Dunn-Coleman, Nigel; Goedegebuur, Frits; Ward, Michael; Yiao, Jian
2014-03-18
The present invention provides a novel endoglucanase nucleic acid sequence, designated egl6 (SEQ ID NO:1 encodes the full length endoglucanase; SEQ ID NO:4 encodes the mature form), and the corresponding endoglucanase VI amino acid sequence ("EGVI"; SEQ ID NO:3 is the signal sequence; SEQ ID NO:2 is the mature sequence). The invention also provides expression vectors and host cells comprising a nucleic acid sequence encoding EGVI, recombinant EGVI proteins and methods for producing the same.
Carbohydrate degrading polypeptide and uses thereof
Sagt, Cornelis Maria Jacobus; Schooneveld-Bergmans, Margot Elisabeth Francoise; Roubos, Johannes Andries; Los, Alrik Pieter
2015-10-20
The invention relates to a polypeptide having carbohydrate material degrading activity which comprises the amino acid sequence set out in SEQ ID NO: 2 or an amino acid sequence encoded by the nucleotide sequence of SEQ ID NO: 1 or SEQ ID NO: 4, or a variant polypeptide or variant polynucleotide thereof, wherein the variant polypeptide has at least 96% sequence identity with the sequence set out in SEQ ID NO: 2 or the variant polynucleotide encodes a polypeptide that has at least 96% sequence identity with the sequence set out in SEQ ID NO: 2. The invention features the full length coding sequence of the novel gene as well as the amino acid sequence of the full-length functional protein and functional equivalents of the gene or the amino acid sequence. The invention also relates to methods for using the polypeptide in industrial processes. Also included in the invention are cells transformed with a polynucleotide according to the invention suitable for producing these proteins.
Polypeptide having or assisting in carbohydrate material degrading activity and uses thereof
Schooneveld-Bergmans, Margot Elisabeth Francoise; Heijne, Wilbert Herman Marie; Los, Alrik Pieter
2016-02-16
The invention relates to a polypeptide which comprises the amino acid sequence set out in SEQ ID NO: 2 or an amino acid sequence encoded by the nucleotide sequence of SEQ ID NO: 1, or a variant polypeptide or variant polynucleotide thereof, wherein the variant polypeptide has at least 76% sequence identity with the sequence set out in SEQ ID NO: 2 or the variant polynucleotide encodes a polypeptide that has at least 76% sequence identity with the sequence set out in SEQ ID NO: 2. The invention features the full length coding sequence of the novel gene as well as the amino acid sequence of the full-length functional polypeptide and functional equivalents of the gene or the amino acid sequence. The invention also relates to methods for using the polypeptide in industrial processes. Also included in the invention are cells transformed with a polynucleotide according to the invention suitable for producing these proteins.
Polypeptide having beta-glucosidase activity and uses thereof
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schoonneveld-Bergmans, Margot Elisabeth Francoise; Heijne, Wilbert Herman Marie; De Jong, Rene Marcel
The invention relates to a polypeptide comprising the amino acid sequence set out in SEQ ID NO: 2 or an amino acid sequence encoded by the nucleotide sequence of SEQ ID NO: 1, or a variant polypeptide or variant polynucleotide thereof, wherein the variant polypeptide has at least 96% sequence identity with the sequence set out in SEQ ID NO: 2 or the variant polynucleotide encodes a polypeptide that has at least 96% sequence identity with the sequence set out in SEQ ID NO: 2. The invention features the full length coding sequence of the novel gene as well asmore » the amino acid sequence of the full-length functional polypeptide and functional equivalents of the gene or the amino acid sequence. The invention also relates to methods for using the polypeptide in industrial processes. Also included in the invention are cells transformed with a polynucleotide according to the invention suitable for producing these proteins.« less
Polypeptide having swollenin activity and uses thereof
Schoonneveld-Bergmans, Margot Elizabeth Francoise; Heijne, Wilbert Herman Marie; Vlasie, Monica D; Damveld, Robbertus Antonius
2015-11-04
The invention relates to a polypeptide comprising the amino acid sequence set out in SEQ ID NO: 2 or an amino acid sequence encoded by the nucleotide sequence of SEQ ID NO: 1, or a variant polypeptide or variant polynucleotide thereof, wherein the variant polypeptide has at least 73% sequence identity with the sequence set out in SEQ ID NO: 2 or the variant polynucleotide encodes a polypeptide that has at least 73% sequence identity with the sequence set out in SEQ ID NO: 2. The invention features the full length coding sequence of the novel gene as well as the amino acid sequence of the full-length functional polypeptide and functional equivalents of the gene or the amino acid sequence. The invention also relates to methods for using the polypeptide in industrial processes. Also included in the invention are cells transformed with a polynucleotide according to the invention suitable for producing these proteins.
Polypeptide having beta-glucosidase activity and uses thereof
Schooneveld-Bergmans, Margot Elisabeth Francoise; Heijne, Wilbert Herman Marie; De Jong, Rene Marcel; Damveld, Robbertus Antonius
2015-09-01
The invention relates to a polypeptide comprising the amino acid sequence set out in SEQ ID NO: 2 or an amino acid sequence encoded by the nucleotide sequence of SEQ ID NO: 1, or a variant polypeptide or variant polynucleotide thereof, wherein the variant polypeptide has at least 70% sequence identity with the sequence set out in SEQ ID NO: 2 or the variant polynucleotide encodes a polypeptide that has at least 70% sequence identity with the sequence set out in SEQ ID NO: 2. The invention features the full length coding sequence of the novel gene as well as the amino acid sequence of the full-length functional polypeptide and functional equivalents of the gene or the amino acid sequence. The invention also relates to methods for using the polypeptide in industrial processes. Also included in the invention are cells transformed with a polynucleotide according to the invention suitable for producing these proteins.
Polypeptide having cellobiohydrolase activity and uses thereof
Sagt, Cornelis Maria Jacobus; Schooneveld-Bergmans, Margot Elisabeth Francoise; Roubos, Johannes Andries; Los, Alrik Pieter
2015-09-15
The invention relates to a polypeptide comprising the amino acid sequence set out in SEQ ID NO: 2 or an amino acid sequence encoded by the nucleotide sequence of SEQ ID NO: 1, or a variant polypeptide or variant polynucleotide thereof, wherein the variant polypeptide has at least 93% sequence identity with the sequence set out in SEQ ID NO: 2 or the variant polynucleotide encodes a polypeptide that has at least 93% sequence identity with the sequence set out in SEQ ID NO: 2. The invention features the full length coding sequence of the novel gene as well as the amino acid sequence of the full-length functional polypeptide and functional equivalents of the gene or the amino acid sequence. The invention also relates to methods for using the polypeptide in industrial processes. Also included in the invention are cells transformed with a polynucleotide according to the invention suitable for producing these proteins.
Polypeptide having acetyl xylan esterase activity and uses thereof
Schoonneveld-Bergmans, Margot Elisabeth Francoise; Heijne, Wilbert Herman Marie; Los, Alrik Pieter
2015-10-20
The invention relates to a polypeptide comprising the amino acid sequence set out in SEQ ID NO: 2 or an amino acid sequence encoded by the nucleotide sequence of SEQ ID NO: 1, or a variant polypeptide or variant polynucleotide thereof, wherein the variant polypeptide has at least 82% sequence identity with the sequence set out in SEQ ID NO: 2 or the variant polynucleotide encodes a polypeptide that has at least 82% sequence identity with the sequence set out in SEQ ID NO: 2. The invention features the full length coding sequence of the novel gene as well as the amino acid sequence of the full-length functional polypeptide and functional equivalents of the gene or the amino acid sequence. The invention also relates to methods for using the polypeptide in industrial processes. Also included in the invention are cells transformed with a polynucleotide according to the invention suitable for producing these proteins.
Polypeptide having carbohydrate degrading activity and uses thereof
Schooneveld-Bergmans, Margot Elisabeth Francoise; Heijne, Wilbert Herman Marie; Vlasie, Monica Diana; Damveld, Robbertus Antonius
2015-08-18
The invention relates to a polypeptide comprising the amino acid sequence set out in SEQ ID NO: 2 or an amino acid sequence encoded by the nucleotide sequence of SEQ ID NO: 1, or a variant polypeptide or variant polynucleotide thereof, wherein the variant polypeptide has at least 73% sequence identity with the sequence set out in SEQ ID NO: 2 or the variant polynucleotide encodes a polypeptide that has at least 73% sequence identity with the sequence set out in SEQ ID NO: 2. The invention features the full length coding sequence of the novel gene as well as the amino acid sequence of the full-length functional polypeptide and functional equivalents of the gene or the amino acid sequence. The invention also relates to methods for using the polypeptide in industrial processes. Also included in the invention are cells transformed with a polynucleotide according to the invention suitable for producing these proteins.
Human jagged polypeptide, encoding nucleic acids and methods of use
Li, Linheng; Hood, Leroy
2000-01-01
The present invention provides an isolated polypeptide exhibiting substantially the same amino acid sequence as JAGGED, or an active fragment thereof, provided that the polypeptide does not have the amino acid sequence of SEQ ID NO:5 or SEQ ID NO:6. The invention further provides an isolated nucleic acid molecule containing a nucleotide sequence encoding substantially the same amino acid sequence as JAGGED, or an active fragment thereof, provided that the nucleotide sequence does not encode the amino acid sequence of SEQ ID NO:5 or SEQ ID NO:6. Also provided herein is a method of inhibiting differentiation of hematopoietic progenitor cells by contacting the progenitor cells with an isolated JAGGED polypeptide, or active fragment thereof. The invention additionally provides a method of diagnosing Alagille Syndrome in an individual. The method consists of detecting an Alagille Syndrome disease-associated mutation linked to a JAGGED locus.
Methods of diagnosing alagille syndrome
Li, Linheng; Hood, Leroy; Krantz, Ian D.; Spinner, Nancy B.
2004-03-09
The present invention provides an isolated polypeptide exhibiting substantially the same amino acid sequence as JAGGED, or an active fragment thereof, provided that the polypeptide does not have the amino acid sequence of SEQ ID NO:5 or SEQ ID NO:6. The invention further provides an isolated nucleic acid molecule containing a nucleotide sequence encoding substantially the same amino acid sequence as JAGGED, or an active fragment thereof, provided that the nucleotide sequence does not encode the amino acid sequence of SEQ ID NO:5 or SEQ ID NO:6. Also provided herein is a method of inhibiting differentiation of hematopoietic progenitor cells by contacting the progenitor cells with an isolated JAGGED polypeptide, or active fragment thereof. The invention additionally provides a method of diagnosing Alagille Syndrome in an individual. The method consists of detecting an Alagille Syndrome disease-associated mutation linked to a JAGGED locus.
Croteau, Rodney Bruce; Wildung, Mark Raymond; Lange, Bernd Markus; McCaskill, David G.
2001-01-01
cDNAs encoding 1-deoxyxylulose-5-phosphate synthase from peppermint (Mentha piperita) have been isolated and sequenced, and the corresponding amino acid sequences have been determined. Accordingly, isolated DNA sequences (SEQ ID NO:3, SEQ ID NO:5, SEQ ID NO:7) are provided which code for the expression of 1-deoxyxylulose-5-phosphate synthase from plants. In another aspect the present invention provides for isolated, recombinant DXPS proteins, such as the proteins having the sequences set forth in SEQ ID NO:4, SEQ ID NO:6 and SEQ ID NO:8. In other aspects, replicable recombinant cloning vehicles are provided which code for plant 1-deoxyxylulose-5-phosphate synthases, or for a base sequence sufficiently complementary to at least a portion of 1-deoxyxylulose-5-phosphate synthase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding a plant 1-deoxyxylulose-5-phosphate synthase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant 1-deoxyxylulose-5-phosphate synthase that may be used to facilitate its production, isolation and purification in significant amounts. Recombinant 1-deoxyxylulose-5-phosphate synthase may be used to obtain expression or enhanced expression of 1-deoxyxylulose-5-phosphate synthase in plants in order to enhance the production of 1-deoxyxylulose-5-phosphate, or its derivatives such as isopentenyl diphosphate (BP), or may be otherwise employed for the regulation or expression of 1-deoxyxylulose-5-phosphate synthase, or the production of its products.
Cysteine-containing peptide tag for site-specific conjugation of proteins
Backer, Marina V.; Backer, Joseph M.
2008-04-08
The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety bound to the targeting moiety; the biological conjugate having a covalent bond between the thiol group of SEQ ID NO:2 and a functional group in the binding moiety. The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety that comprises an adapter protein, the adapter protein having a thiol group; the biological conjugate having a disulfide bond between the thiol group of SEQ ID NO:2 and the thiol group of the adapter protein. The present invention is also directed to biological sequences employed in the above biological conjugates, as well as pharmaceutical preparations and methods using the above biological conjugates.
Cysteine-containing peptide tag for site-specific conjugation of proteins
Backer, Marina V.; Backer, Joseph M.
2010-10-05
The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety bound to the targeting moiety; the biological conjugate having a covalent bond between the thiol group of SEQ ID NO:2 and a functional group in the binding moiety. The present invention is directed to a biological conjugate, comprising: (a) a targeting moiety comprising a polypeptide having an amino acid sequence comprising the polypeptide sequence of SEQ ID NO:2 and the polypeptide sequence of a selected targeting protein; and (b) a binding moiety that comprises an adapter protein, the adapter protein having a thiol group; the biological conjugate having a disulfide bond between the thiol group of SEQ ID NO:2 and the thiol group of the adapter protein. The present invention is also directed to biological sequences employed in the above biological conjugates, as well as pharmaceutical preparations and methods using the above biological conjugates.
Monoterpene synthases from common sage (Salvia officinalis)
Croteau, Rodney Bruce; Wise, Mitchell Lynn; Katahira, Eva Joy; Savage, Thomas Jonathan
1999-01-01
cDNAs encoding (+)-bornyl diphosphate synthase, 1,8-cineole synthase and (+)-sabinene synthase from common sage (Salvia officinalis) have been isolated and sequenced, and the corresponding amino acid sequences has been determined. Accordingly, isolated DNA sequences (SEQ ID No:1; SEQ ID No:3 and SEQ ID No:5) are provided which code for the expression of (+)-bornyl diphosphate synthase (SEQ ID No:2), 1,8-cineole synthase (SEQ ID No:4) and (+)-sabinene synthase SEQ ID No:6), respectively, from sage (Salvia officinalis). In other aspects, replicable recombinant cloning vehicles are provided which code for (+)-bornyl diphosphate synthase, 1,8-cineole synthase or (+)-sabinene synthase, or for a base sequence sufficiently complementary to at least a portion of (+)-bornyl diphosphate synthase, 1,8-cineole synthase or (+)-sabinene synthase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding (+)-bornyl diphosphate synthase, 1,8-cineole synthase or (+)-sabinene synthase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant monoterpene synthases that may be used to facilitate their production, isolation and purification in significant amounts. Recombinant (+)-bornyl diphosphate synthase, 1,8-cineole synthase and (+)-sabinene synthase may be used to obtain expression or enhanced expression of (+)-bornyl diphosphate synthase, 1,8-cineole synthase and (+)-sabinene synthase in plants in order to enhance the production of monoterpenoids, or may be otherwise employed for the regulation or expression of (+)-bornyl diphosphate synthase, 1,8-cineole synthase and (+)-sabinene synthase, or the production of their products.
Switchgrass ubiquitin promoter (PVUBI2) and uses thereof
Stewart, C. Neal; Mann, David George James
2013-12-10
The subject application provides polynucleotides, compositions thereof and methods for regulating gene expression in a plant. Polynucleotides disclosed herein comprise novel sequences for a promoter isolated from Panicum virgatum (switchgrass) that initiates transcription of an operably linked nucleotide sequence. Thus, various embodiments of the invention comprise the nucleotide sequence of SEQ ID NO: 2 or fragments thereof comprising nucleotides 1 to 692 of SEQ ID NO: 2 that are capable of driving the expression of an operably linked nucleic acid sequence.
A non-parametric peak calling algorithm for DamID-Seq.
Li, Renhua; Hempel, Leonie U; Jiang, Tingbo
2015-01-01
Protein-DNA interactions play a significant role in gene regulation and expression. In order to identify transcription factor binding sites (TFBS) of double sex (DSX)-an important transcription factor in sex determination, we applied the DNA adenine methylation identification (DamID) technology to the fat body tissue of Drosophila, followed by deep sequencing (DamID-Seq). One feature of DamID-Seq data is that induced adenine methylation signals are not assured to be symmetrically distributed at TFBS, which renders the existing peak calling algorithms for ChIP-Seq, including SPP and MACS, inappropriate for DamID-Seq data. This challenged us to develop a new algorithm for peak calling. A challenge in peaking calling based on sequence data is estimating the averaged behavior of background signals. We applied a bootstrap resampling method to short sequence reads in the control (Dam only). After data quality check and mapping reads to a reference genome, the peaking calling procedure compromises the following steps: 1) reads resampling; 2) reads scaling (normalization) and computing signal-to-noise fold changes; 3) filtering; 4) Calling peaks based on a statistically significant threshold. This is a non-parametric method for peak calling (NPPC). We also used irreproducible discovery rate (IDR) analysis, as well as ChIP-Seq data to compare the peaks called by the NPPC. We identified approximately 6,000 peaks for DSX, which point to 1,225 genes related to the fat body tissue difference between female and male Drosophila. Statistical evidence from IDR analysis indicated that these peaks are reproducible across biological replicates. In addition, these peaks are comparable to those identified by use of ChIP-Seq on S2 cells, in terms of peak number, location, and peaks width.
Fidantsef, Ana; Lamsa, Michael; Gorre-Clancy, Brian
2015-07-14
The present invention relates to variants of a parent beta-glucosidase, comprising a substitution at one or more positions corresponding to positions 142, 183, 266, and 703 of amino acids 1 to 842 of SEQ ID NO: 2 or corresponding to positions 142, 183, 266, and 705 of amino acids 1 to 844 of SEQ ID NO: 70, wherein the variant has beta-glucosidase activity. The present invention also relates to nucleotide sequences encoding the variant beta-glucosidases and to nucleic acid constructs, vectors, and host cells comprising the nucleotide sequences.
Fidantsef, Ana; Lamsa, Michael; Gorre-Clancy, Brian
2014-10-07
The present invention relates to variants of a parent beta-glucosidase, comprising a substitution at one or more positions corresponding to positions 142, 183, 266, and 703 of amino acids 1 to 842 of SEQ ID NO: 2 or corresponding to positions 142, 183, 266, and 705 of amino acids 1 to 844 of SEQ ID NO: 70, wherein the variant has beta-glucosidase activity. The present invention also relates to nucleotide sequences encoding the variant beta-glucosidases and to nucleic acid constructs, vectors, and host cells comprising the nucleotide sequences.
Fidantsef, Ana [Davis, CA; Lamsa, Michael [Davis, CA; Gorre-Clancy, Brian [Elk Grove, CA
2009-12-29
The present invention relates to variants of a parent beta-glucosidase, comprising a substitution at one or more positions corresponding to positions 142, 183, 266, and 703 of amino acids 1 to 842 of SEQ ID NO: 2 or corresponding to positions 142, 183, 266, and 705 of amino acids 1 to 844 of SEQ ID NO: 70, wherein the variant has beta-glucosidase activity. The present invention also relates to nucleotide sequences encoding the variant beta-glucosidases and to nucleic acid constructs, vectors, and host cells comprising the nucleotide sequences.
Variants of glycoside hydrolases
Teter, Sarah; Ward, Connie; Cherry, Joel; Jones, Aubrey; Harris, Paul; Yi, Jung
2013-02-26
The present invention relates to variants of a parent glycoside hydrolase, comprising a substitution at one or more positions corresponding to positions 21, 94, 157, 205, 206, 247, 337, 350, 373, 383, 438, 455, 467, and 486 of amino acids 1 to 513 of SEQ ID NO: 2, and optionally further comprising a substitution at one or more positions corresponding to positions 8, 22, 41, 49, 57, 113, 193, 196, 226, 227, 246, 251, 255, 259, 301, 356, 371, 411, and 462 of amino acids 1 to 513 of SEQ ID NO: 2 a substitution at one or more positions corresponding to positions 8, 22, 41, 49, 57, 113, 193, 196, 226, 227, 246, 251, 255, 259, 301, 356, 371, 411, and 462 of amino acids 1 to 513 of SEQ ID NO: 2, wherein the variants have glycoside hydrolase activity. The present invention also relates to nucleotide sequences encoding the variant glycoside hydrolases and to nucleic acid constructs, vectors, and host cells comprising the nucleotide sequences.
Variants of glycoside hydrolases
Teter, Sarah [Davis, CA; Ward, Connie [Hamilton, MT; Cherry, Joel [Davis, CA; Jones, Aubrey [Davis, CA; Harris, Paul [Carnation, WA; Yi, Jung [Sacramento, CA
2011-04-26
The present invention relates to variants of a parent glycoside hydrolase, comprising a substitution at one or more positions corresponding to positions 21, 94, 157, 205, 206, 247, 337, 350, 373, 383, 438, 455, 467, and 486 of amino acids 1 to 513 of SEQ ID NO: 2, and optionally further comprising a substitution at one or more positions corresponding to positions 8, 22, 41, 49, 57, 113, 193, 196, 226, 227, 246, 251, 255, 259, 301, 356, 371, 411, and 462 of amino acids 1 to 513 of SEQ ID NO: 2 a substitution at one or more positions corresponding to positions 8, 22, 41, 49, 57, 113, 193, 196, 226, 227, 246, 251, 255, 259, 301, 356, 371, 411, and 462 of amino acids 1 to 513 of SEQ ID NO: 2, wherein the variants have glycoside hydrolase activity. The present invention also relates to nucleotide sequences encoding the variant glycoside hydrolases and to nucleic acid constructs, vectors, and host cells comprising the nucleotide sequences.
Variants of glycoside hydrolases
Teter, Sarah; Ward, Connie; Cherry, Joel; Jones, Aubrey; Harris, Paul; Yi, Jung
2017-07-11
The present invention relates to variants of a parent glycoside hydrolase, comprising a substitution at one or more positions corresponding to positions 21, 94, 157, 205, 206, 247, 337, 350, 373, 383, 438, 455, 467, and 486 of amino acids 1 to 513 of SEQ ID NO: 2, and optionally further comprising a substitution at one or more positions corresponding to positions 8, 22, 41, 49, 57, 113, 193, 196, 226, 227, 246, 251, 255, 259, 301, 356, 371, 411, and 462 of amino acids 1 to 513 of SEQ ID NO: 2 a substitution at one or more positions corresponding to positions 8, 22, 41, 49, 57, 113, 193, 196, 226, 227, 246, 251, 255, 259, 301, 356, 371, 411, and 462 of amino acids 1 to 513 of SEQ ID NO: 2, wherein the variants have glycoside hydrolase activity. The present invention also relates to nucleotide sequences encoding the variant glycoside hydrolases and to nucleic acid constructs, vectors, and host cells comprising the nucleotide sequences.
Gene encoding herbicide safener binding protein
Walton, Jonathan D.; Scott-Craig, John S.
1999-01-01
The cDNA encoding safener binding protein (SafBP), also referred to as SBP1, is set forth in FIG. 5 and SEQ ID No. 1. The deduced amino acid sequence is provided in FIG. 5 and SEQ ID No. 2. Methods of making and using SBP1 and SafBP to alter a plant's sensitivity to certain herbicides or a plant's responsiveness to certain safeners are also provided, as well as expression vectors, transgenic plants or other organisms transfected with said vectors and seeds from said plants.
Nucleic Acid Encoding A Lectin-Derived Progenitor Cell Preservation Factor
Colucci, M. Gabriella; Chrispeels, Maarten J.; Moore, Jeffrey G.
2001-10-30
The invention relates to an isolated nucleic acid molecule that encodes a protein that is effective to preserve progenitor cells, such as hematopoietic progenitor cells. The nucleic acid comprises a sequence defined by SEQ ID NO:1, a homolog thereof, or a fragment thereof. The encoded protein has an amino acid sequence that comprises a sequence defined by SEQ ID NO:2, a homolog thereof, or a fragment thereof that contains an amino acid sequence TNNVLQVT. Methods of using the encoded protein for preserving progenitor cells in vitro, ex vivo, and in vivo are also described. The invention, therefore, include methods such as myeloablation therapies for cancer treatment wherein myeloid reconstitution is facilitated by means of the specified protein. Other therapeutic utilities are also enabled through the invention, for example, expanding progenitor cell populations ex vivo to increase chances of engraftation, improving conditions for transporting and storing progenitor cells, and facilitating gene therapy to treat and cure a broad range of life-threatening hematologic diseases.
Karin, Michael; Hibi, Masahiko; Lin, Anning
2002-01-29
The present invention provides an isolated polynucleotide encoding a c-Jun peptide consisting of about amino acid residues 33 to 79 as set fort in SEQ ID NO: 10 or conservative variations thereof. The invention also provides a method for producing a peptide of SEQ ID NO:1 comprising (a) culturing a host cell containing a polynucleotide encoding a c-Jun peptide consisting of about amino acid residues 33 to 79 as set forth in SEQ ID NO: 10 under conditions which allow expression of the polynucleotide; and (b) obtaining the peptide of SEQ ID NO:1.
Exome Pool-Seq in neurodevelopmental disorders.
Popp, Bernt; Ekici, Arif B; Thiel, Christian T; Hoyer, Juliane; Wiesener, Antje; Kraus, Cornelia; Reis, André; Zweier, Christiane
2017-12-01
High throughput sequencing has greatly advanced disease gene identification, especially in heterogeneous entities. Despite falling costs this is still an expensive and laborious technique, particularly when studying large cohorts. To address this problem we applied Exome Pool-Seq as an economic and fast screening technology in neurodevelopmental disorders (NDDs). Sequencing of 96 individuals can be performed in eight pools of 12 samples on less than one Illumina sequencer lane. In a pilot study with 96 cases we identified 27 variants, likely or possibly affecting function. Twenty five of these were identified in 923 established NDD genes (based on SysID database, status November 2016) (ACTB, AHDC1, ANKRD11, ATP6V1B2, ATRX, CASK, CHD8, GNAS, IFIH1, KCNQ2, KMT2A, KRAS, MAOA, MED12, MED13L, RIT1, SETD5, SIN3A, TCF4, TRAPPC11, TUBA1A, WAC, ZBTB18, ZMYND11), two in 543 (SysID) candidate genes (ZNF292, BPTF), and additionally a de novo loss-of-function variant in LRRC7, not previously implicated in NDDs. Most of them were confirmed to be de novo, but we also identified X-linked or autosomal-dominantly or autosomal-recessively inherited variants. With a detection rate of 28%, Exome Pool-Seq achieves comparable results to individual exome analyses but reduces costs by >85%. Compared with other large scale approaches using Molecular Inversion Probes (MIP) or gene panels, it allows flexible re-analysis of data. Exome Pool-Seq is thus well suited for large-scale, cost-efficient and flexible screening in characterized but heterogeneous entities like NDDs.
Nucleic acid molecules encoding isopentenyl monophosphate kinase, and methods of use
Croteau, Rodney B.; Lange, Bernd M.
2001-01-01
A cDNA encoding isopentenyl monophosphate kinase (IPK) from peppermint (Mentha x piperita) has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID NO:1) is provided which codes for the expression of isopentenyl monophosphate kinase (SEQ ID NO:2), from peppermint (Mentha x piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for isopentenyl monophosphate kinase, or for a base sequence sufficiently complementary to at least a portion of isopentenyl monophosphate kinase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding isopentenyl monophosphate kinase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant isopentenyl monophosphate kinase that may be used to facilitate its production, isolation and purification in significant amounts. Recombinant isopentenyl monophosphate kinase may be used to obtain expression or enhanced expression of isopentenyl monophosphate kinase in plants in order to enhance the production of isopentenyl monophosphate kinase, or isoprenoids derived therefrom, or may be otherwise employed for the regulation or expression of isopentenyl monophosphate kinase, or the production of its products.
Croteau, Rodney Bruce; Wildung, Mark Raymond; Crock, John E.
1999-01-01
A cDNA encoding (E)-.beta.-farnesene synthase from peppermint (Mentha piperita) has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID NO:1) is provided which codes for the expression of (E)-.beta.-farnesene synthase (SEQ ID NO:2), from peppermint (Mentha piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for (E)-.beta.-farnesene synthase, or for a base sequence sufficiently complementary to at least a portion of (E)-.beta.-farnesene synthase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding (E)-.beta.-farnesene synthase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant (E)-.beta.-farnesene synthase that may be used to facilitate its production, isolation and purification in significant amounts. Recombinant (E)-.beta.-farnesene synthase may be used to obtain expression or enhanced expression of (E)-.beta.-farnesene synthase in plants in order to enhance the production of (E)-.beta.-farnesene, or may be otherwise employed for the regulation or expression of (E)-.beta.-farnesene synthase, or the production of its product.
Croteau, Rodney Bruce; Crock, John E.
2005-01-25
A cDNA encoding (E)-.beta.-farnesene synthase from peppermint (Mentha piperita) has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID NO:1) is provided which codes for the expression of (E)-.beta.-farnesene synthase (SEQ ID NO:2), from peppermint (Mentha piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for (E)-.beta.-farnesene synthase, or for a base sequence sufficiently complementary to at least a portion of (E)-.beta.-farnesene synthase DNA or RNA to enable hybridization therewith. In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding (E)-.beta.-farnesene synthase. Thus, systems and methods are provided for the recombinant expression of the aforementioned recombinant (E)-.beta.-famesene synthase that may be used to facilitate its production, isolation and purification in significant amounts. Recombinant (E)-.beta.-farnesene synthase may be used to obtain expression or enhanced expression of (E)-.beta.-famesene synthase in plants in order to enhance the production of (E)-.beta.-farnesene, or may be otherwise employed for the regulation or expression of (E)-.beta.-farnesene synthase, or the production of its product.
Geranyl diphosphate synthase from mint
Croteau, Rodney Bruce; Wildung, Mark Raymond; Burke, Charles Cullen; Gershenzon, Jonathan
1999-01-01
A cDNA encoding geranyl diphosphate synthase from peppermint has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID No:1) is provided which codes for the expression of geranyl diphosphate synthase (SEQ ID No:2) from peppermint (Mentha piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for geranyl diphosphate synthase or for a base sequence sufficiently complementary to at least a portion of the geranyl diphosphate synthase DNA or RNA to enable hybridization therewith (e.g., antisense geranyl diphosphate synthase RNA or fragments of complementary geranyl diphosphate synthase DNA which are useful as polymerase chain reaction primers or as probes for geranyl diphosphate synthase or related genes). In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding geranyl diphosphate synthase. Thus, systems and methods are provided for the recombinant expression of geranyl diphosphate synthase that may be used to facilitate the production, isolation and purification of significant quantities of recombinant geranyl diphosphate synthase for subsequent use, to obtain expression or enhanced expression of geranyl diphosphate synthase in plants in order to enhance the production of monoterpenoids, to produce geranyl diphosphate in cancerous cells as a precursor to monoterpenoids having anti-cancer properties or may be otherwise employed for the regulation or expression of geranyl diphosphate synthase or the production of geranyl diphosphate.
Geranyl diphosphate synthase from mint
Croteau, R.B.; Wildung, M.R.; Burke, C.C.; Gershenzon, J.
1999-03-02
A cDNA encoding geranyl diphosphate synthase from peppermint has been isolated and sequenced, and the corresponding amino acid sequence has been determined. Accordingly, an isolated DNA sequence (SEQ ID No:1) is provided which codes for the expression of geranyl diphosphate synthase (SEQ ID No:2) from peppermint (Mentha piperita). In other aspects, replicable recombinant cloning vehicles are provided which code for geranyl diphosphate synthase or for a base sequence sufficiently complementary to at least a portion of the geranyl diphosphate synthase DNA or RNA to enable hybridization therewith (e.g., antisense geranyl diphosphate synthase RNA or fragments of complementary geranyl diphosphate synthase DNA which are useful as polymerase chain reaction primers or as probes for geranyl diphosphate synthase or related genes). In yet other aspects, modified host cells are provided that have been transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence encoding geranyl diphosphate synthase. Thus, systems and methods are provided for the recombinant expression of geranyl diphosphate synthase that may be used to facilitate the production, isolation and purification of significant quantities of recombinant geranyl diphosphate synthase for subsequent use, to obtain expression or enhanced expression of geranyl diphosphate synthase in plants in order to enhance the production of monoterpenoids, to produce geranyl diphosphate in cancerous cells as a precursor to monoterpenoids having anti-cancer properties or may be otherwise employed for the regulation or expression of geranyl diphosphate synthase or the production of geranyl diphosphate. 5 figs.
Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment.
Gierliński, Marek; Cole, Christian; Schofield, Pietà; Schurch, Nicholas J; Sherstnev, Alexander; Singh, Vijender; Wrobel, Nicola; Gharbi, Karim; Simpson, Gordon; Owen-Hughes, Tom; Blaxter, Mark; Barton, Geoffrey J
2015-11-15
High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read-count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations. A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ∼0.01. The high-replicate data also allowed for strict quality control and screening of 'bad' replicates, which can drastically affect the gene read-count distribution. RNA-seq data have been submitted to ENA archive with project ID PRJEB5348. g.j.barton@dundee.ac.uk. © The Author 2015. Published by Oxford University Press.
Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment
Cole, Christian; Schofield, Pietà; Schurch, Nicholas J.; Sherstnev, Alexander; Singh, Vijender; Wrobel, Nicola; Gharbi, Karim; Simpson, Gordon; Owen-Hughes, Tom; Blaxter, Mark; Barton, Geoffrey J.
2015-01-01
Motivation: High-throughput RNA sequencing (RNA-seq) is now the standard method to determine differential gene expression. Identifying differentially expressed genes crucially depends on estimates of read-count variability. These estimates are typically based on statistical models such as the negative binomial distribution, which is employed by the tools edgeR, DESeq and cuffdiff. Until now, the validity of these models has usually been tested on either low-replicate RNA-seq data or simulations. Results: A 48-replicate RNA-seq experiment in yeast was performed and data tested against theoretical models. The observed gene read counts were consistent with both log-normal and negative binomial distributions, while the mean-variance relation followed the line of constant dispersion parameter of ∼0.01. The high-replicate data also allowed for strict quality control and screening of ‘bad’ replicates, which can drastically affect the gene read-count distribution. Availability and implementation: RNA-seq data have been submitted to ENA archive with project ID PRJEB5348. Contact: g.j.barton@dundee.ac.uk PMID:26206307
1-deoxy-d-xylulose-5-phosphate reductoisomerases and method of use
Croteau, Rodney B.; Lange, Bernd M.
2001-01-01
The present invention relates to isolated DNA sequences which code for the expression of plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein, such as the sequence presented in SEQ ID NO:1 which encodes a 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein from peppermint (Mentha x piperita). Additionally, the present invention relates to isolated plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein. In other aspects, the present invention is directed to replicable recombinant cloning vehicles comprising a nucleic acid sequence which codes for a plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase, to modified host cells transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence of the invention.
1-deoxy-D-xylulose-5-phosphate reductoisomerases, and methods of use
Croteau, Rodney B.; Lange, Bernd M.
2002-07-16
The present invention relates to isolated DNA sequences which code for the expression of plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein, such as the sequence presented in SEQ ID NO:1 which encodes a 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein from peppermint (Mentha x piperita). Additionally, the present invention relates to isolated plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase protein. In other aspects, the present invention is directed to replicable recombinant cloning vehicles comprising a nucleic acid sequence which codes for a plant 1-deoxy-D-xylulose-5-phosphate reductoisomerase, to modified host cells transformed, transfected, infected and/or injected with a recombinant cloning vehicle and/or DNA sequence of the invention.
Brett, Maggie; McPherson, John; Zang, Zhi Jiang; Lai, Angeline; Tan, Ee-Shien; Ng, Ivy; Ong, Lai-Choo; Cham, Breana; Tan, Patrick; Rozen, Steve; Tan, Ene-Choo
2014-01-01
Developmental delay and/or intellectual disability (DD/ID) affects 1–3% of all children. At least half of these are thought to have a genetic etiology. Recent studies have shown that massively parallel sequencing (MPS) using a targeted gene panel is particularly suited for diagnostic testing for genetically heterogeneous conditions. We report on our experiences with using massively parallel sequencing of a targeted gene panel of 355 genes for investigating the genetic etiology of eight patients with a wide range of phenotypes including DD/ID, congenital anomalies and/or autism spectrum disorder. Targeted sequence enrichment was performed using the Agilent SureSelect Target Enrichment Kit and sequenced on the Illumina HiSeq2000 using paired-end reads. For all eight patients, 81–84% of the targeted regions achieved read depths of at least 20×, with average read depths overlapping targets ranging from 322× to 798×. Causative variants were successfully identified in two of the eight patients: a nonsense mutation in the ATRX gene and a canonical splice site mutation in the L1CAM gene. In a third patient, a canonical splice site variant in the USP9X gene could likely explain all or some of her clinical phenotypes. These results confirm the value of targeted MPS for investigating DD/ID in children for diagnostic purposes. However, targeted gene MPS was less likely to provide a genetic diagnosis for children whose phenotype includes autism. PMID:24690944
Compositions and methods for improved protein production
Bodie, Elizabeth A [San Carlos, CA; Kim, Steve [San Francisco, CA
2012-07-10
The present invention relates to the identification of novel nucleic acid sequences, designated herein as 7p, 8k, 7E, 9G, 8Q and 203, in a host cell which effect protein production. The present invention also provides host cells having a mutation or deletion of part or all of the gene encoding 7p, 8k, 7E, 9G, 8Q and 203, which are presented in FIG. 1, and are SEQ ID NOS.: 1-6, respectively. The present invention also provides host cells further comprising a nucleic acid encoding a desired heterologous protein such as an enzyme.
Compositions and methods for improved protein production
Bodie, Elizabeth A.; Kim, Steve Sungjin
2014-06-03
The present invention relates to the identification of novel nucleic acid sequences, designated herein as 7p, 8k, 7E, 9G, 8Q and 203, in a host cell which effect protein production. The present invention also provides host cells having a mutation or deletion of part or all of the gene encoding 7p, 8k, 7E, 9G, 8Q and 203, which are presented in FIG. 1, and are SEQ ID NOS.: 1-6, respectively. The present invention also provides host cells further comprising a nucleic acid encoding a desired heterologous protein such as an enzyme.
Chavan, Shweta S; Bauer, Michael A; Peterson, Erich A; Heuck, Christoph J; Johann, Donald J
2013-01-01
Transcriptome analysis by microarrays has produced important advances in biomedicine. For instance in multiple myeloma (MM), microarray approaches led to the development of an effective disease subtyping via cluster assignment, and a 70 gene risk score. Both enabled an improved molecular understanding of MM, and have provided prognostic information for the purposes of clinical management. Many researchers are now transitioning to Next Generation Sequencing (NGS) approaches and RNA-seq in particular, due to its discovery-based nature, improved sensitivity, and dynamic range. Additionally, RNA-seq allows for the analysis of gene isoforms, splice variants, and novel gene fusions. Given the voluminous amounts of historical microarray data, there is now a need to associate and integrate microarray and RNA-seq data via advanced bioinformatic approaches. Custom software was developed following a model-view-controller (MVC) approach to integrate Affymetrix probe set-IDs, and gene annotation information from a variety of sources. The tool/approach employs an assortment of strategies to integrate, cross reference, and associate microarray and RNA-seq datasets. Output from a variety of transcriptome reconstruction and quantitation tools (e.g., Cufflinks) can be directly integrated, and/or associated with Affymetrix probe set data, as well as necessary gene identifiers and/or symbols from a diversity of sources. Strategies are employed to maximize the annotation and cross referencing process. Custom gene sets (e.g., MM 70 risk score (GEP-70)) can be specified, and the tool can be directly assimilated into an RNA-seq pipeline. A novel bioinformatic approach to aid in the facilitation of both annotation and association of historic microarray data, in conjunction with richer RNA-seq data, is now assisting with the study of MM cancer biology.
Langley, Alexander R.; Gräf, Stefan; Smith, James C.; Krude, Torsten
2016-01-01
Next-generation sequencing has enabled the genome-wide identification of human DNA replication origins. However, different approaches to mapping replication origins, namely (i) sequencing isolated small nascent DNA strands (SNS-seq); (ii) sequencing replication bubbles (bubble-seq) and (iii) sequencing Okazaki fragments (OK-seq), show only limited concordance. To address this controversy, we describe here an independent high-resolution origin mapping technique that we call initiation site sequencing (ini-seq). In this approach, newly replicated DNA is directly labelled with digoxigenin-dUTP near the sites of its initiation in a cell-free system. The labelled DNA is then immunoprecipitated and genomic locations are determined by DNA sequencing. Using this technique we identify >25,000 discrete origin sites at sub-kilobase resolution on the human genome, with high concordance between biological replicates. Most activated origins identified by ini-seq are found at transcriptional start sites and contain G-quadruplex (G4) motifs. They tend to cluster in early-replicating domains, providing a correlation between early replication timing and local density of activated origins. Origins identified by ini-seq show highest concordance with sites identified by SNS-seq, followed by OK-seq and bubble-seq. Furthermore, germline origins identified by positive nucleotide distribution skew jumps overlap with origins identified by ini-seq and OK-seq more frequently and more specifically than do sites identified by either SNS-seq or bubble-seq. PMID:27587586
Langley, Alexander R; Gräf, Stefan; Smith, James C; Krude, Torsten
2016-12-01
Next-generation sequencing has enabled the genome-wide identification of human DNA replication origins. However, different approaches to mapping replication origins, namely (i) sequencing isolated small nascent DNA strands (SNS-seq); (ii) sequencing replication bubbles (bubble-seq) and (iii) sequencing Okazaki fragments (OK-seq), show only limited concordance. To address this controversy, we describe here an independent high-resolution origin mapping technique that we call initiation site sequencing (ini-seq). In this approach, newly replicated DNA is directly labelled with digoxigenin-dUTP near the sites of its initiation in a cell-free system. The labelled DNA is then immunoprecipitated and genomic locations are determined by DNA sequencing. Using this technique we identify >25,000 discrete origin sites at sub-kilobase resolution on the human genome, with high concordance between biological replicates. Most activated origins identified by ini-seq are found at transcriptional start sites and contain G-quadruplex (G4) motifs. They tend to cluster in early-replicating domains, providing a correlation between early replication timing and local density of activated origins. Origins identified by ini-seq show highest concordance with sites identified by SNS-seq, followed by OK-seq and bubble-seq. Furthermore, germline origins identified by positive nucleotide distribution skew jumps overlap with origins identified by ini-seq and OK-seq more frequently and more specifically than do sites identified by either SNS-seq or bubble-seq. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Comparison and evaluation of two exome capture kits and sequencing platforms for variant calling.
Zhang, Guoqiang; Wang, Jianfeng; Yang, Jin; Li, Wenjie; Deng, Yutian; Li, Jing; Huang, Jun; Hu, Songnian; Zhang, Bing
2015-08-05
To promote the clinical application of next-generation sequencing, it is important to obtain accurate and consistent variants of target genomic regions at low cost. Ion Proton, the latest updated semiconductor-based sequencing instrument from Life Technologies, is designed to provide investigators with an inexpensive platform for human whole exome sequencing that achieves a rapid turnaround time. However, few studies have comprehensively compared and evaluated the accuracy of variant calling between Ion Proton and Illumina sequencing platforms such as HiSeq 2000, which is the most popular sequencing platform for the human genome. The Ion Proton sequencer combined with the Ion TargetSeq Exome Enrichment Kit together make up TargetSeq-Proton, whereas SureSelect-Hiseq is based on the Agilent SureSelect Human All Exon v4 Kit and the HiSeq 2000 sequencer. Here, we sequenced exonic DNA from four human blood samples using both TargetSeq-Proton and SureSelect-HiSeq. We then called variants in the exonic regions that overlapped between the two exome capture kits (33.6 Mb). The rates of shared variant loci called by two sequencing platforms were from 68.0 to 75.3% in four samples, whereas the concordance of co-detected variant loci reached 99%. Sanger sequencing validation revealed that the validated rate of concordant single nucleotide polymorphisms (SNPs) (91.5%) was higher than the SNPs specific to TargetSeq-Proton (60.0%) or specific to SureSelect-HiSeq (88.3%). With regard to 1-bp small insertions and deletions (InDels), the Sanger sequencing validated rates of concordant variants (100.0%) and SureSelect-HiSeq-specific (89.6%) were higher than those of TargetSeq-Proton-specific (15.8%). In the sequencing of exonic regions, a combination of using of two sequencing strategies (SureSelect-HiSeq and TargetSeq-Proton) increased the variant calling specificity for concordant variant loci and the sensitivity for variant loci called by any one platform. However, for the sequencing of platform-specific variants, the accuracy of variant calling by HiSeq 2000 was higher than that of Ion Proton, specifically for the InDel detection. Moreover, the variant calling software also influences the detection of SNPs and, specifically, InDels in Ion Proton exome sequencing.
SeqLib: a C ++ API for rapid BAM manipulation, sequence alignment and sequence assembly
Wala, Jeremiah; Beroukhim, Rameen
2017-01-01
Abstract We present SeqLib, a C ++ API and command line tool that provides a rapid and user-friendly interface to BAM/SAM/CRAM files, global sequence alignment operations and sequence assembly. Four C libraries perform core operations in SeqLib: HTSlib for BAM access, BWA-MEM and BLAT for sequence alignment and Fermi for error correction and sequence assembly. Benchmarking indicates that SeqLib has lower CPU and memory requirements than leading C ++ sequence analysis APIs. We demonstrate an example of how minimal SeqLib code can extract, error-correct and assemble reads from a CRAM file and then align with BWA-MEM. SeqLib also provides additional capabilities, including chromosome-aware interval queries and read plotting. Command line tools are available for performing integrated error correction, micro-assemblies and alignment. Availability and Implementation: SeqLib is available on Linux and OSX for the C ++98 standard and later at github.com/walaj/SeqLib. SeqLib is released under the Apache2 license. Additional capabilities for BLAT alignment are available under the BLAT license. Contact: jwala@broadinstitue.org; rameen@broadinstitute.org PMID:28011768
SeqLib: a C ++ API for rapid BAM manipulation, sequence alignment and sequence assembly.
Wala, Jeremiah; Beroukhim, Rameen
2017-03-01
We present SeqLib, a C ++ API and command line tool that provides a rapid and user-friendly interface to BAM/SAM/CRAM files, global sequence alignment operations and sequence assembly. Four C libraries perform core operations in SeqLib: HTSlib for BAM access, BWA-MEM and BLAT for sequence alignment and Fermi for error correction and sequence assembly. Benchmarking indicates that SeqLib has lower CPU and memory requirements than leading C ++ sequence analysis APIs. We demonstrate an example of how minimal SeqLib code can extract, error-correct and assemble reads from a CRAM file and then align with BWA-MEM. SeqLib also provides additional capabilities, including chromosome-aware interval queries and read plotting. Command line tools are available for performing integrated error correction, micro-assemblies and alignment. SeqLib is available on Linux and OSX for the C ++98 standard and later at github.com/walaj/SeqLib. SeqLib is released under the Apache2 license. Additional capabilities for BLAT alignment are available under the BLAT license. jwala@broadinstitue.org ; rameen@broadinstitute.org. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
GGRNA: an ultrafast, transcript-oriented search engine for genes and transcripts
Naito, Yuki; Bono, Hidemasa
2012-01-01
GGRNA (http://GGRNA.dbcls.jp/) is a Google-like, ultrafast search engine for genes and transcripts. The web server accepts arbitrary words and phrases, such as gene names, IDs, gene descriptions, annotations of gene and even nucleotide/amino acid sequences through one simple search box, and quickly returns relevant RefSeq transcripts. A typical search takes just a few seconds, which dramatically enhances the usability of routine searching. In particular, GGRNA can search sequences as short as 10 nt or 4 amino acids, which cannot be handled easily by popular sequence analysis tools. Nucleotide sequences can be searched allowing up to three mismatches, or the query sequences may contain degenerate nucleotide codes (e.g. N, R, Y, S). Furthermore, Gene Ontology annotations, Enzyme Commission numbers and probe sequences of catalog microarrays are also incorporated into GGRNA, which may help users to conduct searches by various types of keywords. GGRNA web server will provide a simple and powerful interface for finding genes and transcripts for a wide range of users. All services at GGRNA are provided free of charge to all users. PMID:22641850
GGRNA: an ultrafast, transcript-oriented search engine for genes and transcripts.
Naito, Yuki; Bono, Hidemasa
2012-07-01
GGRNA (http://GGRNA.dbcls.jp/) is a Google-like, ultrafast search engine for genes and transcripts. The web server accepts arbitrary words and phrases, such as gene names, IDs, gene descriptions, annotations of gene and even nucleotide/amino acid sequences through one simple search box, and quickly returns relevant RefSeq transcripts. A typical search takes just a few seconds, which dramatically enhances the usability of routine searching. In particular, GGRNA can search sequences as short as 10 nt or 4 amino acids, which cannot be handled easily by popular sequence analysis tools. Nucleotide sequences can be searched allowing up to three mismatches, or the query sequences may contain degenerate nucleotide codes (e.g. N, R, Y, S). Furthermore, Gene Ontology annotations, Enzyme Commission numbers and probe sequences of catalog microarrays are also incorporated into GGRNA, which may help users to conduct searches by various types of keywords. GGRNA web server will provide a simple and powerful interface for finding genes and transcripts for a wide range of users. All services at GGRNA are provided free of charge to all users.
Comparative Analysis of Single-Cell RNA Sequencing Methods.
Ziegenhain, Christoph; Vieth, Beate; Parekh, Swati; Reinius, Björn; Guillaumet-Adkins, Amy; Smets, Martha; Leonhardt, Heinrich; Heyn, Holger; Hellmann, Ines; Enard, Wolfgang
2017-02-16
Single-cell RNA sequencing (scRNA-seq) offers new possibilities to address biological and medical questions. However, systematic comparisons of the performance of diverse scRNA-seq protocols are lacking. We generated data from 583 mouse embryonic stem cells to evaluate six prominent scRNA-seq methods: CEL-seq2, Drop-seq, MARS-seq, SCRB-seq, Smart-seq, and Smart-seq2. While Smart-seq2 detected the most genes per cell and across cells, CEL-seq2, Drop-seq, MARS-seq, and SCRB-seq quantified mRNA levels with less amplification noise due to the use of unique molecular identifiers (UMIs). Power simulations at different sequencing depths showed that Drop-seq is more cost-efficient for transcriptome quantification of large numbers of cells, while MARS-seq, SCRB-seq, and Smart-seq2 are more efficient when analyzing fewer cells. Our quantitative comparison offers the basis for an informed choice among six prominent scRNA-seq methods, and it provides a framework for benchmarking further improvements of scRNA-seq protocols. Copyright © 2017 Elsevier Inc. All rights reserved.
Contemporary microbiology and identification of Corynebacteria spp. causing infections in human.
Zasada, A A; Mosiej, E
2018-06-01
The Corynebacterium is a genus of bacteria of growing clinical importance. Progress in medicine results in growing population of immunocompromised patients and growing number of infections caused by opportunistic pathogens. A new infections caused by new Corynebacterium species and species previously regarded as commensal micro-organisms have been described. Parallel with changes in Corynebacteria infections, the microbiological laboratory diagnostic possibilities are changing. But identification of this group of bacteria to the species level remains difficult. In the paper, we present various manual, semi-automated and automated assays used in clinical laboratories for Corynebacterium identification, such as API Coryne, RapID CB Plus, BBL Crystal Gram Positive ID System, MICRONAUT-RPO, VITEK 2, BD Phoenix System, Sherlock Microbial ID System, MicroSeq Microbial Identification System, Biolog Microbial Identification Systems, MALDI-TOF MS systems, polymerase chain reaction (PCR)-based and sequencing-based assays. The presented assays are based on various properties, like biochemical tests, specific DNA sequences, composition of cellular fatty acids, protein profiles and have specific limitations. The number of opportunistic infections caused by Corynebacteria is increasing due to increase in number of immunocompromised patients. New Corynebacterium species and new human infections, caused by this group of bacteria, has been described recently. However, identification of Corynebacteria is still a challenge despite application of sophisticated laboratory methods. In the study we present possibilities and limitations of various commercial systems for identification of Corynebacteria. © 2018 The Society for Applied Microbiology.
Liu, Yu; Koyutürk, Mehmet; Maxwell, Sean; Xiang, Min; Veigl, Martina; Cooper, Richard S; Tayo, Bamidele O; Li, Li; LaFramboise, Thomas; Wang, Zhenghe; Zhu, Xiaofeng; Chance, Mark R
2014-08-16
Sequences up to several megabases in length have been found to be present in individual genomes but absent in the human reference genome. These sequences may be common in populations, and their absence in the reference genome may indicate rare variants in the genomes of individuals who served as donors for the human genome project. As the reference genome is used in probe design for microarray technology and mapping short reads in next generation sequencing (NGS), this missing sequence could be a source of bias in functional genomic studies and variant analysis. One End Anchor (OEA) and/or orphan reads from paired-end sequencing have been used to identify novel sequences that are absent in reference genome. However, there is no study to investigate the distribution, evolution and functionality of those sequences in human populations. To systematically identify and study the missing common sequences (micSeqs), we extended the previous method by pooling OEA reads from large number of individuals and applying strict filtering methods to remove false sequences. The pipeline was applied to data from phase 1 of the 1000 Genomes Project. We identified 309 micSeqs that are present in at least 1% of the human population, but absent in the reference genome. We confirmed 76% of these 309 micSeqs by comparison to other primate genomes, individual human genomes, and gene expression data. Furthermore, we randomly selected fifteen micSeqs and confirmed their presence using PCR validation in 38 additional individuals. Functional analysis using published RNA-seq and ChIP-seq data showed that eleven micSeqs are highly expressed in human brain and three micSeqs contain transcription factor (TF) binding regions, suggesting they are functional elements. In addition, the identified micSeqs are absent in non-primates and show dynamic acquisition during primate evolution culminating with most micSeqs being present in Africans, suggesting some micSeqs may be important sources of human diversity. 76% of micSeqs were confirmed by a comparative genomics approach. Fourteen micSeqs are expressed in human brain or contain TF binding regions. Some micSeqs are primate-specific, conserved and may play a role in the evolution of primates.
Wetmore, Kelly M.; Price, Morgan N.; Waters, Robert J.; ...
2015-05-12
Transposon mutagenesis with next-generation sequencing (TnSeq) is a powerful approach to annotate gene function in bacteria, but existing protocols for TnSeq require laborious preparation of every sample before sequencing. Thus, the existing protocols are not amenable to the throughput necessary to identify phenotypes and functions for the majority of genes in diverse bacteria. Here, we present a method, random bar code transposon-site sequencing (RB-TnSeq), which increases the throughput of mutant fitness profiling by incorporating random DNA bar codes into Tn5 and mariner transposons and by using bar code sequencing (BarSeq) to assay mutant fitness. RB-TnSeq can be used with anymore » transposon, and TnSeq is performed once per organism instead of once per sample. Each BarSeq assay requires only a simple PCR, and 48 to 96 samples can be sequenced on one lane of an Illumina HiSeq system. We demonstrate the reproducibility and biological significance of RB-TnSeq with Escherichia coli, Phaeobacter inhibens, Pseudomonas stutzeri, Shewanella amazonensis, and Shewanella oneidensis. To demonstrate the increased throughput of RB-TnSeq, we performed 387 successful genome-wide mutant fitness assays representing 130 different bacterium-carbon source combinations and identified 5,196 genes with significant phenotypes across the five bacteria. In P. inhibens, we used our mutant fitness data to identify genes important for the utilization of diverse carbon substrates, including a putative D-mannose isomerase that is required for mannitol catabolism. RB-TnSeq will enable the cost-effective functional annotation of diverse bacteria using mutant fitness profiling. A large challenge in microbiology is the functional assessment of the millions of uncharacterized genes identified by genome sequencing. Transposon mutagenesis coupled to next-generation sequencing (TnSeq) is a powerful approach to assign phenotypes and functions to genes. However, the current strategies for TnSeq are too laborious to be applied to hundreds of experimental conditions across multiple bacteria. Here, we describe an approach, random bar code transposon-site sequencing (RB-TnSeq), which greatly simplifies the measurement of gene fitness by using bar code sequencing (BarSeq) to monitor the abundance of mutants. We performed 387 genome-wide fitness assays across five bacteria and identified phenotypes for over 5,000 genes. RB-TnSeq can be applied to diverse bacteria and is a powerful tool to annotate uncharacterized genes using phenotype data.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wetmore, Kelly M.; Price, Morgan N.; Waters, Robert J.
Transposon mutagenesis with next-generation sequencing (TnSeq) is a powerful approach to annotate gene function in bacteria, but existing protocols for TnSeq require laborious preparation of every sample before sequencing. Thus, the existing protocols are not amenable to the throughput necessary to identify phenotypes and functions for the majority of genes in diverse bacteria. Here, we present a method, random bar code transposon-site sequencing (RB-TnSeq), which increases the throughput of mutant fitness profiling by incorporating random DNA bar codes into Tn5 and mariner transposons and by using bar code sequencing (BarSeq) to assay mutant fitness. RB-TnSeq can be used with anymore » transposon, and TnSeq is performed once per organism instead of once per sample. Each BarSeq assay requires only a simple PCR, and 48 to 96 samples can be sequenced on one lane of an Illumina HiSeq system. We demonstrate the reproducibility and biological significance of RB-TnSeq with Escherichia coli, Phaeobacter inhibens, Pseudomonas stutzeri, Shewanella amazonensis, and Shewanella oneidensis. To demonstrate the increased throughput of RB-TnSeq, we performed 387 successful genome-wide mutant fitness assays representing 130 different bacterium-carbon source combinations and identified 5,196 genes with significant phenotypes across the five bacteria. In P. inhibens, we used our mutant fitness data to identify genes important for the utilization of diverse carbon substrates, including a putative D-mannose isomerase that is required for mannitol catabolism. RB-TnSeq will enable the cost-effective functional annotation of diverse bacteria using mutant fitness profiling. A large challenge in microbiology is the functional assessment of the millions of uncharacterized genes identified by genome sequencing. Transposon mutagenesis coupled to next-generation sequencing (TnSeq) is a powerful approach to assign phenotypes and functions to genes. However, the current strategies for TnSeq are too laborious to be applied to hundreds of experimental conditions across multiple bacteria. Here, we describe an approach, random bar code transposon-site sequencing (RB-TnSeq), which greatly simplifies the measurement of gene fitness by using bar code sequencing (BarSeq) to monitor the abundance of mutants. We performed 387 genome-wide fitness assays across five bacteria and identified phenotypes for over 5,000 genes. RB-TnSeq can be applied to diverse bacteria and is a powerful tool to annotate uncharacterized genes using phenotype data.« less
Methods and kits for predicting a response to an erythropoietic agent
Merchant, Michael L.; Klein, Jon B.; Brier, Michael E.; Gaweda, Adam E.
2015-06-16
Methods for predicting a response to an erythropoietic agent in a subject include providing a biological sample from the subject, and determining an amount in the sample of at least one peptide selected from the group consisting of SEQ ID NOS: 1-17. If there is a measurable difference in the amount of the at least one peptide in the sample, when compared to a control level of the same peptide, the subject is then predicted to have a good response or a poor response to the erythropoietic agent. Kits for predicting a response to an erythropoietic agent are further provided and include one or more antibodies, or fragments thereof, that specifically recognize a peptide of SEQ ID NOS: 1-17.
Robinson, Nick A; Hall, Nathan E; Ross, Elizabeth M; Cooke, Ira R; Shiel, Brett P; Robinson, Andrew J; Strugnell, Jan M
2016-01-01
The mitochondrial genome of greenlip abalone, Haliotis laevigata, is reported. MiSeq and HiSeq sequencing of one individual was assembled to yield a single 16,545 bp contig. The sequence shares 92% identity to the H. rubra mitochondrial genome (a closely related species that hybridize with H. laevigata in the wild). The sequence will be useful for determining the maternal contribution to hybrid populations, for investigating population structure and stock-enhancement effectiveness.
Wetmore, Kelly M.; Price, Morgan N.; Waters, Robert J.; Lamson, Jacob S.; He, Jennifer; Hoover, Cindi A.; Blow, Matthew J.; Bristow, James; Butland, Gareth
2015-01-01
ABSTRACT Transposon mutagenesis with next-generation sequencing (TnSeq) is a powerful approach to annotate gene function in bacteria, but existing protocols for TnSeq require laborious preparation of every sample before sequencing. Thus, the existing protocols are not amenable to the throughput necessary to identify phenotypes and functions for the majority of genes in diverse bacteria. Here, we present a method, random bar code transposon-site sequencing (RB-TnSeq), which increases the throughput of mutant fitness profiling by incorporating random DNA bar codes into Tn5 and mariner transposons and by using bar code sequencing (BarSeq) to assay mutant fitness. RB-TnSeq can be used with any transposon, and TnSeq is performed once per organism instead of once per sample. Each BarSeq assay requires only a simple PCR, and 48 to 96 samples can be sequenced on one lane of an Illumina HiSeq system. We demonstrate the reproducibility and biological significance of RB-TnSeq with Escherichia coli, Phaeobacter inhibens, Pseudomonas stutzeri, Shewanella amazonensis, and Shewanella oneidensis. To demonstrate the increased throughput of RB-TnSeq, we performed 387 successful genome-wide mutant fitness assays representing 130 different bacterium-carbon source combinations and identified 5,196 genes with significant phenotypes across the five bacteria. In P. inhibens, we used our mutant fitness data to identify genes important for the utilization of diverse carbon substrates, including a putative d-mannose isomerase that is required for mannitol catabolism. RB-TnSeq will enable the cost-effective functional annotation of diverse bacteria using mutant fitness profiling. PMID:25968644
Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) Version 3.0 User Guide
User Guide to describe the complete functionality of the Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) Version 3.0 online tool. The US Environmental Protection Agency Sequence Alignment to Predict Across Species Susceptibility tool (SeqAPASS; https://seqa...
Nair, Shalima S; Luu, Phuc-Loi; Qu, Wenjia; Maddugoda, Madhavi; Huschtscha, Lily; Reddel, Roger; Chenevix-Trench, Georgia; Toso, Martina; Kench, James G; Horvath, Lisa G; Hayes, Vanessa M; Stricker, Phillip D; Hughes, Timothy P; White, Deborah L; Rasko, John E J; Wong, Justin J-L; Clark, Susan J
2018-05-28
Comprehensive genome-wide DNA methylation profiling is critical to gain insights into epigenetic reprogramming during development and disease processes. Among the different genome-wide DNA methylation technologies, whole genome bisulphite sequencing (WGBS) is considered the gold standard for assaying genome-wide DNA methylation at single base resolution. However, the high sequencing cost to achieve the optimal depth of coverage limits its application in both basic and clinical research. To achieve 15× coverage of the human methylome, using WGBS, requires approximately three lanes of 100-bp-paired-end Illumina HiSeq 2500 sequencing. It is important, therefore, for advances in sequencing technologies to be developed to enable cost-effective high-coverage sequencing. In this study, we provide an optimised WGBS methodology, from library preparation to sequencing and data processing, to enable 16-20× genome-wide coverage per single lane of HiSeq X Ten, HCS 3.3.76. To process and analyse the data, we developed a WGBS pipeline (METH10X) that is fast and can call SNPs. We performed WGBS on both high-quality intact DNA and degraded DNA from formalin-fixed paraffin-embedded tissue. First, we compared different library preparation methods on the HiSeq 2500 platform to identify the best method for sequencing on the HiSeq X Ten. Second, we optimised the PhiX and genome spike-ins to achieve higher quality and coverage of WGBS data on the HiSeq X Ten. Third, we performed integrated whole genome sequencing (WGS) and WGBS of the same DNA sample in a single lane of HiSeq X Ten to improve data output. Finally, we compared methylation data from the HiSeq 2500 and HiSeq X Ten and found high concordance (Pearson r > 0.9×). Together we provide a systematic, efficient and complete approach to perform and analyse WGBS on the HiSeq X Ten. Our protocol allows for large-scale WGBS studies at reasonable processing time and cost on the HiSeq X Ten platform.
Townsley, Brad T; Covington, Michael F; Ichihashi, Yasunori; Zumstein, Kristina; Sinha, Neelima R
2015-01-01
Next Generation Sequencing (NGS) is driving rapid advancement in biological understanding and RNA-sequencing (RNA-seq) has become an indispensable tool for biology and medicine. There is a growing need for access to these technologies although preparation of NGS libraries remains a bottleneck to wider adoption. Here we report a novel method for the production of strand specific RNA-seq libraries utilizing the terminal breathing of double-stranded cDNA to capture and incorporate a sequencing adapter. Breath Adapter Directional sequencing (BrAD-seq) reduces sample handling and requires far fewer enzymatic steps than most available methods to produce high quality strand-specific RNA-seq libraries. The method we present is optimized for 3-prime Digital Gene Expression (DGE) libraries and can easily extend to full transcript coverage shotgun (SHO) type strand-specific libraries and is modularized to accommodate a diversity of RNA and DNA input materials. BrAD-seq offers a highly streamlined and inexpensive option for RNA-seq libraries.
Rhodes, Johanna; Beale, Mathew A; Fisher, Matthew C
2014-01-01
The industry of next-generation sequencing is constantly evolving, with novel library preparation methods and new sequencing machines being released by the major sequencing technology companies annually. The Illumina TruSeq v2 library preparation method was the most widely used kit and the market leader; however, it has now been discontinued, and in 2013 was replaced by the TruSeq Nano and TruSeq PCR-free methods, leaving a gap in knowledge regarding which is the most appropriate library preparation method to use. Here, we used isolates from the pathogenic fungi Cryptococcus neoformans var. grubii and sequenced them using the existing TruSeq DNA v2 kit (Illumina), along with two new kits: the TruSeq Nano DNA kit (Illumina) and the NEBNext Ultra DNA kit (New England Biolabs) to provide a comparison. Compared to the original TruSeq DNA v2 kit, both newer kits gave equivalent or better sequencing data, with increased coverage. When comparing the two newer kits, we found little difference in cost and workflow, with the NEBNext Ultra both slightly cheaper and faster than the TruSeq Nano. However, the quality of data generated using the TruSeq Nano DNA kit was superior due to higher coverage at regions of low GC content, and more SNPs identified. Researchers should therefore evaluate their resources and the type of application (and hence data quality) being considered when ultimately deciding on which library prep method to use.
MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples.
Behr, Jonas; Kahles, André; Zhong, Yi; Sreedharan, Vipin T; Drewe, Philipp; Rätsch, Gunnar
2013-10-15
High-throughput sequencing of mRNA (RNA-Seq) has led to tremendous improvements in the detection of expressed genes and reconstruction of RNA transcripts. However, the extensive dynamic range of gene expression, technical limitations and biases, as well as the observed complexity of the transcriptional landscape, pose profound computational challenges for transcriptome reconstruction. We present the novel framework MITIE (Mixed Integer Transcript IdEntification) for simultaneous transcript reconstruction and quantification. We define a likelihood function based on the negative binomial distribution, use a regularization approach to select a few transcripts collectively explaining the observed read data and show how to find the optimal solution using Mixed Integer Programming. MITIE can (i) take advantage of known transcripts, (ii) reconstruct and quantify transcripts simultaneously in multiple samples, and (iii) resolve the location of multi-mapping reads. It is designed for genome- and assembly-based transcriptome reconstruction. We present an extensive study based on realistic simulated RNA-Seq data. When compared with state-of-the-art approaches, MITIE proves to be significantly more sensitive and overall more accurate. Moreover, MITIE yields substantial performance gains when used with multiple samples. We applied our system to 38 Drosophila melanogaster modENCODE RNA-Seq libraries and estimated the sensitivity of reconstructing omitted transcript annotations and the specificity with respect to annotated transcripts. Our results corroborate that a well-motivated objective paired with appropriate optimization techniques lead to significant improvements over the state-of-the-art in transcriptome reconstruction. MITIE is implemented in C++ and is available from http://bioweb.me/mitie under the GPL license.
Introduction to Single-Cell RNA Sequencing.
Olsen, Thale Kristin; Baryawno, Ninib
2018-04-01
During the last decade, high-throughput sequencing methods have revolutionized the entire field of biology. The opportunity to study entire transcriptomes in great detail using RNA sequencing (RNA-seq) has fueled many important discoveries and is now a routine method in biomedical research. However, RNA-seq is typically performed in "bulk," and the data represent an average of gene expression patterns across thousands to millions of cells; this might obscure biologically relevant differences between cells. Single-cell RNA-seq (scRNA-seq) represents an approach to overcome this problem. By isolating single cells, capturing their transcripts, and generating sequencing libraries in which the transcripts are mapped to individual cells, scRNA-seq allows assessment of fundamental biological properties of cell populations and biological systems at unprecedented resolution. Here, we present the most common scRNA-seq protocols in use today and the basics of data analysis and discuss factors that are important to consider before planning and designing an scRNA-seq project. © 2018 by John Wiley & Sons, Inc. Copyright © 2018 John Wiley & Sons, Inc.
Thomsen, Martin Christen Frølund; Nielsen, Morten
2012-01-01
Seq2Logo is a web-based sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences. Accurate generation of sequence logos is often compromised by sequence redundancy and low number of observations. Moreover, most methods available for sequence logo generation focus on displaying the position-specific enrichment of amino acids, discarding the equally valuable information related to amino acid depletion. Seq2logo aims at resolving these issues allowing the user to include sequence weighting to correct for data redundancy, pseudo counts to correct for low number of observations and different logotype representations each capturing different aspects related to amino acid enrichment and depletion. Besides allowing input in the format of peptides and MSA, Seq2Logo accepts input as Blast sequence profiles, providing easy access for non-expert end-users to characterize and identify functionally conserved/variable amino acids in any given protein of interest. The output from the server is a sequence logo and a PSSM. Seq2Logo is available at http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 2012, date last accessed). PMID:22638583
The Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) tool was developed to address needs for rapid, cost effective methods of species extrapolation of chemical susceptibility. Specifically, the SeqAPASS tool compares the primary sequence (Level 1), functiona...
Microbe-ID: an open source toolbox for microbial genotyping and species identification.
Tabima, Javier F; Everhart, Sydney E; Larsen, Meredith M; Weisberg, Alexandra J; Kamvar, Zhian N; Tancos, Matthew A; Smart, Christine D; Chang, Jeff H; Grünwald, Niklaus J
2016-01-01
Development of tools to identify species, genotypes, or novel strains of invasive organisms is critical for monitoring emergence and implementing rapid response measures. Molecular markers, although critical to identifying species or genotypes, require bioinformatic tools for analysis. However, user-friendly analytical tools for fast identification are not readily available. To address this need, we created a web-based set of applications called Microbe-ID that allow for customizing a toolbox for rapid species identification and strain genotyping using any genetic markers of choice. Two components of Microbe-ID, named Sequence-ID and Genotype-ID, implement species and genotype identification, respectively. Sequence-ID allows identification of species by using BLAST to query sequences for any locus of interest against a custom reference sequence database. Genotype-ID allows placement of an unknown multilocus marker in either a minimum spanning network or dendrogram with bootstrap support from a user-created reference database. Microbe-ID can be used for identification of any organism based on nucleotide sequences or any molecular marker type and several examples are provided. We created a public website for demonstration purposes called Microbe-ID (microbe-id.org) and provided a working implementation for the genus Phytophthora (phytophthora-id.org). In Phytophthora-ID, the Sequence-ID application allows identification based on ITS or cox spacer sequences. Genotype-ID groups individuals into clonal lineages based on simple sequence repeat (SSR) markers for the two invasive plant pathogen species P. infestans and P. ramorum. All code is open source and available on github and CRAN. Instructions for installation and use are provided at https://github.com/grunwaldlab/Microbe-ID.
Measuring Sister Chromatid Cohesion Protein Genome Occupancy in Drosophila melanogaster by ChIP-seq.
Dorsett, Dale; Misulovin, Ziva
2017-01-01
This chapter presents methods to conduct and analyze genome-wide chromatin immunoprecipitation of the cohesin complex and the Nipped-B cohesin loading factor in Drosophila cells using high-throughput DNA sequencing (ChIP-seq). Procedures for isolation of chromatin, immunoprecipitation, and construction of sequencing libraries for the Ion Torrent Proton high throughput sequencer are detailed, and computational methods to calculate occupancy as input-normalized fold-enrichment are described. The results obtained by ChIP-seq are compared to those obtained by ChIP-chip (genomic ChIP using tiling microarrays), and the effects of sequencing depth on the accuracy are analyzed. ChIP-seq provides similar sensitivity and reproducibility as ChIP-chip, and identifies the same broad regions of occupancy. The locations of enrichment peaks, however, can differ between ChIP-chip and ChIP-seq, and low sequencing depth can splinter broad regions of occupancy into distinct peaks.
O'Leary, Nuala A; Wright, Mathew W; Brister, J Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S; Kodali, Vamsi K; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M; Murphy, Michael R; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H; Rausch, Daniel; Riddick, Lillian D; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E; Vatsan, Anjana R; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D; Pruitt, Kim D
2016-01-04
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.
Karas, Vlad O; Sinnott-Armstrong, Nicholas A; Varghese, Vici; Shafer, Robert W; Greenleaf, William J; Sherlock, Gavin
2018-01-01
Abstract Much of the within species genetic variation is in the form of single nucleotide polymorphisms (SNPs), typically detected by whole genome sequencing (WGS) or microarray-based technologies. However, WGS produces mostly uninformative reads that perfectly match the reference, while microarrays require genome-specific reagents. We have developed Diff-seq, a sequencing-based mismatch detection assay for SNP discovery without the requirement for specialized nucleic-acid reagents. Diff-seq leverages the Surveyor endonuclease to cleave mismatched DNA molecules that are generated after cross-annealing of a complex pool of DNA fragments. Sequencing libraries enriched for Surveyor-cleaved molecules result in increased coverage at the variant sites. Diff-seq detected all mismatches present in an initial test substrate, with specific enrichment dependent on the identity and context of the variation. Application to viral sequences resulted in increased observation of variant alleles in a biologically relevant context. Diff-Seq has the potential to increase the sensitivity and efficiency of high-throughput sequencing in the detection of variation. PMID:29361139
BAsE-Seq: a method for obtaining long viral haplotypes from short sequence reads.
Hong, Lewis Z; Hong, Shuzhen; Wong, Han Teng; Aw, Pauline P K; Cheng, Yan; Wilm, Andreas; de Sessions, Paola F; Lim, Seng Gee; Nagarajan, Niranjan; Hibberd, Martin L; Quake, Stephen R; Burkholder, William F
2014-01-01
We present a method for obtaining long haplotypes, of over 3 kb in length, using a short-read sequencer, Barcode-directed Assembly for Extra-long Sequences (BAsE-Seq). BAsE-Seq relies on transposing a template-specific barcode onto random segments of the template molecule and assembling the barcoded short reads into complete haplotypes. We applied BAsE-Seq on mixed clones of hepatitis B virus and accurately identified haplotypes occurring at frequencies greater than or equal to 0.4%, with >99.9% specificity. Applying BAsE-Seq to a clinical sample, we obtained over 9,000 viral haplotypes, which provided an unprecedented view of hepatitis B virus population structure during chronic infection. BAsE-Seq is readily applicable for monitoring quasispecies evolution in viral diseases.
Hayashi, Tetsutaro; Ozaki, Haruka; Sasagawa, Yohei; Umeda, Mana; Danno, Hiroki; Nikaido, Itoshi
2018-02-12
Total RNA sequencing has been used to reveal poly(A) and non-poly(A) RNA expression, RNA processing and enhancer activity. To date, no method for full-length total RNA sequencing of single cells has been developed despite the potential of this technology for single-cell biology. Here we describe random displacement amplification sequencing (RamDA-seq), the first full-length total RNA-sequencing method for single cells. Compared with other methods, RamDA-seq shows high sensitivity to non-poly(A) RNA and near-complete full-length transcript coverage. Using RamDA-seq with differentiation time course samples of mouse embryonic stem cells, we reveal hundreds of dynamically regulated non-poly(A) transcripts, including histone transcripts and long noncoding RNA Neat1. Moreover, RamDA-seq profiles recursive splicing in >300-kb introns. RamDA-seq also detects enhancer RNAs and their cell type-specific activity in single cells. Taken together, we demonstrate that RamDA-seq could help investigate the dynamics of gene expression, RNA-processing events and transcriptional regulation in single cells.
Microbe-ID: an open source toolbox for microbial genotyping and species identification
Tabima, Javier F.; Everhart, Sydney E.; Larsen, Meredith M.; Weisberg, Alexandra J.; Kamvar, Zhian N.; Tancos, Matthew A.; Smart, Christine D.; Chang, Jeff H.
2016-01-01
Development of tools to identify species, genotypes, or novel strains of invasive organisms is critical for monitoring emergence and implementing rapid response measures. Molecular markers, although critical to identifying species or genotypes, require bioinformatic tools for analysis. However, user-friendly analytical tools for fast identification are not readily available. To address this need, we created a web-based set of applications called Microbe-ID that allow for customizing a toolbox for rapid species identification and strain genotyping using any genetic markers of choice. Two components of Microbe-ID, named Sequence-ID and Genotype-ID, implement species and genotype identification, respectively. Sequence-ID allows identification of species by using BLAST to query sequences for any locus of interest against a custom reference sequence database. Genotype-ID allows placement of an unknown multilocus marker in either a minimum spanning network or dendrogram with bootstrap support from a user-created reference database. Microbe-ID can be used for identification of any organism based on nucleotide sequences or any molecular marker type and several examples are provided. We created a public website for demonstration purposes called Microbe-ID (microbe-id.org) and provided a working implementation for the genus Phytophthora (phytophthora-id.org). In Phytophthora-ID, the Sequence-ID application allows identification based on ITS or cox spacer sequences. Genotype-ID groups individuals into clonal lineages based on simple sequence repeat (SSR) markers for the two invasive plant pathogen species P. infestans and P. ramorum. All code is open source and available on github and CRAN. Instructions for installation and use are provided at https://github.com/grunwaldlab/Microbe-ID. PMID:27602267
ARCPHdb: A comprehensive protein database for SF1 and SF2 helicase from archaea.
Moukhtar, Mirna; Chaar, Wafi; Abdel-Razzak, Ziad; Khalil, Mohamad; Taha, Samir; Chamieh, Hala
2017-01-01
Superfamily 1 and Superfamily 2 helicases, two of the largest helicase protein families, play vital roles in many biological processes including replication, transcription and translation. Study of helicase proteins in the model microorganisms of archaea have largely contributed to the understanding of their function, architecture and assembly. Based on a large phylogenomics approach, we have identified and classified all SF1 and SF2 protein families in ninety five sequenced archaea genomes. Here we developed an online webserver linked to a specialized protein database named ARCPHdb to provide access for SF1 and SF2 helicase families from archaea. ARCPHdb was implemented using MySQL relational database. Web interfaces were developed using Netbeans. Data were stored according to UniProt accession numbers, NCBI Ref Seq ID, PDB IDs and Entrez Databases. A user-friendly interactive web interface has been developed to browse, search and download archaeal helicase protein sequences, their available 3D structure models, and related documentation available in the literature provided by ARCPHdb. The database provides direct links to matching external databases. The ARCPHdb is the first online database to compile all protein information on SF1 and SF2 helicase from archaea in one platform. This database provides essential resource information for all researchers interested in the field. Copyright © 2016 Elsevier Ltd. All rights reserved.
RNA-Seq for Bacterial Gene Expression.
Poulsen, Line Dahl; Vinther, Jeppe
2018-06-01
RNA sequencing (RNA-seq) has become the preferred method for global quantification of bacterial gene expression. With the continued improvements in sequencing technology and data analysis tools, the most labor-intensive and expensive part of an RNA-seq experiment is the preparation of sequencing libraries, which is also essential for the quality of the data obtained. Here, we present a straightforward and inexpensive basic protocol for preparation of strand-specific RNA-seq libraries from bacterial RNA as well as a computational pipeline for the data analysis of sequencing reads. The protocol is based on the Illumina platform and allows easy multiplexing of samples and the removal of sequencing reads that are PCR duplicates. © 2018 by John Wiley & Sons, Inc. © 2018 John Wiley & Sons, Inc.
Chwialkowska, Karolina; Korotko, Urszula; Kosinska, Joanna; Szarejko, Iwona; Kwasniewski, Miroslaw
2017-01-01
Epigenetic mechanisms, including histone modifications and DNA methylation, mutually regulate chromatin structure, maintain genome integrity, and affect gene expression and transposon mobility. Variations in DNA methylation within plant populations, as well as methylation in response to internal and external factors, are of increasing interest, especially in the crop research field. Methylation Sensitive Amplification Polymorphism (MSAP) is one of the most commonly used methods for assessing DNA methylation changes in plants. This method involves gel-based visualization of PCR fragments from selectively amplified DNA that are cleaved using methylation-sensitive restriction enzymes. In this study, we developed and validated a new method based on the conventional MSAP approach called Methylation Sensitive Amplification Polymorphism Sequencing (MSAP-Seq). We improved the MSAP-based approach by replacing the conventional separation of amplicons on polyacrylamide gels with direct, high-throughput sequencing using Next Generation Sequencing (NGS) and automated data analysis. MSAP-Seq allows for global sequence-based identification of changes in DNA methylation. This technique was validated in Hordeum vulgare . However, MSAP-Seq can be straightforwardly implemented in different plant species, including crops with large, complex and highly repetitive genomes. The incorporation of high-throughput sequencing into MSAP-Seq enables parallel and direct analysis of DNA methylation in hundreds of thousands of sites across the genome. MSAP-Seq provides direct genomic localization of changes and enables quantitative evaluation. We have shown that the MSAP-Seq method specifically targets gene-containing regions and that a single analysis can cover three-quarters of all genes in large genomes. Moreover, MSAP-Seq's simplicity, cost effectiveness, and high-multiplexing capability make this method highly affordable. Therefore, MSAP-Seq can be used for DNA methylation analysis in crop plants with large and complex genomes.
Chwialkowska, Karolina; Korotko, Urszula; Kosinska, Joanna; Szarejko, Iwona; Kwasniewski, Miroslaw
2017-01-01
Epigenetic mechanisms, including histone modifications and DNA methylation, mutually regulate chromatin structure, maintain genome integrity, and affect gene expression and transposon mobility. Variations in DNA methylation within plant populations, as well as methylation in response to internal and external factors, are of increasing interest, especially in the crop research field. Methylation Sensitive Amplification Polymorphism (MSAP) is one of the most commonly used methods for assessing DNA methylation changes in plants. This method involves gel-based visualization of PCR fragments from selectively amplified DNA that are cleaved using methylation-sensitive restriction enzymes. In this study, we developed and validated a new method based on the conventional MSAP approach called Methylation Sensitive Amplification Polymorphism Sequencing (MSAP-Seq). We improved the MSAP-based approach by replacing the conventional separation of amplicons on polyacrylamide gels with direct, high-throughput sequencing using Next Generation Sequencing (NGS) and automated data analysis. MSAP-Seq allows for global sequence-based identification of changes in DNA methylation. This technique was validated in Hordeum vulgare. However, MSAP-Seq can be straightforwardly implemented in different plant species, including crops with large, complex and highly repetitive genomes. The incorporation of high-throughput sequencing into MSAP-Seq enables parallel and direct analysis of DNA methylation in hundreds of thousands of sites across the genome. MSAP-Seq provides direct genomic localization of changes and enables quantitative evaluation. We have shown that the MSAP-Seq method specifically targets gene-containing regions and that a single analysis can cover three-quarters of all genes in large genomes. Moreover, MSAP-Seq's simplicity, cost effectiveness, and high-multiplexing capability make this method highly affordable. Therefore, MSAP-Seq can be used for DNA methylation analysis in crop plants with large and complex genomes. PMID:29250096
Sasagawa, Yohei; Danno, Hiroki; Takada, Hitomi; Ebisawa, Masashi; Tanaka, Kaori; Hayashi, Tetsutaro; Kurisaki, Akira; Nikaido, Itoshi
2018-03-09
High-throughput single-cell RNA-seq methods assign limited unique molecular identifier (UMI) counts as gene expression values to single cells from shallow sequence reads and detect limited gene counts. We thus developed a high-throughput single-cell RNA-seq method, Quartz-Seq2, to overcome these issues. Our improvements in the reaction steps make it possible to effectively convert initial reads to UMI counts, at a rate of 30-50%, and detect more genes. To demonstrate the power of Quartz-Seq2, we analyzed approximately 10,000 transcriptomes from in vitro embryonic stem cells and an in vivo stromal vascular fraction with a limited number of reads.
Comparison of Next-Generation Sequencing Systems
Liu, Lin; Li, Yinhu; Li, Siliang; Hu, Ni; He, Yimin; Pong, Ray; Lin, Danni; Lu, Lihua; Law, Maggie
2012-01-01
With fast development and wide applications of next-generation sequencing (NGS) technologies, genomic sequence information is within reach to aid the achievement of goals to decode life mysteries, make better crops, detect pathogens, and improve life qualities. NGS systems are typically represented by SOLiD/Ion Torrent PGM from Life Sciences, Genome Analyzer/HiSeq 2000/MiSeq from Illumina, and GS FLX Titanium/GS Junior from Roche. Beijing Genomics Institute (BGI), which possesses the world's biggest sequencing capacity, has multiple NGS systems including 137 HiSeq 2000, 27 SOLiD, one Ion Torrent PGM, one MiSeq, and one 454 sequencer. We have accumulated extensive experience in sample handling, sequencing, and bioinformatics analysis. In this paper, technologies of these systems are reviewed, and first-hand data from extensive experience is summarized and analyzed to discuss the advantages and specifics associated with each sequencing system. At last, applications of NGS are summarized. PMID:22829749
Identification of Prostate Cancer-Specific microDNAs
2014-12-01
displacement amplification (MDA). 2 adopted multiple displacement amplification (MDA) with random primers for enriched circular DNA by rolling circle ... amplification (RCA) (Fig. 1) and then amplified DNA fragments were subject to deep sequencing. Sequence NO of Reads seq 1 184 seq 2 133 seq 3 2407 seq...prostate cancer cells through multiple displacement amplification . Clone #7 is the top candidate which has been cloned in an expression vector and it
RNA-Seq Technology and Its Application in Fish Transcriptomics
Ba, Yi; Zhuang, Qianfeng
2014-01-01
Abstract High-throughput sequencing technologies, also known as next-generation sequencing (NGS) technologies, have revolutionized the way that genomic research is advancing. In addition to the static genome, these state-of-art technologies have been recently exploited to analyze the dynamic transcriptome, and the resulting technology is termed RNA sequencing (RNA-seq). RNA-seq is free from many limitations of other transcriptomic approaches, such as microarray and tag-based sequencing method. Although RNA-seq has only been available for a short time, studies using this method have completely changed our perspective of the breadth and depth of eukaryotic transcriptomes. In terms of the transcriptomics of teleost fishes, both model and non-model species have benefited from the RNA-seq approach and have undergone tremendous advances in the past several years. RNA-seq has helped not only in mapping and annotating fish transcriptome but also in our understanding of many biological processes in fish, such as development, adaptive evolution, host immune response, and stress response. In this review, we first provide an overview of each step of RNA-seq from library construction to the bioinformatic analysis of the data. We then summarize and discuss the recent biological insights obtained from the RNA-seq studies in a variety of fish species. PMID:24380445
USDA-ARS?s Scientific Manuscript database
Alternative splicing is a well-known phenomenon that dramatically increases eukaryotic transcriptome diversity. The extent of mRNA isoform diversity among porcine tissues was assessed using Pacific Biosciences single-molecule long-read isoform sequencing (Iso-Seq) and Illumina short read sequencing ...
YAMAT-seq: an efficient method for high-throughput sequencing of mature transfer RNAs
Shigematsu, Megumi; Honda, Shozo; Loher, Phillipe; Telonis, Aristeidis G.; Rigoutsos, Isidore
2017-01-01
Abstract Besides translation, transfer RNAs (tRNAs) play many non-canonical roles in various biological pathways and exhibit highly variable expression profiles. To unravel the emerging complexities of tRNA biology and molecular mechanisms underlying them, an efficient tRNA sequencing method is required. However, the rigid structure of tRNA has been presenting a challenge to the development of such methods. We report the development of Y-shaped Adapter-ligated MAture TRNA sequencing (YAMAT-seq), an efficient and convenient method for high-throughput sequencing of mature tRNAs. YAMAT-seq circumvents the issue of inefficient adapter ligation, a characteristic of conventional RNA sequencing methods for mature tRNAs, by employing the efficient and specific ligation of Y-shaped adapter to mature tRNAs using T4 RNA Ligase 2. Subsequent cDNA amplification and next-generation sequencing successfully yield numerous mature tRNA sequences. YAMAT-seq has high specificity for mature tRNAs and high sensitivity to detect most isoacceptors from minute amount of total RNA. Moreover, YAMAT-seq shows quantitative capability to estimate expression levels of mature tRNAs, and has high reproducibility and broad applicability for various cell lines. YAMAT-seq thus provides high-throughput technique for identifying tRNA profiles and their regulations in various transcriptomes, which could play important regulatory roles in translation and other biological processes. PMID:28108659
ChIP-seq: advantages and challenges of a maturing technology.
Park, Peter J
2009-10-01
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. Owing to the tremendous progress in next-generation sequencing technology, ChIP-seq offers higher resolution, less noise and greater coverage than its array-based predecessor ChIP-chip. With the decreasing cost of sequencing, ChIP-seq has become an indispensable tool for studying gene regulation and epigenetic mechanisms. In this Review, I describe the benefits and challenges in harnessing this technique with an emphasis on issues related to experimental design and data analysis. ChIP-seq experiments generate large quantities of data, and effective computational analysis will be crucial for uncovering biological mechanisms.
Advances in single-cell RNA sequencing and its applications in cancer research.
Zhu, Sibo; Qing, Tao; Zheng, Yuanting; Jin, Li; Shi, Leming
2017-08-08
Unlike population-level approaches, single-cell RNA sequencing enables transcriptomic analysis of an individual cell. Through the combination of high-throughput sequencing and bioinformatic tools, single-cell RNA-seq can detect more than 10,000 transcripts in one cell to distinguish cell subsets and dynamic cellular changes. After several years' development, single-cell RNA-seq can now achieve massively parallel, full-length mRNA sequencing as well as in situ sequencing and even has potential for multi-omic detection. One appealing area of single-cell RNA-seq is cancer research, and it is regarded as a promising way to enhance prognosis and provide more precise target therapy by identifying druggable subclones. Indeed, progresses have been made regarding solid tumor analysis to reveal intratumoral heterogeneity, correlations between signaling pathways, stemness, drug resistance, and tumor architecture shaping the microenvironment. Furthermore, through investigation into circulating tumor cells, many genes have been shown to promote a propensity toward stemness and the epithelial-mesenchymal transition, to enhance anchoring and adhesion, and to be involved in mechanisms of anoikis resistance and drug resistance. This review focuses on advances and progresses of single-cell RNA-seq with regard to the following aspects: 1. Methodologies of single-cell RNA-seq 2. Single-cell isolation techniques 3. Single-cell RNA-seq in solid tumor research 4. Single-cell RNA-seq in circulating tumor cell research 5.
Advances in single-cell RNA sequencing and its applications in cancer research
Zhu, Sibo; Qing, Tao; Zheng, Yuanting; Jin, Li; Shi, Leming
2017-01-01
Unlike population-level approaches, single-cell RNA sequencing enables transcriptomic analysis of an individual cell. Through the combination of high-throughput sequencing and bioinformatic tools, single-cell RNA-seq can detect more than 10,000 transcripts in one cell to distinguish cell subsets and dynamic cellular changes. After several years’ development, single-cell RNA-seq can now achieve massively parallel, full-length mRNA sequencing as well as in situ sequencing and even has potential for multi-omic detection. One appealing area of single-cell RNA-seq is cancer research, and it is regarded as a promising way to enhance prognosis and provide more precise target therapy by identifying druggable subclones. Indeed, progresses have been made regarding solid tumor analysis to reveal intratumoral heterogeneity, correlations between signaling pathways, stemness, drug resistance, and tumor architecture shaping the microenvironment. Furthermore, through investigation into circulating tumor cells, many genes have been shown to promote a propensity toward stemness and the epithelial-mesenchymal transition, to enhance anchoring and adhesion, and to be involved in mechanisms of anoikis resistance and drug resistance. This review focuses on advances and progresses of single-cell RNA-seq with regard to the following aspects: 1. Methodologies of single-cell RNA-seq 2. Single-cell isolation techniques 3. Single-cell RNA-seq in solid tumor research 4. Single-cell RNA-seq in circulating tumor cell research 5. Perspectives PMID:28881849
A weighted U-statistic for genetic association analyses of sequencing data.
Wei, Changshuai; Li, Ming; He, Zihuai; Vsevolozhskaya, Olga; Schaid, Daniel J; Lu, Qing
2014-12-01
With advancements in next-generation sequencing technology, a massive amount of sequencing data is generated, which offers a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, the high-dimensional sequencing data poses a great challenge for statistical analysis. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a Weighted U Sequencing test, referred to as WU-SEQ, for the high-dimensional association analysis of sequencing data. Based on a nonparametric U-statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used sequence kernel association test (SKAT) method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-SEQ to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol. © 2014 WILEY PERIODICALS, INC.
Library construction for next-generation sequencing: Overviews and challenges
Head, Steven R.; Komori, H. Kiyomi; LaMere, Sarah A.; Whisenant, Thomas; Van Nieuwerburgh, Filip; Salomon, Daniel R.; Ordoukhanian, Phillip
2014-01-01
High-throughput sequencing, also known as next-generation sequencing (NGS), has revolutionized genomic research. In recent years, NGS technology has steadily improved, with costs dropping and the number and range of sequencing applications increasing exponentially. Here, we examine the critical role of sequencing library quality and consider important challenges when preparing NGS libraries from DNA and RNA sources. Factors such as the quantity and physical characteristics of the RNA or DNA source material as well as the desired application (i.e., genome sequencing, targeted sequencing, RNA-seq, ChIP-seq, RIP-seq, and methylation) are addressed in the context of preparing high quality sequencing libraries. In addition, the current methods for preparing NGS libraries from single cells are also discussed. PMID:24502796
Quail, Michael A; Smith, Miriam; Coupland, Paul; Otto, Thomas D; Harris, Simon R; Connor, Thomas R; Bertoni, Anna; Swerdlow, Harold P; Gu, Yong
2012-07-24
Next generation sequencing (NGS) technology has revolutionized genomic and genetic research. The pace of change in this area is rapid with three major new sequencing platforms having been released in 2011: Ion Torrent's PGM, Pacific Biosciences' RS and the Illumina MiSeq. Here we compare the results obtained with those platforms to the performance of the Illumina HiSeq, the current market leader. In order to compare these platforms, and get sufficient coverage depth to allow meaningful analysis, we have sequenced a set of 4 microbial genomes with mean GC content ranging from 19.3 to 67.7%. Together, these represent a comprehensive range of genome content. Here we report our analysis of that sequence data in terms of coverage distribution, bias, GC distribution, variant detection and accuracy. Sequence generated by Ion Torrent, MiSeq and Pacific Biosciences technologies displays near perfect coverage behaviour on GC-rich, neutral and moderately AT-rich genomes, but a profound bias was observed upon sequencing the extremely AT-rich genome of Plasmodium falciparum on the PGM, resulting in no coverage for approximately 30% of the genome. We analysed the ability to call variants from each platform and found that we could call slightly more variants from Ion Torrent data compared to MiSeq data, but at the expense of a higher false positive rate. Variant calling from Pacific Biosciences data was possible but higher coverage depth was required. Context specific errors were observed in both PGM and MiSeq data, but not in that from the Pacific Biosciences platform. All three fast turnaround sequencers evaluated here were able to generate usable sequence. However there are key differences between the quality of that data and the applications it will support.
Li, Wenli; Turner, Amy; Aggarwal, Praful; Matter, Andrea; Storvick, Erin; Arnett, Donna K; Broeckel, Ulrich
2015-12-16
Whole transcriptome sequencing (RNA-seq) represents a powerful approach for whole transcriptome gene expression analysis. However, RNA-seq carries a few limitations, e.g., the requirement of a significant amount of input RNA and complications led by non-specific mapping of short reads. The Ion AmpliSeq Transcriptome Human Gene Expression Kit (AmpliSeq) was recently introduced by Life Technologies as a whole-transcriptome, targeted gene quantification kit to overcome these limitations of RNA-seq. To assess the performance of this new methodology, we performed a comprehensive comparison of AmpliSeq with RNA-seq using two well-established next-generation sequencing platforms (Illumina HiSeq and Ion Torrent Proton). We analyzed standard reference RNA samples and RNA samples obtained from human induced pluripotent stem cell derived cardiomyocytes (hiPSC-CMs). Using published data from two standard RNA reference samples, we observed a strong concordance of log2 fold change for all genes when comparing AmpliSeq to Illumina HiSeq (Pearson's r = 0.92) and Ion Torrent Proton (Pearson's r = 0.92). We used ROC, Matthew's correlation coefficient and RMSD to determine the overall performance characteristics. All three statistical methods demonstrate AmpliSeq as a highly accurate method for differential gene expression analysis. Additionally, for genes with high abundance, AmpliSeq outperforms the two RNA-seq methods. When analyzing four closely related hiPSC-CM lines, we show that both AmpliSeq and RNA-seq capture similar global gene expression patterns consistent with known sources of variations. Our study indicates that AmpliSeq excels in the limiting areas of RNA-seq for gene expression quantification analysis. Thus, AmpliSeq stands as a very sensitive and cost-effective approach for very large scale gene expression analysis and mRNA marker screening with high accuracy.
iSeq: Web-Based RNA-seq Data Analysis and Visualization.
Zhang, Chao; Fan, Caoqi; Gan, Jingbo; Zhu, Ping; Kong, Lei; Li, Cheng
2018-01-01
Transcriptome sequencing (RNA-seq) is becoming a standard experimental methodology for genome-wide characterization and quantification of transcripts at single base-pair resolution. However, downstream analysis of massive amount of sequencing data can be prohibitively technical for wet-lab researchers. A functionally integrated and user-friendly platform is required to meet this demand. Here, we present iSeq, an R-based Web server, for RNA-seq data analysis and visualization. iSeq is a streamlined Web-based R application under the Shiny framework, featuring a simple user interface and multiple data analysis modules. Users without programming and statistical skills can analyze their RNA-seq data and construct publication-level graphs through a standardized yet customizable analytical pipeline. iSeq is accessible via Web browsers on any operating system at http://iseq.cbi.pku.edu.cn .
Razali, Haslina; O'Connor, Emily; Drews, Anna; Burke, Terry; Westerdahl, Helena
2017-07-28
High-throughput sequencing enables high-resolution genotyping of extremely duplicated genes. 454 amplicon sequencing (454) has become the standard technique for genotyping the major histocompatibility complex (MHC) genes in non-model organisms. However, illumina MiSeq amplicon sequencing (MiSeq), which offers a much higher read depth, is now superseding 454. The aim of this study was to quantitatively and qualitatively evaluate the performance of MiSeq in relation to 454 for genotyping MHC class I alleles using a house sparrow (Passer domesticus) dataset with pedigree information. House sparrows provide a good study system for this comparison as their MHC class I genes have been studied previously and, consequently, we had prior expectations concerning the number of alleles per individual. We found that 454 and MiSeq performed equally well in genotyping amplicons with low diversity, i.e. amplicons from individuals that had fewer than 6 alleles. Although there was a higher rate of failure in the 454 dataset in resolving amplicons with higher diversity (6-9 alleles), the same genotypes were identified by both 454 and MiSeq in 98% of cases. We conclude that low diversity amplicons are equally well genotyped using either 454 or MiSeq, but the higher coverage afforded by MiSeq can lead to this approach outperforming 454 in amplicons with higher diversity.
YAMAT-seq: an efficient method for high-throughput sequencing of mature transfer RNAs.
Shigematsu, Megumi; Honda, Shozo; Loher, Phillipe; Telonis, Aristeidis G; Rigoutsos, Isidore; Kirino, Yohei
2017-05-19
Besides translation, transfer RNAs (tRNAs) play many non-canonical roles in various biological pathways and exhibit highly variable expression profiles. To unravel the emerging complexities of tRNA biology and molecular mechanisms underlying them, an efficient tRNA sequencing method is required. However, the rigid structure of tRNA has been presenting a challenge to the development of such methods. We report the development of Y-shaped Adapter-ligated MAture TRNA sequencing (YAMAT-seq), an efficient and convenient method for high-throughput sequencing of mature tRNAs. YAMAT-seq circumvents the issue of inefficient adapter ligation, a characteristic of conventional RNA sequencing methods for mature tRNAs, by employing the efficient and specific ligation of Y-shaped adapter to mature tRNAs using T4 RNA Ligase 2. Subsequent cDNA amplification and next-generation sequencing successfully yield numerous mature tRNA sequences. YAMAT-seq has high specificity for mature tRNAs and high sensitivity to detect most isoacceptors from minute amount of total RNA. Moreover, YAMAT-seq shows quantitative capability to estimate expression levels of mature tRNAs, and has high reproducibility and broad applicability for various cell lines. YAMAT-seq thus provides high-throughput technique for identifying tRNA profiles and their regulations in various transcriptomes, which could play important regulatory roles in translation and other biological processes. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
ChIP-seq and RNA-seq methods to study circadian control of transcription in mammals
Takahashi, Joseph S.; Kumar, Vivek; Nakashe, Prachi; Koike, Nobuya; Huang, Hung-Chung; Green, Carla B.; Kim, Tae-Kyung
2015-01-01
Genome-wide analyses have revolutionized our ability to study the transcriptional regulation of circadian rhythms. The advent of next-generation sequencing methods has facilitated the use of two such technologies, ChIP-seq and RNA-seq. In this chapter, we describe detailed methods and protocols for these two techniques, with emphasis on their usage in circadian rhythm experiments in the mouse liver, a major target organ of the circadian clock system. Critical factors for these methods are highlighted and issues arising with time series samples for ChIP-seq and RNA-seq are discussed. Finally detailed protocols for library preparation suitable for Illumina sequencing platforms are presented. PMID:25662462
A Concise Atlas of Thyroid Cancer Next-Generation Sequencing Panel ThyroSeq v.2
Alsina, Jorge; Alsina, Raul; Gulec, Seza
2017-01-01
The next-generation sequencing technology allows high out-put genomic analysis. An innovative assay in thyroid cancer, ThyroSeq® was developed for targeted mutation detection by next generation sequencing technology in fine needle aspiration and tissue samples. ThyroSeq v.2 next generation sequencing panel offers simultaneous sequencing and detection in >1000 hotspots of 14 thyroid cancer-related genes and for 42 types of gene fusions known to occur in thyroid cancer. ThyroSeq is being increasingly used to further narrow the indeterminate category defined by cytology for thyroid nodules. From a surgical perspective, genomic profiling also provides prognostic and predictive information and closely relates to determination of surgical strategy. Both the genomic analysis technology and the informatics for the cancer genome data base are rapidly developing. In this paper, we have gathered existing information on the thyroid cancer-related genes involved in the initiation and progression of thyroid cancer. Our goal is to assemble a glossary for the current ThyroSeq genomic panel that can help elucidate the role genomics play in thyroid cancer oncogenesis. PMID:28117295
Du, Xiuquan; Hu, Changlin; Yao, Yu; Sun, Shiwei; Zhang, Yanping
2017-12-12
In bioinformatics, exon skipping (ES) event prediction is an essential part of alternative splicing (AS) event analysis. Although many methods have been developed to predict ES events, a solution has yet to be found. In this study, given the limitations of machine learning algorithms with RNA-Seq data or genome sequences, a new feature, called RS (RNA-seq and sequence) features, was constructed. These features include RNA-Seq features derived from the RNA-Seq data and sequence features derived from genome sequences. We propose a novel Rotation Forest classifier to predict ES events with the RS features (RotaF-RSES). To validate the efficacy of RotaF-RSES, a dataset from two human tissues was used, and RotaF-RSES achieved an accuracy of 98.4%, a specificity of 99.2%, a sensitivity of 94.1%, and an area under the curve (AUC) of 98.6%. When compared to the other available methods, the results indicate that RotaF-RSES is efficient and can predict ES events with RS features.
LipidSeq: a next-generation clinical resequencing panel for monogenic dyslipidemias.
Johansen, Christopher T; Dubé, Joseph B; Loyzer, Melissa N; MacDonald, Austin; Carter, David E; McIntyre, Adam D; Cao, Henian; Wang, Jian; Robinson, John F; Hegele, Robert A
2014-04-01
We report the design of a targeted resequencing panel for monogenic dyslipidemias, LipidSeq, for the purpose of replacing Sanger sequencing in the clinical detection of dyslipidemia-causing variants. We also evaluate the performance of the LipidSeq approach versus Sanger sequencing in 84 patients with a range of phenotypes including extreme blood lipid concentrations as well as additional dyslipidemias and related metabolic disorders. The panel performs well, with high concordance (95.2%) in samples with known mutations based on Sanger sequencing and a high detection rate (57.9%) of mutations likely to be causative for disease in samples not previously sequenced. Clinical implementation of LipidSeq has the potential to aid in the molecular diagnosis of patients with monogenic dyslipidemias with a high degree of speed and accuracy and at lower cost than either Sanger sequencing or whole exome sequencing. Furthermore, LipidSeq will help to provide a more focused picture of monogenic and polygenic contributors that underlie dyslipidemia while excluding the discovery of incidental pathogenic clinically actionable variants in nonmetabolism-related genes, such as oncogenes, that would otherwise be identified by a whole exome approach, thus minimizing potential ethical issues.
LipidSeq: a next-generation clinical resequencing panel for monogenic dyslipidemias[S
Johansen, Christopher T.; Dubé, Joseph B.; Loyzer, Melissa N.; MacDonald, Austin; Carter, David E.; McIntyre, Adam D.; Cao, Henian; Wang, Jian; Robinson, John F.; Hegele, Robert A.
2014-01-01
We report the design of a targeted resequencing panel for monogenic dyslipidemias, LipidSeq, for the purpose of replacing Sanger sequencing in the clinical detection of dyslipidemia-causing variants. We also evaluate the performance of the LipidSeq approach versus Sanger sequencing in 84 patients with a range of phenotypes including extreme blood lipid concentrations as well as additional dyslipidemias and related metabolic disorders. The panel performs well, with high concordance (95.2%) in samples with known mutations based on Sanger sequencing and a high detection rate (57.9%) of mutations likely to be causative for disease in samples not previously sequenced. Clinical implementation of LipidSeq has the potential to aid in the molecular diagnosis of patients with monogenic dyslipidemias with a high degree of speed and accuracy and at lower cost than either Sanger sequencing or whole exome sequencing. Furthermore, LipidSeq will help to provide a more focused picture of monogenic and polygenic contributors that underlie dyslipidemia while excluding the discovery of incidental pathogenic clinically actionable variants in nonmetabolism-related genes, such as oncogenes, that would otherwise be identified by a whole exome approach, thus minimizing potential ethical issues. PMID:24503134
Classifying next-generation sequencing data using a zero-inflated Poisson model.
Zhou, Yan; Wan, Xiang; Zhang, Baoxue; Tong, Tiejun
2018-04-15
With the development of high-throughput techniques, RNA-sequencing (RNA-seq) is becoming increasingly popular as an alternative for gene expression analysis, such as RNAs profiling and classification. Identifying which type of diseases a new patient belongs to with RNA-seq data has been recognized as a vital problem in medical research. As RNA-seq data are discrete, statistical methods developed for classifying microarray data cannot be readily applied for RNA-seq data classification. Witten proposed a Poisson linear discriminant analysis (PLDA) to classify the RNA-seq data in 2011. Note, however, that the count datasets are frequently characterized by excess zeros in real RNA-seq or microRNA sequence data (i.e. when the sequence depth is not enough or small RNAs with the length of 18-30 nucleotides). Therefore, it is desired to develop a new model to analyze RNA-seq data with an excess of zeros. In this paper, we propose a Zero-Inflated Poisson Logistic Discriminant Analysis (ZIPLDA) for RNA-seq data with an excess of zeros. The new method assumes that the data are from a mixture of two distributions: one is a point mass at zero, and the other follows a Poisson distribution. We then consider a logistic relation between the probability of observing zeros and the mean of the genes and the sequencing depth in the model. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. Two real datasets including a breast cancer RNA-seq dataset and a microRNA-seq dataset are also analyzed, and they coincide with the simulation results that our proposed method outperforms the existing competitors. The software is available at http://www.math.hkbu.edu.hk/∼tongt. xwan@comp.hkbu.edu.hk or tongt@hkbu.edu.hk. Supplementary data are available at Bioinformatics online.
USDA-ARS?s Scientific Manuscript database
The newly released rainbow trout genome assembly in NCBI RefSeq has greatly expanded our abilities for analyzing rainbow trout sequencing data. In this poster, we evaluate the utility of this genome assembly for analyzing RNA sequencing (RNA-seq) data of rainbow trout responses to various stressors,...
Qin, Yidan; Yao, Jun; Wu, Douglas C.; Nottingham, Ryan M.; Mohr, Sabine; Hunicke-Smith, Scott; Lambowitz, Alan M.
2016-01-01
Next-generation RNA-sequencing (RNA-seq) has revolutionized transcriptome profiling, gene expression analysis, and RNA-based diagnostics. Here, we developed a new RNA-seq method that exploits thermostable group II intron reverse transcriptases (TGIRTs) and used it to profile human plasma RNAs. TGIRTs have higher thermostability, processivity, and fidelity than conventional reverse transcriptases, plus a novel template-switching activity that can efficiently attach RNA-seq adapters to target RNA sequences without RNA ligation. The new TGIRT-seq method enabled construction of RNA-seq libraries from <1 ng of plasma RNA in <5 h. TGIRT-seq of RNA in 1-mL plasma samples from a healthy individual revealed RNA fragments mapping to a diverse population of protein-coding gene and long ncRNAs, which are enriched in intron and antisense sequences, as well as nearly all known classes of small ncRNAs, some of which have never before been seen in plasma. Surprisingly, many of the small ncRNA species were present as full-length transcripts, suggesting that they are protected from plasma RNases in ribonucleoprotein (RNP) complexes and/or exosomes. This TGIRT-seq method is readily adaptable for profiling of whole-cell, exosomal, and miRNAs, and for related procedures, such as HITS-CLIP and ribosome profiling. PMID:26554030
Impact of sequencing depth and read length on single cell RNA sequencing data of T cells.
Rizzetto, Simone; Eltahla, Auda A; Lin, Peijie; Bull, Rowena; Lloyd, Andrew R; Ho, Joshua W K; Venturi, Vanessa; Luciani, Fabio
2017-10-06
Single cell RNA sequencing (scRNA-seq) provides great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant T cell subsets, and the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, e.g. RNA library capture, cell quality, and sequencing output affect the quality of scRNA-seq data. We studied the effects of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 single cells from 8 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. Successful TCRαβ reconstruction was achieved for 6 datasets (81% - 100%) with at least 0.25 millions (PE) reads of length >50 bp, while it failed for datasets with <30 bp reads. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.
Sequence Capture versus Restriction Site Associated DNA Sequencing for Shallow Systematics.
Harvey, Michael G; Smith, Brian Tilston; Glenn, Travis C; Faircloth, Brant C; Brumfield, Robb T
2016-09-01
Sequence capture and restriction site associated DNA sequencing (RAD-Seq) are two genomic enrichment strategies for applying next-generation sequencing technologies to systematics studies. At shallow timescales, such as within species, RAD-Seq has been widely adopted among researchers, although there has been little discussion of the potential limitations and benefits of RAD-Seq and sequence capture. We discuss a series of issues that may impact the utility of sequence capture and RAD-Seq data for shallow systematics in non-model species. We review prior studies that used both methods, and investigate differences between the methods by re-analyzing existing RAD-Seq and sequence capture data sets from a Neotropical bird (Xenops minutus). We suggest that the strengths of RAD-Seq data sets for shallow systematics are the wide dispersion of markers across the genome, the relative ease and cost of laboratory work, the deep coverage and read overlap at recovered loci, and the high overall information that results. Sequence capture's benefits include flexibility and repeatability in the genomic regions targeted, success using low-quality samples, more straightforward read orthology assessment, and higher per-locus information content. The utility of a method in systematics, however, rests not only on its performance within a study, but on the comparability of data sets and inferences with those of prior work. In RAD-Seq data sets, comparability is compromised by low overlap of orthologous markers across species and the sensitivity of genetic diversity in a data set to an interaction between the level of natural heterozygosity in the samples examined and the parameters used for orthology assessment. In contrast, sequence capture of conserved genomic regions permits interrogation of the same loci across divergent species, which is preferable for maintaining comparability among data sets and studies for the purpose of drawing general conclusions about the impact of historical processes across biotas. We argue that sequence capture should be given greater attention as a method of obtaining data for studies in shallow systematics and comparative phylogeography. © The Author(s) 2016. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
How B-Cell Receptor Repertoire Sequencing Can Be Enriched with Structural Antibody Data
Kovaltsuk, Aleksandr; Krawczyk, Konrad; Galson, Jacob D.; Kelly, Dominic F.; Deane, Charlotte M.; Trück, Johannes
2017-01-01
Next-generation sequencing of immunoglobulin gene repertoires (Ig-seq) allows the investigation of large-scale antibody dynamics at a sequence level. However, structural information, a crucial descriptor of antibody binding capability, is not collected in Ig-seq protocols. Developing systematic relationships between the antibody sequence information gathered from Ig-seq and low-throughput techniques such as X-ray crystallography could radically improve our understanding of antibodies. The mapping of Ig-seq datasets to known antibody structures can indicate structurally, and perhaps functionally, uncharted areas. Furthermore, contrasting naïve and antigenically challenged datasets using structural antibody descriptors should provide insights into antibody maturation. As the number of antibody structures steadily increases and more and more Ig-seq datasets become available, the opportunities that arise from combining the two types of information increase as well. Here, we review how these data types enrich one another and show potential for advancing our knowledge of the immune system and improving antibody engineering. PMID:29276518
Sequencing Strategies for Population and Cancer Epidemiology Studies (SeqSPACE) Webinar Series
The Sequencing Strategies for Population and Cancer Epidemiology Studies (SeqSPACE) Webinar Series provides an opportunity for our grantees and other interested individuals to share lessons learned and practical information regarding the application of next generation sequencing to cancer epidemiology studies.
Hong, Jungeui; Gresham, David
2017-11-01
Quantitative analysis of next-generation sequencing (NGS) data requires discriminating duplicate reads generated by PCR from identical molecules that are of unique origin. Typically, PCR duplicates are identified as sequence reads that align to the same genomic coordinates using reference-based alignment. However, identical molecules can be independently generated during library preparation. Misidentification of these molecules as PCR duplicates can introduce unforeseen biases during analyses. Here, we developed a cost-effective sequencing adapter design by modifying Illumina TruSeq adapters to incorporate a unique molecular identifier (UMI) while maintaining the capacity to undertake multiplexed, single-index sequencing. Incorporation of UMIs into TruSeq adapters (TrUMIseq adapters) enables identification of bona fide PCR duplicates as identically mapped reads with identical UMIs. Using TrUMIseq adapters, we show that accurate removal of PCR duplicates results in improved accuracy of both allele frequency (AF) estimation in heterogeneous populations using DNA sequencing and gene expression quantification using RNA-Seq.
Partial bisulfite conversion for unique template sequencing
Kumar, Vijay; Rosenbaum, Julie; Wang, Zihua; Forcier, Talitha; Ronemus, Michael; Wigler, Michael
2018-01-01
Abstract We introduce a new protocol, mutational sequencing or muSeq, which uses sodium bisulfite to randomly deaminate unmethylated cytosines at a fixed and tunable rate. The muSeq protocol marks each initial template molecule with a unique mutation signature that is present in every copy of the template, and in every fragmented copy of a copy. In the sequenced read data, this signature is observed as a unique pattern of C-to-T or G-to-A nucleotide conversions. Clustering reads with the same conversion pattern enables accurate count and long-range assembly of initial template molecules from short-read sequence data. We explore count and low-error sequencing by profiling 135 000 restriction fragments in a PstI representation, demonstrating that muSeq improves copy number inference and significantly reduces sporadic sequencer error. We explore long-range assembly in the context of cDNA, generating contiguous transcript clusters greater than 3,000 bp in length. The muSeq assemblies reveal transcriptional diversity not observable from short-read data alone. PMID:29161423
SPAR: small RNA-seq portal for analysis of sequencing experiments.
Kuksa, Pavel P; Amlie-Wolf, Alexandre; Katanic, Živadin; Valladares, Otto; Wang, Li-San; Leung, Yuk Yee
2018-05-04
The introduction of new high-throughput small RNA sequencing protocols that generate large-scale genomics datasets along with increasing evidence of the significant regulatory roles of small non-coding RNAs (sncRNAs) have highlighted the urgent need for tools to analyze and interpret large amounts of small RNA sequencing data. However, it remains challenging to systematically and comprehensively discover and characterize sncRNA genes and specifically-processed sncRNA products from these datasets. To fill this gap, we present Small RNA-seq Portal for Analysis of sequencing expeRiments (SPAR), a user-friendly web server for interactive processing, analysis, annotation and visualization of small RNA sequencing data. SPAR supports sequencing data generated from various experimental protocols, including smRNA-seq, short total RNA sequencing, microRNA-seq, and single-cell small RNA-seq. Additionally, SPAR includes publicly available reference sncRNA datasets from our DASHR database and from ENCODE across 185 human tissues and cell types to produce highly informative small RNA annotations across all major small RNA types and other features such as co-localization with various genomic features, precursor transcript cleavage patterns, and conservation. SPAR allows the user to compare the input experiment against reference ENCODE/DASHR datasets. SPAR currently supports analyses of human (hg19, hg38) and mouse (mm10) sequencing data. SPAR is freely available at https://www.lisanwanglab.org/SPAR.
Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads.
Song, Li; Florea, Liliana
2015-01-01
Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.
The power and promise of RNA-seq in ecology and evolution.
Todd, Erica V; Black, Michael A; Gemmell, Neil J
2016-03-01
Reference is regularly made to the power of new genomic sequencing approaches. Using powerful technology, however, is not the same as having the necessary power to address a research question with statistical robustness. In the rush to adopt new and improved genomic research methods, limitations of technology and experimental design may be initially neglected. Here, we review these issues with regard to RNA sequencing (RNA-seq). RNA-seq adds large-scale transcriptomics to the toolkit of ecological and evolutionary biologists, enabling differential gene expression (DE) studies in nonmodel species without the need for prior genomic resources. High biological variance is typical of field-based gene expression studies and means that larger sample sizes are often needed to achieve the same degree of statistical power as clinical studies based on data from cell lines or inbred animal models. Sequencing costs have plummeted, yet RNA-seq studies still underutilize biological replication. Finite research budgets force a trade-off between sequencing effort and replication in RNA-seq experimental design. However, clear guidelines for negotiating this trade-off, while taking into account study-specific factors affecting power, are currently lacking. Study designs that prioritize sequencing depth over replication fail to capitalize on the power of RNA-seq technology for DE inference. Significant recent research effort has gone into developing statistical frameworks and software tools for power analysis and sample size calculation in the context of RNA-seq DE analysis. We synthesize progress in this area and derive an accessible rule-of-thumb guide for designing powerful RNA-seq experiments relevant in eco-evolutionary and clinical settings alike. © 2016 John Wiley & Sons Ltd.
Quantitative phenotyping via deep barcode sequencing.
Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey
2009-10-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.
Sie, Daoud; Snijders, Peter J F; Meijer, Gerrit A; Doeleman, Marije W; van Moorsel, Marinda I H; van Essen, Hendrik F; Eijk, Paul P; Grünberg, Katrien; van Grieken, Nicole C T; Thunnissen, Erik; Verheul, Henk M; Smit, Egbert F; Ylstra, Bauke; Heideman, Daniëlle A M
2014-10-01
Next generation DNA sequencing (NGS) holds promise for diagnostic applications, yet implementation in routine molecular pathology practice requires performance evaluation on DNA derived from routine formalin-fixed paraffin-embedded (FFPE) tissue specimens. The current study presents a comprehensive analysis of TruSeq Amplicon Cancer Panel-based NGS using a MiSeq Personal sequencer (TSACP-MiSeq-NGS) for somatic mutation profiling. TSACP-MiSeq-NGS (testing 212 hotspot mutation amplicons of 48 genes) and a data analysis pipeline were evaluated in a retrospective learning/test set approach (n = 58/n = 45 FFPE-tumor DNA samples) against 'gold standard' high-resolution-melting (HRM)-sequencing for the genes KRAS, EGFR, BRAF and PIK3CA. Next, the performance of the validated test algorithm was assessed in an independent, prospective cohort of FFPE-tumor DNA samples (n = 75). In the learning set, a number of minimum parameter settings was defined to decide whether a FFPE-DNA sample is qualified for TSACP-MiSeq-NGS and for calling mutations. The resulting test algorithm revealed 82% (37/45) compliance to the quality criteria and 95% (35/37) concordant assay findings for KRAS, EGFR, BRAF and PIK3CA with HRM-sequencing (kappa = 0.92; 95% CI = 0.81-1.03) in the test set. Subsequent application of the validated test algorithm to the prospective cohort yielded a success rate of 84% (63/75), and a high concordance with HRM-sequencing (95% (60/63); kappa = 0.92; 95% CI = 0.84-1.01). TSACP-MiSeq-NGS detected 77 mutations in 29 additional genes. TSACP-MiSeq-NGS is suitable for diagnostic gene mutation profiling in oncopathology.
SeqTrim: a high-throughput pipeline for pre-processing any type of sequence read
2010-01-01
Background High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-processing algorithms. Results SeqTrim has been implemented both as a Web and as a standalone command line application. Already-published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming. Conclusions SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts. PMID:20089148
Biallelic missense variants in ZBTB11 can cause intellectual disability in human.
Fattahi, Zohreh; Sheikh, Taimoor I; Musante, Luciana; Rasheed, Memoona; Taskiran, Ibrahim Ihsan; Harripaul, Ricardo; Hu, Hao; Kazeminasab, Somayeh; Alam, Muhammad Rizwan; Hosseini, Masoumeh; Larti, Farzaneh; Ghaderi, Zhila; Celik, Arzu; Ayub, Muhammad; Ansar, Muhammad; Haddadi, Mohammad; Wienker, Thomas F; Ropers, Hans Hilger; Kahrizi, Kimia; Vincent, John B; Najmabadi, H
2018-06-08
Exploring genes and pathways underlying Intellectual Disability (ID) provides insight into brain development and function, clarifying the complex puzzle of how cognition develops. As part of ongoing systematic studies to identify candidate ID genes, linkage analysis and next generation sequencing revealed ZBTB11, as a novel candidate ID gene. ZBTB11 encodes a less-studied transcription regulator and the two identified missense variants in this study may disrupt canonical Zn2+-binding residues of its C2H2 zinc finger domain, leading to possible altered DNA binding. Using HEK293T cells transfected with wild type and mutant GFP-ZBTB11 constructs, we found the ZBTB11 mutants being excluded from the nucleolus, where the wild-type recombinant protein is predominantly localized. Pathway analysis applied to ChIP-seq data deposited in the ENCODE database supports the localization of ZBTB11 in nucleoli, highlighting associated pathways such as rRNA synthesis, ribosomal assembly, RNA modification, stress sensing and provides a direct link between subcellular ZBTB11 location and its function. Furthermore, considering the report of prominent brain and spinal cord degeneration in a zebrafish Zbtb11 mutant, we investigated ZBTB11-ortholog knockdown in Drosophila melanogaster brain by targeting RNAi using the UAS/Gal4 system. The observed approximate reduction to a third of the mushroom body size - possibly through neuronal reduction or degeneration - may affect neuronal circuits in the brain that are required for adaptive behavior, specifying the role of this gene in nervous system. In conclusion, we report two ID families segregating ZBTB11 biallelic mutations disrupting Zn2+-binding motifs, and provide functional evidence linking ZBTB11 dysfunction to this phenotype.
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.
Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo
2014-01-01
Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts. Available under the open source MIT license at http://sourceforge.net/projects/seqpig/
Isoform Sequencing and State-of-Art Applications for Unravelling Complexity of Plant Transcriptomes
An, Dong; Li, Changsheng; Humbeck, Klaus
2018-01-01
Single-molecule real-time (SMRT) sequencing developed by PacBio, also called third-generation sequencing (TGS), offers longer reads than the second-generation sequencing (SGS). Given its ability to obtain full-length transcripts without assembly, isoform sequencing (Iso-Seq) of transcriptomes by PacBio is advantageous for genome annotation, identification of novel genes and isoforms, as well as the discovery of long non-coding RNA (lncRNA). In addition, Iso-Seq gives access to the direct detection of alternative splicing, alternative polyadenylation (APA), gene fusion, and DNA modifications. Such applications of Iso-Seq facilitate the understanding of gene structure, post-transcriptional regulatory networks, and subsequently proteomic diversity. In this review, we summarize its applications in plant transcriptome study, specifically pointing out challenges associated with each step in the experimental design and highlight the development of bioinformatic pipelines. We aim to provide the community with an integrative overview and a comprehensive guidance to Iso-Seq, and thus to promote its applications in plant research. PMID:29346292
2014-01-01
We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the United States Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA sequencing (RNA-seq) performance for junction discovery and differential expression profiling and compare it to microarray and quantitative PCR (qPCR) data using complementary metrics. At all sequencing depths, we discover unannotated exon-exon junctions, with >80% validated by qPCR. We find that measurements of relative expression are accurate and reproducible across sites and platforms if specific filters are used. In contrast, RNA-seq and microarrays do not provide accurate absolute measurements, and gene-specific biases are observed, for these and qPCR. Measurement performance depends on the platform and data analysis pipeline, and variation is large for transcript-level profiling. The complete SEQC data sets, comprising >100 billion reads (10Tb), provide unique resources for evaluating RNA-seq analyses for clinical and regulatory settings. PMID:25150838
PRAPI: post-transcriptional regulation analysis pipeline for Iso-Seq.
Gao, Yubang; Wang, Huiyuan; Zhang, Hangxiao; Wang, Yongsheng; Chen, Jinfeng; Gu, Lianfeng
2018-05-01
The single-molecule real-time (SMRT) isoform sequencing (Iso-Seq) based on Pacific Bioscience (PacBio) platform has received increasing attention for its ability to explore full-length isoforms. Thus, comprehensive tools for Iso-Seq bioinformatics analysis are extremely useful. Here, we present a one-stop solution for Iso-Seq analysis, called PRAPI to analyze alternative transcription initiation (ATI), alternative splicing (AS), alternative cleavage and polyadenylation (APA), natural antisense transcripts (NAT), and circular RNAs (circRNAs) comprehensively. PRAPI is capable of combining Iso-Seq full-length isoforms with short read data, such as RNA-Seq or polyadenylation site sequencing (PAS-seq) for differential expression analysis of NAT, AS, APA and circRNAs. Furthermore, PRAPI can annotate new genes and correct mis-annotated genes when gene annotation is available. Finally, PRAPI generates high-quality vector graphics to visualize and highlight the Iso-Seq results. The Dockerfile of PRAPI is available at http://www.bioinfor.org/tool/PRAPI. lfgu@fafu.edu.cn.
CapZyme-Seq Comprehensively Defines Promoter-Sequence Determinants for RNA 5' Capping with NAD.
Vvedenskaya, Irina O; Bird, Jeremy G; Zhang, Yuanchao; Zhang, Yu; Jiao, Xinfu; Barvík, Ivan; Krásný, Libor; Kiledjian, Megerditch; Taylor, Deanne M; Ebright, Richard H; Nickels, Bryce E
2018-05-03
Nucleoside-containing metabolites such as NAD + can be incorporated as 5' caps on RNA by serving as non-canonical initiating nucleotides (NCINs) for transcription initiation by RNA polymerase (RNAP). Here, we report CapZyme-seq, a high-throughput-sequencing method that employs NCIN-decapping enzymes NudC and Rai1 to detect and quantify NCIN-capped RNA. By combining CapZyme-seq with multiplexed transcriptomics, we determine efficiencies of NAD + capping by Escherichia coli RNAP for ∼16,000 promoter sequences. The results define preferred transcription start site (TSS) positions for NAD + capping and define a consensus promoter sequence for NAD + capping: HRRASWW (TSS underlined). By applying CapZyme-seq to E. coli total cellular RNA, we establish that sequence determinants for NCIN capping in vivo match the NAD + -capping consensus defined in vitro, and we identify and quantify NCIN-capped small RNAs (sRNAs). Our findings define the promoter-sequence determinants for NCIN capping with NAD + and provide a general method for analysis of NCIN capping in vitro and in vivo. Copyright © 2018 Elsevier Inc. All rights reserved.
Partial bisulfite conversion for unique template sequencing.
Kumar, Vijay; Rosenbaum, Julie; Wang, Zihua; Forcier, Talitha; Ronemus, Michael; Wigler, Michael; Levy, Dan
2018-01-25
We introduce a new protocol, mutational sequencing or muSeq, which uses sodium bisulfite to randomly deaminate unmethylated cytosines at a fixed and tunable rate. The muSeq protocol marks each initial template molecule with a unique mutation signature that is present in every copy of the template, and in every fragmented copy of a copy. In the sequenced read data, this signature is observed as a unique pattern of C-to-T or G-to-A nucleotide conversions. Clustering reads with the same conversion pattern enables accurate count and long-range assembly of initial template molecules from short-read sequence data. We explore count and low-error sequencing by profiling 135 000 restriction fragments in a PstI representation, demonstrating that muSeq improves copy number inference and significantly reduces sporadic sequencer error. We explore long-range assembly in the context of cDNA, generating contiguous transcript clusters greater than 3,000 bp in length. The muSeq assemblies reveal transcriptional diversity not observable from short-read data alone. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Zhou, Shuntai; Jones, Corbin; Mieczkowski, Piotr
2015-01-01
ABSTRACT Validating the sampling depth and reducing sequencing errors are critical for studies of viral populations using next-generation sequencing (NGS). We previously described the use of Primer ID to tag each viral RNA template with a block of degenerate nucleotides in the cDNA primer. We now show that low-abundance Primer IDs (offspring Primer IDs) are generated due to PCR/sequencing errors. These artifactual Primer IDs can be removed using a cutoff model for the number of reads required to make a template consensus sequence. We have modeled the fraction of sequences lost due to Primer ID resampling. For a typical sequencing run, less than 10% of the raw reads are lost to offspring Primer ID filtering and resampling. The remaining raw reads are used to correct for PCR resampling and sequencing errors. We also demonstrate that Primer ID reveals bias intrinsic to PCR, especially at low template input or utilization. cDNA synthesis and PCR convert ca. 20% of RNA templates into recoverable sequences, and 30-fold sequence coverage recovers most of these template sequences. We have directly measured the residual error rate to be around 1 in 10,000 nucleotides. We use this error rate and the Poisson distribution to define the cutoff to identify preexisting drug resistance mutations at low abundance in an HIV-infected subject. Collectively, these studies show that >90% of the raw sequence reads can be used to validate template sampling depth and to dramatically reduce the error rate in assessing a genetically diverse viral population using NGS. IMPORTANCE Although next-generation sequencing (NGS) has revolutionized sequencing strategies, it suffers from serious limitations in defining sequence heterogeneity in a genetically diverse population, such as HIV-1 due to PCR resampling and PCR/sequencing errors. The Primer ID approach reveals the true sampling depth and greatly reduces errors. Knowing the sampling depth allows the construction of a model of how to maximize the recovery of sequences from input templates and to reduce resampling of the Primer ID so that appropriate multiplexing can be included in the experimental design. With the defined sampling depth and measured error rate, we are able to assign cutoffs for the accurate detection of minority variants in viral populations. This approach allows the power of NGS to be realized without having to guess about sampling depth or to ignore the problem of PCR resampling, while also being able to correct most of the errors in the data set. PMID:26041299
SeqHound: biological sequence and structure database as a platform for bioinformatics research
2002-01-01
Background SeqHound has been developed as an integrated biological sequence, taxonomy, annotation and 3-D structure database system. It provides a high-performance server platform for bioinformatics research in a locally-hosted environment. Results SeqHound is based on the National Center for Biotechnology Information data model and programming tools. It offers daily updated contents of all Entrez sequence databases in addition to 3-D structural data and information about sequence redundancies, sequence neighbours, taxonomy, complete genomes, functional annotation including Gene Ontology terms and literature links to PubMed. SeqHound is accessible via a web server through a Perl, C or C++ remote API or an optimized local API. It provides functionality necessary to retrieve specialized subsets of sequences, structures and structural domains. Sequences may be retrieved in FASTA, GenBank, ASN.1 and XML formats. Structures are available in ASN.1, XML and PDB formats. Emphasis has been placed on complete genomes, taxonomy, domain and functional annotation as well as 3-D structural functionality in the API, while fielded text indexing functionality remains under development. SeqHound also offers a streamlined WWW interface for simple web-user queries. Conclusions The system has proven useful in several published bioinformatics projects such as the BIND database and offers a cost-effective infrastructure for research. SeqHound will continue to develop and be provided as a service of the Blueprint Initiative at the Samuel Lunenfeld Research Institute. The source code and examples are available under the terms of the GNU public license at the Sourceforge site http://sourceforge.net/projects/slritools/ in the SLRI Toolkit. PMID:12401134
Jupe, Florian; Witek, Kamil; Verweij, Walter; Śliwka, Jadwiga; Pritchard, Leighton; Etherington, Graham J; Maclean, Dan; Cock, Peter J; Leggett, Richard M; Bryan, Glenn J; Cardle, Linda; Hein, Ingo; Jones, Jonathan DG
2013-01-01
Summary RenSeq is a NB-LRR (nucleotide binding-site leucine-rich repeat) gene-targeted, Resistance gene enrichment and sequencing method that enables discovery and annotation of pathogen resistance gene family members in plant genome sequences. We successfully applied RenSeq to the sequenced potato Solanum tuberosum clone DM, and increased the number of identified NB-LRRs from 438 to 755. The majority of these identified R gene loci reside in poorly or previously unannotated regions of the genome. Sequence and positional details on the 12 chromosomes have been established for 704 NB-LRRs and can be accessed through a genome browser that we provide. We compared these NB-LRR genes and the corresponding oligonucleotide baits with the highest sequence similarity and demonstrated that ∼80% sequence identity is sufficient for enrichment. Analysis of the sequenced tomato S. lycopersicum ‘Heinz 1706’ extended the NB-LRR complement to 394 loci. We further describe a methodology that applies RenSeq to rapidly identify molecular markers that co-segregate with a pathogen resistance trait of interest. In two independent segregating populations involving the wild Solanum species S. berthaultii (Rpi-ber2) and S. ruiz-ceballosii (Rpi-rzc1), we were able to apply RenSeq successfully to identify markers that co-segregate with resistance towards the late blight pathogen Phytophthora infestans. These SNP identification workflows were designed as easy-to-adapt Galaxy pipelines. PMID:23937694
A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications.
Haque, Ashraful; Engel, Jessica; Teichmann, Sarah A; Lönnberg, Tapio
2017-08-18
RNA sequencing (RNA-seq) is a genomic approach for the detection and quantitative analysis of messenger RNA molecules in a biological sample and is useful for studying cellular responses. RNA-seq has fueled much discovery and innovation in medicine over recent years. For practical reasons, the technique is usually conducted on samples comprising thousands to millions of cells. However, this has hindered direct assessment of the fundamental unit of biology-the cell. Since the first single-cell RNA-sequencing (scRNA-seq) study was published in 2009, many more have been conducted, mostly by specialist laboratories with unique skills in wet-lab single-cell genomics, bioinformatics, and computation. However, with the increasing commercial availability of scRNA-seq platforms, and the rapid ongoing maturation of bioinformatics approaches, a point has been reached where any biomedical researcher or clinician can use scRNA-seq to make exciting discoveries. In this review, we present a practical guide to help researchers design their first scRNA-seq studies, including introductory information on experimental hardware, protocol choice, quality control, data analysis and biological interpretation.
Wilkinson, Samuel L.; John, Shibu; Walsh, Roddy; Novotny, Tomas; Valaskova, Iveta; Gupta, Manu; Game, Laurence; Barton, Paul J R.; Cook, Stuart A.; Ware, James S.
2013-01-01
Background Molecular genetic testing is recommended for diagnosis of inherited cardiac disease, to guide prognosis and treatment, but access is often limited by cost and availability. Recently introduced high-throughput bench-top DNA sequencing platforms have the potential to overcome these limitations. Methodology/Principal Findings We evaluated two next-generation sequencing (NGS) platforms for molecular diagnostics. The protein-coding regions of six genes associated with inherited arrhythmia syndromes were amplified from 15 human samples using parallelised multiplex PCR (Access Array, Fluidigm), and sequenced on the MiSeq (Illumina) and Ion Torrent PGM (Life Technologies). Overall, 97.9% of the target was sequenced adequately for variant calling on the MiSeq, and 96.8% on the Ion Torrent PGM. Regions missed tended to be of high GC-content, and most were problematic for both platforms. Variant calling was assessed using 107 variants detected using Sanger sequencing: within adequately sequenced regions, variant calling on both platforms was highly accurate (Sensitivity: MiSeq 100%, PGM 99.1%. Positive predictive value: MiSeq 95.9%, PGM 95.5%). At the time of the study the Ion Torrent PGM had a lower capital cost and individual runs were cheaper and faster. The MiSeq had a higher capacity (requiring fewer runs), with reduced hands-on time and simpler laboratory workflows. Both provide significant cost and time savings over conventional methods, even allowing for adjunct Sanger sequencing to validate findings and sequence exons missed by NGS. Conclusions/Significance MiSeq and Ion Torrent PGM both provide accurate variant detection as part of a PCR-based molecular diagnostic workflow, and provide alternative platforms for molecular diagnosis of inherited cardiac conditions. Though there were performance differences at this throughput, platforms differed primarily in terms of cost, scalability, protocol stability and ease of use. Compared with current molecular genetic diagnostic tests for inherited cardiac arrhythmias, these NGS approaches are faster, less expensive, and yet more comprehensive. PMID:23861798
SeqAPASS (Sequence Alignment to Predict Across Species Susceptibility) software and documentation
SeqAPASS is a software application facilitates rapid and streamlined, yet transparent, comparisons of the similarity of toxicologically-significant molecular targets across species. The present application facilitates analysis of primary amino acid sequence similarity (including ...
Hellberg, Rosalee S; Martin, Keely G; Keys, Ashley L; Haney, Christopher J; Shen, Yuelian; Smiley, R Derike
2013-12-01
Use of 16S rRNA partial gene sequencing within the regulatory workflow could greatly reduce the time and labor needed for confirmation and subtyping of Listeria monocytogenes. The goal of this study was to build a 16S rRNA partial gene reference library for Listeria spp. and investigate the potential for 16S rRNA molecular subtyping. A total of 86 isolates of Listeria representing L. innocua, L. seeligeri, L. welshimeri, and L. monocytogenes were obtained for use in building the custom library. Seven non-Listeria species and three additional strains of Listeria were obtained for use in exclusivity and food spiking tests. Isolates were sequenced for the partial 16S rRNA gene using the MicroSeq ID 500 Bacterial Identification Kit (Applied Biosystems). High-quality sequences were obtained for 84 of the custom library isolates and 23 unique 16S sequence types were discovered for use in molecular subtyping. All of the exclusivity strains were negative for Listeria and the three Listeria strains used in food spiking were consistently recovered and correctly identified at the species level. The spiking results also allowed for differentiation beyond the species level, as 87% of replicates for one strain and 100% of replicates for the other two strains consistently matched the same 16S type. Copyright © 2013 Elsevier Ltd. All rights reserved.
Hysteretic energy prediction method for mainshock-aftershock sequences
NASA Astrophysics Data System (ADS)
Zhai, Changhai; Ji, Duofa; Wen, Weiping; Li, Cuihua; Lei, Weidong; Xie, Lili
2018-04-01
Structures located in seismically active regions may be subjected to mainshock-aftershock (MSAS) sequences. Strong aftershocks significantly affect the hysteretic energy demand of structures. The hysteretic energy, E H,seq, is normalized by mass m and expressed in terms of the equivalent velocity, V D,seq, to quantitatively investigate aftershock effects on the hysteretic energy of structures. The equivalent velocity, V D,seq, is computed by analyzing the response time-history of an inelastic single-degree-of-freedom (SDOF) system with a varying vibration period subjected to 309 MSAS sequences. The present study selected two kinds of MSAS sequences, with one aftershock and two aftershocks, respectively. The aftershocks are scaled to maintain different relative intensities. The variation of the equivalent velocity, V D,seq, is studied for consideration of the ductility values, site conditions, relative intensities, number of aftershocks, hysteretic models, and damping ratios. The MSAS sequence with one aftershock exhibited a 10% to 30% hysteretic energy increase, whereas the MSAS sequence with two aftershocks presented a 20% to 40% hysteretic energy increase. Finally, a hysteretic energy prediction equation is proposed as a function of the vibration period, ductility value, and damping ratio to estimate hysteretic energy for mainshock-aftershock sequences.
Quantitative phenotyping via deep barcode sequencing
Smith, Andrew M.; Heisler, Lawrence E.; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J.; Chee, Mark; Roth, Frederick P.; Giaever, Guri; Nislow, Corey
2009-01-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or “Bar-seq,” outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that ∼20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene–environment interactions on a genome-wide scale. PMID:19622793
Walter, Vonn; Patel, Nirali M.; Eberhard, David A.; Hayward, Michele C.; Salazar, Ashley H.; Jo, Heejoon; Soloway, Matthew G.; Wilkerson, Matthew D.; Parker, Joel S.; Yin, Xiaoying; Zhang, Guosheng; Siegel, Marni B.; Rosson, Gary B.; Earp, H. Shelton; Sharpless, Norman E.; Gulley, Margaret L.; Weck, Karen E.
2015-01-01
The recent FDA approval of the MiSeqDx platform provides a unique opportunity to develop targeted next generation sequencing (NGS) panels for human disease, including cancer. We have developed a scalable, targeted panel-based assay termed UNCseq, which involves a NGS panel of over 200 cancer-associated genes and a standardized downstream bioinformatics pipeline for detection of single nucleotide variations (SNV) as well as small insertions and deletions (indel). In addition, we developed a novel algorithm, NGScopy, designed for samples with sparse sequencing coverage to detect large-scale copy number variations (CNV), similar to human SNP Array 6.0 as well as small-scale intragenic CNV. Overall, we applied this assay to 100 snap-frozen lung cancer specimens lacking same-patient germline DNA (07–0120 tissue cohort) and validated our results against Sanger sequencing, SNP Array, and our recently published integrated DNA-seq/RNA-seq assay, UNCqeR, where RNA-seq of same-patient tumor specimens confirmed SNV detected by DNA-seq, if RNA-seq coverage depth was adequate. In addition, we applied the UNCseq assay on an independent lung cancer tumor tissue collection with available same-patient germline DNA (11–1115 tissue cohort) and confirmed mutations using assays performed in a CLIA-certified laboratory. We conclude that UNCseq can identify SNV, indel, and CNV in tumor specimens lacking germline DNA in a cost-efficient fashion. PMID:26076459
JVM: Java Visual Mapping tool for next generation sequencing read.
Yang, Ye; Liu, Juan
2015-01-01
We developed a program JVM (Java Visual Mapping) for mapping next generation sequencing read to reference sequence. The program is implemented in Java and is designed to deal with millions of short read generated by sequence alignment using the Illumina sequencing technology. It employs seed index strategy and octal encoding operations for sequence alignments. JVM is useful for DNA-Seq, RNA-Seq when dealing with single-end resequencing. JVM is a desktop application, which supports reads capacity from 1 MB to 10 GB.
SeqMule: automated pipeline for analysis of human exome/genome sequencing data.
Guo, Yunfei; Ding, Xiaolei; Shen, Yufeng; Lyon, Gholson J; Wang, Kai
2015-09-18
Next-generation sequencing (NGS) technology has greatly helped us identify disease-contributory variants for Mendelian diseases. However, users are often faced with issues such as software compatibility, complicated configuration, and no access to high-performance computing facility. Discrepancies exist among aligners and variant callers. We developed a computational pipeline, SeqMule, to perform automated variant calling from NGS data on human genomes and exomes. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48 GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. SeqMule supports Sun Grid Engine for parallel processing, offers turn-key solution for deployment on Amazon Web Services, allows quality check, Mendelian error check, consistency evaluation, HTML-based reports. SeqMule is available at http://seqmule.openbioinformatics.org.
Kwon, Andrew T.; Arenillas, David J.; Hunt, Rebecca Worsley; Wasserman, Wyeth W.
2012-01-01
oPOSSUM-3 is a web-accessible software system for identification of over-represented transcription factor binding sites (TFBS) and TFBS families in either DNA sequences of co-expressed genes or sequences generated from high-throughput methods, such as ChIP-Seq. Validation of the system with known sets of co-regulated genes and published ChIP-Seq data demonstrates the capacity for oPOSSUM-3 to identify mediating transcription factors (TF) for co-regulated genes or co-recovered sequences. oPOSSUM-3 is available at http://opossum.cisreg.ca. PMID:22973536
Kwon, Andrew T; Arenillas, David J; Worsley Hunt, Rebecca; Wasserman, Wyeth W
2012-09-01
oPOSSUM-3 is a web-accessible software system for identification of over-represented transcription factor binding sites (TFBS) and TFBS families in either DNA sequences of co-expressed genes or sequences generated from high-throughput methods, such as ChIP-Seq. Validation of the system with known sets of co-regulated genes and published ChIP-Seq data demonstrates the capacity for oPOSSUM-3 to identify mediating transcription factors (TF) for co-regulated genes or co-recovered sequences. oPOSSUM-3 is available at http://opossum.cisreg.ca.
Woo, Patrick C Y; Chung, Liliane M W; Teng, Jade L L; Tse, Herman; Pang, Sherby S Y; Lau, Veronica Y T; Wong, Vanessa W K; Kam, Kwok‐ling; Lau, Susanna K P; Yuen, Kwok‐Yung
2007-01-01
This study is the first study that provides useful guidelines to clinical microbiologists and technicians on the usefulness of full 16S rRNA sequencing, 5′‐end 527‐bp 16S rRNA sequencing and the existing MicroSeq full and 500 16S rDNA bacterial identification system (MicroSeq, Perkin‐Elmer Applied Biosystems Division, Foster City, California, USA) databases for the identification of all existing medically important anaerobic bacteria. Full and 527‐bp 16S rRNA sequencing are able to identify 52–63% of 130 Gram‐positive anaerobic rods, 72–73% of 86 Gram‐negative anaerobic rods and 78% of 23 anaerobic cocci. The existing MicroSeq databases are able to identify only 19–25% of 130 Gram‐positive anaerobic rods, 38% of 86 Gram‐negative anaerobic rods and 39% of 23 anaerobic cocci. These represent only 45–46% of those that should be confidently identified by full and 527‐bp 16S rRNA sequencing. To improve the usefulness of MicroSeq, bacterial species that should be confidently identified by full and/or 527‐bp 16S rRNA sequencing but not included in the existing MicroSeq databases should be included. PMID:17046845
LookSeq: a browser-based viewer for deep sequencing data.
Manske, Heinrich Magnus; Kwiatkowski, Dominic P
2009-11-01
Sequencing a genome to great depth can be highly informative about heterogeneity within an individual or a population. Here we address the problem of how to visualize the multiple layers of information contained in deep sequencing data. We propose an interactive AJAX-based web viewer for browsing large data sets of aligned sequence reads. By enabling seamless browsing and fast zooming, the LookSeq program assists the user to assimilate information at different levels of resolution, from an overview of a genomic region to fine details such as heterogeneity within the sample. A specific problem, particularly if the sample is heterogeneous, is how to depict information about structural variation. LookSeq provides a simple graphical representation of paired sequence reads that is more revealing about potential insertions and deletions than are conventional methods.
Pruitt, Kim D.; Tatusova, Tatiana; Maglott, Donna R.
2005-01-01
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff. PMID:15608248
The Chlamydomonas genome project: a decade on.
Blaby, Ian K; Blaby-Haas, Crysten E; Tourasse, Nicolas; Hom, Erik F Y; Lopez, David; Aksoy, Munevver; Grossman, Arthur; Umen, James; Dutcher, Susan; Porter, Mary; King, Stephen; Witman, George B; Stanke, Mario; Harris, Elizabeth H; Goodstein, David; Grimwood, Jane; Schmutz, Jeremy; Vallon, Olivier; Merchant, Sabeeha S; Prochnik, Simon
2014-10-01
The green alga Chlamydomonas reinhardtii is a popular unicellular organism for studying photosynthesis, cilia biogenesis, and micronutrient homeostasis. Ten years since its genome project was initiated an iterative process of improvements to the genome and gene predictions has propelled this organism to the forefront of the omics era. Housed at Phytozome, the plant genomics portal of the Joint Genome Institute (JGI), the most up-to-date genomic data include a genome arranged on chromosomes and high-quality gene models with alternative splice forms supported by an abundance of whole transcriptome sequencing (RNA-Seq) data. We present here the past, present, and future of Chlamydomonas genomics. Specifically, we detail progress on genome assembly and gene model refinement, discuss resources for gene annotations, functional predictions, and locus ID mapping between versions and, importantly, outline a standardized framework for naming genes. Copyright © 2014 Elsevier Ltd. All rights reserved.
Li, Peipei; Piao, Yongjun; Shon, Ho Sun; Ryu, Keun Ho
2015-10-28
Recently, rapid improvements in technology and decrease in sequencing costs have made RNA-Seq a widely used technique to quantify gene expression levels. Various normalization approaches have been proposed, owing to the importance of normalization in the analysis of RNA-Seq data. A comparison of recently proposed normalization methods is required to generate suitable guidelines for the selection of the most appropriate approach for future experiments. In this paper, we compared eight non-abundance (RC, UQ, Med, TMM, DESeq, Q, RPKM, and ERPKM) and two abundance estimation normalization methods (RSEM and Sailfish). The experiments were based on real Illumina high-throughput RNA-Seq of 35- and 76-nucleotide sequences produced in the MAQC project and simulation reads. Reads were mapped with human genome obtained from UCSC Genome Browser Database. For precise evaluation, we investigated Spearman correlation between the normalization results from RNA-Seq and MAQC qRT-PCR values for 996 genes. Based on this work, we showed that out of the eight non-abundance estimation normalization methods, RC, UQ, Med, TMM, DESeq, and Q gave similar normalization results for all data sets. For RNA-Seq of a 35-nucleotide sequence, RPKM showed the highest correlation results, but for RNA-Seq of a 76-nucleotide sequence, least correlation was observed than the other methods. ERPKM did not improve results than RPKM. Between two abundance estimation normalization methods, for RNA-Seq of a 35-nucleotide sequence, higher correlation was obtained with Sailfish than that with RSEM, which was better than without using abundance estimation methods. However, for RNA-Seq of a 76-nucleotide sequence, the results achieved by RSEM were similar to without applying abundance estimation methods, and were much better than with Sailfish. Furthermore, we found that adding a poly-A tail increased alignment numbers, but did not improve normalization results. Spearman correlation analysis revealed that RC, UQ, Med, TMM, DESeq, and Q did not noticeably improve gene expression normalization, regardless of read length. Other normalization methods were more efficient when alignment accuracy was low; Sailfish with RPKM gave the best normalization results. When alignment accuracy was high, RC was sufficient for gene expression calculation. And we suggest ignoring poly-A tail during differential gene expression analysis.
Shiroguchi, Katsuyuki; Jia, Tony Z.; Sims, Peter A.; Xie, X. Sunney
2012-01-01
RNA sequencing (RNA-Seq) is a powerful tool for transcriptome profiling, but is hampered by sequence-dependent bias and inaccuracy at low copy numbers intrinsic to exponential PCR amplification. We developed a simple strategy for mitigating these complications, allowing truly digital RNA-Seq. Following reverse transcription, a large set of barcode sequences is added in excess, and nearly every cDNA molecule is uniquely labeled by random attachment of barcode sequences to both ends. After PCR, we applied paired-end deep sequencing to read the two barcodes and cDNA sequences. Rather than counting the number of reads, RNA abundance is measured based on the number of unique barcode sequences observed for a given cDNA sequence. We optimized the barcodes to be unambiguously identifiable, even in the presence of multiple sequencing errors. This method allows counting with single-copy resolution despite sequence-dependent bias and PCR-amplification noise, and is analogous to digital PCR but amendable to quantifying a whole transcriptome. We demonstrated transcriptome profiling of Escherichia coli with more accurate and reproducible quantification than conventional RNA-Seq. PMID:22232676
Gene expression profiling of human breast tissue samples using SAGE-Seq.
Wu, Zhenhua Jeremy; Meyer, Clifford A; Choudhury, Sibgat; Shipitsin, Michail; Maruyama, Reo; Bessarabova, Marina; Nikolskaya, Tatiana; Sukumar, Saraswati; Schwartzman, Armin; Liu, Jun S; Polyak, Kornelia; Liu, X Shirley
2010-12-01
We present a powerful application of ultra high-throughput sequencing, SAGE-Seq, for the accurate quantification of normal and neoplastic mammary epithelial cell transcriptomes. We develop data analysis pipelines that allow the mapping of sense and antisense strands of mitochondrial and RefSeq genes, the normalization between libraries, and the identification of differentially expressed genes. We find that the diversity of cancer transcriptomes is significantly higher than that of normal cells. Our analysis indicates that transcript discovery plateaus at 10 million reads/sample, and suggests a minimum desired sequencing depth around five million reads. Comparison of SAGE-Seq and traditional SAGE on normal and cancerous breast tissues reveals higher sensitivity of SAGE-Seq to detect less-abundant genes, including those encoding for known breast cancer-related transcription factors and G protein-coupled receptors (GPCRs). SAGE-Seq is able to identify genes and pathways abnormally activated in breast cancer that traditional SAGE failed to call. SAGE-Seq is a powerful method for the identification of biomarkers and therapeutic targets in human disease.
Veeranagouda, Yaligara; Debono-Lagneaux, Delphine; Fournet, Hamida; Thill, Gilbert; Didier, Michel
2018-01-16
The emergence of clustered regularly interspaced short palindromic repeats-Cas9 (CRISPR-Cas9) gene editing systems has enabled the creation of specific mutants at low cost, in a short time and with high efficiency, in eukaryotic cells. Since a CRISPR-Cas9 system typically creates an array of mutations in targeted sites, a successful gene editing project requires careful selection of edited clones. This process can be very challenging, especially when working with multiallelic genes and/or polyploid cells (such as cancer and plants cells). Here we described a next-generation sequencing method called CRISPR-Cas9 Edited Site Sequencing (CRES-Seq) for the efficient and high-throughput screening of CRISPR-Cas9-edited clones. CRES-Seq facilitates the precise genotyping up to 96 CRISPR-Cas9-edited sites (CRES) in a single MiniSeq (Illumina) run with an approximate sequencing cost of $6/clone. CRES-Seq is particularly useful when multiple genes are simultaneously targeted by CRISPR-Cas9, and also for screening of clones generated from multiallelic genes/polyploid cells. © 2018 by John Wiley & Sons, Inc. Copyright © 2018 John Wiley & Sons, Inc.
SeqHBase: a big data toolset for family based sequencing data analysis.
He, Min; Person, Thomas N; Hebbring, Scott J; Heinzen, Ethan; Ye, Zhan; Schrodi, Steven J; McPherson, Elizabeth W; Lin, Simon M; Peissig, Peggy L; Brilliant, Murray H; O'Rawe, Jason; Robison, Reid J; Lyon, Gholson J; Wang, Kai
2015-04-01
Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Monitoring Error Rates In Illumina Sequencing.
Manley, Leigh J; Ma, Duanduan; Levine, Stuart S
2016-12-01
Guaranteeing high-quality next-generation sequencing data in a rapidly changing environment is an ongoing challenge. The introduction of the Illumina NextSeq 500 and the depreciation of specific metrics from Illumina's Sequencing Analysis Viewer (SAV; Illumina, San Diego, CA, USA) have made it more difficult to determine directly the baseline error rate of sequencing runs. To improve our ability to measure base quality, we have created an open-source tool to construct the Percent Perfect Reads (PPR) plot, previously provided by the Illumina sequencers. The PPR program is compatible with HiSeq 2000/2500, MiSeq, and NextSeq 500 instruments and provides an alternative to Illumina's quality value (Q) scores for determining run quality. Whereas Q scores are representative of run quality, they are often overestimated and are sourced from different look-up tables for each platform. The PPR's unique capabilities as a cross-instrument comparison device, as a troubleshooting tool, and as a tool for monitoring instrument performance can provide an increase in clarity over SAV metrics that is often crucial for maintaining instrument health. These capabilities are highlighted.
Trapnell, Cole; Roberts, Adam; Goff, Loyal; Pertea, Geo; Kim, Daehwan; Kelley, David R; Pimentel, Harold; Salzberg, Steven L; Rinn, John L; Pachter, Lior
2012-01-01
Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ~1 h of hands-on time. PMID:22383036
Mendoza-Parra, Marco-Antonio; Saravaki, Vincent; Cholley, Pierre-Etienne; Blum, Matthias; Billoré, Benjamin; Gronemeyer, Hinrich
2016-01-01
We have established a certification system for antibodies to be used in chromatin immunoprecipitation assays coupled to massive parallel sequencing (ChIP-seq). This certification comprises a standardized ChIP procedure and the attribution of a numerical quality control indicator (QCi) to biological replicate experiments. The QCi computation is based on a universally applicable quality assessment that quantitates the global deviation of randomly sampled subsets of ChIP-seq dataset with the original genome-aligned sequence reads. Comparison with a QCi database for >28,000 ChIP-seq assays were used to attribute quality grades (ranging from 'AAA' to 'DDD') to a given dataset. In the present report we used the numerical QC system to assess the factors influencing the quality of ChIP-seq assays, including the nature of the target, the sequencing depth and the commercial source of the antibody. We have used this approach specifically to certify mono and polyclonal antibodies obtained from Active Motif directed against the histone modification marks H3K4me3, H3K27ac and H3K9ac for ChIP-seq. The antibodies received the grades AAA to BBC ( www.ngs-qc.org). We propose to attribute such quantitative grading of all antibodies attributed with the label "ChIP-seq grade".
SNP ID-info: SNP ID searching and visualization platform.
Yang, Cheng-Hong; Chuang, Li-Yeh; Cheng, Yu-Huei; Wen, Cheng-Hao; Chang, Phei-Lang; Chang, Hsueh-Wei
2008-09-01
Many association studies provide the relationship between single nucleotide polymorphisms (SNPs), diseases and cancers, without giving a SNP ID, however. Here, we developed the SNP ID-info freeware to provide the SNP IDs within inputting genetic and physical information of genomes. The program provides an "SNP-ePCR" function to generate the full-sequence using primers and template inputs. In "SNPosition," sequence from SNP-ePCR or direct input is fed to match the SNP IDs from SNP fasta-sequence. In "SNP search" and "SNP fasta" function, information of SNPs within the cytogenetic band, contig position, and keyword input are acceptable. Finally, the SNP ID neighboring environment for inputs is completely visualized in the order of contig position and marked with SNP and flanking hits. The SNP identification problems inherent in NCBI SNP BLAST are also avoided. In conclusion, the SNP ID-info provides a visualized SNP ID environment for multiple inputs and assists systematic SNP association studies. The server and user manual are available at http://bio.kuas.edu.tw/snpid-info.
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo
2014-01-01
Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts. Availability and Implementation: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/ Contact: andre.schumacher@yahoo.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24149054
Comparison and quantitative verification of mapping algorithms for whole genome bisulfite sequencing
USDA-ARS?s Scientific Manuscript database
Coupling bisulfite conversion with next-generation sequencing (Bisulfite-seq) enables genome-wide measurement of DNA methylation, but poses unique challenges for mapping. However, despite a proliferation of Bisulfite-seq mapping tools, no systematic comparison of their genomic coverage and quantitat...
The bench scientist's guide to RNA-Seq analysis
USDA-ARS?s Scientific Manuscript database
RNA sequencing (RNA-Seq) is emerging as a highly accurate method to quantify transcript abundance. However, analyses of the large data sets obtained by sequencing the entire transcriptome of organisms have generally been performed by bioinformatic specialists. Here we outline a methods strategy desi...
Kim, Tae Hoon; Dekker, Job
2018-05-01
Owing to its digital nature, ChIP-seq has become the standard method for genome-wide ChIP analysis. Using next-generation sequencing platforms (notably the Illumina Genome Analyzer), millions of short sequence reads can be obtained. The densities of recovered ChIP sequence reads along the genome are used to determine the binding sites of the protein. Although a relatively small amount of ChIP DNA is required for ChIP-seq, the current sequencing platforms still require amplification of the ChIP DNA by ligation-mediated PCR (LM-PCR). This protocol, which involves linker ligation followed by size selection, is the standard ChIP-seq protocol using an Illumina Genome Analyzer. The size-selected ChIP DNA is amplified by LM-PCR and size-selected for the second time. The purified ChIP DNA is then loaded into the Genome Analyzer. The ChIP DNA can also be processed in parallel for ChIP-chip results. © 2018 Cold Spring Harbor Laboratory Press.
Uniform, optimal signal processing of mapped deep-sequencing data.
Kumar, Vibhor; Muratani, Masafumi; Rayan, Nirmala Arul; Kraus, Petra; Lufkin, Thomas; Ng, Huck Hui; Prabhakar, Shyam
2013-07-01
Despite their apparent diversity, many problems in the analysis of high-throughput sequencing data are merely special cases of two general problems, signal detection and signal estimation. Here we adapt formally optimal solutions from signal processing theory to analyze signals of DNA sequence reads mapped to a genome. We describe DFilter, a detection algorithm that identifies regulatory features in ChIP-seq, DNase-seq and FAIRE-seq data more accurately than assay-specific algorithms. We also describe EFilter, an estimation algorithm that accurately predicts mRNA levels from as few as 1-2 histone profiles (R ∼0.9). Notably, the presence of regulatory motifs in promoters correlates more with histone modifications than with mRNA levels, suggesting that histone profiles are more predictive of cis-regulatory mechanisms. We show by applying DFilter and EFilter to embryonic forebrain ChIP-seq data that regulatory protein identification and functional annotation are feasible despite tissue heterogeneity. The mathematical formalism underlying our tools facilitates integrative analysis of data from virtually any sequencing-based functional profile.
Cho, Namjin; Hwang, Byungjin; Yoon, Jung-ki; Park, Sangun; Lee, Joongoo; Seo, Han Na; Lee, Jeewon; Huh, Sunghoon; Chung, Jinsoo; Bang, Duhee
2015-09-21
Interpreting epistatic interactions is crucial for understanding evolutionary dynamics of complex genetic systems and unveiling structure and function of genetic pathways. Although high resolution mapping of en masse variant libraries renders molecular biologists to address genotype-phenotype relationships, long-read sequencing technology remains indispensable to assess functional relationship between mutations that lie far apart. Here, we introduce JigsawSeq for multiplexed sequence identification of pooled gene variant libraries by combining a codon-based molecular barcoding strategy and de novo assembly of short-read data. We first validate JigsawSeq on small sub-pools and observed high precision and recall at various experimental settings. With extensive simulations, we then apply JigsawSeq to large-scale gene variant libraries to show that our method can be reliably scaled using next-generation sequencing. JigsawSeq may serve as a rapid screening tool for functional genomics and offer the opportunity to explore evolutionary trajectories of protein variants.
Selective amplification and sequencing of cyclic phosphate-containing RNAs by the cP-RNA-seq method.
Honda, Shozo; Morichika, Keisuke; Kirino, Yohei
2016-03-01
RNA digestions catalyzed by many ribonucleases generate RNA fragments that contain a 2',3'-cyclic phosphate (cP) at their 3' termini. However, standard RNA-seq methods are unable to accurately capture cP-containing RNAs because the cP inhibits the adapter ligation reaction. We recently developed a method named cP-RNA-seq that is able to selectively amplify and sequence cP-containing RNAs. Here we describe the cP-RNA-seq protocol in which the 3' termini of all RNAs, except those containing a cP, are cleaved through a periodate treatment after phosphatase treatment; hence, subsequent adapter ligation and cDNA amplification steps are exclusively applied to cP-containing RNAs. cP-RNA-seq takes ∼6 d, excluding the time required for sequencing and bioinformatics analyses, which are not covered in detail in this protocol. Biochemical validation of the existence of cP in the identified RNAs takes ∼3 d. Even though the cP-RNA-seq method was developed to identify angiogenin-generating 5'-tRNA halves as a proof of principle, the method should be applicable to global identification of cP-containing RNA repertoires in various transcriptomes.
Streaming fragment assignment for real-time analysis of sequencing experiments
Roberts, Adam; Pachter, Lior
2013-01-01
We present eXpress, a software package for highly efficient probabilistic assignment of ambiguously mapping sequenced fragments. eXpress uses a streaming algorithm with linear run time and constant memory use. It can determine abundances of sequenced molecules in real time, and can be applied to ChIP-seq, metagenomics and other large-scale sequencing data. We demonstrate its use on RNA-seq data, showing greater efficiency than other quantification methods. PMID:23160280
Protein Interaction Profile Sequencing (PIP-seq).
Foley, Shawn W; Gregory, Brian D
2016-10-10
Every eukaryotic RNA transcript undergoes extensive post-transcriptional processing from the moment of transcription up through degradation. This regulation is performed by a distinct cohort of RNA-binding proteins which recognize their target transcript by both its primary sequence and secondary structure. Here, we describe protein interaction profile sequencing (PIP-seq), a technique that uses ribonuclease-based footprinting followed by high-throughput sequencing to globally assess both protein-bound RNA sequences and RNA secondary structure. PIP-seq utilizes single- and double-stranded RNA-specific nucleases in the absence of proteins to infer RNA secondary structure. These libraries are also compared to samples that undergo nuclease digestion in the presence of proteins in order to find enriched protein-bound sequences. Combined, these four libraries provide a comprehensive, transcriptome-wide view of RNA secondary structure and RNA protein interaction sites from a single experimental technique. © 2016 by John Wiley & Sons, Inc. Copyright © 2016 John Wiley & Sons, Inc.
Quantum-Sequencing: Biophysics of quantum tunneling through nucleic acids
NASA Astrophysics Data System (ADS)
Casamada Ribot, Josep; Chatterjee, Anushree; Nagpal, Prashant
2014-03-01
Tunneling microscopy and spectroscopy has extensively been used in physical surface sciences to study quantum tunneling to measure electronic local density of states of nanomaterials and to characterize adsorbed species. Quantum-Sequencing (Q-Seq) is a new method based on tunneling microscopy for electronic sequencing of single molecule of nucleic acids. A major goal of third-generation sequencing technologies is to develop a fast, reliable, enzyme-free single-molecule sequencing method. Here, we present the unique ``electronic fingerprints'' for all nucleotides on DNA and RNA using Q-Seq along their intrinsic biophysical parameters. We have analyzed tunneling spectra for the nucleotides at different pH conditions and analyzed the HOMO, LUMO and energy gap for all of them. In addition we show a number of biophysical parameters to further characterize all nucleobases (electron and hole transition voltage and energy barriers). These results highlight the robustness of Q-Seq as a technique for next-generation sequencing.
Azab, Marwa Mohamed; Fayyad, Dalia Mukhtar
2018-01-01
The use of high throughput next generation technologies has allowed more comprehensive analysis than traditional Sanger sequencing. The specific aim of this study was to investigate the microbial diversity of primary endodontic infections using Illumina MiSeq sequencing platform in Egyptian patients. Samples were collected from 19 patients in Suez Canal University Hospital (Endodontic Department) using sterile # 15K file and paper points. DNA was extracted using Mo Bio power soil DNA isolation extraction kit followed by PCR amplification and agarose gel electrophoresis. The microbiome was characterized on the basis of the V3 and V4 hypervariable region of the 16S rRNA gene by using paired-end sequencing on Illumina MiSeq device. MOTHUR software was used in sequence filtration and analysis of sequenced data. A total of 1858 operational taxonomic units at 97% similarity were assigned to 26 phyla, 245 families, and 705 genera. Four main phyla Firmicutes, Bacteroidetes, Proteobacteria, and Synergistetes were predominant in all samples. At genus level, Prevotella, Bacillus, Porphyromonas, Streptococcus, and Bacteroides were the most abundant. Illumina MiSeq platform sequencing can be used to investigate oral microbiome composition of endodontic infections. Elucidating the ecology of endodontic infections is a necessary step in developing effective intracanal antimicrobials. PMID:29849646
Hong, Yoonki; Kim, Woo Jin; Bang, Chi Young; Lee, Jae Cheol; Oh, Yeon-Mok
2016-04-01
Lung cancer is the most common cause of cancer related death. Alterations in gene sequence, structure, and expression have an important role in the pathogenesis of lung cancer. Fusion genes and alternative splicing of cancer-related genes have the potential to be oncogenic. In the current study, we performed RNA-sequencing (RNA-seq) to investigate potential fusion genes and alternative splicing in non-small cell lung cancer. RNA was isolated from lung tissues obtained from 86 subjects with lung cancer. The RNA samples from lung cancer and normal tissues were processed with RNA-seq using the HiSeq 2000 system. Fusion genes were evaluated using Defuse and ChimeraScan. Candidate fusion transcripts were validated by Sanger sequencing. Alternative splicing was analyzed using multivariate analysis of transcript sequencing and validated using quantitative real time polymerase chain reaction. RNA-seq data identified oncogenic fusion genes EML4-ALK and SLC34A2-ROS1 in three of 86 normal-cancer paired samples. Nine distinct fusion transcripts were selected using DeFuse and ChimeraScan; of which, four fusion transcripts were validated by Sanger sequencing. In 33 squamous cell carcinoma, 29 tumor specific skipped exon events and six mutually exclusive exon events were identified. ITGB4 and PYCR1 were top genes that showed significant tumor specific splice variants. In conclusion, RNA-seq data identified novel potential fusion transcripts and splice variants. Further evaluation of their functional significance in the pathogenesis of lung cancer is required.
Chromatin-associated RNA sequencing (ChAR-seq) maps genome-wide RNA-to-DNA contacts
Jukam, David; Teran, Nicole A; Risca, Viviana I; Smith, Owen K; Johnson, Whitney L; Skotheim, Jan M; Greenleaf, William James
2018-01-01
RNA is a critical component of chromatin in eukaryotes, both as a product of transcription, and as an essential constituent of ribonucleoprotein complexes that regulate both local and global chromatin states. Here, we present a proximity ligation and sequencing method called Chromatin-Associated RNA sequencing (ChAR-seq) that maps all RNA-to-DNA contacts across the genome. Using Drosophila cells, we show that ChAR-seq provides unbiased, de novo identification of targets of chromatin-bound RNAs including nascent transcripts, chromosome-specific dosage compensation ncRNAs, and genome-wide trans-associated RNAs involved in co-transcriptional RNA processing. PMID:29648534
Separation and parallel sequencing of the genomes and transcriptomes of single cells using G&T-seq.
Macaulay, Iain C; Teng, Mabel J; Haerty, Wilfried; Kumar, Parveen; Ponting, Chris P; Voet, Thierry
2016-11-01
Parallel sequencing of a single cell's genome and transcriptome provides a powerful tool for dissecting genetic variation and its relationship with gene expression. Here we present a detailed protocol for G&T-seq, a method for separation and parallel sequencing of genomic DNA and full-length polyA(+) mRNA from single cells. We provide step-by-step instructions for the isolation and lysis of single cells; the physical separation of polyA(+) mRNA from genomic DNA using a modified oligo-dT bead capture and the respective whole-transcriptome and whole-genome amplifications; and library preparation and sequence analyses of these amplification products. The method allows the detection of thousands of transcripts in parallel with the genetic variants captured by the DNA-seq data from the same single cell. G&T-seq differs from other currently available methods for parallel DNA and RNA sequencing from single cells, as it involves physical separation of the DNA and RNA and does not require bespoke microfluidics platforms. The process can be implemented manually or through automation. When performed manually, paired genome and transcriptome sequencing libraries from eight single cells can be produced in ∼3 d by researchers experienced in molecular laboratory work. For users with experience in the programming and operation of liquid-handling robots, paired DNA and RNA libraries from 96 single cells can be produced in the same time frame. Sequence analysis and integration of single-cell G&T-seq DNA and RNA data requires a high level of bioinformatics expertise and familiarity with a wide range of informatics tools.
Evaluation of ribosomal RNA removal protocols for Salmonella RNA-Seq projects
USDA-ARS?s Scientific Manuscript database
Next generation sequencing is a powerful technology and its application to sequencing entire RNA populations of food-borne pathogens will provide valuable insights. A problem unique to prokaryotic RNA-Seq is the massive abundance of ribosomal RNA. Unlike eukaryotic messenger RNA (mRNA), bacterial ...
RNA-Seq Atlas of Glycine max: a guide to the soybean transcriptome
USDA-ARS?s Scientific Manuscript database
A first analysis of the Glycine max (L.) Merr. (soybean) transcriptome using next generation sequencing technology and RNA-Sequencing (RNA-Seq) is presented. This analysis will provide an important resource for understanding transcription and gene co-regulatory networks in soybean, the most economic...
Genome-wide profiling of DNA-binding proteins using barcode-based multiplex Solexa sequencing.
Raghav, Sunil Kumar; Deplancke, Bart
2012-01-01
Chromatin immunoprecipitation (ChIP) is a commonly used technique to detect the in vivo binding of proteins to DNA. ChIP is now routinely paired to microarray analysis (ChIP-chip) or next-generation sequencing (ChIP-Seq) to profile the DNA occupancy of proteins of interest on a genome-wide level. Because ChIP-chip introduces several biases, most notably due to the use of a fixed number of probes, ChIP-Seq has quickly become the method of choice as, depending on the sequencing depth, it is more sensitive, quantitative, and provides a greater binding site location resolution. With the ever increasing number of reads that can be generated per sequencing run, it has now become possible to analyze several samples simultaneously while maintaining sufficient sequence coverage, thus significantly reducing the cost per ChIP-Seq experiment. In this chapter, we provide a step-by-step guide on how to perform multiplexed ChIP-Seq analyses. As a proof-of-concept, we focus on the genome-wide profiling of RNA Polymerase II as measuring its DNA occupancy at different stages of any biological process can provide insights into the gene regulatory mechanisms involved. However, the protocol can also be used to perform multiplexed ChIP-Seq analyses of other DNA-binding proteins such as chromatin modifiers and transcription factors.
Zhang, Qi; Zeng, Xin; Younkin, Sam; Kawli, Trupti; Snyder, Michael P; Keleş, Sündüz
2016-02-24
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments revolutionized genome-wide profiling of transcription factors and histone modifications. Although maturing sequencing technologies allow these experiments to be carried out with short (36-50 bps), long (75-100 bps), single-end, or paired-end reads, the impact of these read parameters on the downstream data analysis are not well understood. In this paper, we evaluate the effects of different read parameters on genome sequence alignment, coverage of different classes of genomic features, peak identification, and allele-specific binding detection. We generated 101 bps paired-end ChIP-seq data for many transcription factors from human GM12878 and MCF7 cell lines. Systematic evaluations using in silico variations of these data as well as fully simulated data, revealed complex interplay between the sequencing parameters and analysis tools, and indicated clear advantages of paired-end designs in several aspects such as alignment accuracy, peak resolution, and most notably, allele-specific binding detection. Our work elucidates the effect of design on the downstream analysis and provides insights to investigators in deciding sequencing parameters in ChIP-seq experiments. We present the first systematic evaluation of the impact of ChIP-seq designs on allele-specific binding detection and highlights the power of pair-end designs in such studies.
Cloud, Joann L; Conville, Patricia S; Croft, Ann; Harmsen, Dag; Witebsky, Frank G; Carroll, Karen C
2004-02-01
Identification of clinically significant nocardiae to the species level is important in patient diagnosis and treatment. A study was performed to evaluate Nocardia species identification obtained by partial 16S ribosomal DNA (rDNA) sequencing by the MicroSeq 500 system with an expanded database. The expanded portion of the database was developed from partial 5' 16S rDNA sequences derived from 28 reference strains (from the American Type Culture Collection and the Japanese Collection of Microorganisms). The expanded MicroSeq 500 system was compared to (i). conventional identification obtained from a combination of growth characteristics with biochemical and drug susceptibility tests; (ii). molecular techniques involving restriction enzyme analysis (REA) of portions of the 16S rRNA and 65-kDa heat shock protein genes; and (iii). when necessary, sequencing of a 999-bp fragment of the 16S rRNA gene. An unknown isolate was identified as a particular species if the sequence obtained by partial 16S rDNA sequencing by the expanded MicroSeq 500 system was 99.0% similar to that of the reference strain. Ninety-four nocardiae representing 10 separate species were isolated from patient specimens and examined by using the three different methods. Sequencing of partial 16S rDNA by the expanded MicroSeq 500 system resulted in only 72% agreement with conventional methods for species identification and 90% agreement with the alternative molecular methods. Molecular methods for identification of Nocardia species provide more accurate and rapid results than the conventional methods using biochemical and susceptibility testing. With an expanded database, the MicroSeq 500 system for partial 16S rDNA was able to correctly identify the human pathogens N. brasiliensis, N. cyriacigeorgica, N. farcinica, N. nova, N. otitidiscaviarum, and N. veterana.
2013-09-01
sequence dataset. All procedures were performed by personnel in the IIMT UT Southwestern Genomics and Microarray Core using standard protocols. More... sequencing run, samples were demultiplexed using standard algorithms in the Genomics and Microarray Core and processed into individual sample Illumina single... Sequencing (RNA-Seq), using Illumina’s multiplexing mRNA-Seq to generate full sequence libraries from the poly-A tailed RNA to a read depth of 30
Isolated polypeptide having arabinofuranosidase activity
Foreman, Pamela; Van Solingen, Pieter; Goedegebuur, Frits; Ward, Michael
2010-02-23
Described herein are novel gene sequences isolated from Trichoderma reesei. Two genes encoding proteins comprising a cellulose binding domain, one encoding an arabionfuranosidase and one encoding an acetylxylanesterase are described. The sequences, CIP1 and CIP2, contain a cellulose binding domain. These proteins are especially useful in the textile and detergent industry and in pulp and paper industry. TABLE-US-00001 cip1 cDNA sequence (SEQ ID NO: 1) GACTAGTTCA TAATACAGTA GTTGAGTTCA TAGCAACTTC 50 ACTCTCTAGC TGAACAAATT ATCTGCGCAA ACATGGTTCG CCGGACTGCT 100 CTGCTGGCCC TTGGGGCTCT CTCAACGCTC TCTATGGCCC AAATCTCAGA 150 CGACTTCGAG TCGGGCTGGG ATCAGACTAA ATGGCCCATT TCGGCACCAG 200 ACTGTAACCA GGGCGGCACC GTCAGCCTCG ACACCACAGT AGCCCACAGC 250 GGCAGCAACT CCATGAAGGT CGTTGGTGGC CCCAATGGCT ACTGTGGACA 300 CATCTTCTTC GGCACTACCC AGGTGCCAAC TGGGGATGTA TATGTCAGAG 350 CTTGGATTCG GCTTCAGACT GCTCTCGGCA GCAACCACGT CACATTCATC 400 ATCATGCCAG ACACCGCTCA GGGAGGGAAG CACCTCCGAA TTGGTGGCCA 450 AAGCCAAGTT CTCGACTACA ACCGCGAGTC CGACGATGCC ACTCTTCCGG 500 ACCTGTCTCC CAACGGCATT GCCTCCACCG TCACTCTGCC TACCGGCGCG 550 TTCCAGTGCT TCGAGTACCA CCTGGGCACT GACGGAACCA TCGAGACGTG 600 GCTCAACGGC AGCCTCATCC CGGGCATGAC CGTGGGCCCT GGCGTCGACA 650 ATCCAAACGA CGCTGGCTGG ACGAGGGCCA GCTATATTCC GGAGATCACC 700 GGTGTCAACT TTGGCTGGGA GGCCTACAGC GGAGACGTCA ACACCGTCTG 750 GTTCGACGAC ATCTCGATTG CGTCGACCCG CGTGGGATGC GGCCCCGGCA 800 GCCCCGGCGG TCCTGGAAGC TCGACGACTG GGCGTAGCAG CACCTCGGGC 850 CCGACGAGCA CTTCGAGGCC AAGCACCACC ATTCCGCCAC CGACTTCCAG 900 GACAACGACC GCCACGGGTC CGACTCAGAC ACACTATGGC CAGTGCGGAG 1000 GGATTGGTTA CAGCGGGCCT ACGGTCTGCG CGAGCGGCAC GACCTGCCAG 1050 GTCCTGAACC CATACTACTC CCAGTGCTTA TAAGGGGATG AGCATGGAGT 1100 GAAGTGAAGT GAAGTGGAGA GAGTTGAAGT GGCATTGCGC TCGGCTGGGT 1150 AGATAAAAGT CAGCAGCTAT GAATACTCTA TGTGATGCTC ATTGGCGTGT 1200 ACGTTTTAAA AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA 1250 AAAAAAAAAA AAAAAAAAAG GGGGCGGCCG C 1271
High-throughput illumina strand-specific RNA sequencing library preparation
USDA-ARS?s Scientific Manuscript database
Conventional Illumina RNA-Seq does not have the resolution to decode the complex eukaryote transcriptome due to the lack of RNA polarity information. Strand-specific RNA sequencing (ssRNA-Seq) can overcome these limitations and as such is better suited for genome annotation, de novo transcriptome as...
Pott, Sebastian
2017-01-01
Gaining insights into the regulatory mechanisms that underlie the transcriptional variation observed between individual cells necessitates the development of methods that measure chromatin organization in single cells. Here I adapted Nucleosome Occupancy and Methylome-sequencing (NOMe-seq) to measure chromatin accessibility and endogenous DNA methylation in single cells (scNOMe-seq). scNOMe-seq recovered characteristic accessibility and DNA methylation patterns at DNase hypersensitive sites (DHSs). An advantage of scNOMe-seq is that sequencing reads are sampled independently of the accessibility measurement. scNOMe-seq therefore controlled for fragment loss, which enabled direct estimation of the fraction of accessible DHSs within individual cells. In addition, scNOMe-seq provided high resolution of chromatin accessibility within individual loci which was exploited to detect footprints of CTCF binding events and to estimate the average nucleosome phasing distances in single cells. scNOMe-seq is therefore well-suited to characterize the chromatin organization of single cells in heterogeneous cellular mixtures. DOI: http://dx.doi.org/10.7554/eLife.23203.001 PMID:28653622
ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysis
2011-01-01
Background Chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) allows genome-wide discovery of protein-DNA interactions such as transcription factor bindings and histone modifications. Previous reports only compared a small number of profiles, and little has been done to compare histone modification profiles generated by the two technologies or to assess the impact of input DNA libraries in ChIP-seq analysis. Here, we performed a systematic analysis of a modENCODE dataset consisting of 31 pairs of ChIP-chip/ChIP-seq profiles of the coactivator CBP, RNA polymerase II (RNA PolII), and six histone modifications across four developmental stages of Drosophila melanogaster. Results Both technologies produce highly reproducible profiles within each platform, ChIP-seq generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks. The set of peaks identified by the two technologies can be significantly different, but the extent to which they differ varies depending on the factor and the analysis algorithm. Importantly, we found that there is a significant variation among multiple sequencing profiles of input DNA libraries and that this variation most likely arises from both differences in experimental condition and sequencing depth. We further show that using an inappropriate input DNA profile can impact the average signal profiles around genomic features and peak calling results, highlighting the importance of having high quality input DNA data for normalization in ChIP-seq analysis. Conclusions Our findings highlight the biases present in each of the platforms, show the variability that can arise from both technology and analysis methods, and emphasize the importance of obtaining high quality and deeply sequenced input DNA libraries for ChIP-seq analysis. PMID:21356108
SNP discovery in the bovine milk transcriptome using RNA-Seq technology.
Cánovas, Angela; Rincon, Gonzalo; Islas-Trejo, Alma; Wickramasinghe, Saumya; Medrano, Juan F
2010-12-01
High-throughput sequencing of RNA (RNA-Seq) was developed primarily to analyze global gene expression in different tissues. However, it also is an efficient way to discover coding SNPs. The objective of this study was to perform a SNP discovery analysis in the milk transcriptome using RNA-Seq. Seven milk samples from Holstein cows were analyzed by sequencing cDNAs using the Illumina Genome Analyzer system. We detected 19,175 genes expressed in milk samples corresponding to approximately 70% of the total number of genes analyzed. The SNP detection analysis revealed 100,734 SNPs in Holstein samples, and a large number of those corresponded to differences between the Holstein breed and the Hereford bovine genome assembly Btau4.0. The number of polymorphic SNPs within Holstein cows was 33,045. The accuracy of RNA-Seq SNP discovery was tested by comparing SNPs detected in a set of 42 candidate genes expressed in milk that had been resequenced earlier using Sanger sequencing technology. Seventy of 86 SNPs were detected using both RNA-Seq and Sanger sequencing technologies. The KASPar Genotyping System was used to validate unique SNPs found by RNA-Seq but not observed by Sanger technology. Our results confirm that analyzing the transcriptome using RNA-Seq technology is an efficient and cost-effective method to identify SNPs in transcribed regions. This study creates guidelines to maximize the accuracy of SNP discovery and prevention of false-positive SNP detection, and provides more than 33,000 SNPs located in coding regions of genes expressed during lactation that can be used to develop genotyping platforms to perform marker-trait association studies in Holstein cattle.
He, W; Zhao, S; Liu, X; Dong, S; Lv, J; Liu, D; Wang, J; Meng, Z
2013-12-04
Large-scale next-generation sequencing (NGS)-based resequencing detects sequence variations, constructs evolutionary histories, and identifies phenotype-related genotypes. However, NGS-based resequencing studies generate extraordinarily large amounts of data, making computations difficult. Effective use and analysis of these data for NGS-based resequencing studies remains a difficult task for individual researchers. Here, we introduce ReSeqTools, a full-featured toolkit for NGS (Illumina sequencing)-based resequencing analysis, which processes raw data, interprets mapping results, and identifies and annotates sequence variations. ReSeqTools provides abundant scalable functions for routine resequencing analysis in different modules to facilitate customization of the analysis pipeline. ReSeqTools is designed to use compressed data files as input or output to save storage space and facilitates faster and more computationally efficient large-scale resequencing studies in a user-friendly manner. It offers abundant practical functions and generates useful statistics during the analysis pipeline, which significantly simplifies resequencing analysis. Its integrated algorithms and abundant sub-functions provide a solid foundation for special demands in resequencing projects. Users can combine these functions to construct their own pipelines for other purposes.
Early Detection of NSCLC Using Stromal Markers in Peripheral Blood
2016-09-01
circulating myeloid cells, flow cytometry, RNA -sequencing, expression profiling. 3. ACCOMPLISHMENTS: What were the major goals of the project...Subtask 2: Flow cytometry sorting of circulating myeloid cells. Subtask 3: RNA -Sequencing Subtask 4: RNA -seq data analysis Subtask 5: Feasible RT-PCR...accomplished the patient recruitment, flow cytometry sorting of circulating myeloid cells, RNA -sequencing of the samples. During the RNA - seq data analysis, we
Stroma Based Prognosticators Incorporating Differences between African and European Americans
2017-10-01
amenable to bisulfite sequencing of more than a few genes. Exploiting the recent three-fold reduction in the cost of sequencing per read , we developed oligo...cards. The ability of the HiSeq 4000 to obtain about three times as many reads as the HiSeq2500, at the same price, means we can stay on track, though...capture, and sequencing (Table 2). We obtain tens of millions of mapped deduplicated reads per sample, while using only 5% of a sequencing lane per sample
Roy, Somak; Durso, Mary Beth; Wald, Abigail; Nikiforov, Yuri E; Nikiforova, Marina N
2014-01-01
A wide repertoire of bioinformatics applications exist for next-generation sequencing data analysis; however, certain requirements of the clinical molecular laboratory limit their use: i) comprehensive report generation, ii) compatibility with existing laboratory information systems and computer operating system, iii) knowledgebase development, iv) quality management, and v) data security. SeqReporter is a web-based application developed using ASP.NET framework version 4.0. The client-side was designed using HTML5, CSS3, and Javascript. The server-side processing (VB.NET) relied on interaction with a customized SQL server 2008 R2 database. Overall, 104 cases (1062 variant calls) were analyzed by SeqReporter. Each variant call was classified into one of five report levels: i) known clinical significance, ii) uncertain clinical significance, iii) pending pathologists' review, iv) synonymous and deep intronic, and v) platform and panel-specific sequence errors. SeqReporter correctly annotated and classified 99.9% (859 of 860) of sequence variants, including 68.7% synonymous single-nucleotide variants, 28.3% nonsynonymous single-nucleotide variants, 1.7% insertions, and 1.3% deletions. One variant of potential clinical significance was re-classified after pathologist review. Laboratory information system-compatible clinical reports were generated automatically. SeqReporter also facilitated quality management activities. SeqReporter is an example of a customized and well-designed informatics solution to optimize and automate the downstream analysis of clinical next-generation sequencing data. We propose it as a model that may envisage the development of a comprehensive clinical informatics solution. Copyright © 2014 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Processing and population genetic analysis of multigenic datasets with ProSeq3 software.
Filatov, Dmitry A
2009-12-01
The current tendency in molecular population genetics is to use increasing numbers of genes in the analysis. Here I describe a program for handling and population genetic analysis of DNA polymorphism data collected from multiple genes. The program includes a sequence/alignment editor and an internal relational database that simplify the preparation and manipulation of multigenic DNA polymorphism datasets. The most commonly used DNA polymorphism analyses are implemented in ProSeq3, facilitating population genetic analysis of large multigenic datasets. Extensive input/output options make ProSeq3 a convenient hub for sequence data processing and analysis. The program is available free of charge from http://dps.plants.ox.ac.uk/sequencing/proseq.htm.
David, Fabrice P A; Delafontaine, Julien; Carat, Solenne; Ross, Frederick J; Lefebvre, Gregory; Jarosz, Yohan; Sinclair, Lucas; Noordermeer, Daan; Rougemont, Jacques; Leleu, Marion
2014-01-01
The HTSstation analysis portal is a suite of simple web forms coupled to modular analysis pipelines for various applications of High-Throughput Sequencing including ChIP-seq, RNA-seq, 4C-seq and re-sequencing. HTSstation offers biologists the possibility to rapidly investigate their HTS data using an intuitive web application with heuristically pre-defined parameters. A number of open-source software components have been implemented and can be used to build, configure and run HTS analysis pipelines reactively. Besides, our programming framework empowers developers with the possibility to design their own workflows and integrate additional third-party software. The HTSstation web application is accessible at http://htsstation.epfl.ch.
HTSstation: A Web Application and Open-Access Libraries for High-Throughput Sequencing Data Analysis
David, Fabrice P. A.; Delafontaine, Julien; Carat, Solenne; Ross, Frederick J.; Lefebvre, Gregory; Jarosz, Yohan; Sinclair, Lucas; Noordermeer, Daan; Rougemont, Jacques; Leleu, Marion
2014-01-01
The HTSstation analysis portal is a suite of simple web forms coupled to modular analysis pipelines for various applications of High-Throughput Sequencing including ChIP-seq, RNA-seq, 4C-seq and re-sequencing. HTSstation offers biologists the possibility to rapidly investigate their HTS data using an intuitive web application with heuristically pre-defined parameters. A number of open-source software components have been implemented and can be used to build, configure and run HTS analysis pipelines reactively. Besides, our programming framework empowers developers with the possibility to design their own workflows and integrate additional third-party software. The HTSstation web application is accessible at http://htsstation.epfl.ch. PMID:24475057
Delaney, Nigel F.; Marx, Christopher J.
2012-01-01
Understanding evolutionary dynamics within microbial populations requires the ability to accurately follow allele frequencies through time. Here we present a rapid, cost-effective method (FREQ-Seq) that leverages Illumina next-generation sequencing for localized, quantitative allele frequency detection. Analogous to RNA-Seq, FREQ-Seq relies upon counts from the >105 reads generated per locus per time-point to determine allele frequencies. Loci of interest are directly amplified from a mixed population via two rounds of PCR using inexpensive, user-designed oligonucleotides and a bar-coded bridging primer system that can be regenerated in-house. The resulting bar-coded PCR products contain the adapters needed for Illumina sequencing, eliminating further library preparation. We demonstrate the utility of FREQ-Seq by determining the order and dynamics of beneficial alleles that arose as a microbial population, founded with an engineered strain of Methylobacterium, evolved to grow on methanol. Quantifying allele frequencies with minimal bias down to 1% abundance allowed effective analysis of SNPs, small in-dels and insertions of transposable elements. Our data reveal large-scale clonal interference during the early stages of adaptation and illustrate the utility of FREQ-Seq as a cost-effective tool for tracking allele frequencies in populations. PMID:23118913
High-throughput detection of RNA processing in bacteria.
Gill, Erin E; Chan, Luisa S; Winsor, Geoffrey L; Dobson, Neil; Lo, Raymond; Ho Sui, Shannan J; Dhillon, Bhavjinder K; Taylor, Patrick K; Shrestha, Raunak; Spencer, Cory; Hancock, Robert E W; Unrau, Peter J; Brinkman, Fiona S L
2018-03-27
Understanding the RNA processing of an organism's transcriptome is an essential but challenging step in understanding its biology. Here we investigate with unprecedented detail the transcriptome of Pseudomonas aeruginosa PAO1, a medically important and innately multi-drug resistant bacterium. We systematically mapped RNA cleavage and dephosphorylation sites that result in 5'-monophosphate terminated RNA (pRNA) using monophosphate RNA-Seq (pRNA-Seq). Transcriptional start sites (TSS) were also mapped using differential RNA-Seq (dRNA-Seq) and both datasets were compared to conventional RNA-Seq performed in a variety of growth conditions. The pRNA-Seq library revealed known tRNA, rRNA and transfer-messenger RNA (tmRNA) processing sites, together with previously uncharacterized RNA cleavage events that were found disproportionately near the 5' ends of transcripts associated with basic bacterial functions such as oxidative phosphorylation and purine metabolism. The majority (97%) of the processed mRNAs were cleaved at precise codon positions within defined sequence motifs indicative of distinct endonucleolytic activities. The most abundant of these motifs corresponded closely to an E. coli RNase E site previously established in vitro. Using the dRNA-Seq library, we performed an operon analysis and predicted 3159 potential TSS. A correlation analysis uncovered 105 antiparallel pairs of TSS that were separated by 18 bp from each other and were centered on single palindromic TAT(A/T)ATA motifs (likely - 10 promoter elements), suggesting that, consistent with previous in vitro experimentation, these sites can initiate transcription bi-directionally and may thus provide a novel form of transcriptional regulation. TSS and RNA-Seq analysis allowed us to confirm expression of small non-coding RNAs (ncRNAs), many of which are differentially expressed in swarming and biofilm formation conditions. This study uses pRNA-Seq, a method that provides a genome-wide survey of RNA processing, to study the bacterium Pseudomonas aeruginosa and discover extensive transcript processing not previously appreciated. We have also gained novel insight into RNA maturation and turnover as well as a potential novel form of transcription regulation. NOTE: All sequence data has been submitted to the NCBI sequence read archive. Accession numbers are as follows: [NCBI sequence read archive: SRX156386, SRX157659, SRX157660, SRX157661, SRX157683 and SRX158075]. The sequence data is viewable using Jbrowse on www.pseudomonas.com .
Langevin, Stanley A; Bent, Zachary W; Solberg, Owen D; Curtis, Deanna J; Lane, Pamela D; Williams, Kelly P; Schoeniger, Joseph S; Sinha, Anupama; Lane, Todd W; Branda, Steven S
2013-04-01
Use of second generation sequencing (SGS) technologies for transcriptional profiling (RNA-Seq) has revolutionized transcriptomics, enabling measurement of RNA abundances with unprecedented specificity and sensitivity and the discovery of novel RNA species. Preparation of RNA-Seq libraries requires conversion of the RNA starting material into cDNA flanked by platform-specific adaptor sequences. Each of the published methods and commercial kits currently available for RNA-Seq library preparation suffers from at least one major drawback, including long processing times, large starting material requirements, uneven coverage, loss of strand information and high cost. We report the development of a new RNA-Seq library preparation technique that produces representative, strand-specific RNA-Seq libraries from small amounts of starting material in a fast, simple and cost-effective manner. Additionally, we have developed a new quantitative PCR-based assay for precisely determining the number of PCR cycles to perform for optimal enrichment of the final library, a key step in all SGS library preparation workflows.
Prakash, Celine; Haeseler, Arndt Von
2017-03-01
RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain fragments in RNA-seq library preparation and sequencing. To investigate the expected coverage obtained from fragmentation, we develop a simple fragmentation model that is independent of bias from the experimental method and is not specific to the transcript sequence. Essentially, we enumerate all configurations for maximal placement of a given fragment length, F, on transcript length, T, to represent every possible fragmentation pattern, from which we compute the expected coverage profile across a transcript. We extend this model to incorporate general empirical attributes such as read length, fragment length distribution, and number of molecules of the transcript. We further introduce the fragment starting-point, fragment coverage, and read coverage profiles. We find that the expected profiles are not uniform and that factors such as fragment length to transcript length ratio, read length to fragment length ratio, fragment length distribution, and number of molecules influence the variability of coverage across a transcript. Finally, we explore a potential application of the model where, with simulations, we show that it is possible to correctly estimate the transcript copy number for any transcript in the RNA-seq experiment.
Haeseler, Arndt Von
2017-01-01
Abstract RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain fragments in RNA-seq library preparation and sequencing. To investigate the expected coverage obtained from fragmentation, we develop a simple fragmentation model that is independent of bias from the experimental method and is not specific to the transcript sequence. Essentially, we enumerate all configurations for maximal placement of a given fragment length, F, on transcript length, T, to represent every possible fragmentation pattern, from which we compute the expected coverage profile across a transcript. We extend this model to incorporate general empirical attributes such as read length, fragment length distribution, and number of molecules of the transcript. We further introduce the fragment starting-point, fragment coverage, and read coverage profiles. We find that the expected profiles are not uniform and that factors such as fragment length to transcript length ratio, read length to fragment length ratio, fragment length distribution, and number of molecules influence the variability of coverage across a transcript. Finally, we explore a potential application of the model where, with simulations, we show that it is possible to correctly estimate the transcript copy number for any transcript in the RNA-seq experiment. PMID:27661099
2011-01-01
Background The generation and analysis of high-throughput sequencing data are becoming a major component of many studies in molecular biology and medical research. Illumina's Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. Here, we comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus, with read lengths of 95 to 150 bases. Results We provide quantifications and evidence for GC bias, error rates, error sequence context, effects of quality filtering, and the reliability of quality values. By combining different filtering criteria we reduced error rates 7-fold at the expense of discarding 12.5% of alignable bases. While overall error rates are low in HiSeq data we observed regions of accumulated wrong base calls. Only 3% of all error positions accounted for 24.7% of all substitution errors. Analyzing the forward and reverse strands separately revealed error rates of up to 18.7%. Insertions and deletions occurred at very low rates on average but increased to up to 2% in homopolymers. A positive correlation between read coverage and GC content was found depending on the GC content range. Conclusions The errors and biases we report have implications for the use and the interpretation of Illumina sequencing data. GAIIx and HiSeq data sets show slightly different error profiles. Quality filtering is essential to minimize downstream analysis artifacts. Supporting previous recommendations, the strand-specificity provides a criterion to distinguish sequencing errors from low abundance polymorphisms. PMID:22067484
Necci, Marco; Piovesan, Damiano; Tosatto, Silvio C E
2016-12-01
Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large-scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence-based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures. © 2016 The Protein Society.
ScaffoldSeq: Software for characterization of directed evolution populations.
Woldring, Daniel R; Holec, Patrick V; Hackel, Benjamin J
2016-07-01
ScaffoldSeq is software designed for the numerous applications-including directed evolution analysis-in which a user generates a population of DNA sequences encoding for partially diverse proteins with related functions and would like to characterize the single site and pairwise amino acid frequencies across the population. A common scenario for enzyme maturation, antibody screening, and alternative scaffold engineering involves naïve and evolved populations that contain diversified regions, varying in both sequence and length, within a conserved framework. Analyzing the diversified regions of such populations is facilitated by high-throughput sequencing platforms; however, length variability within these regions (e.g., antibody CDRs) encumbers the alignment process. To overcome this challenge, the ScaffoldSeq algorithm takes advantage of conserved framework sequences to quickly identify diverse regions. Beyond this, unintended biases in sequence frequency are generated throughout the experimental workflow required to evolve and isolate clones of interest prior to DNA sequencing. ScaffoldSeq software uniquely handles this issue by providing tools to quantify and remove background sequences, cluster similar protein families, and dampen the impact of dominant clones. The software produces graphical and tabular summaries for each region of interest, allowing users to evaluate diversity in a site-specific manner as well as identify epistatic pairwise interactions. The code and detailed information are freely available at http://research.cems.umn.edu/hackel. Proteins 2016; 84:869-874. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
USDA-ARS?s Scientific Manuscript database
We conducted genomic sequencing to identify viruses associated with mosaic disease of an apple tree using the high-throughput sequencing (HTS) Illumina RNA-seq platform. The objective was to examine if rapid identification and characterization of viruses could be effectively achieved by RNA-seq anal...
USDA-ARS?s Scientific Manuscript database
PacBio long-read sequencing technology is increasingly popular in genome sequence assembly and transcriptome cataloguing. Recently, a new-generation pig reference genome was assembled based on long reads from this technology. To finely annotate this genome assembly, transcriptomes of nine tissues fr...
Archer, Stuart K; Shirokikh, Nikolay E; Preiss, Thomas
2015-04-01
Most applications for RNA-seq require the depletion of abundant transcripts to gain greater coverage of the underlying transcriptome. The sequences to be targeted for depletion depend on application and species and in many cases may not be supported by commercial depletion kits. This unit describes a method for generating RNA-seq libraries that incorporates probe-directed degradation (PDD), which can deplete any unwanted sequence set, with the low-bias split-adapter method of library generation (although many other library generation methods are in principle compatible). The overall strategy is suitable for applications requiring customized sequence depletion or where faithful representation of fragment ends and lack of sequence bias is paramount. We provide guidelines to rapidly design specific probes against the target sequence, and a detailed protocol for library generation using the split-adapter method including several strategies for streamlining the technique and reducing adapter dimer content. Copyright © 2015 John Wiley & Sons, Inc.
Reid-Bayliss, Kate S; Loeb, Lawrence A
2017-08-29
Transcriptional mutagenesis (TM) due to misincorporation during RNA transcription can result in mutant RNAs, or epimutations, that generate proteins with altered properties. TM has long been hypothesized to play a role in aging, cancer, and viral and bacterial evolution. However, inadequate methodologies have limited progress in elucidating a causal association. We present a high-throughput, highly accurate RNA sequencing method to measure epimutations with single-molecule sensitivity. Accurate RNA consensus sequencing (ARC-seq) uniquely combines RNA barcoding and generation of multiple cDNA copies per RNA molecule to eliminate errors introduced during cDNA synthesis, PCR, and sequencing. The stringency of ARC-seq can be scaled to accommodate the quality of input RNAs. We apply ARC-seq to directly assess transcriptome-wide epimutations resulting from RNA polymerase mutants and oxidative stress.
Corrie, Brian D; Marthandan, Nishanth; Zimonja, Bojan; Jaglale, Jerome; Zhou, Yang; Barr, Emily; Knoetze, Nicole; Breden, Frances M W; Christley, Scott; Scott, Jamie K; Cowell, Lindsay G; Breden, Felix
2018-07-01
Next-generation sequencing allows the characterization of the adaptive immune receptor repertoire (AIRR) in exquisite detail. These large-scale AIRR-seq data sets have rapidly become critical to vaccine development, understanding the immune response in autoimmune and infectious disease, and monitoring novel therapeutics against cancer. However, at present there is no easy way to compare these AIRR-seq data sets across studies and institutions. The ability to combine and compare information for different disease conditions will greatly enhance the value of AIRR-seq data for improving biomedical research and patient care. The iReceptor Data Integration Platform (gateway.ireceptor.org) provides one implementation of the AIRR Data Commons envisioned by the AIRR Community (airr-community.org), an initiative that is developing protocols to facilitate sharing and comparing AIRR-seq data. The iReceptor Scientific Gateway links distributed (federated) AIRR-seq repositories, allowing sequence searches or metadata queries across multiple studies at multiple institutions, returning sets of sequences fulfilling specific criteria. We present a review of the development of iReceptor, and how it fits in with the general trend toward sharing genomic and health data, and the development of standards for describing and reporting AIRR-seq data. Researchers interested in integrating their repositories of AIRR-seq data into the iReceptor Platform are invited to contact support@ireceptor.org. © 2018 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Langevin, Stanley A.; Bent, Zachary W.; Solberg, Owen D.; Curtis, Deanna J.; Lane, Pamela D.; Williams, Kelly P.; Schoeniger, Joseph S.; Sinha, Anupama; Lane, Todd W.; Branda, Steven S.
2013-01-01
Use of second generation sequencing (SGS) technologies for transcriptional profiling (RNA-Seq) has revolutionized transcriptomics, enabling measurement of RNA abundances with unprecedented specificity and sensitivity and the discovery of novel RNA species. Preparation of RNA-Seq libraries requires conversion of the RNA starting material into cDNA flanked by platform-specific adaptor sequences. Each of the published methods and commercial kits currently available for RNA-Seq library preparation suffers from at least one major drawback, including long processing times, large starting material requirements, uneven coverage, loss of strand information and high cost. We report the development of a new RNA-Seq library preparation technique that produces representative, strand-specific RNA-Seq libraries from small amounts of starting material in a fast, simple and cost-effective manner. Additionally, we have developed a new quantitative PCR-based assay for precisely determining the number of PCR cycles to perform for optimal enrichment of the final library, a key step in all SGS library preparation workflows. PMID:23558773
TRAPR: R Package for Statistical Analysis and Visualization of RNA-Seq Data.
Lim, Jae Hyun; Lee, Soo Youn; Kim, Ju Han
2017-03-01
High-throughput transcriptome sequencing, also known as RNA sequencing (RNA-Seq), is a standard technology for measuring gene expression with unprecedented accuracy. Numerous bioconductor packages have been developed for the statistical analysis of RNA-Seq data. However, these tools focus on specific aspects of the data analysis pipeline, and are difficult to appropriately integrate with one another due to their disparate data structures and processing methods. They also lack visualization methods to confirm the integrity of the data and the process. In this paper, we propose an R-based RNA-Seq analysis pipeline called TRAPR, an integrated tool that facilitates the statistical analysis and visualization of RNA-Seq expression data. TRAPR provides various functions for data management, the filtering of low-quality data, normalization, transformation, statistical analysis, data visualization, and result visualization that allow researchers to build customized analysis pipelines.
Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput.
Gierahn, Todd M; Wadsworth, Marc H; Hughes, Travis K; Bryson, Bryan D; Butler, Andrew; Satija, Rahul; Fortune, Sarah; Love, J Christopher; Shalek, Alex K
2017-04-01
Single-cell RNA-seq can precisely resolve cellular states, but applying this method to low-input samples is challenging. Here, we present Seq-Well, a portable, low-cost platform for massively parallel single-cell RNA-seq. Barcoded mRNA capture beads and single cells are sealed in an array of subnanoliter wells using a semipermeable membrane, enabling efficient cell lysis and transcript capture. We use Seq-Well to profile thousands of primary human macrophages exposed to Mycobacterium tuberculosis.
Conservation of a molecular target across species can be used as a line-of-evidence to predict the likelihood of chemical susceptibility. The web-based Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) tool was developed to simplify, streamline, and quantitat...
Reporting Differences Between Spacecraft Sequence Files
NASA Technical Reports Server (NTRS)
Khanampompan, Teerapat; Gladden, Roy E.; Fisher, Forest W.
2010-01-01
A suite of computer programs, called seq diff suite, reports differences between the products of other computer programs involved in the generation of sequences of commands for spacecraft. These products consist of files of several types: replacement sequence of events (RSOE), DSN keyword file [DKF (wherein DSN signifies Deep Space Network)], spacecraft activities sequence file (SASF), spacecraft sequence file (SSF), and station allocation file (SAF). These products can include line numbers, request identifications, and other pieces of information that are not relevant when generating command sequence products, though these fields can result in the appearance of many changes to the files, particularly when using the UNIX diff command to inspect file differences. The outputs of prior software tools for reporting differences between such products include differences in these non-relevant pieces of information. In contrast, seq diff suite removes the fields containing the irrelevant pieces of information before processing to extract differences, so that only relevant differences are reported. Thus, seq diff suite is especially useful for reporting changes between successive versions of the various products and in particular flagging difference in fields relevant to the sequence command generation and review process.
CORNAS: coverage-dependent RNA-Seq analysis of gene expression data without biological replicates.
Low, Joel Z B; Khang, Tsung Fei; Tammi, Martti T
2017-12-28
In current statistical methods for calling differentially expressed genes in RNA-Seq experiments, the assumption is that an adjusted observed gene count represents an unknown true gene count. This adjustment usually consists of a normalization step to account for heterogeneous sample library sizes, and then the resulting normalized gene counts are used as input for parametric or non-parametric differential gene expression tests. A distribution of true gene counts, each with a different probability, can result in the same observed gene count. Importantly, sequencing coverage information is currently not explicitly incorporated into any of the statistical models used for RNA-Seq analysis. We developed a fast Bayesian method which uses the sequencing coverage information determined from the concentration of an RNA sample to estimate the posterior distribution of a true gene count. Our method has better or comparable performance compared to NOISeq and GFOLD, according to the results from simulations and experiments with real unreplicated data. We incorporated a previously unused sequencing coverage parameter into a procedure for differential gene expression analysis with RNA-Seq data. Our results suggest that our method can be used to overcome analytical bottlenecks in experiments with limited number of replicates and low sequencing coverage. The method is implemented in CORNAS (Coverage-dependent RNA-Seq), and is available at https://github.com/joel-lzb/CORNAS .
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie
2018-01-01
Abstract Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. PMID:29106630
MultiSeq: unifying sequence and structure data for evolutionary analysis
Roberts, Elijah; Eargle, John; Wright, Dan; Luthey-Schulten, Zaida
2006-01-01
Background Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes. Results Here we present MultiSeq, a unified bioinformatics analysis environment that allows one to organize, display, align and analyze both sequence and structure data for proteins and nucleic acids. While special emphasis is placed on analyzing the data within the framework of evolutionary biology, the environment is also flexible enough to accommodate other usage patterns. The evolutionary approach is supported by the use of predefined metadata, adherence to standard ontological mappings, and the ability for the user to adjust these classifications using an electronic notebook. MultiSeq contains a new algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of a homologous group of distantly related proteins. The method, based on the multidimensional QR factorization of multiple sequence and structure alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. Conclusion MultiSeq is a major extension of the Multiple Alignment tool that is provided as part of VMD, a structural visualization program for analyzing molecular dynamics simulations. Both are freely distributed by the NIH Resource for Macromolecular Modeling and Bioinformatics and MultiSeq is included with VMD starting with version 1.8.5. The MultiSeq website has details on how to download and use the software: PMID:16914055
Eaton, Deren A R; Spriggs, Elizabeth L; Park, Brian; Donoghue, Michael J
2017-05-01
Restriction-site associated DNA (RAD) sequencing and related methods rely on the conservation of enzyme recognition sites to isolate homologous DNA fragments for sequencing, with the consequence that mutations disrupting these sites lead to missing information. There is thus a clear expectation for how missing data should be distributed, with fewer loci recovered between more distantly related samples. This observation has led to a related expectation: that RAD-seq data are insufficiently informative for resolving deeper scale phylogenetic relationships. Here we investigate the relationship between missing information among samples at the tips of a tree and information at edges within it. We re-analyze and review the distribution of missing data across ten RAD-seq data sets and carry out simulations to determine expected patterns of missing information. We also present new empirical results for the angiosperm clade Viburnum (Adoxaceae, with a crown age >50 Ma) for which we examine phylogenetic information at different depths in the tree and with varied sequencing effort. The total number of loci, the proportion that are shared, and phylogenetic informativeness varied dramatically across the examined RAD-seq data sets. Insufficient or uneven sequencing coverage accounted for similar proportions of missing data as dropout from mutation-disruption. Simulations reveal that mutation-disruption, which results in phylogenetically distributed missing data, can be distinguished from the more stochastic patterns of missing data caused by low sequencing coverage. In Viburnum, doubling sequencing coverage nearly doubled the number of parsimony informative sites, and increased by >10X the number of loci with data shared across >40 taxa. Our analysis leads to a set of practical recommendations for maximizing phylogenetic information in RAD-seq studies. [hierarchical redundancy; phylogenetic informativeness; quartet informativeness; Restriction-site associated DNA (RAD) sequencing; sequencing coverage; Viburnum.]. © The authors 2016. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For permissions, please e-mail: journals.permission@oup.com.
In vivo insertion pool sequencing identifies virulence factors in a complex fungal–host interaction
Uhse, Simon; Pflug, Florian G.; Stirnberg, Alexandra; Ehrlinger, Klaus; von Haeseler, Arndt
2018-01-01
Large-scale insertional mutagenesis screens can be powerful genome-wide tools if they are streamlined with efficient downstream analysis, which is a serious bottleneck in complex biological systems. A major impediment to the success of next-generation sequencing (NGS)-based screens for virulence factors is that the genetic material of pathogens is often underrepresented within the eukaryotic host, making detection extremely challenging. We therefore established insertion Pool-Sequencing (iPool-Seq) on maize infected with the biotrophic fungus U. maydis. iPool-Seq features tagmentation, unique molecular barcodes, and affinity purification of pathogen insertion mutant DNA from in vivo-infected tissues. In a proof of concept using iPool-Seq, we identified 28 virulence factors, including 23 that were previously uncharacterized, from an initial pool of 195 candidate effector mutants. Because of its sensitivity and quantitative nature, iPool-Seq can be applied to any insertional mutagenesis library and is especially suitable for genetically complex setups like pooled infections of eukaryotic hosts. PMID:29684023
Comprehensive discovery of noncoding RNAs in acute myeloid leukemia cell transcriptomes.
Zhang, Jin; Griffith, Malachi; Miller, Christopher A; Griffith, Obi L; Spencer, David H; Walker, Jason R; Magrini, Vincent; McGrath, Sean D; Ly, Amy; Helton, Nichole M; Trissal, Maria; Link, Daniel C; Dang, Ha X; Larson, David E; Kulkarni, Shashikant; Cordes, Matthew G; Fronick, Catrina C; Fulton, Robert S; Klco, Jeffery M; Mardis, Elaine R; Ley, Timothy J; Wilson, Richard K; Maher, Christopher A
2017-11-01
To detect diverse and novel RNA species comprehensively, we compared deep small RNA and RNA sequencing (RNA-seq) methods applied to a primary acute myeloid leukemia (AML) sample. We were able to discover previously unannotated small RNAs using deep sequencing of a library method using broader insert size selection. We analyzed the long noncoding RNA (lncRNA) landscape in AML by comparing deep sequencing from multiple RNA-seq library construction methods for the sample that we studied and then integrating RNA-seq data from 179 AML cases. This identified lncRNAs that are completely novel, differentially expressed, and associated with specific AML subtypes. Our study revealed the complexity of the noncoding RNA transcriptome through a combined strategy of strand-specific small RNA and total RNA-seq. This dataset will serve as an invaluable resource for future RNA-based analyses. Copyright © 2017 ISEH – Society for Hematology and Stem Cells. Published by Elsevier Inc. All rights reserved.
Gadala-Maria, Daniel; Yaari, Gur; Uduman, Mohamed; Kleinstein, Steven H
2015-02-24
Individual variation in germline and expressed B-cell immunoglobulin (Ig) repertoires has been associated with aging, disease susceptibility, and differential response to infection and vaccination. Repertoire properties can now be studied at large-scale through next-generation sequencing of rearranged Ig genes. Accurate analysis of these repertoire-sequencing (Rep-Seq) data requires identifying the germline variable (V), diversity (D), and joining (J) gene segments used by each Ig sequence. Current V(D)J assignment methods work by aligning sequences to a database of known germline V(D)J segment alleles. However, existing databases are likely to be incomplete and novel polymorphisms are hard to differentiate from the frequent occurrence of somatic hypermutations in Ig sequences. Here we develop a Tool for Ig Genotype Elucidation via Rep-Seq (TIgGER). TIgGER analyzes mutation patterns in Rep-Seq data to identify novel V segment alleles, and also constructs a personalized germline database containing the specific set of alleles carried by a subject. This information is then used to improve the initial V segment assignments from existing tools, like IMGT/HighV-QUEST. The application of TIgGER to Rep-Seq data from seven subjects identified 11 novel V segment alleles, including at least one in every subject examined. These novel alleles constituted 13% of the total number of unique alleles in these subjects, and impacted 3% of V(D)J segment assignments. These results reinforce the highly polymorphic nature of human Ig V genes, and suggest that many novel alleles remain to be discovered. The integration of TIgGER into Rep-Seq processing pipelines will increase the accuracy of V segment assignments, thus improving B-cell repertoire analyses.
Accurate Prediction of Inducible Transcription Factor Binding Intensities In Vivo
Siepel, Adam; Lis, John T.
2012-01-01
DNA sequence and local chromatin landscape act jointly to determine transcription factor (TF) binding intensity profiles. To disentangle these influences, we developed an experimental approach, called protein/DNA binding followed by high-throughput sequencing (PB–seq), that allows the binding energy landscape to be characterized genome-wide in the absence of chromatin. We applied our methods to the Drosophila Heat Shock Factor (HSF), which inducibly binds a target DNA sequence element (HSE) following heat shock stress. PB–seq involves incubating sheared naked genomic DNA with recombinant HSF, partitioning the HSF–bound and HSF–free DNA, and then detecting HSF–bound DNA by high-throughput sequencing. We compared PB–seq binding profiles with ones observed in vivo by ChIP–seq and developed statistical models to predict the observed departures from idealized binding patterns based on covariates describing the local chromatin environment. We found that DNase I hypersensitivity and tetra-acetylation of H4 were the most influential covariates in predicting changes in HSF binding affinity. We also investigated the extent to which DNA accessibility, as measured by digital DNase I footprinting data, could be predicted from MNase–seq data and the ChIP–chip profiles for many histone modifications and TFs, and found GAGA element associated factor (GAF), tetra-acetylation of H4, and H4K16 acetylation to be the most predictive covariates. Lastly, we generated an unbiased model of HSF binding sequences, which revealed distinct biophysical properties of the HSF/HSE interaction and a previously unrecognized substructure within the HSE. These findings provide new insights into the interplay between the genomic sequence and the chromatin landscape in determining transcription factor binding intensity. PMID:22479205
Tian, Yao; Smith, David Roy
2016-05-01
Thousands of mitochondrial genomes have been sequenced, but there are comparatively few available mitochondrial transcriptomes. This might soon be changing. High-throughput RNA sequencing (RNA-Seq) techniques have made it fast and cheap to generate massive amounts of mitochondrial transcriptomic data. Here, we explore the utility of RNA-Seq for assembling mitochondrial genomes and studying their expression patterns. Specifically, we investigate the mitochondrial transcriptomes from Polytomella non-photosynthetic green algae, which have among the smallest, most reduced mitochondrial genomes from the Archaeplastida as well as fragmented rRNA-coding regions, palindromic genes, and linear chromosomes with telomeres. Isolation of whole genomic RNA from the four known Polytomella species followed by Illumina paired-end sequencing generated enough mitochondrial-derived reads to easily recover almost-entire mitochondrial genome sequences. Read-mapping and coverage statistics also gave insights into Polytomella mitochondrial transcriptional architecture, revealing polycistronic transcripts and the expression of telomeres and palindromic genes. Ultimately, RNA-Seq is a promising, cost-effective technique for studying mitochondrial genetics, but it does have drawbacks, which are discussed. One of its greatest potentials, as shown here, is that it can be used to generate near-complete mitochondrial genome sequences, which could be particularly useful in situations where there is a lack of available mtDNA data. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.
Yates, Kathleen B.; Bi, Kevin; Darko, Samuel; Godec, Jernej; Gerdemann, Ulrike; Swadling, Leo; Douek, Daniel C.; Klenerman, Paul; Barnes, Eleanor J.; Sharpe, Arlene H.
2017-01-01
Abstract The T cell compartment must contain diversity in both T cell receptor (TCR) repertoire and cell state to provide effective immunity against pathogens. However, it remains unclear how differences in the TCR contribute to heterogeneity in T cell state. Single cell RNA-sequencing (scRNA-seq) can allow simultaneous measurement of TCR sequence and global transcriptional profile from single cells. However, current methods for TCR inference from scRNA-seq are limited in their sensitivity and require long sequencing reads, thus increasing the cost and decreasing the number of cells that can be feasibly analyzed. Here we present TRAPeS, a publicly available tool that can efficiently extract TCR sequence information from short-read scRNA-seq libraries. We apply it to investigate heterogeneity in the CD8+ T cell response in humans and mice, and show that it is accurate and more sensitive than existing approaches. Coupling TRAPeS with transcriptome analysis of CD8+ T cells specific for a single epitope from Yellow Fever Virus (YFV), we show that the recently described ‘naive-like’ memory population have significantly longer CDR3 regions and greater divergence from germline sequence than do effector-memory phenotype cells. This suggests that TCR usage is associated with the differentiation state of the CD8+ T cell response to YFV. PMID:28934479
Engagement and communication among participants in the ClinSeq Genomic Sequencing Study.
Hooker, Gillian W; Umstead, Kendall L; Lewis, Katie L; Koehly, Laura K; Biesecker, Leslie G; Biesecker, Barbara B
2017-01-01
As clinical genome sequencing expand its reach, understanding how individuals engage with this process are of critical importance. In this study, we aimed to describe internal engagement and its correlates among a ClinSeq cohort of adults consented to genome sequencing and receipt of results. This study was framed using the precaution adoption process model (PAPM), in which knowledge predicts engagement and engagement predicts subsequent behaviors. Prior to receipt of sequencing results, 630 participants in the study completed a baseline survey. Engagement was assessed as the frequency with which participants thought about their participation in ClinSeq since enrollment. Results were consistent with the PAPM: those with higher genomics knowledge reported higher engagement (r = 0.13, P = 0.001) and those who were more engaged reported more frequent communication with their physicians (r = 0.28, P < 0.001) and family members (r = 0.35, P < 0.001) about ClinSeq. Characteristics of those with higher engagement included poorer overall health (r = -0.13, P = 0.002), greater seeking of health information (r = 0.16, P < 0.001), and more recent study enrollment (r = -0.21, P < 0.001). These data support the importance of internal engagement in communication related to genomic sequencing.Genet Med 19 1, 98-103.
Picardi, Ernesto; Gallo, Angela; Galeano, Federica; Tomaselli, Sara; Pesole, Graziano
2012-01-01
RNA editing is a post-transcriptional process occurring in a wide range of organisms. In human brain, the A-to-I RNA editing, in which individual adenosine (A) bases in pre-mRNA are modified to yield inosine (I), is the most frequent event. Modulating gene expression, RNA editing is essential for cellular homeostasis. Indeed, its deregulation has been linked to several neurological and neurodegenerative diseases. To date, many RNA editing sites have been identified by next generation sequencing technologies employing massive transcriptome sequencing together with whole genome or exome sequencing. While genome and transcriptome reads are not always available for single individuals, RNA-Seq data are widespread through public databases and represent a relevant source of yet unexplored RNA editing sites. In this context, we propose a simple computational strategy to identify genomic positions enriched in novel hypothetical RNA editing events by means of a new two-steps mapping procedure requiring only RNA-Seq data and no a priori knowledge of RNA editing characteristics and genomic reads. We assessed the suitability of our procedure by confirming A-to-I candidates using conventional Sanger sequencing and performing RNA-Seq as well as whole exome sequencing of human spinal cord tissue from a single individual. PMID:22957051
Sanchez-Luque, Francisco J; Richardson, Sandra R; Faulkner, Geoffrey J
2016-01-01
Mobile genetic elements (MGEs) are of critical importance in genomics and developmental biology. Polymorphic and somatic MGE insertions have the potential to impact the phenotype of an individual, depending on their genomic locations and functional consequences. However, the identification of polymorphic and somatic insertions among the plethora of copies residing in the genome presents a formidable technical challenge. Whole genome sequencing has the potential to address this problem; however, its efficacy depends on the abundance of cells carrying the new insertion. Robust detection of somatic insertions present in only a subset of cells within a given sample can also be prohibitively expensive due to a requirement for high sequencing depth. Here, we describe retrotransposon capture sequencing (RC-seq), a sequence capture approach in which Illumina libraries are enriched for fragments containing the 5' and 3' termini of specific MGEs. RC-seq allows the detection of known polymorphic insertions present in an individual, as well as the identification of rare or private germline insertions not previously described. Furthermore, RC-seq can be used to detect and characterize somatic insertions, providing a valuable tool to elucidate the extent and characteristics of MGE activity in healthy tissues and in various disease states.
Synthetic spike-in standards for high-throughput 16S rRNA gene amplicon sequencing
Tourlousse, Dieter M.; Yoshiike, Satowa; Ohashi, Akiko; Matsukura, Satoko; Noda, Naohiro
2017-01-01
Abstract High-throughput sequencing of 16S rRNA gene amplicons (16S-seq) has become a widely deployed method for profiling complex microbial communities but technical pitfalls related to data reliability and quantification remain to be fully addressed. In this work, we have developed and implemented a set of synthetic 16S rRNA genes to serve as universal spike-in standards for 16S-seq experiments. The spike-ins represent full-length 16S rRNA genes containing artificial variable regions with negligible identity to known nucleotide sequences, permitting unambiguous identification of spike-in sequences in 16S-seq read data from any microbiome sample. Using defined mock communities and environmental microbiota, we characterized the performance of the spike-in standards and demonstrated their utility for evaluating data quality on a per-sample basis. Further, we showed that staggered spike-in mixtures added at the point of DNA extraction enable concurrent estimation of absolute microbial abundances suitable for comparative analysis. Results also underscored that template-specific Illumina sequencing artifacts may lead to biases in the perceived abundance of certain taxa. Taken together, the spike-in standards represent a novel bioanalytical tool that can substantially improve 16S-seq-based microbiome studies by enabling comprehensive quality control along with absolute quantification. PMID:27980100
Indel detection from DNA and RNA sequencing data with transIndel.
Yang, Rendong; Van Etten, Jamie L; Dehm, Scott M
2018-04-19
Insertions and deletions (indels) are a major class of genomic variation associated with human disease. Indels are primarily detected from DNA sequencing (DNA-seq) data but their transcriptional consequences remain unexplored due to challenges in discriminating medium-sized and large indels from splicing events in RNA-seq data. Here, we developed transIndel, a splice-aware algorithm that parses the chimeric alignments predicted by a short read aligner and reconstructs the mid-sized insertions and large deletions based on the linear alignments of split reads from DNA-seq or RNA-seq data. TransIndel exhibits competitive or superior performance over eight state-of-the-art indel detection tools on benchmarks using both synthetic and real DNA-seq data. Additionally, we applied transIndel to DNA-seq and RNA-seq datasets from 333 primary prostate cancer patients from The Cancer Genome Atlas (TCGA) and 59 metastatic prostate cancer patients from AACR-PCF Stand-Up- To-Cancer (SU2C) studies. TransIndel enhanced the taxonomy of DNA- and RNA-level alterations in prostate cancer by identifying recurrent FOXA1 indels as well as exitron splicing in genes implicated in disease progression. Our study demonstrates that transIndel is a robust tool for elucidation of medium- and large-sized indels from DNA-seq and RNA-seq data. Including RNA-seq in indel discovery efforts leads to significant improvements in sensitivity for identification of med-sized and large indels missed by DNA-seq, and reveals non-canonical RNA-splicing events in genes associated with disease pathology.
COBRA-Seq: Sensitive and Quantitative Methylome Profiling
Varinli, Hilal; Statham, Aaron L.; Clark, Susan J.; Molloy, Peter L.; Ross, Jason P.
2015-01-01
Combined Bisulfite Restriction Analysis (COBRA) quantifies DNA methylation at a specific locus. It does so via digestion of PCR amplicons produced from bisulfite-treated DNA, using a restriction enzyme that contains a cytosine within its recognition sequence, such as TaqI. Here, we introduce COBRA-seq, a genome wide reduced methylome method that requires minimal DNA input (0.1–1.0 μg) and can either use PCR or linear amplification to amplify the sequencing library. Variants of COBRA-seq can be used to explore CpG-depleted as well as CpG-rich regions in vertebrate DNA. The choice of enzyme influences enrichment for specific genomic features, such as CpG-rich promoters and CpG islands, or enrichment for less CpG dense regions such as enhancers. COBRA-seq coupled with linear amplification has the additional advantage of reduced PCR bias by producing full length fragments at high abundance. Unlike other reduced representative methylome methods, COBRA-seq has great flexibility in the choice of enzyme and can be multiplexed and tuned, to reduce sequencing costs and to interrogate different numbers of sites. Moreover, COBRA-seq is applicable to non-model organisms without the reference genome and compatible with the investigation of non-CpG methylation by using restriction enzymes containing CpA, CpT, and CpC in their recognition site. PMID:26512698
Determination of fetal DNA fraction from the plasma of pregnant women using sequence read counts.
Kim, Sung K; Hannum, Gregory; Geis, Jennifer; Tynan, John; Hogg, Grant; Zhao, Chen; Jensen, Taylor J; Mazloom, Amin R; Oeth, Paul; Ehrich, Mathias; van den Boom, Dirk; Deciu, Cosmin
2015-08-01
This study introduces a novel method, referred to as SeqFF, for estimating the fetal DNA fraction in the plasma of pregnant women and to infer the underlying mechanism that allows for such statistical modeling. Autosomal regional read counts from whole-genome massively parallel single-end sequencing of circulating cell-free DNA (ccfDNA) from the plasma of 25 312 pregnant women were used to train a multivariate model. The pretrained model was then applied to 505 pregnant samples to assess the performance of SeqFF against known methodologies for fetal DNA fraction calculations. Pearson's correlation between chromosome Y and SeqFF for pregnancies with male fetuses from two independent cohorts ranged from 0.932 to 0.938. Comparison between a single-nucleotide polymorphism-based approach and SeqFF yielded a Pearson's correlation of 0.921. Paired-end sequencing suggests that shorter ccfDNA, that is, less than 150 bp in length, is nonuniformly distributed across the genome. Regions exhibiting an increased proportion of short ccfDNA, which are more likely of fetal origin, tend to provide more information in the SeqFF calculations. SeqFF is a robust and direct method to determine fetal DNA fraction. Furthermore, the method is applicable to both male and female pregnancies and can greatly improve the accuracy of noninvasive prenatal testing for fetal copy number variation. © 2015 John Wiley & Sons, Ltd.
TSSAR: TSS annotation regime for dRNA-seq data.
Amman, Fabian; Wolfinger, Michael T; Lorenz, Ronny; Hofacker, Ivo L; Stadler, Peter F; Findeiß, Sven
2014-03-27
Differential RNA sequencing (dRNA-seq) is a high-throughput screening technique designed to examine the architecture of bacterial operons in general and the precise position of transcription start sites (TSS) in particular. Hitherto, dRNA-seq data were analyzed by visualizing the sequencing reads mapped to the reference genome and manually annotating reliable positions. This is very labor intensive and, due to the subjectivity, biased. Here, we present TSSAR, a tool for automated de novo TSS annotation from dRNA-seq data that respects the statistics of dRNA-seq libraries. TSSAR uses the premise that the number of sequencing reads starting at a certain genomic position within a transcriptional active region follows a Poisson distribution with a parameter that depends on the local strength of expression. The differences of two dRNA-seq library counts thus follow a Skellam distribution. This provides a statistical basis to identify significantly enriched primary transcripts.We assessed the performance by analyzing a publicly available dRNA-seq data set using TSSAR and two simple approaches that utilize user-defined score cutoffs. We evaluated the power of reproducing the manual TSS annotation. Furthermore, the same data set was used to reproduce 74 experimentally validated TSS in H. pylori from reliable techniques such as RACE or primer extension. Both analyses showed that TSSAR outperforms the static cutoff-dependent approaches. Having an automated and efficient tool for analyzing dRNA-seq data facilitates the use of the dRNA-seq technique and promotes its application to more sophisticated analysis. For instance, monitoring the plasticity and dynamics of the transcriptomal architecture triggered by different stimuli and growth conditions becomes possible.The main asset of a novel tool for dRNA-seq analysis that reaches out to a broad user community is usability. As such, we provide TSSAR both as intuitive RESTful Web service ( http://rna.tbi.univie.ac.at/TSSAR) together with a set of post-processing and analysis tools, as well as a stand-alone version for use in high-throughput dRNA-seq data analysis pipelines.
Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud
Griffith, Malachi; Walker, Jason R.; Spies, Nicholas C.; Ainscough, Benjamin J.; Griffith, Obi L.
2015-01-01
Massively parallel RNA sequencing (RNA-seq) has rapidly become the assay of choice for interrogating RNA transcript abundance and diversity. This article provides a detailed introduction to fundamental RNA-seq molecular biology and informatics concepts. We make available open-access RNA-seq tutorials that cover cloud computing, tool installation, relevant file formats, reference genomes, transcriptome annotations, quality-control strategies, expression, differential expression, and alternative splicing analysis methods. These tutorials and additional training resources are accompanied by complete analysis pipelines and test datasets made available without encumbrance at www.rnaseq.wiki. PMID:26248053
Cortijo, Sandra; Charoensawan, Varodom; Roudier, François; Wigge, Philip A
2018-01-01
Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-seq) is a powerful technique to investigate in vivo transcription factor (TF) binding to DNA, as well as chromatin marks. Here we provide a detailed protocol for all the key steps to perform ChIP-seq in Arabidopsis thaliana roots, also working on other A. thaliana tissues and in most non-ligneous plants. We detail all steps from material collection, fixation, chromatin preparation, immunoprecipitation, library preparation, and finally computational analysis based on a combination of publicly available tools.
Mapping RNA-seq Reads with STAR
Dobin, Alexander; Gingeras, Thomas R.
2015-01-01
Mapping of large sets of high-throughput sequencing reads to a reference genome is one of the foundational steps in RNA-seq data analysis. The STAR software package performs this task with high levels of accuracy and speed. In addition to detecting annotated and novel splice junctions, STAR is capable of discovering more complex RNA sequence arrangements, such as chimeric and circular RNA. STAR can align spliced sequences of any length with moderate error rates providing scalability for emerging sequencing technologies. STAR generates output files that can be used for many downstream analyses such as transcript/gene expression quantification, differential gene expression, novel isoform reconstruction, signal visualization, and so forth. In this unit we describe computational protocols that produce various output files, use different RNA-seq datatypes, and utilize different mapping strategies. STAR is Open Source software that can be run on Unix, Linux or Mac OS X systems. PMID:26334920
Mapping RNA-seq Reads with STAR.
Dobin, Alexander; Gingeras, Thomas R
2015-09-03
Mapping of large sets of high-throughput sequencing reads to a reference genome is one of the foundational steps in RNA-seq data analysis. The STAR software package performs this task with high levels of accuracy and speed. In addition to detecting annotated and novel splice junctions, STAR is capable of discovering more complex RNA sequence arrangements, such as chimeric and circular RNA. STAR can align spliced sequences of any length with moderate error rates, providing scalability for emerging sequencing technologies. STAR generates output files that can be used for many downstream analyses such as transcript/gene expression quantification, differential gene expression, novel isoform reconstruction, and signal visualization. In this unit, we describe computational protocols that produce various output files, use different RNA-seq datatypes, and utilize different mapping strategies. STAR is open source software that can be run on Unix, Linux, or Mac OS X systems. Copyright © 2015 John Wiley & Sons, Inc.
Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications.
Van den Berge, Koen; Perraudeau, Fanny; Soneson, Charlotte; Love, Michael I; Risso, Davide; Vert, Jean-Philippe; Robinson, Mark D; Dudoit, Sandrine; Clement, Lieven
2018-02-26
Dropout events in single-cell RNA sequencing (scRNA-seq) cause many transcripts to go undetected and induce an excess of zero read counts, leading to power issues in differential expression (DE) analysis. This has triggered the development of bespoke scRNA-seq DE methods to cope with zero inflation. Recent evaluations, however, have shown that dedicated scRNA-seq tools provide no advantage compared to traditional bulk RNA-seq tools. We introduce a weighting strategy, based on a zero-inflated negative binomial model, that identifies excess zero counts and generates gene- and cell-specific weights to unlock bulk RNA-seq DE pipelines for zero-inflated data, boosting performance for scRNA-seq.
Li, Jun; Tibshirani, Robert
2015-01-01
We discuss the identification of features that are associated with an outcome in RNA-Sequencing (RNA-Seq) and other sequencing-based comparative genomic experiments. RNA-Seq data takes the form of counts, so models based on the normal distribution are generally unsuitable. The problem is especially challenging because different sequencing experiments may generate quite different total numbers of reads, or ‘sequencing depths’. Existing methods for this problem are based on Poisson or negative binomial models: they are useful but can be heavily influenced by ‘outliers’ in the data. We introduce a simple, nonparametric method with resampling to account for the different sequencing depths. The new method is more robust than parametric methods. It can be applied to data with quantitative, survival, two-class or multiple-class outcomes. We compare our proposed method to Poisson and negative binomial-based methods in simulated and real data sets, and find that our method discovers more consistent patterns than competing methods. PMID:22127579
Lee, Soohyun; Seo, Chae Hwa; Alver, Burak Han; Lee, Sanghyuk; Park, Peter J
2015-09-03
RNA-seq has been widely used for genome-wide expression profiling. RNA-seq data typically consists of tens of millions of short sequenced reads from different transcripts. However, due to sequence similarity among genes and among isoforms, the source of a given read is often ambiguous. Existing approaches for estimating expression levels from RNA-seq reads tend to compromise between accuracy and computational cost. We introduce a new approach for quantifying transcript abundance from RNA-seq data. EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts. The method uses nearly all mapped reads, including those mapped to multiple genes. With an efficient transcriptome indexing based on modified suffix arrays, EMSAR minimizes the use of CPU time and memory while achieving accuracy comparable to the best existing methods. EMSAR is a method for quantifying transcripts from RNA-seq data with high accuracy and low computational cost. EMSAR is available at https://github.com/parklab/emsar.
GUIDE-Seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases
Nguyen, Nhu T.; Liebers, Matthew; Topkar, Ved V.; Thapar, Vishal; Wyvekens, Nicolas; Khayter, Cyd; Iafrate, A. John; Le, Long P.; Aryee, Martin J.; Joung, J. Keith
2014-01-01
CRISPR RNA-guided nucleases (RGNs) are widely used genome-editing reagents, but methods to delineate their genome-wide off-target cleavage activities have been lacking. Here we describe an approach for global detection of DNA double-stranded breaks (DSBs) introduced by RGNs and potentially other nucleases. This method, called Genome-wide Unbiased Identification of DSBs Enabled by Sequencing (GUIDE-Seq), relies on capture of double-stranded oligodeoxynucleotides into breaks Application of GUIDE-Seq to thirteen RGNs in two human cell lines revealed wide variability in RGN off-target activities and unappreciated characteristics of off-target sequences. The majority of identified sites were not detected by existing computational methods or ChIP-Seq. GUIDE-Seq also identified RGN-independent genomic breakpoint ‘hotspots’. Finally, GUIDE-Seq revealed that truncated guide RNAs exhibit substantially reduced RGN-induced off-target DSBs. Our experiments define the most rigorous framework for genome-wide identification of RGN off-target effects to date and provide a method for evaluating the safety of these nucleases prior to clinical use. PMID:25513782
SEQ-REVIEW: A tool for reviewing and checking spacecraft sequences
NASA Astrophysics Data System (ADS)
Maldague, Pierre F.; El-Boushi, Mekki; Starbird, Thomas J.; Zawacki, Steven J.
1994-11-01
A key component of JPL's strategy to make space missions faster, better and cheaper is the Advanced Multi-Mission Operations System (AMMOS), a ground software intensive system currently in use and in further development. AMMOS intends to eliminate the cost of re-engineering a ground system for each new JPL mission. This paper discusses SEQ-REVIEW, a component of AMMOS that was designed to facilitate and automate the task of reviewing and checking spacecraft sequences. SEQ-REVIEW is a smart browser for inspecting files created by other sequence generation tools in the AMMOS system. It can parse sequence-related files according to a computer-readable version of a 'Software Interface Specification' (SIS), which is a standard document for defining file formats. It lets users display one or several linked files and check simple constraints using a Basic-like 'Little Language'. SEQ-REVIEW represents the first application of the Quality Function Development (QFD) method to sequence software development at JPL. The paper will show how the requirements for SEQ-REVIEW were defined and converted into a design based on object-oriented principles. The process starts with interviews of potential users, a small but diverse group that spans multiple disciplines and 'cultures'. It continues with the development of QFD matrices that related product functions and characteristics to user-demanded qualities. These matrices are then turned into a formal Software Requirements Document (SRD). The process concludes with the design phase, in which the CRC (Class, Responsibility, Collaboration) approach was used to convert requirements into a blueprint for the final product.
SEQ-REVIEW: A tool for reviewing and checking spacecraft sequences
NASA Technical Reports Server (NTRS)
Maldague, Pierre F.; El-Boushi, Mekki; Starbird, Thomas J.; Zawacki, Steven J.
1994-01-01
A key component of JPL's strategy to make space missions faster, better and cheaper is the Advanced Multi-Mission Operations System (AMMOS), a ground software intensive system currently in use and in further development. AMMOS intends to eliminate the cost of re-engineering a ground system for each new JPL mission. This paper discusses SEQ-REVIEW, a component of AMMOS that was designed to facilitate and automate the task of reviewing and checking spacecraft sequences. SEQ-REVIEW is a smart browser for inspecting files created by other sequence generation tools in the AMMOS system. It can parse sequence-related files according to a computer-readable version of a 'Software Interface Specification' (SIS), which is a standard document for defining file formats. It lets users display one or several linked files and check simple constraints using a Basic-like 'Little Language'. SEQ-REVIEW represents the first application of the Quality Function Development (QFD) method to sequence software development at JPL. The paper will show how the requirements for SEQ-REVIEW were defined and converted into a design based on object-oriented principles. The process starts with interviews of potential users, a small but diverse group that spans multiple disciplines and 'cultures'. It continues with the development of QFD matrices that related product functions and characteristics to user-demanded qualities. These matrices are then turned into a formal Software Requirements Document (SRD). The process concludes with the design phase, in which the CRC (Class, Responsibility, Collaboration) approach was used to convert requirements into a blueprint for the final product.
Whole-genome sequencing for comparative genomics and de novo genome assembly.
Benjak, Andrej; Sala, Claudia; Hartkoorn, Ruben C
2015-01-01
Next-generation sequencing technologies for whole-genome sequencing of mycobacteria are rapidly becoming an attractive alternative to more traditional sequencing methods. In particular this technology is proving useful for genome-wide identification of mutations in mycobacteria (comparative genomics) as well as for de novo assembly of whole genomes. Next-generation sequencing however generates a vast quantity of data that can only be transformed into a usable and comprehensible form using bioinformatics. Here we describe the methodology one would use to prepare libraries for whole-genome sequencing, and the basic bioinformatics to identify mutations in a genome following Illumina HiSeq or MiSeq sequencing, as well as de novo genome assembly following sequencing using Pacific Biosciences (PacBio).
Tremblay, Julien
2018-01-22
Julien Tremblay from DOE JGI presents "Evaluation of Multiplexed 16S rRNA Microbial Population Surveys Using Illumina MiSeq Platorm" at the 7th Annual Sequencing, Finishing, Analysis in the Future (SFAF) Meeting held in June, 2012 in Santa Fe, NM.
USDA-ARS?s Scientific Manuscript database
About 447 millions of RNA-Seq sequences were generated from 40 RNA libraries covering 8 different berry developmental stages of table grape ‘Kyoho’ and its early ripening bud mutant ‘Fengzao’. These sequences were mapped to 23,178 and 22,982 genes in the flesh and peel tissues, respectively. While m...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tremblay, Julien
2012-06-01
Julien Tremblay from DOE JGI presents "Evaluation of Multiplexed 16S rRNA Microbial Population Surveys Using Illumina MiSeq Platorm" at the 7th Annual Sequencing, Finishing, Analysis in the Future (SFAF) Meeting held in June, 2012 in Santa Fe, NM.
Position-specific binding of FUS to nascent RNA regulates mRNA length
Masuda, Akio; Takeda, Jun-ichi; Okuno, Tatsuya; Okamoto, Takaaki; Ohkawara, Bisei; Ito, Mikako; Ishigaki, Shinsuke; Sobue, Gen
2015-01-01
More than half of all human genes produce prematurely terminated polyadenylated short mRNAs. However, the underlying mechanisms remain largely elusive. CLIP-seq (cross-linking immunoprecipitation [CLIP] combined with deep sequencing) of FUS (fused in sarcoma) in neuronal cells showed that FUS is frequently clustered around an alternative polyadenylation (APA) site of nascent RNA. ChIP-seq (chromatin immunoprecipitation [ChIP] combined with deep sequencing) of RNA polymerase II (RNAP II) demonstrated that FUS stalls RNAP II and prematurely terminates transcription. When an APA site is located upstream of an FUS cluster, FUS enhances polyadenylation by recruiting CPSF160 and up-regulates the alternative short transcript. In contrast, when an APA site is located downstream from an FUS cluster, polyadenylation is not activated, and the RNAP II-suppressing effect of FUS leads to down-regulation of the alternative short transcript. CAGE-seq (cap analysis of gene expression [CAGE] combined with deep sequencing) and PolyA-seq (a strand-specific and quantitative method for high-throughput sequencing of 3' ends of polyadenylated transcripts) revealed that position-specific regulation of mRNA lengths by FUS is operational in two-thirds of transcripts in neuronal cells, with enrichment in genes involved in synaptic activities. PMID:25995189
RSAT 2018: regulatory sequence analysis tools 20th anniversary.
Nguyen, Nga Thi Thuy; Contreras-Moreira, Bruno; Castro-Mondragon, Jaime A; Santana-Garcia, Walter; Ossio, Raul; Robles-Espinoza, Carla Daniela; Bahin, Mathieu; Collombet, Samuel; Vincens, Pierre; Thieffry, Denis; van Helden, Jacques; Medina-Rivera, Alejandra; Thomas-Chollier, Morgane
2018-05-02
RSAT (Regulatory Sequence Analysis Tools) is a suite of modular tools for the detection and the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, including from genome-wide datasets like ChIP-seq/ATAC-seq, (ii) motif scanning, (iii) motif analysis (quality assessment, comparisons and clustering), (iv) analysis of regulatory variations, (v) comparative genomics. Six public servers jointly support 10 000 genomes from all kingdoms. Six novel or refactored programs have been added since the 2015 NAR Web Software Issue, including updated programs to analyse regulatory variants (retrieve-variation-seq, variation-scan, convert-variations), along with tools to extract sequences from a list of coordinates (retrieve-seq-bed), to select motifs from motif collections (retrieve-matrix), and to extract orthologs based on Ensembl Compara (get-orthologs-compara). Three use cases illustrate the integration of new and refactored tools to the suite. This Anniversary update gives a 20-year perspective on the software suite. RSAT is well-documented and available through Web sites, SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services, virtual machines and stand-alone programs at http://www.rsat.eu/.
Hoople, Gordon D; Richards, Andrew; Wu, Yan; Pisano, Albert P; Zhang, Kun
2018-03-26
The ability to amplify and sequence either DNA or RNA from small starting samples has only been achieved in the last five years. Unfortunately, the standard protocols for generating genomic or transcriptomic libraries are incompatible and researchers must choose whether to sequence DNA or RNA for a particular sample. Gel-seq solves this problem by enabling researchers to simultaneously prepare libraries for both DNA and RNA starting with 100 - 1000 cells using a simple hydrogel device. This paper presents a detailed approach for the fabrication of the device as well as the biological protocol to generate paired libraries. We designed Gel-seq so that it could be easily implemented by other researchers; many genetics labs already have the necessary equipment to reproduce the Gel-seq device fabrication. Our protocol employs commonly-used kits for both whole-transcript amplification (WTA) and library preparation, which are also likely to be familiar to researchers already versed in generating genomic and transcriptomic libraries. Our approach allows researchers to bring to bear the power of both DNA and RNA sequencing on a single sample without splitting and with negligible added cost.
Identification and removal of low-complexity sites in allele-specific analysis of ChIP-seq data.
Waszak, Sebastian M; Kilpinen, Helena; Gschwind, Andreas R; Orioli, Andrea; Raghav, Sunil K; Witwicki, Robert M; Migliavacca, Eugenia; Yurovsky, Alisa; Lappalainen, Tuuli; Hernandez, Nouria; Reymond, Alexandre; Dermitzakis, Emmanouil T; Deplancke, Bart
2014-01-15
High-throughput sequencing technologies enable the genome-wide analysis of the impact of genetic variation on molecular phenotypes at unprecedented resolution. However, although powerful, these technologies can also introduce unexpected artifacts. We investigated the impact of library amplification bias on the identification of allele-specific (AS) molecular events from high-throughput sequencing data derived from chromatin immunoprecipitation assays (ChIP-seq). Putative AS DNA binding activity for RNA polymerase II was determined using ChIP-seq data derived from lymphoblastoid cell lines of two parent-daughter trios. We found that, at high-sequencing depth, many significant AS binding sites suffered from an amplification bias, as evidenced by a larger number of clonal reads representing one of the two alleles. To alleviate this bias, we devised an amplification bias detection strategy, which filters out sites with low read complexity and sites featuring a significant excess of clonal reads. This method will be useful for AS analyses involving ChIP-seq and other functional sequencing assays. The R package abs filter for library clonality simulations and detection of amplification-biased sites is available from http://updepla1srv1.epfl.ch/waszaks/absfilter
Liu, S; Song, L; Cram, D S; Xiong, L; Wang, K; Wu, R; Liu, J; Deng, K; Jia, B; Zhong, M; Yang, F
2015-10-01
To compare the performance of traditional G-banding karyotyping with that of copy number variation sequencing (CNV-Seq) for detection of chromosomal abnormalities associated with miscarriage. Products of conception (POC) were collected from spontaneous miscarriages. Chromosomal abnormalities were detected using high-resolution G-banding karyotyping and CNV sequencing. Quantitative fluorescent polymerase chain reaction analysis of maternal and POC DNA for short tandem repeat (STR) markers was used to both monitor maternal cell contamination and confirm the chromosomal status and sex of the miscarriage tissue. A total of 64 samples of POC, comprising 16 with an abnormal and 48 with a normal karyotype, were selected and coded for analysis by CNV-Seq. CNV-Seq results were concordant for 14 (87.5%) of the 16 gross chromosomal abnormalities identified by karyotyping, including 11 autosomal trisomies and three sex chromosomal aneuploidies (45,X). Of the two discordant results, a 69,XXX polyploidy was missed by CNV-Seq, although supporting STR marker analysis confirmed the triploidy. In contrast, CNV-Seq identified a sample with 45,X karyotype as a 45,X/46,XY mosaic. In the remaining 48 samples of POC with a normal karyotype, CNV-Seq detected a 2.58-Mb 22q deletion associated with DiGeorge syndrome and nine different smaller CNVs of no apparent clinical significance. CNV-Seq used in parallel with STR profiling is a reliable and accurate alternative to karyotyping for identifying chromosome copy number abnormalities associated with spontaneous miscarriage. Copyright © 2015 ISUOG. Published by John Wiley & Sons Ltd.
SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis.
Johnson, Benjamin K; Scholz, Matthew B; Teal, Tracy K; Abramovitch, Robert B
2016-02-04
Many tools exist in the analysis of bacterial RNA sequencing (RNA-seq) transcriptional profiling experiments to identify differentially expressed genes between experimental conditions. Generally, the workflow includes quality control of reads, mapping to a reference, counting transcript abundance, and statistical tests for differentially expressed genes. In spite of the numerous tools developed for each component of an RNA-seq analysis workflow, easy-to-use bacterially oriented workflow applications to combine multiple tools and automate the process are lacking. With many tools to choose from for each step, the task of identifying a specific tool, adapting the input/output options to the specific use-case, and integrating the tools into a coherent analysis pipeline is not a trivial endeavor, particularly for microbiologists with limited bioinformatics experience. To make bacterial RNA-seq data analysis more accessible, we developed a Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis (SPARTA). SPARTA is a reference-based bacterial RNA-seq analysis workflow application for single-end Illumina reads. SPARTA is turnkey software that simplifies the process of analyzing RNA-seq data sets, making bacterial RNA-seq analysis a routine process that can be undertaken on a personal computer or in the classroom. The easy-to-install, complete workflow processes whole transcriptome shotgun sequencing data files by trimming reads and removing adapters, mapping reads to a reference, counting gene features, calculating differential gene expression, and, importantly, checking for potential batch effects within the data set. SPARTA outputs quality analysis reports, gene feature counts and differential gene expression tables and scatterplots. SPARTA provides an easy-to-use bacterial RNA-seq transcriptional profiling workflow to identify differentially expressed genes between experimental conditions. This software will enable microbiologists with limited bioinformatics experience to analyze their data and integrate next generation sequencing (NGS) technologies into the classroom. The SPARTA software and tutorial are available at sparta.readthedocs.org.
Single-Cell Sequencing for Drug Discovery and Drug Development.
Wu, Hongjin; Wang, Charles; Wu, Shixiu
2017-01-01
Next-generation sequencing (NGS), particularly single-cell sequencing, has revolutionized the scale and scope of genomic and biomedical research. Recent technological advances in NGS and singlecell studies have made the deep whole-genome (DNA-seq), whole epigenome and whole-transcriptome sequencing (RNA-seq) at single-cell level feasible. NGS at the single-cell level expands our view of genome, epigenome and transcriptome and allows the genome, epigenome and transcriptome of any organism to be explored without a priori assumptions and with unprecedented throughput. And it does so with single-nucleotide resolution. NGS is also a very powerful tool for drug discovery and drug development. In this review, we describe the current state of single-cell sequencing techniques, which can provide a new, more powerful and precise approach for analyzing effects of drugs on treated cells and tissues. Our review discusses single-cell whole genome/exome sequencing (scWGS/scWES), single-cell transcriptome sequencing (scRNA-seq), single-cell bisulfite sequencing (scBS), and multiple omics of single-cell sequencing. We also highlight the advantages and challenges of each of these approaches. Finally, we describe, elaborate and speculate the potential applications of single-cell sequencing for drug discovery and drug development. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
P53 Gene Mutagenesis in Breast Cancer
2005-03-01
the wild type T peak. 12 Table 1. Sonic ntations dected by SINtA Individual Cell Sequence Amino Acid Species Conservation 3 ID’ ID Change2 Change... differences in the content of toxic substances in the diet (Biggs et al., 1993; Blaszyk et al., 1996). The development of this p53 mutation load...Changes in the P53 Gene in Single Cells Individual Sequence Amino acid Species conservation ’ ID’ Cell ID change’ change Monkey Mouse Rat Chicken
Exome Sequencing Identifies Three Novel Candidate Genes Implicated in Intellectual Disability
Azam, Maleeha; Ayub, Humaira; Vissers, Lisenka E. L. M.; Gilissen, Christian; Ali, Syeda Hafiza Benish; Riaz, Moeen; Veltman, Joris A.; Pfundt, Rolph; van Bokhoven, Hans; Qamar, Raheel
2014-01-01
Intellectual disability (ID) is a major health problem mostly with an unknown etiology. Recently exome sequencing of individuals with ID identified novel genes implicated in the disease. Therefore the purpose of the present study was to identify the genetic cause of ID in one syndromic and two non-syndromic Pakistani families. Whole exome of three ID probands was sequenced. Missense variations in two plausible novel genes implicated in autosomal recessive ID were identified: lysine (K)-specific methyltransferase 2B (KMT2B), zinc finger protein 589 (ZNF589), as well as hedgehog acyltransferase (HHAT) with a de novo mutation with autosomal dominant mode of inheritance. The KMT2B recessive variant is the first report of recessive Kleefstra syndrome-like phenotype. Identification of plausible causative mutations for two recessive and a dominant type of ID, in genes not previously implicated in disease, underscores the large genetic heterogeneity of ID. These results also support the viewpoint that large number of ID genes converge on limited number of common networks i.e. ZNF589 belongs to KRAB-domain zinc-finger proteins previously implicated in ID, HHAT is predicted to affect sonic hedgehog, which is involved in several disorders with ID, KMT2B associated with syndromic ID fits the epigenetic module underlying the Kleefstra syndromic spectrum. The association of these novel genes in three different Pakistani ID families highlights the importance of screening these genes in more families with similar phenotypes from different populations to confirm the involvement of these genes in pathogenesis of ID. PMID:25405613
Yao, Po-Ju; Chung, Ren-Hua
2016-02-15
It is difficult for current simulation tools to simulate sequence data in a pre-specified pedigree structure and pre-specified affection status. Previously, we developed a flexible tool, SeqSIMLA2, for simulating sequence data in either unrelated case-control or family samples with different disease and quantitative trait models. Here we extended the tool to efficiently simulate sequences with multiple disease sites in large pedigrees with a given disease status for each pedigree member, assuming that the disease prevalence is low. SeqSIMLA2_exact is implemented with C++ and is available at http://seqsimla.sourceforge.net. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Sramkó, Gábor; Paun, Ovidiu
2018-01-01
Abstract Background and Aims Bee orchids (Ophrys) have become the most popular model system for studying reproduction via insect-mediated pseudo-copulation and for exploring the consequent, putatively adaptive, evolutionary radiations. However, despite intensive past research, both the phylogenetic structure and species diversity within the genus remain highly contentious. Here, we integrate next-generation sequencing and morphological cladistic techniques to clarify the phylogeny of the genus. Methods At least two accessions of each of the ten species groups previously circumscribed from large-scale cloned nuclear ribosomal internal transcibed spacer (nrITS) sequencing were subjected to restriction site-associated sequencing (RAD-seq). The resulting matrix of 4159 single nucleotide polymorphisms (SNPs) for 34 accessions was used to construct an unrooted network and a rooted maximum likelihood phylogeny. A parallel morphological cladistic matrix of 43 characters generated both polymorphic and non-polymorphic sets of parsimony trees before being mapped across the RAD-seq topology. Key Results RAD-seq data strongly support the monophyly of nine out of ten groups previously circumscribed using nrITS and resolve three major clades; in contrast, supposed microspecies are barely distinguishable. Strong incongruence separated the RAD-seq trees from both the morphological trees and traditional classifications; mapping of the morphological characters across the RAD-seq topology rendered them far more homoplastic. Conclusions The comparatively high level of morphological homoplasy reflects extensive convergence, whereas the derived placement of the fusca group is attributed to paedomorphic simplification. The phenotype of the most recent common ancestor of the extant lineages is inferred, but it post-dates the majority of the character-state changes that typify the genus. RAD-seq may represent the high-water mark of the contribution of molecular phylogenetics to understanding evolution within Ophrys; further progress will require large-scale population-level studies that integrate phenotypic and genotypic data in a cogent conceptual framework. PMID:29325077
Bateman, Richard M; Sramkó, Gábor; Paun, Ovidiu
2018-01-25
Bee orchids (Ophrys) have become the most popular model system for studying reproduction via insect-mediated pseudo-copulation and for exploring the consequent, putatively adaptive, evolutionary radiations. However, despite intensive past research, both the phylogenetic structure and species diversity within the genus remain highly contentious. Here, we integrate next-generation sequencing and morphological cladistic techniques to clarify the phylogeny of the genus. At least two accessions of each of the ten species groups previously circumscribed from large-scale cloned nuclear ribosomal internal transcibed spacer (nrITS) sequencing were subjected to restriction site-associated sequencing (RAD-seq). The resulting matrix of 4159 single nucleotide polymorphisms (SNPs) for 34 accessions was used to construct an unrooted network and a rooted maximum likelihood phylogeny. A parallel morphological cladistic matrix of 43 characters generated both polymorphic and non-polymorphic sets of parsimony trees before being mapped across the RAD-seq topology. RAD-seq data strongly support the monophyly of nine out of ten groups previously circumscribed using nrITS and resolve three major clades; in contrast, supposed microspecies are barely distinguishable. Strong incongruence separated the RAD-seq trees from both the morphological trees and traditional classifications; mapping of the morphological characters across the RAD-seq topology rendered them far more homoplastic. The comparatively high level of morphological homoplasy reflects extensive convergence, whereas the derived placement of the fusca group is attributed to paedomorphic simplification. The phenotype of the most recent common ancestor of the extant lineages is inferred, but it post-dates the majority of the character-state changes that typify the genus. RAD-seq may represent the high-water mark of the contribution of molecular phylogenetics to understanding evolution within Ophrys; further progress will require large-scale population-level studies that integrate phenotypic and genotypic data in a cogent conceptual framework. © The Author(s) 2018. Published by Oxford University Press on behalf of the Annals of Botany Company.
RNA-ID, a Powerful Tool for Identifying and Characterizing Regulatory Sequences.
Brule, C E; Dean, K M; Grayhack, E J
2016-01-01
The identification and analysis of sequences that regulate gene expression is critical because regulated gene expression underlies biology. RNA-ID is an efficient and sensitive method to discover and investigate regulatory sequences in the yeast Saccharomyces cerevisiae, using fluorescence-based assays to detect green fluorescent protein (GFP) relative to a red fluorescent protein (RFP) control in individual cells. Putative regulatory sequences can be inserted either in-frame or upstream of a superfolder GFP fusion protein whose expression, like that of RFP, is driven by the bidirectional GAL1,10 promoter. In this chapter, we describe the methodology to identify and study cis-regulatory sequences in the RNA-ID system, explaining features and variations of the RNA-ID reporter, as well as some applications of this system. We describe in detail the methods to analyze a single regulatory sequence, from construction of a single GFP variant to assay of variants by flow cytometry, as well as modifications required to screen libraries of different strains simultaneously. We also describe subsequent analyses of regulatory sequences. © 2016 Elsevier Inc. All rights reserved.
Doyle, Stephen R; Griffith, Ian S; Murphy, Nick P; Strugnell, Jan M
2015-01-01
The complete mitochondrial genome of the Eastern Rock lobster, Sagmariasus verreauxi, is reported for the first time. Using low-coverage, long read MiSeq next generation sequencing, we constructed and determined the mtDNA genome organization of the 15,470 bp sequence from two isolates from Eastern Tasmania, Australia and Northern New Zealand, and identified 46 polymorphic nucleotides between the two sequences. This genome sequence and its genetic polymorphisms will likely be useful in understanding the distribution and population connectivity of the Eastern Rock Lobster, and in the fisheries management of this commercially important species.
Workflow and web application for annotating NCBI BioProject transcriptome data
Vera Alvarez, Roberto; Medeiros Vidal, Newton; Garzón-Martínez, Gina A.; Barrero, Luz S.; Landsman, David
2017-01-01
Abstract The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621. Database URL: http://www.ncbi.nlm.nih.gov/projects/physalis/ PMID:28605765
Necci, Marco; Piovesan, Damiano
2016-01-01
Abstract Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large‐scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence‐based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures. PMID:27636733
RNA-seq Data: Challenges in and Recommendations for Experimental Design and Analysis.
Williams, Alexander G; Thomas, Sean; Wyman, Stacia K; Holloway, Alisha K
2014-10-01
RNA-seq is widely used to determine differential expression of genes or transcripts as well as identify novel transcripts, identify allele-specific expression, and precisely measure translation of transcripts. Thoughtful experimental design and choice of analysis tools are critical to ensure high-quality data and interpretable results. Important considerations for experimental design include number of replicates, whether to collect paired-end or single-end reads, sequence length, and sequencing depth. Common analysis steps in all RNA-seq experiments include quality control, read alignment, assigning reads to genes or transcripts, and estimating gene or transcript abundance. Our aims are two-fold: to make recommendations for common components of experimental design and assess tool capabilities for each of these steps. We also test tools designed to detect differential expression, since this is the most widespread application of RNA-seq. We hope that these analyses will help guide those who are new to RNA-seq and will generate discussion about remaining needs for tool improvement and development. Copyright © 2014 John Wiley & Sons, Inc.
Using single nuclei for RNA-seq to capture the transcriptome of postmortem neurons
Krishnaswami, Suguna Rani; Grindberg, Rashel V; Novotny, Mark; Venepally, Pratap; Lacar, Benjamin; Bhutani, Kunal; Linker, Sara B; Pham, Son; Erwin, Jennifer A; Miller, Jeremy A; Hodge, Rebecca; McCarthy, James K; Kelder, Martin; McCorrison, Jamison; Aevermann, Brian D; Fuertes, Francisco Diez; Scheuermann, Richard H; Lee, Jun; Lein, Ed S; Schork, Nicholas; McConnell, Michael J; Gage, Fred H; Lasken, Roger S
2016-01-01
A protocol is described for sequencing the transcriptome of a cell nucleus. Nuclei are isolated from specimens and sorted by FACS, cDNA libraries are constructed and RNA-seq is performed, followed by data analysis. Some steps follow published methods (Smart-seq2 for cDNA synthesis and Nextera XT barcoded library preparation) and are not described in detail here. Previous single-cell approaches for RNA-seq from tissues include cell dissociation using protease treatment at 30 °C, which is known to alter the transcriptome. We isolate nuclei at 4 °C from tissue homogenates, which cause minimal damage. Nuclear transcriptomes can be obtained from postmortem human brain tissue stored at −80 °C, making brain archives accessible for RNA-seq from individual neurons. The method also allows investigation of biological features unique to nuclei, such as enrichment of certain transcripts and precursors of some noncoding RNAs. By following this procedure, it takes about 4 d to construct cDNA libraries that are ready for sequencing. PMID:26890679
An interactive environment for agile analysis and visualization of ChIP-sequencing data.
Lerdrup, Mads; Johansen, Jens Vilstrup; Agrawal-Singh, Shuchi; Hansen, Klaus
2016-04-01
To empower experimentalists with a means for fast and comprehensive chromatin immunoprecipitation sequencing (ChIP-seq) data analyses, we introduce an integrated computational environment, EaSeq. The software combines the exploratory power of genome browsers with an extensive set of interactive and user-friendly tools for genome-wide abstraction and visualization. It enables experimentalists to easily extract information and generate hypotheses from their own data and public genome-wide datasets. For demonstration purposes, we performed meta-analyses of public Polycomb ChIP-seq data and established a new screening approach to analyze more than 900 datasets from mouse embryonic stem cells for factors potentially associated with Polycomb recruitment. EaSeq, which is freely available and works on a standard personal computer, can substantially increase the throughput of many analysis workflows, facilitate transparency and reproducibility by automatically documenting and organizing analyses, and enable a broader group of scientists to gain insights from ChIP-seq data.
Chen, Yunshun; Lun, Aaron T L; Smyth, Gordon K
2016-01-01
In recent years, RNA sequencing (RNA-seq) has become a very widely used technology for profiling gene expression. One of the most common aims of RNA-seq profiling is to identify genes or molecular pathways that are differentially expressed (DE) between two or more biological conditions. This article demonstrates a computational workflow for the detection of DE genes and pathways from RNA-seq data by providing a complete analysis of an RNA-seq experiment profiling epithelial cell subsets in the mouse mammary gland. The workflow uses R software packages from the open-source Bioconductor project and covers all steps of the analysis pipeline, including alignment of read sequences, data exploration, differential expression analysis, visualization and pathway analysis. Read alignment and count quantification is conducted using the Rsubread package and the statistical analyses are performed using the edgeR package. The differential expression analysis uses the quasi-likelihood functionality of edgeR.
TopHat: discovering splice junctions with RNA-Seq
Trapnell, Cole; Pachter, Lior; Salzberg, Steven L.
2009-01-01
Motivation: A new protocol for sequencing the messenger RNA in a cell, known as RNA-Seq, generates millions of short sequence fragments in a single run. These fragments, or ‘reads’, can be used to measure levels of gene expression and to identify novel splice variants of genes. However, current software for aligning RNA-Seq data to a genome relies on known splice junctions and cannot identify novel ones. TopHat is an efficient read-mapping algorithm designed to align reads from an RNA-Seq experiment to a reference genome without relying on known splice sites. Results: We mapped the RNA-Seq reads from a recent mammalian RNA-Seq experiment and recovered more than 72% of the splice junctions reported by the annotation-based software from that study, along with nearly 20 000 previously unreported junctions. The TopHat pipeline is much faster than previous systems, mapping nearly 2.2 million reads per CPU hour, which is sufficient to process an entire RNA-Seq experiment in less than a day on a standard desktop computer. We describe several challenges unique to ab initio splice site discovery from RNA-Seq reads that will require further algorithm development. Availability: TopHat is free, open-source software available from http://tophat.cbcb.umd.edu Contact: cole@cs.umd.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19289445
A MBD-seq protocol for large-scale methylome-wide studies with (very) low amounts of DNA.
Aberg, Karolina A; Chan, Robin F; Shabalin, Andrey A; Zhao, Min; Turecki, Gustavo; Staunstrup, Nicklas Heine; Starnawska, Anna; Mors, Ole; Xie, Lin Y; van den Oord, Edwin Jcg
2017-09-01
We recently showed that, after optimization, our methyl-CpG binding domain sequencing (MBD-seq) application approximates the methylome-wide coverage obtained with whole-genome bisulfite sequencing (WGB-seq), but at a cost that enables adequately powered large-scale association studies. A prior drawback of MBD-seq is the relatively large amount of genomic DNA (ideally >1 µg) required to obtain high-quality data. Biomaterials are typically expensive to collect, provide a finite amount of DNA, and may simply not yield sufficient starting material. The ability to use low amounts of DNA will increase the breadth and number of studies that can be conducted. Therefore, we further optimized the enrichment step. With this low starting material protocol, MBD-seq performed equally well, or better, than the protocol requiring ample starting material (>1 µg). Using only 15 ng of DNA as input, there is minimal loss in data quality, achieving 93% of the coverage of WGB-seq (with standard amounts of input DNA) at similar false/positive rates. Furthermore, across a large number of genomic features, the MBD-seq methylation profiles closely tracked those observed for WGB-seq with even slightly larger effect sizes. This suggests that MBD-seq provides similar information about the methylome and classifies methylation status somewhat more accurately. Performance decreases with <15 ng DNA as starting material but, even with as little as 5 ng, MBD-seq still achieves 90% of the coverage of WGB-seq with comparable genome-wide methylation profiles. Thus, the proposed protocol is an attractive option for adequately powered and cost-effective methylome-wide investigations using (very) low amounts of DNA.
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong
2018-01-04
Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
zUMIs - A fast and flexible pipeline to process RNA sequencing data with UMIs.
Parekh, Swati; Ziegenhain, Christoph; Vieth, Beate; Enard, Wolfgang; Hellmann, Ines
2018-06-01
Single-cell RNA-sequencing (scRNA-seq) experiments typically analyze hundreds or thousands of cells after amplification of the cDNA. The high throughput is made possible by the early introduction of sample-specific bar codes (BCs), and the amplification bias is alleviated by unique molecular identifiers (UMIs). Thus, the ideal analysis pipeline for scRNA-seq data needs to efficiently tabulate reads according to both BC and UMI. zUMIs is a pipeline that can handle both known and random BCs and also efficiently collapse UMIs, either just for exon mapping reads or for both exon and intron mapping reads. If BC annotation is missing, zUMIs can accurately detect intact cells from the distribution of sequencing reads. Another unique feature of zUMIs is the adaptive downsampling function that facilitates dealing with hugely varying library sizes but also allows the user to evaluate whether the library has been sequenced to saturation. To illustrate the utility of zUMIs, we analyzed a single-nucleus RNA-seq dataset and show that more than 35% of all reads map to introns. Also, we show that these intronic reads are informative about expression levels, significantly increasing the number of detected genes and improving the cluster resolution. zUMIs flexibility makes if possible to accommodate data generated with any of the major scRNA-seq protocols that use BCs and UMIs and is the most feature-rich, fast, and user-friendly pipeline to process such scRNA-seq data.
Missing data and technical variability in single-cell RNA-sequencing experiments.
Hicks, Stephanie C; Townes, F William; Teng, Mingxiang; Irizarry, Rafael A
2017-11-06
Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Selective amplification and sequencing of cyclic phosphate-containing RNAs by the cP-RNA-seq method
Honda, Shozo; Morichika, Keisuke; Kirino, Yohei
2016-01-01
RNA digestions catalyzed by many ribonucleases generate RNA fragments containing a 2′,3′-cyclic phosphate (cP) at their 3′-termini. However, standard RNA-seq methods are unable to accurately capture cP-containing RNAs because the cP inhibits the adapter ligation reaction. We recently developed a method named “cP-RNA-seq” that is able to selectively amplify and sequence cP-containing RNAs. Here we describe the cP-RNA-seq protocol in which the 3′-termini of all RNAs, except those containing a cP, are cleaved through a periodate treatment after phosphatase treatment, hence subsequent adapter ligation and cDNA amplification steps are exclusively applied to cP-containing RNAs. cP-RNA-seq takes ~6 d, excluding the time required for sequencing and bioinformatics analyses, such downstream assays are not covered in detail in this protocol. Biochemical validation of the existence of cP in the identified RNAs takes ~3 d. Even though the cP-RNA-seq method was developed to identify angiogenin-generating 5′-tRNA halves as a proof of principle, the method should be applicable to global identification of cP-containing RNA repertoires in various transcriptomes. PMID:26866791
Wiewiórka, Marek S; Messina, Antonio; Pacholewska, Alicja; Maffioletti, Sergio; Gawrysiak, Piotr; Okoniewski, Michał J
2014-09-15
Many time-consuming analyses of next -: generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics BECAUSE OF: their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data querying. The SparkSeq software has been created to take advantage of a new MapReduce framework, Apache Spark, for next-generation sequencing data. SparkSeq is a general-purpose, flexible and easily extendable library for genomic cloud computing. It can be used to build genomic analysis pipelines in Scala and run them in an interactive way. SparkSeq opens up the possibility of customized ad hoc secondary analyses and iterative machine learning algorithms. This article demonstrates its scalability and overall fast performance by running the analyses of sequencing datasets. Tests of SparkSeq also prove that the use of cache and HDFS block size can be tuned for the optimal performance on multiple worker nodes. Available under open source Apache 2.0 license: https://bitbucket.org/mwiewiorka/sparkseq/. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Buschmann, Dominik; Haberberger, Anna; Kirchner, Benedikt; Spornraft, Melanie; Riedmaier, Irmgard; Schelling, Gustav; Pfaffl, Michael W.
2016-01-01
Small RNA-Seq has emerged as a powerful tool in transcriptomics, gene expression profiling and biomarker discovery. Sequencing cell-free nucleic acids, particularly microRNA (miRNA), from liquid biopsies additionally provides exciting possibilities for molecular diagnostics, and might help establish disease-specific biomarker signatures. The complexity of the small RNA-Seq workflow, however, bears challenges and biases that researchers need to be aware of in order to generate high-quality data. Rigorous standardization and extensive validation are required to guarantee reliability, reproducibility and comparability of research findings. Hypotheses based on flawed experimental conditions can be inconsistent and even misleading. Comparable to the well-established MIQE guidelines for qPCR experiments, this work aims at establishing guidelines for experimental design and pre-analytical sample processing, standardization of library preparation and sequencing reactions, as well as facilitating data analysis. We highlight bottlenecks in small RNA-Seq experiments, point out the importance of stringent quality control and validation, and provide a primer for differential expression analysis and biomarker discovery. Following our recommendations will encourage better sequencing practice, increase experimental transparency and lead to more reproducible small RNA-Seq results. This will ultimately enhance the validity of biomarker signatures, and allow reliable and robust clinical predictions. PMID:27317696
A normalization strategy for comparing tag count data
2012-01-01
Background High-throughput sequencing, such as ribonucleic acid sequencing (RNA-seq) and chromatin immunoprecipitation sequencing (ChIP-seq) analyses, enables various features of organisms to be compared through tag counts. Recent studies have demonstrated that the normalization step for RNA-seq data is critical for a more accurate subsequent analysis of differential gene expression. Development of a more robust normalization method is desirable for identifying the true difference in tag count data. Results We describe a strategy for normalizing tag count data, focusing on RNA-seq. The key concept is to remove data assigned as potential differentially expressed genes (DEGs) before calculating the normalization factor. Several R packages for identifying DEGs are currently available, and each package uses its own normalization method and gene ranking algorithm. We compared a total of eight package combinations: four R packages (edgeR, DESeq, baySeq, and NBPSeq) with their default normalization settings and with our normalization strategy. Many synthetic datasets under various scenarios were evaluated on the basis of the area under the curve (AUC) as a measure for both sensitivity and specificity. We found that packages using our strategy in the data normalization step overall performed well. This result was also observed for a real experimental dataset. Conclusion Our results showed that the elimination of potential DEGs is essential for more accurate normalization of RNA-seq data. The concept of this normalization strategy can widely be applied to other types of tag count data and to microarray data. PMID:22475125
Synthetic spike-in standards for high-throughput 16S rRNA gene amplicon sequencing.
Tourlousse, Dieter M; Yoshiike, Satowa; Ohashi, Akiko; Matsukura, Satoko; Noda, Naohiro; Sekiguchi, Yuji
2017-02-28
High-throughput sequencing of 16S rRNA gene amplicons (16S-seq) has become a widely deployed method for profiling complex microbial communities but technical pitfalls related to data reliability and quantification remain to be fully addressed. In this work, we have developed and implemented a set of synthetic 16S rRNA genes to serve as universal spike-in standards for 16S-seq experiments. The spike-ins represent full-length 16S rRNA genes containing artificial variable regions with negligible identity to known nucleotide sequences, permitting unambiguous identification of spike-in sequences in 16S-seq read data from any microbiome sample. Using defined mock communities and environmental microbiota, we characterized the performance of the spike-in standards and demonstrated their utility for evaluating data quality on a per-sample basis. Further, we showed that staggered spike-in mixtures added at the point of DNA extraction enable concurrent estimation of absolute microbial abundances suitable for comparative analysis. Results also underscored that template-specific Illumina sequencing artifacts may lead to biases in the perceived abundance of certain taxa. Taken together, the spike-in standards represent a novel bioanalytical tool that can substantially improve 16S-seq-based microbiome studies by enabling comprehensive quality control along with absolute quantification. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Cooper, James; Ding, Yi; Song, Jiuzhou; Zhao, Keji
2017-11-01
Increased chromatin accessibility is a feature of cell-type-specific cis-regulatory elements; therefore, mapping of DNase I hypersensitive sites (DHSs) enables the detection of active regulatory elements of transcription, including promoters, enhancers, insulators and locus-control regions. Single-cell DNase sequencing (scDNase-seq) is a method of detecting genome-wide DHSs when starting with either single cells or <1,000 cells from primary cell sources. This technique enables genome-wide mapping of hypersensitive sites in a wide range of cell populations that cannot be analyzed using conventional DNase I sequencing because of the requirement for millions of starting cells. Fresh cells, formaldehyde-cross-linked cells or cells recovered from formalin-fixed paraffin-embedded (FFPE) tissue slides are suitable for scDNase-seq assays. To generate scDNase-seq libraries, cells are lysed and then digested with DNase I. Circular carrier plasmid DNA is included during subsequent DNA purification and library preparation steps to prevent loss of the small quantity of DHS DNA. Libraries are generated for high-throughput sequencing on the Illumina platform using standard methods. Preparation of scDNase-seq libraries requires only 2 d. The materials and molecular biology techniques described in this protocol should be accessible to any general molecular biology laboratory. Processing of high-throughput sequencing data requires basic bioinformatics skills and uses publicly available bioinformatics software.
Pervasive, Genome-Wide Transcription in the Organelle Genomes of Diverse Plastid-Bearing Protists.
Sanitá Lima, Matheus; Smith, David Roy
2017-11-06
Organelle genomes are among the most sequenced kinds of chromosome. This is largely because they are small and widely used in molecular studies, but also because next-generation sequencing technologies made sequencing easier, faster, and cheaper. However, studies of organelle RNA have not kept pace with those of DNA, despite huge amounts of freely available eukaryotic RNA-sequencing (RNA-seq) data. Little is known about organelle transcription in nonmodel species, and most of the available eukaryotic RNA-seq data have not been mined for organelle transcripts. Here, we use publicly available RNA-seq experiments to investigate organelle transcription in 30 diverse plastid-bearing protists with varying organelle genomic architectures. Mapping RNA-seq data to organelle genomes revealed pervasive, genome-wide transcription, regardless of the taxonomic grouping, gene organization, or noncoding content. For every species analyzed, transcripts covered ≥85% of the mitochondrial and/or plastid genomes (all of which were ≤105 kb), indicating that most of the organelle DNA-coding and noncoding-is transcriptionally active. These results follow earlier studies of model species showing that organellar transcription is coupled and ubiquitous across the genome, requiring significant downstream processing of polycistronic transcripts. Our findings suggest that noncoding organelle DNA can be transcriptionally active, raising questions about the underlying function of these transcripts and underscoring the utility of publicly available RNA-seq data for recovering complete genome sequences. If pervasive transcription is also found in bigger organelle genomes (>105 kb) and across a broader range of eukaryotes, this could indicate that noncoding organelle RNAs are regulating fundamental processes within eukaryotic cells. Copyright © 2017 Sanitá Lima and Smith.
Hu, Haijing; Wang, Cong; Li, Xia; Tang, Yunyun; Wang, Yufang; Chen, Shuanglin; Yan, Shuzhen
2018-05-08
The endophytic bacteria Bacillus cereus BCM2 has shown great potential as a defense against the parasitic nematode Meloidogyne incognita. Here, we studied the endophytic bacteria-mediated plant defense against M. incognita and searched for defense-related candidate genes using RNA-Seq. The induced systemic resistance of BCM2 against M. incognita was tested using the split-root method. Pre-inoculated BCM2 on the inducer side was associated with a dramatic reduction in galls and egg masses at the responder side, but inoculated BCM2 alone did not produce the same effect. In order to investigate which plant defense-related genes are specifically activated by BCM2, four RNA samples from tomato roots were sequenced, and four high quality total clean bases were obtained, ranging from 6.64 to 6.75 Gb, with an average of 21558 total genes. The 34 candidate defense-related genes were identified by pair-wise comparison among libraries, representing the targets for BCM2 priming resistance against M. incognita. Functional characterization revealed that the plant-pathogen interaction pathway (ID: ko04626) was significantly enriched for BCM2-mediated M. incognita resistance. This study demonstrates that B. cereus BCM2 maintains a harmonious host-microbe relationship with tomato, but appeared to prime the plant, resulting in more vigorous defense response toward the infection nematode. This article is protected by copyright. All rights reserved.
ALEA: a toolbox for allele-specific epigenomics analysis.
Younesy, Hamid; Möller, Torsten; Heravi-Moussavi, Alireza; Cheng, Jeffrey B; Costello, Joseph F; Lorincz, Matthew C; Karimi, Mohammad M; Jones, Steven J M
2014-04-15
The assessment of expression and epigenomic status using sequencing based methods provides an unprecedented opportunity to identify and correlate allelic differences with epigenomic status. We present ALEA, a computational toolbox for allele-specific epigenomics analysis, which incorporates allelic variation data within existing resources, allowing for the identification of significant associations between epigenetic modifications and specific allelic variants in human and mouse cells. ALEA provides a customizable pipeline of command line tools for allele-specific analysis of next-generation sequencing data (ChIP-seq, RNA-seq, etc.) that takes the raw sequencing data and produces separate allelic tracks ready to be viewed on genome browsers. The pipeline has been validated using human and hybrid mouse ChIP-seq and RNA-seq data. The package, test data and usage instructions are available online at http://www.bcgsc.ca/platform/bioinfo/software/alea CONTACT: : mkarimi1@interchange.ubc.ca or sjones@bcgsc.ca Supplementary information: Supplementary data are available at Bioinformatics online. © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Integrating RNA sequencing into neuro-oncology practice.
Rogawski, David S; Vitanza, Nicholas A; Gauthier, Angela C; Ramaswamy, Vijay; Koschmann, Carl
2017-11-01
Malignant tumors of the central nervous system (CNS) cause substantial morbidity and mortality, yet efforts to optimize chemo- and radiotherapy have largely failed to improve dismal prognoses. Over the past decade, RNA sequencing (RNA-seq) has emerged as a powerful tool to comprehensively characterize the transcriptome of CNS tumor cells in one high-throughput step, leading to improved understanding of CNS tumor biology and suggesting new routes for targeted therapies. RNA-seq has been instrumental in improving the diagnostic classification of brain tumors, characterizing oncogenic fusion genes, and shedding light on intratumor heterogeneity. Currently, RNA-seq is beginning to be incorporated into regular neuro-oncology practice in the form of precision neuro-oncology programs, which use information from tumor sequencing to guide implementation of personalized targeted therapies. These programs show great promise in improving patient outcomes for tumors where single agent trials have been ineffective. As RNA-seq is a relatively new technique, many further applications yielding new advances in CNS tumor research and management are expected in the coming years. Copyright © 2017 Elsevier Inc. All rights reserved.
Han, Kook; Tjaden, Brian; Lory, Stephen
2016-12-22
The first step in the post-transcriptional regulatory function of most bacterial small non-coding RNAs (sRNAs) is base pairing with partially complementary sequences of targeted transcripts. We present a simple method for identifying sRNA targets in vivo and defining processing sites of the regulated transcripts. The technique, referred to as global small non-coding RNA target identification by ligation and sequencing (GRIL-seq), is based on preferential ligation of sRNAs to the ends of base-paired targets in bacteria co-expressing T4 RNA ligase, followed by sequencing to identify the chimaeras. In addition to the RNA chaperone Hfq, the GRIL-seq method depends on the activity of the pyrophosphorylase RppH. Using PrrF1, an iron-regulated sRNA in Pseudomonas aeruginosa, we demonstrated that direct regulatory targets of this sRNA can readily be identified. Therefore, GRIL-seq represents a powerful tool not only for identifying direct targets of sRNAs in a variety of environments, but also for uncovering novel roles for sRNAs and their targets in complex regulatory networks.
Limitations and possibilities of low cell number ChIP-seq
2012-01-01
Background Chromatin immunoprecipitation coupled with high-throughput DNA sequencing (ChIP-seq) offers high resolution, genome-wide analysis of DNA-protein interactions. However, current standard methods require abundant starting material in the range of 1–20 million cells per immunoprecipitation, and remain a bottleneck to the acquisition of biologically relevant epigenetic data. Using a ChIP-seq protocol optimised for low cell numbers (down to 100,000 cells / IP), we examined the performance of the ChIP-seq technique on a series of decreasing cell numbers. Results We present an enhanced native ChIP-seq method tailored to low cell numbers that represents a 200-fold reduction in input requirements over existing protocols. The protocol was tested over a range of starting cell numbers covering three orders of magnitude, enabling determination of the lower limit of the technique. At low input cell numbers, increased levels of unmapped and duplicate reads reduce the number of unique reads generated, and can drive up sequencing costs and affect sensitivity if ChIP is attempted from too few cells. Conclusions The optimised method presented here considerably reduces the input requirements for performing native ChIP-seq. It extends the applicability of the technique to isolated primary cells and rare cell populations (e.g. biobank samples, stem cells), and in many cases will alleviate the need for cell culture and any associated alteration of epigenetic marks. However, this study highlights a challenge inherent to ChIP-seq from low cell numbers: as cell input numbers fall, levels of unmapped sequence reads and PCR-generated duplicate reads rise. We discuss a number of solutions to overcome the effects of reducing cell number that may aid further improvements to ChIP performance. PMID:23171294
Cost analysis of whole genome sequencing in German clinical practice.
Plöthner, Marika; Frank, Martin; von der Schulenburg, J-Matthias Graf
2017-06-01
Whole genome sequencing (WGS) is an emerging tool in clinical diagnostics. However, little has been said about its procedure costs, owing to a dearth of related cost studies. This study helps fill this research gap by analyzing the execution costs of WGS within the setting of German clinical practice. First, to estimate costs, a sequencing process related to clinical practice was undertaken. Once relevant resources were identified, a quantification and monetary evaluation was conducted using data and information from expert interviews with clinical geneticists, and personnel at private enterprises and hospitals. This study focuses on identifying the costs associated with the standard sequencing process, and the procedure costs for a single WGS were analyzed on the basis of two sequencing platforms-namely, HiSeq 2500 and HiSeq Xten, both by Illumina, Inc. In addition, sensitivity analyses were performed to assess the influence of various uses of sequencing platforms and various coverage values on a fixed-cost degression. In the base case scenario-which features 80 % utilization and 30-times coverage-the cost of a single WGS analysis with the HiSeq 2500 was estimated at €3858.06. The cost of sequencing materials was estimated at €2848.08; related personnel costs of €396.94 and acquisition/maintenance costs (€607.39) were also found. In comparison, the cost of sequencing that uses the latest technology (i.e., HiSeq Xten) was approximately 63 % cheaper, at €1411.20. The estimated costs of WGS currently exceed the prediction of a 'US$1000 per genome', by more than a factor of 3.8. In particular, the material costs in themselves exceed this predicted cost.
Assessment of the cPAS-based BGISEQ-500 platform for metagenomic sequencing.
Fang, Chao; Zhong, Huanzi; Lin, Yuxiang; Chen, Bing; Han, Mo; Ren, Huahui; Lu, Haorong; Luber, Jacob M; Xia, Min; Li, Wangsheng; Stein, Shayna; Xu, Xun; Zhang, Wenwei; Drmanac, Radoje; Wang, Jian; Yang, Huanming; Hammarström, Lennart; Kostic, Aleksandar D; Kristiansen, Karsten; Li, Junhua
2018-03-01
More extensive use of metagenomic shotgun sequencing in microbiome research relies on the development of high-throughput, cost-effective sequencing. Here we present a comprehensive evaluation of the performance of the new high-throughput sequencing platform BGISEQ-500 for metagenomic shotgun sequencing and compare its performance with that of 2 Illumina platforms. Using fecal samples from 20 healthy individuals, we evaluated the intra-platform reproducibility for metagenomic sequencing on the BGISEQ-500 platform in a setup comprising 8 library replicates and 8 sequencing replicates. Cross-platform consistency was evaluated by comparing 20 pairwise replicates on the BGISEQ-500 platform vs the Illumina HiSeq 2000 platform and the Illumina HiSeq 4000 platform. In addition, we compared the performance of the 2 Illumina platforms against each other. By a newly developed overall accuracy quality control method, an average of 82.45 million high-quality reads (96.06% of raw reads) per sample, with 90.56% of bases scoring Q30 and above, was obtained using the BGISEQ-500 platform. Quantitative analyses revealed extremely high reproducibility between BGISEQ-500 intra-platform replicates. Cross-platform replicates differed slightly more than intra-platform replicates, yet a high consistency was observed. Only a low percentage (2.02%-3.25%) of genes exhibited significant differences in relative abundance comparing the BGISEQ-500 and HiSeq platforms, with a bias toward genes with higher GC content being enriched on the HiSeq platforms. Our study provides the first set of performance metrics for human gut metagenomic sequencing data using BGISEQ-500. The high accuracy and technical reproducibility confirm the applicability of the new platform for metagenomic studies, though caution is still warranted when combining metagenomic data from different platforms.
Key gene regulating cell wall biosynthesis and recalcitrance in Populus, gene Y
Chen, Jay; Engle, Nancy; Gunter, Lee E.; Jawdy, Sara; Tschaplinski, Timothy J.; Tuskan, Gerald A.
2015-12-08
This disclosure provides methods and transgenic plants for improved production of renewable biofuels and other plant-derived biomaterials by altering the expression and/or activity of Gene Y, an O-acetyltransferase. This disclosure also provides expression vectors containing a nucleic acid (Gene Y) which encodes the polypeptide of SEQ ID NO: 1 and is operably linked to a heterologous promoter.
Software for rapid time dependent ChIP-sequencing analysis (TDCA).
Myschyshyn, Mike; Farren-Dai, Marco; Chuang, Tien-Jui; Vocadlo, David
2017-11-25
Chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq) and associated methods are widely used to define the genome wide distribution of chromatin associated proteins, post-translational epigenetic marks, and modifications found on DNA bases. An area of emerging interest is to study time dependent changes in the distribution of such proteins and marks by using serial ChIP-seq experiments performed in a time resolved manner. Despite such time resolved studies becoming increasingly common, software to facilitate analysis of such data in a robust automated manner is limited. We have designed software called Time-Dependent ChIP-Sequencing Analyser (TDCA), which is the first program to automate analysis of time-dependent ChIP-seq data by fitting to sigmoidal curves. We provide users with guidance for experimental design of TDCA for modeling of time course (TC) ChIP-seq data using two simulated data sets. Furthermore, we demonstrate that this fitting strategy is widely applicable by showing that automated analysis of three previously published TC data sets accurately recapitulates key findings reported in these studies. Using each of these data sets, we highlight how biologically relevant findings can be readily obtained by exploiting TDCA to yield intuitive parameters that describe behavior at either a single locus or sets of loci. TDCA enables customizable analysis of user input aligned DNA sequencing data, coupled with graphical outputs in the form of publication-ready figures that describe behavior at either individual loci or sets of loci sharing common traits defined by the user. TDCA accepts sequencing data as standard binary alignment map (BAM) files and loci of interest in browser extensible data (BED) file format. TDCA accurately models the number of sequencing reads, or coverage, at loci from TC ChIP-seq studies or conceptually related TC sequencing experiments. TC experiments are reduced to intuitive parametric values that facilitate biologically relevant data analysis, and the uncovering of variations in the time-dependent behavior of chromatin. TDCA automates the analysis of TC ChIP-seq experiments, permitting researchers to easily obtain raw and modeled data for specific loci or groups of loci with similar behavior while also enhancing consistency of data analysis of TC data within the genomics field.
Combining multiple ChIP-seq peak detection systems using combinatorial fusion.
Schweikert, Christina; Brown, Stuart; Tang, Zuojian; Smith, Phillip R; Hsu, D Frank
2012-01-01
Due to the recent rapid development in ChIP-seq technologies, which uses high-throughput next-generation DNA sequencing to identify the targets of Chromatin Immunoprecipitation, there is an increasing amount of sequencing data being generated that provides us with greater opportunity to analyze genome-wide protein-DNA interactions. In particular, we are interested in evaluating and enhancing computational and statistical techniques for locating protein binding sites. Many peak detection systems have been developed; in this study, we utilize the following six: CisGenome, MACS, PeakSeq, QuEST, SISSRs, and TRLocator. We define two methods to merge and rescore the regions of two peak detection systems and analyze the performance based on average precision and coverage of transcription start sites. The results indicate that ChIP-seq peak detection can be improved by fusion using score or rank combination. Our method of combination and fusion analysis would provide a means for generic assessment of available technologies and systems and assist researchers in choosing an appropriate system (or fusion method) for analyzing ChIP-seq data. This analysis offers an alternate approach for increasing true positive rates, while decreasing false positive rates and hence improving the ChIP-seq peak identification process.
Wei, Yulong; Silke, Jordan R; Xia, Xuhua
2017-12-15
Bacterial translation initiation is influenced by base pairing between the Shine-Dalgarno (SD) sequence in the 5' UTR of mRNA and the anti-SD (aSD) sequence at the free 3' end of the 16S rRNA (3' TAIL) due to: 1) the SD/aSD sequence binding location and 2) SD/aSD binding affinity. In order to understand what makes an SD/aSD interaction optimal, we must define: 1) terminus of the 3' TAIL and 2) extent of the core aSD sequence within the 3' TAIL. Our approach to characterize these components in Escherichia coli and Bacillus subtilis involves 1) mapping the 3' boundary of the mature 16S rRNA using high-throughput RNA sequencing (RNA-Seq), and 2) identifying the segment within the 3' TAIL that is strongly preferred in SD/aSD pairing. Using RNA-Seq data, we resolve previous discrepancies in the reported 3' TAIL in B. subtilis and recovered the established 3' TAIL in E. coli. Furthermore, we extend previous studies to suggest that both highly and lowly expressed genes favor SD sequences with intermediate binding affinity, but this trend is exclusive to SD sequences that complement the core aSD sequences defined herein.
Zhao, Shanrong; Xi, Li; Quan, Jie; Xi, Hualin; Zhang, Ying; von Schack, David; Vincent, Michael; Zhang, Baohong
2016-01-08
RNA sequencing (RNA-seq), a next-generation sequencing technique for transcriptome profiling, is being increasingly used, in part driven by the decreasing cost of sequencing. Nevertheless, the analysis of the massive amounts of data generated by large-scale RNA-seq remains a challenge. Multiple algorithms pertinent to basic analyses have been developed, and there is an increasing need to automate the use of these tools so as to obtain results in an efficient and user friendly manner. Increased automation and improved visualization of the results will help make the results and findings of the analyses readily available to experimental scientists. By combing the best open source tools developed for RNA-seq data analyses and the most advanced web 2.0 technologies, we have implemented QuickRNASeq, a pipeline for large-scale RNA-seq data analyses and visualization. The QuickRNASeq workflow consists of three main steps. In Step #1, each individual sample is processed, including mapping RNA-seq reads to a reference genome, counting the numbers of mapped reads, quality control of the aligned reads, and SNP (single nucleotide polymorphism) calling. Step #1 is computationally intensive, and can be processed in parallel. In Step #2, the results from individual samples are merged, and an integrated and interactive project report is generated. All analyses results in the report are accessible via a single HTML entry webpage. Step #3 is the data interpretation and presentation step. The rich visualization features implemented here allow end users to interactively explore the results of RNA-seq data analyses, and to gain more insights into RNA-seq datasets. In addition, we used a real world dataset to demonstrate the simplicity and efficiency of QuickRNASeq in RNA-seq data analyses and interactive visualizations. The seamless integration of automated capabilites with interactive visualizations in QuickRNASeq is not available in other published RNA-seq pipelines. The high degree of automation and interactivity in QuickRNASeq leads to a substantial reduction in the time and effort required prior to further downstream analyses and interpretation of the analyses findings. QuickRNASeq advances primary RNA-seq data analyses to the next level of automation, and is mature for public release and adoption.
Sokol, Martin; Jessen, Karen Margrethe; Pedersen, Finn Skou
2016-01-01
Several studies have shown that human endogenous retroviruses and endogenous retrovirus-like repeats (here collectively HERVs) impose direct regulation on human genes through enhancer and promoter motifs present in their long terminal repeats (LTRs). Although chimeric transcription in which novel gene isoforms containing retroviral and human sequence are transcribed from viral promoters are commonly associated with disease, regulation by HERVs is beneficial in other settings; for example, in human testis chimeric isoforms of TP63 induced by an ERV9 LTR protect the male germ line upon DNA damage by inducing apoptosis, whereas in the human globin locus the γ- and β-globin switch during normal hematopoiesis is mediated by complex interactions of an ERV9 LTR and surrounding human sequence. The advent of deep sequencing or next-generation sequencing (NGS) has revolutionized the way researchers solve important scientific questions and develop novel hypotheses in relation to human genome regulation. We recently applied next-generation paired-end RNA-sequencing (RNA-seq) together with chromatin immunoprecipitation with sequencing (ChIP-seq) to examine ERV9 chimeric transcription in human reference cell lines from Encyclopedia of DNA Elements (ENCODE). This led to the discovery of advanced regulation mechanisms by ERV9s and other HERVs across numerous human loci including transcription of large gene-unannotated genomic regions, as well as cooperative regulation by multiple HERVs and non-LTR repeats such as Alu elements. In this article, well-established examples of human gene regulation by HERVs are reviewed followed by a description of paired-end RNA-seq, and its application in identifying chimeric transcription genome-widely. Based on integrative analyses of RNA-seq and ChIP-seq, data we then present novel examples of regulation by ERV9s of tumor suppressor genes CADM2 and SEMA3A, as well as transcription of an unannotated region. Taken together, this article highlights the high suitability of contemporary sequencing methods in future analyses of human biology in relation to evolutionary acquired retroviruses in the human genome. © 2016 APMIS. Published by John Wiley & Sons Ltd.
Devailly, Guillaume; Mantsoki, Anna; Joshi, Anagha
2016-11-01
Better protocols and decreasing costs have made high-throughput sequencing experiments now accessible even to small experimental laboratories. However, comparing one or few experiments generated by an individual lab to the vast amount of relevant data freely available in the public domain might be limited due to lack of bioinformatics expertise. Though several tools, including genome browsers, allow such comparison at a single gene level, they do not provide a genome-wide view. We developed Heat*seq, a web-tool that allows genome scale comparison of high throughput experiments chromatin immuno-precipitation followed by sequencing, RNA-sequencing and Cap Analysis of Gene Expression) provided by a user, to the data in the public domain. Heat*seq currently contains over 12 000 experiments across diverse tissues and cell types in human, mouse and drosophila. Heat*seq displays interactive correlation heatmaps, with an ability to dynamically subset datasets to contextualize user experiments. High quality figures and tables are produced and can be downloaded in multiple formats. Web application: http://www.heatstarseq.roslin.ed.ac.uk/ Source code: https://github.com/gdevailly CONTACT: Guillaume.Devailly@roslin.ed.ac.uk or Anagha.Joshi@roslin.ed.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Munger, Steven C.; Raghupathy, Narayanan; Choi, Kwangbom; Simons, Allen K.; Gatti, Daniel M.; Hinerfeld, Douglas A.; Svenson, Karen L.; Keller, Mark P.; Attie, Alan D.; Hibbs, Matthew A.; Graber, Joel H.; Chesler, Elissa J.; Churchill, Gary A.
2014-01-01
Massively parallel RNA sequencing (RNA-seq) has yielded a wealth of new insights into transcriptional regulation. A first step in the analysis of RNA-seq data is the alignment of short sequence reads to a common reference genome or transcriptome. Genetic variants that distinguish individual genomes from the reference sequence can cause reads to be misaligned, resulting in biased estimates of transcript abundance. Fine-tuning of read alignment algorithms does not correct this problem. We have developed Seqnature software to construct individualized diploid genomes and transcriptomes for multiparent populations and have implemented a complete analysis pipeline that incorporates other existing software tools. We demonstrate in simulated and real data sets that alignment to individualized transcriptomes increases read mapping accuracy, improves estimation of transcript abundance, and enables the direct estimation of allele-specific expression. Moreover, when applied to expression QTL mapping we find that our individualized alignment strategy corrects false-positive linkage signals and unmasks hidden associations. We recommend the use of individualized diploid genomes over reference sequence alignment for all applications of high-throughput sequencing technology in genetically diverse populations. PMID:25236449
Kuhn, Jens H.; Andersen, Kristian G.; Bào, Yīmíng; Bavari, Sina; Becker, Stephan; Bennett, Richard S.; Bergman, Nicholas H.; Blinkova, Olga; Bradfute, Steven; Brister, J. Rodney; Bukreyev, Alexander; Chandran, Kartik; Chepurnov, Alexander A.; Davey, Robert A.; Dietzgen, Ralf G.; Doggett, Norman A.; Dolnik, Olga; Dye, John M.; Enterlein, Sven; Fenimore, Paul W.; Formenty, Pierre; Freiberg, Alexander N.; Garry, Robert F.; Garza, Nicole L.; Gire, Stephen K.; Gonzalez, Jean-Paul; Griffiths, Anthony; Happi, Christian T.; Hensley, Lisa E.; Herbert, Andrew S.; Hevey, Michael C.; Hoenen, Thomas; Honko, Anna N.; Ignatyev, Georgy M.; Jahrling, Peter B.; Johnson, Joshua C.; Johnson, Karl M.; Kindrachuk, Jason; Klenk, Hans-Dieter; Kobinger, Gary; Kochel, Tadeusz J.; Lackemeyer, Matthew G.; Lackner, Daniel F.; Leroy, Eric M.; Lever, Mark S.; Mühlberger, Elke; Netesov, Sergey V.; Olinger, Gene G.; Omilabu, Sunday A.; Palacios, Gustavo; Panchal, Rekha G.; Park, Daniel J.; Patterson, Jean L.; Paweska, Janusz T.; Peters, Clarence J.; Pettitt, James; Pitt, Louise; Radoshitzky, Sheli R.; Ryabchikova, Elena I.; Saphire, Erica Ollmann; Sabeti, Pardis C.; Sealfon, Rachel; Shestopalov, Aleksandr M.; Smither, Sophie J.; Sullivan, Nancy J.; Swanepoel, Robert; Takada, Ayato; Towner, Jonathan S.; van der Groen, Guido; Volchkov, Viktor E.; Volchkova, Valentina A.; Wahl-Jensen, Victoria; Warren, Travis K.; Warfield, Kelly L.; Weidmann, Manfred; Nichol, Stuart T.
2014-01-01
Sequence determination of complete or coding-complete genomes of viruses is becoming common practice for supporting the work of epidemiologists, ecologists, virologists, and taxonomists. Sequencing duration and costs are rapidly decreasing, sequencing hardware is under modification for use by non-experts, and software is constantly being improved to simplify sequence data management and analysis. Thus, analysis of virus disease outbreaks on the molecular level is now feasible, including characterization of the evolution of individual virus populations in single patients over time. The increasing accumulation of sequencing data creates a management problem for the curators of commonly used sequence databases and an entry retrieval problem for end users. Therefore, utilizing the data to their fullest potential will require setting nomenclature and annotation standards for virus isolates and associated genomic sequences. The National Center for Biotechnology Information’s (NCBI’s) RefSeq is a non-redundant, curated database for reference (or type) nucleotide sequence records that supplies source data to numerous other databases. Building on recently proposed templates for filovirus variant naming [
HSA: a heuristic splice alignment tool.
Bu, Jingde; Chi, Xuebin; Jin, Zhong
2013-01-01
RNA-Seq methodology is a revolutionary transcriptomics sequencing technology, which is the representative of Next generation Sequencing (NGS). With the high throughput sequencing of RNA-Seq, we can acquire much more information like differential expression and novel splice variants from deep sequence analysis and data mining. But the short read length brings a great challenge to alignment, especially when the reads span two or more exons. A two steps heuristic splice alignment tool is generated in this investigation. First, map raw reads to reference with unspliced aligner--BWA; second, split initial unmapped reads into three equal short reads (seeds), align each seed to the reference, filter hits, search possible split position of read and extend hits to a complete match. Compare with other splice alignment tools like SOAPsplice and Tophat2, HSA has a better performance in call rate and efficiency, but its results do not as accurate as the other software to some extent. HSA is an effective spliced aligner of RNA-Seq reads mapping, which is available at https://github.com/vlcc/HSA.
SeqDepot: streamlined database of biological sequences and precomputed features.
Ulrich, Luke E; Zhulin, Igor B
2014-01-15
Assembling and/or producing integrated knowledge of sequence features continues to be an onerous and redundant task despite a large number of existing resources. We have developed SeqDepot-a novel database that focuses solely on two primary goals: (i) assimilating known primary sequences with predicted feature data and (ii) providing the most simple and straightforward means to procure and readily use this information. Access to >28.5 million sequences and 300 million features is provided through a well-documented and flexible RESTful interface that supports fetching specific data subsets, bulk queries, visualization and searching by MD5 digests or external database identifiers. We have also developed an HTML5/JavaScript web application exemplifying how to interact with SeqDepot and Perl/Python scripts for use with local processing pipelines. Freely available on the web at http://seqdepot.net/. RESTaccess via http://seqdepot.net/api/v1. Database files and scripts maybe downloaded from http://seqdepot.net/download.
Ajawatanawong, Pravech; Atkinson, Gemma C; Watson-Haigh, Nathan S; Mackenzie, Bryony; Baldauf, Sandra L
2012-07-01
Analyses of multiple sequence alignments generally focus on well-defined conserved sequence blocks, while the rest of the alignment is largely ignored or discarded. This is especially true in phylogenomics, where large multigene datasets are produced through automated pipelines. However, some of the most powerful phylogenetic markers have been found in the variable length regions of multiple alignments, particularly insertions/deletions (indels) in protein sequences. We have developed Sequence Feature and Indel Region Extractor (SeqFIRE) to enable the automated identification and extraction of indels from protein sequence alignments. The program can also extract conserved blocks and identify fast evolving sites using a combination of conservation and entropy. All major variables can be adjusted by the user, allowing them to identify the sets of variables most suited to a particular analysis or dataset. Thus, all major tasks in preparing an alignment for further analysis are combined in a single flexible and user-friendly program. The output includes a numbered list of indels, alignments in NEXUS format with indels annotated or removed and indel-only matrices. SeqFIRE is a user-friendly web application, freely available online at www.seqfire.org/.
Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation.
Nakato, Ryuichiro; Shirahige, Katsuhiko
2017-03-01
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) analysis can detect protein/DNA-binding and histone-modification sites across an entire genome. Recent advances in sequencing technologies and analyses enable us to compare hundreds of samples simultaneously; such large-scale analysis has potential to reveal the high-dimensional interrelationship level for regulatory elements and annotate novel functional genomic regions de novo. Because many experimental considerations are relevant to the choice of a method in a ChIP-seq analysis, the overall design and quality management of the experiment are of critical importance. This review offers guiding principles of computation and sample preparation for ChIP-seq analyses, highlighting the validity and limitations of the state-of-the-art procedures at each step. We also discuss the latest challenges of single-cell analysis that will encourage a new era in this field. © The Author 2016. Published by Oxford University Press.
Lott, Steffen C; Wolfien, Markus; Riege, Konstantin; Bagnacani, Andrea; Wolkenhauer, Olaf; Hoffmann, Steve; Hess, Wolfgang R
2017-11-10
RNA-Sequencing (RNA-Seq) has become a widely used approach to study quantitative and qualitative aspects of transcriptome data. The variety of RNA-Seq protocols, experimental study designs and the characteristic properties of the organisms under investigation greatly affect downstream and comparative analyses. In this review, we aim to explain the impact of structured pre-selection, classification and integration of best-performing tools within modularized data analysis workflows and ready-to-use computing infrastructures towards experimental data analyses. We highlight examples for workflows and use cases that are presented for pro-, eukaryotic and mixed dual RNA-Seq (meta-transcriptomics) experiments. In addition, we are summarizing the expertise of the laboratories participating in the project consortium "Structured Analysis and Integration of RNA-Seq experiments" (de.STAIR) and its integration with the Galaxy-workbench of the RNA Bioinformatics Center (RBC). Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.
Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.
Haghverdi, Laleh; Lun, Aaron T L; Morgan, Michael D; Marioni, John C
2018-06-01
Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.
A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages.
Park, Seung-Jin; Kim, Jong-Hwan; Yoon, Byung-Ha; Kim, Seon-Young
2017-03-01
Nowadays, huge volumes of chromatin immunoprecipitation-sequencing (ChIP-Seq) data are generated to increase the knowledge on DNA-protein interactions in the cell, and accordingly, many tools have been developed for ChIP-Seq analysis. Here, we provide an example of a streamlined workflow for ChIP-Seq data analysis composed of only four packages in Bioconductor: dada2, QuasR, mosaics, and ChIPseeker. 'dada2' performs trimming of the high-throughput sequencing data. 'QuasR' and 'mosaics' perform quality control and mapping of the input reads to the reference genome and peak calling, respectively. Finally, 'ChIPseeker' performs annotation and visualization of the called peaks. This workflow runs well independently of operating systems (e.g., Windows, Mac, or Linux) and processes the input fastq files into various results in one run. R code is available at github: https://github.com/ddhb/Workflow_of_Chipseq.git.
A ChIP-Seq Data Analysis Pipeline Based on Bioconductor Packages
Park, Seung-Jin; Kim, Jong-Hwan; Yoon, Byung-Ha; Kim, Seon-Young
2017-01-01
Nowadays, huge volumes of chromatin immunoprecipitation-sequencing (ChIP-Seq) data are generated to increase the knowledge on DNA-protein interactions in the cell, and accordingly, many tools have been developed for ChIP-Seq analysis. Here, we provide an example of a streamlined workflow for ChIP-Seq data analysis composed of only four packages in Bioconductor: dada2, QuasR, mosaics, and ChIPseeker. ‘dada2’ performs trimming of the high-throughput sequencing data. ‘QuasR’ and ‘mosaics’ perform quality control and mapping of the input reads to the reference genome and peak calling, respectively. Finally, ‘ChIPseeker’ performs annotation and visualization of the called peaks. This workflow runs well independently of operating systems (e.g., Windows, Mac, or Linux) and processes the input fastq files into various results in one run. R code is available at github: https://github.com/ddhb/Workflow_of_Chipseq.git. PMID:28416945
RnaSeqSampleSize: real data based sample size estimation for RNA sequencing.
Zhao, Shilin; Li, Chung-I; Guo, Yan; Sheng, Quanhu; Shyr, Yu
2018-05-30
One of the most important and often neglected components of a successful RNA sequencing (RNA-Seq) experiment is sample size estimation. A few negative binomial model-based methods have been developed to estimate sample size based on the parameters of a single gene. However, thousands of genes are quantified and tested for differential expression simultaneously in RNA-Seq experiments. Thus, additional issues should be carefully addressed, including the false discovery rate for multiple statistic tests, widely distributed read counts and dispersions for different genes. To solve these issues, we developed a sample size and power estimation method named RnaSeqSampleSize, based on the distributions of gene average read counts and dispersions estimated from real RNA-seq data. Datasets from previous, similar experiments such as the Cancer Genome Atlas (TCGA) can be used as a point of reference. Read counts and their dispersions were estimated from the reference's distribution; using that information, we estimated and summarized the power and sample size. RnaSeqSampleSize is implemented in R language and can be installed from Bioconductor website. A user friendly web graphic interface is provided at http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ . RnaSeqSampleSize provides a convenient and powerful way for power and sample size estimation for an RNAseq experiment. It is also equipped with several unique features, including estimation for interested genes or pathway, power curve visualization, and parameter optimization.
Quantum-Sequencing: Fast electronic single DNA molecule sequencing
NASA Astrophysics Data System (ADS)
Casamada Ribot, Josep; Chatterjee, Anushree; Nagpal, Prashant
2014-03-01
A major goal of third-generation sequencing technologies is to develop a fast, reliable, enzyme-free, high-throughput and cost-effective, single-molecule sequencing method. Here, we present the first demonstration of unique ``electronic fingerprint'' of all nucleotides (A, G, T, C), with single-molecule DNA sequencing, using Quantum-tunneling Sequencing (Q-Seq) at room temperature. We show that the electronic state of the nucleobases shift depending on the pH, with most distinct states identified at acidic pH. We also demonstrate identification of single nucleotide modifications (methylation here). Using these unique electronic fingerprints (or tunneling data), we report a partial sequence of beta lactamase (bla) gene, which encodes resistance to beta-lactam antibiotics, with over 95% success rate. These results highlight the potential of Q-Seq as a robust technique for next-generation sequencing.
RefSeq microbial genomes database: new representation and annotation strategy.
Tatusova, Tatiana; Ciufo, Stacy; Fedorov, Boris; O'Neill, Kathleen; Tolstoy, Igor
2014-01-01
The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.
Coverage Bias and Sensitivity of Variant Calling for Four Whole-genome Sequencing Technologies
Lasitschka, Bärbel; Jones, David; Northcott, Paul; Hutter, Barbara; Jäger, Natalie; Kool, Marcel; Taylor, Michael; Lichter, Peter; Pfister, Stefan; Wolf, Stephan; Brors, Benedikt; Eils, Roland
2013-01-01
The emergence of high-throughput, next-generation sequencing technologies has dramatically altered the way we assess genomes in population genetics and in cancer genomics. Currently, there are four commonly used whole-genome sequencing platforms on the market: Illumina’s HiSeq2000, Life Technologies’ SOLiD 4 and its completely redesigned 5500xl SOLiD, and Complete Genomics’ technology. A number of earlier studies have compared a subset of those sequencing platforms or compared those platforms with Sanger sequencing, which is prohibitively expensive for whole genome studies. Here we present a detailed comparison of the performance of all currently available whole genome sequencing platforms, especially regarding their ability to call SNVs and to evenly cover the genome and specific genomic regions. Unlike earlier studies, we base our comparison on four different samples, allowing us to assess the between-sample variation of the platforms. We find a pronounced GC bias in GC-rich regions for Life Technologies’ platforms, with Complete Genomics performing best here, while we see the least bias in GC-poor regions for HiSeq2000 and 5500xl. HiSeq2000 gives the most uniform coverage and displays the least sample-to-sample variation. In contrast, Complete Genomics exhibits by far the smallest fraction of bases not covered, while the SOLiD platforms reveal remarkable shortcomings, especially in covering CpG islands. When comparing the performance of the four platforms for calling SNPs, HiSeq2000 and Complete Genomics achieve the highest sensitivity, while the SOLiD platforms show the lowest false positive rate. Finally, we find that integrating sequencing data from different platforms offers the potential to combine the strengths of different technologies. In summary, our results detail the strengths and weaknesses of all four whole-genome sequencing platforms. It indicates application areas that call for a specific sequencing platform and disallow other platforms. This helps to identify the proper sequencing platform for whole genome studies with different application scopes. PMID:23776689
The Porcelain Crab Transcriptome and PCAD, the Porcelain Crab Microarray and Sequence Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tagmount, Abderrahmane; Wang, Mei; Lindquist, Erika
2010-01-27
Background: With the emergence of a completed genome sequence of the freshwater crustacean Daphnia pulex, construction of genomic-scale sequence databases for additional crustacean sequences are important for comparative genomics and annotation. Porcelain crabs, genus Petrolisthes, have been powerful crustacean models for environmental and evolutionary physiology with respect to thermal adaptation and understanding responses of marine organisms to climate change. Here, we present a large-scale EST sequencing and cDNA microarray database project for the porcelain crab Petrolisthes cinctipes. Methodology/Principal Findings: A set of ~;;30K unique sequences (UniSeqs) representing ~;;19K clusters were generated from ~;;98K high quality ESTs from a set ofmore » tissue specific non-normalized and mixed-tissue normalized cDNA libraries from the porcelain crab Petrolisthes cinctipes. Homology for each UniSeq was assessed using BLAST, InterProScan, GO and KEGG database searches. Approximately 66percent of the UniSeqs had homology in at least one of the databases. All EST and UniSeq sequences along with annotation results and coordinated cDNA microarray datasets have been made publicly accessible at the Porcelain Crab Array Database (PCAD), a feature-enriched version of the Stanford and Longhorn Array Databases.Conclusions/Significance: The EST project presented here represents the third largest sequencing effort for any crustacean, and the largest effort for any crab species. Our assembly and clustering results suggest that our porcelain crab EST data set is equally diverse to the much larger EST set generated in the Daphnia pulex genome sequencing project, and thus will be an important resource to the Daphnia research community. Our homology results support the pancrustacea hypothesis and suggest that Malacostraca may be ancestral to Branchiopoda and Hexapoda. Our results also suggest that our cDNA microarrays cover as much of the transcriptome as can reasonably be captured in EST library sequencing approaches, and thus represent a rich resource for studies of environmental genomics.« less
Shi, Yang; Chinnaiyan, Arul M; Jiang, Hui
2015-07-01
High-throughput sequencing of transcriptomes (RNA-Seq) has become a powerful tool to study gene expression. Here we present an R package, rSeqNP, which implements a non-parametric approach to test for differential expression and splicing from RNA-Seq data. rSeqNP uses permutation tests to access statistical significance and can be applied to a variety of experimental designs. By combining information across isoforms, rSeqNP is able to detect more differentially expressed or spliced genes from RNA-Seq data. The R package with its source code and documentation are freely available at http://www-personal.umich.edu/∼jianghui/rseqnp/. jianghui@umich.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Genotype calling from next-generation sequencing data using haplotype information of reads
Zhi, Degui; Wu, Jihua; Liu, Nianjun; Zhang, Kui
2012-01-01
Motivation: Low coverage sequencing provides an economic strategy for whole genome sequencing. When sequencing a set of individuals, genotype calling can be challenging due to low sequencing coverage. Linkage disequilibrium (LD) based refinement of genotyping calling is essential to improve the accuracy. Current LD-based methods use read counts or genotype likelihoods at individual potential polymorphic sites (PPSs). Reads that span multiple PPSs (jumping reads) can provide additional haplotype information overlooked by current methods. Results: In this article, we introduce a new Hidden Markov Model (HMM)-based method that can take into account jumping reads information across adjacent PPSs and implement it in the HapSeq program. Our method extends the HMM in Thunder and explicitly models jumping reads information as emission probabilities conditional on the states of adjacent PPSs. Our simulation results show that, compared to Thunder, HapSeq reduces the genotyping error rate by 30%, from 0.86% to 0.60%. The results from the 1000 Genomes Project show that HapSeq reduces the genotyping error rate by 12 and 9%, from 2.24% and 2.76% to 1.97% and 2.50% for individuals with European and African ancestry, respectively. We expect our program can improve genotyping qualities of the large number of ongoing and planned whole genome sequencing projects. Contact: dzhi@ms.soph.uab.edu; kzhang@ms.soph.uab.edu Availability: The software package HapSeq and its manual can be found and downloaded at www.ssg.uab.edu/hapseq/. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22285565
Hybrid De Novo Genome Assembly Using MiSeq and SOLiD Short Read Data
Ikegami, Tsutomu; Inatsugi, Toyohiro; Kojima, Isao; Umemura, Myco; Hagiwara, Hiroko; Machida, Masayuki; Asai, Kiyoshi
2015-01-01
A hybrid de novo assembly pipeline was constructed to utilize both MiSeq and SOLiD short read data in combination in the assembly. The short read data were converted to a standard format of the pipeline, and were supplied to the pipeline components such as ABySS and SOAPdenovo. The assembly pipeline proceeded through several stages, and either MiSeq paired-end data, SOLiD mate-paired data, or both of them could be specified as input data at each stage separately. The pipeline was examined on the filamentous fungus Aspergillus oryzae RIB40, by aligning the assembly results against the reference sequences. Using both the MiSeq and the SOLiD data in the hybrid assembly, the alignment length was improved by a factor of 3 to 8, compared with the assemblies using either one of the data types. The number of the reproduced gene cluster regions encoding secondary metabolite biosyntheses (SMB) was also improved by the hybrid assemblies. These results imply that the MiSeq data with long read length are essential to construct accurate nucleotide sequences, while the SOLiD mate-paired reads with long insertion length enhance long-range arrangements of the sequences. The pipeline was also tested on the actinomycete Streptomyces avermitilis MA-4680, whose gene is known to have high-GC content. Although the quality of the SOLiD reads was too low to perform any meaningful assemblies by themselves, the alignment length to the reference was improved by a factor of 2, compared with the assembly using only the MiSeq data. PMID:25919614
CRISPR-FOCUS: A web server for designing focused CRISPR screening experiments.
Cao, Qingyi; Ma, Jian; Chen, Chen-Hao; Xu, Han; Chen, Zhi; Li, Wei; Liu, X Shirley
2017-01-01
The recently developed CRISPR screen technology, based on the CRISPR/Cas9 genome editing system, enables genome-wide interrogation of gene functions in an efficient and cost-effective manner. Although many computational algorithms and web servers have been developed to design single-guide RNAs (sgRNAs) with high specificity and efficiency, algorithms specifically designed for conducting CRISPR screens are still lacking. Here we present CRISPR-FOCUS, a web-based platform to search and prioritize sgRNAs for CRISPR screen experiments. With official gene symbols or RefSeq IDs as the only mandatory input, CRISPR-FOCUS filters and prioritizes sgRNAs based on multiple criteria, including efficiency, specificity, sequence conservation, isoform structure, as well as genomic variations including Single Nucleotide Polymorphisms and cancer somatic mutations. CRISPR-FOCUS also provides pre-defined positive and negative control sgRNAs, as well as other necessary sequences in the construct (e.g., U6 promoters to drive sgRNA transcription and RNA scaffolds of the CRISPR/Cas9). These features allow users to synthesize oligonucleotides directly based on the output of CRISPR-FOCUS. Overall, CRISPR-FOCUS provides a rational and high-throughput approach for sgRNA library design that enables users to efficiently conduct a focused screen experiment targeting up to thousands of genes. (CRISPR-FOCUS is freely available at http://cistrome.org/crispr-focus/).
Cornwell, MacIntosh; Vangala, Mahesh; Taing, Len; Herbert, Zachary; Köster, Johannes; Li, Bo; Sun, Hanfei; Li, Taiwen; Zhang, Jian; Qiu, Xintao; Pun, Matthew; Jeselsohn, Rinath; Brown, Myles; Liu, X Shirley; Long, Henry W
2018-04-12
RNA sequencing has become a ubiquitous technology used throughout life sciences as an effective method of measuring RNA abundance quantitatively in tissues and cells. The increase in use of RNA-seq technology has led to the continuous development of new tools for every step of analysis from alignment to downstream pathway analysis. However, effectively using these analysis tools in a scalable and reproducible way can be challenging, especially for non-experts. Using the workflow management system Snakemake we have developed a user friendly, fast, efficient, and comprehensive pipeline for RNA-seq analysis. VIPER (Visualization Pipeline for RNA-seq analysis) is an analysis workflow that combines some of the most popular tools to take RNA-seq analysis from raw sequencing data, through alignment and quality control, into downstream differential expression and pathway analysis. VIPER has been created in a modular fashion to allow for the rapid incorporation of new tools to expand the capabilities. This capacity has already been exploited to include very recently developed tools that explore immune infiltrate and T-cell CDR (Complementarity-Determining Regions) reconstruction abilities. The pipeline has been conveniently packaged such that minimal computational skills are required to download and install the dozens of software packages that VIPER uses. VIPER is a comprehensive solution that performs most standard RNA-seq analyses quickly and effectively with a built-in capacity for customization and expansion.
Using the self-select paradigm to delineate the nature of speech motor programming.
Wright, David L; Robin, Don A; Rhee, Jooyhun; Vaculin, Amber; Jacks, Adam; Guenther, Frank H; Fox, Peter T
2009-06-01
The authors examined the involvement of 2 speech motor programming processes identified by S. T. Klapp (1995, 2003) during the articulation of utterances differing in syllable and sequence complexity. According to S. T. Klapp, 1 process, INT, resolves the demands of the programmed unit, whereas a second process, SEQ, oversees the serial order demands of longer sequences. A modified reaction time paradigm was used to assess INT and SEQ demands. Specifically, syllable complexity was dependent on syllable structure, whereas sequence complexity involved either repeated or unique syllabi within an utterance. INT execution was slowed when articulating single syllables in the form CCCV compared to simpler CV syllables. Planning unique syllables within a multisyllabic utterance rather than repetitions of the same syllable slowed INT but not SEQ. The INT speech motor programming process, important for mental syllabary access, is sensitive to changes in both syllable structure and the number of unique syllables in an utterance.
Matsunaga, Hiroko; Goto, Mari; Arikawa, Koji; Shirai, Masataka; Tsunoda, Hiroyuki; Huang, Huan; Kambara, Hideki
2015-02-15
Analyses of gene expressions in single cells are important for understanding detailed biological phenomena. Here, a highly sensitive and accurate method by sequencing (called "bead-seq") to obtain a whole gene expression profile for a single cell is proposed. A key feature of the method is to use a complementary DNA (cDNA) library on magnetic beads, which enables adding washing steps to remove residual reagents in a sample preparation process. By adding the washing steps, the next steps can be carried out under the optimal conditions without losing cDNAs. Error sources were carefully evaluated to conclude that the first several steps were the key steps. It is demonstrated that bead-seq is superior to the conventional methods for single-cell gene expression analyses in terms of reproducibility, quantitative accuracy, and biases caused during sample preparation and sequencing processes. Copyright © 2014 Elsevier Inc. All rights reserved.
Normalization, bias correction, and peak calling for ChIP-seq
Diaz, Aaron; Park, Kiyoub; Lim, Daniel A.; Song, Jun S.
2012-01-01
Next-generation sequencing is rapidly transforming our ability to profile the transcriptional, genetic, and epigenetic states of a cell. In particular, sequencing DNA from the immunoprecipitation of protein-DNA complexes (ChIP-seq) and methylated DNA (MeDIP-seq) can reveal the locations of protein binding sites and epigenetic modifications. These approaches contain numerous biases which may significantly influence the interpretation of the resulting data. Rigorous computational methods for detecting and removing such biases are still lacking. Also, multi-sample normalization still remains an important open problem. This theoretical paper systematically characterizes the biases and properties of ChIP-seq data by comparing 62 separate publicly available datasets, using rigorous statistical models and signal processing techniques. Statistical methods for separating ChIP-seq signal from background noise, as well as correcting enrichment test statistics for sequence-dependent and sonication biases, are presented. Our method effectively separates reads into signal and background components prior to normalization, improving the signal-to-noise ratio. Moreover, most peak callers currently use a generic null model which suffers from low specificity at the sensitivity level requisite for detecting subtle, but true, ChIP enrichment. The proposed method of determining a cell type-specific null model, which accounts for cell type-specific biases, is shown to be capable of achieving a lower false discovery rate at a given significance threshold than current methods. PMID:22499706
Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures
Stanton, Kelly Patrick; Parisi, Fabio; Strino, Francesco; Rabin, Neta; Asp, Patrik; Kluger, Yuval
2013-01-01
Researchers generating new genome-wide data in an exploratory sequencing study can gain biological insights by comparing their data with well-annotated data sets possessing similar genomic patterns. Data compression techniques are needed for efficient comparisons of a new genomic experiment with large repositories of publicly available profiles. Furthermore, data representations that allow comparisons of genomic signals from different platforms and across species enhance our ability to leverage these large repositories. Here, we present a signal processing approach that characterizes protein–chromatin interaction patterns at length scales of several kilobases. This allows us to efficiently compare numerous chromatin-immunoprecipitation sequencing (ChIP-seq) data sets consisting of many types of DNA-binding proteins collected from a variety of cells, conditions and organisms. Importantly, these interaction patterns broadly reflect the biological properties of the binding events. To generate these profiles, termed Arpeggio profiles, we applied harmonic deconvolution techniques to the autocorrelation profiles of the ChIP-seq signals. We used 806 publicly available ChIP-seq experiments and showed that Arpeggio profiles with similar spectral densities shared biological properties. Arpeggio profiles of ChIP-seq data sets revealed characteristics that are not easily detected by standard peak finders. They also allowed us to relate sequencing data sets from different genomes, experimental platforms and protocols. Arpeggio is freely available at http://sourceforge.net/p/arpeggio/wiki/Home/. PMID:23873955
Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures.
Stanton, Kelly Patrick; Parisi, Fabio; Strino, Francesco; Rabin, Neta; Asp, Patrik; Kluger, Yuval
2013-09-01
Researchers generating new genome-wide data in an exploratory sequencing study can gain biological insights by comparing their data with well-annotated data sets possessing similar genomic patterns. Data compression techniques are needed for efficient comparisons of a new genomic experiment with large repositories of publicly available profiles. Furthermore, data representations that allow comparisons of genomic signals from different platforms and across species enhance our ability to leverage these large repositories. Here, we present a signal processing approach that characterizes protein-chromatin interaction patterns at length scales of several kilobases. This allows us to efficiently compare numerous chromatin-immunoprecipitation sequencing (ChIP-seq) data sets consisting of many types of DNA-binding proteins collected from a variety of cells, conditions and organisms. Importantly, these interaction patterns broadly reflect the biological properties of the binding events. To generate these profiles, termed Arpeggio profiles, we applied harmonic deconvolution techniques to the autocorrelation profiles of the ChIP-seq signals. We used 806 publicly available ChIP-seq experiments and showed that Arpeggio profiles with similar spectral densities shared biological properties. Arpeggio profiles of ChIP-seq data sets revealed characteristics that are not easily detected by standard peak finders. They also allowed us to relate sequencing data sets from different genomes, experimental platforms and protocols. Arpeggio is freely available at http://sourceforge.net/p/arpeggio/wiki/Home/.
Sequence Alignment to Predict Across Species Susceptibility ...
Conservation of a molecular target across species can be used as a line-of-evidence to predict the likelihood of chemical susceptibility. The web-based Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS) tool was developed to simplify, streamline, and quantitatively assess protein sequence/structural similarity across taxonomic groups as a means to predict relative intrinsic susceptibility. The intent of the tool is to allow for evaluation of any potential protein target, so it is amenable to variable degrees of protein characterization, depending on available information about the chemical/protein interaction and the molecular target itself. To allow for flexibility in the analysis, a layered strategy was adopted for the tool. The first level of the SeqAPASS analysis compares primary amino acid sequences to a query sequence, calculating a metric for sequence similarity (including detection of candidate orthologs), the second level evaluates sequence similarity within selected domains (e.g., ligand-binding domain, DNA binding domain), and the third level of analysis compares individual amino acid residue positions identified as being of importance for protein conformation and/or ligand binding upon chemical perturbation. Each level of the SeqAPASS analysis provides increasing evidence to apply toward rapid, screening-level assessments of probable cross species susceptibility. Such analyses can support prioritization of chemicals for further ev
The LAM-PCR Method to Sequence LV Integration Sites.
Wang, Wei; Bartholomae, Cynthia C; Gabriel, Richard; Deichmann, Annette; Schmidt, Manfred
2016-01-01
Integrating viral gene transfer vectors are commonly used gene delivery tools in clinical gene therapy trials providing stable integration and continuous gene expression of the transgene in the treated host cell. However, integration of the reverse-transcribed vector DNA into the host genome is a potentially mutagenic event that may directly contribute to unwanted side effects. A comprehensive and accurate analysis of the integration site (IS) repertoire is indispensable to study clonality in transduced cells obtained from patients undergoing gene therapy and to identify potential in vivo selection of affected cell clones. To date, next-generation sequencing (NGS) of vector-genome junctions allows sophisticated studies on the integration repertoire in vitro and in vivo. We have explored the use of the Illumina MiSeq Personal Sequencer platform to sequence vector ISs amplified by non-restrictive linear amplification-mediated PCR (nrLAM-PCR) and LAM-PCR. MiSeq-based high-quality IS sequence retrieval is accomplished by the introduction of a double-barcode strategy that substantially minimizes the frequency of IS sequence collisions compared to the conventionally used single-barcode protocol. Here, we present an updated protocol of (nr)LAM-PCR for the analysis of lentiviral IS using a double-barcode system and followed by deep sequencing using the MiSeq device.
Yan, Yong-Wei; Zou, Bin; Zhu, Ting; Hozzein, Wael N.
2017-01-01
RNA-seq-based SSU (small subunit) rRNA (ribosomal RNA) analysis has provided a better understanding of potentially active microbial community within environments. However, for RNA-seq library construction, high quantities of purified RNA are typically required. We propose a modified RNA-seq method for SSU rRNA-based microbial community analysis that depends on the direct ligation of a 5’ adaptor to RNA before reverse-transcription. The method requires only a low-input quantity of RNA (10–100 ng) and does not require a DNA removal step. The method was initially tested on three mock communities synthesized with enriched SSU rRNA of archaeal, bacterial and fungal isolates at different ratios, and was subsequently used for environmental samples of high or low biomass. For high-biomass salt-marsh sediments, enriched SSU rRNA and total nucleic acid-derived RNA-seq datasets revealed highly consistent community compositions for all of the SSU rRNA sequences, and as much as 46.4%-59.5% of 16S rRNA sequences were suitable for OTU (operational taxonomic unit)-based community and diversity analyses with complete coverage of V1-V2 regions. OTU-based community structures for the two datasets were also highly consistent with those determined by all of the 16S rRNA reads. For low-biomass samples, total nucleic acid-derived RNA-seq datasets were analyzed, and highly active bacterial taxa were also identified by the OTU-based method, notably including members of the previously underestimated genus Nitrospira and phylum Acidobacteria in tap water, members of the phylum Actinobacteria on a shower curtain, and members of the phylum Cyanobacteria on leaf surfaces. More than half of the bacterial 16S rRNA sequences covered the complete region of primer 8F, and non-coverage rates as high as 38.7% were obtained for phylum-unclassified sequences, providing many opportunities to identify novel bacterial taxa. This modified RNA-seq method will provide a better snapshot of diverse microbial communities, most notably by OTU-based analysis, even communities with low-biomass samples. PMID:29016661
Single-cell Transcriptome Study as Big Data
Yu, Pingjian; Lin, Wei
2016-01-01
The rapid growth of single-cell RNA-seq studies (scRNA-seq) demands efficient data storage, processing, and analysis. Big-data technology provides a framework that facilitates the comprehensive discovery of biological signals from inter-institutional scRNA-seq datasets. The strategies to solve the stochastic and heterogeneous single-cell transcriptome signal are discussed in this article. After extensively reviewing the available big-data applications of next-generation sequencing (NGS)-based studies, we propose a workflow that accounts for the unique characteristics of scRNA-seq data and primary objectives of single-cell studies. PMID:26876720
GUIDEseq: a bioconductor package to analyze GUIDE-Seq datasets for CRISPR-Cas nucleases.
Zhu, Lihua Julie; Lawrence, Michael; Gupta, Ankit; Pagès, Hervé; Kucukural, Alper; Garber, Manuel; Wolfe, Scot A
2017-05-15
Genome editing technologies developed around the CRISPR-Cas9 nuclease system have facilitated the investigation of a broad range of biological questions. These nucleases also hold tremendous promise for treating a variety of genetic disorders. In the context of their therapeutic application, it is important to identify the spectrum of genomic sequences that are cleaved by a candidate nuclease when programmed with a particular guide RNA, as well as the cleavage efficiency of these sites. Powerful new experimental approaches, such as GUIDE-seq, facilitate the sensitive, unbiased genome-wide detection of nuclease cleavage sites within the genome. Flexible bioinformatics analysis tools for processing GUIDE-seq data are needed. Here, we describe an open source, open development software suite, GUIDEseq, for GUIDE-seq data analysis and annotation as a Bioconductor package in R. The GUIDEseq package provides a flexible platform with more than 60 adjustable parameters for the analysis of datasets associated with custom nuclease applications. These parameters allow data analysis to be tailored to different nuclease platforms with different length and complexity in their guide and PAM recognition sequences or their DNA cleavage position. They also enable users to customize sequence aggregation criteria, and vary peak calling thresholds that can influence the number of potential off-target sites recovered. GUIDEseq also annotates potential off-target sites that overlap with genes based on genome annotation information, as these may be the most important off-target sites for further characterization. In addition, GUIDEseq enables the comparison and visualization of off-target site overlap between different datasets for a rapid comparison of different nuclease configurations or experimental conditions. For each identified off-target, the GUIDEseq package outputs mapped GUIDE-Seq read count as well as cleavage score from a user specified off-target cleavage score prediction algorithm permitting the identification of genomic sequences with unexpected cleavage activity. The GUIDEseq package enables analysis of GUIDE-data from various nuclease platforms for any species with a defined genomic sequence. This software package has been used successfully to analyze several GUIDE-seq datasets. The software, source code and documentation are freely available at http://www.bioconductor.org/packages/release/bioc/html/GUIDEseq.html .
Personal Genome Sequencing in Ostensibly Healthy Individuals and the PeopleSeq Consortium
Linderman, Michael D.; Nielsen, Daiva E.; Green, Robert C.
2016-01-01
Thousands of ostensibly healthy individuals have had their exome or genome sequenced, but a much smaller number of these individuals have received any personal genomic results from that sequencing. We term those projects in which ostensibly healthy participants can receive sequencing-derived genetic findings and may also have access to their genomic data as participatory predispositional personal genome sequencing (PPGS). Here we are focused on genome sequencing applied in a pre-symptomatic context and so define PPGS to exclude diagnostic genome sequencing intended to identify the molecular cause of suspected or diagnosed genetic disease. In this report we describe the design of completed and underway PPGS projects, briefly summarize the results reported to date and introduce the PeopleSeq Consortium, a newly formed collaboration of PPGS projects designed to collect much-needed longitudinal outcome data. PMID:27023617
Liao, Wei; Jordaan, Gwen; Nham, Phillipp; Phan, Ryan T; Pelegrini, Matteo; Sharma, Sanjai
2015-10-16
To determine differentially expressed and spliced RNA transcripts in chronic lymphocytic leukemia specimens a high throughput RNA-sequencing (HTS RNA-seq) analysis was performed. Ten CLL specimens and five normal peripheral blood CD19+ B cells were analyzed by HTS RNA-seq. The library preparation was performed with Illumina TrueSeq RNA kit and analyzed by Illumina HiSeq 2000 sequencing system. An average of 48.5 million reads for B cells, and 50.6 million reads for CLL specimens were obtained with 10396 and 10448 assembled transcripts for normal B cells and primary CLL specimens respectively. With the Cuffdiff analysis, 2091 differentially expressed genes (DEG) between B cells and CLL specimens based on FPKM (fragments per kilobase of transcript per million reads and false discovery rate, FDR q < 0.05, fold change >2) were identified. Expression of selected DEGs (n = 32) with up regulated and down regulated expression in CLL from RNA-seq data were also analyzed by qRT-PCR in a test cohort of CLL specimens. Even though there was a variation in fold expression of DEG genes between RNA-seq and qRT-PCR; more than 90 % of analyzed genes were validated by qRT-PCR analysis. Analysis of RNA-seq data for splicing alterations in CLL and B cells was performed by Multivariate Analysis of Transcript Splicing (MATS analysis). Skipped exon was the most frequent splicing alteration in CLL specimens with 128 significant events (P-value <0.05, minimum inclusion level difference >0.1). The RNA-seq analysis of CLL specimens identifies novel DEG and alternatively spliced genes that are potential prognostic markers and therapeutic targets. High level of validation by qRT-PCR for a number of DEG genes supports the accuracy of this analysis. Global comparison of transcriptomes of B cells, IGVH non-mutated CLL (U-CLL) and mutated CLL specimens (M-CLL) with multidimensional scaling analysis was able to segregate CLL and B cell transcriptomes but the M-CLL and U-CLL transcriptomes were indistinguishable. The analysis of HTS RNA-seq data to identify alternative splicing events and other genetic abnormalities specific to CLL is an added advantage of RNA-seq that is not feasible with other genome wide analysis.
Soyer, Jessica L; Möller, Mareike; Schotanus, Klaas; Connolly, Lanelle R; Galazka, Jonathan M; Freitag, Michael; Stukenbrock, Eva H
2015-06-01
The presence or absence of specific transcription factors, chromatin remodeling machineries, chromatin modification enzymes, post-translational histone modifications and histone variants all play crucial roles in the regulation of pathogenicity genes. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) provides an important tool to study genome-wide protein-DNA interactions to help understand gene regulation in the context of native chromatin. ChIP-seq is a convenient in vivo technique to identify, map and characterize occupancy of specific DNA fragments with proteins against which specific antibodies exist or which can be epitope-tagged in vivo. We optimized existing ChIP protocols for use in the wheat pathogen Zymoseptoria tritici and closely related sister species. Here, we provide a detailed method, underscoring which aspects of the technique are organism-specific. Library preparation for Illumina sequencing is described, as this is currently the most widely used ChIP-seq method. One approach for the analysis and visualization of representative sequence is described; improved tools for these analyses are constantly being developed. Using ChIP-seq with antibodies against H3K4me2, which is considered a mark for euchromatin or H3K9me3 and H3K27me3, which are considered marks for heterochromatin, the overall distribution of euchromatin and heterochromatin in the genome of Z. tritici can be determined. Our ChIP-seq protocol was also successfully applied to Z. tritici strains with high levels of melanization or aberrant colony morphology, and to different species of the genus (Z. ardabiliae and Z. pseudotritici), suggesting that our technique is robust. The methods described here provide a powerful framework to study new aspects of chromatin biology and gene regulation in this prominent wheat pathogen. Copyright © 2015 Elsevier Inc. All rights reserved.
Next-Generation Genomics Facility at C-CAMP: Accelerating Genomic Research in India
S, Chandana; Russiachand, Heikham; H, Pradeep; S, Shilpa; M, Ashwini; S, Sahana; B, Jayanth; Atla, Goutham; Jain, Smita; Arunkumar, Nandini; Gowda, Malali
2014-01-01
Next-Generation Sequencing (NGS; http://www.genome.gov/12513162) is a recent life-sciences technological revolution that allows scientists to decode genomes or transcriptomes at a much faster rate with a lower cost. Genomic-based studies are in a relatively slow pace in India due to the non-availability of genomics experts, trained personnel and dedicated service providers. Using NGS there is a lot of potential to study India's national diversity (of all kinds). We at the Centre for Cellular and Molecular Platforms (C-CAMP) have launched the Next Generation Genomics Facility (NGGF) to provide genomics service to scientists, to train researchers and also work on national and international genomic projects. We have HiSeq1000 from Illumina and GS-FLX Plus from Roche454. The long reads from GS FLX Plus, and high sequence depth from HiSeq1000, are the best and ideal hybrid approaches for de novo and re-sequencing of genomes and transcriptomes. At our facility, we have sequenced around 70 different organisms comprising of more than 388 genomes and 615 transcriptomes – prokaryotes and eukaryotes (fungi, plants and animals). In addition we have optimized other unique applications such as small RNA (miRNA, siRNA etc), long Mate-pair sequencing (2 to 20 Kb), Coding sequences (Exome), Methylome (ChIP-Seq), Restriction Mapping (RAD-Seq), Human Leukocyte Antigen (HLA) typing, mixed genomes (metagenomes) and target amplicons, etc. Translating DNA sequence data from NGS sequencer into meaningful information is an important exercise. Under NGGF, we have bioinformatics experts and high-end computing resources to dissect NGS data such as genome assembly and annotation, gene expression, target enrichment, variant calling (SSR or SNP), comparative analysis etc. Our services (sequencing and bioinformatics) have been utilized by more than 45 organizations (academia and industry) both within India and outside, resulting several publications in peer-reviewed journals and several genomic/transcriptomic data is available at NCBI.
S-MART, a software toolbox to aid RNA-Seq data analysis.
Zytnicki, Matthias; Quesneville, Hadi
2011-01-01
High-throughput sequencing is now routinely performed in many experiments. But the analysis of the millions of sequences generated, is often beyond the expertise of the wet labs who have no personnel specializing in bioinformatics. Whereas several tools are now available to map high-throughput sequencing data on a genome, few of these can extract biological knowledge from the mapped reads. We have developed a toolbox called S-MART, which handles mapped RNA-Seq data. S-MART is an intuitive and lightweight tool which performs many of the tasks usually required for the analysis of mapped RNA-Seq reads. S-MART does not require any computer science background and thus can be used by all of the biologist community through a graphical interface. S-MART can run on any personal computer, yielding results within an hour even for Gb of data for most queries. S-MART may perform the entire analysis of the mapped reads, without any need for other ad hoc scripts. With this tool, biologists can easily perform most of the analyses on their computer for their RNA-Seq data, from the mapped data to the discovery of important loci.
S-MART, A Software Toolbox to Aid RNA-seq Data Analysis
Zytnicki, Matthias; Quesneville, Hadi
2011-01-01
High-throughput sequencing is now routinely performed in many experiments. But the analysis of the millions of sequences generated, is often beyond the expertise of the wet labs who have no personnel specializing in bioinformatics. Whereas several tools are now available to map high-throughput sequencing data on a genome, few of these can extract biological knowledge from the mapped reads. We have developed a toolbox called S-MART, which handles mapped RNA-Seq data. S-MART is an intuitive and lightweight tool which performs many of the tasks usually required for the analysis of mapped RNA-Seq reads. S-MART does not require any computer science background and thus can be used by all of the biologist community through a graphical interface. S-MART can run on any personal computer, yielding results within an hour even for Gb of data for most queries. S-MART may perform the entire analysis of the mapped reads, without any need for other ad hoc scripts. With this tool, biologists can easily perform most of the analyses on their computer for their RNA-Seq data, from the mapped data to the discovery of important loci. PMID:21998740
Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq.
Hu, Ming; Zhu, Yu; Taylor, Jeremy M G; Liu, Jun S; Qin, Zhaohui S
2012-01-01
RNA sequencing (RNA-Seq) is a powerful new technology for mapping and quantifying transcriptomes using ultra high-throughput next-generation sequencing technologies. Using deep sequencing, gene expression levels of all transcripts including novel ones can be quantified digitally. Although extremely promising, the massive amounts of data generated by RNA-Seq, substantial biases and uncertainty in short read alignment pose challenges for data analysis. In particular, large base-specific variation and between-base dependence make simple approaches, such as those that use averaging to normalize RNA-Seq data and quantify gene expressions, ineffective. In this study, we propose a Poisson mixed-effects (POME) model to characterize base-level read coverage within each transcript. The underlying expression level is included as a key parameter in this model. Since the proposed model is capable of incorporating base-specific variation as well as between-base dependence that affect read coverage profile throughout the transcript, it can lead to improved quantification of the true underlying expression level. POME can be freely downloaded at http://www.stat.purdue.edu/~yuzhu/pome.html. yuzhu@purdue.edu; zhaohui.qin@emory.edu Supplementary data are available at Bioinformatics online.
GenomicTools: a computational platform for developing high-throughput analytics in genomics.
Tsirigos, Aristotelis; Haiminen, Niina; Bilal, Erhan; Utro, Filippo
2012-01-15
Recent advances in sequencing technology have resulted in the dramatic increase of sequencing data, which, in turn, requires efficient management of computational resources, such as computing time, memory requirements as well as prototyping of computational pipelines. We present GenomicTools, a flexible computational platform, comprising both a command-line set of tools and a C++ API, for the analysis and manipulation of high-throughput sequencing data such as DNA-seq, RNA-seq, ChIP-seq and MethylC-seq. GenomicTools implements a variety of mathematical operations between sets of genomic regions thereby enabling the prototyping of computational pipelines that can address a wide spectrum of tasks ranging from pre-processing and quality control to meta-analyses. Additionally, the GenomicTools platform is designed to analyze large datasets of any size by minimizing memory requirements. In practical applications, where comparable, GenomicTools outperforms existing tools in terms of both time and memory usage. The GenomicTools platform (version 2.0.0) was implemented in C++. The source code, documentation, user manual, example datasets and scripts are available online at http://code.google.com/p/ibm-cbc-genomic-tools.
TEA: the epigenome platform for Arabidopsis methylome study.
Su, Sheng-Yao; Chen, Shu-Hwa; Lu, I-Hsuan; Chiang, Yih-Shien; Wang, Yu-Bin; Chen, Pao-Yang; Lin, Chung-Yen
2016-12-22
Bisulfite sequencing (BS-seq) has become a standard technology to profile genome-wide DNA methylation at single-base resolution. It allows researchers to conduct genome-wise cytosine methylation analyses on issues about genomic imprinting, transcriptional regulation, cellular development and differentiation. One single data from a BS-Seq experiment is resolved into many features according to the sequence contexts, making methylome data analysis and data visualization a complex task. We developed a streamlined platform, TEA, for analyzing and visualizing data from whole-genome BS-Seq (WGBS) experiments conducted in the model plant Arabidopsis thaliana. To capture the essence of the genome methylation level and to meet the efficiency for running online, we introduce a straightforward method for measuring genome methylation in each sequence context by gene. The method is scripted in Java to process BS-Seq mapping results. Through a simple data uploading process, the TEA server deploys a web-based platform for deep analysis by linking data to an updated Arabidopsis annotation database and toolkits. TEA is an intuitive and efficient online platform for analyzing the Arabidopsis genomic DNA methylation landscape. It provides several ways to help users exploit WGBS data. TEA is freely accessible for academic users at: http://tea.iis.sinica.edu.tw .
Optimizing exosomal RNA isolation for RNA-Seq analyses of archival sera specimens.
Prendergast, Emily N; de Souza Fonseca, Marcos Abraão; Dezem, Felipe Segato; Lester, Jenny; Karlan, Beth Y; Noushmehr, Houtan; Lin, Xianzhi; Lawrenson, Kate
2018-01-01
Exosomes are endosome-derived membrane vesicles that contain proteins, lipids, and nucleic acids. The exosomal transcriptome mediates intercellular communication, and represents an understudied reservoir of novel biomarkers for human diseases. Next-generation sequencing enables complex quantitative characterization of exosomal RNAs from diverse sources. However, detailed protocols describing exosome purification for preparation of exosomal RNA-sequence (RNA-Seq) libraries are lacking. Here we compared methods for isolation of exosomes and extraction of exosomal RNA from human cell-free serum, as well as strategies for attaining equal representation of samples within pooled RNA-Seq libraries. We compared commercial precipitation with ultracentrifugation for exosome purification and confirmed the presence of exosomes via both transmission electron microscopy and immunoblotting. Exosomal RNA extraction was compared using four different RNA purification methods. We determined the minimal starting volume of serum required for exosome preparation and showed that high quality exosomal RNA can be isolated from sera stored for over a decade. Finally, RNA-Seq libraries were successfully prepared with exosomal RNAs extracted from human cell-free serum, cataloguing both coding and non-coding exosomal transcripts. This method provides researchers with strategic options to prepare RNA-Seq libraries and compare RNA-Seq data quantitatively from minimal volumes of fresh and archival human cell-free serum for disease biomarker discovery.
Ruttanajit, Tida; Chanchamroen, Sujin; Cram, David S; Sawakwongpra, Kritchakorn; Suksalak, Wanwisa; Leng, Xue; Fan, Junmei; Wang, Li; Yao, Yuanqing; Quangkananurug, Wiwat
2016-02-01
Currently, our understanding of the nature and reproductive potential of blastocysts associated with trophectoderm (TE) lineage chromosomal mosaicism is limited. The objective of this study was to first validate copy number variation sequencing (CNV-Seq) for measuring the level of mosaicism and second, examine the nature and level of mosaicism in TE biopsies of patient's blastocysts. TE biopy samples were analysed by array comparative genomic hybridization (CGH) and CNV-Seq to discriminate between euploid, aneuploid and mosaic blastocysts. Using artificial models of TE mosaicism for five different chromosomes, CNV-Seq accurately and reproducibly quantitated mosaicism at levels of 50% and 20%. In a comparative 24-chromosome study of 49 blastocysts by array CGH and CNV-Seq, 43 blastocysts (87.8%) had a concordant diagnosis and 6 blastocysts (12.2%) were discordant. The discordance was attributed to low to medium levels of chromosomal mosaicism (30-70%) not detected by array CGH. In an expanded study of 399 blastocysts using CNV-Seq as the sole diagnostic method, the proportion of diploid-aneuploid mosaics (34, 8.5%) was significantly higher than aneuploid mosaics (18, 4.5%) (p < 0.02). Mosaicism is a significant chromosomal abnormality associated with the TE lineage of human blastocysts that can be reliably and accurately detected by CNV-Seq. © 2015 John Wiley & Sons, Ltd.
SeqCompress: an algorithm for biological sequence compression.
Sardaraz, Muhammad; Tahir, Muhammad; Ikram, Ataul Aziz; Bajwa, Hassan
2014-10-01
The growth of Next Generation Sequencing technologies presents significant research challenges, specifically to design bioinformatics tools that handle massive amount of data efficiently. Biological sequence data storage cost has become a noticeable proportion of total cost in the generation and analysis. Particularly increase in DNA sequencing rate is significantly outstripping the rate of increase in disk storage capacity, which may go beyond the limit of storage capacity. It is essential to develop algorithms that handle large data sets via better memory management. This article presents a DNA sequence compression algorithm SeqCompress that copes with the space complexity of biological sequences. The algorithm is based on lossless data compression and uses statistical model as well as arithmetic coding to compress DNA sequences. The proposed algorithm is compared with recent specialized compression tools for biological sequences. Experimental results show that proposed algorithm has better compression gain as compared to other existing algorithms. Copyright © 2014 Elsevier Inc. All rights reserved.
Workflow and web application for annotating NCBI BioProject transcriptome data.
Vera Alvarez, Roberto; Medeiros Vidal, Newton; Garzón-Martínez, Gina A; Barrero, Luz S; Landsman, David; Mariño-Ramírez, Leonardo
2017-01-01
The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621. URL: http://www.ncbi.nlm.nih.gov/projects/physalis/. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.
Sawal, Humaira Aziz; Harripaul, Ricardo; Mikhailov, Anna; Vleuten, Kayla; Naeem, Farooq; Nasr, Tanveer; Hassan, Muhammad Jawad; Vincent, John B; Ayub, Muhammad; Rafiq, Muhammad Arshad
2018-06-01
Bilateral frontoparietal polymicrogyria (BFPP, MIM 606854) is a heterogeneous autosomal recessive disorder of abnormal cortical lamination, leading to moderate-to-severe intellectual disability (ID), seizure disorder, and motor difficulties, and caused by mutations in the G protein-coupled receptor 56 ( GPR56 ) gene. Twenty-eight mutations in 40 different families have been reported in the literature. The clinical and neuroimaging phenotype is consistent in these cases. The BFPP cortex consists of numerous small gyral cells, with scalloping of the cortical-white matter junction. There are also associated white matter, brain stem, and cerebellar changes. GPR56 is a member of an adhesion G protein-coupled receptor family with a very long N-terminal stalk and seven transmembrane domains. In this study, we identified three families from Pakistan, ascertained primarily for ID, with overlapping approximately 1 Mb region (chr16:56,973,335-57,942,866) of homozygosity by descent, including 24 RefSeq genes. We found three GPR56 homozygous mutations, using next-generation sequencing. These mutations include a substitutional variant, c.1460T > C; p.L487P, (chr16:57693480 T > C), a 13-bp insertion causing the frameshift and truncating mutation, p.Leu269Hisfs*21 (NM_005682.6:c.803_804insCCATGGAGGTGCT; Chr16: 57689345_57689346insCCATGGAGGTGCT), and a truncating mutation c.1426C > T; p.Arg476* (Chr16:57693446C > T). These mutations fully segregated with ID in these families and were absent in the Exome Aggregation Consortium database that has approximately 8,000 control samples of South Asian origin. Two of these mutations have been reported in ClinVar database, and the third one has not been reported before. Three families from Pakistan with GPR56 mutations have been reported before. With the addition of our findings, the total number of mutations reported in Pakistani patients now is six. These results increase our knowledge regarding the mutational spectrum of the GPR56 gene causing BFPP/ID.
Evaluation of the reproducibility of amplicon sequencing with Illumina MiSeq platform
Van Nostrand, Joy D.; Ning, Daliang; Sun, Bo; Xue, Kai; Liu, Feifei; Deng, Ye; Liang, Yuting; Zhou, Jizhong
2017-01-01
Illumina’s MiSeq has become the dominant platform for gene amplicon sequencing in microbial ecology studies; however, various technical concerns, such as reproducibility, still exist. To assess reproducibility, 16S rRNA gene amplicons from 18 soil samples of a reciprocal transplantation experiment were sequenced on an Illumina MiSeq. The V4 region of 16S rRNA gene from each sample was sequenced in triplicate with each replicate having a unique barcode. The average OTU overlap, without considering sequence abundance, at a rarefaction level of 10,323 sequences was 33.4±2.1% and 20.2±1.7% between two and among three technical replicates, respectively. When OTU sequence abundance was considered, the average sequence abundance weighted OTU overlap was 85.6±1.6% and 81.2±2.1% for two and three replicates, respectively. Removing singletons significantly increased the overlap for both (~1–3%, p<0.001). Increasing the sequencing depth to 160,000 reads by deep sequencing increased OTU overlap both when sequence abundance was considered (95%) and when not (44%). However, if singletons were not removed the overlap between two technical replicates (not considering sequence abundance) plateaus at 39% with 30,000 sequences. Diversity measures were not affected by the low overlap as α-diversities were similar among technical replicates while β-diversities (Bray-Curtis) were much smaller among technical replicates than among treatment replicates (e.g., 0.269 vs. 0.374). Higher diversity coverage, but lower OTU overlap, was observed when replicates were sequenced in separate runs. Detrended correspondence analysis indicated that while there was considerable variation among technical replicates, the reproducibility was sufficient for detecting treatment effects for the samples examined. These results suggest that although there is variation among technical replicates, amplicon sequencing on MiSeq is useful for analyzing microbial community structure if used appropriately and with caution. For example, including technical replicates, removing spurious sequences and unrepresentative OTUs, using a clustering method with a high stringency for OTU generation, estimating treatment effects at higher taxonomic levels, and adapting the unique molecular identifier (UMI) and other newly developed methods to lower PCR and sequencing error and to identify true low abundance rare species all can increase reproducibility. PMID:28453559
Evaluation of the reproducibility of amplicon sequencing with Illumina MiSeq platform
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wen, Chongqing; Wu, Liyou; Qin, Yujia
Illumina's MiSeq has become the dominant platform for gene amplicon sequencing in microbial ecology studies; however, various technical concerns, such as reproducibility, still exist. To assess reproducibility, 16S rRNA gene amplicons from 18 soil samples of a reciprocal transplantation experiment were sequenced on an Illumina MiSeq. The V4 region of 16S rRNA gene from each sample was sequenced in triplicate with each replicate having a unique barcode. The average OTU overlap, without considering sequence abundance, at a rarefaction level of 10,323 sequences was 33.4±2.1% and 20.2±1.7% between two and among three technical replicates, respectively. When OTU sequence abundance was considered,more » the average sequence abundance weighted OTU overlap was 85.6±1.6% and 81.2±2.1% for two and three replicates, respectively. Removing singletons significantly increased the overlap for both (~1-3%, p<0.001). Increasing the sequencing depth to 160,000 reads by deep sequencing increased OTU overlap both when sequence abundance was considered (95%) and when not (44%). However, if singletons were not removed the overlap between two technical replicates (not considering sequence abundance) plateaus at 39% with 30,000 sequences. Diversity measures were not affected by the low overlap as α-diversities were similar among technical replicates while β-diversities (Bray-Curtis) were much smaller among technical replicates than among treatment replicates (e.g., 0.269 vs. 0.374). Higher diversity coverage, but lower OTU overlap, was observed when replicates were sequenced in separate runs. Detrended correspondence analysis indicated that while there was considerable variation among technical replicates, the reproducibility was sufficient for detecting treatment effects for the samples examined. These results suggest that although there is variation among technical replicates, amplicon sequencing on MiSeq is useful for analyzing microbial community structure if used appropriately and with caution. For example, including technical replicates, removing spurious sequences and unrepresentative OTUs, using a clustering method with a high stringency for OTU generation, estimating treatment effects at higher taxonomic levels, and adapting the unique molecular identifier (UMI) and other newly developed methods to lower PCR and sequencing error and to identify true low abundance rare species all can increase reproducibility.« less
Evaluation of the reproducibility of amplicon sequencing with Illumina MiSeq platform
Wen, Chongqing; Wu, Liyou; Qin, Yujia; ...
2017-04-28
Illumina's MiSeq has become the dominant platform for gene amplicon sequencing in microbial ecology studies; however, various technical concerns, such as reproducibility, still exist. To assess reproducibility, 16S rRNA gene amplicons from 18 soil samples of a reciprocal transplantation experiment were sequenced on an Illumina MiSeq. The V4 region of 16S rRNA gene from each sample was sequenced in triplicate with each replicate having a unique barcode. The average OTU overlap, without considering sequence abundance, at a rarefaction level of 10,323 sequences was 33.4±2.1% and 20.2±1.7% between two and among three technical replicates, respectively. When OTU sequence abundance was considered,more » the average sequence abundance weighted OTU overlap was 85.6±1.6% and 81.2±2.1% for two and three replicates, respectively. Removing singletons significantly increased the overlap for both (~1-3%, p<0.001). Increasing the sequencing depth to 160,000 reads by deep sequencing increased OTU overlap both when sequence abundance was considered (95%) and when not (44%). However, if singletons were not removed the overlap between two technical replicates (not considering sequence abundance) plateaus at 39% with 30,000 sequences. Diversity measures were not affected by the low overlap as α-diversities were similar among technical replicates while β-diversities (Bray-Curtis) were much smaller among technical replicates than among treatment replicates (e.g., 0.269 vs. 0.374). Higher diversity coverage, but lower OTU overlap, was observed when replicates were sequenced in separate runs. Detrended correspondence analysis indicated that while there was considerable variation among technical replicates, the reproducibility was sufficient for detecting treatment effects for the samples examined. These results suggest that although there is variation among technical replicates, amplicon sequencing on MiSeq is useful for analyzing microbial community structure if used appropriately and with caution. For example, including technical replicates, removing spurious sequences and unrepresentative OTUs, using a clustering method with a high stringency for OTU generation, estimating treatment effects at higher taxonomic levels, and adapting the unique molecular identifier (UMI) and other newly developed methods to lower PCR and sequencing error and to identify true low abundance rare species all can increase reproducibility.« less
Evaluation of the reproducibility of amplicon sequencing with Illumina MiSeq platform.
Wen, Chongqing; Wu, Liyou; Qin, Yujia; Van Nostrand, Joy D; Ning, Daliang; Sun, Bo; Xue, Kai; Liu, Feifei; Deng, Ye; Liang, Yuting; Zhou, Jizhong
2017-01-01
Illumina's MiSeq has become the dominant platform for gene amplicon sequencing in microbial ecology studies; however, various technical concerns, such as reproducibility, still exist. To assess reproducibility, 16S rRNA gene amplicons from 18 soil samples of a reciprocal transplantation experiment were sequenced on an Illumina MiSeq. The V4 region of 16S rRNA gene from each sample was sequenced in triplicate with each replicate having a unique barcode. The average OTU overlap, without considering sequence abundance, at a rarefaction level of 10,323 sequences was 33.4±2.1% and 20.2±1.7% between two and among three technical replicates, respectively. When OTU sequence abundance was considered, the average sequence abundance weighted OTU overlap was 85.6±1.6% and 81.2±2.1% for two and three replicates, respectively. Removing singletons significantly increased the overlap for both (~1-3%, p<0.001). Increasing the sequencing depth to 160,000 reads by deep sequencing increased OTU overlap both when sequence abundance was considered (95%) and when not (44%). However, if singletons were not removed the overlap between two technical replicates (not considering sequence abundance) plateaus at 39% with 30,000 sequences. Diversity measures were not affected by the low overlap as α-diversities were similar among technical replicates while β-diversities (Bray-Curtis) were much smaller among technical replicates than among treatment replicates (e.g., 0.269 vs. 0.374). Higher diversity coverage, but lower OTU overlap, was observed when replicates were sequenced in separate runs. Detrended correspondence analysis indicated that while there was considerable variation among technical replicates, the reproducibility was sufficient for detecting treatment effects for the samples examined. These results suggest that although there is variation among technical replicates, amplicon sequencing on MiSeq is useful for analyzing microbial community structure if used appropriately and with caution. For example, including technical replicates, removing spurious sequences and unrepresentative OTUs, using a clustering method with a high stringency for OTU generation, estimating treatment effects at higher taxonomic levels, and adapting the unique molecular identifier (UMI) and other newly developed methods to lower PCR and sequencing error and to identify true low abundance rare species all can increase reproducibility.
Okamoto, Nobuhiko; Nakashima, Mitsuko; Tsurusaki, Yoshinori; Miyake, Noriko; Saitsu, Hirotomo; Matsumoto, Naomichi
2013-01-01
Next-generation sequencing (NGS) combined with enrichment of target genes enables highly efficient and low-cost sequencing of multiple genes for genetic diseases. The aim of this study was to validate the accuracy and sensitivity of our method for comprehensive mutation detection in autism spectrum disorder (ASD). We assessed the performance of the bench-top Ion Torrent PGM and Illumina MiSeq platforms as optimized solutions for mutation detection, using microdroplet PCR-based enrichment of 62 ASD associated genes. Ten patients with known mutations were sequenced using NGS to validate the sensitivity of our method. The overall read quality was better with MiSeq, largely because of the increased indel-related error associated with PGM. The sensitivity of SNV detection was similar between the two platforms, suggesting they are both suitable for SNV detection in the human genome. Next, we used these methods to analyze 28 patients with ASD, and identified 22 novel variants in genes associated with ASD, with one mutation detected by MiSeq only. Thus, our results support the combination of target gene enrichment and NGS as a valuable molecular method for investigating rare variants in ASD. PMID:24066114
RNA-Seq for gene identification and transcript profiling of three Stevia rebaudiana genotypes.
Chen, Junwen; Hou, Kai; Qin, Peng; Liu, Hongchang; Yi, Bin; Yang, Wenting; Wu, Wei
2014-07-07
Stevia (Stevia rebaudiana) is an important medicinal plant that yields diterpenoid steviol glycosides (SGs). SGs are currently used in the preparation of medicines, food products and neutraceuticals because of its sweetening property (zero calories and about 300 times sweeter than sugar). Recently, some progress has been made in understanding the biosynthesis of SGs in Stevia, but little is known about the molecular mechanisms underlying this process. Additionally, the genomics of Stevia, a non-model species, remains uncharacterized. The recent advent of RNA-Seq, a next generation sequencing technology, provides an opportunity to expand the identification of Stevia genes through in-depth transcript profiling. We present a comprehensive landscape of the transcriptome profiles of three genotypes of Stevia with divergent SG compositions characterized using RNA-seq. 191,590,282 high-quality reads were generated and then assembled into 171,837 transcripts with an average sequence length of 969 base pairs. A total of 80,160 unigenes were annotated, and 14,211 of the unique sequences were assigned to specific metabolic pathways by the Kyoto Encyclopedia of Genes and Genomes. Gene sequences of all enzymes known to be involved in SG synthesis were examined. A total of 143 UDP-glucosyltransferase (UGT) unigenes were identified, some of which might be involved in SG biosynthesis. The expression patterns of eight of these genes were further confirmed by RT-QPCR. RNA-seq analysis identified candidate genes encoding enzymes responsible for the biosynthesis of SGs in Stevia, a non-model plant without a reference genome. The transcriptome data from this study yielded new insights into the process of SG accumulation in Stevia. Our results demonstrate that RNA-Seq can be successfully used for gene identification and transcript profiling in a non-model species.
MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data.
Ozaki, Haruka; Iwasaki, Wataru
2016-08-01
As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif. Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs. By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions. Copyright © 2016 Elsevier Ltd. All rights reserved.
Cross-species inference of long non-coding RNAs greatly expands the ruminant transcriptome.
Bush, Stephen J; Muriuki, Charity; McCulloch, Mary E B; Farquhar, Iseabail L; Clark, Emily L; Hume, David A
2018-04-24
mRNA-like long non-coding RNAs (lncRNAs) are a significant component of mammalian transcriptomes, although most are expressed only at low levels, with high tissue-specificity and/or at specific developmental stages. Thus, in many cases lncRNA detection by RNA-sequencing (RNA-seq) is compromised by stochastic sampling. To account for this and create a catalogue of ruminant lncRNAs, we compared de novo assembled lncRNAs derived from large RNA-seq datasets in transcriptional atlas projects for sheep and goats with previous lncRNAs assembled in cattle and human. We then combined the novel lncRNAs with the sheep transcriptional atlas to identify co-regulated sets of protein-coding and non-coding loci. Few lncRNAs could be reproducibly assembled from a single dataset, even with deep sequencing of the same tissues from multiple animals. Furthermore, there was little sequence overlap between lncRNAs that were assembled from pooled RNA-seq data. We combined positional conservation (synteny) with cross-species mapping of candidate lncRNAs to identify a consensus set of ruminant lncRNAs and then used the RNA-seq data to demonstrate detectable and reproducible expression in each species. In sheep, 20 to 30% of lncRNAs were located close to protein-coding genes with which they are strongly co-expressed, which is consistent with the evolutionary origin of some ncRNAs in enhancer sequences. Nevertheless, most of the lncRNAs are not co-expressed with neighbouring protein-coding genes. Alongside substantially expanding the ruminant lncRNA repertoire, the outcomes of our analysis demonstrate that stochastic sampling can be partly overcome by combining RNA-seq datasets from related species. This has practical implications for the future discovery of lncRNAs in other species.
SeqAPASS: Predicting chemical susceptibility to threatened/endangered species
Conservation of a molecular target across species can be used as a line-of-evidence to predict the likelihood of chemical susceptibility. The web-based Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS; https://seqapass.epa.gov/seqapass/) application was devel...
Genetics of intellectual disability in consanguineous families.
Hu, Hao; Kahrizi, Kimia; Musante, Luciana; Fattahi, Zohreh; Herwig, Ralf; Hosseini, Masoumeh; Oppitz, Cornelia; Abedini, Seyedeh Sedigheh; Suckow, Vanessa; Larti, Farzaneh; Beheshtian, Maryam; Lipkowitz, Bettina; Akhtarkhavari, Tara; Mehvari, Sepideh; Otto, Sabine; Mohseni, Marzieh; Arzhangi, Sanaz; Jamali, Payman; Mojahedi, Faezeh; Taghdiri, Maryam; Papari, Elaheh; Soltani Banavandi, Mohammad Javad; Akbari, Saeide; Tonekaboni, Seyed Hassan; Dehghani, Hossein; Ebrahimpour, Mohammad Reza; Bader, Ingrid; Davarnia, Behzad; Cohen, Monika; Khodaei, Hossein; Albrecht, Beate; Azimi, Sarah; Zirn, Birgit; Bastami, Milad; Wieczorek, Dagmar; Bahrami, Gholamreza; Keleman, Krystyna; Vahid, Leila Nouri; Tzschach, Andreas; Gärtner, Jutta; Gillessen-Kaesbach, Gabriele; Varaghchi, Jamileh Rezazadeh; Timmermann, Bernd; Pourfatemi, Fatemeh; Jankhah, Aria; Chen, Wei; Nikuei, Pooneh; Kalscheuer, Vera M; Oladnabi, Morteza; Wienker, Thomas F; Ropers, Hans-Hilger; Najmabadi, Hossein
2018-01-04
Autosomal recessive (AR) gene defects are the leading genetic cause of intellectual disability (ID) in countries with frequent parental consanguinity, which account for about 1/7th of the world population. Yet, compared to autosomal dominant de novo mutations, which are the predominant cause of ID in Western countries, the identification of AR-ID genes has lagged behind. Here, we report on whole exome and whole genome sequencing in 404 consanguineous predominantly Iranian families with two or more affected offspring. In 219 of these, we found likely causative variants, involving 77 known and 77 novel AR-ID (candidate) genes, 21 X-linked genes, as well as 9 genes previously implicated in diseases other than ID. This study, the largest of its kind published to date, illustrates that high-throughput DNA sequencing in consanguineous families is a superior strategy for elucidating the thousands of hitherto unknown gene defects underlying AR-ID, and it sheds light on their prevalence.
Tjhung, Katrina F; Deiss, Frédérique; Tran, Jessica; Chou, Ying; Derda, Ratmir
2015-01-01
In this paper, we describe multivalent display of peptide and protein sequences typically censored from traditional N-terminal display on protein pIII of filamentous bacteriophage M13. Using site-directed mutagenesis of commercially available M13KE phage cloning vector, we introduced sites that permit efficient cloning using restriction enzymes between domains N1 and N2 of the pIII protein. As infectivity of phage is directly linked to the integrity of the connection between N1 and N2 domains, intra-domain phage display (ID-PhD) allows for simple quality control of the display and the natural variations in the displayed sequences. Additionally, direct linkage to phage propagation allows efficient monitoring of sequence cleavage, providing a convenient system for selection and evolution of protease-susceptible or protease-resistant sequences. As an example of the benefits of such an ID-PhD system, we displayed a negatively charged FLAG sequence, which is known to be post-translationally excised from pIII when displayed on the N-terminus, as well as positively charged sequences which suppress production of phage when displayed on the N-terminus. ID-PhD of FLAG exhibited sub-nanomolar apparent Kd suggesting multivalent nature of the display. A TEV-protease recognition sequence (TEVrs) co-expressed in tandem with FLAG, allowed us to demonstrate that 99.9997% of the phage displayed the FLAG-TEVrs tandem and can be recognized and cleaved by TEV-protease. The residual 0.0003% consisted of phage clones that have excised the insert from their genome. ID-PhD is also amenable to display of protein mini-domains, such as the 33-residue minimized Z-domain of protein A. We show that it is thus possible to use ID-PhD for multivalent display and selection of mini-domain proteins (Affibodies, scFv, etc.).
Makita, Yuko; Kawashima, Mika; Lau, Nyok Sean; Othman, Ahmad Sofiman; Matsui, Minami
2018-01-19
Natural rubber is an economically important material. Currently the Pará rubber tree, Hevea brasiliensis is the main commercial source. Little is known about rubber biosynthesis at the molecular level. Next-generation sequencing (NGS) technologies brought draft genomes of three rubber cultivars and a variety of RNA sequencing (RNA-seq) data. However, no current genome or transcriptome databases (DB) are organized by gene. A gene-oriented database is a valuable support for rubber research. Based on our original draft genome sequence of H. brasiliensis RRIM600, we constructed a rubber tree genome and transcriptome DB. Our DB provides genome information including gene functional annotations and multi-transcriptome data of RNA-seq, full-length cDNAs including PacBio Isoform sequencing (Iso-Seq), ESTs and genome wide transcription start sites (TSSs) derived from CAGE technology. Using our original and publically available RNA-seq data, we calculated co-expressed genes for identifying functionally related gene sets and/or genes regulated by the same transcription factor (TF). Users can access multi-transcriptome data through both a gene-oriented web page and a genome browser. For the gene searching system, we provide keyword search, sequence homology search and gene expression search; users can also select their expression threshold easily. The rubber genome and transcriptome DB provides rubber tree genome sequence and multi-transcriptomics data. This DB is useful for comprehensive understanding of the rubber transcriptome. This will assist both industrial and academic researchers for rubber and economically important close relatives such as R. communis, M. esculenta and J. curcas. The Rubber Transcriptome DB release 2017.03 is accessible at http://matsui-lab.riken.jp/rubber/ .
A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.
Bansal, Vikas
2017-03-14
PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates .
Xu, Joshua; Gong, Binsheng; Wu, Leihong; Thakkar, Shraddha; Hong, Huixiao; Tong, Weida
2016-03-15
Studies on gene expression in response to therapy have led to the discovery of pharmacogenomics biomarkers and advances in precision medicine. Whole transcriptome sequencing (RNA-seq) is an emerging tool for profiling gene expression and has received wide adoption in the biomedical research community. However, its value in regulatory decision making requires rigorous assessment and consensus between various stakeholders, including the research community, regulatory agencies, and industry. The FDA-led SEquencing Quality Control (SEQC) consortium has made considerable progress in this direction, and is the subject of this review. Specifically, three RNA-seq platforms (Illumina HiSeq, Life Technologies SOLiD, and Roche 454) were extensively evaluated at multiple sites to assess cross-site and cross-platform reproducibility. The results demonstrated that relative gene expression measurements were consistently comparable across labs and platforms, but not so for the measurement of absolute expression levels. As part of the quality evaluation several studies were included to evaluate the utility of RNA-seq in clinical settings and safety assessment. The neuroblastoma study profiled tumor samples from 498 pediatric neuroblastoma patients by both microarray and RNA-seq. RNA-seq offers more utilities than microarray in determining the transcriptomic characteristics of cancer. However, RNA-seq and microarray-based models were comparable in clinical endpoint prediction, even when including additional features unique to RNA-seq beyond gene expression. The toxicogenomics study compared microarray and RNA-seq profiles of the liver samples from rats exposed to 27 different chemicals representing multiple toxicity modes of action. Cross-platform concordance was dependent on chemical treatment and transcript abundance. Though both RNA-seq and microarray are suitable for developing gene expression based predictive models with comparable prediction performance, RNA-seq offers advantages over microarray in profiling genes with low expression. The rat BodyMap study provided a comprehensive rat transcriptomic body map by performing RNA-Seq on 320 samples from 11 organs in either sex of juvenile, adolescent, adult and aged Fischer 344 rats. Lastly, the transferability study demonstrated that signature genes of predictive models are reciprocally transferable between microarray and RNA-seq data for model development using a comprehensive approach with two large clinical data sets. This result suggests continued usefulness of legacy microarray data in the coming RNA-seq era. In conclusion, the SEQC project enhances our understanding of RNA-seq and provides valuable guidelines for RNA-seq based clinical application and safety evaluation to advance precision medicine.
Goya, Stephanie; Valinotto, Laura E; Tittarelli, Estefania; Rojo, Gabriel L; Nabaes Jodar, Mercedes S; Greninger, Alexander L; Zaiat, Jonathan J; Marti, Marcelo A; Mistchenko, Alicia S; Viegas, Mariana
2018-01-01
Over the last decade, the number of viral genome sequences deposited in available databases has grown exponentially. However, sequencing methodology vary widely and many published works have relied on viral enrichment by viral culture or nucleic acid amplification with specific primers rather than through unbiased techniques such as metagenomics. The genome of RNA viruses is highly variable and these enrichment methodologies may be difficult to achieve or may bias the results. In order to obtain genomic sequences of human respiratory syncytial virus (HRSV) from positive nasopharyngeal aspirates diverse methodologies were evaluated and compared. A total of 29 nearly complete and complete viral genomes were obtained. The best performance was achieved with a DNase I treatment to the RNA directly extracted from the nasopharyngeal aspirate (NPA), sequence-independent single-primer amplification (SISPA) and library preparation performed with Nextera XT DNA Library Prep Kit with manual normalization. An average of 633,789 and 1,674,845 filtered reads per library were obtained with MiSeq and NextSeq 500 platforms, respectively. The higher output of NextSeq 500 was accompanied by the increasing of duplicated reads percentage generated during SISPA (from an average of 1.5% duplicated viral reads in MiSeq to an average of 74% in NextSeq 500). HRSV genome recovery was not affected by the presence or absence of duplicated reads but the computational demand during the analysis was increased. Considering that only samples with viral load ≥ E+06 copies/ml NPA were tested, no correlation between sample viral loads and number of total filtered reads was observed, nor with the mapped viral reads. The HRSV genomes showed a mean coverage of 98.46% with the best methodology. In addition, genomes of human metapneumovirus (HMPV), human rhinovirus (HRV) and human parainfluenza virus types 1-3 (HPIV1-3) were also obtained with the selected optimal methodology.
Grape RNA-Seq analysis pipeline environment
Knowles, David G.; Röder, Maik; Merkel, Angelika; Guigó, Roderic
2013-01-01
Motivation: The avalanche of data arriving since the development of NGS technologies have prompted the need for developing fast, accurate and easily automated bioinformatic tools capable of dealing with massive datasets. Among the most productive applications of NGS technologies is the sequencing of cellular RNA, known as RNA-Seq. Although RNA-Seq provides similar or superior dynamic range than microarrays at similar or lower cost, the lack of standard and user-friendly pipelines is a bottleneck preventing RNA-Seq from becoming the standard for transcriptome analysis. Results: In this work we present a pipeline for processing and analyzing RNA-Seq data, that we have named Grape (Grape RNA-Seq Analysis Pipeline Environment). Grape supports raw sequencing reads produced by a variety of technologies, either in FASTA or FASTQ format, or as prealigned reads in SAM/BAM format. A minimal Grape configuration consists of the file location of the raw sequencing reads, the genome of the species and the corresponding gene and transcript annotation. Grape first runs a set of quality control steps, and then aligns the reads to the genome, a step that is omitted for prealigned read formats. Grape next estimates gene and transcript expression levels, calculates exon inclusion levels and identifies novel transcripts. Grape can be run on a single computer or in parallel on a computer cluster. It is distributed with specific mapping and quantification tools, but given its modular design, any tool supporting popular data interchange formats can be integrated. Availability: Grape can be obtained from the Bioinformatics and Genomics website at: http://big.crg.cat/services/grape. Contact: david.gonzalez@crg.eu or roderic.guigo@crg.eu PMID:23329413
Fontana, Carla; Favaro, Marco; Pelliccioni, Marco; Pistoia, Enrico Salvatore; Favalli, Cartesio
2005-01-01
Reliable automated identification and susceptibility testing of clinically relevant bacteria is an essential routine for microbiology laboratories, thus improving patient care. Examples of automated identification systems include the Phoenix (Becton Dickinson) and the VITEK 2 (bioMérieux). However, more and more frequently, microbiologists must isolate “difficult” strains that automated systems often fail to identify. An alternative approach could be the genetic identification of isolates; this is based on 16S rRNA gene sequencing and analysis. The aim of the present study was to evaluate the possible use of MicroSeq 500 (Applera) for sequencing the 16S rRNA gene to identify isolates whose identification is unobtainable by conventional systems. We analyzed 83 “difficult” clinical isolates: 25 gram-positive and 58 gram-negative strains that were contemporaneously identified by both systems—VITEK 2 and Phoenix—while genetic identification was performed by using the MicroSeq 500 system. The results showed that phenotypic identifications by VITEK 2 and Phoenix were remarkably similar: 74% for gram-negative strains (43 of 58) and 80% for gram-positive strains were concordant by both systems and also concordant with genetic characterization. The exceptions were the 15 gram-negative and 9 gram-positive isolates whose phenotypic identifications were contrasting or inconclusive. For these, the use of MicroSeq 500 was fundamental to achieving species identification. In clinical microbiology the use of MicroSeq 500, particularly for strains with ambiguous biochemical profiles (including slow-growing strains), identifies strains more easily than do conventional systems. Moreover, MicroSeq 500 is easy to use and cost-effective, making it applicable also in the clinical laboratory. PMID:15695654
Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform.
Schirmer, Melanie; Ijaz, Umer Z; D'Amore, Rosalinda; Hall, Neil; Sloan, William T; Quince, Christopher
2015-03-31
With read lengths of currently up to 2 × 300 bp, high throughput and low sequencing costs Illumina's MiSeq is becoming one of the most utilized sequencing platforms worldwide. The platform is manageable and affordable even for smaller labs. This enables quick turnaround on a broad range of applications such as targeted gene sequencing, metagenomics, small genome sequencing and clinical molecular diagnostics. However, Illumina error profiles are still poorly understood and programs are therefore not designed for the idiosyncrasies of Illumina data. A better knowledge of the error patterns is essential for sequence analysis and vital if we are to draw valid conclusions. Studying true genetic variation in a population sample is fundamental for understanding diseases, evolution and origin. We conducted a large study on the error patterns for the MiSeq based on 16S rRNA amplicon sequencing data. We tested state-of-the-art library preparation methods for amplicon sequencing and showed that the library preparation method and the choice of primers are the most significant sources of bias and cause distinct error patterns. Furthermore we tested the efficiency of various error correction strategies and identified quality trimming (Sickle) combined with error correction (BayesHammer) followed by read overlapping (PANDAseq) as the most successful approach, reducing substitution error rates on average by 93%. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Validation of Pooled Whole-Genome Re-Sequencing in Arabidopsis lyrata.
Fracassetti, Marco; Griffin, Philippa C; Willi, Yvonne
2015-01-01
Sequencing pooled DNA of multiple individuals from a population instead of sequencing individuals separately has become popular due to its cost-effectiveness and simple wet-lab protocol, although some criticism of this approach remains. Here we validated a protocol for pooled whole-genome re-sequencing (Pool-seq) of Arabidopsis lyrata libraries prepared with low amounts of DNA (1.6 ng per individual). The validation was based on comparing single nucleotide polymorphism (SNP) frequencies obtained by pooling with those obtained by individual-based Genotyping By Sequencing (GBS). Furthermore, we investigated the effect of sample number, sequencing depth per individual and variant caller on population SNP frequency estimates. For Pool-seq data, we compared frequency estimates from two SNP callers, VarScan and Snape; the former employs a frequentist SNP calling approach while the latter uses a Bayesian approach. Results revealed concordance correlation coefficients well above 0.8, confirming that Pool-seq is a valid method for acquiring population-level SNP frequency data. Higher accuracy was achieved by pooling more samples (25 compared to 14) and working with higher sequencing depth (4.1× per individual compared to 1.4× per individual), which increased the concordance correlation coefficient to 0.955. The Bayesian-based SNP caller produced somewhat higher concordance correlation coefficients, particularly at low sequencing depth. We recommend pooling at least 25 individuals combined with sequencing at a depth of 100× to produce satisfactory frequency estimates for common SNPs (minor allele frequency above 0.05).
DOE Office of Scientific and Technical Information (OSTI.GOV)
Daum, Christopher; Zane, Matthew; Han, James
2011-01-31
The U.S. Department of Energy (DOE) Joint Genome Institute's (JGI) Production Sequencing group is committed to the generation of high-quality genomic DNA sequence to support the mission areas of renewable energy generation, global carbon management, and environmental characterization and clean-up. Within the JGI's Production Sequencing group, a robust Illumina Genome Analyzer and HiSeq pipeline has been established. Optimization of the sesequencer pipelines has been ongoing with the aim of continual process improvement of the laboratory workflow, reducing operational costs and project cycle times to increases ample throughput, and improving the overall quality of the sequence generated. A sequence QC analysismore » pipeline has been implemented to automatically generate read and assembly level quality metrics. The foremost of these optimization projects, along with sequencing and operational strategies, throughput numbers, and sequencing quality results will be presented.« less
Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting.
Khan, Tarik A; Friedensohn, Simon; Gorter de Vries, Arthur R; Straszewski, Jakub; Ruscheweyh, Hans-Joachim; Reddy, Sai T
2016-03-01
High-throughput antibody repertoire sequencing (Ig-seq) provides quantitative molecular information on humoral immunity. However, Ig-seq is compromised by biases and errors introduced during library preparation and sequencing. By using synthetic antibody spike-in genes, we determined that primer bias from multiplex polymerase chain reaction (PCR) library preparation resulted in antibody frequencies with only 42 to 62% accuracy. Additionally, Ig-seq errors resulted in antibody diversity measurements being overestimated by up to 5000-fold. To rectify this, we developed molecular amplification fingerprinting (MAF), which uses unique molecular identifier (UID) tagging before and during multiplex PCR amplification, which enabled tagging of transcripts while accounting for PCR efficiency. Combined with a bioinformatic pipeline, MAF bias correction led to measurements of antibody frequencies with up to 99% accuracy. We also used MAF to correct PCR and sequencing errors, resulting in enhanced accuracy of full-length antibody diversity measurements, achieving 98 to 100% error correction. Using murine MAF-corrected data, we established a quantitative metric of recent clonal expansion-the intraclonal diversity index-which measures the number of unique transcripts associated with an antibody clone. We used this intraclonal diversity index along with antibody frequencies and somatic hypermutation to build a logistic regression model for prediction of the immunological status of clones. The model was able to predict clonal status with high confidence but only when using MAF error and bias corrected Ig-seq data. Improved accuracy by MAF provides the potential to greatly advance Ig-seq and its utility in immunology and biotechnology.
Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting
Khan, Tarik A.; Friedensohn, Simon; de Vries, Arthur R. Gorter; Straszewski, Jakub; Ruscheweyh, Hans-Joachim; Reddy, Sai T.
2016-01-01
High-throughput antibody repertoire sequencing (Ig-seq) provides quantitative molecular information on humoral immunity. However, Ig-seq is compromised by biases and errors introduced during library preparation and sequencing. By using synthetic antibody spike-in genes, we determined that primer bias from multiplex polymerase chain reaction (PCR) library preparation resulted in antibody frequencies with only 42 to 62% accuracy. Additionally, Ig-seq errors resulted in antibody diversity measurements being overestimated by up to 5000-fold. To rectify this, we developed molecular amplification fingerprinting (MAF), which uses unique molecular identifier (UID) tagging before and during multiplex PCR amplification, which enabled tagging of transcripts while accounting for PCR efficiency. Combined with a bioinformatic pipeline, MAF bias correction led to measurements of antibody frequencies with up to 99% accuracy. We also used MAF to correct PCR and sequencing errors, resulting in enhanced accuracy of full-length antibody diversity measurements, achieving 98 to 100% error correction. Using murine MAF-corrected data, we established a quantitative metric of recent clonal expansion—the intraclonal diversity index—which measures the number of unique transcripts associated with an antibody clone. We used this intraclonal diversity index along with antibody frequencies and somatic hypermutation to build a logistic regression model for prediction of the immunological status of clones. The model was able to predict clonal status with high confidence but only when using MAF error and bias corrected Ig-seq data. Improved accuracy by MAF provides the potential to greatly advance Ig-seq and its utility in immunology and biotechnology. PMID:26998518
Yu, Miao; Ji, Lexiang; Neumann, Drexel A.; ...
2015-07-15
Restriction-modification (R-M) systems pose a major barrier to DNA transformation and genetic engineering of bacterial species. Systematic identification of DNA methylation in R-M systems, including N 6-methyladenine (6mA), 5-methylcytosine (5mC) and N 4-methylcytosine (4mC), will enable strategies to make these species genetically tractable. Although single-molecule, real time (SMRT) sequencing technology is capable of detecting 4mC directly for any bacterial species regardless of whether an assembled genome exists or not, it is not as scalable to profiling hundreds to thousands of samples compared with the commonly used next-generation sequencing technologies. Here, we present 4mC-Tet-assisted bisulfite-sequencing (4mC-TAB-seq), a next-generation sequencing method thatmore » rapidly and cost efficiently reveals the genome-wide locations of 4mC for bacterial species with an available assembled reference genome. In 4mC-TAB-seq, both cytosines and 5mCs are read out as thymines, whereas only 4mCs are read out as cytosines, revealing their specific positions throughout the genome. We applied 4mC-TAB-seq to study the methylation of a member of the hyperthermophilc genus, Caldicellulosiruptor, in which 4mC-related restriction is a major barrier to DNA transformation from other species. Lastly, in combination with MethylC-seq, both 4mC- and 5mC-containing motifs are identified which can assist in rapid and efficient genetic engineering of these bacteria in the future.« less
SeqAPASS v3.0 for Extrapolation of Toxicity Knowledge Across Species
The U.S. Environmental Protection Agency Sequence Alignment to Predict Across Species Susceptibility (SeqAPASS; https://seqapass.epa.gov/seqapass/) tool was initially released to the public in 2016, providing a novel means to begin to address challenges in extrapolating toxicity ...
DNA Breaks and End Resection Measured Genome-wide by End Sequencing.
Canela, Andres; Sridharan, Sriram; Sciascia, Nicholas; Tubbs, Anthony; Meltzer, Paul; Sleckman, Barry P; Nussenzweig, André
2016-09-01
DNA double-strand breaks (DSBs) arise during physiological transcription, DNA replication, and antigen receptor diversification. Mistargeting or misprocessing of DSBs can result in pathological structural variation and mutation. Here we describe a sensitive method (END-seq) to monitor DNA end resection and DSBs genome-wide at base-pair resolution in vivo. We utilized END-seq to determine the frequency and spectrum of restriction-enzyme-, zinc-finger-nuclease-, and RAG-induced DSBs. Beyond sequence preference, chromatin features dictate the repertoire of these genome-modifying enzymes. END-seq can detect at least one DSB per cell among 10,000 cells not harboring DSBs, and we estimate that up to one out of 60 cells contains off-target RAG cleavage. In addition to site-specific cleavage, we detect DSBs distributed over extended regions during immunoglobulin class-switch recombination. Thus, END-seq provides a snapshot of DNA ends genome-wide, which can be utilized for understanding genome-editing specificities and the influence of chromatin on DSB pathway choice. Published by Elsevier Inc.
DrImpute: imputing dropout events in single cell RNA sequencing data.
Gong, Wuming; Kwak, Il-Youp; Pota, Pruthvi; Koyano-Nakagawa, Naoko; Garry, Daniel J
2018-06-08
The single cell RNA sequencing (scRNA-seq) technique begin a new era by allowing the observation of gene expression at the single cell level. However, there is also a large amount of technical and biological noise. Because of the low number of RNA transcriptomes and the stochastic nature of the gene expression pattern, there is a high chance of missing nonzero entries as zero, which are called dropout events. We develop DrImpute to impute dropout events in scRNA-seq data. We show that DrImpute has significantly better performance on the separation of the dropout zeros from true zeros than existing imputation algorithms. We also demonstrate that DrImpute can significantly improve the performance of existing tools for clustering, visualization and lineage reconstruction of nine published scRNA-seq datasets. DrImpute can serve as a very useful addition to the currently existing statistical tools for single cell RNA-seq analysis. DrImpute is implemented in R and is available at https://github.com/gongx030/DrImpute .
Hyb-Seq: Combining target enrichment and genome skimming for plant phylogenomics1
Weitemier, Kevin; Straub, Shannon C. K.; Cronn, Richard C.; Fishbein, Mark; Schmickl, Roswitha; McDonnell, Angela; Liston, Aaron
2014-01-01
• Premise of the study: Hyb-Seq, the combination of target enrichment and genome skimming, allows simultaneous data collection for low-copy nuclear genes and high-copy genomic targets for plant systematics and evolution studies. • Methods and Results: Genome and transcriptome assemblies for milkweed (Asclepias syriaca) were used to design enrichment probes for 3385 exons from 768 genes (>1.6 Mbp) followed by Illumina sequencing of enriched libraries. Hyb-Seq of 12 individuals (10 Asclepias species and two related genera) resulted in at least partial assembly of 92.6% of exons and 99.7% of genes and an average assembly length >2 Mbp. Importantly, complete plastomes and nuclear ribosomal DNA cistrons were assembled using off-target reads. Phylogenomic analyses demonstrated signal conflict between genomes. • Conclusions: The Hyb-Seq approach enables targeted sequencing of thousands of low-copy nuclear exons and flanking regions, as well as genome skimming of high-copy repeats and organellar genomes, to efficiently produce genome-scale data sets for phylogenomics. PMID:25225629
visnormsc: A Graphical User Interface to Normalize Single-cell RNA Sequencing Data.
Tang, Lijun; Zhou, Nan
2017-12-26
Single-cell RNA sequencing (RNA-seq) allows the analysis of gene expression with high resolution. The intrinsic defects of this promising technology imports technical noise into the single-cell RNA-seq data, increasing the difficulty of accurate downstream inference. Normalization is a crucial step in single-cell RNA-seq data pre-processing. SCnorm is an accurate and efficient method that can be used for this purpose. An R implementation of this method is currently available. On one hand, the R package possesses many excellent features from R. On the other hand, R programming ability is required, which prevents the biologists who lack the skills from learning to use it quickly. To make this method more user-friendly, we developed a graphical user interface, visnormsc, for normalization of single-cell RNA-seq data. It is implemented in Python and is freely available at https://github.com/solo7773/visnormsc . Although visnormsc is based on the existing method, it contributes to this field by offering a user-friendly alternative. The out-of-the-box and cross-platform features make visnormsc easy to learn and to use. It is expected to serve biologists by simplifying single-cell RNA-seq normalization.
Guo, Fei; Yu, Jiao; Zhang, Lu; Li, Jun
2017-11-01
The ForenSeq™ DNA Signature Prep Kit (ForenSeq Kit) is designed to detect more than 200 forensically relevant markers in a single reaction on the MiSeq FGx™ Forensic Genomics System (MiSeq FGx System), including Amelogenin, 27 autosomal short tandem repeats (A-STRs), 7 X chromosomal STRs (X-STRs), 24 Y chromosomal STRs (Y-STRs) and 94 identity-informative single nucleotide polymorphisms (iSNPs) with the option to contain 22 phenotypic-informative SNPs (pSNPs) and 56 ancestry-informative SNPs (aSNPs). In this study, we evaluated the MiSeq FGx System on three major parts: methodological optimization (DNA extraction, sample quantification, library normalization, diluted libraries concentration, and sample-to-cell arrangement), massively parallel sequencing (MPS) performance (depth of coverage, sequence coverage ratio, and allele coverage ratio), and ForenSeq Kit characteristics (repeatability and concordance, sensitivity, mixture, stability and case-type samples). Results showed that quantitative polymerase chain reaction (qPCR)-based sample quantification and library normalization and the appropriate number of pooled libraries and concentration of diluted libraries provided a greater level of MPS performance and repeatability. Repeatable and concordant genotypes were obtained by the ForenSeq Kit. Full profiles were obtained from ≥100pg input DNA for STRs and ≥200pg for SNPs. A sample with ≥5% minor contributors was considered as a mixture by imbalanced allele coverage ratio distribution, and full profiles from minor contributors were easily detected between 9:1 and 1:9 mixtures with known reference profiles. The ForenSeq Kit tolerated considerable concentrations of inhibitors like ≤200μM hematin and ≤50μg/ml humic acid, and >56% STR profiles and >88% SNP profiles were obtained from ≥200-bp degraded samples. Also, it was adapted to case-type samples. As a whole, the ForenSeq Kit is a well-performed, robust, reliable, reproducible and highly informative assay, and it can fully meet requirements for human identification. Further, sensitive QC indicator and automated sample comparison function in the ForenSeq™ Universal Analysis Software are quite helpful, so that we can concentrate on questionable genotypes and avoid tedious and time-consuming labor to maximum the time spent in data analysis. Copyright © 2017 Elsevier B.V. All rights reserved.
Véliz, David; Vega-Retter, Caren; Quezada-Romegialli, Claudio
2016-01-01
The complete sequence of the mitochondrial genome for the Chilean silverside Basilichthys microlepidotus is reported for the first time. The entire mitochondrial genome was 16,544 bp in length (GenBank accession no. KM245937); gene composition and arrangement was conformed to that reported for most fishes and contained the typical structure of 2 rRNAs, 13 protein-coding genes, 22 tRNAs and a non-coding region. The assembled mitogenome was validated against sequences of COI and Control Region previously sequenced in our lab, functional genes from RNA-Seq data for the same species and the mitogenome of two other atherinopsid species available in Genbank.
Identification of an active ID-like group of SINEs in the mouse
Kass, David H; Jamison, Nicole
2007-01-01
The mouse genome consists of five known families of SINEs: B1, B2, B4/RSINE, ID, and MIR. Using RT-PCR we identified a germ-line transcript that demonstrates 92.7% sequence identity to ID (excluding primer sequence), yet a BLAST search identified numerous matches of 100% sequence identity. We analyzed four of these elements for their presence in orthologous genes in strains and subspecies of M. musculus as well as other species of Mus using a PCR-based assay. All four analyzed elements were either identified only in M. musculus or exclusively in both M. musculus and M. domesticus indicative of recent integrations. In conjunction with the identification of transcripts, we present an active ID-like group of elements that is not derived from the proposed BC1 master gene of ID elements. A BLAST of the rat genome indicated that these elements were not in the rat. Therefore, this family of SINEs has recently evolved, and since thus far has mainly been observed in M. musculus, we then refer to this family as MMIDL. PMID:17572061
Identification of an active ID-like group of SINEs in the mouse.
Kass, David H; Jamison, Nicole
2007-09-01
The mouse genome consists of five known families of SINEs: B1, B2, B4/RSINE, ID, and MIR. Using RT-PCR we identified a germ-line transcript that demonstrates 92.7% sequence identity to ID (excluding primer sequence), yet a BLAST search identified numerous matches of 100% sequence identity. We analyzed four of these elements for their presence in orthologous genes in strains and subspecies of Mus musculus as well as other species of Mus using a PCR-based assay. All four analyzed elements were identified either only in M. musculus or exclusively in both M. musculus and M. domesticus, indicative of recent integrations. In conjunction with the identification of transcripts, we present an active ID-like group of elements that is not derived from the proposed BC1 master gene of ID elements. A BLAST of the rat genome indicated that these elements were not in the rat. Therefore, this family of SINEs has recently evolved, and since it has thus far been observed mainly in M. musculus, we refer to this family as MMIDL.
Using the Self-Select Paradigm to Delineate the Nature of Speech Motor Programming
Wright, David L.; Robin, Don A.; Rhee, Jooyhun; Vaculin, Amber; Jacks, Adam; Guenther, Frank H.; Fox, Peter T.
2015-01-01
Purpose The authors examined the involvement of 2 speech motor programming processes identified by S. T. Klapp (1995, 2003) during the articulation of utterances differing in syllable and sequence complexity. According to S. T. Klapp, 1 process, INT, resolves the demands of the programmed unit, whereas a second process, SEQ, oversees the serial order demands of longer sequences. Method A modified reaction time paradigm was used to assess INT and SEQ demands. Specifically, syllable complexity was dependent on syllable structure, whereas sequence complexity involved either repeated or unique syllabi within an utterance. Results INT execution was slowed when articulating single syllables in the form CCCV compared to simpler CV syllables. Planning unique syllables within a multisyllabic utterance rather than repetitions of the same syllable slowed INT but not SEQ. Conclusions The INT speech motor programming process, important for mental syllabary access, is sensitive to changes in both syllable structure and the number of unique syllables in an utterance. PMID:19474396
Sanders, Ashley D; Falconer, Ester; Hills, Mark; Spierings, Diana C J; Lansdorp, Peter M
2017-06-01
The ability to distinguish between genome sequences of homologous chromosomes in single cells is important for studies of copy-neutral genomic rearrangements (such as inversions and translocations), building chromosome-length haplotypes, refining genome assemblies, mapping sister chromatid exchange events and exploring cellular heterogeneity. Strand-seq is a single-cell sequencing technology that resolves the individual homologs within a cell by restricting sequence analysis to the DNA template strands used during DNA replication. This protocol, which takes up to 4 d to complete, relies on the directionality of DNA, in which each single strand of a DNA molecule is distinguished based on its 5'-3' orientation. Culturing cells in a thymidine analog for one round of cell division labels nascent DNA strands, allowing for their selective removal during genomic library construction. To preserve directionality of template strands, genomic preamplification is bypassed and labeled nascent strands are nicked and not amplified during library preparation. Each single-cell library is multiplexed for pooling and sequencing, and the resulting sequence data are aligned, mapping to either the minus or plus strand of the reference genome, to assign template strand states for each chromosome in the cell. The major adaptations to conventional single-cell sequencing protocols include harvesting of daughter cells after a single round of BrdU incorporation, bypassing of whole-genome amplification, and removal of the BrdU + strand during Strand-seq library preparation. By sequencing just template strands, the structure and identity of each homolog are preserved.
Sudhagar, Arun; El-Matbouli, Mansour
2018-01-01
In recent years, with the advent of next-generation sequencing along with the development of various bioinformatics tools, RNA sequencing (RNA-Seq)-based transcriptome analysis has become much more affordable in the field of biological research. This technique has even opened up avenues to explore the transcriptome of non-model organisms for which a reference genome is not available. This has made fish health researchers march towards this technology to understand pathogenic processes and immune reactions in fish during the event of infection. Recent studies using this technology have altered and updated the previous understanding of many diseases in fish. RNA-Seq has been employed in the understanding of fish pathogens like bacteria, virus, parasites, and oomycetes. Also, it has been helpful in unraveling the immune mechanisms in fish. Additionally, RNA-Seq technology has made its way for future works, such as genetic linkage mapping, quantitative trait analysis, disease-resistant strain or broodstock selection, and the development of effective vaccines and therapies. Until now, there are no reviews that comprehensively summarize the studies which made use of RNA-Seq to explore the mechanisms of infection of pathogens and the defense strategies of fish hosts. This review aims to summarize the contemporary understanding and findings with regard to infectious pathogens and the immune system of fish that have been achieved through RNA-Seq technology. PMID:29342931
Kim, Taemook; Seo, Hogyu David; Hennighausen, Lothar; Lee, Daeyoup
2018-01-01
Abstract Octopus-toolkit is a stand-alone application for retrieving and processing large sets of next-generation sequencing (NGS) data with a single step. Octopus-toolkit is an automated set-up-and-analysis pipeline utilizing the Aspera, SRA Toolkit, FastQC, Trimmomatic, HISAT2, STAR, Samtools, and HOMER applications. All the applications are installed on the user's computer when the program starts. Upon the installation, it can automatically retrieve original files of various epigenomic and transcriptomic data sets, including ChIP-seq, ATAC-seq, DNase-seq, MeDIP-seq, MNase-seq and RNA-seq, from the gene expression omnibus data repository. The downloaded files can then be sequentially processed to generate BAM and BigWig files, which are used for advanced analyses and visualization. Currently, it can process NGS data from popular model genomes such as, human (Homo sapiens), mouse (Mus musculus), dog (Canis lupus familiaris), plant (Arabidopsis thaliana), zebrafish (Danio rerio), fruit fly (Drosophila melanogaster), worm (Caenorhabditis elegans), and budding yeast (Saccharomyces cerevisiae) genomes. With the processed files from Octopus-toolkit, the meta-analysis of various data sets, motif searches for DNA-binding proteins, and the identification of differentially expressed genes and/or protein-binding sites can be easily conducted with few commands by users. Overall, Octopus-toolkit facilitates the systematic and integrative analysis of available epigenomic and transcriptomic NGS big data. PMID:29420797
Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq
Shepard, Peter J.; Choi, Eun-A; Lu, Jente; Flanagan, Lisa A.; Hertel, Klemens J.; Shi, Yongsheng
2011-01-01
Alternative polyadenylation (APA) of mRNAs has emerged as an important mechanism for post-transcriptional gene regulation in higher eukaryotes. Although microarrays have recently been used to characterize APA globally, they have a number of serious limitations that prevents comprehensive and highly quantitative analysis. To better characterize APA and its regulation, we have developed a deep sequencing-based method called Poly(A) Site Sequencing (PAS-Seq) for quantitatively profiling RNA polyadenylation at the transcriptome level. PAS-Seq not only accurately and comprehensively identifies poly(A) junctions in mRNAs and noncoding RNAs, but also provides quantitative information on the relative abundance of polyadenylated RNAs. PAS-Seq analyses of human and mouse transcriptomes showed that 40%–50% of all expressed genes produce alternatively polyadenylated mRNAs. Furthermore, our study detected evolutionarily conserved polyadenylation of histone mRNAs and revealed novel features of mitochondrial RNA polyadenylation. Finally, PAS-Seq analyses of mouse embryonic stem (ES) cells, neural stem/progenitor (NSP) cells, and neurons not only identified more poly(A) sites than what was found in the entire mouse EST database, but also detected significant changes in the global APA profile that lead to lengthening of 3′ untranslated regions (UTR) in many mRNAs during stem cell differentiation. Together, our PAS-Seq analyses revealed a complex landscape of RNA polyadenylation in mammalian cells and the dynamic regulation of APA during stem cell differentiation. PMID:21343387
Tn5Prime, a Tn5 based 5' capture method for single cell RNA-seq.
Cole, Charles; Byrne, Ashley; Beaudin, Anna E; Forsberg, E Camilla; Vollmers, Christopher
2018-06-01
RNA-sequencing (RNA-seq) is a powerful technique to investigate and quantify entire transcriptomes. Recent advances in the field have made it possible to explore the transcriptomes of single cells. However, most widely used RNA-seq protocols fail to provide crucial information regarding transcription start sites. Here we present a protocol, Tn5Prime, that takes advantage of the Tn5 transposase-based Smart-seq2 protocol to create RNA-seq libraries that capture the 5' end of transcripts. The Tn5Prime method dramatically streamlines the 5' capture process and is both cost effective and reliable. By applying Tn5Prime to bulk RNA and single cell samples, we were able to define transcription start sites as well as quantify transcriptomes at high accuracy and reproducibility. Additionally, similar to 3' end-based high-throughput methods like Drop-seq and 10× Genomics Chromium, the 5' capture Tn5Prime method allows the introduction of cellular identifiers during reverse transcription, simplifying the analysis of large numbers of single cells. In contrast to 3' end-based methods, Tn5Prime also enables the assembly of the variable 5' ends of the antibody sequences present in single B-cell data. Therefore, Tn5Prime presents a robust tool for both basic and applied research into the adaptive immune system and beyond.
Yassour, Moran; Grabherr, Manfred; Blood, Philip D.; Bowden, Joshua; Couger, Matthew Brian; Eccles, David; Li, Bo; Lieber, Matthias; MacManes, Matthew D.; Ott, Michael; Orvis, Joshua; Pochet, Nathalie; Strozzi, Francesco; Weeks, Nathan; Westerman, Rick; William, Thomas; Dewey, Colin N.; Henschel, Robert; LeDuc, Richard D.; Friedman, Nir; Regev, Aviv
2013-01-01
De novo assembly of RNA-Seq data allows us to study transcriptomes without the need for a genome sequence, such as in non-model organisms of ecological and evolutionary importance, cancer samples, or the microbiome. In this protocol, we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-Seq data in non-model organisms. We also present Trinity’s supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples, and approaches to identify protein coding genes. In an included tutorial we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sf.net. PMID:23845962
Panek, Marina; Čipčić Paljetak, Hana; Barešić, Anja; Perić, Mihaela; Matijašić, Mario; Lojkić, Ivana; Vranešić Bender, Darija; Krznarić, Željko; Verbanac, Donatella
2018-03-23
The information on microbiota composition in the human gastrointestinal tract predominantly originates from the analyses of human faeces by application of next generation sequencing (NGS). However, the detected composition of the faecal bacterial community can be affected by various factors including experimental design and procedures. This study evaluated the performance of different protocols for collection and storage of faecal samples (native and OMNIgene.GUT system) and bacterial DNA extraction (MP Biomedicals, QIAGEN and MO BIO kits), using two NGS platforms for 16S rRNA gene sequencing (Ilumina MiSeq and Ion Torrent PGM). OMNIgene.GUT proved as a reliable and convenient system for collection and storage of faecal samples although favouring Sutterella genus. MP provided superior DNA yield and quality, MO BIO depleted Gram positive organisms while using QIAGEN with OMNIgene.GUT resulted in greatest variability compared to other two kits. MiSeq and IT platforms in their supplier recommended setups provided comparable reproducibility of donor faecal microbiota. The differences included higher diversity observed with MiSeq and increased capacity of MiSeq to detect Akkermansia muciniphila, [Odoribacteraceae], Erysipelotrichaceae and Ruminococcaceae (primarily Faecalibacterium prausnitzii). The results of our study could assist the investigators using NGS technologies to make informed decisions on appropriate tools for their experimental pipelines.
Rettig, Trisha A; Ward, Claire; Pecaut, Michael J; Chapes, Stephen K
2017-07-01
Spaceflight is known to affect immune cell populations. In particular, splenic B cell numbers decrease during spaceflight and in ground-based physiological models. Although antibody isotype changes have been assessed during and after space flight, an extensive characterization of the impact of spaceflight on antibody composition has not been conducted in mice. Next Generation Sequencing and bioinformatic tools are now available to assess antibody repertoires. We can now identify immunoglobulin gene- segment usage, junctional regions, and modifications that contribute to specificity and diversity. Due to limitations on the International Space Station, alternate sample collection and storage methods must be employed. Our group compared Illumina MiSeq sequencing data from multiple sample preparation methods in normal C57Bl/6J mice to validate that sample preparation and storage would not bias the outcome of antibody repertoire characterization. In this report, we also compared sequencing techniques and a bioinformatic workflow on the data output when we assessed the IgH and Igκ variable gene usage. This included assessments of our bioinformatic workflow on Illumina HiSeq and MiSeq datasets and is specifically designed to reduce bias, capture the most information from Ig sequences, and produce a data set that provides other data mining options. We validated our workflow by comparing our normal mouse MiSeq data to existing murine antibody repertoire studies validating it for future antibody repertoire studies.
Bioinformatic Analysis of the Human Recombinant Iduronate 2-Sulfate Sulfatase
Morales-Álvarez, Edwin D.; Rivera-Hoyos, Claudia M.; Landázuri, Patricia; Poutou-Piñales, Raúl A.; Pedroza-Rodríguez, Aura M.
2016-01-01
Mucopolysaccharidosis type II is a human recessive disease linked to the X chromosome caused by deficiency of lysosomal enzyme Iduronate 2-Sulfate Sulfatase (IDS), which leads to accumulation of glycosaminoglycans in tissues and organs. The human enzyme has been expressed in Escherichia coli and Pichia pastoris in attempt to develop more successful expression systems that allow the production of recombinant IDS for Enzyme Replacement Therapy (ERT). However, the preservation of native signal peptide in the sequence has caused conflicts in processing and recognition in the past, which led to problems in expression and enzyme activity. With the main object being the improvement of the expression system, we eliminate the native signal peptide of human recombinant IDS. The resulting sequence showed two modified codons, thus, our study aimed to analyze computationally the nucleotide sequence of the IDSnh without signal peptide in order to determine the 3D structure and other biochemical properties to compare them with the native human IDS (IDSnh). Results showed that there are no significant differences between both molecules in spite of the two-codon modifications detected in the recombinant DNA sequence. PMID:27335624
Brooks, Matthew J.; Rajasimha, Harsha K.; Roger, Jerome E.
2011-01-01
Purpose Next-generation sequencing (NGS) has revolutionized systems-based analysis of cellular pathways. The goals of this study are to compare NGS-derived retinal transcriptome profiling (RNA-seq) to microarray and quantitative reverse transcription polymerase chain reaction (qRT–PCR) methods and to evaluate protocols for optimal high-throughput data analysis. Methods Retinal mRNA profiles of 21-day-old wild-type (WT) and neural retina leucine zipper knockout (Nrl−/−) mice were generated by deep sequencing, in triplicate, using Illumina GAIIx. The sequence reads that passed quality filters were analyzed at the transcript isoform level with two methods: Burrows–Wheeler Aligner (BWA) followed by ANOVA (ANOVA) and TopHat followed by Cufflinks. qRT–PCR validation was performed using TaqMan and SYBR Green assays. Results Using an optimized data analysis workflow, we mapped about 30 million sequence reads per sample to the mouse genome (build mm9) and identified 16,014 transcripts in the retinas of WT and Nrl−/− mice with BWA workflow and 34,115 transcripts with TopHat workflow. RNA-seq data confirmed stable expression of 25 known housekeeping genes, and 12 of these were validated with qRT–PCR. RNA-seq data had a linear relationship with qRT–PCR for more than four orders of magnitude and a goodness of fit (R2) of 0.8798. Approximately 10% of the transcripts showed differential expression between the WT and Nrl−/− retina, with a fold change ≥1.5 and p value <0.05. Altered expression of 25 genes was confirmed with qRT–PCR, demonstrating the high degree of sensitivity of the RNA-seq method. Hierarchical clustering of differentially expressed genes uncovered several as yet uncharacterized genes that may contribute to retinal function. Data analysis with BWA and TopHat workflows revealed a significant overlap yet provided complementary insights in transcriptome profiling. Conclusions Our study represents the first detailed analysis of retinal transcriptomes, with biologic replicates, generated by RNA-seq technology. The optimized data analysis workflows reported here should provide a framework for comparative investigations of expression profiles. Our results show that NGS offers a comprehensive and more accurate quantitative and qualitative evaluation of mRNA content within a cell or tissue. We conclude that RNA-seq based transcriptome characterization would expedite genetic network analyses and permit the dissection of complex biologic functions. PMID:22162623
nuID: a universal naming scheme of oligonucleotides for Illumina, Affymetrix, and other microarrays
Du, Pan; Kibbe, Warren A; Lin, Simon M
2007-01-01
Background Oligonucleotide probes that are sequence identical may have different identifiers between manufacturers and even between different versions of the same company's microarray; and sometimes the same identifier is reused and represents a completely different oligonucleotide, resulting in ambiguity and potentially mis-identification of the genes hybridizing to that probe. Results We have devised a unique, non-degenerate encoding scheme that can be used as a universal representation to identify an oligonucleotide across manufacturers. We have named the encoded representation 'nuID', for nucleotide universal identifier. Inspired by the fact that the raw sequence of the oligonucleotide is the true definition of identity for a probe, the encoding algorithm uniquely and non-degenerately transforms the sequence itself into a compact identifier (a lossless compression). In addition, we added a redundancy check (checksum) to validate the integrity of the identifier. These two steps, encoding plus checksum, result in an nuID, which is a unique, non-degenerate, permanent, robust and efficient representation of the probe sequence. For commercial applications that require the sequence identity to be confidential, we have an encryption schema for nuID. We demonstrate the utility of nuIDs for the annotation of Illumina microarrays, and we believe it has universal applicability as a source-independent naming convention for oligomers. Reviewers This article was reviewed by Itai Yanai, Rong Chen (nominated by Mark Gerstein), and Gregory Schuler (nominated by David Lipman). PMID:17540033
USDA-ARS?s Scientific Manuscript database
Remarkable advances in next-generation sequencing (NGS) technologies, bioinformatics algorithms, and computational technologies have significantly accelerated genomic research. However, complicated NGS data analysis still remains as a major bottleneck. RNA-seq, as one of the major area in the NGS fi...
Identification of single-nucleotide variants in RNA-seq data. Current version focuses on detection of RNA editing sites without requiring genome sequence data. New version is under development to separately identify RNA editing sites and genetic variants using RNA-seq data alone.
A reference human genome dataset of the BGISEQ-500 sequencer.
Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian
2017-05-01
BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.
JingleBells: A Repository of Immune-Related Single-Cell RNA-Sequencing Datasets.
Ner-Gaon, Hadas; Melchior, Ariel; Golan, Nili; Ben-Haim, Yael; Shay, Tal
2017-05-01
Recent advances in single-cell RNA-sequencing (scRNA-seq) technology increase the understanding of immune differentiation and activation processes, as well as the heterogeneity of immune cell types. Although the number of available immune-related scRNA-seq datasets increases rapidly, their large size and various formats render them hard for the wider immunology community to use, and read-level data are practically inaccessible to the non-computational immunologist. To facilitate datasets reuse, we created the JingleBells repository for immune-related scRNA-seq datasets ready for analysis and visualization of reads at the single-cell level (http://jinglebells.bgu.ac.il/). To this end, we collected the raw data of publicly available immune-related scRNA-seq datasets, aligned the reads to the relevant genome, and saved aligned reads in a uniform format, annotated for cell of origin. We also added scripts and a step-by-step tutorial for visualizing each dataset at the single-cell level, through the commonly used Integrated Genome Viewer (www.broadinstitute.org/igv/). The uniform scRNA-seq format used in JingleBells can facilitate reuse of scRNA-seq data by computational biologists. It also enables immunologists who are interested in a specific gene to visualize the reads aligned to this gene to estimate cell-specific preferences for splicing, mutation load, or alleles. Thus JingleBells is a resource that will extend the usefulness of scRNA-seq datasets outside the programming aficionado realm. Copyright © 2017 by The American Association of Immunologists, Inc.
2013-01-01
Background The production of multiple transcript isoforms from one gene is a major source of transcriptome complexity. RNA-Seq experiments, in which transcripts are converted to cDNA and sequenced, allow the resolution and quantification of alternative transcript isoforms. However, methods to analyze splicing are underdeveloped and errors resulting in incorrect splicing calls occur in every experiment. Results We used RNA-Seq data to develop sequencing and aligner error models. By applying these error models to known input from simulations, we found that errors result from false alignment to minor splice motifs and antisense stands, shifted junction positions, paralog joining, and repeat induced gaps. By using a series of quantitative and qualitative filters, we eliminated diagnosed errors in the simulation, and applied this to RNA-Seq data from Drosophila melanogaster heads. We used high-confidence junction detections to specifically interrogate local splicing differences between transcripts. This method out-performed commonly used RNA-seq methods to identify known alternative splicing events in the Drosophila sex determination pathway. We describe a flexible software package to perform these tasks called Splicing Analysis Kit (Spanki), available at http://www.cbcb.umd.edu/software/spanki. Conclusions Splice-junction centric analysis of RNA-Seq data provides advantages in specificity for detection of alternative splicing. Our software provides tools to better understand error profiles in RNA-Seq data and improve inference from this new technology. The splice-junction centric approach that this software enables will provide more accurate estimates of differentially regulated splicing than current tools. PMID:24209455
Sturgill, David; Malone, John H; Sun, Xia; Smith, Harold E; Rabinow, Leonard; Samson, Marie-Laure; Oliver, Brian
2013-11-09
The production of multiple transcript isoforms from one gene is a major source of transcriptome complexity. RNA-Seq experiments, in which transcripts are converted to cDNA and sequenced, allow the resolution and quantification of alternative transcript isoforms. However, methods to analyze splicing are underdeveloped and errors resulting in incorrect splicing calls occur in every experiment. We used RNA-Seq data to develop sequencing and aligner error models. By applying these error models to known input from simulations, we found that errors result from false alignment to minor splice motifs and antisense stands, shifted junction positions, paralog joining, and repeat induced gaps. By using a series of quantitative and qualitative filters, we eliminated diagnosed errors in the simulation, and applied this to RNA-Seq data from Drosophila melanogaster heads. We used high-confidence junction detections to specifically interrogate local splicing differences between transcripts. This method out-performed commonly used RNA-seq methods to identify known alternative splicing events in the Drosophila sex determination pathway. We describe a flexible software package to perform these tasks called Splicing Analysis Kit (Spanki), available at http://www.cbcb.umd.edu/software/spanki. Splice-junction centric analysis of RNA-Seq data provides advantages in specificity for detection of alternative splicing. Our software provides tools to better understand error profiles in RNA-Seq data and improve inference from this new technology. The splice-junction centric approach that this software enables will provide more accurate estimates of differentially regulated splicing than current tools.
DEMO: Sequence Alignment to Predict Across Species Susceptibility
The US Environmental Protection Agency Sequence Alignment to Predict Across Species Susceptibility tool (SeqAPASS; https://seqapass.epa.gov/seqapass/) was developed to comparatively evaluate protein sequence and structural similarity across species as a means to extrapolate toxic...
Evaluation of tools for highly variable gene discovery from single-cell RNA-seq data.
Yip, Shun H; Sham, Pak Chung; Wang, Junwen
2018-02-21
Traditional RNA sequencing (RNA-seq) allows the detection of gene expression variations between two or more cell populations through differentially expressed gene (DEG) analysis. However, genes that contribute to cell-to-cell differences are not discoverable with RNA-seq because RNA-seq samples are obtained from a mixture of cells. Single-cell RNA-seq (scRNA-seq) allows the detection of gene expression in each cell. With scRNA-seq, highly variable gene (HVG) discovery allows the detection of genes that contribute strongly to cell-to-cell variation within a homogeneous cell population, such as a population of embryonic stem cells. This analysis is implemented in many software packages. In this study, we compare seven HVG methods from six software packages, including BASiCS, Brennecke, scLVM, scran, scVEGs and Seurat. Our results demonstrate that reproducibility in HVG analysis requires a larger sample size than DEG analysis. Discrepancies between methods and potential issues in these tools are discussed and recommendations are made.
Species classifier choice is a key consideration when analysing low-complexity food microbiome data.
Walsh, Aaron M; Crispie, Fiona; O'Sullivan, Orla; Finnegan, Laura; Claesson, Marcus J; Cotter, Paul D
2018-03-20
The use of shotgun metagenomics to analyse low-complexity microbial communities in foods has the potential to be of considerable fundamental and applied value. However, there is currently no consensus with respect to choice of species classification tool, platform, or sequencing depth. Here, we benchmarked the performances of three high-throughput short-read sequencing platforms, the Illumina MiSeq, NextSeq 500, and Ion Proton, for shotgun metagenomics of food microbiota. Briefly, we sequenced six kefir DNA samples and a mock community DNA sample, the latter constructed by evenly mixing genomic DNA from 13 food-related bacterial species. A variety of bioinformatic tools were used to analyse the data generated, and the effects of sequencing depth on these analyses were tested by randomly subsampling reads. Compositional analysis results were consistent between the platforms at divergent sequencing depths. However, we observed pronounced differences in the predictions from species classification tools. Indeed, PERMANOVA indicated that there was no significant differences between the compositional results generated by the different sequencers (p = 0.693, R 2 = 0.011), but there was a significant difference between the results predicted by the species classifiers (p = 0.01, R 2 = 0.127). The relative abundances predicted by the classifiers, apart from MetaPhlAn2, were apparently biased by reference genome sizes. Additionally, we observed varying false-positive rates among the classifiers. MetaPhlAn2 had the lowest false-positive rate, whereas SLIMM had the greatest false-positive rate. Strain-level analysis results were also similar across platforms. Each platform correctly identified the strains present in the mock community, but accuracy was improved slightly with greater sequencing depth. Notably, PanPhlAn detected the dominant strains in each kefir sample above 500,000 reads per sample. Again, the outputs from functional profiling analysis using SUPER-FOCUS were generally accordant between the platforms at different sequencing depths. Finally, and expectedly, metagenome assembly completeness was significantly lower on the MiSeq than either on the NextSeq (p = 0.03) or the Proton (p = 0.011), and it improved with increased sequencing depth. Our results demonstrate a remarkable similarity in the results generated by the three sequencing platforms at different sequencing depths, and, in fact, the choice of bioinformatics methodology had a more evident impact on results than the choice of sequencer did.
High-throughput full-length single-cell mRNA-seq of rare cells.
Ooi, Chin Chun; Mantalas, Gary L; Koh, Winston; Neff, Norma F; Fuchigami, Teruaki; Wong, Dawson J; Wilson, Robert J; Park, Seung-Min; Gambhir, Sanjiv S; Quake, Stephen R; Wang, Shan X
2017-01-01
Single-cell characterization techniques, such as mRNA-seq, have been applied to a diverse range of applications in cancer biology, yielding great insight into mechanisms leading to therapy resistance and tumor clonality. While single-cell techniques can yield a wealth of information, a common bottleneck is the lack of throughput, with many current processing methods being limited to the analysis of small volumes of single cell suspensions with cell densities on the order of 107 per mL. In this work, we present a high-throughput full-length mRNA-seq protocol incorporating a magnetic sifter and magnetic nanoparticle-antibody conjugates for rare cell enrichment, and Smart-seq2 chemistry for sequencing. We evaluate the efficiency and quality of this protocol with a simulated circulating tumor cell system, whereby non-small-cell lung cancer cell lines (NCI-H1650 and NCI-H1975) are spiked into whole blood, before being enriched for single-cell mRNA-seq by EpCAM-functionalized magnetic nanoparticles and the magnetic sifter. We obtain high efficiency (> 90%) capture and release of these simulated rare cells via the magnetic sifter, with reproducible transcriptome data. In addition, while mRNA-seq data is typically only used for gene expression analysis of transcriptomic data, we demonstrate the use of full-length mRNA-seq chemistries like Smart-seq2 to facilitate variant analysis of expressed genes. This enables the use of mRNA-seq data for differentiating cells in a heterogeneous population by both their phenotypic and variant profile. In a simulated heterogeneous mixture of circulating tumor cells in whole blood, we utilize this high-throughput protocol to differentiate these heterogeneous cells by both their phenotype (lung cancer versus white blood cells), and mutational profile (H1650 versus H1975 cells), in a single sequencing run. This high-throughput method can help facilitate single-cell analysis of rare cell populations, such as circulating tumor or endothelial cells, with demonstrably high-quality transcriptomic data.
2010-01-01
Background De novo assembly of transcript sequences produced by short-read DNA sequencing technologies offers a rapid approach to obtain expressed gene catalogs for non-model organisms. A draft genome sequence will be produced in 2010 for a Eucalyptus tree species (E. grandis) representing the most important hardwood fibre crop in the world. Genome annotation of this valuable woody plant and genetic dissection of its superior growth and productivity will be greatly facilitated by the availability of a comprehensive collection of expressed gene sequences from multiple tissues and organs. Results We present an extensive expressed gene catalog for a commercially grown E. grandis × E. urophylla hybrid clone constructed using only Illumina mRNA-Seq technology and de novo assembly. A total of 18,894 transcript-derived contigs, a large proportion of which represent full-length protein coding genes were assembled and annotated. Analysis of assembly quality, length and diversity show that this dataset represent the most comprehensive expressed gene catalog for any Eucalyptus tree. mRNA-Seq analysis furthermore allowed digital expression profiling of all of the assembled transcripts across diverse xylogenic and non-xylogenic tissues, which is invaluable for ascribing putative gene functions. Conclusions De novo assembly of Illumina mRNA-Seq reads is an efficient approach for transcriptome sequencing and profiling in Eucalyptus and other non-model organisms. The transcriptome resource (Eucspresso, http://eucspresso.bi.up.ac.za/) generated by this study will be of value for genomic analysis of woody biomass production in Eucalyptus and for comparative genomic analysis of growth and development in woody and herbaceous plants. PMID:21122097
Foulk, Michael S.; Urban, John M.; Casella, Cinzia; Gerbi, Susan A.
2015-01-01
Nascent strand sequencing (NS-seq) is used to discover DNA replication origins genome-wide, allowing identification of features for their specification. NS-seq depends on the ability of lambda exonuclease (λ-exo) to efficiently digest parental DNA while leaving RNA-primer protected nascent strands intact. We used genomics and biochemical approaches to determine if λ-exo digests all parental DNA sequences equally. We report that λ-exo does not efficiently digest G-quadruplex (G4) structures in a plasmid. Moreover, λ-exo digestion of nonreplicating genomic DNA (LexoG0) enriches GC-rich DNA and G4 motifs genome-wide. We used LexoG0 data to control for nascent strand–independent λ-exo biases in NS-seq and validated this approach at the rDNA locus. The λ-exo–controlled NS-seq peaks are not GC-rich, and only 35.5% overlap with 6.8% of all G4s, suggesting that G4s are not general determinants for origin specification but may play a role for a subset. Interestingly, we observed a periodic spacing of G4 motifs and nucleosomes around the peak summits, suggesting that G4s may position nucleosomes at this subset of origins. Finally, we demonstrate that use of Na+ instead of K+ in the λ-exo digestion buffer reduced the effect of G4s on λ-exo digestion and discuss ways to increase both the sensitivity and specificity of NS-seq. PMID:25695952
Foulk, Michael S; Urban, John M; Casella, Cinzia; Gerbi, Susan A
2015-05-01
Nascent strand sequencing (NS-seq) is used to discover DNA replication origins genome-wide, allowing identification of features for their specification. NS-seq depends on the ability of lambda exonuclease (λ-exo) to efficiently digest parental DNA while leaving RNA-primer protected nascent strands intact. We used genomics and biochemical approaches to determine if λ-exo digests all parental DNA sequences equally. We report that λ-exo does not efficiently digest G-quadruplex (G4) structures in a plasmid. Moreover, λ-exo digestion of nonreplicating genomic DNA (LexoG0) enriches GC-rich DNA and G4 motifs genome-wide. We used LexoG0 data to control for nascent strand-independent λ-exo biases in NS-seq and validated this approach at the rDNA locus. The λ-exo-controlled NS-seq peaks are not GC-rich, and only 35.5% overlap with 6.8% of all G4s, suggesting that G4s are not general determinants for origin specification but may play a role for a subset. Interestingly, we observed a periodic spacing of G4 motifs and nucleosomes around the peak summits, suggesting that G4s may position nucleosomes at this subset of origins. Finally, we demonstrate that use of Na(+) instead of K(+) in the λ-exo digestion buffer reduced the effect of G4s on λ-exo digestion and discuss ways to increase both the sensitivity and specificity of NS-seq. © 2015 Foulk et al.; Published by Cold Spring Harbor Laboratory Press.
The Impact of Normalization Methods on RNA-Seq Data Analysis
Zyprych-Walczak, J.; Szabelska, A.; Handschuh, L.; Górczak, K.; Klamecka, K.; Figlerowicz, M.; Siatkowski, I.
2015-01-01
High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably. PMID:26176014
Evaluation of normalization methods in mammalian microRNA-Seq data
Garmire, Lana Xia; Subramaniam, Shankar
2012-01-01
Simple total tag count normalization is inadequate for microRNA sequencing data generated from the next generation sequencing technology. However, so far systematic evaluation of normalization methods on microRNA sequencing data is lacking. We comprehensively evaluate seven commonly used normalization methods including global normalization, Lowess normalization, Trimmed Mean Method (TMM), quantile normalization, scaling normalization, variance stabilization, and invariant method. We assess these methods on two individual experimental data sets with the empirical statistical metrics of mean square error (MSE) and Kolmogorov-Smirnov (K-S) statistic. Additionally, we evaluate the methods with results from quantitative PCR validation. Our results consistently show that Lowess normalization and quantile normalization perform the best, whereas TMM, a method applied to the RNA-Sequencing normalization, performs the worst. The poor performance of TMM normalization is further evidenced by abnormal results from the test of differential expression (DE) of microRNA-Seq data. Comparing with the models used for DE, the choice of normalization method is the primary factor that affects the results of DE. In summary, Lowess normalization and quantile normalization are recommended for normalizing microRNA-Seq data, whereas the TMM method should be used with caution. PMID:22532701
Brown, Roger B; Madrid, Nathaniel J; Suzuki, Hideaki; Ness, Scott A
2017-01-01
RNA-sequencing (RNA-seq) has become the standard method for unbiased analysis of gene expression but also provides access to more complex transcriptome features, including alternative RNA splicing, RNA editing, and even detection of fusion transcripts formed through chromosomal translocations. However, differences in library methods can adversely affect the ability to recover these different types of transcriptome data. For example, some methods have bias for one end of transcripts or rely on low-efficiency steps that limit the complexity of the resulting library, making detection of rare transcripts less likely. We tested several commonly used methods of RNA-seq library preparation and found vast differences in the detection of advanced transcriptome features, such as alternatively spliced isoforms and RNA editing sites. By comparing several different protocols available for the Ion Proton sequencer and by utilizing detailed bioinformatics analysis tools, we were able to develop an optimized random primer based RNA-seq technique that is reliable at uncovering rare transcript isoforms and RNA editing features, as well as fusion reads from oncogenic chromosome rearrangements. The combination of optimized libraries and rapid Ion Proton sequencing provides a powerful platform for the transcriptome analysis of research and clinical samples.
RSEQtools: a modular framework to analyze RNA-Seq data using compact, anonymized data summaries.
Habegger, Lukas; Sboner, Andrea; Gianoulis, Tara A; Rozowsky, Joel; Agarwal, Ashish; Snyder, Michael; Gerstein, Mark
2011-01-15
The advent of next-generation sequencing for functional genomics has given rise to quantities of sequence information that are often so large that they are difficult to handle. Moreover, sequence reads from a specific individual can contain sufficient information to potentially identify and genetically characterize that person, raising privacy concerns. In order to address these issues, we have developed the Mapped Read Format (MRF), a compact data summary format for both short and long read alignments that enables the anonymization of confidential sequence information, while allowing one to still carry out many functional genomics studies. We have developed a suite of tools (RSEQtools) that use this format for the analysis of RNA-Seq experiments. These tools consist of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads and segmenting that signal into actively transcribed regions. Moreover, the tools can readily be used to build customizable RNA-Seq workflows. In addition to the anonymization afforded by MRF, this format also facilitates the decoupling of the alignment of reads from downstream analyses. RSEQtools is implemented in C and the source code is available at http://rseqtools.gersteinlab.org/.
Gene: a gene-centered information resource at NCBI.
Brown, Garth R; Hem, Vichet; Katz, Kenneth S; Ovetsky, Michael; Wallin, Craig; Ermolaeva, Olga; Tolstoy, Igor; Tatusova, Tatiana; Pruitt, Kim D; Maglott, Donna R; Murphy, Terence D
2015-01-01
The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expression, homologs, protein domains and external databases) is available via interactive browsing through NCBI's Entrez system, via NCBI's Entrez programming utilities (E-Utilities and Entrez Direct) and for bulk transfer by FTP. Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by (a) US Government employee(s) and is in the public domain in the US.
Nakato, Ryuichiro; Itoh, Tahehiko; Shirahige, Katsuhiko
2013-07-01
Chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq) can identify genomic regions that bind proteins involved in various chromosomal functions. Although the development of next-generation sequencers offers the technology needed to identify these protein-binding sites, the analysis can be computationally challenging because sequencing data sometimes consist of >100 million reads/sample. Herein, we describe a cost-effective and time-efficient protocol that is generally applicable to ChIP-seq analysis; this protocol uses a novel peak-calling program termed DROMPA to identify peaks and an additional program, parse2wig, to preprocess read-map files. This two-step procedure drastically reduces computational time and memory requirements compared with other programs. DROMPA enables the identification of protein localization sites in repetitive sequences and efficiently identifies both broad and sharp protein localization peaks. Specifically, DROMPA outputs a protein-binding profile map in pdf or png format, which can be easily manipulated by users who have a limited background in bioinformatics. © 2013 The Authors Genes to Cells © 2013 by the Molecular Biology Society of Japan and Wiley Publishing Asia Pty Ltd.
Spliced synthetic genes as internal controls in RNA sequencing experiments.
Hardwick, Simon A; Chen, Wendy Y; Wong, Ted; Deveson, Ira W; Blackburn, James; Andersen, Stacey B; Nielsen, Lars K; Mattick, John S; Mercer, Tim R
2016-09-01
RNA sequencing (RNA-seq) can be used to assemble spliced isoforms, quantify expressed genes and provide a global profile of the transcriptome. However, the size and diversity of the transcriptome, the wide dynamic range in gene expression and inherent technical biases confound RNA-seq analysis. We have developed a set of spike-in RNA standards, termed 'sequins' (sequencing spike-ins), that represent full-length spliced mRNA isoforms. Sequins have an entirely artificial sequence with no homology to natural reference genomes, but they align to gene loci encoded on an artificial in silico chromosome. The combination of multiple sequins across a range of concentrations emulates alternative splicing and differential gene expression, and it provides scaling factors for normalization between samples. We demonstrate the use of sequins in RNA-seq experiments to measure sample-specific biases and determine the limits of reliable transcript assembly and quantification in accompanying human RNA samples. In addition, we have designed a complementary set of sequins that represent fusion genes arising from rearrangements of the in silico chromosome to aid in cancer diagnosis. RNA sequins provide a qualitative and quantitative reference with which to navigate the complexity of the human transcriptome.
MetaRNA-Seq: An Interactive Tool to Browse and Annotate Metadata from RNA-Seq Studies.
Kumar, Pankaj; Halama, Anna; Hayat, Shahina; Billing, Anja M; Gupta, Manish; Yousri, Noha A; Smith, Gregory M; Suhre, Karsten
2015-01-01
The number of RNA-Seq studies has grown in recent years. The design of RNA-Seq studies varies from very simple (e.g., two-condition case-control) to very complicated (e.g., time series involving multiple samples at each time point with separate drug treatments). Most of these publically available RNA-Seq studies are deposited in NCBI databases, but their metadata are scattered throughout four different databases: Sequence Read Archive (SRA), Biosample, Bioprojects, and Gene Expression Omnibus (GEO). Although the NCBI web interface is able to provide all of the metadata information, it often requires significant effort to retrieve study- or project-level information by traversing through multiple hyperlinks and going to another page. Moreover, project- and study-level metadata lack manual or automatic curation by categories, such as disease type, time series, case-control, or replicate type, which are vital to comprehending any RNA-Seq study. Here we describe "MetaRNA-Seq," a new tool for interactively browsing, searching, and annotating RNA-Seq metadata with the capability of semiautomatic curation at the study level.
GWIPS-viz: development of a ribo-seq genome browser
Michel, Audrey M.; Fox, Gearoid; M. Kiran, Anmol; De Bo, Christof; O’Connor, Patrick B. F.; Heaphy, Stephen M.; Mullan, James P. A.; Donohue, Claire A.; Higgins, Desmond G.; Baranov, Pavel V.
2014-01-01
We describe the development of GWIPS-viz (http://gwips.ucc.ie), an online genome browser for viewing ribosome profiling data. Ribosome profiling (ribo-seq) is a recently developed technique that provides genome-wide information on protein synthesis (GWIPS) in vivo. It is based on the deep sequencing of ribosome-protected messenger RNA (mRNA) fragments, which allows the ribosome density along all mRNA transcripts present in the cell to be quantified. Since its inception, ribo-seq has been carried out in a number of eukaryotic and prokaryotic organisms. Owing to the increasing interest in ribo-seq, there is a pertinent demand for a dedicated ribo-seq genome browser. GWIPS-viz is based on The University of California Santa Cruz (UCSC) Genome Browser. Ribo-seq tracks, coupled with mRNA-seq tracks, are currently available for several genomes: human, mouse, zebrafish, nematode, yeast, bacteria (Escherichia coli K12, Bacillus subtilis), human cytomegalovirus and bacteriophage lambda. Our objective is to continue incorporating published ribo-seq data sets so that the wider community can readily view ribosome profiling information from multiple studies without the need to carry out computational processing. PMID:24185699
Purpose: High-risk neuroblastoma is an aggressive disease. DNA sequencing studies have revealed a paucity of actionable genomic alterations and a low mutation burden, posing challenges to develop effective novel therapies. We used RNA sequencing (RNA-seq) to investigate the biology of this disease including a focus on tumor-infiltrating lymphocytes (TILs). Experimental Design: We performed deep RNA-seq on pre-treatment diagnostic tumors from 129 high-risk and 21 low- or intermediate-risk patients with neuroblastomas.
USDA-ARS?s Scientific Manuscript database
Analysis of DNA methylation patterns relies increasingly on sequencing-based profiling methods. The four most frequently used sequencing-based technologies are the bisulfite-based methods MethylC-seq and reduced representation bisulfite sequencing (RRBS), and the enrichment-based techniques methylat...
Impact of sequencing depth in ChIP-seq experiments
Jung, Youngsook L.; Luquette, Lovelace J.; Ho, Joshua W.K.; Ferrari, Francesco; Tolstorukov, Michael; Minoda, Aki; Issner, Robbyn; Epstein, Charles B.; Karpen, Gary H.; Kuroda, Mitzi I.; Park, Peter J.
2014-01-01
In a chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) experiment, an important consideration in experimental design is the minimum number of sequenced reads required to obtain statistically significant results. We present an extensive evaluation of the impact of sequencing depth on identification of enriched regions for key histone modifications (H3K4me3, H3K36me3, H3K27me3 and H3K9me2/me3) using deep-sequenced datasets in human and fly. We propose to define sufficient sequencing depth as the number of reads at which detected enrichment regions increase <1% for an additional million reads. Although the required depth depends on the nature of the mark and the state of the cell in each experiment, we observe that sufficient depth is often reached at <20 million reads for fly. For human, there are no clear saturation points for the examined datasets, but our analysis suggests 40–50 million reads as a practical minimum for most marks. We also devise a mathematical model to estimate the sufficient depth and total genomic coverage of a mark. Lastly, we find that the five algorithms tested do not agree well for broad enrichment profiles, especially at lower depths. Our findings suggest that sufficient sequencing depth and an appropriate peak-calling algorithm are essential for ensuring robustness of conclusions derived from ChIP-seq data. PMID:24598259
Zhao, Shanrong; Prenger, Kurt; Smith, Lance
2013-01-01
RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets. PMID:25937948
Zhao, Shanrong; Prenger, Kurt; Smith, Lance
2013-01-01
RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.
Hughes, Paul; Deng, Wenjie; Olson, Scott C; Coombs, Robert W; Chung, Michael H; Frenkel, Lisa M
2016-03-01
Accurate analysis of minor populations of drug-resistant HIV requires analysis of a sufficient number of viral templates. We assessed the effect of experimental conditions on the analysis of HIV pol 454 pyrosequences generated from plasma using (1) the "Insertion-deletion (indel) and Carry Forward Correction" (ICC) pipeline, which clusters sequence reads using a nonsubstitution approach and can correct for indels and carry forward errors, and (2) the "Primer Identification (ID)" method, which facilitates construction of a consensus sequence to correct for sequencing errors and allelic skewing. The Primer ID and ICC methods produced similar estimates of viral diversity, but differed in the number of sequence variants generated. Sequence preparation for ICC was comparably simple, but was limited by an inability to assess the number of templates analyzed and allelic skewing. The more costly Primer ID method corrected for allelic skewing and provided the number of viral templates analyzed, which revealed that amplifiable HIV templates varied across specimens and did not correlate with clinical viral load. This latter observation highlights the value of the Primer ID method, which by determining the number of templates amplified, enables more accurate assessment of minority species in the virus population, which may be relevant to prescribing effective antiretroviral therapy.
Hoshino, Tatsuhiko; Inagaki, Fumio
2017-01-01
Next-generation sequencing (NGS) is a powerful tool for analyzing environmental DNA and provides the comprehensive molecular view of microbial communities. For obtaining the copy number of particular sequences in the NGS library, however, additional quantitative analysis as quantitative PCR (qPCR) or digital PCR (dPCR) is required. Furthermore, number of sequences in a sequence library does not always reflect the original copy number of a target gene because of biases caused by PCR amplification, making it difficult to convert the proportion of particular sequences in the NGS library to the copy number using the mass of input DNA. To address this issue, we applied stochastic labeling approach with random-tag sequences and developed a NGS-based quantification protocol, which enables simultaneous sequencing and quantification of the targeted DNA. This quantitative sequencing (qSeq) is initiated from single-primer extension (SPE) using a primer with random tag adjacent to the 5' end of target-specific sequence. During SPE, each DNA molecule is stochastically labeled with the random tag. Subsequently, first-round PCR is conducted, specifically targeting the SPE product, followed by second-round PCR to index for NGS. The number of random tags is only determined during the SPE step and is therefore not affected by the two rounds of PCR that may introduce amplification biases. In the case of 16S rRNA genes, after NGS sequencing and taxonomic classification, the absolute number of target phylotypes 16S rRNA gene can be estimated by Poisson statistics by counting random tags incorporated at the end of sequence. To test the feasibility of this approach, the 16S rRNA gene of Sulfolobus tokodaii was subjected to qSeq, which resulted in accurate quantification of 5.0 × 103 to 5.0 × 104 copies of the 16S rRNA gene. Furthermore, qSeq was applied to mock microbial communities and environmental samples, and the results were comparable to those obtained using digital PCR and relative abundance based on a standard sequence library. We demonstrated that the qSeq protocol proposed here is advantageous for providing less-biased absolute copy numbers of each target DNA with NGS sequencing at one time. By this new experiment scheme in microbial ecology, microbial community compositions can be explored in more quantitative manner, thus expanding our knowledge of microbial ecosystems in natural environments.
USDA-ARS?s Scientific Manuscript database
To better understand maize endosperm filling and maturation, we developed a novel functional genomics platform that combined Bulked Segregant RNA and Exome sequencing (BSREx-seq) to map causative mutations and identify candidate genes within mapping intervals. Using gamma-irradiation of B73 maize to...
ProteinSeq: High-Performance Proteomic Analyses by Proximity Ligation and Next Generation Sequencing
Vänelid, Johan; Siegbahn, Agneta; Ericsson, Olle; Fredriksson, Simon; Bäcklin, Christofer; Gut, Marta; Heath, Simon; Gut, Ivo Glynne; Wallentin, Lars; Gustafsson, Mats G.; Kamali-Moghaddam, Masood; Landegren, Ulf
2011-01-01
Despite intense interest, methods that provide enhanced sensitivity and specificity in parallel measurements of candidate protein biomarkers in numerous samples have been lacking. We present herein a multiplex proximity ligation assay with readout via realtime PCR or DNA sequencing (ProteinSeq). We demonstrate improved sensitivity over conventional sandwich assays for simultaneous analysis of sets of 35 proteins in 5 µl of blood plasma. Importantly, we observe a minimal tendency to increased background with multiplexing, compared to a sandwich assay, suggesting that higher levels of multiplexing are possible. We used ProteinSeq to analyze proteins in plasma samples from cardiovascular disease (CVD) patient cohorts and matched controls. Three proteins, namely P-selectin, Cystatin-B and Kallikrein-6, were identified as putative diagnostic biomarkers for CVD. The latter two have not been previously reported in the literature and their potential roles must be validated in larger patient cohorts. We conclude that ProteinSeq is promising for screening large numbers of proteins and samples while the technology can provide a much-needed platform for validation of diagnostic markers in biobank samples and in clinical use. PMID:21980495
Shieh, Fwu-Shan; Jongeneel, Patrick; Steffen, Jamin D; Lin, Selena; Jain, Surbhi; Song, Wei; Su, Ying-Hsiu
2017-01-01
Identification of viral integration sites has been important in understanding the pathogenesis and progression of diseases associated with particular viral infections. The advent of next-generation sequencing (NGS) has enabled researchers to understand the impact that viral integration has on the host, such as tumorigenesis. Current computational methods to analyze NGS data of virus-host junction sites have been limited in terms of their accessibility to a broad user base. In this study, we developed a software application (named ChimericSeq), that is the first program of its kind to offer a graphical user interface, compatibility with both Windows and Mac operating systems, and optimized for effectively identifying and annotating virus-host chimeric reads within NGS data. In addition, ChimericSeq's pipeline implements custom filtering to remove artifacts and detect reads with quantitative analytical reporting to provide functional significance to discovered integration sites. The improved accessibility of ChimericSeq through a GUI interface in both Windows and Mac has potential to expand NGS analytical support to a broader spectrum of the scientific community.
Skelly, Daniel A.; Johansson, Marnie; Madeoy, Jennifer; Wakefield, Jon; Akey, Joshua M.
2011-01-01
Variation in gene expression is thought to make a significant contribution to phenotypic diversity among individuals within populations. Although high-throughput cDNA sequencing offers a unique opportunity to delineate the genome-wide architecture of regulatory variation, new statistical methods need to be developed to capitalize on the wealth of information contained in RNA-seq data sets. To this end, we developed a powerful and flexible hierarchical Bayesian model that combines information across loci to allow both global and locus-specific inferences about allele-specific expression (ASE). We applied our methodology to a large RNA-seq data set obtained in a diploid hybrid of two diverse Saccharomyces cerevisiae strains, as well as to RNA-seq data from an individual human genome. Our statistical framework accurately quantifies levels of ASE with specified false-discovery rates, achieving high reproducibility between independent sequencing platforms. We pinpoint loci that show unusual and biologically interesting patterns of ASE, including allele-specific alternative splicing and transcription termination sites. Our methodology provides a rigorous, quantitative, and high-resolution tool for profiling ASE across whole genomes. PMID:21873452
Han, Kook; Tjaden, Brian; Lory, Stephen
2017-01-01
The first step in the post-transcriptional regulatory function of most bacterial small non-coding RNAs (sRNAs) is base-pairing with partially complementary sequences of targeted transcripts. We present a simple method for identifying sRNA targets in vivo and defining processing sites of the regulated transcripts. The technique (referred to as GRIL-Seq) is based on preferential ligation of sRNAs to ends of base-paired targets in bacteria co-expressing T4 RNA ligase, followed by sequencing to identify the chimeras. In addition to the RNA chaperone Hfq, the GRIL-Seq method depends on the activity of the pyrophosphorylase RppH. Using PrrF1, an iron-regulated sRNA in Pseudomonas aeruginosa, we demonstrate that direct regulatory targets of this sRNA can be readily identified. Therefore, GRIL-Seq represents a powerful tool not only for identifying direct targets of sRNAs in a variety of environments, but can also result in uncovering novel roles for sRNAs and their targets in complex regulatory networks. PMID:28005055
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.
Landt, Stephen G; Marinov, Georgi K; Kundaje, Anshul; Kheradpour, Pouya; Pauli, Florencia; Batzoglou, Serafim; Bernstein, Bradley E; Bickel, Peter; Brown, James B; Cayting, Philip; Chen, Yiwen; DeSalvo, Gilberto; Epstein, Charles; Fisher-Aylor, Katherine I; Euskirchen, Ghia; Gerstein, Mark; Gertz, Jason; Hartemink, Alexander J; Hoffman, Michael M; Iyer, Vishwanath R; Jung, Youngsook L; Karmakar, Subhradip; Kellis, Manolis; Kharchenko, Peter V; Li, Qunhua; Liu, Tao; Liu, X Shirley; Ma, Lijia; Milosavljevic, Aleksandar; Myers, Richard M; Park, Peter J; Pazin, Michael J; Perry, Marc D; Raha, Debasish; Reddy, Timothy E; Rozowsky, Joel; Shoresh, Noam; Sidow, Arend; Slattery, Matthew; Stamatoyannopoulos, John A; Tolstorukov, Michael Y; White, Kevin P; Xi, Simon; Farnham, Peggy J; Lieb, Jason D; Wold, Barbara J; Snyder, Michael
2012-09-01
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia
Landt, Stephen G.; Marinov, Georgi K.; Kundaje, Anshul; Kheradpour, Pouya; Pauli, Florencia; Batzoglou, Serafim; Bernstein, Bradley E.; Bickel, Peter; Brown, James B.; Cayting, Philip; Chen, Yiwen; DeSalvo, Gilberto; Epstein, Charles; Fisher-Aylor, Katherine I.; Euskirchen, Ghia; Gerstein, Mark; Gertz, Jason; Hartemink, Alexander J.; Hoffman, Michael M.; Iyer, Vishwanath R.; Jung, Youngsook L.; Karmakar, Subhradip; Kellis, Manolis; Kharchenko, Peter V.; Li, Qunhua; Liu, Tao; Liu, X. Shirley; Ma, Lijia; Milosavljevic, Aleksandar; Myers, Richard M.; Park, Peter J.; Pazin, Michael J.; Perry, Marc D.; Raha, Debasish; Reddy, Timothy E.; Rozowsky, Joel; Shoresh, Noam; Sidow, Arend; Slattery, Matthew; Stamatoyannopoulos, John A.; Tolstorukov, Michael Y.; White, Kevin P.; Xi, Simon; Farnham, Peggy J.; Lieb, Jason D.; Wold, Barbara J.; Snyder, Michael
2012-01-01
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals. PMID:22955991
Rutvisuttinunt, Wiriya; Chinnawirotpisan, Piyawan; Simasathien, Sriluck; Shrestha, Sanjaya K; Yoon, In-Kyu; Klungthong, Chonticha; Fernandez, Stefan
2013-11-01
Active global surveillance and characterization of influenza viruses are essential for better preparation against possible pandemic events. Obtaining comprehensive information about the influenza genome can improve our understanding of the evolution of influenza viruses and emergence of new strains, and improve the accuracy when designing preventive vaccines. This study investigated the use of deep sequencing by the next-generation sequencing (NGS) Illumina MiSeq Platform to obtain complete genome sequence information from influenza virus isolates. The influenza virus isolates were cultured from 6 respiratory acute clinical specimens collected in Thailand and Nepal. DNA libraries obtained from each viral isolate were mixed and all were sequenced simultaneously. Total information of 2.6 Gbases was obtained from a 455±14 K/mm2 density with 95.76% (8,571,655/8,950,724 clusters) of the clusters passing quality control (QC) filters. Approximately 93.7% of all sequences from Read1 and 83.5% from Read2 contained high quality sequences that were ≥Q30, a base calling QC score standard. Alignments analysis identified three seasonal influenza A H3N2 strains, one 2009 pandemic influenza A H1N1 strain and two influenza B strains. The nearly entire genomes of all six virus isolates yielded equal or greater than 600-fold sequence coverage depth. MiSeq Platform identified seasonal influenza A H3N2, 2009 pandemic influenza A H1N1and influenza B in the DNA library mixtures efficiently. Copyright © 2013 The Authors. Published by Elsevier B.V. All rights reserved.
Genome sequencing of the redbanded stink bug (Piezodorus guildinii)
USDA-ARS?s Scientific Manuscript database
We assembled a partial genome sequence from the redbanded stink bug, Piezodorus guildinii from Illumina MiSeq sequencing runs. The sequence has been submitted and published under NCBI GenBank Accession Number JTEQ01000000. The BioProject and BioSample Accession numbers are PRJNA263369 and SAMN030997...
Zhang, Zijun; Xing, Yi
2017-09-19
Crosslinking or RNA immunoprecipitation followed by sequencing (CLIP-seq or RIP-seq) allows transcriptome-wide discovery of RNA regulatory sites. As CLIP-seq/RIP-seq reads are short, existing computational tools focus on uniquely mapped reads, while reads mapped to multiple loci are discarded. We present CLAM (CLIP-seq Analysis of Multi-mapped reads). CLAM uses an expectation-maximization algorithm to assign multi-mapped reads and calls peaks combining uniquely and multi-mapped reads. To demonstrate the utility of CLAM, we applied it to a wide range of public CLIP-seq/RIP-seq datasets involving numerous splicing factors, microRNAs and m6A RNA methylation. CLAM recovered a large number of novel RNA regulatory sites inaccessible by uniquely mapped reads. The functional significance of these sites was demonstrated by consensus motif patterns and association with alternative splicing (splicing factors), transcript abundance (AGO2) and mRNA half-life (m6A). CLAM provides a useful tool to discover novel protein-RNA interactions and RNA modification sites from CLIP-seq and RIP-seq data, and reveals the significant contribution of repetitive elements to the RNA regulatory landscape of the human transcriptome. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Michel, Audrey M.; Mullan, James P. A.; Velayudhan, Vimalkumar; O'Connor, Patrick B. F.; Donohue, Claire A.; Baranov, Pavel V.
2016-01-01
ABSTRACT Ribosome profiling (ribo-seq) is a technique that uses high-throughput sequencing to reveal the exact locations and densities of translating ribosomes at the entire transcriptome level. The technique has become very popular since its inception in 2009. Yet experimentalists who generate ribo-seq data often have to rely on bioinformaticians to process and analyze their data. We present RiboGalaxy (http://ribogalaxy.ucc.ie), a freely available Galaxy-based web server for processing and analyzing ribosome profiling data with the visualization functionality provided by GWIPS-viz (http://gwips.ucc.ie). RiboGalaxy offers researchers a suite of tools specifically tailored for processing ribo-seq and corresponding mRNA-seq data. Researchers can take advantage of the published workflows which reduce the multi-step alignment process to a minimum of inputs from the user. Users can then explore their own aligned data as custom tracks in GWIPS-viz and compare their ribosome profiles to existing ribo-seq tracks from published studies. In addition, users can assess the quality of their ribo-seq data, determine the strength of the triplet periodicity signal, generate meta-gene ribosome profiles as well as analyze the relative impact of mRNA sequence features on local read density. RiboGalaxy is accompanied by extensive documentation and tips for helping users. In addition we provide a forum (http://gwips.ucc.ie/Forum) where we encourage users to post their questions and feedback to improve the overall RiboGalaxy service. PMID:26821742
Comparison of alternative approaches for analysing multi-level RNA-seq data
Mohorianu, Irina; Bretman, Amanda; Smith, Damian T.; Fowler, Emily K.; Dalmay, Tamas
2017-01-01
RNA sequencing (RNA-seq) is widely used for RNA quantification in the environmental, biological and medical sciences. It enables the description of genome-wide patterns of expression and the identification of regulatory interactions and networks. The aim of RNA-seq data analyses is to achieve rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite variation in levels of noise and inherent biases in sequencing data. This can be especially challenging for datasets in which gene expression differences are subtle, as in the behavioural transcriptomics test dataset from D. melanogaster that we used here. We investigated the power of existing approaches for quality checking mRNA-seq data and explored additional, quantitative quality checks. To accommodate nested, multi-level experimental designs, we incorporated sample layout into our analyses. We employed a subsampling without replacement-based normalization and an identification of DE that accounted for the hierarchy and amplitude of effect sizes within samples, then evaluated the resulting differential expression call in comparison to existing approaches. In a final step to test for broader applicability, we applied our approaches to a published set of H. sapiens mRNA-seq samples, The dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. The proposed approaches have the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments. PMID:28792517
Riazuddin, S; Hussain, M; Razzaq, A; Iqbal, Z; Shahzad, M; Polla, D L; Song, Y; van Beusekom, E; Khan, A A; Tomas-Roca, L; Rashid, M; Zahoor, M Y; Wissink-Lindhout, W M; Basra, M A R; Ansar, M; Agha, Z; van Heeswijk, K; Rasheed, F; Van de Vorst, M; Veltman, J A; Gilissen, C; Akram, J; Kleefstra, T; Assir, M Z; Grozeva, D; Carss, K; Raymond, F L; O'Connor, T D; Riazuddin, S A; Khan, S N; Ahmed, Z M; de Brouwer, A P M; van Bokhoven, H; Riazuddin, S
2017-01-01
Intellectual disability (ID) is a clinically and genetically heterogeneous disorder, affecting 1–3% of the general population. Although research into the genetic causes of ID has recently gained momentum, identification of pathogenic mutations that cause autosomal recessive ID (ARID) has lagged behind, predominantly due to non-availability of sizeable families. Here we present the results of exome sequencing in 121 large consanguineous Pakistani ID families. In 60 families, we identified homozygous or compound heterozygous DNA variants in a single gene, 30 affecting reported ID genes and 30 affecting novel candidate ID genes. Potential pathogenicity of these alleles was supported by co-segregation with the phenotype, low frequency in control populations and the application of stringent bioinformatics analyses. In another eight families segregation of multiple pathogenic variants was observed, affecting 19 genes that were either known or are novel candidates for ID. Transcriptome profiles of normal human brain tissues showed that the novel candidate ID genes formed a network significantly enriched for transcriptional co-expression (P<0.0001) in the frontal cortex during fetal development and in the temporal–parietal and sub-cortex during infancy through adulthood. In addition, proteins encoded by 12 novel ID genes directly interact with previously reported ID proteins in six known pathways essential for cognitive function (P<0.0001). These results suggest that disruptions of temporal parietal and sub-cortical neurogenesis during infancy are critical to the pathophysiology of ID. These findings further expand the existing repertoire of genes involved in ARID, and provide new insights into the molecular mechanisms and the transcriptome map of ID. PMID:27457812
Riazuddin, S; Hussain, M; Razzaq, A; Iqbal, Z; Shahzad, M; Polla, D L; Song, Y; van Beusekom, E; Khan, A A; Tomas-Roca, L; Rashid, M; Zahoor, M Y; Wissink-Lindhout, W M; Basra, M A R; Ansar, M; Agha, Z; van Heeswijk, K; Rasheed, F; Van de Vorst, M; Veltman, J A; Gilissen, C; Akram, J; Kleefstra, T; Assir, M Z; Grozeva, D; Carss, K; Raymond, F L; O'Connor, T D; Riazuddin, S A; Khan, S N; Ahmed, Z M; de Brouwer, A P M; van Bokhoven, H; Riazuddin, S
2017-11-01
Intellectual disability (ID) is a clinically and genetically heterogeneous disorder, affecting 1-3% of the general population. Although research into the genetic causes of ID has recently gained momentum, identification of pathogenic mutations that cause autosomal recessive ID (ARID) has lagged behind, predominantly due to non-availability of sizeable families. Here we present the results of exome sequencing in 121 large consanguineous Pakistani ID families. In 60 families, we identified homozygous or compound heterozygous DNA variants in a single gene, 30 affecting reported ID genes and 30 affecting novel candidate ID genes. Potential pathogenicity of these alleles was supported by co-segregation with the phenotype, low frequency in control populations and the application of stringent bioinformatics analyses. In another eight families segregation of multiple pathogenic variants was observed, affecting 19 genes that were either known or are novel candidates for ID. Transcriptome profiles of normal human brain tissues showed that the novel candidate ID genes formed a network significantly enriched for transcriptional co-expression (P<0.0001) in the frontal cortex during fetal development and in the temporal-parietal and sub-cortex during infancy through adulthood. In addition, proteins encoded by 12 novel ID genes directly interact with previously reported ID proteins in six known pathways essential for cognitive function (P<0.0001). These results suggest that disruptions of temporal parietal and sub-cortical neurogenesis during infancy are critical to the pathophysiology of ID. These findings further expand the existing repertoire of genes involved in ARID, and provide new insights into the molecular mechanisms and the transcriptome map of ID.
Han, Joon-Hee; Chon, Jae-Kyung; Ahn, Jong-Hwa; Choi, Ik-Young; Lee, Yong-Hwan; Kim, Kyoung Su
2016-06-01
Colletotrichum acutatum is a destructive fungal pathogen which causes anthracnose in a wide range of crops. Here we report the whole genome sequence and annotation of C. acutatum strain KC05, isolated from an infected pepper in Kangwon, South Korea. Genomic DNA from the KC05 strain was used for the whole genome sequencing using a PacBio sequencer and the MiSeq system. The KC05 genome was determined to be 52,190,760 bp in size with a G + C content of 51.73% in 27 scaffolds and to contain 13,559 genes with an average length of 1516 bp. Gene prediction and annotation were performed by incorporating RNA-Seq data. The genome sequence of the KC05 was deposited at DDBJ/ENA/GenBank under the accession number LUXP00000000.
The promise and challenge of high-throughput sequencing of the antibody repertoire
Georgiou, George; Ippolito, Gregory C; Beausang, John; Busse, Christian E; Wardemann, Hedda; Quake, Stephen R
2014-01-01
Efforts to determine the antibody repertoire encoded by B cells in the blood or lymphoid organs using high-throughput DNA sequencing technologies have been advancing at an extremely rapid pace and are transforming our understanding of humoral immune responses. Information gained from high-throughput DNA sequencing of immunoglobulin genes (Ig-seq) can be applied to detect B-cell malignancies with high sensitivity, to discover antibodies specific for antigens of interest, to guide vaccine development and to understand autoimmunity. Rapid progress in the development of experimental protocols and informatics analysis tools is helping to reduce sequencing artifacts, to achieve more precise quantification of clonal diversity and to extract the most pertinent biological information. That said, broader application of Ig-seq, especially in clinical settings, will require the development of a standardized experimental design framework that will enable the sharing and meta-analysis of sequencing data generated by different laboratories. PMID:24441474
SeqWare Query Engine: storing and searching sequence data in the cloud.
O'Connor, Brian D; Merriman, Barry; Nelson, Stanley F
2010-12-21
Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.
SeqWare Query Engine: storing and searching sequence data in the cloud
2010-01-01
Background Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. Results In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). Conclusions The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets. PMID:21210981
Testa, Alison C; Hane, James K; Ellwood, Simon R; Oliver, Richard P
2015-03-11
The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available ( https://sourceforge.net/projects/codingquarry/ ), and suitable for incorporation into genome annotation pipelines.
Guinea Pig ID-Like Families of SINEs
Kass, David H.; Schaetz, Brian A.; Beitler, Lindsey; Bonney, Kevin M.; Jamison, Nicole; Wiesner, Cathy
2009-01-01
Previous studies have indicated a paucity of SINEs within the genomes of the guinea pig and nutria, representatives of the Hystricognathi suborder of rodents. More recent work has shown that the guinea pig genome contains a large number of B1 elements, expanding to various levels among different rodents. In this work we utilized A–B PCR and screened GenBank with sequences from isolated clones to identify potentially uncharacterized SINEs within the guinea pig genome, and identified numerous sequences with a high degree of similarity (>92%) specific to the guinea pig. The presence of A-tails and flanking direct repeats associated with these sequences supported the identification of a full-length SINE, with a consensus sequence notably distinct from other rodent SINEs. Although most similar to the ID SINE, it clearly was not derived from the known ID master gene (BC1), hence we refer to this element as guinea pig ID-like (GPIDL). Using the consensus to screen the guinea pig genomic database (Assembly CavPor2) with Ensembl BlastView, we estimated at least 100,000 copies, which contrasts markedly to just over 100 copies of ID elements. Additionally we provided evidence of recent integrations of GPIDL as two of seven analyzed conserved GPIDL-containing loci demonstrated presence/absence variants in Cavia porcellus and C. aperea. Using intra-IDL PCR and sequence analyses we also provide evidence that GPIDL is derived from a hystricognath-specific SINE family. These results demonstrate that this SINE family continues to contribute to the dynamics of genomes of hystricognath rodents. PMID:19232383
Guinea pig ID-like families of SINEs.
Kass, David H; Schaetz, Brian A; Beitler, Lindsey; Bonney, Kevin M; Jamison, Nicole; Wiesner, Cathy
2009-05-01
Previous studies have indicated a paucity of SINEs within the genomes of the guinea pig and nutria, representatives of the Hystricognathi suborder of rodents. More recent work has shown that the guinea pig genome contains a large number of B1 elements, expanding to various levels among different rodents. In this work we utilized A-B PCR and screened GenBank with sequences from isolated clones to identify potentially uncharacterized SINEs within the guinea pig genome, and identified numerous sequences with a high degree of similarity (>92%) specific to the guinea pig. The presence of A-tails and flanking direct repeats associated with these sequences supported the identification of a full-length SINE, with a consensus sequence notably distinct from other rodent SINEs. Although most similar to the ID SINE, it clearly was not derived from the known ID master gene (BC1), hence we refer to this element as guinea pig ID-like (GPIDL). Using the consensus to screen the guinea pig genomic database (Assembly CavPor2) with Ensembl BlastView, we estimated at least 100,000 copies, which contrasts markedly to just over 100 copies of ID elements. Additionally we provided evidence of recent integrations of GPIDL as two of seven analyzed conserved GPIDL-containing loci demonstrated presence/absence variants in Cavia porcellus and C. aperea. Using intra-IDL PCR and sequence analyses we also provide evidence that GPIDL is derived from a hystricognath-specific SINE family. These results demonstrate that this SINE family continues to contribute to the dynamics of genomes of hystricognath rodents.
A comparative study of ChIP-seq sequencing library preparation methods.
Sundaram, Arvind Y M; Hughes, Timothy; Biondi, Shea; Bolduc, Nathalie; Bowman, Sarah K; Camilli, Andrew; Chew, Yap C; Couture, Catherine; Farmer, Andrew; Jerome, John P; Lazinski, David W; McUsic, Andrew; Peng, Xu; Shazand, Kamran; Xu, Feng; Lyle, Robert; Gilfillan, Gregor D
2016-10-21
ChIP-seq is the primary technique used to investigate genome-wide protein-DNA interactions. As part of this procedure, immunoprecipitated DNA must undergo "library preparation" to enable subsequent high-throughput sequencing. To facilitate the analysis of biopsy samples and rare cell populations, there has been a recent proliferation of methods allowing sequencing library preparation from low-input DNA amounts. However, little information exists on the relative merits, performance, comparability and biases inherent to these procedures. Notably, recently developed single-cell ChIP procedures employing microfluidics must also employ library preparation reagents to allow downstream sequencing. In this study, seven methods designed for low-input DNA/ChIP-seq sample preparation (Accel-NGS® 2S, Bowman-method, HTML-PCR, SeqPlex™, DNA SMART™, TELP and ThruPLEX®) were performed on five replicates of 1 ng and 0.1 ng input H3K4me3 ChIP material, and compared to a "gold standard" reference PCR-free dataset. The performance of each method was examined for the prevalence of unmappable reads, amplification-derived duplicate reads, reproducibility, and for the sensitivity and specificity of peak calling. We identified consistent high performance in a subset of the tested reagents, which should aid researchers in choosing the most appropriate reagents for their studies. Furthermore, we expect this work to drive future advances by identifying and encouraging use of the most promising methods and reagents. The results may also aid judgements on how comparable are existing datasets that have been prepared with different sample library preparation reagents.
Measuring the diversity of the human microbiota with targeted next-generation sequencing.
Finotello, Francesca; Mastrorilli, Eleonora; Di Camillo, Barbara
2016-12-26
The human microbiota is a complex ecological community of commensal, symbiotic and pathogenic microorganisms harboured by the human body. Next-generation sequencing (NGS) technologies, in particular targeted amplicon sequencing of the 16S ribosomal RNA gene (16S-seq), are enabling the identification and quantification of human-resident microorganisms at unprecedented resolution, providing novel insights into the role of the microbiota in health and disease. Once microbial abundances are quantified through NGS data analysis, diversity indices provide valuable mathematical tools to describe the ecological complexity of a single sample or to detect species differences between samples. However, diversity is not a determined physical quantity for which a consensus definition and unit of measure have been established, and several diversity indices are currently available. Furthermore, they were originally developed for macroecology and their robustness to the possible bias introduced by sequencing has not been characterized so far. To assist the reader with the selection and interpretation of diversity measures, we review a panel of broadly used indices, describing their mathematical formulations, purposes and properties, and characterize their behaviour and criticalities in dependence of the data features using simulated data as ground truth. In addition, we make available an R package, DiversitySeq, which implements in a unified framework the full panel of diversity indices and a simulator of 16S-seq data, and thus represents a valuable resource for the analysis of diversity from NGS count data and for the benchmarking of computational methods for 16S-seq. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Bayesian Correlation Analysis for Sequence Count Data
Lau, Nelson; Perkins, Theodore J.
2016-01-01
Evaluating the similarity of different measured variables is a fundamental task of statistics, and a key part of many bioinformatics algorithms. Here we propose a Bayesian scheme for estimating the correlation between different entities’ measurements based on high-throughput sequencing data. These entities could be different genes or miRNAs whose expression is measured by RNA-seq, different transcription factors or histone marks whose expression is measured by ChIP-seq, or even combinations of different types of entities. Our Bayesian formulation accounts for both measured signal levels and uncertainty in those levels, due to varying sequencing depth in different experiments and to varying absolute levels of individual entities, both of which affect the precision of the measurements. In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurement confidence is low—especially for entities with low signal levels. In addition, we consider the influence of priors on the Bayesian correlation estimate. Perhaps surprisingly, we show that naive, uniform priors on entities’ signal levels can lead to highly biased correlation estimates, particularly when different experiments have widely varying sequencing depths. However, we propose two alternative priors that provably mitigate this problem. We also prove that, like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and thus can be used as a similarity measure in any kernel-based machine learning algorithm. We demonstrate our approach on two RNA-seq datasets and one miRNA-seq dataset. PMID:27701449
SIDR: simultaneous isolation and parallel sequencing of genomic DNA and total RNA from single cells.
Han, Kyung Yeon; Kim, Kyu-Tae; Joung, Je-Gun; Son, Dae-Soon; Kim, Yeon Jeong; Jo, Areum; Jeon, Hyo-Jeong; Moon, Hui-Sung; Yoo, Chang Eun; Chung, Woosung; Eum, Hye Hyeon; Kim, Sangmin; Kim, Hong Kwan; Lee, Jeong Eon; Ahn, Myung-Ju; Lee, Hae-Ock; Park, Donghyun; Park, Woong-Yang
2018-01-01
Simultaneous sequencing of the genome and transcriptome at the single-cell level is a powerful tool for characterizing genomic and transcriptomic variation and revealing correlative relationships. However, it remains technically challenging to analyze both the genome and transcriptome in the same cell. Here, we report a novel method for simultaneous isolation of genomic DNA and total RNA (SIDR) from single cells, achieving high recovery rates with minimal cross-contamination, as is crucial for accurate description and integration of the single-cell genome and transcriptome. For reliable and efficient separation of genomic DNA and total RNA from single cells, the method uses hypotonic lysis to preserve nuclear lamina integrity and subsequently captures the cell lysate using antibody-conjugated magnetic microbeads. Evaluating the performance of this method using real-time PCR demonstrated that it efficiently recovered genomic DNA and total RNA. Thorough data quality assessments showed that DNA and RNA simultaneously fractionated by the SIDR method were suitable for genome and transcriptome sequencing analysis at the single-cell level. The integration of single-cell genome and transcriptome sequencing by SIDR (SIDR-seq) showed that genetic alterations, such as copy-number and single-nucleotide variations, were more accurately captured by single-cell SIDR-seq compared with conventional single-cell RNA-seq, although copy-number variations positively correlated with the corresponding gene expression levels. These results suggest that SIDR-seq is potentially a powerful tool to reveal genetic heterogeneity and phenotypic information inferred from gene expression patterns at the single-cell level. © 2018 Han et al.; Published by Cold Spring Harbor Laboratory Press.
SIDR: simultaneous isolation and parallel sequencing of genomic DNA and total RNA from single cells
Han, Kyung Yeon; Kim, Kyu-Tae; Joung, Je-Gun; Son, Dae-Soon; Kim, Yeon Jeong; Jo, Areum; Jeon, Hyo-Jeong; Moon, Hui-Sung; Yoo, Chang Eun; Chung, Woosung; Eum, Hye Hyeon; Kim, Sangmin; Kim, Hong Kwan; Lee, Jeong Eon; Ahn, Myung-Ju; Lee, Hae-Ock; Park, Donghyun; Park, Woong-Yang
2018-01-01
Simultaneous sequencing of the genome and transcriptome at the single-cell level is a powerful tool for characterizing genomic and transcriptomic variation and revealing correlative relationships. However, it remains technically challenging to analyze both the genome and transcriptome in the same cell. Here, we report a novel method for simultaneous isolation of genomic DNA and total RNA (SIDR) from single cells, achieving high recovery rates with minimal cross-contamination, as is crucial for accurate description and integration of the single-cell genome and transcriptome. For reliable and efficient separation of genomic DNA and total RNA from single cells, the method uses hypotonic lysis to preserve nuclear lamina integrity and subsequently captures the cell lysate using antibody-conjugated magnetic microbeads. Evaluating the performance of this method using real-time PCR demonstrated that it efficiently recovered genomic DNA and total RNA. Thorough data quality assessments showed that DNA and RNA simultaneously fractionated by the SIDR method were suitable for genome and transcriptome sequencing analysis at the single-cell level. The integration of single-cell genome and transcriptome sequencing by SIDR (SIDR-seq) showed that genetic alterations, such as copy-number and single-nucleotide variations, were more accurately captured by single-cell SIDR-seq compared with conventional single-cell RNA-seq, although copy-number variations positively correlated with the corresponding gene expression levels. These results suggest that SIDR-seq is potentially a powerful tool to reveal genetic heterogeneity and phenotypic information inferred from gene expression patterns at the single-cell level. PMID:29208629
RAP: RNA-Seq Analysis Pipeline, a new cloud-based NGS web application
2015-01-01
Background The study of RNA has been dramatically improved by the introduction of Next Generation Sequencing platforms allowing massive and cheap sequencing of selected RNA fractions, also providing information on strand orientation (RNA-Seq). The complexity of transcriptomes and of their regulative pathways make RNA-Seq one of most complex field of NGS applications, addressing several aspects of the expression process (e.g. identification and quantification of expressed genes and transcripts, alternative splicing and polyadenylation, fusion genes and trans-splicing, post-transcriptional events, etc.). Moreover, the huge volume of data generated by NGS platforms introduces unprecedented computational and technological challenges to efficiently analyze and store sequence data and results. Methods In order to provide researchers with an effective and friendly resource for analyzing RNA-Seq data, we present here RAP (RNA-Seq Analysis Pipeline), a cloud computing web application implementing a complete but modular analysis workflow. This pipeline integrates both state-of-the-art bioinformatics tools for RNA-Seq analysis and in-house developed scripts to offer to the user a comprehensive strategy for data analysis. RAP is able to perform quality checks (adopting FastQC and NGS QC Toolkit), identify and quantify expressed genes and transcripts (with Tophat, Cufflinks and HTSeq), detect alternative splicing events (using SpliceTrap) and chimeric transcripts (with ChimeraScan). This pipeline is also able to identify splicing junctions and constitutive or alternative polyadenylation sites (implementing custom analysis modules) and call for statistically significant differences in genes and transcripts expression, splicing pattern and polyadenylation site usage (using Cuffdiff2 and DESeq). Results Through a user friendly web interface, the RAP workflow can be suitably customized by the user and it is automatically executed on our cloud computing environment. This strategy allows to access to bioinformatics tools and computational resources without specific bioinformatics and IT skills. RAP provides a set of tabular and graphical results that can be helpful to browse, filter and export analyzed data, according to the user needs. PMID:26046471
Distributed biotin–streptavidin transcription roadblocks for mapping cotranscriptional RNA folding
Strobel, Eric J.; Nedialkov, Yuri; Artsimovitch, Irina
2017-01-01
Abstract RNA folding during transcription directs an order of folding that can determine RNA structure and function. However, the experimental study of cotranscriptional RNA folding has been limited by the lack of easily approachable methods that can interrogate nascent RNA structure at nucleotide resolution. To address this, we previously developed cotranscriptional selective 2΄-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq) to simultaneously probe all intermediate RNA transcripts during transcription by stalling elongation complexes at catalytically dead EcoRIE111Q roadblocks. While effective, the distribution of elongation complexes using EcoRIE111Q requires laborious PCR using many different oligonucleotides for each sequence analyzed. Here, we improve the broad applicability of cotranscriptional SHAPE-Seq by developing a sequence-independent biotin–streptavidin (SAv) roadblocking strategy that simplifies the preparation of roadblocking DNA templates. We first determine the properties of biotin–SAv roadblocks. We then show that randomly distributed biotin–SAv roadblocks can be used in cotranscriptional SHAPE-Seq experiments to identify the same RNA structural transitions related to a riboswitch decision-making process that we previously identified using EcoRIE111Q. Lastly, we find that EcoRIE111Q maps nascent RNA structure to specific transcript lengths more precisely than biotin–SAv and propose guidelines to leverage the complementary strengths of each transcription roadblock in cotranscriptional SHAPE-Seq. PMID:28398514
High-sensitivity HLA typing by Saturated Tiling Capture Sequencing (STC-Seq).
Jiao, Yang; Li, Ran; Wu, Chao; Ding, Yibin; Liu, Yanning; Jia, Danmei; Wang, Lifeng; Xu, Xiang; Zhu, Jing; Zheng, Min; Jia, Junling
2018-01-15
Highly polymorphic human leukocyte antigen (HLA) genes are responsible for fine-tuning the adaptive immune system. High-resolution HLA typing is important for the treatment of autoimmune and infectious diseases. Additionally, it is routinely performed for identifying matched donors in transplantation medicine. Although many HLA typing approaches have been developed, the complexity, low-efficiency and high-cost of current HLA-typing assays limit their application in population-based high-throughput HLA typing for donors, which is required for creating large-scale databases for transplantation and precision medicine. Here, we present a cost-efficient Saturated Tiling Capture Sequencing (STC-Seq) approach to capturing 14 HLA class I and II genes. The highly efficient capture (an approximately 23,000-fold enrichment) of these genes allows for simplified allele calling. Tests on five genes (HLA-A/B/C/DRB1/DQB1) from 31 human samples and 351 datasets using STC-Seq showed results that were 98% consistent with the known two sets of digitals (field1 and field2) genotypes. Additionally, STC can capture genomic DNA fragments longer than 3 kb from HLA loci, making the library compatible with the third-generation sequencing. STC-Seq is a highly accurate and cost-efficient method for HLA typing which can be used to facilitate the establishment of population-based HLA databases for the precision and transplantation medicine.
Distributed biotin-streptavidin transcription roadblocks for mapping cotranscriptional RNA folding.
Strobel, Eric J; Watters, Kyle E; Nedialkov, Yuri; Artsimovitch, Irina; Lucks, Julius B
2017-07-07
RNA folding during transcription directs an order of folding that can determine RNA structure and function. However, the experimental study of cotranscriptional RNA folding has been limited by the lack of easily approachable methods that can interrogate nascent RNA structure at nucleotide resolution. To address this, we previously developed cotranscriptional selective 2΄-hydroxyl acylation analyzed by primer extension sequencing (SHAPE-Seq) to simultaneously probe all intermediate RNA transcripts during transcription by stalling elongation complexes at catalytically dead EcoRIE111Q roadblocks. While effective, the distribution of elongation complexes using EcoRIE111Q requires laborious PCR using many different oligonucleotides for each sequence analyzed. Here, we improve the broad applicability of cotranscriptional SHAPE-Seq by developing a sequence-independent biotin-streptavidin (SAv) roadblocking strategy that simplifies the preparation of roadblocking DNA templates. We first determine the properties of biotin-SAv roadblocks. We then show that randomly distributed biotin-SAv roadblocks can be used in cotranscriptional SHAPE-Seq experiments to identify the same RNA structural transitions related to a riboswitch decision-making process that we previously identified using EcoRIE111Q. Lastly, we find that EcoRIE111Q maps nascent RNA structure to specific transcript lengths more precisely than biotin-SAv and propose guidelines to leverage the complementary strengths of each transcription roadblock in cotranscriptional SHAPE-Seq. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Research Associate | Center for Cancer Research
The Basic Science Program (BSP) at the Frederick National Laboratory for Cancer Research (FNLCR) pursues independent, multidisciplinary research programs in basic or applied molecular biology, immunology, retrovirology, cancer biology or human genetics. As part of the BSP, the Microbiome and Genetics Core (the Core) characterizes microbiomes by next-generation sequencing to determine their composition and variation, as influenced by immune, genetic, and host health factors. The Core provides support across a spectrum of processes, from nucleic acid isolation through bioinformatics and statistical analysis. KEY ROLES/RESPONSIBILITIES The Research Associate II will provide support in the areas of automated isolation, preparation, PCR and sequencing of DNA on next generation platforms (Illumina MiSeq and NextSeq). An opportunity exists to join the Core’s team of highly trained experimentalists and bioinformaticians working to characterize microbiome samples. The following represent requirements of the position: A minimum of five (5) years related of biomedical experience. Experience with high-throughput nucleic acid (DNA/RNA) extraction. Experience in performing PCR amplification (including quantitative real-time PCR). Experience or familiarity with robotic liquid handling protocols (especially on the Eppendorf epMotion 5073 or 5075 platforms). Experience in operating and maintaining benchtop Illumina sequencers (MiSeq and NextSeq). Ability to evaluate experimental quality and to troubleshoot molecular biology protocols. Experience with sample tracking, inventory management and biobanking. Ability to operate and communicate effectively in a team-oriented work environment.
Reyes, Juan M; Chitwood, James L; Ross, Pablo J
2015-02-01
Molecular changes occurring during mammalian oocyte maturation are partly regulated by cytoplasmic polyadenylation (CP) and affect oocyte quality, yet the extent of CP activity during oocyte maturation remains unknown. Single bovine oocyte RNA sequencing (RNA-Seq) was performed to examine changes in transcript abundance during in vitro oocyte maturation in cattle. Polyadenylated RNA from individual germinal-vesicle and metaphase-II oocytes was amplified and processed for Illumina sequencing, producing approximately 30 million reads per replicate for each sample type. A total of 10,494 genes were found to be expressed, of which 2,455 were differentially expressed (adjusted P < 0.05 and fold change >2) between stages, with 503 and 1,952 genes respectively increasing and decreasing in abundance. Differentially expressed genes with complete 3'-untranslated-region sequence (279 increasing and 918 decreasing in polyadenylated transcript abundance) were examined for the presence, position, and distribution of motifs mediating CP, revealing enrichment (85%) and lack thereof (18%) in up- and down-regulated genes, respectively. Examination of total and polyadenylated RNA abundance by quantitative PCR validated these RNA-Seq findings. The observed increases in polyadenylated transcript abundance within the RNA-Seq data are likely due to CP, providing novel insight into targeted transcripts and resultant differential gene expression profiles that contribute to oocyte maturation. © 2015 Wiley Periodicals, Inc.
Gene expression distribution deconvolution in single-cell RNA sequencing.
Wang, Jingshu; Huang, Mo; Torre, Eduardo; Dueck, Hannah; Shaffer, Sydney; Murray, John; Raj, Arjun; Li, Mingyao; Zhang, Nancy R
2018-06-26
Single-cell RNA sequencing (scRNA-seq) enables the quantification of each gene's expression distribution across cells, thus allowing the assessment of the dispersion, nonzero fraction, and other aspects of its distribution beyond the mean. These statistical characterizations of the gene expression distribution are critical for understanding expression variation and for selecting marker genes for population heterogeneity. However, scRNA-seq data are noisy, with each cell typically sequenced at low coverage, thus making it difficult to infer properties of the gene expression distribution from raw counts. Based on a reexamination of nine public datasets, we propose a simple technical noise model for scRNA-seq data with unique molecular identifiers (UMI). We develop deconvolution of single-cell expression distribution (DESCEND), a method that deconvolves the true cross-cell gene expression distribution from observed scRNA-seq counts, leading to improved estimates of properties of the distribution such as dispersion and nonzero fraction. DESCEND can adjust for cell-level covariates such as cell size, cell cycle, and batch effects. DESCEND's noise model and estimation accuracy are further evaluated through comparisons to RNA FISH data, through data splitting and simulations and through its effectiveness in removing known batch effects. We demonstrate how DESCEND can clarify and improve downstream analyses such as finding differentially expressed genes, identifying cell types, and selecting differentiation markers. Copyright © 2018 the Author(s). Published by PNAS.
Global Analysis of Transcription Factor-Binding Sites in Yeast Using ChIP-Seq
Lefrançois, Philippe; Gallagher, Jennifer E. G.; Snyder, Michael
2016-01-01
Transcription factors influence gene expression through their ability to bind DNA at specific regulatory elements. Specific DNA-protein interactions can be isolated through the chromatin immunoprecipitation (ChIP) procedure, in which DNA fragments bound by the protein of interest are recovered. ChIP is followed by high-throughput DNA sequencing (Seq) to determine the genomic provenance of ChIP DNA fragments and their relative abundance in the sample. This chapter describes a ChIP-Seq strategy adapted for budding yeast to enable the genome-wide characterization of binding sites of transcription factors (TFs) and other DNA-binding proteins in an efficient and cost-effective way. Yeast strains with epitope-tagged TFs are most commonly used for ChIP-Seq, along with their matching untagged control strains. The initial step of ChIP involves the cross-linking of DNA and proteins. Next, yeast cells are lysed and sonicated to shear chromatin into smaller fragments. An antibody against an epitope-tagged TF is used to pull down chromatin complexes containing DNA and the TF of interest. DNA is then purified and proteins degraded. Specific barcoded adapters for multiplex DNA sequencing are ligated to ChIP DNA. Short DNA sequence reads (28–36 base pairs) are parsed according to the barcode and aligned against the yeast reference genome, thus generating a nucleotide-resolution map of transcription factor-binding sites and their occupancy. PMID:25213249
Qu, Xiancheng; Hu, Menghong; Shang, Yueyong; Pan, Lisha; Jia, Peixuan; Fu, Chunxue; Liu, Qigen; Wang, Youji
2018-01-01
Next-generation sequencing was used to analyze the effects of toxic microcystin-LR (MC-LR) on silver carp (Hypophthalmichthys molitrix). Silver carps were intraperitoneally injected with MC-LR, and RNA-seq and miRNA-seq in the liver were analyzed at 0.25, 0.5, and 1 h. The expression of glutathione S-transferase (GST), which acts as a marker gene for MC-LR, was tested to determine the earliest time point at which GST transcription was initiated in the liver tissues of the MC-LR-treated silver carps. Hepatic RNA-seq/miRNA-seq analysis and data integration analysis were conducted with reference to the identified time point. Quantitative PCR (qPCR) was performed to detect the expression of the following genes at the three time points: heme oxygenase 1 (HO-1), interleukin-10 receptor 1 (IL-10R1), apolipoprotein A-I (apoA-I), and heme binding protein 2 (HBP2). Results showed that the liver GST expression was remarkably decreased at 0.25 h (P < 0.05). RNA-seq at this time point revealed that the liver tissue contained 97,505 unigenes, including 184 significantly different unigenes and 75 unknown genes. Gene Ontology (GO) term enrichment analysis suggested that 35 of the 145 enriched GO terms were significantly enriched and mainly related to the immune system regulation network. KEGG pathway enrichment analysis showed that 18 of the 189 pathways were significantly enriched, and the most significant was a ribosome pathway containing 77 differentially expressed genes. miRNA-seq analysis indicated that the longest miRNA had 22 nucleotides (nt), followed by 21 and 23 nt. A total of 286 known miRNAs, 332 known miRNA precursor sequences, and 438 new miRNAs were predicted. A total of 1,048,575 mRNA–miRNA interaction sites were obtained, and 21,252 and 21,241 target genes were respectively predicted in known and new miRNAs. qPCR revealed that HO-1, IL-10R1, apoA-I, and HBP2 were significantly differentially expressed and might play important roles in the toxicity and liver detoxification of MC-LR in fish. These results were consistent with those of high-throughput sequencing, thereby verifying the accuracy of our sequencing data. RNA-seq and miRNA-seq analyses of silver carp liver injected with MC-LR provided valuable and new insights into the toxic effects of MC-LR and the antitoxic mechanisms of MC-LR in fish. The RNA/miRNA data are available from the NCBI database Registration No. : SRP075165. PMID:29692738
Wang, Jing; Chen, Lin; Zhou, Cong; Wang, Li; Xie, Hanbin; Xiao, Yuanyuan; Zhu, Hongmei; Hu, Ting; Zhang, Zhu; Zhu, Qian; Liu, Zhiying; Liu, Shanlin; Wang, He; Xu, Mengnan; Ren, Zhilin; Yu, Fuli; Cram, David S; Liu, Hongqian
2018-05-28
Next generation sequencing (NGS) is emerging as a viable alternative to chromosome microarray analysis for the diagnosis of chromosome disease syndromes. One NGS methodology, copy number variation sequencing (CNV-Seq), has been shown to deliver high reliability, accuracy and reproducibility for detection of fetal CNVs in prenatal samples. However, its clinical utility as a first tier diagnostic method has yet to be demonstrated in a large cohort of pregnant women referred for fetal chromosome testing. To evaluate CNV-Seq as a first tier diagnostic method for detection of fetal chromosome anomalies in a general population of pregnant women with high-risk prenatal indications. Prospective analysis of 3429 pregnant women referred for amniocentesis and fetal chromosome testing for different risk indications, including advanced maternal age (AMA), high-risk maternal serum screening (HR-MSS), and positivity for an ultrasound soft marker (USM). Amniocentesis was performed by standard procedures. Amniocyte DNA was analyzed by CNV-Seq with a chromosome resolution of 0.1 Mb. Fetal chromosome anomalies including whole chromosome aneuploidy and segmental imbalances were independently confirmed by gold standard cytogenetic and molecular methods and their pathogenicity determined following guidelines of the American College of Medical Genetics for sequence variants. Clear interpretable CNV-Seq results were obtained for all 3429 amniocentesis samples. CNV-Seq identified 3293 (96%) samples with a normal molecular karyotype and 136 samples (4%) with an altered molecular karyotype. A total of 146 fetal chromosome anomalies were detected, comprising 46 whole chromosome aneuploidies (pathogenic), 29 submicroscopic microdeletions/microduplications with known or suspected associations with chromosome disease syndromes (pathogenic), 22 other microdeletions/microduplications (likely pathogenic) and 49 variants of uncertain significance (VUS). Overall, the cumulative frequency of pathogenic/likely pathogenic and VUS chromosome anomalies in the patient cohort was 2.83% and 1.43%, respectively. In the three high-risk AMA, HR-MSS and USM groups, the most common whole chromosome aneuploidy detected was trisomy 21, followed by sex chromosome aneuploidies, trisomy 18 and trisomy 13. Across all clinical indications, there was a similar incidence of submicroscopic CNVs, with approximately equal proportions of pathogenic/likely pathogenic and VUS CNVs. If karyotyping had been used as an alternate cytogenetics detection method, CNV-Seq would have returned a 1% higher yield of pathogenic or likely pathogenic CNVs. In a large prospective clinical study, CNV-Seq delivered high reliability and accuracy for identifying clinically significant fetal anomalies in prenatal samples. Based on key performance criteria, CNV-Seq appears to be a well-suited methodology for first tier diagnosis of pregnant women in the general population at risk of having a fetal chromosome abnormality. Copyright © 2018. Published by Elsevier Inc.
Montagne, Louise; Derhourhi, Mehdi; Piton, Amélie; Toussaint, Bénédicte; Durand, Emmanuelle; Vaillant, Emmanuel; Thuillier, Dorothée; Gaget, Stefan; De Graeve, Franck; Rabearivelo, Iandry; Lansiaux, Amélie; Lenne, Bruno; Sukno, Sylvie; Desailloud, Rachel; Cnop, Miriam; Nicolescu, Ramona; Cohen, Lior; Zagury, Jean-François; Amouyal, Mélanie; Weill, Jacques; Muller, Jean; Sand, Olivier; Delobel, Bruno; Froguel, Philippe; Bonnefond, Amélie
2018-05-16
The molecular diagnosis of extreme forms of obesity, in which accurate detection of both copy number variations (CNVs) and point mutations, is crucial for an optimal care of the patients and genetic counseling for their families. Whole-exome sequencing (WES) has benefited considerably this molecular diagnosis, but its poor ability to detect CNVs remains a major limitation. We aimed to develop a method (CoDE-seq) enabling the accurate detection of both CNVs and point mutations in one step. CoDE-seq is based on an augmented WES method, using probes distributed uniformly throughout the genome. CoDE-seq was validated in 40 patients for whom chromosomal DNA microarray was available. CNVs and mutations were assessed in 82 children/young adults with suspected Mendelian obesity and/or intellectual disability and in their parents when available (n total = 145). CoDE-seq not only detected all of the 97 CNVs identified by chromosomal DNA microarrays but also found 84 additional CNVs, due to a better resolution. When compared to CoDE-seq and chromosomal DNA microarrays, WES failed to detect 37% and 14% of CNVs, respectively. In the 82 patients, a likely molecular diagnosis was achieved in >30% of the patients. Half of the genetic diagnoses were explained by CNVs while the other half by mutations. CoDE-seq has proven cost-efficient and highly effective as it avoids the sequential genetic screening approaches currently used in clinical practice for the accurate detection of CNVs and point mutations. Copyright © 2018 The Authors. Published by Elsevier GmbH.. All rights reserved.
Preparation of Low-Input and Ligation-Free ChIP-seq Libraries Using Template-Switching Technology.
Bolduc, Nathalie; Lehman, Alisa P; Farmer, Andrew
2016-10-10
Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) has become the gold standard for mapping of transcription factors and histone modifications throughout the genome. However, for ChIP experiments involving few cells or targeting low-abundance transcription factors, the small amount of DNA recovered makes ligation of adapters very challenging. In this unit, we describe a ChIP-seq workflow that can be applied to small cell numbers, including a robust single-tube and ligation-free method for preparation of sequencing libraries from sub-nanogram amounts of ChIP DNA. An example ChIP protocol is first presented, resulting in selective enrichment of DNA-binding proteins and cross-linked DNA fragments immobilized on beads via an antibody bridge. This is followed by a protocol for fast and easy cross-linking reversal and DNA recovery. Finally, we describe a fast, ligation-free library preparation protocol, featuring DNA SMART technology, resulting in samples ready for Illumina sequencing. © 2016 by John Wiley & Sons, Inc. Copyright © 2016 John Wiley & Sons, Inc.
The ChIP-exo Method: Identifying Protein-DNA Interactions with Near Base Pair Precision.
Perreault, Andrea A; Venters, Bryan J
2016-12-23
Chromatin immunoprecipitation (ChIP) is an indispensable tool in the fields of epigenetics and gene regulation that isolates specific protein-DNA interactions. ChIP coupled to high throughput sequencing (ChIP-seq) is commonly used to determine the genomic location of proteins that interact with chromatin. However, ChIP-seq is hampered by relatively low mapping resolution of several hundred base pairs and high background signal. The ChIP-exo method is a refined version of ChIP-seq that substantially improves upon both resolution and noise. The key distinction of the ChIP-exo methodology is the incorporation of lambda exonuclease digestion in the library preparation workflow to effectively footprint the left and right 5' DNA borders of the protein-DNA crosslink site. The ChIP-exo libraries are then subjected to high throughput sequencing. The resulting data can be leveraged to provide unique and ultra-high resolution insights into the functional organization of the genome. Here, we describe the ChIP-exo method that we have optimized and streamlined for mammalian systems and next-generation sequencing-by-synthesis platform.
Unifying cancer and normal RNA sequencing data from different sources
Wang, Qingguo; Armenia, Joshua; Zhang, Chao; Penson, Alexander V.; Reznik, Ed; Zhang, Liguo; Minet, Thais; Ochoa, Angelica; Gross, Benjamin E.; Iacobuzio-Donahue, Christine A.; Betel, Doron; Taylor, Barry S.; Gao, Jianjiong; Schultz, Nikolaus
2018-01-01
Driven by the recent advances of next generation sequencing (NGS) technologies and an urgent need to decode complex human diseases, a multitude of large-scale studies were conducted recently that have resulted in an unprecedented volume of whole transcriptome sequencing (RNA-seq) data, such as the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA). While these data offer new opportunities to identify the mechanisms underlying disease, the comparison of data from different sources remains challenging, due to differences in sample and data processing. Here, we developed a pipeline that processes and unifies RNA-seq data from different studies, which includes uniform realignment, gene expression quantification, and batch effect removal. We find that uniform alignment and quantification is not sufficient when combining RNA-seq data from different sources and that the removal of other batch effects is essential to facilitate data comparison. We have processed data from GTEx and TCGA and successfully corrected for study-specific biases, enabling comparative analysis between TCGA and GTEx. The normalized datasets are available for download on figshare. PMID:29664468
Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data.
Althammer, Sonja; González-Vallinas, Juan; Ballaré, Cecilia; Beato, Miguel; Eyras, Eduardo
2011-12-15
High-throughput sequencing (HTS) has revolutionized gene regulation studies and is now fundamental for the detection of protein-DNA and protein-RNA binding, as well as for measuring RNA expression. With increasing variety and sequencing depth of HTS datasets, the need for more flexible and memory-efficient tools to analyse them is growing. We describe Pyicos, a powerful toolkit for the analysis of mapped reads from diverse HTS experiments: ChIP-Seq, either punctuated or broad signals, CLIP-Seq and RNA-Seq. We prove the effectiveness of Pyicos to select for significant signals and show that its accuracy is comparable and sometimes superior to that of methods specifically designed for each particular type of experiment. Pyicos facilitates the analysis of a variety of HTS datatypes through its flexibility and memory efficiency, providing a useful framework for data integration into models of regulatory genomics. Open-source software, with tutorials and protocol files, is available at http://regulatorygenomics.upf.edu/pyicos or as a Galaxy server at http://regulatorygenomics.upf.edu/galaxy eduardo.eyras@upf.edu Supplementary data are available at Bioinformatics online.
McCarthy, Davis J; Campbell, Kieran R; Lun, Aaron T L; Wills, Quin F
2017-04-15
Single-cell RNA sequencing (scRNA-seq) is increasingly used to study gene expression at the level of individual cells. However, preparing raw sequence data for further analysis is not a straightforward process. Biases, artifacts and other sources of unwanted variation are present in the data, requiring substantial time and effort to be spent on pre-processing, quality control (QC) and normalization. We have developed the R/Bioconductor package scater to facilitate rigorous pre-processing, quality control, normalization and visualization of scRNA-seq data. The package provides a convenient, flexible workflow to process raw sequencing reads into a high-quality expression dataset ready for downstream analysis. scater provides a rich suite of plotting tools for single-cell data and a flexible data structure that is compatible with existing tools and can be used as infrastructure for future software development. The open-source code, along with installation instructions, vignettes and case studies, is available through Bioconductor at http://bioconductor.org/packages/scater . davis@ebi.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.
NASA Astrophysics Data System (ADS)
Savinainen, Antti; Mäkynen, Asko; Nieminen, Pasi; Viiri, Jouni
2017-02-01
This paper presents a research-based teaching-learning sequence (TLS) that focuses on the notion of interaction in teaching Newton's third law (N3 law) which is, as earlier studies have shown, a challenging topic for students to learn. The TLS made systematic use of a visual representation tool—an interaction diagram (ID)—highlighting interactions between objects and addressing the learning demand related to N3 law. This approach had been successful in enhancing students' understanding of N3 law in pilot studies conducted by teacher-researchers. However, it was unclear whether teachers, who have neither been involved with the research nor received intensive tutoring, could replicate the positive results in ordinary school settings. To address this question, we present an empirical study conducted in 10 Finnish upper secondary schools with students ( n = 261, aged 16) taking their mandatory physics course. The study design involved three groups: the heavy ID group (the TLS with seven to eight exercises on IDs), the light ID group (two to three exercises on IDs) and the no ID group (no exercises on IDs). The heavy and light ID groups answered eight ID questions, and all the students answered four questions on N3 law after teaching the force concept. The findings clearly suggest that systematic use of the IDs in teaching the force concept significantly fostered students' understanding of N3 law even with teachers who have no intensive tutoring or research background.
2014-01-01
Background RNA sequencing (RNA-seq) is emerging as a critical approach in biological research. However, its high-throughput advantage is significantly limited by the capacity of bioinformatics tools. The research community urgently needs user-friendly tools to efficiently analyze the complicated data generated by high throughput sequencers. Results We developed a standalone tool with graphic user interface (GUI)-based analytic modules, known as eRNA. The capacity of performing parallel processing and sample management facilitates large data analyses by maximizing hardware usage and freeing users from tediously handling sequencing data. The module miRNA identification” includes GUIs for raw data reading, adapter removal, sequence alignment, and read counting. The module “mRNA identification” includes GUIs for reference sequences, genome mapping, transcript assembling, and differential expression. The module “Target screening” provides expression profiling analyses and graphic visualization. The module “Self-testing” offers the directory setups, sample management, and a check for third-party package dependency. Integration of other GUIs including Bowtie, miRDeep2, and miRspring extend the program’s functionality. Conclusions eRNA focuses on the common tools required for the mapping and quantification analysis of miRNA-seq and mRNA-seq data. The software package provides an additional choice for scientists who require a user-friendly computing environment and high-throughput capacity for large data analysis. eRNA is available for free download at https://sourceforge.net/projects/erna/?source=directory. PMID:24593312
2014-01-01
Background Although it is possible to recover the complete mitogenome directly from shotgun sequencing data, currently reported methods and pipelines are still relatively time consuming and costly. Using a sample of the Australian freshwater crayfish Engaeus lengana, we demonstrate that it is possible to achieve three-day turnaround time (four hours hands-on time) from tissue sample to NCBI-ready submission file through the integration of MiSeq sequencing platform, Nextera sample preparation protocol, MITObim assembly algorithm and MITOS annotation pipeline. Results The complete mitochondrial genome of the parastacid freshwater crayfish, Engaeus lengana, was recovered by modest shotgun sequencing (1.2 giga bases) using the Illumina MiSeq benchtop sequencing platform. Genome assembly using the MITObim mitogenome assembler recovered the mitochondrial genome as a single contig with a 97-fold mean coverage (min. = 17; max. = 138). The mitogenome consists of 15,934 base pairs and contains the typical 37 mitochondrial genes and a non-coding AT-rich region. The genome arrangement is similar to the only other published parastacid mitogenome from the Australian genus Cherax. Conclusions We infer that the gene order arrangement found in Cherax destructor is common to Australian crayfish and may be a derived feature of the southern hemisphere family Parastacidae. Further, we report to our knowledge, the simplest and fastest protocol for the recovery and assembly of complete mitochondrial genomes using the MiSeq benchtop sequencer. PMID:24484414
Yuan, Tiezheng; Huang, Xiaoyi; Dittmar, Rachel L; Du, Meijun; Kohli, Manish; Boardman, Lisa; Thibodeau, Stephen N; Wang, Liang
2014-03-05
RNA sequencing (RNA-seq) is emerging as a critical approach in biological research. However, its high-throughput advantage is significantly limited by the capacity of bioinformatics tools. The research community urgently needs user-friendly tools to efficiently analyze the complicated data generated by high throughput sequencers. We developed a standalone tool with graphic user interface (GUI)-based analytic modules, known as eRNA. The capacity of performing parallel processing and sample management facilitates large data analyses by maximizing hardware usage and freeing users from tediously handling sequencing data. The module miRNA identification" includes GUIs for raw data reading, adapter removal, sequence alignment, and read counting. The module "mRNA identification" includes GUIs for reference sequences, genome mapping, transcript assembling, and differential expression. The module "Target screening" provides expression profiling analyses and graphic visualization. The module "Self-testing" offers the directory setups, sample management, and a check for third-party package dependency. Integration of other GUIs including Bowtie, miRDeep2, and miRspring extend the program's functionality. eRNA focuses on the common tools required for the mapping and quantification analysis of miRNA-seq and mRNA-seq data. The software package provides an additional choice for scientists who require a user-friendly computing environment and high-throughput capacity for large data analysis. eRNA is available for free download at https://sourceforge.net/projects/erna/?source=directory.
Nakazato, Takeru; Ohta, Tazro; Bono, Hidemasa
2013-01-01
High-throughput sequencing technology, also called next-generation sequencing (NGS), has the potential to revolutionize the whole process of genome sequencing, transcriptomics, and epigenetics. Sequencing data is captured in a public primary data archive, the Sequence Read Archive (SRA). As of January 2013, data from more than 14,000 projects have been submitted to SRA, which is double that of the previous year. Researchers can download raw sequence data from SRA website to perform further analyses and to compare with their own data. However, it is extremely difficult to search entries and download raw sequences of interests with SRA because the data structure is complicated, and experimental conditions along with raw sequences are partly described in natural language. Additionally, some sequences are of inconsistent quality because anyone can submit sequencing data to SRA with no quality check. Therefore, as a criterion of data quality, we focused on SRA entries that were cited in journal articles. We extracted SRA IDs and PubMed IDs (PMIDs) from SRA and full-text versions of journal articles and retrieved 2748 SRA ID-PMID pairs. We constructed a publication list referring to SRA entries. Since, one of the main themes of -omics analyses is clarification of disease mechanisms, we also characterized SRA entries by disease keywords, according to the Medical Subject Headings (MeSH) extracted from articles assigned to each SRA entry. We obtained 989 SRA ID-MeSH disease term pairs, and constructed a disease list referring to SRA data. We previously developed feature profiles of diseases in a system called “Gendoo”. We generated hyperlinks between diseases extracted from SRA and the feature profiles of it. The developed project, publication and disease lists resulting from this study are available at our web service, called “DBCLS SRA” (http://sra.dbcls.jp/). This service will improve accessibility to high-quality data from SRA. PMID:24167589
RSAT 2015: Regulatory Sequence Analysis Tools
Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A.; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M.; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques
2015-01-01
RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. PMID:25904632
Bowman, Megan J.; Park, Wonkeun; Bauer, Philip J.; Udall, Joshua A.; Page, Justin T.; Raney, Joshua; Scheffler, Brian E.; Jones, Don. C.; Campbell, B. Todd
2013-01-01
An RNA-Seq experiment was performed using field grown well-watered and naturally rain fed cotton plants to identify differentially expressed transcripts under water-deficit stress. Our work constitutes the first application of the newly published diploid D5 Gossypium raimondii sequence in the study of tetraploid AD1 upland cotton RNA-seq transcriptome analysis. A total of 1,530 transcripts were differentially expressed between well-watered and water-deficit stressed root tissues, in patterns that confirm the accuracy of this technique for future studies in cotton genomics. Additionally, putative sequence based genome localization of differentially expressed transcripts detected A2 genome specific gene expression under water-deficit stress. These data will facilitate efforts to understand the complex responses governing transcriptomic regulatory mechanisms and to identify candidate genes that may benefit applied plant breeding programs. PMID:24324815
Deng, Yue; Bao, Feng; Yang, Yang; Ji, Xiangyang; Du, Mulong; Zhang, Zhengdong
2017-01-01
Abstract The automated transcript discovery and quantification of high-throughput RNA sequencing (RNA-seq) data are important tasks of next-generation sequencing (NGS) research. However, these tasks are challenging due to the uncertainties that arise in the inference of complete splicing isoform variants from partially observed short reads. Here, we address this problem by explicitly reducing the inherent uncertainties in a biological system caused by missing information. In our approach, the RNA-seq procedure for transforming transcripts into short reads is considered an information transmission process. Consequently, the data uncertainties are substantially reduced by exploiting the information transduction capacity of information theory. The experimental results obtained from the analyses of simulated datasets and RNA-seq datasets from cell lines and tissues demonstrate the advantages of our method over state-of-the-art competitors. Our algorithm is an open-source implementation of MaxInfo. PMID:28911101
Hosseinkhani, Farideh; Emaneini, Mohammad; van Leeuwen, Willem
2017-07-20
Using Illumina HiSeq and PacBio technologies, we sequenced the genome of the multidrug-resistant bacterium Staphylococcus haemolyticus , originating from a bloodstream infection in a neonate. The sequence data can be used as an accurate reference sequence. Copyright © 2017 Hosseinkhani et al.
Varela, Ignacio; Fisher, Rosalie; McGranahan, Nicholas; Matthews, Nicholas; Santos, Claudio R; Martinez, Pierre; Phillimore, Benjamin; Begum, Sharmin; Rabinowitz, Adam; Spencer-Dene, Bradley; Gulati, Sakshi; Bates, Paul A; Stamp, Gordon; Pickering, Lisa; Gore, Martin; Nicol, David L; Hazell, Steven; Futreal, P Andrew; Stewart, Aengus; Swanton, Charles
2015-01-01
Clear cell renal carcinomas (ccRCCs) can display intratumor heterogeneity (ITH). We applied multiregion exome sequencing (M-seq) to resolve the genetic architecture and evolutionary histories of ten ccRCCs. Ultra-deep sequencing identified ITH in all cases. We found that 73–75% of identified ccRCC driver aberrations were subclonal, confounding estimates of driver mutation prevalence. ITH increased with the number of biopsies analyzed, without evidence of saturation in most tumors. Chromosome 3p loss and VHL aberrations were the only ubiquitous events. The proportion of C>T transitions at CpG sites increased during tumor progression. M-seq permits the temporal resolution of ccRCC evolution and refines mutational signatures occurring during tumor development. PMID:24487277
A fast one-chip event-preprocessor and sequencer for the Simbol-X Low Energy Detector
NASA Astrophysics Data System (ADS)
Schanz, T.; Tenzer, C.; Maier, D.; Kendziorra, E.; Santangelo, A.
2010-12-01
We present an FPGA-based digital camera electronics consisting of an Event-Preprocessor (EPP) for on-board data preprocessing and a related Sequencer (SEQ) to generate the necessary signals to control the readout of the detector. The device has been originally designed for the Simbol-X low energy detector (LED). The EPP operates on 64×64 pixel images and has a real-time processing capability of more than 8000 frames per second. The already working releases of the EPP and the SEQ are now combined into one Digital-Camera-Controller-Chip (D3C).
Castoe, Todd A.; Poole, Alexander W.; de Koning, A. P. Jason; Jones, Kenneth L.; Tomback, Diana F.; Oyler-McCance, Sara J.; Fike, Jennifer A.; Lance, Stacey L.; Streicher, Jeffrey W.; Smith, Eric N.; Pollock, David D.
2012-01-01
Identification of microsatellites, or simple sequence repeats (SSRs), can be a time-consuming and costly investment requiring enrichment, cloning, and sequencing of candidate loci. Recently, however, high throughput sequencing (with or without prior enrichment for specific SSR loci) has been utilized to identify SSR loci. The direct "Seq-to-SSR" approach has an advantage over enrichment-based strategies in that it does not require a priori selection of particular motifs, or prior knowledge of genomic SSR content. It has been more expensive per SSR locus recovered, however, particularly for genomes with few SSR loci, such as bird genomes. The longer but relatively more expensive 454 reads have been preferred over less expensive Illumina reads. Here, we use Illumina paired-end sequence data to identify potentially amplifiable SSR loci (PALs) from a snake (the Burmese python, Python molurus bivittatus), and directly compare these results to those from 454 data. We also compare the python results to results from Illumina sequencing of two bird genomes (Gunnison Sage-grouse, Centrocercus minimus, and Clark's Nutcracker, Nucifraga columbiana), which have considerably fewer SSRs than the python. We show that direct Illumina Seq-to-SSR can identify and characterize thousands of potentially amplifiable SSR loci for as little as $10 per sample – a fraction of the cost of 454 sequencing. Given that Illumina Seq-to-SSR is effective, inexpensive, and reliable even for species such as birds that have few SSR loci, it seems that there are now few situations for which prior hybridization is justifiable.
Castoe, T.A.; Poole, A.W.; de Koning, A. P. J.; Jones, K.L.; Tomback, D.F.; Oyler-McCance, S.J.; Fike, J.A.; Lance, S.L.; Streicher, J.W.; Smith, E.N.; Pollock, D.D.
2012-01-01
Identification of microsatellites, or simple sequence repeats (SSRs), can be a time-consuming and costly investment requiring enrichment, cloning, and sequencing of candidate loci. Recently, however, high throughput sequencing (with or without prior enrichment for specific SSR loci) has been utilized to identify SSR loci. The direct "Seq-to-SSR" approach has an advantage over enrichment-based strategies in that it does not require a priori selection of particular motifs, or prior knowledge of genomic SSR content. It has been more expensive per SSR locus recovered, however, particularly for genomes with few SSR loci, such as bird genomes. The longer but relatively more expensive 454 reads have been preferred over less expensive Illumina reads. Here, we use Illumina paired-end sequence data to identify potentially amplifiable SSR loci (PALs) from a snake (the Burmese python, Python molurus bivittatus), and directly compare these results to those from 454 data. We also compare the python results to results from Illumina sequencing of two bird genomes (Gunnison Sage-grouse, Centrocercus minimus, and Clark's Nutcracker, Nucifraga columbiana), which have considerably fewer SSRs than the python. We show that direct Illumina Seq-to-SSR can identify and characterize thousands of potentially amplifiable SSR loci for as little as $10 per sample - a fraction of the cost of 454 sequencing. Given that Illumina Seq-to-SSR is effective, inexpensive, and reliable even for species such as birds that have few SSR loci, it seems that there are now few situations for which prior hybridization is justifiable. ?? 2012 Castoe et al.
Castoe, Todd A; Poole, Alexander W; de Koning, A P Jason; Jones, Kenneth L; Tomback, Diana F; Oyler-McCance, Sara J; Fike, Jennifer A; Lance, Stacey L; Streicher, Jeffrey W; Smith, Eric N; Pollock, David D
2012-01-01
Identification of microsatellites, or simple sequence repeats (SSRs), can be a time-consuming and costly investment requiring enrichment, cloning, and sequencing of candidate loci. Recently, however, high throughput sequencing (with or without prior enrichment for specific SSR loci) has been utilized to identify SSR loci. The direct "Seq-to-SSR" approach has an advantage over enrichment-based strategies in that it does not require a priori selection of particular motifs, or prior knowledge of genomic SSR content. It has been more expensive per SSR locus recovered, however, particularly for genomes with few SSR loci, such as bird genomes. The longer but relatively more expensive 454 reads have been preferred over less expensive Illumina reads. Here, we use Illumina paired-end sequence data to identify potentially amplifiable SSR loci (PALs) from a snake (the Burmese python, Python molurus bivittatus), and directly compare these results to those from 454 data. We also compare the python results to results from Illumina sequencing of two bird genomes (Gunnison Sage-grouse, Centrocercus minimus, and Clark's Nutcracker, Nucifraga columbiana), which have considerably fewer SSRs than the python. We show that direct Illumina Seq-to-SSR can identify and characterize thousands of potentially amplifiable SSR loci for as little as $10 per sample--a fraction of the cost of 454 sequencing. Given that Illumina Seq-to-SSR is effective, inexpensive, and reliable even for species such as birds that have few SSR loci, it seems that there are now few situations for which prior hybridization is justifiable.
Granatum: a graphical single-cell RNA-Seq analysis pipeline for genomics scientists.
Zhu, Xun; Wolfgruber, Thomas K; Tasato, Austin; Arisdakessian, Cédric; Garmire, David G; Garmire, Lana X
2017-12-05
Single-cell RNA sequencing (scRNA-Seq) is an increasingly popular platform to study heterogeneity at the single-cell level. Computational methods to process scRNA-Seq data are not very accessible to bench scientists as they require a significant amount of bioinformatic skills. We have developed Granatum, a web-based scRNA-Seq analysis pipeline to make analysis more broadly accessible to researchers. Without a single line of programming code, users can click through the pipeline, setting parameters and visualizing results via the interactive graphical interface. Granatum conveniently walks users through various steps of scRNA-Seq analysis. It has a comprehensive list of modules, including plate merging and batch-effect removal, outlier-sample removal, gene-expression normalization, imputation, gene filtering, cell clustering, differential gene expression analysis, pathway/ontology enrichment analysis, protein network interaction visualization, and pseudo-time cell series construction. Granatum enables broad adoption of scRNA-Seq technology by empowering bench scientists with an easy-to-use graphical interface for scRNA-Seq data analysis. The package is freely available for research use at http://garmiregroup.org/granatum/app.
Gao, Z J; Jiang, Q; Cheng, D Z; Yan, X X; Chen, Q; Xu, K M
2016-10-02
Objective: To evaluate the application of single nucleotide polymorphism (SNP)-microarray and target gene sequencing technology in the clinical molecular genetic diagnosis of unexplained intellectual disability(ID) or developmental delay (DD). Method: Patients with ID or DD were recruited in the Department of Neurology, Affiliated Children's Hospital of Capital Institute of Pediatrics between September 2015 and February 2016. The intellectual assessment of the patients was performed using 0-6-year-old pediatric examination table of neuropsychological development or Wechsler intelligence scale (>6 years). Patients with a DQ less than 49 or IQ less than 51 were included in this study. The patients were scanned by SNP-array for detection of genomic copy number variations (CNV), and the revealed genomic imbalance was confirmed by quantitative real time-PCR. Candidate gene mutation screening was carried out by target gene sequencing technology.Causal mutations or likely pathogenic variants were verified by polymerase chain reaction and direct sequencing. Result: There were 15 children with ID or DD enrolled, 9 males and 6 females. The age of these patients was 7 months-16 years and 9 months. SNP-array revealed that two of the 15 patients had genomic CNV. Both CNV were de novo micro deletions, one involved 11q24.1q25 and the other micro deletion located on 21q22.2q22.3. Both micro deletions were proved to have a clinical significance due to their association with ID, brain DD, unusual faces etc. by querying Decipher database. Thirteen patients with negative findings in SNP-array were consequently examined with target gene sequencing technology, genotype-phenotype correlation analysis and genetic analysis. Five patients were diagnosed with monogenic disorder, two were diagnosed with suspected genetic disorder and six were still negative. Conclusion: Sequential use of SNP-array and target gene sequencing technology can significantly increase the molecular genetic etiologic diagnosis rate of the patients with unexplained ID or DD. Combined use of these technologies can serve as a useful examinational method in assisting differential diagnosis of children with unexplained ID or DD.
An Annotation Agnostic Algorithm for Detecting Nascent RNA Transcripts in GRO-Seq.
Azofeifa, Joseph G; Allen, Mary A; Lladser, Manuel E; Dowell, Robin D
2017-01-01
We present a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels which are affected by transcription, post-transcriptional processing, and RNA stability. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed. Here, we describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, hidden Markov models and logistic regression, to classify which regions of the genome are transcribed. Given a small user-defined training set, our algorithm is accurate, robust to varying read depth, annotation agnostic, and fast. Analysis of GRO-seq data without a priori need for annotation uncovers surprising new insights into several aspects of the transcription process.
Polyester: simulating RNA-seq datasets with differential transcript expression.
Frazee, Alyssa C; Jaffe, Andrew E; Langmead, Ben; Leek, Jeffrey T
2015-09-01
Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user. Polyester is freely available from Bioconductor (http://bioconductor.org/). jtleek@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
USDA-ARS?s Scientific Manuscript database
Over the past decade, Next Generation Sequencing (NGS) technologies, also called deep sequencing, have continued to evolve, increasing capacity and lower the cost necessary for large genome sequencing projects. The one of the advantage of NGS platforms is the possibility to sequence the samples with...
Functional assessment of human enhancer activities using whole-genome STARR-sequencing.
Liu, Yuwen; Yu, Shan; Dhiman, Vineet K; Brunetti, Tonya; Eckart, Heather; White, Kevin P
2017-11-20
Genome-wide quantification of enhancer activity in the human genome has proven to be a challenging problem. Recent efforts have led to the development of powerful tools for enhancer quantification. However, because of genome size and complexity, these tools have yet to be applied to the whole human genome. In the current study, we use a human prostate cancer cell line, LNCaP as a model to perform whole human genome STARR-seq (WHG-STARR-seq) to reliably obtain an assessment of enhancer activity. This approach builds upon previously developed STARR-seq in the fly genome and CapSTARR-seq techniques in targeted human genomic regions. With an improved library preparation strategy, our approach greatly increases the library complexity per unit of starting material, which makes it feasible and cost-effective to explore the landscape of regulatory activity in the much larger human genome. In addition to our ability to identify active, accessible enhancers located in open chromatin regions, we can also detect sequences with the potential for enhancer activity that are located in inaccessible, closed chromatin regions. When treated with the histone deacetylase inhibitor, Trichostatin A, genes nearby this latter class of enhancers are up-regulated, demonstrating the potential for endogenous functionality of these regulatory elements. WHG-STARR-seq provides an improved approach to current pipelines for analysis of high complexity genomes to gain a better understanding of the intricacies of transcriptional regulation.
Determination of in vivo RNA kinetics using RATE-seq.
Neymotin, Benjamin; Athanasiadou, Rodoniki; Gresham, David
2014-10-01
The abundance of a transcript is determined by its rate of synthesis and its rate of degradation; however, global methods for quantifying RNA abundance cannot distinguish variation in these two processes. Here, we introduce RNA approach to equilibrium sequencing (RATE-seq), which uses in vivo metabolic labeling of RNA and approach to equilibrium kinetics, to determine absolute RNA degradation and synthesis rates. RATE-seq does not disturb cellular physiology, uses straightforward normalization with exogenous spike-ins, and can be readily adapted for studies in most organisms. We demonstrate the use of RATE-seq to estimate genome-wide kinetic parameters for coding and noncoding transcripts in Saccharomyces cerevisiae. © 2014 Neymotin et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
Liu, Lian; Zhang, Shao-Wu; Huang, Yufei; Meng, Jia
2017-08-31
As a newly emerged research area, RNA epigenetics has drawn increasing attention recently for the participation of RNA methylation and other modifications in a number of crucial biological processes. Thanks to high throughput sequencing techniques, such as, MeRIP-Seq, transcriptome-wide RNA methylation profile is now available in the form of count-based data, with which it is often of interests to study the dynamics at epitranscriptomic layer. However, the sample size of RNA methylation experiment is usually very small due to its costs; and additionally, there usually exist a large number of genes whose methylation level cannot be accurately estimated due to their low expression level, making differential RNA methylation analysis a difficult task. We present QNB, a statistical approach for differential RNA methylation analysis with count-based small-sample sequencing data. Compared with previous approaches such as DRME model based on a statistical test covering the IP samples only with 2 negative binomial distributions, QNB is based on 4 independent negative binomial distributions with their variances and means linked by local regressions, and in the way, the input control samples are also properly taken care of. In addition, different from DRME approach, which relies only the input control sample only for estimating the background, QNB uses a more robust estimator for gene expression by combining information from both input and IP samples, which could largely improve the testing performance for very lowly expressed genes. QNB showed improved performance on both simulated and real MeRIP-Seq datasets when compared with competing algorithms. And the QNB model is also applicable to other datasets related RNA modifications, including but not limited to RNA bisulfite sequencing, m 1 A-Seq, Par-CLIP, RIP-Seq, etc.
TagDigger: user-friendly extraction of read counts from GBS and RAD-seq data.
Clark, Lindsay V; Sacks, Erik J
2016-01-01
In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult. We created the software TagDigger, which includes three programs for analyzing GBS and RAD-seq data. The first script, tagdigger_interactive.py, rapidly extracts read counts and genotypes from FASTQ files using user-supplied sets of barcodes and tags. Input and output is in CSV format so that it can be opened by spreadsheet software. Tag sequences can also be imported from the Stacks, TASSEL-GBSv2, TASSEL-UNEAK, or pyRAD pipelines, and a separate file can be imported listing the names of markers to retain. A second script, tag_manager.py, consolidates marker names and sequences across multiple projects. A third script, barcode_splitter.py, assists with preparing FASTQ data for deposit in a public archive by splitting FASTQ files by barcode and generating MD5 checksums for the resulting files. TagDigger is open-source and freely available software written in Python 3. It uses a scalable, rapid search algorithm that can process over 100 million FASTQ reads per hour. TagDigger will run on a laptop with any operating system, does not consume hard drive space with intermediate files, and does not require programming skill to use.
Characterization of Human Salivary Extracellular RNA by Next-generation Sequencing.
Li, Feng; Kaczor-Urbanowicz, Karolina Elżbieta; Sun, Jie; Majem, Blanca; Lo, Hsien-Chun; Kim, Yong; Koyano, Kikuye; Liu Rao, Shannon; Young Kang, So; Mi Kim, Su; Kim, Kyoung-Mee; Kim, Sung; Chia, David; Elashoff, David; Grogan, Tristan R; Xiao, Xinshu; Wong, David T W
2018-04-23
It was recently discovered that abundant and stable extracellular RNA (exRNA) species exist in bodily fluids. Saliva is an emerging biofluid for biomarker development for noninvasive detection and screening of local and systemic diseases. Use of RNA-Sequencing (RNA-Seq) to profile exRNA is rapidly growing; however, no single preparation and analysis protocol can be used for all biofluids. Specifically, RNA-Seq of saliva is particularly challenging owing to high abundance of bacterial contents and low abundance of salivary exRNA. Given the laborious procedures needed for RNA-Seq library construction, sequencing, data storage, and data analysis, saliva-specific and optimized protocols are essential. We compared different RNA isolation methods and library construction kits for long and small RNA sequencing. The role of ribosomal RNA (rRNA) depletion also was evaluated. The miRNeasy Micro Kit (Qiagen) showed the highest total RNA yield (70.8 ng/mL cell-free saliva) and best small RNA recovery, and the NEBNext library preparation kits resulted in the highest number of detected human genes [5649-6813 at 1 reads per kilobase RNA per million mapped (RPKM)] and small RNAs [482-696 microRNAs (miRNAs) and 190-214 other small RNAs]. The proportion of human RNA-Seq reads was much higher in rRNA-depleted saliva samples (41%) than in samples without rRNA depletion (14%). In addition, the transfer RNA (tRNA)-derived RNA fragments (tRFs), a novel class of small RNAs, were highly abundant in human saliva, specifically tRF-4 (4%) and tRF-5 (15.25%). Our results may help in selection of the best adapted methods of RNA isolation and small and long RNA library constructions for salivary exRNA studies. © 2018 American Association for Clinical Chemistry.
PySeqLab: an open source Python package for sequence labeling and segmentation.
Allam, Ahmed; Krauthammer, Michael
2017-11-01
Text and genomic data are composed of sequential tokens, such as words and nucleotides that give rise to higher order syntactic constructs. In this work, we aim at providing a comprehensive Python library implementing conditional random fields (CRFs), a class of probabilistic graphical models, for robust prediction of these constructs from sequential data. Python Sequence Labeling (PySeqLab) is an open source package for performing supervised learning in structured prediction tasks. It implements CRFs models, that is discriminative models from (i) first-order to higher-order linear-chain CRFs, and from (ii) first-order to higher-order semi-Markov CRFs (semi-CRFs). Moreover, it provides multiple learning algorithms for estimating model parameters such as (i) stochastic gradient descent (SGD) and its multiple variations, (ii) structured perceptron with multiple averaging schemes supporting exact and inexact search using 'violation-fixing' framework, (iii) search-based probabilistic online learning algorithm (SAPO) and (iv) an interface for Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the limited-memory BFGS algorithms. Viterbi and Viterbi A* are used for inference and decoding of sequences. Using PySeqLab, we built models (classifiers) and evaluated their performance in three different domains: (i) biomedical Natural language processing (NLP), (ii) predictive DNA sequence analysis and (iii) Human activity recognition (HAR). State-of-the-art performance comparable to machine-learning based systems was achieved in the three domains without feature engineering or the use of knowledge sources. PySeqLab is available through https://bitbucket.org/A_2/pyseqlab with tutorials and documentation. ahmed.allam@yale.edu or michael.krauthammer@yale.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
GenSeq: An updated nomenclature and ranking for genetic sequences from type and non-type sources
Chakrabarty, Prosanta; Warren, Melanie; Page, Lawrence M.; Baldwin, Carole C.
2013-01-01
Abstract An improved and expanded nomenclature for genetic sequences is introduced that corresponds with a ranking of the reliability of the taxonomic identification of the source specimens. This nomenclature is an advancement of the “Genetypes” naming system, which some have been reluctant to adopt because of the use of the “type” suffix in the terminology. In the new nomenclature, genetic sequences are labeled “genseq,” followed by a reliability ranking (e.g., 1 if the sequence is from a primary type), followed by the name of the genes from which the sequences were derived (e.g., genseq-1 16S, COI). The numbered suffix provides an indication of the likely reliability of taxonomic identification of the voucher. Included in this ranking system, in descending order of taxonomic reliability, are the following: sequences from primary types – “genseq-1,” secondary types – “genseq-2,” collection-vouchered topotypes – “genseq-3,” collection-vouchered non-types – “genseq-4,” and non-types that lack specimen vouchers but have photo vouchers – “genseq-5.” To demonstrate use of the new nomenclature, we review recently published new-species descriptions in the ichthyological literature that include DNA data and apply the GenSeq nomenclature to sequences referenced in those publications. We encourage authors to adopt the GenSeq nomenclature (note capital “G” and “S” when referring to the nomenclatural program) to provide a searchable tag (e.g., “genseq”; note lowercase “g” and “s” when referring to sequences) for genetic sequences from types and other vouchered specimens. Use of the new nomenclature and ranking system will improve integration of molecular phylogenetics and biological taxonomy and enhance the ability of researchers to assess the reliability of sequence data. We further encourage authors to update sequence information on databases such as GenBank whenever nomenclatural changes are made. PMID:24223486
USDA-ARS?s Scientific Manuscript database
The transcriptome provides a functional footprint of the genome by enumerating the molecular components of cells and tissues. The field of transcript discovery has been revolutionized through high-throughput mRNA sequencing (RNA-seq). Here, we present a methodology that replicates and improves exist...
Karnik, Rahul; Beer, Michael A.
2015-01-01
The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs. PMID:26465884
Karnik, Rahul; Beer, Michael A
2015-01-01
The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.
Rozenberg, Andrey; Leese, Florian; Weiss, Linda C; Tollrian, Ralph
2016-01-01
Tag-Seq is a high-throughput approach used for discovering SNPs and characterizing gene expression. In comparison to RNA-Seq, Tag-Seq eases data processing and allows detection of rare mRNA species using only one tag per transcript molecule. However, reduced library complexity raises the issue of PCR duplicates, which distort gene expression levels. Here we present a novel Tag-Seq protocol that uses the least biased methods for RNA library preparation combined with a novel approach for joint PCR template and sample labeling. In our protocol, input RNA is fragmented by hydrolysis, and poly(A)-bearing RNAs are selected and directly ligated to mixed DNA-RNA P5 adapters. The P5 adapters contain i5 barcodes composed of sample-specific (moderately) degenerate base regions (mDBRs), which later allow detection of PCR duplicates. The P7 adapter is attached via reverse transcription with individual i7 barcodes added during the amplification step. The resulting libraries can be sequenced on an Illumina sequencer. After sample demultiplexing and PCR duplicate removal with a free software tool we designed, the data are ready for downstream analysis. Our protocol was tested on RNA samples from predator-induced and control Daphnia microcrustaceans.
Genome Sequence of Stachybotrys chartarum Strain 51-11
Kim, Jean; Levy, Josh
2015-01-01
The Stachybotrys chartarum strain 51-11 genome was sequenced by shotgun sequencing utilizing Illumina HiSeq 2000 and PacBio technologies. Since S. chartarum has been implicated as having health impacts within water-damaged buildings, any information extracted from the genomic sequence data relating to toxins or the metabolism of the fungus might be useful. PMID:26430036
First report of Cocksfoot mottle virus infecting wheat (Triticum aestivum) in Ohio
USDA-ARS?s Scientific Manuscript database
Cocksfoot mottle virus (CfMV) was discovered in Ohio wheat during a 2016 field survey utilizing RNA-Seq to identify virus-like sequences. Virus sequences were confirmed by reverse transcriptase-polymerase chain reaction (RT-PCR) and Sanger sequencing, and CfMV was transmitted to orchardgrass and pas...
Issues with RNA-seq analysis in non-model organisms: A salmonid example.
Sundaram, Arvind; Tengs, Torstein; Grimholt, Unni
2017-10-01
High throughput sequencing (HTS) is useful for many purposes as exemplified by the other topics included in this special issue. The purpose of this paper is to look into the unique challenges of using this technology in non-model organisms where resources such as genomes, functional genome annotations or genome complexity provide obstacles not met in model organisms. To describe these challenges, we narrow our scope to RNA sequencing used to study differential gene expression in response to pathogen challenge. As a demonstration species we chose Atlantic salmon, which has a sequenced genome with poor annotation and an added complexity due to many duplicated genes. We find that our RNA-seq analysis pipeline deciphers between duplicates despite high sequence identity. However, annotation issues provide problems in linking differentially expressed genes to pathways. Also, comparing results between approaches and species are complicated due to lack of standardized annotation. Copyright © 2017 Elsevier Ltd. All rights reserved.
Amissah, Nana Ama; Chlebowicz, Monika A; Ablordey, Anthony; Sabat, Artur J; Tetteh, Caitlin S; Prah, Isaac; van der Werf, Tjip S; Friedrich, Alex W; van Dijl, Jan Maarten; Rossen, John W; Stienstra, Ymkje
2015-01-01
Buruli ulcer (BU) is a skin infection caused by Mycobacterium ulcerans. The wounds of most BU patients are colonized with different microorganisms, including Staphylococcus aureus. This study investigated possible patient-to-patient transmission events of S. aureus during wound care in a health care center. S. aureus isolates from different BU patients with overlapping visits to the clinic were whole-genome sequenced and analyzed by a gene-by-gene approach using SeqSphere(+) software. In addition, sequence data were screened for the presence of genes that conferred antibiotic resistance. SeqSphere(+) analysis of whole-genome sequence data confirmed transmission of methicillin resistant S. aureus (MRSA) and methicillin susceptible S. aureus among patients that took place during wound care. Interestingly, our sequence data show that the investigated MRSA isolates carry a novel allele of the fexB gene conferring chloramphenicol resistance, which had thus far not been observed in S. aureus.
Kuroda, Yukiko; Ohashi, Ikuko; Naruto, Takuya; Ida, Kazumi; Enomoto, Yumi; Saito, Toshiyuki; Nagai, Jun-Ichi; Wada, Takahito; Kurosawa, Kenji
2015-06-01
Next-generation sequencing has enabled the screening for a causative mutation in X-linked intellectual disability (XLID). We identified KIAA2022 mutations in two unrelated male patients by targeted sequencing. We selected 13 Japanese male patients with severe intellectual disability (ID), including four sibling patients and nine sporadic patients. Two of thirteen had a KIAA2022 mutation. Patient 1 was a 3-year-old boy. He had severe ID with autistic behavior and hypotonia. Patient 2 was a 5-year-old boy. He also had severe ID with autistic behavior, hypotonia, central hypothyroidism, and steroid-dependent nephrotic syndrome. Both patients revealed consistent distinctive features, including upswept hair, narrow forehead, downslanting eyebrows, wide palpebral fissures, long nose, hypoplastic alae nasi, open mouth, and large ears. De novo KIAA2022 mutations (p.Q705X in Patient 1, p.R322X in Patient 2) were detected by targeted sequencing and confirmed by Sanger sequencing. KIAA2022 mutations and alterations have been reported in only four families with nonsyndromic ID and epilepsy. KIAA2022 is highly expressed in the fetal and adult brain and plays a crucial role in neuronal development. These additional patients support the evidence that KIAA2022 is a causative gene for XLID. © 2015 Wiley Periodicals, Inc.
Mykles, Donald L.; Burnett, Karen G.; Durica, David S.; Joyce, Blake L.; McCarthy, Fiona M.; Schmidt, Carl J.; Stillman, Jonathon H.
2016-01-01
High-throughput RNA sequencing (RNA-seq) technology has become an important tool for studying physiological responses of organisms to changes in their environment. De novo assembly of RNA-seq data has allowed researchers to create a comprehensive catalog of genes expressed in a tissue and to quantify their expression without a complete genome sequence. The contributions from the “Tapping the Power of Crustacean Transcriptomics to Address Grand Challenges in Comparative Biology” symposium in this issue show the successes and limitations of using RNA-seq in the study of crustaceans. In conjunction with the symposium, the Animal Genome to Phenome Research Coordination Network collated comments from participants at the meeting regarding the challenges encountered when using transcriptomics in their research. Input came from novices and experts ranging from graduate students to principal investigators. Many were unaware of the bioinformatics analysis resources currently available on the CyVerse platform. Our analysis of community responses led to three recommendations for advancing the field: (1) integration of genomic and RNA-seq sequence assemblies for crustacean gene annotation and comparative expression; (2) development of methodologies for the functional analysis of genes; and (3) information and training exchange among laboratories for transmission of best practices. The field lacks the methods for manipulating tissue-specific gene expression. The decapod crustacean research community should consider the cherry shrimp, Neocaridina denticulata, as a decapod model for the application of transgenic tools for functional genomics. This would require a multi-investigator effort. PMID:27639274
Evaluation of sequencing approaches for high-throughput ...
Whole-genome in vitro transcriptomics has shown the capability to identify mechanisms of action and estimates of potency for chemical-mediated effects in a toxicological framework, but with limited throughput and high cost. We present the evaluation of three toxicogenomics platforms for potential application to high-throughput screening: 1. TempO-Seq utilizing custom designed paired probes per gene; 2. Targeted sequencing (TSQ) utilizing Illumina’s TruSeq RNA Access Library Prep Kit containing tiled exon-specific probe sets; 3. Low coverage whole transcriptome sequencing (LSQ) using Illumina’s TruSeq Stranded mRNA Kit. Each platform was required to cover the ~20,000 genes of the full transcriptome, operate directly with cell lysates, and be automatable with 384-well plates. Technical reproducibility was assessed using MAQC control RNA samples A and B, while functional utility for chemical screening was evaluated using six treatments at a single concentration after 6 hr in MCF7 breast cancer cells: 10 µM chlorpromazine, 10 µM ciclopriox, 10 µM genistein, 100 nM sirolimus, 1 µM tanespimycin, and 1 µM trichostatin A. All RNA samples and chemical treatments were run with 5 technical replicates. The three platforms achieved different read depths, with the TempO-Seq having ~34M mapped reads per sample, while TSQ and LSQ averaged 20M and 11M aligned reads per sample, respectively. Inter-replicate correlation averaged ≥0.95 for raw log2 expression values i
SraTailor: graphical user interface software for processing and visualizing ChIP-seq data.
Oki, Shinya; Maehara, Kazumitsu; Ohkawa, Yasuyuki; Meno, Chikara
2014-12-01
Raw data from ChIP-seq (chromatin immunoprecipitation combined with massively parallel DNA sequencing) experiments are deposited in public databases as SRAs (Sequence Read Archives) that are publically available to all researchers. However, to graphically visualize ChIP-seq data of interest, the corresponding SRAs must be downloaded and converted into BigWig format, a process that involves complicated command-line processing. This task requires users to possess skill with script languages and sequence data processing, a requirement that prevents a wide range of biologists from exploiting SRAs. To address these challenges, we developed SraTailor, a GUI (Graphical User Interface) software package that automatically converts an SRA into a BigWig-formatted file. Simplicity of use is one of the most notable features of SraTailor: entering an accession number of an SRA and clicking the mouse are the only steps required to obtain BigWig-formatted files and to graphically visualize the extents of reads at given loci. SraTailor is also able to make peak calls, generate files of other formats, process users' own data, and accept various command-line-like options. Therefore, this software makes ChIP-seq data fully exploitable by a wide range of biologists. SraTailor is freely available at http://www.devbio.med.kyushu-u.ac.jp/sra_tailor/, and runs on both Mac and Windows machines. © 2014 The Authors Genes to Cells © 2014 by the Molecular Biology Society of Japan and Wiley Publishing Asia Pty Ltd.
Waltari, Eric; Jia, Manxue; Jiang, Caroline S; Lu, Hong; Huang, Jing; Fernandez, Cristina; Finzi, Andrés; Kaufmann, Daniel E; Markowitz, Martin; Tsuji, Moriya; Wu, Xueling
2018-01-01
Using 5' rapid amplification of cDNA ends, Illumina MiSeq, and basic flow cytometry, we systematically analyzed the expressed B cell receptor (BCR) repertoire in 14 healthy adult PBMCs, 5 HIV-1+ adult PBMCs, 5 cord blood samples, and 3 HIS-CD4/B mice, examining the full-length variable region of μ, γ, α, κ, and λ chains for V-gene usage, somatic hypermutation (SHM), and CDR3 length. Adding to the known repertoire of healthy adults, Illumina MiSeq consistently detected small fractions of reads with high mutation frequencies including hypermutated μ reads, and reads with long CDR3s. Additionally, the less studied IgA repertoire displayed similar characteristics to that of IgG. Compared to healthy adults, the five HIV-1 chronically infected adults displayed elevated mutation frequencies for all μ, γ, α, κ, and λ chains examined and slightly longer CDR3 lengths for γ, α, and λ. To evaluate the reconstituted human BCR sequences in a humanized mouse model, we analyzed cord blood and HIS-CD4/B mice, which all lacked the typical SHM seen in the adult reference. Furthermore, MiSeq revealed identical unmutated IgM sequences derived from separate cell aliquots, thus for the first time demonstrating rare clonal members of unmutated IgM B cells by sequencing.
A Framework Phylogeny of the American Oak Clade Based on Sequenced RAD Data
Hipp, Andrew L.; Eaton, Deren A. R.; Cavender-Bares, Jeannine; Fitzek, Elisabeth; Nipper, Rick; Manos, Paul S.
2014-01-01
Previous phylogenetic studies in oaks (Quercus, Fagaceae) have failed to resolve the backbone topology of the genus with strong support. Here, we utilize next-generation sequencing of restriction-site associated DNA (RAD-Seq) to resolve a framework phylogeny of a predominantly American clade of oaks whose crown age is estimated at 23–33 million years old. Using a recently developed analytical pipeline for RAD-Seq phylogenetics, we created a concatenated matrix of 1.40 E06 aligned nucleotides, constituting 27,727 sequence clusters. RAD-Seq data were readily combined across runs, with no difference in phylogenetic placement between technical replicates, which overlapped by only 43–64% in locus coverage. 17% (4,715) of the loci we analyzed could be mapped with high confidence to one or more expressed sequence tags in NCBI Genbank. A concatenated matrix of the loci that BLAST to at least one EST sequence provides approximately half as many variable or parsimony-informative characters as equal-sized datasets from the non-EST loci. The EST-associated matrix is more complete (fewer missing loci) and has slightly lower homoplasy than non-EST subsampled matrices of the same size, but there is no difference in phylogenetic support or relative attribution of base substitutions to internal versus terminal branches of the phylogeny. We introduce a partitioned RAD visualization method (implemented in the R package RADami; http://cran.r-project.org/web/packages/RADami) to investigate the possibility that suboptimal topologies supported by large numbers of loci—due, for example, to reticulate evolution or lineage sorting—are masked by the globally optimal tree. We find no evidence for strongly-supported alternative topologies in our study, suggesting that the phylogeny we recover is a robust estimate of large-scale phylogenetic patterns in the American oak clade. Our study is one of the first to demonstrate the utility of RAD-Seq data for inferring phylogeny in a 23–33 million year-old clade. PMID:24705617
Sequence-specific bias correction for RNA-seq data using recurrent neural networks.
Zhang, Yao-Zhong; Yamaguchi, Rui; Imoto, Seiya; Miyano, Satoru
2017-01-25
The recent success of deep learning techniques in machine learning and artificial intelligence has stimulated a great deal of interest among bioinformaticians, who now wish to bring the power of deep learning to bare on a host of bioinformatical problems. Deep learning is ideally suited for biological problems that require automatic or hierarchical feature representation for biological data when prior knowledge is limited. In this work, we address the sequence-specific bias correction problem for RNA-seq data redusing Recurrent Neural Networks (RNNs) to model nucleotide sequences without pre-determining sequence structures. The sequence-specific bias of a read is then calculated based on the sequence probabilities estimated by RNNs, and used in the estimation of gene abundance. We explore the application of two popular RNN recurrent units for this task and demonstrate that RNN-based approaches provide a flexible way to model nucleotide sequences without knowledge of predetermined sequence structures. Our experiments show that training a RNN-based nucleotide sequence model is efficient and RNN-based bias correction methods compare well with the-state-of-the-art sequence-specific bias correction method on the commonly used MAQC-III data set. RNNs provides an alternative and flexible way to calculate sequence-specific bias without explicitly pre-determining sequence structures.
2017-01-01
Unique Molecular Identifiers (UMIs) are random oligonucleotide barcodes that are increasingly used in high-throughput sequencing experiments. Through a UMI, identical copies arising from distinct molecules can be distinguished from those arising through PCR amplification of the same molecule. However, bioinformatic methods to leverage the information from UMIs have yet to be formalized. In particular, sequencing errors in the UMI sequence are often ignored or else resolved in an ad hoc manner. We show that errors in the UMI sequence are common and introduce network-based methods to account for these errors when identifying PCR duplicates. Using these methods, we demonstrate improved quantification accuracy both under simulated conditions and real iCLIP and single-cell RNA-seq data sets. Reproducibility between iCLIP replicates and single-cell RNA-seq clustering are both improved using our proposed network-based method, demonstrating the value of properly accounting for errors in UMIs. These methods are implemented in the open source UMI-tools software package. PMID:28100584
Singh, Vikas K; Khan, Aamir W; Saxena, Rachit K; Kumar, Vinay; Kale, Sandip M; Sinha, Pallavi; Chitikineni, Annapurna; Pazhamala, Lekha T; Garg, Vanika; Sharma, Mamta; Sameer Kumar, Chanda Venkata; Parupalli, Swathi; Vechalapu, Suryanarayana; Patil, Suyash; Muniswamy, Sonnappa; Ghanta, Anuradha; Yamini, Kalinati Narasimhan; Dharmaraj, Pallavi Subbanna; Varshney, Rajeev K
2016-05-01
To map resistance genes for Fusarium wilt (FW) and sterility mosaic disease (SMD) in pigeonpea, sequencing-based bulked segregant analysis (Seq-BSA) was used. Resistant (R) and susceptible (S) bulks from the extreme recombinant inbred lines of ICPL 20096 × ICPL 332 were sequenced. Subsequently, SNP index was calculated between R- and S-bulks with the help of draft genome sequence and reference-guided assembly of ICPL 20096 (resistant parent). Seq-BSA has provided seven candidate SNPs for FW and SMD resistance in pigeonpea. In parallel, four additional genotypes were re-sequenced and their combined analysis with R- and S-bulks has provided a total of 8362 nonsynonymous (ns) SNPs. Of 8362 nsSNPs, 60 were found within the 2-Mb flanking regions of seven candidate SNPs identified through Seq-BSA. Haplotype analysis narrowed down to eight nsSNPs in seven genes. These eight nsSNPs were further validated by re-sequencing 11 genotypes that are resistant and susceptible to FW and SMD. This analysis revealed association of four candidate nsSNPs in four genes with FW resistance and four candidate nsSNPs in three genes with SMD resistance. Further, In silico protein analysis and expression profiling identified two most promising candidate genes namely C.cajan_01839 for SMD resistance and C.cajan_03203 for FW resistance. Identified candidate genomic regions/SNPs will be useful for genomics-assisted breeding in pigeonpea. © 2015 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Xiao, Chuan-Le; Mai, Zhi-Biao; Lian, Xin-Lei; Zhong, Jia-Yong; Jin, Jing-Jie; He, Qing-Yu; Zhang, Gong
2014-01-01
Correct and bias-free interpretation of the deep sequencing data is inevitably dependent on the complete mapping of all mappable reads to the reference sequence, especially for quantitative RNA-seq applications. Seed-based algorithms are generally slow but robust, while Burrows-Wheeler Transform (BWT) based algorithms are fast but less robust. To have both advantages, we developed an algorithm FANSe2 with iterative mapping strategy based on the statistics of real-world sequencing error distribution to substantially accelerate the mapping without compromising the accuracy. Its sensitivity and accuracy are higher than the BWT-based algorithms in the tests using both prokaryotic and eukaryotic sequencing datasets. The gene identification results of FANSe2 is experimentally validated, while the previous algorithms have false positives and false negatives. FANSe2 showed remarkably better consistency to the microarray than most other algorithms in terms of gene expression quantifications. We implemented a scalable and almost maintenance-free parallelization method that can utilize the computational power of multiple office computers, a novel feature not present in any other mainstream algorithm. With three normal office computers, we demonstrated that FANSe2 mapped an RNA-seq dataset generated from an entire Illunima HiSeq 2000 flowcell (8 lanes, 608 M reads) to masked human genome within 4.1 hours with higher sensitivity than Bowtie/Bowtie2. FANSe2 thus provides robust accuracy, full indel sensitivity, fast speed, versatile compatibility and economical computational utilization, making it a useful and practical tool for deep sequencing applications. FANSe2 is freely available at http://bioinformatics.jnu.edu.cn/software/fanse2/.
2012-01-01
Background RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Results Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. Conclusions This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates. PMID:22985019
Robles, José A; Qureshi, Sumaira E; Stephen, Stuart J; Wilson, Susan R; Burden, Conrad J; Taylor, Jennifer M
2012-09-17
RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates.
Sequencing on the SOLiD 5500xl System - in-depth characterization of the GC bias.
Roeh, Simone; Weber, Peter; Rex-Haffner, Monika; Deussing, Jan M; Binder, Elisabeth B; Jakovcevski, Mira
2017-07-04
Different types of sequencing biases have been described and subsequently improved for a variety of sequencing systems, mostly focusing on the widely used Illumina systems. Similar studies are missing for the SOLiD 5500xl system, a sequencer which produced many data sets available to researchers today. Describing and understanding the bias is important to accurately interpret and integrate these published data in various ongoing research projects. We report a particularly strong GC bias for this sequencing system when analyzing a defined gDNA mix of 5 microbes with a wide range of different GC contents (20-72%) when comparing to the expected distribution and Illumina MiSeq data from the same DNA pool. Since we observed this bias already under PCR-free conditions, changing the PCR conditions during library preparation - a common strategy to handle bias in the Illumina system - was not relevant. Source of the bias appeared to be an uneven heat distribution during the SOLiD emulsion PCR (ePCR) - for enrichment of libraries prior loading - since ePCR in either small pouches or in 96-well plates improved the GC bias. Sequencing of chromatin immunoprecipitated DNA (ChIP-seq) is a common approach in epigenetics. ChIP-seq of the mixed source histone mark H3K9ac (acetyl Histone H3 lysine 9), typically found on promoter regions and on gene bodies, including CpG islands, performed on a SOLiD 5500xl machine, resulted in major loss of reads at GC rich loci (GC content ≥ 62%), not explained by low sequencing depth. This was improved with adaptations of the ePCR.
Zhang, ZhiZhuo; Chang, Cheng Wei; Hugo, Willy; Cheung, Edwin; Sung, Wing-Kin
2013-03-01
Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e., position preference and sequence rank preference). This information is usually required from the user. This article presents a de novo motif discovery algorithm called SEME (sampling with expectation maximization for motif elicitation), which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position, and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large-scale synthetic datasets, 32 metazoan compendium benchmark datasets, and 164 chromatin immunoprecipitation sequencing (ChIP-Seq) libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (coTF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct coTF motifs and, at the same time, predicted coTF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each coTF reveals potential interaction mechanisms between the primary TF and the coTF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the coTFs. The application is available online.
NASA Astrophysics Data System (ADS)
Cai, Lei; Yuan, Wei; Zhang, Zhou; He, Lin; Chou, Kuo-Chen
2016-11-01
Four popular somatic single nucleotide variant (SNV) calling methods (Varscan, SomaticSniper, Strelka and MuTect2) were carefully evaluated on the real whole exome sequencing (WES, depth of ~50X) and ultra-deep targeted sequencing (UDT-Seq, depth of ~370X) data. The four tools returned poor consensus on candidates (only 20% of calls were with multiple hits by the callers). For both WES and UDT-Seq, MuTect2 and Strelka obtained the largest proportion of COSMIC entries as well as the lowest rate of dbSNP presence and high-alternative-alleles-in-control calls, demonstrating their superior sensitivity and accuracy. Combining different callers does increase reliability of candidates, but narrows the list down to very limited range of tumor read depth and variant allele frequency. Calling SNV on UDT-Seq data, which were of much higher read-depth, discovered additional true-positive variations, despite an even more tremendous growth in false positive predictions. Our findings not only provide valuable benchmark for state-of-the-art SNV calling methods, but also shed light on the access to more accurate SNV identification in the future.
De novo transcriptome assembly databases for the butterfly orchid Phalaenopsis equestris
Niu, Shan-Ce; Xu, Qing; Zhang, Guo-Qiang; Zhang, Yong-Qiang; Tsai, Wen-Chieh; Hsu, Jui-Ling; Liang, Chieh-Kai; Luo, Yi-Bo; Liu, Zhong-Jian
2016-01-01
Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third dataset consists of the annotation results of the aligned unigenes versus the Nonredundant (Nr) protein database, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) databases with low e-values, enabling a name-based search. PMID:27673730
Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data
Althammer, Sonja; González-Vallinas, Juan; Ballaré, Cecilia; Beato, Miguel; Eyras, Eduardo
2011-01-01
Motivation: High-throughput sequencing (HTS) has revolutionized gene regulation studies and is now fundamental for the detection of protein–DNA and protein–RNA binding, as well as for measuring RNA expression. With increasing variety and sequencing depth of HTS datasets, the need for more flexible and memory-efficient tools to analyse them is growing. Results: We describe Pyicos, a powerful toolkit for the analysis of mapped reads from diverse HTS experiments: ChIP-Seq, either punctuated or broad signals, CLIP-Seq and RNA-Seq. We prove the effectiveness of Pyicos to select for significant signals and show that its accuracy is comparable and sometimes superior to that of methods specifically designed for each particular type of experiment. Pyicos facilitates the analysis of a variety of HTS datatypes through its flexibility and memory efficiency, providing a useful framework for data integration into models of regulatory genomics. Availability: Open-source software, with tutorials and protocol files, is available at http://regulatorygenomics.upf.edu/pyicos or as a Galaxy server at http://regulatorygenomics.upf.edu/galaxy Contact: eduardo.eyras@upf.edu Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:21994224
Protospacer Adjacent Motif (PAM)-Distal Sequences Engage CRISPR Cas9 DNA Target Cleavage
Ethier, Sylvain; Schmeing, T. Martin; Dostie, Josée; Pelletier, Jerry
2014-01-01
The clustered regularly interspaced short palindromic repeat (CRISPR)-associated enzyme Cas9 is an RNA-guided nuclease that has been widely adapted for genome editing in eukaryotic cells. However, the in vivo target specificity of Cas9 is poorly understood and most studies rely on in silico predictions to define the potential off-target editing spectrum. Using chromatin immunoprecipitation followed by sequencing (ChIP-seq), we delineate the genome-wide binding panorama of catalytically inactive Cas9 directed by two different single guide (sg) RNAs targeting the Trp53 locus. Cas9:sgRNA complexes are able to load onto multiple sites with short seed regions adjacent to 5′NGG3′ protospacer adjacent motifs (PAM). Yet among 43 ChIP-seq sites harboring seed regions analyzed for mutational status, we find editing only at the intended on-target locus and one off-target site. In vitro analysis of target site recognition revealed that interactions between the 5′ end of the guide and PAM-distal target sequences are necessary to efficiently engage Cas9 nucleolytic activity, providing an explanation for why off-target editing is significantly lower than expected from ChIP-seq data. PMID:25275497
Vukmirovic, Milica; Herazo-Maya, Jose D; Blackmon, John; Skodric-Trifunovic, Vesna; Jovanovic, Dragana; Pavlovic, Sonja; Stojsic, Jelena; Zeljkovic, Vesna; Yan, Xiting; Homer, Robert; Stefanovic, Branko; Kaminski, Naftali
2017-01-12
Idiopathic Pulmonary Fibrosis (IPF) is a lethal lung disease of unknown etiology. A major limitation in transcriptomic profiling of lung tissue in IPF has been a dependence on snap-frozen fresh tissues (FF). In this project we sought to determine whether genome scale transcript profiling using RNA Sequencing (RNA-Seq) could be applied to archived Formalin-Fixed Paraffin-Embedded (FFPE) IPF tissues. We isolated total RNA from 7 IPF and 5 control FFPE lung tissues and performed 50 base pair paired-end sequencing on Illumina 2000 HiSeq. TopHat2 was used to map sequencing reads to the human genome. On average ~62 million reads (53.4% of ~116 million reads) were mapped per sample. 4,131 genes were differentially expressed between IPF and controls (1,920 increased and 2,211 decreased (FDR < 0.05). We compared our results to differentially expressed genes calculated from a previously published dataset generated from FF tissues analyzed on Agilent microarrays (GSE47460). The overlap of differentially expressed genes was very high (760 increased and 1,413 decreased, FDR < 0.05). Only 92 differentially expressed genes changed in opposite directions. Pathway enrichment analysis performed using MetaCore confirmed numerous IPF relevant genes and pathways including extracellular remodeling, TGF-beta, and WNT. Gene network analysis of MMP7, a highly differentially expressed gene in both datasets, revealed the same canonical pathways and gene network candidates in RNA-Seq and microarray data. For validation by NanoString nCounter® we selected 35 genes that had a fold change of 2 in at least one dataset (10 discordant, 10 significantly differentially expressed in one dataset only and 15 concordant genes). High concordance of fold change and FDR was observed for each type of the samples (FF vs FFPE) with both microarrays (r = 0.92) and RNA-Seq (r = 0.90) and the number of discordant genes was reduced to four. Our results demonstrate that RNA sequencing of RNA obtained from archived FFPE lung tissues is feasible. The results obtained from FFPE tissue are highly comparable to FF tissues. The ability to perform RNA-Seq on archived FFPE IPF tissues should greatly enhance the availability of tissue biopsies for research in IPF.
Michel, Audrey M; Kiniry, Stephen J; O’Connor, Patrick B F; Mullan, James P
2018-01-01
Abstract The GWIPS-viz browser (http://gwips.ucc.ie/) is an on-line genome browser which is tailored for exploring ribosome profiling (Ribo-seq) data. Since its publication in 2014, GWIPS-viz provides Ribo-seq data for an additional 14 genomes bringing the current total to 23. The integration of new Ribo-seq data has been automated thereby increasing the number of available tracks to 1792, a 10-fold increase in the last three years. The increase is particularly substantial for data derived from human sources. Following user requests, we added the functionality to download these tracks in bigWig format. We also incorporated new types of data (e.g. TCP-seq) as well as auxiliary tracks from other sources that help with the interpretation of Ribo-seq data. Improvements in the visualization of the data have been carried out particularly for bacterial genomes where the Ribo-seq data are now shown in a strand specific manner. For higher eukaryotic datasets, we provide characteristics of individual datasets using the RUST program which includes the triplet periodicity, sequencing biases and relative inferred A-site dwell times. This information can be used for assessing the quality of Ribo-seq datasets. To improve the power of the signal, we aggregate Ribo-seq data from several studies into Global aggregate tracks for each genome. PMID:28977460
Chen, Ziyi; Quan, Lijun; Huang, Anfei; Zhao, Qiang; Yuan, Yao; Yuan, Xuye; Shen, Qin; Shang, Jingzhe; Ben, Yinyin; Qin, F Xiao-Feng; Wu, Aiping
2018-01-01
The RNA sequencing approach has been broadly used to provide gene-, pathway-, and network-centric analyses for various cell and tissue samples. However, thus far, rich cellular information carried in tissue samples has not been thoroughly characterized from RNA-Seq data. Therefore, it would expand our horizons to better understand the biological processes of the body by incorporating a cell-centric view of tissue transcriptome. Here, a computational model named seq-ImmuCC was developed to infer the relative proportions of 10 major immune cells in mouse tissues from RNA-Seq data. The performance of seq-ImmuCC was evaluated among multiple computational algorithms, transcriptional platforms, and simulated and experimental datasets. The test results showed its stable performance and superb consistency with experimental observations under different conditions. With seq-ImmuCC, we generated the comprehensive landscape of immune cell compositions in 27 normal mouse tissues and extracted the distinct signatures of immune cell proportion among various tissue types. Furthermore, we quantitatively characterized and compared 18 different types of mouse tumor tissues of distinct cell origins with their immune cell compositions, which provided a comprehensive and informative measurement for the immune microenvironment inside tumor tissues. The online server of seq-ImmuCC are freely available at http://wap-lab.org:3200/immune/.
Transformation and model choice for RNA-seq co-expression analysis.
Rau, Andrea; Maugis-Rabusseau, Cathy
2018-05-01
Although a large number of clustering algorithms have been proposed to identify groups of co-expressed genes from microarray data, the question of if and how such methods may be applied to RNA sequencing (RNA-seq) data remains unaddressed. In this work, we investigate the use of data transformations in conjunction with Gaussian mixture models for RNA-seq co-expression analyses, as well as a penalized model selection criterion to select both an appropriate transformation and number of clusters present in the data. This approach has the advantage of accounting for per-cluster correlation structures among samples, which can be strong in RNA-seq data. In addition, it provides a rigorous statistical framework for parameter estimation, an objective assessment of data transformations and number of clusters and the possibility of performing diagnostic checks on the quality and homogeneity of the identified clusters. We analyze four varied RNA-seq data sets to illustrate the use of transformations and model selection in conjunction with Gaussian mixture models. Finally, we propose a Bioconductor package coseq (co-expression of RNA-seq data) to facilitate implementation and visualization of the recommended RNA-seq co-expression analyses.
Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database.
Zappia, Luke; Phipson, Belinda; Oshlack, Alicia
2018-06-25
As single-cell RNA-sequencing (scRNA-seq) datasets have become more widespread the number of tools designed to analyse these data has dramatically increased. Navigating the vast sea of tools now available is becoming increasingly challenging for researchers. In order to better facilitate selection of appropriate analysis tools we have created the scRNA-tools database (www.scRNA-tools.org) to catalogue and curate analysis tools as they become available. Our database collects a range of information on each scRNA-seq analysis tool and categorises them according to the analysis tasks they perform. Exploration of this database gives insights into the areas of rapid development of analysis methods for scRNA-seq data. We see that many tools perform tasks specific to scRNA-seq analysis, particularly clustering and ordering of cells. We also find that the scRNA-seq community embraces an open-source and open-science approach, with most tools available under open-source licenses and preprints being extensively used as a means to describe methods. The scRNA-tools database provides a valuable resource for researchers embarking on scRNA-seq analysis and records the growth of the field over time.
RNA-Seq reveals complex genetic response to Deepwater Horizon oil release in Fundulus grandis.
Garcia, Tzintzuni I; Shen, Yingjia; Crawford, Douglas; Oleksiak, Marjorie F; Whitehead, Andrew; Walter, Ronald B
2012-09-12
The release of oil resulting from the blowout of the Deepwater Horizon (DH) drilling platform was one of the largest in history discharging more than 189 million gallons of oil and subject to widespread application of oil dispersants. This event impacted a wide range of ecological habitats with a complex mix of pollutants whose biological impact is still not yet fully understood. To better understand the effects on a vertebrate genome, we studied gene expression in the salt marsh minnow Fundulus grandis, which is local to the northern coast of the Gulf of Mexico and is a sister species of the ecotoxicological model Fundulus heteroclitus. To assess genomic changes, we quantified mRNA expression using high throughput sequencing technologies (RNA-Seq) in F. grandis populations in the marshes and estuaries impacted by DH oil release. This application of RNA-Seq to a non-model, wild, and ecologically significant organism is an important evaluation of the technology to quickly assess similar events in the future. Our de novo assembly of RNA-Seq data produced a large set of sequences which included many duplicates and fragments. In many cases several of these could be associated with a common reference sequence using blast to query a reference database. This reduced the set of significant genes to 1,070 down-regulated and 1,251 up-regulated genes. These genes indicate a broad and complex genomic response to DH oil exposure including the expected AHR-mediated response and CYP genes. In addition a response to hypoxic conditions and an immune response are also indicated. Several genes in the choriogenin family were down-regulated in the exposed group; a response that is consistent with AH exposure. These analyses are in agreement with oligonucleotide-based microarray analyses, and describe only a subset of significant genes with aberrant regulation in the exposed set. RNA-Seq may be successfully applied to feral and extremely polymorphic organisms that do not have an underlying genome sequence assembly to address timely environmental problems. Additionally, the observed changes in a large set of transcript expression levels are indicative of a complex response to the varied petroleum components to which the fish were exposed.
Linkage maps of the Atlantic salmon (Salmo salar) genome derived from RAD sequencing
2014-01-01
Background Genetic linkage maps are useful tools for mapping quantitative trait loci (QTL) influencing variation in traits of interest in a population. Genotyping-by-sequencing approaches such as Restriction-site Associated DNA sequencing (RAD-Seq) now enable the rapid discovery and genotyping of genome-wide SNP markers suitable for the development of dense SNP linkage maps, including in non-model organisms such as Atlantic salmon (Salmo salar). This paper describes the development and characterisation of a high density SNP linkage map based on SbfI RAD-Seq SNP markers from two Atlantic salmon reference families. Results Approximately 6,000 SNPs were assigned to 29 linkage groups, utilising markers from known genomic locations as anchors. Linkage maps were then constructed for the four mapping parents separately. Overall map lengths were comparable between male and female parents, but the distribution of the SNPs showed sex-specific patterns with a greater degree of clustering of sire-segregating SNPs to single chromosome regions. The maps were integrated with the Atlantic salmon draft reference genome contigs, allowing the unique assignment of ~4,000 contigs to a linkage group. 112 genome contigs mapped to two or more linkage groups, highlighting regions of putative homeology within the salmon genome. A comparative genomics analysis with the stickleback reference genome identified putative genes closely linked to approximately half of the ordered SNPs and demonstrated blocks of orthology between the Atlantic salmon and stickleback genomes. A subset of 47 RAD-Seq SNPs were successfully validated using a high-throughput genotyping assay, with a correspondence of 97% between the two assays. Conclusions This Atlantic salmon RAD-Seq linkage map is a resource for salmonid genomics research as genotyping-by-sequencing becomes increasingly common. This is aided by the integration of the SbfI RAD-Seq SNPs with existing reference maps and the draft reference genome, as well as the identification of putative genes proximal to the SNPs. Differences in the distribution of recombination events between the sexes is evident, and regions of homeology have been identified which are reflective of the recent salmonid whole genome duplication. PMID:24571138
Database Resources of the BIG Data Center in 2018
Xu, Xingjian; Hao, Lili; Zhu, Junwei; Tang, Bixia; Zhou, Qing; Song, Fuhai; Chen, Tingting; Zhang, Sisi; Dong, Lili; Lan, Li; Wang, Yanqing; Sang, Jian; Hao, Lili; Liang, Fang; Cao, Jiabao; Liu, Fang; Liu, Lin; Wang, Fan; Ma, Yingke; Xu, Xingjian; Zhang, Lijuan; Chen, Meili; Tian, Dongmei; Li, Cuiping; Dong, Lili; Du, Zhenglin; Yuan, Na; Zeng, Jingyao; Zhang, Zhewen; Wang, Jinyue; Shi, Shuo; Zhang, Yadong; Pan, Mengyu; Tang, Bixia; Zou, Dong; Song, Shuhui; Sang, Jian; Xia, Lin; Wang, Zhennan; Li, Man; Cao, Jiabao; Niu, Guangyi; Zhang, Yang; Sheng, Xin; Lu, Mingming; Wang, Qi; Xiao, Jingfa; Zou, Dong; Wang, Fan; Hao, Lili; Liang, Fang; Li, Mengwei; Sun, Shixiang; Zou, Dong; Li, Rujiao; Yu, Chunlei; Wang, Guangyu; Sang, Jian; Liu, Lin; Li, Mengwei; Li, Man; Niu, Guangyi; Cao, Jiabao; Sun, Shixiang; Xia, Lin; Yin, Hongyan; Zou, Dong; Xu, Xingjian; Ma, Lina; Chen, Huanxin; Sun, Yubin; Yu, Lei; Zhai, Shuang; Sun, Mingyuan; Zhang, Zhang; Zhao, Wenming; Xiao, Jingfa; Bao, Yiming; Song, Shuhui; Hao, Lili; Li, Rujiao; Ma, Lina; Sang, Jian; Wang, Yanqing; Tang, Bixia; Zou, Dong; Wang, Fan
2018-01-01
Abstract The BIG Data Center at Beijing Institute of Genomics (BIG) of the Chinese Academy of Sciences provides freely open access to a suite of database resources in support of worldwide research activities in both academia and industry. With the vast amounts of omics data generated at ever-greater scales and rates, the BIG Data Center is continually expanding, updating and enriching its core database resources through big-data integration and value-added curation, including BioCode (a repository archiving bioinformatics tool codes), BioProject (a biological project library), BioSample (a biological sample library), Genome Sequence Archive (GSA, a data repository for archiving raw sequence reads), Genome Warehouse (GWH, a centralized resource housing genome-scale data), Genome Variation Map (GVM, a public repository of genome variations), Gene Expression Nebulas (GEN, a database of gene expression profiles based on RNA-Seq data), Methylation Bank (MethBank, an integrated databank of DNA methylomes), and Science Wikis (a series of biological knowledge wikis for community annotations). In addition, three featured web services are provided, viz., BIG Search (search as a service; a scalable inter-domain text search engine), BIG SSO (single sign-on as a service; a user access control system to gain access to multiple independent systems with a single ID and password) and Gsub (submission as a service; a unified submission service for all relevant resources). All of these resources are publicly accessible through the home page of the BIG Data Center at http://bigd.big.ac.cn. PMID:29036542
Genome Sequence of Stachybotrys chartarum Strain 51-11.
Betancourt, Doris A; Dean, Timothy R; Kim, Jean; Levy, Josh
2015-10-01
The Stachybotrys chartarum strain 51-11 genome was sequenced by shotgun sequencing utilizing Illumina HiSeq 2000 and PacBio technologies. Since S. chartarum has been implicated as having health impacts within water-damaged buildings, any information extracted from the genomic sequence data relating to toxins or the metabolism of the fungus might be useful. Copyright © 2015 Betancourt et al.
Motor programming when sequencing multiple elements of the same duration.
Magnuson, Curt E; Robin, Donald A; Wright, David L
2008-11-01
Motor programming at the self-select paradigm was adopted in 2 experiments to examine the processing demands of independent processes. One process (INT) is responsible for organizing the internal features of the individual elements in a movement (e.g., response duration). The 2nd process (SEQ) is responsible for placing the elements into the proper serial order before execution. Participants in Experiment 1 performed tasks involving 1 key press or sequences of 4 key presses of the same duration. Implementing INT and SEQ was more time consuming for key-pressing sequences than for single key-press tasks. Experiment 2 examined whether the INT costs resulting from the increase in sequence length observed in Experiment 1 resulted from independent planning of each sequence element or via a separate "multiplier" process that handled repetitions of elements of the same duration. Findings from Experiment 2, in which participants performed single key presses or double or triple key sequences of the same duration, suggested that INT is involved with the independent organization of each element contained in the sequence. Researchers offer an elaboration of the 2-process account of motor programming to incorporate the present findings and the findings from other recent sequence-learning research.
Primer and platform effects on 16S rRNA tag sequencing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tremblay, Julien; Singh, Kanwar; Fern, Alison
Sequencing of 16S rRNA gene tags is a popular method for profiling and comparing microbial communities. The protocols and methods used, however, vary considerably with regard to amplification primers, sequencing primers, sequencing technologies; as well as quality filtering and clustering. How results are affected by these choices, and whether data produced with different protocols can be meaningfully compared, is often unknown. Here we compare results obtained using three different amplification primer sets (targeting V4, V6–V8, and V7–V8) and two sequencing technologies (454 pyrosequencing and Illumina MiSeq) using DNA from a mock community containing a known number of species as wellmore » as complex environmental samples whose PCR-independent profiles were estimated using shotgun sequencing. We find that paired-end MiSeq reads produce higher quality data and enabled the use of more aggressive quality control parameters over 454, resulting in a higher retention rate of high quality reads for downstream data analysis. While primer choice considerably influences quantitative abundance estimations, sequencing platform has relatively minor effects when matched primers are used. In conclusion, beta diversity metrics are surprisingly robust to both primer and sequencing platform biases.« less
Primer and platform effects on 16S rRNA tag sequencing
Tremblay, Julien; Singh, Kanwar; Fern, Alison; ...
2015-08-04
Sequencing of 16S rRNA gene tags is a popular method for profiling and comparing microbial communities. The protocols and methods used, however, vary considerably with regard to amplification primers, sequencing primers, sequencing technologies; as well as quality filtering and clustering. How results are affected by these choices, and whether data produced with different protocols can be meaningfully compared, is often unknown. Here we compare results obtained using three different amplification primer sets (targeting V4, V6–V8, and V7–V8) and two sequencing technologies (454 pyrosequencing and Illumina MiSeq) using DNA from a mock community containing a known number of species as wellmore » as complex environmental samples whose PCR-independent profiles were estimated using shotgun sequencing. We find that paired-end MiSeq reads produce higher quality data and enabled the use of more aggressive quality control parameters over 454, resulting in a higher retention rate of high quality reads for downstream data analysis. While primer choice considerably influences quantitative abundance estimations, sequencing platform has relatively minor effects when matched primers are used. In conclusion, beta diversity metrics are surprisingly robust to both primer and sequencing platform biases.« less
Linnorm: improved statistical analysis for single cell RNA-seq expression data
Yip, Shun H.; Wang, Panwen; Kocher, Jean-Pierre A.; Sham, Pak Chung
2017-01-01
Abstract Linnorm is a novel normalization and transformation method for the analysis of single cell RNA sequencing (scRNA-seq) data. Linnorm is developed to remove technical noises and simultaneously preserve biological variations in scRNA-seq data, such that existing statistical methods can be improved. Using real scRNA-seq data, we compared Linnorm with existing normalization methods, including NODES, SAMstrt, SCnorm, scran, DESeq and TMM. Linnorm shows advantages in speed, technical noise removal and preservation of cell heterogeneity, which can improve existing methods in the discovery of novel subtypes, pseudo-temporal ordering of cells, clustering analysis, etc. Linnorm also performs better than existing DEG analysis methods, including BASiCS, NODES, SAMstrt, Seurat and DESeq2, in false positive rate control and accuracy. PMID:28981748
ATACseqQC: a Bioconductor package for post-alignment quality assessment of ATAC-seq data.
Ou, Jianhong; Liu, Haibo; Yu, Jun; Kelliher, Michelle A; Castilla, Lucio H; Lawson, Nathan D; Zhu, Lihua Julie
2018-03-01
ATAC-seq (Assays for Transposase-Accessible Chromatin using sequencing) is a recently developed technique for genome-wide analysis of chromatin accessibility. Compared to earlier methods for assaying chromatin accessibility, ATAC-seq is faster and easier to perform, does not require cross-linking, has higher signal to noise ratio, and can be performed on small cell numbers. However, to ensure a successful ATAC-seq experiment, step-by-step quality assurance processes, including both wet lab quality control and in silico quality assessment, are essential. While several tools have been developed or adopted for assessing read quality, identifying nucleosome occupancy and accessible regions from ATAC-seq data, none of the tools provide a comprehensive set of functionalities for preprocessing and quality assessment of aligned ATAC-seq datasets. We have developed a Bioconductor package, ATACseqQC, for easily generating various diagnostic plots to help researchers quickly assess the quality of their ATAC-seq data. In addition, this package contains functions to preprocess aligned ATAC-seq data for subsequent peak calling. Here we demonstrate the utilities of our package using 25 publicly available ATAC-seq datasets from four studies. We also provide guidelines on what the diagnostic plots should look like for an ideal ATAC-seq dataset. This software package has been used successfully for preprocessing and assessing several in-house and public ATAC-seq datasets. Diagnostic plots generated by this package will facilitate the quality assessment of ATAC-seq data, and help researchers to evaluate their own ATAC-seq experiments as well as select high-quality ATAC-seq datasets from public repositories such as GEO to avoid generating hypotheses or drawing conclusions from low-quality ATAC-seq experiments. The software, source code, and documentation are freely available as a Bioconductor package at https://bioconductor.org/packages/release/bioc/html/ATACseqQC.html .
Nguyen, Lam S; Kim, Hyung-Goo; Rosenfeld, Jill A; Shen, Yiping; Gusella, James F; Lacassie, Yves; Layman, Lawrence C; Shaffer, Lisa G; Gécz, Jozef
2013-05-01
The nonsense-mediated mRNA decay (NMD) pathway functions not only to degrade transcripts containing premature termination codons (PTC), but also to regulate the transcriptome. UPF3B and RBM8A, important components of NMD, have been implicated in various forms of intellectual disability (ID) and Thrombocytopenia with Absent Radius (TAR) syndrome, which is also associated with ID. To gauge the contribution of other NMD factors to ID, we performed a comprehensive search for copy number variants (CNVs) of 18 NMD genes among individuals with ID and/or congenital anomalies. We identified 11 cases with heterozygous deletions of the genomic region encompassing UPF2, which encodes for a direct interacting protein of UPF3B. Using RNA-Seq, we showed that the genome-wide consequence of reduced expression of UPF2 is similar to that seen in patients with UPF3B mutations. Out of the 1009 genes found deregulated in patients with UPF2 deletions by at least 2-fold, majority (95%) were deregulated similarly in patients with UPF3B mutations. This supports the major role of deletion of UPF2 in ID. Furthermore, we found that four other NMD genes, UPF3A, SMG6, EIF4A3 and RNPS1 are frequently deleted and/or duplicated in the patients. We postulate that dosage imbalances of these NMD genes are likely to be the causes or act as predisposing factors for neuro-developmental disorders. Our findings further emphasize the importance of NMD pathway(s) in learning and memory.
Comparison of ribosomal RNA removal methods for transcriptome sequencing workflows in teleost fish
USDA-ARS?s Scientific Manuscript database
RNA sequencing (RNA-Seq) is becoming the standard for transcriptome analysis. Removal of contaminating ribosomal RNA (rRNA) is a priority in the preparation of libraries suitable for sequencing. rRNAs are commonly removed from total RNA via either mRNA selection or rRNA depletion. These methods have...
Complete genome sequence of ‘Candidatus Liberibacter africanus’
USDA-ARS?s Scientific Manuscript database
The complete genome sequence of ‘Candidatus Liberibacter africanus’ (Laf), strain ptsapsy, was obtained by an Illumina HiSeq 2000. The Laf genome comprises 1,192,232 nucleotides, 34.5% GC content, 1,141 predicted coding sequences, 44 tRNAs, 3 complete copies of ribosomal RNA genes (16S, 23S and 5S) ...