Science.gov

Sample records for deep short-read sequencing

  1. Unlocking Short Read Sequencing for Metagenomics

    DOE PAGESBeta

    Rodrigue, Sébastien; Materna, Arne C.; Timberlake, Sonia C.; Blackburn, Matthew C.; Malmstrom, Rex R.; Alm, Eric J.; Chisholm, Sallie W.; Gilbert, Jack Anthony

    2010-07-28

    We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read.

  2. Unlocking Short Read Sequencing for Metagenomics.

    SciTech Connect

    Rodrigue, S A. C.; Materna, S C; Timberlake, M C; Blacburn, R R; Malmstrom, E J. Alm; Chisholm, S W

    2010-01-01

    We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read.

  3. Unlocking Short Read Sequencing for Metagenomics

    PubMed Central

    Timberlake, Sonia C.; Blackburn, Matthew C.; Malmstrom, Rex R.; Alm, Eric J.; Chisholm, Sallie W.

    2010-01-01

    Background Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved. Methodology/Principal Findings We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read. Conclusions/Significance This strategy is broadly applicable to sequencing applications that benefit from low-cost high-throughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the Illumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing. PMID:20676378

  4. Fast Search of Thousands of Short-Read Sequencing Experiments

    PubMed Central

    Solomon, Brad; Kingsford, Carl

    2015-01-01

    We introduce Sequence Bloom Trees, a method for querying thousands of short-read sequencing experiments by sequence 485 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use Sequence Bloom Trees to search 2652 human blood, breast, and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. PMID:26854477

  5. Deep short-read sequencing of chromosome 17 from the mouse strains A/J and CAST/Ei identifies significant germline variation and candidate genes that regulate liver triglyceride levels.

    PubMed

    Sudbery, Ian; Stalker, Jim; Simpson, Jared T; Keane, Thomas; Rust, Alistair G; Hurles, Matthew E; Walter, Klaudia; Lynch, Dee; Teboul, Lydia; Brown, Steve D; Li, Heng; Ning, Zemin; Nadeau, Joseph H; Croniger, Colleen M; Durbin, Richard; Adams, David J

    2009-01-01

    Genome sequences are essential tools for comparative and mutational analyses. Here we present the short read sequence of mouse chromosome 17 from the Mus musculus domesticus derived strain A/J, and the Mus musculus castaneus derived strain CAST/Ei. We describe approaches for the accurate identification of nucleotide and structural variation in the genomes of vertebrate experimental organisms, and show how these techniques can be applied to help prioritize candidate genes within quantitative trait loci. PMID:19825173

  6. Development and transferability of black and red raspberry microsatellite markers from short-read sequences

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The advent of next-generation sequencing technologies has been a boon to the cost-effective development of molecular markers, particularly in non-model species. Here, we demonstrate the efficiency of microsatellite or simple sequence repeat (SSR) marker development from short-read sequences using th...

  7. An analysis of the feasibility of short read sequencing

    PubMed Central

    Whiteford, Nava; Haslam, Niall; Weber, Gerald; Prügel-Bennett, Adam; Essex, Jonathan W.; Roach, Peter L.; Bradley, Mark; Neylon, Cameron

    2005-01-01

    Several methods for ultra high-throughput DNA sequencing are currently under investigation. Many of these methods yield very short blocks of sequence information (reads). Here we report on an analysis showing the level of genome sequencing possible as a function of read length. It is shown that re-sequencing and de novo sequencing of the majority of a bacterial genome is possible with read lengths of 20–30 nt, and that reads of 50 nt can provide reconstructed contigs (a contiguous fragment of sequence data) of 1000 nt and greater that cover 80% of human chromosome 1. PMID:16275781

  8. Short read sequencing for Genomic Analysis of the brown rot fungus Fibroporia radiculosa

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The practical capability of short read sequencing for whole genome gene prediction was investigated for Fibroporia radiculosa, a copper-tolerant basidiomycete fungus that causes brown rot decay of wood. Illumina GAIIX reads from a single run of a paired-end library (75 nt read length, 300 bp insert...

  9. Whole-genome sequencing and assembly with high-throughput, short-read technologies.

    PubMed

    Sundquist, Andreas; Ronaghi, Mostafa; Tang, Haixu; Pevzner, Pavel; Batzoglou, Serafim

    2007-01-01

    While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology. PMID:17534434

  10. Identifying wrong assemblies in de novo short read primary sequence assembly contigs.

    PubMed

    Chawla, Vandna; Kumar, Rajnish; Shankar, Ravi

    2016-09-01

    With the advent of short-reads-based genome sequencing approaches, large number of organisms are being sequenced all over the world. Most of these assemblies are done using some de novo short read assemblers and other related approaches. However, the contigs produced this way are prone to wrong assembly. So far, there is a conspicuous dearth of reliable tools to identify mis-assembled contigs. Mis-assemblies could result from incorrectly deleted or wrongly arranged genomic sequences. In the present work various factors related to sequence, sequencing and assembling have been assessed for their role in causing mis-assembly by using different genome sequencing data. Finally, some mis-assembly detecting tools have been evaluated for their ability to detect the wrongly assembled primary contigs, suggesting a lot of scope for improvement in this area. The present work also proposes a simple unsupervised learning-based novel approach to identify mis-assemblies in the contigs which was found performing reasonably well when compared to the already existing tools to report mis-assembled contigs. It was observed that the proposed methodology may work as a complementary system to the existing tools to enhance their accuracy. PMID:27581937

  11. Assembled sequence contigs by SOAPdenova and Volvet algorithms from metagenomic short reads of a new bacterial isolate of gut origin

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Assembled sequence contigs by SOAPdenova and Volvet algorithms from metagenomic short reads of a new bacterial isolate of gut origin. This study included 2 submissions with a total of 9.8 million bp of assembled contigs....

  12. Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing.

    PubMed

    Stapleton, James A; Kim, Jeongwoon; Hamilton, John P; Wu, Ming; Irber, Luiz C; Maddamsetti, Rohan; Briney, Bryan; Newton, Linsey; Burton, Dennis R; Brown, C Titus; Chan, Christina; Buell, C Robin; Whitehead, Timothy A

    2016-01-01

    Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise. PMID:26789840

  13. Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing

    PubMed Central

    Stapleton, James A.; Kim, Jeongwoon; Hamilton, John P.; Wu, Ming; Irber, Luiz C.; Maddamsetti, Rohan; Briney, Bryan; Newton, Linsey; Burton, Dennis R.; Brown, C. Titus; Chan, Christina; Buell, C. Robin; Whitehead, Timothy A.

    2016-01-01

    Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise. PMID:26789840

  14. Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences

    PubMed Central

    Catchen, Julian M.; Amores, Angel; Hohenlohe, Paul; Cresko, William; Postlethwait, John H.

    2011-01-01

    Advances in sequencing technology provide special opportunities for genotyping individuals with speed and thrift, but the lack of software to automate the calling of tens of thousands of genotypes over hundreds of individuals has hindered progress. Stacks is a software system that uses short-read sequence data to identify and genotype loci in a set of individuals either de novo or by comparison to a reference genome. From reduced representation Illumina sequence data, such as RAD-tags, Stacks can recover thousands of single nucleotide polymorphism (SNP) markers useful for the genetic analysis of crosses or populations. Stacks can generate markers for ultra-dense genetic linkage maps, facilitate the examination of population phylogeography, and help in reference genome assembly. We report here the algorithms implemented in Stacks and demonstrate their efficacy by constructing loci from simulated RAD-tags taken from the stickleback reference genome and by recapitulating and improving a genetic map of the zebrafish, Danio rerio. PMID:22384329

  15. The effect of strand bias in Illumina short-read sequencing data

    PubMed Central

    2012-01-01

    Background When using Illumina high throughput short read data, sometimes the genotype inferred from the positive strand and negative strand are significantly different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias. In this study, we used Illumina short-read sequencing data to evaluate the effect of strand bias on genotyping quality, and to explore the possible causes of strand bias. Result We collected 22 breast cancer samples from 22 patients and sequenced their exome using the Illumina GAIIx machine. By comparing the consistency between the genotypes inferred from this sequencing data with the genotypes inferred from SNP chip data, we found that, when using sequencing data, SNPs with extreme strand bias did not have significantly lower consistency rates compared to SNPs with low or no strand bias. However, this result may be limited by the small subset of SNPs present in both the exome sequencing and the SNP chip data. We further compared the transition and transversion ratio and the number of novel non-synonymous SNPs between the SNPs with low or no strand bias and those with extreme strand bias, and found that SNPs with low or no strand bias have better overall quality. We also discovered that the strand bias occurs randomly at genomic positions across these samples, and observed no consistent pattern of strand bias location across samples. By comparing results from two different aligners, BWA and Bowtie, we found very consistent strand bias patterns. Thus strand bias is unlikely to be caused by alignment artifacts. We successfully replicated our results using two additional independent datasets with different capturing methods and Illumina sequencers. Conclusion Extreme strand bias indicates a potential high false-positive rate for SNPs. PMID:23176052

  16. Investigating bisulfite short-read mapping failure with hairpin bisulfite sequencing data

    PubMed Central

    2015-01-01

    Background DNA methylation is an important epigenetic mark relevant to normal development and disease genesis. A common approach to characterizing genome-wide DNA methylation is using Next Generation Sequencing technology to sequence bisulfite treated DNA. The short sequence reads are mapped to the reference genome to determine the methylation statuses of Cs. However, despite intense effort, a much smaller proportion of the reads derived from bisulfite treated DNA (usually about 40-80%) can be mapped than regular short reads mapping (> 90%), and it is unclear what factors lead to this low mapping efficiency. Results To address this issue, we used the hairpin bisulfite sequencing technology to determine sequences of both DNA double strands simultaneously. This enabled the recovery of the original non-bisulfite-converted sequences. We used Bismark for bisulfite read mapping and Bowtie2 for recovered read mapping. We found that recovering the reads improved unique mapping efficiency by 9-10% compared to the bisulfite reads. Such improvement in mapping efficiency is related to sequence entropy. Conclusions The hairpin recovery technique improves mapping efficiency, and sequence entropy relates to mapping efficiency. PMID:26576456

  17. AREM: Aligning Short Reads from ChIP-Sequencing by Expectation Maximization

    NASA Astrophysics Data System (ADS)

    Newkirk, Daniel; Biesinger, Jacob; Chon, Alvin; Yokomori, Kyoko; Xie, Xiaohui

    High-throughput sequencing coupled to chromatin immunoprecipitation (ChIP-Seq) is widely used in characterizing genome-wide binding patterns of transcription factors, cofactors, chromatin modifiers, and other DNA binding proteins. A key step in ChIP-Seq data analysis is to map short reads from high-throughput sequencing to a reference genome and identify peak regions enriched with short reads. Although several methods have been proposed for ChIP-Seq analysis, most existing methods only consider reads that can be uniquely placed in the reference genome, and therefore have low power for detecting peaks located within repeat sequences. Here we introduce a probabilistic approach for ChIP-Seq data analysis which utilizes all reads, providing a truly genome-wide view of binding patterns. Reads are modeled using a mixture model corresponding to K enriched regions and a null genomic background. We use maximum likelihood to estimate the locations of the enriched regions, and implement an expectation-maximization (E-M) algorithm, called AREM (aligning reads by expectation maximization), to update the alignment probabilities of each read to different genomic locations. We apply the algorithm to identify genome-wide binding events of two proteins: Rad21, a component of cohesin and a key factor involved in chromatid cohesion, and Srebp-1, a transcription factor important for lipid/cholesterol homeostasis. Using AREM, we were able to identify 19,935 Rad21 peaks and 1,748 Srebp-1 peaks in the mouse genome with high confidence, including 1,517 (7.6%) Rad21 peaks and 227 (13%) Srebp-1 peaks that were missed using only uniquely mapped reads. The open source implementation of our algorithm is available at http://sourceforge.net/projects/arem

  18. Reference-based compression of short-read sequences using path encoding

    PubMed Central

    Kingsford, Carl; Patro, Rob

    2015-01-01

    Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X. Contact: carlk@cs.cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25649622

  19. MOST: a modified MLST typing tool based on short read sequencing.

    PubMed

    Tewolde, Rediat; Dallman, Timothy; Schaefer, Ulf; Sheppard, Carmen L; Ashton, Philip; Pichon, Bruno; Ellington, Matthew; Swift, Craig; Green, Jonathan; Underwood, Anthony

    2016-01-01

    Multilocus sequence typing (MLST) is an effective method to describe bacterial populations. Conventionally, MLST involves Polymerase Chain Reaction (PCR) amplification of housekeeping genes followed by Sanger DNA sequencing. Public Health England (PHE) is in the process of replacing the conventional MLST methodology with a method based on short read sequence data derived from Whole Genome Sequencing (WGS). This paper reports the comparison of the reliability of MLST results derived from WGS data, comparing mapping and assembly-based approaches to conventional methods using 323 bacterial genomes of diverse species. The sensitivity of the two WGS based methods were further investigated with 26 mixed and 29 low coverage genomic data sets from Salmonella enteridis and Streptococcus pneumoniae. Of the 323 samples, 92.9% (n = 300), 97.5% (n = 315) and 99.7% (n = 322) full MLST profiles were derived by the conventional method, assembly- and mapping-based approaches, respectively. The concordance between samples that were typed by conventional (92.9%) and both WGS methods was 100%. From the 55 mixed and low coverage genomes, 89.1% (n = 49) and 67.3% (n = 37) full MLST profiles were derived from the mapping and assembly based approaches, respectively. In conclusion, deriving MLST from WGS data is more sensitive than the conventional method. When comparing WGS based methods, the mapping based approach was the most sensitive. In addition, the mapping based approach described here derives quality metrics, which are difficult to determine quantitatively using conventional and WGS-assembly based approaches. PMID:27602279

  20. MOST: a modified MLST typing tool based on short read sequencing

    PubMed Central

    Dallman, Timothy; Schaefer, Ulf; Sheppard, Carmen L.; Ashton, Philip; Pichon, Bruno; Ellington, Matthew; Swift, Craig; Green, Jonathan; Underwood, Anthony

    2016-01-01

    Multilocus sequence typing (MLST) is an effective method to describe bacterial populations. Conventionally, MLST involves Polymerase Chain Reaction (PCR) amplification of housekeeping genes followed by Sanger DNA sequencing. Public Health England (PHE) is in the process of replacing the conventional MLST methodology with a method based on short read sequence data derived from Whole Genome Sequencing (WGS). This paper reports the comparison of the reliability of MLST results derived from WGS data, comparing mapping and assembly-based approaches to conventional methods using 323 bacterial genomes of diverse species. The sensitivity of the two WGS based methods were further investigated with 26 mixed and 29 low coverage genomic data sets from Salmonella enteridis and Streptococcus pneumoniae. Of the 323 samples, 92.9% (n = 300), 97.5% (n = 315) and 99.7% (n = 322) full MLST profiles were derived by the conventional method, assembly- and mapping-based approaches, respectively. The concordance between samples that were typed by conventional (92.9%) and both WGS methods was 100%. From the 55 mixed and low coverage genomes, 89.1% (n = 49) and 67.3% (n = 37) full MLST profiles were derived from the mapping and assembly based approaches, respectively. In conclusion, deriving MLST from WGS data is more sensitive than the conventional method. When comparing WGS based methods, the mapping based approach was the most sensitive. In addition, the mapping based approach described here derives quality metrics, which are difficult to determine quantitatively using conventional and WGS-assembly based approaches. PMID:27602279

  1. Short-read, high-throughput sequencing technology for STR genotyping

    PubMed Central

    Bornman, Daniel M.; Hester, Mark E.; Schuetter, Jared M.; Kasoji, Manjula D.; Minard-Smith, Angela; Barden, Curt A.; Nelson, Scott C.; Godbold, Gene D.; Baker, Christine H.; Yang, Boyu; Walther, Jacquelyn E.; Tornes, Ivan E.; Yan, Pearlly S.; Rodriguez, Benjamin; Bundschuh, Ralf; Dickens, Michael L.; Young, Brian A.; Faith, Seth A.

    2013-01-01

    DNA-based methods for human identification principally rely upon genotyping of short tandem repeat (STR) loci. Electrophoretic-based techniques for variable-length classification of STRs are universally utilized, but are limited in that they have relatively low throughput and do not yield nucleotide sequence information. High-throughput sequencing technology may provide a more powerful instrument for human identification, but is not currently validated for forensic casework. Here, we present a systematic method to perform high-throughput genotyping analysis of the Combined DNA Index System (CODIS) STR loci using short-read (150 bp) massively parallel sequencing technology. Open source reference alignment tools were optimized to evaluate PCR-amplified STR loci using a custom designed STR genome reference. Evaluation of this approach demonstrated that the 13 CODIS STR loci and amelogenin (AMEL) locus could be accurately called from individual and mixture samples. Sensitivity analysis showed that as few as 18,500 reads, aligned to an in silico referenced genome, were required to genotype an individual (>99% confidence) for the CODIS loci. The power of this technology was further demonstrated by identification of variant alleles containing single nucleotide polymorphisms (SNPs) and the development of quantitative measurements (reads) for resolving mixed samples. PMID:25621315

  2. Short-read sequencing for genomic analysis of the brown rot fungus Fibroporia radiculosa.

    PubMed

    Tang, Juliet D; Perkins, Andy D; Sonstegard, Tad S; Schroeder, Steven G; Burgess, Shane C; Diehl, Susan V

    2012-04-01

    The feasibility of short-read sequencing for genomic analysis was demonstrated for Fibroporia radiculosa, a copper-tolerant fungus that causes brown rot decay of wood. The effect of read quality on genomic assembly was assessed by filtering Illumina GAIIx reads from a single run of a paired-end library (75-nucleotide read length and 300-bp fragment size) at three different stringency levels and then assembling each data set with Velvet. A simple approach was devised to determine which filter stringency was "best." Venn diagrams identified the regions containing reads that were used in an assembly but were of a low-enough quality to be removed by a filter. By plotting base quality histograms of reads in this region, we judged whether a filter was too stringent or not stringent enough. Our best assembly had a genome size of 33.6 Mb, an N50 of 65.8 kb for a k-mer of 51, and a maximum contig length of 347 kb. Using GeneMark, 9,262 genes were predicted. TargetP and SignalP analyses showed that among the 1,213 genes with secreted products, 986 had motifs for signal peptides and 227 had motifs for signal anchors. Blast2GO analysis provided functional annotation for 5,407 genes. We identified 29 genes with putative roles in copper tolerance and 73 genes for lignocellulose degradation. A search for homologs of these 102 genes showed that F. radiculosa exhibited more similarity to Postia placenta than Serpula lacrymans. Notable differences were found, however, and their involvements in copper tolerance and wood decay are discussed. PMID:22247176

  3. Characterization of a biogas-producing microbial community by short-read next generation DNA sequencing

    PubMed Central

    2012-01-01

    Background Renewable energy production is currently a major issue worldwide. Biogas is a promising renewable energy carrier as the technology of its production combines the elimination of organic waste with the formation of a versatile energy carrier, methane. In consequence of the complexity of the microbial communities and metabolic pathways involved the biotechnology of the microbiological process leading to biogas production is poorly understood. Metagenomic approaches are suitable means of addressing related questions. In the present work a novel high-throughput technique was tested for its benefits in resolving the functional and taxonomical complexity of such microbial consortia. Results It was demonstrated that the extremely parallel SOLiD™ short-read DNA sequencing platform is capable of providing sufficient useful information to decipher the systematic and functional contexts within a biogas-producing community. Although this technology has not been employed to address such problems previously, the data obtained compare well with those from similar high-throughput approaches such as 454-pyrosequencing GS FLX or Titanium. The predominant microbes contributing to the decomposition of organic matter include members of the Eubacteria, class Clostridia, order Clostridiales, family Clostridiaceae. Bacteria belonging in other systematic groups contribute to the diversity of the microbial consortium. Archaea comprise a remarkably small minority in this community, given their crucial role in biogas production. Among the Archaea, the predominant order is the Methanomicrobiales and the most abundant species is Methanoculleus marisnigri. The Methanomicrobiales are hydrogenotrophic methanogens. Besides corroborating earlier findings on the significance of the contribution of the Clostridia to organic substrate decomposition, the results demonstrate the importance of the metabolism of hydrogen within the biogas producing microbial community. Conclusions Both

  4. PARalyzer: definition of RNA binding sites from PAR-CLIP short-read sequence data

    PubMed Central

    2011-01-01

    Crosslinking and immunoprecipitation (CLIP) protocols have made it possible to identify transcriptome-wide RNA-protein interaction sites. In particular, PAR-CLIP utilizes a photoactivatable nucleoside for more efficient crosslinking. We present an approach, centered on the novel PARalyzer tool, for mapping high-confidence sites from PAR-CLIP deep-sequencing data. We show that PARalyzer delineates sites with a high signal-to-noise ratio. Motif finding identifies the sequence preferences of RNA-binding proteins, as well as seed-matches for highly expressed microRNAs when profiling Argonaute proteins. Our study describes tailored analytical methods and provides guidelines for future efforts to utilize high-throughput sequencing in RNA biology. PARalyzer is available at http://www.genome.duke.edu/labs/ohler/research/PARalyzer/. PMID:21851591

  5. Efficient Graph Based Assembly of Short-Read Sequences on Hybrid Core Architecture

    SciTech Connect

    Sczyrba, Alex; Pratap, Abhishek; Canon, Shane; Han, James; Copeland, Alex; Wang, Zhong; Brewer, Tony; Soper, David; D'Jamoos, Mike; Collins, Kirby; Vacek, George

    2011-03-22

    Advanced architectures can deliver dramatically increased throughput for genomics and proteomics applications, reducing time-to-completion in some cases from days to minutes. One such architecture, hybrid-core computing, marries a traditional x86 environment with a reconfigurable coprocessor, based on field programmable gate array (FPGA) technology. In addition to higher throughput, increased performance can fundamentally improve research quality by allowing more accurate, previously impractical approaches. We will discuss the approach used by Convey?s de Bruijn graph constructor for short-read, de-novo assembly. Bioinformatics applications that have random access patterns to large memory spaces, such as graph-based algorithms, experience memory performance limitations on cache-based x86 servers. Convey?s highly parallel memory subsystem allows application-specific logic to simultaneously access 8192 individual words in memory, significantly increasing effective memory bandwidth over cache-based memory systems. Many algorithms, such as Velvet and other de Bruijn graph based, short-read, de-novo assemblers, can greatly benefit from this type of memory architecture. Furthermore, small data type operations (four nucleotides can be represented in two bits) make more efficient use of logic gates than the data types dictated by conventional programming models.JGI is comparing the performance of Convey?s graph constructor and Velvet on both synthetic and real data. We will present preliminary results on memory usage and run time metrics for various data sets with different sizes, from small microbial and fungal genomes to very large cow rumen metagenome. For genomes with references we will also present assembly quality comparisons between the two assemblers.

  6. Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species

    PubMed Central

    Judy, Caroline Duffie; Seeholzer, Glenn F.; Maley, James M.; Graves, Gary R.; Brumfield, Robb T.

    2015-01-01

    Comparing inferences among datasets generated using short read sequencing may provide insight into the concerted impacts of divergence, gene flow and selection across organisms, but comparisons are complicated by biases introduced during dataset assembly. Sequence similarity thresholds allow the de novo assembly of short reads into clusters of alleles representing different loci, but the resulting datasets are sensitive to both the similarity threshold used and to the variation naturally present in the organism under study. Thresholds that require high sequence similarity among reads for assembly (stringent thresholds) as well as highly variable species may result in datasets in which divergent alleles are lost or divided into separate loci (‘over-splitting’), whereas liberal thresholds increase the risk of paralogous loci being combined into a single locus (‘under-splitting’). Comparisons among datasets or species are therefore potentially biased if different similarity thresholds are applied or if the species differ in levels of within-lineage genetic variation. We examine the impact of a range of similarity thresholds on assembly of empirical short read datasets from populations of four different non-model bird lineages (species or species pairs) with different levels of genetic divergence. We find that, in all species, stringent similarity thresholds result in fewer alleles per locus than more liberal thresholds, which appears to be the result of high levels of over-splitting. The frequency of putative under-splitting, conversely, is low at all thresholds. Inferred genetic distances between individuals, gene tree depths, and estimates of the ancestral mutation-scaled effective population size (θ) differ depending upon the similarity threshold applied. Relative differences in inferences across species differ even when the same threshold is applied, but may be dramatically different when datasets assembled under different thresholds are compared. These

  7. TGS-TB: Total Genotyping Solution for Mycobacterium tuberculosis Using Short-Read Whole-Genome Sequencing

    PubMed Central

    Sekizuka, Tsuyoshi; Yamashita, Akifumi; Murase, Yoshiro; Iwamoto, Tomotada; Mitarai, Satoshi; Kato, Seiya; Kuroda, Makoto

    2015-01-01

    Whole-genome sequencing (WGS) with next-generation DNA sequencing (NGS) is an increasingly accessible and affordable method for genotyping hundreds of Mycobacterium tuberculosis (Mtb) isolates, leading to more effective epidemiological studies involving single nucleotide variations (SNVs) in core genomic sequences based on molecular evolution. We developed an all-in-one web-based tool for genotyping Mtb, referred to as the Total Genotyping Solution for TB (TGS-TB), to facilitate multiple genotyping platforms using NGS for spoligotyping and the detection of phylogenies with core genomic SNVs, IS6110 insertion sites, and 43 customized loci for variable number tandem repeat (VNTR) through a user-friendly, simple click interface. This methodology is implemented with a KvarQ script to predict MTBC lineages/sublineages and potential antimicrobial resistance. Seven Mtb isolates (JP01 to JP07) in this study showing the same VNTR profile were accurately discriminated through median-joining network analysis using SNVs unique to those isolates. An additional IS6110 insertion was detected in one of those isolates as supportive genetic information in addition to core genomic SNVs. The results of in silico analyses using TGS-TB are consistent with those obtained using conventional molecular genotyping methods, suggesting that NGS short reads could provide multiple genotypes to discriminate multiple strains of Mtb, although longer NGS reads (≥300-mer) will be required for full genotyping on the TGS-TB web site. Most available short reads (~100-mer) can be utilized to discriminate the isolates based on the core genome phylogeny. TGS-TB provides a more accurate and discriminative strain typing for clinical and epidemiological investigations; NGS strain typing offers a total genotyping solution for Mtb outbreak and surveillance. TGS-TB web site: https://gph.niid.go.jp/tgs-tb/. PMID:26565975

  8. Rapid Short-Read Sequencing and Aneuploidy Detection Using MinION Nanopore Technology

    PubMed Central

    Wei, Shan; Williams, Zev

    2016-01-01

    MinION is a memory stick–sized nanopore-based sequencer designed primarily for single-molecule sequencing of long DNA fragments (>6 kb). We developed a library preparation and data-analysis method to enable rapid real-time sequencing of short DNA fragments (<1 kb) that resulted in the sequencing of 500 reads in 3 min and 40,000–80,000 reads in 2–4 hr at a rate of 30 nt/sec. We then demonstrated the clinical applicability of this approach by performing successful aneuploidy detection in prenatal and miscarriage samples with sequencing in <4 hr. This method broadens the application of nanopore-based single-molecule sequencing and makes it a promising and versatile tool for rapid clinical and research applications. PMID:26500254

  9. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer

    PubMed Central

    Hernandez, David; François, Patrice; Farinelli, Laurent; Østerås, Magne; Schrenzel, Jacques

    2008-01-01

    Novel high-throughput DNA sequencing technologies allow researchers to characterize a bacterial genome during a single experiment and at a moderate cost. However, the increase in sequencing throughput that is allowed by using such platforms is obtained at the expense of individual sequence read length, which must be assembled into longer contigs to be exploitable. This study focuses on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length. We propose a de novo assembler software that is dedicated to process such data. Based on a classical overlap graph representation and on the detection of potentially spurious reads, our software generates a set of accurate contigs of several kilobases that cover most of the bacterial genome. The assembly results were validated by comparing data sets that were obtained experimentally for Staphylococcus aureus strain MW2 and Helicobacter acinonychis strain Sheeba with that of their published genomes acquired by conventional sequencing of 1.5- to 3.0-kb fragments. We also provide indications that the broad coverage achieved by high-throughput sequencing might allow for the detection of clonal polymorphisms in the set of DNA molecules being sequenced. PMID:18332092

  10. Optimal pooling for genome re-sequencing with ultra-high-throughput short-read technologies

    PubMed Central

    Hajirasouliha, Iman; Hormozdiari, Fereydoun; Sahinalp, S. Cenk; Birol, Inanc

    2008-01-01

    New generation sequencing technologies offer unique opportunities and challenges for re-sequencing studies. In this article, we focus on re-sequencing experiments using the Solexa technology, based on bacterial artificial chromosome (BAC) clones, and address an experimental design problem. In these specific experiments, approximate coordinates of the BACs on a reference genome are known, and fine-scale differences between the BAC sequences and the reference are of interest. The high-throughput characteristics of the sequencing technology makes it possible to multiplex BAC sequencing experiments by pooling BACs for a cost-effective operation. However, the way BACs are pooled in such re-sequencing experiments has an effect on the downstream analysis of the generated data, mostly due to subsequences common to multiple BACs. The experimental design strategy we develop in this article offers combinatorial solutions based on approximation algorithms for the well-known max n-cut problem and the related max n-section problem on hypergraphs. Our algorithms, when applied to a number of sample cases give more than a 2-fold performance improvement over random partitioning. Contact: cenk@cs.sfu.ca PMID:18586730

  11. BarraCUDA - a fast short read sequence aligner using graphics processing units

    PubMed Central

    2012-01-01

    Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net PMID:22244497

  12. RNA-Seq Analysis and Gene Discovery of Andrias davidianus Using Illumina Short Read Sequencing

    PubMed Central

    Li, Fenggang; Wang, Lixin; Lan, Qingjing; Yang, Hui; Li, Yang; Liu, Xiaolin; Yang, Zhaoxia

    2015-01-01

    The Chinese giant salamander, Andrias davidianus, is an important species in the course of evolution; however, there is insufficient genomic data in public databases for understanding its immunologic mechanisms. High-throughput transcriptome sequencing is necessary to generate an enormous number of transcript sequences from A. davidianus for gene discovery. In this study, we generated more than 40 million reads from samples of spleen and skin tissue using the Illumina paired-end sequencing technology. De novo assembly yielded 87,297 transcripts with a mean length of 734 base pairs (bp). Based on the sequence similarities, searching with known proteins, 38,916 genes were identified. Gene enrichment analysis determined that 981 transcripts were assigned to the immune system. Tissue-specific expression analysis indicated that 443 of transcripts were specifically expressed in the spleen and skin. Among these transcripts, 147 transcripts were found to be involved in immune responses and inflammatory reactions, such as fucolectin, β-defensins and lymphotoxin beta. Eight tissue-specific genes were selected for validation using real time reverse transcription quantitative PCR (qRT-PCR). The results showed that these genes were significantly more expressed in spleen and skin than in other tissues, suggesting that these genes have vital roles in the immune response. This work provides a comprehensive genomic sequence resource for A. davidianus and lays the foundation for future research on the immunologic and disease resistance mechanisms of A. davidianus and other amphibians. PMID:25874626

  13. MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping.

    PubMed

    Lee, Wan-Ping; Stromberg, Michael P; Ward, Alistair; Stewart, Chip; Garrison, Erik P; Marth, Gabor T

    2014-01-01

    MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me). PMID:24599324

  14. Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses

    PubMed Central

    Pightling, Arthur W.; Petronella, Nicholas; Pagotto, Franco

    2014-01-01

    The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should

  15. Choice of reference sequence and assembler for alignment of Listeria monocytogenes short-read sequence data greatly influences rates of error in SNP analyses.

    PubMed

    Pightling, Arthur W; Petronella, Nicholas; Pagotto, Franco

    2014-01-01

    The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should

  16. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine

    PubMed Central

    Ye, Hao; Meehan, Joe; Tong, Weida; Hong, Huixiao

    2015-01-01

    Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants. PMID:26610555

  17. Short-read reading-frame predictors are not created equal: sequence error causes loss of signal

    PubMed Central

    2012-01-01

    Background Gene prediction algorithms (or gene callers) are an essential tool for analyzing shotgun nucleic acid sequence data. Gene prediction is a ubiquitous step in sequence analysis pipelines; it reduces the volume of data by identifying the most likely reading frame for a fragment, permitting the out-of-frame translations to be ignored. In this study we evaluate five widely used ab initio gene-calling algorithms—FragGeneScan, MetaGeneAnnotator, MetaGeneMark, Orphelia, and Prodigal—for accuracy on short (75–1000 bp) fragments containing sequence error from previously published artificial data and “real” metagenomic datasets. Results While gene prediction tools have similar accuracies predicting genes on error-free fragments, in the presence of sequencing errors considerable differences between tools become evident. For error-containing short reads, FragGeneScan finds more prokaryotic coding regions than does MetaGeneAnnotator, MetaGeneMark, Orphelia, or Prodigal. This improved detection of genes in error-containing fragments, however, comes at the cost of much lower (50%) specificity and overprediction of genes in noncoding regions. Conclusions Ab initio gene callers offer a significant reduction in the computational burden of annotating individual nucleic acid reads and are used in many metagenomic annotation systems. For predicting reading frames on raw reads, we find the hidden Markov model approach in FragGeneScan is more sensitive than other gene prediction tools, while Prodigal, MGA, and MGM are better suited for higher-quality sequences such as assembled contigs. PMID:22839106

  18. Analysis of gene expression for microminipig liver transcriptomes using parallel long-read technology and short-read sequencing.

    PubMed

    Sakai, Chizuka; Iwano, Shunsuke; Shimizu, Makiko; Onodera, Jun; Uchida, Masashi; Sakurada, Eri; Yamazaki, Yuri; Asaoka, Yoshiji; Imura, Naoko; Uno, Yasuhiro; Murayama, Norie; Hayashi, Ryoji; Yamazaki, Hiroshi; Miyamoto, Yohei

    2016-05-01

    The microminipig is one of the smallest minipigs that has emerged as a possible experimental animal model, because it shares many anatomical and/or physiological similarities with humans, including the coronary artery distribution in the heart, the digestive physiology, the kidney size and its structure, and so on. However, information on gene expression profiles, including those on drug-metabolizing phase I and II enzymes, in the microminipig is limited. Therefore, the aim of the present study was to identify transcripts in microminipig livers and to determine gene expression profiles. De novo assembly and expression analyses of microminipig transcripts were conducted with liver samples from three male and three female microminipigs using parallel long-read and short-read sequencing technologies. After unique sequences had been automatically aligned by assembling software, the mean contig length of 50843 transcripts was 707 bp. The expression profiles of cytochrome P450 (P450) 1A2, 2C, 2E1 and 3A genes in livers in microminipigs were similar to those in humans. Liver carboxylesterase (CES) precursor, liver CES-like, UDP-glucuronosyltransferase (UGT) 2C1-like, amine sulfotransferase (SULT)-like, N-acetyltransferases (NAT8) and glutathione S-transferase (GST) A2 genes, which are relatively unknown genes in pigs and/or humans, were expressed strongly. Furthermore, no significant gender differences were observed in the gene expression profiles of phase I enzymes, whereas UGT2B17, SULT1E1, SULT2A1, amine SULT-like, NAT8 and GSTT4 genes were different between males and females among phase II enzyme genes under the present sample conditions. These results provide a foundation for mechanistic studies and the use of microminipigs as model animals for drug development in the future. Copyright © 2016 John Wiley & Sons, Ltd. PMID:27214158

  19. Short reads and nonmodel species: exploring the complexities of next-generation sequence assembly and SNP discovery in the absence of a reference genome.

    PubMed

    Everett, M V; Grau, E D; Seeb, J E

    2011-03-01

    How practical is gene and SNP discovery in a nonmodel species using short read sequences? Next-generation sequencing technologies are being applied to an increasing number of species with no reference genome. For nonmodel species, the cost, availability of existing genetic resources, genome complexity and the planned method of assembly must all be considered when selecting a sequencing platform. Our goal was to examine the feasibility and optimal methodology for SNP and gene discovery in the sockeye salmon (Oncorhynchus nerka) using short read sequences. SOLiD short reads (up to 50 bp) were generated from single- and pooled-tissue transcriptome libraries from ten sockeye salmon. The individuals were from five distinct populations from the Wood River Lakes and Mendeltna Creek, Alaska. As no reference genome was available for sockeye salmon, the SOLiD sequence reads were assembled to publicly available EST reference sequences from sockeye salmon and two closely related species, rainbow trout (Oncorhynchus mykiss) and Atlantic salmon (Salmo salar). Additionally, de novo assembly of the SOLiD data was carried out, and the SOLiD reads were remapped to the de novo contigs. The results from each reference assembly were compared across all references. The number and size of contigs assembled varied with the size reference sequences. In silico SNP discovery was carried out on contigs from all four EST references; however, discovery of valid SNPs was most successful using one of the two conspecific references. PMID:21429166

  20. Fine De Novo Sequencing of a Fungal Genome Using only SOLiD Short Read Data: Verification on Aspergillus oryzae RIB40

    PubMed Central

    Takeda, Itaru; Hagiwara, Hiroko; Ikegami, Tsutomu; Koike, Hideaki; Machida, Masayuki

    2013-01-01

    The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to investigate whether de novo sequencing of fungal genomes is possible using only SOLiD short read sequence data, we performed de novo assembly of the Aspergillus oryzae RIB40 genome using only SOLiD read data of 50 bp generated from mate-paired libraries with 2.8- or 1.9-kb insert sizes. The assembled scaffolds showed an N50 value of 1.6 Mb, a 22-fold increase than those obtained using only SOLiD short read in other published reports. In addition, almost 99% of the reference genome was accurately aligned by the assembled scaffold fragments in long lengths. The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds. Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi. We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33. PMID:23667655

  1. Methods for accurate quantification of LTR-retrotransposon copy number using short-read sequence data: a case study in Sorghum.

    PubMed

    Ramachandran, Dhanushya; Hawkins, Jennifer S

    2016-10-01

    Transposable elements (TEs) are ubiquitous in eukaryotic genomes and their mobility impacts genome structure and function in myriad ways. Because of their abundance, activity, and repetitive nature, the characterization and analysis of TEs remain challenging, particularly from short-read sequencing projects. To overcome this difficulty, we have developed a method that estimates TE copy number from short-read sequences. To test the accuracy of our method, we first performed an in silico analysis of the reference Sorghum bicolor genome, using both reference-based and de novo approaches. The resulting TE copy number estimates were strikingly similar to the annotated numbers. We then tested our method on real short-read data by estimating TE copy numbers in several accessions of S. bicolor and its close relative S. propinquum. Both methods effectively identify and rank similar TE families from highest to lowest abundance. We found that de novo characterization was effective at capturing qualitative variation, but underestimated the abundance of some TE families, specifically families of more ancient origin. Also, interspecific reference-based mapping of S. propinquum reads to the S. bicolor database failed to fully describe TE content in S. propinquum, indicative of recent TE activity leading to changes in the respective repetitive landscapes over very short evolutionary timescales. We conclude that reference-based analyses are best suited for within-species comparisons, while de novo approaches are more reliable for evolutionarily distant comparisons. PMID:27295958

  2. The Long March: A Sample Preparation Technique that Enhances Contig Length and Coverage by High-Throughput Short-Read Sequencing

    PubMed Central

    Webster, Dale; Dimon, Michelle; Ruby, J. Graham; Hekele, Armin; DeRisi, Joseph L.

    2008-01-01

    High-throughput short-read technologies have revolutionized DNA sequencing by drastically reducing the cost per base of sequencing information. Despite producing gigabases of sequence per run, these technologies still present obstacles in resequencing and de novo assembly applications due to biased or insufficient target sequence coverage. We present here a simple sample preparation method termed the “long march” that increases both contig lengths and target sequence coverage using high-throughput short-read technologies. By incorporating a Type IIS restriction enzyme recognition motif into the sequencing primer adapter, successive rounds of restriction enzyme cleavage and adapter ligation produce a set of nested sub-libraries from the initial amplicon library. Sequence reads from these sub-libraries are offset from each other with enough overlap to aid assembly and contig extension. We demonstrate the utility of the long march in resequencing of the Plasmodium falciparum transcriptome, where the number of genomic bases covered was increased by 39%, as well as in metagenomic analysis of a serum sample from a patient with hepatitis B virus (HBV)-related acute liver failure, where the number of HBV bases covered was increased by 42%. We also offer a theoretical optimization of the long march for de novo sequence assembly. PMID:18941527

  3. Crystallizing short-read assemblies around seeds

    PubMed Central

    Hossain, Mohammad Sajjad; Azimi, Navid; Skiena, Steven

    2009-01-01

    Background New short-read sequencing technologies produce enormous volumes of 25–30 base paired-end reads. The resulting reads have vastly different characteristics than produced by Sanger sequencing, and require different approaches than the previous generation of sequence assemblers. In this paper, we present a short-read de novo assembler particularly targeted at the new ABI SOLiD sequencing technology. Results This paper presents what we believe to be the first de novo sequence assembly results on real data from the emerging SOLiD platform, introduced by Applied Biosystems. Our assembler SHORTY augments short-paired reads using a trivially small number (5 – 10) of seeds of length 300 – 500 bp. These seeds enable us to produce significant assemblies using short-read coverage no more than 100×, which can be obtained in a single run of these high-capacity sequencers. SHORTY exploits two ideas which we believe to be of interest to the short-read assembly community: (1) using single seed reads to crystallize assemblies, and (2) estimating intercontig distances accurately from multiple spanning paired-end reads. Conclusion We demonstrate effective assemblies (N50 contig sizes ~40 kb) of three different bacterial species using simulated SOLiD data. Sequencing artifacts limit our performance on real data, however our results on this data are substantially better than those achieved by competing assemblers. PMID:19208115

  4. Whole genome sequencing of environmental Vibrio cholerae O1 from 10 nanograms of DNA using short reads.

    PubMed

    Pérez Chaparro, Paula Juliana; McCulloch, John Anthony; Cerdeira, Louise Teixeira; Al-Dilaimi, Arwa; Canto de Sá, Lena Lillian; de Oliveira, Rodrigo; Tauch, Andreas; de Carvalho Azevedo, Vasco Ariston; Cruz Schneider, Maria Paula; da Silva, Artur Luiz da Costa

    2011-11-01

    Multiple Displacement Amplification (MDA) of DNA using φ29 (phi29) DNA polymerase amplifies DNA several billion-fold, which has proved to be potentially very useful for evaluating genome information in a culture-independent manner. Whole genome sequencing using DNA from a single prokaryotic genome copy amplified by MDA has not yet been achieved due to the formation of chimeras and skewed amplification of genomic regions during the MDA step, which then precludes genome assembly. We have hereby addressed the issue by using 10 ng of genomic Vibrio cholerae DNA extracted within an agarose plug to ensure circularity as a starting point for MDA and then sequencing the amplified yield using the SOLiD platform. We successfully managed to assemble the entire genome of V. cholerae strain LMA3984-4 (environmental O1 strain isolated in urban Amazonia) using a hybrid de novo assembly strategy. Using our method, only 178 out of 16,713 (1%) of contigs were not able to be inserted into either chromosome scaffold, and out of these 178, only 3 appeared to be chimeras. The other contigs seem to be the result of template-independent non-specific amplification during MDA, yielding spurious reads. Extraction of genomic DNA within an agarose plug in order to ensure circularity of the extracted genome might be key to minimizing amplification bias by MDA for WGS. PMID:21871929

  5. COPS: a sensitive and accurate tool for detecting somatic Copy Number Alterations using short-read sequence data from paired samples.

    PubMed

    Krishnan, Neeraja M; Gaur, Prakhar; Chaudhary, Rakshit; Rao, Arjun A; Panda, Binay

    2012-01-01

    Copy Number Alterations (CNAs) such as deletions and duplications; compose a larger percentage of genetic variations than single nucleotide polymorphisms or other structural variations in cancer genomes that undergo major chromosomal re-arrangements. It is, therefore, imperative to identify cancer-specific somatic copy number alterations (SCNAs), with respect to matched normal tissue, in order to understand their association with the disease. We have devised an accurate, sensitive, and easy-to-use tool, COPS, COpy number using Paired Samples, for detecting SCNAs. We rigorously tested the performance of COPS using short sequence simulated reads at various sizes and coverage of SCNAs, read depths, read lengths and also with real tumor:normal paired samples. We found COPS to perform better in comparison to other known SCNA detection tools for all evaluated parameters, namely, sensitivity (detection of true positives), specificity (detection of false positives) and size accuracy. COPS performed well for sequencing reads of all lengths when used with most upstream read alignment tools. Additionally, by incorporating a downstream boundary segmentation detection tool, the accuracy of SCNA boundaries was further improved. Here, we report an accurate, sensitive and easy to use tool in detecting cancer-specific SCNAs using short-read sequence data. In addition to cancer, COPS can be used for any disease as long as sequence reads from both disease and normal samples from the same individual are available. An added boundary segmentation detection module makes COPS detected SCNA boundaries more specific for the samples studied. COPS is available at ftp://115.119.160.213 with username "cops" and password "cops". PMID:23110103

  6. Short Read Alignment Using SOAP2.

    PubMed

    Hurgobin, Bhavna

    2016-01-01

    Next-generation sequencing (NGS) technologies have rapidly evolved in the last 5 years, leading to the generation of millions of short reads in a single run. Consequently, various sequence alignment algorithms have been developed to compare these reads to an appropriate reference in order to perform important downstream analysis. SOAP2 from the SOAP series is one of the most commonly used alignment programs to handle NGS data, and it efficiently does so using low computer memory usage and fast alignment speed. This chapter describes the protocol used to align short reads to a reference genome using SOAP2, and highlights the significance of using the in-built command-line options to tune the behavior of the algorithm according to the inputs and the desired results. PMID:26519410

  7. Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions

    PubMed Central

    2014-01-01

    Deep sequencing harnesses the high throughput nature of next generation sequencing technologies to generate population samples, treating information contained in individual reads as meaningful. Here, we review applications of deep sequencing to pathogen evolution. Pioneering deep sequencing studies from the virology literature are discussed, such as whole genome Roche-454 sequencing analyses of the dynamics of the rapidly mutating pathogens hepatitis C virus and HIV. Extension of the deep sequencing approach to bacterial populations is then discussed, including the impacts of emerging sequencing technologies. While it is clear that deep sequencing has unprecedented potential for assessing the genetic structure and evolutionary history of pathogen populations, bioinformatic challenges remain. We summarise current approaches to overcoming these challenges, in particular methods for detecting low frequency variants in the context of sequencing error and reconstructing individual haplotypes from short reads. PMID:24428920

  8. Making sense of deep sequencing.

    PubMed

    Goldman, D; Domschke, K

    2014-10-01

    This review, the first of an occasional series, tries to make sense of the concepts and uses of deep sequencing of polynucleic acids (DNA and RNA). Deep sequencing, synonymous with next-generation sequencing, high-throughput sequencing and massively parallel sequencing, includes whole genome sequencing but is more often and diversely applied to specific parts of the genome captured in different ways, for example the highly expressed portion of the genome known as the exome and portions of the genome that are epigenetically marked either by DNA methylation, the binding of proteins including histones, or that are in different configurations and thus more or less accessible to enzymes that cleave DNA. Deep sequencing of RNA (RNASeq) reverse-transcribed to complementary DNA is invaluable for measuring RNA expression and detecting changes in RNA structure. Important concepts in deep sequencing include the length and depth of sequence reads, mapping and assembly of reads, sequencing error, haplotypes, and the propensity of deep sequencing, as with other types of 'big data', to generate large numbers of errors, requiring monitoring for methodologic biases and strategies for replication and validation. Deep sequencing yields a unique genetic fingerprint that can be used to identify a person, and a trove of predictors of genetic medical diseases. Deep sequencing to identify epigenetic events including changes in DNA methylation and RNA expression can reveal the history and impact of environmental exposures. Because of the power of sequencing to identify and deliver biomedically significant information about a person and their blood relatives, it creates ethical dilemmas and practical challenges in research and clinical care, for example the decision and procedures to report incidental findings that will increasingly and frequently be discovered. PMID:24925306

  9. Comparison of de novo short read assemblers on metagenomic data

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Next-generation sequencing technologies have potentials to revolutionize genomics and biological researches. A flurry of short-read assemblers have been developed recently to facilitate the analysis of the short sequences generated using these technologies. However, none of these assemblers has spec...

  10. Objective and comprehensive evaluation of bisulfite short read mapping tools.

    PubMed

    Tran, Hong; Porter, Jacob; Sun, Ming-An; Xie, Hehuang; Zhang, Liqing

    2014-01-01

    Background. Large-scale bisulfite treatment and short reads sequencing technology allow comprehensive estimation of methylation states of Cs in the genomes of different tissues, cell types, and developmental stages. Accurate characterization of DNA methylation is essential for understanding genotype phenotype association, gene and environment interaction, diseases, and cancer. Aligning bisulfite short reads to a reference genome has been a challenging task. We compared five bisulfite short read mapping tools, BSMAP, Bismark, BS-Seeker, BiSS, and BRAT-BW, representing two classes of mapping algorithms (hash table and suffix/prefix tries). We examined their mapping efficiency (i.e., the percentage of reads that can be mapped to the genomes), usability, running time, and effects of changing default parameter settings using both real and simulated reads. We also investigated how preprocessing data might affect mapping efficiency. Conclusion. Among the five programs compared, in terms of mapping efficiency, Bismark performs the best on the real data, followed by BiSS, BSMAP, and finally BRAT-BW and BS-Seeker with very similar performance. If CPU time is not a constraint, Bismark is a good choice of program for mapping bisulfite treated short reads. Data quality impacts a great deal mapping efficiency. Although increasing the number of mismatches allowed can increase mapping efficiency, it not only significantly slows down the program, but also runs the risk of having increased false positives. Therefore, users should carefully set the related parameters depending on the quality of their sequencing data. PMID:24839440

  11. Objective and Comprehensive Evaluation of Bisulfite Short Read Mapping Tools

    PubMed Central

    Zhang, Liqing

    2014-01-01

    Background. Large-scale bisulfite treatment and short reads sequencing technology allow comprehensive estimation of methylation states of Cs in the genomes of different tissues, cell types, and developmental stages. Accurate characterization of DNA methylation is essential for understanding genotype phenotype association, gene and environment interaction, diseases, and cancer. Aligning bisulfite short reads to a reference genome has been a challenging task. We compared five bisulfite short read mapping tools, BSMAP, Bismark, BS-Seeker, BiSS, and BRAT-BW, representing two classes of mapping algorithms (hash table and suffix/prefix tries). We examined their mapping efficiency (i.e., the percentage of reads that can be mapped to the genomes), usability, running time, and effects of changing default parameter settings using both real and simulated reads. We also investigated how preprocessing data might affect mapping efficiency. Conclusion. Among the five programs compared, in terms of mapping efficiency, Bismark performs the best on the real data, followed by BiSS, BSMAP, and finally BRAT-BW and BS-Seeker with very similar performance. If CPU time is not a constraint, Bismark is a good choice of program for mapping bisulfite treated short reads. Data quality impacts a great deal mapping efficiency. Although increasing the number of mismatches allowed can increase mapping efficiency, it not only significantly slows down the program, but also runs the risk of having increased false positives. Therefore, users should carefully set the related parameters depending on the quality of their sequencing data. PMID:24839440

  12. Droplet barcoding for massively parallel single-molecule deep sequencing.

    PubMed

    Lan, Freeman; Haliburton, John R; Yuan, Aaron; Abate, Adam R

    2016-01-01

    The ability to accurately sequence long DNA molecules is important across biology, but existing sequencers are limited in read length and accuracy. Here, we demonstrate a method to leverage short-read sequencing to obtain long and accurate reads. Using droplet microfluidics, we isolate, amplify, fragment and barcode single DNA molecules in aqueous picolitre droplets, allowing the full-length molecules to be sequenced with multi-fold coverage using short-read sequencing. We show that this approach can provide accurate sequences of up to 10 kb, allowing us to identify rare mutations below the detection limit of conventional sequencing and directly link them into haplotypes. This barcoding methodology can be a powerful tool in sequencing heterogeneous populations such as viruses. PMID:27353563

  13. Droplet barcoding for massively parallel single-molecule deep sequencing

    PubMed Central

    Lan, Freeman; Haliburton, John R.; Yuan, Aaron; Abate, Adam R.

    2016-01-01

    The ability to accurately sequence long DNA molecules is important across biology, but existing sequencers are limited in read length and accuracy. Here, we demonstrate a method to leverage short-read sequencing to obtain long and accurate reads. Using droplet microfluidics, we isolate, amplify, fragment and barcode single DNA molecules in aqueous picolitre droplets, allowing the full-length molecules to be sequenced with multi-fold coverage using short-read sequencing. We show that this approach can provide accurate sequences of up to 10 kb, allowing us to identify rare mutations below the detection limit of conventional sequencing and directly link them into haplotypes. This barcoding methodology can be a powerful tool in sequencing heterogeneous populations such as viruses. PMID:27353563

  14. Qualitative De Novo Analysis of Full Length cDNA and Quantitative Analysis of Gene Expression for Common Marmoset (Callithrix jacchus) Transcriptomes Using Parallel Long-Read Technology and Short-Read Sequencing

    PubMed Central

    Uno, Yasuhiro; Uehara, Shotaro; Inoue, Takashi; Murayama, Norie; Onodera, Jun; Sasaki, Erika; Yamazaki, Hiroshi

    2014-01-01

    The common marmoset (Callithrix jacchus) is a non-human primate that could prove useful as human pharmacokinetic and biomedical research models. The cytochromes P450 (P450s) are a superfamily of enzymes that have critical roles in drug metabolism and disposition via monooxygenation of a broad range of xenobiotics; however, information on some marmoset P450s is currently limited. Therefore, identification and quantitative analysis of tissue-specific mRNA transcripts, including those of P450s and flavin-containing monooxygenases (FMO, another monooxygenase family), need to be carried out in detail before the marmoset can be used as an animal model in drug development. De novo assembly and expression analysis of marmoset transcripts were conducted with pooled liver, intestine, kidney, and brain samples from three male and three female marmosets. After unique sequences were automatically aligned by assembling software, the mean contig length was 718 bp (with a standard deviation of 457 bp) among a total of 47,883 transcripts. Approximately 30% of the total transcripts were matched to known marmoset sequences. Gene expression in 18 marmoset P450- and 4 FMO-like genes displayed some tissue-specific patterns. Of these, the three most highly expressed in marmoset liver were P450 2D-, 2E-, and 3A-like genes. In extrahepatic tissues, including brain, gene expressions of these monooxygenases were lower than those in liver, although P450 3A4 (previously P450 3A21) in intestine and P450 4A11- and FMO1-like genes in kidney were relatively highly expressed. By means of massive parallel long-read sequencing and short-read technology applied to marmoset liver, intestine, kidney, and brain, the combined next-generation sequencing analyses reported here were able to identify novel marmoset drug-metabolizing P450 transcripts that have until now been little reported. These results provide a foundation for mechanistic studies and pave the way for the use of marmosets as model animals

  15. A hybrid short read mapping accelerator

    PubMed Central

    2013-01-01

    Background The rapid growth of short read datasets poses a new challenge to the short read mapping problem in terms of sensitivity and execution speed. Existing methods often use a restrictive error model for computing the alignments to improve speed, whereas more flexible error models are generally too slow for large-scale applications. A number of short read mapping software tools have been proposed. However, designs based on hardware are relatively rare. Field programmable gate arrays (FPGAs) have been successfully used in a number of specific application areas, such as the DSP and communications domains due to their outstanding parallel data processing capabilities, making them a competitive platform to solve problems that are “inherently parallel”. Results We present a hybrid system for short read mapping utilizing both FPGA-based hardware and CPU-based software. The computation intensive alignment and the seed generation operations are mapped onto an FPGA. We present a computationally efficient, parallel block-wise alignment structure (Align Core) to approximate the conventional dynamic programming algorithm. The performance is compared to the multi-threaded CPU-based GASSST and BWA software implementations. For single-end alignment, our hybrid system achieves faster processing speed than GASSST (with a similar sensitivity) and BWA (with a higher sensitivity); for pair-end alignment, our design achieves a slightly worse sensitivity than that of BWA but has a higher processing speed. Conclusions This paper shows that our hybrid system can effectively accelerate the mapping of short reads to a reference genome based on the seed-and-extend approach. The performance comparison to the GASSST and BWA software implementations under different conditions shows that our hybrid design achieves a high degree of sensitivity and requires less overall execution time with only modest FPGA resource utilization. Our hybrid system design also shows that the performance

  16. RAPSearch: a fast protein similarity search tool for short reads

    PubMed Central

    2011-01-01

    Background Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets. Results We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST. Conclusions RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated. PMID:21575167

  17. EC: an efficient error correction algorithm for short reads

    PubMed Central

    2015-01-01

    Background In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. Results We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. Conclusions Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. Software availability The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip. PMID:26678663

  18. Quantitative phenotyping via deep barcode sequencing.

    PubMed

    Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey

    2009-10-01

    Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale. PMID:19622793

  19. SAMMate: a GUI tool for processing short read alignments in SAM/BAM format

    PubMed Central

    2011-01-01

    Background Next Generation Sequencing (NGS) technology generates tens of millions of short reads for each DNA/RNA sample. A key step in NGS data analysis is the short read alignment of the generated sequences to a reference genome. Although storing alignment information in the Sequence Alignment/Map (SAM) or Binary SAM (BAM) format is now standard, biomedical researchers still have difficulty accessing this information. Results We have developed a Graphical User Interface (GUI) software tool named SAMMate. SAMMate allows biomedical researchers to quickly process SAM/BAM files and is compatible with both single-end and paired-end sequencing technologies. SAMMate also automates some standard procedures in DNA-seq and RNA-seq data analysis. Using either standard or customized annotation files, SAMMate allows users to accurately calculate the short read coverage of genomic intervals. In particular, for RNA-seq data SAMMate can accurately calculate the gene expression abundance scores for customized genomic intervals using short reads originating from both exons and exon-exon junctions. Furthermore, SAMMate can quickly calculate a whole-genome signal map at base-wise resolution allowing researchers to solve an array of bioinformatics problems. Finally, SAMMate can export both a wiggle file for alignment visualization in the UCSC genome browser and an alignment statistics report. The biological impact of these features is demonstrated via several case studies that predict miRNA targets using short read alignment information files. Conclusions With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files. Our software is constantly updated and will greatly facilitate the downstream analysis of NGS data. Both the source code and the GUI executable are freely available under the GNU General Public License at http://sammate.sourceforge.net. PMID:21232146

  20. CRISPR Detection From Short Reads Using Partial Overlap Graphs.

    PubMed

    Ben-Bassat, Ilan; Chor, Benny

    2016-06-01

    Clustered regularly interspaced short palindromic repeats (CRISPR) are structured regions in bacterial and archaeal genomes, which are part of an adaptive immune system against phages. CRISPRs are important for many microbial studies and are playing an essential role in current gene editing techniques. As such, they attract substantial research interest. The exponential growth in the amount of bacterial sequence data in recent years enables the exploration of CRISPR loci in more and more species. Most of the automated tools that detect CRISPR loci rely on fully assembled genomes. However, many assemblers do not handle repetitive regions successfully. The first tool to work directly on raw sequence data is Crass, which requires reads that are long enough to contain two copies of the same repeat. We present a method to identify CRISPR repeats from raw sequence data of short reads. The algorithm is based on an observation differentiating CRISPR repeats from other types of repeats, and it involves a series of partial constructions of the overlap graph. This enables us to avoid many of the difficulties that assemblers face, as we merely aim to identify the repeats that belong to CRISPR loci. A preliminary implementation of the algorithm shows good results and detects CRISPR repeats in cases where other existing tools fail to do so. PMID:27058690

  1. Non-referenced genome assembly from epigenomic short-read data.

    PubMed

    Kaspi, Antony; Ziemann, Mark; Keating, Samuel T; Khurana, Ishant; Connor, Timothy; Spolding, Briana; Cooper, Adrian; Lazarus, Ross; Walder, Ken; Zimmet, Paul; El-Osta, Assam

    2014-10-01

    Current computational methods used to analyze changes in DNA methylation and chromatin modification rely on sequenced genomes. Here we describe a pipeline for the detection of these changes from short-read sequence data that does not require a reference genome. Open source software packages were used for sequence assembly, alignment, and measurement of differential enrichment. The method was evaluated by comparing results with reference-based results showing a strong correlation between chromatin modification and gene expression. We then used our de novo sequence assembly to build the DNA methylation profile for the non-referenced Psammomys obesus genome. The pipeline described uses open source software for fast annotation and visualization of unreferenced genomic regions from short-read data. PMID:25437048

  2. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genome projects routinely produce draft sequences for species from diverse evolutionary clades, but generally do not create single nucleotide polymorphism (SNP) resources. We present an approach for de novo SNP discovery based on short-read sequencing of reduced representation libraries (RRL) to ge...

  3. A Bayesian Assignment Method for Ambiguous Bisulfite Short Reads

    PubMed Central

    Tran, Hong; Wu, Xiaowei; Tithi, Saima; Sun, Ming-an; Xie, Hehuang; Zhang, Liqing

    2016-01-01

    DNA methylation is an epigenetic modification critical for normal development and diseases. The determination of genome-wide DNA methylation at single-nucleotide resolution is made possible by sequencing bisulfite treated DNA with next generation high-throughput sequencing. However, aligning bisulfite short reads to a reference genome remains challenging as only a limited proportion of them (around 50–70%) can be aligned uniquely; a significant proportion, known as multireads, are mapped to multiple locations and thus discarded from downstream analyses, causing financial waste and biased methylation inference. To address this issue, we develop a Bayesian model that assigns multireads to their most likely locations based on the posterior probability derived from information hidden in uniquely aligned reads. Analyses of both simulated data and real hairpin bisulfite sequencing data show that our method can effectively assign approximately 70% of the multireads to their best locations with up to 90% accuracy, leading to a significant increase in the overall mapping efficiency. Moreover, the assignment model shows robust performance with low coverage depth, making it particularly attractive considering the prohibitive cost of bisulfite sequencing. Additionally, results show that longer reads help improve the performance of the assignment model. The assignment model is also robust to varying degrees of methylation and varying sequencing error rates. Finally, incorporating prior knowledge on mutation rate and context specific methylation level into the assignment model increases inference accuracy. The assignment model is implemented in the BAM-ABS package and freely available at https://github.com/zhanglabvt/BAM_ABS. PMID:27011215

  4. Deep Whole-Genome Sequencing to Detect Mixed Infection of Mycobacterium tuberculosis

    PubMed Central

    Gan, Mingyu; Liu, Qingyun; Yang, Chongguang; Gao, Qian; Luo, Tao

    2016-01-01

    Mixed infection by multiple Mycobacterium tuberculosis (MTB) strains is associated with poor treatment outcome of tuberculosis (TB). Traditional genotyping methods have been used to detect mixed infections of MTB, however, their sensitivity and resolution are limited. Deep whole-genome sequencing (WGS) has been proved highly sensitive and discriminative for studying population heterogeneity of MTB. Here, we developed a phylogenetic-based method to detect MTB mixed infections using WGS data. We collected published WGS data of 782 global MTB strains from public database. We called homogeneous and heterogeneous single nucleotide variations (SNVs) of individual strains by mapping short reads to the ancestral MTB reference genome. We constructed a phylogenomic database based on 68,639 homogeneous SNVs of 652 MTB strains. Mixed infections were determined if multiple evolutionary paths were identified by mapping the SNVs of individual samples to the phylogenomic database. By simulation, our method could specifically detect mixed infections when the sequencing depth of minor strains was as low as 1× coverage, and when the genomic distance of two mixed strains was as small as 16 SNVs. By applying our methods to all 782 samples, we detected 47 mixed infections and 45 of them were caused by locally endemic strains. The results indicate that our method is highly sensitive and discriminative for identifying mixed infections from deep WGS data of MTB isolates. PMID:27391214

  5. Deep Whole-Genome Sequencing to Detect Mixed Infection of Mycobacterium tuberculosis.

    PubMed

    Gan, Mingyu; Liu, Qingyun; Yang, Chongguang; Gao, Qian; Luo, Tao

    2016-01-01

    Mixed infection by multiple Mycobacterium tuberculosis (MTB) strains is associated with poor treatment outcome of tuberculosis (TB). Traditional genotyping methods have been used to detect mixed infections of MTB, however, their sensitivity and resolution are limited. Deep whole-genome sequencing (WGS) has been proved highly sensitive and discriminative for studying population heterogeneity of MTB. Here, we developed a phylogenetic-based method to detect MTB mixed infections using WGS data. We collected published WGS data of 782 global MTB strains from public database. We called homogeneous and heterogeneous single nucleotide variations (SNVs) of individual strains by mapping short reads to the ancestral MTB reference genome. We constructed a phylogenomic database based on 68,639 homogeneous SNVs of 652 MTB strains. Mixed infections were determined if multiple evolutionary paths were identified by mapping the SNVs of individual samples to the phylogenomic database. By simulation, our method could specifically detect mixed infections when the sequencing depth of minor strains was as low as 1× coverage, and when the genomic distance of two mixed strains was as small as 16 SNVs. By applying our methods to all 782 samples, we detected 47 mixed infections and 45 of them were caused by locally endemic strains. The results indicate that our method is highly sensitive and discriminative for identifying mixed infections from deep WGS data of MTB isolates. PMID:27391214

  6. Short read DNA fragment anchoring algorithm

    PubMed Central

    Wang, Wendi; Zhang, Peiheng; Liu, Xinchun

    2009-01-01

    Background The emerging next-generation sequencing method based on PCR technology boosts genome sequencing speed considerably, the expense is also get decreased. It has been utilized to address a broad range of bioinformatics problems. Limited by reliable output sequence length of next-generation sequencing technologies, we are confined to study gene fragments with 30~50 bps in general and it is relatively shorter than traditional gene fragment length. Anchoring gene fragments in long reference sequence is an essential and prerequisite step for further assembly and analysis works. Due to the sheer number of fragments produced by next-generation sequencing technologies and the huge size of reference sequences, anchoring would rapidly becoming a computational bottleneck. Results and discussion We compared algorithm efficiency on BLAT, SOAP and EMBF. The efficiency is defined as the count of total output results divided by time consumed to retrieve them. The data show that our algorithm EMBF have 3~4 times efficiency advantage over SOAP, and at least 150 times over BLAT. Moreover, when the reference sequence size is increased, the efficiency of SOAP will get degraded as far as 30%, while EMBF have preferable increasing tendency. Conclusion In conclusion, we deem that EMBF is more suitable for short fragment anchoring problem where result completeness and accuracy is predominant and the reference sequences are relatively large. PMID:19208116

  7. Simultaneous alignment of short reads against multiple genomes

    PubMed Central

    Schneeberger, Korbinian; Hagmann, Jörg; Ossowski, Stephan; Warthmann, Norman; Gesing, Sandra; Kohlbacher, Oliver; Weigel, Detlef

    2009-01-01

    Genome resequencing with short reads generally relies on alignments against a single reference. GenomeMapper supports simultaneous mapping of short reads against multiple genomes by integrating related genomes (e.g., individuals of the same species) into a single graph structure. It constitutes the first approach for handling multiple references and introduces representations for alignments against complex structures. Demonstrated benefits include access to polymorphisms that cannot be identified by alignments against the reference alone. Download GenomeMapper at . PMID:19761611

  8. Deep Ion Torrent sequencing identifies soil fungal community shifts after frequent prescribed fires in a southeastern US forest ecosystem.

    PubMed

    Brown, Shawn P; Callaham, Mac A; Oliver, Alena K; Jumpponen, Ari

    2013-12-01

    Prescribed burning is a common management tool to control fuel loads, ground vegetation, and facilitate desirable game species. We evaluated soil fungal community responses to long-term prescribed fire treatments in a loblolly pine forest on the Piedmont of Georgia and utilized deep Internal Transcribed Spacer Region 1 (ITS1) amplicon sequencing afforded by the recent Ion Torrent Personal Genome Machine (PGM). These deep sequence data (19,000 + reads per sample after subsampling) indicate that frequent fires (3-year fire interval) shift soil fungus communities, whereas infrequent fires (6-year fire interval) permit system resetting to a state similar to that without prescribed fire. Furthermore, in nonmetric multidimensional scaling analyses, primarily ectomycorrhizal taxa were correlated with axes associated with long fire intervals, whereas soil saprobes tended to be correlated with the frequent fire recurrence. We conclude that (1) multiplexed Ion Torrent PGM analyses allow deep cost effective sequencing of fungal communities but may suffer from short read lengths and inconsistent sequence quality adjacent to the sequencing adaptor; (2) frequent prescribed fires elicit a shift in soil fungal communities; and (3) such shifts do not occur when fire intervals are longer. Our results emphasize the general responsiveness of these forests to management, and the importance of fire return intervals in meeting management objectives. PMID:23869991

  9. DistMap: a toolkit for distributed short read mapping on a Hadoop cluster.

    PubMed

    Pandey, Ram Vinay; Schlötterer, Christian

    2013-01-01

    With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/ PMID:24009693

  10. DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster

    PubMed Central

    Pandey, Ram Vinay; Schlötterer, Christian

    2013-01-01

    With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/ PMID:24009693

  11. Whole Chloroplast Genome Sequencing in Fragaria Using Deep Sequencing: A Comparison of Three Methods

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Chloroplast sequences previously investigated in Fragaria revealed low amounts of variation. Deep sequencing technologies enable economical sequencing of complete chloroplast genomes. These sequences can potentially provide robust phylogenetic resolution, even at low taxonomic levels within plant gr...

  12. Short-read DNA sequencing yields microsatellite markers for Rheum

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Identifying culinary rhubarb (Rheum ×hybridum Murray) cultivars using morphological characteristics is problematic due to variability within individual genotypes, variation caused by environmental factors, plant and leaf age, similarity between genetically diverse genotypes, multiple cultivar names ...

  13. Deep Sequencing: Becoming a Critical Tool in Clinical Virology

    PubMed Central

    QUIÑONES-MATEU, Miguel E.; AVILA, Santiago; REYES-TERAN, Gustavo; MARTINEZ, Miguel A.

    2014-01-01

    Population (Sanger) sequencing has been the standard method in basic and clinical DNA sequencing for almost 40 years; however, next-generation (deep) sequencing methodologies are now revolutionizing the field of genomics, and clinical virology is no exception. Deep sequencing is highly efficient, producing an enormous amount of information at low cost in a relatively short period of time. High-throughput sequencing techniques have enabled significant contributions to multiples areas in virology, including virus discovery and metagenomics (viromes), molecular epidemiology, pathogenesis, and studies of how viruses to escape the host immune system and antiviral pressures. In addition, new and more affordable deep sequencing-based assays are now being implemented in clinical laboratories. Here we review the use of the current deep sequencing platforms in virology, focusing on three of the most studied viruses: human immunodeficiency virus (HIV), hepatitis C virus (HCV), and influenza virus. PMID:24998424

  14. Deep Sequencing to Identify the Causes of Viral Encephalitis

    PubMed Central

    Chan, Benjamin K.; Wilson, Theodore; Fischer, Kael F.; Kriesel, John D.

    2014-01-01

    Deep sequencing allows for a rapid, accurate characterization of microbial DNA and RNA sequences in many types of samples. Deep sequencing (also called next generation sequencing or NGS) is being developed to assist with the diagnosis of a wide variety of infectious diseases. In this study, seven frozen brain samples from deceased subjects with recent encephalitis were investigated. RNA from each sample was extracted, randomly reverse transcribed and sequenced. The sequence analysis was performed in a blinded fashion and confirmed with pathogen-specific PCR. This analysis successfully identified measles virus sequences in two brain samples and herpes simplex virus type-1 sequences in three brain samples. No pathogen was identified in the other two brain specimens. These results were concordant with pathogen-specific PCR and partially concordant with prior neuropathological examinations, demonstrating that deep sequencing can accurately identify viral infections in frozen brain tissue. PMID:24699691

  15. Full Genome Virus Detection in Fecal Samples Using Sensitive Nucleic Acid Preparation, Deep Sequencing, and a Novel Iterative Sequence Classification Algorithm

    PubMed Central

    Cotten, Matthew; Oude Munnink, Bas; Canuti, Marta; Deijs, Martin; Watson, Simon J.; Kellam, Paul; van der Hoek, Lia

    2014-01-01

    We have developed a full genome virus detection process that combines sensitive nucleic acid preparation optimised for virus identification in fecal material with Illumina MiSeq sequencing and a novel post-sequencing virus identification algorithm. Enriched viral nucleic acid was converted to double-stranded DNA and subjected to Illumina MiSeq sequencing. The resulting short reads were processed with a novel iterative Python algorithm SLIM for the identification of sequences with homology to known viruses. De novo assembly was then used to generate full viral genomes. The sensitivity of this process was demonstrated with a set of fecal samples from HIV-1 infected patients. A quantitative assessment of the mammalian, plant, and bacterial virus content of this compartment was generated and the deep sequencing data were sufficient to assembly 12 complete viral genomes from 6 virus families. The method detected high levels of enteropathic viruses that are normally controlled in healthy adults, but may be involved in the pathogenesis of HIV-1 infection and will provide a powerful tool for virus detection and for analyzing changes in the fecal virome associated with HIV-1 progression and pathogenesis. PMID:24695106

  16. Fitness Inference from Short-Read Data: Within-Host Evolution of a Reassortant H5N1 Influenza Virus

    PubMed Central

    Illingworth, Christopher J.R.

    2015-01-01

    We present a method to infer the role of selection acting during the within-host evolution of the influenza virus from short-read genome sequence data. Linkage disequilibrium between loci is accounted for by treating short-read sequences as noisy multilocus emissions from an underlying model of haplotype evolution. A hierarchical model-selection procedure is used to infer the underlying fitness landscape of the virus insofar as that landscape is explored by the viral population. In a first application of our method, we analyze data from an evolutionary experiment describing the growth of a reassortant H5N1 virus in ferrets. Across two sets of replica experiments we infer multiple alleles to be under selection, including variants associated with receptor binding specificity, glycosylation, and with the increased transmissibility of the virus. We identify epistasis as an important component of the within-host fitness landscape, and show that adaptation can proceed through multiple genetic pathways. PMID:26243288

  17. Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads

    PubMed Central

    Carr, Rogan; Borenstein, Elhanan

    2014-01-01

    To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research. PMID:25148512

  18. Deep sequencing increases hepatitis C virus phylogenetic cluster detection compared to Sanger sequencing.

    PubMed

    Montoya, Vincent; Olmstead, Andrea; Tang, Patrick; Cook, Darrel; Janjua, Naveed; Grebely, Jason; Jacka, Brendan; Poon, Art F Y; Krajden, Mel

    2016-09-01

    Effective surveillance and treatment strategies are required to control the hepatitis C virus (HCV) epidemic. Phylogenetic analyses are powerful tools for reconstructing the evolutionary history of viral outbreaks and identifying transmission clusters. These studies often rely on Sanger sequencing which typically generates a single consensus sequence for each infected individual. For rapidly mutating viruses such as HCV, consensus sequencing underestimates the complexity of the viral quasispecies population and could therefore generate different phylogenetic tree topologies. Although deep sequencing provides a more detailed quasispecies characterization, in-depth phylogenetic analyses are challenging due to dataset complexity and computational limitations. Here, we apply deep sequencing to a characterized population to assess its ability to identify phylogenetic clusters compared with consensus Sanger sequencing. For deep sequencing, a sample specific threshold determined by the 50th percentile of the patristic distance distribution for all variants within each individual was used to identify clusters. Among seven patristic distance thresholds tested for the Sanger sequence phylogeny ranging from 0.005-0.06, a threshold of 0.03 was found to provide the maximum balance between positive agreement (samples in a cluster) and negative agreement (samples not in a cluster) relative to the deep sequencing dataset. From 77 HCV seroconverters, 10 individuals were identified in phylogenetic clusters using both methods. Deep sequencing analysis identified an additional 4 individuals and excluded 8 other individuals relative to Sanger sequencing. The application of this deep sequencing approach could be a more effective tool to understand onward HCV transmission dynamics compared with Sanger sequencing, since the incorporation of minority sequence variants improves the discrimination of phylogenetically linked clusters. PMID:27282472

  19. Preparing DNA Libraries for Multiplexed Paired-End Deep Sequencing for Illumina GA Sequencers

    PubMed Central

    Son, Mike S.; Taylor, Ronald K.

    2011-01-01

    Whole genome sequencing, also known as deep sequencing, is becoming a more affordable and efficient way to identify SNP mutations, deletions and insertions in DNA sequences across several different strains. Two major obstacles preventing the widespread use of deep sequencers are the costs involved in services used to prepare DNA libraries for sequencing and the overall accuracy of the sequencing data. This Unit describes the preparation of DNA libraries for multiplexed paired-end sequencing using the Illumina GA series sequencer. Self-preparation of DNA libraries can help reduce overall expenses, especially if optimization is required for the different samples, and use of the Illumina GA Sequencer can improve the quality of the data. PMID:21400673

  20. Deep sequencing and human antibody repertoire analysis.

    PubMed

    Boyd, Scott D; Crowe, James E

    2016-06-01

    In the past decade, high-throughput DNA sequencing (HTS) methods and improved approaches for isolating antigen-specific B cells and their antibody genes have been applied in many areas of human immunology. This work has greatly increased our understanding of human antibody repertoires and the specific clones responsible for protective immunity or immune-mediated pathogenesis. Although the principles underlying selection of individual B cell clones in the intact immune system are still under investigation, the combination of more powerful genetic tracking of antibody lineage development and functional testing of the encoded proteins promises to transform therapeutic antibody discovery and optimization. Here, we highlight recent advances in this fast-moving field. PMID:27065089

  1. GAViT: Genome Assembly Visualization Tool for Short Read Data

    SciTech Connect

    Syed, Aijazuddin; Shapiro, Harris; Tu, Hank; Pangilinan, Jasmyn; Trong, Stephan

    2008-03-14

    It is a challenging job for genome analysts to accurately debug, troubleshoot, and validate genome assembly results. Genome analysts rely on visualization tools to help validate and troubleshoot assembly results, including such problems as mis-assemblies, low-quality regions, and repeats. Short read data adds further complexity and makes it extremely challenging for the visualization tools to scale and to view all needed assembly information. As a result, there is a need for a visualization tool that can scale to display assembly data from the new sequencing technologies. We present Genome Assembly Visualization Tool (GAViT), a highly scalable and interactive assembly visualization tool developed at the DOE Joint Genome Institute (JGI).

  2. Complete Genome Sequence of the WHO International Standard for HIV-2 RNA Determined by Deep Sequencing

    PubMed Central

    Ham, Claire; Morris, Clare

    2016-01-01

    The World Health Organization (WHO) International Standard for HIV-2 RNA nucleic acid assays was characterized by complete genome deep sequencing. The entire coding sequence and flanking long terminal repeats (LTRs), including minority species, were assigned subtype A. This information will aid design, development, and evaluation of HIV-2 RNA amplification assays. PMID:26847885

  3. deepTools: a flexible platform for exploring deep-sequencing data

    PubMed Central

    Ramírez, Fidel; Dündar, Friederike; Diehl, Sarah; Grüning, Björn A.; Manke, Thomas

    2014-01-01

    We present a Galaxy based web server for processing and visualizing deeply sequenced data. The web server's core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload pre-processed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner. In addition, we offer several tools for the analysis of files containing aligned reads and enable efficient and reproducible generation of normalized coverage files. As a modular and open-source platform, deepTools can easily be expanded and customized to future demands and developments. The deepTools webserver is freely available at http://deeptools.ie-freiburg.mpg.de and is accompanied by extensive documentation and tutorials aimed at conveying the principles of deep-sequencing data analysis. The web server can be used without registration. deepTools can be installed locally either stand-alone or as part of Galaxy. PMID:24799436

  4. deepTools: a flexible platform for exploring deep-sequencing data.

    PubMed

    Ramírez, Fidel; Dündar, Friederike; Diehl, Sarah; Grüning, Björn A; Manke, Thomas

    2014-07-01

    We present a Galaxy based web server for processing and visualizing deeply sequenced data. The web server's core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload pre-processed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner. In addition, we offer several tools for the analysis of files containing aligned reads and enable efficient and reproducible generation of normalized coverage files. As a modular and open-source platform, deepTools can easily be expanded and customized to future demands and developments. The deepTools webserver is freely available at http://deeptools.ie-freiburg.mpg.de and is accompanied by extensive documentation and tutorials aimed at conveying the principles of deep-sequencing data analysis. The web server can be used without registration. deepTools can be installed locally either stand-alone or as part of Galaxy. PMID:24799436

  5. DSAP: deep-sequencing small RNA analysis pipeline.

    PubMed

    Huang, Po-Jung; Liu, Yi-Chung; Lee, Chi-Ching; Lin, Wei-Chen; Gan, Richie Ruei-Chi; Lyu, Ping-Chiang; Tang, Petrus

    2010-07-01

    DSAP is an automated multiple-task web service designed to provide a total solution to analyzing deep-sequencing small RNA datasets generated by next-generation sequencing technology. DSAP uses a tab-delimited file as an input format, which holds the unique sequence reads (tags) and their corresponding number of copies generated by the Solexa sequencing platform. The input data will go through four analysis steps in DSAP: (i) cleanup: removal of adaptors and poly-A/T/C/G/N nucleotides; (ii) clustering: grouping of cleaned sequence tags into unique sequence clusters; (iii) non-coding RNA (ncRNA) matching: sequence homology mapping against a transcribed sequence library from the ncRNA database Rfam (http://rfam.sanger.ac.uk/); and (iv) known miRNA matching: detection of known miRNAs in miRBase (http://www.mirbase.org/) based on sequence homology. The expression levels corresponding to matched ncRNAs and miRNAs are summarized in multi-color clickable bar charts linked to external databases. DSAP is also capable of displaying miRNA expression levels from different jobs using a log(2)-scaled color matrix. Furthermore, a cross-species comparative function is also provided to show the distribution of identified miRNAs in different species as deposited in miRBase. DSAP is available at http://dsap.cgu.edu.tw. PMID:20478825

  6. MiRGator v3.0: a microRNA portal for deep sequencing, expression profiling and mRNA targeting.

    PubMed

    Cho, Sooyoung; Jang, Insu; Jun, Yukyung; Yoon, Suhyeon; Ko, Minjeong; Kwon, Yeajee; Choi, Ikjung; Chang, Hyeshik; Ryu, Daeun; Lee, Byungwook; Kim, V Narry; Kim, Wankyu; Lee, Sanghyuk

    2013-01-01

    Biogenesis and molecular function are two key subjects in the field of microRNA (miRNA) research. Deep sequencing has become the principal technique in cataloging of miRNA repertoire and generating expression profiles in an unbiased manner. Here, we describe the miRGator v3.0 update (http://mirgator.kobic.re.kr) that compiled the deep sequencing miRNA data available in public and implemented several novel tools to facilitate exploration of massive data. The miR-seq browser supports users to examine short read alignment with the secondary structure and read count information available in concurrent windows. Features such as sequence editing, sorting, ordering, import and export of user data would be of great utility for studying iso-miRs, miRNA editing and modifications. miRNA-target relation is essential for understanding miRNA function. Coexpression analysis of miRNA and target mRNAs, based on miRNA-seq and RNA-seq data from the same sample, is visualized in the heat-map and network views where users can investigate the inverse correlation of gene expression and target relations, compiled from various databases of predicted and validated targets. By keeping datasets and analytic tools up-to-date, miRGator should continue to serve as an integrated resource for biogenesis and functional investigation of miRNAs. PMID:23193297

  7. Unbiased Deep Sequencing of RNA Viruses from Clinical Samples.

    PubMed

    Matranga, Christian B; Gladden-Young, Adrianne; Qu, James; Winnicki, Sarah; Nosamiefan, Dolo; Levin, Joshua Z; Sabeti, Pardis C

    2016-01-01

    Here we outline a next-generation RNA sequencing protocol that enables de novo assemblies and intra-host variant calls of viral genomes collected from clinical and biological sources. The method is unbiased and universal; it uses random primers for cDNA synthesis and requires no prior knowledge of the viral sequence content. Before library construction, selective RNase H-based digestion is used to deplete unwanted RNA - including poly(rA) carrier and ribosomal RNA - from the viral RNA sample. Selective depletion improves both the data quality and the number of unique reads in viral RNA sequencing libraries. Moreover, a transposase-based 'tagmentation' step is used in the protocol as it reduces overall library construction time. The protocol has enabled rapid deep sequencing of over 600 Lassa and Ebola virus samples-including collections from both blood and tissue isolates-and is broadly applicable to other microbial genomics studies. PMID:27403729

  8. Transcriptome Sequences Resolve Deep Relationships of the Grape Family

    PubMed Central

    Wen, Jun; Xiong, Zhiqiang; Nie, Ze-Long; Mao, Likai; Zhu, Yabing; Kan, Xian-Zhao; Ickert-Bond, Stefanie M.; Gerrath, Jean; Zimmer, Elizabeth A.; Fang, Xiao-Dong

    2013-01-01

    Previous phylogenetic studies of the grape family (Vitaceae) yielded poorly resolved deep relationships, thus impeding our understanding of the evolution of the family. Next-generation sequencing now offers access to protein coding sequences very easily, quickly and cost-effectively. To improve upon earlier work, we extracted 417 orthologous single-copy nuclear genes from the transcriptomes of 15 species of the Vitaceae, covering its phylogenetic diversity. The resulting transcriptome phylogeny provides robust support for the deep relationships, showing the phylogenetic utility of transcriptome data for plants over a time scale at least since the mid-Cretaceous. The pros and cons of transcriptome data for phylogenetic inference in plants are also evaluated. PMID:24069307

  9. Inferring short tandem repeat variation from paired-end short reads

    PubMed Central

    Cao, Minh Duc; Tasker, Edward; Willadsen, Kai; Imelfort, Michael; Vishwanathan, Sailaja; Sureshkumar, Sridevi; Balasubramanian, Sureshkumar; Bodén, Mikael

    2014-01-01

    The advances of high-throughput sequencing offer an unprecedented opportunity to study genetic variation. This is challenged by the difficulty of resolving variant calls in repetitive DNA regions. We present a Bayesian method to estimate repeat-length variation from paired-end sequence read data. The method makes variant calls based on deviations in sequence fragment sizes, allowing the analysis of repeats at lengths of relevance to a range of phenotypes. We demonstrate the method’s ability to detect and quantify changes in repeat lengths from short read genomic sequence data across genotypes. We use the method to estimate repeat variation among 12 strains of Arabidopsis thaliana and demonstrate experimentally that our method compares favourably against existing methods. Using this method, we have identified all repeats across the genome, which are likely to be polymorphic. In addition, our predicted polymorphic repeats also included the only known repeat expansion in A. thaliana, suggesting an ability to discover potential unstable repeats. PMID:24353318

  10. Deep Sequencing of the Transcriptomes of Soybean Aphid and Associated Endosymbionts

    PubMed Central

    Liu, Sijun; Chougule, Nanasaheb P.; Vijayendran, Diveena; Bonning, Bryony C.

    2012-01-01

    Background The soybean aphid has significantly impacted soybean production in the U.S. Transcriptomic analyses were conducted for further insight into leads for potential novel management strategies. Methodology/Principal Findings Transcriptomic data were generated from whole aphids and from 2,000 aphid guts using an Illumina GAII sequencer. The sequence data were assembled de novo using the Velvet assembler. In addition to providing a general overview, we demonstrate (i) the use of the Multiple-k/Multiple-C method for de novo assembly of short read sequences, followed by BLAST annotation of contigs for increased transcript identification: From 400,000 contigs analyzed, 16,257 non-redundant BLAST hits were identified; (ii) analysis of species distributions of top non-redundant hits: 80% of BLAST hits (minimum e-value of 1.0-E3) were to the pea aphid or other aphid species, representing about half of the pea aphid genes; (iii) comparison of relative depth of sequence coverage to relative transcript abundance for genes with high (membrane alanyl aminopeptidase N) or low transcript abundance; (iv) analysis of the Buchnera transcriptome: Transcripts from 57.6% of the genes from Buchnera aphidicola were identified; (v) identification of Arsenophonus and Wolbachia as potential secondary endosymbionts; (vi) alignment of full length sequences from RNA-seq data for the putative salivary gland protein C002, the silencing of which has potential for aphid management, and the putative Bacillus thuringiensis Cry toxin receptors, aminopeptidase N and alkaline phosphatase. Conclusions/Significance This study provides the most comprehensive data set to date for soybean aphid gene expression: This work also illustrates the utility of short-read transcriptome sequencing and the Multiple-k/Multiple-C method followed by BLAST annotation for rapid identification of target genes for organisms for which reference genome sequences are not available, and extends the utility to include the

  11. deepBase: a database for deeply annotating and mining deep sequencing data

    PubMed Central

    Yang, Jian-Hua; Shao, Peng; Zhou, Hui; Chen, Yue-Qin; Qu, Liang-Hu

    2010-01-01

    Advances in high-throughput next-generation sequencing technology have reshaped the transcriptomic research landscape. However, exploration of these massive data remains a daunting challenge. In this study, we describe a novel database, deepBase, which we have developed to facilitate the comprehensive annotation and discovery of small RNAs from transcriptomic data. The current release of deepBase contains deep sequencing data from 185 small RNA libraries from diverse tissues and cell lines of seven organisms: human, mouse, chicken, Ciona intestinalis, Drosophila melanogaster, Caenhorhabditis elegans and Arabidopsis thaliana. By analyzing ∼14.6 million unique reads that perfectly mapped to more than 284 million genomic loci, we annotated and identified ∼380 000 unique ncRNA-associated small RNAs (nasRNAs), ∼1.5 million unique promoter-associated small RNAs (pasRNAs), ∼4.0 million unique exon-associated small RNAs (easRNAs) and ∼6 million unique repeat-associated small RNAs (rasRNAs). Furthermore, 2038 miRNA and 1889 snoRNA candidates were predicted by miRDeep and snoSeeker. All of the mapped reads can be grouped into about 1.2 million RNA clusters. For the purpose of comparative analysis, deepBase provides an integrative, interactive and versatile display. A convenient search option, related publications and other useful information are also provided for further investigation. deepBase is available at: http://deepbase.sysu.edu.cn/. PMID:19966272

  12. SHARAKU: an algorithm for aligning and clustering read mapping profiles of deep sequencing in non-coding RNA processing

    PubMed Central

    Tsuchiya, Mariko; Amano, Kojiro; Abe, Masaya; Seki, Misato; Hase, Sumitaka; Sato, Kengo; Sakakibara, Yasubumi

    2016-01-01

    Motivation: Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs. Results: We developed an algorithm termed SHARAKU to align two read mapping profiles of next-generation sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5′-end processing and 3′-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain. Availability and Implementation: The source code of our program SHARAKU is available at http://www.dna.bio.keio.ac.jp/sharaku/, and the simulated dataset used in this work is available at the same link. Accession code: The sequence data from the whole RNA transcripts in the hippocampus of the left brain used in this work is available from the DNA DataBank of Japan (DDBJ) Sequence Read Archive (DRA) under the accession number DRA

  13. Genetics and Epigenetics of the Skin Meet Deep Sequence

    PubMed Central

    Cheng, Jeffrey B.; Cho, Raymond J.

    2014-01-01

    Rapid advances in next-generation sequencing technology are revolutionizing approaches to genomic and epigenomic studies of skin. Deep sequencing of cutaneous malignancies reveals heavily mutagenized genomes with large numbers of low-prevalence mutations and multiple resistance mechanisms to targeted therapies. Next-generation sequencing approaches have already paid rich dividends in identifying the genetic causes of dermatologic disease, both in heritable mutations and the somatic aberrations that underlie cutaneous mosaicism. Although epigenetic alterations clearly influence tumorigenesis, pluripotent stem cell biology, and epidermal cell lineage decisions, labor and cost-intensive approaches long delayed a genome-scale perspective. New insights into epigenomic mechanisms in skin disease should arise from the accelerating assessment of histone modification, DNA methylation, and related gene expression signatures. PMID:22237701

  14. Deep sequencing approach for investigating infectious agents causing fever.

    PubMed

    Susilawati, T N; Jex, A R; Cantacessi, C; Pearson, M; Navarro, S; Susianto, A; Loukas, A C; McBride, W J H

    2016-07-01

    Acute undifferentiated fever (AUF) poses a diagnostic challenge due to the variety of possible aetiologies. While the majority of AUFs resolve spontaneously, some cases become prolonged and cause significant morbidity and mortality, necessitating improved diagnostic methods. This study evaluated the utility of deep sequencing in fever investigation. DNA and RNA were isolated from plasma/sera of AUF cases being investigated at Cairns Hospital in northern Australia, including eight control samples from patients with a confirmed diagnosis. Following isolation, DNA and RNA were bulk amplified and RNA was reverse transcribed to cDNA. The resulting DNA and cDNA amplicons were subjected to deep sequencing on an Illumina HiSeq 2000 platform. Bioinformatics analysis was performed using the program Kraken and the CLC assembly-alignment pipeline. The results were compared with the outcomes of clinical tests. We generated between 4 and 20 million reads per sample. The results of Kraken and CLC analyses concurred with diagnoses obtained by other means in 87.5 % (7/8) and 25 % (2/8) of control samples, respectively. Some plausible causes of fever were identified in ten patients who remained undiagnosed following routine hospital investigations, including Escherichia coli bacteraemia and scrub typhus that eluded conventional tests. Achromobacter xylosoxidans, Alteromonas macleodii and Enterobacteria phage were prevalent in all samples. A deep sequencing approach of patient plasma/serum samples led to the identification of aetiological agents putatively implicated in AUFs and enabled the study of microbial diversity in human blood. The application of this approach in hospital practice is currently limited by sequencing input requirements and complicated data analysis. PMID:27180244

  15. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus

    PubMed Central

    Hackl, Thomas; Hedrich, Rainer; Schultz, Jörg; Förster, Frank

    2014-01-01

    Motivation: Today, the base code of DNA is mostly determined through sequencing by synthesis as provided by the Illumina sequencers. Although highly accurate, resulting reads are short, making their analyses challenging. Recently, a new technology, single molecule real-time (SMRT) sequencing, was developed that could address these challenges, as it generates reads of several thousand bases. But, their broad application has been hampered by a high error rate. Therefore, hybrid approaches that use high-quality short reads to correct erroneous SMRT long reads have been developed. Still, current implementations have great demands on hardware, work only in well-defined computing infrastructures and reject a substantial amount of reads. This limits their usability considerably, especially in the case of large sequencing projects. Results: Here we present proovread, a hybrid correction pipeline for SMRT reads, which can be flexibly adapted on existing hardware and infrastructure from a laptop to a high-performance computing cluster. On genomic and transcriptomic test cases covering Escherichia coli, Arabidopsis thaliana and human, proovread achieved accuracies up to 99.9% and outperformed the existing hybrid correction programs. Furthermore, proovread-corrected sequences were longer and the throughput was higher. Thus, proovread combines the most accurate correction results with an excellent adaptability to the available hardware. It will therefore increase the applicability and value of SMRT sequencing. Availability and implementation: proovread is available at the following URL: http://proovread.bioapps.biozentrum.uni-wuerzburg.de Contact: frank.foerster@biozentrum.uni-wuerzburg.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25015988

  16. Deep sequencing of HIV: clinical and research applications.

    PubMed

    Chabria, Shiven B; Gupta, Shaili; Kozal, Michael J

    2014-01-01

    Human immunodeficiency virus (HIV) exhibits remarkable diversity in its genomic makeup and exists in any given individual as a complex distribution of closely related but nonidentical genomes called a viral quasispecies, which is subject to genetic variation, competition, and selection. This viral diversity clinically manifests as a selection of mutant variants based on viral fitness in treatment-naive individuals and based on drug-selective pressure in those on antiretroviral therapy (ART). The current standard-of-care ART consists of a combination of antiretroviral agents, which ensures maximal viral suppression while preventing the emergence of drug-resistant HIV variants. Unfortunately, transmission of drug-resistant HIV does occur, affecting 5% to >20% of newly infected individuals. To optimize therapy, clinicians rely on viral genotypic information obtained from conventional population sequencing-based assays, which cannot reliably detect viral variants that constitute <20% of the circulating viral quasispecies. These low-frequency variants can be detected by highly sensitive genotyping methods collectively grouped under the moniker of deep sequencing. Low-frequency variants have been correlated to treatment failures and HIV transmission, and detection of these variants is helping to inform strategies for vaccine development. Here, we discuss the molecular virology of HIV, viral heterogeneity, drug-resistance mutations, and the application of deep sequencing technologies in research and the clinical care of HIV-infected individuals. PMID:24821496

  17. deepTools2: a next generation web server for deep-sequencing data analysis

    PubMed Central

    Ramírez, Fidel; Ryan, Devon P; Grüning, Björn; Bhardwaj, Vivek; Kilpert, Fabian; Richter, Andreas S; Heyne, Steffen; Dündar, Friederike; Manke, Thomas

    2016-01-01

    We present an update to our Galaxy-based web server for processing and visualizing deeply sequenced data. Its core tool set, deepTools, allows users to perform complete bioinformatic workflows ranging from quality controls and normalizations of aligned reads to integrative analyses, including clustering and visualization approaches. Since we first described our deepTools Galaxy server in 2014, we have implemented new solutions for many requests from the community and our users. Here, we introduce significant enhancements and new tools to further improve data visualization and interpretation. deepTools continue to be open to all users and freely available as a web service at deeptools.ie-freiburg.mpg.de. The new deepTools2 suite can be easily deployed within any Galaxy framework via the toolshed repository, and we also provide source code for command line usage under Linux and Mac OS X. A public and documented API for access to deepTools functionality is also available. PMID:27079975

  18. deepTools2: a next generation web server for deep-sequencing data analysis.

    PubMed

    Ramírez, Fidel; Ryan, Devon P; Grüning, Björn; Bhardwaj, Vivek; Kilpert, Fabian; Richter, Andreas S; Heyne, Steffen; Dündar, Friederike; Manke, Thomas

    2016-07-01

    We present an update to our Galaxy-based web server for processing and visualizing deeply sequenced data. Its core tool set, deepTools, allows users to perform complete bioinformatic workflows ranging from quality controls and normalizations of aligned reads to integrative analyses, including clustering and visualization approaches. Since we first described our deepTools Galaxy server in 2014, we have implemented new solutions for many requests from the community and our users. Here, we introduce significant enhancements and new tools to further improve data visualization and interpretation. deepTools continue to be open to all users and freely available as a web service at deeptools.ie-freiburg.mpg.de The new deepTools2 suite can be easily deployed within any Galaxy framework via the toolshed repository, and we also provide source code for command line usage under Linux and Mac OS X. A public and documented API for access to deepTools functionality is also available. PMID:27079975

  19. Parallel and Scalable Short-Read Alignment on Multi-Core Clusters Using UPC++

    PubMed Central

    González-Domínguez, Jorge; Liu, Yongchao; Schmidt, Bertil

    2016-01-01

    The growth of next-generation sequencing (NGS) datasets poses a challenge to the alignment of reads to reference genomes in terms of alignment quality and execution speed. Some available aligners have been shown to obtain high quality mappings at the expense of long execution times. Finding fast yet accurate software solutions is of high importance to research, since availability and size of NGS datasets continue to increase. In this work we present an efficient parallelization approach for NGS short-read alignment on multi-core clusters. Our approach takes advantage of a distributed shared memory programming model based on the new UPC++ language. Experimental results using the CUSHAW3 aligner show that our implementation based on dynamic scheduling obtains good scalability on multi-core clusters. Through our evaluation, we are able to complete the single-end and paired-end alignments of 246 million reads of length 150 base-pairs in 11.54 and 16.64 minutes, respectively, using 32 nodes with four AMD Opteron 6272 16-core CPUs per node. In contrast, the multi-threaded original tool needs 2.77 and 5.54 hours to perform the same alignments on the 64 cores of one node. The source code of our parallel implementation is publicly available at the CUSHAW3 homepage (http://cushaw3.sourceforge.net). PMID:26731399

  20. Parallel and Scalable Short-Read Alignment on Multi-Core Clusters Using UPC+.

    PubMed

    González-Domínguez, Jorge; Liu, Yongchao; Schmidt, Bertil

    2016-01-01

    The growth of next-generation sequencing (NGS) datasets poses a challenge to the alignment of reads to reference genomes in terms of alignment quality and execution speed. Some available aligners have been shown to obtain high quality mappings at the expense of long execution times. Finding fast yet accurate software solutions is of high importance to research, since availability and size of NGS datasets continue to increase. In this work we present an efficient parallelization approach for NGS short-read alignment on multi-core clusters. Our approach takes advantage of a distributed shared memory programming model based on the new UPC++ language. Experimental results using the CUSHAW3 aligner show that our implementation based on dynamic scheduling obtains good scalability on multi-core clusters. Through our evaluation, we are able to complete the single-end and paired-end alignments of 246 million reads of length 150 base-pairs in 11.54 and 16.64 minutes, respectively, using 32 nodes with four AMD Opteron 6272 16-core CPUs per node. In contrast, the multi-threaded original tool needs 2.77 and 5.54 hours to perform the same alignments on the 64 cores of one node. The source code of our parallel implementation is publicly available at the CUSHAW3 homepage (http://cushaw3.sourceforge.net). PMID:26731399

  1. Clinical actionability enhanced through deep targeted sequencing of solid tumors

    PubMed Central

    Chen, Ken; Meric-Bernstam, Funda; Zhao, Hao; Zhang, Qingxiu; Ezzeddine, Nader; Tang, Lin-ya; Qi, Yuan; Mao, Yong; Chen, Tenghui; Chong, Zechen; Zhou, Wanding; Zheng, Xiaofeng; Johnson, Amber; Aldape, Kenneth D.; Routbort, Mark J.; Luthra, Rajyalakshmi; Kopetz, Scott; Davies, Michael A.; de Groot, John; Moulder, Stacy; Vinod, Ravi; Farhangfar, Carol J.; Shaw, Kenna Mills; Mendelsohn, John; Mills, Gordon B.; Eterovic, Agda Karina

    2015-01-01

    Background Further advances of targeted cancer therapy require comprehensive in-depth profiling of somatic mutations that are present in subpopulations of tumor cells in a clinical tumor sample. However, it is unclear to what extent such intra-tumor heterogeneity is present and whether it may affect clinical decision making. To unravel this challenge, we established a deep targeted sequencing platform to identify potentially actionable DNA alterations in tumor samples. Methods We assayed 515 FFPE tumor samples and matched germline (475 patients) from 11 disease sites by capturing and sequencing all the exons in 201 cancer related genes. Mutations, indels and copy number data were reported. Results We obtained a 1000-fold average sequencing depth and identified 4794 non-synonymous mutations in the samples analyzed, which 15.2% were present at less than 10% allele frequency. Most of these low level mutations occurred at known oncogenic hotspots and are likely functional. Identifying low level mutations improved identification of mutations in actionable genes in 118 (24.84%) patients, among which 47 (9.8%) would otherwise be unactionable. In addition, acquiring ultra-high depth also ensured a low false discovery rate (less than 2.2%) from FFPE samples. Conclusion Our results were as accurate as a commercially available CLIA-compliant hotspot panel, but allowed the detection of a higher number of mutations in actionable genes. Our study revealed the critical importance of acquiring and utilizing high depth in profiling clinical tumor samples and presented a very useful platform for implementing routine sequencing in a cancer care institution. PMID:25626406

  2. Target Enrichment Improves Mapping of Complex Traits by Deep Sequencing

    PubMed Central

    Guo, Jianjun; Fan, Jue; Hauser, Bernard A.; Rhee, Seung Y.

    2015-01-01

    Complex traits such as crop performance and human diseases are controlled by multiple genetic loci, many of which have small effects and often go undetected by traditional quantitative trait locus (QTL) mapping. Recently, bulked segregant analysis with large F2 pools and genome-level markers (named extreme-QTL or X-QTL mapping) has been used to identify many QTL. To estimate parameters impacting QTL detection for X-QTL mapping, we simulated the effects of population size, marker density, and sequencing depth of markers on QTL detectability for traits with differing heritabilities. These simulations indicate that a high (>90%) chance of detecting QTL with at least 5% effect requires 5000× sequencing depth for a trait with heritability of 0.4−0.7. For most eukaryotic organisms, whole-genome sequencing at this depth is not economically feasible. Therefore, we tested and confirmed the feasibility of applying deep sequencing of target-enriched markers for X-QTL mapping. We used two traits in Arabidopsis thaliana with different heritabilities: seed size (H2 = 0.61) and seedling greening in response to salt (H2 = 0.94). We used a modified G test to identify QTL regions and developed a model-based statistical framework to resolve individual peaks by incorporating recombination rates. Multiple QTL were identified for both traits, including previously undiscovered QTL. We call our method target-enriched X-QTL (TEX-QTL) mapping; this mapping approach is not limited by the genome size or the availability of recombinant inbred populations and should be applicable to many organisms and traits. PMID:26530422

  3. Error analysis of deep sequencing of phage libraries: peptides censored in sequencing.

    PubMed

    Matochko, Wadim L; Derda, Ratmir

    2013-01-01

    Next-generation sequencing techniques empower selection of ligands from phage-display libraries because they can detect low abundant clones and quantify changes in the copy numbers of clones without excessive selection rounds. Identification of errors in deep sequencing data is the most critical step in this process because these techniques have error rates >1%. Mechanisms that yield errors in Illumina and other techniques have been proposed, but no reports to date describe error analysis in phage libraries. Our paper focuses on error analysis of 7-mer peptide libraries sequenced by Illumina method. Low theoretical complexity of this phage library, as compared to complexity of long genetic reads and genomes, allowed us to describe this library using convenient linear vector and operator framework. We describe a phage library as N × 1 frequency vector n = ||ni||, where ni is the copy number of the ith sequence and N is the theoretical diversity, that is, the total number of all possible sequences. Any manipulation to the library is an operator acting on n. Selection, amplification, or sequencing could be described as a product of a N × N matrix and a stochastic sampling operator (Sa). The latter is a random diagonal matrix that describes sampling of a library. In this paper, we focus on the properties of Sa and use them to define the sequencing operator (Seq). Sequencing without any bias and errors is Seq = Sa IN, where IN is a N × N unity matrix. Any bias in sequencing changes IN to a nonunity matrix. We identified a diagonal censorship matrix (CEN), which describes elimination or statistically significant downsampling, of specific reads during the sequencing process. PMID:24416071

  4. DeepBase: annotation and discovery of microRNAs and other noncoding RNAs from deep-sequencing data.

    PubMed

    Yang, Jian-Hua; Qu, Liang-Hu

    2012-01-01

    Recent advances in high-throughput deep-sequencing technology have produced large numbers of short and long RNA sequences and enabled the detection and profiling of known and novel microRNAs (miRNAs) and other noncoding RNAs (ncRNAs) at unprecedented sensitivity and depth. In this chapter, we describe the use of deepBase, a database that we have developed to integrate all public deep-sequencing data and to facilitate the comprehensive annotation and discovery of miRNAs and other ncRNAs from these data. deepBase provides an integrative, interactive, and versatile web graphical interface to evaluate miRBase-annotated miRNA genes and other known ncRNAs, explores the expression patterns of miRNAs and other ncRNAs, and discovers novel miRNAs and other ncRNAs from deep-sequencing data. deepBase also provides a deepView genome browser to comparatively analyze these data at multiple levels. deepBase is available at http://deepbase.sysu.edu.cn/. PMID:22144203

  5. Deep sequencing reveals stepwise mutation acquisition in paroxysmal nocturnal hemoglobinuria

    PubMed Central

    Shen, Wenyi; Clemente, Michael J.; Hosono, Naoko; Yoshida, Kenichi; Przychodzen, Bartlomiej; Yoshizato, Tetsuichi; Shiraishi, Yuichi; Miyano, Satoru; Ogawa, Seishi; Maciejewski, Jaroslaw P.; Makishima, Hideki

    2014-01-01

    Paroxysmal nocturnal hemoglobinuria (PNH) is a nonmalignant clonal disease of hematopoietic stem cells that is associated with hemolysis, marrow failure, and thrombophilia. PNH has been considered a monogenic disease that results from somatic mutations in the gene encoding PIGA, which is required for biosynthesis of glycosylphosphatidylinisotol-anchored (GPI-anchored) proteins. The loss of certain GPI-anchored proteins is hypothesized to provide the mutant clone with an extrinsic growth advantage, but some features of PNH argue that there are intrinsic drivers of clonal expansion. Here, we performed whole-exome sequencing of paired PNH+ and PNH– fractions on samples taken from 12 patients as well as targeted deep sequencing of an additional 36 PNH patients. We identified additional somatic mutations that resulted in a complex hierarchical clonal architecture, similar to that observed in myeloid neoplasms. In addition to mutations in PIGA, mutations were found in genes known to be involved in myeloid neoplasm pathogenesis, including TET2, SUZ12, U2AF1, and JAK2. Clonal analysis indicated that these additional mutations arose either as a subclone within the PIGA-mutant population, or prior to PIGA mutation. Together, our data indicate that in addition to PIGA mutations, accessory genetic events are frequent in PNH, suggesting a stepwise clonal evolution derived from a singular stem cell clone. PMID:25244093

  6. A Protein Deep Sequencing Evaluation of Metastatic Melanoma Tissues

    PubMed Central

    Welinder, Charlotte; Pawłowski, Krzysztof; Sugihara, Yutaka; Yakovleva, Maria; Jönsson, Göran; Ingvar, Christian; Lundgren, Lotta; Baldetorp, Bo; Olsson, Håkan; Rezeli, Melinda; Jansson, Bo; Laurell, Thomas; Fehniger, Thomas; Döme, Balazs; Malm, Johan; Wieslander, Elisabet; Nishimura, Toshihide; Marko-Varga, György

    2015-01-01

    Malignant melanoma has the highest increase of incidence of malignancies in the western world. In early stages, front line therapy is surgical excision of the primary tumor. Metastatic disease has very limited possibilities for cure. Recently, several protein kinase inhibitors and immune modifiers have shown promising clinical results but drug resistance in metastasized melanoma remains a major problem. The need for routine clinical biomarkers to follow disease progression and treatment efficacy is high. The aim of the present study was to build a protein sequence database in metastatic melanoma, searching for novel, relevant biomarkers. Ten lymph node metastases (South-Swedish Malignant Melanoma Biobank) were subjected to global protein expression analysis using two proteomics approaches (with/without orthogonal fractionation). Fractionation produced higher numbers of protein identifications (4284). Combining both methods, 5326 unique proteins were identified (2641 proteins overlapping). Deep mining proteomics may contribute to the discovery of novel biomarkers for metastatic melanoma, for example dividing the samples into two metastatic melanoma “genomic subtypes”, (“pigmentation” and “high immune”) revealed several proteins showing differential levels of expression. In conclusion, the present study provides an initial version of a metastatic melanoma protein sequence database producing a total of more than 5000 unique protein identifications. The raw data have been deposited to the ProteomeXchange with identifiers PXD001724 and PXD001725. PMID:25874936

  7. Detecting copy number variation with mated short reads

    PubMed Central

    Medvedev, Paul; Fiume, Marc; Dzamba, Misko; Smith, Tim; Brudno, Michael

    2010-01-01

    The development of high-throughput sequencing (HTS) technologies has opened the door to novel methods for detecting copy number variants (CNVs) in the human genome. While in the past CNVs have been detected based on array CGH data, recent studies have shown that depth-of-coverage information from HTS technologies can also be used for the reliable identification of large copy-variable regions. Such methods, however, are hindered by sequencing biases that lead certain regions of the genome to be over- or undersampled, lowering their resolution and ability to accurately identify the exact breakpoints of the variants. In this work, we develop a method for CNV detection that supplements the depth-of-coverage with paired-end mapping information, where mate pairs mapping discordantly to the reference serve to indicate the presence of variation. Our algorithm, called CNVer, combines this information within a unified computational framework called the donor graph, allowing us to better mitigate the sequencing biases that cause uneven local coverage and accurately predict CNVs. We use CNVer to detect 4879 CNVs in the recently described genome of a Yoruban individual. Most of the calls (77%) coincide with previously known variants within the Database of Genomic Variants, while 81% of deletion copy number variants previously known for this individual coincide with one of our loss calls. Furthermore, we demonstrate that CNVer can reconstruct the absolute copy counts of segments of the donor genome and evaluate the feasibility of using CNVer with low coverage datasets. PMID:20805290

  8. A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware.

    PubMed

    Shi, Haixiang; Schmidt, Bertil; Liu, Weiguo; Müller-Wittig, Wolfgang

    2010-04-01

    Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method. This poses challenges for de novo DNA fragment assembly algorithms in terms of both accuracy (to deal with short, error-prone reads) and scalability (to deal with very large input data sets). In this article, we present a scalable parallel algorithm for correcting sequencing errors in high-throughput short-read data so that error-free reads can be available before DNA fragment assembly, which is of high importance to many graph-based short-read assembly tools. The algorithm is based on spectral alignment and uses the Compute Unified Device Architecture (CUDA) programming model. To gain efficiency we are taking advantage of the CUDA texture memory using a space-efficient Bloom filter data structure for spectrum membership queries. We have tested the runtime and accuracy of our algorithm using real and simulated Illumina data for different read lengths, error rates, input sizes, and algorithmic parameters. Using a CUDA-enabled mass-produced GPU (available for less than US$400 at any local computer outlet), this results in speedups of 12-84 times for the parallelized error correction, and speedups of 3-63 times for both sequential preprocessing and parallelized error correction compared to the publicly available Euler-SR program. Our implementation is freely available for download from http://cuda-ec.sourceforge.net . PMID:20426693

  9. Accurate indel prediction using paired-end short reads

    PubMed Central

    2013-01-01

    Background One of the major open challenges in next generation sequencing (NGS) is the accurate identification of structural variants such as insertions and deletions (indels). Current methods for indel calling assign scores to different types of evidence or counter-evidence for the presence of an indel, such as the number of split read alignments spanning the boundaries of a deletion candidate or reads that map within a putative deletion. Candidates with a score above a manually defined threshold are then predicted to be true indels. As a consequence, structural variants detected in this manner contain many false positives. Results Here, we present a machine learning based method which is able to discover and distinguish true from false indel candidates in order to reduce the false positive rate. Our method identifies indel candidates using a discriminative classifier based on features of split read alignment profiles and trained on true and false indel candidates that were validated by Sanger sequencing. We demonstrate the usefulness of our method with paired-end Illumina reads from 80 genomes of the first phase of the 1001 Genomes Project ( http://www.1001genomes.org) in Arabidopsis thaliana. Conclusion In this work we show that indel classification is a necessary step to reduce the number of false positive candidates. We demonstrate that missing classification may lead to spurious biological interpretations. The software is available at: http://agkb.is.tuebingen.mpg.de/Forschung/SV-M/. PMID:23442375

  10. Concurrent and Accurate Short Read Mapping on Multicore Processors.

    PubMed

    Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

    2015-01-01

    We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR. PMID:26451814

  11. Unified View of Backward Backtracking in Short Read Mapping

    NASA Astrophysics Data System (ADS)

    Mäkinen, Veli; Välimäki, Niko; Laaksonen, Antti; Katainen, Riku

    Mapping short DNA reads to the reference genome is the core task in the recent high-throughput technologies to study e.g. protein-DNA interactions (ChIP-seq) and alternative splicing (RNA-seq). Several tools for the task (bowtie, bwa, SOAP2, TopHat) have been developed that exploit Burrows-Wheeler transform and the backward backtracking technique on it, to map the reads to their best approximate occurrences in the genome. These tools use different tailored mechanisms for small error-levels to prune the search phase significantly. We propose a new pruning mechanism that can be seen a generalization of the tailored mechanisms used so far. It uses a novel idea of storing all cyclic rotations of fixed length substrings of the reference sequence with a compressed index that is able to exploit the repetitions created to level out the growth of the input set. For RNA-seq we propose a new method that combines dynamic programming with backtracking to map efficiently and correctly all reads that span two exons. Same mechanism can also be used for mapping mate-pair reads.

  12. Remote triggering of deep earthquakes in the 2002 Tonga sequences.

    PubMed

    Tibi, Rigobert; Wiens, Douglas A; Inoue, Hiroshi

    2003-08-21

    It is well established that an earthquake in the Earth's crust can trigger subsequent earthquakes, but such triggering has not been documented for deeper earthquakes. Models for shallow fault interactions suggest that static (permanent) stress changes can trigger nearby earthquakes, within a few fault lengths from the causative earthquake, whereas dynamic (transient) stresses carried by seismic waves may trigger earthquakes both nearby and at remote distances. Here we present a detailed analysis of the 19 August 2002 Tonga deep earthquake sequences and show evidence for both static and dynamic triggering. Seven minutes after a magnitude 7.6 earthquake occurred at a depth of 598 km, a magnitude 7.7 earthquake (664 km depth) occurred 300 km away, in a previously aseismic region. We found that nearby aftershocks of the first mainshock are preferentially located in regions where static stresses are predicted to have been enhanced by the mainshock. But the second mainshock and other triggered events are located at larger distances where static stress increases should be negligible, thus suggesting dynamic triggering. The origin times of the triggered events do not correspond to arrival times of the main seismic waves from the mainshocks and the dynamically triggered earthquakes frequently occur in aseismic regions below or adjacent to the seismic zone. We propose that these events are triggered by transient effects in regions near criticality, but where earthquakes have difficulty nucleating without external influences. PMID:12931183

  13. Key roles for freshwater Actinobacteria revealed by deep metagenomic sequencing.

    PubMed

    Ghai, Rohit; Mizuno, Carolina Megumi; Picazo, Antonio; Camacho, Antonio; Rodriguez-Valera, Francisco

    2014-12-01

    Freshwater ecosystems are critical but fragile environments directly affecting society and its welfare. However, our understanding of genuinely freshwater microbial communities, constrained by our capacity to manipulate its prokaryotic participants in axenic cultures, remains very rudimentary. Even the most abundant components, freshwater Actinobacteria, remain largely unknown. Here, applying deep metagenomic sequencing to the microbial community of a freshwater reservoir, we were able to circumvent this traditional bottleneck and reconstruct de novo seven distinct streamlined actinobacterial genomes. These genomes represent three new groups of photoheterotrophic, planktonic Actinobacteria. We describe for the first time genomes of two novel clades, acMicro (Micrococcineae, related to Luna2,) and acAMD (Actinomycetales, related to acTH1). Besides, an aggregate of contigs belonged to a new branch of the Acidimicrobiales. All are estimated to have small genomes (approximately 1.2 Mb), and their GC content varied from 40 to 61%. One of the Micrococcineae genomes encodes a proteorhodopsin, a rhodopsin type reported for the first time in Actinobacteria. The remarkable potential capacity of some of these genomes to transform recalcitrant plant detrital material, particularly lignin-derived compounds, suggests close linkages between the terrestrial and aquatic realms. Moreover, abundances of Actinobacteria correlate inversely to those of Cyanobacteria that are responsible for prolonged and frequently irretrievable damage to freshwater ecosystems. This suggests that they might serve as sentinels of impending ecological catastrophes. PMID:25355242

  14. Deep-Sea, Deep-Sequencing: Metabarcoding Extracellular DNA from Sediments of Marine Canyons.

    PubMed

    Guardiola, Magdalena; Uriz, María Jesús; Taberlet, Pierre; Coissac, Eric; Wangensteen, Owen Simon; Turon, Xavier

    2015-01-01

    Marine sediments are home to one of the richest species pools on Earth, but logistics and a dearth of taxonomic work-force hinders the knowledge of their biodiversity. We characterized α- and β-diversity of deep-sea assemblages from submarine canyons in the western Mediterranean using an environmental DNA metabarcoding. We used a new primer set targeting a short eukaryotic 18S sequence (ca. 110 bp). We applied a protocol designed to obtain extractions enriched in extracellular DNA from replicated sediment corers. With this strategy we captured information from DNA (local or deposited from the water column) that persists adsorbed to inorganic particles and buffered short-term spatial and temporal heterogeneity. We analysed replicated samples from 20 localities including 2 deep-sea canyons, 1 shallower canal, and two open slopes (depth range 100-2,250 m). We identified 1,629 MOTUs, among which the dominant groups were Metazoa (with representatives of 19 phyla), Alveolata, Stramenopiles, and Rhizaria. There was a marked small-scale heterogeneity as shown by differences in replicates within corers and within localities. The spatial variability between canyons was significant, as was the depth component in one of the canyons where it was tested. Likewise, the composition of the first layer (1 cm) of sediment was significantly different from deeper layers. We found that qualitative (presence-absence) and quantitative (relative number of reads) data showed consistent trends of differentiation between samples and geographic areas. The subset of exclusively benthic MOTUs showed similar patterns of β-diversity and community structure as the whole dataset. Separate analyses of the main metazoan phyla (in number of MOTUs) showed some differences in distribution attributable to different lifestyles. Our results highlight the differentiation that can be found even between geographically close assemblages, and sets the ground for future monitoring and conservation efforts on

  15. Deep-Sea, Deep-Sequencing: Metabarcoding Extracellular DNA from Sediments of Marine Canyons

    PubMed Central

    Guardiola, Magdalena; Uriz, María Jesús; Taberlet, Pierre; Coissac, Eric; Wangensteen, Owen Simon; Turon, Xavier

    2015-01-01

    Marine sediments are home to one of the richest species pools on Earth, but logistics and a dearth of taxonomic work-force hinders the knowledge of their biodiversity. We characterized α- and β-diversity of deep-sea assemblages from submarine canyons in the western Mediterranean using an environmental DNA metabarcoding. We used a new primer set targeting a short eukaryotic 18S sequence (ca. 110 bp). We applied a protocol designed to obtain extractions enriched in extracellular DNA from replicated sediment corers. With this strategy we captured information from DNA (local or deposited from the water column) that persists adsorbed to inorganic particles and buffered short-term spatial and temporal heterogeneity. We analysed replicated samples from 20 localities including 2 deep-sea canyons, 1 shallower canal, and two open slopes (depth range 100–2,250 m). We identified 1,629 MOTUs, among which the dominant groups were Metazoa (with representatives of 19 phyla), Alveolata, Stramenopiles, and Rhizaria. There was a marked small-scale heterogeneity as shown by differences in replicates within corers and within localities. The spatial variability between canyons was significant, as was the depth component in one of the canyons where it was tested. Likewise, the composition of the first layer (1 cm) of sediment was significantly different from deeper layers. We found that qualitative (presence-absence) and quantitative (relative number of reads) data showed consistent trends of differentiation between samples and geographic areas. The subset of exclusively benthic MOTUs showed similar patterns of β-diversity and community structure as the whole dataset. Separate analyses of the main metazoan phyla (in number of MOTUs) showed some differences in distribution attributable to different lifestyles. Our results highlight the differentiation that can be found even between geographically close assemblages, and sets the ground for future monitoring and conservation efforts on

  16. High-resolution microbial community reconstruction by integrating short reads from multiple 16S rRNA regions.

    PubMed

    Amir, Amnon; Zeisel, Amit; Zuk, Or; Elgart, Michael; Stern, Shay; Shamir, Ohad; Turnbaugh, Peter J; Soen, Yoav; Shental, Noam

    2013-12-01

    The emergence of massively parallel sequencing technology has revolutionized microbial profiling, allowing the unprecedented comparison of microbial diversity across time and space in a wide range of host-associated and environmental ecosystems. Although the high-throughput nature of such methods enables the detection of low-frequency bacteria, these advances come at the cost of sequencing read length, limiting the phylogenetic resolution possible by current methods. Here, we present a generic approach for integrating short reads from large genomic regions, thus enabling phylogenetic resolution far exceeding current methods. The approach is based on a mapping to a statistical model that is later solved as a constrained optimization problem. We demonstrate the utility of this method by analyzing human saliva and Drosophila samples, using Illumina single-end sequencing of a 750 bp amplicon of the 16S rRNA gene. Phylogenetic resolution is significantly extended while reducing the number of falsely detected bacteria, as compared with standard single-region Roche 454 Pyrosequencing. Our approach can be seamlessly applied to simultaneous sequencing of multiple genes providing a higher resolution view of the composition and activity of complex microbial communities. PMID:24214960

  17. High-resolution microbial community reconstruction by integrating short reads from multiple 16S rRNA regions

    PubMed Central

    Amir, Amnon; Zeisel, Amit; Zuk, Or; Elgart, Michael; Stern, Shay; Shamir, Ohad; Turnbaugh, Peter J.; Soen, Yoav; Shental, Noam

    2013-01-01

    The emergence of massively parallel sequencing technology has revolutionized microbial profiling, allowing the unprecedented comparison of microbial diversity across time and space in a wide range of host-associated and environmental ecosystems. Although the high-throughput nature of such methods enables the detection of low-frequency bacteria, these advances come at the cost of sequencing read length, limiting the phylogenetic resolution possible by current methods. Here, we present a generic approach for integrating short reads from large genomic regions, thus enabling phylogenetic resolution far exceeding current methods. The approach is based on a mapping to a statistical model that is later solved as a constrained optimization problem. We demonstrate the utility of this method by analyzing human saliva and Drosophila samples, using Illumina single-end sequencing of a 750 bp amplicon of the 16S rRNA gene. Phylogenetic resolution is significantly extended while reducing the number of falsely detected bacteria, as compared with standard single-region Roche 454 Pyrosequencing. Our approach can be seamlessly applied to simultaneous sequencing of multiple genes providing a higher resolution view of the composition and activity of complex microbial communities. PMID:24214960

  18. Complete Genome Sequence of Bacteriophage Deep-Blue Infecting Emetic Bacillus cereus.

    PubMed

    Hock, Louise; Gillis, Annika; Mahillon, Jacques

    2016-01-01

    The Bacillus cereus emetic pathotype is responsible for important food-borne intoxications. Here, we describe the complete genome sequence of bacteriophage Deep-Blue, which is able to infect emetic strains of B. cereus Deep-Blue is a 159-kb myophage of the Bastille-like group within the Spounavirinae. PMID:27313285

  19. Complete Genome Sequence of Bacteriophage Deep-Blue Infecting Emetic Bacillus cereus

    PubMed Central

    Hock, Louise; Gillis, Annika

    2016-01-01

    The Bacillus cereus emetic pathotype is responsible for important food-borne intoxications. Here, we describe the complete genome sequence of bacteriophage Deep-Blue, which is able to infect emetic strains of B. cereus. Deep-Blue is a 159-kb myophage of the Bastille-like group within the Spounavirinae. PMID:27313285

  20. Mutascope: sensitive detection of somatic mutations from deep amplicon sequencing

    PubMed Central

    Yost, Shawn E.; Alakus, Hakan; Matsui, Hiroko; Schwab, Richard B.; Jepsen, Kristen; Frazer, Kelly A.; Harismendy, Olivier

    2013-01-01

    Summary: We present Mutascope, a sequencing analysis pipeline specifically developed for the identification of somatic variants present at low-allelic fraction from high-throughput sequencing of amplicons from matched tumor-normal specimen. Using datasets reproducing tumor genetic heterogeneity, we demonstrate that Mutascope has a higher sensitivity and generates fewer false-positive calls than tools designed for shotgun sequencing or diploid genomes. Availability: Freely available on the web at http://sourceforge.net/projects/mutascope/. Contact: oharismendy@ucsd.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23712659

  1. DNA Methyltransferase Accessibility Protocol for Individual Templates by Deep Sequencing

    PubMed Central

    Darst, Russell P.; Nabilsi, Nancy H.; Pardo, Carolina E.; Riva, Alberto; Kladde, Michael P.

    2013-01-01

    A single-molecule probe of chromatin structure can uncover dynamic chromatin states and rare epigenetic variants of biological importance that bulk measures of chromatin structure miss. In bisulfite genomic sequencing, each sequenced clone records the methylation status of multiple sites on an individual molecule of DNA. An exogenous DNA methyltransferase can thus be used to image nucleosomes and other protein–DNA complexes. In this chapter, we describe the adaptation of this technique, termed Methylation Accessibility Protocol for individual templates, to modern high-throughput sequencing, which both simplifies the workflow and extends its utility. PMID:22929770

  2. HIV-1 Quasispecies Delineation by Tag Linkage Deep Sequencing

    PubMed Central

    Wu, Nicholas C.; De La Cruz, Justin; Al-Mawsawi, Laith Q.; Olson, C. Anders; Qi, Hangfei; Luan, Harding H.; Nguyen, Nguyen; Du, Yushen; Le, Shuai; Wu, Ting-Ting; Li, Xinmin; Lewis, Martha J.; Yang, Otto O.; Sun, Ren

    2014-01-01

    Trade-offs between throughput, read length, and error rates in high-throughput sequencing limit certain applications such as monitoring viral quasispecies. Here, we describe a molecular-based tag linkage method that allows assemblage of short sequence reads into long DNA fragments. It enables haplotype phasing with high accuracy and sensitivity to interrogate individual viral sequences in a quasispecies. This approach is demonstrated to deduce ∼2000 unique 1.3 kb viral sequences from HIV-1 quasispecies in vivo and after passaging ex vivo with a detection limit of ∼0.005% to ∼0.001%. Reproducibility of the method is validated quantitatively and qualitatively by a technical replicate. This approach can improve monitoring of the genetic architecture and evolution dynamics in any quasispecies population. PMID:24842159

  3. Deep Sequencing Analysis of Nucleolar Small RNAs: Bioinformatics.

    PubMed

    Bai, Baoyan; Laiho, Marikki

    2016-01-01

    Small RNAs (size 20-30 nt) of various types have been actively investigated in recent years, and their subcellular compartmentalization and relative concentrations are likely to be of importance to their cellular and physiological functions. Comprehensive data on this subset of the transcriptome can only be obtained by application of high-throughput sequencing, which yields data that are inherently complex and multidimensional, as sequence composition, length, and abundance will all inform to the small RNA function. Subsequent data analysis, hypothesis testing, and presentation/visualization of the results are correspondingly challenging. We have constructed small RNA libraries derived from different cellular compartments, including the nucleolus, and asked whether small RNAs exist in the nucleolus and whether they are distinct from cytoplasmic and nuclear small RNAs, the miRNAs. Here, we present a workflow for analysis of small RNA sequencing data generated by the Ion Torrent PGM sequencer from samples derived from different cellular compartments. PMID:27576724

  4. Draft Genome Sequence of Loktanella cinnabarina LL-001T, Isolated from Deep-Sea Floor Sediment

    PubMed Central

    Tsubouchi, Taishi; Takaki, Yoshihiro; Koyanagi, Ryo; Satoh, Nori; Maruyama, Tadashi; Hatada, Yuji

    2013-01-01

    This report describes the draft genome sequence of Loktanella cinnabarina LL-001T, which was the first isolated strain from deep-sea floor sediment of the genus Loktanella. The draft genome sequence contains 3,896,245 bp, with a G+C content of 66.7%. PMID:24233588

  5. Draft Genome Sequence of Loktanella cinnabarina LL-001T, Isolated from Deep-Sea Floor Sediment.

    PubMed

    Nishi, Shinro; Tsubouchi, Taishi; Takaki, Yoshihiro; Koyanagi, Ryo; Satoh, Nori; Maruyama, Tadashi; Hatada, Yuji

    2013-01-01

    This report describes the draft genome sequence of Loktanella cinnabarina LL-001(T), which was the first isolated strain from deep-sea floor sediment of the genus Loktanella. The draft genome sequence contains 3,896,245 bp, with a G+C content of 66.7%. PMID:24233588

  6. Molecular Diagnosis of Actinomadura madurae Infection by 16S rRNA Deep Sequencing

    PubMed Central

    SenGupta, Dhruba J.; Hoogestraat, Daniel R.; Cummings, Lisa A.; Bryant, Bronwyn H.; Natividad, Catherine; Thielges, Stephanie; Monsaas, Peter W.; Chau, Mimosa; Barbee, Lindley A.; Rosenthal, Christopher; Cookson, Brad T.; Hoffman, Noah G.

    2013-01-01

    Next-generation DNA sequencing can be used to catalog individual organisms within complex, polymicrobial specimens. Here, we utilized deep sequencing of 16S rRNA to implicate Actinomadura madurae as the cause of mycetoma in a diabetic patient when culture and conventional molecular methods were overwhelmed by overgrowth of other organisms. PMID:24108607

  7. Predicting effects of noncoding variants with deep learning-based sequence model.

    PubMed

    Zhou, Jian; Troyanskaya, Olga G

    2015-10-01

    Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning-based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants. PMID:26301843

  8. Predicting effects of noncoding variants with deep learning–based sequence model

    PubMed Central

    Zhou, Jian; Troyanskaya, Olga G

    2016-01-01

    Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning–based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants. PMID:26301843

  9. Use of S1 nuclease in deep sequencing for detection of double-stranded RNA viruses.

    PubMed

    Shimada, Saya; Nagai, Makoto; Moriyama, Hiromitsu; Fukuhara, Toshiyuki; Koyama, Satoshi; Omatsu, Tsutomu; Furuya, Tetsuya; Shirai, Junsuke; Mizutani, Tetsuya

    2015-09-01

    Metagenomic approach using next-generation DNA sequencing has facilitated the detection of many pathogenic viruses from fecal samples. However, in many cases, majority of the detected sequences originate from the host genome and bacterial flora in the gut. Here, to improve efficiency of the detection of double-stranded (ds) RNA viruses from samples, we evaluated the applicability of S1 nuclease on deep sequencing. Treating total RNA with S1 nuclease resulted in 1.5-28.4- and 10.1-208.9-fold increases in sequence reads of group A rotavirus in fecal and viral culture samples, respectively. Moreover, increasing coverage of mapping to reference sequences allowed for sufficient genotyping using analytical software. These results suggest that library construction using S1 nuclease is useful for deep sequencing in the detection of dsRNA viruses. PMID:25843154

  10. Short-Read Assembly of Full-Length 16S Amplicons Reveals Bacterial Diversity in Subsurface Sediments

    PubMed Central

    Miller, Christopher S.; Handley, Kim M.; Wrighton, Kelly C.; Frischkorn, Kyle R.; Thomas, Brian C.; Banfield, Jillian F.

    2013-01-01

    In microbial ecology, a fundamental question relates to how community diversity and composition change in response to perturbation. Most studies have had limited ability to deeply sample community structure (e.g. Sanger-sequenced 16S rRNA libraries), or have had limited taxonomic resolution (e.g. studies based on 16S rRNA hypervariable region sequencing). Here, we combine the higher taxonomic resolution of near-full-length 16S rRNA gene amplicons with the economics and sensitivity of short-read sequencing to assay the abundance and identity of organisms that represent as little as 0.01% of sediment bacterial communities. We used a new version of EMIRGE optimized for large data size to reconstruct near-full-length 16S rRNA genes from amplicons sheared and sequenced with Illumina technology. The approach allowed us to differentiate the community composition among samples acquired before perturbation, after acetate amendment shifted the predominant metabolism to iron reduction, and once sulfate reduction began. Results were highly reproducible across technical replicates, and identified specific taxa that responded to the perturbation. All samples contain very high alpha diversity and abundant organisms from phyla without cultivated representatives. Surprisingly, at the time points measured, there was no strong loss of evenness, despite the selective pressure of acetate amendment and change in the terminal electron accepting process. However, community membership was altered significantly. The method allows for sensitive, accurate profiling of the “long tail” of low abundance organisms that exist in many microbial communities, and can resolve population dynamics in response to environmental change. PMID:23405248

  11. Deep Sequencing Analysis of the Ixodes ricinus Haemocytome

    PubMed Central

    Franta, Zdeněk; Pedra, Joao H. F.; Ribeiro, José M. C.

    2015-01-01

    Background Ixodes ricinus is the main tick vector of the microbes that cause Lyme disease and tick-borne encephalitis in Europe. Pathogens transmitted by ticks have to overcome innate immunity barriers present in tick tissues, including midgut, salivary glands epithelia and the hemocoel. Molecularly, invertebrate immunity is initiated when pathogen recognition molecules trigger serum or cellular signalling cascades leading to the production of antimicrobials, pathogen opsonization and phagocytosis. We presently aimed at identifying hemocyte transcripts from semi-engorged female I. ricinus ticks by mass sequencing a hemocyte cDNA library and annotating immune-related transcripts based on their hemocyte abundance as well as their ubiquitous distribution. Methodology/principal findings De novo assembly of 926,596 pyrosequence reads plus 49,328,982 Illumina reads (148 nt length) from a hemocyte library, together with over 189 million Illumina reads from salivary gland and midgut libraries, generated 15,716 extracted coding sequences (CDS); these are displayed in an annotated hyperlinked spreadsheet format. Read mapping allowed the identification and annotation of tissue-enriched transcripts. A total of 327 transcripts were found significantly over expressed in the hemocyte libraries, including those coding for scavenger receptors, antimicrobial peptides, pathogen recognition proteins, proteases and protease inhibitors. Vitellogenin and lipid metabolism transcription enrichment suggests fat body components. We additionally annotated ubiquitously distributed transcripts associated with immune function, including immune-associated signal transduction proteins and transcription factors, including the STAT transcription factor. Conclusions/significance This is the first systems biology approach to describe the genes expressed in the haemocytes of this neglected disease vector. A total of 2,860 coding sequences were deposited to GenBank, increasing to 27,547 the number so

  12. Lineage analysis by microsatellite loci deep sequencing in mice.

    PubMed

    Luo, Tao; He, Xionglei; Xing, Ke

    2016-05-01

    Lineage analysis is the identification of all the progeny of a single progenitor cell, and has become particularly useful for studying developmental processes and cancer biology. Here, we propose a novel and effective method for lineage analysis that combines sequence capture and next-generation sequencing technology. Genome-wide mononucleotide and dinucleotide microsatellite loci in eight samples from two mice were identified and used to construct phylogenetic trees based on somatic indel mutations at these loci, which were unique enough to distinguish and parse samples from different mice into different groups along the lineage tree. For example, biopsies from the liver and stomach, which originate from the endoderm, were located in the same clade, while samples in kidney, which originate from the mesoderm, were located in another clade. Yet, tissue with a common developmental origin may still contain cells of a mixed ancestry. This genome-wide approach thus provides a non-invasive lineage analysis method based on mutations that accumulate in the genomes of opaque multicellular organism somatic cells. Mol. Reprod. Dev. 83: 387-391, 2016. © 2016 Wiley Periodicals, Inc. PMID:26932355

  13. SNP discovery through de novo deep sequencing using the next generation of DNA sequencers

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The production of high volumes of DNA sequence data using new technologies has permitted more efficient identification of single nucleotide polymorphisms in vertebrate genomes. This chapter presented practical methodology for production and analysis of DNA sequence data for SNP discovery....

  14. Determining mutant spectra of three RNA viral samples using ultra-deep sequencing

    SciTech Connect

    Chen, H

    2012-06-06

    RNA viruses have extremely high mutation rates that enable the virus to adapt to new host environments and even jump from one species to another. As part of a viral transmission study, three viral samples collected from naturally infected animals were sequenced using Illumina paired-end technology at ultra-deep coverage. In order to determine the mutant spectra within the viral quasispecies, it is critical to understand the sequencing error rates and control for false positive calls of viral variants (point mutantations). I will estimate the sequencing error rate from two control sequences and characterize the mutant spectra in the natural samples with this error rate.

  15. Deep sequencing as a probe of normal stem cell fate and preneoplasia in human epidermis

    PubMed Central

    Simons, Benjamin D.

    2016-01-01

    Using deep sequencing technology, methods based on the sporadic acquisition of somatic DNA mutations in human tissues have been used to trace the clonal evolution of progenitor cells in diseased states. However, the potential of these approaches to explore cell fate behavior of normal tissues and the initiation of preneoplasia remain underexploited. Focusing on the results of a recent deep sequencing study of eyelid epidermis, we show that the quantitative analysis of mutant clone size provides a general method to resolve the pattern of normal stem cell fate and to detect and characterize the mutational signature of rare field transformations in human tissues, with implications for the early detection of preneoplasia. PMID:26699486

  16. Using Small RNA Deep Sequencing Data to Detect Human Viruses

    PubMed Central

    Wang, Fang; Sun, Yu; Ruan, Jishou; Chen, Rui; Chen, Xin; Chen, Chengjie; Kreuze, Jan F.; Fei, ZhangJun; Zhu, Xiao

    2016-01-01

    Small RNA sequencing (sRNA-seq) can be used to detect viruses in infected hosts without the necessity to have any prior knowledge or specialized sample preparation. The sRNA-seq method was initially used for viral detection and identification in plants and then in invertebrates and fungi. However, it is still controversial to use sRNA-seq in the detection of mammalian or human viruses. In this study, we used 931 sRNA-seq runs of data from the NCBI SRA database to detect and identify viruses in human cells or tissues, particularly from some clinical samples. Six viruses including HPV-18, HBV, HCV, HIV-1, SMRV, and EBV were detected from 36 runs of data. Four viruses were consistent with the annotations from the previous studies. HIV-1 was found in clinical samples without the HIV-positive reports, and SMRV was found in Diffuse Large B-Cell Lymphoma cells for the first time. In conclusion, these results suggest the sRNA-seq can be used to detect viruses in mammals and humans. PMID:27066498

  17. Deep sequencing reveals 50 novel genes for recessive cognitive disorders.

    PubMed

    Najmabadi, Hossein; Hu, Hao; Garshasbi, Masoud; Zemojtel, Tomasz; Abedini, Seyedeh Sedigheh; Chen, Wei; Hosseini, Masoumeh; Behjati, Farkhondeh; Haas, Stefan; Jamali, Payman; Zecha, Agnes; Mohseni, Marzieh; Püttmann, Lucia; Vahid, Leyla Nouri; Jensen, Corinna; Moheb, Lia Abbasi; Bienek, Melanie; Larti, Farzaneh; Mueller, Ines; Weissmann, Robert; Darvish, Hossein; Wrogemann, Klaus; Hadavi, Valeh; Lipkowitz, Bettina; Esmaeeli-Nieh, Sahar; Wieczorek, Dagmar; Kariminejad, Roxana; Firouzabadi, Saghar Ghasemi; Cohen, Monika; Fattahi, Zohreh; Rost, Imma; Mojahedi, Faezeh; Hertzberg, Christoph; Dehghan, Atefeh; Rajab, Anna; Banavandi, Mohammad Javad Soltani; Hoffer, Julia; Falah, Masoumeh; Musante, Luciana; Kalscheuer, Vera; Ullmann, Reinhard; Kuss, Andreas Walter; Tzschach, Andreas; Kahrizi, Kimia; Ropers, H Hilger

    2011-10-01

    Common diseases are often complex because they are genetically heterogeneous, with many different genetic defects giving rise to clinically indistinguishable phenotypes. This has been amply documented for early-onset cognitive impairment, or intellectual disability, one of the most complex disorders known and a very important health care problem worldwide. More than 90 different gene defects have been identified for X-chromosome-linked intellectual disability alone, but research into the more frequent autosomal forms of intellectual disability is still in its infancy. To expedite the molecular elucidation of autosomal-recessive intellectual disability, we have now performed homozygosity mapping, exon enrichment and next-generation sequencing in 136 consanguineous families with autosomal-recessive intellectual disability from Iran and elsewhere. This study, the largest published so far, has revealed additional mutations in 23 genes previously implicated in intellectual disability or related neurological disorders, as well as single, probably disease-causing variants in 50 novel candidate genes. Proteins encoded by several of these genes interact directly with products of known intellectual disability genes, and many are involved in fundamental cellular processes such as transcription and translation, cell-cycle control, energy metabolism and fatty-acid synthesis, which seem to be pivotal for normal brain development and function. PMID:21937992

  18. Using Small RNA Deep Sequencing Data to Detect Human Viruses.

    PubMed

    Wang, Fang; Sun, Yu; Ruan, Jishou; Chen, Rui; Chen, Xin; Chen, Chengjie; Kreuze, Jan F; Fei, ZhangJun; Zhu, Xiao; Gao, Shan

    2016-01-01

    Small RNA sequencing (sRNA-seq) can be used to detect viruses in infected hosts without the necessity to have any prior knowledge or specialized sample preparation. The sRNA-seq method was initially used for viral detection and identification in plants and then in invertebrates and fungi. However, it is still controversial to use sRNA-seq in the detection of mammalian or human viruses. In this study, we used 931 sRNA-seq runs of data from the NCBI SRA database to detect and identify viruses in human cells or tissues, particularly from some clinical samples. Six viruses including HPV-18, HBV, HCV, HIV-1, SMRV, and EBV were detected from 36 runs of data. Four viruses were consistent with the annotations from the previous studies. HIV-1 was found in clinical samples without the HIV-positive reports, and SMRV was found in Diffuse Large B-Cell Lymphoma cells for the first time. In conclusion, these results suggest the sRNA-seq can be used to detect viruses in mammals and humans. PMID:27066498

  19. Novel lineages of Southern Ocean deep-sea foraminifera revealed by environmental DNA sequencing

    NASA Astrophysics Data System (ADS)

    Pawlowski, Jan; Fontaine, Delia; da Silva, Ana Aranda; Guiard, Jackie

    2011-10-01

    Diversity of deep-sea foraminifera is commonly studied based on analysis of agglutinated and calcareous tests preserved in the dried sediment samples. Soft-walled and agglutinated monothalamous (single-chambered) foraminifera are usually ignored because they are poorly preserved and difficult to identify. Moreover, the assemblage examined is usually limited to sediment size fraction larger than 63 or 125 μm. To overcome these problems, we analysed the foraminiferal assemblage based on ribosomal DNA sequences amplified specifically from total DNA extracted from unsieved and fine fraction (<32 μm) of sediment samples from three sites in Southern Ocean. We obtained 392 sequences, representing 123 phylotypes of foraminifera. Over 90% of phylotypes (112) could not be assigned to any previously sequenced species or genera. Among these new phylotypes, 20 belong to the clade of multi-chambered calcareous Rotaliida and agglutinated Textulariida, while 94 branch among the radiation of monothalamous species. Many new phylotypes clustered together with other environmental foraminiferal sequences and sequences of unknown origin. Eight new lineages of environmental foraminiferal sequences (ENFOR 1-8) were distinguished. The morphology of species included in these novel lineages is unknown, but we can speculate that they are tiny, amoeboid protists present in the deep-sea sediments. Their diversity may be as high as that of better known large-sized foraminifera. Documenting this hidden component of deep-sea foraminiferal assemblages is a major challenge for the future.

  20. Deep Sequencing of the Murine Olfactory Receptor Neuron Transcriptome

    PubMed Central

    Kanageswaran, Ninthujah; Demond, Marilen; Nagel, Maximilian; Schreiner, Benjamin S. P.; Baumgart, Sabrina; Scholz, Paul; Altmüller, Janine; Becker, Christian; Doerner, Julia F.; Conrad, Heike; Oberland, Sonja; Wetzel, Christian H.; Neuhaus, Eva M.; Hatt, Hanns; Gisselmann, Günter

    2015-01-01

    The ability of animals to sense and differentiate among thousands of odorants relies on a large set of olfactory receptors (OR) and a multitude of accessory proteins within the olfactory epithelium (OE). ORs and related signaling mechanisms have been the subject of intensive studies over the past years, but our knowledge regarding olfactory processing remains limited. The recent development of next generation sequencing (NGS) techniques encouraged us to assess the transcriptome of the murine OE. We analyzed RNA from OEs of female and male adult mice and from fluorescence-activated cell sorting (FACS)-sorted olfactory receptor neurons (ORNs) obtained from transgenic OMP-GFP mice. The Illumina RNA-Seq protocol was utilized to generate up to 86 million reads per transcriptome. In OE samples, nearly all OR and trace amine-associated receptor (TAAR) genes involved in the perception of volatile amines were detectably expressed. Other genes known to participate in olfactory signaling pathways were among the 200 genes with the highest expression levels in the OE. To identify OE-specific genes, we compared olfactory neuron expression profiles with RNA-Seq transcriptome data from different murine tissues. By analyzing different transcript classes, we detected the expression of non-olfactory GPCRs in ORNs and established an expression ranking for GPCRs detected in the OE. We also identified other previously undescribed membrane proteins as potential new players in olfaction. The quantitative and comprehensive transcriptome data provide a virtually complete catalogue of genes expressed in the OE and present a useful tool to uncover candidate genes involved in, for example, olfactory signaling, OR trafficking and recycling, and proliferation. PMID:25590618

  1. Deep sequencing of the murine olfactory receptor neuron transcriptome.

    PubMed

    Kanageswaran, Ninthujah; Demond, Marilen; Nagel, Maximilian; Schreiner, Benjamin S P; Baumgart, Sabrina; Scholz, Paul; Altmüller, Janine; Becker, Christian; Doerner, Julia F; Conrad, Heike; Oberland, Sonja; Wetzel, Christian H; Neuhaus, Eva M; Hatt, Hanns; Gisselmann, Günter

    2015-01-01

    The ability of animals to sense and differentiate among thousands of odorants relies on a large set of olfactory receptors (OR) and a multitude of accessory proteins within the olfactory epithelium (OE). ORs and related signaling mechanisms have been the subject of intensive studies over the past years, but our knowledge regarding olfactory processing remains limited. The recent development of next generation sequencing (NGS) techniques encouraged us to assess the transcriptome of the murine OE. We analyzed RNA from OEs of female and male adult mice and from fluorescence-activated cell sorting (FACS)-sorted olfactory receptor neurons (ORNs) obtained from transgenic OMP-GFP mice. The Illumina RNA-Seq protocol was utilized to generate up to 86 million reads per transcriptome. In OE samples, nearly all OR and trace amine-associated receptor (TAAR) genes involved in the perception of volatile amines were detectably expressed. Other genes known to participate in olfactory signaling pathways were among the 200 genes with the highest expression levels in the OE. To identify OE-specific genes, we compared olfactory neuron expression profiles with RNA-Seq transcriptome data from different murine tissues. By analyzing different transcript classes, we detected the expression of non-olfactory GPCRs in ORNs and established an expression ranking for GPCRs detected in the OE. We also identified other previously undescribed membrane proteins as potential new players in olfaction. The quantitative and comprehensive transcriptome data provide a virtually complete catalogue of genes expressed in the OE and present a useful tool to uncover candidate genes involved in, for example, olfactory signaling, OR trafficking and recycling, and proliferation. PMID:25590618

  2. Deep Sequencing of the Vaginal Microbiota of Women with HIV

    PubMed Central

    Hummelen, Ruben; Fernandes, Andrew D.; Macklaim, Jean M.; Dickson, Russell J.; Changalucha, John

    2010-01-01

    Background Women living with HIV and co-infected with bacterial vaginosis (BV) are at higher risk for transmitting HIV to a partner or newborn. It is poorly understood which bacterial communities constitute BV or the normal vaginal microbiota among this population and how the microbiota associated with BV responds to antibiotic treatment. Methods and Findings The vaginal microbiota of 132 HIV positive Tanzanian women, including 39 who received metronidazole treatment for BV, were profiled using Illumina to sequence the V6 region of the 16S rRNA gene. Of note, Gardnerella vaginalis and Lactobacillus iners were detected in each sample constituting core members of the vaginal microbiota. Eight major clusters were detected with relatively uniform microbiota compositions. Two clusters dominated by L. iners or L. crispatus were strongly associated with a normal microbiota. The L. crispatus dominated microbiota were associated with low pH, but when L. crispatus was not present, a large fraction of L. iners was required to predict a low pH. Four clusters were strongly associated with BV, and were dominated by Prevotella bivia, Lachnospiraceae, or a mixture of different species. Metronidazole treatment reduced the microbial diversity and perturbed the BV-associated microbiota, but rarely resulted in the establishment of a lactobacilli-dominated microbiota. Conclusions Illumina based microbial profiling enabled high though-put analyses of microbial samples at a high phylogenetic resolution. The vaginal microbiota among women living with HIV in Sub-Saharan Africa constitutes several profiles associated with a normal microbiota or BV. Recurrence of BV frequently constitutes a different BV-associated profile than before antibiotic treatment. PMID:20711427

  3. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction

    PubMed Central

    Laehnemann, David; Borkhardt, Arndt

    2016-01-01

    Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here. PMID:26026159

  4. Draft Genome Sequence of the Deep-Sea Bacterium Shewanella benthica Strain KT99.

    PubMed

    Lauro, F M; Chastain, R A; Ferriera, S; Johnson, J; Yayanos, A A; Bartlett, D H

    2013-01-01

    We report the draft genome sequence of the obligately piezophilic Shewanella benthica strain KT99 isolated from the abyssal South Pacific Ocean. Strain KT99 is the first piezophilic isolate from the Tonga-Kermadec trench, and its genome provides many clues on high-pressure adaptation and the evolution of deep-sea piezophilic bacteria. PMID:23723392

  5. Deep-sequencing of the peach latent mosaic viroid reveals new aspects of population heterogeneity.

    PubMed

    Glouzon, Jean-Pierre Sehi; Bolduc, François; Wang, Shengrui; Najmanovich, Rafael J; Perreault, Jean-Pierre

    2014-01-01

    Viroids are small circular single-stranded infectious RNAs characterized by a relatively high mutation level. Knowledge of their sequence heterogeneity remains largely elusive and previous studies, using Sanger sequencing, were based on a limited number of sequences. In an attempt to address sequence heterogeneity from a population dynamics perspective, a GF305-indicator peach tree was infected with a single variant of the Avsunviroidae family member Peach latent mosaic viroid (PLMVd). Six months post-inoculation, full-length circular conformers of PLMVd were isolated and deep-sequenced. We devised an original approach to the bioinformatics refinement of our sequence libraries involving important phenotypic data, based on the systematic analysis of hammerhead self-cleavage activity. Two distinct libraries yielded a total of 3,939 different PLMVd variants. Sequence variants exhibiting up to ∼17% of mutations relative to the inoculated viroid were retrieved, clearly illustrating the high level of divergence dynamics within a unique population. While we initially assumed that most positions of the viroid sequence would mutate, we were surprised to discover that ∼50% of positions remained perfectly conserved, including several small stretches as well as a small motif reminiscent of a GNRA tetraloop which are the result of various selective pressures. Using a hierarchical clustering algorithm, the different variants harvested were subdivided into 7 clusters. We found that most sequences contained an average of 4.6 to 6.4 mutations compared to the variant used to initially inoculate the plant. Interestingly, it was possible to reconstitute and compare the sequence evolution of each of these clusters. In doing so, we identified several key mutations. This study provides a reliable pipeline for the treatment of viroid deep-sequencing. It also sheds new light on the extent of sequence variation that a viroid population can sustain, and which may give rise to a

  6. Studies of a Biochemical Factory: Tomato Trichome Deep Expressed Sequence Tag Sequencing and Proteomics1[W][OA

    PubMed Central

    Schilmiller, Anthony L.; Miner, Dennis P.; Larson, Matthew; McDowell, Eric; Gang, David R.; Wilkerson, Curtis; Last, Robert L.

    2010-01-01

    Shotgun proteomics analysis allows hundreds of proteins to be identified and quantified from a single sample at relatively low cost. Extensive DNA sequence information is a prerequisite for shotgun proteomics, and it is ideal to have sequence for the organism being studied rather than from related species or accessions. While this requirement has limited the set of organisms that are candidates for this approach, next generation sequencing technologies make it feasible to obtain deep DNA sequence coverage from any organism. As part of our studies of specialized (secondary) metabolism in tomato (Solanum lycopersicum) trichomes, 454 sequencing of cDNA was combined with shotgun proteomics analyses to obtain in-depth profiles of genes and proteins expressed in leaf and stem glandular trichomes of 3-week-old plants. The expressed sequence tag and proteomics data sets combined with metabolite analysis led to the discovery and characterization of a sesquiterpene synthase that produces β-caryophyllene and α-humulene from E,E-farnesyl diphosphate in trichomes of leaf but not of stem. This analysis demonstrates the utility of combining high-throughput cDNA sequencing with proteomics experiments in a target tissue. These data can be used for dissection of other biochemical processes in these specialized epidermal cells. PMID:20431087

  7. Resolving the Complexity of Human Skin Metagenomes Using Single-Molecule Sequencing

    PubMed Central

    Tsai, Yu-Chih; Deming, Clayton; Segre, Julia A.; Kong, Heidi H.; Korlach, Jonas

    2016-01-01

    ABSTRACT Deep metagenomic shotgun sequencing has emerged as a powerful tool to interrogate composition and function of complex microbial communities. Computational approaches to assemble genome fragments have been demonstrated to be an effective tool for de novo reconstruction of genomes from these communities. However, the resultant “genomes” are typically fragmented and incomplete due to the limited ability of short-read sequence data to assemble complex or low-coverage regions. Here, we use single-molecule, real-time (SMRT) sequencing to reconstruct a high-quality, closed genome of a previously uncharacterized Corynebacterium simulans and its companion bacteriophage from a skin metagenomic sample. Considerable improvement in assembly quality occurs in hybrid approaches incorporating short-read data, with even relatively small amounts of long-read data being sufficient to improve metagenome reconstruction. Using short-read data to evaluate strain variation of this C. simulans in its skin community at single-nucleotide resolution, we observed a dominant C. simulans strain with moderate allelic heterozygosity throughout the population. We demonstrate the utility of SMRT sequencing and hybrid approaches in metagenome quantitation, reconstruction, and annotation. PMID:26861018

  8. A user-friendly computational workflow for the analysis of microRNA deep sequencing data.

    PubMed

    Majer, Anna; Caligiuri, Kyle A; Booth, Stephanie A

    2013-01-01

    Second-generation high-throughput sequencing is a robust and inexpensive methodology that is becoming an increasingly common technique for the study of microRNA (miRNA) expression levels in the central nervous system. This method allows for the identification of both known and novel miRNAs, reporting on the qualitative and quantitative levels these RNA species represent in any given sample. Numerous bioinformatic programs are currently available to analyze deep sequencing data but many require at least a partial understanding of the command line interface. In this chapter, we describe a user-friendly computational workflow guiding the user through the process from the initial FASTQ deep sequencing file to the identification of known and potentially novel miRNAs in a given experiment, as well as the assessment of the differential expression of these miRNAs between experimental samples. Furthermore, programs that can predict potential targets for these miRNAs are also highlighted. PMID:23007497

  9. Pooled Amplicon Deep Sequencing of Candidate Plasmodium falciparum Transmission-Blocking Vaccine Antigens.

    PubMed

    Juliano, Jonathan J; Parobek, Christian M; Brazeau, Nicholas F; Ngasala, Billy; Randrianarivelojosia, Milijaona; Lon, Chanthap; Mwandagalirwa, Kashamuka; Tshefu, Antoinette; Dhar, Ravi; Das, Bidyut K; Hoffman, Irving; Martinson, Francis; Mårtensson, Andreas; Saunders, David L; Kumar, Nirbhay; Meshnick, Steven R

    2016-01-01

    Polymorphisms within Plasmodium falciparum vaccine candidate antigens have the potential to compromise vaccine efficacy. Understanding the allele frequencies of polymorphisms in critical binding regions of antigens can help in the designing of strain-transcendent vaccines. Here, we adopt a pooled deep-sequencing approach, originally designed to study P. falciparum drug resistance mutations, to study the diversity of two leading transmission-blocking vaccine candidates, Pfs25 and Pfs48/45. We sequenced 329 P. falciparum field isolates from six different geographic regions. Pfs25 showed little diversity, with only one known polymorphism identified in the region associated with binding of transmission-blocking antibodies among our isolates. However, we identified four new mutations among eight non-synonymous mutations within the presumed antibody-binding region of Pfs48/45. Pooled deep sequencing provides a scalable and cost-effective approach for the targeted study of allele frequencies of P. falciparum candidate vaccine antigens. PMID:26503281

  10. Classification of ncRNAs using position and size information in deep sequencing data

    PubMed Central

    Erhard, Florian; Zimmer, Ralf

    2010-01-01

    Motivation: Small non-coding RNAs (ncRNAs) play important roles in various cellular functions in all clades of life. With next-generation sequencing techniques, it has become possible to study ncRNAs in a high-throughput manner and by using specialized algorithms ncRNA classes such as miRNAs can be detected in deep sequencing data. Typically, such methods are targeted to a certain class of ncRNA. Many methods rely on RNA secondary structure prediction, which is not always accurate and not all ncRNA classes are characterized by a common secondary structure. Unbiased classification methods for ncRNAs could be important to improve accuracy and to detect new ncRNA classes in sequencing data. Results: Here, we present a scoring system called ALPS (alignment of pattern matrices score) that only uses primary information from a deep sequencing experiment, i.e. the relative positions and lengths of reads, to classify ncRNAs. ALPS makes no further assumptions, e.g. about common structural properties in the ncRNA class and is nevertheless able to identify ncRNA classes with high accuracy. Since ALPS is not designed to recognize a certain class of ncRNA, it can be used to detect novel ncRNA classes, as long as these unknown ncRNAs have a characteristic pattern of deep sequencing read lengths and positions. We evaluate our scoring system on publicly available deep sequencing data and show that it is able to classify known ncRNAs with high sensitivity and specificity. Availability: Calculated pattern matrices of the datasets hESC and EB are available at the project web site http://www.bio.ifi.lmu.de/ALPS. An implementation of the described method is available upon request from the authors. Contact: florian.erhard@bio.ifi.lmu.de PMID:20823303

  11. Enhanced arbovirus surveillance with deep sequencing: identification of novel rhabdoviruses and bunyaviruses in Australian mosquitoes

    PubMed Central

    Coffey, Lark L.; Page, Brady L.; Greninger, Alexander L.; Herring, Belinda L.; Russell, Richard C.; Doggett, Stephen L.; Haniotis, John; Wang, Chunlin; Deng, Xutao; Delwart, Eric L.

    2013-01-01

    Viral metagenomics characterizes known and identifies unknown viruses based on sequence similarities to any previously sequenced viral genomes. A metagenomics approach was used to identify virus sequences in Australian mosquitoes causing cytopathic effects in inoculated mammalian cell cultures. Sequence comparisons revealed strains of Liao Ning virus (Reovirus, Seadornavirus), previously detected only in China, livestock-infecting Stretch Lagoon virus (Reovirus, Orbivirus), two novel dimarhabdoviruses, named Beaumont and North Creek viruses, and two novel orthobunyaviruses, named Murrumbidgee and Salt Ash viruses. The novel virus proteomes diverged by ≥50% relative to their closest previously genetically characterized viral relatives. Deep sequencing also generated genomes of Warrego and Wallal viruses, orbiviruses linked to kangaroo blindness, whose genomes had not been fully characterized. This study highlights viral metagenomics in concert with traditional arbovirus surveillance to characterize known and new arboviruses in field-collected mosquitoes. Follow-up epidemiological studies are required to determine whether the novel viruses infect humans. PMID:24314645

  12. Enhanced arbovirus surveillance with deep sequencing: Identification of novel rhabdoviruses and bunyaviruses in Australian mosquitoes.

    PubMed

    Coffey, Lark L; Page, Brady L; Greninger, Alexander L; Herring, Belinda L; Russell, Richard C; Doggett, Stephen L; Haniotis, John; Wang, Chunlin; Deng, Xutao; Delwart, Eric L

    2014-01-01

    Viral metagenomics characterizes known and identifies unknown viruses based on sequence similarities to any previously sequenced viral genomes. A metagenomics approach was used to identify virus sequences in Australian mosquitoes causing cytopathic effects in inoculated mammalian cell cultures. Sequence comparisons revealed strains of Liao Ning virus (Reovirus, Seadornavirus), previously detected only in China, livestock-infecting Stretch Lagoon virus (Reovirus, Orbivirus), two novel dimarhabdoviruses, named Beaumont and North Creek viruses, and two novel orthobunyaviruses, named Murrumbidgee and Salt Ash viruses. The novel virus proteomes diverged by ≥ 50% relative to their closest previously genetically characterized viral relatives. Deep sequencing also generated genomes of Warrego and Wallal viruses, orbiviruses linked to kangaroo blindness, whose genomes had not been fully characterized. This study highlights viral metagenomics in concert with traditional arbovirus surveillance to characterize known and new arboviruses in field-collected mosquitoes. Follow-up epidemiological studies are required to determine whether the novel viruses infect humans. PMID:24314645

  13. Deep sequencing reveals global patterns of mRNA recruitment during translation initiation

    PubMed Central

    Gao, Rong; Yu, Kai; Nie, Jukui; Lian, Tengfei; Jin, Jianshi; Liljas, Anders; Su, Xiao-Dong

    2016-01-01

    In this work, we developed a method to systematically study the sequence preference of mRNAs during translation initiation. Traditionally, the dynamic process of translation initiation has been studied at the single molecule level with limited sequencing possibility. Using deep sequencing techniques, we identified the sequence preference at different stages of the initiation complexes. Our results provide a comprehensive and dynamic view of the initiation elements in the translation initiation region (TIR), including the S1 binding sequence, the Shine-Dalgarno (SD)/anti-SD interaction and the second codon, at the equilibrium of different initiation complexes. Moreover, our experiments reveal the conformational changes and regional dynamics throughout the dynamic process of mRNA recruitment. PMID:27460773

  14. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence

    PubMed Central

    Kinney, Justin B.; Murugan, Anand; Callan, Curtis G.; Cox, Edward C.

    2010-01-01

    Cells use protein-DNA and protein-protein interactions to regulate transcription. A biophysical understanding of this process has, however, been limited by the lack of methods for quantitatively characterizing the interactions that occur at specific promoters and enhancers in living cells. Here we show how such biophysical information can be revealed by a simple experiment in which a library of partially mutated regulatory sequences are partitioned according to their in vivo transcriptional activities and then sequenced en masse. Computational analysis of the sequence data produced by this experiment can provide precise quantitative information about how the regulatory proteins at a specific arrangement of binding sites work together to regulate transcription. This ability to reliably extract precise information about regulatory biophysics in the face of experimental noise is made possible by a recently identified relationship between likelihood and mutual information. Applying our experimental and computational techniques to the Escherichia coli lac promoter, we demonstrate the ability to identify regulatory protein binding sites de novo, determine the sequence-dependent binding energy of the proteins that bind these sites, and, importantly, measure the in vivo interaction energy between RNA polymerase and a DNA-bound transcription factor. Our approach provides a generally applicable method for characterizing the biophysical basis of transcriptional regulation by a specified regulatory sequence. The principles of our method can also be applied to a wide range of other problems in molecular biology. PMID:20439748

  15. Ultra-deep sequencing of intra-host rabies virus populations during cross-species transmission.

    PubMed

    Borucki, Monica K; Chen-Harris, Haiyin; Lao, Victoria; Vanier, Gilda; Wadford, Debra A; Messenger, Sharon; Allen, Jonathan E

    2013-11-01

    One of the hurdles to understanding the role of viral quasispecies in RNA virus cross-species transmission (CST) events is the need to analyze a densely sampled outbreak using deep sequencing in order to measure the amount of mutation occurring on a small time scale. In 2009, the California Department of Public Health reported a dramatic increase (350) in the number of gray foxes infected with a rabies virus variant for which striped skunks serve as a reservoir host in Humboldt County. To better understand the evolution of rabies, deep-sequencing was applied to 40 unpassaged rabies virus samples from the Humboldt outbreak. For each sample, approximately 11 kb of the 12 kb genome was amplified and sequenced using the Illumina platform. Average coverage was 17,448 and this allowed characterization of the rabies virus population present in each sample at unprecedented depths. Phylogenetic analysis of the consensus sequence data demonstrated that samples clustered according to date (1995 vs. 2009) and geographic location (northern vs. southern). A single amino acid change in the G protein distinguished a subset of northern foxes from a haplotype present in both foxes and skunks, suggesting this mutation may have played a role in the observed increased transmission among foxes in this region. Deep-sequencing data indicated that many genetic changes associated with the CST event occurred prior to 2009 since several nonsynonymous mutations that were present in the consensus sequences of skunk and fox rabies samples obtained from 20032010 were present at the sub-consensus level (as rare variants in the viral population) in skunk and fox samples from 1995. These results suggest that analysis of rare variants within a viral population may yield clues to ancestral genomes and identify rare variants that have the potential to be selected for if environment conditions change. PMID:24278493

  16. Ultra-Deep Sequencing of Intra-host Rabies Virus Populations during Cross-species Transmission

    PubMed Central

    Borucki, Monica K.; Chen-Harris, Haiyin; Lao, Victoria; Vanier, Gilda; Wadford, Debra A.; Messenger, Sharon; Allen, Jonathan E.

    2013-01-01

    One of the hurdles to understanding the role of viral quasispecies in RNA virus cross-species transmission (CST) events is the need to analyze a densely sampled outbreak using deep sequencing in order to measure the amount of mutation occurring on a small time scale. In 2009, the California Department of Public Health reported a dramatic increase (350) in the number of gray foxes infected with a rabies virus variant for which striped skunks serve as a reservoir host in Humboldt County. To better understand the evolution of rabies, deep-sequencing was applied to 40 unpassaged rabies virus samples from the Humboldt outbreak. For each sample, approximately 11 kb of the 12 kb genome was amplified and sequenced using the Illumina platform. Average coverage was 17,448 and this allowed characterization of the rabies virus population present in each sample at unprecedented depths. Phylogenetic analysis of the consensus sequence data demonstrated that samples clustered according to date (1995 vs. 2009) and geographic location (northern vs. southern). A single amino acid change in the G protein distinguished a subset of northern foxes from a haplotype present in both foxes and skunks, suggesting this mutation may have played a role in the observed increased transmission among foxes in this region. Deep-sequencing data indicated that many genetic changes associated with the CST event occurred prior to 2009 since several nonsynonymous mutations that were present in the consensus sequences of skunk and fox rabies samples obtained from 20032010 were present at the sub-consensus level (as rare variants in the viral population) in skunk and fox samples from 1995. These results suggest that analysis of rare variants within a viral population may yield clues to ancestral genomes and identify rare variants that have the potential to be selected for if environment conditions change. PMID:24278493

  17. Seismic sequence stratigraphy of Tertiary sediments, offshore Sarawak deep-water area

    SciTech Connect

    Mohammad, A.M. )

    1994-07-01

    Tectonic processes and sea level changes are the main key factors that have strongly influenced clastic and carbonate sedimentations in the Sarawak deep-water area. A seismic sequence stratigraphy of Tertiary sediments was conducted in the area with the main objective of developing a workable genetic chronostratigraphic framework that defines the sequence and system tracts boundaries within which depositional systems and lithofacies can be identified, mapped and interpreted. This study has resulted in the identification of eight major depositional sequences that are bounded by regional unconformities and correlative conformities. These sequences can generally be grouped into four megasequences, based on the main tectonic events observed in the area. Three system tracts of a type-1, third-order sequence boundary were recognized in most of the sequences: lowstand, transgressive, and highstand systems tracts. The lowstand system tract includes basin-floor fans, slope fans, and lowstand prograding wedges. Paleoenvironmental distribution maps constructed for each of the sequences using seismic facies analysis and nearby well control suggest that the sequence intervals are predominantly transgressive units that have been intermittently interrupted by regressive pulses brought about by changes in eustatic sea level. The trend of paleocoastline observed during Oligocene to Miocene times changes from northwest-southeast orientation to a position roughly parallel to the present coastline. Seismic facies maps generated from late Oligocene to early Miocene indicate the depositional environment was coastal to coastal plain in the western and the middle part of the study area, becoming more marine toward the east and northeast.

  18. Draft genome sequence of Pseudomonas oleovorans strain MGY01 isolated from deep sea water.

    PubMed

    Wang, Runping; Ren, Chong; Huang, Nan; Liu, Yang; Zeng, Runying

    2015-04-01

    Pseudomonas oleovorans MGY01 isolated from the deep-sea water of the South China Sea could effectively degrade malachite green. The draft genome of P. oleovorans MGY01 was sequenced and analyzed to gain insights into its efficient metabolic pathway for degrading malachite green. The data obtained revealed 109 Contigs (N50; 128,269 bp) with whole genome size of 5,201,892 bp. The draft genome sequence of strain MGY01 will be helpful in studying the genetic pathways involved in the degradation of malachite green. PMID:25528517

  19. miRBase: integrating microRNA annotation and deep-sequencing data.

    PubMed

    Kozomara, Ana; Griffiths-Jones, Sam

    2011-01-01

    miRBase is the primary online repository for all microRNA sequences and annotation. The current release (miRBase 16) contains over 15,000 microRNA gene loci in over 140 species, and over 17,000 distinct mature microRNA sequences. Deep-sequencing technologies have delivered a sharp rise in the rate of novel microRNA discovery. We have mapped reads from short RNA deep-sequencing experiments to microRNAs in miRBase and developed web interfaces to view these mappings. The user can view all read data associated with a given microRNA annotation, filter reads by experiment and count, and search for microRNAs by tissue- and stage-specific expression. These data can be used as a proxy for relative expression levels of microRNA sequences, provide detailed evidence for microRNA annotations and alternative isoforms of mature microRNAs, and allow us to revisit previous annotations. miRBase is available online at: http://www.mirbase.org/. PMID:21037258

  20. Characterization of microRNA transcriptome in lung cancer by next-generation deep sequencing

    PubMed Central

    Ma, Jie; Mannoor, Kaiissar; Gao, Lu; Tan, Afang; Guarnera, Maria A.; Zhan, Min; Shetty, Amol; Stass, Sanford A; Xing, Lingxiao; Jiang, Feng

    2014-01-01

    Non-small cell lung cancer (NSCLC) is the leading cause of cancer death. Systematically characterizing miRNAs in NSCLC will help develop biomarkers for its diagnosis and subclassification, and identify therapeutic targets for the treatment. We used next-generation deep sequencing to comprehensively characterize miRNA profiles in eight lung tumor tissues consisting of two major types of NSCLC, squamous cell carcinoma (SCC) and adenocarcinoma (AC). We used quantitative PCR (qPCR) to verify the findings in 40 pairs of stage I NSCLC tissues and the paired normal tissues, and 60 NSCLC tissues of different types and stages. We also investigated the function of identified miRNAs in lung tumorigenesis. Deep sequencing identified 896 known miRNAs and 14 novel miRNAs, of which, 24 miRNAs displayed dysregulation with fold change ≥4.5 in either stage I ACs or SCCs or both relative to normal tissues. qPCR validation showed that 14 of 24 miRNAs exhibited consistent changes with deep sequencing data. Seven miRNAs displayed distinctive expressions between SCC and AC, from which, a panel of four miRNAs (miRs-944, 205-3p, 135a-5p, and 577) was identified that cold differentiate SCC from AC with 93.3% sensitivity and 86.7% specificity. Manipulation of miR-944 expression in NSCLC cells affected cell growth, proliferation, and invasion by targeting a tumor suppressor, SOCS4. Evaluating miR-944 in 52 formalin-fixed paraffin-embedded SCC tissues revealed that miR-944 expression was associated with lymph node metastasis. This study presents the earliest use of deep sequencing for profiling miRNAs in lung tumor specimens. The identified miRNA signatures may provide biomarkers for early detection, subclassification, and predicting metastasis, and potential therapeutic targets of NSCLC. PMID:24785186

  1. Ultra deep sequencing detects a low rate of mosaic mutations in Tuberous Sclerosis Complex

    PubMed Central

    Qin, Wei; Kozlowski, Piotr; Taillon, Bruce E.; Bouffard, Pascal; Holmes, Alison J.; Janne, Pasi; Camposano, Susana; Thiele, Elizabeth; Franz, David; Kwiatkowski, David J.

    2010-01-01

    Tuberous sclerosis complex (TSC) is an autosomal dominant neurocutaneous syndrome caused by mutations in TSC1 and TSC2. However, 10 to 15% TSC patients have no mutation identified with conventional molecular diagnostic studies. We used the ultra-deep pyrosequencing technique of 454 Sequencing to search for mosaicism in 38 TSC patients who had no TSC1 or TSC2 mutation identified by conventional methods. Two TSC2 mutations were identified, each at 5.3% read frequency in different patients, consistent with mosaicism. Both mosaic mutations were confirmed by several methods. Five of 38 samples were found to have heterozygous non-mosaic mutations, which had been missed in earlier analyses. Several other possible low frequency mosaic mutations were identified by deep sequencing, but were discarded as artifacts by secondary studies. The low frequency of detection of mosaic mutations, 2 (6%) of 33, suggests that the majority of TSC patients who have no mutation identified are not due to mosaicism, but rather other causes, which remain to be determined. These findings indicate the ability of deep sequencing, coupled with secondary confirmatory analyses, to detect low frequency mosaic mutations. PMID:20165957

  2. MetaGeniE: Characterizing Human Clinical Samples Using Deep Metagenomic Sequencing

    PubMed Central

    Rawat, Arun; Engelthaler, David M.; Driebe, Elizabeth M.; Keim, Paul; Foster, Jeffrey T.

    2014-01-01

    With the decreasing cost of next-generation sequencing, deep sequencing of clinical samples provides unique opportunities to understand host-associated microbial communities. Among the primary challenges of clinical metagenomic sequencing is the rapid filtering of human reads to survey for pathogens with high specificity and sensitivity. Metagenomes are inherently variable due to different microbes in the samples and their relative abundance, the size and architecture of genomes, and factors such as target DNA amounts in tissue samples (i.e. human DNA versus pathogen DNA concentration). This variation in metagenomes typically manifests in sequencing datasets as low pathogen abundance, a high number of host reads, and the presence of close relatives and complex microbial communities. In addition to these challenges posed by the composition of metagenomes, high numbers of reads generated from high-throughput deep sequencing pose immense computational challenges. Accurate identification of pathogens is confounded by individual reads mapping to multiple different reference genomes due to gene similarity in different taxa present in the community or close relatives in the reference database. Available global and local sequence aligners also vary in sensitivity, specificity, and speed of detection. The efficiency of detection of pathogens in clinical samples is largely dependent on the desired taxonomic resolution of the organisms. We have developed an efficient strategy that identifies “all against all” relationships between sequencing reads and reference genomes. Our approach allows for scaling to large reference databases and then genome reconstruction by aggregating global and local alignments, thus allowing genetic characterization of pathogens at higher taxonomic resolution. These results were consistent with strain level SNP genotyping and bacterial identification from laboratory culture. PMID:25365329

  3. Efficient selection of biomineralizing DNA aptamers using deep sequencing and population clustering.

    PubMed

    Bawazer, Lukmaan A; Newman, Aaron M; Gu, Qian; Ibish, Abdullah; Arcila, Mary; Cooper, James B; Meldrum, Fiona C; Morse, Daniel E

    2014-01-28

    DNA-based information systems drive the combinatorial optimization processes of natural evolution, including the evolution of biominerals. Advances in high-throughput DNA sequencing expand the power of DNA as a potential information platform for combinatorial engineering, but many applications remain to be developed due in part to the challenge of handling large amounts of sequence data. Here we employ high-throughput sequencing and a recently developed clustering method (AutoSOME) to identify single-stranded DNA sequence families that bind specifically to ZnO semiconductor mineral surfaces. These sequences were enriched from a diverse DNA library after a single round of screening, whereas previous screening approaches typically require 5-15 rounds of enrichment for effective sequence identification. The consensus sequence of the largest cluster was poly d(T)30. This consensus sequence exhibited clear aptamer behavior and was shown to promote the synthesis of crystalline ZnO from aqueous solution at near-neutral pH. This activity is significant, as the crystalline form of this wide-bandgap semiconductor is not typically amenable to solution synthesis in this pH range. High-resolution TEM revealed that this DNA synthesis route yields ZnO nanoparticles with an amorphous-crystalline core-shell structure, suggesting that the mechanism of mineralization involves nanoscale coacervation around the DNA template. We thus demonstrate that our new method, termed Single round Enrichment of Ligands by deep Sequencing (SEL-Seq), can facilitate biomimetic synthesis of technological nanomaterials by accelerating combinatorial selection of biomolecular-mineral interactions. Moreover, by enabling direct characterization of sequence family demographics, we anticipate that SEL-Seq will enhance aptamer discovery in applications employing additional rounds of screening. PMID:24341560

  4. miRBase: annotating high confidence microRNAs using deep sequencing data

    PubMed Central

    Kozomara, Ana; Griffiths-Jones, Sam

    2014-01-01

    We describe an update of the miRBase database (http://www.mirbase.org/), the primary microRNA sequence repository. The latest miRBase release (v20, June 2013) contains 24 521 microRNA loci from 206 species, processed to produce 30 424 mature microRNA products. The rate of deposition of novel microRNAs and the number of researchers involved in their discovery continue to increase, driven largely by small RNA deep sequencing experiments. In the face of these increases, and a range of microRNA annotation methods and criteria, maintaining the quality of the microRNA sequence data set is a significant challenge. Here, we describe recent developments of the miRBase database to address this issue. In particular, we describe the collation and use of deep sequencing data sets to assign levels of confidence to miRBase entries. We now provide a high confidence subset of miRBase entries, based on the pattern of mapped reads. The high confidence microRNA data set is available alongside the complete microRNA collection at http://www.mirbase.org/. We also describe embedding microRNA-specific Wikipedia pages on the miRBase website to encourage the microRNA community to contribute and share textual and functional information. PMID:24275495

  5. Analyzing the microRNA Transcriptome in Plants Using Deep Sequencing Data

    PubMed Central

    Yang, Xiaozeng; Li, Lei

    2012-01-01

    MicroRNAs (miRNAs) are 20- to 24-nucleotide endogenous small RNA molecules emerging as an important class of sequence-specific, trans-acting regulators for modulating gene expression at the post-transcription level. There has been a surge of interest in the past decade in identifying miRNAs and profiling their expression pattern using various experimental approaches. In particular, ultra-deep sampling of specifically prepared low-molecular-weight RNA libraries based on next-generation sequencing technologies has been used successfully in diverse species. The challenge now is to effectively deconvolute the complex sequencing data to provide comprehensive and reliable information on the miRNAs, miRNA precursors, and expression profile of miRNA genes. Here we review the recently developed computational tools and their applications in profiling the miRNA transcriptomes, with an emphasis on the model plant Arabidopsis thaliana. Highlighted is also progress and insight into miRNA biology derived from analyzing available deep sequencing data. PMID:24832228

  6. Nautilus: a bioinformatics package for the analysis of HIV type 1 targeted deep sequencing data.

    PubMed

    Kijak, Gustavo H; Pham, Phuc; Sanders-Buell, Eric; Harbolick, Elizabeth A; Eller, Leigh Anne; Robb, Merlin L; Michael, Nelson L; Kim, Jerome H; Tovanabutra, Sodsai

    2013-10-01

    The advent of next generation sequencing technologies is providing new insight into HIV-1 diversity and evolution, which has created the need for bioinformatics tools that could be applied to the characterization of viral quasispecies. Here we present Nautilus, a bioinformatics package for the analysis of HIV-1 targeted deep sequencing data. The DeepHaplo module determines the nucleotide base frequency and read depth at each position and computes the haplotype frequencies based on the linkage among polymorphisms in the same next generation sequence read. The Motifs module computes the frequency of the variants in the setting of their sequence context and mapping orientation, which allows for the validation of polymorphisms and haplotypes when strand bias is suspected. Both modules are accessed through a user-friendly GUI, which runs on Mac OS X (version 10.7.4 or later), and are based on Python, JAVA, and R scripts. Nautilus is available from www.hivresearch.org/research.php?ServiceID=5&SubServiceID=6 . PMID:23809062

  7. miRBase: annotating high confidence microRNAs using deep sequencing data.

    PubMed

    Kozomara, Ana; Griffiths-Jones, Sam

    2014-01-01

    We describe an update of the miRBase database (http://www.mirbase.org/), the primary microRNA sequence repository. The latest miRBase release (v20, June 2013) contains 24 521 microRNA loci from 206 species, processed to produce 30 424 mature microRNA products. The rate of deposition of novel microRNAs and the number of researchers involved in their discovery continue to increase, driven largely by small RNA deep sequencing experiments. In the face of these increases, and a range of microRNA annotation methods and criteria, maintaining the quality of the microRNA sequence data set is a significant challenge. Here, we describe recent developments of the miRBase database to address this issue. In particular, we describe the collation and use of deep sequencing data sets to assign levels of confidence to miRBase entries. We now provide a high confidence subset of miRBase entries, based on the pattern of mapped reads. The high confidence microRNA data set is available alongside the complete microRNA collection at http://www.mirbase.org/. We also describe embedding microRNA-specific Wikipedia pages on the miRBase website to encourage the microRNA community to contribute and share textual and functional information. PMID:24275495

  8. Complete Genome Sequence of a Reference Stock of Simian Immunodeficiency Virus RNA (SIVmac251/32H/L28) Determined by Deep Sequencing

    PubMed Central

    Jenkins, Adrian; Ham, Claire; Almond, Neil

    2016-01-01

    A reference preparation for simian immunodeficiency virus (SIV) RNA nucleic acid assays was characterized by complete genome deep sequencing. The entire coding sequence and flanking long terminal repeats, including minority species, were determined. This information will inform SIV research investigations and aid evaluation and development of amplification assays for SIV RNA quantification. PMID:27231355

  9. Complete Genome Sequence of a Reference Stock of Simian Immunodeficiency Virus RNA (SIVmac251/32H/L28) Determined by Deep Sequencing.

    PubMed

    Jenkins, Adrian; Ham, Claire; Almond, Neil; Berry, Neil

    2016-01-01

    A reference preparation for simian immunodeficiency virus (SIV) RNA nucleic acid assays was characterized by complete genome deep sequencing. The entire coding sequence and flanking long terminal repeats, including minority species, were determined. This information will inform SIV research investigations and aid evaluation and development of amplification assays for SIV RNA quantification. PMID:27231355

  10. De novo meta-assembly of ultra-deep sequencing data

    PubMed Central

    Mirebrahim, Hamid; Close, Timothy J.; Lonardi, Stefano

    2015-01-01

    We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized ‘slices’ and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler. Availability and implementation: Slicembler can be accessed at http://slicembler.cs.ucr.edu/. Contact: hamid.mirebrahim@email.ucr.edu PMID:26072514

  11. HomozygosityMapper2012--bridging the gap between homozygosity mapping and deep sequencing.

    PubMed

    Seelow, Dominik; Schuelke, Markus

    2012-07-01

    Homozygosity mapping is a common method to map recessive traits in consanguineous families. To facilitate these analyses, we have developed HomozygosityMapper, a web-based approach to homozygosity mapping. HomozygosityMapper allows researchers to directly upload the genotype files produced by the major genotyping platforms as well as deep sequencing data. It detects stretches of homozygosity shared by the affected individuals and displays them graphically. Users can interactively inspect the underlying genotypes, manually refine these regions and eventually submit them to our candidate gene search engine GeneDistiller to identify the most promising candidate genes. Here, we present the new version of HomozygosityMapper. The most striking new feature is the support of Next Generation Sequencing *.vcf files as input. Upon users' requests, we have implemented the analysis of common experimental rodents as well as of important farm animals. Furthermore, we have extended the options for single families and loss of heterozygosity studies. Another new feature is the export of *.bed files for targeted enrichment of the potential disease regions for deep sequencing strategies. HomozygosityMapper also generates files for conventional linkage analyses which are already restricted to the possible disease regions, hence superseding CPU-intensive genome-wide analyses. HomozygosityMapper is freely available at http://www.homozygositymapper.org/. PMID:22669902

  12. Genotyping Influenza Virus by Next-Generation Deep Sequencing in Clinical Specimens.

    PubMed

    Seong, Moon Woo; Cho, Sung Im; Park, Hyunwoong; Seo, Soo Hyun; Lee, Seung Jun; Kim, Eui Chong; Park, Sung Sup

    2016-05-01

    Rapid and accurate identification of an influenza outbreak is essential for patient care and treatment. We describe a next-generation sequencing (NGS)-based, unbiased deep sequencing method in clinical specimens to investigate an influenza outbreak. Nasopharyngeal swabs from patients were collected for molecular epidemiological analysis. Total RNA was sequenced by using the NGS technology as paired-end 250 bp reads. Total of 7 to 12 million reads were obtained. After mapping to the human reference genome, we analyzed the 3-4% of reads that originated from a non-human source. A BLAST search of the contigs reconstructed de novo revealed high sequence similarity with that of the pandemic H1N1 virus. In the phylogenetic analysis, the HA gene of our samples clustered closely with that of A/Senegal/VR785/2010(H1N1), A/Wisconsin/11/2013(H1N1), and A/Korea/01/2009(H1N1), and the NA gene of our samples clustered closely with A/Wisconsin/11/2013(H1N1). This study suggests that NGS-based unbiased sequencing can be effectively applied to investigate molecular characteristics of nosocomial influenza outbreak by using clinical specimens such as nasopharyngeal swabs. PMID:26915615

  13. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing

    SciTech Connect

    Whitehead, Timothy A.; Chevalier, Aaron; Song, Yifan; Dreyfus, Cyrille; Fleishman, Sarel J.; De Mattos, Cecilia; Myers, Chris A.; Kamisetty, Hetunandan; Blair, Patrick; Wilson, Ian A.; Baker, David

    2012-06-19

    We show that comprehensive sequence-function maps obtained by deep sequencing can be used to reprogram interaction specificity and to leapfrog over bottlenecks in affinity maturation by combining many individually small contributions not detectable in conventional approaches. We use this approach to optimize two computationally designed inhibitors against H1N1 influenza hemagglutinin and, in both cases, obtain variants with subnanomolar binding affinity. The most potent of these, a 51-residue protein, is broadly cross-reactive against all influenza group 1 hemagglutinins, including human H2, and neutralizes H1N1 viruses with a potency that rivals that of several human monoclonal antibodies, demonstrating that computational design followed by comprehensive energy landscape mapping can generate proteins with potential therapeutic utility.

  14. Antibody repertoire deep sequencing reveals antigen-independent selection in maturing B cells

    PubMed Central

    Kaplinsky, Joseph; Li, Anthony; Sun, Amy; Coffre, Maryaline; Koralov, Sergei B.; Arnaout, Ramy

    2014-01-01

    Antibody repertoires are known to be shaped by selection for antigen binding. Unexpectedly, we now show that selection also acts on a non–antigen-binding antibody region: the heavy-chain variable (VH)–encoded “elbow” between variable and constant domains. By sequencing 2.8 million recombined heavy-chain genes from immature and mature B-cell subsets in mice, we demonstrate a striking gradient in VH gene use as pre-B cells mature into follicular and then into marginal zone B cells. Cells whose antibodies use VH genes that encode a more flexible elbow are more likely to mature. This effect is distinct from, and exceeds in magnitude, previously described maturation-associated changes in heavy-chain complementarity determining region 3, a key antigen-binding region, which arise from junctional diversity rather than differential VH gene use. Thus, deep sequencing reveals a previously unidentified mode of B-cell selection. PMID:24927543

  15. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing

    PubMed Central

    Whitehead, Timothy A; Chevalier, Aaron; Song, Yifan; Dreyfus, Cyrille; Fleishman, Sarel J; De Mattos, Cecilia; Myers, Chris A; Kamisetty, Hetunandan; Blair, Patrick; Wilson, Ian A; Baker, David

    2013-01-01

    We show that comprehensive sequence-function maps obtained by deep sequencing can be used to reprogram interaction specificity and to leapfrog over bottlenecks in affinity maturation by combining many individually small contributions not detectable in conventional approaches. We use this approach to optimize two computationally designed inhibitors against H1N1 influenza hemagglutinin and, in both cases, obtain variants with subnanomolar binding affinity. The most potent of these, a 51-residue protein, is broadly cross-reactive against all influenza group 1 hemagglutinins, including human H2, and neutralizes H1N1 viruses with a potency that rivals that of several human monoclonal antibodies, demonstrating that computational design followed by comprehensive energy landscape mapping can generate proteins with potential therapeutic utility. PMID:22634563

  16. Detection and characterization of mycoviruses in arbuscular mycorrhizal fungi by deep-sequencing.

    PubMed

    Ezawa, Tatsuhiro; Ikeda, Yoji; Shimura, Hanako; Masuta, Chikara

    2015-01-01

    Fungal viruses (mycoviruses) often have a significant impact not only on phenotypic expression of the host fungus but also on higher order biological interactions, e.g., conferring plant stress tolerance via an endophytic host fungus. Arbuscular mycorrhizal (AM) fungi in the phylum Glomeromycota associate with most land plants and supply mineral nutrients to the host plants. So far, little information about mycoviruses has been obtained in the fungi due to their obligate biotrophic nature. Here we provide a technical breakthrough, "two-step strategy" in combination with deep-sequencing, for virological study in AM fungi; dsRNA is first extracted and sequenced using material obtained from highly productive open pot culture, and then the presence of viruses is verified using pure material produced in the in vitro monoxenic culture. This approach enabled us to demonstrate the presence of several viruses for the first time from a glomeromycotan fungus. PMID:25287503

  17. Inside the intraterrestrials: The deep biosphere seen through massively parallel sequencing

    NASA Astrophysics Data System (ADS)

    Biddle, J.

    2009-12-01

    Deeply buried marine sediments may house a large amount of the Earth’s microbial population. Initial studies based on 16S rRNA clone libraries suggest that these sediments contain unique phylotypes of microorganisms, particularly from the archaeal domain. Since this environment is so difficult to study, microbiologists are challenged to find ways to examine these populations remotely. A major approach taken to study this environment uses massively parallel sequencing to examine the inner genetic workings of these microorganisms after the sediment has been drilled. Both metagenomics and tagged amplicon sequencing have been employed on deep sediments, and initial results show that different geographic regions can be differentiated through genomics and also minor populations may cause major geochemical changes.

  18. Rapid fine conformational epitope mapping using comprehensive mutagenesis and deep sequencing.

    PubMed

    Kowalsky, Caitlin A; Faber, Matthew S; Nath, Aritro; Dann, Hailey E; Kelly, Vince W; Liu, Li; Shanker, Purva; Wagner, Ellen K; Maynard, Jennifer A; Chan, Christina; Whitehead, Timothy A

    2015-10-30

    Knowledge of the fine location of neutralizing and non-neutralizing epitopes on human pathogens affords a better understanding of the structural basis of antibody efficacy, which will expedite rational design of vaccines, prophylactics, and therapeutics. However, full utilization of the wealth of information from single cell techniques and antibody repertoire sequencing awaits the development of a high throughput, inexpensive method to map the conformational epitopes for antibody-antigen interactions. Here we show such an approach that combines comprehensive mutagenesis, cell surface display, and DNA deep sequencing. We develop analytical equations to identify epitope positions and show the method effectiveness by mapping the fine epitope for different antibodies targeting TNF, pertussis toxin, and the cancer target TROP2. In all three cases, the experimentally determined conformational epitope was consistent with previous experimental datasets, confirming the reliability of the experimental pipeline. Once the comprehensive library is generated, fine conformational epitope maps can be prepared at a rate of four per day. PMID:26296891

  19. Deep sequencing of New World screw-worm transcripts to discover genes involved in insecticide resistance

    PubMed Central

    2010-01-01

    Background The New World screw-worm (NWS), Cochliomyia hominivorax, is one of the most important myiasis-causing flies, causing severe losses to the livestock industry. In its current geographical distribution, this species has been controlled by the application of insecticides, mainly organophosphate (OP) compounds, but a number of lineages have been identified that are resistant to such chemicals. Despite its economic importance, only limited genetic information is available for the NWS. Here, as a part of an effort to characterize the C. hominivorax genome and identify putative genes involved in insecticide resistance, we sampled its transcriptome by deep sequencing of polyadenylated transcripts using the 454 sequencing technology. Results Deep sequencing on the 454 platform of three normalized libraries (larval, adult male and adult female) generated a total of 548,940 reads. Eighteen candidate genes coding for three metabolic detoxification enzyme families, cytochrome P450 monooxygenases, glutathione S-transferases and carboxyl/cholinesterases were selected and gene expression levels were measured using quantitative real-time polymerase chain reaction (qRT-PCR). Of the investigated candidates, only one gene was expressed differently between control and resistant larvae with, at least, a 10-fold down-regulation in the resistant larvae. The presence of mutations in the acetylcholinesterase (target site) and carboxylesterase E3 genes was investigated and all of the resistant flies presented E3 mutations previously associated with insecticide resistance. Conclusions Here, we provided the largest database of NWS expressed sequence tags that is an important resource, not only for further studies on the molecular basis of the OP resistance in NWS fly, but also for functional and comparative studies among Calliphoridae flies. Among our candidates, only one gene was found differentially expressed in resistant individuals, and its role on insecticide resistance should

  20. Draft Genome Sequence of Caloranaerobacter sp. TR13, an Anaerobic Thermophilic Bacterium Isolated from a Deep-Sea Hydrothermal Vent.

    PubMed

    Zhou, Meixian; Xie, Yunbiao; Dong, Binbin; Liu, Qing; Chen, Xiaoyao

    2015-01-01

    Here, we report the draft 2,261,881-bp genome sequence of Caloranaerobacter sp. TR13, isolated from a deep-sea hydrothermal vent on the East Pacific Rise. The sequence will be helpful for understanding the genetic and metabolic features, as well as potential biotechnological application in the genus Caloranaerobacter. PMID:26679595

  1. Draft Genome Sequence of Psychrobacter piscatorii Strain LQ58, a Psychrotolerant Bacterium Isolated from a Deep-Sea Hydrothermal Vent.

    PubMed

    Zhou, Meixian; Dong, Binbin; Liu, Qing

    2016-01-01

    Here, we report the 3.1-Mb draft genome sequence of Psychrobacter piscatorii strain LQ58, isolated from a deep-sea hydrothermal vent on the East Pacific Rise. The sequence will provide further insight into the environmental adaptation of psychrotolerant bacteria and the development of novel cold-active enzymes for industrial application. PMID:26941137

  2. Draft Genome Sequence of Psychrobacter piscatorii Strain LQ58, a Psychrotolerant Bacterium Isolated from a Deep-Sea Hydrothermal Vent

    PubMed Central

    Dong, Binbin; Liu, Qing

    2016-01-01

    Here, we report the 3.1-Mb draft genome sequence of Psychrobacter piscatorii strain LQ58, isolated from a deep-sea hydrothermal vent on the East Pacific Rise. The sequence will provide further insight into the environmental adaptation of psychrotolerant bacteria and the development of novel cold-active enzymes for industrial application. PMID:26941137

  3. Draft Genome Sequence of Caloranaerobacter sp. TR13, an Anaerobic Thermophilic Bacterium Isolated from a Deep-Sea Hydrothermal Vent

    PubMed Central

    Xie, Yunbiao; Dong, Binbin; Liu, Qing; Chen, Xiaoyao

    2015-01-01

    Here, we report the draft 2,261,881-bp genome sequence of Caloranaerobacter sp. TR13, isolated from a deep-sea hydrothermal vent on the East Pacific Rise. The sequence will be helpful for understanding the genetic and metabolic features, as well as potential biotechnological application in the genus Caloranaerobacter. PMID:26679595

  4. Population-genomic variation within RNA viruses of the Western honey bee, Apis mellifera, inferred from deep sequencing

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Deep sequencing of viruses isolated from infected hosts is an efficient way to measure population-genetic variation and can reveal patterns of dispersal and natural selection. In this study, we mined existing Illumina sequence reads to investigate single-nucleotide polymorphisms (SNPs) within two RN...

  5. Human norovirus hyper-mutation revealed by ultra-deep sequencing.

    PubMed

    Cuevas, José M; Combe, Marine; Torres-Puente, Manoli; Garijo, Raquel; Guix, Susana; Buesa, Javier; Rodríguez-Díaz, Jesús; Sanjuán, Rafael

    2016-07-01

    Human noroviruses (NoVs) are a major cause of gastroenteritis worldwide. It is thought that, similar to other RNA viruses, high mutation rates allow NoVs to evolve fast and to undergo rapid immune escape at the population level. However, the rate and spectrum of spontaneous mutations of human NoVs have not been quantified previously. Here, we analyzed the intra-patient diversity of the NoV capsid by carrying out RT-PCR and ultra-deep sequencing with 100,000-fold coverage of 16 stool samples from symptomatic patients. This revealed the presence of low-frequency sequences carrying large numbers of U-to-C or A-to-G base transitions, suggesting a role for hyper-mutation in NoV diversity. To more directly test for hyper-mutation, we performed transfection assays in which the production of mutations was restricted to a single cell infection cycle. This confirmed the presence of sequences with multiple U-to-C/A-to-G transitions, and suggested that hyper-mutation contributed a large fraction of the total NoV spontaneous mutation rate. The type of changes produced and their sequence context are compatible with ADAR-mediated editing of the viral RNA. PMID:27094861

  6. Targeted Deep Sequencing Reveals No Definitive Evidence for Somatic Mosaicism in Atrial Fibrillation

    PubMed Central

    Roberts, Jason D.; Longoria, James; Poon, Annie; Gollob, Michael H.; Dewland, Thomas A.; Kwok, Pui-Yan; Olgin, Jeffrey E.; Deo, Rahul C.; Marcus, Gregory M.

    2015-01-01

    Background Studies of ≤15 atrial fibrillation (AF) patients have identified atrial-specific mutations within connexin genes, suggesting that somatic mutations may account for sporadic cases of the arrhythmia. We sought to identify atrial somatic mutations among patients with and without AF using targeted deep next-generation sequencing of 560 genes, including genetic culprits implicated in AF, the Mendelian cardiomyopathies and channelopathies, and all ion channels within the genome. Methods and Results Targeted gene capture and next generation sequencing were performed on DNA from lymphocytes and left atrial appendages of 34 patients (25 with AF). Twenty AF patients had undergone cardiac surgery exclusively for pulmonary vein isolation, and 17 had no structural heart disease. Sequence alignment and variant calling were performed for each atrial-lymphocyte pair using the Burrows-Wheeler Aligner, the Genome Analysis Toolkit, and MuTect packages. Next generation sequencing yielded a median 265-fold coverage depth (IQR 164–369). Comparison of the 3 million base pairs from each atrial-lymphocyte pair revealed a single potential somatic missense mutation in 3 AF patients and 2 in a single control (12 vs. 11%; p=1). All potential discordant variants had low allelic fractions (range: 2.3–7.3%) and none were detected with conventional sequencing. Conclusions Using high-depth next generation sequencing and state-of-the art somatic mutation calling approaches, no pathogenic atrial somatic mutations could be confirmed among 25 AF patients in a comprehensive cardiac arrhythmia genetic panel. These findings indicate that atrial specific mutations are rare and that somatic mosaicism is unlikely to exert a prominent role in AF pathogenesis. PMID:25406240

  7. Analysis of the full-length genome sequence of papaya lethal yellowing virus (PLYV), determined by deep sequencing, confirms its classification in the genus Sobemovirus.

    PubMed

    Pereira, Alvaro J; Alfenas-Zerbini, Poliane; Cascardo, Renan S; Andrade, Eduardo C; Murilo Zerbini, F

    2012-10-01

    Papaya lethal yellowing virus (PLYV) causes an economically important disease in papayas in northeastern Brazil. Based on biological and molecular properties, PLYV has been tentatively assigned to the genus Sobemovirus. We report the sequence of the full-length genome of a PLYV isolate from Brazil, determined by deep sequencing. The PLYV genome is 4,145 nt long and contains four ORFs, with an arrangement identical to that of sobemoviruses. The polyprotein and CP display significant sequence identity with the corresponding proteins of other sobemoviruses. Pairwise comparisons and phylogenetic analysis based on complete nucleotide sequences confirm the classification of PLYV in the genus Sobemovirus. PMID:22743825

  8. Evaluation of ultra-deep targeted sequencing for personalized breast cancer care

    PubMed Central

    2013-01-01

    Introduction The increasing number of targeted therapies, together with a deeper understanding of cancer genetics and drug response, have prompted major healthcare centers to implement personalized treatment approaches relying on high-throughput tumor DNA sequencing. However, the optimal way to implement this transformative methodology is not yet clear. Current assays may miss important clinical information such as the mutation allelic fraction, the presence of sub-clones or chromosomal rearrangements, or the distinction between inherited variants and somatic mutations. Here, we present the evaluation of ultra-deep targeted sequencing (UDT-Seq) to generate and interpret the molecular profile of 38 breast cancer patients from two academic medical centers. Methods We sequenced 47 genes in matched germline and tumor DNA samples from 38 breast cancer patients. The selected genes, or the pathways they belong to, can be targeted by drugs or are important in familial cancer risk or drug metabolism. Results Relying on the added value of sequencing matched tumor and germline DNA and using a dedicated analysis, UDT-Seq has a high sensitivity to identify mutations in tumors with low malignant cell content. Applying UDT-Seq to matched tumor and germline specimens from the 38 patients resulted in a proposal for at least one targeted therapy for 22 patients, the identification of tumor sub-clones in 3 patients, the suggestion of potential adverse drug effects in 3 patients and a recommendation for genetic counseling for 2 patients. Conclusion Overall our study highlights the additional benefits of a sequencing strategy, which includes germline DNA and is optimized for heterogeneous tumor tissues. PMID:24326041

  9. Deep sequencing of pigeonpea sterility mosaic virus discloses five RNA segments related to emaraviruses.

    PubMed

    Elbeaino, Toufic; Digiaro, Michele; Uppala, Mangala; Sudini, Harikishan

    2014-08-01

    The sequences of five viral RNA segments of pigeonpea sterility mosaic virus (PPSMV), the agent of sterility mosaic disease (SMD) of pigeonpea (Cajanus cajan, Fabaceae), were determined using the deep sequencing technology. Each of the five RNAs encodes a single protein on the negative-sense strand with an open reading frame (ORF) of 6885, 1947, 927, 1086, and 1,422 nts, respectively. In order, from RNA1 to RNA5, these ORFs encode the RNA-dependent RNA polymerase (p1, 267.9 kDa), a putative glycoprotein precursor (p2, 74.3 kDa), a putative nucleocapsid protein (p3, 34.6 kDa), a putative movement protein (p4, 40.8 kDa), while p5 (55 kDa) has an unknown function. All RNA segments of PPSMV showed the highest identity with orthologs of fig mosaic virus (FMV) and Rose rosette virus (RRV). In phylogenetic trees constructed with the amino acid sequences of p1, p2 and p3, PPSMV clustered consistently with other emaraviruses, close to clades comprising members of other genera of the family Bunyaviridae. Based on the molecular characteristics unveiled in this study and the morphological and epidemiological features similar to other emaraviruses, PPSMV seems to be the seventh species to join the list of emaraviruses known to date and accordingly, its classification in the genus Emaravirus seems now legitimate. PMID:24685674

  10. Deep transcriptome profiling of clinical Klebsiella pneumoniae isolates reveals strain and sequence type-specific adaptation.

    PubMed

    Bruchmann, Sebastian; Muthukumarasamy, Uthayakumar; Pohl, Sarah; Preusse, Matthias; Bielecka, Agata; Nicolai, Tanja; Hamann, Isabell; Hillert, Roger; Kola, Axel; Gastmeier, Petra; Eckweiler, Denitsa; Häussler, Susanne

    2015-11-01

    Health-care-associated infections by multi-drug-resistant bacteria constitute one of the greatest challenges to modern medicine. Bacterial pathogens devise various mechanisms to withstand the activity of a wide range of antimicrobial compounds, among which the acquisition of carbapenemases is one of the most concerning. In Klebsiella pneumoniae, the dissemination of the K. pneumoniae carbapenemase is tightly connected to the global spread of certain clonal lineages. Although antibiotic resistance is a key driver for the global distribution of epidemic high-risk clones, there seem to be other adaptive traits that may explain their success. Here, we exploited the power of deep transcriptome profiling (RNA-seq) to shed light on the transcriptomic landscape of 37 clinical K. pneumoniae isolates of diverse phylogenetic origins. We identified a large set of 3346 genes which was expressed in all isolates. While the core-transcriptome profiles varied substantially between groups of different sequence types, they were more homogenous among isolates of the same sequence type. We furthermore linked the detailed information on differentially expressed genes with the clinically relevant phenotypes of biofilm formation and bacterial virulence. This allowed for the identification of a diminished expression of biofilm-specific genes within the low biofilm producing ST258 isolates as a sequence type-specific trait. PMID:26261087

  11. Heteroplasmic substitutions in the entire mitochondrial genomes of human colon cells detected by ultra-deep 454 sequencing.

    PubMed

    Skonieczna, Katarzyna; Malyarchuk, Boris; Jawień, Arkadiusz; Marszałek, Andrzej; Banaszkiewicz, Zbigniew; Jarmocik, Paweł; Borcz, Marcelina; Bała, Piotr; Grzybowski, Tomasz

    2015-03-01

    Mitochondrial DNA (mtDNA) heteroplasmy has been widely described from clinical, evolutionary and analytical points of view. Historically, the majority of studies have been based on Sanger sequencing. However, next-generation sequencing technologies are now being used for heteroplasmy analysis. Ultra-deep sequencing approaches provide increased sensitivity for detecting minority variants. However, a phylogenetic a posteriori analysis revealed that most of the next-generation sequencing data published to date suffers from shortcomings. Because implementation of new technologies in clinical, population, or forensic studies requires proper verification, in this paper we present a direct comparison of ultra-deep 454 and Sanger sequencing for the detection of heteroplasmy in complete mitochondrial genomes of normal colon cells. The spectrum of heteroplasmic mutations is discussed against the background of mitochondrial DNA variability in human populations. PMID:25465762

  12. Deep Sequencing Analysis of Nucleolar Small RNAs: RNA Isolation and Library Preparation.

    PubMed

    Bai, Baoyan; Laiho, Marikki

    2016-01-01

    The nucleolus is a subcellular compartment with a key essential function in ribosome biogenesis. The nucleolus is rich in noncoding RNAs, mostly the ribosomal RNAs and small nucleolar RNAs. Surprisingly, also several miRNAs have been detected in the nucleolus, raising the question as to whether other small RNA species are present and functional in the nucleolus. We have developed a strategy for stepwise enrichment of nucleolar small RNAs from the total nucleolar RNA extracts and subsequent construction of nucleolar small RNA libraries which are suitable for deep sequencing. Our method successfully isolates the small RNA population from total RNAs and monitors the RNA quality in each step to ensure that small RNAs recovered represent the actual small RNA population in the nucleolus and not degradation products from larger RNAs. We have further applied this approach to characterize the distribution of small RNAs in different cellular compartments. PMID:27576723

  13. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing

    PubMed Central

    Manske, Magnus; Miotto, Olivo; Campino, Susana; Auburn, Sarah; Almagro-Garcia, Jacob; Maslen, Gareth; O’Brien, Jack; Djimde, Abdoulaye; Doumbo, Ogobara; Zongo, Issaka; Ouedraogo, Jean-Bosco; Michon, Pascal; Mueller, Ivo; Siba, Peter; Nzila, Alexis; Borrmann, Steffen; Kiara, Steven M.; Marsh, Kevin; Jiang, Hongying; Su, Xin-Zhuan; Amaratunga, Chanaki; Fairhurst, Rick; Socheat, Duong; Nosten, Francois; Imwong, Mallika; White, Nicholas J.; Sanders, Mandy; Anastasi, Elisa; Alcock, Dan; Drury, Eleanor; Oyola, Samuel; Quail, Michael A.; Turner, Daniel J.; Rubio, Valentin Ruano; Jyothi, Dushyanth; Amenga-Etego, Lucas; Hubbart, Christina; Jeffreys, Anna; Rowlands, Kate; Sutherland, Colin; Roper, Cally; Mangano, Valentina; Modiano, David; Tan, John C.; Ferdig, Michael T.; Amambua-Ngwa, Alfred; Conway, David J.; Takala-Harrison, Shannon; Plowe, Christopher V.; Rayner, Julian C.; Rockett, Kirk A.; Clark, Taane G.; Newbold, Chris I.; Berriman, Matthew; MacInnis, Bronwyn; Kwiatkowski, Dominic P.

    2013-01-01

    Malaria elimination strategies require surveillance of the parasite population for genetic changes that demand a public health response, such as new forms of drug resistance. 1,2 Here we describe methods for large-scale analysis of genetic variation in Plasmodium falciparum by deep sequencing of parasite DNA obtained from the blood of patients with malaria, either directly or after short term culture. Analysis of 86,158 exonic SNPs that passed genotyping quality control in 227 samples from Africa, Asia and Oceania provides genome-wide estimates of allele frequency distribution, population structure and linkage disequilibrium. By comparing the genetic diversity of individual infections with that of the local parasite population, we derive a metric of within-host diversity that is related to the level of inbreeding in the population. An open-access web application has been established for exploration of regional differences in allele frequency and of highly differentiated loci in the P. falciparum genome. PMID:22722859

  14. Polymorphism Identification and Improved Genome Annotation of Brassica rapa Through Deep RNA Sequencing

    PubMed Central

    Devisetty, Upendra Kumar; Covington, Michael F.; Tat, An V.; Lekkala, Saradadevi; Maloof, Julin N.

    2014-01-01

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes—R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)—using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/. PMID:25122667

  15. Identification of MicroRNAs in Meloidogyne incognita Using Deep Sequencing

    PubMed Central

    Wang, Yunsheng; Mao, Zhenchuan; Yan, Jin; Cheng, Xinyue; Liu, Feng; Xiao, Luo; Dai, Liangying; Luo, Feng; Xie, Bingyan

    2015-01-01

    MicroRNAs play important regulatory roles in eukaryotic lineages. In this paper, we employed deep sequencing technology to sequence and identify microRNAs in M. incognita genome, which is one of the important plant parasitic nematodes. We identified 102 M. incognita microRNA genes, which can be grouped into 71 nonredundant miRNAs based on mature sequences. Among the 71 miRANs, 27 are known miRNAs and 44 are novel miRNAs. We identified seven miRNA clusters in M. incognita genome. Four of the seven clusters, miR-100/let-7, miR-71-1/miR-2a-1, miR-71-2/miR-2a-2 and miR-279/miR-2b are conserved in other species. We validated the expressions of 5 M. incognita microRNAs, including 3 known microRNAs (miR-71, miR-100b and let-7) and 2 novel microRNAs (NOVEL-1 and NOVEL-2), using RT-PCR. We can detect all 5 microRNAs. The expression levels of four microRNAs obtained using RT-PCR were consistent with those obtained by high-throughput sequencing except for those of let-7. We also examined how M. incognita miRNAs are conserved in four other nematodes species: C. elegans, A. suum, B. malayi and P. pacificus. We found that four microRNAs, miR-100, miR-92, miR-279 and miR-137, exist only in genomes of parasitic nematodes, but do not exist in the genomes of the free living nematode C. elegans. Our research created a unique resource for the research of plant parasitic nematodes. The candidate microRNAs could help elucidate the genomic structure, gene regulation, evolutionary processes, and developmental features of plant parasitic nematodes and nematode-plant interaction. PMID:26241472

  16. Deep Sequencing the Transcriptome Reveals Seasonal Adaptive Mechanisms in a Hibernating Mammal

    PubMed Central

    Hampton, Marshall; Melvin, Richard G.; Kendall, Anne H.; Kirkpatrick, Brian R.; Peterson, Nichole; Andrews, Matthew T.

    2011-01-01

    Mammalian hibernation is a complex phenotype involving metabolic rate reduction, bradycardia, profound hypothermia, and a reliance on stored fat that allows the animal to survive for months without food in a state of suspended animation. To determine the genes responsible for this phenotype in the thirteen-lined ground squirrel (Ictidomys tridecemlineatus) we used the Roche 454 platform to sequence mRNA isolated at six points throughout the year from three key tissues: heart, skeletal muscle, and white adipose tissue (WAT). Deep sequencing generated approximately 3.7 million cDNA reads from 18 samples (6 time points ×3 tissues) with a mean read length of 335 bases. Of these, 3,125,337 reads were assembled into 140,703 contigs. Approximately 90% of all sequences were matched to proteins in the human UniProt database. The total number of distinct human proteins matched by ground squirrel transcripts was 13,637 for heart, 12,496 for skeletal muscle, and 14,351 for WAT. Extensive mitochondrial RNA sequences enabled a novel approach of using the transcriptome to construct the complete mitochondrial genome for I. tridecemlineatus. Seasonal and activity-specific changes in mRNA levels that met our stringent false discovery rate cutoff (1.0×10−11) were used to identify patterns of gene expression involving various aspects of the hibernation phenotype. Among these patterns are differentially expressed genes encoding heart proteins AT1A1, NAC1 and RYR2 controlling ion transport required for contraction and relaxation at low body temperatures. Abundant RNAs in skeletal muscle coding ubiquitin pathway proteins ASB2, UBC and DDB1 peak in October, suggesting an increase in muscle proteolysis. Finally, genes in WAT that encode proteins involved in lipogenesis (ACOD, FABP4) are highly expressed in August, but gradually decline in expression during the seasonal transition to lipolysis. PMID:22046435

  17. Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data

    PubMed Central

    Krøigård, Anne Bruun; Thomassen, Mads; Lænkholm, Anne-Vibeke; Kruse, Torben A.; Larsen, Martin Jakob

    2016-01-01

    Next generation sequencing is extensively applied to catalogue somatic mutations in cancer, in research settings and increasingly in clinical settings for molecular diagnostics, guiding therapy decisions. Somatic variant callers perform paired comparisons of sequencing data from cancer tissue and matched normal tissue in order to detect somatic mutations. The advent of many new somatic variant callers creates a need for comparison and validation of the tools, as no de facto standard for detection of somatic mutations exists and only limited comparisons have been reported. We have performed a comprehensive evaluation using exome sequencing and targeted deep sequencing data of paired tumor-normal samples from five breast cancer patients to evaluate the performance of nine publicly available somatic variant callers: EBCall, Mutect, Seurat, Shimmer, Indelocator, Somatic Sniper, Strelka, VarScan 2 and Virmid for the detection of single nucleotide mutations and small deletions and insertions. We report a large variation in the number of calls from the nine somatic variant callers on the same sequencing data and highly variable agreement. Sequencing depth had markedly diverse impact on individual callers, as for some callers, increased sequencing depth highly improved sensitivity. For SNV calling, we report EBCall, Mutect, Virmid and Strelka to be the most reliable somatic variant callers for both exome sequencing and targeted deep sequencing. For indel calling, EBCall is superior due to high sensitivity and robustness to changes in sequencing depths. PMID:27002637

  18. Uncovering microRNA-mediated response to SO2 stress in Arabidopsis thaliana by deep sequencing.

    PubMed

    Li, Lihong; Xue, Meizhao; Yi, Huilan

    2016-10-01

    Sulfur dioxide (SO2) is a major air pollutant and has significant impacts on plants. MicroRNAs (miRNAs) are a class of gene expression regulators that play important roles in response to environmental stresses. In this study, deep sequencing was used for genome-wide identification of miRNAs and their expression profiles in response to SO2 stress in Arabidopsis thaliana shoots. A total of 27 conserved miRNAs and 5 novel miRNAs were found to be differentially expressed under SO2 stress. qRT-PCR analysis showed mostly negative correlation between miRNA accumulation and target gene mRNA abundance, suggesting regulatory roles of these miRNAs during SO2 exposure. The target genes of SO2-responsive miRNAs encode transcription factors and proteins that regulate auxin signaling and stress response, and the miRNAs-mediated suppression of these genes could improve plant resistance to SO2 stress. Promoter sequence analysis of genes encoding SO2-responsive miRNAs showed that stress-responsive and phytohormone-related cis-regulatory elements occurred frequently, providing additional evidence of the involvement of miRNAs in adaption to SO2 stress. This study represents a comprehensive expression profiling of SO2-responsive miRNAs in Arabidopsis and broads our perspective on the ubiquitous regulatory roles of miRNAs under stress conditions. PMID:27232729

  19. Transcript analysis of a goat mesenteric lymph node by deep next-generation sequencing.

    PubMed

    E, G X; Zhao, Y J; Na, R S; Huang, Y F

    2016-01-01

    Deep RNA sequencing (RNA-seq) provides a practical and inexpensive alternative for exploring genomic data in non-model organisms. The functional annotation of non-model mammalian genomes, such as that of goats, is still poor compared to that of humans and mice. In the current study, we performed a whole transcriptome analysis of an intestinal mucous membrane lymph node to comprehensively characterize the transcript catalogue of this tissue in a goat. Using an Illumina HiSeq 4000 sequencing platform, 9.692 GB of raw reads were acquired. A total of 57,526 lymph transcripts were obtained, and the majority of these were mapped to known transcriptional units (42.67%). A comparison of the mRNA expression of the mesenteric lymph nodes during the juvenile and post-adolescent stages revealed 8949 transcripts that were differentially expressed, including 6174 known genes. In addition, we functionally classified these transcripts using Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) terms. A total of 6174 known genes were assigned to 64 GO terms, and 3782 genes were assigned to 303 KEGG pathways, including some related to immunity. Our results reveal the complex transcriptome profile of the lymph node and suggest that the immune system is immature in the mesenteric lymph nodes of juvenile goats. PMID:27173308

  20. Deep sequencing of the MHC region in the Chinese population contributes to studies of complex disease.

    PubMed

    Zhou, Fusheng; Cao, Hongzhi; Zuo, Xianbo; Zhang, Tao; Zhang, Xiaoguang; Liu, Xiaomin; Xu, Ricong; Chen, Gang; Zhang, Yuanwei; Zheng, Xiaodong; Jin, Xin; Gao, Jinping; Mei, Junpu; Sheng, Yujun; Li, Qibin; Liang, Bo; Shen, Juan; Shen, Changbing; Jiang, Hui; Zhu, Caihong; Fan, Xing; Xu, Fengping; Yue, Min; Yin, Xianyong; Ye, Chen; Zhang, Cuicui; Liu, Xiao; Yu, Liang; Wu, Jinghua; Chen, Mengyun; Zhuang, Xuehan; Tang, Lili; Shao, Haojing; Wu, Longmao; Li, Jian; Xu, Yu; Zhang, Yijie; Zhao, Suli; Wang, Yu; Li, Ge; Xu, Hanshi; Zeng, Lei; Wang, Jianan; Bai, Mingzhou; Chen, Yanling; Chen, Wei; Kang, Tian; Wu, Yanyan; Xu, Xun; Zhu, Zhengwei; Cui, Yong; Wang, Zaixing; Yang, Chunjun; Wang, Peiguang; Xiang, Leihong; Chen, Xiang; Zhang, Anping; Gao, Xinghua; Zhang, Furen; Xu, Jinhua; Zheng, Min; Zheng, Jie; Zhang, Jianzhong; Yu, Xueqing; Li, Yingrui; Yang, Sen; Yang, Huanming; Wang, Jian; Liu, Jianjun; Hammarström, Lennart; Sun, Liangdan; Wang, Jun; Zhang, Xuejun

    2016-07-01

    The human major histocompatibility complex (MHC) region has been shown to be associated with numerous diseases. However, it remains a challenge to pinpoint the causal variants for these associations because of the extreme complexity of the region. We thus sequenced the entire 5-Mb MHC region in 20,635 individuals of Han Chinese ancestry (10,689 controls and 9,946 patients with psoriasis) and constructed a Han-MHC database that includes both variants and HLA gene typing results of high accuracy. We further identified multiple independent new susceptibility loci in HLA-C, HLA-B, HLA-DPB1 and BTNL2 and an intergenic variant, rs118179173, associated with psoriasis and confirmed the well-established risk allele HLA-C*06:02. We anticipate that our Han-MHC reference panel built by deep sequencing of a large number of samples will serve as a useful tool for investigating the role of the MHC region in a variety of diseases and thus advance understanding of the pathogenesis of these disorders. PMID:27213287

  1. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations

    PubMed Central

    Pulido-Tamayo, Sergio; Sánchez-Rodríguez, Aminael; Swings, Toon; Van den Bergh, Bram; Dubey, Akanksha; Steenackers, Hans; Michiels, Jan; Fostier, Jan; Marchal, Kathleen

    2015-01-01

    Clonal populations accumulate mutations over time, resulting in different haplotypes. Deep sequencing of such a population in principle provides information to reconstruct these haplotypes and the frequency at which the haplotypes occur. However, this reconstruction is technically not trivial, especially not in clonal systems with a relatively low mutation frequency. The low number of segregating sites in those systems adds ambiguity to the haplotype phasing and thus obviates the reconstruction of genome-wide haplotypes based on sequence overlap information. Therefore, we present EVORhA, a haplotype reconstruction method that complements phasing information in the non-empty read overlap with the frequency estimations of inferred local haplotypes. As was shown with simulated data, as soon as read lengths and/or mutation rates become restrictive for state-of-the-art methods, the use of this additional frequency information allows EVORhA to still reliably reconstruct genome-wide haplotypes. On real data, we show the applicability of the method in reconstructing the population composition of evolved bacterial populations and in decomposing mixed bacterial infections from clinical samples. PMID:25990729

  2. Deep sequencing identifies genetic heterogeneity and recurrent convergent evolution in chronic lymphocytic leukemia

    PubMed Central

    Ojha, Juhi; Ayres, Jackline; Secreto, Charla; Tschumper, Renee; Rabe, Kari; Van Dyke, Daniel; Slager, Susan; Shanafelt, Tait; Fonseca, Rafael; Kay, Neil E.

    2015-01-01

    Recent high-throughput sequencing and microarray studies have characterized the genetic landscape and clonal complexity of chronic lymphocytic leukemia (CLL). Here, we performed a longitudinal study in a homogeneously treated cohort of 12 patients, with sequential samples obtained at comparable stages of disease. We identified clonal competition between 2 or more genetic subclones in 70% of the patients with relapse, and stable clonal dynamics in the remaining 30%. By deep sequencing, we identified a high reservoir of genetic heterogeneity in the form of several driver genes mutated in small subclones underlying the disease course. Furthermore, in 2 patients, we identified convergent evolution, characterized by the combination of genetic lesions affecting the same genes or copy number abnormality in different subclones. The phenomenon affects multiple CLL putative driver abnormalities, including mutations in NOTCH1, SF3B1, DDX3X, and del(11q23). This is the first report documenting convergent evolution as a recurrent event in the CLL genome. Furthermore, this finding suggests the selective advantage of specific combinations of genetic lesions for CLL pathogenesis in a subset of patients. PMID:25377784

  3. Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples

    PubMed Central

    2011-01-01

    Background Readthrough fusions across adjacent genes in the genome, or transcription-induced chimeras (TICs), have been estimated using expressed sequence tag (EST) libraries to involve 4-6% of all genes. Deep transcriptional sequencing (RNA-Seq) now makes it possible to study the occurrence and expression levels of TICs in individual samples across the genome. Methods We performed single-end RNA-Seq on three human prostate adenocarcinoma samples and their corresponding normal tissues, as well as brain and universal reference samples. We developed two bioinformatics methods to specifically identify TIC events: a targeted alignment method using artificial exon-exon junctions within 200,000 bp from adjacent genes, and genomic alignment allowing splicing within individual reads. We performed further experimental verification and characterization of selected TIC and fusion events using quantitative RT-PCR and comparative genomic hybridization microarrays. Results Targeted alignment against artificial exon-exon junctions yielded 339 distinct TIC events, including 32 gene pairs with multiple isoforms. The false discovery rate was estimated to be 1.5%. Spliced alignment to the genome was less sensitive, finding only 18% of those found by targeted alignment in 33-nt reads and 59% of those in 50-nt reads. However, spliced alignment revealed 30 cases of TICs with intervening exons, in addition to distant inversions, scrambled genes, and translocations. Our findings increase the catalog of observed TIC gene pairs by 66%. We verified 6 of 6 predicted TICs in all prostate samples, and 2 of 5 predicted novel distant gene fusions, both private events among 54 prostate tumor samples tested. Expression of TICs correlates with that of the upstream gene, which can explain the prostate-specific pattern of some TIC events and the restriction of the SLC45A3-ELK4 e4-e2 TIC to ERG-negative prostate samples, as confirmed in 20 matched prostate tumor and normal samples and 9 lung cancer

  4. Deep-sequencing transcriptome analysis of chilling tolerance mechanisms of a subnival alpine plant, Chorispora bungeana

    PubMed Central

    2012-01-01

    Background The plant tolerance mechanisms to low temperature have been studied extensively in the model plant Arabidopsis at the transcriptional level. However, few studies were carried out in plants with strong inherited cold tolerance. Chorispora bungeana is a subnival alpine plant possessing strong cold tolerance mechanisms. To get a deeper insight into its cold tolerance mechanisms, the transcriptome profiles of chilling-treated C. bungeana seedlings were analyzed by Illumina deep-sequencing and compared with Arabidopsis. Results Two cDNA libraries constructed from mRNAs of control and chilling-treated seedlings were sequenced by Illumina technology. A total of 54,870 unigenes were obtained by de novo assembly, and 3,484 chilling up-regulated and 4,571 down-regulated unigenes were identified. The expressions of 18 out of top 20 up-regulated unigenes were confirmed by qPCR analysis. Functional network analysis of the up-regulated genes revealed some common biological processes, including cold responses, and molecular functions in C. bungeana and Arabidopsis responding to chilling. Karrikins were found as new plant growth regulators involved in chilling responses of C. bungeana and Arabidopsis. However, genes involved in cold acclimation were enriched in chilling up-regulated genes in Arabidopsis but not in C. bungeana. In addition, although transcription activations were stimulated in both C. bungeana and Arabidopsis, no CBF putative ortholog was up-regulated in C. bungeana while CBF2 and CBF3 were chilling up-regulated in Arabidopsis. On the other hand, up-regulated genes related to protein phosphorylation and auto-ubiquitination processes were over-represented in C. bungeana but not in Arabidopsis. Conclusions We conducted the first deep-sequencing transcriptome profiling and chilling stress regulatory network analysis of C. bungeana, a subnival alpine plant with inherited cold tolerance. Comparative transcriptome analysis suggests that cold acclimation is not

  5. An optimized kit-free method for making strand-specific deep sequencing libraries from RNA fragments.

    PubMed

    Heyer, Erin E; Ozadam, Hakan; Ricci, Emiliano P; Cenik, Can; Moore, Melissa J

    2015-01-01

    Deep sequencing of strand-specific cDNA libraries is now a ubiquitous tool for identifying and quantifying RNAs in diverse sample types. The accuracy of conclusions drawn from these analyses depends on precise and quantitative conversion of the RNA sample into a DNA library suitable for sequencing. Here, we describe an optimized method of preparing strand-specific RNA deep sequencing libraries from small RNAs and variably sized RNA fragments obtained from ribonucleoprotein particle footprinting experiments or fragmentation of long RNAs. Our approach works across a wide range of input amounts (400 pg to 200 ng), is easy to follow and produces a library in 2-3 days at relatively low reagent cost, all while giving the user complete control over every step. Because all enzymatic reactions were optimized and driven to apparent completion, sequence diversity and species abundance in the input sample are well preserved. PMID:25505164

  6. Deep Sequencing Analysis Reveals Temporal Microbiota Changes Associated with Development of Bovine Digital Dermatitis

    PubMed Central

    Krull, Adam C.; Shearer, Jan K.; Gorden, Patrick J.; Cooper, Vickie L.; Phillips, Gregory J.

    2014-01-01

    Bovine digital dermatitis (DD) is a leading cause of lameness in dairy cattle throughout the world. Despite 35 years of research, the definitive etiologic agent associated with the disease process is still unknown. Previous studies have demonstrated that multiple bacterial species are associated with lesions, with spirochetes being the most reliably identified organism. This study details the deep sequencing-based metagenomic evaluation of 48 staged DD biopsy specimens collected during a 3-year longitudinal study of disease progression. Over 175 million sequences were evaluated by utilizing both shotgun and 16S metagenomic techniques. Based on the shotgun sequencing results, there was no evidence of a fungal or DNA viral etiology. The bacterial microbiota of biopsy specimens progresses through a systematic series of changes that correlate with the novel morphological lesion scoring system developed as part of this project. This scoring system was validated, as the microbiota of each stage was statistically significantly different from those of other stages (P < 0.001). The microbiota of control biopsy specimens were the most diverse and became less diverse as lesions developed. Although Treponema spp. predominated in the advanced lesions, they were in relatively low abundance in the newly described early lesions that are associated with the initiation of the disease process. The consortium of Treponema spp. identified at the onset of disease changes considerably as the lesions progress through the morphological stages identified. The results of this study support the hypothesis that DD is a polybacterial disease process and provide unique insights into the temporal changes in bacterial populations throughout lesion development. PMID:24866801

  7. Reconstructing the Dynamics of HIV Evolution within Hosts from Serial Deep Sequence Data

    PubMed Central

    Poon, Art F. Y.; Swenson, Luke C.; Bunnik, Evelien M.; Edo-Matas, Diana; Schuitemaker, Hanneke; van 't Wout, Angélique B.; Harrigan, P. Richard

    2012-01-01

    At the early stage of infection, human immunodeficiency virus (HIV)-1 predominantly uses the CCR5 coreceptor for host cell entry. The subsequent emergence of HIV variants that use the CXCR4 coreceptor in roughly half of all infections is associated with an accelerated decline of CD4+ T-cells and rate of progression to AIDS. The presence of a ‘fitness valley’ separating CCR5- and CXCR4-using genotypes is postulated to be a biological determinant of whether the HIV coreceptor switch occurs. Using phylogenetic methods to reconstruct the evolutionary dynamics of HIV within hosts enables us to discriminate between competing models of this process. We have developed a phylogenetic pipeline for the molecular clock analysis, ancestral reconstruction, and visualization of deep sequence data. These data were generated by next-generation sequencing of HIV RNA extracted from longitudinal serum samples (median 7 time points) from 8 untreated subjects with chronic HIV infections (Amsterdam Cohort Studies on HIV-1 infection and AIDS). We used the known dates of sampling to directly estimate rates of evolution and to map ancestral mutations to a reconstructed timeline in units of days. HIV coreceptor usage was predicted from reconstructed ancestral sequences using the geno2pheno algorithm. We determined that the first mutations contributing to CXCR4 use emerged about 16 (per subject range 4 to 30) months before the earliest predicted CXCR4-using ancestor, which preceded the first positive cell-based assay of CXCR4 usage by 10 (range 5 to 25) months. CXCR4 usage arose in multiple lineages within 5 of 8 subjects, and ancestral lineages following alternate mutational pathways before going extinct were common. We observed highly patient-specific distributions and time-scales of mutation accumulation, implying that the role of a fitness valley is contingent on the genotype of the transmitted variant. PMID:23133358

  8. High resolution sequence stratigraphy of Miocene deep-water clastic outcrops, Taranaki coast, New Zealand

    SciTech Connect

    King, P.R.; Browne, G.H.; Slatt, R.M.

    1995-08-01

    Approximately 700m of deep water clastic deposits of Mt. Messenger Formation are superbly exposed along the Taranaki coast of North Island, New Zealand. Biostratigraphy indicates the interval was deposited during the time span 10.5-9.2m.y. in water depths grading upward from lower bathyal to middle-upper bathyal. This interval is considered part of a 3rd order depositional sequence deposited under conditions of fluctuating relative sea-level, concomitant with high sedimentation rates. Several 4th order depositional sequences, reflecting successive sea-level falls, are recognized within the interval. Sequence boundaries display a range of erosive morphologies from metre-wide canyons to scours several hundred metres across. All components of a generic lowstand systems tract--basin floor fan, channel-levee complex and progading complex--are present in logical and temporal order. They are repetitive through the interval, with the relatively shallower-water components becoming more prevalent upward. Basin floor fan lithologies are mainly m-thick, massive and convolute-bedded sandstones that alternate with cm- and dm-thick massive, horizontally-stratified and ripple-laminated sandstones and bioturbated mudstones. Channel-levee deposits consist of interleaving packages of thin-bedded, climbing-rippled and parallel-laminated sandstones and millstones; infrequent channels are filled with sandstones and mudstones, and sometimes lined with conglomerate. Thin beds of parallel to convoluted mudstone comprise prograding complex deposits. Similar lowstand systems tracts can be recognized and correlated on subsurface seismic reflection profiles and wireline logs. Such correlation has been aided by a continuous outcrop gamma-ray fog obtained over most of the measured interval. In the adjacent Taranaki peninsula, basin floor fan and channel-levee deposits comprise hydrocarbon reservoir intervals. Outcrop and subsurface reservior sandstones exhibit similar permeabilities.

  9. Transcriptome-Wide Identification of Hfq-Associated RNAs in Brucella suis by Deep Sequencing

    PubMed Central

    Saadeh, Bashir; Caswell, Clayton C.; Berta, Philippe; Wattam, Alice Rebecca; Roop, R. Martin

    2015-01-01

    ABSTRACT Recent breakthroughs in next-generation sequencing technologies have led to the identification of small noncoding RNAs (sRNAs) as a new important class of regulatory molecules. In prokaryotes, sRNAs are often bound to the chaperone protein Hfq, which allows them to interact with their partner mRNA(s). We screened the genome of the zoonotic and human pathogen Brucella suis 1330 for the presence of this class of RNAs. We designed a coimmunoprecipitation strategy that relies on the use of Hfq as a bait to enrich the sample with sRNAs and eventually their target mRNAs. By deep sequencing analysis of the Hfq-bound transcripts, we identified a number of mRNAs and 33 sRNA candidates associated with Hfq. The expression of 10 sRNAs in the early stationary growth phase was experimentally confirmed by Northern blotting and/or reverse transcriptase PCR. IMPORTANCE Brucella organisms are facultative intracellular pathogens that use stealth strategies to avoid host defenses. Adaptation to the host environment requires tight control of gene expression. Recently, small noncoding RNAs (sRNAs) and the sRNA chaperone Hfq have been shown to play a role in the fine-tuning of gene expression. Here we have used RNA sequencing to identify RNAs associated with the B. suis Hfq protein. We have identified a novel list of 33 sRNAs and 62 Hfq-associated mRNAs for future studies aiming to understand the intracellular lifestyle of this pathogen. PMID:26553849

  10. Whole-genome sequence of Sunxiuqinia dokdonensis DH1(T), isolated from deep sub-seafloor sediment in Dokdo Island.

    PubMed

    Lim, Sooyeon; Chang, Dong-Ho; Kim, Byoung-Chan

    2016-09-01

    Sunxiuqinia dokdonensis DH1(T) was isolated from deep sub-seafloor sediment at a depth of 900 m below the seafloor off Seo-do (the west part of Dokdo Island) in the East Sea of the Republic of Korea and subjected to whole genome sequencing on HiSeq platform and annotated on RAST. The nucleotide sequence of this genome was deposited into DDBJ/EMBL/GenBank under the accession LGIA00000000. PMID:27437183

  11. Draft Genome Sequence of Alcanivorax sp. Strain KX64203 Isolated from Deep-Sea Sediments of Iheya North, Okinawa Trough.

    PubMed

    Zhang, Huan; Liu, Rui; Wang, Mengqiang; Wang, Hao; Gao, Qiang; Hou, Zhanhui; Gao, Dahai; Wang, Lingling

    2016-01-01

    This report describes the draft genome sequence of Alcanivorax sp. strain KX64203, isolated from deep-sea sediment samples. The reads generated by an Ion Torrent PGM were assembled into contigs, with a total size of 4.76 Mb. The data will improve our understanding of the strain's function in alkane degradation. PMID:27563046

  12. Small RNA deep sequencing revealed that mixed infection of known and unknown viruses were common in field collected vegetable samples

    Technology Transfer Automated Retrieval System (TEKTRAN)

    In an effort to characterize the causal agents for plant diseases in field collected samples using the small RNA deep sequencing technology, numerous known or novel viruses and viroids were identified. In many cases, a mixed infection with multiple pathogen species was common. Such situation compl...

  13. Draft Genome Sequence of Alcanivorax sp. Strain KX64203 Isolated from Deep-Sea Sediments of Iheya North, Okinawa Trough

    PubMed Central

    Liu, Rui; Wang, Mengqiang; Wang, Hao; Gao, Qiang; Hou, Zhanhui; Gao, Dahai

    2016-01-01

    This report describes the draft genome sequence of Alcanivorax sp. strain KX64203, isolated from deep-sea sediment samples. The reads generated by an Ion Torrent PGM were assembled into contigs, with a total size of 4.76 Mb. The data will improve our understanding of the strain’s function in alkane degradation. PMID:27563046

  14. High-Resolution Hepatitis C Virus Subtyping Using NS5B Deep Sequencing and Phylogeny, an Alternative to Current Methods

    PubMed Central

    Gregori, Josep; Rodríguez-Frias, Francisco; Buti, Maria; Madejon, Antonio; Perez-del-Pulgar, Sofia; Garcia-Cehic, Damir; Casillas, Rosario; Blasi, Maria; Homs, Maria; Tabernero, David; Alvarez-Tejado, Miguel; Muñoz, Jose Manuel; Cubero, Maria; Caballero, Andrea; delCampo, Jose Antonio; Domingo, Esteban; Belmonte, Irene; Nieto, Leonardo; Lens, Sabela; Muñoz-de-Rueda, Paloma; Sanz-Cameno, Paloma; Sauleda, Silvia; Bes, Marta; Gomez, Jordi; Briones, Carlos; Perales, Celia; Sheldon, Julie; Castells, Lluis; Viladomiu, Lluis; Salmeron, Javier; Ruiz-Extremera, Angela; Quiles-Pérez, Rosa; Moreno-Otero, Ricardo; López-Rodríguez, Rosario; Allende, Helena; Romero-Gómez, Manuel; Guardia, Jaume; Esteban, Rafael; Garcia-Samaniego, Javier; Forns, Xavier

    2014-01-01

    Hepatitis C virus (HCV) is classified into seven major genotypes and 67 subtypes. Recent studies have shown that in HCV genotype 1-infected patients, response rates to regimens containing direct-acting antivirals (DAAs) are subtype dependent. Currently available genotyping methods have limited subtyping accuracy. We have evaluated the performance of a deep-sequencing-based HCV subtyping assay, developed for the 454/GS-Junior platform, in comparison with those of two commercial assays (Versant HCV genotype 2.0 and Abbott Real-time HCV Genotype II) and using direct NS5B sequencing as a gold standard (direct sequencing), in 114 clinical specimens previously tested by first-generation hybridization assay (82 genotype 1 and 32 with uninterpretable results). Phylogenetic analysis of deep-sequencing reads matched subtype 1 calling by population Sanger sequencing (69% 1b, 31% 1a) in 81 specimens and identified a mixed-subtype infection (1b/3a/1a) in one sample. Similarly, among the 32 previously indeterminate specimens, identical genotype and subtype results were obtained by direct and deep sequencing in all but four samples with dual infection. In contrast, both Versant HCV Genotype 2.0 and Abbott Real-time HCV Genotype II failed subtype 1 calling in 13 (16%) samples each and were unable to identify the HCV genotype and/or subtype in more than half of the non-genotype 1 samples. We concluded that deep sequencing is more efficient for HCV subtyping than currently available methods and allows qualitative identification of mixed infections and may be more helpful with respect to informing treatment strategies with new DAA-containing regimens across all HCV subtypes. PMID:25378574

  15. Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads

    DOE PAGESBeta

    Rosen, Gail L.; Polikar, Robi; Caseiro, Diamantino A.; Essinger, Steven D.; Sokhansanj, Bahrad A.

    2011-01-01

    High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (“reads”) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between “known” and “unknown” taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for theirmore » ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an “unknown” class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainage dataset.« less

  16. DeepSNVMiner: a sequence analysis tool to detect emergent, rare mutations in subsets of cell populations

    PubMed Central

    Andrews, T. Daniel; Jeelall, Yogesh; Talaulikar, Dipti; Goodnow, Christopher C.

    2016-01-01

    Background. Massively parallel sequencing technology is being used to sequence highly diverse populations of DNA such as that derived from heterogeneous cell mixtures containing both wild-type and disease-related states. At the core of such molecule tagging techniques is the tagging and identification of sequence reads derived from individual input DNA molecules, which must be first computationally disambiguated to generate read groups sharing common sequence tags, with each read group representing a single input DNA molecule. This disambiguation typically generates huge numbers of reads groups, each of which requires additional variant detection analysis steps to be run specific to each read group, thus representing a significant computational challenge. While sequencing technologies for producing these data are approaching maturity, the lack of available computational tools for analysing such heterogeneous sequence data represents an obstacle to the widespread adoption of this technology. Results. Using synthetic data we successfully detect unique variants at dilution levels of 1 in a 1,000,000 molecules, and find DeeepSNVMiner obtains significantly lower false positive and false negative rates compared to popular variant callers GATK, SAMTools, FreeBayes and LoFreq, particularly as the variant concentration levels decrease. In a dilution series with genomic DNA from two cells lines, we find DeepSNVMiner identifies a known somatic variant when present at concentrations of only 1 in 1,000 molecules in the input material, the lowest concentration amongst all variant callers tested. Conclusions. Here we present DeepSNVMiner; a tool to disambiguate tagged sequence groups and robustly identify sequence variants specific to subsets of starting DNA molecules that may indicate the presence of a disease. DeepSNVMiner is an automated workflow of custom sequence analysis utilities and open source tools able to differentiate somatic DNA variants from artefactual sequence

  17. DeepSNVMiner: a sequence analysis tool to detect emergent, rare mutations in subsets of cell populations.

    PubMed

    Andrews, T Daniel; Jeelall, Yogesh; Talaulikar, Dipti; Goodnow, Christopher C; Field, Matthew A

    2016-01-01

    Background. Massively parallel sequencing technology is being used to sequence highly diverse populations of DNA such as that derived from heterogeneous cell mixtures containing both wild-type and disease-related states. At the core of such molecule tagging techniques is the tagging and identification of sequence reads derived from individual input DNA molecules, which must be first computationally disambiguated to generate read groups sharing common sequence tags, with each read group representing a single input DNA molecule. This disambiguation typically generates huge numbers of reads groups, each of which requires additional variant detection analysis steps to be run specific to each read group, thus representing a significant computational challenge. While sequencing technologies for producing these data are approaching maturity, the lack of available computational tools for analysing such heterogeneous sequence data represents an obstacle to the widespread adoption of this technology. Results. Using synthetic data we successfully detect unique variants at dilution levels of 1 in a 1,000,000 molecules, and find DeeepSNVMiner obtains significantly lower false positive and false negative rates compared to popular variant callers GATK, SAMTools, FreeBayes and LoFreq, particularly as the variant concentration levels decrease. In a dilution series with genomic DNA from two cells lines, we find DeepSNVMiner identifies a known somatic variant when present at concentrations of only 1 in 1,000 molecules in the input material, the lowest concentration amongst all variant callers tested. Conclusions. Here we present DeepSNVMiner; a tool to disambiguate tagged sequence groups and robustly identify sequence variants specific to subsets of starting DNA molecules that may indicate the presence of a disease. DeepSNVMiner is an automated workflow of custom sequence analysis utilities and open source tools able to differentiate somatic DNA variants from artefactual sequence

  18. Sequence stratigraphy of Cenozoic deepwater deposits in the Perdido fold belt, Northwestern Deep Gulf of Mexico

    SciTech Connect

    Fiduk, J.C.; Weimer, P.; Trudgill, B.D.

    1996-12-31

    Analysis of 12,000 km of 2-D multifold seismic data shows three large Cenozoic wedges of deepwater deposits in the Perdido fold belt that differ in seismic facies, areal distribution, and potential reservoir geometries. Together, these three wedges reflect the changing positions of Cenozoic depocenters and record the evolution of the Perdido structural province. Lithologic interpretation is based upon seismic facies and analogous facies in other drilled areas in the Gulf of Mexico (1) The Paleocene to middle Oligocene interval, which is strongly folded, reflects pre-growth deposition. Paleocene and Oligocene strata thicken westward and consist of medium to high amplitude, subparallel reflections of varying continuity. Broad channels and channel-levee systems are interpreted, suggesting turbidite deposition. These strata are interpreted as the down-dip equivalent of the Wilcox and Frio shallow-water depo-centers and are potentially sand-prone. Eocene strata are low amplitude, discontinuous, subparallel reflections interpreted to be shale-prone. (2) The upper Oligocene to upper Miocene interval consists of multiple well-developed sequences with variable amplitude, divergent reflections, many of which onlap against the fold crests. Sequences within this interval are often modified by erosion, faulting, and/or slumping against the folds. (3) The upper Miocene to Recent interval, which overlies most folds, consists of channel-levee, overbank, slump, and layered or amalgamated turbidite sheet deposits. These are similar to other coeval submarine fan sediments in the northern deep Gulf. Thus, the Cenozoic section in the Perdido fold belt is interpreted as mostly shale-prone, with some sand-prone intervals, based upon seismic facies, isopach thickening to the west, and similar producing facies elsewhere in the Gulf of Mexico.

  19. Sequence stratigraphy of Cenozoic deepwater deposits in the Perdido fold belt, Northwestern Deep Gulf of Mexico

    SciTech Connect

    Fiduk, J.C.; Weimer, P.; Trudgill, B.D. )

    1996-01-01

    Analysis of 12,000 km of 2-D multifold seismic data shows three large Cenozoic wedges of deepwater deposits in the Perdido fold belt that differ in seismic facies, areal distribution, and potential reservoir geometries. Together, these three wedges reflect the changing positions of Cenozoic depocenters and record the evolution of the Perdido structural province. Lithologic interpretation is based upon seismic facies and analogous facies in other drilled areas in the Gulf of Mexico (1) The Paleocene to middle Oligocene interval, which is strongly folded, reflects pre-growth deposition. Paleocene and Oligocene strata thicken westward and consist of medium to high amplitude, subparallel reflections of varying continuity. Broad channels and channel-levee systems are interpreted, suggesting turbidite deposition. These strata are interpreted as the down-dip equivalent of the Wilcox and Frio shallow-water depo-centers and are potentially sand-prone. Eocene strata are low amplitude, discontinuous, subparallel reflections interpreted to be shale-prone. (2) The upper Oligocene to upper Miocene interval consists of multiple well-developed sequences with variable amplitude, divergent reflections, many of which onlap against the fold crests. Sequences within this interval are often modified by erosion, faulting, and/or slumping against the folds. (3) The upper Miocene to Recent interval, which overlies most folds, consists of channel-levee, overbank, slump, and layered or amalgamated turbidite sheet deposits. These are similar to other coeval submarine fan sediments in the northern deep Gulf. Thus, the Cenozoic section in the Perdido fold belt is interpreted as mostly shale-prone, with some sand-prone intervals, based upon seismic facies, isopach thickening to the west, and similar producing facies elsewhere in the Gulf of Mexico.

  20. mRNA deep sequencing reveals 75 new genes and a complex transcriptional landscape in Mimivirus.

    PubMed

    Legendre, Matthieu; Audic, Stéphane; Poirot, Olivier; Hingamp, Pascal; Seltzer, Virginie; Byrne, Deborah; Lartigue, Audrey; Lescot, Magali; Bernadac, Alain; Poulain, Julie; Abergel, Chantal; Claverie, Jean-Michel

    2010-05-01

    Mimivirus, a virus infecting Acanthamoeba, is the prototype of the Mimiviridae, the latest addition to the nucleocytoplasmic large DNA viruses. The Mimivirus genome encodes close to 1000 proteins, many of them never before encountered in a virus, such as four amino-acyl tRNA synthetases. To explore the physiology of this exceptional virus and identify the genes involved in the building of its characteristic intracytoplasmic "virion factory," we coupled electron microscopy observations with the massively parallel pyrosequencing of the polyadenylated RNA fractions of Acanthamoeba castellanii cells at various time post-infection. We generated 633,346 reads, of which 322,904 correspond to Mimivirus transcripts. This first application of deep mRNA sequencing (454 Life Sciences [Roche] FLX) to a large DNA virus allowed the precise delineation of the 5' and 3' extremities of Mimivirus mRNAs and revealed 75 new transcripts including several noncoding RNAs. Mimivirus genes are expressed across a wide dynamic range, in a finely regulated manner broadly described by three main temporal classes: early, intermediate, and late. This RNA-seq study confirmed the AAAATTGA sequence as an early promoter element, as well as the presence of palindromes at most of the polyadenylation sites. It also revealed a new promoter element correlating with late gene expression, which is also prominent in Sputnik, the recently described Mimivirus "virophage." These results-validated genome-wide by the hybridization of total RNA extracted from infected Acanthamoeba cells on a tiling array (Agilent)--will constitute the foundation on which to build subsequent functional studies of the Mimivirus/Acanthamoeba system. PMID:20360389

  1. Improved Sequence Learning with Subthalamic Nucleus Deep Brain Stimulation: Evidence for Treatment-Specific Network Modulation

    PubMed Central

    Mure, Hideo; Tang, Chris C.; Argyelan, Miklos; Ghilardi, Maria-Felice; Kaplitt, Michael G.; Dhawan, Vijay; Eidelberg, David

    2015-01-01

    We used a network approach to study the effects of anti-parkinsonian treatment on motor sequence learning in humans. Eight Parkinson’s disease (PD) patients with bilateral subthalamic nucleus (STN) deep brain stimulation underwent H2 15Opositron emission tomography (PET) imaging to measure regional cerebral blood flow (rCBF) while they performed kinematically matched sequence learning and movement tasks at baseline and during stimulation. Network analysis revealed a significant learning-related spatial covariance pattern characterized by consistent increases in subject expression during stimulation (p = 0.008, permutation test). The network was associated with increased activity in the lateral cerebellum, dorsal premotor cortex, and parahippocampal gyrus, with covarying reductions in the supplementary motor area (SMA) and orbitofrontal cortex. Stimulation-mediated increases in network activity correlated with concurrent improvement in learning performance (p < 0.02). To determine whether similar changes occurred during dopaminergic pharmacotherapy, we studied the subjects during an intravenous levodopa infusion titrated to achieve a motor response equivalent to stimulation. Despite consistent improvement in motor ratings during infusion, levodopa did not alter learning performance or network activity. Analysis of learning-related rCBF in network regions revealed improvement in baseline abnormalities with STN stimulation but not levodopa. These effects were most pronounced in the SMA. In this region, a consistent rCBF response to stimulation was observed across subjects and trials (p = 0.01), although the levodopa response was not significant. These findings link the cognitive treatment response in PD to changes in the activity of a specific cerebello-premotor cortical network. Selective modulation of overactive SMA–STN projection pathways may underlie the improvement in learning found with stimulation. PMID:22357863

  2. A High-Dimensional, Deep-Sequencing Study of Lung Adenocarcinoma in Female Never-Smokers

    PubMed Central

    Kim, Pora; Park, Jehwan; Seo, Jihae; Kim, Jiwoong; Park, Seongjin; Jang, Insu; Kim, Namshin; Yang, Jin Ok; Lee, Byungwook; Rho, Kyoohyoung; Jung, Yeonhwa; Keum, Juhee; Lee, Jinseon; Han, Jungho; Kang, Sangeun; Bae, Sujin; Choi, So-Jung; Kim, Sujin; Lee, Jong-Eun; Kim, Wankyu; Kim, Jhingook; Lee, Sanghyuk

    2013-01-01

    Background Deep sequencing techniques provide a remarkable opportunity for comprehensive understanding of tumorigenesis at the molecular level. As omics studies become popular, integrative approaches need to be developed to move from a simple cataloguing of mutations and changes in gene expression to dissecting the molecular nature of carcinogenesis at the systemic level and understanding the complex networks that lead to cancer development. Results Here, we describe a high-throughput, multi-dimensional sequencing study of primary lung adenocarcinoma tumors and adjacent normal tissues of six Korean female never-smoker patients. Our data encompass results from exome-seq, RNA-seq, small RNA-seq, and MeDIP-seq. We identified and validated novel genetic aberrations, including 47 somatic mutations and 19 fusion transcripts. One of the fusions involves the c-RET gene, which was recently reported to form fusion genes that may function as drivers of carcinogenesis in lung cancer patients. We also characterized gene expression profiles, which we integrated with genomic aberrations and gene regulations into functional networks. The most prominent gene network module that emerged indicates that disturbances in G2/M transition and mitotic progression are causally linked to tumorigenesis in these patients. Also, results from the analysis strongly suggest that several novel microRNA-target interactions represent key regulatory elements of the gene network. Conclusions Our study not only provides an overview of the alterations occurring in lung adenocarcinoma at multiple levels from genome to transcriptome and epigenome, but also offers a model for integrative genomics analysis and proposes potential target pathways for the control of lung adenocarcinoma. PMID:23405175

  3. Hybrid DNA virus in Chinese patients with seronegative hepatitis discovered by deep sequencing

    PubMed Central

    Xu, Baoyan; Zhi, Ning; Hu, Gangqing; Wan, Zhihong; Zheng, Xiaobin; Liu, Xiaohong; Wong, Susan; Kajigaya, Sachiko; Zhao, Keji; Mao, Qing; Young, Neal S.

    2013-01-01

    Seronegative hepatitis—non-A, non-B, non-C, non-D, non-E hepatitis—is poorly characterized but strongly associated with serious complications. We collected 92 sera specimens from patients with non-A–E hepatitis in Chongqing, China between 1999 and 2007. Ten sera pools were screened by Solexa deep sequencing. We discovered a 3,780-bp contig present in all 10 pools that yielded BLASTx E scores of 7e-05–0.008 against parvoviruses. The complete sequence of the in silico-assembled 3,780-bp contig was confirmed by gene amplification of overlapping regions over almost the entire genome, and the virus was provisionally designated NIH-CQV. Further analysis revealed that the contig was composed of two major ORFs. By protein BLAST, ORF1 and ORF2 were most homologous to the replication-associated protein of bat circovirus and the capsid protein of porcine parvovirus, respectively. Phylogenetic analysis indicated that NIH-CQV is located at the interface of Parvoviridae and Circoviridae. Prevalence of NIH-CQV in patients was determined by quantitative PCR. Sixty-three of 90 patient samples (70%) were positive, but all those from 45 healthy controls were negative. Average virus titer in the patient specimens was 1.05 e4 copies/µL. Specific antibodies against NIH-CQV were sought by immunoblotting. Eighty-four percent of patients were positive for IgG, and 31% were positive for IgM; in contrast, 78% of healthy controls were positive for IgG, but all were negative for IgM. Although more work is needed to determine the etiologic role of NIH-CQV in human disease, our data indicate that a parvovirus-like virus is highly prevalent in a cohort of patients with non-A–E hepatitis. PMID:23716702

  4. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments

    PubMed Central

    Ingolia, Nicholas T.; Brar, Gloria A.; Rouskin, Silvia; McGeachy, Anna M.; Weissman, Jonathan S.

    2012-01-01

    Recent studies highlight the importance of translational control in determining protein abundance, underscoring the value of measuring gene expression at the level of translation. We present a protocol for genome-wide, quantitative analysis of in vivo translation by deep sequencing. This ribosome profiling approach maps the exact positions of ribosomes on transcripts by nuclease footprinting. The nuclease-protected mRNA fragments are converted into a DNA library suitable for deep sequencing using a strategy that minimizes bias. The abundance of different footprint fragments in deep sequencing data reports on the amount of translation of a gene. Additionally, footprints reveal the exact regions of the transcriptome that are translated. To better define translated reading frames, we describe an adaptation that reveals the sites of translation initiation by pre-treating cells with harringtonine to immobilize initiating ribosomes. The protocol we describe requires 5–7 days to generate a completed ribosome profiling sequencing library. Sequencing and data analysis requires a further 4 – 5 days. PMID:22836135

  5. Complete genome sequence of Southern tomato virus naturally infecting tomatoes in Bangladesh using small RNA deep sequencing

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The complete genome sequence of a Southern tomato virus (STV) isolate on tomato plants in a seed production field in Bangladesh was obtained for the first time using next generation sequencing. The identified isolate STV_BD-13 shares high degree of sequence identity (99%) with several known STV isol...

  6. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads

    PubMed Central

    Kajitani, Rei; Toshimoto, Kouta; Noguchi, Hideki; Toyoda, Atsushi; Ogura, Yoshitoshi; Okuno, Miki; Yabana, Mitsuru; Harada, Masayuki; Nagayasu, Eiji; Maruyama, Haruhiko; Kohara, Yuji; Fujiyama, Asao; Hayashi, Tetsuya; Itoh, Takehiko

    2014-01-01

    Although many de novo genome assembly projects have recently been conducted using high-throughput sequencers, assembling highly heterozygous diploid genomes is a substantial challenge due to the increased complexity of the de Bruijn graph structure predominantly used. To address the increasing demand for sequencing of nonmodel and/or wild-type samples, in most cases inbred lines or fosmid-based hierarchical sequencing methods are used to overcome such problems. However, these methods are costly and time consuming, forfeiting the advantages of massive parallel sequencing. Here, we describe a novel de novo assembler, Platanus, that can effectively manage high-throughput data from heterozygous samples. Platanus assembles DNA fragments (reads) into contigs by constructing de Bruijn graphs with automatically optimized k-mer sizes followed by the scaffolding of contigs based on paired-end information. The complicated graph structures that result from the heterozygosity are simplified during not only the contig assembly step but also the scaffolding step. We evaluated the assembly results on eukaryotic samples with various levels of heterozygosity. Compared with other assemblers, Platanus yields assembly results that have a larger scaffold NG50 length without any accompanying loss of accuracy in both simulated and real data. In addition, Platanus recorded the largest scaffold NG50 values for two of the three low-heterozygosity species used in the de novo assembly contest, Assemblathon 2. Platanus therefore provides a novel and efficient approach for the assembly of gigabase-sized highly heterozygous genomes and is an attractive alternative to the existing assemblers designed for genomes of lower heterozygosity. PMID:24755901

  7. A novel method for identifying polymorphic transposable elements via scanning of high-throughput short reads

    PubMed Central

    Kang, Houxiang; Zhu, Dan; Lin, Runmao; Opiyo, Stephen Obol; Jiang, Ning; Shiu, Shin-Han; Wang, Guo-Liang

    2016-01-01

    Identification of polymorphic transposable elements (TEs) is important because TE polymorphism creates genetic diversity and influences the function of genes in the host genome. However, de novo scanning of polymorphic TEs remains a challenge. Here, we report a novel computational method, called PTEMD (polymorphic TEs and their movement detection), for de novo discovery of genome-wide polymorphic TEs. PTEMD searches highly identical sequences using reads supported breakpoint evidences. Using PTEMD, we identified 14 polymorphic TE families (905 sequences) in rice blast fungus Magnaporthe oryzae, and 68 (10,618 sequences) in maize. We validated one polymorphic TE family experimentally, MoTE-1; all MoTE-1 family members are located in different genomic loci in the three tested isolates. We found that 57.1% (8 of 14) of the PTEMD-detected polymorphic TE families in M. oryzae are active. Furthermore, our data indicate that there are more polymorphic DNA transposons in maize than their counterparts of retrotransposons despite the fact that retrotransposons occupy largest fraction of genomic mass. We demonstrated that PTEMD is an effective tool for identifying polymorphic TEs in M. oryzae and maize genomes. PTEMD and the genome-wide polymorphic TEs in M. oryzae and maize are publically available at http://www.kanglab.cn/blast/PTEMD_V1.02.htm. PMID:27098848

  8. A novel method for identifying polymorphic transposable elements via scanning of high-throughput short reads.

    PubMed

    Kang, Houxiang; Zhu, Dan; Lin, Runmao; Opiyo, Stephen Obol; Jiang, Ning; Shiu, Shin-Han; Wang, Guo-Liang

    2016-06-01

    Identification of polymorphic transposable elements (TEs) is important because TE polymorphism creates genetic diversity and influences the function of genes in the host genome. However, de novo scanning of polymorphic TEs remains a challenge. Here, we report a novel computational method, called PTEMD (polymorphic TEs and their movement detection), for de novo discovery of genome-wide polymorphic TEs. PTEMD searches highly identical sequences using reads supported breakpoint evidences. Using PTEMD, we identified 14 polymorphic TE families (905 sequences) in rice blast fungus Magnaporthe oryzae, and 68 (10,618 sequences) in maize. We validated one polymorphic TE family experimentally, MoTE-1; all MoTE-1 family members are located in different genomic loci in the three tested isolates. We found that 57.1% (8 of 14) of the PTEMD-detected polymorphic TE families in M. oryzae are active. Furthermore, our data indicate that there are more polymorphic DNA transposons in maize than their counterparts of retrotransposons despite the fact that retrotransposons occupy largest fraction of genomic mass. We demonstrated that PTEMD is an effective tool for identifying polymorphic TEs in M. oryzae and maize genomes. PTEMD and the genome-wide polymorphic TEs in M. oryzae and maize are publically available at http://www.kanglab.cn/blast/PTEMD_V1.02.htm. PMID:27098848

  9. Deep sequencing reveals abundant noncanonical retroviral microRNAs in B-cell leukemia/lymphoma.

    PubMed

    Rosewick, Nicolas; Momont, Mélanie; Durkin, Keith; Takeda, Haruko; Caiment, Florian; Cleuter, Yvette; Vernin, Céline; Mortreux, Franck; Wattel, Eric; Burny, Arsène; Georges, Michel; Van den Broeke, Anne

    2013-02-01

    Viral tumor models have significantly contributed to our understanding of oncogenic mechanisms. How transforming delta-retroviruses induce malignancy, however, remains poorly understood, especially as viral mRNA/protein are tightly silenced in tumors. Here, using deep sequencing of broad windows of small RNA sizes in the bovine leukemia virus ovine model of leukemia/lymphoma, we provide in vivo evidence of the production of noncanonical RNA polymerase III (Pol III)-transcribed viral microRNAs in leukemic B cells in the complete absence of Pol II 5'-LTR-driven transcriptional activity. Processed from a cluster of five independent self-sufficient transcriptional units located in a proviral region dispensable for in vivo infectivity, bovine leukemia virus microRNAs represent ∼40% of all microRNAs in both experimental and natural malignancy. They are subject to strong purifying selection and associate with Argonautes, consistent with a critical function in silencing of important cellular and/or viral targets. Bovine leukemia virus microRNAs are strongly expressed in preleukemic and malignant cells in which structural and regulatory gene expression is repressed, suggesting a key role in tumor onset and progression. Understanding how Pol III-dependent microRNAs subvert cellular and viral pathways will contribute to deciphering the intricate perturbations that underlie malignant transformation. PMID:23345446

  10. Genomic region operation kit for flexible processing of deep sequencing data.

    PubMed

    Ovaska, Kristian; Lyly, Lauri; Sahu, Biswajyoti; Jänne, Olli A; Hautaniemi, Sampsa

    2013-01-01

    Computational analysis of data produced in deep sequencing (DS) experiments is challenging due to large data volumes and requirements for flexible analysis approaches. Here, we present a mathematical formalism based on set algebra for frequently performed operations in DS data analysis to facilitate translation of biomedical research questions to language amenable for computational analysis. With the help of this formalism, we implemented the Genomic Region Operation Kit (GROK), which supports various DS-related operations such as preprocessing, filtering, file conversion, and sample comparison. GROK provides high-level interfaces for R, Python, Lua, and command line, as well as an extension C++ API. It supports major genomic file formats and allows storing custom genomic regions in efficient data structures such as red-black trees and SQL databases. To demonstrate the utility of GROK, we have characterized the roles of two major transcription factors (TFs) in prostate cancer using data from 10 DS experiments. GROK is freely available with a user guide from >http://csbi.ltdk.helsinki.fi/grok/. PMID:23702556

  11. Deep sequencing reveals unique small RNA repertoire that is regulated during head regeneration in Hydra magnipapillata.

    PubMed

    Krishna, Srikar; Nair, Aparna; Cheedipudi, Sirisha; Poduval, Deepak; Dhawan, Jyotsna; Palakodeti, Dasaradhi; Ghanekar, Yashoda

    2013-01-01

    Small non-coding RNAs such as miRNAs, piRNAs and endo-siRNAs fine-tune gene expression through post-transcriptional regulation, modulating important processes in development, differentiation, homeostasis and regeneration. Using deep sequencing, we have profiled small non-coding RNAs in Hydra magnipapillata and investigated changes in small RNA expression pattern during head regeneration. Our results reveal a unique repertoire of small RNAs in hydra. We have identified 126 miRNA loci; 123 of these miRNAs are unique to hydra. Less than 50% are conserved across two different strains of Hydra vulgaris tested in this study, indicating a highly diverse nature of hydra miRNAs in contrast to bilaterian miRNAs. We also identified siRNAs derived from precursors with perfect stem-loop structure and that arise from inverted repeats. piRNAs were the most abundant small RNAs in hydra, mapping to transposable elements, the annotated transcriptome and unique non-coding regions on the genome. piRNAs that map to transposable elements and the annotated transcriptome display a ping-pong signature. Further, we have identified several miRNAs and piRNAs whose expression is regulated during hydra head regeneration. Our study defines different classes of small RNAs in this cnidarian model system, which may play a role in orchestrating gene expression essential for hydra regeneration. PMID:23166307

  12. Deep sequencing reveals unique small RNA repertoire that is regulated during head regeneration in Hydra magnipapillata

    PubMed Central

    Krishna, Srikar; Nair, Aparna; Cheedipudi, Sirisha; Poduval, Deepak; Dhawan, Jyotsna; Palakodeti, Dasaradhi; Ghanekar, Yashoda

    2013-01-01

    Small non-coding RNAs such as miRNAs, piRNAs and endo-siRNAs fine-tune gene expression through post-transcriptional regulation, modulating important processes in development, differentiation, homeostasis and regeneration. Using deep sequencing, we have profiled small non-coding RNAs in Hydra magnipapillata and investigated changes in small RNA expression pattern during head regeneration. Our results reveal a unique repertoire of small RNAs in hydra. We have identified 126 miRNA loci; 123 of these miRNAs are unique to hydra. Less than 50% are conserved across two different strains of Hydra vulgaris tested in this study, indicating a highly diverse nature of hydra miRNAs in contrast to bilaterian miRNAs. We also identified siRNAs derived from precursors with perfect stem–loop structure and that arise from inverted repeats. piRNAs were the most abundant small RNAs in hydra, mapping to transposable elements, the annotated transcriptome and unique non-coding regions on the genome. piRNAs that map to transposable elements and the annotated transcriptome display a ping–pong signature. Further, we have identified several miRNAs and piRNAs whose expression is regulated during hydra head regeneration. Our study defines different classes of small RNAs in this cnidarian model system, which may play a role in orchestrating gene expression essential for hydra regeneration. PMID:23166307

  13. MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC)

    PubMed Central

    2015-01-01

    Background Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores). Results To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM). Summary MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development. Availability and implementation MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3. Supplementary information Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica. PMID:25952019

  14. Patchiness of deep-sea benthic Foraminifera across the Southern Ocean: Insights from high-throughput DNA sequencing

    NASA Astrophysics Data System (ADS)

    Lejzerowicz, Franck; Esling, Philippe; Pawlowski, Jan

    2014-10-01

    Spatial patchiness is a natural feature that strongly influences the level of species richness we perceive in surface sediments sampled in the deep-sea. Recent environmental DNA (eDNA) surveys of benthic micro- and meiofauna confirmed this exceptional richness. However, it is unknown to which extent the results of these studies, based usually on few grams of sediment, are affected by spatial patchiness of deep-sea benthos. Here, we analyse the eDNA diversity of Foraminifera in 42 deep-sea sediment samples collected across different scales in the Southern Ocean. At three stations, we deployed at least twice the multicorer and from each multicorer cast, we subsampled 3 sediment replicates per core for 2 cores. Using high-throughput sequencing (HTS), we generated over 2.35 million high-quality sequences that we clustered into 451 operational taxonomic units (OTUs). The majority of OTUs were assigned to the monothalamous (single-chambered) taxa and environmental clades. On average, a one-gram sediment sample captures 57.9% of the overall OTU diversity found in a single core, while three replicates cover at most 61.9% of the diversity found in a station. The OTUs found in all the replicates of each core gather up to 87.9% of the total sequenced reads, but only represent from 12.2% to 30% of the OTUs found in one core. These OTUs represent the most abundant species, among which dominate environmental lineages. The majority of the OTUs are represented by few sequences comprising several well-known deep-sea morphospecies or remaining unassigned. It is crucial to study wider arrays of sample and PCR replicates as well as RNA together with DNA in order to overcome biases stemming from deep-sea patchiness and molecular methods.

  15. Deep sequencing analysis of viral infection and evolution allows rapid and detailed characterization of viral mutant spectrum

    PubMed Central

    Isakov, Ofer; Bordería, Antonio V.; Golan, David; Hamenahem, Amir; Celniker, Gershon; Yoffe, Liron; Blanc, Hervé; Vignuzzi, Marco; Shomron, Noam

    2015-01-01

    Motivation: The study of RNA virus populations is a challenging task. Each population of RNA virus is composed of a collection of different, yet related genomes often referred to as mutant spectra or quasispecies. Virologists using deep sequencing technologies face major obstacles when studying virus population dynamics, both experimentally and in natural settings due to the relatively high error rates of these technologies and the lack of high performance pipelines. In order to overcome these hurdles we developed a computational pipeline, termed ViVan (Viral Variance Analysis). ViVan is a complete pipeline facilitating the identification, characterization and comparison of sequence variance in deep sequenced virus populations. Results: Applying ViVan on deep sequenced data obtained from samples that were previously characterized by more classical approaches, we uncovered novel and potentially crucial aspects of virus populations. With our experimental work, we illustrate how ViVan can be used for studies ranging from the more practical, detection of resistant mutations and effects of antiviral treatments, to the more theoretical temporal characterization of the population in evolutionary studies. Availability and implementation: Freely available on the web at http://www.vivanbioinfo.org Contact: nshomron@post.tau.ac.il Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25701575

  16. De Novo Sequencing and Transcriptome Analysis of the Central Nervous System of Mollusc Lymnaea stagnalis by Deep RNA Sequencing

    PubMed Central

    Sadamoto, Hisayo; Takahashi, Hironobu; Okada, Taketo; Kenmoku, Hiromichi; Toyota, Masao; Asakawa, Yoshinori

    2012-01-01

    The pond snail Lymnaea stagnalis is among several mollusc species that have been well investigated due to the simplicity of their nervous systems and large identifiable neurons. Nonetheless, despite the continued attention given to the physiological characteristics of its nervous system, the genetic information of the Lymnaea central nervous system (CNS) has not yet been fully explored. The absence of genetic information is a large disadvantage for transcriptome sequencing because it makes transcriptome assembly difficult. We here performed transcriptome sequencing for Lymnaea CNS using an Illumina Genome Analyzer IIx platform and obtained 81.9 M of 100 base pair (bp) single end reads. For de novo assembly, five programs were used: ABySS, Velvet, OASES, Trinity and Rnnotator. Based on a comparison of the assemblies, we chose the Rnnotator dataset for the following blast searches and gene ontology analyses. The present dataset, 116,355 contigs of Lymnaea transcriptome shotgun assembly (TSA), contained longer sequences and was much larger compared to the previously reported Lymnaea expression sequence tag (EST) established by classical Sanger sequencing. The TSA sequences were subjected to blast analyses against several protein databases and Aplysia EST data. The results demonstrated that about 20,000 sequences had significant similarity to the reported sequences using a cutoff value of 1e-6, and showed the lack of molluscan sequences in the public databases. The richness of the present TSA data allowed us to identify a large number of new transcripts in Lymnaea and molluscan species. PMID:22870333

  17. QuasR: quantification and annotation of short reads in R

    PubMed Central

    Gaidatzis, Dimos; Lerch, Anita; Hahne, Florian; Stadler, Michael B.

    2015-01-01

    Summary: QuasR is a package for the integrated analysis of high-throughput sequencing data in R, covering all steps from read preprocessing, alignment and quality control to quantification. QuasR supports different experiment types (including RNA-seq, ChIP-seq and Bis-seq) and analysis variants (e.g. paired-end, stranded, spliced and allele-specific), and is integrated in Bioconductor so that its output can be directly processed for statistical analysis and visualization. Availability and implementation: QuasR is implemented in R and C/C++. Source code and binaries for major platforms (Linux, OS X and MS Windows) are available from Bioconductor (www.bioconductor.org/packages/release/bioc/html/QuasR.html). The package includes a ‘vignette’ with step-by-step examples for typical work flows. Contact: michael.stadler@fmi.ch Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25417205

  18. Microbial Dark Matter: Unusual intervening sequences in 16S rRNA genes of candidate phyla from the deep subsurface

    SciTech Connect

    Jarett, Jessica; Stepanauskas, Ramunas; Kieft, Thomas; Onstott, Tullis; Woyke, Tanja

    2014-03-17

    The Microbial Dark Matter project has sequenced genomes from over 200 single cells from candidate phyla, greatly expanding our knowledge of the ecology, inferred metabolism, and evolution of these widely distributed, yet poorly understood lineages. The second phase of this project aims to sequence an additional 800 single cells from known as well as potentially novel candidate phyla derived from a variety of environments. In order to identify whole genome amplified single cells, screening based on phylogenetic placement of 16S rRNA gene sequences is being conducted. Briefly, derived 16S rRNA gene sequences are aligned to a custom version of the Greengenes reference database and added to a reference tree in ARB using parsimony. In multiple samples from deep subsurface habitats but not from other habitats, a large number of sequences proved difficult to align and therefore to place in the tree. Based on comparisons to reference sequences and structural alignments using SSU-ALIGN, many of these ?difficult? sequences appear to originate from candidate phyla, and contain intervening sequences (IVSs) within the 16S rRNA genes. These IVSs are short (39 - 79 nt) and do not appear to be self-splicing or to contain open reading frames. IVSs were found in the loop regions of stem-loop structures in several different taxonomic groups. Phylogenetic placement of sequences is strongly affected by IVSs; two out of three groups investigated were classified as different phyla after their removal. Based on data from samples screened in this project, IVSs appear to be more common in microbes occurring in deep subsurface habitats, although the reasons for this remain elusive.

  19. Acyclic Identification of Aptamers for Human alpha-Thrombin Using Over-Represented Libraries and Deep Sequencing

    PubMed Central

    Kupakuwana, Gillian V.; Crill, James E.; McPike, Mark P.; Borer, Philip N.

    2011-01-01

    Background Aptamers are oligonucleotides that bind proteins and other targets with high affinity and selectivity. Twenty years ago elements of natural selection were adapted to in vitro selection in order to distinguish aptamers among randomized sequence libraries. The primary bottleneck in traditional aptamer discovery is multiple cycles of in vitro evolution. Methodology/Principal Findings We show that over-representation of sequences in aptamer libraries and deep sequencing enables acyclic identification of aptamers. We demonstrated this by isolating a known family of aptamers for human α-thrombin. Aptamers were found within a library containing an average of 56,000 copies of each possible randomized 15mer segment. The high affinity sequences were counted many times above the background in 2–6 million reads. Clustering analysis of sequences with more than 10 counts distinguished two sequence motifs with candidates at high abundance. Motif I contained the previously observed consensus 15mer, Thb1 (46,000 counts), and related variants with mostly G/T substitutions; secondary analysis showed that affinity for thrombin correlated with abundance (Kd = 12 nM for Thb1). The signal-to-noise ratio for this experiment was roughly 10,000∶1 for Thb1. Motif II was unrelated to Thb1 with the leading candidate (29,000 counts) being a novel aptamer against hexose sugars in the storage and elution buffers for Concanavilin A (Kd = 0.5 µM for α-methyl-mannoside); ConA was used to immobilize α-thrombin. Conclusions/Significance Over-representation together with deep sequencing can dramatically shorten the discovery process, distinguish aptamers having a wide range of affinity for the target, allow an exhaustive search of the sequence space within a simplified library, reduce the quantity of the target required, eliminate cycling artifacts, and should allow multiplexing of sequencing experiments and targets. PMID:21625587

  20. Deep sequencing reveals the complete genome and evidence for transcriptional activity of the first virus-like sequences identified in Aristotelia chilensis (Maqui Berry).

    PubMed

    Villacreses, Javier; Rojas-Herrera, Marcelo; Sánchez, Carolina; Hewstone, Nicole; Undurraga, Soledad F; Alzate, Juan F; Manque, Patricio; Maracaja-Coutinho, Vinicius; Polanco, Victor

    2015-04-01

    Here, we report the genome sequence and evidence for transcriptional activity of a virus-like element in the native Chilean berry tree Aristotelia chilensis. We propose to name the endogenous sequence as Aristotelia chilensis Virus 1 (AcV1). High-throughput sequencing of the genome of this tree uncovered an endogenous viral element, with a size of 7122 bp, corresponding to the complete genome of AcV1. Its sequence contains three open reading frames (ORFs): ORFs 1 and 2 shares 66%-73% amino acid similarity with members of the Caulimoviridae virus family, especially the Petunia vein clearing virus (PVCV), Petuvirus genus. ORF1 encodes a movement protein (MP); ORF2 a Reverse Transcriptase (RT) and a Ribonuclease H (RNase H) domain; and ORF3 showed no amino acid sequence similarity with any other known virus proteins. Analogous to other known endogenous pararetrovirus sequences (EPRVs), AcV1 is integrated in the genome of Maqui Berry and showed low viral transcriptional activity, which was detected by deep sequencing technology (DNA and RNA-seq). Phylogenetic analysis of AcV1 and other pararetroviruses revealed a closer resemblance with Petuvirus. Overall, our data suggests that AcV1 could be a new member of Caulimoviridae family, genus Petuvirus, and the first evidence of this kind of virus in a fruit plant. PMID:25855242

  1. Deep Sequencing Reveals the Complete Genome and Evidence for Transcriptional Activity of the First Virus-Like Sequences Identified in Aristotelia chilensis (Maqui Berry)

    PubMed Central

    Villacreses, Javier; Rojas-Herrera, Marcelo; Sánchez, Carolina; Hewstone, Nicole; Undurraga, Soledad F.; Alzate, Juan F.; Manque, Patricio; Maracaja-Coutinho, Vinicius; Polanco, Victor

    2015-01-01

    Here, we report the genome sequence and evidence for transcriptional activity of a virus-like element in the native Chilean berry tree Aristotelia chilensis. We propose to name the endogenous sequence as Aristotelia chilensis Virus 1 (AcV1). High-throughput sequencing of the genome of this tree uncovered an endogenous viral element, with a size of 7122 bp, corresponding to the complete genome of AcV1. Its sequence contains three open reading frames (ORFs): ORFs 1 and 2 shares 66%–73% amino acid similarity with members of the Caulimoviridae virus family, especially the Petunia vein clearing virus (PVCV), Petuvirus genus. ORF1 encodes a movement protein (MP); ORF2 a Reverse Transcriptase (RT) and a Ribonuclease H (RNase H) domain; and ORF3 showed no amino acid sequence similarity with any other known virus proteins. Analogous to other known endogenous pararetrovirus sequences (EPRVs), AcV1 is integrated in the genome of Maqui Berry and showed low viral transcriptional activity, which was detected by deep sequencing technology (DNA and RNA-seq). Phylogenetic analysis of AcV1 and other pararetroviruses revealed a closer resemblance with Petuvirus. Overall, our data suggests that AcV1 could be a new member of Caulimoviridae family, genus Petuvirus, and the first evidence of this kind of virus in a fruit plant. PMID:25855242

  2. Deep sequencing reveals a novel closterovirus associated with wild rose leaf rosette disease.

    PubMed

    He, Yan; Yang, Zuokun; Hong, Ni; Wang, Guoping; Ning, Guogui; Xu, Wenxing

    2015-06-01

    A bizarre virus-like symptom of a leaf rosette formed by dense small leaves on branches of wild roses (Rosa multiflora Thunb.), designated as 'wild rose leaf rosette disease' (WRLRD), was observed in China. To investigate the presumed causal virus, a wild rose sample affected by WRLRD was subjected to deep sequencing of small interfering RNAs (siRNAs) for a complete survey of the infecting viruses and viroids. The assembly of siRNAs led to the reconstruction of the complete genomes of three known viruses, namely Apple stem grooving virus (ASGV), Blackberry chlorotic ringspot virus (BCRV) and Prunus necrotic ringspot virus (PNRSV), and of a novel virus provisionally named 'rose leaf rosette-associated virus' (RLRaV). Phylogenetic analysis clearly placed RLRaV alongside members of the genus Closterovirus, family Closteroviridae. Genome organization of RLRaV RNA (17,653 nucleotides) showed 13 open reading frames (ORFs), except ORF1 and the quintuple gene block, most of which showed no significant similarities with known viral proteins, but, instead, had detectable identities to fungal or bacterial proteins. Additional novel molecular features indicated that RLRaV seems to be the most complex virus among the known genus members. To our knowledge, this is the first report of WRLRD and its associated closterovirus, as well as two ilarviruses and one capilovirus, infecting wild roses. Our findings present novel information about the closterovirus and the aetiology of this rose disease which should facilitate its control. More importantly, the novel features of RLRaV help to clarify the molecular and evolutionary features of the closterovirus. PMID:25187347

  3. Unique gene program of rat small resistance mesenteric arteries as revealed by deep RNA sequencing

    PubMed Central

    Reho, John J; Shetty, Amol; Dippold, Rachael P; Mahurkar, Anup; Fisher, Steven A

    2015-01-01

    Deep sequencing of RNA samples from rat small mesenteric arteries (MA) and aorta (AO) identified common and unique features of their gene programs. ∼5% of mRNAs were quantitatively differentially expressed in MA versus AO. Unique transcriptional control in MA smooth muscle is suggested by the selective or enriched expression of transcription factors Nkx2-3, HAND2, and Tcf21 (Capsulin). Enrichment in AO of PPAR transcription factors and their target genes of mitochondrial function, lipid metabolism, and oxidative phosphorylation is consistent with slow (oxidative) tonic smooth muscle. In contrast MA was enriched in contractile and calcium channel mRNAs suggestive of components of fast (glycolytic) phasic smooth muscle. Myosin phosphatase regulatory subunit paralogs Mypt1 and p85 were expressed at similar levels, while smooth muscle MLCK was the only such kinase expressed, suggesting functional redundancy of the former but not the latter in accordance with mouse knockout studies. With regard to vaso-regulatory signals, purinergic receptors P2rx1 and P2rx5 were reciprocally expressed in MA versus AO, while the olfactory receptor Olr59 was enriched in MA. Alox15, which generates the EDHF HPETE, was enriched in MA while eNOS was equally expressed, consistent with the greater role of EDHF in the smaller arteries. mRNAs that were not expressed at a level consistent with impugned function include skeletal myogenic factors, IKK2, nonmuscle myosin, and Gnb3. This screening analysis of gene expression in the small mesenteric resistance arteries suggests testable hypotheses regarding unique aspects of small artery function in the regional control of blood flow. PMID:26156969

  4. Sequence stratigraphy and sedimentology of a shelf-margin lowstand wedge in the deep Wilcox flexture trend of south Texas

    SciTech Connect

    Snedden, J.W. ); Cooke, J.C. ); Johnson, R.K.; Conrad, K.T. )

    1991-03-01

    An integrated sedimentologic and biostratigraphic study of 15 wells and over 1400 ft (430 m) of core facilitated establishment of a sequence stratigraphic framework for the deep Wilcox Group of south Texas. This analysis also revealed the presence of a dip-restricted, sand-prone sediment wedge that produces hydrocarbons in growth-fault structures. A sequence stratigraphic framework for the Wilcox was constructed via the use of faunal-increase markers, thin intervals present in well cuttings characterized by rises in the relative abundance of planktonic foraminifera. These marine flooding horizons can be utilized to subdivide the Wilcox Group into four depositional sequences termed P(aleogene)-8, P-7, P-4, and P-3, in descending order. Identification of standard sequence-bounding unconformities is hampered by the poor seismic expression of the Wilcox and the structural complexity of the area.

  5. Development of a candidate reference material for adventitious virus detection in vaccine and biologicals manufacturing by deep sequencing

    PubMed Central

    Mee, Edward T.; Preston, Mark D.; Minor, Philip D.; Schepelmann, Silke; Huang, Xuening; Nguyen, Jenny; Wall, David; Hargrove, Stacey; Fu, Thomas; Xu, George; Li, Li; Cote, Colette; Delwart, Eric; Li, Linlin; Hewlett, Indira; Simonyan, Vahan; Ragupathy, Viswanath; Alin, Voskanian-Kordi; Mermod, Nicolas; Hill, Christiane; Ottenwälder, Birgit; Richter, Daniel C.; Tehrani, Arman; Jacqueline, Weber-Lehmann; Cassart, Jean-Pol; Letellier, Carine; Vandeputte, Olivier; Ruelle, Jean-Louis; Deyati, Avisek; La Neve, Fabio; Modena, Chiara; Mee, Edward; Schepelmann, Silke; Preston, Mark; Minor, Philip; Eloit, Marc; Muth, Erika; Lamamy, Arnaud; Jagorel, Florence; Cheval, Justine; Anscombe, Catherine; Misra, Raju; Wooldridge, David; Gharbia, Saheer; Rose, Graham; Ng, Siemon H.S.; Charlebois, Robert L.; Gisonni-Lex, Lucy; Mallet, Laurent; Dorange, Fabien; Chiu, Charles; Naccache, Samia; Kellam, Paul; van der Hoek, Lia; Cotten, Matt; Mitchell, Christine; Baier, Brian S.; Sun, Wenping; Malicki, Heather D.

    2016-01-01

    Background Unbiased deep sequencing offers the potential for improved adventitious virus screening in vaccines and biotherapeutics. Successful implementation of such assays will require appropriate control materials to confirm assay performance and sensitivity. Methods A common reference material containing 25 target viruses was produced and 16 laboratories were invited to process it using their preferred adventitious virus detection assay. Results Fifteen laboratories returned results, obtained using a wide range of wet-lab and informatics methods. Six of 25 target viruses were detected by all laboratories, with the remaining viruses detected by 4–14 laboratories. Six non-target viruses were detected by three or more laboratories. Conclusion The study demonstrated that a wide range of methods are currently used for adventitious virus detection screening in biological products by deep sequencing and that they can yield significantly different results. This underscores the need for common reference materials to ensure satisfactory assay performance and enable comparisons between laboratories. PMID:26709640

  6. Proteome-wide Identification of Novel Ceramide-binding Proteins by Yeast Surface cDNA Display and Deep Sequencing.

    PubMed

    Bidlingmaier, Scott; Ha, Kevin; Lee, Nam-Kyung; Su, Yang; Liu, Bin

    2016-04-01

    Although the bioactive sphingolipid ceramide is an important cell signaling molecule, relatively few direct ceramide-interacting proteins are known. We used an approach combining yeast surface cDNA display and deep sequencing technology to identify novel proteins binding directly to ceramide. We identified 234 candidate ceramide-binding protein fragments and validated binding for 20. Most (17) bound selectively to ceramide, although a few (3) bound to other lipids as well. Several novel ceramide-binding domains were discovered, including the EF-hand calcium-binding motif, the heat shock chaperonin-binding motif STI1, the SCP2 sterol-binding domain, and the tetratricopeptide repeat region motif. Interestingly, four of the verified ceramide-binding proteins (HPCA, HPCAL1, NCS1, and VSNL1) and an additional three candidate ceramide-binding proteins (NCALD, HPCAL4, and KCNIP3) belong to the neuronal calcium sensor family of EF hand-containing proteins. We used mutagenesis to map the ceramide-binding site in HPCA and to create a mutant HPCA that does not bind to ceramide. We demonstrated selective binding to ceramide by mammalian cell-produced wild type but not mutant HPCA. Intriguingly, we also identified a fragment from prostaglandin D2synthase that binds preferentially to ceramide 1-phosphate. The wide variety of proteins and domains capable of binding to ceramide suggests that many of the signaling functions of ceramide may be regulated by direct binding to these proteins. Based on the deep sequencing data, we estimate that our yeast surface cDNA display library covers ∼60% of the human proteome and our selection/deep sequencing protocol can identify target-interacting protein fragments that are present at extremely low frequency in the starting library. Thus, the yeast surface cDNA display/deep sequencing approach is a rapid, comprehensive, and flexible method for the analysis of protein-ligand interactions, particularly for the study of non-protein ligands. PMID

  7. The mitochondrial genome sequence of a deep-sea, hydrothermal vent limpet, Lepetodrilus nux, presents a novel vetigastropod gene arrangement.

    PubMed

    Nakajima, Yuichi; Shinzato, Chuya; Khalturina, Mariia; Nakamura, Masako; Watanabe, Hiromi; Satoh, Noriyuki; Mitarai, Satoshi

    2016-08-01

    While mitochondrial (mt) genomes are used extensively for comparative and evolutionary genomics, few mt genomes of deep-sea species, including hydrothermal vent species, have been determined. The Genus Lepetodrilus is a major deep-sea gastropod taxon that occurs in various deep-sea ecosystems. Using next-generation sequencing, we determined nearly the complete mitochondrial genome sequence of Lepetodrilus nux, which inhabits hydrothermal vents in the Okinawa Trough. The total length of the mitochondrial genome is 16,353bp, excluding the repeat region. It contains 13 protein-coding genes, 22 tRNA genes, two rRNA genes, and a control region, typical of most metazoan genomes. Compared with other vetigastropod mt genome sequences, L. nux employs a novel mt gene arrangement. Other novel arrangements have been identified in the vetigastropod, Fissurella volcano, and in Chrysomallon squamiferum, a neomphaline gastropod; however, all three gene arrangements are different, and Bayesian inference suggests that each lineage diverged independently. Our findings suggest that vetigastropod mt gene arrangements are more diverse than previously realized. PMID:27102631

  8. Microbial Diversity in Deep-sea Methane Seep Sediments Presented by SSU rRNA Gene Tag Sequencing

    PubMed Central

    Nunoura, Takuro; Takaki, Yoshihiro; Kazama, Hiromi; Hirai, Miho; Ashi, Juichiro; Imachi, Hiroyuki; Takai, Ken

    2012-01-01

    Microbial community structures in methane seep sediments in the Nankai Trough were analyzed by tag-sequencing analysis for the small subunit (SSU) rRNA gene using a newly developed primer set. The dominant members of Archaea were Deep-sea Hydrothermal Vent Euryarchaeotic Group 6 (DHVEG 6), Marine Group I (MGI) and Deep Sea Archaeal Group (DSAG), and those in Bacteria were Alpha-, Gamma-, Delta- and Epsilonproteobacteria, Chloroflexi, Bacteroidetes, Planctomycetes and Acidobacteria. Diversity and richness were examined by 8,709 and 7,690 tag-sequences from sediments at 5 and 25 cm below the seafloor (cmbsf), respectively. The estimated diversity and richness in the methane seep sediment are as high as those in soil and deep-sea hydrothermal environments, although the tag-sequences obtained in this study were not sufficient to show whole microbial diversity in this analysis. We also compared the diversity and richness of each taxon/division between the sediments from the two depths, and found that the diversity and richness of some taxa/divisions varied significantly along with the depth. PMID:22510646

  9. Deep Sequencing for the Detection of Virus-Like Sequences in the Brains of Patients with Multiple Sclerosis: Detection of GBV-C in Human Brain

    PubMed Central

    Kriesel, John D.; Hobbs, Maurine R.; Jones, Brandt B.; Milash, Brett; Nagra, Rashed M.; Fischer, Kael F.

    2012-01-01

    Multiple sclerosis (MS) is a demyelinating disease of unknown origin that affects the central nervous system of an estimated 400,000 Americans. GBV-C or hepatitis G is a flavivirus that is found in the serum of 1–2% of blood donors. It was originally associated with hepatitis, but is now believed to be a relatively non-pathogenic lymphotropic virus. Fifty frozen specimens from the brains of deceased persons affected by MS were obtained along with 15 normal control brain specimens. RNA was extracted and ribosomal RNAs were depleted before sequencing on the Illumina GAII. These 36 bp reads were compared with a non-redundant database derived from the 600,000+ viral sequences in GenBank organized into 4080 taxa. An individual read successfully aligned to the viral database was considered to be a “hit”. Normalized MS specimen hit rates for each viral taxon were compared to the distribution of hits in the normal controls. Seventeen MS and 11 control brain extracts were sequenced, yielding 4–10 million sequences (“reads”) each. Over-representation of sequence from at least one of 12 viral taxa was observed in 7 of the 17 MS samples. Sequences resembling other viruses previously implicated in the pathogenesis of MS were not significantly enriched in any of the diseased brain specimens. Sequences from GB virus C (GBV-C), a flavivirus not previously isolated from brain, were enriched in one of the MS samples. GBV-C in this brain specimen was confirmed by specific amplification in this single MS brain specimen, but not in the 30 other MS brain samples available. The entire 9.4 kb sequence of this GBV-C isolate is reported here. This study shows the feasibility of deep sequencing for the detection of occult viral infections in the brains of deceased persons with MS. The first isolation of GBV-C from human brain is reported here. PMID:22412845

  10. Correlating Geochemical and Deep Sequence Data: An Example from the Uzon Caldera, Kamchatka

    NASA Astrophysics Data System (ADS)

    Crowe, D. E.; Wagner, I. D.; Mou, X.; Ye, W.; Sun, S.; Romanek, C. S.; Moran, M. A.

    2008-12-01

    Microbial community structure is complex and relatively unknown in high temperature extreme environments. The relationship between community structure and the variable physicochemical environment that hosts the community is similarly not well understood. One of the most significant roadblocks to elucidating these relationships is the difficulty of determining which microorganisms are present in a given environment, and in what percentages. We carried out deep sequencing of 16S rRNA genes using 454 pyrosequencing methods from a series of terrestrial hot springs in the Uzon Caldera, Kamchatka, Far East Russia. Using Primer v5 software, we correlated community structure and membership to variable geochemical parameters within the springs, and determined which set of parameters is most predictive in terms of community structure. Six hot springs within the caldera were selected for study. For each spring, temperature, pH, oxygen and hydrogen isotope ratios, and a suite of elements were measured. Sediment samples were collected and bulk DNA was extracted. A set of 12 primers with broad coverage of the Bacteria and Archaea V6 region of the 16S rRNA gene was used. The 311,981 reads obtained were clustered at an identity threshold of 99%. Rarefaction analysis revealed that although between 3350 and 6700 OTUs (Bacteria plus Archaea) were identified in each pool, saturation was not attained. PCA and MDS analyses were used to evaluate relationships within the geochemical data. Both data sets were analyzed using the BIOENV subprogram of Primer v5 to evaluate which set of physicochemical parameters best explained the community structure in all pools. The results of the BIOENV analysis revealed that 94.6% of the variance in membership between pools is explained by a set of highly correlated parameters consisting of As, Cl, Li, Ca, K, and Na concentrations. Where As and salinity are high, Bacterial communities were dominated by Thermaceae, Pseudomonadaceae, and Nitrospiraceae, and