Science.gov

Sample records for deep short-read sequencing

  1. Unlocking Short Read Sequencing for Metagenomics

    SciTech Connect

    Rodrigue, Sébastien; Materna, Arne C.; Timberlake, Sonia C.; Blackburn, Matthew C.; Malmstrom, Rex R.; Alm, Eric J.; Chisholm, Sallie W.; Gilbert, Jack Anthony

    2010-07-28

    We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read.

  2. Fast search of thousands of short-read sequencing experiments.

    PubMed

    Solomon, Brad; Kingsford, Carl

    2016-03-01

    The amount of sequence information in public repositories is growing at a rapid rate. Although these data are likely to contain clinically important information that has not yet been uncovered, our ability to effectively mine these repositories is limited. Here we introduce Sequence Bloom Trees (SBTs), a method for querying thousands of short-read sequencing experiments by sequence, 162 times faster than existing approaches. The approach searches large data archives for all experiments that involve a given sequence. We use SBTs to search 2,652 human blood, breast and brain RNA-seq experiments for all 214,293 known transcripts in under 4 days using less than 239 MB of RAM and a single CPU. Searching sequence archives at this scale and in this time frame is currently not possible using existing tools.

  3. Viral population analysis and minority-variant detection using short read next-generation sequencing

    PubMed Central

    Watson, Simon J.; Welkers, Matthijs R. A.; Depledge, Daniel P.; Coulter, Eve; Breuer, Judith M.; de Jong, Menno D.; Kellam, Paul

    2013-01-01

    RNA viruses within infected individuals exist as a population of evolutionary-related variants. Owing to evolutionary change affecting the constitution of this population, the frequency and/or occurrence of individual viral variants can show marked or subtle fluctuations. Since the development of massively parallel sequencing platforms, such viral populations can now be investigated to unprecedented resolution. A critical problem with such analyses is the presence of sequencing-related errors that obscure the identification of true biological variants present at low frequency. Here, we report the development and assessment of the Quality Assessment of Short Read (QUASR) Pipeline (http://sourceforge.net/projects/quasr) specific for virus genome short read analysis that minimizes sequencing errors from multiple deep-sequencing platforms, and enables post-mapping analysis of the minority variants within the viral population. QUASR significantly reduces the error-related noise in deep-sequencing datasets, resulting in increased mapping accuracy and reduction of erroneous mutations. Using QUASR, we have determined influenza virus genome dynamics in sequential samples from an in vitro evolution of 2009 pandemic H1N1 (A/H1N1/09) influenza from samples sequenced on both the Roche 454 GSFLX and Illumina GAIIx platforms. Importantly, concordance between the 454 and Illumina sequencing allowed unambiguous minority-variant detection and accurate determination of virus population turnover in vitro. PMID:23382427

  4. Viral population analysis and minority-variant detection using short read next-generation sequencing.

    PubMed

    Watson, Simon J; Welkers, Matthijs R A; Depledge, Daniel P; Coulter, Eve; Breuer, Judith M; de Jong, Menno D; Kellam, Paul

    2013-03-19

    RNA viruses within infected individuals exist as a population of evolutionary-related variants. Owing to evolutionary change affecting the constitution of this population, the frequency and/or occurrence of individual viral variants can show marked or subtle fluctuations. Since the development of massively parallel sequencing platforms, such viral populations can now be investigated to unprecedented resolution. A critical problem with such analyses is the presence of sequencing-related errors that obscure the identification of true biological variants present at low frequency. Here, we report the development and assessment of the Quality Assessment of Short Read (QUASR) Pipeline (http://sourceforge.net/projects/quasr) specific for virus genome short read analysis that minimizes sequencing errors from multiple deep-sequencing platforms, and enables post-mapping analysis of the minority variants within the viral population. QUASR significantly reduces the error-related noise in deep-sequencing datasets, resulting in increased mapping accuracy and reduction of erroneous mutations. Using QUASR, we have determined influenza virus genome dynamics in sequential samples from an in vitro evolution of 2009 pandemic H1N1 (A/H1N1/09) influenza from samples sequenced on both the Roche 454 GSFLX and Illumina GAIIx platforms. Importantly, concordance between the 454 and Illumina sequencing allowed unambiguous minority-variant detection and accurate determination of virus population turnover in vitro.

  5. Software for pre-processing Illumina next-generation sequencing short read sequences

    PubMed Central

    2014-01-01

    Background When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. Methods We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. Results Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference

  6. Meeting the challenges of non-referenced genome assembly from short-read sequence data

    Treesearch

    M. Parks; A. Liston; R. Cronn

    2010-01-01

    Massively parallel sequencing technologies (MPST) offer unprecedented opportunities for novel sequencing projects. MPST, while offering tremendous sequencing capacity, are typically most effective in resequencing projects (as opposed to the sequencing of novel genomes) due to the fact that sequence is returned in relatively short reads. Nonetheless, there is great...

  7. Development and transferability of black and red raspberry microsatellite markers from short-read sequences

    USDA-ARS?s Scientific Manuscript database

    The advent of next-generation sequencing technologies has been a boon to the cost-effective development of molecular markers, particularly in non-model species. Here, we demonstrate the efficiency of microsatellite or simple sequence repeat (SSR) marker development from short-read sequences using th...

  8. An analysis of the feasibility of short read sequencing

    PubMed Central

    Whiteford, Nava; Haslam, Niall; Weber, Gerald; Prügel-Bennett, Adam; Essex, Jonathan W.; Roach, Peter L.; Bradley, Mark; Neylon, Cameron

    2005-01-01

    Several methods for ultra high-throughput DNA sequencing are currently under investigation. Many of these methods yield very short blocks of sequence information (reads). Here we report on an analysis showing the level of genome sequencing possible as a function of read length. It is shown that re-sequencing and de novo sequencing of the majority of a bacterial genome is possible with read lengths of 20–30 nt, and that reads of 50 nt can provide reconstructed contigs (a contiguous fragment of sequence data) of 1000 nt and greater that cover 80% of human chromosome 1. PMID:16275781

  9. Short-Read Sequencing for Genomic Analysis of the Brown Rot Fungus Fibroporia radiculosa

    Treesearch

    J. D. Tang; A. D. Perkins; T. S. Sonstegard; S. G. Schroeder; S. C. Burgess; S. V. Diehl

    2012-01-01

    The feasibility of short-read sequencing for genomic analysis was demonstrated for Fibroporia radiculosa, a copper-tolerant fungus that causes brown rot decay of wood. The effect of read quality on genomic assembly was assessed by filtering Illumina GAIIx reads from a single run of a paired-end library (75-nucleotide read length and 300-bp fragment...

  10. Short read sequencing for Genomic Analysis of the brown rot fungus Fibroporia radiculosa

    USDA-ARS?s Scientific Manuscript database

    The practical capability of short read sequencing for whole genome gene prediction was investigated for Fibroporia radiculosa, a copper-tolerant basidiomycete fungus that causes brown rot decay of wood. Illumina GAIIX reads from a single run of a paired-end library (75 nt read length, 300 bp insert...

  11. Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies

    PubMed Central

    Sundquist, Andreas; Ronaghi, Mostafa; Tang, Haixu; Pevzner, Pavel; Batzoglou, Serafim

    2007-01-01

    While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology. PMID:17534434

  12. ResSeq: Enhancing Short-Read Sequencing Alignment By Rescuing Error-Containing Reads.

    PubMed

    Feng, Weixing; Sang, Peichao; Lian, Deyuan; Dong, Yansheng; Song, Fengfei; Li, Meng; He, Bo; Cao, Fenglin; Liu, Yunlong

    2015-01-01

    Next-generation short-read sequencing is widely utilized in genomic studies. Biological applications require an alignment step to map sequencing reads to the reference genome, before acquiring expected genomic information. This requirement makes alignment accuracy a key factor for effective biological interpretation. Normally, when accounting for measurement errors and single nucleotide polymorphisms, short read mappings with a few mismatches are generally considered acceptable. However, to further improve the efficiency of short-read sequencing alignment, we propose a method to retrieve additional reliably aligned reads (reads with more than a pre-defined number of mismatches), using a Bayesian-based approach. In this method, we first retrieve the sequence context around the mismatched nucleotides within the already aligned reads; these loci contain the genomic features where sequencing errors occur. Then, using the derived pattern, we evaluate the remaining (typically discarded) reads with more than the allowed number of mismatches, and calculate a score that represents the probability that a specific alignment is correct. This strategy allows the extraction of more reliably aligned reads, therefore improving alignment sensitivity. The source code of our tool, ResSeq, can be downloaded from: https://github.com/hrbeubiocenter/Resseq.

  13. Short reads from honey bee (Apis sp.) sequencing projects reflect microbial associate diversity.

    PubMed

    Gerth, Michael; Hurst, Gregory D D

    2017-01-01

    High throughput (or 'next generation') sequencing has transformed most areas of biological research and is now a standard method that underpins empirical study of organismal biology, and (through comparison of genomes), reveals patterns of evolution. For projects focused on animals, these sequencing methods do not discriminate between the primary target of sequencing (the animal genome) and 'contaminating' material, such as associated microbes. A common first step is to filter out these contaminants to allow better assembly of the animal genome or transcriptome. Here, we aimed to assess if these 'contaminations' provide information with regard to biologically important microorganisms associated with the individual. To achieve this, we examined whether the short read data from Apis retrieved elements of its well established microbiome. To this end, we screened almost 1,000 short read libraries of honey bee (Apis sp.) DNA sequencing project for the presence of microbial sequences, and find sequences from known honey bee microbial associates in at least 11% of them. Further to this, we screened ∼500 Apis RNA sequencing libraries for evidence of viral infections, which were found to be present in about half of them. We then used the data to reconstruct draft genomes of three Apis associated bacteria, as well as several viral strains de novo. We conclude that 'contamination' in short read sequencing libraries can provide useful genomic information on microbial taxa known to be associated with the target organisms, and may even lead to the discovery of novel associations. Finally, we demonstrate that RNAseq samples from experiments commonly carry uneven viral loads across libraries. We note variation in viral presence and load may be a confounding feature of differential gene expression analyses, and as such it should be incorporated as a random factor in analyses.

  14. Assembled sequence contigs by SOAPdenova and Volvet algorithms from metagenomic short reads of a new bacterial isolate of gut origin

    USDA-ARS?s Scientific Manuscript database

    Assembled sequence contigs by SOAPdenova and Volvet algorithms from metagenomic short reads of a new bacterial isolate of gut origin. This study included 2 submissions with a total of 9.8 million bp of assembled contigs....

  15. Haplotype-Phased Synthetic Long Reads from Short-Read Sequencing

    PubMed Central

    Stapleton, James A.; Kim, Jeongwoon; Hamilton, John P.; Wu, Ming; Irber, Luiz C.; Maddamsetti, Rohan; Briney, Bryan; Newton, Linsey; Burton, Dennis R.; Brown, C. Titus; Chan, Christina; Buell, C. Robin; Whitehead, Timothy A.

    2016-01-01

    Next-generation DNA sequencing has revolutionized the study of biology. However, the short read lengths of the dominant instruments complicate assembly of complex genomes and haplotype phasing of mixtures of similar sequences. Here we demonstrate a method to reconstruct the sequences of individual nucleic acid molecules up to 11.6 kilobases in length from short (150-bp) reads. We show that our method can construct 99.97%-accurate synthetic reads from bacterial, plant, and animal genomic samples, full-length mRNA sequences from human cancer cell lines, and individual HIV env gene variants from a mixture. The preparation of multiple samples can be multiplexed into a single tube, further reducing effort and cost relative to competing approaches. Our approach generates sequencing libraries in three days from less than one microgram of DNA in a single-tube format without custom equipment or specialized expertise. PMID:26789840

  16. The effect of strand bias in Illumina short-read sequencing data

    PubMed Central

    2012-01-01

    Background When using Illumina high throughput short read data, sometimes the genotype inferred from the positive strand and negative strand are significantly different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias. In this study, we used Illumina short-read sequencing data to evaluate the effect of strand bias on genotyping quality, and to explore the possible causes of strand bias. Result We collected 22 breast cancer samples from 22 patients and sequenced their exome using the Illumina GAIIx machine. By comparing the consistency between the genotypes inferred from this sequencing data with the genotypes inferred from SNP chip data, we found that, when using sequencing data, SNPs with extreme strand bias did not have significantly lower consistency rates compared to SNPs with low or no strand bias. However, this result may be limited by the small subset of SNPs present in both the exome sequencing and the SNP chip data. We further compared the transition and transversion ratio and the number of novel non-synonymous SNPs between the SNPs with low or no strand bias and those with extreme strand bias, and found that SNPs with low or no strand bias have better overall quality. We also discovered that the strand bias occurs randomly at genomic positions across these samples, and observed no consistent pattern of strand bias location across samples. By comparing results from two different aligners, BWA and Bowtie, we found very consistent strand bias patterns. Thus strand bias is unlikely to be caused by alignment artifacts. We successfully replicated our results using two additional independent datasets with different capturing methods and Illumina sequencers. Conclusion Extreme strand bias indicates a potential high false-positive rate for SNPs. PMID:23176052

  17. AREM: Aligning Short Reads from ChIP-Sequencing by Expectation Maximization

    NASA Astrophysics Data System (ADS)

    Newkirk, Daniel; Biesinger, Jacob; Chon, Alvin; Yokomori, Kyoko; Xie, Xiaohui

    High-throughput sequencing coupled to chromatin immunoprecipitation (ChIP-Seq) is widely used in characterizing genome-wide binding patterns of transcription factors, cofactors, chromatin modifiers, and other DNA binding proteins. A key step in ChIP-Seq data analysis is to map short reads from high-throughput sequencing to a reference genome and identify peak regions enriched with short reads. Although several methods have been proposed for ChIP-Seq analysis, most existing methods only consider reads that can be uniquely placed in the reference genome, and therefore have low power for detecting peaks located within repeat sequences. Here we introduce a probabilistic approach for ChIP-Seq data analysis which utilizes all reads, providing a truly genome-wide view of binding patterns. Reads are modeled using a mixture model corresponding to K enriched regions and a null genomic background. We use maximum likelihood to estimate the locations of the enriched regions, and implement an expectation-maximization (E-M) algorithm, called AREM (aligning reads by expectation maximization), to update the alignment probabilities of each read to different genomic locations. We apply the algorithm to identify genome-wide binding events of two proteins: Rad21, a component of cohesin and a key factor involved in chromatid cohesion, and Srebp-1, a transcription factor important for lipid/cholesterol homeostasis. Using AREM, we were able to identify 19,935 Rad21 peaks and 1,748 Srebp-1 peaks in the mouse genome with high confidence, including 1,517 (7.6%) Rad21 peaks and 227 (13%) Srebp-1 peaks that were missed using only uniquely mapped reads. The open source implementation of our algorithm is available at http://sourceforge.net/projects/arem

  18. Improved variant discovery through local re-alignment of short-read next-generation sequencing data using SRMA

    PubMed Central

    2010-01-01

    A primary component of next-generation sequencing analysis is to align short reads to a reference genome, with each read aligned independently. However, reads that observe the same non-reference DNA sequence are highly correlated and can be used to better model the true variation in the target genome. A novel short-read micro re-aligner, SRMA, that leverages this correlation to better resolve a consensus of the underlying DNA sequence of the targeted genome is described here. PMID:20932289

  19. MOST: a modified MLST typing tool based on short read sequencing

    PubMed Central

    Dallman, Timothy; Schaefer, Ulf; Sheppard, Carmen L.; Ashton, Philip; Pichon, Bruno; Ellington, Matthew; Swift, Craig; Green, Jonathan; Underwood, Anthony

    2016-01-01

    Multilocus sequence typing (MLST) is an effective method to describe bacterial populations. Conventionally, MLST involves Polymerase Chain Reaction (PCR) amplification of housekeeping genes followed by Sanger DNA sequencing. Public Health England (PHE) is in the process of replacing the conventional MLST methodology with a method based on short read sequence data derived from Whole Genome Sequencing (WGS). This paper reports the comparison of the reliability of MLST results derived from WGS data, comparing mapping and assembly-based approaches to conventional methods using 323 bacterial genomes of diverse species. The sensitivity of the two WGS based methods were further investigated with 26 mixed and 29 low coverage genomic data sets from Salmonella enteridis and Streptococcus pneumoniae. Of the 323 samples, 92.9% (n = 300), 97.5% (n = 315) and 99.7% (n = 322) full MLST profiles were derived by the conventional method, assembly- and mapping-based approaches, respectively. The concordance between samples that were typed by conventional (92.9%) and both WGS methods was 100%. From the 55 mixed and low coverage genomes, 89.1% (n = 49) and 67.3% (n = 37) full MLST profiles were derived from the mapping and assembly based approaches, respectively. In conclusion, deriving MLST from WGS data is more sensitive than the conventional method. When comparing WGS based methods, the mapping based approach was the most sensitive. In addition, the mapping based approach described here derives quality metrics, which are difficult to determine quantitatively using conventional and WGS-assembly based approaches. PMID:27602279

  20. Reference-based compression of short-read sequences using path encoding

    PubMed Central

    Kingsford, Carl; Patro, Rob

    2015-01-01

    Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X. Contact: carlk@cs.cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25649622

  1. Efficient yeast ChIP-Seq using multiplex short-read DNA sequencing

    PubMed Central

    Lefrançois, Philippe; Euskirchen, Ghia M; Auerbach, Raymond K; Rozowsky, Joel; Gibson, Theodore; Yellman, Christopher M; Gerstein, Mark; Snyder, Michael

    2009-01-01

    Background Short-read high-throughput DNA sequencing technologies provide new tools to answer biological questions. However, high cost and low throughput limit their widespread use, particularly in organisms with smaller genomes such as S. cerevisiae. Although ChIP-Seq in mammalian cell lines is replacing array-based ChIP-chip as the standard for transcription factor binding studies, ChIP-Seq in yeast is still underutilized compared to ChIP-chip. We developed a multiplex barcoding system that allows simultaneous sequencing and analysis of multiple samples using Illumina's platform. We applied this method to analyze the chromosomal distributions of three yeast DNA binding proteins (Ste12, Cse4 and RNA PolII) and a reference sample (input DNA) in a single experiment and demonstrate its utility for rapid and accurate results at reduced costs. Results We developed a barcoding ChIP-Seq method for the concurrent analysis of transcription factor binding sites in yeast. Our multiplex strategy generated high quality data that was indistinguishable from data obtained with non-barcoded libraries. None of the barcoded adapters induced differences relative to a non-barcoded adapter when applied to the same DNA sample. We used this method to map the binding sites for Cse4, Ste12 and Pol II throughout the yeast genome and we found 148 binding targets for Cse4, 823 targets for Ste12 and 2508 targets for PolII. Cse4 was strongly bound to all yeast centromeres as expected and the remaining non-centromeric targets correspond to highly expressed genes in rich media. The presence of Cse4 non-centromeric binding sites was not reported previously. Conclusion We designed a multiplex short-read DNA sequencing method to perform efficient ChIP-Seq in yeast and other small genome model organisms. This method produces accurate results with higher throughput and reduced cost. Given constant improvements in high-throughput sequencing technologies, increasing multiplexing will be possible to

  2. Indel variant analysis of short-read sequencing data with Scalpel.

    PubMed

    Fang, Han; Bergmann, Ewa A; Arora, Kanika; Vacic, Vladimir; Zody, Michael C; Iossifov, Ivan; O'Rawe, Jason A; Wu, Yiyang; Jimenez Barron, Laura T; Rosenbaum, Julie; Ronemus, Michael; Lee, Yoon-Ha; Wang, Zihua; Dikoglu, Esra; Jobanputra, Vaidehi; Lyon, Gholson J; Wigler, Michael; Schatz, Michael C; Narzisi, Giuseppe

    2016-12-01

    As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in ∼5 h after read mapping.

  3. Short-read, high-throughput sequencing technology for STR genotyping.

    PubMed

    Bornman, Daniel M; Hester, Mark E; Schuetter, Jared M; Kasoji, Manjula D; Minard-Smith, Angela; Barden, Curt A; Nelson, Scott C; Godbold, Gene D; Baker, Christine H; Yang, Boyu; Walther, Jacquelyn E; Tornes, Ivan E; Yan, Pearlly S; Rodriguez, Benjamin; Bundschuh, Ralf; Dickens, Michael L; Young, Brian A; Faith, Seth A

    2012-04-01

    DNA-based methods for human identification principally rely upon genotyping of short tandem repeat (STR) loci. Electrophoretic-based techniques for variable-length classification of STRs are universally utilized, but are limited in that they have relatively low throughput and do not yield nucleotide sequence information. High-throughput sequencing technology may provide a more powerful instrument for human identification, but is not currently validated for forensic casework. Here, we present a systematic method to perform high-throughput genotyping analysis of the Combined DNA Index System (CODIS) STR loci using short-read (150 bp) massively parallel sequencing technology. Open source reference alignment tools were optimized to evaluate PCR-amplified STR loci using a custom designed STR genome reference. Evaluation of this approach demonstrated that the 13 CODIS STR loci and amelogenin (AMEL) locus could be accurately called from individual and mixture samples. Sensitivity analysis showed that as few as 18,500 reads, aligned to an in silico referenced genome, were required to genotype an individual (>99% confidence) for the CODIS loci. The power of this technology was further demonstrated by identification of variant alleles containing single nucleotide polymorphisms (SNPs) and the development of quantitative measurements (reads) for resolving mixed samples.

  4. Reconstruction of Acetogenesis Pathway Using Short-Read Sequencing of Clostridium aceticum Genome.

    PubMed

    Lee, Sooin; Song, Yoseb; Choe, Donghui; Cho, Suhyung; Yu, Seok Jong; Cho, Yongseong; Kim, Sun Chang; Cho, Byung-Kwan

    2015-05-01

    Clostridium aceticum is an anaerobic homoacetogen, able to reduce CO2 to multi-carbon products using the reductive acetyl-CoA pathway. This unique ability to use CO2 or CO makes the microbe a potential platform for the biotech industry. However, the development of genetically engineered homoacetogen for the large-scale production of commodity chemicals is hampered by the limited amount of their genetic and metabolic information. Here we exploited next-generation sequencing to reveal C. aceticum genome. The short-read sequencing produced 44,871,196 high quality reads with an average length of 248 bases. Following sequence trimming step, 30,256,976 reads were assembled into 12,563 contigs with 168-fold coverage and 1,971 bases in length using de Bruijn graph algorithm. Since the k-mer hash length in the algorithm is an important factor for the quality of output contigs, a window of k-mers (k-51 to k-201) was tested to obtain high quality contigs. In addition to the assembly metrics, the functional annotation of the contigs was investigated to select the k-mer optimum. Metabolic pathway mapping using the functional annotation identified the majority of central metabolic pathways, such as the glycolysis and TCA cycle. Further, these analyses elucidated the enzymes consisting of Wood-Ljungdahl pathway, in which CO2 is fixed into acetyl-CoA. Thus, the metabolic reconstruction based on the draft genome assembly provides a foundation for the functional genomics required to engineer C. aceticum.

  5. Bioinformatics challenges in de novo transcriptome assembly using short read sequences in the absence of a reference genome sequence.

    PubMed

    Góngora-Castillo, Elsa; Buell, C Robin

    2013-04-01

    Plant natural product research can be facilitated through genome and transcriptome sequencing approaches that generate informative sequence and expression datasets that enable characterization of biochemical pathways of interest. As the overwhelming majority of plant-derived natural products are derived from species with little, if any, sequence and/or genomic resources, the ability to perform whole genome shotgun sequencing and assembly has been and will continue to be transformative as access to a genome sequence provides molecular resources and a context for discovery and characterization of biosynthetic pathways. Due to the reduced size and complexity of the transcriptome relative to the genome, transcriptome sequencing provides a rapid, inexpensive approach to access gene sequences, gene expression abundances, and gene expression patterns in any species, including those that lack a reference genome sequence. To date, successful applications of RNA sequencing in conjunction with de novo transcriptome assembly has enabled identification of new genes in an array of biochemical pathways in plants. While sequencing technologies are well developed, challenges remain in the handling and analysis of transcriptome sequences. In this Highlight article, we provide an overview of the bioinformatics challenges associated with transcriptome analyses using short read sequences and how to address these issues in plant species that lack a reference genome.

  6. Assembly-based inference of B-cell receptor repertoires from short read RNA sequencing data with V’DJer

    PubMed Central

    Mose, Lisle E.; Selitsky, Sara R.; Bixby, Lisa M.; Marron, David L.; Iglesia, Michael D.; Serody, Jonathan S.; Perou, Charles M.; Vincent, Benjamin G.; Parker, Joel S.

    2016-01-01

    Motivation: B-cell receptor (BCR) repertoire profiling is an important tool for understanding the biology of diverse immunologic processes. Current methods for analyzing adaptive immune receptor repertoires depend upon PCR amplification of VDJ rearrangements followed by long read amplicon sequencing spanning the VDJ junctions. While this approach has proven to be effective, it is frequently not feasible due to cost or limited sample material. Additionally, there are many existing datasets where short-read RNA sequencing data are available but PCR amplified BCR data are not. Results: We present here V’DJer, an assembly-based method that reconstructs adaptive immune receptor repertoires from short-read RNA sequencing data. This method captures expressed BCR loci from a standard RNA-seq assay. We applied this method to 473 Melanoma samples from The Cancer Genome Atlas and demonstrate V’DJer’s ability to accurately reconstruct BCR repertoires from short read mRNA-seq data. Availability and Implementation: V’DJer is implemented in C/C ++, freely available for academic use and can be downloaded from Github: https://github.com/mozack/vdjer Contact: benjamin_vincent@med.unc.edu or parkerjs@email.unc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27559159

  7. Characterization of a biogas-producing microbial community by short-read next generation DNA sequencing

    PubMed Central

    2012-01-01

    Background Renewable energy production is currently a major issue worldwide. Biogas is a promising renewable energy carrier as the technology of its production combines the elimination of organic waste with the formation of a versatile energy carrier, methane. In consequence of the complexity of the microbial communities and metabolic pathways involved the biotechnology of the microbiological process leading to biogas production is poorly understood. Metagenomic approaches are suitable means of addressing related questions. In the present work a novel high-throughput technique was tested for its benefits in resolving the functional and taxonomical complexity of such microbial consortia. Results It was demonstrated that the extremely parallel SOLiD™ short-read DNA sequencing platform is capable of providing sufficient useful information to decipher the systematic and functional contexts within a biogas-producing community. Although this technology has not been employed to address such problems previously, the data obtained compare well with those from similar high-throughput approaches such as 454-pyrosequencing GS FLX or Titanium. The predominant microbes contributing to the decomposition of organic matter include members of the Eubacteria, class Clostridia, order Clostridiales, family Clostridiaceae. Bacteria belonging in other systematic groups contribute to the diversity of the microbial consortium. Archaea comprise a remarkably small minority in this community, given their crucial role in biogas production. Among the Archaea, the predominant order is the Methanomicrobiales and the most abundant species is Methanoculleus marisnigri. The Methanomicrobiales are hydrogenotrophic methanogens. Besides corroborating earlier findings on the significance of the contribution of the Clostridia to organic substrate decomposition, the results demonstrate the importance of the metabolism of hydrogen within the biogas producing microbial community. Conclusions Both

  8. Efficient Graph Based Assembly of Short-Read Sequences on Hybrid Core Architecture

    SciTech Connect

    Sczyrba, Alex; Pratap, Abhishek; Canon, Shane; Han, James; Copeland, Alex; Wang, Zhong; Brewer, Tony; Soper, David; D'Jamoos, Mike; Collins, Kirby; Vacek, George

    2011-03-22

    Advanced architectures can deliver dramatically increased throughput for genomics and proteomics applications, reducing time-to-completion in some cases from days to minutes. One such architecture, hybrid-core computing, marries a traditional x86 environment with a reconfigurable coprocessor, based on field programmable gate array (FPGA) technology. In addition to higher throughput, increased performance can fundamentally improve research quality by allowing more accurate, previously impractical approaches. We will discuss the approach used by Convey?s de Bruijn graph constructor for short-read, de-novo assembly. Bioinformatics applications that have random access patterns to large memory spaces, such as graph-based algorithms, experience memory performance limitations on cache-based x86 servers. Convey?s highly parallel memory subsystem allows application-specific logic to simultaneously access 8192 individual words in memory, significantly increasing effective memory bandwidth over cache-based memory systems. Many algorithms, such as Velvet and other de Bruijn graph based, short-read, de-novo assemblers, can greatly benefit from this type of memory architecture. Furthermore, small data type operations (four nucleotides can be represented in two bits) make more efficient use of logic gates than the data types dictated by conventional programming models.JGI is comparing the performance of Convey?s graph constructor and Velvet on both synthetic and real data. We will present preliminary results on memory usage and run time metrics for various data sets with different sizes, from small microbial and fungal genomes to very large cow rumen metagenome. For genomes with references we will also present assembly quality comparisons between the two assemblers.

  9. Similarity thresholds used in DNA sequence assembly from short reads can reduce the comparability of population histories across species

    PubMed Central

    Judy, Caroline Duffie; Seeholzer, Glenn F.; Maley, James M.; Graves, Gary R.; Brumfield, Robb T.

    2015-01-01

    Comparing inferences among datasets generated using short read sequencing may provide insight into the concerted impacts of divergence, gene flow and selection across organisms, but comparisons are complicated by biases introduced during dataset assembly. Sequence similarity thresholds allow the de novo assembly of short reads into clusters of alleles representing different loci, but the resulting datasets are sensitive to both the similarity threshold used and to the variation naturally present in the organism under study. Thresholds that require high sequence similarity among reads for assembly (stringent thresholds) as well as highly variable species may result in datasets in which divergent alleles are lost or divided into separate loci (‘over-splitting’), whereas liberal thresholds increase the risk of paralogous loci being combined into a single locus (‘under-splitting’). Comparisons among datasets or species are therefore potentially biased if different similarity thresholds are applied or if the species differ in levels of within-lineage genetic variation. We examine the impact of a range of similarity thresholds on assembly of empirical short read datasets from populations of four different non-model bird lineages (species or species pairs) with different levels of genetic divergence. We find that, in all species, stringent similarity thresholds result in fewer alleles per locus than more liberal thresholds, which appears to be the result of high levels of over-splitting. The frequency of putative under-splitting, conversely, is low at all thresholds. Inferred genetic distances between individuals, gene tree depths, and estimates of the ancestral mutation-scaled effective population size (θ) differ depending upon the similarity threshold applied. Relative differences in inferences across species differ even when the same threshold is applied, but may be dramatically different when datasets assembled under different thresholds are compared. These

  10. Fast and accurate HLA typing from short-read next-generation sequence data with xHLA

    PubMed Central

    Xie, Chao; Yeo, Zhen Xuan; Wong, Marie; Piper, Jason; Long, Tao; Kirkness, Ewen F.; Biggs, William H.; Bloom, Ken; Spellman, Stephen; Vierra-Green, Cynthia; Brady, Colleen; Scheuermann, Richard H.; Howard, Sally; Brewerton, Suzanne; Turpaz, Yaron; Venter, J. Craig

    2017-01-01

    The HLA gene complex on human chromosome 6 is one of the most polymorphic regions in the human genome and contributes in large part to the diversity of the immune system. Accurate typing of HLA genes with short-read sequencing data has historically been difficult due to the sequence similarity between the polymorphic alleles. Here, we introduce an algorithm, xHLA, that iteratively refines the mapping results at the amino acid level to achieve 99–100% four-digit typing accuracy for both class I and II HLA genes, taking only ∼3 min to process a 30× whole-genome BAM file on a desktop computer. PMID:28674023

  11. Fast and accurate HLA typing from short-read next-generation sequence data with xHLA.

    PubMed

    Xie, Chao; Yeo, Zhen Xuan; Wong, Marie; Piper, Jason; Long, Tao; Kirkness, Ewen F; Biggs, William H; Bloom, Ken; Spellman, Stephen; Vierra-Green, Cynthia; Brady, Colleen; Scheuermann, Richard H; Telenti, Amalio; Howard, Sally; Brewerton, Suzanne; Turpaz, Yaron; Venter, J Craig

    2017-07-25

    The HLA gene complex on human chromosome 6 is one of the most polymorphic regions in the human genome and contributes in large part to the diversity of the immune system. Accurate typing of HLA genes with short-read sequencing data has historically been difficult due to the sequence similarity between the polymorphic alleles. Here, we introduce an algorithm, xHLA, that iteratively refines the mapping results at the amino acid level to achieve 99-100% four-digit typing accuracy for both class I and II HLA genes, taking only [Formula: see text]3 min to process a 30× whole-genome BAM file on a desktop computer.

  12. BarraCUDA - a fast short read sequence aligner using graphics processing units

    PubMed Central

    2012-01-01

    Background With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequencing reads has soared to over 600 gigabases from a single instrument run. General purpose computing on graphics processing units (GPGPU), extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy efficient alternative to traditional high-performance computing (HPC) clusters. In this article, we describe the implementation of BarraCUDA, a GPGPU sequence alignment software that is based on BWA, to accelerate the alignment of sequencing reads generated by these instruments to a reference DNA sequence. Findings Using the NVIDIA Compute Unified Device Architecture (CUDA) software development environment, we ported the most computational-intensive alignment component of BWA to GPU to take advantage of the massive parallelism. As a result, BarraCUDA offers a magnitude of performance boost in alignment throughput when compared to a CPU core while delivering the same level of alignment fidelity. The software is also capable of supporting multiple CUDA devices in parallel to further accelerate the alignment throughput. Conclusions BarraCUDA is designed to take advantage of the parallelism of GPU to accelerate the alignment of millions of sequencing reads generated by NGS instruments. By doing this, we could, at least in part streamline the current bioinformatics pipeline such that the wider scientific community could benefit from the sequencing technology. BarraCUDA is currently available from http://seqbarracuda.sf.net PMID:22244497

  13. The Effect of Primer Choice and Short Read Sequences on the Outcome of 16S rRNA Gene Based Diversity Studies

    PubMed Central

    Heylen, Kim; Sessitsch, Angela; De Vos, Paul

    2013-01-01

    Different regions of the bacterial 16S rRNA gene evolve at different evolutionary rates. The scientific outcome of short read sequencing studies therefore alters with the gene region sequenced. We wanted to gain insight in the impact of primer choice on the outcome of short read sequencing efforts. All the unknowns associated with sequencing data, i.e. primer coverage rate, phylogeny, OTU-richness and taxonomic assignment, were therefore implemented in one study for ten well established universal primers (338f/r, 518f/r, 799f/r, 926f/r and 1062f/r) targeting dispersed regions of the bacterial 16S rRNA gene. All analyses were performed on nearly full length and in silico generated short read sequence libraries containing 1175 sequences that were carefully chosen as to present a representative substitute of the SILVA SSU database. The 518f and 799r primers, targeting the V4 region of the 16S rRNA gene, were found to be particularly suited for short read sequencing studies, while the primer 1062r, targeting V6, seemed to be least reliable. Our results will assist scientists in considering whether the best option for their study is to select the most informative primer, or the primer that excludes interferences by host-organelle DNA. The methodology followed can be extrapolated to other primers, allowing their evaluation prior to the experiment. PMID:23977026

  14. MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping.

    PubMed

    Lee, Wan-Ping; Stromberg, Michael P; Ward, Alistair; Stewart, Chip; Garrison, Erik P; Marth, Gabor T

    2014-01-01

    MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me).

  15. Choice of Reference Sequence and Assembler for Alignment of Listeria monocytogenes Short-Read Sequence Data Greatly Influences Rates of Error in SNP Analyses

    PubMed Central

    Pightling, Arthur W.; Petronella, Nicholas; Pagotto, Franco

    2014-01-01

    The wide availability of whole-genome sequencing (WGS) and an abundance of open-source software have made detection of single-nucleotide polymorphisms (SNPs) in bacterial genomes an increasingly accessible and effective tool for comparative analyses. Thus, ensuring that real nucleotide differences between genomes (i.e., true SNPs) are detected at high rates and that the influences of errors (such as false positive SNPs, ambiguously called sites, and gaps) are mitigated is of utmost importance. The choices researchers make regarding the generation and analysis of WGS data can greatly influence the accuracy of short-read sequence alignments and, therefore, the efficacy of such experiments. We studied the effects of some of these choices, including: i) depth of sequencing coverage, ii) choice of reference-guided short-read sequence assembler, iii) choice of reference genome, and iv) whether to perform read-quality filtering and trimming, on our ability to detect true SNPs and on the frequencies of errors. We performed benchmarking experiments, during which we assembled simulated and real Listeria monocytogenes strain 08-5578 short-read sequence datasets of varying quality with four commonly used assemblers (BWA, MOSAIK, Novoalign, and SMALT), using reference genomes of varying genetic distances, and with or without read pre-processing (i.e., quality filtering and trimming). We found that assemblies of at least 50-fold coverage provided the most accurate results. In addition, MOSAIK yielded the fewest errors when reads were aligned to a nearly identical reference genome, while using SMALT to align reads against a reference sequence that is ∼0.82% distant from 08-5578 at the nucleotide level resulted in the detection of the greatest numbers of true SNPs and the fewest errors. Finally, we show that whether read pre-processing improves SNP detection depends upon the choice of reference sequence and assembler. In total, this study demonstrates that researchers should

  16. Partial short-read sequencing of a highly inbred Iberian pig and genomics inference thereof

    PubMed Central

    Esteve-Codina, A; Kofler, R; Himmelbauer, H; Ferretti, L; Vivancos, A P; Groenen, M A M; Folch, J M; Rodríguez, M C; Pérez-Enciso, M

    2011-01-01

    Despite dramatic reduction in sequencing costs with the advent of next generation sequencing technologies, obtaining a complete mammalian genome sequence at sufficient depth is still costly. An alternative is partial sequencing. Here, we have sequenced a reduced representation library of an Iberian sow from the Guadyerbas strain, a highly inbred strain that has been used in numerous QTL studies because of its extreme phenotypic characteristics. Using the Illumina Genome Analyzer II (San Diego, CA, USA), we resequenced ∼1% of the genome with average 4 × depth, identifying 68 778 polymorphisms. Of these, 55 457 were putative fixed differences with respect to the assembly, based on the genome of a Duroc pig, and 13 321 were heterozygous positions within Guadyerbas. Despite being highly inbred, the estimate of heterozygosity within Guadyerbas was ∼0.78 kb−1 in autosomes, after correcting for low depth. Nucleotide variability was consistently higher at the telomeric regions than on the rest of the chromosome, likely a result of increased recombination rates. Further, variability was 50% lower in the X-chromosome than in autosomes, which may be explained by a recent bottleneck or by selection. We divided the whole genome in 500 kb windows and we analyzed overrepresented gene ontology terms in regions of low and high variability. Multi organism process, pigmentation and cell killing were overrepresented in high variability regions and metabolic process ontology, within low variability regions. Further, a genome wide Hudson–Kreitman–Aguadé test was carried out per window; overall, variability was in agreement with neutral expectations. PMID:21407255

  17. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine

    PubMed Central

    Ye, Hao; Meehan, Joe; Tong, Weida; Hong, Huixiao

    2015-01-01

    Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants. PMID:26610555

  18. Short reads and nonmodel species: exploring the complexities of next-generation sequence assembly and SNP discovery in the absence of a reference genome.

    PubMed

    Everett, M V; Grau, E D; Seeb, J E

    2011-03-01

    How practical is gene and SNP discovery in a nonmodel species using short read sequences? Next-generation sequencing technologies are being applied to an increasing number of species with no reference genome. For nonmodel species, the cost, availability of existing genetic resources, genome complexity and the planned method of assembly must all be considered when selecting a sequencing platform. Our goal was to examine the feasibility and optimal methodology for SNP and gene discovery in the sockeye salmon (Oncorhynchus nerka) using short read sequences. SOLiD short reads (up to 50 bp) were generated from single- and pooled-tissue transcriptome libraries from ten sockeye salmon. The individuals were from five distinct populations from the Wood River Lakes and Mendeltna Creek, Alaska. As no reference genome was available for sockeye salmon, the SOLiD sequence reads were assembled to publicly available EST reference sequences from sockeye salmon and two closely related species, rainbow trout (Oncorhynchus mykiss) and Atlantic salmon (Salmo salar). Additionally, de novo assembly of the SOLiD data was carried out, and the SOLiD reads were remapped to the de novo contigs. The results from each reference assembly were compared across all references. The number and size of contigs assembled varied with the size reference sequences. In silico SNP discovery was carried out on contigs from all four EST references; however, discovery of valid SNPs was most successful using one of the two conspecific references.

  19. A Family-Based Probabilistic Method for Capturing De Novo Mutations from High-Throughput Short-Read Sequencing Data

    PubMed Central

    Cartwright, Reed A.; Hussin, Julie; Keebler, Jonathan E. M.; Stone, Eric A.; Awadalla, Philip

    2013-01-01

    Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date. PMID:22499693

  20. Using GPUs for the exact alignment of short-read genetic sequences by means of the Burrows-Wheeler transform.

    PubMed

    Salavert Torres, José; Blanquer Espert, Ignacio; Domínguez, Andrés Tomás; Hernández García, Vicente; Medina Castelló, Ignacio; Tárraga Giménez, Joaquín; Dopazo Blázquez, Joaquín

    2012-01-01

    General Purpose Graphic Processing Units (GPGPUs) constitute an inexpensive resource for computing-intensive applications that could exploit an intrinsic fine-grain parallelism. This paper presents the design and implementation in GPGPUs of an exact alignment tool for nucleotide sequences based on the Burrows-Wheeler Transform. We compare this algorithm with state-of-the-art implementations of the same algorithm over standard CPUs, and considering the same conditions in terms of I/O. Excluding disk transfers, the implementation of the algorithm in GPUs shows a speedup larger than 12, when compared to CPU execution. This implementation exploits the parallelism by concurrently searching different sequences on the same reference search tree, maximizing memory locality and ensuring a symmetric access to the data. The paper describes the behavior of the algorithm in GPU, showing a good scalability in the performance, only limited by the size of the GPU inner memory.

  1. Whole genome sequencing of environmental Vibrio cholerae O1 from 10 nanograms of DNA using short reads.

    PubMed

    Pérez Chaparro, Paula Juliana; McCulloch, John Anthony; Cerdeira, Louise Teixeira; Al-Dilaimi, Arwa; Canto de Sá, Lena Lillian; de Oliveira, Rodrigo; Tauch, Andreas; de Carvalho Azevedo, Vasco Ariston; Cruz Schneider, Maria Paula; da Silva, Artur Luiz da Costa

    2011-11-01

    Multiple Displacement Amplification (MDA) of DNA using φ29 (phi29) DNA polymerase amplifies DNA several billion-fold, which has proved to be potentially very useful for evaluating genome information in a culture-independent manner. Whole genome sequencing using DNA from a single prokaryotic genome copy amplified by MDA has not yet been achieved due to the formation of chimeras and skewed amplification of genomic regions during the MDA step, which then precludes genome assembly. We have hereby addressed the issue by using 10 ng of genomic Vibrio cholerae DNA extracted within an agarose plug to ensure circularity as a starting point for MDA and then sequencing the amplified yield using the SOLiD platform. We successfully managed to assemble the entire genome of V. cholerae strain LMA3984-4 (environmental O1 strain isolated in urban Amazonia) using a hybrid de novo assembly strategy. Using our method, only 178 out of 16,713 (1%) of contigs were not able to be inserted into either chromosome scaffold, and out of these 178, only 3 appeared to be chimeras. The other contigs seem to be the result of template-independent non-specific amplification during MDA, yielding spurious reads. Extraction of genomic DNA within an agarose plug in order to ensure circularity of the extracted genome might be key to minimizing amplification bias by MDA for WGS.

  2. Comprehensive definition of genome features in Spirodela polyrhiza by high-depth physical mapping and short-read DNA sequencing strategies.

    PubMed

    Michael, Todd P; Bryant, Douglas; Gutierrez, Ryan; Borisjuk, Nikolai; Chu, Philomena; Zhang, Hanzhong; Xia, Jing; Zhou, Junfei; Peng, Hai; El Baidouri, Moaine; Ten Hallers, Boudewijn; Hastie, Alex R; Liang, Tiffany; Acosta, Kenneth; Gilbert, Sarah; McEntee, Connor; Jackson, Scott A; Mockler, Todd C; Zhang, Weixiong; Lam, Eric

    2017-02-01

    Spirodela polyrhiza is a fast-growing aquatic monocot with highly reduced morphology, genome size and number of protein-coding genes. Considering these biological features of Spirodela and its basal position in the monocot lineage, understanding its genome architecture could shed light on plant adaptation and genome evolution. Like many draft genomes, however, the 158-Mb Spirodela genome sequence has not been resolved to chromosomes, and important genome characteristics have not been defined. Here we deployed rapid genome-wide physical maps combined with high-coverage short-read sequencing to resolve the 20 chromosomes of Spirodela and to empirically delineate its genome features. Our data revealed a dramatic reduction in the number of the rDNA repeat units in Spirodela to fewer than 100, which is even fewer than that reported for yeast. Consistent with its unique phylogenetic position, small RNA sequencing revealed 29 Spirodela-specific microRNA, with only two being shared with Elaeis guineensis (oil palm) and Musa balbisiana (banana). Combining DNA methylation data and small RNA sequencing enabled the accurate prediction of 20.5% long terminal repeats (LTRs) that doubled the previous estimate, and revealed a high Solo:Intact LTR ratio of 8.2. Interestingly, we found that Spirodela has the lowest global DNA methylation levels (9%) of any plant species tested. Taken together our results reveal a genome that has undergone reduction, likely through eliminating non-essential protein coding genes, rDNA and LTRs. In addition to delineating the genome features of this unique plant, the methodologies described and large-scale genome resources from this work will enable future evolutionary and functional studies of this basal monocot family. © 2016 The Authors The Plant Journal © 2016 John Wiley & Sons Ltd.

  3. Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions

    PubMed Central

    2014-01-01

    Deep sequencing harnesses the high throughput nature of next generation sequencing technologies to generate population samples, treating information contained in individual reads as meaningful. Here, we review applications of deep sequencing to pathogen evolution. Pioneering deep sequencing studies from the virology literature are discussed, such as whole genome Roche-454 sequencing analyses of the dynamics of the rapidly mutating pathogens hepatitis C virus and HIV. Extension of the deep sequencing approach to bacterial populations is then discussed, including the impacts of emerging sequencing technologies. While it is clear that deep sequencing has unprecedented potential for assessing the genetic structure and evolutionary history of pathogen populations, bioinformatic challenges remain. We summarise current approaches to overcoming these challenges, in particular methods for detecting low frequency variants in the context of sequencing error and reconstructing individual haplotypes from short reads. PMID:24428920

  4. Making sense of deep sequencing.

    PubMed

    Goldman, D; Domschke, K

    2014-10-01

    This review, the first of an occasional series, tries to make sense of the concepts and uses of deep sequencing of polynucleic acids (DNA and RNA). Deep sequencing, synonymous with next-generation sequencing, high-throughput sequencing and massively parallel sequencing, includes whole genome sequencing but is more often and diversely applied to specific parts of the genome captured in different ways, for example the highly expressed portion of the genome known as the exome and portions of the genome that are epigenetically marked either by DNA methylation, the binding of proteins including histones, or that are in different configurations and thus more or less accessible to enzymes that cleave DNA. Deep sequencing of RNA (RNASeq) reverse-transcribed to complementary DNA is invaluable for measuring RNA expression and detecting changes in RNA structure. Important concepts in deep sequencing include the length and depth of sequence reads, mapping and assembly of reads, sequencing error, haplotypes, and the propensity of deep sequencing, as with other types of 'big data', to generate large numbers of errors, requiring monitoring for methodologic biases and strategies for replication and validation. Deep sequencing yields a unique genetic fingerprint that can be used to identify a person, and a trove of predictors of genetic medical diseases. Deep sequencing to identify epigenetic events including changes in DNA methylation and RNA expression can reveal the history and impact of environmental exposures. Because of the power of sequencing to identify and deliver biomedically significant information about a person and their blood relatives, it creates ethical dilemmas and practical challenges in research and clinical care, for example the decision and procedures to report incidental findings that will increasingly and frequently be discovered.

  5. Short Read Mapping: An Algorithmic Tour.

    PubMed

    Canzar, Stefan; Salzberg, Steven L

    2017-03-01

    Ultra-high-throughput next-generation sequencing (NGS) technology allows us to determine the sequence of nucleotides of many millions of DNA molecules in parallel. Accompanied by a dramatic reduction in cost since its introduction in 2004, NGS technology has provided a new way of addressing a wide range of biological and biomedical questions, from the study of human genetic disease to the analysis of gene expression, protein-DNA interactions, and patterns of DNA methylation. The data generated by NGS instruments comprise huge numbers of very short DNA sequences, or 'reads', that carry little information by themselves. These reads therefore have to be pieced together by well-engineered algorithms to reconstruct biologically meaningful measurments, such as the level of expression of a gene. To solve this complex, high-dimensional puzzle, reads must be mapped back to a reference genome to determine their origin Due to sequencing errors and to genuine differences between the reference genome and the individual being sequenced, this mapping process must be tolerant of mismatches, insertions, and deletions. Although optimal alignment algorithms to solve this problem have long been available, the practical requirements of aligning hundreds of millions of short reads to the 3 billion base pair long human genome have stimulated the development of new, more efficient methods, which today are used routinely throughout the world for the analysis of NGS data.

  6. Droplet barcoding for massively parallel single-molecule deep sequencing

    PubMed Central

    Lan, Freeman; Haliburton, John R.; Yuan, Aaron; Abate, Adam R.

    2016-01-01

    The ability to accurately sequence long DNA molecules is important across biology, but existing sequencers are limited in read length and accuracy. Here, we demonstrate a method to leverage short-read sequencing to obtain long and accurate reads. Using droplet microfluidics, we isolate, amplify, fragment and barcode single DNA molecules in aqueous picolitre droplets, allowing the full-length molecules to be sequenced with multi-fold coverage using short-read sequencing. We show that this approach can provide accurate sequences of up to 10 kb, allowing us to identify rare mutations below the detection limit of conventional sequencing and directly link them into haplotypes. This barcoding methodology can be a powerful tool in sequencing heterogeneous populations such as viruses. PMID:27353563

  7. Leveraging FPGAs for Accelerating Short Read Alignment.

    PubMed

    Arram, James; Kaplan, Thomas; Luk, Wayne; Jiang, Peiyong

    2016-02-29

    One of the key challenges facing genomics today is how to efficiently analyse the massive amounts of data produced by next-generation sequencing platforms. With general-purpose computing systems struggling to address this challenge, specialised processors such as the Field-Programmable Gate Array (FPGA) are receiving growing interest. The means by which to leverage this technology for accelerating genomic data analysis is however largely unexplored. In this paper we present a runtime reconfigurable architecture for accelerating short read alignment using FPGAs. This architecture exploits the reconfigurability of FPGAs to allow the development of fast yet flexible alignment designs. We apply this architecture to develop an alignment design which supports exact and approximate alignment with up to 2 mismatches. Our design is based on the FM-index, with optimisations to improve the alignment performance. In particular, the n-step FM-index, index oversampling, a seedand- compare stage, and bi-directional backtracking are included. Our design is implemented and evaluated on a 1U Maxeler MPC-X2000 dataflow node with 8 Altera Stratix-V FPGAs. Measurements show that our design is 28 times faster than Bowtie2 running with 16 threads on dual Intel Xeon E5-2640 CPUs, and 9 times faster than Soap3-dp running on an NVIDIA Tesla C2070 GPU.

  8. Qualitative De Novo Analysis of Full Length cDNA and Quantitative Analysis of Gene Expression for Common Marmoset (Callithrix jacchus) Transcriptomes Using Parallel Long-Read Technology and Short-Read Sequencing

    PubMed Central

    Uno, Yasuhiro; Uehara, Shotaro; Inoue, Takashi; Murayama, Norie; Onodera, Jun; Sasaki, Erika; Yamazaki, Hiroshi

    2014-01-01

    The common marmoset (Callithrix jacchus) is a non-human primate that could prove useful as human pharmacokinetic and biomedical research models. The cytochromes P450 (P450s) are a superfamily of enzymes that have critical roles in drug metabolism and disposition via monooxygenation of a broad range of xenobiotics; however, information on some marmoset P450s is currently limited. Therefore, identification and quantitative analysis of tissue-specific mRNA transcripts, including those of P450s and flavin-containing monooxygenases (FMO, another monooxygenase family), need to be carried out in detail before the marmoset can be used as an animal model in drug development. De novo assembly and expression analysis of marmoset transcripts were conducted with pooled liver, intestine, kidney, and brain samples from three male and three female marmosets. After unique sequences were automatically aligned by assembling software, the mean contig length was 718 bp (with a standard deviation of 457 bp) among a total of 47,883 transcripts. Approximately 30% of the total transcripts were matched to known marmoset sequences. Gene expression in 18 marmoset P450- and 4 FMO-like genes displayed some tissue-specific patterns. Of these, the three most highly expressed in marmoset liver were P450 2D-, 2E-, and 3A-like genes. In extrahepatic tissues, including brain, gene expressions of these monooxygenases were lower than those in liver, although P450 3A4 (previously P450 3A21) in intestine and P450 4A11- and FMO1-like genes in kidney were relatively highly expressed. By means of massive parallel long-read sequencing and short-read technology applied to marmoset liver, intestine, kidney, and brain, the combined next-generation sequencing analyses reported here were able to identify novel marmoset drug-metabolizing P450 transcripts that have until now been little reported. These results provide a foundation for mechanistic studies and pave the way for the use of marmosets as model animals

  9. Qualitative de novo analysis of full length cDNA and quantitative analysis of gene expression for common marmoset (Callithrix jacchus) transcriptomes using parallel long-read technology and short-read sequencing.

    PubMed

    Shimizu, Makiko; Iwano, Shunsuke; Uno, Yasuhiro; Uehara, Shotaro; Inoue, Takashi; Murayama, Norie; Onodera, Jun; Sasaki, Erika; Yamazaki, Hiroshi

    2014-01-01

    The common marmoset (Callithrix jacchus) is a non-human primate that could prove useful as human pharmacokinetic and biomedical research models. The cytochromes P450 (P450s) are a superfamily of enzymes that have critical roles in drug metabolism and disposition via monooxygenation of a broad range of xenobiotics; however, information on some marmoset P450s is currently limited. Therefore, identification and quantitative analysis of tissue-specific mRNA transcripts, including those of P450s and flavin-containing monooxygenases (FMO, another monooxygenase family), need to be carried out in detail before the marmoset can be used as an animal model in drug development. De novo assembly and expression analysis of marmoset transcripts were conducted with pooled liver, intestine, kidney, and brain samples from three male and three female marmosets. After unique sequences were automatically aligned by assembling software, the mean contig length was 718 bp (with a standard deviation of 457 bp) among a total of 47,883 transcripts. Approximately 30% of the total transcripts were matched to known marmoset sequences. Gene expression in 18 marmoset P450- and 4 FMO-like genes displayed some tissue-specific patterns. Of these, the three most highly expressed in marmoset liver were P450 2D-, 2E-, and 3A-like genes. In extrahepatic tissues, including brain, gene expressions of these monooxygenases were lower than those in liver, although P450 3A4 (previously P450 3A21) in intestine and P450 4A11- and FMO1-like genes in kidney were relatively highly expressed. By means of massive parallel long-read sequencing and short-read technology applied to marmoset liver, intestine, kidney, and brain, the combined next-generation sequencing analyses reported here were able to identify novel marmoset drug-metabolizing P450 transcripts that have until now been little reported. These results provide a foundation for mechanistic studies and pave the way for the use of marmosets as model animals

  10. De novo peptide sequencing by deep learning.

    PubMed

    Tran, Ngoc Hieu; Zhang, Xianglilan; Xin, Lei; Shan, Baozhen; Li, Ming

    2017-07-18

    De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7-22.9% higher accuracy at the amino acid level and 38.1-64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5-100% coverage and 97.2-99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming.

  11. De novo peptide sequencing by deep learning

    PubMed Central

    Tran, Ngoc Hieu; Zhang, Xianglilan; Xin, Lei; Shan, Baozhen; Li, Ming

    2017-01-01

    De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7–22.9% higher accuracy at the amino acid level and 38.1–64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5–100% coverage and 97.2–99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming. PMID:28720701

  12. A hybrid short read mapping accelerator

    PubMed Central

    2013-01-01

    Background The rapid growth of short read datasets poses a new challenge to the short read mapping problem in terms of sensitivity and execution speed. Existing methods often use a restrictive error model for computing the alignments to improve speed, whereas more flexible error models are generally too slow for large-scale applications. A number of short read mapping software tools have been proposed. However, designs based on hardware are relatively rare. Field programmable gate arrays (FPGAs) have been successfully used in a number of specific application areas, such as the DSP and communications domains due to their outstanding parallel data processing capabilities, making them a competitive platform to solve problems that are “inherently parallel”. Results We present a hybrid system for short read mapping utilizing both FPGA-based hardware and CPU-based software. The computation intensive alignment and the seed generation operations are mapped onto an FPGA. We present a computationally efficient, parallel block-wise alignment structure (Align Core) to approximate the conventional dynamic programming algorithm. The performance is compared to the multi-threaded CPU-based GASSST and BWA software implementations. For single-end alignment, our hybrid system achieves faster processing speed than GASSST (with a similar sensitivity) and BWA (with a higher sensitivity); for pair-end alignment, our design achieves a slightly worse sensitivity than that of BWA but has a higher processing speed. Conclusions This paper shows that our hybrid system can effectively accelerate the mapping of short reads to a reference genome based on the seed-and-extend approach. The performance comparison to the GASSST and BWA software implementations under different conditions shows that our hybrid design achieves a high degree of sensitivity and requires less overall execution time with only modest FPGA resource utilization. Our hybrid system design also shows that the performance

  13. SlideSort: all pairs similarity search for short reads

    PubMed Central

    Shimizu, Kana; Tsuda, Koji

    2011-01-01

    Motivation: Recent progress in DNA sequencing technologies calls for fast and accurate algorithms that can evaluate sequence similarity for a huge amount of short reads. Searching similar pairs from a string pool is a fundamental process of de novo genome assembly, genome-wide alignment and other important analyses. Results: In this study, we designed and implemented an exact algorithm SlideSort that finds all similar pairs from a string pool in terms of edit distance. Using an efficient pattern growth algorithm, SlideSort discovers chains of common k-mers to narrow down the search. Compared to existing methods based on single k-mers, our method is more effective in reducing the number of edit distance calculations. In comparison to backtracking methods such as BWA, our method is much faster in finding remote matches, scaling easily to tens of millions of sequences. Our software has an additional function of single link clustering, which is useful in summarizing short reads for further processing. Availability: Executable binary files and C++ libraries are available at http://www.cbrc.jp/~shimizu/slidesort/ for Linux and Windows. Contact: slidesort@m.aist.go.jp; shimizu-kana@aist.go.jp Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21148542

  14. Quantitative phenotyping via deep barcode sequencing.

    PubMed

    Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey

    2009-10-01

    Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.

  15. From deep sequencing to actual clones.

    PubMed

    D'Angelo, Sara; Kumar, Sandeep; Naranjo, Leslie; Ferrara, Fortunato; Kiss, Csaba; Bradbury, Andrew R M

    2014-10-01

    The application of deep sequencing to in vitro display technologies has been invaluable for the straightforward analysis of enriched clones. After sequencing in vitro selected populations, clones are binned into identical or similar groups and ordered by abundance, allowing identification of those that are most enriched. However, the greatest strength of deep sequencing is also its greatest weakness: clones are easily identified by their DNA sequences, but are not physically available for testing without a laborious multistep process involving several rounds of polymerization chain reaction (PCR), assembly and cloning. Here, using the isolation of antibody genes from a phage and yeast display selection as an example, we show the power of a rapid and simple inverse PCR-based method to easily isolate clones identified by deep sequencing. Once primers have been received, clone isolation can be carried out in a single day, rather than two days. Furthermore the reduced number of PCRs required will reduce PCR mutations correspondingly. We have observed a 100% success rate in amplifying clones with an abundance as low as 0.5% in a polyclonal population. This approach allows us to obtain full-length clones even when an incomplete sequence is available, and greatly simplifies the subcloning process. Moreover, rarer, but functional clones missed by traditional screening can be easily isolated using this method, and the approach can be extended to any selected library (scFv, cDNA, libraries based on scaffold proteins) where a unique sequence signature for the desired clones of interest is available.

  16. A Bayesian Assignment Method for Ambiguous Bisulfite Short Reads.

    PubMed

    Tran, Hong; Wu, Xiaowei; Tithi, Saima; Sun, Ming-an; Xie, Hehuang; Zhang, Liqing

    2016-01-01

    DNA methylation is an epigenetic modification critical for normal development and diseases. The determination of genome-wide DNA methylation at single-nucleotide resolution is made possible by sequencing bisulfite treated DNA with next generation high-throughput sequencing. However, aligning bisulfite short reads to a reference genome remains challenging as only a limited proportion of them (around 50-70%) can be aligned uniquely; a significant proportion, known as multireads, are mapped to multiple locations and thus discarded from downstream analyses, causing financial waste and biased methylation inference. To address this issue, we develop a Bayesian model that assigns multireads to their most likely locations based on the posterior probability derived from information hidden in uniquely aligned reads. Analyses of both simulated data and real hairpin bisulfite sequencing data show that our method can effectively assign approximately 70% of the multireads to their best locations with up to 90% accuracy, leading to a significant increase in the overall mapping efficiency. Moreover, the assignment model shows robust performance with low coverage depth, making it particularly attractive considering the prohibitive cost of bisulfite sequencing. Additionally, results show that longer reads help improve the performance of the assignment model. The assignment model is also robust to varying degrees of methylation and varying sequencing error rates. Finally, incorporating prior knowledge on mutation rate and context specific methylation level into the assignment model increases inference accuracy. The assignment model is implemented in the BAM-ABS package and freely available at https://github.com/zhanglabvt/BAM_ABS.

  17. Deep Whole-Genome Sequencing to Detect Mixed Infection of Mycobacterium tuberculosis

    PubMed Central

    Gan, Mingyu; Liu, Qingyun; Yang, Chongguang; Gao, Qian; Luo, Tao

    2016-01-01

    Mixed infection by multiple Mycobacterium tuberculosis (MTB) strains is associated with poor treatment outcome of tuberculosis (TB). Traditional genotyping methods have been used to detect mixed infections of MTB, however, their sensitivity and resolution are limited. Deep whole-genome sequencing (WGS) has been proved highly sensitive and discriminative for studying population heterogeneity of MTB. Here, we developed a phylogenetic-based method to detect MTB mixed infections using WGS data. We collected published WGS data of 782 global MTB strains from public database. We called homogeneous and heterogeneous single nucleotide variations (SNVs) of individual strains by mapping short reads to the ancestral MTB reference genome. We constructed a phylogenomic database based on 68,639 homogeneous SNVs of 652 MTB strains. Mixed infections were determined if multiple evolutionary paths were identified by mapping the SNVs of individual samples to the phylogenomic database. By simulation, our method could specifically detect mixed infections when the sequencing depth of minor strains was as low as 1× coverage, and when the genomic distance of two mixed strains was as small as 16 SNVs. By applying our methods to all 782 samples, we detected 47 mixed infections and 45 of them were caused by locally endemic strains. The results indicate that our method is highly sensitive and discriminative for identifying mixed infections from deep WGS data of MTB isolates. PMID:27391214

  18. SAMMate: a GUI tool for processing short read alignments in SAM/BAM format

    PubMed Central

    2011-01-01

    Background Next Generation Sequencing (NGS) technology generates tens of millions of short reads for each DNA/RNA sample. A key step in NGS data analysis is the short read alignment of the generated sequences to a reference genome. Although storing alignment information in the Sequence Alignment/Map (SAM) or Binary SAM (BAM) format is now standard, biomedical researchers still have difficulty accessing this information. Results We have developed a Graphical User Interface (GUI) software tool named SAMMate. SAMMate allows biomedical researchers to quickly process SAM/BAM files and is compatible with both single-end and paired-end sequencing technologies. SAMMate also automates some standard procedures in DNA-seq and RNA-seq data analysis. Using either standard or customized annotation files, SAMMate allows users to accurately calculate the short read coverage of genomic intervals. In particular, for RNA-seq data SAMMate can accurately calculate the gene expression abundance scores for customized genomic intervals using short reads originating from both exons and exon-exon junctions. Furthermore, SAMMate can quickly calculate a whole-genome signal map at base-wise resolution allowing researchers to solve an array of bioinformatics problems. Finally, SAMMate can export both a wiggle file for alignment visualization in the UCSC genome browser and an alignment statistics report. The biological impact of these features is demonstrated via several case studies that predict miRNA targets using short read alignment information files. Conclusions With just a few mouse clicks, SAMMate will provide biomedical researchers easy access to important alignment information stored in SAM/BAM files. Our software is constantly updated and will greatly facilitate the downstream analysis of NGS data. Both the source code and the GUI executable are freely available under the GNU General Public License at http://sammate.sourceforge.net. PMID:21232146

  19. CRISPR Detection From Short Reads Using Partial Overlap Graphs.

    PubMed

    Ben-Bassat, Ilan; Chor, Benny

    2016-06-01

    Clustered regularly interspaced short palindromic repeats (CRISPR) are structured regions in bacterial and archaeal genomes, which are part of an adaptive immune system against phages. CRISPRs are important for many microbial studies and are playing an essential role in current gene editing techniques. As such, they attract substantial research interest. The exponential growth in the amount of bacterial sequence data in recent years enables the exploration of CRISPR loci in more and more species. Most of the automated tools that detect CRISPR loci rely on fully assembled genomes. However, many assemblers do not handle repetitive regions successfully. The first tool to work directly on raw sequence data is Crass, which requires reads that are long enough to contain two copies of the same repeat. We present a method to identify CRISPR repeats from raw sequence data of short reads. The algorithm is based on an observation differentiating CRISPR repeats from other types of repeats, and it involves a series of partial constructions of the overlap graph. This enables us to avoid many of the difficulties that assemblers face, as we merely aim to identify the repeats that belong to CRISPR loci. A preliminary implementation of the algorithm shows good results and detects CRISPR repeats in cases where other existing tools fail to do so.

  20. Non-referenced genome assembly from epigenomic short-read data.

    PubMed

    Kaspi, Antony; Ziemann, Mark; Keating, Samuel T; Khurana, Ishant; Connor, Timothy; Spolding, Briana; Cooper, Adrian; Lazarus, Ross; Walder, Ken; Zimmet, Paul; El-Osta, Assam

    2014-10-01

    Current computational methods used to analyze changes in DNA methylation and chromatin modification rely on sequenced genomes. Here we describe a pipeline for the detection of these changes from short-read sequence data that does not require a reference genome. Open source software packages were used for sequence assembly, alignment, and measurement of differential enrichment. The method was evaluated by comparing results with reference-based results showing a strong correlation between chromatin modification and gene expression. We then used our de novo sequence assembly to build the DNA methylation profile for the non-referenced Psammomys obesus genome. The pipeline described uses open source software for fast annotation and visualization of unreferenced genomic regions from short-read data.

  1. Optimization of de novo short read assembly of seabuckthorn (Hippophae rhamnoides L.) transcriptome.

    PubMed

    Ghangal, Rajesh; Chaudhary, Saurabh; Jain, Mukesh; Purty, Ram Singh; Chand Sharma, Prakash

    2013-01-01

    Seabuckthorn (Hippophaerhamnoides L.) is known for its medicinal, nutritional and environmental importance since ancient times. However, very limited efforts have been made to characterize the genome and transcriptome of this wonder plant. Here, we report the use of next generation massive parallel sequencing technology (Illumina platform) and de novo assembly to gain a comprehensive view of the seabuckthorn transcriptome. We assembled 86,253,874 high quality short reads using six assembly tools. At our hand, assembly of non-redundant short reads following a two-step procedure was found to be the best considering various assembly quality parameters. Initially, ABySS tool was used following an additive k-mer approach. The assembled transcripts were subsequently subjected to TGICL suite. Finally, de novo short read assembly yielded 88,297 transcripts (> 100 bp), representing about 53 Mb of seabuckthorn transcriptome. The average length of transcripts was 610 bp, N50 length 1198 BP and 91% of the short reads uniquely mapped back to seabuckthorn transcriptome. A total of 41,340 (46.8%) transcripts showed significant similarity with sequences present in nr protein databases of NCBI (E-value < 1E-06). We also screened the assembled transcripts for the presence of transcription factors and simple sequence repeats. Our strategy involving the use of short read assembler (ABySS) followed by TGICL will be useful for the researchers working with a non-model organism's transcriptome in terms of saving time and reducing complexity in data management. The seabuckthorn transcriptome data generated here provide a valuable resource for gene discovery and development of functional molecular markers.

  2. Optimization of De Novo Short Read Assembly of Seabuckthorn (Hippophae rhamnoides L.) Transcriptome

    PubMed Central

    Ghangal, Rajesh; Chaudhary, Saurabh; Jain, Mukesh; Purty, Ram Singh; Chand Sharma, Prakash

    2013-01-01

    Seabuckthorn (Hippophaerhamnoides L.) is known for its medicinal, nutritional and environmental importance since ancient times. However, very limited efforts have been made to characterize the genome and transcriptome of this wonder plant. Here, we report the use of next generation massive parallel sequencing technology (Illumina platform) and de novo assembly to gain a comprehensive view of the seabuckthorn transcriptome. We assembled 86,253,874 high quality short reads using six assembly tools. At our hand, assembly of non-redundant short reads following a two-step procedure was found to be the best considering various assembly quality parameters. Initially, ABySS tool was used following an additive k-mer approach. The assembled transcripts were subsequently subjected to TGICL suite. Finally, de novo short read assembly yielded 88,297 transcripts (> 100 bp), representing about 53 Mb of seabuckthorn transcriptome. The average length of transcripts was 610 bp, N50 length 1198 BP and 91% of the short reads uniquely mapped back to seabuckthorn transcriptome. A total of 41,340 (46.8%) transcripts showed significant similarity with sequences present in nr protein databases of NCBI (E-value < 1E-06). We also screened the assembled transcripts for the presence of transcription factors and simple sequence repeats. Our strategy involving the use of short read assembler (ABySS) followed by TGICL will be useful for the researchers working with a non-model organism’s transcriptome in terms of saving time and reducing complexity in data management. The seabuckthorn transcriptome data generated here provide a valuable resource for gene discovery and development of functional molecular markers. PMID:23991119

  3. Deep Ion Torrent sequencing identifies soil fungal community shifts after frequent prescribed fires in a southeastern US forest ecosystem.

    PubMed

    Brown, Shawn P; Callaham, Mac A; Oliver, Alena K; Jumpponen, Ari

    2013-12-01

    Prescribed burning is a common management tool to control fuel loads, ground vegetation, and facilitate desirable game species. We evaluated soil fungal community responses to long-term prescribed fire treatments in a loblolly pine forest on the Piedmont of Georgia and utilized deep Internal Transcribed Spacer Region 1 (ITS1) amplicon sequencing afforded by the recent Ion Torrent Personal Genome Machine (PGM). These deep sequence data (19,000 + reads per sample after subsampling) indicate that frequent fires (3-year fire interval) shift soil fungus communities, whereas infrequent fires (6-year fire interval) permit system resetting to a state similar to that without prescribed fire. Furthermore, in nonmetric multidimensional scaling analyses, primarily ectomycorrhizal taxa were correlated with axes associated with long fire intervals, whereas soil saprobes tended to be correlated with the frequent fire recurrence. We conclude that (1) multiplexed Ion Torrent PGM analyses allow deep cost effective sequencing of fungal communities but may suffer from short read lengths and inconsistent sequence quality adjacent to the sequencing adaptor; (2) frequent prescribed fires elicit a shift in soil fungal communities; and (3) such shifts do not occur when fire intervals are longer. Our results emphasize the general responsiveness of these forests to management, and the importance of fire return intervals in meeting management objectives. © 2013 Federation of European Microbiological Societies. Published by John Wiley & Sons Ltd. All rights reserved.

  4. Assembly complexity of prokaryotic genomes using short reads

    PubMed Central

    2010-01-01

    Background De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for de novo reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages). Conclusions Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed. PMID:20064276

  5. Short read DNA fragment anchoring algorithm.

    PubMed

    Wang, Wendi; Zhang, Peiheng; Liu, Xinchun

    2009-01-30

    The emerging next-generation sequencing method based on PCR technology boosts genome sequencing speed considerably, the expense is also get decreased. It has been utilized to address a broad range of bioinformatics problems. Limited by reliable output sequence length of next-generation sequencing technologies, we are confined to study gene fragments with 30 - 50 bps in general and it is relatively shorter than traditional gene fragment length. Anchoring gene fragments in long reference sequence is an essential and prerequisite step for further assembly and analysis works. Due to the sheer number of fragments produced by next-generation sequencing technologies and the huge size of reference sequences, anchoring would rapidly becoming a computational bottleneck. We compared algorithm efficiency on BLAT, SOAP and EMBF. The efficiency is defined as the count of total output results divided by time consumed to retrieve them. The data show that our algorithm EMBF have 3 - 4 times efficiency advantage over SOAP, and at least 150 times over BLAT. Moreover, when the reference sequence size is increased, the efficiency of SOAP will get degraded as far as 30%, while EMBF have preferable increasing tendency. In conclusion, we deem that EMBF is more suitable for short fragment anchoring problem where result completeness and accuracy is predominant and the reference sequences are relatively large.

  6. Effects of Short Read Quality and Quantity on a de novo Vertebrate Transcriptome Assembly✰

    PubMed Central

    Garcia, T.I.; Shen, Y.; Catchen, J.; Amores, A.; Schartl, M.; Postlethwait, J.; Walter, R. B.

    2011-01-01

    For many researchers, next generation sequencing data holds the key to answering a category of questions previously unassailable. One of the important and challenging steps in achieving these goals is accurately assembling the massive quantity of short sequencing reads into full nucleic acid sequences. For research groups working with non-model or wild systems, short read assembly can pose a significant challenge due to the lack of pre-existing EST or genome reference libraries. While many publications describe the overall process of sequencing and assembly, few address the topic of how many and what types of reads are best for assembly. The goal of this project was use real world data to explore the effects of read quantity and short read quality scores on the resulting de novo assemblies. Using several samples of short reads of various sizes and qualities we produced many assemblies in an automated manner. We observe how the properties of read length, read quality, and read quantity affect the resulting assemblies and provide some general recommendations based on our real-world data set. PMID:21651990

  7. DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster

    PubMed Central

    Pandey, Ram Vinay; Schlötterer, Christian

    2013-01-01

    With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/ PMID:24009693

  8. DistMap: a toolkit for distributed short read mapping on a Hadoop cluster.

    PubMed

    Pandey, Ram Vinay; Schlötterer, Christian

    2013-01-01

    With the rapid and steady increase of next generation sequencing data output, the mapping of short reads has become a major data analysis bottleneck. On a single computer, it can take several days to map the vast quantity of reads produced from a single Illumina HiSeq lane. In an attempt to ameliorate this bottleneck we present a new tool, DistMap - a modular, scalable and integrated workflow to map reads in the Hadoop distributed computing framework. DistMap is easy to use, currently supports nine different short read mapping tools and can be run on all Unix-based operating systems. It accepts reads in FASTQ format as input and provides mapped reads in a SAM/BAM format. DistMap supports both paired-end and single-end reads thereby allowing the mapping of read data produced by different sequencing platforms. DistMap is available from http://code.google.com/p/distmap/

  9. Approaching marine bioprospecting in hexacorals by RNA deep sequencing.

    PubMed

    Johansen, Steinar D; Emblem, Ase; Karlsen, Bård Ove; Okkenhaug, Siri; Hansen, Hilde; Moum, Truls; Coucheron, Dag H; Seternes, Ole Morten

    2010-07-31

    RNA deep sequencing represents a new complementary approach in marine bioprospecting. Next-generation sequencing platforms have recently been developed for de novo whole transcriptome analysis, small RNA discovery and gene expression profiling. Deep sequencing transcriptomics (sequencing the complete set of cellular transcripts at a specific stage or condition) leads to sequential identification of all expressed genes in a sample. When combined to high-throughput bioinformatics and protein synthesis, RNA deep sequencing represents a new powerful approach in gene product discovery and bioprospecting. Here we summarize recent progress in the analyses of hexacoral transcriptomes with the focus on cold-water sea anemones and related organisms.

  10. Deep sequencing: becoming a critical tool in clinical virology.

    PubMed

    Quiñones-Mateu, Miguel E; Avila, Santiago; Reyes-Teran, Gustavo; Martinez, Miguel A

    2014-09-01

    Population (Sanger) sequencing has been the standard method in basic and clinical DNA sequencing for almost 40 years; however, next-generation (deep) sequencing methodologies are now revolutionizing the field of genomics, and clinical virology is no exception. Deep sequencing is highly efficient, producing an enormous amount of information at low cost in a relatively short period of time. High-throughput sequencing techniques have enabled significant contributions to multiples areas in virology, including virus discovery and metagenomics (viromes), molecular epidemiology, pathogenesis, and studies of how viruses to escape the host immune system and antiviral pressures. In addition, new and more affordable deep sequencing-based assays are now being implemented in clinical laboratories. Here, we review the use of the current deep sequencing platforms in virology, focusing on three of the most studied viruses: human immunodeficiency virus (HIV), hepatitis C virus (HCV), and influenza virus.

  11. Deep Sequencing: Becoming a Critical Tool in Clinical Virology

    PubMed Central

    QUIÑONES-MATEU, Miguel E.; AVILA, Santiago; REYES-TERAN, Gustavo; MARTINEZ, Miguel A.

    2014-01-01

    Population (Sanger) sequencing has been the standard method in basic and clinical DNA sequencing for almost 40 years; however, next-generation (deep) sequencing methodologies are now revolutionizing the field of genomics, and clinical virology is no exception. Deep sequencing is highly efficient, producing an enormous amount of information at low cost in a relatively short period of time. High-throughput sequencing techniques have enabled significant contributions to multiples areas in virology, including virus discovery and metagenomics (viromes), molecular epidemiology, pathogenesis, and studies of how viruses to escape the host immune system and antiviral pressures. In addition, new and more affordable deep sequencing-based assays are now being implemented in clinical laboratories. Here we review the use of the current deep sequencing platforms in virology, focusing on three of the most studied viruses: human immunodeficiency virus (HIV), hepatitis C virus (HCV), and influenza virus. PMID:24998424

  12. Deep Sequencing to Identify the Causes of Viral Encephalitis

    PubMed Central

    Chan, Benjamin K.; Wilson, Theodore; Fischer, Kael F.; Kriesel, John D.

    2014-01-01

    Deep sequencing allows for a rapid, accurate characterization of microbial DNA and RNA sequences in many types of samples. Deep sequencing (also called next generation sequencing or NGS) is being developed to assist with the diagnosis of a wide variety of infectious diseases. In this study, seven frozen brain samples from deceased subjects with recent encephalitis were investigated. RNA from each sample was extracted, randomly reverse transcribed and sequenced. The sequence analysis was performed in a blinded fashion and confirmed with pathogen-specific PCR. This analysis successfully identified measles virus sequences in two brain samples and herpes simplex virus type-1 sequences in three brain samples. No pathogen was identified in the other two brain specimens. These results were concordant with pathogen-specific PCR and partially concordant with prior neuropathological examinations, demonstrating that deep sequencing can accurately identify viral infections in frozen brain tissue. PMID:24699691

  13. CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform.

    PubMed

    Liu, Yongchao; Schmidt, Bertil; Maskell, Douglas L

    2012-07-15

    New high-throughput sequencing technologies have promoted the production of short reads with dramatically low unit cost. The explosive growth of short read datasets poses a challenge to the mapping of short reads to reference genomes, such as the human genome, in terms of alignment quality and execution speed. We present CUSHAW, a parallelized short read aligner based on the compute unified device architecture (CUDA) parallel programming model. We exploit CUDA-compatible graphics hardware as accelerators to achieve fast speed. Our algorithm uses a quality-aware bounded search approach based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini index to reduce the search space and achieve high alignment quality. Performance evaluation, using simulated as well as real short read datasets, reveals that our algorithm running on one or two graphics processing units achieves significant speedups in terms of execution time, while yielding comparable or even better alignment quality for paired-end alignments compared with three popular BWT-based aligners: Bowtie, BWA and SOAP2. CUSHAW also delivers competitive performance in terms of single-nucleotide polymorphism calling for an Escherichia coli test dataset. http://cushaw.sourceforge.net

  14. Short-read DNA sequencing yields microsatellite markers for Rheum

    USDA-ARS?s Scientific Manuscript database

    Identifying culinary rhubarb (Rheum ×hybridum Murray) cultivars using morphological characteristics is problematic due to variability within individual genotypes, variation caused by environmental factors, plant and leaf age, similarity between genetically diverse genotypes, multiple cultivar names ...

  15. Whole genome complete resequencing of Bacillus subtilis natto by combining long reads with high-quality short reads.

    PubMed

    Kamada, Mayumi; Hase, Sumitaka; Sato, Kengo; Toyoda, Atsushi; Fujiyama, Asao; Sakakibara, Yasubumi

    2014-01-01

    De novo microbial genome sequencing reached a turning point with third-generation sequencing (TGS) platforms, and several microbial genomes have been improved by TGS long reads. Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and it has a function in the production of the traditional Japanese fermented food "natto." The B. subtilis natto BEST195 genome was previously sequenced with short reads, but it included some incomplete regions. We resequenced the BEST195 genome using a PacBio RS sequencer, and we successfully obtained a complete genome sequence from one scaffold without any gaps, and we also applied Illumina MiSeq short reads to enhance quality. Compared with the previous BEST195 draft genome and Marburg 168 genome, we found that incomplete regions in the previous genome sequence were attributed to GC-bias and repetitive sequences, and we also identified some novel genes that are found only in the new genome.

  16. Optimizing de novo common wheat transcriptome assembly using short-read RNA-Seq data.

    PubMed

    Duan, Jialei; Xia, Chuan; Zhao, Guangyao; Jia, Jizeng; Kong, Xiuying

    2012-08-14

    Rapid advances in next-generation sequencing methods have provided new opportunities for transcriptome sequencing (RNA-Seq). The unprecedented sequencing depth provided by RNA-Seq makes it a powerful and cost-efficient method for transcriptome study, and it has been widely used in model organisms and non-model organisms to identify and quantify RNA. For non-model organisms lacking well-defined genomes, de novo assembly is typically required for downstream RNA-Seq analyses, including SNP discovery and identification of genes differentially expressed by phenotypes. Although RNA-Seq has been successfully used to sequence many non-model organisms, the results of de novo assembly from short reads can still be improved by using recent bioinformatic developments. In this study, we used 212.6 million pair-end reads, which accounted for 16.2 Gb, to assemble the hexaploid wheat transcriptome. Two state-of-the-art assemblers, Trinity and Trans-ABySS, which use the single and multiple k-mer methods, respectively, were used, and the whole de novo assembly process was divided into the following four steps: pre-assembly, merging different samples, removal of redundancy and scaffolding. We documented every detail of these steps and how these steps influenced assembly performance to gain insight into transcriptome assembly from short reads. After optimization, the assembled transcripts were comparable to Sanger-derived ESTs in terms of both continuity and accuracy. We also provided considerable new wheat transcript data to the community. It is feasible to assemble the hexaploid wheat transcriptome from short reads. Special attention should be paid to dealing with multiple samples to balance the spectrum of expression levels and redundancy. To obtain an accurate overview of RNA profiling, removal of redundancy may be crucial in de novo assembly.

  17. Error tolerant indexing and alignment of short reads with covering template families.

    PubMed

    Giladi, Eldar; Healy, John; Myers, Gene; Hart, Chris; Kapranov, Philipp; Lipson, Doron; Roels, Steve; Thayer, Edward; Letovsky, Stan

    2010-10-01

    The rapid adoption of high-throughput next generation sequence data in biological research is presenting a major challenge for sequence alignment tools—specifically, the efficient alignment of vast amounts of short reads to large references in the presence of differences arising from sequencing errors and biological sequence variations. To address this challenge, we developed a short read aligner for high-throughput sequencer data that is tolerant of errors or mutations of all types—namely, substitutions, deletions, and insertions. The aligner utilizes a multi-stage approach in which template-based indexing is used to identify candidate regions for alignment with dynamic programming. A template is a pair of gapped seeds, with one used with the read and one used with the reference. In this article, we focus on the development of template families that yield error-tolerant indexing up to a given error-budget. A general algorithm for finding those families is presented, and a recursive construction that creates families with higher error tolerance from ones with a lower error tolerance is developed.

  18. ANDES: Statistical tools for the ANalyses of DEep Sequencing.

    PubMed

    Li, Kelvin; Venter, Eli; Yooseph, Shibu; Stockwell, Timothy B; Eckerle, Lance D; Denison, Mark R; Spiro, David J; Methé, Barbara A

    2010-07-15

    The advancements in DNA sequencing technologies have allowed researchers to progress from the analyses of a single organism towards the deep sequencing of a sample of organisms. With sufficient sequencing depth, it is now possible to detect subtle variations between members of the same species, or between mixed species with shared biomarkers, such as the 16S rRNA gene. However, traditional sequencing analyses of samples from largely homogeneous populations are often still based on multiple sequence alignments (MSA), where each sequence is placed along a separate row and similarities between aligned bases can be followed down each column. While this visual format is intuitive for a small set of aligned sequences, the representation quickly becomes cumbersome as sequencing depths cover loci hundreds or thousands of reads deep. We have developed ANDES, a software library and a suite of applications, written in Perl and R, for the statistical ANalyses of DEep Sequencing. The fundamental data structure underlying ANDES is the position profile, which contains the nucleotide distributions for each genomic position resultant from a multiple sequence alignment (MSA). Tools include the root mean square deviation (RMSD) plot, which allows for the visual comparison of multiple samples on a position-by-position basis, and the computation of base conversion frequencies (transition/transversion rates), variation (Shannon entropy), inter-sample clustering and visualization (dendrogram and multidimensional scaling (MDS) plot), threshold-driven consensus sequence generation and polymorphism detection, and the estimation of empirically determined sequencing quality values. As new sequencing technologies evolve, deep sequencing will become increasingly cost-efficient and the inter and intra-sample comparisons of largely homogeneous sequences will become more common. We have provided a software package and demonstrated its application on various empirically-derived datasets

  19. Deep sequencing in the management of hepatitis virus infections.

    PubMed

    Quer, Josep; Rodríguez-Frias, Francisco; Gregori, Josep; Tabernero, David; Soria, Maria Eugenia; García-Cehic, Damir; Homs, Maria; Bosch, Albert; Pintó, Rosa María; Esteban, Juan Ignacio; Domingo, Esteban; Perales, Celia

    2016-12-28

    The hepatitis viruses represent a major public health problem worldwide. Procedures for characterization of the genomic composition of their populations, accurate diagnosis, identification of multiple infections, and information on inhibitor-escape mutants for treatment decisions are needed. Deep sequencing methodologies are extremely useful for these viruses since they replicate as complex and dynamic quasispecies swarms whose complexity and mutant composition are biologically relevant traits. Population complexity is a major challenge for disease prevention and control, but also an opportunity to distinguish among related but phenotypically distinct variants that might anticipate disease progression and treatment outcome. Detailed characterization of mutant spectra should permit choosing better treatment options, given the increasing number of new antiviral inhibitors available. In the present review we briefly summarize our experience on the use of deep sequencing for the management of hepatitis virus infections, particularly for hepatitis B and C viruses, and outline some possible new applications of deep sequencing for these important human pathogens.

  20. InteMAP: Integrated metagenomic assembly pipeline for NGS short reads.

    PubMed

    Lai, Binbin; Wang, Fumeng; Wang, Xiaoqi; Duan, Liping; Zhu, Huaiqiu

    2015-08-07

    Next-generation sequencing (NGS) has greatly facilitated metagenomic analysis but also raised new challenges for metagenomic DNA sequence assembly, owing to its high-throughput nature and extremely short reads generated by sequencers such as Illumina. To date, how to generate a high-quality draft assembly for metagenomic sequencing projects has not been fully addressed. We conducted a comprehensive assessment on state-of-the-art de novo assemblers and revealed that the performance of each assembler depends critically on the sequencing depth. To address this problem, we developed a pipeline named InteMAP to integrate three assemblers, ABySS, IDBA-UD and CABOG, which were found to complement each other in assembling metagenomic sequences. Making a decision of which assembling approaches to use according to the sequencing coverage estimation algorithm for each short read, the pipeline presents an automatic platform suitable to assemble real metagenomic NGS data with uneven coverage distribution of sequencing depth. By comparing the performance of InteMAP with current assemblers on both synthetic and real NGS metagenomic data, we demonstrated that InteMAP achieves better performance with a longer total contig length and higher contiguity, and contains more genes than others. We developed a de novo pipeline, named InteMAP, that integrates existing tools for metagenomics assembly. The pipeline outperforms previous assembly methods on metagenomic assembly by providing a longer total contig length, a higher contiguity and covering more genes. InteMAP, therefore, could potentially be a useful tool for the research community of metagenomics.

  1. Deep sequencing increases hepatitis C virus phylogenetic cluster detection compared to Sanger sequencing.

    PubMed

    Montoya, Vincent; Olmstead, Andrea; Tang, Patrick; Cook, Darrel; Janjua, Naveed; Grebely, Jason; Jacka, Brendan; Poon, Art F Y; Krajden, Mel

    2016-09-01

    Effective surveillance and treatment strategies are required to control the hepatitis C virus (HCV) epidemic. Phylogenetic analyses are powerful tools for reconstructing the evolutionary history of viral outbreaks and identifying transmission clusters. These studies often rely on Sanger sequencing which typically generates a single consensus sequence for each infected individual. For rapidly mutating viruses such as HCV, consensus sequencing underestimates the complexity of the viral quasispecies population and could therefore generate different phylogenetic tree topologies. Although deep sequencing provides a more detailed quasispecies characterization, in-depth phylogenetic analyses are challenging due to dataset complexity and computational limitations. Here, we apply deep sequencing to a characterized population to assess its ability to identify phylogenetic clusters compared with consensus Sanger sequencing. For deep sequencing, a sample specific threshold determined by the 50th percentile of the patristic distance distribution for all variants within each individual was used to identify clusters. Among seven patristic distance thresholds tested for the Sanger sequence phylogeny ranging from 0.005-0.06, a threshold of 0.03 was found to provide the maximum balance between positive agreement (samples in a cluster) and negative agreement (samples not in a cluster) relative to the deep sequencing dataset. From 77 HCV seroconverters, 10 individuals were identified in phylogenetic clusters using both methods. Deep sequencing analysis identified an additional 4 individuals and excluded 8 other individuals relative to Sanger sequencing. The application of this deep sequencing approach could be a more effective tool to understand onward HCV transmission dynamics compared with Sanger sequencing, since the incorporation of minority sequence variants improves the discrimination of phylogenetically linked clusters.

  2. Deep sequencing and human antibody repertoire analysis

    PubMed Central

    Boyd, Scott D; Crowe, James E

    2016-01-01

    In the past decade, high-throughput DNA sequencing (HTS) methods and improved approaches for isolating antigen-specific B cells and their antibody genes have been applied in many areas of human immunology. This work has greatly increased our understanding of human antibody repertoires and the specific clones responsible for protective immunity or immune-mediated pathogenesis. Although the principles underlying selection of individual B cell clones in the intact immune system are still under investigation, the combination of more powerful genetic tracking of antibody lineage development and functional testing of the encoded proteins promises to transform therapeutic antibody discovery and optimization. Here, we highlight recent advances in this fast-moving field. PMID:27065089

  3. QSRA – a quality-value guided de novo short read assembler

    PubMed Central

    Bryant, Douglas W; Wong, Weng-Keen; Mockler, Todd C

    2009-01-01

    Background New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data. Results We have designed and implemented an assembler, Quality-value guided Short Read Assembler, created to take advantage of quality-value scores as a further method of dealing with error. Compared to previous published algorithms, our assembler shows significant improvements not only in speed but also in output quality. Conclusion QSRA generally produced the highest genomic coverage, while being faster than VCAKE. QSRA is extremely competitive in its longest contig and N50/N80 contig lengths, producing results of similar quality to those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo assembly of complex genomes, improving upon the original VCAKE algorithm by not only drastically reducing runtimes but also increasing the viability of the assembly algorithm through further error handling capabilities. PMID:19239711

  4. Whole genome assembly of a natto production strain Bacillus subtilis natto from very short read data.

    PubMed

    Nishito, Yukari; Osana, Yasunori; Hachiya, Tsuyoshi; Popendorf, Kris; Toyoda, Atsushi; Fujiyama, Asao; Itaya, Mitsuhiro; Sakakibara, Yasubumi

    2010-04-16

    Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and functions as a starter for the production of the traditional Japanese food "natto" made from soybeans. Although re-sequencing whole genomes of several laboratory domesticated B. subtilis 168 derivatives has already been attempted using short read sequencing data, the assembly of the whole genome sequence of a closely related strain, B. subtilis natto, from very short read data is more challenging, particularly with our aim to assemble one fully connected scaffold from short reads around 35 bp in length. We applied a comparative genome assembly method, which combines de novo assembly and reference guided assembly, to one of the B. subtilis natto strains. We successfully assembled 28 scaffolds and managed to avoid substantial fragmentation. Completion of the assembly through long PCR experiments resulted in one connected scaffold for B. subtilis natto. Based on the assembled genome sequence, our orthologous gene analysis between natto BEST195 and Marburg 168 revealed that 82.4% of 4375 predicted genes in BEST195 are one-to-one orthologous to genes in 168, with two genes in-paralog, 3.2% are deleted in 168, 14.3% are inserted in BEST195, and 5.9% of genes present in 168 are deleted in BEST195. The natto genome contains the same alleles in the promoter region of degQ and the coding region of swrAA as the wild strain, RO-FF-1. These are specific for gamma-PGA production ability, which is related to natto production. Further, the B. subtilis natto strain completely lacked a polyketide synthesis operon, disrupted the plipastatin production operon, and possesses previously unidentified transposases. The determination of the whole genome sequence of Bacillus subtilis natto provided detailed analyses of a set of genes related to natto production, demonstrating the number and locations of insertion sequences that B. subtilis natto harbors but B. subtilis 168 lacks

  5. Whole genome assembly of a natto production strain Bacillus subtilis natto from very short read data

    PubMed Central

    2010-01-01

    Background Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and functions as a starter for the production of the traditional Japanese food "natto" made from soybeans. Although re-sequencing whole genomes of several laboratory domesticated B. subtilis 168 derivatives has already been attempted using short read sequencing data, the assembly of the whole genome sequence of a closely related strain, B. subtilis natto, from very short read data is more challenging, particularly with our aim to assemble one fully connected scaffold from short reads around 35 bp in length. Results We applied a comparative genome assembly method, which combines de novo assembly and reference guided assembly, to one of the B. subtilis natto strains. We successfully assembled 28 scaffolds and managed to avoid substantial fragmentation. Completion of the assembly through long PCR experiments resulted in one connected scaffold for B. subtilis natto. Based on the assembled genome sequence, our orthologous gene analysis between natto BEST195 and Marburg 168 revealed that 82.4% of 4375 predicted genes in BEST195 are one-to-one orthologous to genes in 168, with two genes in-paralog, 3.2% are deleted in 168, 14.3% are inserted in BEST195, and 5.9% of genes present in 168 are deleted in BEST195. The natto genome contains the same alleles in the promoter region of degQ and the coding region of swrAA as the wild strain, RO-FF-1. These are specific for γ-PGA production ability, which is related to natto production. Further, the B. subtilis natto strain completely lacked a polyketide synthesis operon, disrupted the plipastatin production operon, and possesses previously unidentified transposases. Conclusions The determination of the whole genome sequence of Bacillus subtilis natto provided detailed analyses of a set of genes related to natto production, demonstrating the number and locations of insertion sequences that B. subtilis natto harbors but B

  6. Deep sequencing analysis of apple infecting viruses in Korea

    USDA-ARS?s Scientific Manuscript database

    Deep sequencing of viruses isolated from eight symptomatic apple trees in Korea has generated 52 contigs derived from five viruses: Apple chlorotic leaf spot virus (ACLSV), Apple stem grooving virus (ASGV), Apple stem pitting virus (ASPV), Apple green crinkle associated virus (AGCaV) and Apricot lat...

  7. Fitness Inference from Short-Read Data: Within-Host Evolution of a Reassortant H5N1 Influenza Virus

    PubMed Central

    Illingworth, Christopher J.R.

    2015-01-01

    We present a method to infer the role of selection acting during the within-host evolution of the influenza virus from short-read genome sequence data. Linkage disequilibrium between loci is accounted for by treating short-read sequences as noisy multilocus emissions from an underlying model of haplotype evolution. A hierarchical model-selection procedure is used to infer the underlying fitness landscape of the virus insofar as that landscape is explored by the viral population. In a first application of our method, we analyze data from an evolutionary experiment describing the growth of a reassortant H5N1 virus in ferrets. Across two sets of replica experiments we infer multiple alleles to be under selection, including variants associated with receptor binding specificity, glycosylation, and with the increased transmissibility of the virus. We identify epistasis as an important component of the within-host fitness landscape, and show that adaptation can proceed through multiple genetic pathways. PMID:26243288

  8. Deep sequencing analysis of phage libraries using Illumina platform.

    PubMed

    Matochko, Wadim L; Chu, Kiki; Jin, Bingjie; Lee, Sam W; Whitesides, George M; Derda, Ratmir

    2012-09-01

    This paper presents an analysis of phage-displayed libraries of peptides using Illumina. We describe steps for the preparation of short DNA fragments for deep sequencing and MatLab software for the analysis of the results. Screening of peptide libraries displayed on the surface of bacteriophage (phage display) can be used to discover peptides that bind to any target. The key step in this discovery is the analysis of peptide sequences present in the library. This analysis is usually performed by Sanger sequencing, which is labor intensive and limited to examination of a few hundred phage clones. On the other hand, Illumina deep-sequencing technology can characterize over 10(7) reads in a single run. We applied Illumina sequencing to analyze phage libraries. Using PCR, we isolated the variable regions from M13KE phage vectors from a phage display library. The PCR primers contained (i) sequences flanking the variable region, (ii) barcodes, and (iii) variable 5'-terminal region. We used this approach to examine how diversity of peptides in phage display libraries changes as a result of amplification of libraries in bacteria. Using HiSeq single-end Illumina sequencing of these fragments, we acquired over 2×10(7) reads, 57 base pairs (bp) in length. Each read contained information about the barcode (6bp), one complimentary region (12bp) and a variable region (36bp). We applied this sequencing to a model library of 10(6) unique clones and observed that amplification enriches ∼150 clones, which dominate ∼20% of the library. Deep sequencing, for the first time, characterized the collapse of diversity in phage libraries. The results suggest that screens based on repeated amplification and small-scale sequencing identify a few binding clones and miss thousands of useful clones. The deep sequencing approach described here could identify under-represented clones in phage screens. It could also be instrumental in developing new screening strategies, which can preserve

  9. deepTools: a flexible platform for exploring deep-sequencing data.

    PubMed

    Ramírez, Fidel; Dündar, Friederike; Diehl, Sarah; Grüning, Björn A; Manke, Thomas

    2014-07-01

    We present a Galaxy based web server for processing and visualizing deeply sequenced data. The web server's core functionality consists of a suite of newly developed tools, called deepTools, that enable users with little bioinformatic background to explore the results of their sequencing experiments in a standardized setting. Users can upload pre-processed files with continuous data in standard formats and generate heatmaps and summary plots in a straight-forward, yet highly customizable manner. In addition, we offer several tools for the analysis of files containing aligned reads and enable efficient and reproducible generation of normalized coverage files. As a modular and open-source platform, deepTools can easily be expanded and customized to future demands and developments. The deepTools webserver is freely available at http://deeptools.ie-freiburg.mpg.de and is accompanied by extensive documentation and tutorials aimed at conveying the principles of deep-sequencing data analysis. The web server can be used without registration. deepTools can be installed locally either stand-alone or as part of Galaxy. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  10. Estimation of genetic diversity in viral populations from next generation sequencing data with extremely deep coverage.

    PubMed

    Zukurov, Jean P; do Nascimento-Brito, Sieberth; Volpini, Angela C; Oliveira, Guilherme C; Janini, Luiz Mario R; Antoneli, Fernando

    2016-01-01

    In this paper we propose a method and discuss its computational implementation as an integrated tool for the analysis of viral genetic diversity on data generated by high-throughput sequencing. The main motivation for this work is to better understand the genetic diversity of viruses with high rates of nucleotide substitution, as HIV-1 and Influenza. Most methods for viral diversity estimation proposed so far are intended to take benefit of the longer reads produced by some next-generation sequencing platforms in order to estimate a population of haplotypes which represent the diversity of the original population. The method proposed here is custom-made to take advantage of the very low error rate and extremely deep coverage per site, which are the main features of some neglected technologies that have not received much attention due to the short length of its reads, which precludes haplotype estimation. This approach allowed us to avoid some hard problems related to haplotype reconstruction (need of long reads, preliminary error filtering and assembly). We propose to measure genetic diversity of a viral population through a family of multinomial probability distributions indexed by the sites of the virus genome, each one representing the distribution of nucleic bases per site. Moreover, the implementation of the method focuses on two main optimization strategies: a read mapping/alignment procedure that aims at the recovery of the maximum possible number of short-reads; the inference of the multinomial parameters in a Bayesian framework with smoothed Dirichlet estimation. The Bayesian approach provides conditional probability distributions for the multinomial parameters allowing one to take into account the prior information of the control experiment and providing a natural way to separate signal from noise, since it automatically furnishes Bayesian confidence intervals and thus avoids the drawbacks of preliminary error filtering. The methods described in this

  11. DSAP: deep-sequencing small RNA analysis pipeline.

    PubMed

    Huang, Po-Jung; Liu, Yi-Chung; Lee, Chi-Ching; Lin, Wei-Chen; Gan, Richie Ruei-Chi; Lyu, Ping-Chiang; Tang, Petrus

    2010-07-01

    DSAP is an automated multiple-task web service designed to provide a total solution to analyzing deep-sequencing small RNA datasets generated by next-generation sequencing technology. DSAP uses a tab-delimited file as an input format, which holds the unique sequence reads (tags) and their corresponding number of copies generated by the Solexa sequencing platform. The input data will go through four analysis steps in DSAP: (i) cleanup: removal of adaptors and poly-A/T/C/G/N nucleotides; (ii) clustering: grouping of cleaned sequence tags into unique sequence clusters; (iii) non-coding RNA (ncRNA) matching: sequence homology mapping against a transcribed sequence library from the ncRNA database Rfam (http://rfam.sanger.ac.uk/); and (iv) known miRNA matching: detection of known miRNAs in miRBase (http://www.mirbase.org/) based on sequence homology. The expression levels corresponding to matched ncRNAs and miRNAs are summarized in multi-color clickable bar charts linked to external databases. DSAP is also capable of displaying miRNA expression levels from different jobs using a log(2)-scaled color matrix. Furthermore, a cross-species comparative function is also provided to show the distribution of identified miRNAs in different species as deposited in miRBase. DSAP is available at http://dsap.cgu.edu.tw.

  12. Deep whole-genome sequencing of 100 southeast Asian Malays.

    PubMed

    Wong, Lai-Ping; Ong, Rick Twee-Hee; Poh, Wan-Ting; Liu, Xuanyao; Chen, Peng; Li, Ruoying; Lam, Kevin Koi-Yau; Pillai, Nisha Esakimuthu; Sim, Kar-Seng; Xu, Haiyan; Sim, Ngak-Leng; Teo, Shu-Mei; Foo, Jia-Nee; Tan, Linda Wei-Lin; Lim, Yenly; Koo, Seok-Hwee; Gan, Linda Seo-Hwee; Cheng, Ching-Yu; Wee, Sharon; Yap, Eric Peng-Huat; Ng, Pauline Crystal; Lim, Wei-Yen; Soong, Richie; Wenk, Markus Rene; Aung, Tin; Wong, Tien-Yin; Khor, Chiea-Chuen; Little, Peter; Chia, Kee-Seng; Teo, Yik-Ying

    2013-01-10

    Whole-genome sequencing across multiple samples in a population provides an unprecedented opportunity for comprehensively characterizing the polymorphic variants in the population. Although the 1000 Genomes Project (1KGP) has offered brief insights into the value of population-level sequencing, the low coverage has compromised the ability to confidently detect rare and low-frequency variants. In addition, the composition of populations in the 1KGP is not complete, despite the fact that the study design has been extended to more than 2,500 samples from more than 20 population groups. The Malays are one of the Austronesian groups predominantly present in Southeast Asia and Oceania, and the Singapore Sequencing Malay Project (SSMP) aims to perform deep whole-genome sequencing of 100 healthy Malays. By sequencing at a minimum of 30× coverage, we have illustrated the higher sensitivity at detecting low-frequency and rare variants and the ability to investigate the presence of hotspots of functional mutations. Compared to the low-pass sequencing in the 1KGP, the deeper coverage allows more functional variants to be identified for each person. A comparison of the fidelity of genotype imputation of Malays indicated that a population-specific reference panel, such as the SSMP, outperforms a cosmopolitan panel with larger number of individuals for common SNPs. For lower-frequency (<5%) markers, a larger number of individuals might have to be whole-genome sequenced so that the accuracy currently afforded by the 1KGP can be achieved. The SSMP data are expected to be the benchmark for evaluating the value of deep population-level sequencing versus low-pass sequencing, especially in populations that are poorly represented in population-genetics studies.

  13. Deep Whole-Genome Sequencing of 100 Southeast Asian Malays

    PubMed Central

    Wong, Lai-Ping; Ong, Rick Twee-Hee; Poh, Wan-Ting; Liu, Xuanyao; Chen, Peng; Li, Ruoying; Lam, Kevin Koi-Yau; Pillai, Nisha Esakimuthu; Sim, Kar-Seng; Xu, Haiyan; Sim, Ngak-Leng; Teo, Shu-Mei; Foo, Jia-Nee; Tan, Linda Wei-Lin; Lim, Yenly; Koo, Seok-Hwee; Gan, Linda Seo-Hwee; Cheng, Ching-Yu; Wee, Sharon; Yap, Eric Peng-Huat; Ng, Pauline Crystal; Lim, Wei-Yen; Soong, Richie; Wenk, Markus Rene; Aung, Tin; Wong, Tien-Yin; Khor, Chiea-Chuen; Little, Peter; Chia, Kee-Seng; Teo, Yik-Ying

    2013-01-01

    Whole-genome sequencing across multiple samples in a population provides an unprecedented opportunity for comprehensively characterizing the polymorphic variants in the population. Although the 1000 Genomes Project (1KGP) has offered brief insights into the value of population-level sequencing, the low coverage has compromised the ability to confidently detect rare and low-frequency variants. In addition, the composition of populations in the 1KGP is not complete, despite the fact that the study design has been extended to more than 2,500 samples from more than 20 population groups. The Malays are one of the Austronesian groups predominantly present in Southeast Asia and Oceania, and the Singapore Sequencing Malay Project (SSMP) aims to perform deep whole-genome sequencing of 100 healthy Malays. By sequencing at a minimum of 30× coverage, we have illustrated the higher sensitivity at detecting low-frequency and rare variants and the ability to investigate the presence of hotspots of functional mutations. Compared to the low-pass sequencing in the 1KGP, the deeper coverage allows more functional variants to be identified for each person. A comparison of the fidelity of genotype imputation of Malays indicated that a population-specific reference panel, such as the SSMP, outperforms a cosmopolitan panel with larger number of individuals for common SNPs. For lower-frequency (<5%) markers, a larger number of individuals might have to be whole-genome sequenced so that the accuracy currently afforded by the 1KGP can be achieved. The SSMP data are expected to be the benchmark for evaluating the value of deep population-level sequencing versus low-pass sequencing, especially in populations that are poorly represented in population-genetics studies. PMID:23290073

  14. RNA-CODE: a noncoding RNA classification tool for short reads in NGS data lacking reference genomes.

    PubMed

    Yuan, Cheng; Sun, Yanni

    2013-01-01

    The number of transcriptomic sequencing projects of various non-model organisms is still accumulating rapidly. As non-coding RNAs (ncRNAs) are highly abundant in living organism and play important roles in many biological processes, identifying fragmentary members of ncRNAs in small RNA-seq data is an important step in post-NGS analysis. However, the state-of-the-art ncRNA search tools are not optimized for next-generation sequencing (NGS) data, especially for very short reads. In this work, we propose and implement a comprehensive ncRNA classification tool (RNA-CODE) for very short reads. RNA-CODE is specifically designed for ncRNA identification in NGS data that lack quality reference genomes. Given a set of short reads, our tool classifies the reads into different types of ncRNA families. The classification results can be used to quantify the expression levels of different types of ncRNAs in RNA-seq data and ncRNA composition profiles in metagenomic data, respectively. The experimental results of applying RNA-CODE to RNA-seq of Arabidopsis and a metagenomic data set sampled from human guts demonstrate that RNA-CODE competes favorably in both sensitivity and specificity with other tools. The source codes of RNA-CODE can be downloaded at http://www.cse.msu.edu/~chengy/RNA_CODE.

  15. DEEP MOTIF DASHBOARD: VISUALIZING AND UNDERSTANDING GENOMIC SEQUENCES USING DEEP NEURAL NETWORKS.

    PubMed

    Lanchantin, Jack; Singh, Ritambhara; Wang, Beilun; Qi, Yanjun

    2016-01-01

    Deep neural network (DNN) models have recently obtained state-of-the-art prediction accuracy for the transcription factor binding (TFBS) site classification task. However, it remains unclear how these approaches identify meaningful DNA sequence signals and give insights as to why TFs bind to certain locations. In this paper, we propose a toolkit called the Deep Motif Dashboard (DeMo Dashboard) which provides a suite of visualization strategies to extract motifs, or sequence patterns from deep neural network models for TFBS classification. We demonstrate how to visualize and understand three important DNN models: convolutional, recurrent, and convolutional-recurrent networks. Our first visualization method is finding a test sequence's saliency map which uses first-order derivatives to describe the importance of each nucleotide in making the final prediction. Second, considering recurrent models make predictions in a temporal manner (from one end of a TFBS sequence to the other), we introduce temporal output scores, indicating the prediction score of a model over time for a sequential input. Lastly, a class-specific visualization strategy finds the optimal input sequence for a given TFBS positive class via stochastic gradient optimization. Our experimental results indicate that a convolutional-recurrent architecture performs the best among the three architectures. The visualization techniques indicate that CNN-RNN makes predictions by modeling both motifs as well as dependencies among them.

  16. Unbiased Deep Sequencing of RNA Viruses from Clinical Samples

    PubMed Central

    Matranga, Christian B.; Gladden-Young, Adrianne; Qu, James; Winnicki, Sarah; Nosamiefan, Dolo; Levin, Joshua Z.; Sabeti, Pardis C.

    2016-01-01

    Here we outline a next-generation RNA sequencing protocol that enables de novo assemblies and intra-host variant calls of viral genomes collected from clinical and biological sources. The method is unbiased and universal; it uses random primers for cDNA synthesis and requires no prior knowledge of the viral sequence content. Before library construction, selective RNase H-based digestion is used to deplete unwanted RNA — including poly(rA) carrier and ribosomal RNA — from the viral RNA sample. Selective depletion improves both the data quality and the number of unique reads in viral RNA sequencing libraries. Moreover, a transposase-based 'tagmentation' step is used in the protocol as it reduces overall library construction time. The protocol has enabled rapid deep sequencing of over 600 Lassa and Ebola virus samples-including collections from both blood and tissue isolates-and is broadly applicable to other microbial genomics studies. PMID:27403729

  17. Deep Sequencing Analysis of Apple Infecting Viruses in Korea

    PubMed Central

    Cho, In-Sook; Igori, Davaajargal; Lim, Seungmo; Choi, Gug-Seoun; Hammond, John; Lim, Hyoun-Sub; Moon, Jae Sun

    2016-01-01

    Deep sequencing has generated 52 contigs derived from five viruses; Apple chlorotic leaf spot virus (ACLSV), Apple stem grooving virus (ASGV), Apple stem pitting virus (ASPV), Apple green crinkle associated virus (AGCaV), and Apricot latent virus (ApLV) were identified from eight apple samples showing small leaves and/or growth retardation. Nucleotide (nt) sequence identity of the assembled contigs was from 68% to 99% compared to the reference sequences of the five respective viral genomes. Sequences of ASPV and ASGV were the most abundantly represented by the 52 contigs assembled. The presence of the five viruses in the samples was confirmed by RT-PCR using specific primers based on the sequences of each assembled contig. All five viruses were detected in three of the samples, whereas all samples had mixed infections with at least two viruses. The most frequently detected virus was ASPV, followed by ASGV, ApLV, ACLSV, and AGCaV which were withal found in mixed infections in the tested samples. AGCaV was identified in assembled contigs ID 1012480 and 93549, which showed 82% and 78% nt sequence identity with ORF1 of AGCaV isolate Aurora-1. ApLV was identified in three assembled contigs, ID 65587, 1802365, and 116777, which showed 77%, 78%, and 76% nt sequence identity respectively with ORF1 of ApLV isolate LA2. Deep sequencing assay was shown to be a valuable and powerful tool for detection and identification of known and unknown virome in infected apple trees, here identifying ApLV and AGCaV in commercial orchards in Korea for the first time. PMID:27721694

  18. Deep Sequencing Analysis of Apple Infecting Viruses in Korea.

    PubMed

    Cho, In-Sook; Igori, Davaajargal; Lim, Seungmo; Choi, Gug-Seoun; Hammond, John; Lim, Hyoun-Sub; Moon, Jae Sun

    2016-10-01

    Deep sequencing has generated 52 contigs derived from five viruses; Apple chlorotic leaf spot virus (ACLSV), Apple stem grooving virus (ASGV), Apple stem pitting virus (ASPV), Apple green crinkle associated virus (AGCaV), and Apricot latent virus (ApLV) were identified from eight apple samples showing small leaves and/or growth retardation. Nucleotide (nt) sequence identity of the assembled contigs was from 68% to 99% compared to the reference sequences of the five respective viral genomes. Sequences of ASPV and ASGV were the most abundantly represented by the 52 contigs assembled. The presence of the five viruses in the samples was confirmed by RT-PCR using specific primers based on the sequences of each assembled contig. All five viruses were detected in three of the samples, whereas all samples had mixed infections with at least two viruses. The most frequently detected virus was ASPV, followed by ASGV, ApLV, ACLSV, and AGCaV which were withal found in mixed infections in the tested samples. AGCaV was identified in assembled contigs ID 1012480 and 93549, which showed 82% and 78% nt sequence identity with ORF1 of AGCaV isolate Aurora-1. ApLV was identified in three assembled contigs, ID 65587, 1802365, and 116777, which showed 77%, 78%, and 76% nt sequence identity respectively with ORF1 of ApLV isolate LA2. Deep sequencing assay was shown to be a valuable and powerful tool for detection and identification of known and unknown virome in infected apple trees, here identifying ApLV and AGCaV in commercial orchards in Korea for the first time.

  19. deepBase: a database for deeply annotating and mining deep sequencing data

    PubMed Central

    Yang, Jian-Hua; Shao, Peng; Zhou, Hui; Chen, Yue-Qin; Qu, Liang-Hu

    2010-01-01

    Advances in high-throughput next-generation sequencing technology have reshaped the transcriptomic research landscape. However, exploration of these massive data remains a daunting challenge. In this study, we describe a novel database, deepBase, which we have developed to facilitate the comprehensive annotation and discovery of small RNAs from transcriptomic data. The current release of deepBase contains deep sequencing data from 185 small RNA libraries from diverse tissues and cell lines of seven organisms: human, mouse, chicken, Ciona intestinalis, Drosophila melanogaster, Caenhorhabditis elegans and Arabidopsis thaliana. By analyzing ∼14.6 million unique reads that perfectly mapped to more than 284 million genomic loci, we annotated and identified ∼380 000 unique ncRNA-associated small RNAs (nasRNAs), ∼1.5 million unique promoter-associated small RNAs (pasRNAs), ∼4.0 million unique exon-associated small RNAs (easRNAs) and ∼6 million unique repeat-associated small RNAs (rasRNAs). Furthermore, 2038 miRNA and 1889 snoRNA candidates were predicted by miRDeep and snoSeeker. All of the mapped reads can be grouped into about 1.2 million RNA clusters. For the purpose of comparative analysis, deepBase provides an integrative, interactive and versatile display. A convenient search option, related publications and other useful information are also provided for further investigation. deepBase is available at: http://deepbase.sysu.edu.cn/. PMID:19966272

  20. GAViT: Genome Assembly Visualization Tool for Short Read Data

    SciTech Connect

    Syed, Aijazuddin; Shapiro, Harris; Tu, Hank; Pangilinan, Jasmyn; Trong, Stephan

    2008-03-14

    It is a challenging job for genome analysts to accurately debug, troubleshoot, and validate genome assembly results. Genome analysts rely on visualization tools to help validate and troubleshoot assembly results, including such problems as mis-assemblies, low-quality regions, and repeats. Short read data adds further complexity and makes it extremely challenging for the visualization tools to scale and to view all needed assembly information. As a result, there is a need for a visualization tool that can scale to display assembly data from the new sequencing technologies. We present Genome Assembly Visualization Tool (GAViT), a highly scalable and interactive assembly visualization tool developed at the DOE Joint Genome Institute (JGI).

  1. Deep whole-genome sequencing of 90 Han Chinese genomes.

    PubMed

    Lan, Tianming; Lin, Haoxiang; Zhu, Wenjuan; Laurent, Tellier Christian Asker Melchior; Yang, Mengcheng; Liu, Xin; Wang, Jun; Wang, Jian; Yang, Huanming; Xu, Xun; Guo, Xiaosen

    2017-09-01

    Next-generation sequencing provides a high-resolution insight into human genetic information. However, the focus of previous studies has primarily been on low-coverage data due to the high cost of sequencing. Although the 1000 Genomes Project and the Haplotype Reference Consortium have both provided powerful reference panels for imputation, low-frequency and novel variants remain difficult to discover and call with accuracy on the basis of low-coverage data. Deep sequencing provides an optimal solution for the problem of these low-frequency and novel variants. Although whole-exome sequencing is also a viable choice for exome regions, it cannot account for noncoding regions, sometimes resulting in the absence of important, causal variants. For Han Chinese populations, the majority of variants have been discovered based upon low-coverage data from the 1000 Genomes Project. However, high-coverage, whole-genome sequencing data are limited for any population, and a large amount of low-frequency, population-specific variants remain uncharacterized. We have performed whole-genome sequencing at a high depth (∼×80) of 90 unrelated individuals of Chinese ancestry, collected from the 1000 Genomes Project samples, including 45 Northern Han Chinese and 45 Southern Han Chinese samples. Eighty-three of these 90 have been sequenced by the 1000 Genomes Project. We have identified 12 568 804 single nucleotide polymorphisms, 2 074 210 short InDels, and 26 142 structural variations from these 90 samples. Compared to the Han Chinese data from the 1000 Genomes Project, we have found 7 000 629 novel variants with low frequency (defined as minor allele frequency < 5%), including 5 813 503 single nucleotide polymorphisms, 1 169 199 InDels, and 17 927 structural variants. Using deep sequencing data, we have built a greatly expanded spectrum of genetic variation for the Han Chinese genome. Compared to the 1000 Genomes Project, these Han Chinese deep sequencing data enhance the

  2. Deep sequencing of 10,000 human genomes

    PubMed Central

    Pierce, Levi C. T.; Biggs, William H.; di Iulio, Julia; Wong, Emily H. M.; Fabani, Martin M.; Kirkness, Ewen F.; Moustafa, Ahmed; Shah, Naisha; Xie, Chao; Brewerton, Suzanne C.; Bulsara, Nadeem; Garner, Chad; Metzker, Gary; Sandoval, Efren; Perkins, Brad A.; Och, Franz J.; Turpaz, Yaron; Venter, J. Craig

    2016-01-01

    We report on the sequencing of 10,545 human genomes at 30×–40× coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high-confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries on average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high-resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use. PMID:27702888

  3. Deep sequencing of 10,000 human genomes.

    PubMed

    Telenti, Amalio; Pierce, Levi C T; Biggs, William H; di Iulio, Julia; Wong, Emily H M; Fabani, Martin M; Kirkness, Ewen F; Moustafa, Ahmed; Shah, Naisha; Xie, Chao; Brewerton, Suzanne C; Bulsara, Nadeem; Garner, Chad; Metzker, Gary; Sandoval, Efren; Perkins, Brad A; Och, Franz J; Turpaz, Yaron; Venter, J Craig

    2016-10-18

    We report on the sequencing of 10,545 human genomes at 30×-40× coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high-confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries on average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high-resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.

  4. Whole Genome Complete Resequencing of Bacillus subtilis Natto by Combining Long Reads with High-Quality Short Reads

    PubMed Central

    Kamada, Mayumi; Hase, Sumitaka; Sato, Kengo; Toyoda, Atsushi; Fujiyama, Asao; Sakakibara, Yasubumi

    2014-01-01

    De novo microbial genome sequencing reached a turning point with third-generation sequencing (TGS) platforms, and several microbial genomes have been improved by TGS long reads. Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and it has a function in the production of the traditional Japanese fermented food “natto.” The B. subtilis natto BEST195 genome was previously sequenced with short reads, but it included some incomplete regions. We resequenced the BEST195 genome using a PacBio RS sequencer, and we successfully obtained a complete genome sequence from one scaffold without any gaps, and we also applied Illumina MiSeq short reads to enhance quality. Compared with the previous BEST195 draft genome and Marburg 168 genome, we found that incomplete regions in the previous genome sequence were attributed to GC-bias and repetitive sequences, and we also identified some novel genes that are found only in the new genome. PMID:25329997

  5. Deep sequencing approach for investigating infectious agents causing fever.

    PubMed

    Susilawati, T N; Jex, A R; Cantacessi, C; Pearson, M; Navarro, S; Susianto, A; Loukas, A C; McBride, W J H

    2016-07-01

    Acute undifferentiated fever (AUF) poses a diagnostic challenge due to the variety of possible aetiologies. While the majority of AUFs resolve spontaneously, some cases become prolonged and cause significant morbidity and mortality, necessitating improved diagnostic methods. This study evaluated the utility of deep sequencing in fever investigation. DNA and RNA were isolated from plasma/sera of AUF cases being investigated at Cairns Hospital in northern Australia, including eight control samples from patients with a confirmed diagnosis. Following isolation, DNA and RNA were bulk amplified and RNA was reverse transcribed to cDNA. The resulting DNA and cDNA amplicons were subjected to deep sequencing on an Illumina HiSeq 2000 platform. Bioinformatics analysis was performed using the program Kraken and the CLC assembly-alignment pipeline. The results were compared with the outcomes of clinical tests. We generated between 4 and 20 million reads per sample. The results of Kraken and CLC analyses concurred with diagnoses obtained by other means in 87.5 % (7/8) and 25 % (2/8) of control samples, respectively. Some plausible causes of fever were identified in ten patients who remained undiagnosed following routine hospital investigations, including Escherichia coli bacteraemia and scrub typhus that eluded conventional tests. Achromobacter xylosoxidans, Alteromonas macleodii and Enterobacteria phage were prevalent in all samples. A deep sequencing approach of patient plasma/serum samples led to the identification of aetiological agents putatively implicated in AUFs and enabled the study of microbial diversity in human blood. The application of this approach in hospital practice is currently limited by sequencing input requirements and complicated data analysis.

  6. deepTools2: a next generation web server for deep-sequencing data analysis.

    PubMed

    Ramírez, Fidel; Ryan, Devon P; Grüning, Björn; Bhardwaj, Vivek; Kilpert, Fabian; Richter, Andreas S; Heyne, Steffen; Dündar, Friederike; Manke, Thomas

    2016-07-08

    We present an update to our Galaxy-based web server for processing and visualizing deeply sequenced data. Its core tool set, deepTools, allows users to perform complete bioinformatic workflows ranging from quality controls and normalizations of aligned reads to integrative analyses, including clustering and visualization approaches. Since we first described our deepTools Galaxy server in 2014, we have implemented new solutions for many requests from the community and our users. Here, we introduce significant enhancements and new tools to further improve data visualization and interpretation. deepTools continue to be open to all users and freely available as a web service at deeptools.ie-freiburg.mpg.de The new deepTools2 suite can be easily deployed within any Galaxy framework via the toolshed repository, and we also provide source code for command line usage under Linux and Mac OS X. A public and documented API for access to deepTools functionality is also available. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  7. Deep sequencing methods for protein engineering and design.

    PubMed

    Wrenbeck, Emily E; Faber, Matthew S; Whitehead, Timothy A

    2016-11-22

    The advent of next-generation sequencing (NGS) has revolutionized protein science, and the development of complementary methods enabling NGS-driven protein engineering have followed. In general, these experiments address the functional consequences of thousands of protein variants in a massively parallel manner using genotype-phenotype linked high-throughput functional screens followed by DNA counting via deep sequencing. We highlight the use of information rich datasets to engineer protein molecular recognition. Examples include the creation of multiple dual-affinity Fabs targeting structurally dissimilar epitopes and engineering of a broad germline-targeted anti-HIV-1 immunogen. Additionally, we highlight the generation of enzyme fitness landscapes for conducting fundamental studies of protein behavior and evolution. We conclude with discussion of technological advances. Copyright © 2016 Elsevier Ltd. All rights reserved.

  8. Signatures of Crested Ibis MHC Revealed by Recombination Screening and Short-Reads Assembly Strategy

    PubMed Central

    Liu, Yuanhong; Xiong, Zijun; Fu, Dongke; Li, Bo; Wei, Shuguang; Xu, Xun; Li, Shengbin; Yuan, Hui

    2016-01-01

    Whole-genome shotgun (WGS) sequencing has become a routine method in genome research over the past decade. However, the assembly of highly polymorphic regions in WGS projects remains a challenge, especially for large genomes. Employing BAC library constructing, PCR screening and Sanger sequencing, traditional strategy is laborious and expensive, which hampers research on polymorphic genomic regions. As one of the most highly polymorphic regions, the major histocompatibility complex (MHC) plays a central role in the adaptive immunity of all jawed vertebrates. In this study, we introduced an efficient procedure based on recombination screening and short-reads assembly. With this procedure, we constructed a high quality 488-kb region of crested ibis MHC that consists of 3 superscaffolds and contains 50 genes. Our sequence showed comparable quality (97.29% identity) to traditional Sanger assembly, while the workload was reduced almost 7 times. Comparative study revealed distinctive features of crested ibis by exhibiting the COL11A2-BLA-BLB-BRD2 cluster and presenting both ADPRH and odorant receptor (OR) gene in the MHC region. Furthermore, the conservation of the BF-TAP1-TAP2 structure in crested ibis and other vertebrate lineages is interesting in light of the hypothesis that coevolution of functionally related genes in the primordial MHC is responsible for the appearance of the antigen presentation pathways at the birth of the adaptive immune system. PMID:27997612

  9. CUSHAW3: Sensitive and Accurate Base-Space and Color-Space Short-Read Alignment with Hybrid Seeding

    PubMed Central

    Liu, Yongchao; Popp, Bernt; Schmidt, Bertil

    2014-01-01

    The majority of next-generation sequencing short-reads can be properly aligned by leading aligners at high speed. However, the alignment quality can still be further improved, since usually not all reads can be correctly aligned to large genomes, such as the human genome, even for simulated data. Moreover, even slight improvements in this area are important but challenging, and usually require significantly more computational endeavor. In this paper, we present CUSHAW3, an open-source parallelized, sensitive and accurate short-read aligner for both base-space and color-space sequences. In this aligner, we have investigated a hybrid seeding approach to improve alignment quality, which incorporates three different seed types, i.e. maximal exact match seeds, exact-match k-mer seeds and variable-length seeds, into the alignment pipeline. Furthermore, three techniques: weighted seed-pairing heuristic, paired-end alignment pair ranking and read mate rescuing have been conceived to facilitate accurate paired-end alignment. For base-space alignment, we have compared CUSHAW3 to Novoalign, CUSHAW2, BWA-MEM, Bowtie2 and GEM, by aligning both simulated and real reads to the human genome. The results show that CUSHAW3 consistently outperforms CUSHAW2, BWA-MEM, Bowtie2 and GEM in terms of single-end and paired-end alignment. Furthermore, our aligner has demonstrated better paired-end alignment performance than Novoalign for short-reads with high error rates. For color-space alignment, CUSHAW3 is consistently one of the best aligners compared to SHRiMP2 and BFAST. The source code of CUSHAW3 and all simulated data are available at http://cushaw3.sourceforge.net. PMID:24466273

  10. Deep homology in the age of next-generation sequencing.

    PubMed

    Tschopp, Patrick; Tabin, Clifford J

    2017-02-05

    The principle of homology is central to conceptualizing the comparative aspects of morphological evolution. The distinctions between homologous or non-homologous structures have become blurred, however, as modern evolutionary developmental biology (evo-devo) has shown that novel features often result from modification of pre-existing developmental modules, rather than arising completely de novo. With this realization in mind, the term 'deep homology' was coined, in recognition of the remarkably conserved gene expression during the development of certain animal structures that would not be considered homologous by previous strict definitions. At its core, it can help to formulate an understanding of deeper layers of ontogenetic conservation for anatomical features that lack any clear phylogenetic continuity. Here, we review deep homology and related concepts in the context of a gene expression-based homology discussion. We then focus on how these conceptual frameworks have profited from the recent rise of high-throughput next-generation sequencing. These techniques have greatly expanded the range of organisms amenable to such studies. Moreover, they helped to elevate the traditional gene-by-gene comparison to a transcriptome-wide level. We will end with an outlook on the next challenges in the field and how technological advances might provide exciting new strategies to tackle these questions.This article is part of the themed issue 'Evo-devo in the genomics era, and the origins of morphological diversity'. © 2016 The Author(s).

  11. Clinical actionability enhanced through deep targeted sequencing of solid tumors

    PubMed Central

    Chen, Ken; Meric-Bernstam, Funda; Zhao, Hao; Zhang, Qingxiu; Ezzeddine, Nader; Tang, Lin-ya; Qi, Yuan; Mao, Yong; Chen, Tenghui; Chong, Zechen; Zhou, Wanding; Zheng, Xiaofeng; Johnson, Amber; Aldape, Kenneth D.; Routbort, Mark J.; Luthra, Rajyalakshmi; Kopetz, Scott; Davies, Michael A.; de Groot, John; Moulder, Stacy; Vinod, Ravi; Farhangfar, Carol J.; Shaw, Kenna Mills; Mendelsohn, John; Mills, Gordon B.; Eterovic, Agda Karina

    2015-01-01

    Background Further advances of targeted cancer therapy require comprehensive in-depth profiling of somatic mutations that are present in subpopulations of tumor cells in a clinical tumor sample. However, it is unclear to what extent such intra-tumor heterogeneity is present and whether it may affect clinical decision making. To unravel this challenge, we established a deep targeted sequencing platform to identify potentially actionable DNA alterations in tumor samples. Methods We assayed 515 FFPE tumor samples and matched germline (475 patients) from 11 disease sites by capturing and sequencing all the exons in 201 cancer related genes. Mutations, indels and copy number data were reported. Results We obtained a 1000-fold average sequencing depth and identified 4794 non-synonymous mutations in the samples analyzed, which 15.2% were present at less than 10% allele frequency. Most of these low level mutations occurred at known oncogenic hotspots and are likely functional. Identifying low level mutations improved identification of mutations in actionable genes in 118 (24.84%) patients, among which 47 (9.8%) would otherwise be unactionable. In addition, acquiring ultra-high depth also ensured a low false discovery rate (less than 2.2%) from FFPE samples. Conclusion Our results were as accurate as a commercially available CLIA-compliant hotspot panel, but allowed the detection of a higher number of mutations in actionable genes. Our study revealed the critical importance of acquiring and utilizing high depth in profiling clinical tumor samples and presented a very useful platform for implementing routine sequencing in a cancer care institution. PMID:25626406

  12. Target Enrichment Improves Mapping of Complex Traits by Deep Sequencing.

    PubMed

    Guo, Jianjun; Fan, Jue; Hauser, Bernard A; Rhee, Seung Y

    2015-11-03

    Complex traits such as crop performance and human diseases are controlled by multiple genetic loci, many of which have small effects and often go undetected by traditional quantitative trait locus (QTL) mapping. Recently, bulked segregant analysis with large F2 pools and genome-level markers (named extreme-QTL or X-QTL mapping) has been used to identify many QTL. To estimate parameters impacting QTL detection for X-QTL mapping, we simulated the effects of population size, marker density, and sequencing depth of markers on QTL detectability for traits with differing heritabilities. These simulations indicate that a high (>90%) chance of detecting QTL with at least 5% effect requires 5000× sequencing depth for a trait with heritability of 0.4-0.7. For most eukaryotic organisms, whole-genome sequencing at this depth is not economically feasible. Therefore, we tested and confirmed the feasibility of applying deep sequencing of target-enriched markers for X-QTL mapping. We used two traits in Arabidopsis thaliana with different heritabilities: seed size (H(2) = 0.61) and seedling greening in response to salt (H(2) = 0.94). We used a modified G test to identify QTL regions and developed a model-based statistical framework to resolve individual peaks by incorporating recombination rates. Multiple QTL were identified for both traits, including previously undiscovered QTL. We call our method target-enriched X-QTL (TEX-QTL) mapping; this mapping approach is not limited by the genome size or the availability of recombinant inbred populations and should be applicable to many organisms and traits.

  13. Localized suffix array and its application to genome mapping problems for paired-end short reads.

    PubMed

    Kimura, Kouichi; Koike, Asako

    2009-10-01

    We introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly. Thus the paired search can be efficiently performed even if there are a large number of occurrences for each word. The localized suffix array itself is in fact a reordering of bits inside the conventional suffix array, and their memory requirements are essentially the same. We demonstrate an application to genome mapping problems for paired-end short reads generated by new-generation DNA sequencers. When paired reads are highly repetitive, it is time-consuming to naïvely calculate, sort, and compare all of the coordinates. For a human genome re-sequencing data of 36 base pairs, more than 10 times speedups over the naïve method were observed in almost half of the cases where the sums of redundancies (number of individual occurrences) of paired reads were greater than 2,000.

  14. Assemblathon 1: A competitive assessment of de novo short read assembly methods

    PubMed Central

    Earl, Dent; Bradnam, Keith; St. John, John; Darling, Aaron; Lin, Dawei; Fass, Joseph; Yu, Hung On Ken; Buffalo, Vince; Zerbino, Daniel R.; Diekhans, Mark; Nguyen, Ngan; Ariyaratne, Pramila Nuwantha; Sung, Wing-Kin; Ning, Zemin; Haimel, Matthias; Simpson, Jared T.; Fonseca, Nuno A.; Birol, İnanç; Docking, T. Roderick; Ho, Isaac Y.; Rokhsar, Daniel S.; Chikhi, Rayan; Lavenier, Dominique; Chapuis, Guillaume; Naquin, Delphine; Maillet, Nicolas; Schatz, Michael C.; Kelley, David R.; Phillippy, Adam M.; Koren, Sergey; Yang, Shiaw-Pyng; Wu, Wei; Chou, Wen-Chi; Srivastava, Anuj; Shaw, Timothy I.; Ruby, J. Graham; Skewes-Cox, Peter; Betegon, Miguel; Dimon, Michelle T.; Solovyev, Victor; Seledtsov, Igor; Kosarev, Petr; Vorobyev, Denis; Ramirez-Gonzalez, Ricardo; Leggett, Richard; MacLean, Dan; Xia, Fangfang; Luo, Ruibang; Li, Zhenyu; Xie, Yinlong; Liu, Binghang; Gnerre, Sante; MacCallum, Iain; Przybylski, Dariusz; Ribeiro, Filipe J.; Yin, Shuangye; Sharpe, Ted; Hall, Giles; Kersey, Paul J.; Durbin, Richard; Jackman, Shaun D.; Chapman, Jarrod A.; Huang, Xiaoqiu; DeRisi, Joseph L.; Caccamo, Mario; Li, Yingrui; Jaffe, David B.; Green, Richard E.; Haussler, David; Korf, Ian; Paten, Benedict

    2011-01-01

    Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/. PMID:21926179

  15. Error analysis of deep sequencing of phage libraries: peptides censored in sequencing.

    PubMed

    Matochko, Wadim L; Derda, Ratmir

    2013-01-01

    Next-generation sequencing techniques empower selection of ligands from phage-display libraries because they can detect low abundant clones and quantify changes in the copy numbers of clones without excessive selection rounds. Identification of errors in deep sequencing data is the most critical step in this process because these techniques have error rates >1%. Mechanisms that yield errors in Illumina and other techniques have been proposed, but no reports to date describe error analysis in phage libraries. Our paper focuses on error analysis of 7-mer peptide libraries sequenced by Illumina method. Low theoretical complexity of this phage library, as compared to complexity of long genetic reads and genomes, allowed us to describe this library using convenient linear vector and operator framework. We describe a phage library as N × 1 frequency vector n = ||ni||, where ni is the copy number of the ith sequence and N is the theoretical diversity, that is, the total number of all possible sequences. Any manipulation to the library is an operator acting on n. Selection, amplification, or sequencing could be described as a product of a N × N matrix and a stochastic sampling operator (Sa). The latter is a random diagonal matrix that describes sampling of a library. In this paper, we focus on the properties of Sa and use them to define the sequencing operator (Seq). Sequencing without any bias and errors is Seq = Sa IN, where IN is a N × N unity matrix. Any bias in sequencing changes IN to a nonunity matrix. We identified a diagonal censorship matrix (CEN), which describes elimination or statistically significant downsampling, of specific reads during the sequencing process.

  16. Error Analysis of Deep Sequencing of Phage Libraries: Peptides Censored in Sequencing

    PubMed Central

    Matochko, Wadim L.; Derda, Ratmir

    2013-01-01

    Next-generation sequencing techniques empower selection of ligands from phage-display libraries because they can detect low abundant clones and quantify changes in the copy numbers of clones without excessive selection rounds. Identification of errors in deep sequencing data is the most critical step in this process because these techniques have error rates >1%. Mechanisms that yield errors in Illumina and other techniques have been proposed, but no reports to date describe error analysis in phage libraries. Our paper focuses on error analysis of 7-mer peptide libraries sequenced by Illumina method. Low theoretical complexity of this phage library, as compared to complexity of long genetic reads and genomes, allowed us to describe this library using convenient linear vector and operator framework. We describe a phage library as N × 1 frequency vector n = ||ni||, where ni is the copy number of the ith sequence and N is the theoretical diversity, that is, the total number of all possible sequences. Any manipulation to the library is an operator acting on n. Selection, amplification, or sequencing could be described as a product of a N × N matrix and a stochastic sampling operator (Sa). The latter is a random diagonal matrix that describes sampling of a library. In this paper, we focus on the properties of Sa and use them to define the sequencing operator (Seq). Sequencing without any bias and errors is Seq = Sa IN, where IN is a N × N unity matrix. Any bias in sequencing changes IN to a nonunity matrix. We identified a diagonal censorship matrix (CEN), which describes elimination or statistically significant downsampling, of specific reads during the sequencing process. PMID:24416071

  17. DeepBase: annotation and discovery of microRNAs and other noncoding RNAs from deep-sequencing data.

    PubMed

    Yang, Jian-Hua; Qu, Liang-Hu

    2012-01-01

    Recent advances in high-throughput deep-sequencing technology have produced large numbers of short and long RNA sequences and enabled the detection and profiling of known and novel microRNAs (miRNAs) and other noncoding RNAs (ncRNAs) at unprecedented sensitivity and depth. In this chapter, we describe the use of deepBase, a database that we have developed to integrate all public deep-sequencing data and to facilitate the comprehensive annotation and discovery of miRNAs and other ncRNAs from these data. deepBase provides an integrative, interactive, and versatile web graphical interface to evaluate miRBase-annotated miRNA genes and other known ncRNAs, explores the expression patterns of miRNAs and other ncRNAs, and discovers novel miRNAs and other ncRNAs from deep-sequencing data. deepBase also provides a deepView genome browser to comparatively analyze these data at multiple levels. deepBase is available at http://deepbase.sysu.edu.cn/.

  18. Measuring Cation Dependent DNA Polymerase Fidelity Landscapes by Deep Sequencing

    PubMed Central

    Kording, Konrad; Schmidt, Daniel; Martin-Alarcon, Daniel; Tyo, Keith; Boyden, Edward S.; Church, George

    2012-01-01

    High-throughput recording of signals embedded within inaccessible micro-environments is a technological challenge. The ideal recording device would be a nanoscale machine capable of quantitatively transducing a wide range of variables into a molecular recording medium suitable for long-term storage and facile readout in the form of digital data. We have recently proposed such a device, in which cation concentrations modulate the misincorporation rate of a DNA polymerase (DNAP) on a known template, allowing DNA sequences to encode information about the local cation concentration. In this work we quantify the cation sensitivity of DNAP misincorporation rates, making possible the indirect readout of cation concentration by DNA sequencing. Using multiplexed deep sequencing, we quantify the misincorporation properties of two DNA polymerases – Dpo4 and Klenow exo− – obtaining the probability and base selectivity of misincorporation at all positions within the template. We find that Dpo4 acts as a DNA recording device for Mn2+ with a misincorporation rate gain of ∼2%/mM. This modulation of misincorporation rate is selective to the template base: the probability of misincorporation on template T by Dpo4 increases >50-fold over the range tested, while the other template bases are affected less strongly. Furthermore, cation concentrations act as scaling factors for misincorporation: on a given template base, Mn2+ and Mg2+ change the overall misincorporation rate but do not alter the relative frequencies of incoming misincorporated nucleotides. Characterization of the ion dependence of DNAP misincorporation serves as the first step towards repurposing it as a molecular recording device. PMID:22928047

  19. ECHO: a reference-free short-read error correction algorithm.

    PubMed

    Kao, Wei-Chun; Chan, Andrew H; Song, Yun S

    2011-07-01

    Developing accurate, scalable algorithms to improve data quality is an important computational challenge associated with recent advances in high-throughput sequencing technology. In this study, a novel error-correction algorithm, called ECHO, is introduced for correcting base-call errors in short-reads, without the need of a reference genome. Unlike most previous methods, ECHO does not require the user to specify parameters of which optimal values are typically unknown a priori. ECHO automatically sets the parameters in the assumed model and estimates error characteristics specific to each sequencing run, while maintaining a running time that is within the range of practical use. ECHO is based on a probabilistic model and is able to assign a quality score to each corrected base. Furthermore, it explicitly models heterozygosity in diploid genomes and provides a reference-free method for detecting bases that originated from heterozygous sites. On both real and simulated data, ECHO is able to improve the accuracy of previous error-correction methods by several folds to an order of magnitude, depending on the sequence coverage depth and the position in the read. The improvement is most pronounced toward the end of the read, where previous methods become noticeably less effective. Using a whole-genome yeast data set, it is demonstrated here that ECHO is capable of coping with nonuniform coverage. Also, it is shown that using ECHO to perform error correction as a preprocessing step considerably facilitates de novo assembly, particularly in the case of low-to-moderate sequence coverage depth.

  20. Parallel and Scalable Short-Read Alignment on Multi-Core Clusters Using UPC++

    PubMed Central

    González-Domínguez, Jorge; Liu, Yongchao; Schmidt, Bertil

    2016-01-01

    The growth of next-generation sequencing (NGS) datasets poses a challenge to the alignment of reads to reference genomes in terms of alignment quality and execution speed. Some available aligners have been shown to obtain high quality mappings at the expense of long execution times. Finding fast yet accurate software solutions is of high importance to research, since availability and size of NGS datasets continue to increase. In this work we present an efficient parallelization approach for NGS short-read alignment on multi-core clusters. Our approach takes advantage of a distributed shared memory programming model based on the new UPC++ language. Experimental results using the CUSHAW3 aligner show that our implementation based on dynamic scheduling obtains good scalability on multi-core clusters. Through our evaluation, we are able to complete the single-end and paired-end alignments of 246 million reads of length 150 base-pairs in 11.54 and 16.64 minutes, respectively, using 32 nodes with four AMD Opteron 6272 16-core CPUs per node. In contrast, the multi-threaded original tool needs 2.77 and 5.54 hours to perform the same alignments on the 64 cores of one node. The source code of our parallel implementation is publicly available at the CUSHAW3 homepage (http://cushaw3.sourceforge.net). PMID:26731399

  1. Parallel and Scalable Short-Read Alignment on Multi-Core Clusters Using UPC+.

    PubMed

    González-Domínguez, Jorge; Liu, Yongchao; Schmidt, Bertil

    2016-01-01

    The growth of next-generation sequencing (NGS) datasets poses a challenge to the alignment of reads to reference genomes in terms of alignment quality and execution speed. Some available aligners have been shown to obtain high quality mappings at the expense of long execution times. Finding fast yet accurate software solutions is of high importance to research, since availability and size of NGS datasets continue to increase. In this work we present an efficient parallelization approach for NGS short-read alignment on multi-core clusters. Our approach takes advantage of a distributed shared memory programming model based on the new UPC++ language. Experimental results using the CUSHAW3 aligner show that our implementation based on dynamic scheduling obtains good scalability on multi-core clusters. Through our evaluation, we are able to complete the single-end and paired-end alignments of 246 million reads of length 150 base-pairs in 11.54 and 16.64 minutes, respectively, using 32 nodes with four AMD Opteron 6272 16-core CPUs per node. In contrast, the multi-threaded original tool needs 2.77 and 5.54 hours to perform the same alignments on the 64 cores of one node. The source code of our parallel implementation is publicly available at the CUSHAW3 homepage (http://cushaw3.sourceforge.net).

  2. Ultra-deep sequencing of foraminiferal microbarcodes unveils hidden richness of early monothalamous lineages in deep-sea sediments.

    PubMed

    Lecroq, Béatrice; Lejzerowicz, Franck; Bachar, Dipankar; Christen, Richard; Esling, Philippe; Baerlocher, Loïc; Østerås, Magne; Farinelli, Laurent; Pawlowski, Jan

    2011-08-09

    Deep-sea floors represent one of the largest and most complex ecosystems on Earth but remain essentially unexplored. The vastness and remoteness of this ecosystem make deep-sea sampling difficult, hampering traditional taxonomic observations and diversity assessment. This problem is particularly true in the case of the deep-sea meiofauna, which largely comprises small-sized, fragile, and difficult-to-identify metazoans and protists. Here, we introduce an ultra-deep sequencing-based metagenetic approach to examine the richness of benthic foraminifera, a principal component of deep-sea meiofauna. We used Illumina sequencing technology to assess foraminiferal richness in 31 unsieved deep-sea sediment samples from five distinct oceanic regions. We sequenced an extremely short fragment (36 bases) of the small subunit ribosomal DNA hypervariable region 37f, which has been shown to accurately distinguish foraminiferal species. In total, we obtained 495,978 unique sequences that were grouped into 1,643 operational taxonomic units, of which about half (841) could be reliably assigned to foraminifera. The vast majority of the operational taxonomic units (nearly 90%) were either assigned to early (ancient) lineages of soft-walled, single-chambered (monothalamous) foraminifera or remained undetermined and yet possibly belong to unknown early lineages. Contrasting with the classical view of multichambered taxa dominating foraminiferal assemblages, our work reflects an unexpected diversity of monothalamous lineages that are as yet unknown using conventional micropaleontological observations. Although we can only speculate about their morphology, the immense richness of deep-sea phylotypes revealed by this study suggests that ultra-deep sequencing can improve understanding of deep-sea benthic diversity considered until now as unknowable based on a traditional taxonomic approach.

  3. Estimating Copy Number and Allelic Variation at the Immunoglobulin Heavy Chain Locus Using Short Reads.

    PubMed

    Luo, Shishi; Yu, Jane A; Song, Yun S

    2016-09-01

    The study of genomic regions that contain gene copies and structural variation is a major challenge in modern genomics. Unlike variation involving single nucleotide changes, data on the variation of copy number is difficult to collect and few tools exist for analyzing the variation between individuals. The immunoglobulin heavy variable (IGHV) locus, which plays an integral role in the adaptive immune response, is an example of a complex genomic region that varies in gene copy number. Lack of standard methods to genotype this region prevents it from being included in association studies and is holding back the growing field of antibody repertoire analysis. Here we develop a method that takes short reads from high-throughput sequencing and outputs a genetic profile of the IGHV locus with the read coverage depth and a putative nucleotide sequence for each operationally defined gene cluster. Our operationally defined gene clusters aim to address a major challenge in studying the IGHV locus: the high sequence similarity between gene segments in different genomic locations. Tests on simulated data demonstrate that our approach can accurately determine the presence or absence of a gene cluster from reads as short as 70 bp. More detailed resolution on the copy number of gene clusters can be obtained from read coverage depth using longer reads (e.g., ≥ 100 bp). Detail at the nucleotide resolution of single copy genes (genes present in one copy per haplotype) can be determined with 250 bp reads. For IGHV genes with more than one copy, accurate nucleotide-resolution reconstruction is currently beyond the means of our approach. When applied to a family of European ancestry, our pipeline outputs genotypes that are consistent with the family pedigree, confirms existing multigene variants and suggests new copy number variants. This study paves the way for analyzing population-level patterns of variation in IGHV gene clusters in larger diverse datasets and for quantitatively

  4. Key roles for freshwater Actinobacteria revealed by deep metagenomic sequencing.

    PubMed

    Ghai, Rohit; Mizuno, Carolina Megumi; Picazo, Antonio; Camacho, Antonio; Rodriguez-Valera, Francisco

    2014-12-01

    Freshwater ecosystems are critical but fragile environments directly affecting society and its welfare. However, our understanding of genuinely freshwater microbial communities, constrained by our capacity to manipulate its prokaryotic participants in axenic cultures, remains very rudimentary. Even the most abundant components, freshwater Actinobacteria, remain largely unknown. Here, applying deep metagenomic sequencing to the microbial community of a freshwater reservoir, we were able to circumvent this traditional bottleneck and reconstruct de novo seven distinct streamlined actinobacterial genomes. These genomes represent three new groups of photoheterotrophic, planktonic Actinobacteria. We describe for the first time genomes of two novel clades, acMicro (Micrococcineae, related to Luna2,) and acAMD (Actinomycetales, related to acTH1). Besides, an aggregate of contigs belonged to a new branch of the Acidimicrobiales. All are estimated to have small genomes (approximately 1.2 Mb), and their GC content varied from 40 to 61%. One of the Micrococcineae genomes encodes a proteorhodopsin, a rhodopsin type reported for the first time in Actinobacteria. The remarkable potential capacity of some of these genomes to transform recalcitrant plant detrital material, particularly lignin-derived compounds, suggests close linkages between the terrestrial and aquatic realms. Moreover, abundances of Actinobacteria correlate inversely to those of Cyanobacteria that are responsible for prolonged and frequently irretrievable damage to freshwater ecosystems. This suggests that they might serve as sentinels of impending ecological catastrophes.

  5. Mapping Billions of Short Reads to a Reference Genome.

    PubMed

    Hung, Jui-Hung; Weng, Zhiping

    2017-01-03

    Rapid development and commercialization of instruments that can accurately, rapidly, and cheaply sequence billions of DNA bases is revolutionizing molecular biology and medicine. Because a reference genome is usually available, the first bioinformatics challenge presented by the new generation of high-throughput sequencers is the genome mapping problem, where each read is mapped to a reference genome to reveal its location(s). An introduction to mapping algorithms, as well as factors that influence their results, is provided here. © 2017 Cold Spring Harbor Laboratory Press.

  6. Deep-Sea, Deep-Sequencing: Metabarcoding Extracellular DNA from Sediments of Marine Canyons

    PubMed Central

    Guardiola, Magdalena; Uriz, María Jesús; Taberlet, Pierre; Coissac, Eric; Wangensteen, Owen Simon; Turon, Xavier

    2015-01-01

    Marine sediments are home to one of the richest species pools on Earth, but logistics and a dearth of taxonomic work-force hinders the knowledge of their biodiversity. We characterized α- and β-diversity of deep-sea assemblages from submarine canyons in the western Mediterranean using an environmental DNA metabarcoding. We used a new primer set targeting a short eukaryotic 18S sequence (ca. 110 bp). We applied a protocol designed to obtain extractions enriched in extracellular DNA from replicated sediment corers. With this strategy we captured information from DNA (local or deposited from the water column) that persists adsorbed to inorganic particles and buffered short-term spatial and temporal heterogeneity. We analysed replicated samples from 20 localities including 2 deep-sea canyons, 1 shallower canal, and two open slopes (depth range 100–2,250 m). We identified 1,629 MOTUs, among which the dominant groups were Metazoa (with representatives of 19 phyla), Alveolata, Stramenopiles, and Rhizaria. There was a marked small-scale heterogeneity as shown by differences in replicates within corers and within localities. The spatial variability between canyons was significant, as was the depth component in one of the canyons where it was tested. Likewise, the composition of the first layer (1 cm) of sediment was significantly different from deeper layers. We found that qualitative (presence-absence) and quantitative (relative number of reads) data showed consistent trends of differentiation between samples and geographic areas. The subset of exclusively benthic MOTUs showed similar patterns of β-diversity and community structure as the whole dataset. Separate analyses of the main metazoan phyla (in number of MOTUs) showed some differences in distribution attributable to different lifestyles. Our results highlight the differentiation that can be found even between geographically close assemblages, and sets the ground for future monitoring and conservation efforts on

  7. Deep-Sea, Deep-Sequencing: Metabarcoding Extracellular DNA from Sediments of Marine Canyons.

    PubMed

    Guardiola, Magdalena; Uriz, María Jesús; Taberlet, Pierre; Coissac, Eric; Wangensteen, Owen Simon; Turon, Xavier

    2015-01-01

    Marine sediments are home to one of the richest species pools on Earth, but logistics and a dearth of taxonomic work-force hinders the knowledge of their biodiversity. We characterized α- and β-diversity of deep-sea assemblages from submarine canyons in the western Mediterranean using an environmental DNA metabarcoding. We used a new primer set targeting a short eukaryotic 18S sequence (ca. 110 bp). We applied a protocol designed to obtain extractions enriched in extracellular DNA from replicated sediment corers. With this strategy we captured information from DNA (local or deposited from the water column) that persists adsorbed to inorganic particles and buffered short-term spatial and temporal heterogeneity. We analysed replicated samples from 20 localities including 2 deep-sea canyons, 1 shallower canal, and two open slopes (depth range 100-2,250 m). We identified 1,629 MOTUs, among which the dominant groups were Metazoa (with representatives of 19 phyla), Alveolata, Stramenopiles, and Rhizaria. There was a marked small-scale heterogeneity as shown by differences in replicates within corers and within localities. The spatial variability between canyons was significant, as was the depth component in one of the canyons where it was tested. Likewise, the composition of the first layer (1 cm) of sediment was significantly different from deeper layers. We found that qualitative (presence-absence) and quantitative (relative number of reads) data showed consistent trends of differentiation between samples and geographic areas. The subset of exclusively benthic MOTUs showed similar patterns of β-diversity and community structure as the whole dataset. Separate analyses of the main metazoan phyla (in number of MOTUs) showed some differences in distribution attributable to different lifestyles. Our results highlight the differentiation that can be found even between geographically close assemblages, and sets the ground for future monitoring and conservation efforts on

  8. Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study.

    PubMed

    Zhao, Qiong-Yi; Wang, Yi; Kong, Yi-Meng; Luo, Da; Li, Xuan; Hao, Pei

    2011-12-14

    With the fast advances in nextgen sequencing technology, high-throughput RNA sequencing has emerged as a powerful and cost-effective way for transcriptome study. De novo assembly of transcripts provides an important solution to transcriptome analysis for organisms with no reference genome. However, there lacked understanding on how the different variables affected assembly outcomes, and there was no consensus on how to approach an optimal solution by selecting software tool and suitable strategy based on the properties of RNA-Seq data. To reveal the performance of different programs for transcriptome assembly, this work analyzed some important factors, including k-mer values, genome complexity, coverage depth, directional reads, etc. Seven program conditions, four single k-mer assemblers (SK: SOAPdenovo, ABySS, Oases and Trinity) and three multiple k-mer methods (MK: SOAPdenovo-MK, trans-ABySS and Oases-MK) were tested. While small and large k-mer values performed better for reconstructing lowly and highly expressed transcripts, respectively, MK strategy worked well for almost all ranges of expression quintiles. Among SK tools, Trinity performed well across various conditions but took the longest running time. Oases consumed the most memory whereas SOAPdenovo required the shortest runtime but worked poorly to reconstruct full-length CDS. ABySS showed some good balance between resource usage and quality of assemblies. Our work compared the performance of publicly available transcriptome assemblers, and analyzed important factors affecting de novo assembly. Some practical guidelines for transcript reconstruction from short-read RNA-Seq data were proposed. De novo assembly of C. sinensis transcriptome was greatly improved using some optimized methods.

  9. A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware.

    PubMed

    Shi, Haixiang; Schmidt, Bertil; Liu, Weiguo; Müller-Wittig, Wolfgang

    2010-04-01

    Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method. This poses challenges for de novo DNA fragment assembly algorithms in terms of both accuracy (to deal with short, error-prone reads) and scalability (to deal with very large input data sets). In this article, we present a scalable parallel algorithm for correcting sequencing errors in high-throughput short-read data so that error-free reads can be available before DNA fragment assembly, which is of high importance to many graph-based short-read assembly tools. The algorithm is based on spectral alignment and uses the Compute Unified Device Architecture (CUDA) programming model. To gain efficiency we are taking advantage of the CUDA texture memory using a space-efficient Bloom filter data structure for spectrum membership queries. We have tested the runtime and accuracy of our algorithm using real and simulated Illumina data for different read lengths, error rates, input sizes, and algorithmic parameters. Using a CUDA-enabled mass-produced GPU (available for less than US$400 at any local computer outlet), this results in speedups of 12-84 times for the parallelized error correction, and speedups of 3-63 times for both sequential preprocessing and parallelized error correction compared to the publicly available Euler-SR program. Our implementation is freely available for download from http://cuda-ec.sourceforge.net .

  10. Concurrent and Accurate Short Read Mapping on Multicore Processors.

    PubMed

    Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

    2015-01-01

    We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.

  11. Unified View of Backward Backtracking in Short Read Mapping

    NASA Astrophysics Data System (ADS)

    Mäkinen, Veli; Välimäki, Niko; Laaksonen, Antti; Katainen, Riku

    Mapping short DNA reads to the reference genome is the core task in the recent high-throughput technologies to study e.g. protein-DNA interactions (ChIP-seq) and alternative splicing (RNA-seq). Several tools for the task (bowtie, bwa, SOAP2, TopHat) have been developed that exploit Burrows-Wheeler transform and the backward backtracking technique on it, to map the reads to their best approximate occurrences in the genome. These tools use different tailored mechanisms for small error-levels to prune the search phase significantly. We propose a new pruning mechanism that can be seen a generalization of the tailored mechanisms used so far. It uses a novel idea of storing all cyclic rotations of fixed length substrings of the reference sequence with a compressed index that is able to exploit the repetitions created to level out the growth of the input set. For RNA-seq we propose a new method that combines dynamic programming with backtracking to map efficiently and correctly all reads that span two exons. Same mechanism can also be used for mapping mate-pair reads.

  12. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

    PubMed

    Alipanahi, Babak; Delong, Andrew; Weirauch, Matthew T; Frey, Brendan J

    2015-08-01

    Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.

  13. Complete Genome Sequence of Bacteriophage Deep-Blue Infecting Emetic Bacillus cereus.

    PubMed

    Hock, Louise; Gillis, Annika; Mahillon, Jacques

    2016-06-16

    The Bacillus cereus emetic pathotype is responsible for important food-borne intoxications. Here, we describe the complete genome sequence of bacteriophage Deep-Blue, which is able to infect emetic strains of B. cereus Deep-Blue is a 159-kb myophage of the Bastille-like group within the Spounavirinae.

  14. Complete Genome Sequence of Bacteriophage Deep-Blue Infecting Emetic Bacillus cereus

    PubMed Central

    Hock, Louise; Gillis, Annika

    2016-01-01

    The Bacillus cereus emetic pathotype is responsible for important food-borne intoxications. Here, we describe the complete genome sequence of bacteriophage Deep-Blue, which is able to infect emetic strains of B. cereus. Deep-Blue is a 159-kb myophage of the Bastille-like group within the Spounavirinae. PMID:27313285

  15. Virus Identification in Unknown Tropical Febrile Illness Cases Using Deep Sequencing

    PubMed Central

    Balmaseda, Angel; Harris, Eva; DeRisi, Joseph L.

    2012-01-01

    Dengue virus is an emerging infectious agent that infects an estimated 50–100 million people annually worldwide, yet current diagnostic practices cannot detect an etiologic pathogen in ∼40% of dengue-like illnesses. Metagenomic approaches to pathogen detection, such as viral microarrays and deep sequencing, are promising tools to address emerging and non-diagnosable disease challenges. In this study, we used the Virochip microarray and deep sequencing to characterize the spectrum of viruses present in human sera from 123 Nicaraguan patients presenting with dengue-like symptoms but testing negative for dengue virus. We utilized a barcoding strategy to simultaneously deep sequence multiple serum specimens, generating on average over 1 million reads per sample. We then implemented a stepwise bioinformatic filtering pipeline to remove the majority of human and low-quality sequences to improve the speed and accuracy of subsequent unbiased database searches. By deep sequencing, we were able to detect virus sequence in 37% (45/123) of previously negative cases. These included 13 cases with Human Herpesvirus 6 sequences. Other samples contained sequences with similarity to sequences from viruses in the Herpesviridae, Flaviviridae, Circoviridae, Anelloviridae, Asfarviridae, and Parvoviridae families. In some cases, the putative viral sequences were virtually identical to known viruses, and in others they diverged, suggesting that they may derive from novel viruses. These results demonstrate the utility of unbiased metagenomic approaches in the detection of known and divergent viruses in the study of tropical febrile illness. PMID:22347512

  16. Virus identification in unknown tropical febrile illness cases using deep sequencing.

    PubMed

    Yozwiak, Nathan L; Skewes-Cox, Peter; Stenglein, Mark D; Balmaseda, Angel; Harris, Eva; DeRisi, Joseph L

    2012-01-01

    Dengue virus is an emerging infectious agent that infects an estimated 50-100 million people annually worldwide, yet current diagnostic practices cannot detect an etiologic pathogen in ∼40% of dengue-like illnesses. Metagenomic approaches to pathogen detection, such as viral microarrays and deep sequencing, are promising tools to address emerging and non-diagnosable disease challenges. In this study, we used the Virochip microarray and deep sequencing to characterize the spectrum of viruses present in human sera from 123 Nicaraguan patients presenting with dengue-like symptoms but testing negative for dengue virus. We utilized a barcoding strategy to simultaneously deep sequence multiple serum specimens, generating on average over 1 million reads per sample. We then implemented a stepwise bioinformatic filtering pipeline to remove the majority of human and low-quality sequences to improve the speed and accuracy of subsequent unbiased database searches. By deep sequencing, we were able to detect virus sequence in 37% (45/123) of previously negative cases. These included 13 cases with Human Herpesvirus 6 sequences. Other samples contained sequences with similarity to sequences from viruses in the Herpesviridae, Flaviviridae, Circoviridae, Anelloviridae, Asfarviridae, and Parvoviridae families. In some cases, the putative viral sequences were virtually identical to known viruses, and in others they diverged, suggesting that they may derive from novel viruses. These results demonstrate the utility of unbiased metagenomic approaches in the detection of known and divergent viruses in the study of tropical febrile illness.

  17. Error rates, PCR recombination, and sampling depth in HIV-1 whole genome deep sequencing.

    PubMed

    Zanini, Fabio; Brodin, Johanna; Albert, Jan; Neher, Richard A

    2016-12-27

    Deep sequencing is a powerful and cost-effective tool to characterize the genetic diversity and evolution of virus populations. While modern sequencing instruments readily cover viral genomes many thousand fold and very rare variants can in principle be detected, sequencing errors, amplification biases, and other artifacts can limit sensitivity and complicate data interpretation. For this reason, the number of studies using whole genome deep sequencing to characterize viral quasi-species in clinical samples is still limited. We have previously undertaken a large scale whole genome deep sequencing study of HIV-1 populations. Here we discuss the challenges, error profiles, control experiments, and computational test we developed to quantify the accuracy of variant frequency estimation.

  18. KRAS, BRAF, and TP53 deep sequencing for colorectal carcinoma patient diagnostics.

    PubMed

    Rechsteiner, Markus; von Teichman, Adriana; Rüschoff, Jan H; Fankhauser, Niklaus; Pestalozzi, Bernhard; Schraml, Peter; Weber, Achim; Wild, Peter; Zimmermann, Dieter; Moch, Holger

    2013-05-01

    In colorectal carcinoma, KRAS (alias Ki-ras) and BRAF mutations have emerged as predictors of resistance to anti-epidermal growth factor receptor antibody treatment and worse patient outcome, respectively. In this study, we aimed to establish a high-throughput deep sequencing workflow according to 454 pyrosequencing technology to cope with the increasing demand for sequence information at medical institutions. A cohort of 81 patients with known KRAS mutation status detected by Sanger sequencing was chosen for deep sequencing. The workflow allowed us to analyze seven amplicons (one BRAF, two KRAS, and four TP53 exons) of nine patients in parallel in one deep sequencing run. Target amplification and variant calling showed reproducible results with input DNA derived from FFPE tissue that ranged from 0.4 to 50 ng with the use of different targets and multiplex identifiers. Equimolar pooling of each amplicon in a deep sequencing run was necessary to counterbalance differences in patient tissue quality. Five BRAF and 49 TP53 mutations with functional consequences were detected. The lowest mutation frequency detected in a patient tumor population was 5% in TP53 exon 5. This low-frequency mutation was successfully verified in a second PCR and deep sequencing run. In summary, our workflow allows us to process 315 targets a week and provides the quality, flexibility, and speed needed to be integrated as standard procedure for mutational analysis in diagnostics.

  19. GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping.

    PubMed

    Alser, Mohammed; Hassan, Hasan; Xin, Hongyi; Ergin, Oguz; Mutlu, Onur; Alkan, Can

    2017-05-31

    High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads - that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and "candidate" locations in that reference genome. The similarity measurement, called alignment , formulated as an approximate string matching problem, is the computational bottleneck because: (1) it is implemented using quadratic-time dynamic programming algorithms, and (2) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before using computationally costly alignment operations. We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, Gate-Keeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shift-ed Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment can reduce the verification time of mrFAST mapper by a factor of 10. https://github.com/BilkentCompGen/GateKeeper . mohammedalser@bilkent.edu.tr , onur.mutlu@inf.ethz.ch , calkan@cs.bilkent.edu.tr . Supplementary data are available at Bioinformatics online .

  20. Sniper: improved SNP discovery by multiply mapping deep sequenced reads.

    PubMed

    Simola, Daniel F; Kim, Junhyong

    2011-06-20

    SNP (single nucleotide polymorphism) discovery using next-generation sequencing data remains difficult primarily because of redundant genomic regions, such as interspersed repetitive elements and paralogous genes, present in all eukaryotic genomes. To address this problem, we developed Sniper, a novel multi-locus Bayesian probabilistic model and a computationally efficient algorithm that explicitly incorporates sequence reads that map to multiple genomic loci. Our model fully accounts for sequencing error, template bias, and multi-locus SNP combinations, maintaining high sensitivity and specificity under a broad range of conditions. An implementation of Sniper is freely available at http://kim.bio.upenn.edu/software/sniper.shtml.

  1. DNA Methyltransferase Accessibility Protocol for Individual Templates by Deep Sequencing

    PubMed Central

    Darst, Russell P.; Nabilsi, Nancy H.; Pardo, Carolina E.; Riva, Alberto; Kladde, Michael P.

    2013-01-01

    A single-molecule probe of chromatin structure can uncover dynamic chromatin states and rare epigenetic variants of biological importance that bulk measures of chromatin structure miss. In bisulfite genomic sequencing, each sequenced clone records the methylation status of multiple sites on an individual molecule of DNA. An exogenous DNA methyltransferase can thus be used to image nucleosomes and other protein–DNA complexes. In this chapter, we describe the adaptation of this technique, termed Methylation Accessibility Protocol for individual templates, to modern high-throughput sequencing, which both simplifies the workflow and extends its utility. PMID:22929770

  2. HIV-1 quasispecies delineation by tag linkage deep sequencing.

    PubMed

    Wu, Nicholas C; De La Cruz, Justin; Al-Mawsawi, Laith Q; Olson, C Anders; Qi, Hangfei; Luan, Harding H; Nguyen, Nguyen; Du, Yushen; Le, Shuai; Wu, Ting-Ting; Li, Xinmin; Lewis, Martha J; Yang, Otto O; Sun, Ren

    2014-01-01

    Trade-offs between throughput, read length, and error rates in high-throughput sequencing limit certain applications such as monitoring viral quasispecies. Here, we describe a molecular-based tag linkage method that allows assemblage of short sequence reads into long DNA fragments. It enables haplotype phasing with high accuracy and sensitivity to interrogate individual viral sequences in a quasispecies. This approach is demonstrated to deduce ∼ 2000 unique 1.3 kb viral sequences from HIV-1 quasispecies in vivo and after passaging ex vivo with a detection limit of ∼ 0.005% to ∼ 0.001%. Reproducibility of the method is validated quantitatively and qualitatively by a technical replicate. This approach can improve monitoring of the genetic architecture and evolution dynamics in any quasispecies population.

  3. Deep sequencing of small RNAs in plants: applied bioinformatics.

    PubMed

    Studholme, David J

    2012-01-01

    Small RNAs, including microRNA and short-interfering RNAs, play important roles in plants. In recent years, developments in sequencing technology have enabled the large-scale discovery of sRNAs in various cells, tissues and developmental stages and in response to various stresses. This review describes the bioinformatics challenges to analysing these large datasets of short-RNA sequences and some of the solutions to those challenges.

  4. Deep sequencing of phage display libraries to support antibody discovery.

    PubMed

    Ravn, Ulla; Didelot, Gérard; Venet, Sophie; Ng, Kwok-Ting; Gueneau, Franck; Rousseau, François; Calloud, Sébastien; Kosco-Vilbois, Marie; Fischer, Nicolas

    2013-03-15

    The use of next generation sequencing (NGS) for the analysis of antibody sequences both in phage display libraries and during in vitro selection processes has become increasingly popular in the last few years. Here, our methods developed for DNA preparation, sequencing and data analysis are presented. A key parameter has also been to develop new software designed for high throughput antibody sequence analysis that is used in combination with publicly available tools. As an example of our methods, we provide data from the extensive analysis of five scFv libraries generated using different heavy chain CDR3 diversification strategies. The results not only confirm that the library designs were correct but also reveal differences in quality not easily identified by standard DNA sequencing approaches. The very large number of reads permits extensive sequence coverage after the selection process. Furthermore, as samples can be multiplexed, costs decrease and more information is gained per NGS run. Using examples of results obtained post phage display selections against two antigens, frequency and clustering analysis identified novel antibody fragments that were then shown to be specific for the target antigen. In summary, the methods described here demonstrate how NGS analysis enhances quality control of complex antibody libraries as well as facilitates the antibody discovery process.

  5. High-resolution microbial community reconstruction by integrating short reads from multiple 16S rRNA regions

    PubMed Central

    Amir, Amnon; Zeisel, Amit; Zuk, Or; Elgart, Michael; Stern, Shay; Shamir, Ohad; Turnbaugh, Peter J.; Soen, Yoav; Shental, Noam

    2013-01-01

    The emergence of massively parallel sequencing technology has revolutionized microbial profiling, allowing the unprecedented comparison of microbial diversity across time and space in a wide range of host-associated and environmental ecosystems. Although the high-throughput nature of such methods enables the detection of low-frequency bacteria, these advances come at the cost of sequencing read length, limiting the phylogenetic resolution possible by current methods. Here, we present a generic approach for integrating short reads from large genomic regions, thus enabling phylogenetic resolution far exceeding current methods. The approach is based on a mapping to a statistical model that is later solved as a constrained optimization problem. We demonstrate the utility of this method by analyzing human saliva and Drosophila samples, using Illumina single-end sequencing of a 750 bp amplicon of the 16S rRNA gene. Phylogenetic resolution is significantly extended while reducing the number of falsely detected bacteria, as compared with standard single-region Roche 454 Pyrosequencing. Our approach can be seamlessly applied to simultaneous sequencing of multiple genes providing a higher resolution view of the composition and activity of complex microbial communities. PMID:24214960

  6. Molecular Diagnosis of Actinomadura madurae Infection by 16S rRNA Deep Sequencing

    PubMed Central

    SenGupta, Dhruba J.; Hoogestraat, Daniel R.; Cummings, Lisa A.; Bryant, Bronwyn H.; Natividad, Catherine; Thielges, Stephanie; Monsaas, Peter W.; Chau, Mimosa; Barbee, Lindley A.; Rosenthal, Christopher; Cookson, Brad T.; Hoffman, Noah G.

    2013-01-01

    Next-generation DNA sequencing can be used to catalog individual organisms within complex, polymicrobial specimens. Here, we utilized deep sequencing of 16S rRNA to implicate Actinomadura madurae as the cause of mycetoma in a diabetic patient when culture and conventional molecular methods were overwhelmed by overgrowth of other organisms. PMID:24108607

  7. Molecular diagnosis of Actinomadura madurae infection by 16S rRNA deep sequencing.

    PubMed

    Salipante, Stephen J; Sengupta, Dhruba J; Hoogestraat, Daniel R; Cummings, Lisa A; Bryant, Bronwyn H; Natividad, Catherine; Thielges, Stephanie; Monsaas, Peter W; Chau, Mimosa; Barbee, Lindley A; Rosenthal, Christopher; Cookson, Brad T; Hoffman, Noah G

    2013-12-01

    Next-generation DNA sequencing can be used to catalog individual organisms within complex, polymicrobial specimens. Here, we utilized deep sequencing of 16S rRNA to implicate Actinomadura madurae as the cause of mycetoma in a diabetic patient when culture and conventional molecular methods were overwhelmed by overgrowth of other organisms.

  8. Deep sequencing approach for genetic stability evaluation of influenza A viruses.

    PubMed

    Bidzhieva, Bella; Zagorodnyaya, Tatiana; Karagiannis, Konstantinos; Simonyan, Vahan; Laassri, Majid; Chumakov, Konstantin

    2014-04-01

    Assessment of genetic stability of viruses could be used to monitor manufacturing process of both live and inactivated viral vaccines. Until recently such studies were limited by the difficulty of detecting and quantifying mutations in heterogeneous viral populations. High-throughput sequencing technologies (deep sequencing) can generate massive amounts of genetic information and could be used to reveal and quantify mutations. Comparison of different approaches for deep sequencing of the complete influenza A genome was performed to determine the best way to detect and quantify mutants in attenuated influenza reassortant strain A/Brisbane/59/2007 (H1N1) and its passages in different cell substrates. Full-length amplicons of influenza A virus segments as well as multiple overlapping amplicons covering the entire viral genome were subjected to several ways of DNA library preparation followed by deep sequencing using Solexa (Illumina) and pyrosequencing (454 Life Science) technologies. Sequencing coverage (the number of times each nucleotide was determined) of mutational profiles generated after 454-pyrosequencing of individually synthesized overlapping amplicons were relatively low and insufficiently uniform. Amplification of the entire genome of influenza virus followed by its enzymatic fragmentation, library construction, and Illumina sequencing resulted in high and uniform sequencing coverage enabling sensitive quantitation of mutations. A new bioinformatic procedure was developed to improve the post-alignment quality control for deep-sequencing data analysis.

  9. Protein sequences bound to mineral surfaces persist into deep time

    PubMed Central

    Demarchi, Beatrice; Hall, Shaun; Roncal-Herrero, Teresa; Freeman, Colin L; Woolley, Jos; Crisp, Molly K; Wilson, Julie; Fotakis, Anna; Fischer, Roman; Kessler, Benedikt M; Rakownikow Jersie-Christensen, Rosa; Olsen, Jesper V; Haile, James; Thomas, Jessica; Marean, Curtis W; Parkington, John; Presslee, Samantha; Lee-Thorp, Julia; Ditchfield, Peter; Hamilton, Jacqueline F; Ward, Martyn W; Wang, Chunting Michelle; Shaw, Marvin D; Harrison, Terry; Domínguez-Rodrigo, Manuel; MacPhee, Ross DE; Kwekason, Amandus; Ecker, Michaela; Kolska Horwitz, Liora; Chazan, Michael; Kröger, Roland; Thomas-Oates, Jane; Harding, John H; Cappellini, Enrico; Penkman, Kirsty; Collins, Matthew J

    2016-01-01

    Proteins persist longer in the fossil record than DNA, but the longevity, survival mechanisms and substrates remain contested. Here, we demonstrate the role of mineral binding in preserving the protein sequence in ostrich (Struthionidae) eggshell, including from the palaeontological sites of Laetoli (3.8 Ma) and Olduvai Gorge (1.3 Ma) in Tanzania. By tracking protein diagenesis back in time we find consistent patterns of preservation, demonstrating authenticity of the surviving sequences. Molecular dynamics simulations of struthiocalcin-1 and -2, the dominant proteins within the eggshell, reveal that distinct domains bind to the mineral surface. It is the domain with the strongest calculated binding energy to the calcite surface that is selectively preserved. Thermal age calculations demonstrate that the Laetoli and Olduvai peptides are 50 times older than any previously authenticated sequence (equivalent to ~16 Ma at a constant 10°C). DOI: http://dx.doi.org/10.7554/eLife.17092.001 PMID:27668515

  10. VING: a software for visualization of deep sequencing signals.

    PubMed

    Descrimes, Marc; Ben Zouari, Yousra; Wery, Maxime; Legendre, Rachel; Gautheret, Daniel; Morillon, Antonin

    2015-09-07

    Next generation sequencing (NGS) data treatment often requires mapping sequenced reads onto a reference genome for further analysis. Mapped data are commonly visualized using genome browsers. However, such software are not suited for a publication-ready and versatile representation of NGS data coverage, especially when multiple experiments are simultaneously treated. We developed 'VING', a stand-alone R script that takes as input NGS mapping files and genome annotations to produce accurate snapshots of the NGS coverage signal for any specified genomic region. VING offers multiple viewing options, including strand-specific views and a special heatmap mode for representing multiple experiments in a single figure. VING produces high-quality figures for NGS data representation in a genome region of interest. It is available at http://vm-gb.curie.fr/ving/. We also developed a Galaxy wrapper, available in the Galaxy tool shed with installation and usage instructions.

  11. Using Amplicon Deep Sequencing to Detect Genetic Signatures of Plasmodium vivax Relapse

    PubMed Central

    Lin, Jessica T.; Hathaway, Nicholas J.; Saunders, David L.; Lon, Chanthap; Balasubramanian, Sujata; Kharabora, Oksana; Gosi, Panita; Sriwichai, Sabaithip; Kartchner, Laurel; Chuor, Char Meng; Satharath, Prom; Lanteri, Charlotte; Bailey, Jeffrey A.; Juliano, Jonathan J.

    2015-01-01

    Plasmodium vivax infections often recur due to relapse of hypnozoites from the liver. In malaria-endemic areas, tools to distinguish relapse from reinfection are needed. We applied amplicon deep sequencing to P. vivax isolates from 78 Cambodian volunteers, nearly one-third of whom suffered recurrence at a median of 68 days. Deep sequencing at a highly variable region of the P. vivax merozoite surface protein 1 gene revealed impressive diversity—generating 67 unique haplotypes and detecting on average 3.6 cocirculating parasite clones within individuals, compared to 2.1 clones detected by a combination of 3 microsatellite markers. This diversity enabled a scheme to classify over half of recurrences as probable relapses based on the low probability of reinfection by multiple recurring variants. In areas of high P. vivax diversity, targeted deep sequencing can help detect genetic signatures of relapse, key to evaluating antivivax interventions and achieving a better understanding of relapse-reinfection epidemiology. PMID:25748326

  12. Deep sequencing as a probe of normal stem cell fate and preneoplasia in human epidermis

    PubMed Central

    Simons, Benjamin D.

    2016-01-01

    Using deep sequencing technology, methods based on the sporadic acquisition of somatic DNA mutations in human tissues have been used to trace the clonal evolution of progenitor cells in diseased states. However, the potential of these approaches to explore cell fate behavior of normal tissues and the initiation of preneoplasia remain underexploited. Focusing on the results of a recent deep sequencing study of eyelid epidermis, we show that the quantitative analysis of mutant clone size provides a general method to resolve the pattern of normal stem cell fate and to detect and characterize the mutational signature of rare field transformations in human tissues, with implications for the early detection of preneoplasia. PMID:26699486

  13. Deep sequencing as a probe of normal stem cell fate and preneoplasia in human epidermis.

    PubMed

    Simons, Benjamin D

    2016-01-05

    Using deep sequencing technology, methods based on the sporadic acquisition of somatic DNA mutations in human tissues have been used to trace the clonal evolution of progenitor cells in diseased states. However, the potential of these approaches to explore cell fate behavior of normal tissues and the initiation of preneoplasia remain underexploited. Focusing on the results of a recent deep sequencing study of eyelid epidermis, we show that the quantitative analysis of mutant clone size provides a general method to resolve the pattern of normal stem cell fate and to detect and characterize the mutational signature of rare field transformations in human tissues, with implications for the early detection of preneoplasia.

  14. Insights into C4 metabolism from comparative deep sequencing.

    PubMed

    Burgess, Steven J; Hibberd, Julian M

    2015-06-01

    C4 photosynthesis suppresses the oxygenation activity of Ribulose Bisphosphate Carboxylase Oxygenase and so limits photorespiration. Although highly complex, it is estimated to have evolved in 66 plant lineages, with the vast majority lacking sequenced genomes. Transcriptomics has recently initiated assessments of the degree to which transcript abundance differs between C3 and C4 leaves, identified novel components of C4 metabolism, and also led to mathematical models explaining the repeated evolution of this complex phenotype. Evidence is accumulating that this complex and convergent phenotype is partly underpinned by parallel evolution of structural genes, but also regulatory elements in both cis and trans. Furthermore, it appears that initial events associated with acquisition of C4 traits likely represent evolutionary exaptations related to non-photosynthetic processes. Copyright © 2015 Elsevier Ltd. All rights reserved.

  15. Determining mutant spectra of three RNA viral samples using ultra-deep sequencing

    SciTech Connect

    Chen, H

    2012-06-06

    RNA viruses have extremely high mutation rates that enable the virus to adapt to new host environments and even jump from one species to another. As part of a viral transmission study, three viral samples collected from naturally infected animals were sequenced using Illumina paired-end technology at ultra-deep coverage. In order to determine the mutant spectra within the viral quasispecies, it is critical to understand the sequencing error rates and control for false positive calls of viral variants (point mutantations). I will estimate the sequencing error rate from two control sequences and characterize the mutant spectra in the natural samples with this error rate.

  16. Expression Profile of Ectopic Olfactory Receptors Determined by Deep Sequencing

    PubMed Central

    Flegel, Caroline; Manteniotis, Stavros; Osthold, Sandra; Hatt, Hanns; Gisselmann, Günter

    2013-01-01

    Olfactory receptors (ORs) provide the molecular basis for the detection of volatile odorant molecules by olfactory sensory neurons. The OR supergene family encodes G-protein coupled proteins that belong to the seven-transmembrane-domain receptor family. It was initially postulated that ORs are exclusively expressed in the olfactory epithelium. However, recent studies have demonstrated ectopic expression of some ORs in a variety of other tissues. In the present study, we conducted a comprehensive expression analysis of ORs using an extended panel of human tissues. This analysis made use of recent dramatic technical developments of the so-called Next Generation Sequencing (NGS) technique, which encouraged us to use open access data for the first comprehensive RNA-Seq expression analysis of ectopically expressed ORs in multiple human tissues. We analyzed mRNA-Seq data obtained by Illumina sequencing of 16 human tissues available from Illumina Body Map project 2.0 and from an additional study of OR expression in testis. At least some ORs were expressed in all the tissues analyzed. In several tissues, we could detect broadly expressed ORs such as OR2W3 and OR51E1. We also identified ORs that showed exclusive expression in one investigated tissue, such as OR4N4 in testis. For some ORs, the coding exon was found to be part of a transcript of upstream genes. In total, 111 of 400 OR genes were expressed with an FPKM (fragments per kilobase of exon per million fragments mapped) higher than 0.1 in at least one tissue. For several ORs, mRNA expression was verified by RT-PCR. Our results support the idea that ORs are broadly expressed in a variety of tissues and provide the basis for further functional studies. PMID:23405139

  17. Complete genome of Hainan papaya ringspot virus using small RNA deep sequencing.

    PubMed

    Zhang, Yuliang; Yu, Naitong; Huang, Qixing; Yin, Guohua; Guo, Anping; Wang, Xiangfeng; Xiong, Zhongguo; Liu, Zhixin

    2014-06-01

    Small RNA deep sequencing allows for virus identification, virus genome assembly, and strain differentiation. In this study, papaya plants with virus-like symptoms collected in Hainan province were used for deep sequencing and small RNA library construction. After in silicon subtraction of the papaya sRNAs, small RNA reads were used to in the viral genome assembly using a reference-guided, iterative assembly approach. A nearly complete genome was assembled for a Hainan isolate of papaya ringspot virus (PRSV-HN-2). The complete PRSV-HN-2 genome (accession no.: KF734962) was obtained after a 15-nucleotide gap was filled by direct sequencing of the amplified genomic region. Direct sequencing of several random genomic regions of the PRSV isolate did not find any sequence discrepancy with the sRNA-assembled genome. The newly sequenced PRSV-HN-2 genome shared a nucleotide identity of 96 and 94 % to that of the PRSV-HN (EF183499) and PRSV-HN-1 (HQ424465) isolates, and together with these two isolates formed a new PRSV clade. These data demonstrate that the small RNA deep sequencing technology provides a viable and rapid mean to assemble complete viral genomes in plants.

  18. Deep sequencing extends the diversity of human papillomaviruses in human skin

    PubMed Central

    Bzhalava, Davit; Mühr, Laila Sara Arroyo; Lagheden, Camilla; Ekström, Johanna; Forslund, Ola; Dillner, Joakim; Hultin, Emilie

    2014-01-01

    Most viruses in human skin are known to be human papillomaviruses (HPVs). Previous sequencing of skin samples has identified 273 different cutaneous HPV types, including 47 previously unknown types. In the present study, we wished to extend prior studies using deeper sequencing. This deeper sequencing without prior PCR of a pool of 142 whole genome amplified skin lesions identified 23 known HPV types, 3 novel putative HPV types and 4 non-HPV viruses. The complete sequence was obtained for one of the known putative types and almost the complete sequence was obtained for one of the novel putative types. In addition, sequencing of amplimers from HPV consensus PCR of 326 skin lesions detected 385 different HPV types, including 226 previously unknown putative types. In conclusion, metagenomic deep sequencing of human skin samples identified no less than 396 different HPV types in human skin, out of which 229 putative HPV types were previously unknown. PMID:25055967

  19. Deep sequencing reveals 50 novel genes for recessive cognitive disorders.

    PubMed

    Najmabadi, Hossein; Hu, Hao; Garshasbi, Masoud; Zemojtel, Tomasz; Abedini, Seyedeh Sedigheh; Chen, Wei; Hosseini, Masoumeh; Behjati, Farkhondeh; Haas, Stefan; Jamali, Payman; Zecha, Agnes; Mohseni, Marzieh; Püttmann, Lucia; Vahid, Leyla Nouri; Jensen, Corinna; Moheb, Lia Abbasi; Bienek, Melanie; Larti, Farzaneh; Mueller, Ines; Weissmann, Robert; Darvish, Hossein; Wrogemann, Klaus; Hadavi, Valeh; Lipkowitz, Bettina; Esmaeeli-Nieh, Sahar; Wieczorek, Dagmar; Kariminejad, Roxana; Firouzabadi, Saghar Ghasemi; Cohen, Monika; Fattahi, Zohreh; Rost, Imma; Mojahedi, Faezeh; Hertzberg, Christoph; Dehghan, Atefeh; Rajab, Anna; Banavandi, Mohammad Javad Soltani; Hoffer, Julia; Falah, Masoumeh; Musante, Luciana; Kalscheuer, Vera; Ullmann, Reinhard; Kuss, Andreas Walter; Tzschach, Andreas; Kahrizi, Kimia; Ropers, H Hilger

    2011-09-21

    Common diseases are often complex because they are genetically heterogeneous, with many different genetic defects giving rise to clinically indistinguishable phenotypes. This has been amply documented for early-onset cognitive impairment, or intellectual disability, one of the most complex disorders known and a very important health care problem worldwide. More than 90 different gene defects have been identified for X-chromosome-linked intellectual disability alone, but research into the more frequent autosomal forms of intellectual disability is still in its infancy. To expedite the molecular elucidation of autosomal-recessive intellectual disability, we have now performed homozygosity mapping, exon enrichment and next-generation sequencing in 136 consanguineous families with autosomal-recessive intellectual disability from Iran and elsewhere. This study, the largest published so far, has revealed additional mutations in 23 genes previously implicated in intellectual disability or related neurological disorders, as well as single, probably disease-causing variants in 50 novel candidate genes. Proteins encoded by several of these genes interact directly with products of known intellectual disability genes, and many are involved in fundamental cellular processes such as transcription and translation, cell-cycle control, energy metabolism and fatty-acid synthesis, which seem to be pivotal for normal brain development and function.

  20. Using Small RNA Deep Sequencing Data to Detect Human Viruses.

    PubMed

    Wang, Fang; Sun, Yu; Ruan, Jishou; Chen, Rui; Chen, Xin; Chen, Chengjie; Kreuze, Jan F; Fei, ZhangJun; Zhu, Xiao; Gao, Shan

    2016-01-01

    Small RNA sequencing (sRNA-seq) can be used to detect viruses in infected hosts without the necessity to have any prior knowledge or specialized sample preparation. The sRNA-seq method was initially used for viral detection and identification in plants and then in invertebrates and fungi. However, it is still controversial to use sRNA-seq in the detection of mammalian or human viruses. In this study, we used 931 sRNA-seq runs of data from the NCBI SRA database to detect and identify viruses in human cells or tissues, particularly from some clinical samples. Six viruses including HPV-18, HBV, HCV, HIV-1, SMRV, and EBV were detected from 36 runs of data. Four viruses were consistent with the annotations from the previous studies. HIV-1 was found in clinical samples without the HIV-positive reports, and SMRV was found in Diffuse Large B-Cell Lymphoma cells for the first time. In conclusion, these results suggest the sRNA-seq can be used to detect viruses in mammals and humans.

  1. Using Small RNA Deep Sequencing Data to Detect Human Viruses

    PubMed Central

    Wang, Fang; Sun, Yu; Ruan, Jishou; Chen, Rui; Chen, Xin; Chen, Chengjie; Kreuze, Jan F.; Fei, ZhangJun; Zhu, Xiao

    2016-01-01

    Small RNA sequencing (sRNA-seq) can be used to detect viruses in infected hosts without the necessity to have any prior knowledge or specialized sample preparation. The sRNA-seq method was initially used for viral detection and identification in plants and then in invertebrates and fungi. However, it is still controversial to use sRNA-seq in the detection of mammalian or human viruses. In this study, we used 931 sRNA-seq runs of data from the NCBI SRA database to detect and identify viruses in human cells or tissues, particularly from some clinical samples. Six viruses including HPV-18, HBV, HCV, HIV-1, SMRV, and EBV were detected from 36 runs of data. Four viruses were consistent with the annotations from the previous studies. HIV-1 was found in clinical samples without the HIV-positive reports, and SMRV was found in Diffuse Large B-Cell Lymphoma cells for the first time. In conclusion, these results suggest the sRNA-seq can be used to detect viruses in mammals and humans. PMID:27066498

  2. Deep sequencing of the vaginal microbiota of women with HIV.

    PubMed

    Hummelen, Ruben; Fernandes, Andrew D; Macklaim, Jean M; Dickson, Russell J; Changalucha, John; Gloor, Gregory B; Reid, Gregor

    2010-08-12

    Women living with HIV and co-infected with bacterial vaginosis (BV) are at higher risk for transmitting HIV to a partner or newborn. It is poorly understood which bacterial communities constitute BV or the normal vaginal microbiota among this population and how the microbiota associated with BV responds to antibiotic treatment. The vaginal microbiota of 132 HIV positive Tanzanian women, including 39 who received metronidazole treatment for BV, were profiled using Illumina to sequence the V6 region of the 16S rRNA gene. Of note, Gardnerella vaginalis and Lactobacillus iners were detected in each sample constituting core members of the vaginal microbiota. Eight major clusters were detected with relatively uniform microbiota compositions. Two clusters dominated by L. iners or L. crispatus were strongly associated with a normal microbiota. The L. crispatus dominated microbiota were associated with low pH, but when L. crispatus was not present, a large fraction of L. iners was required to predict a low pH. Four clusters were strongly associated with BV, and were dominated by Prevotella bivia, Lachnospiraceae, or a mixture of different species. Metronidazole treatment reduced the microbial diversity and perturbed the BV-associated microbiota, but rarely resulted in the establishment of a lactobacilli-dominated microbiota. Illumina based microbial profiling enabled high though-put analyses of microbial samples at a high phylogenetic resolution. The vaginal microbiota among women living with HIV in Sub-Saharan Africa constitutes several profiles associated with a normal microbiota or BV. Recurrence of BV frequently constitutes a different BV-associated profile than before antibiotic treatment.

  3. Deep Sequencing of the Vaginal Microbiota of Women with HIV

    PubMed Central

    Hummelen, Ruben; Fernandes, Andrew D.; Macklaim, Jean M.; Dickson, Russell J.; Changalucha, John

    2010-01-01

    Background Women living with HIV and co-infected with bacterial vaginosis (BV) are at higher risk for transmitting HIV to a partner or newborn. It is poorly understood which bacterial communities constitute BV or the normal vaginal microbiota among this population and how the microbiota associated with BV responds to antibiotic treatment. Methods and Findings The vaginal microbiota of 132 HIV positive Tanzanian women, including 39 who received metronidazole treatment for BV, were profiled using Illumina to sequence the V6 region of the 16S rRNA gene. Of note, Gardnerella vaginalis and Lactobacillus iners were detected in each sample constituting core members of the vaginal microbiota. Eight major clusters were detected with relatively uniform microbiota compositions. Two clusters dominated by L. iners or L. crispatus were strongly associated with a normal microbiota. The L. crispatus dominated microbiota were associated with low pH, but when L. crispatus was not present, a large fraction of L. iners was required to predict a low pH. Four clusters were strongly associated with BV, and were dominated by Prevotella bivia, Lachnospiraceae, or a mixture of different species. Metronidazole treatment reduced the microbial diversity and perturbed the BV-associated microbiota, but rarely resulted in the establishment of a lactobacilli-dominated microbiota. Conclusions Illumina based microbial profiling enabled high though-put analyses of microbial samples at a high phylogenetic resolution. The vaginal microbiota among women living with HIV in Sub-Saharan Africa constitutes several profiles associated with a normal microbiota or BV. Recurrence of BV frequently constitutes a different BV-associated profile than before antibiotic treatment. PMID:20711427

  4. Deep Sequencing of the Murine Olfactory Receptor Neuron Transcriptome

    PubMed Central

    Kanageswaran, Ninthujah; Demond, Marilen; Nagel, Maximilian; Schreiner, Benjamin S. P.; Baumgart, Sabrina; Scholz, Paul; Altmüller, Janine; Becker, Christian; Doerner, Julia F.; Conrad, Heike; Oberland, Sonja; Wetzel, Christian H.; Neuhaus, Eva M.; Hatt, Hanns; Gisselmann, Günter

    2015-01-01

    The ability of animals to sense and differentiate among thousands of odorants relies on a large set of olfactory receptors (OR) and a multitude of accessory proteins within the olfactory epithelium (OE). ORs and related signaling mechanisms have been the subject of intensive studies over the past years, but our knowledge regarding olfactory processing remains limited. The recent development of next generation sequencing (NGS) techniques encouraged us to assess the transcriptome of the murine OE. We analyzed RNA from OEs of female and male adult mice and from fluorescence-activated cell sorting (FACS)-sorted olfactory receptor neurons (ORNs) obtained from transgenic OMP-GFP mice. The Illumina RNA-Seq protocol was utilized to generate up to 86 million reads per transcriptome. In OE samples, nearly all OR and trace amine-associated receptor (TAAR) genes involved in the perception of volatile amines were detectably expressed. Other genes known to participate in olfactory signaling pathways were among the 200 genes with the highest expression levels in the OE. To identify OE-specific genes, we compared olfactory neuron expression profiles with RNA-Seq transcriptome data from different murine tissues. By analyzing different transcript classes, we detected the expression of non-olfactory GPCRs in ORNs and established an expression ranking for GPCRs detected in the OE. We also identified other previously undescribed membrane proteins as potential new players in olfaction. The quantitative and comprehensive transcriptome data provide a virtually complete catalogue of genes expressed in the OE and present a useful tool to uncover candidate genes involved in, for example, olfactory signaling, OR trafficking and recycling, and proliferation. PMID:25590618

  5. Deep sequencing the circadian and diurnal transcriptome of Drosophila brain

    PubMed Central

    Hughes, Michael E.; Grant, Gregory R.; Paquin, Christina; Qian, Jack; Nitabach, Michael N.

    2012-01-01

    Eukaryotic circadian clocks include transcriptional/translational feedback loops that drive 24-h rhythms of transcription. These transcriptional rhythms underlie oscillations of protein abundance, thereby mediating circadian rhythms of behavior, physiology, and metabolism. Numerous studies over the last decade have used microarrays to profile circadian transcriptional rhythms in various organisms and tissues. Here we use RNA sequencing (RNA-seq) to profile the circadian transcriptome of Drosophila melanogaster brain from wild-type and period-null clock-defective animals. We identify several hundred transcripts whose abundance oscillates with 24-h periods in either constant darkness or 12 h light/dark diurnal cycles, including several noncoding RNAs (ncRNAs) that were not identified in previous microarray studies. Of particular interest are U snoRNA host genes (Uhgs), a family of diurnal cycling noncoding RNAs that encode the precursors of more than 50 box-C/D small nucleolar RNAs, key regulators of ribosomal biogenesis. Transcriptional profiling at the level of individual exons reveals alternative splice isoforms for many genes whose relative abundances are regulated by either period or circadian time, although the effect of circadian time is muted in comparison to that of period. Interestingly, period loss of function significantly alters the frequency of RNA editing at several editing sites, suggesting an unexpected link between a key circadian gene and RNA editing. We also identify tens of thousands of novel splicing events beyond those previously annotated by the modENCODE Consortium, including several that affect key circadian genes. These studies demonstrate extensive circadian control of ncRNA expression, reveal the extent of clock control of alternative splicing and RNA editing, and provide a novel, genome-wide map of splicing in Drosophila brain. PMID:22472103

  6. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.

    PubMed

    Laehnemann, David; Borkhardt, Arndt; McHardy, Alice Carolyn

    2016-01-01

    Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.

  7. VirusDetect: An automated pipeline for efficient virus discovery using deep sequencing of small RNAs

    USDA-ARS?s Scientific Manuscript database

    Accurate detection of viruses in plants and animals is critical for agriculture production and human health. Deep sequencing and assembly of virus-derived siRNAs has proven to be a highly efficient approach for virus discovery. However, to date no computational tools specifically designed for both k...

  8. Draft Genome Sequence of the Deep-Subsurface Actinobacterium Tessaracoccus lapidicaptus IPBSL-7T

    PubMed Central

    Pieper, Dietmar H.; Arce-Rodríguez, Alejandro

    2016-01-01

    The type strain of Tessaracoccus lapidicaptus was isolated from the deep subsurface of the Iberian Pyrite Belt (southwest Spain). Here, we report its draft genome, consisting of 27 contigs with a ~3.1-Mb genome size. The annotation revealed 2,905 coding DNA sequences, 45 tRNA genes, and three rRNA genes. PMID:27688325

  9. Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID.

    PubMed

    Jabara, Cassandra B; Jones, Corbin D; Roach, Jeffrey; Anderson, Jeffrey A; Swanstrom, Ronald

    2011-12-13

    Viruses can create complex genetic populations within a host, and deep sequencing technologies allow extensive sampling of these populations. Limitations of these technologies, however, potentially bias this sampling, particularly when a PCR step precedes the sequencing protocol. Typically, an unknown number of templates are used in initiating the PCR amplification, and this can lead to unrecognized sequence resampling creating apparent homogeneity; also, PCR-mediated recombination can disrupt linkage, and differential amplification can skew allele frequency. Finally, misincorporation of nucleotides during PCR and errors during the sequencing protocol can inflate diversity. We have solved these problems by including a random sequence tag in the initial primer such that each template receives a unique Primer ID. After sequencing, repeated identification of a Primer ID reveals sequence resampling. These resampled sequences are then used to create an accurate consensus sequence for each template, correcting for recombination, allelic skewing, and misincorporation/sequencing errors. The resulting population of consensus sequences directly represents the initial sampled templates. We applied this approach to the HIV-1 protease (pro) gene to view the distribution of sequence variation of a complex viral population within a host. We identified major and minor polymorphisms at coding and noncoding positions. In addition, we observed dynamic genetic changes within the population during intermittent drug exposure, including the emergence of multiple resistant alleles. These results provide an unprecedented view of a complex viral population in the absence of PCR resampling.

  10. Short-read assembly of full-length 16S amplicons reveals bacterial diversity in subsurface sediments.

    PubMed

    Miller, Christopher S; Handley, Kim M; Wrighton, Kelly C; Frischkorn, Kyle R; Thomas, Brian C; Banfield, Jillian F

    2013-01-01

    In microbial ecology, a fundamental question relates to how community diversity and composition change in response to perturbation. Most studies have had limited ability to deeply sample community structure (e.g. Sanger-sequenced 16S rRNA libraries), or have had limited taxonomic resolution (e.g. studies based on 16S rRNA hypervariable region sequencing). Here, we combine the higher taxonomic resolution of near-full-length 16S rRNA gene amplicons with the economics and sensitivity of short-read sequencing to assay the abundance and identity of organisms that represent as little as 0.01% of sediment bacterial communities. We used a new version of EMIRGE optimized for large data size to reconstruct near-full-length 16S rRNA genes from amplicons sheared and sequenced with Illumina technology. The approach allowed us to differentiate the community composition among samples acquired before perturbation, after acetate amendment shifted the predominant metabolism to iron reduction, and once sulfate reduction began. Results were highly reproducible across technical replicates, and identified specific taxa that responded to the perturbation. All samples contain very high alpha diversity and abundant organisms from phyla without cultivated representatives. Surprisingly, at the time points measured, there was no strong loss of evenness, despite the selective pressure of acetate amendment and change in the terminal electron accepting process. However, community membership was altered significantly. The method allows for sensitive, accurate profiling of the "long tail" of low abundance organisms that exist in many microbial communities, and can resolve population dynamics in response to environmental change.

  11. Studies of a Biochemical Factory: Tomato Trichome Deep Expressed Sequence Tag Sequencing and Proteomics1[W][OA

    PubMed Central

    Schilmiller, Anthony L.; Miner, Dennis P.; Larson, Matthew; McDowell, Eric; Gang, David R.; Wilkerson, Curtis; Last, Robert L.

    2010-01-01

    Shotgun proteomics analysis allows hundreds of proteins to be identified and quantified from a single sample at relatively low cost. Extensive DNA sequence information is a prerequisite for shotgun proteomics, and it is ideal to have sequence for the organism being studied rather than from related species or accessions. While this requirement has limited the set of organisms that are candidates for this approach, next generation sequencing technologies make it feasible to obtain deep DNA sequence coverage from any organism. As part of our studies of specialized (secondary) metabolism in tomato (Solanum lycopersicum) trichomes, 454 sequencing of cDNA was combined with shotgun proteomics analyses to obtain in-depth profiles of genes and proteins expressed in leaf and stem glandular trichomes of 3-week-old plants. The expressed sequence tag and proteomics data sets combined with metabolite analysis led to the discovery and characterization of a sesquiterpene synthase that produces β-caryophyllene and α-humulene from E,E-farnesyl diphosphate in trichomes of leaf but not of stem. This analysis demonstrates the utility of combining high-throughput cDNA sequencing with proteomics experiments in a target tissue. These data can be used for dissection of other biochemical processes in these specialized epidermal cells. PMID:20431087

  12. Isoform discovery by targeted cloning, “deep well” pooling, and parallel sequencing

    PubMed Central

    Salehi-Ashtiani, Kourosh; Yang, Xinping; Derti, Adnan; Tian, Weidong; Hao, Tong; Lin, Chenwei; Makowski, Kathryn; Shen, Lei; Murray, Ryan R; Szeto, David; Tusneem, Nadeem; Smith, Douglas R; Cusick, Michael E; Hill, David E; Roth, Frederick P; Vidal, Marc

    2009-01-01

    Describing the “ORFeome” of an organism, including all major isoforms, is essential for a systems understanding of any species; however, conventional cloning and sequencing approaches are prohibitively costly and labor-intensive. We describe a potentially genome-wide methodology for efficiently capturing novel coding isoforms using RT-PCR recombinational cloning, “deep well” pooling, and a “next generation” sequencing platform. This ORFeome discovery pipeline will be applicable to any eukaryotic species with a sequenced genome. PMID:18552854

  13. Genome-Wide Probing of RNA Structures In Vitro Using Nucleases and Deep Sequencing.

    PubMed

    Wan, Yue; Qu, Kun; Ouyang, Zhengqing; Chang, Howard Y

    2016-01-01

    RNA structure probing is an important technique that studies the secondary and tertiary conformations of an RNA. While it was traditionally performed on one RNA at a time, recent advances in deep sequencing has enabled the secondary structure mapping of thousands of RNAs simultaneously. Here, we describe the method Parallel Analysis for RNA Structures (PARS), which couples double and single strand specific nuclease probing to high throughput sequencing. Upon cloning of the cleavage sites into a cDNA library, deep sequencing and mapping of reads to the transcriptome, the position of paired and unpaired bases along cellular RNAs can be identified. PARS can be performed under diverse solution conditions and on different organismal RNAs to provide genome-wide RNA structural information. This information can also be further used to constrain computational predictions to provide better RNA structure models under different conditions.

  14. Pooled Amplicon Deep Sequencing of Candidate Plasmodium falciparum Transmission-Blocking Vaccine Antigens

    PubMed Central

    Juliano, Jonathan J.; Parobek, Christian M.; Brazeau, Nicholas F.; Ngasala, Billy; Randrianarivelojosia, Milijaona; Lon, Chanthap; Mwandagalirwa, Kashamuka; Tshefu, Antoinette; Dhar, Ravi; Das, Bidyut K.; Hoffman, Irving; Martinson, Francis; Mårtensson, Andreas; Saunders, David L.; Kumar, Nirbhay; Meshnick, Steven R.

    2016-01-01

    Polymorphisms within Plasmodium falciparum vaccine candidate antigens have the potential to compromise vaccine efficacy. Understanding the allele frequencies of polymorphisms in critical binding regions of antigens can help in the designing of strain-transcendent vaccines. Here, we adopt a pooled deep-sequencing approach, originally designed to study P. falciparum drug resistance mutations, to study the diversity of two leading transmission-blocking vaccine candidates, Pfs25 and Pfs48/45. We sequenced 329 P. falciparum field isolates from six different geographic regions. Pfs25 showed little diversity, with only one known polymorphism identified in the region associated with binding of transmission-blocking antibodies among our isolates. However, we identified four new mutations among eight non-synonymous mutations within the presumed antibody-binding region of Pfs48/45. Pooled deep sequencing provides a scalable and cost-effective approach for the targeted study of allele frequencies of P. falciparum candidate vaccine antigens. PMID:26503281

  15. Deep-Sequencing Technologies and Potential Applications in Forensic DNA Testing.

    PubMed

    Zascavage, R R; Shewale, S J; Planz, J V

    2013-03-01

    Development of second- and third-generation DNA sequencing technologies have enabled an increasing number of applications in different areas such as molecular diagnostics, gene therapy, monitoring food and pharmaceutical products, biosecurity, and forensics. These technologies are based on different biochemical principles such as monitoring released pyrophosphate upon incorporation of a base (pyrosequencing), fluorescence detection subsequent to reversible incorporation of a fluorescently labeled terminator base, ligation based approach wherein fluorescence of cleaved nucleotide after ligation is measured, measuring the proton released after incorporation of a base (semiconductor-based sequencing), monitoring incorporation of a nucleotide by measuring the fluorescence of the fluorophore attached to the phosphate chain of the nucleotide, and by detecting the altered charge in a protein nanopore due to released nucleotide by exonuclease cleavage of a DNA strand. Analysis of multiple DNA fragments in parallel increases the depth of coverage while decreasing labor, cost, and time, highlighting some major advantages of deep-sequencing technologies. DNA sequencing has been routinely used in the forensic laboratories for mitochondrial DNA analysis. Fragment analysis, however, is the preferred method for Short Tandem Repeat genotyping due to the cumbersome and costly nature of fi rst-generation DNA sequencing methodologies. Deep-sequencing technologies have brought a new perspective to forensic DNA analysis. Studies include STR analysis to reveal hidden variation in the repeat regions, mtDNA sequencing, Single Nucleotide Polymorphism analysis, mixture resolution, and body fluid identification. Recent publications reveal that attempts are being made to expand the capability.

  16. Enhanced arbovirus surveillance with deep sequencing: identification of novel rhabdoviruses and bunyaviruses in Australian mosquitoes

    PubMed Central

    Coffey, Lark L.; Page, Brady L.; Greninger, Alexander L.; Herring, Belinda L.; Russell, Richard C.; Doggett, Stephen L.; Haniotis, John; Wang, Chunlin; Deng, Xutao; Delwart, Eric L.

    2013-01-01

    Viral metagenomics characterizes known and identifies unknown viruses based on sequence similarities to any previously sequenced viral genomes. A metagenomics approach was used to identify virus sequences in Australian mosquitoes causing cytopathic effects in inoculated mammalian cell cultures. Sequence comparisons revealed strains of Liao Ning virus (Reovirus, Seadornavirus), previously detected only in China, livestock-infecting Stretch Lagoon virus (Reovirus, Orbivirus), two novel dimarhabdoviruses, named Beaumont and North Creek viruses, and two novel orthobunyaviruses, named Murrumbidgee and Salt Ash viruses. The novel virus proteomes diverged by ≥50% relative to their closest previously genetically characterized viral relatives. Deep sequencing also generated genomes of Warrego and Wallal viruses, orbiviruses linked to kangaroo blindness, whose genomes had not been fully characterized. This study highlights viral metagenomics in concert with traditional arbovirus surveillance to characterize known and new arboviruses in field-collected mosquitoes. Follow-up epidemiological studies are required to determine whether the novel viruses infect humans. PMID:24314645

  17. Deep sequencing reveals global patterns of mRNA recruitment during translation initiation

    PubMed Central

    Gao, Rong; Yu, Kai; Nie, Jukui; Lian, Tengfei; Jin, Jianshi; Liljas, Anders; Su, Xiao-Dong

    2016-01-01

    In this work, we developed a method to systematically study the sequence preference of mRNAs during translation initiation. Traditionally, the dynamic process of translation initiation has been studied at the single molecule level with limited sequencing possibility. Using deep sequencing techniques, we identified the sequence preference at different stages of the initiation complexes. Our results provide a comprehensive and dynamic view of the initiation elements in the translation initiation region (TIR), including the S1 binding sequence, the Shine-Dalgarno (SD)/anti-SD interaction and the second codon, at the equilibrium of different initiation complexes. Moreover, our experiments reveal the conformational changes and regional dynamics throughout the dynamic process of mRNA recruitment. PMID:27460773

  18. Enhanced arbovirus surveillance with deep sequencing: Identification of novel rhabdoviruses and bunyaviruses in Australian mosquitoes.

    PubMed

    Coffey, Lark L; Page, Brady L; Greninger, Alexander L; Herring, Belinda L; Russell, Richard C; Doggett, Stephen L; Haniotis, John; Wang, Chunlin; Deng, Xutao; Delwart, Eric L

    2014-01-05

    Viral metagenomics characterizes known and identifies unknown viruses based on sequence similarities to any previously sequenced viral genomes. A metagenomics approach was used to identify virus sequences in Australian mosquitoes causing cytopathic effects in inoculated mammalian cell cultures. Sequence comparisons revealed strains of Liao Ning virus (Reovirus, Seadornavirus), previously detected only in China, livestock-infecting Stretch Lagoon virus (Reovirus, Orbivirus), two novel dimarhabdoviruses, named Beaumont and North Creek viruses, and two novel orthobunyaviruses, named Murrumbidgee and Salt Ash viruses. The novel virus proteomes diverged by ≥ 50% relative to their closest previously genetically characterized viral relatives. Deep sequencing also generated genomes of Warrego and Wallal viruses, orbiviruses linked to kangaroo blindness, whose genomes had not been fully characterized. This study highlights viral metagenomics in concert with traditional arbovirus surveillance to characterize known and new arboviruses in field-collected mosquitoes. Follow-up epidemiological studies are required to determine whether the novel viruses infect humans. © 2013 Elsevier Inc. All rights reserved.

  19. Novel TRAF1-ALK fusion identified by deep RNA sequencing of anaplastic large cell lymphoma.

    PubMed

    Feldman, Andrew L; Vasmatzis, George; Asmann, Yan W; Davila, Jaime; Middha, Sumit; Eckloff, Bruce W; Johnson, Sarah H; Porcher, Julie C; Ansell, Stephen M; Caride, Ariel

    2013-11-01

    Chromosomal translocations leading to expression of abnormal fusion proteins play a major role in the pathogenesis of various hematologic malignancies. The recent development of high-throughput, "deep" sequencing has allowed discovery of novel translocations leading to a rapid increase in understanding these diseases. Translocations involving the anaplastic lymphoma kinase (ALK) gene leading to ALK fusion proteins originally were discovered in anaplastic large cell lymphomas (ALCLs). Among ALCLs, NPM1-ALK fusions are most common and lead to nuclear localization of the fusion protein. Here, we present a 50-year-old male with ALCL demonstrating cytoplasmic ALK immunoreactivity only, suggesting the presence of a non-NPM1 fusion partner. We performed deep RNA sequencing of tumor tissue from this patient and identified a novel transcript fusing Exon 6 of TRAF1 to Exon 20 of ALK. The TRAF1-ALK fusion transcript was confirmed at the mRNA level by Sanger sequencing and the fusion protein was visualized by Western blot. The discovery of this TRAF1-ALK fusion expands the diversity of known ALK fusion partners and highlights the power of deep sequencing for fusion transcript discovery. © 2013 Wiley Periodicals, Inc. Copyright © 2013 Wiley Periodicals, Inc.

  20. Exploring fungal diversity in deep-sea sediments from Okinawa Trough using high-throughput Illumina sequencing

    NASA Astrophysics Data System (ADS)

    Zhang, Xiao-Yong; Wang, Guang-Hua; Xu, Xin-Ya; Nong, Xu-Hua; Wang, Jie; Amin, Muhammad; Qi, Shu-Hua

    2016-10-01

    The present study investigated the fungal diversity in four different deep-sea sediments from Okinawa Trough using high-throughput Illumina sequencing of the nuclear ribosomal internal transcribed spacer-1 (ITS1). A total of 40,297 fungal ITS1 sequences clustered into 420 operational taxonomic units (OTUs) with 97% sequence similarity and 170 taxa were recovered from these sediments. Most ITS1 sequences (78%) belonged to the phylum Ascomycota, followed by Basidiomycota (17.3%), Zygomycota (1.5%) and Chytridiomycota (0.8%), and a small proportion (2.4%) belonged to unassigned fungal phyla. Compared with previous studies on fungal diversity of sediments from deep-sea environments by culture-dependent approach and clone library analysis, the present result suggested that Illumina sequencing had been dramatically accelerating the discovery of fungal community of deep-sea sediments. Furthermore, our results revealed that Sordariomycetes was the most diverse and abundant fungal class in this study, challenging the traditional view that the diversity of Sordariomycetes phylotypes was low in the deep-sea environments. In addition, more than 12 taxa accounted for 21.5% sequences were found to be rarely reported as deep-sea fungi, suggesting the deep-sea sediments from Okinawa Trough harbored a plethora of different fungal communities compared with other deep-sea environments. To our knowledge, this study is the first exploration of the fungal diversity in deep-sea sediments from Okinawa Trough using high-throughput Illumina sequencing.

  1. Resolving the Complexity of Human Skin Metagenomes Using Single-Molecule Sequencing

    PubMed Central

    Tsai, Yu-Chih; Deming, Clayton; Segre, Julia A.; Kong, Heidi H.; Korlach, Jonas

    2016-01-01

    ABSTRACT Deep metagenomic shotgun sequencing has emerged as a powerful tool to interrogate composition and function of complex microbial communities. Computational approaches to assemble genome fragments have been demonstrated to be an effective tool for de novo reconstruction of genomes from these communities. However, the resultant “genomes” are typically fragmented and incomplete due to the limited ability of short-read sequence data to assemble complex or low-coverage regions. Here, we use single-molecule, real-time (SMRT) sequencing to reconstruct a high-quality, closed genome of a previously uncharacterized Corynebacterium simulans and its companion bacteriophage from a skin metagenomic sample. Considerable improvement in assembly quality occurs in hybrid approaches incorporating short-read data, with even relatively small amounts of long-read data being sufficient to improve metagenome reconstruction. Using short-read data to evaluate strain variation of this C. simulans in its skin community at single-nucleotide resolution, we observed a dominant C. simulans strain with moderate allelic heterozygosity throughout the population. We demonstrate the utility of SMRT sequencing and hybrid approaches in metagenome quantitation, reconstruction, and annotation. PMID:26861018

  2. Ultra-deep sequencing of intra-host rabies virus populations during cross-species transmission.

    PubMed

    Borucki, Monica K; Chen-Harris, Haiyin; Lao, Victoria; Vanier, Gilda; Wadford, Debra A; Messenger, Sharon; Allen, Jonathan E

    2013-11-01

    One of the hurdles to understanding the role of viral quasispecies in RNA virus cross-species transmission (CST) events is the need to analyze a densely sampled outbreak using deep sequencing in order to measure the amount of mutation occurring on a small time scale. In 2009, the California Department of Public Health reported a dramatic increase (350) in the number of gray foxes infected with a rabies virus variant for which striped skunks serve as a reservoir host in Humboldt County. To better understand the evolution of rabies, deep-sequencing was applied to 40 unpassaged rabies virus samples from the Humboldt outbreak. For each sample, approximately 11 kb of the 12 kb genome was amplified and sequenced using the Illumina platform. Average coverage was 17,448 and this allowed characterization of the rabies virus population present in each sample at unprecedented depths. Phylogenetic analysis of the consensus sequence data demonstrated that samples clustered according to date (1995 vs. 2009) and geographic location (northern vs. southern). A single amino acid change in the G protein distinguished a subset of northern foxes from a haplotype present in both foxes and skunks, suggesting this mutation may have played a role in the observed increased transmission among foxes in this region. Deep-sequencing data indicated that many genetic changes associated with the CST event occurred prior to 2009 since several nonsynonymous mutations that were present in the consensus sequences of skunk and fox rabies samples obtained from 20032010 were present at the sub-consensus level (as rare variants in the viral population) in skunk and fox samples from 1995. These results suggest that analysis of rare variants within a viral population may yield clues to ancestral genomes and identify rare variants that have the potential to be selected for if environment conditions change.

  3. Deep Impact Sequence Planning Using Multi-Mission Adaptable Planning Tools With Integrated Spacecraft Models

    NASA Technical Reports Server (NTRS)

    Wissler, Steven S.; Maldague, Pierre; Rocca, Jennifer; Seybold, Calina

    2006-01-01

    The Deep Impact mission was ambitious and challenging. JPL's well proven, easily adaptable multi-mission sequence planning tools combined with integrated spacecraft subsystem models enabled a small operations team to develop, validate, and execute extremely complex sequence-based activities within very short development times. This paper focuses on the core planning tool used in the mission, APGEN. It shows how the multi-mission design and adaptability of APGEN made it possible to model spacecraft subsystems as well as ground assets throughout the lifecycle of the Deep Impact project, starting with models of initial, high-level mission objectives, and culminating in detailed predictions of spacecraft behavior during mission-critical activities.

  4. Deep sequencing approaches for the analysis of prokaryotic transcriptional boundaries and dynamics.

    PubMed

    James, Katherine; Cockell, Simon J; Zenkin, Nikolay

    2017-05-01

    The identification of the protein-coding regions of a genome is straightforward due to the universality of start and stop codons. However, the boundaries of the transcribed regions, conditional operon structures, non-coding RNAs and the dynamics of transcription, such as pausing of elongation, are non-trivial to identify, even in the comparatively simple genomes of prokaryotes. Traditional methods for the study of these areas, such as tiling arrays, are noisy, labour-intensive and lack the resolution required for densely-packed bacterial genomes. Recently, deep sequencing has become increasingly popular for the study of the transcriptome due to its lower costs, higher accuracy and single nucleotide resolution. These methods have revolutionised our understanding of prokaryotic transcriptional dynamics. Here, we review the deep sequencing and data analysis techniques that are available for the study of transcription in prokaryotes, and discuss the bioinformatic considerations of these analyses. Copyright © 2017 Elsevier Inc. All rights reserved.

  5. Quantifying RNA allelic ratios by microfluidics-based multiplex PCR and deep sequencing

    PubMed Central

    Zhang, Rui; Li, Xin; Ramaswami, Gokul; Smith, Kevin S; Turecki, Gustavo; Montgomery, Stephen B; Li, Jin Billy

    2013-01-01

    We developed a targeted RNA sequencing method that couples microfluidics-based multiplex PCR and deep sequencing (mmPCR-seq) to uniformly and simultaneously amplify up to 960 loci in 48 samples independently of their gene expression levels, and accurately and cost-effectively measure allelic ratios even for low-quantity or low-quality RNA samples. We applied mmPCR-seq to RNA editing and allele-specific expression studies. mmPCR-seq complements RNA-seq and provides a highly desirable solution for future applications. PMID:24270603

  6. Prognostic value of deep sequencing method for minimal residual disease detection in multiple myeloma

    PubMed Central

    Lahuerta, Juan J.; Pepin, François; González, Marcos; Barrio, Santiago; Ayala, Rosa; Puig, Noemí; Montalban, María A.; Paiva, Bruno; Weng, Li; Jiménez, Cristina; Sopena, María; Moorhead, Martin; Cedena, Teresa; Rapado, Immaculada; Mateos, María Victoria; Rosiñol, Laura; Oriol, Albert; Blanchard, María J.; Martínez, Rafael; Bladé, Joan; San Miguel, Jesús; Faham, Malek; García-Sanz, Ramón

    2014-01-01

    We assessed the prognostic value of minimal residual disease (MRD) detection in multiple myeloma (MM) patients using a sequencing-based platform in bone marrow samples from 133 MM patients in at least very good partial response (VGPR) after front-line therapy. Deep sequencing was carried out in patients in whom a high-frequency myeloma clone was identified and MRD was assessed using the IGH-VDJH, IGH-DJH, and IGK assays. The results were contrasted with those of multiparametric flow cytometry (MFC) and allele-specific oligonucleotide polymerase chain reaction (ASO-PCR). The applicability of deep sequencing was 91%. Concordance between sequencing and MFC and ASO-PCR was 83% and 85%, respectively. Patients who were MRD– by sequencing had a significantly longer time to tumor progression (TTP) (median 80 vs 31 months; P < .0001) and overall survival (median not reached vs 81 months; P = .02), compared with patients who were MRD+. When stratifying patients by different levels of MRD, the respective TTP medians were: MRD ≥10−3 27 months, MRD 10−3 to 10−5 48 months, and MRD <10−5 80 months (P = .003 to .0001). Ninety-two percent of VGPR patients were MRD+. In complete response patients, the TTP remained significantly longer for MRD– compared with MRD+ patients (131 vs 35 months; P = .0009). PMID:24646471

  7. Prognostic value of deep sequencing method for minimal residual disease detection in multiple myeloma.

    PubMed

    Martinez-Lopez, Joaquin; Lahuerta, Juan J; Pepin, François; González, Marcos; Barrio, Santiago; Ayala, Rosa; Puig, Noemí; Montalban, María A; Paiva, Bruno; Weng, Li; Jiménez, Cristina; Sopena, María; Moorhead, Martin; Cedena, Teresa; Rapado, Immaculada; Mateos, María Victoria; Rosiñol, Laura; Oriol, Albert; Blanchard, María J; Martínez, Rafael; Bladé, Joan; San Miguel, Jesús; Faham, Malek; García-Sanz, Ramón

    2014-05-15

    We assessed the prognostic value of minimal residual disease (MRD) detection in multiple myeloma (MM) patients using a sequencing-based platform in bone marrow samples from 133 MM patients in at least very good partial response (VGPR) after front-line therapy. Deep sequencing was carried out in patients in whom a high-frequency myeloma clone was identified and MRD was assessed using the IGH-VDJH, IGH-DJH, and IGK assays. The results were contrasted with those of multiparametric flow cytometry (MFC) and allele-specific oligonucleotide polymerase chain reaction (ASO-PCR). The applicability of deep sequencing was 91%. Concordance between sequencing and MFC and ASO-PCR was 83% and 85%, respectively. Patients who were MRD(-) by sequencing had a significantly longer time to tumor progression (TTP) (median 80 vs 31 months; P < .0001) and overall survival (median not reached vs 81 months; P = .02), compared with patients who were MRD(+). When stratifying patients by different levels of MRD, the respective TTP medians were: MRD ≥10(-3) 27 months, MRD 10(-3) to 10(-5) 48 months, and MRD <10(-5) 80 months (P = .003 to .0001). Ninety-two percent of VGPR patients were MRD(+). In complete response patients, the TTP remained significantly longer for MRD(-) compared with MRD(+) patients (131 vs 35 months; P = .0009).

  8. miRBase: integrating microRNA annotation and deep-sequencing data.

    PubMed

    Kozomara, Ana; Griffiths-Jones, Sam

    2011-01-01

    miRBase is the primary online repository for all microRNA sequences and annotation. The current release (miRBase 16) contains over 15,000 microRNA gene loci in over 140 species, and over 17,000 distinct mature microRNA sequences. Deep-sequencing technologies have delivered a sharp rise in the rate of novel microRNA discovery. We have mapped reads from short RNA deep-sequencing experiments to microRNAs in miRBase and developed web interfaces to view these mappings. The user can view all read data associated with a given microRNA annotation, filter reads by experiment and count, and search for microRNAs by tissue- and stage-specific expression. These data can be used as a proxy for relative expression levels of microRNA sequences, provide detailed evidence for microRNA annotations and alternative isoforms of mature microRNAs, and allow us to revisit previous annotations. miRBase is available online at: http://www.mirbase.org/.

  9. Deep sequencing of Lotus corniculatus L. reveals key enzymes and potential transcription factors related to the flavonoid biosynthesis pathway.

    PubMed

    Wang, Ying; Hua, Wenping; Wang, Jian; Hannoufa, Abdelali; Xu, Ziqin; Wang, Zhezhi

    2013-04-01

    Lotus corniculatus L. is used worldwide as a forage crop due to its abundance of secondary metabolites and its ability to grow in severe environments. Although the entire genome of L. corniculatus var. japonicus R. is being sequenced, the differences in morphology and production of secondary metabolites between these two related species have led us to investigate this variability at the genetic level, in particular the differences in flavonoid biosynthesis. Our goal is to use the resulting information to develop more valuable forage crops and medicinal materials. Here, we conducted Illumina/Solexa sequencing to profile the transcriptome of L. corniculatus. We produced 26,492,952 short reads that corresponded to 2.38 gigabytes of total nucleotides. These reads were then assembled into 45,698 unigenes, of which a large number associated with secondary metabolism were annotated. In addition, we identified 2,998 unigenes based on homology with L. japonicus transcription factors (TFs) and grouped them into 55 families. Meanwhile, a comparison of four tag-based digital gene expression libraries, built from the flowers, pods, leaves, and roots, revealed distinct patterns of spatial expression of candidate unigenes in flavonoid biosynthesis. Based on these results, we identified many key enzymes from L. corniculatus which were different from reference genes of L. japonicus, and five TFs that are potential enhancers in flavonoid biosynthesis. Our results provide initial genetics resources that will be valuable in efforts to manipulate the flavonoid metabolic pathway in plants.

  10. Seismic sequence stratigraphy of Tertiary sediments, offshore Sarawak deep-water area

    SciTech Connect

    Mohammad, A.M. )

    1994-07-01

    Tectonic processes and sea level changes are the main key factors that have strongly influenced clastic and carbonate sedimentations in the Sarawak deep-water area. A seismic sequence stratigraphy of Tertiary sediments was conducted in the area with the main objective of developing a workable genetic chronostratigraphic framework that defines the sequence and system tracts boundaries within which depositional systems and lithofacies can be identified, mapped and interpreted. This study has resulted in the identification of eight major depositional sequences that are bounded by regional unconformities and correlative conformities. These sequences can generally be grouped into four megasequences, based on the main tectonic events observed in the area. Three system tracts of a type-1, third-order sequence boundary were recognized in most of the sequences: lowstand, transgressive, and highstand systems tracts. The lowstand system tract includes basin-floor fans, slope fans, and lowstand prograding wedges. Paleoenvironmental distribution maps constructed for each of the sequences using seismic facies analysis and nearby well control suggest that the sequence intervals are predominantly transgressive units that have been intermittently interrupted by regressive pulses brought about by changes in eustatic sea level. The trend of paleocoastline observed during Oligocene to Miocene times changes from northwest-southeast orientation to a position roughly parallel to the present coastline. Seismic facies maps generated from late Oligocene to early Miocene indicate the depositional environment was coastal to coastal plain in the western and the middle part of the study area, becoming more marine toward the east and northeast.

  11. SOAP3: ultra-fast GPU-based parallel alignment tool for short reads.

    PubMed

    Liu, Chi-Man; Wong, Thomas; Wu, Edward; Luo, Ruibang; Yiu, Siu-Ming; Li, Yingrui; Wang, Bingqiang; Yu, Chang; Chu, Xiaowen; Zhao, Kaiyong; Li, Ruiqiang; Lam, Tak-Wah

    2012-03-15

    SOAP3 is the first short read alignment tool that leverages the multi-processors in a graphic processing unit (GPU) to achieve a drastic improvement in speed. We adapted the compressed full-text index (BWT) used by SOAP2 in view of the advantages and disadvantages of GPU. When tested with millions of Illumina Hiseq 2000 length-100 bp reads, SOAP3 takes < 30 s to align a million read pairs onto the human reference genome and is at least 7.5 and 20 times faster than BWA and Bowtie, respectively. For aligning reads with up to four mismatches, SOAP3 aligns slightly more reads than BWA and Bowtie; this is because SOAP3, unlike BWA and Bowtie, is not heuristic-based and always reports all answers.

  12. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score

    PubMed Central

    Lee, Hayan; Schatz, Michael C.

    2012-01-01

    Motivation: Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the reliability of the mapping. These attributes, however, are assigned to individual reads and do not directly measure the problematic repeats across the genome. Here, we present the Genome Mappability Score (GMS) as a novel measure of the complexity of resequencing a genome. The GMS is a weighted probability that any read could be unambiguously mapped to a given position and thus measures the overall composition of the genome itself. Results: We have developed the Genome Mappability Analyzer to compute the GMS of every position in a genome. It leverages the parallelism of cloud computing to analyze large genomes, and enabled us to identify the 5–14% of the human, mouse, fly and yeast genomes that are difficult to analyze with short reads. We examined the accuracy of the widely used BWA/SAMtools polymorphism discovery pipeline in the context of the GMS, and found discovery errors are dominated by false negatives, especially in regions with poor GMS. These errors are fundamental to the mapping process and cannot be overcome by increasing coverage. As such, the GMS should be considered in every resequencing project to pinpoint the ‘dark matter’ of the genome, including of known clinically relevant variations in these regions. Availability: The source code and profiles of several model organisms are available at http://gma-bio.sourceforge.net Contact: hlee@cshl.edu Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:22668792

  13. CPSS: a computational platform for the analysis of small RNA deep sequencing data.

    PubMed

    Zhang, Yuanwei; Xu, Bo; Yang, Yifan; Ban, Rongjun; Zhang, Huan; Jiang, Xiaohua; Cooke, Howard J; Xue, Yu; Shi, Qinghua

    2012-07-15

    Next generation sequencing (NGS) techniques have been widely used to document the small ribonucleic acids (RNAs) implicated in a variety of biological, physiological and pathological processes. An integrated computational tool is needed for handling and analysing the enormous datasets from small RNA deep sequencing approach. Herein, we present a novel web server, CPSS (a computational platform for the analysis of small RNA deep sequencing data), designed to completely annotate and functionally analyse microRNAs (miRNAs) from NGS data on one platform with a single data submission. Small RNA NGS data can be submitted to this server with analysis results being returned in two parts: (i) annotation analysis, which provides the most comprehensive analysis for small RNA transcriptome, including length distribution and genome mapping of sequencing reads, small RNA quantification, prediction of novel miRNAs, identification of differentially expressed miRNAs, piwi-interacting RNAs and other non-coding small RNAs between paired samples and detection of miRNA editing and modifications and (ii) functional analysis, including prediction of miRNA targeted genes by multiple tools, enrichment of gene ontology terms, signalling pathway involvement and protein-protein interaction analysis for the predicted genes. CPSS, a ready-to-use web server that integrates most functions of currently available bioinformatics tools, provides all the information wanted by the majority of users from small RNA deep sequencing datasets. CPSS is implemented in PHP/PERL+MySQL+R and can be freely accessed at http://mcg.ustc.edu.cn/db/cpss/index.html or http://mcg.ustc.edu.cn/sdap1/cpss/index.html.

  14. A transcriptional sketch of a primary human breast cancer by 454 deep sequencing.

    PubMed

    Guffanti, Alessandro; Iacono, Michele; Pelucchi, Paride; Kim, Namshin; Soldà, Giulia; Croft, Larry J; Taft, Ryan J; Rizzi, Ermanno; Askarian-Amiri, Marjan; Bonnal, Raoul J; Callari, Maurizio; Mignone, Flavio; Pesole, Graziano; Bertalot, Giovanni; Bernardi, Luigi Rossi; Albertini, Alberto; Lee, Christopher; Mattick, John S; Zucchi, Ileana; De Bellis, Gianluca

    2009-04-20

    The cancer transcriptome is difficult to explore due to the heterogeneity of quantitative and qualitative changes in gene expression linked to the disease status. An increasing number of "unconventional" transcripts, such as novel isoforms, non-coding RNAs, somatic gene fusions and deletions have been associated with the tumoral state. Massively parallel sequencing techniques provide a framework for exploring the transcriptional complexity inherent to cancer with a limited laboratory and financial effort. We developed a deep sequencing and bioinformatics analysis protocol to investigate the molecular composition of a breast cancer poly(A)+ transcriptome. This method utilizes a cDNA library normalization step to diminish the representation of highly expressed transcripts and biology-oriented bioinformatic analyses to facilitate detection of rare and novel transcripts. We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas. Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown transcripts, splice isoforms, gene

  15. A transcriptional sketch of a primary human breast cancer by 454 deep sequencing

    PubMed Central

    Guffanti, Alessandro; Iacono, Michele; Pelucchi, Paride; Kim, Namshin; Soldà, Giulia; Croft, Larry J; Taft, Ryan J; Rizzi, Ermanno; Askarian-Amiri, Marjan; Bonnal, Raoul J; Callari, Maurizio; Mignone, Flavio; Pesole, Graziano; Bertalot, Giovanni; Bernardi, Luigi Rossi; Albertini, Alberto; Lee, Christopher; Mattick, John S; Zucchi, Ileana; De Bellis, Gianluca

    2009-01-01

    Background The cancer transcriptome is difficult to explore due to the heterogeneity of quantitative and qualitative changes in gene expression linked to the disease status. An increasing number of "unconventional" transcripts, such as novel isoforms, non-coding RNAs, somatic gene fusions and deletions have been associated with the tumoral state. Massively parallel sequencing techniques provide a framework for exploring the transcriptional complexity inherent to cancer with a limited laboratory and financial effort. We developed a deep sequencing and bioinformatics analysis protocol to investigate the molecular composition of a breast cancer poly(A)+ transcriptome. This method utilizes a cDNA library normalization step to diminish the representation of highly expressed transcripts and biology-oriented bioinformatic analyses to facilitate detection of rare and novel transcripts. Results We analyzed over 132,000 Roche 454 high-confidence deep sequencing reads from a primary human lobular breast cancer tissue specimen, and detected a range of unusual transcriptional events that were subsequently validated by RT-PCR in additional eight primary human breast cancer samples. We identified and validated one deletion, two novel ncRNAs (one intergenic and one intragenic), ten previously unknown or rare transcript isoforms and a novel gene fusion specific to a single primary tissue sample. We also explored the non-protein-coding portion of the breast cancer transcriptome, identifying thousands of novel non-coding transcripts and more than three hundred reads corresponding to the non-coding RNA MALAT1, which is highly expressed in many human carcinomas. Conclusion Our results demonstrate that combining 454 deep sequencing with a normalization step and careful bioinformatic analysis facilitates the discovery and quantification of rare transcripts or ncRNAs, and can be used as a qualitative tool to characterize transcriptome complexity, revealing many hitherto unknown

  16. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics.

    PubMed

    Asgari, Ehsaneddin; Mofrad, Mohammad R K

    2015-01-01

    We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as

  17. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics

    PubMed Central

    Asgari, Ehsaneddin; Mofrad, Mohammad R. K.

    2015-01-01

    We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as

  18. Analysis of insertion-deletion from deep-sequencing data: software evaluation for optimal detection.

    PubMed

    Neuman, Joseph A; Isakov, Ofer; Shomron, Noam

    2013-01-01

    Insertion and deletion (indel) mutations, the most common type of structural variance in the human genome, affect a multitude of human traits and diseases. New sequencing technologies, such as deep sequencing, allow massive throughput of sequence data and greatly contribute to the field of disease causing mutation detection, in general, and indel detection, specifically. In order to infer indel presence (indel calling), the deep-sequencing data have to undergo comprehensive computational analysis. Selecting which indel calling software to use can often skew the results and inherent tool limitations may affect downstream analysis. In order to better understand these inter-software differences, we evaluated the performance of several indel calling software for short indel (1-10 nt) detection. We compared the software's sensitivity and predictive values in the presence of varying parameters such as read depth (coverage), read length, indel size and frequency. We pinpoint several key features that assist successful experimental design and appropriate tool selection. Our study may also serve as a basis for future evaluation of additional indel calling methods.

  19. Deep Sequencing Analysis of HBV Genotype Shift and Correlation with Antiviral Efficiency during Adefovir Dipivoxil Therapy

    PubMed Central

    Shan, Youlan; Huang, Wenxiang; Zhang, Dazhi; Zen, Aizhong; Zhou, Xin; Zhao, Yao; Gong, Xuyang; Xu, Ge; Zhang, Xiuyu; Chen, Juan; Huang, Ailong

    2015-01-01

    Background Viral genotype shift in chronic hepatitis B (CHB) patients during antiviral therapy has been reported, but the underlying mechanism remains elusive. Methods 38 CHB patients treated with ADV for one year were selected for studying genotype shift by both deep sequencing and Sanger sequencing method. Results Sanger sequencing method found that 7.9% patients showed mixed genotype before ADV therapy. In contrast, all 38 patients showed mixed genotype before ADV treatment by deep sequencing. 95.5% mixed genotype rate was also obtained from additional 200 treatment-naïve CHB patients. Of the 13 patients with genotype shift, the fraction of the minor genotype in 5 patients (38%) increased gradually during the course of ADV treatment. Furthermore, responses to ADV and HBeAg seroconversion were associated with the high rate of genotype shift, suggesting drug and immune pressure may be key factors to induce genotype shift. Interestingly, patients with genotype C had a significantly higher rate of genotype shift than genotype B. In genotype shift group, ADV treatment induced a marked enhancement of genotype B ratio accompanied by a reduction of genotype C ratio, suggesting genotype C may be more sensitive to ADV than genotype B. Moreover, patients with dominant genotype C may have a better therapeutic effect. Finally, genotype shifts was correlated with clinical improvement in terms of ALT. Conclusions Our findings provided a rational explanation for genotype shift among ADV-treated CHB patients. The genotype and genotype shift might be associated with antiviral efficiency. PMID:26110616

  20. Use of deep sequencing data for routine analysis of HIV resistance in newly diagnosed patients

    PubMed Central

    Fernández-Caballero, Jose-Angel; Chueca, Natalia; Alvarez, Marta; Gonzalez, Dimitri; García, Federico

    2014-01-01

    Introduction Use of deep sequencing is becoming a critical tool in clinical virology, with an important impact in the HIV field for routine diagnostic purposes. Here, we present the comparison of deep and Sanger sequencing in newly diagnosed HIV patients, and the use of DeepChek v1.3 & VisibleChek for their interpretation and integration with virological and clinical data. Patients and Methods Plasma samples from 88 newly diagnosed HIV-1-infected patients were included in the study. Median age (IQR) was 37 (27–47), median CD4 count (IQR) was 387 (220–554), and 85% were males. Median Viral Load (Log, IQR) was 5.03 (4.51–5.53). Deep sequencing was obtained using a GS-Junior (Roche). Sequences were preprocessed with the 454 AVA software; aligned reads were uploaded into the DeepChek v1.3 system (ABL SA). Sanger sequences (Trugene), were uploaded in parallel. Stanford algorithm (version 7.0) resistance interpretation to first line drugs and all the mutations (score≥5) were analyzed. For deep sequencing, 1%, 5% and 10% thresholds were chosen for resistance interpretation. Results Using VisibleChek for analysis, we were able to describe the detection of any mutation using Sanger in 37/88 patients, with a total number of 50 Stanford ≥5 mutations, K103N and E138A being the most prevalent (n=4). Using UDS-1%, we found 72/88 patients with at least one mutation (total of 206 Stanford ≥5 mutations). Using Sanger data, 9/88 patients (10.22%) showed any resistance to NNRTIs, while none showed resistance to NRTIs or PIs. Using UDS-10% increased resistance to NRTIs [3/88 (3.40%)], to NNRTIs 12/88 (13.63%), and to a lesser extent to PIs [1/88 (1.13%)]. Using UDS-5% increased resistance to NRTIs [4/88 (4.54%)] and to NNRTIs [12/88 (13.63%)], but not to PIs. Using UDS-1% increased resistance to all classes: NRTIs [14/88 (15.90%)], NNRTIs [26/88 (30.68%)], and PIs [6/88 (6.81]. Conclusions DeepChek and VisibleChek allow for an easy, reliable and rapid analysis of UDS data

  1. Use of deep sequencing data for routine analysis of HIV resistance in newly diagnosed patients.

    PubMed

    Fernández-Caballero, Jose-Angel; Chueca, Natalia; Alvarez, Marta; Gonzalez, Dimitri; García, Federico

    2014-01-01

    Use of deep sequencing is becoming a critical tool in clinical virology, with an important impact in the HIV field for routine diagnostic purposes. Here, we present the comparison of deep and Sanger sequencing in newly diagnosed HIV patients, and the use of DeepChek v1.3 & VisibleChek for their interpretation and integration with virological and clinical data. Plasma samples from 88 newly diagnosed HIV-1-infected patients were included in the study. Median age (IQR) was 37 (27-47), median CD4 count (IQR) was 387 (220-554), and 85% were males. Median Viral Load (Log, IQR) was 5.03 (4.51-5.53). Deep sequencing was obtained using a GS-Junior (Roche). Sequences were preprocessed with the 454 AVA software; aligned reads were uploaded into the DeepChek v1.3 system (ABL SA). Sanger sequences (Trugene), were uploaded in parallel. Stanford algorithm (version 7.0) resistance interpretation to first line drugs and all the mutations (score≥5) were analyzed. For deep sequencing, 1%, 5% and 10% thresholds were chosen for resistance interpretation. Using VisibleChek for analysis, we were able to describe the detection of any mutation using Sanger in 37/88 patients, with a total number of 50 Stanford ≥5 mutations, K103N and E138A being the most prevalent (n=4). Using UDS-1%, we found 72/88 patients with at least one mutation (total of 206 Stanford ≥5 mutations). Using Sanger data, 9/88 patients (10.22%) showed any resistance to NNRTIs, while none showed resistance to NRTIs or PIs. Using UDS-10% increased resistance to NRTIs [3/88 (3.40%)], to NNRTIs 12/88 (13.63%), and to a lesser extent to PIs [1/88 (1.13%)]. Using UDS-5% increased resistance to NRTIs [4/88 (4.54%)] and to NNRTIs [12/88 (13.63%)], but not to PIs. Using UDS-1% increased resistance to all classes: NRTIs [14/88 (15.90%)], NNRTIs [26/88 (30.68%)], and PIs [6/88 (6.81]. DeepChek and VisibleChek allow for an easy, reliable and rapid analysis of UDS data from HIV-1. Compared to Sanger data, UDS detected a higher

  2. Sequence-based prediction of protein protein interaction using a deep-learning algorithm.

    PubMed

    Sun, Tanlin; Zhou, Bo; Lai, Luhua; Pei, Jianfeng

    2017-05-25

    Protein-protein interactions (PPIs) are critical for many biological processes. It is therefore important to develop accurate high-throughput methods for identifying PPI to better understand protein function, disease occurrence, and therapy design. Though various computational methods for predicting PPI have been developed, their robustness for prediction with external datasets is unknown. Deep-learning algorithms have achieved successful results in diverse areas, but their effectiveness for PPI prediction has not been tested. We used a stacked autoencoder, a type of deep-learning algorithm, to study the sequence-based PPI prediction. The best model achieved an average accuracy of 97.19% with 10-fold cross-validation. The prediction accuracies for various external datasets ranged from 87.99% to 99.21%, which are superior to those achieved with previous methods. To our knowledge, this research is the first to apply a deep-learning algorithm to sequence-based PPI prediction, and the results demonstrate its potential in this field.

  3. An introduction to Deep learning on biological sequence data - Examples and solutions.

    PubMed

    Jurtz, Vanessa Isabell; Rosenberg Johansen, Alexander; Nielsen, Morten; Almagro Armenteros, Jose Juan; Nielsen, Henrik; Kaae Sønderby, Casper; Winther, Ole; Kaae Sønderby, Søren

    2017-08-23

    Deep neural network architectures such as convolutional and long short-term memory networks have become increasingly popular as machine learning tools during the recent years. The availability of greater computational resources, more data, new algorithms for training deep models and easy to use libraries for implementation and training of neural networks are the drivers of this development. The use of deep learning has been especially successful in image recognition; and the development of tools, applications and code examples are in most cases centered within this field rather than within biology. Here, we aim to further the development of deep learning methods within biology by providing application examples and ready to apply and adapt code templates. Given such examples, we illustrate how architectures consisting of convolutional and long short-term memory neural networks can relatively easily be designed and trained to state-of-the-art performance on three biological sequence problems: prediction of subcellular localization, protein secondary structure and the binding of peptides to MHC Class II molecules. All implementations and datasets are available online to the scientific community at https://github.com/vanessajurtz/lasagne4bio . Supplementary data are available at Bioinformatics online.

  4. Genome Sequence of the Deep-Sea Denitrifier Pseudomonas sp. Strain MT-1, Isolated from the Mariana Trench.

    PubMed

    Fujinami, Shun; Oikawa, Yuji; Araki, Takuma; Shinmura, Yui; Midorikawa, Ryota; Ishizaka, Hikari; Kato, Chiaki; Horikoshi, Koki; Ito, Masahiro; Tamegai, Hideyuki

    2014-12-18

    Pseudomonas sp. strain MT-1 was the first deep-sea denitrifier isolated and characterized from mud recovered from a depth of 11,000 m in the Mariana Trench. We report here the genome sequence of this bacterium, which contributes to our understanding of denitrification and bioenergetics in the deep sea.

  5. HomozygosityMapper2012--bridging the gap between homozygosity mapping and deep sequencing.

    PubMed

    Seelow, Dominik; Schuelke, Markus

    2012-07-01

    Homozygosity mapping is a common method to map recessive traits in consanguineous families. To facilitate these analyses, we have developed HomozygosityMapper, a web-based approach to homozygosity mapping. HomozygosityMapper allows researchers to directly upload the genotype files produced by the major genotyping platforms as well as deep sequencing data. It detects stretches of homozygosity shared by the affected individuals and displays them graphically. Users can interactively inspect the underlying genotypes, manually refine these regions and eventually submit them to our candidate gene search engine GeneDistiller to identify the most promising candidate genes. Here, we present the new version of HomozygosityMapper. The most striking new feature is the support of Next Generation Sequencing *.vcf files as input. Upon users' requests, we have implemented the analysis of common experimental rodents as well as of important farm animals. Furthermore, we have extended the options for single families and loss of heterozygosity studies. Another new feature is the export of *.bed files for targeted enrichment of the potential disease regions for deep sequencing strategies. HomozygosityMapper also generates files for conventional linkage analyses which are already restricted to the possible disease regions, hence superseding CPU-intensive genome-wide analyses. HomozygosityMapper is freely available at http://www.homozygositymapper.org/.

  6. Ultra-deep sequencing of VHSV isolates contributes to understanding the role of viral quasispecies.

    PubMed

    Schönherz, Anna A; Lorenzen, Niels; Guldbrandtsen, Bernt; Buitenhuis, Bart; Einer-Jensen, Katja

    2016-01-08

    The high mutation rate of RNA viruses enables the generation of a genetically diverse viral population, termed a quasispecies, within a single infected host. This high in-host genetic diversity enables an RNA virus to adapt to a diverse array of selective pressures such as host immune response and switching between host species. The negative-sense, single-stranded RNA virus, viral haemorrhagic septicaemia virus (VHSV), was originally considered an epidemic virus of cultured rainbow trout in Europe, but was later proved to be endemic among a range of marine fish species in the Northern hemisphere. To better understand the nature of a virus quasispecies related to the evolutionary potential of VHSV, a deep-sequencing protocol specific to VHSV was established and applied to 4 VHSV isolates, 2 originating from rainbow trout and 2 from Atlantic herring. Each isolate was subjected to Illumina paired end shotgun sequencing after PCR amplification and the 11.1 kb genome was successfully sequenced with an average coverage of 0.5-1.9 × 10(6) sequenced copies. Differences in single nucleotide polymorphism (SNP) frequency were detected both within and between isolates, possibly related to their stage of adaptation to host species and host immune reactions. The N, M, P and Nv genes appeared nearly fixed, while genetic variation in the G and L genes demonstrated presence of diverse genetic populations particularly in two isolates. The results demonstrate that deep sequencing and analysis methodologies can be useful for future in vivo host adaption studies of VHSV.

  7. 3′ terminal diversity of MRP RNA and other human noncoding RNAs revealed by deep sequencing

    PubMed Central

    2013-01-01

    Background Post-transcriptional 3′ end processing is a key component of RNA regulation. The abundant and essential RNA subunit of RNase MRP has been proposed to function in three distinct cellular compartments and therefore may utilize this mode of regulation. Here we employ 3′ RACE coupled with high-throughput sequencing to characterize the 3′ terminal sequences of human MRP RNA and other noncoding RNAs that form RNP complexes. Results The 3′ terminal sequence of MRP RNA from HEK293T cells has a distinctive distribution of genomically encoded termini (including an assortment of U residues) with a portion of these selectively tagged by oligo(A) tails. This profile contrasts with the relatively homogenous 3′ terminus of an in vitro transcribed MRP RNA control and the differing 3′ terminal profiles of U3 snoRNA, RNase P RNA, and telomerase RNA (hTR). Conclusions 3′ RACE coupled with deep sequencing provides a valuable framework for the functional characterization of 3′ terminal sequences of noncoding RNAs. PMID:24053768

  8. Genotyping Influenza Virus by Next-Generation Deep Sequencing in Clinical Specimens.

    PubMed

    Seong, Moon Woo; Cho, Sung Im; Park, Hyunwoong; Seo, Soo Hyun; Lee, Seung Jun; Kim, Eui Chong; Park, Sung Sup

    2016-05-01

    Rapid and accurate identification of an influenza outbreak is essential for patient care and treatment. We describe a next-generation sequencing (NGS)-based, unbiased deep sequencing method in clinical specimens to investigate an influenza outbreak. Nasopharyngeal swabs from patients were collected for molecular epidemiological analysis. Total RNA was sequenced by using the NGS technology as paired-end 250 bp reads. Total of 7 to 12 million reads were obtained. After mapping to the human reference genome, we analyzed the 3-4% of reads that originated from a non-human source. A BLAST search of the contigs reconstructed de novo revealed high sequence similarity with that of the pandemic H1N1 virus. In the phylogenetic analysis, the HA gene of our samples clustered closely with that of A/Senegal/VR785/2010(H1N1), A/Wisconsin/11/2013(H1N1), and A/Korea/01/2009(H1N1), and the NA gene of our samples clustered closely with A/Wisconsin/11/2013(H1N1). This study suggests that NGS-based unbiased sequencing can be effectively applied to investigate molecular characteristics of nosocomial influenza outbreak by using clinical specimens such as nasopharyngeal swabs.

  9. Metatranscriptomic analysis of small RNAs present in soybean deep sequencing libraries

    PubMed Central

    Molina, Lorrayne Gomes; da Fonseca, Guilherme Cordenonsi; de Morais, Guilherme Loss; de Oliveira, Luiz Felipe Valter; de Carvalho, Joseane Biso; Kulcheski, Franceli Rodrigues; Margis, Rogerio

    2012-01-01

    A large number of small RNAs unrelated to the soybean genome were identified after deep sequencing of soybean small RNA libraries. A metatranscriptomic analysis was carried out to identify the origin of these sequences. Comparative analyses of small interference RNAs (siRNAs) present in samples collected in open areas corresponding to soybean field plantations and samples from soybean cultivated in greenhouses under a controlled environment were made. Different pathogenic, symbiotic and free-living organisms were identified from samples of both growth systems. They included viruses, bacteria and different groups of fungi. This approach can be useful not only to identify potentially unknown pathogens and pests, but also to understand the relations that soybean plants establish with microorganisms that may affect, directly or indirectly, plant health and crop production. PMID:22802714

  10. Deep Sequencing Analysis of Aptazyme Variants Based on a Pistol Ribozyme.

    PubMed

    Kobori, Shungo; Takahashi, Kei; Yokobayashi, Yohei

    2017-04-14

    Chemically regulated self-cleaving ribozymes, or aptazymes, are emerging as a promising class of genetic devices that allow dynamic control of gene expression in synthetic biology. However, further expansion of the limited repertoire of ribozymes and aptamers, and development of new strategies to couple the RNA elements to engineer functional aptazymes are highly desirable for synthetic biology applications. Here, we report aptazymes based on the recently identified self-cleaving pistol ribozyme class using a guanine aptamer as the molecular sensing element. Two aptazyme architectures were studied by constructing and assaying 17 728 mutants by deep sequencing. Although one of the architectures did not yield functional aptazymes, a novel aptazyme design in which the aptamer and the ribozyme were placed in tandem yielded a number of guanine-inhibited ribozymes. Detailed analysis of the extensive sequence-function data suggests a mechanism that involves a competition between two mutually exclusive RNA structures reminiscent of natural bacterial riboswitches.

  11. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing

    SciTech Connect

    Whitehead, Timothy A.; Chevalier, Aaron; Song, Yifan; Dreyfus, Cyrille; Fleishman, Sarel J.; De Mattos, Cecilia; Myers, Chris A.; Kamisetty, Hetunandan; Blair, Patrick; Wilson, Ian A.; Baker, David

    2012-06-19

    We show that comprehensive sequence-function maps obtained by deep sequencing can be used to reprogram interaction specificity and to leapfrog over bottlenecks in affinity maturation by combining many individually small contributions not detectable in conventional approaches. We use this approach to optimize two computationally designed inhibitors against H1N1 influenza hemagglutinin and, in both cases, obtain variants with subnanomolar binding affinity. The most potent of these, a 51-residue protein, is broadly cross-reactive against all influenza group 1 hemagglutinins, including human H2, and neutralizes H1N1 viruses with a potency that rivals that of several human monoclonal antibodies, demonstrating that computational design followed by comprehensive energy landscape mapping can generate proteins with potential therapeutic utility.

  12. Metatranscriptomic analysis of small RNAs present in soybean deep sequencing libraries.

    PubMed

    Molina, Lorrayne Gomes; da Fonseca, Guilherme Cordenonsi; de Morais, Guilherme Loss; de Oliveira, Luiz Felipe Valter; de Carvalho, Joseane Biso; Kulcheski, Franceli Rodrigues; Margis, Rogerio

    2012-06-01

    A large number of small RNAs unrelated to the soybean genome were identified after deep sequencing of soybean small RNA libraries. A metatranscriptomic analysis was carried out to identify the origin of these sequences. Comparative analyses of small interference RNAs (siRNAs) present in samples collected in open areas corresponding to soybean field plantations and samples from soybean cultivated in greenhouses under a controlled environment were made. Different pathogenic, symbiotic and free-living organisms were identified from samples of both growth systems. They included viruses, bacteria and different groups of fungi. This approach can be useful not only to identify potentially unknown pathogens and pests, but also to understand the relations that soybean plants establish with microorganisms that may affect, directly or indirectly, plant health and crop production.

  13. Deep sequencing reveals persistence of cell-associated mumps vaccine virus in chronic encephalitis.

    PubMed

    Morfopoulou, Sofia; Mee, Edward T; Connaughton, Sarah M; Brown, Julianne R; Gilmour, Kimberly; Chong, W K 'Kling'; Duprex, W Paul; Ferguson, Deborah; Hubank, Mike; Hutchinson, Ciaran; Kaliakatsos, Marios; McQuaid, Stephen; Paine, Simon; Plagnol, Vincent; Ruis, Christopher; Virasami, Alex; Zhan, Hong; Jacques, Thomas S; Schepelmann, Silke; Qasim, Waseem; Breuer, Judith

    2017-01-01

    Routine childhood vaccination against measles, mumps and rubella has virtually abolished virus-related morbidity and mortality. Notwithstanding this, we describe here devastating neurological complications associated with the detection of live-attenuated mumps virus Jeryl Lynn (MuV(JL5)) in the brain of a child who had undergone successful allogeneic transplantation for severe combined immunodeficiency (SCID). This is the first confirmed report of MuV(JL5) associated with chronic encephalitis and highlights the need to exclude immunodeficient individuals from immunisation with live-attenuated vaccines. The diagnosis was only possible by deep sequencing of the brain biopsy. Sequence comparison of the vaccine batch to the MuV(JL5) isolated from brain identified biased hypermutation, particularly in the matrix gene, similar to those found in measles from cases of SSPE. The findings provide unique insights into the pathogenesis of paramyxovirus brain infections.

  14. Multiplexed Metagenomic Deep Sequencing To Analyze the Composition of High-Priority Pathogen Reagents

    PubMed Central

    Wilson, Michael R.; Stenglein, Mark D.; Olejnik, Judith; Rennick, Linda J.; Nambulli, Sham; Feldmann, Friederike; Duprex, W. Paul

    2016-01-01

    ABSTRACT Laboratories studying high-priority pathogens need comprehensive methods to confirm microbial species and strains while also detecting contamination. Metagenomic deep sequencing (MDS) inventories nucleic acids present in laboratory stocks, providing an unbiased assessment of pathogen identity, the extent of genomic variation, and the presence of contaminants. Double-stranded cDNA MDS libraries were constructed from RNA extracted from in vitro-passaged stocks of six viruses (La Crosse virus, Ebola virus, canine distemper virus, measles virus, human respiratory syncytial virus, and vesicular stomatitis virus). Each library was dual indexed and pooled for sequencing. A custom bioinformatics pipeline determined the organisms present in each sample in a blinded fashion. Single nucleotide variant (SNV) analysis identified viral isolates. We confirmed that (i) each sample contained the expected microbe, (ii) dual indexing of the samples minimized false assignments of individual sequences, (iii) multiple viral and bacterial contaminants were present, and (iv) SNV analysis of the viral genomes allowed precise identification of the viral isolates. MDS can be multiplexed to allow simultaneous and unbiased interrogation of mixed microbial cultures and (i) confirm pathogen identity, (ii) characterize the extent of genomic variation, (iii) confirm the cell line used for virus propagation, and (iv) assess for contaminating microbes. These assessments ensure the true composition of these high-priority reagents and generate a comprehensive database of microbial genomes studied in each facility. MDS can serve as an integral part of a pathogen-tracking program which in turn will enhance sample security and increase experimental rigor and precision. IMPORTANCE Both the integrity and reproducibility of experiments using select agents depend in large part on unbiased validation to ensure the correct identity and purity of the species in question. Metagenomic deep sequencing

  15. Identification and characterization of novel microRNA candidates from deep sequencing.

    PubMed

    Wu, Qian; Wang, Chao; Guo, Li; Ge, Qinyu; Lu, Zuhong

    2013-01-16

    In our previous study, we screened a candidate new microRNA (miRNA) based on the deep sequencing and bioinformatics analysis. In this paper, we evaluated the novel miRNA in the following experiment: 1) the secondary structure of the precursor of novel-miR has the characteristic of a stem-loop hairpin structure, and mature miRNA is far from loops and bulges. 2) we used BLAST (Basic Local Alignment Search Tool) to compare the novel-miR sequence to that found in the GenBank. Novel-miR sequence existed in Mus musculus, Drosophila grimshawi, Rattus norvegicus, Xenopus laevis, Spodoptera frugiperda, Papio anubis, Salmo salar and so on. Then multiple sequence alignment (MSA) showed that sequence from 5 to 11 bp and 13 to 17 bp exhibited 100% similarity, where there is significant sequence conservation. Novel-miR showed similarity in the seed region with the known miR-3675-3p, indicating that these miRNAs are likely to belong to the same family and thus may share common biology. 3) novel-miR from MCF-7 and MB-MDA-231 was validated by Northern blot and detected in the serum and tissue samples of BC patients, respectively, by real-time PCR. The data showed that novel-miR was downregulated in the BC cancerous tissues and serum of breast cancer patients (P<0.05). 4) transfection of novel-miR mimics into MCF-7 cell significantly inhibited cell growth detected by CCK-8 assay (P<0.05). 5) to identify the mRNA targets of novel-miR, we performed a computational screen for genes with novel-miR complementary sites in their 3'-UTR using several open access databases. In addition, we used the CapitalBio® Molecule Annotation System V3.0 to perform gene ontology (GO) analysis on the target genes of novel-miR and specific biological process categories were enriched. 7 genes (CUL3, KRAS, ETS1, MNT, CNTN3, CCNK and FOXO3) which have a high prediction score and are associated with cell proliferation, apoptosis and cell cycle were chosen. 3'-UTR luciferase report assay suggested that miR-BS1

  16. Pathogen-specific deep sequence-coupled biopanning: A method for surveying human antibody responses

    PubMed Central

    Pascale, Juan M.; Moreno, Brechla; Chackerian, Bryce; Peabody, David S.

    2017-01-01

    Identifying the targets of antibody responses during infection is important for designing vaccines, developing diagnostic and prognostic tools, and understanding pathogenesis. We developed a novel deep sequence-coupled biopanning approach capable of identifying the protein epitopes of antibodies present in human polyclonal serum. Here, we report the adaptation of this approach for the identification of pathogen-specific epitopes recognized by antibodies elicited during acute infection. As a proof-of-principle, we applied this approach to assessing antibodies to Dengue virus (DENV). Using a panel of sera from patients with acute secondary DENV infection, we panned a DENV antigen fragment library displayed on the surface of bacteriophage MS2 virus-like particles and characterized the population of affinity-selected peptide epitopes by deep sequence analysis. Although there was considerable variation in the responses of individuals, we found several epitopes within the Envelope glycoprotein and Non-Structural Protein 1 that were commonly enriched. This report establishes a novel approach for characterizing pathogen-specific antibody responses in human sera, and has future utility in identifying novel diagnostic and vaccine targets. PMID:28152075

  17. Deep Sequencing Identification of Novel Glucocorticoid-Responsive miRNAs in Apoptotic Primary Lymphocytes

    PubMed Central

    Mav, Deepak; Scoltock, Alyson B.; Cidlowski, John A.

    2013-01-01

    Apoptosis of lymphocytes governs the response of the immune system to environmental stress and toxic insult. Signaling through the ubiquitously expressed glucocorticoid receptor, stress-induced glucocorticoid hormones induce apoptosis via mechanisms requiring altered gene expression. Several reports have detailed the changes in gene expression mediating glucocorticoid-induced apoptosis of lymphocytes. However, few studies have examined the role of non-coding miRNAs in this essential physiological process. Previously, using hybridization-based gene expression analysis and deep sequencing of small RNAs, we described the prevalent post-transcriptional repression of annotated miRNAs during glucocorticoid-induced apoptosis of lymphocytes. Here, we describe the development of a customized bioinformatics pipeline that facilitates the deep sequencing-mediated discovery of novel glucocorticoid-responsive miRNAs in apoptotic primary lymphocytes. This analysis identifies the potential presence of over 200 novel glucocorticoid-responsive miRNAs. We have validated the expression of two novel glucocorticoid-responsive miRNAs using small RNA-specific qPCR. Furthermore, through the use of Ingenuity Pathways Analysis (IPA) we determined that the putative targets of these novel validated miRNAs are predicted to regulate cell death processes. These findings identify two and predict the presence of additional novel glucocorticoid-responsive miRNAs in the rat transcriptome, suggesting a potential role for both annotated and novel miRNAs in glucocorticoid-induced apoptosis of lymphocytes. PMID:24250753

  18. Deep sequencing reveals as-yet-undiscovered small RNAs in Escherichia coli

    PubMed Central

    2011-01-01

    Background In Escherichia coli, approximately 100 regulatory small RNAs (sRNAs) have been identified experimentally and many more have been predicted by various methods. To provide a comprehensive overview of sRNAs, we analysed the low-molecular-weight RNAs (< 200 nt) of E. coli with deep sequencing, because the regulatory RNAs in bacteria are usually 50-200 nt in length. Results We discovered 229 novel candidate sRNAs (≥ 50 nt) with computational or experimental evidence of transcription initiation. Among them, the expression of seven intergenic sRNAs and three cis-antisense sRNAs was detected by northern blot analysis. Interestingly, five novel sRNAs are expressed from prophage regions and we note that these sRNAs have several specific characteristics. Furthermore, we conducted an evolutionary conservation analysis of the candidate sRNAs and summarised the data among closely related bacterial strains. Conclusions This comprehensive screen for E. coli sRNAs using a deep sequencing approach has shown that many as-yet-undiscovered sRNAs are potentially encoded in the E. coli genome. We constructed the Escherichia coli Small RNA Browser (ECSBrowser; http://rna.iab.keio.ac.jp/), which integrates the data for previously identified sRNAs and the novel sRNAs found in this study. PMID:21864382

  19. Pathogen-specific deep sequence-coupled biopanning: A method for surveying human antibody responses.

    PubMed

    Frietze, Kathryn M; Pascale, Juan M; Moreno, Brechla; Chackerian, Bryce; Peabody, David S

    2017-01-01

    Identifying the targets of antibody responses during infection is important for designing vaccines, developing diagnostic and prognostic tools, and understanding pathogenesis. We developed a novel deep sequence-coupled biopanning approach capable of identifying the protein epitopes of antibodies present in human polyclonal serum. Here, we report the adaptation of this approach for the identification of pathogen-specific epitopes recognized by antibodies elicited during acute infection. As a proof-of-principle, we applied this approach to assessing antibodies to Dengue virus (DENV). Using a panel of sera from patients with acute secondary DENV infection, we panned a DENV antigen fragment library displayed on the surface of bacteriophage MS2 virus-like particles and characterized the population of affinity-selected peptide epitopes by deep sequence analysis. Although there was considerable variation in the responses of individuals, we found several epitopes within the Envelope glycoprotein and Non-Structural Protein 1 that were commonly enriched. This report establishes a novel approach for characterizing pathogen-specific antibody responses in human sera, and has future utility in identifying novel diagnostic and vaccine targets.

  20. Deep sequencing reveals a global reprogramming of lncRNA transcriptome during EMT.

    PubMed

    Liao, Jian-You; Wu, Jue; Wang, Yan-Jie; He, Jie-Hua; Deng, Wei-Xi; Hu, KaiShun; Zhang, Yu-Chan; Zhang, Yin; Yan, Haiyan; Wang, Dan-Lan; Liu, Qiang; Zeng, Mu-Sheng; Phillip Koeffler, H; Song, Erwei; Yin, Dong

    2017-10-01

    Several studies have shown that long non-coding RNAs (lncRNAs) may play an essential role in Epithelial-Mesenchymal Transition (EMT), which is an important step in tumor metastasis; however, little is known about the global change of lncRNA transcriptome during EMT. To investigate how lncRNA transcriptome alterations contribute to EMT progression regulation, we deep-sequenced the whole-transcriptome of MCF10A as the cells underwent TGF-β-induced EMT. Deep-sequencing results showed that the long RNA transcriptome of MCF10A had undergone global changes as early as 8h after treatment with TGF-β. The expression of 3403 known and novel lncRNAs, and 570 known and novel circRNAs were altered during EMT. To identify the key lncRNA-regulator, we constructed the co-expression network and found all junction nodes in the network are lncRNAs. One junction node, RP6-65G23.5, was further verified as a key regulator of EMT. Intriguingly, we identified 216 clusters containing lncRNAs which were located in "gene desert" regions. The expressions of all lncRNAs in these clusters changed concurrently during EMT, strongly suggesting that these clusters might play important roles in EMT. Our study reveals a global reprogramming of lncRNAs transcriptome during EMT and provides clues for the future study of the molecular mechanism of EMT. Copyright © 2017 Elsevier B.V. All rights reserved.

  1. De Novo Assembly of the Complete Genome of an Enhanced Electricity-Producing Variant of Geobacter sulfurreducens Using Only Short Reads

    PubMed Central

    Nagarajan, Harish; Butler, Jessica E.; Klimes, Anna; Qiu, Yu; Zengler, Karsten; Ward, Joy; Young, Nelson D.; Methé, Barbara A.; Palsson, Bernhard Ø.; Lovley, Derek R.; Barrett, Christian L.

    2010-01-01

    State-of-the-art DNA sequencing technologies are transforming the life sciences due to their ability to generate nucleotide sequence information with a speed and quantity that is unapproachable with traditional Sanger sequencing. Genome sequencing is a principal application of this technology, where the ultimate goal is the full and complete sequence of the organism of interest. Due to the nature of the raw data produced by these technologies, a full genomic sequence attained without the aid of Sanger sequencing has yet to be demonstrated. We have successfully developed a four-phase strategy for using only next-generation sequencing technologies (Illumina and 454) to assemble a complete microbial genome de novo. We applied this approach to completely assemble the 3.7 Mb genome of a rare Geobacter variant (KN400) that is capable of unprecedented current production at an electrode. Two key components of our strategy enabled us to achieve this result. First, we integrated the two data types early in the process to maximally leverage their complementary characteristics. And second, we used the output of different short read assembly programs in such a way so as to leverage the complementary nature of their different underlying algorithms or of their different implementations of the same underlying algorithm. The significance of our result is that it demonstrates a general approach for maximizing the efficiency and success of genome assembly projects as new sequencing technologies and new assembly algorithms are introduced. The general approach is a meta strategy, wherein sequencing data are integrated as early as possible and in particular ways and wherein multiple assembly algorithms are judiciously applied such that the deficiencies in one are complemented by another. PMID:20544019

  2. Population-genomic variation within RNA viruses of the Western honey bee, Apis mellifera, inferred from deep sequencing

    USDA-ARS?s Scientific Manuscript database

    Deep sequencing of viruses isolated from infected hosts is an efficient way to measure population-genetic variation and can reveal patterns of dispersal and natural selection. In this study, we mined existing Illumina sequence reads to investigate single-nucleotide polymorphisms (SNPs) within two RN...

  3. Draft Genome Sequence of Caloranaerobacter sp. TR13, an Anaerobic Thermophilic Bacterium Isolated from a Deep-Sea Hydrothermal Vent

    PubMed Central

    Xie, Yunbiao; Dong, Binbin; Liu, Qing; Chen, Xiaoyao

    2015-01-01

    Here, we report the draft 2,261,881-bp genome sequence of Caloranaerobacter sp. TR13, isolated from a deep-sea hydrothermal vent on the East Pacific Rise. The sequence will be helpful for understanding the genetic and metabolic features, as well as potential biotechnological application in the genus Caloranaerobacter. PMID:26679595

  4. MinVar: A rapid and versatile tool for HIV-1 drug resistance genotyping by deep sequencing.

    PubMed

    Huber, Michael; Metzner, Karin J; Geissberger, Fabienne D; Shah, Cyril; Leemann, Christine; Klimkait, Thomas; Böni, Jürg; Trkola, Alexandra; Zagordi, Osvaldo

    2017-02-01

    Genotypic monitoring of drug-resistance mutations (DRMs) in HIV-1 infected individuals is strongly recommended to guide selection of the initial antiretroviral therapy (ART) and changes of drug regimens. Traditionally, mutations conferring drug resistance are detected by population sequencing of the reverse transcribed viral RNA encoding the HIV-1 enzymes target by ART, followed by manual analysis and interpretation of Sanger sequencing traces. This process is labor intensive, relies on subjective interpretation from the operator, and offers limited sensitivity as only mutations above 20% frequency can be reliably detected. Here we present MinVar, a pipeline for the analysis of deep sequencing data, which allows reliable and automated detection of DRMs down to 5%. We evaluated MinVar with data from amplicon sequencing of defined mixtures of molecular virus clones with known DRM and plasma samples of viremic HIV-1 infected individuals and we compared it to VirVarSeq, another virus variant detection tool exclusively working on Illumina deep sequencing data. MinVar was designed to be compatible with a diverse range of sequencing platforms and allows the detection of DRMs and insertions/deletions from deep sequencing data without the need to perform additional bioinformatics analysis, a prerequisite to a widespread implementation of HIV-1 genotyping using deep sequencing in routine diagnostic settings.

  5. A near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing

    PubMed Central

    Martin, Jeffrey A.; Johnson, Nicole V.; Gross, Stephen M.; Schnable, James; Meng, Xiandong; Wang, Mei; Coleman-Derr, Devin; Lindquist, Erika; Wei, Chia-Lin; Kaeppler, Shawn; Chen, Feng; Wang, Zhong

    2014-01-01

    RNA-sequencing (RNA-seq) enables in-depth exploration of transcriptomes, but typical sequencing depth often limits its comprehensiveness. In this study, we generated nearly 3 billion RNA-Seq reads, totaling 341 Gb of sequence, from a Zea mays seedling sample. At this depth, a near complete snapshot of the transcriptome was observed consisting of over 90% of the annotated transcripts, including lowly expressed transcription factors. A novel hybrid strategy combining de novo and reference-based assemblies yielded a transcriptome consisting of 126,708 transcripts with 88% of expressed known genes assembled to full-length. We improved current annotations by adding 4,842 previously unannotated transcript variants and many new features, including 212 maize transcripts, 201 genes, 10 genes with undocumented potential roles in seedlings as well as maize lineage specific gene fusion events. We demonstrated the power of deep sequencing for large transcriptome studies by generating a high quality transcriptome, which provides a rich resource for the research community. PMID:24682209

  6. Deep sequencing analysis of the developing mouse brain reveals a novel microRNA

    PubMed Central

    2011-01-01

    Background MicroRNAs (miRNAs) are small non-coding RNAs that can exert multilevel inhibition/repression at a post-transcriptional or protein synthesis level during disease or development. Characterisation of miRNAs in adult mammalian brains by deep sequencing has been reported previously. However, to date, no small RNA profiling of the developing brain has been undertaken using this method. We have performed deep sequencing and small RNA analysis of a developing (E15.5) mouse brain. Results We identified the expression of 294 known miRNAs in the E15.5 developing mouse brain, which were mostly represented by let-7 family and other brain-specific miRNAs such as miR-9 and miR-124. We also discovered 4 putative 22-23 nt miRNAs: mm_br_e15_1181, mm_br_e15_279920, mm_br_e15_96719 and mm_br_e15_294354 each with a 70-76 nt predicted pre-miRNA. We validated the 4 putative miRNAs and further characterised one of them, mm_br_e15_1181, throughout embryogenesis. Mm_br_e15_1181 biogenesis was Dicer1-dependent and was expressed in E3.5 blastocysts and E7 whole embryos. Embryo-wide expression patterns were observed at E9.5 and E11.5 followed by a near complete loss of expression by E13.5, with expression restricted to a specialised layer of cells within the developing and early postnatal brain. Mm_br_e15_1181 was upregulated during neurodifferentiation of P19 teratocarcinoma cells. This novel miRNA has been identified as miR-3099. Conclusions We have generated and analysed the first deep sequencing dataset of small RNA sequences of the developing mouse brain. The analysis revealed a novel miRNA, miR-3099, with potential regulatory effects on early embryogenesis, and involvement in neuronal cell differentiation/function in the brain during late embryonic and early neonatal development. PMID:21466694

  7. Shotgun metagenomics of biological stains using ultra-deep DNA sequencing.

    PubMed

    Brenig, B; Beck, J; Schütz, E

    2010-07-01

    A detailed molecular analysis of blood or other biological stains at a crime scene is often hampered by the low quantity and quality of the extractable DNA. However, the determination of the origin and composition of a stain is in most cases a prerequisite for the final elucidation of a criminal case. Standard methodologies, e.g. amplification of DNA followed by microsatellite typing or mitochondrial DNA sequencing, are often not sensitive enough to result in sufficient and conclusive data. We have applied ultra-deep DNA sequencing using the 454 pyrosequencing technology on a whole genome amplified (WGA) environmental biological stain, which was analysed unsuccessfully with standard methodologies following WGA. With the combination of WGA and 454 pyrosequencing, however, we were able to generate 7242 single sequences with an average length of 195bp. A total of 1,441,971bp DNA sequences were generated and compared with public DNA sequence databases. Using RepeatMasker and basic logical alignment search tool (BLAST) searches against known microbial and mammalian genomes it was possible to determine the metagenomic composition of the stain, i.e. 4.2% bacterial DNA, 0.3% viral DNA, 2.7% fungal DNA, 10.3% mammalian repetitive DNA, 0.9% porcine DNA, 0.13% human DNA and 81.5% DNA of unknown origin. Our data demonstrate that 454 pyrosequencing has the potential to become a powerful tool not only in basic research but also in the metagenomic analysis of biological trace materials for forensic genetics.

  8. Identification of torque teno virus in culture-negative endophthalmitis by representational deep DNA sequencing.

    PubMed

    Lee, Aaron Y; Akileswaran, Lakshmi; Tibbetts, Michael D; Garg, Sunir J; Van Gelder, Russell N

    2015-03-01

    To test the hypothesis that uncultured organisms may be present in cases of culture-negative endophthalmitis by use of deep DNA sequencing of vitreous biopsies. Single-center, consecutive, prospective, observational study. Aqueous or vitreous biopsies from 21 consecutive patients presenting with presumed infectious endophthalmitis and 7 vitreous samples from patients undergoing surgery for noninfectious retinal disorders. Traditional bacterial and fungal culture, 16S quantitative polymerase chain reaction (qPCR), and a representational deep-sequencing method (biome representational in silico karyotyping [BRiSK]) were applied in parallel to samples to identify DNA sequences corresponding to potential pathogens. Presence of potential pathogen DNA in ocular samples. Zero of 7 control eyes undergoing routine vitreous surgery yielded positive results for bacteria or virus by culture or 16S polymerase chain reaction (PCR). A total of 14 of the 21 samples (66.7%) from eyes harboring suspected infectious endophthalmitis were culture-positive, the most common being Staphylococcal and Streptococcal species. There was good agreement among culture, 16S bacterial PCR, and BRiSK methodologies for culture-positive cases (Fleiss' kappa of 0.621). 16S PCR did not yield a recognizable pathogen sequence in any culture-negative sample, whereas BRiSK suggested the presence of Streptococcus in 1 culture-negative sample. With the use of BRiSK, 57.1% of culture-positive and 100% of culture-negative samples demonstrated the presence of torque teno virus (TTV) sequences, compared with none in the controls (P=0.0005, Fisher exact test). The presence of TTV viral DNA was confirmed in 7 cases by qPCR. No other known viruses or potential pathogens were identified in these samples. Culture, 16S qPCR, and BRiSK provide complementary information in presumed infectious endophthalmitis. The majority of culture-negative endophthalmitis samples did not contain significant levels of bacterial DNA

  9. Identification of torque teno virus in culture-negative endophthalmitis by representational deep-DNA sequencing

    PubMed Central

    Lee, Aaron Y.; Akileswaran, Lakshmi; Tibbetts, Michael D.; Garg, Sunir J.; Van Gelder, Russell N.

    2014-01-01

    Purpose To test the hypothesis that uncultured organisms may be present in cases of culture-negative endophthalmitis, by use of deep DNA sequencing of vitreous biopsies. Design Single center consecutive prospective observational study. Participants and Controls Aqueous or vitreous biopsies from 21 consecutive patients presenting with presumed infectious endophthalmitis, and seven vitreous samples from patients undergoing surgery for non-infectious retinal disorders. Methods Traditional bacterial and fungal culture, 16S quantitative polymerase chain reaction (qPCR) and a representational deep-sequencing method (Biome Representational in Silico Karyotyping [BRiSK]) were applied in parallel to samples to identify DNA sequences corresponding to potential pathogens. Main Outcome Measures Presence of potential pathogen DNA in ocular samples. Results None of 7 control eyes undergoing routine vitreous surgery yielded positive results for bacteria or virus by culture or 16S PCR. Fourteen of the 21 samples (66.7%) from eyes harboring suspected infectious endophthalmitis were culture-positive, the most common being Staphylococcal and Streptococcal species. There was good agreement among culture, 16S bacterial PCR, and BRiSK methodologies for culture-positive cases (Fleiss’ kappa of 0.621). 16S PCR did not yield a recognizable pathogen sequence in any culture-negative sample, while BRiSK suggested presence of Steptococcus in one culture-negative sample. Surprisingly, using BRiSK, 57.1% of culture-positive and 100% of culture-negative samples demonstrated presence of Torque Teno Virus (TTV) sequences, compared to none in the controls (Fisher exact, p = 0.0005). Presence of TTV viral DNA was confirmed in seven cases by qPCR. No other known viruses or potential pathogens were identified in these samples. Conclusion Culture, 16S qPCR, and BRiSK provide complementary information in presumed infectious endophthalmitis. The majority of culture-negative endophthalmitis samples did

  10. Fungal communities from the calcareous deep-sea sediments in the Southwest India Ridge revealed by Illumina sequencing technology.

    PubMed

    Zhang, Likui; Kang, Manyu; Huang, Yangchao; Yang, Lixiang

    2016-05-01

    The diversity and ecological significance of bacteria and archaea in deep-sea environments have been thoroughly investigated, but eukaryotic microorganisms in these areas, such as fungi, are poorly understood. To elucidate fungal diversity in calcareous deep-sea sediments in the Southwest India Ridge (SWIR), the internal transcribed spacer (ITS) regions of rRNA genes from two sediment metagenomic DNA samples were amplified and sequenced using the Illumina sequencing platform. The results revealed that 58-63 % and 36-42 % of the ITS sequences (97 % similarity) belonged to Basidiomycota and Ascomycota, respectively. These findings suggest that Basidiomycota and Ascomycota are the predominant fungal phyla in the two samples. We also found that Agaricomycetes, Leotiomycetes, and Pezizomycetes were the major fungal classes in the two samples. At the species level, Thelephoraceae sp. and Phialocephala fortinii were major fungal species in the two samples. Despite the low relative abundance, unidentified fungal sequences were also observed in the two samples. Furthermore, we found that there were slight differences in fungal diversity between the two sediment samples, although both were collected from the SWIR. Thus, our results demonstrate that calcareous deep-sea sediments in the SWIR harbor diverse fungi, which augment the fungal groups in deep-sea sediments. This is the first report of fungal communities in calcareous deep-sea sediments in the SWIR revealed by Illumina sequencing.

  11. Integrated analysis of microRNA regulatory network in nasopharyngeal carcinoma with deep sequencing.

    PubMed

    Wang, Fan; Lu, Juan; Peng, Xiaohong; Wang, Jie; Liu, Xiong; Chen, Xiaomei; Jiang, Yiqi; Li, Xiangping; Zhang, Bao

    2016-01-22

    MicroRNAs (miRNAs) have been shown to play a critical role in the development and progression of nasopharyngeal carcinoma (NPC). Although accumulating studies have been performed on the molecular mechanisms of NPC, the miRNA regulatory networks in cancer progression remain largely unknown. Laser capture microdissection (LCM) and deep sequencing are powerful tools that can help us to detect the integrated view of miRNA-target network. Illumina Hiseq2000 deep sequencing was used to screen differentially expressed miRNAs in laser-microdessected biopsies between 12 NPC and 8 chronic nasopharyngitis patients. The result was validated by real-time PCR on 201 NPC and 25 chronic nasopharyngitis patients. The potential candidate target genes of the miRNAs were predicted using published target prediction softwares (RNAhybrid, TargetScan, Miranda, PITA), and the overlay part was analyzed in Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) biological process. The miRNA regulatory network analysis was performed using the Ingenuity Pathway Analysis (IPA) software. Eight differentially expressed miRNAs were identified between NPC and chronic nasopharyngitis patients by deep sequencing. Further qRT-PCR assays confirmed 3 down-regulated miRNAs (miR-34c-5p, miR-375 and miR-449c-5p), 4 up-regulated miRNAs (miR-205-5p, miR-92a-3p, miR-193b-3p and miR-27a-5p). Additionally, the low level of miR-34c-5p (miR-34c) was significantly correlated with advanced TNM stage. GO and KEGG enrichment analyses showed that 914 target genes were involved in cell cycle, cytokine secretion and tumor immunology, and so on. IPA revealed that cancer was the top disease associated with those dysregulated miRNAs, and the genes regulated by miR-34c were in the center of miRNA-mRNA regulatory network, including TP53, CCND1, CDK6, MET and BCL2, and the PI3K/AKT/ mTOR signaling was regarded as a significant function pathway in this network. Our study presents the current knowledge of mi

  12. Mapping vaccinia virus DNA replication origins at nucleotide level by deep sequencing.

    PubMed

    Senkevich, Tatiana G; Bruno, Daniel; Martens, Craig; Porcella, Stephen F; Wolf, Yuri I; Moss, Bernard

    2015-09-01

    Poxviruses reproduce in the host cytoplasm and encode most or all of the enzymes and factors needed for expression and synthesis of their double-stranded DNA genomes. Nevertheless, the mode of poxvirus DNA replication and the nature and location of the replication origins remain unknown. A current but unsubstantiated model posits only leading strand synthesis starting at a nick near one covalently closed end of the genome and continuing around the other end to generate a concatemer that is subsequently resolved into unit genomes. The existence of specific origins has been questioned because any plasmid can replicate in cells infected by vaccinia virus (VACV), the prototype poxvirus. We applied directional deep sequencing of short single-stranded DNA fragments enriched for RNA-primed nascent strands isolated from the cytoplasm of VACV-infected cells to pinpoint replication origins. The origins were identified as the switching points of the fragment directions, which correspond to the transition from continuous to discontinuous DNA synthesis. Origins containing a prominent initiation point mapped to a sequence within the hairpin loop at one end of the VACV genome and to the same sequence within the concatemeric junction of replication intermediates. These findings support a model for poxvirus genome replication that involves leading and lagging strand synthesis and is consistent with the requirements for primase and ligase activities as well as earlier electron microscopic and biochemical studies implicating a replication origin at the end of the VACV genome.

  13. VirusDetect: An automated pipeline for efficient virus discovery using deep sequencing of small RNAs.

    PubMed

    Zheng, Yi; Gao, Shan; Padmanabhan, Chellappan; Li, Rugang; Galvez, Marco; Gutierrez, Dina; Fuentes, Segundo; Ling, Kai-Shu; Kreuze, Jan; Fei, Zhangjun

    2017-01-01

    Accurate detection of viruses in plants and animals is critical for agriculture production and human health. Deep sequencing and assembly of virus-derived small interfering RNAs has proven to be a highly efficient approach for virus discovery. Here we present VirusDetect, a bioinformatics pipeline that can efficiently analyze large-scale small RNA (sRNA) datasets for both known and novel virus identification. VirusDetect performs both reference-guided assemblies through aligning sRNA sequences to a curated virus reference database and de novo assemblies of sRNA sequences with automated parameter optimization and the option of host sRNA subtraction. The assembled contigs are compared to a curated and classified reference virus database for known and novel virus identification, and evaluated for their sRNA size profiles to identify novel viruses. Extensive evaluations using plant and insect sRNA datasets suggest that VirusDetect is highly sensitive and efficient in identifying known and novel viruses. VirusDetect is freely available at http://bioinfo.bti.cornell.edu/tool/VirusDetect/. Copyright © 2016 Elsevier Inc. All rights reserved.

  14. Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition.

    PubMed

    Ibrahim, Wisam; Abadeh, Mohammad Saniee

    2017-03-27

    Protein fold recognition is an important problem in bioinformatics to predict three-dimensional structure of a protein. One of the most challenging tasks in protein fold recognition problem is the extraction of efficient features from the amino-acid sequences to obtain better classifiers. In this paper, we have proposed six descriptors to extract features from protein sequences. These descriptors are applied in the first stage of a three-stage framework PCA-DELM-LDA to extract feature vectors from the amino-acid sequences. Principal Component Analysis PCA has been implemented to reduce the number of extracted features. The extracted feature vectors have been used with original features to improve the performance of the Deep Extreme Learning Machine DELM in the second stage. Four new features have been extracted from the second stage and used in the third stage by Linear Discriminant Analysis LDA to classify the instances into 27 folds. The proposed framework is implemented on the independent and combined feature sets in SCOP datasets. The experimental results show that extracted feature vectors in the first stage could improve the performance of DELM in extracting new useful features in second stage.

  15. Identification of Dirofilaria immitis miRNA using illumina deep sequencing

    PubMed Central

    2013-01-01

    The heartworm Dirofilaria immitis is the causative agent of cardiopulmonary dirofilariosis in dogs and cats, which also infects a wide range of wild mammals and humans. The complex life cycle of D. immitis with several developmental stages in its invertebrate mosquito vectors and its vertebrate hosts indicates the importance of miRNA in growth and development, and their ability to regulate infection of mammalian hosts. This study identified the miRNA profiles of D. immitis of zoonotic significance by deep sequencing. A total of 1063 conserved miRNA candidates, including 68 anti-sense miRNA (miRNA*) sequences, were predicted by computational methods and could be grouped into 808 miRNA families. A significant bias towards family members, family abundance and sequence nucleotides was observed. Thirteen novel miRNA candidates were predicted by alignment with the Brugia malayi genome. Eleven out of 13 predicted miRNA candidates were verified by using a PCR-based method. Target genes of the novel miRNA candidates were predicted by using the heartworm transcriptome dataset. To our knowledge, this is the first report of miRNA profiles in D. immitis, which will contribute to a better understanding of the complex biology of this zoonotic filarial nematode and the molecular regulation roles of miRNA involved. Our findings may also become a useful resource for small RNA studies in other filarial parasitic nematodes. PMID:23331513

  16. Genome diversity in Brachypodium distachyon: deep sequencing of highly diverse inbred lines.

    PubMed

    Gordon, Sean P; Priest, Henry; Des Marais, David L; Schackwitz, Wendy; Figueroa, Melania; Martin, Joel; Bragg, Jennifer N; Tyler, Ludmila; Lee, Cheng-Ruei; Bryant, Doug; Wang, Wenqin; Messing, Joachim; Manzaneda, Antonio J; Barry, Kerrie; Garvin, David F; Budak, Hikmet; Tuna, Metin; Mitchell-Olds, Thomas; Pfender, William F; Juenger, Thomas E; Mockler, Todd C; Vogel, John P

    2014-08-01

    Brachypodium distachyon is small annual grass that has been adopted as a model for the grasses. Its small genome, high-quality reference genome, large germplasm collection, and selfing nature make it an excellent subject for studies of natural variation. We sequenced six divergent lines to identify a comprehensive set of polymorphisms and analyze their distribution and concordance with gene expression. Multiple methods and controls were utilized to identify polymorphisms and validate their quality. mRNA-Seq experiments under control and simulated drought-stress conditions, identified 300 genes with a genotype-dependent treatment response. We showed that large-scale sequence variants had extremely high concordance with altered expression of hundreds of genes, including many with genotype-dependent treatment responses. We generated a deep mRNA-Seq dataset for the most divergent line and created a de novo transcriptome assembly. This led to the discovery of >2400 previously unannotated transcripts and hundreds of genes not present in the reference genome. We built a public database for visualization and investigation of sequence variants among these widely used inbred lines.

  17. Microbes in deep marine sediments viewed through amplicon sequencing and metagenomics

    NASA Astrophysics Data System (ADS)

    Biddle, J.; Leon, Z. R.; Russell, J. A., III; Martino, A. J.

    2016-12-01

    Nearly twenty percent of microbial biomass on Earth can be found in the marine subsurface. The majority of this is concentrated on continental margins, which have been investigated by scientific drilling. On the Costa Rica Margin, Iberian Margin and Peru Margins, sediment samples have been investigated through DNA extraction followed by amplicon and metagenomic sequencing. Overall samples show a high degree of microbial diversity, including many lineages of newly defined groups. In this talk, metagenome assembled genomes of unusual lineages will be presented, including their relationships to shallower relatives. From Costa Rica, in particular, we have retrieved deep relatives of Lokiarchaeota and Thorarchaeota, as well as other deeply branching archaeal relatives. We discuss their genome similarities to both other archaea and eukaryotes. From the Iberian Margin, relatives of Atribacteria and Aerophobetes will be discussed. Finally, we will detail the knowledge lost or gained depending on whether samples are studied via amplicon sequencing or total metagenomics, as studies in other environments have shown that up to 15% of microbial diversity is ignored when samples are studied via amplicon sequencing alone.

  18. De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads.

    PubMed

    Korlach, Jonas; Gedman, Gregory; Kingan, Sarah B; Chin, Chen-Shan; Howard, Jason T; Audet, Jean-Nicolas; Cantin, Lindsey; Jarvis, Erich D

    2017-10-01

    Reference-quality genomes are expected to provide a resource for studying gene structure, function, and evolution. However, often genes of interest are not completely or accurately assembled, leading to unknown errors in analyses or additional cloning efforts for the correct sequences. A promising solution is long-read sequencing. Here we tested PacBio-based long-read sequencing and diploid assembly for potential improvements to the Sanger-based intermediate-read zebra finch reference and Illumina-based short-read Anna's hummingbird reference, 2 vocal learning avian species widely studied in neuroscience and genomics. With DNA of the same individuals used to generate the reference genomes, we generated diploid assemblies with the FALCON-Unzip assembler, resulting in contigs with no gaps in the megabase range, representing 150-fold and 200-fold improvements over the current zebra finch and hummingbird references, respectively. These long-read and phased assemblies corrected and resolved what we discovered to be numerous misassemblies in the references, including missing sequences in gaps, erroneous sequences flanking gaps, base call errors in difficult-to-sequence regions, complex repeat structure errors, and allelic differences between the 2 haplotypes. These improvements were validated by single long-genome and transcriptome reads and resulted for the first time in completely resolved protein-coding genes widely studied in neuroscience and specialized in vocal learning species. These findings demonstrate the impact of long reads, sequencing of previously difficult-to-sequence regions, and phasing of haplotypes on generating the high-quality assemblies necessary for understanding gene structure, function, and evolution. © The Authors 2017. Published by Oxford University Press.

  19. Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    PubMed

    Devisetty, Upendra Kumar; Covington, Michael F; Tat, An V; Lekkala, Saradadevi; Maloof, Julin N

    2014-08-12

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/.

  20. Polymorphism Identification and Improved Genome Annotation of Brassica rapa Through Deep RNA Sequencing

    PubMed Central

    Devisetty, Upendra Kumar; Covington, Michael F.; Tat, An V.; Lekkala, Saradadevi; Maloof, Julin N.

    2014-01-01

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes—R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)—using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/. PMID:25122667

  1. Mitogenome polymorphism in a single branch sample revealed by SOLiD deep sequencing of the Lophelia pertusa coral genome.

    PubMed

    Emblem, Ase; Karlsen, Bård Ove; Evertsen, Jussi; Miller, David J; Moum, Truls; Johansen, Steinar D

    2012-09-15

    We present an initial genomic analysis of the non-symbiotic scleractinian coral Lophelia pertusa, the dominant cold-water reef-building coral species in the North Atlantic Ocean. A significant fraction of the deep sequencing reads was of mitochondrial and microbial origins. SOLiD deep sequencing reads from fragment library experiments of total DNA and PCR amplified mitogenome generated about 21,000 times and 136,000 times coverage, respectively, of the 16,150 bp mitogenome. Five polymorphic sites that include two non-synonymous sites in the NADH dehydrogenase subunit 5 genes were detected in both experiments. This observation is surprising since anthozoans in general exhibit very low mtDNA sequence variation at intraspecific level compared to nuclear sequences. More than fifty bacterial species associated with the coral isolate were also sequence detected, representing at least ten complete genomes. Most reads, however, were predicted to originate from the Lophelia nuclear genome.

  2. Draft Genome Sequence of Deep-Sea Alteromonas sp. Strain V450 Isolated from the Marine Sponge Leiodermatium sp.

    PubMed Central

    Barrett, Nolan H.; McCarthy, Peter J.

    2017-01-01

    ABSTRACT The proteobacterium Alteromonas sp. strain V450 was isolated from the Atlantic deep-sea sponge Leiodermatium sp. Here, we report the draft genome sequence of this strain, with a genome size of approx. 4.39 Mb and a G+C content of 44.01%. The results will aid deep-sea microbial ecology, evolution, and sponge-microbe association studies. PMID:28153886

  3. Deep sequencing of small RNAs confirms an annelid affinity of Myzostomida.

    PubMed

    Helm, Conrad; Bernhart, Stephan H; Höner zu Siederdissen, Christian; Nickel, Birgit; Bleidorn, Christoph

    2012-07-01

    Myzostomida comprise a group of marine worms associated mainly with echinoderms since the Carboniferous. Due to their unusual morphology the phylogenetic position in relation to other Lophotrochozoa is discussed since their description. According to different morphological and molecular markers the Myzostomida are either close to Platyzoa or Annelida. Here we investigated small non-coding RNAs of Myzostoma cirriferum to infer the phylogenetic position of myzostomids. Based on transcriptomic data collected by Illumina Deep Sequencing we analyzed the microRNA (miRNA) families occurring in M. cirriferum. Phylogenetic analysis revealed the presence of 13 miRNA-families exclusively shared by Annelida (including Sipuncula) and Myzostomida, as such highly significantly supporting an annelid origin of myzostomids. Furthermore, using a mapping-approach and secondary structure models we predicted several miRNA-candidates unique for myzostomids.

  4. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing

    PubMed Central

    Manske, Magnus; Miotto, Olivo; Campino, Susana; Auburn, Sarah; Almagro-Garcia, Jacob; Maslen, Gareth; O’Brien, Jack; Djimde, Abdoulaye; Doumbo, Ogobara; Zongo, Issaka; Ouedraogo, Jean-Bosco; Michon, Pascal; Mueller, Ivo; Siba, Peter; Nzila, Alexis; Borrmann, Steffen; Kiara, Steven M.; Marsh, Kevin; Jiang, Hongying; Su, Xin-Zhuan; Amaratunga, Chanaki; Fairhurst, Rick; Socheat, Duong; Nosten, Francois; Imwong, Mallika; White, Nicholas J.; Sanders, Mandy; Anastasi, Elisa; Alcock, Dan; Drury, Eleanor; Oyola, Samuel; Quail, Michael A.; Turner, Daniel J.; Rubio, Valentin Ruano; Jyothi, Dushyanth; Amenga-Etego, Lucas; Hubbart, Christina; Jeffreys, Anna; Rowlands, Kate; Sutherland, Colin; Roper, Cally; Mangano, Valentina; Modiano, David; Tan, John C.; Ferdig, Michael T.; Amambua-Ngwa, Alfred; Conway, David J.; Takala-Harrison, Shannon; Plowe, Christopher V.; Rayner, Julian C.; Rockett, Kirk A.; Clark, Taane G.; Newbold, Chris I.; Berriman, Matthew; MacInnis, Bronwyn; Kwiatkowski, Dominic P.

    2013-01-01

    Malaria elimination strategies require surveillance of the parasite population for genetic changes that demand a public health response, such as new forms of drug resistance. 1,2 Here we describe methods for large-scale analysis of genetic variation in Plasmodium falciparum by deep sequencing of parasite DNA obtained from the blood of patients with malaria, either directly or after short term culture. Analysis of 86,158 exonic SNPs that passed genotyping quality control in 227 samples from Africa, Asia and Oceania provides genome-wide estimates of allele frequency distribution, population structure and linkage disequilibrium. By comparing the genetic diversity of individual infections with that of the local parasite population, we derive a metric of within-host diversity that is related to the level of inbreeding in the population. An open-access web application has been established for exploration of regional differences in allele frequency and of highly differentiated loci in the P. falciparum genome. PMID:22722859

  5. Advancing Eucalyptus genomics: identification and sequencing of lignin biosynthesis genes from deep-coverage BAC libraries

    PubMed Central

    2011-01-01

    Background Eucalyptus species are among the most planted hardwoods in the world because of their rapid growth, adaptability and valuable wood properties. The development and integration of genomic resources into breeding practice will be increasingly important in the decades to come. Bacterial artificial chromosome (BAC) libraries are key genomic tools that enable positional cloning of important traits, synteny evaluation, and the development of genome framework physical maps for genetic linkage and genome sequencing. Results We describe the construction and characterization of two deep-coverage BAC libraries EG_Ba and EG_Bb obtained from nuclear DNA fragments of E. grandis (clone BRASUZ1) digested with HindIII and BstYI, respectively. Genome coverages of 17 and 15 haploid genome equivalents were estimated for EG_Ba and EG_Bb, respectively. Both libraries contained large inserts, with average sizes ranging from 135 Kb (Eg_Bb) to 157 Kb (Eg_Ba), very low extra-nuclear genome contamination providing a probability of finding a single copy gene ≥ 99.99%. Libraries were screened for the presence of several genes of interest via hybridizations to high-density BAC filters followed by PCR validation. Five selected BAC clones were sequenced and assembled using the Roche GS FLX technology providing the whole sequence of the E. grandis chloroplast genome, and complete genomic sequences of important lignin biosynthesis genes. Conclusions The two E. grandis BAC libraries described in this study represent an important milestone for the advancement of Eucalyptus genomics and forest tree research. These BAC resources have a highly redundant genome coverage (> 15×), contain large average inserts and have a very low percentage of clones with organellar DNA or empty vectors. These publicly available BAC libraries are thus suitable for a broad range of applications in genetic and genomic research in Eucalyptus and possibly in related species of Myrtaceae, including genome

  6. Complex Genotype Mixtures Analyzed by Deep Sequencing in Two Different Regions of Hepatitis B Virus.

    PubMed

    Caballero, Andrea; Gregori, Josep; Homs, Maria; Tabernero, David; Gonzalez, Carolina; Quer, Josep; Blasi, Maria; Casillas, Rosario; Nieto, Leonardo; Riveiro-Barciela, Mar; Esteban, Rafael; Buti, Maria; Rodriguez-Frias, Francisco

    2015-01-01

    This study assesses the presence and outcome of genotype mixtures in the polymerase/surface and X/preCore regions of the HBV genome in patients with chronic hepatitis B virus (HBV) infection. Thirty samples from ten chronic hepatitis B patients were included. The polymerase/surface and X/preCore regions were analyzed by deep sequencing (UDPS) in the first available sample at diagnosis, a pre-treatment sample, and a sample while under treatment. HBV genotype was determined by phylogenesis. Quasispecies complexity was evaluated by mutation frequency and nucleotide diversity. The polymerase/surface and X/preCore regions were validated for genotyping from 113 GenBank reference sequences. UDPS yielded a median of 10,960 sequences per sample (IQR 16,645) in the polymerase/surface region and 11,595 sequences per sample (IQR 14,682) in X/preCore. Genotype mixtures were more common in X/preCore (90%) than in polymerase/surface (30%) (p<0.001). On X/preCore genotyping, all samples were genotype A, whereas polymerase/surface yielded genotypes A (80%), D (16.7%), and F (3.3%) (p = 0.036). Genotype changes in polymerase/surface were observed in four patients during natural quasispecies dynamics and in two patients during treatment. There were no genotype changes in X/preCore. Quasispecies complexity was higher in X/preCore than in polymerase/surface (p = 0.004). The results provide evidence of genotype mixtures and differential genotype proportions in the polymerase/surface and X/preCore regions. The genotype dynamics in HBV infection and the different patterns of quasispecies complexity in the HBV genome suggest a new paradigm for HBV genotype classification.

  7. Complex Genotype Mixtures Analyzed by Deep Sequencing in Two Different Regions of Hepatitis B Virus

    PubMed Central

    Homs, Maria; Tabernero, David; Gonzalez, Carolina; Quer, Josep; Blasi, Maria; Casillas, Rosario; Nieto, Leonardo; Riveiro-Barciela, Mar; Esteban, Rafael; Buti, Maria; Rodriguez-Frias, Francisco

    2015-01-01

    This study assesses the presence and outcome of genotype mixtures in the polymerase/surface and X/preCore regions of the HBV genome in patients with chronic hepatitis B virus (HBV) infection. Thirty samples from ten chronic hepatitis B patients were included. The polymerase/surface and X/preCore regions were analyzed by deep sequencing (UDPS) in the first available sample at diagnosis, a pre-treatment sample, and a sample while under treatment. HBV genotype was determined by phylogenesis. Quasispecies complexity was evaluated by mutation frequency and nucleotide diversity. The polymerase/surface and X/preCore regions were validated for genotyping from 113 GenBank reference sequences. UDPS yielded a median of 10,960 sequences per sample (IQR 16,645) in the polymerase/surface region and 11,595 sequences per sample (IQR 14,682) in X/preCore. Genotype mixtures were more common in X/preCore (90%) than in polymerase/surface (30%) (p<0.001). On X/preCore genotyping, all samples were genotype A, whereas polymerase/surface yielded genotypes A (80%), D (16.7%), and F (3.3%) (p = 0.036). Genotype changes in polymerase/surface were observed in four patients during natural quasispecies dynamics and in two patients during treatment. There were no genotype changes in X/preCore. Quasispecies complexity was higher in X/preCore than in polymerase/surface (p = 0.004). The results provide evidence of genotype mixtures and differential genotype proportions in the polymerase/surface and X/preCore regions. The genotype dynamics in HBV infection and the different patterns of quasispecies complexity in the HBV genome suggest a new paradigm for HBV genotype classification. PMID:26714168

  8. Dysregulation of B Cell Repertoire Formation in Myasthenia Gravis Patients Revealed through Deep Sequencing.

    PubMed

    Vander Heiden, Jason A; Stathopoulos, Panos; Zhou, Julian Q; Chen, Luan; Gilbert, Tamara J; Bolen, Christopher R; Barohn, Richard J; Dimachkie, Mazen M; Ciafaloni, Emma; Broering, Teresa J; Vigneault, Francois; Nowak, Richard J; Kleinstein, Steven H; O'Connor, Kevin C

    2017-02-15

    Myasthenia gravis (MG) is a prototypical B cell-mediated autoimmune disease affecting 20-50 people per 100,000. The majority of patients fall into two clinically distinguishable types based on whether they produce autoantibodies targeting the acetylcholine receptor (AChR-MG) or muscle specific kinase (MuSK-MG). The autoantibodies are pathogenic, but whether their generation is associated with broader defects in the B cell repertoire is unknown. To address this question, we performed deep sequencing of the BCR repertoire of AChR-MG, MuSK-MG, and healthy subjects to generate ∼518,000 unique VH and VL sequences from sorted naive and memory B cell populations. AChR-MG and MuSK-MG subjects displayed distinct gene segment usage biases in both VH and VL sequences within the naive and memory compartments. The memory compartment of AChR-MG was further characterized by reduced positive selection of somatic mutations in the VH CDR and altered VH CDR3 physicochemical properties. The VL repertoire of MuSK-MG was specifically characterized by reduced V-J segment distance in recombined sequences, suggesting diminished VL receptor editing during B cell development. Our results identify large-scale abnormalities in both the naive and memory B cell repertoires. Particular abnormalities were unique to either AChR-MG or MuSK-MG, indicating that the repertoires reflect the distinct properties of the subtypes. These repertoire abnormalities are consistent with previously observed defects in B cell tolerance checkpoints in MG, thereby offering additional insight regarding the impact of tolerance defects on peripheral autoimmune repertoires. These collective findings point toward a deformed B cell repertoire as a fundamental component of MG. Copyright © 2017 by The American Association of Immunologists, Inc.

  9. Deep RNA Sequencing of the Skeletal Muscle Transcriptome in Swimming Fish

    PubMed Central

    Palstra, Arjan P.; Beltran, Sergi; Burgerhout, Erik; Brittijn, Sebastiaan A.; Magnoni, Leonardo J.; Henkel, Christiaan V.; Jansen, Hans J.; van den Thillart, Guido E. E. J. M.; Spaink, Herman P.; Planas, Josep V.

    2013-01-01

    Deep RNA sequencing (RNA-seq) was performed to provide an in-depth view of the transcriptome of red and white skeletal muscle of exercised and non-exercised rainbow trout (Oncorhynchus mykiss) with the specific objective to identify expressed genes and quantify the transcriptomic effects of swimming-induced exercise. Pubertal autumn-spawning seawater-raised female rainbow trout were rested (n = 10) or swum (n = 10) for 1176 km at 0.75 body-lengths per second in a 6,000-L swim-flume under reproductive conditions for 40 days. Red and white muscle RNA of exercised and non-exercised fish (4 lanes) was sequenced and resulted in 15–17 million reads per lane that, after de novo assembly, yielded 149,159 red and 118,572 white muscle contigs. Most contigs were annotated using an iterative homology search strategy against salmonid ESTs, the zebrafish Danio rerio genome and general Metazoan genes. When selecting for large contigs (>500 nucleotides), a number of novel rainbow trout gene sequences were identified in this study: 1,085 and 1,228 novel gene sequences for red and white muscle, respectively, which included a number of important molecules for skeletal muscle function. Transcriptomic analysis revealed that sustained swimming increased transcriptional activity in skeletal muscle and specifically an up-regulation of genes involved in muscle growth and developmental processes in white muscle. The unique collection of transcripts will contribute to our understanding of red and white muscle physiology, specifically during the long-term reproductive migration of salmonids. PMID:23308156

  10. Deep Sequencing Reveals Low Incidence of Endogenous LINE-1 Retrotransposition in Human Induced Pluripotent Stem Cells

    PubMed Central

    Arokium, Hubert; Kim, Namshin; Liang, Min; Presson, Angela P.; Chen, Irvin S.

    2014-01-01

    Long interspersed element-1 (LINE-1 or L1) retrotransposition induces insertional mutations that can result in diseases. It was recently shown that the copy number of L1 and other retroelements is stable in induced pluripotent stem cells (iPSCs). However, by using an engineered reporter construct over-expressing L1, another study suggests that reprogramming activates L1 mobility in iPSCs. Given the potential of human iPSCs in therapeutic applications, it is important to clarify whether these cells harbor somatic insertions resulting from endogenous L1 retrotransposition. Here, we verified L1 expression during and after reprogramming as well as potential somatic insertions driven by the most active human endogenous L1 subfamily (L1Hs). Our results indicate that L1 over-expression is initiated during the reprogramming process and is subsequently sustained in isolated clones. To detect potential somatic insertions in iPSCs caused by L1Hs retotransposition, we used a novel sequencing strategy. As opposed to conventional sequencing direction, we sequenced from the 3′ end of L1Hs to the genomic DNA, thus enabling the direct detection of the polyA tail signature of retrotransposition for verification of true insertions. Deep coverage sequencing thus allowed us to detect seven potential somatic insertions with low read counts from two iPSC clones. Negative PCR amplification in parental cells, presence of a polyA tail and absence from seven L1 germline insertion databases highly suggested true somatic insertions in iPSCs. Furthermore, these insertions could not be detected in iPSCs by PCR, likely due to low abundance. We conclude that L1Hs retrotransposes at low levels in iPSCs and therefore warrants careful analyses for genotoxic effects. PMID:25289675

  11. Retrospective review using targeted deep sequencing reveals mutational differences between gastroesophageal junction and gastric carcinomas.

    PubMed

    Li-Chang, Hector H; Kasaian, Katayoon; Ng, Ying; Lum, Amy; Kong, Esther; Lim, Howard; Jones, Steven Jm; Huntsman, David G; Schaeffer, David F; Yip, Stephen

    2015-02-06

    Adenocarcinomas of both the gastroesophageal junction and stomach are molecularly complex, but differ with respect to epidemiology, etiology and survival. There are few data directly comparing the frequencies of single nucleotide mutations in cancer-related genes between the two sites. Sequencing of targeted gene panels may be useful in uncovering multiple genomic aberrations using a single test. DNA from 92 gastroesophageal junction and 75 gastric adenocarcinoma resection specimens was extracted from formalin-fixed paraffin-embedded tissue. Targeted deep sequencing of 46 cancer-related genes was performed through emulsion PCR followed by semiconductor-based sequencing. Gastroesophageal junction and gastric carcinomas were contrasted with respect to mutational profiles, immunohistochemistry and in situ hybridization, as well as corresponding clinicopathologic data. Gastroesophageal junction carcinomas were associated with younger age, more frequent intestinal-type histology, more frequent p53 overexpression, and worse disease-free survival on multivariable analysis. Among all cases, 145 mutations were detected in 31 genes. TP53 mutations were the most common abnormality detected, and were more common in gastroesophageal junction carcinomas (42% vs. 27%, p = 0.036). Mutations in the Wnt pathway components APC and CTNNB1 were more common among gastric carcinomas (16% vs. 3%, p = 0.006), and gastric carcinomas were more likely to have ≥3 driver mutations detected (11% vs. 2%, p = 0.044). Twenty percent of cases had potentially actionable mutations identified. R132H and R132C missense mutations in the IDH1 gene were observed, and are the first reported mutations of their kind in gastric carcinoma. Panel sequencing of routine pathology material can yield mutational information on several driver genes, including some for which targeted therapies are available. Differing rates of mutations and clinicopathologic differences support a distinction between

  12. Small RNA Library Cloning Procedure for Deep Sequencing of Specific Endogenous siRNA Classes in Caenorhabditis elegans

    PubMed Central

    Ow, Maria C.; Lau, Nelson C.; Hall, Sarah E.

    2017-01-01

    In recent years, distinct classes of small RNAs ranging in size from ~21 to 26 nucleotides have been discovered and shown to play important roles in a wide array of cellular functions. Because of the abundance of these small RNAs, library preparation from an RNA sample followed by deep sequencing provides the identity and quantity of a particular class of small RNAs. In this chapter we describe a detailed protocol for preparing small RNA libraries for deep sequencing on the Illumina platform from the nematode C. elegans. PMID:24920360

  13. Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data.

    PubMed

    Krøigård, Anne Bruun; Thomassen, Mads; Lænkholm, Anne-Vibeke; Kruse, Torben A; Larsen, Martin Jakob

    2016-01-01

    Next generation sequencing is extensively applied to catalogue somatic mutations in cancer, in research settings and increasingly in clinical settings for molecular diagnostics, guiding therapy decisions. Somatic variant callers perform paired comparisons of sequencing data from cancer tissue and matched normal tissue in order to detect somatic mutations. The advent of many new somatic variant callers creates a need for comparison and validation of the tools, as no de facto standard for detection of somatic mutations exists and only limited comparisons have been reported. We have performed a comprehensive evaluation using exome sequencing and targeted deep sequencing data of paired tumor-normal samples from five breast cancer patients to evaluate the performance of nine publicly available somatic variant callers: EBCall, Mutect, Seurat, Shimmer, Indelocator, Somatic Sniper, Strelka, VarScan 2 and Virmid for the detection of single nucleotide mutations and small deletions and insertions. We report a large variation in the number of calls from the nine somatic variant callers on the same sequencing data and highly variable agreement. Sequencing depth had markedly diverse impact on individual callers, as for some callers, increased sequencing depth highly improved sensitivity. For SNV calling, we report EBCall, Mutect, Virmid and Strelka to be the most reliable somatic variant callers for both exome sequencing and targeted deep sequencing. For indel calling, EBCall is superior due to high sensitivity and robustness to changes in sequencing depths.

  14. Evaluation of Nine Somatic Variant Callers for Detection of Somatic Mutations in Exome and Targeted Deep Sequencing Data

    PubMed Central

    Krøigård, Anne Bruun; Thomassen, Mads; Lænkholm, Anne-Vibeke; Kruse, Torben A.; Larsen, Martin Jakob

    2016-01-01

    Next generation sequencing is extensively applied to catalogue somatic mutations in cancer, in research settings and increasingly in clinical settings for molecular diagnostics, guiding therapy decisions. Somatic variant callers perform paired comparisons of sequencing data from cancer tissue and matched normal tissue in order to detect somatic mutations. The advent of many new somatic variant callers creates a need for comparison and validation of the tools, as no de facto standard for detection of somatic mutations exists and only limited comparisons have been reported. We have performed a comprehensive evaluation using exome sequencing and targeted deep sequencing data of paired tumor-normal samples from five breast cancer patients to evaluate the performance of nine publicly available somatic variant callers: EBCall, Mutect, Seurat, Shimmer, Indelocator, Somatic Sniper, Strelka, VarScan 2 and Virmid for the detection of single nucleotide mutations and small deletions and insertions. We report a large variation in the number of calls from the nine somatic variant callers on the same sequencing data and highly variable agreement. Sequencing depth had markedly diverse impact on individual callers, as for some callers, increased sequencing depth highly improved sensitivity. For SNV calling, we report EBCall, Mutect, Virmid and Strelka to be the most reliable somatic variant callers for both exome sequencing and targeted deep sequencing. For indel calling, EBCall is superior due to high sensitivity and robustness to changes in sequencing depths. PMID:27002637

  15. Enhanced methods for unbiased deep sequencing of Lassa and Ebola RNA viruses from clinical and biological samples.

    PubMed

    Matranga, Christian B; Andersen, Kristian G; Winnicki, Sarah; Busby, Michele; Gladden, Adrianne D; Tewhey, Ryan; Stremlau, Matthew; Berlin, Aaron; Gire, Stephen K; England, Eleina; Moses, Lina M; Mikkelsen, Tarjei S; Odia, Ikponmwonsa; Ehiane, Philomena E; Folarin, Onikepe; Goba, Augustine; Kahn, S Humarr; Grant, Donald S; Honko, Anna; Hensley, Lisa; Happi, Christian; Garry, Robert F; Malboeuf, Christine M; Birren, Bruce W; Gnirke, Andreas; Levin, Joshua Z; Sabeti, Pardis C

    2014-01-01

    We have developed a robust RNA sequencing method for generating complete de novo assemblies with intra-host variant calls of Lassa and Ebola virus genomes in clinical and biological samples. Our method uses targeted RNase H-based digestion to remove contaminating poly(rA) carrier and ribosomal RNA. This depletion step improves both the quality of data and quantity of informative reads in unbiased total RNA sequencing libraries. We have also developed a hybrid-selection protocol to further enrich the viral content of sequencing libraries. These protocols have enabled rapid deep sequencing of both Lassa and Ebola virus and are broadly applicable to other viral genomics studies.

  16. Engineering and analysis of peptide-recognition domain specificities by phage display and deep sequencing.

    PubMed

    McLaughlin, Megan E; Sidhu, Sachdev S

    2013-01-01

    Protein interaction networks depend in part on the specific recognition of unstructured peptides by folded domains. Understanding how members of a domain family use a similar fold to recognize different peptide sequences selectively is a fundamental question. One way to advance our understanding of peptide recognition is to apply an existing model of peptide recognition for a particular domain toward engineering synthetic domain variants with desired properties. Successes, failures, and unintended outcomes can help refine the model and can illuminate more general principles of peptide recognition. Using the PDZ domain fold as an example, we describe methods for (1) structure-based combinatorial library design and directed evolution of domain variants and (2) specificity profiling of large repertoires of synthetic variants using multiplexed deep sequencing. Peptide-binding preferences for hundreds of variants can be decoded in parallel, enabling comparisons between different library designs and selection pressures. The tremendous depth of coverage of the binding peptide profiles also permits robust computational analysis. This approach to studying peptide recognition can be applied to other domains and to a variety of structural and functional models by tailoring the combinatorial library design and selection pressures accordingly. Copyright © 2013 Elsevier Inc. All rights reserved.

  17. Quantifying perinatal transmission of Hepatitis B viral quasispecies by tag linkage deep sequencing.

    PubMed

    Du, Yushen; Chi, Xiumei; Wang, Chong; Jiang, Jing; Kong, Fei; Yan, Hongqing; Wang, Xiaomei; Li, Jie; Wu, Nicholas C; Dai, Lei; Zhang, Tian-Hao; Shu, Sara; Zhou, Jian; Yoshizawa, Janice M; Li, Xinmin; Bhattacharya, Debika; Wu, Ting-Ting; Niu, Junqi; Sun, Ren

    2017-08-31

    Despite full immunoprophylaxis, mother-to-child transmission (MTCT) of Hepatitis B Virus still occurs in approximately 2-5% of HBsAg positive mothers. Little is known about the bottleneck of HBV transmission and the evolution of viral quasispecies in the context of MTCT. Here we adopted a newly developed tag linkage deep sequencing method and analyzed the quasispecies of four MTCT pairs that broke through immunoprophylaxis. By assigning unique tags to individual viral sequences, we accurately reconstructed HBV haplotypes in a region of 836 bp, which contains the major immune epitopes and drug resistance mutations. The detection limit of minor viral haplotypes reached 0.1% for individual patient sample. Dominance of "a determinant" polymorphisms were observed in two children, which pre-existed as minor quasispecies in maternal samples. In all four pairs of MTCT samples, we consistently observed a significant overlap of viral haplotypes shared between mother and child. We also demonstrate that the data can be potentially useful to estimate the bottleneck effect during HBV MTCT, which provides information to optimize treatment for reducing the frequency of MTCT.

  18. Ultra Deep Sequencing of a Baculovirus Population Reveals Widespread Genomic Variations

    PubMed Central

    Chateigner, Aurélien; Bézier, Annie; Labrousse, Carole; Jiolle, Davy; Barbe, Valérie; Herniou, Elisabeth A.

    2015-01-01

    Viruses rely on widespread genetic variation and large population size for adaptation. Large DNA virus populations are thought to harbor little variation though natural populations may be polymorphic. To measure the genetic variation present in a dsDNA virus population, we deep sequenced a natural strain of the baculovirus Autographa californica multiple nucleopolyhedrovirus. With 124,221X average genome coverage of our 133,926 bp long consensus, we could detect low frequency mutations (0.025%). K-means clustering was used to classify the mutations in four categories according to their frequency in the population. We found 60 high frequency non-synonymous mutations under balancing selection distributed in all functional classes. These mutants could alter viral adaptation dynamics, either through competitive or synergistic processes. Lastly, we developed a technique for the delimitation of large deletions in next generation sequencing data. We found that large deletions occur along the entire viral genome, with hotspots located in homologous repeat regions (hrs). Present in 25.4% of the genomes, these deletion mutants presumably require functional complementation to complete their infection cycle. They might thus have a large impact on the fitness of the baculovirus population. Altogether, we found a wide breadth of genomic variation in the baculovirus population, suggesting it has high adaptive potential. PMID:26198241

  19. Deep sequencing analysis of defective genomes of parainfluenza virus 5 and their role in interferon induction.

    PubMed

    Killip, M J; Young, D F; Gatherer, D; Ross, C S; Short, J A L; Davison, A J; Goodbourn, S; Randall, R E

    2013-05-01

    Preparations of parainfluenza virus 5 (PIV5) that are potent activators of the interferon (IFN) induction cascade were generated by high-multiplicity passage in order to accumulate defective interfering virus genomes (DIs). Nucleocapsid RNA from these virus preparations was extracted and subjected to deep sequencing. Sequencing data were analyzed using methods designed to detect internal deletion and "copyback" DIs in order to identify and characterize the different DIs present and to approximately quantify the ratio of defective to nondefective genomes. Trailer copybacks dominated the DI populations in IFN-inducing preparations of both the PIV5 wild type (wt) and PIV5-VΔC (a recombinant virus that does not encode a functional V protein). Although the PIV5 V protein is an efficient inhibitor of the IFN induction cascade, we show that nondefective PIV5 wt is unable to prevent activation of the IFN response by coinfecting copyback DIs due to the interfering effects of copyback DIs on nondefective virus protein expression. As a result, copyback DIs are able to very rapidly activate the IFN induction cascade prior to the expression of detectable levels of V protein by coinfecting nondefective virus.

  20. Deep RNA sequencing analysis of readthrough gene fusions in human prostate adenocarcinoma and reference samples

    PubMed Central

    2011-01-01

    Background Readthrough fusions across adjacent genes in the genome, or transcription-induced chimeras (TICs), have been estimated using expressed sequence tag (EST) libraries to involve 4-6% of all genes. Deep transcriptional sequencing (RNA-Seq) now makes it possible to study the occurrence and expression levels of TICs in individual samples across the genome. Methods We performed single-end RNA-Seq on three human prostate adenocarcinoma samples and their corresponding normal tissues, as well as brain and universal reference samples. We developed two bioinformatics methods to specifically identify TIC events: a targeted alignment method using artificial exon-exon junctions within 200,000 bp from adjacent genes, and genomic alignment allowing splicing within individual reads. We performed further experimental verification and characterization of selected TIC and fusion events using quantitative RT-PCR and comparative genomic hybridization microarrays. Results Targeted alignment against artificial exon-exon junctions yielded 339 distinct TIC events, including 32 gene pairs with multiple isoforms. The false discovery rate was estimated to be 1.5%. Spliced alignment to the genome was less sensitive, finding only 18% of those found by targeted alignment in 33-nt reads and 59% of those in 50-nt reads. However, spliced alignment revealed 30 cases of TICs with intervening exons, in addition to distant inversions, scrambled genes, and translocations. Our findings increase the catalog of observed TIC gene pairs by 66%. We verified 6 of 6 predicted TICs in all prostate samples, and 2 of 5 predicted novel distant gene fusions, both private events among 54 prostate tumor samples tested. Expression of TICs correlates with that of the upstream gene, which can explain the prostate-specific pattern of some TIC events and the restriction of the SLC45A3-ELK4 e4-e2 TIC to ERG-negative prostate samples, as confirmed in 20 matched prostate tumor and normal samples and 9 lung cancer

  1. Characterization of Small Interfering RNAs Derived from the Geminivirus/Betasatellite Complex Using Deep Sequencing

    PubMed Central

    Yang, Xiuling; Wang, Yu; Guo, Wei; Xie, Yan; Xie, Qi; Fan, Longjiang; Zhou, Xueping

    2011-01-01

    Background Small RNA (sRNA)-guided RNA silencing is a critical antiviral defense mechanism employed by a variety of eukaryotic organisms. Although the induction of RNA silencing by bipartite and monopartite begomoviruses has been described in plants, the nature of begomovirus/betasatellite complexes remains undefined. Methodology/Principal Findings Solanum lycopersicum plant leaves systemically infected with Tomato yellow leaf curl China virus (TYLCCNV) alone or together with its associated betasatellite (TYLCCNB), and Nicotiana benthamiana plant leaves systemically infected with TYLCCNV alone, or together with TYLCCNB or with mutant TYLCCNB were harvested for RNA extraction; sRNA cDNA libraries were then constructed and submitted to Solexa-based deep sequencing. Both sense and anti-sense TYLCCNV and TYLCCNB-derived sRNAs (V-sRNAs and S-sRNAs) accumulated preferentially as 22 nucleotide species in infected S. lycopersicum and N. benthamiana plants. High resolution mapping of V-sRNAs and S-sRNAs revealed heterogeneous distribution of V-sRNA and S-sRNA sequences across the TYLCCNV and TYLCCNB genomes. In TYLCCNV-infected S. lycopersicum or N. benthamiana and TYLCCNV and βC1-mutant TYLCCNB co-infected N. benthamiana plants, the primary TYLCCNV targets were AV2 and the 5′ terminus of AV1. In TYLCCNV and betasatellite-infected plants, the number of V-sRNAs targeting this region decreased and the production of V-sRNAs increased corresponding to the overlapping regions of AC2 and AC3, as well as the 3′ terminal of AC1. βC1 is the primary determinant mediating symptom induction and also the primary silencing target of the TYLCCNB genome even in its mutated form. Conclusions/Significance We report the first high-resolution sRNA map for a monopartite begomovirus and its associated betasatellite using Solexa-based deep sequencing. Our results suggest that viral transcript might act as RDR substrates resulting in dsRNA and secondary siRNA production. In addition, the

  2. An optimized kit-free method for making strand-specific deep sequencing libraries from RNA fragments

    PubMed Central

    Heyer, Erin E.; Ozadam, Hakan; Ricci, Emiliano P.; Cenik, Can; Moore, Melissa J.

    2015-01-01

    Deep sequencing of strand-specific cDNA libraries is now a ubiquitous tool for identifying and quantifying RNAs in diverse sample types. The accuracy of conclusions drawn from these analyses depends on precise and quantitative conversion of the RNA sample into a DNA library suitable for sequencing. Here, we describe an optimized method of preparing strand-specific RNA deep sequencing libraries from small RNAs and variably sized RNA fragments obtained from ribonucleoprotein particle footprinting experiments or fragmentation of long RNAs. Our approach works across a wide range of input amounts (400 pg to 200 ng), is easy to follow and produces a library in 2–3 days at relatively low reagent cost, all while giving the user complete control over every step. Because all enzymatic reactions were optimized and driven to apparent completion, sequence diversity and species abundance in the input sample are well preserved. PMID:25505164

  3. Deep Sequencing Analysis Reveals Temporal Microbiota Changes Associated with Development of Bovine Digital Dermatitis

    PubMed Central

    Krull, Adam C.; Shearer, Jan K.; Gorden, Patrick J.; Cooper, Vickie L.; Phillips, Gregory J.

    2014-01-01

    Bovine digital dermatitis (DD) is a leading cause of lameness in dairy cattle throughout the world. Despite 35 years of research, the definitive etiologic agent associated with the disease process is still unknown. Previous studies have demonstrated that multiple bacterial species are associated with lesions, with spirochetes being the most reliably identified organism. This study details the deep sequencing-based metagenomic evaluation of 48 staged DD biopsy specimens collected during a 3-year longitudinal study of disease progression. Over 175 million sequences were evaluated by utilizing both shotgun and 16S metagenomic techniques. Based on the shotgun sequencing results, there was no evidence of a fungal or DNA viral etiology. The bacterial microbiota of biopsy specimens progresses through a systematic series of changes that correlate with the novel morphological lesion scoring system developed as part of this project. This scoring system was validated, as the microbiota of each stage was statistically significantly different from those of other stages (P < 0.001). The microbiota of control biopsy specimens were the most diverse and became less diverse as lesions developed. Although Treponema spp. predominated in the advanced lesions, they were in relatively low abundance in the newly described early lesions that are associated with the initiation of the disease process. The consortium of Treponema spp. identified at the onset of disease changes considerably as the lesions progress through the morphological stages identified. The results of this study support the hypothesis that DD is a polybacterial disease process and provide unique insights into the temporal changes in bacterial populations throughout lesion development. PMID:24866801

  4. Ultra-Deep Sequencing of Mouse Mitochondrial DNA: Mutational Patterns and Their Origins

    PubMed Central

    Freyer, Christoph; Hagström, Erik; Ingman, Max; Larsson, Nils-Göran; Gyllensten, Ulf

    2011-01-01

    Somatic mutations of mtDNA are implicated in the aging process, but there is no universally accepted method for their accurate quantification. We have used ultra-deep sequencing to study genome-wide mtDNA mutation load in the liver of normally- and prematurely-aging mice. Mice that are homozygous for an allele expressing a proof-reading–deficient mtDNA polymerase (mtDNA mutator mice) have 10-times-higher point mutation loads than their wildtype siblings. In addition, the mtDNA mutator mice have increased levels of a truncated linear mtDNA molecule, resulting in decreased sequence coverage in the deleted region. In contrast, circular mtDNA molecules with large deletions occur at extremely low frequencies in mtDNA mutator mice and can therefore not drive the premature aging phenotype. Sequence analysis shows that the main proportion of the mutation load in heterozygous mtDNA mutator mice and their wildtype siblings is inherited from their heterozygous mothers consistent with germline transmission. We found no increase in levels of point mutations or deletions in wildtype C57Bl/6N mice with increasing age, thus questioning the causative role of these changes in aging. In addition, there was no increased frequency of transversion mutations with time in any of the studied genotypes, arguing against oxidative damage as a major cause of mtDNA mutations. Our results from studies of mice thus indicate that most somatic mtDNA mutations occur as replication errors during development and do not result from damage accumulation in adult life. PMID:21455489

  5. High resolution sequence stratigraphy of Miocene deep-water clastic outcrops, Taranaki coast, New Zealand

    SciTech Connect

    King, P.R.; Browne, G.H.; Slatt, R.M.

    1995-08-01

    Approximately 700m of deep water clastic deposits of Mt. Messenger Formation are superbly exposed along the Taranaki coast of North Island, New Zealand. Biostratigraphy indicates the interval was deposited during the time span 10.5-9.2m.y. in water depths grading upward from lower bathyal to middle-upper bathyal. This interval is considered part of a 3rd order depositional sequence deposited under conditions of fluctuating relative sea-level, concomitant with high sedimentation rates. Several 4th order depositional sequences, reflecting successive sea-level falls, are recognized within the interval. Sequence boundaries display a range of erosive morphologies from metre-wide canyons to scours several hundred metres across. All components of a generic lowstand systems tract--basin floor fan, channel-levee complex and progading complex--are present in logical and temporal order. They are repetitive through the interval, with the relatively shallower-water components becoming more prevalent upward. Basin floor fan lithologies are mainly m-thick, massive and convolute-bedded sandstones that alternate with cm- and dm-thick massive, horizontally-stratified and ripple-laminated sandstones and bioturbated mudstones. Channel-levee deposits consist of interleaving packages of thin-bedded, climbing-rippled and parallel-laminated sandstones and millstones; infrequent channels are filled with sandstones and mudstones, and sometimes lined with conglomerate. Thin beds of parallel to convoluted mudstone comprise prograding complex deposits. Similar lowstand systems tracts can be recognized and correlated on subsurface seismic reflection profiles and wireline logs. Such correlation has been aided by a continuous outcrop gamma-ray fog obtained over most of the measured interval. In the adjacent Taranaki peninsula, basin floor fan and channel-levee deposits comprise hydrocarbon reservoir intervals. Outcrop and subsurface reservior sandstones exhibit similar permeabilities.

  6. Deep sequencing analysis reveals temporal microbiota changes associated with development of bovine digital dermatitis.

    PubMed

    Krull, Adam C; Shearer, Jan K; Gorden, Patrick J; Cooper, Vickie L; Phillips, Gregory J; Plummer, Paul J

    2014-08-01

    Bovine digital dermatitis (DD) is a leading cause of lameness in dairy cattle throughout the world. Despite 35 years of research, the definitive etiologic agent associated with the disease process is still unknown. Previous studies have demonstrated that multiple bacterial species are associated with lesions, with spirochetes being the most reliably identified organism. This study details the deep sequencing-based metagenomic evaluation of 48 staged DD biopsy specimens collected during a 3-year longitudinal study of disease progression. Over 175 million sequences were evaluated by utilizing both shotgun and 16S metagenomic techniques. Based on the shotgun sequencing results, there was no evidence of a fungal or DNA viral etiology. The bacterial microbiota of biopsy specimens progresses through a systematic series of changes that correlate with the novel morphological lesion scoring system developed as part of this project. This scoring system was validated, as the microbiota of each stage was statistically significantly different from those of other stages (P < 0.001). The microbiota of control biopsy specimens were the most diverse and became less diverse as lesions developed. Although Treponema spp. predominated in the advanced lesions, they were in relatively low abundance in the newly described early lesions that are associated with the initiation of the disease process. The consortium of Treponema spp. identified at the onset of disease changes considerably as the lesions progress through the morphological stages identified. The results of this study support the hypothesis that DD is a polybacterial disease process and provide unique insights into the temporal changes in bacterial populations throughout lesion development. Copyright © 2014, American Society for Microbiology. All Rights Reserved.

  7. Reconstructing the Dynamics of HIV Evolution within Hosts from Serial Deep Sequence Data

    PubMed Central

    Poon, Art F. Y.; Swenson, Luke C.; Bunnik, Evelien M.; Edo-Matas, Diana; Schuitemaker, Hanneke; van 't Wout, Angélique B.; Harrigan, P. Richard

    2012-01-01

    At the early stage of infection, human immunodeficiency virus (HIV)-1 predominantly uses the CCR5 coreceptor for host cell entry. The subsequent emergence of HIV variants that use the CXCR4 coreceptor in roughly half of all infections is associated with an accelerated decline of CD4+ T-cells and rate of progression to AIDS. The presence of a ‘fitness valley’ separating CCR5- and CXCR4-using genotypes is postulated to be a biological determinant of whether the HIV coreceptor switch occurs. Using phylogenetic methods to reconstruct the evolutionary dynamics of HIV within hosts enables us to discriminate between competing models of this process. We have developed a phylogenetic pipeline for the molecular clock analysis, ancestral reconstruction, and visualization of deep sequence data. These data were generated by next-generation sequencing of HIV RNA extracted from longitudinal serum samples (median 7 time points) from 8 untreated subjects with chronic HIV infections (Amsterdam Cohort Studies on HIV-1 infection and AIDS). We used the known dates of sampling to directly estimate rates of evolution and to map ancestral mutations to a reconstructed timeline in units of days. HIV coreceptor usage was predicted from reconstructed ancestral sequences using the geno2pheno algorithm. We determined that the first mutations contributing to CXCR4 use emerged about 16 (per subject range 4 to 30) months before the earliest predicted CXCR4-using ancestor, which preceded the first positive cell-based assay of CXCR4 usage by 10 (range 5 to 25) months. CXCR4 usage arose in multiple lineages within 5 of 8 subjects, and ancestral lineages following alternate mutational pathways before going extinct were common. We observed highly patient-specific distributions and time-scales of mutation accumulation, implying that the role of a fitness valley is contingent on the genotype of the transmitted variant. PMID:23133358

  8. Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler.

    PubMed

    Shepard, Samuel S; Meno, Sarah; Bahl, Justin; Wilson, Malania M; Barnes, John; Neuhaus, Elizabeth

    2016-09-05

    Deep sequencing makes it possible to observe low-frequency viral variants and sub-populations with greater accuracy and sensitivity than ever before. Existing platforms can be used to multiplex a large number of samples; however, analysis of the resulting data is complex and involves separating barcoded samples and various read manipulation processes ending in final assembly. Many assembly tools were designed with larger genomes and higher fidelity polymerases in mind and do not perform well with reads derived from highly variable viral genomes. Reference-based assemblers may leave gaps in viral assemblies while de novo assemblers may struggle to assemble unique genomes. The IRMA (iterative refinement meta-assembler) pipeline solves the problem of viral variation by the iterative optimization of read gathering and assembly. As with all reference-based assembly, reads are included in assembly when they match consensus template sets; however, IRMA provides for on-the-fly reference editing, correction, and optional elongation without the need for additional reference selection. This increases both read depth and breadth. IRMA also focuses on quality control, error correction, indel reporting, variant calling and variant phasing. In fact, IRMA's ability to detect and phase minor variants is one of its most distinguishing features. We have built modules for influenza and ebolavirus. We demonstrate usage and provide calibration data from mixture experiments. Methods for variant calling, phasing, and error estimation/correction have been redesigned to meet the needs of viral genomic sequencing. IRMA provides a robust next-generation sequencing assembly solution that is adapted to the needs and characteristics of viral genomes. The software solves issues related to the genetic diversity of viruses while providing customized variant calling, phasing, and quality control. IRMA is freely available for non-commercial use on Linux and Mac OS X and has been parallelized for high

  9. ROCker: accurate detection and quantification of target genes in short-read metagenomic data sets by modeling sliding-window bitscores.

    PubMed

    Orellana, Luis H; Rodriguez-R, Luis M; Konstantinidis, Konstantinos T

    2017-02-17

    Functional annotation of metagenomic and metatranscriptomic data sets relies on similarity searches based on e-value thresholds resulting in an unknown number of false positive and negative matches. To overcome these limitations, we introduce ROCker, aimed at identifying position-specific, most-discriminant thresholds in sliding windows along the sequence of a target protein, accounting for non-discriminative domains shared by unrelated proteins. ROCker employs the receiver operating characteristic (ROC) curve to minimize false discovery rate (FDR) and calculate the best thresholds based on how simulated shotgun metagenomic reads of known composition map onto well-curated reference protein sequences and thus, differs from HMM profiles and related methods. We showcase ROCker using ammonia monooxygenase (amoA) and nitrous oxide reductase (nosZ) genes, mediating oxidation of ammonia and the reduction of the potent greenhouse gas, N2O, to inert N2, respectively. ROCker typically showed 60-fold lower FDR when compared to the common practice of using fixed e-values. Previously uncounted ‘atypical’ nosZ genes were found to be two times more abundant, on average, than their typical counterparts in most soil metagenomes and the abundance of bacterial amoA was quantified against the highly-related particulate methane monooxygenase (pmoA). Therefore, ROCker can reliably detect and quantify target genes in short-read metagenomes.

  10. ROCker: accurate detection and quantification of target genes in short-read metagenomic data sets by modeling sliding-window bitscores

    DOE PAGES

    Orellana, Luis H.; Rodriguez-R, Luis M.; Konstantinidis, Konstantinos T.

    2016-10-07

    Functional annotation of metagenomic and metatranscriptomic data sets relies on similarity searches based on e-value thresholds resulting in an unknown number of false positive and negative matches. To overcome these limitations, we introduce ROCker, aimed at identifying position-specific, most-discriminant thresholds in sliding windows along the sequence of a target protein, accounting for non-discriminative domains shared by unrelated proteins. ROCker employs the receiver operating characteristic (ROC) curve to minimize false discovery rate (FDR) and calculate the best thresholds based on how simulated shotgun metagenomic reads of known composition map onto well-curated reference protein sequences and thus, differs from HMM profiles andmore » related methods. We showcase ROCker using ammonia monooxygenase (amoA) and nitrous oxide reductase (nosZ) genes, mediating oxidation of ammonia and the reduction of the potent greenhouse gas, N2O, to inert N2, respectively. ROCker typically showed 60-fold lower FDR when compared to the common practice of using fixed e-values. Previously uncounted ‘atypical’ nosZ genes were found to be two times more abundant, on average, than their typical counterparts in most soil metagenomes and the abundance of bacterial amoA was quantified against the highly-related particulate methane monooxygenase (pmoA). Therefore, ROCker can reliably detect and quantify target genes in short-read metagenomes.« less

  11. ROCker: accurate detection and quantification of target genes in short-read metagenomic data sets by modeling sliding-window bitscores

    SciTech Connect

    Orellana, Luis H.; Rodriguez-R, Luis M.; Konstantinidis, Konstantinos T.

    2016-10-07

    Functional annotation of metagenomic and metatranscriptomic data sets relies on similarity searches based on e-value thresholds resulting in an unknown number of false positive and negative matches. To overcome these limitations, we introduce ROCker, aimed at identifying position-specific, most-discriminant thresholds in sliding windows along the sequence of a target protein, accounting for non-discriminative domains shared by unrelated proteins. ROCker employs the receiver operating characteristic (ROC) curve to minimize false discovery rate (FDR) and calculate the best thresholds based on how simulated shotgun metagenomic reads of known composition map onto well-curated reference protein sequences and thus, differs from HMM profiles and related methods. We showcase ROCker using ammonia monooxygenase (amoA) and nitrous oxide reductase (nosZ) genes, mediating oxidation of ammonia and the reduction of the potent greenhouse gas, N2O, to inert N2, respectively. ROCker typically showed 60-fold lower FDR when compared to the common practice of using fixed e-values. Previously uncounted ‘atypical’ nosZ genes were found to be two times more abundant, on average, than their typical counterparts in most soil metagenomes and the abundance of bacterial amoA was quantified against the highly-related particulate methane monooxygenase (pmoA). Therefore, ROCker can reliably detect and quantify target genes in short-read metagenomes.

  12. ROCker: accurate detection and quantification of target genes in short-read metagenomic data sets by modeling sliding-window bitscores

    PubMed Central

    2017-01-01

    Abstract Functional annotation of metagenomic and metatranscriptomic data sets relies on similarity searches based on e-value thresholds resulting in an unknown number of false positive and negative matches. To overcome these limitations, we introduce ROCker, aimed at identifying position-specific, most-discriminant thresholds in sliding windows along the sequence of a target protein, accounting for non-discriminative domains shared by unrelated proteins. ROCker employs the receiver operating characteristic (ROC) curve to minimize false discovery rate (FDR) and calculate the best thresholds based on how simulated shotgun metagenomic reads of known composition map onto well-curated reference protein sequences and thus, differs from HMM profiles and related methods. We showcase ROCker using ammonia monooxygenase (amoA) and nitrous oxide reductase (nosZ) genes, mediating oxidation of ammonia and the reduction of the potent greenhouse gas, N2O, to inert N2, respectively. ROCker typically showed 60-fold lower FDR when compared to the common practice of using fixed e-values. Previously uncounted ‘atypical’ nosZ genes were found to be two times more abundant, on average, than their typical counterparts in most soil metagenomes and the abundance of bacterial amoA was quantified against the highly-related particulate methane monooxygenase (pmoA). Therefore, ROCker can reliably detect and quantify target genes in short-read metagenomes. PMID:28180325

  13. Multiregion ultra-deep sequencing reveals early intermixing and variable levels of intratumoral heterogeneity in colorectal cancer.

    PubMed

    Suzuki, Yuka; Ng, Sarah Boonhsi; Chua, Clarinda; Leow, Wei Qiang; Chng, Jermain; Liu, Shi Yang; Ramnarayanan, Kalpana; Gan, Anna; Ho, Dan Liang; Ten, Rachel; Su, Yan; Lezhava, Alexandar; Lai, Jiunn Herng; Koh, Dennis; Lim, Kiat Hon; Tan, Patrick; Rozen, Steven G; Tan, Iain Beehuat

    2017-02-01

    Intratumor heterogeneity (ITH) contributes to cancer progression and chemoresistance. We sought to comprehensively describe ITH of somatic mutations, copy number, and transcriptomic alterations involving clinically and biologically relevant gene pathways in colorectal cancer (CRC). We performed multiregion, high-depth (384× on average) sequencing of 799 cancer-associated genes in 24 spatially separated primary tumor and nonmalignant tissues from four treatment-naïve CRC patients. We then used ultra-deep sequencing (17 075× on average) to accurately verify the presence or absence of identified somatic mutations in each sector. We also digitally measured gene expression and copy number alterations using NanoString assays. We identified the subclonal point mutations and determined the mutational timing and phylogenetic relationships among spatially separated sectors of each tumor. Truncal mutations, those shared by all sectors in the tumor, affected the well-described driver genes such as APC, TP53, and KRAS. With sequencing at 17 075×, we found that mutations first detected at a sequencing depth of 384× were in fact more widely shared among sectors than originally assessed. Interestingly, ultra-deep sequencing also revealed some mutations that were present in all spatially dispersed sectors, but at subclonal levels. Ultra-high-depth validation sequencing, copy number analysis, and gene expression profiling provided a comprehensive and accurate genomic landscape of spatial heterogeneity in CRC. Ultra-deep sequencing allowed more sensitive detection of somatic mutations and a more accurate assessment of ITH. By detecting the subclonal mutations with ultra-deep sequencing, we traced the genomic histories of each tumor and the relative timing of mutational events. We found evidence of early mixing, in which the subclonal ancestral mutations intermixed across the sectors before the acquisition of subsequent nontruncal mutations. Our findings also indicate that

  14. Draft Genome Sequence of Hydrogenibacillus schlegelii MA48, a Deep-Branching Member of the Bacilli Class of Firmicutes

    PubMed Central

    Maker, Allison; Pace, Laura A.; Ward, Lewis M.; Fischer, Woodward W.

    2017-01-01

    ABSTRACT We report here the draft genome sequence of Hydrogenibacillus schlegelii MA48, a thermophilic facultative anaerobe that can oxidize hydrogen aerobically. H. schlegelii MA48 belongs to a deep-branching clade of the Bacilli class and provides important insight into the acquisition of aerobic respiration within the Firmicutes phylum. PMID:28104644

  15. Small RNA deep sequencing revealed that mixed infection of known and unknown viruses were common in field collected vegetable samples

    USDA-ARS?s Scientific Manuscript database

    In an effort to characterize the causal agents for plant diseases in field collected samples using the small RNA deep sequencing technology, numerous known or novel viruses and viroids were identified. In many cases, a mixed infection with multiple pathogen species was common. Such situation compl...

  16. Draft Genome Sequence of Alcanivorax sp. Strain KX64203 Isolated from Deep-Sea Sediments of Iheya North, Okinawa Trough

    PubMed Central

    Liu, Rui; Wang, Mengqiang; Wang, Hao; Gao, Qiang; Hou, Zhanhui; Gao, Dahai

    2016-01-01

    This report describes the draft genome sequence of Alcanivorax sp. strain KX64203, isolated from deep-sea sediment samples. The reads generated by an Ion Torrent PGM were assembled into contigs, with a total size of 4.76 Mb. The data will improve our understanding of the strain’s function in alkane degradation. PMID:27563046

  17. High-Resolution Hepatitis C Virus Subtyping Using NS5B Deep Sequencing and Phylogeny, an Alternative to Current Methods

    PubMed Central

    Gregori, Josep; Rodríguez-Frias, Francisco; Buti, Maria; Madejon, Antonio; Perez-del-Pulgar, Sofia; Garcia-Cehic, Damir; Casillas, Rosario; Blasi, Maria; Homs, Maria; Tabernero, David; Alvarez-Tejado, Miguel; Muñoz, Jose Manuel; Cubero, Maria; Caballero, Andrea; delCampo, Jose Antonio; Domingo, Esteban; Belmonte, Irene; Nieto, Leonardo; Lens, Sabela; Muñoz-de-Rueda, Paloma; Sanz-Cameno, Paloma; Sauleda, Silvia; Bes, Marta; Gomez, Jordi; Briones, Carlos; Perales, Celia; Sheldon, Julie; Castells, Lluis; Viladomiu, Lluis; Salmeron, Javier; Ruiz-Extremera, Angela; Quiles-Pérez, Rosa; Moreno-Otero, Ricardo; López-Rodríguez, Rosario; Allende, Helena; Romero-Gómez, Manuel; Guardia, Jaume; Esteban, Rafael; Garcia-Samaniego, Javier; Forns, Xavier

    2014-01-01

    Hepatitis C virus (HCV) is classified into seven major genotypes and 67 subtypes. Recent studies have shown that in HCV genotype 1-infected patients, response rates to regimens containing direct-acting antivirals (DAAs) are subtype dependent. Currently available genotyping methods have limited subtyping accuracy. We have evaluated the performance of a deep-sequencing-based HCV subtyping assay, developed for the 454/GS-Junior platform, in comparison with those of two commercial assays (Versant HCV genotype 2.0 and Abbott Real-time HCV Genotype II) and using direct NS5B sequencing as a gold standard (direct sequencing), in 114 clinical specimens previously tested by first-generation hybridization assay (82 genotype 1 and 32 with uninterpretable results). Phylogenetic analysis of deep-sequencing reads matched subtype 1 calling by population Sanger sequencing (69% 1b, 31% 1a) in 81 specimens and identified a mixed-subtype infection (1b/3a/1a) in one sample. Similarly, among the 32 previously indeterminate specimens, identical genotype and subtype results were obtained by direct and deep sequencing in all but four samples with dual infection. In contrast, both Versant HCV Genotype 2.0 and Abbott Real-time HCV Genotype II failed subtype 1 calling in 13 (16%) samples each and were unable to identify the HCV genotype and/or subtype in more than half of the non-genotype 1 samples. We concluded that deep sequencing is more efficient for HCV subtyping than currently available methods and allows qualitative identification of mixed infections and may be more helpful with respect to informing treatment strategies with new DAA-containing regimens across all HCV subtypes. PMID:25378574

  18. Hybridization Capture-Based Next-Generation Sequencing to Evaluate Coding Sequence and Deep Intronic Mutations in the NF1 Gene

    PubMed Central

    Cunha, Karin Soares; Oliveira, Nathalia Silva; Fausto, Anna Karoline; de Souza, Carolina Cruz; Gros, Audrey; Bandres, Thomas; Idrissi, Yamina; Merlio, Jean-Philippe; de Moura Neto, Rodrigo Soares; Silva, Rosane; Geller, Mauro; Cappellen, David

    2016-01-01

    Neurofibromatosis 1 (NF1) is one of the most common genetic disorders and is caused by mutations in the NF1 gene. NF1 gene mutational analysis presents a considerable challenge because of its large size, existence of highly homologous pseudogenes located throughout the human genome, absence of mutational hotspots, and diversity of mutations types, including deep intronic splicing mutations. We aimed to evaluate the use of hybridization capture-based next-generation sequencing to screen coding and noncoding NF1 regions. Hybridization capture-based next-generation sequencing, with genomic DNA as starting material, was used to sequence the whole NF1 gene (exons and introns) from 11 unrelated individuals and 1 relative, who all had NF1. All of them met the NF1 clinical diagnostic criteria. We showed a mutation detection rate of 91% (10 out of 11). We identified eight recurrent and two novel mutations, which were all confirmed by Sanger methodology. In the Sanger sequencing confirmation, we also included another three relatives with NF1. Splicing alterations accounted for 50% of the mutations. One of them was caused by a deep intronic mutation (c.1260 + 1604A > G). Frameshift truncation and missense mutations corresponded to 30% and 20% of the pathogenic variants, respectively. In conclusion, we show the use of a simple and fast approach to screen, at once, the entire NF1 gene (exons and introns) for different types of pathogenic variations, including the deep intronic splicing mutations. PMID:27999334

  19. Hybridization Capture-Based Next-Generation Sequencing to Evaluate Coding Sequence and Deep Intronic Mutations in the NF1 Gene.

    PubMed

    Cunha, Karin Soares; Oliveira, Nathalia Silva; Fausto, Anna Karoline; de Souza, Carolina Cruz; Gros, Audrey; Bandres, Thomas; Idrissi, Yamina; Merlio, Jean-Philippe; de Moura Neto, Rodrigo Soares; Silva, Rosane; Geller, Mauro; Cappellen, David

    2016-12-17

    Neurofibromatosis 1 (NF1) is one of the most common genetic disorders and is caused by mutations in the NF1 gene. NF1 gene mutational analysis presents a considerable challenge because of its large size, existence of highly homologous pseudogenes located throughout the human genome, absence of mutational hotspots, and diversity of mutations types, including deep intronic splicing mutations. We aimed to evaluate the use of hybridization capture-based next-generation sequencing to screen coding and noncoding NF1 regions. Hybridization capture-based next-generation sequencing, with genomic DNA as starting material, was used to sequence the whole NF1 gene (exons and introns) from 11 unrelated individuals and 1 relative, who all had NF1. All of them met the NF1 clinical diagnostic criteria. We showed a mutation detection rate of 91% (10 out of 11). We identified eight recurrent and two novel mutations, which were all confirmed by Sanger methodology. In the Sanger sequencing confirmation, we also included another three relatives with NF1. Splicing alterations accounted for 50% of the mutations. One of them was caused by a deep intronic mutation (c.1260 + 1604A > G). Frameshift truncation and missense mutations corresponded to 30% and 20% of the pathogenic variants, respectively. In conclusion, we show the use of a simple and fast approach to screen, at once, the entire NF1 gene (exons and introns) for different types of pathogenic variations, including the deep intronic splicing mutations.

  20. DeepSNVMiner: a sequence analysis tool to detect emergent, rare mutations in subsets of cell populations.

    PubMed

    Andrews, T Daniel; Jeelall, Yogesh; Talaulikar, Dipti; Goodnow, Christopher C; Field, Matthew A

    2016-01-01

    Background. Massively parallel sequencing technology is being used to sequence highly diverse populations of DNA such as that derived from heterogeneous cell mixtures containing both wild-type and disease-related states. At the core of such molecule tagging techniques is the tagging and identification of sequence reads derived from individual input DNA molecules, which must be first computationally disambiguated to generate read groups sharing common sequence tags, with each read group representing a single input DNA molecule. This disambiguation typically generates huge numbers of reads groups, each of which requires additional variant detection analysis steps to be run specific to each read group, thus representing a significant computational challenge. While sequencing technologies for producing these data are approaching maturity, the lack of available computational tools for analysing such heterogeneous sequence data represents an obstacle to the widespread adoption of this technology. Results. Using synthetic data we successfully detect unique variants at dilution levels of 1 in a 1,000,000 molecules, and find DeeepSNVMiner obtains significantly lower false positive and false negative rates compared to popular variant callers GATK, SAMTools, FreeBayes and LoFreq, particularly as the variant concentration levels decrease. In a dilution series with genomic DNA from two cells lines, we find DeepSNVMiner identifies a known somatic variant when present at concentrations of only 1 in 1,000 molecules in the input material, the lowest concentration amongst all variant callers tested. Conclusions. Here we present DeepSNVMiner; a tool to disambiguate tagged sequence groups and robustly identify sequence variants specific to subsets of starting DNA molecules that may indicate the presence of a disease. DeepSNVMiner is an automated workflow of custom sequence analysis utilities and open source tools able to differentiate somatic DNA variants from artefactual sequence

  1. Deep sequencing of mycovirus-derived small RNAs from Botrytis species.

    PubMed

    Donaire, Livia; Ayllón, María A

    2016-08-31

    RNA silencing is an ancient regulatory mechanism operating in all eukaryotic cells. In fungi, it was first discovered in Neurospora crassa, although its potential as a defence mechanism against mycoviruses was first reported in Cryphonectria parasitica and, later, in several fungal species. There is little evidence of the antiviral potential of RNA silencing in the phytopathogenic species of the fungal genus Botrytis. Moreover, little is known about the RNA silencing components in these fungi, although the analysis of public genome databases identified two Dicer-like genes in B. cinerea, as in most of the ascomycetes sequenced to date. In this work, we used deep sequencing to study the virus-derived small RNA (vsiRNA) populations from different mycoviruses infecting field isolates of Botrytis spp. The mycoviruses under study belong to different genera and species, and have different types of genome [double-stranded RNA (dsRNA), (+)single-stranded RNA (ssRNA) and (-)ssRNA]. In general, vsiRNAs derived from mycoviruses are mostly of 21, 20 and 22 nucleotides in length, possess sense or antisense orientation, either in a similar ratio or with a predominance of sense polarity depending on the virus species, have predominantly U at their 5' end, and are unevenly distributed along the viral genome, showing conspicuous hotspots of vsiRNA accumulation. These characteristics reveal striking similarities with vsiRNAs produced by plant viruses, suggesting similar pathways of viral targeting in plants and fungi. We have shown that the fungal RNA silencing machinery acts against the mycoviruses used in this work in a similar manner independent of their viral or fungal origin.

  2. Identification of miRNAs involved in fruit ripening in Cavendish bananas by deep sequencing.

    PubMed

    Bi, Fangcheng; Meng, Xiangchun; Ma, Chao; Yi, Ganjun

    2015-10-13

    MicroRNAs (miRNAs) are a family of non-coding small RNAs that play an important regulatory role in various biological processes. Previous studies have reported that miRNAs are closely related to the ripening process in model plants. However, the miRNAs that are closely involved in the banana fruit ripening process remain unknown. Here, we investigated the miRNA populations from banana fruits in response to ethylene or 1-MCP treatment using a deep sequencing approach and bioinformatics analysis combined with quantitative RT-PCR validation. A total of 125 known miRNAs and 26 novel miRNAs were identified from three libraries. MiRNA profiling of bananas in response to ethylene treatment compared with 1-MCP treatment showed differential expression of 82 miRNAs. Furthermore, the differentially expressed miRNAs were predicted to target a total of 815 target genes. Interestingly, some targets were annotated as transcription factors and other functional proteins closely involved in the development and the ripening process in other plant species. Analysis by qRT-PCR validated the contrasting expression patterns between several miRNAs and their target genes. The miRNAome of the banana fruit in response to ethylene or 1-MCP treatment were identified by high-throughput sequencing. A total of 82 differentially expressed miRNAs were found to be closely associated with the ripening process. The miRNA target genes encode transcription factors and other functional proteins, including SPL, APETALA2, EIN3, E3 ubiquitin ligase, β-galactosidase, and β-glucosidase. These findings provide valuable information for further functional research of the miRNAs involved in banana fruit ripening.

  3. Deep sequencing-based analysis of the anaerobic stimulon in Neisseria gonorrhoeae

    PubMed Central

    2011-01-01

    Background Maintenance of an anaerobic denitrification system in the obligate human pathogen, Neisseria gonorrhoeae, suggests that an anaerobic lifestyle may be important during the course of infection. Furthermore, mounting evidence suggests that reduction of host-produced nitric oxide has several immunomodulary effects on the host. However, at this point there have been no studies analyzing the complete gonococcal transcriptome response to anaerobiosis. Here we performed deep sequencing to compare the gonococcal transcriptomes of aerobically and anaerobically grown cells. Using the information derived from this sequencing, we discuss the implications of the robust transcriptional response to anaerobic growth. Results We determined that 198 chromosomal genes were differentially expressed (~10% of the genome) in response to anaerobic conditions. We also observed a large induction of genes encoded within the cryptic plasmid, pJD1. Validation of RNA-seq data using translational-lacZ fusions or RT-PCR demonstrated the RNA-seq results to be very reproducible. Surprisingly, many genes of prophage origin were induced anaerobically, as well as several transcriptional regulators previously unknown to be involved in anaerobic growth. We also confirmed expression and regulation of a small RNA, likely a functional equivalent of fnrS in the Enterobacteriaceae family. We also determined that many genes found to be responsive to anaerobiosis have also been shown to be responsive to iron and/or oxidative stress. Conclusions Gonococci will be subject to many forms of environmental stress, including oxygen-limitation, during the course of infection. Here we determined that the anaerobic stimulon in gonococci was larger than previous studies would suggest. Many new targets for future research have been uncovered, and the results derived from this study may have helped to elucidate factors or mechanisms of virulence that may have otherwise been overlooked. PMID:21251255

  4. Deep sequencing-based analysis of the anaerobic stimulon in Neisseria gonorrhoeae.

    PubMed

    Isabella, Vincent M; Clark, Virginia L

    2011-01-20

    Maintenance of an anaerobic denitrification system in the obligate human pathogen, Neisseria gonorrhoeae, suggests that an anaerobic lifestyle may be important during the course of infection. Furthermore, mounting evidence suggests that reduction of host-produced nitric oxide has several immunomodulary effects on the host. However, at this point there have been no studies analyzing the complete gonococcal transcriptome response to anaerobiosis. Here we performed deep sequencing to compare the gonococcal transcriptomes of aerobically and anaerobically grown cells. Using the information derived from this sequencing, we discuss the implications of the robust transcriptional response to anaerobic growth. We determined that 198 chromosomal genes were differentially expressed (~10% of the genome) in response to anaerobic conditions. We also observed a large induction of genes encoded within the cryptic plasmid, pJD1. Validation of RNA-seq data using translational-lacZ fusions or RT-PCR demonstrated the RNA-seq results to be very reproducible. Surprisingly, many genes of prophage origin were induced anaerobically, as well as several transcriptional regulators previously unknown to be involved in anaerobic growth. We also confirmed expression and regulation of a small RNA, likely a functional equivalent of fnrS in the Enterobacteriaceae family. We also determined that many genes found to be responsive to anaerobiosis have also been shown to be responsive to iron and/or oxidative stress. Gonococci will be subject to many forms of environmental stress, including oxygen-limitation, during the course of infection. Here we determined that the anaerobic stimulon in gonococci was larger than previous studies would suggest. Many new targets for future research have been uncovered, and the results derived from this study may have helped to elucidate factors or mechanisms of virulence that may have otherwise been overlooked.

  5. mRNA deep sequencing reveals 75 new genes and a complex transcriptional landscape in Mimivirus

    PubMed Central

    Legendre, Matthieu; Audic, Stéphane; Poirot, Olivier; Hingamp, Pascal; Seltzer, Virginie; Byrne, Deborah; Lartigue, Audrey; Lescot, Magali; Bernadac, Alain; Poulain, Julie; Abergel, Chantal; Claverie, Jean-Michel

    2010-01-01

    Mimivirus, a virus infecting Acanthamoeba, is the prototype of the Mimiviridae, the latest addition to the nucleocytoplasmic large DNA viruses. The Mimivirus genome encodes close to 1000 proteins, many of them never before encountered in a virus, such as four amino-acyl tRNA synthetases. To explore the physiology of this exceptional virus and identify the genes involved in the building of its characteristic intracytoplasmic “virion factory,” we coupled electron microscopy observations with the massively parallel pyrosequencing of the polyadenylated RNA fractions of Acanthamoeba castellanii cells at various time post-infection. We generated 633,346 reads, of which 322,904 correspond to Mimivirus transcripts. This first application of deep mRNA sequencing (454 Life Sciences [Roche] FLX) to a large DNA virus allowed the precise delineation of the 5′ and 3′ extremities of Mimivirus mRNAs and revealed 75 new transcripts including several noncoding RNAs. Mimivirus genes are expressed across a wide dynamic range, in a finely regulated manner broadly described by three main temporal classes: early, intermediate, and late. This RNA-seq study confirmed the AAAATTGA sequence as an early promoter element, as well as the presence of palindromes at most of the polyadenylation sites. It also revealed a new promoter element correlating with late gene expression, which is also prominent in Sputnik, the recently described Mimivirus “virophage.” These results—validated genome-wide by the hybridization of total RNA extracted from infected Acanthamoeba cells on a tiling array (Agilent)—will constitute the foundation on which to build subsequent functional studies of the Mimivirus/Acanthamoeba system. PMID:20360389

  6. mRNA deep sequencing reveals 75 new genes and a complex transcriptional landscape in Mimivirus.

    PubMed

    Legendre, Matthieu; Audic, Stéphane; Poirot, Olivier; Hingamp, Pascal; Seltzer, Virginie; Byrne, Deborah; Lartigue, Audrey; Lescot, Magali; Bernadac, Alain; Poulain, Julie; Abergel, Chantal; Claverie, Jean-Michel

    2010-05-01

    Mimivirus, a virus infecting Acanthamoeba, is the prototype of the Mimiviridae, the latest addition to the nucleocytoplasmic large DNA viruses. The Mimivirus genome encodes close to 1000 proteins, many of them never before encountered in a virus, such as four amino-acyl tRNA synthetases. To explore the physiology of this exceptional virus and identify the genes involved in the building of its characteristic intracytoplasmic "virion factory," we coupled electron microscopy observations with the massively parallel pyrosequencing of the polyadenylated RNA fractions of Acanthamoeba castellanii cells at various time post-infection. We generated 633,346 reads, of which 322,904 correspond to Mimivirus transcripts. This first application of deep mRNA sequencing (454 Life Sciences [Roche] FLX) to a large DNA virus allowed the precise delineation of the 5' and 3' extremities of Mimivirus mRNAs and revealed 75 new transcripts including several noncoding RNAs. Mimivirus genes are expressed across a wide dynamic range, in a finely regulated manner broadly described by three main temporal classes: early, intermediate, and late. This RNA-seq study confirmed the AAAATTGA sequence as an early promoter element, as well as the presence of palindromes at most of the polyadenylation sites. It also revealed a new promoter element correlating with late gene expression, which is also prominent in Sputnik, the recently described Mimivirus "virophage." These results-validated genome-wide by the hybridization of total RNA extracted from infected Acanthamoeba cells on a tiling array (Agilent)--will constitute the foundation on which to build subsequent functional studies of the Mimivirus/Acanthamoeba system.

  7. Hybrid DNA virus in Chinese patients with seronegative hepatitis discovered by deep sequencing.

    PubMed

    Xu, Baoyan; Zhi, Ning; Hu, Gangqing; Wan, Zhihong; Zheng, Xiaobin; Liu, Xiaohong; Wong, Susan; Kajigaya, Sachiko; Zhao, Keji; Mao, Qing; Young, Neal S

    2013-06-18

    Seronegative hepatitis--non-A, non-B, non-C, non-D, non-E hepatitis--is poorly characterized but strongly associated with serious complications. We collected 92 sera specimens from patients with non-A-E hepatitis in Chongqing, China between 1999 and 2007. Ten sera pools were screened by Solexa deep sequencing. We discovered a 3,780-bp contig present in all 10 pools that yielded BLASTx E scores of 7e-05-0.008 against parvoviruses. The complete sequence of the in silico-assembled 3,780-bp contig was confirmed by gene amplification of overlapping regions over almost the entire genome, and the virus was provisionally designated NIH-CQV. Further analysis revealed that the contig was composed of two major ORFs. By protein BLAST, ORF1 and ORF2 were most homologous to the replication-associated protein of bat circovirus and the capsid protein of porcine parvovirus, respectively. Phylogenetic analysis indicated that NIH-CQV is located at the interface of Parvoviridae and Circoviridae. Prevalence of NIH-CQV in patients was determined by quantitative PCR. Sixty-three of 90 patient samples (70%) were positive, but all those from 45 healthy controls were negative. Average virus titer in the patient specimens was 1.05 e4 copies/µL. Specific antibodies against NIH-CQV were sought by immunoblotting. Eighty-four percent of patients were positive for IgG, and 31% were positive for IgM; in contrast, 78% of healthy controls were positive for IgG, but all were negative for IgM. Although more work is needed to determine the etiologic role of NIH-CQV in human disease, our data indicate that a parvovirus-like virus is highly prevalent in a cohort of patients with non-A-E hepatitis.

  8. Hybrid DNA virus in Chinese patients with seronegative hepatitis discovered by deep sequencing

    PubMed Central

    Xu, Baoyan; Zhi, Ning; Hu, Gangqing; Wan, Zhihong; Zheng, Xiaobin; Liu, Xiaohong; Wong, Susan; Kajigaya, Sachiko; Zhao, Keji; Mao, Qing; Young, Neal S.

    2013-01-01

    Seronegative hepatitis—non-A, non-B, non-C, non-D, non-E hepatitis—is poorly characterized but strongly associated with serious complications. We collected 92 sera specimens from patients with non-A–E hepatitis in Chongqing, China between 1999 and 2007. Ten sera pools were screened by Solexa deep sequencing. We discovered a 3,780-bp contig present in all 10 pools that yielded BLASTx E scores of 7e-05–0.008 against parvoviruses. The complete sequence of the in silico-assembled 3,780-bp contig was confirmed by gene amplification of overlapping regions over almost the entire genome, and the virus was provisionally designated NIH-CQV. Further analysis revealed that the contig was composed of two major ORFs. By protein BLAST, ORF1 and ORF2 were most homologous to the replication-associated protein of bat circovirus and the capsid protein of porcine parvovirus, respectively. Phylogenetic analysis indicated that NIH-CQV is located at the interface of Parvoviridae and Circoviridae. Prevalence of NIH-CQV in patients was determined by quantitative PCR. Sixty-three of 90 patient samples (70%) were positive, but all those from 45 healthy controls were negative. Average virus titer in the patient specimens was 1.05 e4 copies/µL. Specific antibodies against NIH-CQV were sought by immunoblotting. Eighty-four percent of patients were positive for IgG, and 31% were positive for IgM; in contrast, 78% of healthy controls were positive for IgG, but all were negative for IgM. Although more work is needed to determine the etiologic role of NIH-CQV in human disease, our data indicate that a parvovirus-like virus is highly prevalent in a cohort of patients with non-A–E hepatitis. PMID:23716702

  9. Improved sequence learning with subthalamic nucleus deep brain stimulation: evidence for treatment-specific network modulation.

    PubMed

    Mure, Hideo; Tang, Chris C; Argyelan, Miklos; Ghilardi, Maria-Felice; Kaplitt, Michael G; Dhawan, Vijay; Eidelberg, David

    2012-02-22

    We used a network approach to study the effects of anti-parkinsonian treatment on motor sequence learning in humans. Eight Parkinson's disease (PD) patients with bilateral subthalamic nucleus (STN) deep brain stimulation underwent H(2)(15)O positron emission tomography (PET) imaging to measure regional cerebral blood flow (rCBF) while they performed kinematically matched sequence learning and movement tasks at baseline and during stimulation. Network analysis revealed a significant learning-related spatial covariance pattern characterized by consistent increases in subject expression during stimulation (p = 0.008, permutation test). The network was associated with increased activity in the lateral cerebellum, dorsal premotor cortex, and parahippocampal gyrus, with covarying reductions in the supplementary motor area (SMA) and orbitofrontal cortex. Stimulation-mediated increases in network activity correlated with concurrent improvement in learning performance (p < 0.02). To determine whether similar changes occurred during dopaminergic pharmacotherapy, we studied the subjects during an intravenous levodopa infusion titrated to achieve a motor response equivalent to stimulation. Despite consistent improvement in motor ratings during infusion, levodopa did not alter learning performance or network activity. Analysis of learning-related rCBF in network regions revealed improvement in baseline abnormalities with STN stimulation but not levodopa. These effects were most pronounced in the SMA. In this region, a consistent rCBF response to stimulation was observed across subjects and trials (p = 0.01), although the levodopa response was not significant. These findings link the cognitive treatment response in PD to changes in the activity of a specific cerebello-premotor cortical network. Selective modulation of overactive SMA-STN projection pathways may underlie the improvement in learning found with stimulation.

  10. Targeted deep sequencing improves outcome stratification in chronic myelomonocytic leukemia with low risk cytogenetic features

    PubMed Central

    Palomo, Laura; Garcia, Olga; Arnan, Montse; Xicoy, Blanca; Fuster, Francisco; Cabezón, Marta; Coll, Rosa; Ademà, Vera; Grau, Javier; Jiménez, Maria-José; Pomares, Helena; Marcé, Sílvia; Mallo, Mar; Millá, Fuensanta; Alonso, Esther; Sureda, Anna; Gallardo, David; Feliu, Evarist; Ribera, Josep-Maria; Solé, Francesc; Zamora, Lurdes

    2016-01-01

    Clonal cytogenetic abnormalities are found in 20-30% of patients with chronic myelomonocytic leukemia (CMML), while gene mutations are present in >90% of cases. Patients with low risk cytogenetic features account for 80% of CMML cases and often fall into the low risk categories of CMML prognostic scoring systems, but the outcome differs considerably among them. We performed targeted deep sequencing of 83 myeloid-related genes in 56 CMML patients with low risk cytogenetic features or uninformative conventional cytogenetics (CC) at diagnosis, with the aim to identify the genetic characteristics of patients with a more aggressive disease. Targeted sequencing was also performed in a subset of these patients at time of acute myeloid leukemia (AML) transformation. Overall, 98% of patients harbored at least one mutation. Mutations in cell signaling genes were acquired at time of AML progression. Mutations in ASXL1, EZH2 and NRAS correlated with higher risk features and shorter overall survival (OS) and progression free survival (PFS). Patients with SRSF2 mutations associated with poorer OS, while absence of TET2 mutations (TET2wt) was predictive of shorter PFS. A decrease in OS and PFS was observed as the number of adverse risk gene mutations (ASXL1, EZH2, NRAS and SRSF2) increased. On multivariate analyses, CMML-specific scoring system (CPSS) and presence of adverse risk gene mutations remained significant for OS, while CPSS and TET2wt were predictive of PFS. These results confirm that mutation analysis can add prognostic value to patients with CMML and low risk cytogenetic features or uninformative CC. PMID:27486981

  11. Sequence stratigraphy of Cenozoic deepwater deposits in the Perdido fold belt, Northwestern Deep Gulf of Mexico

    SciTech Connect

    Fiduk, J.C.; Weimer, P.; Trudgill, B.D.

    1996-12-31

    Analysis of 12,000 km of 2-D multifold seismic data shows three large Cenozoic wedges of deepwater deposits in the Perdido fold belt that differ in seismic facies, areal distribution, and potential reservoir geometries. Together, these three wedges reflect the changing positions of Cenozoic depocenters and record the evolution of the Perdido structural province. Lithologic interpretation is based upon seismic facies and analogous facies in other drilled areas in the Gulf of Mexico (1) The Paleocene to middle Oligocene interval, which is strongly folded, reflects pre-growth deposition. Paleocene and Oligocene strata thicken westward and consist of medium to high amplitude, subparallel reflections of varying continuity. Broad channels and channel-levee systems are interpreted, suggesting turbidite deposition. These strata are interpreted as the down-dip equivalent of the Wilcox and Frio shallow-water depo-centers and are potentially sand-prone. Eocene strata are low amplitude, discontinuous, subparallel reflections interpreted to be shale-prone. (2) The upper Oligocene to upper Miocene interval consists of multiple well-developed sequences with variable amplitude, divergent reflections, many of which onlap against the fold crests. Sequences within this interval are often modified by erosion, faulting, and/or slumping against the folds. (3) The upper Miocene to Recent interval, which overlies most folds, consists of channel-levee, overbank, slump, and layered or amalgamated turbidite sheet deposits. These are similar to other coeval submarine fan sediments in the northern deep Gulf. Thus, the Cenozoic section in the Perdido fold belt is interpreted as mostly shale-prone, with some sand-prone intervals, based upon seismic facies, isopach thickening to the west, and similar producing facies elsewhere in the Gulf of Mexico.

  12. Sequence stratigraphy of Cenozoic deepwater deposits in the Perdido fold belt, Northwestern Deep Gulf of Mexico

    SciTech Connect

    Fiduk, J.C.; Weimer, P.; Trudgill, B.D. )

    1996-01-01

    Analysis of 12,000 km of 2-D multifold seismic data shows three large Cenozoic wedges of deepwater deposits in the Perdido fold belt that differ in seismic facies, areal distribution, and potential reservoir geometries. Together, these three wedges reflect the changing positions of Cenozoic depocenters and record the evolution of the Perdido structural province. Lithologic interpretation is based upon seismic facies and analogous facies in other drilled areas in the Gulf of Mexico (1) The Paleocene to middle Oligocene interval, which is strongly folded, reflects pre-growth deposition. Paleocene and Oligocene strata thicken westward and consist of medium to high amplitude, subparallel reflections of varying continuity. Broad channels and channel-levee systems are interpreted, suggesting turbidite deposition. These strata are interpreted as the down-dip equivalent of the Wilcox and Frio shallow-water depo-centers and are potentially sand-prone. Eocene strata are low amplitude, discontinuous, subparallel reflections interpreted to be shale-prone. (2) The upper Oligocene to upper Miocene interval consists of multiple well-developed sequences with variable amplitude, divergent reflections, many of which onlap against the fold crests. Sequences within this interval are often modified by erosion, faulting, and/or slumping against the folds. (3) The upper Miocene to Recent interval, which overlies most folds, consists of channel-levee, overbank, slump, and layered or amalgamated turbidite sheet deposits. These are similar to other coeval submarine fan sediments in the northern deep Gulf. Thus, the Cenozoic section in the Perdido fold belt is interpreted as mostly shale-prone, with some sand-prone intervals, based upon seismic facies, isopach thickening to the west, and similar producing facies elsewhere in the Gulf of Mexico.

  13. Improved Sequence Learning with Subthalamic Nucleus Deep Brain Stimulation: Evidence for Treatment-Specific Network Modulation

    PubMed Central

    Mure, Hideo; Tang, Chris C.; Argyelan, Miklos; Ghilardi, Maria-Felice; Kaplitt, Michael G.; Dhawan, Vijay; Eidelberg, David

    2015-01-01

    We used a network approach to study the effects of anti-parkinsonian treatment on motor sequence learning in humans. Eight Parkinson’s disease (PD) patients with bilateral subthalamic nucleus (STN) deep brain stimulation underwent H2 15Opositron emission tomography (PET) imaging to measure regional cerebral blood flow (rCBF) while they performed kinematically matched sequence learning and movement tasks at baseline and during stimulation. Network analysis revealed a significant learning-related spatial covariance pattern characterized by consistent increases in subject expression during stimulation (p = 0.008, permutation test). The network was associated with increased activity in the lateral cerebellum, dorsal premotor cortex, and parahippocampal gyrus, with covarying reductions in the supplementary motor area (SMA) and orbitofrontal cortex. Stimulation-mediated increases in network activity correlated with concurrent improvement in learning performance (p < 0.02). To determine whether similar changes occurred during dopaminergic pharmacotherapy, we studied the subjects during an intravenous levodopa infusion titrated to achieve a motor response equivalent to stimulation. Despite consistent improvement in motor ratings during infusion, levodopa did not alter learning performance or network activity. Analysis of learning-related rCBF in network regions revealed improvement in baseline abnormalities with STN stimulation but not levodopa. These effects were most pronounced in the SMA. In this region, a consistent rCBF response to stimulation was observed across subjects and trials (p = 0.01), although the levodopa response was not significant. These findings link the cognitive treatment response in PD to changes in the activity of a specific cerebello-premotor cortical network. Selective modulation of overactive SMA–STN projection pathways may underlie the improvement in learning found with stimulation. PMID:22357863

  14. The ribosome profiling strategy for monitoring translation in vivo by deep sequencing of ribosome-protected mRNA fragments

    PubMed Central

    Ingolia, Nicholas T.; Brar, Gloria A.; Rouskin, Silvia; McGeachy, Anna M.; Weissman, Jonathan S.

    2012-01-01

    Recent studies highlight the importance of translational control in determining protein abundance, underscoring the value of measuring gene expression at the level of translation. We present a protocol for genome-wide, quantitative analysis of in vivo translation by deep sequencing. This ribosome profiling approach maps the exact positions of ribosomes on transcripts by nuclease footprinting. The nuclease-protected mRNA fragments are converted into a DNA library suitable for deep sequencing using a strategy that minimizes bias. The abundance of different footprint fragments in deep sequencing data reports on the amount of translation of a gene. Additionally, footprints reveal the exact regions of the transcriptome that are translated. To better define translated reading frames, we describe an adaptation that reveals the sites of translation initiation by pre-treating cells with harringtonine to immobilize initiating ribosomes. The protocol we describe requires 5–7 days to generate a completed ribosome profiling sequencing library. Sequencing and data analysis requires a further 4 – 5 days. PMID:22836135

  15. Implementation of a custom hardware-accelerator for short-read mapping using Burrows-Wheeler alignment.

    PubMed

    Waidyasooriya, Hasitha Muthumala; Hariyama, Masanori; Kameyama, Michitaka

    2013-01-01

    The mapping of millions of short DNA fragments to a large genome is a great challenge in modern computational biology. Usually, it takes many hours or days to map a large genome using software. However, the recent progress of programmable hardware such as field programmable gate arrays (FPGAs) provides a cost effective solution to this challenge. FPGAs contain millions of programmable logic gates to design massively parallel accelerators. This paper proposes a hardware architecture to accelerate the short-read mapping using Burrows-Wheeler alignment. The speed-up of the proposed architecture is estimated to be at least 10 times compared to its equivalent software application.

  16. Filtering of deep sequencing data reveals the existence of abundant Dicer-dependent small RNAs derived from tRNAs

    PubMed Central

    Cole, Christian; Sobala, Andrew; Lu, Cheng; Thatcher, Shawn R.; Bowman, Andrew; Brown, John W.S.; Green, Pamela J.; Barton, Geoffrey J.; Hutvagner, Gyorgy

    2009-01-01

    Deep sequencing technologies such as Illumina, SOLiD, and 454 platforms have become very powerful tools in discovering and quantifying small RNAs in diverse organisms. Sequencing small RNA fractions always identifies RNAs derived from abundant RNA species such as rRNAs, tRNAs, snRNA, and snoRNA, and they are widely considered to be random degradation products. We carried out bioinformatic analysis of deep sequenced HeLa RNA and after quality filtering, identified highly abundant small RNA fragments, derived from mature tRNAs that are likely produced by specific processing rather than from random degradation. Moreover, we showed that the processing of small RNAs derived from tRNAGln is dependent on Dicer in vivo and that Dicer cleaves the tRNA in vitro. PMID:19850906

  17. An effective differential expression analysis of deep-sequencing data based on the Poisson log-normal model.

    PubMed

    Wu, Jun; Zhao, Xiaodong; Lin, Zongli; Shao, Zhifeng

    2015-04-01

    Tremendous amount of deep-sequencing data has unprecedentedly improved our understanding in biomedical science by digital sequence reads. To mine useful information from such data, a proper distribution for modeling all range of the count data and accurate parameter estimation are required. In this paper, we propose a method, called "DEPln," for differential expression analysis based on the Poisson log-normal (PLN) distribution with an accurate parameter estimation strategy, which aims to overcome the inconvenience in the mathematical analysis of the traditional PLN distribution. The performance of our proposed method is validated by both synthetic and real data. Experimental results indicate that our method outperforms the traditional methods in terms of the discrimination ability and results in a good tradeoff between the recall rate and the precision. Thus, our work provides a new approach for gene expression analysis and has strong potential in deep-sequencing based research.

  18. Deep sequencing reveals unique small RNA repertoire that is regulated during head regeneration in Hydra magnipapillata

    PubMed Central

    Krishna, Srikar; Nair, Aparna; Cheedipudi, Sirisha; Poduval, Deepak; Dhawan, Jyotsna; Palakodeti, Dasaradhi; Ghanekar, Yashoda

    2013-01-01

    Small non-coding RNAs such as miRNAs, piRNAs and endo-siRNAs fine-tune gene expression through post-transcriptional regulation, modulating important processes in development, differentiation, homeostasis and regeneration. Using deep sequencing, we have profiled small non-coding RNAs in Hydra magnipapillata and investigated changes in small RNA expression pattern during head regeneration. Our results reveal a unique repertoire of small RNAs in hydra. We have identified 126 miRNA loci; 123 of these miRNAs are unique to hydra. Less than 50% are conserved across two different strains of Hydra vulgaris tested in this study, indicating a highly diverse nature of hydra miRNAs in contrast to bilaterian miRNAs. We also identified siRNAs derived from precursors with perfect stem–loop structure and that arise from inverted repeats. piRNAs were the most abundant small RNAs in hydra, mapping to transposable elements, the annotated transcriptome and unique non-coding regions on the genome. piRNAs that map to transposable elements and the annotated transcriptome display a ping–pong signature. Further, we have identified several miRNAs and piRNAs whose expression is regulated during hydra head regeneration. Our study defines different classes of small RNAs in this cnidarian model system, which may play a role in orchestrating gene expression essential for hydra regeneration. PMID:23166307

  19. Deep sequencing reveals unique small RNA repertoire that is regulated during head regeneration in Hydra magnipapillata.

    PubMed

    Krishna, Srikar; Nair, Aparna; Cheedipudi, Sirisha; Poduval, Deepak; Dhawan, Jyotsna; Palakodeti, Dasaradhi; Ghanekar, Yashoda

    2013-01-07

    Small non-coding RNAs such as miRNAs, piRNAs and endo-siRNAs fine-tune gene expression through post-transcriptional regulation, modulating important processes in development, differentiation, homeostasis and regeneration. Using deep sequencing, we have profiled small non-coding RNAs in Hydra magnipapillata and investigated changes in small RNA expression pattern during head regeneration. Our results reveal a unique repertoire of small RNAs in hydra. We have identified 126 miRNA loci; 123 of these miRNAs are unique to hydra. Less than 50% are conserved across two different strains of Hydra vulgaris tested in this study, indicating a highly diverse nature of hydra miRNAs in contrast to bilaterian miRNAs. We also identified siRNAs derived from precursors with perfect stem-loop structure and that arise from inverted repeats. piRNAs were the most abundant small RNAs in hydra, mapping to transposable elements, the annotated transcriptome and unique non-coding regions on the genome. piRNAs that map to transposable elements and the annotated transcriptome display a ping-pong signature. Further, we have identified several miRNAs and piRNAs whose expression is regulated during hydra head regeneration. Our study defines different classes of small RNAs in this cnidarian model system, which may play a role in orchestrating gene expression essential for hydra regeneration.

  20. Metagenomes obtained by 'deep sequencing' - what do they tell about the enhanced biological phosphorus removal communities?

    PubMed

    Albertsen, Mads; Saunders, Aaron M; Nielsen, Kåre L; Nielsen, Per H

    2013-01-01

    Metagenomics enables studies of the genomic potential of complex microbial communities by sequencing bulk genomic DNA directly from the environment. Knowledge of the genetic potential of a community can be used to formulate and test ecological hypotheses about stability and performance. In this study deep metagenomics and fluorescence in situ hybridization (FISH) were used to study a full-scale wastewater treatment plant with enhanced biological phosphorus removal (EBPR), and the results were compared to an existing EBPR metagenome. EBPR is a widely used process that relies on a complex community of microorganisms to function properly. Insight into community and species level stability and dynamics is valuable for knowledge-driven optimization of the EBPR process. The metagenomes of the EBPR communities were distinct compared to metagenomes of communities from a wide range of other environments, which could be attributed to selection pressures of the EBPR process. The metabolic potential of one of the key microorganisms in the EPBR process, Accumulibacter, was investigated in more detail in the two plants, revealing a potential importance of phage predation on the dynamics of Accumulibacter populations. The results demonstrate that metagenomics can be used as a powerful tool for system wide characterization of the EBPR community as well as for a deeper understanding of the function of specific community members. Furthermore, we discuss and illustrate some of the general pitfalls in metagenomics and stress the need of additional DNA extraction independent information in metagenome studies.

  1. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing.

    PubMed

    Manske, Magnus; Miotto, Olivo; Campino, Susana; Auburn, Sarah; Almagro-Garcia, Jacob; Maslen, Gareth; O'Brien, Jack; Djimde, Abdoulaye; Doumbo, Ogobara; Zongo, Issaka; Ouedraogo, Jean-Bosco; Michon, Pascal; Mueller, Ivo; Siba, Peter; Nzila, Alexis; Borrmann, Steffen; Kiara, Steven M; Marsh, Kevin; Jiang, Hongying; Su, Xin-Zhuan; Amaratunga, Chanaki; Fairhurst, Rick; Socheat, Duong; Nosten, Francois; Imwong, Mallika; White, Nicholas J; Sanders, Mandy; Anastasi, Elisa; Alcock, Dan; Drury, Eleanor; Oyola, Samuel; Quail, Michael A; Turner, Daniel J; Ruano-Rubio, Valentin; Jyothi, Dushyanth; Amenga-Etego, Lucas; Hubbart, Christina; Jeffreys, Anna; Rowlands, Kate; Sutherland, Colin; Roper, Cally; Mangano, Valentina; Modiano, David; Tan, John C; Ferdig, Michael T; Amambua-Ngwa, Alfred; Conway, David J; Takala-Harrison, Shannon; Plowe, Christopher V; Rayner, Julian C; Rockett, Kirk A; Clark, Taane G; Newbold, Chris I; Berriman, Matthew; MacInnis, Bronwyn; Kwiatkowski, Dominic P

    2012-07-19

    Malaria elimination strategies require surveillance of the parasite population for genetic changes that demand a public health response, such as new forms of drug resistance. Here we describe methods for the large-scale analysis of genetic variation in Plasmodium falciparum by deep sequencing of parasite DNA obtained from the blood of patients with malaria, either directly or after short-term culture. Analysis of 86,158 exonic single nucleotide polymorphisms that passed genotyping quality control in 227 samples from Africa, Asia and Oceania provides genome-wide estimates of allele frequency distribution, population structure and linkage disequilibrium. By comparing the genetic diversity of individual infections with that of the local parasite population, we derive a metric of within-host diversity that is related to the level of inbreeding in the population. An open-access web application has been established for the exploration of regional differences in allele frequency and of highly differentiated loci in the P. falciparum genome.

  2. Multiple platform assessment of the EGF dependent transcriptome by microarray and deep tag sequencing analysis

    PubMed Central

    2011-01-01

    Background Epidermal Growth Factor (EGF) is a key regulatory growth factor activating many processes relevant to normal development and disease, affecting cell proliferation and survival. Here we use a combined approach to study the EGF dependent transcriptome of HeLa cells by using multiple long oligonucleotide based microarray platforms (from Agilent, Operon, and Illumina) in combination with digital gene expression profiling (DGE) with the Illumina Genome Analyzer. Results By applying a procedure for cross-platform data meta-analysis based on RankProd and GlobalAncova tests, we establish a well validated gene set with transcript levels altered after EGF treatment. We use this robust gene list to build higher order networks of gene interaction by interconnecting associated networks, supporting and extending the important role of the EGF signaling pathway in cancer. In addition, we find an entirely new set of genes previously unrelated to the currently accepted EGF associated cellular functions. Conclusions We propose that the use of global genomic cross-validation derived from high content technologies (microarrays or deep sequencing) can be used to generate more reliable datasets. This approach should help to improve the confidence of downstream in silico functional inference analyses based on high content data. PMID:21699700

  3. Deep sequencing uncovers rice long siRNAs and its involvement in immunity against Rhizoctonia solani.

    PubMed

    Niu, Dongdong; Zhang, Xin; Song, Xiaoou; Wang, Zhihui; Li, Yanqiang; Qiao, Lulu; Wang, Zhaoyun; Liu, Junzhong; Deng, Yiwen; He, Zuhua; Yang, Donglei; Liu, Renyi; Wang, Yangli; Zhao, Hongwei

    2017-09-06

    Small RNA (sRNA) is a class of non-coding RNA that can silence the expression of target genes. In rice, the majority of characterized sRNAs are within the range of 21 to 24 nucleotide long, whose biogenesis and function are associated with a specific sets of components, such as Dicer-like (OsDCLs) and Argonaute proteins (OsAGOs). Rice sRNAs longer than 24 nt are occasionally reported, with biogenesis and functional mechanism uninvestigated, especially in a context of defense responses against pathogen infection. By using deep sequencing, we identified a group of rice long small interfering RNAs (lsiRNAs) that are within the range of 25-40 nt in length. Our results show that some rice lsiRNAs are differentially expressed upon infection of Rhizoctonia solani, the causal agent of the rice sheath blight disease. Bioinformatic analysis and experimental validation indicate that some rice lsiRNAs can target defense-related genes. We further demonstrate that rice lsiRNAs are neither derived from RNA degradation nor originated as secondary small interfering RNAs (siRNAs). Moreover, lsiRNAs requires OsDCL4 for biogenesis and OsAGO18 for function. Therefore, our study indicate that rice lsiRNAs are a unique class of endogenous sRNAs produced in rice, which may participate in response against pathogens.

  4. Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling.

    PubMed

    Britanova, Olga V; Putintseva, Ekaterina V; Shugay, Mikhail; Merzlyak, Ekaterina M; Turchaninova, Maria A; Staroverov, Dmitriy B; Bolotin, Dmitriy A; Lukyanov, Sergey; Bogdanova, Ekaterina A; Mamedov, Ilgar Z; Lebedev, Yuriy B; Chudakov, Dmitriy M

    2014-03-15

    The decrease of TCR diversity with aging has never been studied by direct methods. In this study, we combined high-throughput Illumina sequencing with unique cDNA molecular identifier technology to achieve deep and precisely normalized profiling of TCR β repertoires in 39 healthy donors aged 6-90 y. We demonstrate that TCR β diversity per 10(6) T cells decreases roughly linearly with age, with significant reduction already apparent by age 40. The percentage of naive T cells showed a strong correlation with measured TCR diversity and decreased linearly up to age 70. Remarkably, the oldest group (average age 82 y) was characterized by a higher percentage of naive CD4(+) T cells, lower abundance of expanded clones, and increased TCR diversity compared with the previous age group (average age 62 y), suggesting the influence of age selection and association of these three related parameters with longevity. Interestingly, cross-analysis of individual TCR β repertoires revealed a set >10,000 of the most representative public TCR β clonotypes, whose abundance among the top 100,000 clones correlated with TCR diversity and decreased with aging.

  5. Characterization of the stress associated microRNAs in Glycine max by deep sequencing

    PubMed Central

    2011-01-01

    Background Plants involved in highly complex and well-coordinated systems have evolved a considerable degree of developmental plasticity, thus minimizing the damage caused by stress. MicroRNAs (miRNAs) have recently emerged as key regulators in gene regulation, developmental processes and stress tolerance in plants. Results In this study, soybean miRNAs associated with stress responses (drought, salinity, and alkalinity) have been identified and analyzed in combination with deep sequencing technology and in-depth bioinformatics analysis. One hundred and thirty three conserved miRNAs representing 95 miRNA families were expressed in soybeans under three treatments. In addition, 71, 50, and 45 miRNAs are either uniquely or differently expressed under drought, salinity, and alkalinity, respectively, suggesting that many miRNAs are inducible and are differentially expressed in response to certain stress. Conclusion Our study has important implications for further identification of gene regulation under abiotic stresses and significantly contributes a complete profile of miRNAs in Glycine max. PMID:22112171

  6. Deep sequencing-based identification of small regulatory RNAs in Synechocystis sp. PCC 6803.

    PubMed

    Xu, Wen; Chen, Hui; He, Chen-Liu; Wang, Qiang

    2014-01-01

    Synechocystis sp. PCC 6803 is a genetically tractable model organism for photosynthesis research. The genome of Synechocystis sp. PCC 6803 consists of a circular chromosome and seven plasmids. The importance of small regulatory RNAs (sRNAs) as mediators of a number of cellular processes in bacteria has begun to be recognized. However, little is known regarding sRNAs in Synechocystis sp. PCC 6803. To provide a comprehensive overview of sRNAs in this model organism, the sRNAs of Synechocystis sp. PCC 6803 were analyzed using deep sequencing, and 7,951,189 reads were obtained. High quality mapping reads (6,127,890) were mapped onto the genome and assembled into 16,192 transcribed regions (clusters) based on read overlap. A total number of 5211 putative sRNAs were revealed from the genome and the 4 megaplasmids, and 27 of these molecules, including four from plasmids, were confirmed by RT-PCR. In addition, possible target genes regulated by all of the putative sRNAs identified in this study were predicted by IntaRNA and analyzed for functional categorization and biological pathways, which provided evidence that sRNAs are indeed involved in many different metabolic pathways, including basic metabolic pathways, such as glycolysis/gluconeogenesis, the citrate cycle, fatty acid metabolism and adaptations to environmentally stress-induced changes. The information from this study provides a valuable reservoir for understanding the sRNA-mediated regulation of the complex physiology and metabolic processes of cyanobacteria.

  7. Complete genome sequence of Southern tomato virus naturally infecting tomatoes in Bangladesh using small RNA deep sequencing

    USDA-ARS?s Scientific Manuscript database

    The complete genome sequence of a Southern tomato virus (STV) isolate on tomato plants in a seed production field in Bangladesh was obtained for the first time using next generation sequencing. The identified isolate STV_BD-13 shares high degree of sequence identity (99%) with several known STV isol...

  8. Patchiness of deep-sea benthic Foraminifera across the Southern Ocean: Insights from high-throughput DNA sequencing

    NASA Astrophysics Data System (ADS)

    Lejzerowicz, Franck; Esling, Philippe; Pawlowski, Jan

    2014-10-01

    Spatial patchiness is a natural feature that strongly influences the level of species richness we perceive in surface sediments sampled in the deep-sea. Recent environmental DNA (eDNA) surveys of benthic micro- and meiofauna confirmed this exceptional richness. However, it is unknown to which extent the results of these studies, based usually on few grams of sediment, are affected by spatial patchiness of deep-sea benthos. Here, we analyse the eDNA diversity of Foraminifera in 42 deep-sea sediment samples collected across different scales in the Southern Ocean. At three stations, we deployed at least twice the multicorer and from each multicorer cast, we subsampled 3 sediment replicates per core for 2 cores. Using high-throughput sequencing (HTS), we generated over 2.35 million high-quality sequences that we clustered into 451 operational taxonomic units (OTUs). The majority of OTUs were assigned to the monothalamous (single-chambered) taxa and environmental clades. On average, a one-gram sediment sample captures 57.9% of the overall OTU diversity found in a single core, while three replicates cover at most 61.9% of the diversity found in a station. The OTUs found in all the replicates of each core gather up to 87.9% of the total sequenced reads, but only represent from 12.2% to 30% of the OTUs found in one core. These OTUs represent the most abundant species, among which dominate environmental lineages. The majority of the OTUs are represented by few sequences comprising several well-known deep-sea morphospecies or remaining unassigned. It is crucial to study wider arrays of sample and PCR replicates as well as RNA together with DNA in order to overcome biases stemming from deep-sea patchiness and molecular methods.

  9. Deep sequencing analysis of viral infection and evolution allows rapid and detailed characterization of viral mutant spectrum.

    PubMed

    Isakov, Ofer; Bordería, Antonio V; Golan, David; Hamenahem, Amir; Celniker, Gershon; Yoffe, Liron; Blanc, Hervé; Vignuzzi, Marco; Shomron, Noam

    2015-07-01

    The study of RNA virus populations is a challenging task. Each population of RNA virus is composed of a collection of different, yet related genomes often referred to as mutant spectra or quasispecies. Virologists using deep sequencing technologies face major obstacles when studying virus population dynamics, both experimentally and in natural settings due to the relatively high error rates of these technologies and the lack of high performance pipelines. In order to overcome these hurdles we developed a computational pipeline, termed ViVan (Viral Variance Analysis). ViVan is a complete pipeline facilitating the identification, characterization and comparison of sequence variance in deep sequenced virus populations. Applying ViVan on deep sequenced data obtained from samples that were previously characterized by more classical approaches, we uncovered novel and potentially crucial aspects of virus populations. With our experimental work, we illustrate how ViVan can be used for studies ranging from the more practical, detection of resistant mutations and effects of antiviral treatments, to the more theoretical temporal characterization of the population in evolutionary studies. Freely available on the web at http://www.vivanbioinfo.org : nshomron@post.tau.ac.il Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  10. Deep Sequencing Reveals Potential Antigenic Variants at Low Frequencies in Influenza A Virus-Infected Humans

    PubMed Central

    Dinis, Jorge M.; Florek, Nicholas W.; Fatola, Omolayo O.; Moncla, Louise H.; Mutschler, James P.; Charlier, Olivia K.; Meece, Jennifer K.; Belongia, Edward A.

    2016-01-01

    ABSTRACT Influenza vaccines must be frequently reformulated to account for antigenic changes in the viral envelope protein, hemagglutinin (HA). The rapid evolution of influenza virus under immune pressure is likely enhanced by the virus's genetic diversity within a host, although antigenic change has rarely been investigated on the level of individual infected humans. We used deep sequencing to characterize the between- and within-host genetic diversity of influenza viruses in a cohort of patients that included individuals who were vaccinated and then infected in the same season. We characterized influenza HA segments from the predominant circulating influenza A subtypes during the 2012-2013 (H3N2) and 2013-2014 (pandemic H1N1; H1N1pdm) flu seasons. We found that HA consensus sequences were similar in nonvaccinated and vaccinated subjects. In both groups, purifying selection was the dominant force shaping HA genetic diversity. Interestingly, viruses from multiple individuals harbored low-frequency mutations encoding amino acid substitutions in HA antigenic sites at or near the receptor-binding domain. These mutations included two substitutions in H1N1pdm viruses, G158K and N159K, which were recently found to confer escape from virus-specific antibodies. These findings raise the possibility that influenza antigenic diversity can be generated within individual human hosts but may not become fixed in the viral population even when they would be expected to have a strong fitness advantage. Understanding constraints on influenza antigenic evolution within individual hosts may elucidate potential future pathways of antigenic evolution at the population level. IMPORTANCE Influenza vaccines must be frequently reformulated due to the virus's rapid evolution rate. We know that influenza viruses exist within each infected host as a “swarm” of genetically distinct viruses, but the role of this within-host diversity in the antigenic evolution of influenza has been unclear

  11. MicroRNA Discovery and Analysis of Pinewood Nematode Bursaphelenchus xylophilus by Deep Sequencing

    PubMed Central

    Huang, Qi-Xing; Cheng, Xin-Yue; Mao, Zhen-Chuan; Wang, Yun-Sheng; Zhao, Li-Lin; Yan, Xia; Ferris, Virginia R.; Xu, Ru-Mei; Xie, Bing-Yan

    2010-01-01

    Background MicroRNAs (miRNAs) are considered to be very important in regulating the growth, development, behavior and stress response in animals and plants in post-transcriptional gene regulation. Pinewood nematode, Bursaphelenchus xylophilus, is an important invasive plant parasitic nematode in Asia. To have a comprehensive knowledge about miRNAs of the nematode is necessary for further in-depth study on roles of miRNAs in the ecological adaptation of the invasive species. Methods and Findings Five small RNA libraries were constructed and sequenced by Illumina/Solexa deep-sequencing technology. A total of 810 miRNA candidates (49 conserved and 761 novel) were predicted by a computational pipeline, of which 57 miRNAs (20 conserved and 37 novel) encoded by 53 miRNA precursors were identified by experimental methods. Ten novel miRNAs were considered to be species-specific miRNAs of B. xylophilus. Comparison of expression profiles of miRNAs in the five small RNA libraries showed that many miRNAs exhibited obviously different expression levels in the third-stage dispersal juvenile and at a cold-stressed status. Most of the miRNAs exhibited obviously down-regulated expression in the dispersal stage. But differences among the three geographic libraries were not prominent. A total of 979 genes were predicted to be targets of these authentic miRNAs. Among them, seven heat shock protein genes were targeted by 14 miRNAs, and six FMRFamide-like neuropeptides genes were targeted by 17 miRNAs. A real-time quantitative polymerase chain reaction was used to quantify the mRNA expression levels of target genes. Conclusions Basing on the fact that a negative correlation existed between the expression profiles of miRNAs and the mRNA expression profiles of their target genes (hsp, flp) by comparing those of the nematodes at a cold stressed status and a normal status, we suggested that miRNAs might participate in ecological adaptation and behavior regulation of the nematode. This is

  12. Identifying Conserved and Novel MicroRNAs in Developing Seeds of Brassica napus Using Deep Sequencing

    PubMed Central

    Körbes, Ana Paula; Machado, Ronei Dorneles; Guzman, Frank; Almerão, Mauricio Pereira; de Oliveira, Luiz Felipe Valter; Loss-Morais, Guilherme; Turchetto-Zolet, Andreia Carina; Cagliari, Alexandro; dos Santos Maraschin, Felipe; Margis-Pinheiro, Marcia; Margis, Rogerio

    2012-01-01

    MicroRNAs (miRNAs) are important post-transcriptional regulators of plant development and seed formation. In Brassica napus, an important edible oil crop, valuable lipids are synthesized and stored in specific seed tissues during embryogenesis. The miRNA transcriptome of B. napus is currently poorly characterized, especially at different seed developmental stages. This work aims to describe the miRNAome of developing seeds of B. napus by identifying plant-conserved and novel miRNAs and comparing miRNA abundance in mature versus developing seeds. Members of 59 miRNA families were detected through a computational analysis of a large number of reads obtained from deep sequencing two small RNA and two RNA-seq libraries of (i) pooled immature developing stages and (ii) mature B. napus seeds. Among these miRNA families, 17 families are currently known to exist in B. napus; additionally 29 families not reported in B. napus but conserved in other plant species were identified by alignment with known plant mature miRNAs. Assembled mRNA-seq contigs allowed for a search of putative new precursors and led to the identification of 13 novel miRNA families. Analysis of miRNA population between libraries reveals that several miRNAs and isomiRNAs have different abundance in developing stages compared to mature seeds. The predicted miRNA target genes encode a broad range of proteins related to seed development and energy storage. This work presents a comparative study of the miRNA transcriptome of mature and developing B. napus seeds and provides a basis for future research on individual miRNAs and their functions in embryogenesis, seed maturation and lipid accumulation in B. napus. PMID:23226347

  13. MicroRNA deep-sequencing reveals master regulators of follicular and papillary thyroid tumors.

    PubMed

    Mancikova, Veronika; Castelblanco, Esmeralda; Pineiro-Yanez, Elena; Perales-Paton, Javier; de Cubas, Aguirre A; Inglada-Perez, Lucia; Matias-Guiu, Xavier; Capel, Ismael; Bella, Maria; Lerma, Enrique; Riesco-Eizaguirre, Garcilaso; Santisteban, Pilar; Maravall, Francisco; Mauricio, Didac; Al-Shahrour, Fatima; Robledo, Mercedes

    2015-06-01

    MicroRNA deregulation could be a crucial event in thyroid carcinogenesis. However, current knowledge is based on studies that have used inherently biased methods. Thus, we aimed to define in an unbiased way a list of deregulated microRNAs in well-differentiated thyroid cancer in order to identify diagnostic and prognostic markers. We performed a microRNA deep-sequencing study using the largest well-differentiated thyroid tumor collection reported to date, comprising 127 molecularly characterized tumors with follicular or papillary patterns of growth and available clinical follow-up data, and 17 normal tissue samples. Furthermore, we integrated microRNA and gene expression data for the same tumors to propose targets for the novel molecules identified. Two main microRNA expression profiles were identified: one common for follicular-pattern tumors, and a second for papillary tumors. Follicular tumors showed a notable overexpression of several members of miR-515 family, and downregulation of the novel microRNA miR-1247. Among papillary tumors, top upregulated microRNAs were miR-146b and the miR-221~222 cluster, while miR-1179 was downregulated. BRAF-positive samples displayed extreme downregulation of miR-7 and -204. The identification of the predicted targets for the novel molecules gave insights into the proliferative potential of the transformed follicular cell. Finally, by integrating clinical follow-up information with microRNA expression, we propose a prediction model for disease relapse based on expression of two miRNAs (miR-192 and let-7a) and several other clinicopathological features. This comprehensive study complements the existing knowledge about deregulated microRNAs in the development of well-differentiated thyroid cancer and identifies novel markers associated with recurrence-free survival.

  14. Deep sequencing reveals microRNAs predictive of antiangiogenic drug response

    PubMed Central

    García-Donas, Jesús; Beuselinck, Benoit; Inglada-Pérez, Lucía; Graña, Osvaldo; Schöffski, Patrick; Wozniak, Agnieszka; Bechter, Oliver; Apellániz-Ruiz, Maria; Leandro-García, Luis Javier; Esteban, Emilio; Castellano, Daniel E.; González del Alba, Aranzazu; Climent, Miguel Angel; Hernando, Susana; Arranz, José Angel; Morente, Manuel; Pisano, David G.; Robledo, Mercedes

    2016-01-01

    The majority of metastatic renal cell carcinoma (RCC) patients are treated with tyrosine kinase inhibitors (TKI) in first-line treatment; however, a fraction are refractory to these antiangiogenic drugs. MicroRNAs (miRNAs) are regulatory molecules proven to be accurate biomarkers in cancer. Here, we identified miRNAs predictive of progressive disease under TKI treatment through deep sequencing of 74 metastatic clear cell RCC cases uniformly treated with these drugs. Twenty-nine miRNAs were differentially expressed in the tumors of patients who progressed under TKI therapy (P values from 6 × 10–9 to 3 × 10–3). Among 6 miRNAs selected for validation in an independent series, the most relevant associations corresponded to miR–1307-3p, miR–155-5p, and miR–221-3p (P = 4.6 × 10–3, 6.5 × 10–3, and 3.4 × 10–2, respectively). Furthermore, a 2 miRNA–based classifier discriminated individuals with progressive disease upon TKI treatment (AUC = 0.75, 95% CI, 0.64–0.85; P = 1.3 × 10–4) with better predictive value than clinicopathological risk factors commonly used. We also identified miRNAs significantly associated with progression-free survival and overall survival (P = 6.8 × 10–8 and 7.8 × 10–7 for top hits, respectively), and 7 overlapped with early progressive disease. In conclusion, this is the first miRNome comprehensive study, to our knowledge, that demonstrates a predictive value of miRNAs for TKI response and provides a new set of relevant markers that can help rationalize metastatic RCC treatment. PMID:27699216

  15. A deep sequencing approach to uncover the miRNOME in the human heart.

    PubMed

    Leptidis, Stefanos; El Azzouzi, Hamid; Lok, Sjoukje I; de Weger, Roel; Olieslagers, Servé; Olieslagers, Serv; Kisters, Natasja; Silva, Gustavo J; Heymans, Stephane; Cuppen, Edwin; Berezikov, Eugene; De Windt, Leon J; da Costa Martins, Paula

    2013-01-01

    MicroRNAs (miRNAs) are a class of non-coding RNAs of ∼22 nucleotides in length, and constitute a novel class of gene regulators by imperfect base-pairing to the 3'UTR of protein encoding messenger RNAs. Growing evidence indicates that miRNAs are implicated in several pathological processes in myocardial disease. The past years, we have witnessed several profiling attempts using high-density oligonucleotide array-based approaches to identify the complete miRNA content (miRNOME) in the healthy and diseased mammalian heart. These efforts have demonstrated that the failing heart displays differential expression of several dozens of miRNAs. While the total number of experimentally validated human miRNAs is roughly two thousand, the number of expressed miRNAs in the human myocardium remains elusive. Our objective was to perform an unbiased assay to identify the miRNOME of the human heart, both under physiological and pathophysiological conditions. We used deep sequencing and bioinformatics to annotate and quantify microRNA expression in healthy and diseased human heart (heart failure secondary to hypertrophic or dilated cardiomyopathy). Our results indicate that the human heart expresses >800 miRNAs, the majority of which not being annotated nor described so far and some of which being unique to primate species. Furthermore, >250 miRNAs show differential and etiology-dependent expression in human dilated cardiomyopathy (DCM) or hypertrophic cardiomyopathy (HCM). The human cardiac miRNOME still possesses a large number of miRNAs that remain virtually unexplored. The current study provides a starting point for a more comprehensive understanding of the role of miRNAs in regulating human heart disease.

  16. Deep sequencing reveals a novel closterovirus associated with wild rose leaf rosette disease.

    PubMed

    He, Yan; Yang, Zuokun; Hong, Ni; Wang, Guoping; Ning, Guogui; Xu, Wenxing

    2015-06-01

    A bizarre virus-like symptom of a leaf rosette formed by dense small leaves on branches of wild roses (Rosa multiflora Thunb.), designated as 'wild rose leaf rosette disease' (WRLRD), was observed in China. To investigate the presumed causal virus, a wild rose sample affected by WRLRD was subjected to deep sequencing of small interfering RNAs (siRNAs) for a complete survey of the infecting viruses and viroids. The assembly of siRNAs led to the reconstruction of the complete genomes of three known viruses, namely Apple stem grooving virus (ASGV), Blackberry chlorotic ringspot virus (BCRV) and Prunus necrotic ringspot virus (PNRSV), and of a novel virus provisionally named 'rose leaf rosette-associated virus' (RLRaV). Phylogenetic analysis clearly placed RLRaV alongside members of the genus Closterovirus, family Closteroviridae. Genome organization of RLRaV RNA (17,653 nucleotides) showed 13 open reading frames (ORFs), except ORF1 and the quintuple gene block, most of which showed no significant similarities with known viral proteins, but, instead, had detectable identities to fungal or bacterial proteins. Additional novel molecular features indicated that RLRaV seems to be the most complex virus among the known genus members. To our knowledge, this is the first report of WRLRD and its associated closterovirus, as well as two ilarviruses and one capilovirus, infecting wild roses. Our findings present novel information about the closterovirus and the aetiology of this rose disease which should facilitate its control. More importantly, the novel features of RLRaV help to clarify the molecular and evolutionary features of the closterovirus.

  17. Microbial Dark Matter: Unusual intervening sequences in 16S rRNA genes of candidate phyla from the deep subsurface

    SciTech Connect

    Jarett, Jessica; Stepanauskas, Ramunas; Kieft, Thomas; Onstott, Tullis; Woyke, Tanja

    2014-03-17

    The Microbial Dark Matter project has sequenced genomes from over 200 single cells from candidate phyla, greatly expanding our knowledge of the ecology, inferred metabolism, and evolution of these widely distributed, yet poorly understood lineages. The second phase of this project aims to sequence an additional 800 single cells from known as well as potentially novel candidate phyla derived from a variety of environments. In order to identify whole genome amplified single cells, screening based on phylogenetic placement of 16S rRNA gene sequences is being conducted. Briefly, derived 16S rRNA gene sequences are aligned to a custom version of the Greengenes reference database and added to a reference tree in ARB using parsimony. In multiple samples from deep subsurface habitats but not from other habitats, a large number of sequences proved difficult to align and therefore to place in the tree. Based on comparisons to reference sequences and structural alignments using SSU-ALIGN, many of these ?difficult? sequences appear to originate from candidate phyla, and contain intervening sequences (IVSs) within the 16S rRNA genes. These IVSs are short (39 - 79 nt) and do not appear to be self-splicing or to contain open reading frames. IVSs were found in the loop regions of stem-loop structures in several different taxonomic groups. Phylogenetic placement of sequences is strongly affected by IVSs; two out of three groups investigated were classified as different phyla after their removal. Based on data from samples screened in this project, IVSs appear to be more common in microbes occurring in deep subsurface habitats, although the reasons for this remain elusive.

  18. Acyclic identification of aptamers for human alpha-thrombin using over-represented libraries and deep sequencing.

    PubMed

    Kupakuwana, Gillian V; Crill, James E; McPike, Mark P; Borer, Philip N

    2011-01-01

    Aptamers are oligonucleotides that bind proteins and other targets with high affinity and selectivity. Twenty years ago elements of natural selection were adapted to in vitro selection in order to distinguish aptamers among randomized sequence libraries. The primary bottleneck in traditional aptamer discovery is multiple cycles of in vitro evolution. We show that over-representation of sequences in aptamer libraries and deep sequencing enables acyclic identification of aptamers. We demonstrated this by isolating a known family of aptamers for human α-thrombin. Aptamers were found within a library containing an average of 56,000 copies of each possible randomized 15mer segment. The high affinity sequences were counted many times above the background in 2-6 million reads. Clustering analysis of sequences with more than 10 counts distinguished two sequence motifs with candidates at high abundance. Motif I contained the previously observed consensus 15mer, Thb1 (46,000 counts), and related variants with mostly G/T substitutions; secondary analysis showed that affinity for thrombin correlated with abundance (K(d) = 12 nM for Thb1). The signal-to-noise ratio for this experiment was roughly 10,000∶1 for Thb1. Motif II was unrelated to Thb1 with the leading candidate (29,000 counts) being a novel aptamer against hexose sugars in the storage and elution buffers for Concanavilin A (K(d) = 0.5 µM for α-methyl-mannoside); ConA was used to immobilize α-thrombin. Over-representation together with deep sequencing can dramatically shorten the discovery process, distinguish aptamers having a wide range of affinity for the target, allow an exhaustive search of the sequence space within a simplified library, reduce the quantity of the target required, eliminate cycling artifacts, and should allow multiplexing of sequencing experiments and targets.

  19. Acyclic Identification of Aptamers for Human alpha-Thrombin Using Over-Represented Libraries and Deep Sequencing

    PubMed Central

    Kupakuwana, Gillian V.; Crill, James E.; McPike, Mark P.; Borer, Philip N.

    2011-01-01

    Background Aptamers are oligonucleotides that bind proteins and other targets with high affinity and selectivity. Twenty years ago elements of natural selection were adapted to in vitro selection in order to distinguish aptamers among randomized sequence libraries. The primary bottleneck in traditional aptamer discovery is multiple cycles of in vitro evolution. Methodology/Principal Findings We show that over-representation of sequences in aptamer libraries and deep sequencing enables acyclic identification of aptamers. We demonstrated this by isolating a known family of aptamers for human α-thrombin. Aptamers were found within a library containing an average of 56,000 copies of each possible randomized 15mer segment. The high affinity sequences were counted many times above the background in 2–6 million reads. Clustering analysis of sequences with more than 10 counts distinguished two sequence motifs with candidates at high abundance. Motif I contained the previously observed consensus 15mer, Thb1 (46,000 counts), and related variants with mostly G/T substitutions; secondary analysis showed that affinity for thrombin correlated with abundance (Kd = 12 nM for Thb1). The signal-to-noise ratio for this experiment was roughly 10,000∶1 for Thb1. Motif II was unrelated to Thb1 with the leading candidate (29,000 counts) being a novel aptamer against hexose sugars in the storage and elution buffers for Concanavilin A (Kd = 0.5 µM for α-methyl-mannoside); ConA was used to immobilize α-thrombin. Conclusions/Significance Over-representation together with deep sequencing can dramatically shorten the discovery process, distinguish aptamers having a wide range of affinity for the target, allow an exhaustive search of the sequence space within a simplified library, reduce the quantity of the target required, eliminate cycling artifacts, and should allow multiplexing of sequencing experiments and targets. PMID:21625587

  20. Deep Sequencing Reveals the Complete Genome and Evidence for Transcriptional Activity of the First Virus-Like Sequences Identified in Aristotelia chilensis (Maqui Berry)

    PubMed Central

    Villacreses, Javier; Rojas-Herrera, Marcelo; Sánchez, Carolina; Hewstone, Nicole; Undurraga, Soledad F.; Alzate, Juan F.; Manque, Patricio; Maracaja-Coutinho, Vinicius; Polanco, Victor

    2015-01-01

    Here, we report the genome sequence and evidence for transcriptional activity of a virus-like element in the native Chilean berry tree Aristotelia chilensis. We propose to name the endogenous sequence as Aristotelia chilensis Virus 1 (AcV1). High-throughput sequencing of the genome of this tree uncovered an endogenous viral element, with a size of 7122 bp, corresponding to the complete genome of AcV1. Its sequence contains three open reading frames (ORFs): ORFs 1 and 2 shares 66%–73% amino acid similarity with members of the Caulimoviridae virus family, especially the Petunia vein clearing virus (PVCV), Petuvirus genus. ORF1 encodes a movement protein (MP); ORF2 a Reverse Transcriptase (RT) and a Ribonuclease H (RNase H) domain; and ORF3 showed no amino acid sequence similarity with any other known virus proteins. Analogous to other known endogenous pararetrovirus sequences (EPRVs), AcV1 is integrated in the genome of Maqui Berry and showed low viral transcriptional activity, which was detected by deep sequencing technology (DNA and RNA-seq). Phylogenetic analysis of AcV1 and other pararetroviruses revealed a closer resemblance with Petuvirus. Overall, our data suggests that AcV1 could be a new member of Caulimoviridae family, genus Petuvirus, and the first evidence of this kind of virus in a fruit plant. PMID:25855242

  1. Deep sequencing reveals the complete genome and evidence for transcriptional activity of the first virus-like sequences identified in Aristotelia chilensis (Maqui Berry).

    PubMed

    Villacreses, Javier; Rojas-Herrera, Marcelo; Sánchez, Carolina; Hewstone, Nicole; Undurraga, Soledad F; Alzate, Juan F; Manque, Patricio; Maracaja-Coutinho, Vinicius; Polanco, Victor

    2015-04-03

    Here, we report the genome sequence and evidence for transcriptional activity of a virus-like element in the native Chilean berry tree Aristotelia chilensis. We propose to name the endogenous sequence as Aristotelia chilensis Virus 1 (AcV1). High-throughput sequencing of the genome of this tree uncovered an endogenous viral element, with a size of 7122 bp, corresponding to the complete genome of AcV1. Its sequence contains three open reading frames (ORFs): ORFs 1 and 2 shares 66%-73% amino acid similarity with members of the Caulimoviridae virus family, especially the Petunia vein clearing virus (PVCV), Petuvirus genus. ORF1 encodes a movement protein (MP); ORF2 a Reverse Transcriptase (RT) and a Ribonuclease H (RNase H) domain; and ORF3 showed no amino acid sequence similarity with any other known virus proteins. Analogous to other known endogenous pararetrovirus sequences (EPRVs), AcV1 is integrated in the genome of Maqui Berry and showed low viral transcriptional activity, which was detected by deep sequencing technology (DNA and RNA-seq). Phylogenetic analysis of AcV1 and other pararetroviruses revealed a closer resemblance with Petuvirus. Overall, our data suggests that AcV1 could be a new member of Caulimoviridae family, genus Petuvirus, and the first evidence of this kind of virus in a fruit plant.

  2. A filtering method to generate high quality short reads using illumina paired-end technology.

    PubMed

    Eren, A Murat; Vineis, Joseph H; Morrison, Hilary G; Sogin, Mitchell L

    2013-01-01

    Consensus between independent reads improves the accuracy of genome and transcriptome analyses, however lack of consensus between very similar sequences in metagenomic studies can and often does represent natural variation of biological significance. The common use of machine-assigned quality scores on next generation platforms does not necessarily correlate with accuracy. Here, we describe using the overlap of paired-end, short sequence reads to identify error-prone reads in marker gene analyses and their contribution to spurious OTUs following clustering analysis using QIIME. Our approach can also reduce error in shotgun sequencing data generated from libraries with small, tightly constrained insert sizes. The open-source implementation of this algorithm in Python programming language with user instructions can be obtained from https://github.com/meren/illumina-utils.

  3. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing

    PubMed Central

    Song, Kai; Ren, Jie; Reinert, Gesine; Deng, Minghua

    2014-01-01

    With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data. PMID:24064230

  4. DREAM: a webserver for the identification of editing sites in mature miRNAs using deep sequencing data.

    PubMed

    Alon, Shahar; Erew, Muhammad; Eisenberg, Eli

    2015-08-01

    detecting RNA editing associated with microRNAs, is a webserver for the identification of mature microRNA editing events using deep sequencing data. Raw microRNA sequencing reads can be provided as input, the reads are aligned against the genome and custom scripts process the data, search for potential editing sites and assess the statistical significance of the findings. The output is a text file with the location and the statistical description of all the putative editing sites detected. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  5. Deep Transcriptome Sequencing of Two Green Algae, Chara vulgaris and Chlamydomonas reinhardtii, Provides No Evidence of Organellar RNA Editing

    PubMed Central

    Cahoon, A. Bruce; Nauss, John A.; Stanley, Conner D.; Qureshi, Ali

    2017-01-01

    Nearly all land plants post-transcriptionally modify specific nucleotides within RNAs, a process known as RNA editing. This adaptation allows the correction of deleterious mutations within the asexually reproducing and presumably non-recombinant chloroplast and mitochondrial genomes. There are no reports of RNA editing in any of the green algae so this phenomenon is presumed to have originated in embryophytes either after the invasion of land or in the now extinct algal ancestor of all land plants. This was challenged when a recent in silico screen for RNA edit sites based on genomic sequence homology predicted edit sites in the green alga Chara vulgaris, a multicellular alga found within the Streptophyta clade and one of the closest extant algal relatives of land plants. In this study, the organelle transcriptomes of C. vulgaris and Chlamydomonas reinhardtii were deep sequenced for a comprehensive assessment of RNA editing. Initial analyses based solely on sequence comparisons suggested potential edit sites in both species, but subsequent high-resolution melt analysis, RNase H-dependent PCR (rhPCR), and Sanger sequencing of DNA and complementary DNAs (cDNAs) from each of the putative edit sites revealed them to be either single-nucleotide polymorphisms (SNPs) or spurious deep sequencing results. The lack of RNA editing in these two lineages is consistent with the current hypothesis that RNA editing evolved after embryophytes split from its ancestral algal lineage. PMID:28230734

  6. Discovering the unknown: improving detection of novel species and genera from short reads.

    PubMed

    Rosen, Gail L; Polikar, Robi; Caseiro, Diamantino A; Essinger, Steven D; Sokhansanj, Bahrad A

    2011-01-01

    High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments ("reads") from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between "known" and "unknown" taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an "unknown" class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate the performance of several algorithms on a real acid mine drainage dataset.

  7. Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads

    PubMed Central

    Rosen, Gail L.; Polikar, Robi; Caseiro, Diamantino A.; Essinger, Steven D.; Sokhansanj, Bahrad A.

    2011-01-01

    High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (“reads”) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between “known” and “unknown” taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an “unknown” class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate the performance of several algorithms on a real acid mine drainage dataset. PMID:21541181

  8. Discovering the Unknown: Improving Detection of Novel Species and Genera from Short Reads

    DOE PAGES

    Rosen, Gail L.; Polikar, Robi; Caseiro, Diamantino A.; ...

    2011-01-01

    High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments (“reads”) from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between “known” and “unknown” taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for theirmore » ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an “unknown” class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate theperformance of several algorithms on a real acid mine drainage dataset.« less

  9. When Is a Microbial Culture “Pure”? Persistent Cryptic Contaminant Escapes Detection Even with Deep Genome Sequencing

    PubMed Central

    Shrestha, Pravin Malla; Nevin, Kelly P.; Shrestha, Minita; Lovley, Derek R.

    2013-01-01

    ABSTRACT Geobacter sulfurreducens strain KN400 was recovered in previous studies in which a culture of the DL1 strain of G. sulfurreducens served as the inoculum in investigations of microbial current production at low anode potentials (−400 mV versus Ag/AgCl). Differences in the genome sequences of KN400 and DL1 were too great to have arisen from adaptive evolution during growth on the anode. Previous deep sequencing (80-fold coverage) of the DL1 culture failed to detect sequences specific to KN400, suggesting that KN400 was an external contaminant inadvertently introduced into the anode culturing system. In order to evaluate this further, a portion of the gene for OmcS, a c-type cytochrome that both KN400 and DL1 possess, was amplified from the DL1 culture. HiSeq-2000 Illumina sequencing of the PCR product detected the KN400 sequence, which differs from the DL1 sequence at 14 bp, at a frequency of ca. 1 in 105 copies of the DL1 sequence. A similar low frequency of KN400 was detected with quantitative PCR of a KN400-specific gene. KN400 persisted at this frequency after intensive restreaking of isolated colonies from the DL1 culture. However, a culture in which KN400 could no longer be detected was obtained by serial dilution to extinction in liquid medium. The KN400-free culture could not grow on an anode poised at −400 mV. Thus, KN400 cryptically persisted in the culture dominated by DL1 for more than a decade, undetected by even deep whole-genome sequencing, and was only fortuitously uncovered by the unnatural selection pressure of growth on a low-potential electrode. PMID:23481604

  10. Clinical Application of Targeted Deep Sequencing in Solid-Cancer Patients; Utility of Targeted Deep Sequencing for Biomarker-Selected Clinical Trial.

    PubMed

    Kim, Seung Tae; Kim, Kyoung-Mee; Kim, Nayoung K D; Park, Joon Oh; Ahn, Soomin; Yun, Jae-Won; Kim, Kyu-Tae; Park, Se Hoon; Park, Peter J; Kim, Hee Cheol; Sohn, Tae Sung; Choi, Dong Il; Cho, Jong Ho; Heo, Jin Seok; Kwon, Wooil; Lee, Hyuk; Min, Byung-Hoon; Hong, Sung No; Park, Young Suk; Lim, Ho Yeong; Kang, Won Ki; Park, Woong-Yang; Lee, Jeeyun

    2017-07-12

    Molecular profiling of actionable mutations in refractory cancer patients has the potential to enable "precision medicine," wherein individualized therapies are guided based on genomic profiling. The molecular-screening program was intended to route participants to different candidate drugs in trials based on clinical-sequencing reports. In this screening program, we used a custom target-enrichment panel consisting of cancer-related genes to interrogate single-nucleotide variants, insertions and deletions, copy number variants, and a subset of gene fusions. From August 2014 through April 2015, 654 patients consented to participate in the program at Samsung Medical Center. Of these patients, 588 passed the quality control process for the 381-gene cancer-panel test, and 418 patients were included in the final analysis as being eligible for any anticancer treatment (127 gastric cancer, 122 colorectal cancer, 62 pancreatic/biliary tract cancer, 67 sarcoma/other cancer, and 40 genitourinary cancer patients). Of the 418 patients, 55 (12%) harbored a biomarker that guided them to a biomarker-selected clinical trial, and 184 (44%) patients harbored at least one genomic alteration that was potentially targetable. This study demonstrated that the panel-based sequencing program resulted in an increased rate of trial enrollment of metastatic cancer patients into biomarker-selected clinical trials. Given the expanding list of biomarker-selected trials, the guidance percentage to matched trials is anticipated to increase. This study demonstrated that the panel-based sequencing program resulted in an increased rate of trial enrollment of metastatic cancer patients into biomarker-selected clinical trials. Given the expanding list of biomarker-selected trials, the guidance percentage to matched trials is anticipated to increase. © AlphaMed Press 2017.

  11. Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events

    PubMed Central

    Tilgner, Hagen; Jahanbani, Fereshteh; Blauwkamp, Tim; Moshrefi, Ali; Jaeger, Erich; Chen, Feng; Harel, Itamar; Bustamante, Carlos D; Rasmussen, Morten; Snyder, Michael P

    2016-01-01

    Alternative splicing shapes mammalian transcriptomes, with many RNA molecules undergoing multiple distant alternative splicing events. Comprehensive transcriptome analysis, including analysis of exon co-association in the same molecule, requires deep, long-read sequencing. Here we introduce an RNA sequencing method, synthetic long-read RNA sequencing (SLR-RNA-seq), in which small pools (≤1,000 molecules/pool, ≤1 molecule/gene for most genes) of full-length cDNAs are amplified, fragmented and short-read-sequenced. We demonstrate that these RNA sequences reconstructed from the short reads from each of the pools are mostly close to full length and contain few insertion and deletion errors. We report many previously undescribed isoforms (human brain: ∼13,800 affected genes, 14.5% of molecules; mouse brain ∼8,600 genes, 18% of molecules) and up to 165 human distant molecularly associated exon pairs (dMAPs) and distant molecularly and mutually exclusive pairs (dMEPs). Of 16 associated pairs detected in the mouse brain, 9 are conserved in human. Our results indicate conserved mechanisms that can produce distant but phased features on transcript and proteome isoforms. PMID:25985263

  12. Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events.

    PubMed

    Tilgner, Hagen; Jahanbani, Fereshteh; Blauwkamp, Tim; Moshrefi, Ali; Jaeger, Erich; Chen, Feng; Harel, Itamar; Bustamante, Carlos D; Rasmussen, Morten; Snyder, Michael P

    2015-07-01

    Alternative splicing shapes mammalian transcriptomes, with many RNA molecules undergoing multiple distant alternative splicing events. Comprehensive transcriptome analysis, including analysis of exon co-association in the same molecule, requires deep, long-read sequencing. Here we introduce an RNA sequencing method, synthetic long-read RNA sequencing (SLR-RNA-seq), in which small pools (≤1,000 molecules/pool, ≤1 molecule/gene for most genes) of full-length cDNAs are amplified, fragmented and short-read-sequenced. We demonstrate that these RNA sequences reconstructed from the short reads from each of the pools are mostly close to full length and contain few insertion and deletion errors. We report many previously undescribed isoforms (human brain: ∼13,800 affected genes, 14.5% of molecules; mouse brain ∼8,600 genes, 18% of molecules) and up to 165 human distant molecularly associated exon pairs (dMAPs) and distant molecularly and mutually exclusive pairs (dMEPs). Of 16 associated pairs detected in the mouse brain, 9 are conserved in human. Our results indicate conserved mechanisms that can produce distant but phased features on transcript and proteome isoforms.

  13. Homology-independent discovery of replicating pathogenic circular RNAs by deep sequencing and a new computational algorithm.

    PubMed

    Wu, Qingfa; Wang, Ying; Cao, Mengji; Pantaleo, Vitantonio; Burgyan, Joszef; Li, Wan-Xiang; Ding, Shou-Wei

    2012-03-06

    A common challenge in pathogen discovery by deep sequencing approaches is to recognize viral or subviral pathogens in samples of diseased tissue that share no significant homology with a known pathogen. Here we report a homology-independent approach for discovering viroids, a distinct class of free circular RNA subviral pathogens that encode no protein and are known to infect plants only. Our approach involves analyzing the sequences of the total small RNAs of the infected plants obtained by deep sequencing with a unique computational algorithm, progressive filtering of overlapping small RNAs (PFOR). Viroid infection triggers production of viroid-derived overlapping siRNAs that cover the entire genome with high densities. PFOR retains viroid-specific siRNAs for genome assembly by progressively eliminating nonoverlapping small RNAs and those that overlap but cannot be assembled into a direct repeat RNA, which is synthesized from circular or multimeric repeated-sequence templates during viroid replication. We show that viroids from the two known families are readily identified and their full-length sequences assembled by PFOR from small RNAs sequenced from infected plants. PFOR analysis of a grapevine library further identified a viroid-like circular RNA 375 nt long that shared no significant sequence homology with known molecules and encoded active hammerhead ribozymes in RNAs of both plus and minus polarities, which presumably self-cleave to release monomer from multimeric replicative intermediates. A potential application of the homology-independent approach for viroid discovery in plant and animal species where RNA replication triggers the biogenesis of siRNAs is discussed.

  14. MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC)

    PubMed Central

    2015-01-01

    Background Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores). Results To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM). Summary MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development. Availability and implementation MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3. Supplementary information Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica. PMID:25952019

  15. MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC).

    PubMed

    Luo, Ruibang; Cheung, Jeanno; Wu, Edward; Wang, Heng; Chan, Sze-Hang; Law, Wai-Chun; He, Guangzhu; Yu, Chang; Liu, Chi-Man; Zhou, Dazong; Li, Yingrui; Li, Ruiqiang; Wang, Jun; Zhu, Xiaoqian; Peng, Shaoliang; Lam, Tak-Wah

    2015-01-01

    Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores). To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM). MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development. MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3. Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica.

  16. A Systematic Assessment of Accuracy in Detecting Somatic Mosaic Variants by Deep Amplicon Sequencing: Application to NF2 Gene

    PubMed Central

    Sestini, Roberta; Candita, Luisa; Capone, Gabriele Lorenzo; Barbetti, Lorenzo; Falconi, Serena; Frusconi, Sabrina; Giotti, Irene; Giuliani, Costanza; Torricelli, Francesca; Benelli, Matteo; Papi, Laura

    2015-01-01

    The accurate detection of low-allelic variants is still challenging, particularly for the identification of somatic mosaicism, where matched control sample is not available. High throughput sequencing, by the simultaneous and independent analysis of thousands of different DNA fragments, might overcome many of the limits of traditional methods, greatly increasing the sensitivity. However, it is necessary to take into account the high number of false positives that may arise due to the lack of matched control samples. Here, we applied deep amplicon sequencing to the analysis of samples with known genotype and variant allele fraction (VAF) followed by a tailored statistical analysis. This method allowed to define a minimum value of VAF for detecting mosaic variants with high accuracy. Then, we exploited the estimated VAF to select candidate alterations in NF2 gene in 34 samples with unknown genotype (30 blood and 4 tumor DNAs), demonstrating the suitability of our method. The strategy we propose optimizes the use of deep amplicon sequencing for the identification of low abundance variants. Moreover, our method can be applied to different high throughput sequencing approaches to estimate the background noise and define the accuracy of the experimental design. PMID:26066488

  17. Deep sequencing of immune repertoires during bovine development and in response to respiratory pathogen challenge

    USDA-ARS?s Scientific Manuscript database

    Vertebrate immune systems generate diverse repertoires of antibodies capable of mediating response to a variety of antigens. Single-molecule circular consensus sequencing permits the sequencing of expressed antibody repertoires at previously unattainable depths of coverage and accuracy. We examined...

  18. Development of high-throughput SNP-based genotyping in Acacia auriculiformis x A. mangium hybrids using short-read transcriptome data

    PubMed Central

    2012-01-01

    Background Next Generation Sequencing has provided comprehensive, affordable and high-throughput DNA sequences for Single Nucleotide Polymorphism (SNP) discovery in Acacia auriculiformis and Acacia mangium. Like other non-model species, SNP detection and genotyping in Acacia are challenging due to lack of genome sequences. The main objective of this study is to develop the first high-throughput SNP genotyping assay for linkage map construction of A. auriculiformis x A. mangium hybrids. Results We identified a total of 37,786 putative SNPs by aligning short read transcriptome data from four parents of two Acacia hybrid mapping populations using Bowtie against 7,839 de novo transcriptome contigs. Given a set of 10 validated SNPs from two lignin genes, our in silico SNP detection approach is highly accurate (100%) compared to the traditional in vitro approach (44%). Further validation of 96 SNPs using Illumina GoldenGate Assay gave an overall assay success rate of 89.6% and conversion rate of 37.5%. We explored possible factors lowering assay success rate by predicting exon-intron boundaries and paralogous genes of Acacia contigs using Medicago truncatula genome as reference. This assessment revealed that presence of exon-intron boundary is the main cause (50%) of assay failure. Subsequent SNPs filtering and improved assay design resulted in assay success and conversion rate of 92.4% and 57.4%, respectively based on 768 SNPs genotyping. Analysis of clustering patterns revealed that 27.6% of the assays were not reproducible and flanking sequence might play a role in determining cluster compression. In addition, we identified a total of 258 and 319 polymorphic SNPs in A. auriculiformis and A. mangium natural germplasms, respectively. Conclusion We have successfully discovered a large number of SNP markers in A. auriculiformis x A. mangium hybrids using next generation transcriptome sequencing. By using a reference genome from the most closely related species, we

  19. Development of high-throughput SNP-based genotyping in Acacia auriculiformis x A. mangium hybrids using short-read transcriptome data.

    PubMed

    Wong, Melissa M L; Cannon, Charles H; Wickneswari, Ratnam

    2012-12-24

    Next Generation Sequencing has provided comprehensive, affordable and high-throughput DNA sequences for Single Nucleotide Polymorphism (SNP) discovery in Acacia auriculiformis and Acacia mangium. Like other non-model species, SNP detection and genotyping in Acacia are challenging due to lack of genome sequences. The main objective of this study is to develop the first high-throughput SNP genotyping assay for linkage map construction of A. auriculiformis x A. mangium hybrids. We identified a total of 37,786 putative SNPs by aligning short read transcriptome data from four parents of two Acacia hybrid mapping populations using Bowtie against 7,839 de novo transcriptome contigs. Given a set of 10 validated SNPs from two lignin genes, our in silico SNP detection approach is highly accurate (100%) compared to the traditional in vitro approach (44%). Further validation of 96 SNPs using Illumina GoldenGate Assay gave an overall assay success rate of 89.6% and conversion rate of 37.5%. We explored possible factors lowering assay success rate by predicting exon-intron boundaries and paralogous genes of Acacia contigs using Medicago truncatula genome as reference. This assessment revealed that presence of exon-intron boundary is the main cause (50%) of assay failure. Subsequent SNPs filtering and improved assay design resulted in assay success and conversion rate of 92.4% and 57.4%, respectively based on 768 SNPs genotyping. Analysis of clustering patterns revealed that 27.6% of the assays were not reproducible and flanking sequence might play a role in determining cluster compression. In addition, we identified a total of 258 and 319 polymorphic SNPs in A. auriculiformis and A. mangium natural germplasms, respectively. We have successfully discovered a large number of SNP markers in A. auriculiformis x A. mangium hybrids using next generation transcriptome sequencing. By using a reference genome from the most closely related species, we converted most SNPs to successful

  20. Metavisitor, a Suite of Galaxy Tools for Simple and Rapid Detection and Discovery of Viruses in Deep Sequence Data

    PubMed Central

    Vernick, Kenneth D.

    2017-01-01

    Metavisitor is a software package that allows biologists and clinicians without specialized bioinformatics expertise to detect and assemble viral genomes from deep sequence datasets. The package is composed of a set of modular bioinformatic tools and workflows that are implemented in the Galaxy framework. Using the graphical Galaxy workflow editor, users with minimal computational skills can use existing Metavisitor workflows or adapt them to suit specific needs by adding or modifying analysis modules. Metavisitor works with DNA, RNA or small RNA sequencing data over a range of read lengths and can use a combination of de novo and guided approaches to assemble genomes from sequencing reads. We show that the software has the potential for quick diagnosis as well as discovery of viruses from a vast array of organisms. Importantly, we provide here executable Metavisitor use cases, which increase the accessibility and transparency of the software, ultimately enabling biologists or clinicians to focus on biological or medical questions. PMID:28045932

  1. Deep sequencing unearths nuclear mitochondrial sequences under Leber's hereditary optic neuropathy-associated false heteroplasmic mitochondrial DNA variants.

    PubMed

    Petruzzella, Vittoria; Carrozzo, Rosalba; Calabrese, Claudia; Dell'Aglio, Rosa; Trentadue, Raffaella; Piredda, Roberta; Artuso, Lucia; Rizza, Teresa; Bianchi, Marzia; Porcelli, Anna Maria; Guerriero, Silvana; Gasparre, Giuseppe; Attimonelli, Marcella

    2012-09-01

    Leber's hereditary optic neuropathy (LHON) is associated with mitochondrial DNA (mtDNA) ND mutations that are mostly homoplasmic. However, these mutations are not sufficient to explain the peculiar features of penetrance and the tissue-specific expression of the disease and are believed to be causative in association with unknown environmental or other genetic factors. Discerning between clear-cut pathogenetic variants, such as those that appear to be heteroplasmic, and less penetrant variants, such as the homoplasmic, remains a challenging issue that we have addressed here using next-generation sequencing approach. We set up a protocol to quantify MTND5 heteroplasmy levels in a family in which the proband manifests a LHON phenotype. Furthermore, to study this mtDNA haplotype, we applied the cybridization protocol. The results demonstrate that the mutations are mostly homoplasmic, whereas the suspected heteroplasmic feature of the observed mutations is due to the co-amplification of Nuclear mitochondrial Sequences.

  2. Natural variation in Brachypodium disctachyon: Deep Sequencing of Highly Diverse Natural Accessions (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

    SciTech Connect

    Gordon, Sean

    2013-03-01

    Sean Gordon of the USDA on "Natural variation in Brachypodium disctachyon: Deep Sequencing of Highly Diverse Natural Accessions" at the 8th Annual Genomics of Energy & Environment Meeting on March 27, 2013 in Walnut Creek, Calif.

  3. A kinetic model-based algorithm to classify NGS short reads by their allele origin.

    PubMed

    Marinoni, Andrea; Rizzo, Ettore; Limongelli, Ivan; Gamba, Paolo; Bellazzi, Riccardo

    2015-02-01

    Genotyping Next Generation Sequencing (NGS) data of a diploid genome aims to assign the zygosity of identified variants through comparison with a reference genome. Current methods typically employ probabilistic models that rely on the pileup of bases at each locus and on a priori knowledge. We present a new algorithm, called Kimimila (KInetic Modeling based on InforMation theory to Infer Labels of Alleles), which is able to assign reads to alleles by using a distance geometry approach and to infer the variant genotypes accurately, without any kind of assumption. The performance of the model has been assessed on simulated and real data of the 1000 Genomes Project and the results have been compared with several commonly used genotyping methods, i.e., GATK, Samtools, VarScan, FreeBayes and Atlas2. Despite our algorithm does not make use of a priori knowledge, the percentage of correctly genotyped variants is comparable to these algorithms. Furthermore, our method allows the user to split the reads pool depending on the inferred allele origin.

  4. HPV Population Profiling in Healthy Men by Next-Generation Deep Sequencing Coupled with HPV-QUEST

    PubMed Central

    Yin, Li; Yao, Jin; Chang, Kaifen; Gardner, Brent P.; Yu, Fahong; Giuliano, Anna R.; Goodenow, Maureen M.

    2016-01-01

    Multiple-type human papillomaviruses (HPV) infection presents a greater risk for persistence in asymptomatic individuals and may accelerate cancer development. To extend the scope of HPV types defined by probe-based assays, multiplexing deep sequencing of HPV L1, coupled with an HPV-QUEST genotyping server and a bioinformatic pipeline, was established and applied to survey the diversity of HPV genotypes among a subset of healthy men from the HPV in Men (HIM) Multinational Study. Twenty-one HPV genotypes (12 high-risk and 9 low-risk) were detected in the genital area from 18 asymptomatic individuals. A single HPV type, either HPV16, HPV6b or HPV83, was detected in 7 individuals, while coinfection by 2 to 5 high-risk and/or low-risk genotypes was identified in the other 11 participants. In two individuals studied for over one year, HPV16 persisted, while fluctuations of coinfecting genotypes occurred. HPV L1 regions were generally identical between query and reference sequences, although nonsynonymous and synonymous nucleotide polymorphisms of HPV16, 18, 31, 35h, 59, 70, 73, cand85, 6b, 62, 81, 83, cand89 or JEB2 L1 genotypes, mostly unidentified by linear array, were evident. Deep sequencing coupled with HPV-QUEST provides efficient and unambiguous classification of HPV genotypes in multiple-type HPV infection in host ecosystems. PMID:26821041

  5. Deep sequencing detects very-low-grade somatic mosaicism in the unaffected mother of siblings with nemaline myopathy.

    PubMed

    Miyatake, Satoko; Koshimizu, Eriko; Hayashi, Yukiko K; Miya, Kazushi; Shiina, Masaaki; Nakashima, Mitsuko; Tsurusaki, Yoshinori; Miyake, Noriko; Saitsu, Hirotomo; Ogata, Kazuhiro; Nishino, Ichizo; Matsumoto, Naomichi

    2014-07-01

    When an expected mutation in a particular disease-causing gene is not identified in a suspected carrier, it is usually assumed to be due to germline mosaicism. We report here very-low-grade somatic mosaicism in ACTA1 in an unaffected mother of two siblings affected with a neonatal form of nemaline myopathy. The mosaicism was detected by deep resequencing using a next-generation sequencer. We identified a novel heterozygous mutation in ACTA1, c.448A>G (p.Thr150Ala), in the affected siblings. Three-dimensional structural modeling suggested that this mutation may affect polymerization and/or actin's interactions with other proteins. In this family, we expected autosomal dominant inheritance with either parent demonstrating germline or somatic mosaicism. Sanger sequencing identified no mutation. However, further deep resequencing of this mutation on a next-generation sequencer identified very-low-grade somatic mosaicism in the mother: 0.4%, 1.1%, and 8.3% in the saliva, blood leukocytes, and nails, respectively. Our study demonstrates the possibility of very-low-grade somatic mosaicism in suspected carriers, rather than germline mosaicism. Copyright © 2014 Elsevier B.V. All rights reserved.

  6. HPV Population Profiling in Healthy Men by Next-Generation Deep Sequencing Coupled with HPV-QUEST.

    PubMed

    Yin, Li; Yao, Jin; Chang, Kaifen; Gardner, Brent P; Yu, Fahong; Giuliano, Anna R; Goodenow, Maureen M

    2016-01-25

    Multiple-type human papillomaviruses (HPV) infection presents a greater risk for persistence in asymptomatic individuals and may accelerate cancer development. To extend the scope of HPV types defined by probe-based assays, multiplexing deep sequencing of HPV L1, coupled with an HPV-QUEST genotyping server and a bioinformatic pipeline, was established and applied to survey the diversity of HPV genotypes among a subset of healthy men from the HPV in Men (HIM) Multinational Study. Twenty-one HPV genotypes (12 high-risk and 9 low-risk) were detected in the genital area from 18 asymptomatic individuals. A single HPV type, either HPV16, HPV6b or HPV83, was detected in 7 individuals, while coinfection by 2 to 5 high-risk and/or low-risk genotypes was identified in the other 11 participants. In two individuals studied for over one year, HPV16 persisted, while fluctuations of coinfecting genotypes occurred. HPV L1 regions were generally identical between query and reference sequences, although nonsynonymous and synonymous nucleotide polymorphisms of HPV16, 18, 31, 35h, 59, 70, 73, cand85, 6b, 62, 81, 83, cand89 or JEB2 L1 genotypes, mostly unidentified by linear array, were evident. Deep sequencing coupled with HPV-QUEST provides efficient and unambiguous classification of HPV genotypes in multiple-type HPV infection in host ecosystems.

  7. Heavy-light chain interrelations of MS-associated immunoglobulins probed by deep sequencing and rational variation.

    PubMed

    Lomakin, Yakov A; Zakharova, Maria Yu; Stepanov, Alexey V; Dronina, Maria A; Smirnov, Ivan V; Bobik, Tatyana V; Pyrkov, Andrey Yu; Tikunova, Nina V; Sharanova, Svetlana N; Boitsov, Vitali M; Vyazmin, Sergey Yu; Kabilov, Marsel R; Tupikin, Alexey E; Krasnov, Alexey N; Bykova, Nadezda A; Medvedeva, Yulia A; Fridman, Marina V; Favorov, Alexander V; Ponomarenko, Natalia A; Dubina, Michael V; Boyko, Alexey N; Vlassov, Valentin V; Belogurov, Alexey A; Gabibov, Alexander G

    2014-12-01

    The mechanisms triggering most of autoimmune diseases are still obscure. Autoreactive B cells play a crucial role in the development of such pathologies and, in particular, production of autoantibodies of different specificities. The combination of deep-sequencing technology with functional studies of antibodies selected from highly representative immunoglobulin combinatorial libraries may provide unique information on specific features in the repertoires of autoreactive B cells. Here, we have analyzed cross-combinations of the variable regions of human immunoglobulins against the myelin basic protein (MBP) previously selected from a multiple sclerosis (MS)-related scFv phage-display library. On the other hand, we have performed deep sequencing of the sublibraries of scFvs against MBP, Epstein-Barr virus (EBV) latent membrane protein 1 (LMP1), and myelin oligodendrocyte glycoprotein (MOG). Bioinformatics analysis of sequencing data and surface plasmon resonance (SPR) studies have shown that it is the variable fragments of antibody heavy chains that mainly determine both the affinity of antibodies to the parent autoantigen and their cross-reactivity. It is suggested that LMP1-cross-reactive anti-myelin autoantibodies contain heavy chains encoded by certain germline gene segments, which may be a hallmark of the EBV-specific B cell subpopulation involved in MS triggering.

  8. Evolution of the ISM in main-sequence versus starburst galaxies: A motivation for molecular deep fields

    NASA Astrophysics Data System (ADS)

    Aravena, Manuel

    In the last decade, significant progress has been made to understand the evolution with redshift of star formation processes in galaxies. Its is now clear that the majority of galaxies at z<3 form a nearly linear correlation between their stellar mass and star formation rates and appear to create most of their stars in timescales of ~1 Gyr. At the highest luminosities, a significant fraction of galaxies deviate from this main-sequence, showing short duty cycles and thus producing most of their stars in a single burst of star formation within ~100 Myr, being likely driven by major merger activity. Despite the large luminosities of starbursts, main-sequence galaxies appear to dominate the star formation density of the Universe at its peak. While progress has been impressive, a number of questions are still unanswered. In this paper, I briefly review our current observational understanding of this main-sequence vs starburst galaxy paradigm, and address how future observations will help us to have better insights into the fundamental properties of the interstellar medium of these galaxies. Finally, I show recent attempts to conduct molecular deep field observations and the motivation to perform molecular deep field spectroscopy with the Atacama Large Millimeter/submillimeter Array.

  9. Genomic Analysis by Deep Sequencing of the Probiotic Lactobacillus brevis KB290 Harboring Nine Plasmids Reveals Genomic Stability

    PubMed Central

    Fukao, Masanori; Oshima, Kenshiro; Morita, Hidetoshi; Toh, Hidehiro; Suda, Wataru; Kim, Seok-Won; Suzuki, Shigenori; Yakabe, Takafumi; Hattori, Masahira; Yajima, Nobuhiro

    2013-01-01

    We determined the complete genome sequence of Lactobacillus brevis KB290, a probiotic lactic acid bacterium isolated from a traditional Japanese fermented vegetable. The genome contained a 2,395,134-bp chromosome that housed 2,391 protein-coding genes and nine plasmids that together accounted for 191 protein-coding genes. KB290 contained no virulence factor genes, and several genes related to presumptive cell wall-associated polysaccharide biosynthesis and the stress response were present in L. brevis KB290 but not in the closely related L. brevis ATCC 367. Plasmid-curing experiments revealed that the presence of plasmid pKB290-1 was essential for the strain's gastrointestinal tract tolerance and tendency to aggregate. Using next-generation deep sequencing of current and 18-year-old stock strains to detect low frequency variants, we evaluated genome stability. Deep sequencing of four periodic KB290 culture stocks with more than 1,000-fold coverage revealed 3 mutation sites and 37 minority variation sites, indicating long-term stability and providing a useful method for assessing the stability of industrial bacteria at the nucleotide level. PMID:23544154

  10. Comparison of Illumina and 454 Deep Sequencing in Participants Failing Raltegravir-Based Antiretroviral Therapy

    PubMed Central

    Li, Jonathan Z.; Chapman, Brad; Charlebois, Patrick; Hofmann, Oliver; Weiner, Brian; Porter, Alyssa J.; Samuel, Reshmi; Vardhanabhuti, Saran; Zheng, Lu; Eron, Joseph; Taiwo, Babafemi; Zody, Michael C.; Henn, Matthew R.; Kuritzkes, Daniel R.; Hide, Winston; Wilson, Cara C.; Berzins, Baiba I.; Acosta, Edward P.; Bastow, Barbara; Kim, Peter S.; Read, Sarah W.; Janik, Jennifer; Meres, Debra S.; Lederman, Michael M.; Mong-Kryspin, Lori; Shaw, Karl E.; Zimmerman, Louis G.; Leavitt, Randi; De La Rosa, Guy; Jennings, Amy

    2014-01-01

    Background The impact of raltegravir-resistant HIV-1 minority variants (MVs) on raltegravir treatment failure is unknown. Illumina sequencing offers greater throughput than 454, but sequence analysis tools for viral sequencing are needed. We evaluated Illumina and 454 for the detection of HIV-1 raltegravir-resistant MVs. Methods A5262 was a single-arm study of raltegravir and darunavir/ritonavir in treatment-naïve patients. Pre-treatment plasma was obtained from 5 participants with raltegravir resistance at the time of virologic failure. A control library was created by pooling integrase clones at predefined proportions. Multiplexed sequencing was performed with Illumina and 454 platforms at comparable costs. Illumina sequence analysis was performed with the novel snp-assess tool and 454 sequencing was analyzed with V-Phaser. Results Illumina sequencing resulted in significantly higher sequence coverage and a 0.095% limit of detection. Illumina accurately detected all MVs in the control library at ≥0.5% and 7/10 MVs expected at 0.1%. 454 sequencing failed to detect any MVs at 0.1% with 5 false positive calls. For MVs detected in the patient samples by both 454 and Illumina, the correlation in the detected variant frequencies was high (R2 = 0.92, P<0.001). Illumina sequencing detected 2.4-fold greater nucleotide MVs and 2.9-fold greater amino acid MVs compared to 454. The only raltegravir-resistant MV detected was an E138K mutation in one participant by Illumina sequencing, but not by 454. Conclusions In participants of A5262 with raltegravir resistance at virologic failure, baseline raltegravir-resistant MVs were rarely detected. At comparable costs to 454 sequencing, Illumina demonstrated greater depth of coverage, increased sensitivity for detecting HIV MVs, and fewer false positive variant calls. PMID:24603872

  11. Comparison of illumina and 454 deep sequencing in participants failing raltegravir-based antiretroviral therapy.

    PubMed

    Li, Jonathan Z; Chapman, Brad; Charlebois, Patrick; Hofmann, Oliver; Weiner, Brian; Porter, Alyssa J; Samuel, Reshmi; Vardhanabhuti, Saran; Zheng, Lu; Eron, Joseph; Taiwo, Babafemi; Zody, Michael C; Henn, Matthew R; Kuritzkes, Daniel R; Hide, Winston; Wilson, Cara C; Berzins, Baiba I; Acosta, Edward P; Bastow, Barbara; Kim, Peter S; Read, Sarah W; Janik, Jennifer; Meres, Debra S; Lederman, Michael M; Mong-Kryspin, Lori; Shaw, Karl E; Zimmerman, Louis G; Leavitt, Randi; De La Rosa, Guy; Jennings, Amy

    2014-01-01

    The impact of raltegravir-resistant HIV-1 minority variants (MVs) on raltegravir treatment failure is unknown. Illumina sequencing offers greater throughput than 454, but sequence analysis tools for viral sequencing are needed. We evaluated Illumina and 454 for the detection of HIV-1 raltegravir-resistant MVs. A5262 was a single-arm study of raltegravir and darunavir/ritonavir in treatment-naïve patients. Pre-treatment plasma was obtained from 5 participants with raltegravir resistance at the time of virologic failure. A control library was created by pooling integrase clones at predefined proportions. Multiplexed sequencing was performed with Illumina and 454 platforms at comparable costs. Illumina sequence analysis was performed with the novel snp-assess tool and 454 sequencing was analyzed with V-Phaser. Illumina sequencing resulted in significantly higher sequence coverage and a 0.095% limit of detection. Illumina accurately detected all MVs in the control library at ≥0.5% and 7/10 MVs expected at 0.1%. 454 sequencing failed to detect any MVs at 0.1% with 5 false positive calls. For MVs detected in the patient samples by both 454 and Illumina, the correlation in the detected variant frequencies was high (R2 = 0.92, P<0.001). Illumina sequencing detected 2.4-fold greater nucleotide MVs and 2.9-fold greater amino acid MVs compared to 454. The only raltegravir-resistant MV detected was an E138K mutation in one participant by Illumina sequencing, but not by 454. In participants of A5262 with raltegravir resistance at virologic failure, baseline raltegravir-resistant MVs were rarely detected. At comparable costs to 454 sequencing, Illumina demonstrated greater depth of coverage, increased sensitivity for detecting HIV MVs, and fewer false positive variant calls.

  12. Draft Genome Sequence of the Deep-Sea Bacterium Moritella sp. JT01 and Identification of Biotechnologically Relevant Genes.

    PubMed

    Freitas, Robert Cardoso de; Odisi, Estácio Jussie; Kato, Chiaki; da Silva, Marcus Adonai Castro; Lima, André Oliveira de Souza

    2017-07-22

    Deep-sea bacteria can produce various biotechnologically relevant enzymes due to their adaptations to high pressures and low temperatures. To identify such enzymes, we have sequenced the genome of the polycaprolactone-degrading bacterium Moritella sp. JT01, isolated from sediment samples from Japan Trench (6957 m depth), using a Illumina HiSeq2000 sequencer (12.1 million paired-end reads) and CLC Genomics Workbench (version 6.5.1) for the assembly, resulting in a 4.83-Mb genome (42 scaffolds). The genome was annotated using Rapid Annotation using Subsystem Technology (RAST), Protein Homology/analogY Recognition Engine V 2.0 (PHYRE2), and BLAST2Go, revealing 4439 protein coding sequences and 101 RNAs. Gene products with industrial relevance, such as lipases (three) and esterases (four), were identified and are related to bacterium's ability to degrade polycaprolactone. The annotation revealed proteins related to deep-sea survival, such as cold-shock proteins (six) and desaturases (three). The presence of secondary metabolite biosynthetic gene clusters suggests that this bacterium could produce nonribosomal peptides, polyunsaturated fatty acids, and bacteriocins. To demonstrate the potential of this genome, a lipase was cloned an introduced into Escherichia coli. The lipase was purified and characterized, showing activity over a wide temperature range (over 50% at 20-60 °C) and pH range (over 80% at pH 6.3 to 9). This enzyme has tolerance to the surfactant action of sodium dodecyl sulfate and shows 30% increased activity when subjected to a working pressure of 200 MPa. The genomic characterization of Moritella sp. JT01 reveals traits associated with survival in the deep-sea and their potential uses in biotechnology, as exemplified by the characterized lipase.

  13. Genetic Heterogeneity of Hepatitis C Virus in Association with Antiviral Therapy Determined by Ultra-Deep Sequencing

    PubMed Central

    Nasu, Akihiro; Marusawa, Hiroyuki; Ueda, Yoshihide; Nishijima, Norihiro; Takahashi, Ken; Osaki, Yukio; Yamashita, Yukitaka; Inokuma, Tetsuro; Tamada, Takashi; Fujiwara, Takeshi; Sato, Fumiaki; Shimizu, Kazuharu; Chiba, Tsutomu

    2011-01-01

    Background and Aims The hepatitis C virus (HCV) invariably shows wide heterogeneity in infected patients, referred to as a quasispecies population. Massive amounts of genetic information due to the abundance of HCV variants could be an obstacle to evaluate the viral genetic heterogeneity in detail. Methods Using a newly developed massive-parallel ultra-deep sequencing technique, we investigated the viral genetic heterogeneity in 27 chronic hepatitis C patients receiving peg-interferon (IFN) α2b plus ribavirin therapy. Results Ultra-deep sequencing determined a total of more than 10 million nucleotides of the HCV genome, corresponding to a mean of more than 1000 clones in each specimen, and unveiled extremely high genetic heterogeneity in the genotype 1b HCV population. There was no significant difference in the level of viral complexity between immediate virologic responders and non-responders at baseline (p = 0.39). Immediate virologic responders (n = 8) showed a significant reduction in the genetic complexity spanning all the viral genetic regions at the early phase of IFN administration (p = 0.037). In contrast, non-virologic responders (n = 8) showed no significant changes in the level of viral quasispecies (p = 0.12), indicating that very few viral clones are sensitive to IFN treatment. We also demonstrated that clones resistant to direct-acting antivirals for HCV, such as viral protease and polymerase inhibitors, preexist with various abundances in all 27 treatment-naïve patients, suggesting the risk of the development of drug resistance against these agents. Conclusion Use of the ultra-deep sequencing technology revealed massive genetic heterogeneity of HCV, which has important implications regarding the treatment response and outcome of antiviral therapy. PMID:21966381

  14. Sequence boundaries in uppermost Proterozoic mixed siliciclastic-carbonate rocks: Deep Spring Formation, southern Basin and Range

    SciTech Connect

    Parsons, S.M.; Rees, M.N. . Geosciences Dept.)

    1993-04-01

    The authors propose that a sequence boundary lies at the top of the Reed Dolomite and another at the top of the lower member of the overlying Deep Spring Formation. These boundaries should be useful in correlating critical pre-trilobite Neoproterozoic rocks across the southern Basin and Range Province. Furthermore, the mixed siliciclastic-carbonate rocks between these boundaries reflect an intimate interplay between subsidences, sea-level change and the different rates at which siliciclastic and carbonate sediments accumulate. The Type 2 sequence boundary at the top of the Reed Dolomite is marked in outcrop near Bishop, California by minor channelization and dissolution surfaces that resulted from subaerial exposure of the carbonate platform. This sea level low stand is recorded in the lower Deep Spring Formation, 150 km northwest, by carbonate sediment-gravity-flow deposits. With initiation of transgression, siliciclastics buried the eroded platform and carbonate sedimentation continued in the northwest. As sea level continued to rise, carbonate deposition occurred across the region. Time of maximum flooding is represented by lagoonal deposits in the southeast and a condensed section to the northwest. The condensed section is characterized by dolomitized limestones containing glauconite and small shelly fossils that are overlain by thinly interbedded shales and siltstones with rare trace fossils. The slower rate of siliciclastic deposition on the rapidly subsiding shelf produced an increase in accommodation space resulting in development of an ooid shoal to the southeast. To the northwest, however, continued submarine deposition produced thinly interbedded limestone turbidities and shales. Ooid accumulation outpaced subsidence and together with sea level fall resulted in extensive subaerial exposure of the oolite. Thus, the top of the lower member of the Deep Spring Formation represents the second Type 2 sequence boundary.

  15. Ultrasensitive measurement of hotspot mutations in tumor DNA in blood using error-suppressed multiplexed deep sequencing.

    PubMed

    Narayan, Azeet; Carriero, Nicholas J; Gettinger, Scott N; Kluytenaar, Jeannie; Kozak, Kevin R; Yock, Torunn I; Muscato, Nicole E; Ugarelli, Pedro; Decker, Roy H; Patel, Abhijit A

    2012-07-15

    Detection of cell-free tumor DNA in the blood has offered promise as a cancer biomarker, but practical clinical implementations have been impeded by the lack of a sensitive and accurate method for quantitation that is also simple, inexpensive, and readily scalable. Here we present an approach that uses next-generation sequencing to quantify the small fraction of DNA molecules that contain tumor-specific mutations within a background of normal DNA in plasma. Using layers of sequence redundancy designed to distinguish true mutations from sequencer misreads and PCR misincorporations, we achieved a detection sensitivity of approximately 1 variant in 5,000 molecules. In addition, the attachment of modular barcode tags to the DNA fragments to be sequenced facilitated the simultaneous analysis of more than 100 patient samples. As proof-of-principle, we showed the successful use of this method to follow treatment-associated changes in circulating tumor DNA levels in patients with non-small cell lung cancer. Our findings suggest that the deep sequencing approach described here may be applied to the development of a practical diagnostic test that measures tumor-derived DNA levels in blood.

  16. MaxSSmap: a GPU program for mapping divergent short reads to genomes with the maximum scoring subsequence.

    PubMed

    Turki, Turki; Roshan, Usman

    2014-11-15

    Programs based on hash tables and Burrows-Wheeler are very fast for mapping short reads to genomes but have low accuracy in the presence of mismatches and gaps. Such reads can be aligned accurately with the Smith-Waterman algorithm but it can take hours and days to map millions of reads even for bacteria genomes. We introduce a GPU program called MaxSSmap with the aim of achieving comparable accuracy to Smith-Waterman but with faster runtimes. Similar to most programs MaxSSmap identifies a local region of the genome followed by exact alignment. Instead of using hash tables or Burrows-Wheeler in the first part, MaxSSmap calculates maximum scoring subsequence score between the read and disjoint fragments of the genome in parallel on a GPU and selects the highest scoring fragment for exact alignment. We evaluate MaxSSmap's accuracy and runtime when mapping simulated Illumina E.coli and human chromosome one reads of different lengths and 10% to 30% mismatches with gaps to the E.coli genome and human chromosome one. We also demonstrate applications on real data by mapping ancient horse DNA reads to modern genomes and unmapped paired reads from NA12878 in 1000 genomes. We show that MaxSSmap attains comparable high accuracy and low error to fast Smith-Waterman programs yet has much lower runtimes. We show that MaxSSmap can map reads rejected by BWA and NextGenMap with high accuracy and low error much faster than if Smith-Waterman were used. On short read lengths of 36 and 51 both MaxSSmap and Smith-Waterman have lower accuracy compared to at higher lengths. On real data MaxSSmap produces many alignments with high score and mapping quality that are not given by NextGenMap and BWA. The MaxSSmap source code in CUDA and OpenCL is freely available from http://www.cs.njit.edu/usman/MaxSSmap.

  17. Ultra-deep sequencing leads to earlier and more sensitive detection of the tyrosine kinase inhibitor resistance mutation T315I in chronic myeloid leukemia

    PubMed Central

    Baer, Constance; Kern, Wolfgang; Koch, Sarah; Nadarajah, Niroshan; Schindela, Sonja; Meggendorfer, Manja; Haferlach, Claudia; Haferlach, Torsten

    2016-01-01

    Chronic myeloid leukemia cells acquire resistance to tyrosine kinase inhibitors through mutations in the ABL1 kinase domain. The T315I mutation mediates resistance to imatinib, dasatinib, nilotinib and bosutinib, whereas sensitivity to ponatinib remains. Mutation detection by conventional Sanger sequencing requires 10%–20% expansion of the mutated subclone. We studied the T315I mutation development by ultra-deep sequencing on the 454 XL+ platform (Roche) in comparison to Sanger sequencing. By ultra-deep sequencing, mutations were detected at loads of 1%–2%. We selected 40 patients who had failed first-line to third-line treatment (imatinib, dasatinib, nilotinib) and had high loads of the T315I mutation detected by Sanger sequencing. We confirmed T315I mutations by ultra-deep sequencing and investigated the mutation dynamics by backtracking earlier samples. In 20 of 40 patients, we identified the T315I three months (median) before Sanger sequencing detection limits were reached. To exclude sporadic low percentage mutation development without subsequent mutation outgrowth, we selected 42 patients without resistance mutations detected by Sanger sequencing but loss of major molecular response. Here, no mutation was detected by ultradeep sequencing. Additional non-T315I resistance mutations were found in 20 of 40 patients. Only 15% had two mutations per cell; the other cases showed multiple independently mutated clones and the T315I clone demonstrated a rapid outgrowth. In conclusion, T315I mutations could be detected earlier by ultra-deep sequencing compared to Sanger sequencing in a selected group of cases. Earlier mutation detection by ultra-deep sequencing might allow treatment to be changed before clonal increase of cells with the T315I mutation. PMID:27102501

  18. Ultra-deep sequencing leads to earlier and more sensitive detection of the tyrosine kinase inhibitor resistance mutation T315I in chronic myeloid leukemia.

    PubMed

    Baer, Constance; Kern, Wolfgang; Koch, Sarah; Nadarajah, Niroshan; Schindela, Sonja; Meggendorfer, Manja; Haferlach, Claudia; Haferlach, Torsten

    2016-07-01

    Chronic myeloid leukemia cells acquire resistance to tyrosine kinase inhibitors through mutations in the ABL1 kinase domain. The T315I mutation mediates resistance to imatinib, dasatinib, nilotinib and bosutinib, whereas sensitivity to ponatinib remains. Mutation detection by conventional Sanger sequencing requires 10%-20% expansion of the mutated subclone. We studied the T315I mutation development by ultra-deep sequencing on the 454 XL+ platform (Roche) in comparison to Sanger sequencing. By ultra-deep sequencing, mutations were detected at loads of 1%-2%. We selected 40 patients who had failed first-line to third-line treatment (imatinib, dasatinib, nilotinib) and had high loads of the T315I mutation detected by Sanger sequencing. We confirmed T315I mutations by ultra-deep sequencing and investigated the mutation dynamics by backtracking earlier samples. In 20 of 40 patients, we identified the T315I three months (median) before Sanger sequencing detection limits were reached. To exclude sporadic low percentage mutation development without subsequent mutation outgrowth, we selected 42 patients without resistance mutations detected by Sanger sequencing but loss of major molecular response. Here, no mutation was detected by ultradeep sequencing. Additional non-T315I resistance mutations were found in 20 of 40 patients. Only 15% had two mutations per cell; the other cases showed multiple independently mutated clones and the T315I clone demonstrated a rapid outgrowth. In conclusion, T315I mutations could be detected earlier by ultra-deep sequencing compared to Sanger sequencing in a selected group of cases. Earlier mutation detection by ultra-deep sequencing might allow treatment to be changed before clonal increase of cells with the T315I mutation.

  19. Characterization of the Genomic Diversity of Norovirus in Linked Patients Using a Metagenomic Deep Sequencing Approach

    PubMed Central

    Nasheri, Neda; Petronella, Nicholas; Ronholm, Jennifer; Bidawid, Sabah; Corneau, Nathalie

    2017-01-01

    Norovirus (NoV) is the leading cause of gastroenteritis worldwide. A robust cell culture system does not exist for NoV and therefore detailed characterization of outbreak and sporadic strains relies on molecular techniques. In this study, we employed a metagenomic approach that uses non-specific amplification followed by next-generation sequencing to whole genome sequence NoV genomes directly from clinical samples obtained from 8 linked patients. Enough sequencing depth was obtained for each sample to use a de novo assembly of near-complete genome sequences. The resultant consensus sequences were then used to identify inter-host nucleotide variations that occur after direct transmission, analyze amino acid variations in the major capsid protein, and provide evidence of recombination events. The analysis of intra-host quasispecies diversity was possible due to high coverage-depth. We also observed a linear relationship between NoV viral load in the clinical sample and the number of sequence reads that could be attributed to NoV. The method demonstrated here has the potential for future use in whole genome sequence analyses of other RNA viruses isolated from clinical, environmental, and food specimens. PMID:28197136

  20. Using deep RNA sequencing for the structural annotation of the laccaria bicolor mycorrhizal transcriptome.

    SciTech Connect

    Larsen, P. E.; Trivedi, G.; Sreedasyam, A.; Lu, V.; Podila, G. K.; Collart, F. R.; Biosciences Division; Univ. of Alabama

    2010-07-06

    Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derived from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. 69% of expressed mycorrhizal JGI 'best' gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there

  1. Deep HST Imaging in 47 Tuc and NGC 6397: Main Sequence Turnoff Ages

    NASA Astrophysics Data System (ADS)

    Dotter, Aaron L.; Anderson, J.; Fahlman, G.; Hansen, B.; Hurley, J.; Kalirai, J.; King, I.; Reitzel, D.; Rich, R. M.; Richer, H.; Shara, M.; Stetson, P.; Woodley, K.; Zurek, D.

    2011-01-01

    The ages of Galactic globular clusters provide insight into the formation history of the Milky Way. Utilizing HST photometry of unprecendented depth and wavelength coverage, we determine the main sequence turnoff ages of the nearby globular clusters NGC 6397 and 47 Tuc. The ages are determined by comparing stellar evolution models to the main sequences with a chi-squared minimization technique. Our analysis of 47 Tuc leverages the pronounced 'kink' or 'knee' feature that appears in the lower main sequence in the near-IR. We present our age estimates as probability distributions and construct confidence intervals over input parameters such as metallicity, distance, and reddening.

  2. Draft Genome Sequence of Pseudomonas pachastrellae Strain CCUG 46540T, a Deep-Sea Bacterium

    PubMed Central

    2017-01-01

    ABSTRACT Pseudomonas pachastrellae strain CCUG 46540T (KMM 330T) was isolated from a deep-sea sponge specimen collected in the Philippine Sea at a depth of 750 m. The draft genome has an estimated size of 4.0 Mb, exhibits a G+C content of 61.2 mol%, and is predicted to encode 3,592 proteins, including pathways for the degradation of aromatic compounds. PMID:28385850

  3. Increasing Clinical Severity during a Dengue Virus Type 3 Cuban Epidemic: Deep Sequencing of Evolving Viral Populations

    PubMed Central

    Blanc, Hervé; Bordería, Antonio V.; Díaz, Gisell; Henningsson, Rasmus; Gonzalez, Daniel; Santana, Emidalys; Alvarez, Mayling; Castro, Osvaldo; Fontes, Magnus; Vignuzzi, Marco; Guzman, Maria G.

    2016-01-01

    ABSTRACT During the dengue virus type 3 (DENV-3) epidemic that occurred in Havana in 2001 to 2002, severe disease was associated with the infection sequence DENV-1 followed by DENV-3 (DENV-1/DENV-3), while the sequence DENV-2/DENV-3 was associated with mild/asymptomatic infections. To determine the role of the virus in the increasing severity demonstrated during the epidemic, serum samples collected at different time points were studied. A total of 22 full-length sequences were obtained using a deep-sequencing approach. Bayesian phylogenetic analysis of consensus sequences revealed that two DENV-3 lineages were circulating in Havana at that time, both grouped within genotype III. The predominant lineage is closely related to Peruvian and Ecuadorian strains, while the minor lineage is related to Venezuelan strains. According to consensus sequences, relatively few nonsynonymous mutations were observed; only one was fixed during the epidemic at position 4380 in the NS2B gene. Intrahost genetic analysis indicated that a significant minor population was selected and became predominant toward the end of the epidemic. In conclusion, greater variability was detected during the epidemic's progression in terms of significant minority variants, particularly in the nonstructural genes. An increasing trend of genetic diversity toward the end of the epidemic was observed only for synonymous variant allele rates, with higher variability in secondary cases. Remarkably, significant intrahost genetic variation was demonstrated within the same patient during the course of secondary infection with DENV-1/DENV-3, including changes in the structural proteins premembrane (PrM) and envelope (E). Therefore, the dynamic of evolving viral populations in the context of heterotypic antibodies could be related to the increasing clinical severity observed during the epidemic. IMPORTANCE Based on the evidence that DENV fitness is context dependent, our research has focused on the study of viral

  4. Deep sequencing in library selection projects: what insight does it bring?

    PubMed

    Glanville, J; D'Angelo, S; Khan, T A; Reddy, S T; Naranjo, L; Ferrara, F; Bradbury, A R M

    2015-08-01

    High throughput sequencing is poised to change all aspects of the way antibodies and other binders are discovered and engineered. Millions of available sequence reads provide an unprecedented sampling depth able to guide the design and construction of effective, high quality naïve libraries containing tens of billions of unique molecules. Furthermore, during selections, high throughput sequencing enables quantitative tracing of enriched clones and position-specific guidance to amino acid variation under positive selection during antibody engineering. Successful application of the technologies relies on specific PCR reagent design, correct sequencing platform selection, and effective use of computational tools and statistical measures to remove error, identify antibodies, estimate diversity, and extract signatures of selection from the clone down to individual structural positions. Here we review these considerations and discuss some of the remaining challenges to the widespread adoption of the technology. Copyright © 2015 Elsevier Ltd. All rights reserved.

  5. A generic assay for whole-genome amplification and deep sequencing of enterovirus A71.

    PubMed

    Tan, Le Van; Tuyen, Nguyen Thi Kim; Thanh, Tran Tan; Ngan, Tran Thuy; Van, Hoang Minh Tu; Sabanathan, Saraswathy; Van, Tran Thi My; Thanh, Le Thi My; Nguyet, Lam Anh; Geoghegan, Jemma L; Ong, Kien Chai; Perera, David; Hang, Vu Thi Ty; Ny, Nguyen Thi Han; Anh, Nguyen To; Ha, Do Quang; Qui, Phan Tu; Viet, Do Chau; Tuan, Ha Manh; Wong, Kum Thong; Holmes, Edward C; Chau, Nguyen Van Vinh; Thwaites, Guy; van Doorn, H Rogier

    2015-04-01

    Enterovirus A71 (EV-A71) has emerged as the most important cause of large outbreaks of severe and sometimes fatal hand, foot and mouth disease (HFMD) across the Asia-Pacific region. EV-A71 outbreaks have been associated with (sub)genogroup switches, sometimes accompanied by recombination events. Understanding EV-A71 population dynamics is therefore essential for understanding this emerging infection, and may provide pivotal information for vaccine development. Despite the public health burden of EV-A71, relatively few EV-A71 complete-genome sequences are available for analysis and from limited geographical localities. The availability of an efficient procedure for whole-genome sequencing would stimulate effort to generate more viral sequence data. Herein, we report for the first time the development of a next-generation sequencing based protocol for whole-genome sequencing of EV-A71 directly from clinical specimens. We were able to sequence viruses of subgenogroup C4 and B5, while RNA from culture materials of diverse EV-A71 subgenogroups belonging to both genogroup B and C was successfully amplified. The nature of intra-host genetic diversity was explored in 22 clinical samples, revealing 107 positions carrying minor variants (ranging from 0 to 15 variants per sample). Our analysis of EV-A71 strains sampled in 2013 showed that they all belonged to subgenogroup B5, representing the first report of this subgenogroup in Vietnam. In conclusion, we have successfully developed a high-throughput next-generation sequencing-based assay for whole-genome sequencing of EV-A71 from clinical samples.

  6. A generic assay for whole-genome amplification and deep sequencing of enterovirus A71

    PubMed Central

    Tan, Le Van; Tuyen, Nguyen Thi Kim; Thanh, Tran Tan; Ngan, Tran Thuy; Van, Hoang Minh Tu; Sabanathan, Saraswathy; Van, Tran Thi My; Thanh, Le Thi My; Nguyet, Lam Anh; Geoghegan, Jemma L.; Ong, Kien Chai; Perera, David; Hang, Vu Thi Ty; Ny, Nguyen Thi Han; Anh, Nguyen To; Ha, Do Quang; Qui, Phan Tu; Viet, Do Chau; Tuan, Ha Manh; Wong, Kum Thong; Holmes, Edward C.; Chau, Nguyen Van Vinh; Thwaites, Guy; van Doorn, H. Rogier

    2015-01-01

    Enterovirus A71 (EV-A71) has emerged as the most important cause of large outbreaks of severe and sometimes fatal hand, foot and mouth disease (HFMD) across the Asia-Pacific region. EV-A71 outbreaks have been associated with (sub)genogroup switches, sometimes accompanied by recombination events. Understanding EV-A71 population dynamics is therefore essential for understanding this emerging infection, and may provide pivotal information for vaccine development. Despite the public health burden of EV-A71, relatively few EV-A71 complete-genome sequences are available for analysis and from limited geographical localities. The availability of an efficient procedure for whole-genome sequencing would stimulate effort to generate more viral sequence data. Herein, we report for the first time the development of a next-generation sequencing based protocol for whole-genome sequencing of EV-A71 directly from clinical specimens. We were able to sequence viruses of subgenogroup C4 and B5, while RNA from culture materials of diverse EV-A71 subgenogroups belonging to both genogroup B and C was successfully amplified. The nature of intra-host genetic diversity was explored in 22 clinical samples, revealing 107 positions carrying minor variants (ranging from 0 to 15 variants per sample). Our analysis of EV-A71 strains sampled in 2013 showed that they all belonged to subgenogroup B5, representing the first report of this subgenogroup in Vietnam. In conclusion, we have successfully developed a high-throughput next-generation sequencing-based assay for whole-genome sequencing of EV-A71 from clinical samples. PMID:25704598

  7. Insights into Deep-Sea Sediment Fungal Communities from the East Indian Ocean Using Targeted Environmental Sequencing Combined with Traditional Cultivation

    PubMed Central

    Zhang, Xiao-yong; Tang, Gui-ling; Xu, Xin-ya; Nong, Xu-hua; Qi, Shu-Hua

    2014-01-01

    The fungal diversity in deep-sea environments has recently gained an increasing amount attention. Our knowledge and understanding of the true fungal diversity and the role it plays in deep-sea environments, however, is still limited. We investigated the fungal community structure in five sediments from a depth of ∼4000 m in the East India Ocean using a combination of targeted environmental sequencing and traditional cultivation. This approach resulted in the recovery of a total of 45 fungal operational taxonomic units (OTUs) and 20 culturable fungal phylotypes. This finding indicates that there is a great amount of fungal diversity in the deep-sea sediments collected in the East Indian Ocean. Three fungal OTUs and one culturable phylotype demonstrated high divergence (89%–97%) from the existing sequences in the GenBank. Moreover, 44.4% fungal OTUs and 30% culturable fungal phylotypes are new reports for deep-sea sediments. These results suggest that the deep-sea sediments from the East India Ocean can serve as habitats for new fungal communities compared with other deep-sea environments. In addition, different fungal community could be detected when using targeted environmental sequencing compared with traditional cultivation in this study, which suggests that a combination of targeted environmental sequencing and traditional cultivation will generate a more diverse fungal community in deep-sea environments than using either targeted environmental sequencing or traditional cultivation alone. This study is the first to report new insights into the fungal communities in deep-sea sediments from the East Indian Ocean, which increases our knowledge and understanding of the fungal diversity in deep-sea environments. PMID:25272044

  8. Insights into deep-sea sediment fungal communities from the East Indian Ocean using targeted environmental sequencing combined with traditional cultivation.

    PubMed

    Zhang, Xiao-yong; Tang, Gui-ling; Xu, Xin-ya; Nong, Xu-hua; Qi, Shu-hua

    2014-01-01

    The fungal diversity in deep-sea environments has recently gained an increasing amount attention. Our knowledge and understanding of the true fungal diversity and the role it plays in deep-sea environments, however, is still limited. We investigated the fungal community structure in five sediments from a depth of ∼ 4000 m in the East India Ocean using a combination of targeted environmental sequencing and traditional cultivation. This approach resulted in the recovery of a total of 45 fungal operational taxonomic units (OTUs) and 20 culturable fungal phylotypes. This finding indicates that there is a great amount of fungal diversity in the deep-sea sediments collected in the East Indian Ocean. Three fungal OTUs and one culturable phylotype demonstrated high divergence (89%-97%) from the existing sequences in the GenBank. Moreover, 44.4% fungal OTUs and 30% culturable fungal phylotypes are new reports for deep-sea sediments. These results suggest that the deep-sea sediments from the East India Ocean can serve as habitats for new fungal communities compared with other deep-sea environments. In addition, different fungal community could be detected when using targeted environmental sequencing compared with traditional cultivation in this study, which suggests that a combination of targeted environmental sequencing or traditional cultivation alone. This study is the first to report new insights into the fungal communities in deep-sea sediments environmental sequencing and traditional cultivation will generate a more diverse fungal community in deep-sea environments than using either from the East Indian Ocean, which increases our knowledge and understanding of the fungal diversity in deep-sea environments.

  9. Dissection of the octoploid strawberry genome by deep sequencing of the genomes of Fragaria species.

    PubMed

    Hirakawa, Hideki; Shirasawa, Kenta; Kosugi, Shunichi; Tashiro, Kosuke; Nakayama, Shinobu; Yamada, Manabu; Kohara, Mistuyo; Watanabe, Akiko; Kishida, Yoshie; Fujishiro, Tsunakazu; Tsuruoka, Hisano; Minami, Chiharu; Sasamoto, Shigemi; Kato, Midori; Nanri, Keiko; Komaki, Akiko; Yanagi, Tomohiro; Guoxin, Qin; Maeda, Fumi; Ishikawa, Masami; Kuhara, Satoru; Sato, Shusei; Tabata, Satoshi; Isobe, Sachiko N

    2014-01-01

    Cultivated strawberry (Fragaria x ananassa) is octoploid and shows allogamous behaviour. The present study aims at dissecting this octoploid genome through comparison with its wild relatives, F. iinumae, F. nipponica, F. nubicola, and F. orientalis by de novo whole-genome sequencing on an Illumina and Roche 454 platforms. The total length of the assembled Illumina genome sequences obtained was 698 Mb for F. x ananassa, and ∼200 Mb each for the four wild species. Subsequently, a virtual reference genome termed FANhybrid_r1.2 was constructed by integrating the sequences of the four homoeologous subgenomes of F. x ananassa, from which heterozygous regions in the Roche 454 and Illumina genome sequences were eliminated. The total length of FANhybrid_r1.2 thus created was 173.2 Mb with the N50 length of 5137 bp. The Illumina-assembled genome sequences of F. x ananassa and the four wild species were then mapped onto the reference genome, along with the previously published F. vesca genome sequence to establish the subgenomic structure of F. x ananassa. The strategy adopted in this study has turned out to be successful in dissecting the genome of octoploid F. x ananassa and appears promising when applied to the analysis of other polyploid plant species.

  10. Dissection of the Octoploid Strawberry Genome by Deep Sequencing of the Genomes of Fragaria Species

    PubMed Central

    Hirakawa, Hideki; Shirasawa, Kenta; Kosugi, Shunichi; Tashiro, Kosuke; Nakayama, Shinobu; Yamada, Manabu; Kohara, Mistuyo; Watanabe, Akiko; Kishida, Yoshie; Fujishiro, Tsunakazu; Tsuruoka, Hisano; Minami, Chiharu; Sasamoto, Shigemi; Kato, Midori; Nanri, Keiko; Komaki, Akiko; Yanagi, Tomohiro; Guoxin, Qin; Maeda, Fumi; Ishikawa, Masami; Kuhara, Satoru; Sato, Shusei; Tabata, Satoshi; Isobe, Sachiko N.

    2014-01-01

    Cultivated strawberry (Fragaria x ananassa) is octoploid and shows allogamous behaviour. The present study aims at dissecting this octoploid genome through comparison with its wild relatives, F. iinumae, F. nipponica, F. nubicola, and F. orientalis by de novo whole-genome sequencing on an Illumina and Roche 454 platforms. The total length of the assembled Illumina genome sequences obtained was 698 Mb for F. x ananassa, and ∼200 Mb each for the four wild species. Subsequently, a virtual reference genome termed FANhybrid_r1.2 was constructed by integrating the sequences of the four homoeologous subgenomes of F. x ananassa, from which heterozygous regions in the Roche 454 and Illumina genome sequences were eliminated. The total length of FANhybrid_r1.2 thus created was 173.2 Mb with the N50 length of 5137 bp. The Illumina-assembled genome sequences of F. x ananassa and the four wild species were then mapped onto the reference genome, along with the previously published F. vesca genome sequence to establish the subgenomic structure of F. x ananassa. The strategy adopted in this study has turned out to be successful in dissecting the genome of octoploid F. x ananassa and appears promising when applied to the analysis of other polyploid plant species. PMID:24282021

  11. Gene Discovery Using Mutagen-Induced Polymorphisms and Deep Sequencing: Application to Plant Disease Resistance

    PubMed Central

    Zhu, Ying; Mang, Hyung-gon; Sun, Qi; Qian, Jun; Hipps, Ashley; Hua, Jian

    2012-01-01

    Next-generation sequencing technologies are accelerating gene discovery by combining multiple steps of mapping and cloning used in the traditional map-based approach into one step using DNA sequence polymorphisms existing between two different accessions/strains/backgrounds of the same species. The existing next-generation sequencing method, like the traditional one, requires the use of a segregating population from a cross of a mutant organism in one accession with a wild-type (WT) organism in a different accession. It therefore could potentially be limited by modification of mutant phenotypes in different accessions and/or by the lengthy process required to construct a particular mapping parent in a second accession. Here we present mapping and cloning of an enhancer mutation with next-generation sequencing on bulked segregants in the same accession using sequence polymorphisms induced by a chemical mutagen. This method complements the conventional cloning approach and makes forward genetics more feasible and powerful in molecularly dissecting biological processes in any organisms. The pipeline developed in this study can be used to clone causal genes in background of single mutants or higher order of mutants and in species with or without sequence information on multiple accessions. PMID:22714407

  12. Gene discovery using mutagen-induced polymorphisms and deep sequencing: application to plant disease resistance.

    PubMed

    Zhu, Ying; Mang, Hyung-gon; Sun, Qi; Qian, Jun; Hipps, Ashley; Hua, Jian

    2012-09-01

    Next-generation sequencing technologies are accelerating gene discovery by combining multiple steps of mapping and cloning used in the traditional map-based approach into one step using DNA sequence polymorphisms existing between two different accessions/strains/backgrounds of the same species. The existing next-generation sequencing method, like the traditional one, requires the use of a segregating population from a cross of a mutant organism in one accession with a wild-type (WT) organism in a different accession. It therefore could potentially be limited by modification of mutant phenotypes in different accessions and/or by the lengthy process required to construct a particular mapping parent in a second accession. Here we present mapping and cloning of an enhancer mutation with next-generation sequencing on bulked segregants in the same accession using sequence polymorphisms induced by a chemical mutagen. This method complements the conventional cloning approach and makes forward genetics more feasible and powerful in molecularly dissecting biological processes in any organisms. The pipeline developed in this study can be used to clone causal genes in background of single mutants or higher order of mutants and in species with or without sequence information on multiple accessions.

  13. Taxonomic classification of bacterial 16S rRNA genes using short sequencing reads: evaluation of effective study designs.

    PubMed

    Mizrahi-Man, Orna; Davenport, Emily R; Gilad, Yoav

    2013-01-01

    Massively parallel high throughput sequencing technologies allow us to interrogate the microbial composition of biological samples at unprecedented resolution. The typical approach is to perform high-throughout sequencing of 16S rRNA genes, which are then taxonomically classified based on similarity to known sequences in existing databases. Current technologies cause a predicament though, because although they enable deep coverage of samples, they are limited in the length of sequence they can produce. As a result, high-throughout studies of microbial communities often do not sequence the entire 16S rRNA gene. The challenge is to obtain reliable representation of bacterial communities through taxonomic classification of short 16S rRNA gene sequences. In this study we explored properties of different study designs and developed specific recommendations for effective use of short-read sequencing technologies for the purpose of interrogating bacterial communities, with a focus on classification using naïve Bayesian classifiers. To assess precision and coverage of each design, we used a collection of ∼8,500 manually curated 16S rRNA gene sequences from cultured bacteria and a set of over one million bacterial 16S rRNA gene sequences retrieved from environmental samples, respectively. We also tested different configurations of taxonomic classification approaches using short read sequencing data, and provide recommendations for optimal choice of the relevant parameters. We conclude that with a judicious selection of the sequenced region and the corresponding choice of a suitable training set for taxonomic classification, it is possible to explore bacterial communities at great depth using current technologies, with only a minimal loss of taxonomic resolution.

  14. Mosaic KCNJ2 mutation in Andersen-Tawil syndrome: targeted deep sequencing is useful for the detection of mosaicism.

    PubMed

    Hasegawa, K; Ohno, S; Kimura, H; Itoh, H; Makiyama, T; Yoshida, Y; Horie, M

    2015-03-01

    Andersen-Tawil syndrome (ATS) is an inherited disease characterized by ventricular arrhythmias, periodic paralysis, and dysmorphic features. It results from a heterozygous mutation of KCNJ2, but little is known about mosaicism in ATS. We performed genetic analysis of KCNJ2 in 32 ATS probands and their family members and identified KCNJ2 mutations in 25 probands, 20 families who underwent extensive genetic testing. These tests revealed that seven probands carried de novo mutations while 13 carried inherited mutations from their parents. We then specifically assessed a single proband and the respective family. The proband was a 9 year old girl who fulfilled the ATS triad and carried an insertion mutation (p.75_76insThr). We determined that the proband's mother carried a somatic mosaicism and that the proband's younger brother also carried the ATS phenotype with the same insertion mutation. The mother, who exhibited mosaicism, was asymptomatic, although she exhibited Q(T)U prolongation. Mutant allele frequency was 11% as per TA cloning and 17.3% as per targeted deep sequencing. Our observations suggest that targeted deep sequencing is useful for the detection of mosaicism and that the detection of mosaic mutations in parents of apparently sporadic ATS patients can help in the process of genetic counseling.

  15. Development of a candidate reference material for adventitious virus detection in vaccine and biologicals manufacturing by deep sequencing.

    PubMed

    Mee, Edward T; Preston, Mark D; Minor, Philip D; Schepelmann, Silke

    2016-04-12

    Unbiased deep sequencing offers the potential for improved adventitious virus screening in vaccines and biotherapeutics. Successful implementation of such assays will require appropriate control materials to confirm assay performance and sensitivity. A common reference material containing 25 target viruses was produced and 16 laboratories were invited to process it using their preferred adventitious virus detection assay. Fifteen laboratories returned results, obtained using a wide range of wet-lab and informatics methods. Six of 25 target viruses were detected by all laboratories, with the remaining viruses detected by 4-14 laboratories. Six non-target viruses were detected by three or more laboratories. The study demonstrated that a wide range of methods are currently used for adventitious virus detection screening in biological products by deep sequencing and that they can yield significantly different results. This underscores the need for common reference materials to ensure satisfactory assay performance and enable comparisons between laboratories. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.

  16. Acute West Nile Virus Meningoencephalitis Diagnosed Via Metagenomic Deep Sequencing of Cerebrospinal Fluid in a Renal Transplant Patient.

    PubMed

    Wilson, M R; Zimmermann, L L; Crawford, E D; Sample, H A; Soni, P R; Baker, A N; Khan, L M; DeRisi, J L

    2017-03-01

    Solid organ transplant patients are vulnerable to suffering neurologic complications from a wide array of viral infections and can be sentinels in the population who are first to get serious complications from emerging infections like the recent waves of arboviruses, including West Nile virus, Chikungunya virus, Zika virus, and Dengue virus. The diverse and rapidly changing landscape of possible causes of viral encephalitis poses great challenges for traditional candidate-based infectious disease diagnostics that already fail to identify a causative pathogen in approximately 50% of encephalitis cases. We present the case of a 14-year-old girl on immunosuppression for a renal transplant who presented with acute meningoencephalitis. Traditional diagnostics failed to identify an etiology. RNA extracted from her cerebrospinal fluid was subjected to unbiased metagenomic deep sequencing, enhanced with the use of a Cas9-based technique for host depletion. This analysis identified West Nile virus (WNV). Convalescent serum serologies subsequently confirmed WNV seroconversion. These results support a clear clinical role for metagenomic deep sequencing in the setting of suspected viral encephalitis, especially in the context of the high-risk transplant patient population.

  17. Generic Amplicon Deep Sequencing to Determine Ilarvirus Species Diversity in Australian Prunus

    PubMed Central

    Kinoti, Wycliff M.; Constable, Fiona E.; Nancarrow, Narelle; Plummer, Kim M.; Rodoni, Brendan

    2017-01-01

    The distribution of Ilarvirus species populations amongst 61 Australian Prunus trees was determined by next generation sequencing (NGS) of amplicons generated using a genus-based generic RT-PCR targeting a conserved region of the Ilarvirus RNA2 component that encodes the RNA dependent RNA polymerase (RdRp) gene. Presence of Ilarvirus sequences in each positive sample was further validated by Sanger sequencing of cloned amplicons of regions of each of RNA1, RNA2 and/or RNA3 that were generated by species specific PCRs and by metagenomic NGS. Prunus necrotic ringspot virus (PNRSV) was the most frequently detected Ilarvirus, occurring in 48 of the 61 Ilarvirus-positive trees and Prune dwarf virus (PDV) and Apple mosaic virus (ApMV) were detected in three trees and one tree, respectively. American plum line pattern virus (APLPV) was detected in three trees and represents the first report of APLPV detection in Australia. Two novel and distinct groups of Ilarvirus-like RNA2 amplicon sequences were also identified in several trees by the generic amplicon NGS approach. The high read depth from the amplicon NGS of the generic PCR products allowed the detection of distinct RNA2 RdRp sequence variant populations of PNRSV, PDV, ApMV, APLPV and the two novel Ilarvirus-like sequences. Mixed infections of ilarviruses were also detected in seven Prunus trees. Sanger sequencing of specific RNA1, RNA2, and/or RNA3 genome segments of each virus and total nucleic acid metagenomics NGS confirmed the presence of PNRSV, PDV, ApMV and APLPV detected by RNA2 generic amplicon NGS. However, the two novel groups of Ilarvirus-like RNA2 amplicon sequences detected by the generic amplicon NGS could not be associated to the presence of sequence from RNA1 or RNA3 genome segments or full Ilarvirus genomes, and their origin is unclear. This work highlights the sensitivity of genus-specific amplicon NGS in detection of virus sequences and their distinct populations in multiple samples, and the need

  18. Differential expression analysis of Paralichthys olivaceus microRNAs in adult ovary and testis by deep sequencing.

    PubMed

    Gu, Yifeng; Zhang, Lei; Chen, Xiaowu

    2014-08-01

    MicroRNAs (miRNAs) play an important role in gonadal development and differentiation in fish. However, understanding of the mechanism of this process is hindered by our poor knowledge of miRNA expression patterns in fish gonads. In this study, miRNA libraries derived from adult gonads of Paralichthys olivaceus were generated by using next-generation sequencing (NGS) technology. Bioinformatics analysis was performed to distinguish mature miRNA sequences from two classes of small RNAs represented in the sequencing data. A total of 141 mature miRNAs were identified, in which 21 miRNAs were found in P. olivaceus for the first time. Variance and preference of miRNAs expression were concluded from the deep sequencing reads. Some miRNAs, such as pol-miR-143, pol-miR-26a and pol-let-7a were found with quite high expression levels in both gonads, while some exhibited a clear sex-biased expression in different gonad. Approximate 20.0% and 13.1% of the isolated miRNAs were preferentially expressed in the testis (FC<0.5) or ovary (FC>2), respectively. The identification and the preliminary analysis of the sex-biased expression of miRNAs in P. olivaceus gonads in our work by using NGS will provide us a basic catalog of miRNAs to facilitate future improvement and exploitation of sexual regulatory mechanisms in P. olivaceus. Copyright © 2014. Published by Elsevier Inc.

  19. Characterization and Development of EST-SSRs by Deep Transcriptome Sequencing in Chinese Cabbage (Brassica rapa L. ssp. pekinensis)

    PubMed Central

    Ding, Qian; Li, Jingjuan; Wang, Fengde; Zhang, Yihui; Li, Huayin; Zhang, Jiannong; Gao, Jianwei

    2015-01-01

    Simple sequence repeats (SSRs) are among the most important markers for population analysis and have been widely used in plant genetic mapping and molecular breeding. Expressed sequence tag-SSR (EST-SSR) markers, located in the coding regions, are potentially more efficient for QTL mapping, gene targeting, and marker-assisted breeding. In this study, we investigated 51,694 nonredundant unigenes, assembled from clean reads from deep transcriptome sequencing with a Solexa/Illumina platform, for identification and development of EST-SSRs in Chinese cabbage. In total, 10,420 EST-SSRs with over 12 bp were identified and characterized, among which 2744 EST-SSRs are new and 2317 are known ones showing polymorphism with previously reported SSRs. A total of 7877 PCR primer pairs for 1561 EST-SSR loci were designed, and primer pairs for twenty-four EST-SSRs were selected for primer evaluation. In nineteen EST-SSR loci (79.2%), amplicons were successfully generated with high quality. Seventeen (89.5%) showed polymorphism in twenty-four cultivars of Chinese cabbage. The polymorphic alleles of each polymorphic locus were sequenced, and the results showed that most polymorphisms were due to variations of SSR repeat motifs. The EST-SSRs identified and characterized in this study have important implications for developing new tools for genetics and molecular breeding in Chinese cabbage. PMID:26504770

  20. Characterization of microRNAs and their targets in wild barley (Hordeum vulgare subsp. spontaneum) using deep sequencing.

    PubMed

    Deng, Pingchuan; Bian, Jianxin; Yue, Hong; Feng, Kewei; Wang, Mengxing; Du, Xianghong; Weining, Song; Nie, Xiaojun

    2016-05-01

    MicroRNAs (miRNA) are a class of small, endogenous RNAs that play a negative regulatory role in various developmental and metabolic processes of plants. Wild barley (Hordeum vulgare subsp. spontaneum), as the progenitor of cultivated barley (Hordeum vulgare subsp. vulgare), has served as a valuable germplasm resource for barley genetic improvement. To survey miRNAs in wild barley, we sequenced the small RNA library prepared from wild barley using the Illumina deep sequencing technology. A total of 70 known miRNAs and 18 putative novel miRNAs were identified. Sequence analysis revealed that all of the miRNAs identified in wild barley contained the highly conserved hairpin sequences found in barley cultivars. MiRNA target predictions showed that 12 out of 52 miRNA families were predicted to target transcription factors, including 8 highly conserved miRNA families in plants and 4 wheat-barley conserved miRNA families. In addition to transcription factors, other predicted target genes were involved in diverse physiological and metabolic processes and stress defense. Our study for the first time reported the large-scale investigation of small RNAs in wild barley, which will provide essential information for understanding the regulatory role of miRNAs in wild barley and also shed light on future practical utilization of miRNAs for barley improvement.

  1. Location and sequence of muscle onset in deep abdominal muscles measured by different modes of ultrasound imaging.

    PubMed

    Westad, Christian; Mork, Paul J; Vasseljen, Ottar

    2010-10-01

    Various modes of ultrasound (US) imaging have been introduced as an alternative to electromyography for determining muscle onset. The purpose of this study was to compare the agreement between US motion-mode (US(m-mode)) and US strain rate (US(SR)) derived from tissue velocity imaging in determining latency time, location and sequence of muscle onset in abdominal muscles using the same data set (contractions). Twenty-four subjects performed four rapid arm flexions in response to a light signal while US recordings were made from the abdominal muscles on the contralateral side. The examined muscles were transversus abdominis (TrA), superficial and deep obliquus internus abdominis (OI(deep) and OI(sup)), and obliquus externus abdominis (OE). The results showed that the two methods detected the first muscle onset on average within 0.1 ms (95% CI; +/-1.4 ms) of each other. US(SR) detected the second muscle onset on average 27 ms after US(m-mode). While US(SR) and US(m-mode) can be used interchangeably to detect the first muscle onset, the location of both first onset and subsequent muscle onsets can be reliably detected by US(SR) only. Furthermore, this study indicates that OI may be functionally subdivided into a superficial and deep region, with onset in OI(deep) occurring on average 53 ms before OI(sup). First onset was detected more frequently in OI than in TrA (65% versus 25% of detected onsets, 10% were equal). Copyright (c) 2010 Elsevier Ltd. All rights reserved.

  2. Identification of Hepatotropic Viruses from Plasma Using Deep Sequencing: A Next Generation Diagnostic Tool

    PubMed Central

    Patterson, Jordan; Ford, Glenn; O’keefe, Sandra; Wang, Weiwei; Meng, Bo; Song, Deyong; Zhang, Yong; Tian, Zhijian; Wasilenko, Shawn T.; Rahbari, Mandana; Mitchell, Troy; Jordan, Tracy; Carpenter, Eric; Mason, Andrew L.; Wong, Gane Ka-Shu

    2013-01-01

    We conducted an unbiased metagenomics survey using plasma from patients with chronic hepatitis B, chronic hepatitis C, autoimmune hepatitis (AIH), non-alcoholic steatohepatitis (NASH), and patients without liver disease (control). RNA and DNA libraries were sequenced from plasma filtrates enriched in viral particles to catalog virus populations. Hepatitis viruses were readily detected at high coverage in patients with chronic viral hepatitis B and C, but only a limited number of sequences resembling other viruses were found. The exception was a library from a patient diagnosed with hepatitis C virus (HCV) infection that contained multiple sequences matching GB virus C (GBV-C). Abundant GBV-C reads were also found in plasma from patients with AIH, whereas Torque teno virus (TTV) was found at high frequency in samples from patients with AIH and NASH. After taxonomic classification of sequences by BLASTn, a substantial fraction in each library, ranging from 35% to 76%, remained unclassified. These unknown sequences were assembled into scaffolds along with virus, phage and endogenous retrovirus sequences and then analyzed by BLASTx against the non-redundant protein database. Nearly the full genome of a heretofore-unknown circovirus was assembled and many scaffolds that encoded proteins with similarity to plant, insect and mammalian viruses. The presence of this novel circovirus was confirmed by PCR. BLASTx also identified many polypeptides resembling nucleo-cytoplasmic large DNA viruses (NCLDV) proteins. We re-evaluated these alignments with a profile hidden Markov method, HHblits, and observed inconsistencies in the target proteins reported by the different algorithms. This suggests that sequence alignments are insufficient to identify NCLDV proteins, especially when these alignments are only to small portions of the target protein. Nevertheless, we have now established a reliable protocol for the identification of viruses in plasma that can also be adapted to other

  3. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome.

    PubMed

    Margulies, Elliott H; Cooper, Gregory M; Asimenos, George; Thomas, Daryl J; Dewey, Colin N; Siepel, Adam; Birney, Ewan; Keefe, Damian; Schwartz, Ariel S; Hou, Minmei; Taylor, James; Nikolaev, Sergey; Montoya-Burgos, Juan I; Löytynoja, Ari; Whelan, Simon; Pardi, Fabio; Massingham, Tim; Brown, James B; Bickel, Peter; Holmes, Ian; Mullikin, James C; Ureta-Vidal, Abel; Paten, Benedict; Stone, Eric A; Rosenbloom, Kate R; Kent, W James; Bouffard, Gerard G; Guan, Xiaobin; Hansen, Nancy F; Idol, Jacquelyn R; Maduro, Valerie V B; Maskeri, Baishali; McDowell, Jennifer C; Park, Morgan; Thomas, Pamela J; Young, Alice C; Blakesley, Robert W; Muzny, Donna M; Sodergren, Erica; Wheeler, David A; Worley, Kim C; Jiang, Huaiyang; Weinstock, George M; Gibbs, Richard A; Graves, Tina; Fulton, Robert; Mardis, Elaine R; Wilson, Richard K; Clamp, Michele; Cuff, James; Gnerre, Sante; Jaffe, David B; Chang, Jean L; Lindblad-Toh, Kerstin; Lander, Eric S; Hinrichs, Angie; Trumbower, Heather; Clawson, Hiram; Zweig, Ann; Kuhn, Robert M; Barber, Galt; Harte, Rachel; Karolchik, Donna; Field, Matthew A; Moore, Richard A; Matthewson, Carrie A; Schein, Jacqueline E; Marra, Marco A; Antonarakis, Stylianos E; Batzoglou, Serafim; Goldman, Nick; Hardison, Ross; Haussler, David; Miller, Webb; Pachter, Lior; Green, Eric D; Sidow, Arend

    2007-06-01

    A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.

  4. Mitochondrial genome sequences reveal deep divergences among Anopheles punctulatus sibling species in Papua New Guinea

    PubMed Central

    2013-01-01

    Background Members of the Anopheles punctulatus group (AP group) are the primary vectors of human malaria in Papua New Guinea. The AP group includes 13 sibling species, most of them morphologically indistinguishable. Understanding why only certain species are able to transmit malaria requires a better comprehension of their evolutionary history. In particular, understanding relationships and divergence times among Anopheles species may enable assessing how malaria-related traits (e.g. blood feeding behaviours, vector competence) have evolved. Methods DNA sequences of 14 mitochondrial (mt) genomes from five AP sibling species and two species of the Anopheles dirus complex of Southeast Asia were sequenced. DNA sequences from all concatenated protein coding genes (10,770 bp) were then analysed using a Bayesian approach to reconstruct phylogenetic relationships and date the divergence of the AP sibling species. Results Phylogenetic reconstruction using the concatenated DNA sequence of all mitochondrial protein coding genes indicates that the ancestors of the AP group arrived in Papua New Guinea 25 to 54 million years ago and rapidly diverged to form the current sibling species. Conclusion Through evaluation of newly described mt genome sequences, this study has revealed a divergence among members of the AP group in Papua New Guinea that would significantly predate the arrival of humans in this region, 50 thousand years ago. The divergence observed among the mtDNA sequences studied here may have resulted from reproductive isolation during historical changes in sea-level through glacial minima and maxima. This leads to a hypothesis that the AP sibling species have evolved independently for potentially thousands of generations. This suggests that the evolution of many phenotypes, such as insecticide resistance will arise independently in each of the AP sibling species studied here. PMID:23405960

  5. Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection

    PubMed Central

    Henn, Matthew R.; Lennon, Niall J.; Power, Karen A.; Macalalad, Alexander R.; Berlin, Aaron M.; Malboeuf, Christine M.; Ryan, Elizabeth M.; Gnerre, Sante; Zody, Michael C.; Erlich, Rachel L.; Green, Lisa M.; Berical, Andrew; Wang, Yaoyu; Casali, Monica; Streeck, Hendrik; Bloom, Allyson K.; Dudek, Tim; Tully, Damien; Newman, Ruchi; Axten, Karen L.; Gladden, Adrianne D.; Battis, Laura; Kemper, Michael; Zeng, Qiandong; Shea, Terrance P.; Gujja, Sharvari; Zedlack, Carmen; Gasser, Olivier; Brander, Christian; Hess, Christoph; Günthard, Huldrych F.; Brumme, Zabrina L.; Brumme, Chanson J.; Bazner, Suzane; Rychert, Jenna; Tinsley, Jake P.; Mayer, Ken H.; Rosenberg, Eric; Pereyra, Florencia; Levin, Joshua Z.; Young, Sarah K.; Jessen, Heiko; Altfeld, Marcus; Birren, Bruce W.; Walker, Bruce D.; Allen, Todd M.

    2012-01-01

    Deep sequencing technologies have the potential to transform the study of highly variable viral pathogens by providing a rapid and cost-effective approach to sensitively characterize rapidly evolving viral quasispecies. Here, we report on a high-throughput whole HIV-1 genome deep sequencing platform that combines 454 pyrosequencing with novel assembly and variant detection algorithms. In one subject we combined these genetic data with detailed immunological analyses to comprehensively evaluate viral evolution and immune escape during the acute phase of HIV-1 infection. The majority of early, low frequency mutations represented viral adaptation to host CD8+ T cell responses, evidence of strong immune selection pressure occurring during the early decline from peak viremia. CD8+ T cell responses capable of recognizing these low frequency escape variants coincided with the selection and evolution of more effective secondary HLA-anchor escape mutations. Frequent, and in some cases rapid, reversion of transmitted mutations was also observed across the viral genome. When located within restricted CD8 epitopes these low frequency reverting mutations were sufficient to prime de novo responses to these epitopes, again illustrating the capacity of the immune response to recognize and respond to low frequency variants. More importantly, rapid viral escape from the most immunodominant CD8+ T cell responses coincided with plateauing of the initial viral load decline in this subject, suggestive of a potential link between maintenance of effective, dominant CD8 responses and the degree of early viremia reduction. We conclude that the early control of HIV-1 replication by immunodominant CD8+ T cell responses may be substantially influenced by rapid, low frequency viral adaptations not detected by conventional sequencing approaches, which warrants further investigation. These data support the critical need for vaccine-induced CD8+ T cell responses to target more highly constrained

  6. Exome and deep sequencing of clinically aggressive neuroblastoma reveal somatic mutations that affect key pathways involved in cancer progression

    PubMed Central

    Lasorsa, Vito Alessandro; Formicola, Daniela; Pignataro, Piero; Cimmino, Flora; Calabrese, Francesco Maria; Mora, Jaume; Esposito, Maria Rosaria; Pantile, Marcella; Zanon, Carlo; De Mariano, Marilena; Longo, Luca; Hogarty, Michael D.; de Torres, Carmen; Tonini, Gian Paolo; Iolascon, Achille; Capasso, Mario

    2016-01-01

    The spectrum of somatic mutation of the most aggressive forms of neuroblastoma is not completely determined. We sought to identify potential cancer drivers in clinically aggressive neuroblastoma. Whole exome sequencing was conducted on 17 germline and tumor DNA samples from high-risk patients with adverse events within 36 months from diagnosis (HR-Event3) to identify somatic mutations and deep targeted sequencing of 134 genes selected from the initial screening in additional 48 germline and tumor pairs (62.5% HR-Event3 and high-risk patients), 17 HR-Event3 tumors and 17 human-derived neuroblastoma cell lines. We revealed 22 significantly mutated genes, many of which implicated in cancer progression. Fifteen genes (68.2%) were highly expressed in neuroblastoma supporting their involvement in the disease. CHD9, a cancer driver gene, was the most significantly altered (4.0% of cases) after ALK. Other genes (PTK2, NAV3, NAV1, FZD1 and ATRX), expressed in neuroblastoma and involved in cell invasion and migration were mutated at frequency ranged from 4% to 2%. Focal adhesion and regulation of actin cytoskeleton pathways, were frequently disrupted (14.1% of cases) thus suggesting potential novel therapeutic strategies to prevent disease progression. Notably BARD1, CHEK2 and AXIN2 were enriched in rare, potentially pathogenic, germline variants. In summary, whole exome and deep targeted sequencing identified novel cancer genes of clinically aggressive neuroblastoma. Our analyses show pathway-level implications of infrequently mutated genes in leading neuroblastoma progression. PMID:27009842

  7. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection.

    PubMed

    Henn, Matthew R; Boutwell, Christian L; Charlebois, Patrick; Lennon, Niall J; Power, Karen A; Macalalad, Alexander R; Berlin, Aaron M; Malboeuf, Christine M; Ryan, Elizabeth M; Gnerre, Sante; Zody, Michael C; Erlich, Rachel L; Green, Lisa M; Berical, Andrew; Wang, Yaoyu; Casali, Monica; Streeck, Hendrik; Bloom, Allyson K; Dudek, Tim; Tully, Damien; Newman, Ruchi; Axten, Karen L; Gladden, Adrianne D; Battis, Laura; Kemper, Michael; Zeng, Qiandong; Shea, Terrance P; Gujja, Sharvari; Zedlack, Carmen; Gasser, Olivier; Brander, Christian; Hess, Christoph; Günthard, Huldrych F; Brumme, Zabrina L; Brumme, Chanson J; Bazner, Suzane; Rychert, Jenna; Tinsley, Jake P; Mayer, Ken H; Rosenberg, Eric; Pereyra, Florencia; Levin, Joshua Z; Young, Sarah K; Jessen, Heiko; Altfeld, Marcus; Birren, Bruce W; Walker, Bruce D; Allen, Todd M

    2012-01-01

    Deep sequencing technologies have the potential to transform the study of highly variable viral pathogens by providing a rapid and cost-effective approach to sensitively characterize rapidly evolving viral quasispecies. Here, we report on a high-throughput whole HIV-1 genome deep sequencing platform that combines 454 pyrosequencing with novel assembly and variant detection algorithms. In one subject we combined these genetic data with detailed immunological analyses to comprehensively evaluate viral evolution and immune escape during the acute phase of HIV-1 infection. The majority of early, low frequency mutations represented viral adaptation to host CD8+ T cell responses, evidence of strong immune selection pressure occurring during the early decline from peak viremia. CD8+ T cell responses capable of recognizing these low frequency escape variants coincided with the selection and evolution of more effective secondary HLA-anchor escape mutations. Frequent, and in some cases rapid, reversion of transmitted mutations was also observed across the viral genome. When located within restricted CD8 epitopes these low frequency reverting mutations were sufficient to prime de novo responses to these epitopes, again illustrating the capacity of the immune response to recognize and respond to low frequency variants. More importantly, rapid viral escape from the most immunodominant CD8+ T cell responses coincided with plateauing of the initial viral load decline in this subject, suggestive of a potential link between maintenance of effective, dominant CD8 responses and the degree of early viremia reduction. We conclude that the early control of HIV-1 replication by immunodominant CD8+ T cell responses may be substantially influenced by rapid, low frequency viral adaptations not detected by conventional sequencing approaches, which warrants further investigation. These data support the critical need for vaccine-induced CD8+ T cell responses to target more highly constrained

  8. Deep sequencing of protease inhibitor resistant HIV patient isolates reveals patterns of correlated mutations in Gag and protease.

    PubMed

    Flynn, William F; Chang, Max W; Tan, Zhiqiang; Oliveira, Glenn; Yuan, Jinyun; Okulicz, Jason F; Torbett, Bruce E; Levy, Ronald M

    2015-04-01

    While the role of drug resistance mutations in HIV protease has been studied comprehensively, mutations in its substrate, Gag, have not been extensively cataloged. Using deep sequencing, we analyzed a unique collection of longitudinal viral samples from 93 patients who have been treated with therapies containing protease inhibitors (PIs). Due to the high sequence coverage within each sample, the frequencies of mutations at individual positions were calculated with high precision. We used this information to characterize the variability in the Gag polyprotein and its effects on PI-therapy outcomes. To examine covariation of mutations between two different sites using deep sequencing data, we developed an approach to estimate the tight bounds on the two-site bivariate probabilities in each viral sample, and the mutual information between pairs of positions based on all the bounds. Utilizing the new methodology we found that mutations in the matrix and p6 proteins contribute to continued therapy failure and have a major role in the network of strongly correlated mutations in the Gag polyprotein, as well as between Gag and protease. Although covariation is not direct evidence of structural propensities, we found the strongest correlations between residues on capsid and matrix of the same Gag protein were often due to structural proximity. This suggests that some of the strongest inter-protein Gag correlations are the result of structural proximity. Moreover, the strong covariation between residues in matrix and capsid at the N-terminus with p1 and p6 at the C-terminus is consistent with residue-residue contacts between these proteins at some point in the viral life cycle.

  9. Deep sequencing of uveal melanoma identifies a recurrent mutation in PLCB4

    PubMed Central

    Johansson, Peter; Aoude, Lauren G.; Wadt, Karin; Glasson, William J.; Warrier, Sunil K.; Hewitt, Alex W.; Kiilgaard, Jens Folke; Heegaard, Steffen; Isaacs, Tim; Franchina, Maria; Ingvar, Christian; Vermeulen, Tersia; Whitehead, Kevin J.; Schmidt, Christopher W.; Palmer, Jane M.; Symmons, Judith; Gerdes, Anne-Marie; Jönsson, Göran; Hayward, Nicholas K.

    2016-01-01

    Next generation sequencing of uveal melanoma (UM) samples has identified a number of recurrent oncogenic or loss-of-function mutations in key driver genes including: GNAQ, GNA11, EIF1AX, SF3B1 and BAP1. To search for additional driver mutations in this tumor type we carried out whole-genome or whole-exome sequencing of 28 tumors or primary cell lines. These samples have a low mutation burden, with a mean of 10.6 protein changing mutations per sample (range 0 to 53). As expected for these sun-shielded melanomas the mutation spectrum was not consistent with an ultraviolet radiation signature, instead, a BRCA mutation signature predominated. In addition to mutations in the known UM driver genes, we found a recurrent mutation in PLCB4 (c.G1888T, p.D630Y, NM_000933), which was validated using Sanger sequencing. The identical mutation was also found in published UM sequence data (1 of 56 tumors), supporting its role as a novel driver mutation in UM. PLCB4 p.D630Y mutations are mutually exclusive with mutations in GNA11 and GNAQ, consistent with PLCB4 being the canonical downstream target of the former gene products. Taken together these data suggest that the PLCB4 hotspot mutation is similarly a gain-of-function mutation leading to activation of the same signaling pathway, promoting UM tumorigenesis. PMID:26683228

  10. De Novo Peptide Sequencing: Deep Mining of High-Resolution Mass Spectrometry Data.

    PubMed

    Islam, Mohammad Tawhidul; Mohamedali, Abidali; Fernandes, Criselda Santan; Baker, Mark S; Ranganathan, Shoba

    2017-01-01

    High resolution mass spectrometry has revolutionized proteomics over the past decade, resulting in tremendous amounts of data in the form of mass spectra, being generated in a relatively short span of time. The mining of this spectral data for analysis and interpretation though has lagged behind such that potentially valuable data is being overlooked because it does not fit into the mold of traditional database searching methodologies. Although the analysis of spectra by de novo sequences removes such biases and has been available for a long period of time, its uptake has been slow or almost nonexistent within the scientific community. In this chapter, we propose a methodology to integrate de novo peptide sequencing using three commonly available software solutions in tandem, complemented by homology searching, and manual validation of spectra. This simplified method would allow greater use of de novo sequencing approaches and potentially greatly increase proteome coverage leading to the unearthing of valuable insights into protein biology, especially of organisms whose genomes have been recently sequenced or are poorly annotated.

  11. Draft Genome Sequences of Thermophiles Isolated from Yates Shaft, a Deep-Subsurface Environment.

    PubMed

    Singh, Nitin K; Carlson, Courtney; Sani, Rajesh K; Venkateswaran, Kasthuri

    2017-06-01

    The whole-genome sequences of seven thermophiles that could grow at >55°C, but not at 37°C, were generated. These thermophilic bacteria will play a useful role as model microorganisms, and analyzing their genomes will help to understand the observed production of novel bioactive compounds, including thermozymes and macromolecules. Copyright © 2017 Singh et al.

  12. Ultra Deep Sequencing of Listeria monocytogenes sRNA Transcriptome Revealed New Antisense RNAs

    PubMed Central

    Behrens, Sebastian; Widder, Stefanie; Mannala, Gopala Krishna; Qing, Xiaoxing; Madhugiri, Ramakanth; Kefer, Nathalie; Mraheil, Mobarak Abu; Rattei, Thomas; Hain, Torsten

    2014-01-01

    Listeria monocytogenes, a gram-positive pathogen, and causative agent of listeriosis, has become a widely used model organism for intracellular infections. Recent studies have identified small non-coding RNAs (sRNAs) as important factors for regulating gene expression and pathogenicity of L. monocytogenes. Increased speed and reduced costs of high throughput sequencing (HTS) techniques have made RNA sequencing (RNA-Seq) the state-of-the-art method to study bacterial transcriptomes. We created a large transcriptome dataset of L. monocytogenes containing a total of 21 million reads, using the SOLiD sequencing technology. The dataset contained cDNA sequences generated from L. monocytogenes RNA collected under intracellular and extracellular condition and additionally was size fractioned into three different size ranges from <40 nt, 40–150 nt and >150 nt. We report here, the identification of nine new sRNAs candidates of L. monocytogenes and a reevaluation of known sRNAs of L. monocytogenes EGD-e. Automatic comparison to known sRNAs revealed a high recovery rate of 55%, which was increased to 90% by manual revision of the data. Moreover, thorough classification of known sRNAs shed further light on their possible biological functions. Interestingly among the newly identified sRNA candidates are antisense RNAs (asRNAs) associated to the housekeeping genes purA, fumC and pgi and potentially their regulation, emphasizing the significance of sRNAs for metabolic adaptation in L. monocytogenes. PMID:24498259

  13. Testing deep reticulate evolution in Amaryllidaceae Tribe Hippeastreae (Asparagales) with ITS and chloroplast sequence data

    USDA-ARS?s Scientific Manuscript database

    The phylogeny of Amaryllidaceae tribe Hippeastreae was inferred using chloroplast (3’ycf1, ndhF, trnL-F) and nuclear (ITS rDNA) sequence data under maximum parsimony and maximum likelihood frameworks. Network analyses were applied to resolve conflicting signals among data sets and putative scenarios...

  14. Considering DNA damage when interpreting mtDNA heteroplasmy in deep sequencing data.

    PubMed

    Rathbun, Molly M; McElhoe, Jennifer A; Parson, Walther; Holland, Mitchell M

    2017-01-01

    Resolution of mitochondrial (mt) DNA heteroplasmy is now possible when applying a massively parallel sequencing (MPS) approach, including minor components down to 1%. However, reporting thresholds and interpretation criteria will need to be established for calling heteroplasmic variants that address a number of important topics, one of which is DNA damage. We assessed the impact of increasing amounts of DNA damage on the interpretation of minor component sequence variants in the mtDNA control region, including low-level mixed sites. A passive approach was used to evaluate the impact of storage conditions, and an active approach was employed to accelerate the process of hydrolytic damage (for example, replication errors associated with depurination events). The patterns of damage were compared and assessed in relation to damage typically encountered in poor quality samples. As expected, the number of miscoding lesions increased as conditions worsened. Single nucleotide polymorphisms (SNPs) associated with miscoding lesions were indistinguishable from innate heteroplasmy and were most often observed as 1-2% of the total sequencing reads. Numerous examples of miscoding lesions above 2% were identified, including two complete changes in the nucleotide sequence, presenting a challenge when assessing the placement of reporting thresholds for heteroplasmy. To mitigate the impact, replication of miscoding lesions was not observed in stored samples, and was rarely seen in data associated with accelerated hydrolysis. In addition, a significant decrease in the expected transition:transversion ratio was observed, providing a useful tool for predicting the presence of damage-induced lesions. The results of this study directly impact MPS analysis of minor sequence variants from poorly preserved DNA extracts, and when biological samples have been exposed to agents that induce DNA damage. These findings are particularly relevant to clinical and forensic investigations. Copyright

  15. Deep sequencing and variant analysis of an Italian pathogenic field strain of equine infectious anaemia virus.

    PubMed

    Cappelli, K; Cook, R F; Stefanetti, V; Passamonti, F; Autorino, G L; Scicluna, M T; Coletti, M; Verini Supplizi, A; Capomaccio, S

    2017-03-15

    Equine infectious anaemia virus (EIAV) is a lentivirus with an almost worldwide distribution that causes persistent infections in equids. Technical limitations have restricted genetic analysis of EIAV field isolates predominantly to gag sequences resulting in very little published information concerning the extent of inter-strain variation in pol, env and the three ancillary open reading frames (ORFs). Here, we describe the use of long-range PCR in conjunction with next-generation sequencing (NGS) for rapid molecular characterization of all viral ORFs and known transcription factor binding motifs within the long terminal repeat of two EIAV isolates from the 2006 Italian outbreak. These isolates were from foals believed to have been exposed to the same source material but with different clinical histories: one died 53 days post-infection (SA) while the other (DE) survived 5 months despite experiencing multiple febrile episodes. Nucleotide sequence identity between the isolates was 99.358% confirming infection with the same EIAV strain with most differences comprising single nucleotide polymorphisms in env and the second exon of rev. Although the synonymous:non-synonymous nucleotide substitution ratio was approximately 2:1 in gag and pol, the situation is reversed in env and ORF3 suggesting these sequences are subjected to host-mediated selective pressure. EIAV proviral quasispecies complexity in vivo has not been extensively investigated; however, analysis suggests it was relatively low in SA at the time of death. These results highlight advantages of NGS for molecular characterization of EIAV namely it avoids potential artefacts generated by traditional composite sequencing strategies and can provide information about viral quasispecies complexity. © 2017 Blackwell Verlag GmbH.

  16. Contribution of Ultra Deep Sequencing in the Clinical Diagnosis of a New Fungal Pathogen Species: Basidiobolus meristosporus.

    PubMed

    Sitterlé, Emilie; Rodriguez, Christophe; Mounier, Roman; Calderaro, Julien; Foulet, Françoise; Develoux, Michel; Pawlotsky, Jean-Michel; Botterel, Françoise

    2017-01-01

    Some cases of fungal infection remained undiagnosed, especially when the pathogens are uncommon, require specific conditions for in vitro growth, or when several microbial species are present in the specimen. Ultra-Deep Sequencing (UDS) could be considered as a precise tool in the identification of involved pathogens in order to upgrade patient treatment. In this study, we report the implementation of UDS technology in medical laboratory during the follow-up of an atypical fungal infection case. Thanks to UDS technology, we document the first case of gastro-intestinal basidiobolomycosis (GIB) due to Basidiobolus meristosporus. The diagnosis was suspected after histopathological examination but conventional microbiological methods failed to supply proof. The final diagnosis was made by means of an original approach based on UDS. DNA was extracted from the embedded colon biopsy obtained after hemicolectomy, and a fragment encompassing the internal transcribed spacer (ITS) rDNA region was PCR-amplified. An Amplicon library was then prepared using Genome Sequencer Junior Titanium Kits (Roche/454 Life Sciences) and the library was pyrosequenced on a GS Junior (Roche/454 Life Sciences). Using this method, 2,247 sequences with more than 100 bases were generated and used for UDS analysis. B. meristosporus represented 80% of the sequences, with an average homology of 98.8%. A phylogenetic tree with Basidiobolus reference sequences confirmed the presence of B. meristosporus (bootstrap value of 99%). Conclusion : UDS-based diagnostic approaches are ready to integrate conventional diagnostic testing to improve documentation of infectious disease and the therapeutic management of patients.

  17. Contribution of Ultra Deep Sequencing in the Clinical Diagnosis of a New Fungal Pathogen Species: Basidiobolus meristosporus

    PubMed Central

    Sitterlé, Emilie; Rodriguez, Christophe; Mounier, Roman; Calderaro, Julien; Foulet, Françoise; Develoux, Michel; Pawlotsky, Jean-Michel; Botterel, Françoise

    2017-01-01

    Some cases of fungal infection remained undiagnosed, especially when the pathogens are uncommon, require specific conditions for in vitro growth, or when several microbial species are present in the specimen. Ultra-Deep Sequencing (UDS) could be considered as a precise tool in the identification of involved pathogens in order to upgrade patient treatment. In this study, we report the implementation of UDS technology in medical laboratory during the follow-up of an atypical fungal infection case. Thanks to UDS technology, we document the first case of gastro-intestinal basidiobolomycosis (GIB) due to Basidiobolus meristosporus. The diagnosis was suspected after histopathological examination but conventional microbiological methods failed to supply proof. The final diagnosis was made by means of an original approach based on UDS. DNA was extracted from the embedded colon biopsy obtained after hemicolectomy, and a fragment encompassing the internal transcribed spacer (ITS) rDNA region was PCR-amplified. An Amplicon library was then prepared using Genome Sequencer Junior Titanium Kits (Roche/454 Life Sciences) and the library was pyrosequenced on a GS Junior (Roche/454 Life Sciences). Using this method, 2,247 sequences with more than 100 bases were generated and used for UDS analysis. B. meristosporus represented 80% of the sequences, with an average homology of 98.8%. A phylogenetic tree with Basidiobolus reference sequences confirmed the presence of B. meristosporus (bootstrap value of 99%). Conclusion : UDS-based diagnostic approaches are ready to integrate conventional diagnostic testing to improve documentation of infectious disease and the therapeutic management of patients. PMID:28326064

  18. Identification and profiling of novel microRNAs in the Brassica rapa genome based on small RNA deep sequencing

    PubMed Central

    2012-01-01

    Background MicroRNAs (miRNAs) are one of the functional non-coding small RNAs involved in the epigenetic control of the plant genome. Although plants contain both evolutionary conserved miRNAs and species-specific miRNAs within their genomes, computational methods often only identify evolutionary conserved miRNAs. The recent sequencing of the Brassica rapa genome enables us to identify miRNAs and their putative target genes. In this study, we sought to provide a more comprehensive prediction of B. rapa miRNAs based on high throughput small RNA deep sequencing. Results We sequenced small RNAs from five types of tissue: seedlings, roots, petioles, leaves, and flowers. By analyzing 2.75 million unique reads that mapped to the B. rapa genome, we identified 216 novel and 196 conserved miRNAs that were predicted to target approximately 20% of the genome’s protein coding genes. Quantitative analysis of miRNAs from the five types of tissue revealed that novel miRNAs were expressed in diverse tissues but their expression levels were lower than those of the conserved miRNAs. Comparative analysis of the miRNAs between the B. rapa and Arabidopsis thaliana genomes demonstrated that redundant copies of conserved miRNAs in the B. rapa genome may have been deleted after whole genome triplication. Novel miRNA members seemed to have spontaneously arisen from the B. rapa and A. thaliana genomes, suggesting the species-specific expansion of miRNAs. We have made this data publicly available in a miRNA database of B. rapa called BraMRs. The database allows the user to retrieve miRNA sequences, their expression profiles, and a description of their target genes from the five tissue types investigated here. Conclusions This is the first report to identify novel miRNAs from Brassica crops using genome-wide high throughput techniques. The combination of computational methods and small RNA deep sequencing provides robust predictions of miRNAs in the genome. The finding of numerous novel mi

  19. Identification of representative genes of the central nervous system of the locust, Locusta migratoria manilensis by deep sequencing.

    PubMed

    Zhang, Zhengyi; Peng, Zhi-Yu; Yi, Kang; Cheng, Yanbing; Xia, Yuxian

    2012-01-01

    The shortage of available genomic and transcriptomic data hampers the molecular study on the migratory locust, Locusta migratoria manilensis (L.) (Orthoptera: Acrididae) central nervous system (CNS). In this study, locust CNS RNA was sequenced by deep sequencing. 41,179 unigenes were obtained with an average length of 570 bp, and 5,519 unigenes were longer than 1,000 bp. Compared with an EST database of another locust species Schistocerca gregaria Forsskåi, 9,069 unigenes were found conserved, while 32,110 unigenes were differentially expressed. A total of 15,895 unigenes were identified, including 644 nervous system relevant unigenes. Among the 25,284 unknown unigenes, 9,482 were found to be specific to the CNS by filtering out the previous ESTs acquired from locust organs without CNS's. The locust CNS showed the most matches (18%) with Tribolium castaneum (Herbst) (Coleoptera: Tenebrionidae) sequences. Comprehensive assessment reveals that the database generated in this study is broadly representative of the CNS of adult locust, providing comprehensive gene information at the transcriptional level that could facilitate research of the locust CNS, including various physiological aspects and pesticide target finding.

  20. MiRNA Expression Profile for the Human Gastric Antrum Region Using Ultra-Deep Sequencing

    PubMed Central

    Hamoy, Igor G.; Darnet, Sylvain; Burbano, Rommel; Khayat, André; Gonçalves, André Nicolau; Alencar, Dayse O.; Cruz, Aline; Magalhães, Leandro; Araújo Jr., Wilson; Silva, Artur; Santos, Sidney; Demachki, Samia; Assumpção, Paulo; Ribeiro-dos-Santos, Ândrea

    2014-01-01

    Background MicroRNAs are small non-coding nucleotide sequences that regulate gene expression. These structures are fundamental to several biological processes, including cell proliferation, development, differentiation and apoptosis. Identifying the expression profile of microRNAs in healthy human gastric antrum mucosa may help elucidate the miRNA regulatory mechanisms of the human stomach. Methodology/Principal Findings A small RNA library of stomach antrum tissue was sequenced using high-throughput SOLiD sequencing technology. The total read count for the gastric mucosa antrum region was greater than 618,000. After filtering and aligning using with MirBase, 148 mature miRNAs were identified in the gastric antrum tissue, totaling 3,181 quality reads; 63.5% (2,021) of the reads were concentrated in the eight most highly expressed miRNAs (hsa-mir-145, hsa-mir-29a, hsa-mir-29c, hsa-mir-21, hsa-mir-451a, hsa-mir-192, hsa-mir-191 and hsa-mir-148a). RT-PCR validated the expression profiles of seven of these highly expressed miRNAs and confirmed the sequencing results obtained using the SOLiD platform. Conclusions/Significance In comparison with other tissues, the antrum’s expression profile was unique with respect to the most highly expressed miRNAs, suggesting that this expression profile is specific to stomach antrum tissue. The current study provides a starting point for a more comprehensive understanding of the role of miRNAs in the regulation of the molecular processes of the human stomach. PMID:24647245

  1. Transcriptome dynamics through alternative polyadenylation in developmental and environmental responses in plants revealed by deep sequencing

    PubMed Central

    Shen, Yingjia; Venu, R.C.; Nobuta, Kan; Wu, Xiaohui; Notibala, Varun; Demirci, Caghan; Meyers, Blake C.; Wang, Guo-Liang; Ji, Guoli; Li, Qingshun Q.

    2011-01-01

    Polyadenylation sites mark the ends of mRNA transcripts. Alternative polyadenylation (APA) may alter sequence elements and/or the coding capacity of transcripts, a mechanism that has been demonstrated to regulate gene expression and transcriptome diversity. To study the role of APA in transcriptome dynamics, we analyzed a large-scale data set of RNA “tags” that signify poly(A) sites and expression levels of mRNA. These tags were derived from a wide range of tissues and developmental stages that were mutated or exposed to environmental treatments, and generated using digital gene expression (DGE)–based protocols of the massively parallel signature sequencing (MPSS-DGE) and the Illumina sequencing-by-synthesis (SBS-DGE) sequencing platforms. The data offer a global view of APA and how it contributes to transcriptome dynamics. Upon analysis of these data, we found that ∼60% of Arabidopsis genes have multiple poly(A) sites. Likewise, ∼47% and 82% of rice genes use APA, supported by MPSS-DGE and SBS-DGE tags, respectively. In both species, ∼49%–66% of APA events were mapped upstream of annotated stop codons. Interestingly, 10% of the transcriptomes are made up of APA transcripts that are differentially distributed among developmental stages and in tissues responding to environmental stresses, providing an additional level of transcriptome dynamics. Examples of pollen-specific APA switching and salicylic acid treatment-specific APA clearly demonstrated such dynamics. The significance of these APAs is more evident in the 3034 genes that have conserved APA events between rice and Arabidopsis. PMID:21813626

  2. Deep Sequencing of the Oral Microbiome Reveals Signatures of Periodontal Disease

    PubMed Central

    Ghodsi, Mohammad; Sommer, Daniel D.; Gibbons, Theodore R.; Treangen, Todd J.; Chang, Yi-Chien; Li, Shan; Stine, O. Colin; Hasturk, Hatice; Kasif, Simon; Segrè, Daniel; Pop, Mihai; Amar, Salomon

    2012-01-01

    The oral microbiome, the complex ecosystem of microbes inhabiting the human mouth, harbors several thousands of bacterial types. The proliferation of pathogenic bacteria within the mouth gives rise to periodontitis, an inflammatory disease known to also constitute a risk factor for cardiovascular disease. While much is known about individual species associated with pathogenesis, the system-level mechanisms underlying the transition from health to disease are still poorly understood. Through the sequencing of the 16S rRNA gene and of whole community DNA we provide a glimpse at the global genetic, metabolic, and ecological changes associated with periodontitis in 15 subgingival plaque samples, four from each of two periodontitis patients, and the remaining samples from three healthy individuals. We also demonstrate the power of whole-metagenome sequencing approaches in characterizing the genomes of key players in the oral microbiome, including an unculturable TM7 organism. We reveal the disease microbiome to be enriched in virulence factors, and adapted to a parasitic lifestyle that takes advantage of the disrupted host homeostasis. Furthermore, diseased samples share a common structure that was not found in completely healthy samples, suggesting that the disease state may occupy a narrow region within the space of possible configurations of the oral microbiome. Our pilot study demonstrates the power of high-throughput sequencing as a tool for understanding the role of the oral microbiome in periodontal disease. Despite a modest level of sequencing (∼2 lanes Illumina 76 bp PE) and high human DNA contamination (up to ∼90%) we were able to partially reconstruct several oral microbes and to preliminarily characterize some systems-level differences between the healthy and diseased oral microbiomes. PMID:22675498

  3. Deep sequencing identifies viral and wasp genes with potential roles in replication of Microplitis demolitor Bracovirus.

    PubMed

    Burke, Gaelen R; Strand, Michael R

    2012-03-01

    Viruses in the genus Bracovirus (BV) (Polydnaviridae) are symbionts of parasitoid wasps that specifically replicate in the ovaries of females. Recent analysis of expressed sequence tags from two wasp species, Cotesia congregata and Chelonus inanitus, identified transcripts related to 24 different nudivirus genes. These results together with other data strongly indicate that BVs evolved from a nudivirus ancestor. However, it remains unclear whether BV-carrying wasps contain other nudivirus-like genes and what types of wasp genes may also be required for BV replication. Microplitis demolitor carries Microplitis demolitor bracovirus (MdBV). Here we characterized MdBV replication and performed massively parallel sequencing of M. demolitor ovary transcripts. Our results indicated that MdBV replication begins in stage 2 pupae and continues in adults. Analysis of prereplication- and active-replication-stage ovary RNAs yielded 22 Gb of sequence that assembled into 66,425 transcripts. This breadth of sampling indicated that a large percentage of genes in the M. demolitor genome were sequenced. A total of 41 nudivirus-like transcripts were identified, of which a majority were highly expressed during MdBV replication. Our results also identified a suite of wasp genes that were highly expressed during MdBV replication. Among these products were several transcripts with conserved roles in regulating locus-specific DNA amplification by eukaryotes. Overall, our data set together with prior results likely identify the majority of nudivirus-related genes that are transcriptionally functional during BV replication. Our results also suggest that amplification of proviral DNAs for packaging into BV virions may depend upon the replication machinery of wasps.

  4. High diversity of picornaviruses in rats from different continents revealed by deep sequencing

    PubMed Central

    Hansen, Thomas Arn; Mollerup, Sarah; Nguyen, Nam-phuong; White, Nicole E; Coghlan, Megan; Alquezar-Planas, David E; Joshi, Tejal; Jensen, Randi Holm; Fridholm, Helena; Kjartansdóttir, Kristín Rós; Mourier, Tobias; Warnow, Tandy; Belsham, Graham J; Bunce, Michael; Willerslev, Eske; Nielsen, Lars Peter; Vinner, Lasse; Hansen, Anders Johannes

    2016-01-01

    Outbreaks of zoonotic diseases in humans and livestock are not uncommon, and an important component in containment of such emerging viral diseases is rapid and reliable diagnostics. Such methods are often PCR-based and hence require the availability of sequence data from the pathogen. Rattus norvegicus (R. norvegicus) is a known reservoir for important zoonotic pathogens. Transmission may be direct via contact with the animal, for example, through exposure to its faecal matter, or indirectly mediated by arthropod vectors. Here we investigated the viral content in rat faecal matter (n=29) collected from two continents by analyzing 2.2 billion next-generation sequencing reads derived from both DNA and RNA. Among other virus families, we found sequences from members of the Picornaviridae to be abundant in the microbiome of all the samples. Here we describe the diversity of the picornavirus-like contigs including near-full-length genomes closely related to the Boone cardiovirus and Theiler's encephalomyelitis virus. From this study, we conclude that picornaviruses within R. norvegicus are more diverse than previously recognized. The virome of R. norvegicus should be investigated further to assess the full potential for zoonotic virus transmission. PMID:27530749

  5. New mutations in chronic lymphocytic leukemia identified by target enrichment and deep sequencing.

    PubMed

    Doménech, Elena; Gómez-López, Gonzalo; Gzlez-Peña, Daniel; López, Mar; Herreros, Beatriz; Menezes, Juliane; Gómez-Lozano, Natalia; Carro, Angel; Graña, Osvaldo; Pisano, David G; Domínguez, Orlando; García-Marco, José A; Piris, Miguel A; Sánchez-Beato, Margarita

    2012-01-01

    Chronic lymphocytic leukemia (CLL) is a heterogeneous disease without a well-defined genetic alteration responsible for the onset of the disease. Several lines of evidence coincide in identifying stimulatory and growth signals delivered by B-cell receptor (BCR), and co-receptors together with NFkB pathway, as being the driving force in B-cell survival in CLL. However, the molecular mechanism responsible for this activation has not been identified. Based on the hypothesis that BCR activation may depend on somatic mutations of the BCR and related pathways we have performed a complete mutational screening of 301 selected genes associated with BCR signaling and related pathways using massive parallel sequencing technology in 10 CLL cases. Four mutated genes in coding regions (KRAS, SMARCA2, NFKBIE and PRKD3) have been confirmed by capillary sequencing. In conclusion, this study identifies new genes mutated in CLL, all of them in cases with progressive disease, and demonstrates that next-generation sequencing technologies applied to selected genes or pathways of interest are powerful tools for identifying novel mutational changes.

  6. High diversity of picornaviruses in rats from different continents revealed by deep sequencing.

    PubMed

    Hansen, Thomas Arn; Mollerup, Sarah; Nguyen, Nam-Phuong; White, Nicole E; Coghlan, Megan; Alquezar-Planas, David E; Joshi, Tejal; Jensen, Randi Holm; Fridholm, Helena; Kjartansdóttir, Kristín Rós; Mourier, Tobias; Warnow, Tandy; Belsham, Graham J; Bunce, Michael; Willerslev, Eske; Nielsen, Lars Peter; Vinner, Lasse; Hansen, Anders Johannes

    2016-08-17

    Outbreaks of zoonotic diseases in humans and livestock are not uncommon, and an important component in containment of such emerging viral diseases is rapid and reliable diagnostics. Such methods are often PCR-based and hence require the availability of sequence data from the pathogen. Rattus norvegicus (R. norvegicus) is a known reservoir for important zoonotic pathogens. Transmission may be direct via contact with the animal, for example, through exposure to its faecal matter, or indirectly mediated by arthropod vectors. Here we investigated the viral content in rat faecal matter (n=29) collected from two continents by analyzing 2.2 billion next-generation sequencing reads derived from both DNA and RNA. Among other virus families, we found sequences from members of the Picornaviridae to be abundant in the microbiome of all the samples. Here we describe the diversity of the picornavirus-like contigs including near-full-length genomes closely related to the Boone cardiovirus and Theiler's encephalomyelitis virus. From this study, we conclude that picornaviruses within R. norvegicus are more diverse than previously recognized. The virome of R. norvegicus should be investigated further to assess the full potential for zoonotic virus transmission.

  7. MicroRNA repertoire for functional genome research in tilapia identified by deep sequencing.

    PubMed

    Yan, Biao; Wang, Zhen-Hua; Zhu, Chang-Dong; Guo, Jin-Tao; Zhao, Jin-Liang

    2014-08-01

    The Nile tilapia (Oreochromis niloticus; Cichlidae) is an economically important species in aquaculture and occupies a prominent position in the aquaculture industry. MicroRNAs (miRNAs) are a class of noncoding RNAs that post-transcriptionally regulate gene expression involved in diverse biological and metabolic processes. To increase the repertoire of miRNAs characterized in tilapia, we used the Illumina/Solexa sequencing technology to sequence a small RNA library using pooled RNA sample isolated from the different developmental stages of tilapia. Bioinformatic analyses suggest that 197 conserved and 27 novel miRNAs are expressed in tilapia. Sequence alignments indicate that all tested miRNAs and miRNAs* are highly conserved across many species. In addition, we characterized the tissue expression patterns of five miRNAs using real-time quantitative PCR. We found that miR-1/206, miR-7/9, and miR-122 is abundantly expressed in muscle, brain, and liver, respectively, implying a potential role in the regulation of tissue differentiation or the maintenance of tissue identity. Overall, our results expand the number of tilapia miRNAs, and the discovery of miRNAs in tilapia genome contributes to a better understanding the role of miRNAs in regulating diverse biological processes.

  8. Deep sequencing uncovers protistan plankton diversity in the Portuguese Ria Formosa solar saltern ponds.

    PubMed

    Filker, Sabine; Gimmler, Anna; Dunthorn, Micah; Mahé, Frédéric; Stoeck, Thorsten

    2015-03-01

    We used high-throughput sequencing to unravel the genetic diversity of protistan (including fungal) plankton in hypersaline ponds of the Ria Formosa solar saltern works in Portugal. From three ponds of different salinity (4, 12 and 38 %), we obtained ca. 105,000 amplicons (V4 region of the SSU rDNA). The genetic diversity we found was higher than what has been described from solar saltern ponds thus far by microscopy or molecular studies. The obtained operational taxonomic units (OTUs) could be assigned to 14 high-rank taxonomic groups and blasted to 120 eukaryotic families. The novelty of this genetic diversity was extremely high, with 27 % of all OTUs having a sequence divergence of more than 10 % to deposited sequences of described taxa. The highest degree of novelty was found at intermediate salinity of 12 % within the ciliates, which traditionally are considered as the best known and described taxon group within the kingdom Protista. Further substantial novelty was detected within the stramenopiles and the chlorophytes. Analyses of community structures suggest a transition boundary for protistan plankton between 4 and 12 % salinity, suggesting different haloadaptation strategies in individual evolutionary lineages as a result of environmental filtering. Our study makes evident the gaps in our knowledge not only of protistan and fungal plankton diversity in hypersaline environments, but also in their ecology and their strategies to cope with these environmental conditions. It substantiates that specific future research needs to fill these gaps.

  9. Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library.

    PubMed

    Sánchez, Cecilia Castaño; Smith, Timothy P L; Wiedmann, Ralph T; Vallejo, Roger L; Salem, Mohamed; Yao, Jianbo; Rexroad, Caird E

    2009-11-25

    To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs) have been used for single nucleotide polymorphism (SNP) discovery in salmonids. In those strategies, the salmonid semi-tetraploid genomes often led to assemblies of paralogous sequences and therefore resulted in a high rate of false positive SNP identification. Sequencing genomic DNA using primers identified from ESTs proved to be an effective but time consuming methodology of SNP identification in rainbow trout, therefore not suitable for high throughput SNP discovery. In this study, we employed a high-throughput strategy that used pyrosequencing technology to generate data from a reduced representation library constructed with genomic DNA pooled from 96 unrelated rainbow trout that represent the National Center for Cool and Cold Water Aquaculture (NCCCWA) broodstock population. The reduced representation library consisted of 440 bp fragments resulting from complete digestion with the restriction enzyme HaeIII; sequencing produced 2,000,000 reads providing an average 6 fold coverage of the estimated 150,000 unique genomic restriction fragments (300,000 fragment ends). Three independent data analyses identified 22,022 to 47,128 putative SNPs on 13,140 to 24,627 independent contigs. A set of 384 putative SNPs, randomly selected from the sets produced by the three analyses were genotyped on individual fish to determine the validation rate of putative SNPs among analyses, distinguish apparent SNPs that actually represent paralogous loci in the tetraploid genome, examine Mendelian segregation, and place the validated SNPs on the rainbow trout linkage map. Approximately 48% (183) of the putative SNPs were validated; 167 markers were successfully incorporated into the rainbow trout linkage map. In addition, 2% of the sequences from the

  10. Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network.

    PubMed

    Lyons, James; Dehzangi, Abdollah; Heffernan, Rhys; Sharma, Alok; Paliwal, Kuldip; Sattar, Abdul; Zhou, Yaoqi; Yang, Yuedong

    2014-10-30

    Because a nearly constant distance between two neighbouring Cα atoms, local backbone structure of proteins can be represented accurately by the angle between C(αi-1)-C(αi)-C(αi+1) (θ) and a dihedral angle rotated about the C(αi)-C(αi+1) bond (τ). θ and τ angles, as the representative of structural properties of three to four amino-acid residues, offer a description of backbone conformations that is complementary to φ and ψ angles (single residue) and secondary structures (>3 residues). Here, we report the first machine-learning technique for sequence-based prediction of θ and τ angles. Predicted angles based on an independent test have a mean absolute error of 9° for θ and 34° for τ with a distribution on the θ-τ plane close to that of native values. The average root-mean-square distance of 10-residue fragment structures constructed from predicted θ and τ angles is only 1.9Å from their corresponding native structures. Predicted θ and τ angles are expected to be complementary to predicted ϕ and ψ angles and secondary structures for using in model validation and template-based as well as template-free structure prediction. The deep neural network learning technique is available as an on-line server called Structural Property prediction with Integrated DEep neuRal network (SPIDER) at http://sparks-lab.org.

  11. Exploring the Gastrointestinal “Nemabiome”: Deep Amplicon Sequencing to Quantify the Species Composition of Parasitic Nematode Communities

    PubMed Central

    Avramenko, Russell W.; Redman, Elizabeth M.; Lewis, Roy; Yazwinski, Thomas A.; Wasmuth, James D.; Gilleard, John S.

    2015-01-01

    Parasitic helminth infections have a considerable impact on global human health as well as animal welfare and production. Although co-infection with multiple parasite species within a host is common, there is a dearth of tools with which to study the composition of these complex parasite communities. Helminth species vary in their pathogenicity, epidemiology and drug sensitivity and the interactions that occur between co-infecting species and their hosts are poorly understood. We describe the first application of deep amplicon sequencing to study parasitic nematode communities as well as introduce the concept of the gastro-intestinal “nemabiome”. The approach is analogous to 16S rDNA deep sequencing used to explore microbial communities, but utilizes the nematode ITS-2 rDNA locus instead. Gastro-intestinal parasites of cattle were used to develop the concept, as this host has many well-defined gastro-intestinal nematode species that commonly occur as complex co-infections. Further, the availability of pure mono-parasite populations from experimentally infected cattle allowed us to prepare mock parasite communities to determine, and correct for, species representation biases in the sequence data. We demonstrate that, once these biases have been corrected, accurate relative quantitation of gastro-intestinal parasitic nematode communities in cattle fecal samples can be achieved. We have validated the accuracy of the method applied to field-samples by comparing the results of detailed morphological examination of L3 larvae populations with those of the sequencing assay. The results illustrate the insights that can be gained into the species composition of parasite communities, using grazing cattle in the mid-west USA as an example. However, both the technical approach and the concept of the ‘nemabiome’ have a wide range of potential applications in human and veterinary medicine. These include investigations of host-parasite and parasite-parasite interactions

  12. Deep Sequencing-Based Analysis of the Cymbidium ensifolium Floral Transcriptome

    PubMed Central

    Li, Xiaobai; Luo, Jie; Yan, Tianlian; Xiang, Lin; Jin, Feng; Qin, Dehui; Sun, Chongbo; Xie, Ming

    2013-01-01

    Cymbidium ensifolium is a Chinese Cymbidium with an elegant shape, beautiful appearance, and a fragrant aroma. C. ensifolium has a long history of cultivation in China and it has excellent commercial value as a potted plant and cut flower. The development of C. ensifolium genomic resources has been delayed because of its large genome size. Taking advantage of technical and cost improvement of RNA-Seq, we extracted total mRNA from flower buds and mature flowers and obtained a total of 9.52 Gb of filtered nucleotides comprising 98,819,349 filtered reads. The filtered reads were assembled into 101,423 isotigs, representing 51,696 genes. Of the 101,423 isotigs, 41,873 were putative homologs of annotated sequences in the public databases, of which 158 were associated with floral development and 119 were associated with flowering. The isotigs were categorized according to their putative functions. In total, 10,212 of the isotigs were assigned into 25 eukaryotic orthologous groups (KOGs), 41,690 into 58 gene ontology (GO) terms, and 9,830 into 126 Arabidopsis Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, and 9,539 isotigs into 123 rice pathways. Comparison of the isotigs with those of the two related orchid species P. equestris and C. sinense showed that 17,906 isotigs are unique to C. ensifolium. In addition, a total of 7,936 SSRs and 16,676 putative SNPs were identified. To our knowledge, this transcriptome database is the first major genomic resource for C. ensifolium and the most comprehensive transcriptomic resource for genus Cymbidium. These sequences provide valuable information for understanding the molecular mechanisms of floral development and flowering. Sequences predicted to be unique to C. ensifolium would provide more insights into C. ensifolium gene diversity. The numerous SNPs and SSRs identified in the present study will contribute to marker development for C. ensifolium. PMID:24392013

  13. Identification of an NAC Transcription Factor Family by Deep Transcriptome Sequencing in Onion (Allium cepa L.)

    PubMed Central

    Zhu, Siyuan; Dai, Qiuzhong; Liu, Touming

    2016-01-01

    Although onion has been used extensively in the past for cytogenetic studies, molecular analysis has been lacking because the availability of genetic resources is limited. NAM, ATAF, and CUC (NAC) transcription factors (TFs) are plant-specific proteins, and they play key roles in plant growth, development, and stress tolerance. However, none of the onion NAC (CepNAC) genes had been identified thus far. In this study, the transcriptome of onion leaves was analyzed by Illumina paired-end sequencing. Approximately 102.9 million clean sequence reads were produced and used for de novo assembly, which generated 117,189 non-redundant transcripts. Of these transcripts, 39,472 were annotated for their function. In order to mine the CepNAC TFs, CepNAC genes were searched from the transcripts assembled, resulting in the identification of all 39 CepNAC genes. These 39 CepNAC proteins were subjected to phylogenetic analysis together with 47 NAC proteins of known function that were previously identified in other species. The results showed that they can be divided into five groups (NAC-I–V). Interestingly, the NAC-IV and -V groups were found to be likely related to the processes of secondary wall synthesis and stress response, respectively. The transcriptome analysis generated a substantial amount of transcripts, which will aid immensely in identifying important genes and accelerating our understanding of onion growth and development. Moreover, the discovery of 39 CepNAC TFs and the identification of the sequence conservation between them and NAC proteins published will provide a basis for further characterization and validation of their functions in the future. PMID:27331904

  14. Sequence-of-Events-Driven Automation of the Deep Space Network

    NASA Technical Reports Server (NTRS)

    Hill, R., Jr.; Fayyad, K.; Smyth, C.; Santos, T.; Chen, R.; Chien, S.; Bevan, R.

    1996-01-01

    In February 1995, sequence-of-events (SOE)-driven automation technology was demonstrated for a Voyager telemetry downlink track at DSS 13. This demonstration entailed automated generation of an operations procedure (in the form of a temporal dependency network) from project SOE information using artificial intelligence planning technology and automated execution of the temporal dependency network using the link monitor and control operator assistant system. This article describes the overall approach to SOE-driven automation that was demonstrated, identifies gaps in SOE definitions and project profiles that hamper automation, and provides detailed measurements of the knowledge engineering effort required for automation.

  15. Sequence-of-events-driven automation of the deep space network

    NASA Technical Reports Server (NTRS)

    Hill, R., Jr.; Fayyad, K.; Smyth, C.; Santos, T.; Chen, R.; Chien, S.; Bevan, R.

    1996-01-01

    In February 1995, sequence-of-events (SOE)-driven automation technology was demonstrated for a Voyager telemetry downlink track at DSS 13. This demonstration entailed automated generation of an operations procedure (in the form of a temporal dependency network) from project SOE information using artificial intelligence planning technology and automated execution of the temporal dependency network using the link monitor and control operator assistant system. This article describes the overall approach to SOE-driven automation that was demonstrated, identifies gaps in SOE definitions and project profiles that hamper automation, and provides detailed measurements of the knowledge engineering effort required for automation.

  16. Deep Sequencing of Plant and Animal DNA Contained within Traditional Chinese Medicines Reveals Legality Issues and Health Safety Concerns

    PubMed Central

    Coghlan, Megan L.; Haile, James; Houston, Jayne; Murray, Dáithí C.; White, Nicole E.; Moolhuijzen, Paula; Bellgard, Matthew I.; Bunce, Michael

    2012-01-01

    Traditional Chinese medicine (TCM) has been practiced for thousands of years, but only within the last few decades has its use become more widespread outside of Asia. Concerns continue to be raised about the efficacy, legality, and safety of many popular complementary alternative medicines, including TCMs. Ingredients of some TCMs are known to include derivatives of endangered, trade-restricted species of plants and animals, and therefore contravene the Convention on International Trade in Endangered Species (CITES) legislation. Chromatographic studies have detected the presence of heavy metals and plant toxins within some TCMs, and there are numerous cases of adverse reactions. It is in the interests of both biodiversity conservation and public safety that techniques are developed to screen medicinals like TCMs. Targeting both the p-loop region of the plastid trnL gene and the mitochondrial 16S ribosomal RNA gene, over 49,000 amplicon sequence reads were generated from 15 TCM samples presented in the form of powders, tablets, capsules, bile flakes, and herbal teas. Here we show that second-generation, high-throughput sequencing (HTS) of DNA represents an effective means to genetically audit organic ingredients within complex TCMs. Comparison of DNA sequence data to reference databases revealed the presence of 68 different plant families and included genera, such as Ephedra and Asarum, that are potentially toxic. Similarly, animal families were identified that include genera that are classified as vulnerable, endangered, or critically endangered, including Asiatic black bear (Ursus thibetanus) and Saiga antelope (Saiga tatarica). Bovidae, Cervidae, and Bufonidae DNA were also detected in many of the TCM samples and were rarely declared on the product packaging. This study demonstrates that deep sequencing via HTS is an efficient and cost-effective way to audit highly processed TCM products and will assist in monitoring their legality and safety especially when

  17. Deep sequencing of plant and animal DNA contained within traditional Chinese medicines reveals legality issues and health safety concerns.

    PubMed

    Coghlan, Megan L; Haile, James; Houston, Jayne; Murray, Dáithí C; White, Nicole E; Moolhuijzen, Paula; Bellgard, Matthew I; Bunce, Michael

    2012-01-01

    Traditional Chinese medicine (TCM) has been practiced for thousands of years, but only within the last few decades has its use become more widespread outside of Asia. Concerns continue to be raised about the efficacy, legality, and safety of many popular complementary alternative medicines, including TCMs. Ingredients of some TCMs are known to include derivatives of endangered, trade-restricted species of plants and animals, and therefore contravene the Convention on International Trade in Endangered Species (CITES) legislation. Chromatographic studies have detected the presence of heavy metals and plant toxins within some TCMs, and there are numerous cases of adverse reactions. It is in the interests of both biodiversity conservation and public safety that techniques are developed to screen medicinals like TCMs. Targeting both the p-loop region of the plastid trnL gene and the mitochondrial 16S ribosomal RNA gene, over 49,000 amplicon sequence reads were generated from 15 TCM samples presented in the form of powders, tablets, capsules, bile flakes, and herbal teas. Here we show that second-generation, high-throughput sequencing (HTS) of DNA represents an effective means to genetically audit organic ingredients within complex TCMs. Comparison of DNA sequence data to reference databases revealed the presence of 68 different plant families and included genera, such as Ephedra and Asarum, that are potentially toxic. Similarly, animal families were identified that include genera that are classified as vulnerable, endangered, or critically endangered, including Asiatic black bear (Ursus thibetanus) and Saiga antelope (Saiga tatarica). Bovidae, Cervidae, and Bufonidae DNA were also detected in many of the TCM samples and were rarely declared on the product packaging. This study demonstrates that deep sequencing via HTS is an efficient and cost-effective way to audit highly processed TCM products and will assist in monitoring their legality and safety especially when

  18. Identification of MicroRNAs in Helicoverpa armigera and Spodoptera litura Based on Deep Sequencing and Homology Analysis

    PubMed Central

    Ge, Xie; Zhang, Yong; Jiang, Jianhao; Zhong, Yi; Yang, Xiaonan; Li, Zhiqian; Huang, Yongping; Tan, Anjiang

    2013-01-01

    The current identification of microRNAs (miRNAs) in insects is largely dependent on genome sequences. However, the lack of available genome sequences inhibits the identification of miRNAs in various insect species. In this study, we used a miRNA database of the silkworm Bombyx mori as a reference to identify miRNAs in Helicoverpa armigera and Spodoptera litura using deep sequencing and homology analysis. Because all three species belong to the Lepidoptera, the experiment produced reliable results. Our study identified 97 and 91 conserved miRNAs in H. armigera and S. litura, respectively. Using the genome of B. mori and BAC sequences of H. armigera as references, 1 novel miRNA and 8 novel miRNA candidates were identified in H. armigera, and 4 novel miRNA candidates were identified in S. litura. An evolutionary analysis revealed that most of the identified miRNAs were insect-specific, and more than 20 miRNAs were Lepidoptera-specific. The investigation of the expression patterns of miR-2a, miR-34, miR-2796-3p and miR-11 revealed their potential roles in insect development. miRNA target prediction revealed that conserved miRNA target sites exist in various genes in the 3 species. Conserved miRNA target sites for the Hsp90 gene among the 3 species were validated in the mammalian 293T cell line using a dual-luciferase reporter assay. Our study provides a new approach with which to identify miRNAs in insects lacking genome information and contributes to the functional analysis of insect miRNAs. PMID:23289012

  19. Deep Sequencing of the T-cell Receptor Repertoire Demonstrates Polyclonal T-cell Infiltrates in Psoriasis

    PubMed Central

    Harden, Jamie L.; Hamm, David; Gulati, Nicholas; Lowes, Michelle A.; Krueger, James G.

    2015-01-01

    It is well known that infiltration of pathogenic T-cells plays an important role in psoriasis pathogenesis. However, the antigen specificity of these activated T-cells is relatively unknown. Previous studies using T-cell receptor polymerase chain reaction technology (TCR-PCR) have suggested there are expanded T-cell receptor (TCR) clones in psoriatic skin, suggesting a response to an unknown psoriatic antigen. Here we describe the results of high-throughput deep sequencing of the entire αβ- and γδ- TCR repertoire in normal healthy skin and psoriatic lesional and non-lesional skin. From this study, we were able to determine that there is a significant increase in the abundance of unique β- and γ- TCR sequences in psoriatic lesional skin compared to non-lesional and normal skin, and that the entire T-cell repertoire in psoriasis is polyclonal, with similar diversity to normal and non-lesional skin. Comparison of the αβ- and γδ- TCR repertoire in paired non-lesional and lesional samples showed many common clones within a patient, and these close were often equally abundant in non-lesional and lesional skin, again suggesting a diverse T-cell repertoire. Although there were similar (and low) amounts of shared β-chain sequences between different patient samples, there was significantly increased sequence sharing of the γ-chain in psoriatic skin from different individuals compared to those without psoriasis. This suggests that although the T-cell response in psoriasis is highly polyclonal, particular γδ- T-cell subsets may be associated with this disease. Overall, our findings present the feasibility of this technology to determine the entire αβ- and γδ- T-cell repertoire in skin, and that psoriasis contains polyclonal and diverse αβ- and γδ- T-cell populations. PMID:26594339

  20. Deep Sequencing of Subseafloor Eukaryotic rRNA Reveals Active Fungi across Marine Subsurface Provinces

    PubMed Central

    Orsi, William; Biddle, Jennifer F.; Edgcomb, Virginia

    2013-01-01

    The deep marine subsurface is a vast habitat for microbial life where cells may live on geologic timescales. Because DNA in sediments may be preserved on long timescales, ribosomal RNA (rRNA) is suggested to be a proxy for the active fraction of a microbial community in the subsurface. During an investigation of eukaryotic 18S rRNA by amplicon pyrosequencing, unique profiles of Fungi were found across a range of marine subsurface provinces including ridge flanks, continental margins, and abyssal plains. Subseafloor fungal populations exhibit statistically significant correlations with total organic carbon (TOC), nitrate, sulfide, and dissolved inorganic carbon (DIC). These correlations are supported by terminal restriction length polymorphism (TRFLP) analyses of fungal rRNA. Geochemical correlations with fungal pyrosequencing and TRFLP data from this geographically broad sample set suggests environmental selection of active Fungi in the marine subsurface. Within the same dataset, ancient rRNA signatures were recovered from plants and diatoms in marine sediments ranging from 0.03 to 2.7 million years old, suggesting that rRNA from some eukaryotic taxa may be much more stable than previously considered in the marine subsurface. PMID:23418556

  1. Deep sequencing of subseafloor eukaryotic rRNA reveals active Fungi across marine subsurface provinces.

    PubMed

    Orsi, William; Biddle, Jennifer F; Edgcomb, Virginia

    2013-01-01

    The deep marine subsurface is a vast habitat for microbial life where cells may live on geologic timescales. Because DNA in sediments may be preserved on long timescales, ribosomal RNA (rRNA) is suggested to be a proxy for the active fraction of a microbial community in the subsurface. During an investigation of eukaryotic 18S rRNA by amplicon pyrosequencing, unique profiles of Fungi were found across a range of marine subsurface provinces including ridge flanks, continental margins, and abyssal plains. Subseafloor fungal populations exhibit statistically significant correlations with total organic carbon (TOC), nitrate, sulfide, and dissolved inorganic carbon (DIC). These correlations are supported by terminal restriction length polymorphism (TRFLP) analyses of fungal rRNA. Geochemical correlations with fungal pyrosequencing and TRFLP data from this geographically broad sample set suggests environmental selection of active Fungi in the marine subsurface. Within the same dataset, ancient rRNA signatures were recovered from plants and diatoms in marine sediments ranging from 0.03 to 2.7 million years old, suggesting that rRNA from some eukaryotic taxa may be much more stable than previously considered in the marine subsurface.

  2. Identification of Hop stunt viroid infecting Citrus limon in China using small RNAs deep sequencing approach.

    PubMed

    Su, Xiu; Fu, Shuai; Qian, Yajuan; Xu, Yi; Zhou, Xueping

    2015-07-07

    The advent of next generation sequencing technology has allowed for significant advances in plant virus discovery, particularly for identification of covert viruses and previously undescribed viruses. The Citrus limon Burm. f. (C. limon) is a small evergreen tree native to Asia, and . China is the world's top lemon-producing nation. In this work, lemon samples were collected from southwestern of China, where an unknown disease outbreak had caused huge losses in the lemon production industry. Using high-throughput pyrosequencing and the assembly of small RNAs, we showed that the Hop stunt viroid (HSVd) was present in C. limon leaf sample. The majority of it is a main lemon producing agricultural cultivarHop stunt viroid derived siRNAs (HSVd-siRNAs) in C. limon were 21 nucleotides in length, and nearly equal amount of HSVd-siRNAs originated from the plus-genomic RNA strand as from the complementary strand. A bias of HSVd-siRNAs toward sequences beginning with a 5'-Guanine was observed. Furthermore, hotspot analysis showed that a large amount of HSVd-siRNAs derived from the central and variant domains of the HSVd genome. Our results suggest that C. limon could set up a small RNA-mediated gene silencing response to Hop stunt viroid, Interestingly, based on bioinformatics analysis, our results also suggest that the large amounts of HSVd-siRNAs from central and variant domains might be involved in interference with host gene expression and affect symptom development.

  3. Revolution of nephrology research by deep sequencing: ChIP-seq and RNA-seq.

    PubMed

    Mimura, Imari; Kanki, Yasuharu; Kodama, Tatsuhiko; Nangaku, Masaomi

    2014-01-01

    The recent and rapid advent of next-generation sequencing (NGS) has made this technology broadly available not only to researchers in various molecular and cellular biology fields but also to those in kidney disease. In this paper, we describe the usage of ChIP-seq (chromatin immunoprecipitation with sequencing) and RNA-seq for sample preparation and interpretation of raw data in the investigation of biological phenomenon in renal diseases. ChIP-seq identifies genome-wide transcriptional DNA-binding sites as well as histone modifications, which are known to regulate gene expression, in the intragenic as well as in the intergenic regions. With regard to RNA-seq, this process analyzes not only the expression level of mRNA but also splicing variants, non-coding RNA, and microRNA on a genome-wide scale. The combination of ChIP-seq and RNA-seq allows the clarification of novel transcriptional mechanisms, which have important roles in various kinds of diseases, including chronic kidney disease. The rapid development of these techniques requires an update on the latest information and methods of NGS. In this review, we highlight the merits and characteristics of ChIP-seq and RNA-seq and discuss the use of the genome-wide analysis in kidney disease.

  4. Targeted deep sequencing of flowering regulators in Brassica napus reveals extensive copy number variation

    PubMed Central

    Schiessl, Sarah; Huettel, Bruno; Kuehn, Diana; Reinhardt, Richard; Snowdon, Rod J.

    2017-01-01

    Gene copy number variation (CNV) is increasingly implicated in control of complex trait networks, particularly in polyploid plants like rapeseed (Brassica napus L.) with an evolutionary history of genome restructuring. Here we performed sequence capture to assay nucleotide variation and CNV in a panel of central flowering time regulatory genes across a species-wide diversity set of 280 B. napus accessions. The genes were chosen based on prior knowledge from Arabidopsis thaliana and related Brassica species. Target enrichment was performed using the Agilent SureSelect technology, followed by Illumina sequencing. A bait (probe) pool was developed based on results of a preliminary experiment with representatives from different B. napus morphotypes. A very high mean target coverage of ~670x allowed reliable calling of CNV, single nucleotide polymorphisms (SNPs) and insertion-deletion (InDel) polymorphisms. No accession exhibited no CNV, and at least one homolog of every gene we investigated showed CNV in some accessions. Some CNV appear more often in specific morphotypes, indicating a role in diversification. PMID:28291231

  5. Evolutionary relations of Hexanchiformes deep-sea sharks elucidated by whole mitochondrial genome sequences.

    PubMed

    Tanaka, Keiko; Shiina, Takashi; Tomita, Taketeru; Suzuki, Shingo; Hosomichi, Kazuyoshi; Sano, Kazumi; Doi, Hiroyuki; Kono, Azumi; Komiyama, Tomoyoshi; Inoko, Hidetoshi; Kulski, Jerzy K; Tanaka, Sho

    2013-01-01

    Hexanchiformes is regarded as a monophyletic taxon, but the morphological and genetic relationships between the five extant species within the order are still uncertain. In this study, we determined the whole mitochondrial DNA (mtDNA) sequences of seven sharks including representatives of the five Hexanchiformes, one squaliform, and one carcharhiniform and inferred the phylogenetic relationships among those species and 12 other Chondrichthyes (cartilaginous fishes) species for which the complete mitogenome is available. The monophyly of Hexanchiformes and its close relation with all other Squaliformes sharks were strongly supported by likelihood and Bayesian phylogenetic analysis of 13,749 aligned nucleotides of 13 protein coding genes and two rRNA genes that were derived from the whole mDNA sequences of the 19 species. The phylogeny suggested that Hexanchiformes is in the superorder Squalomorphi, Chlamydoselachus anguineus (frilled shark) is the sister species to all other Hexanchiformes, and the relations within Hexanchiformes are well resolved as Chlamydoselachus, (Notorynchus, (Heptranchias, (Hexanchus griseus, H. nakamurai))). Based on our phylogeny, we discussed evolutionary scenarios of the jaw suspension mechanism and gill slit numbers that are significant features in the sharks.

  6. Deep Sequencing of the Human Retinae Reveals the Expression of Odorant Receptors

    PubMed Central

    Jovancevic, Nikolina; Wunderlich, Kirsten A.; Haering, Claudia; Flegel, Caroline; Maßberg, Désirée; Weinrich, Markus; Weber, Lea; Tebbe, Lars; Kampik, Anselm; Gisselmann, Günter; Wolfrum, Uwe; Hatt, Hanns; Gelis, Lian

    2017-01-01

    Several studies have demonstrated that the expression of odorant receptors (ORs) occurs in various tissues. These findings have served as a basis for functional studies that demonstrate the potential of ORs as drug targets for a clinical application. To the best of our knowledge, this report describes the first evaluation of the mRNA expression of ORs and the localization of OR proteins in the human retina that set a stage for subsequent functional analyses. RNA-Sequencing datasets of three individual neural retinae were generated using Next-generation sequencing and were compared to previously published but reanalyzed datasets of the peripheral and the macular human retina and to reference tissues. The protein localization of several ORs was investigated by immunohistochemistry. The transcriptome analyses detected an average of 14 OR transcripts in the neural retina, of which OR6B3 is one of the most highly expressed ORs. Immunohistochemical stainings of retina sections localized OR2W3 to the photosensitive outer segment membranes of cones, whereas OR6B3 was found in various cell types. OR5P3 and OR10AD1 were detected at the base of the photoreceptor connecting cilium, and OR10AD1 was also localized to the nuclear envelope of all of the nuclei of the retina. The cell type-specific expression of the ORs in the retina suggests that there are unique biological functions for those receptors. PMID:28174521

  7. Evolutionary Relations of Hexanchiformes Deep-Sea Sharks Elucidated by Whole Mitochondrial Genome Sequences

    PubMed Central

    Tanaka, Keiko; Tomita, Taketeru; Suzuki, Shingo; Hosomichi, Kazuyoshi; Sano, Kazumi; Doi, Hiroyuki; Kono, Azumi; Inoko, Hidetoshi; Kulski, Jerzy K.; Tanaka, Sho

    2013-01-01

    Hexanchiformes is regarded as a monophyletic taxon, but the morphological and genetic relationships between the five extant species within the order are still uncertain. In this study, we determined the whole mitochondrial DNA (mtDNA) sequences of seven sharks including representatives of the five Hexanchiformes, one squaliform, and one carcharhiniform and inferred the phylogenetic relationships among those species and 12 other Chondrichthyes (cartilaginous fishes) species for which the complete mitogenome is available. The monophyly of Hexanchiformes and its close relation with all other Squaliformes sharks were strongly supported by likelihood and Bayesian phylogenetic analysis of 13,749 aligned nucleotides of 13 protein coding genes and two rRNA genes that were derived from the whole mDNA sequences of the 19 species. The phylogeny suggested that Hexanchiformes is in the superorder Squalomorphi, Chlamydoselachus anguineus (frilled shark) is the sister species to all other Hexanchiformes, and the relations within Hexanchiformes are well resolved as Chlamydoselachus, (Notorynchus, (Heptranchias, (Hexanchus griseus, H. nakamurai))). Based on our phylogeny, we discussed evolutionary scenarios of the jaw suspension mechanism and gill slit numbers that are significant features in the sharks. PMID:24089661

  8. Power of deep sequencing and agilent microarray for gene expression profiling study.

    PubMed

    Feng, Lin; Liu, Hang; Liu, Yu; Lu, Zhike; Guo, Guangwu; Guo, Suping; Zheng, Hongwei; Gao, Yanning; Cheng, Shujun; Wang, Jian; Zhang, Kaitai; Zhang, Yong

    2010-06-01

    Next-generation sequencing-based Digital Gene Expression tag profiling (DGE) has been used to study the changes in gene expression profiling. To compare the quality of the data generated by microarray and DGE, we examined the gene expression profiles of an in vitro cell model with these platforms. In this study, 17,362 and 15,938 genes were detected by microarray and DGE, respectively, with 13,221 overlapping genes. The correlation coefficients between the technical replicates were >0.99 and the detection variance was <9% for both platforms. The dynamic range of microarray was fixed with four orders of magnitude, whereas that of DGE was extendable. The consistency of the two platforms was high, especially for those abundant genes. It was more difficult for the microarray to distinguish the expression variation of less abundant genes. Although microarrays might be eventually replaced by DGE or transcriptome sequencing (RNA-seq) in the near future, microarrays are still stable, practical, and feasible, which may be useful for most biological researchers.

  9. The transcriptome of Verticillium dahliae-infected Nicotiana benthamiana determined by deep RNA sequencing.

    PubMed

    Faino, Luigi; de Jonge, Ronnie; Thomma, Bart P H J

    2012-09-01

    Verticillium wilt disease is caused by fungi of the Verticillium genus that occur on a wide range of host plants, including Solanaceous species such as tomato and tobacco. Currently, the well characterized Ve1 gene of tomato is the only Verticillium wilt resistance gene cloned. During experiments to identify the Verticillium molecule that activates Ve1 resistance in tomato, RNA sequencing (RNA-Seq) of Verticillium-infected Nicotiana benthamiana was performed. In total, over 99% of the obtained reads were derived from N. benthamiana. Here, we report the assembly and annotation of the N. benthamiana transcriptome. In total, 142,738 transcripts > 100 bp were obtained, amounting to a total transcriptome size of 38.7 Mbp, which is comparable to the Arabidopsis transcriptome. About 30,282 transcripts could be annotated based on homology to Arabidopsis genes. By assembly of the N. benthamiana transcriptome, we provide a catalogue of transcripts of a Solanaceous model plant under pathogen stress.

  10. Focused Evolution of HIV-1 Neutralizing Antibodies Revealed by Structures and Deep Sequencing

    SciTech Connect

    Wu, Xueling; Zhou, Tongqing; Zhu, Jiang; Zhang, Baoshan; Georgiev, Ivelin; Wang, Charlene; Chen, Xuejun; Longo, Nancy S.; Louder, Mark; McKee, Krisha; O’Dell, Sijy; Perfetto, Stephen; Schmidt, Stephen D.; Shi, Wei; Wu, Lan; Yang, Yongping; Yang, Zhi-Yong; Yang, Zhongjia; Zhang, Zhenhai; Bonsignori, Mattia; Crump, John A.; Kapiga, Saidi H.; Sam, Noel E.; Haynes, Barton F.; Simek, Melissa; Burton, Dennis R.; Koff, Wayne C.; Doria-Rose, Nicole A.; Connors, Mark; Mullikin, James C.; Nabel, Gary J.; Roederer, Mario; Shapiro, Lawrence; Kwong, Peter D.; Mascola, John R.

    2013-03-04

    Antibody VRC01 is a human immunoglobulin that neutralizes about 90% of HIV-1 isolates. To understand how such broadly neutralizing antibodies develop, we used x-ray crystallography and 454 pyrosequencing to characterize additional VRC01-like antibodies from HIV-1-infected individuals. Crystal structures revealed a convergent mode of binding for diverse antibodies to the same CD4-binding-site epitope. A functional genomics analysis of expressed heavy and light chains revealed common pathways of antibody-heavy chain maturation, confined to the IGHV1-2*02 lineage, involving dozens of somatic changes, and capable of pairing with different light chains. Broadly neutralizing HIV-1 immunity associated with VRC01-like antibodies thus involves the evolution of antibodies to a highly affinity-matured state required to recognize an invariant viral structure, with lineages defined from thousands of sequences providing a genetic roadmap of their development.

  11. Genomic DNA sequences from mastodon and woolly mammoth reveal deep speciation of forest and savanna elephants.

    PubMed

    Rohland, Nadin; Reich, David; Mallick, Swapan; Meyer, Matthias; Green, Richard E; Georgiadis, Nicholas J; Roca, Alfred L; Hofreiter, Michael

    2010-12-21

    To elucidate the history of living and extinct elephantids, we generated 39,763 bp of aligned nuclear DNA sequence across 375 loci for African savanna elephant, African forest elephant, Asian elephant, the extinct American mastodon, and the woolly mammoth. Our data establish that the Asian elephant is the closest living relative of the extinct mammoth in the nuclear genome, extending previous findings from mitochondrial DNA analyses. We also find that savanna and forest elephants, which some have argued are the same species, are as or more divergent in the nuclear genome as mammoths and Asian elephants, which are considered to be distinct genera, thus resolving a long-standing debate about the appropriate taxonomic classification of the African elephants. Finally, we document a much larger effective population size in forest elephants compared with the other elephantid taxa, likely reflecting species differences in ancient geographic structure and range and differences in life history traits such as variance in male reproductive success.

  12. Genome-wide analysis of SRSF10-regulated alternative splicing by deep sequencing of chicken transcriptome.

    PubMed

    Zhou, Xuexia; Wu, Wenwu; Wei, Ning; Cheng, Yuanming; Xie, Zhiqin; Feng, Ying

    2014-12-01

    Splicing factor SRSF10 is known to function as a sequence-specific splicing activator that is capable of regulating alternative splicing both in vitro and in vivo. We recently used an RNA-seq approach coupled with bioinformatics analysis to identify the extensive splicing network regulated by SRSF10 in chicken cells. We found that SRSF10 promoted both exon inclusion and exclusion. Functionally, many of the SRSF10-verified alternative exons are linked to pathways of response to external stimulus. Here we describe in detail the experimental design, bioinformatics analysis and GO/pathway enrichment analysis of SRSF10-regulated genes to correspond with our data in the Gene Expression Omnibus with accession number GSE53354. Our data thus provide a resource for studying regulation of alternative splicing in vivo that underlines biological functions of splicing regulatory proteins in cells.

  13. Genomic DNA Sequences from Mastodon and Woolly Mammoth Reveal Deep Speciation of Forest and Savanna Elephants

    PubMed Central

    Mallick, Swapan; Meyer, Matthias; Green, Richard E.; Georgiadis, Nicholas J.; Roca, A