Science.gov

Sample records for novo sequencing approach

  1. Proteomics-grade de novo sequencing approach.

    PubMed

    Savitski, Mikhail M; Nielsen, Michael L; Kjeldsen, Frank; Zubarev, Roman A

    2005-01-01

    The conventional approach in modern proteomics to identify proteins from limited information provided by molecular and fragment masses of their enzymatic degradation products carries an inherent risk of both false positive and false negative identifications. For reliable identification of even known proteins, complete de novo sequencing of their peptides is desired. The main problems of conventional sequencing based on tandem mass spectrometry are incomplete backbone fragmentation and the frequent overlap of fragment masses. In this work, the first proteomics-grade de novo approach is presented, where the above problems are alleviated by the use of complementary fragmentation techniques CAD and ECD. Implementation of a high-current, large-area dispenser cathode as a source of low-energy electrons provided efficient ECD of doubly charged peptides, the most abundant species (65-80%), in a typical trypsin-based proteomics experiment. A new linear de novo algorithm is developed combining efficiency and speed, processing on a conventional 3 GHz PC, 1000 MS/MS data sets in 60 s. More than 6% of all MS/MS data for doubly charged peptides yielded complete sequences, and another 13% gave nearly complete sequences with a maximum gap of two amino acid residues. These figures are comparable with the typical success rates (5-15%) of database identification. For peptides reliably found in the database (Mowse score > or = 34), the agreement with de novo-derived full sequences was >95%. Full sequences were derived in 67% of the cases when full sequence information was present in MS/MS spectra. Thus the new de novo sequencing approach reached the same level of efficiency and reliability as conventional database-identification strategies. PMID:16335984

  2. A Machine Learning Based Approach to de novo Sequencing of Glycans from Tandem Mass Spectrometry Spectrum.

    PubMed

    Kumozaki, Shotaro; Sato, Kengo; Sakakibara, Yasubumi

    2015-01-01

    Recently, glycomics has been actively studied and various technologies for glycomics have been rapidly developed. Currently, tandem mass spectrometry (MS/MS) is one of the key experimental tools for identification of structures of oligosaccharides. MS/MS can observe MS/MS peaks of fragmented glycan ions including cross-ring ions resulting from internal cleavages, which provide valuable information to infer glycan structures. Thus, the aim of de novo sequencing of glycans is to find the most probable assignments of observed MS/MS peaks to glycan substructures without databases. However, there are few satisfiable algorithms for glycan de novo sequencing from MS/MS spectra. We present a machine learning based approach to de novo sequencing of glycans from MS/MS spectrum. First, we build a suitable model for the fragmentation of glycans including cross-ring ions, and implement a solver that employs Lagrangian relaxation with a dynamic programming technique. Then, to optimize scores for the algorithm, we introduce a machine learning technique called structured support vector machines that enable us to learn parameters including scores for cross-ring ions from training data, i.e., known glycan mass spectra. Furthermore, we implement additional constraints for core structures of well-known glycan types including N-linked glycans and O-linked glycans. This enables us to predict more accurate glycan structures if the glycan type of given spectra is known. Computational experiments show that our algorithm performs accurate de novo sequencing of glycans. The implementation of our algorithm and the datasets are available at http://glyfon.dna.bio.keio.ac.jp/. PMID:26671799

  3. New Approaches and Technologies to Sequence de novo Plant reference Genomes (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

    SciTech Connect

    Schmutz, Jeremy

    2013-03-01

    Jeremy Schmutz of the HudsonAlpha Institute for Biotechnology on "New approaches and technologies to sequence de novo plant reference genomes" at the 8th Annual Genomics of Energy & Environment Meeting on March 27, 2013 in Walnut Creek, Calif.

  4. De novo sequencing of sunflower genome for SNP discovery using RAD (Restriction site Associated DNA) approach

    PubMed Central

    2013-01-01

    Background Application of Single Nucleotide Polymorphism (SNP) marker technology as a tool in sunflower breeding programs offers enormous potential to improve sunflower genetics, and facilitate faster release of sunflower hybrids to the market place. Through a National Sunflower Association (NSA) funded initiative, we report on the process of SNP discovery through reductive genome sequencing and local assembly of six diverse sunflower inbred lines that represent oil as well as confection types. Results A combination of Restriction site Associated DNA Sequencing (RAD-Seq) protocols and Illumina paired-end sequencing chemistry generated high quality 89.4 M paired end reads from the six lines which represent 5.3 GB of the sequencing data. Raw reads from the sunflower line, RHA 464 were assembled de novo to serve as a framework reference genome. About 15.2 Mb of sunflower genome distributed over 42,267 contigs were obtained upon assembly of RHA 464 sequencing data, the contig lengths ranged from 200 to 950 bp with an N50 length of 393 bp. SNP calling was performed by aligning sequencing data from the six sunflower lines to the assembled reference RHA 464. On average, 1 SNP was located every 143 bp of the sunflower genome sequence. Based on several filtering criteria, a final set of 16,467 putative sequence variants with characteristics favorable for Illumina Infinium Genotyping Technology (IGT) were mined from the sequence data generated across six diverse sunflower lines. Conclusion Here we report the molecular and computational methodology involved in SNP development for a complex genome like sunflower lacking reference assembly, offering an attractive tool for molecular breeding purposes in sunflower. PMID:23947483

  5. A high-throughput de novo sequencing approach for shotgun proteomics using high-resolution tandem mass spectrometry

    SciTech Connect

    Pan, Chongle; Park, Byung H; McDonald, W Hayes; Carey, Patricia A; Banfield, Jillian F.; Verberkmoes, Nathan C; Hettich, Robert {Bob} L; Samatova, Nagiza F

    2010-01-01

    Background High-resolution tandem mass spectra can now be readily acquired with hybrid instruments, such as LTQ-Orbitrap and LTQ-FT, in high-throughput shotgun proteomics workflows. The improved spectral quality enables more accurate de novo sequencing for identification of post-translational modifications and amino acid polymorphisms. Results In this study, a new de novo sequencing algorithm, called Vonode, has been developed specifically for analysis of such high-resolution tandem mass spectra. To fully exploit the high mass accuracy of these spectra, a unique scoring system is proposed to evaluate sequence tags based primarily on mass accuracy information of fragment ions. Consensus sequence tags were inferred for 11,422 spectra with an average peptide length of 5.5 residues from a total of 40,297 input spectra acquired in a 24-hour proteomics measurement of Rhodopseudomonas palustris. The accuracy of inferred consensus sequence tags was 84%. According to our comparison, the performance of Vonode was shown to be superior to the PepNovo v2.0 algorithm, in terms of the number of de novo sequenced spectra and the sequencing accuracy. Conclusions Here, we improved de novo sequencing performance by developing a new algorithm specifically for high-resolution tandem mass spectral data. The Vonode algorithm is freely available for download at http://compbio.ornl.gov/Vonode.

  6. A model of random sequences for de novo peptide sequencing

    SciTech Connect

    Jarman, Kenneth D.; Cannon, William R.; Jarman, Kristin H.; Heredia-Langner, Alejandro

    2003-04-15

    We present a model for the probability of random sequences appearing in product ion spectra obtained from tandem mass spectrometry experiments using collision-induced dissociation. We demonstrate the use of these probabilities for ranking candidate peptide sequences obtained using a de novo algorithm. Sequence candidates are obtained from a spectrum graph that is greatly reduced in size from those in previous graph-theoretical de novo approaches. Evidence of multiple instances of subsequences of each candidate, due to different fragment ion type series as well as isotopic peaks, is incorporated in a hierarchical scoring scheme. This approach is shown to be useful for confirming results from database search and as a first step towards a statistically rigorous de novo algorithm.

  7. Two Different High Throughput Sequencing Approaches Identify Thousands of De Novo Genomic Markers for the Genetically Depleted Bornean Elephant

    PubMed Central

    Sharma, Reeta; Goossens, Benoit; Kun-Rodrigues, Célia; Teixeira, Tatiana; Othman, Nurzhafarina; Boone, Jason Q.; Jue, Nathaniel K.; Obergfell, Craig; O'Neill, Rachel J.; Chikhi, Lounès

    2012-01-01

    High throughput sequencing technologies are being applied to an increasing number of model species with a high-quality reference genome. The application and analyses of whole-genome sequence data in non-model species with no prior genomic information are currently under way. Recent sequencing technologies provide new opportunities for gathering genomic data in natural populations, laying the empirical foundation for future research in the field of conservation and population genomics. Here we present the case study of the Bornean elephant, which is the most endangered subspecies of Asian elephant and exhibits very low genetic diversity. We used two different sequencing platforms, the Roche 454 FLX (shotgun) and Illumina, GAIIx (Restriction site associated DNA, RAD) to evaluate the feasibility of the two methodologies for the discovery of de novo markers (single nucleotide polymorphism, SNPs and microsatellites) using low coverage data. Approximately, 6,683 (shotgun) and 14,724 (RAD) SNPs were detected within our elephant sequence dataset. Genotyping of a representative sample of 194 SNPs resulted in a SNP validation rate of ∼ 83 to 94% and 17% of the loci were polymorphic with a low diversity (Ho = 0.057). Different numbers of microsatellites were identified through shotgun (27,226) and RAD (868) techniques. Out of all di-, tri-, and tetra-microsatellite loci, 1,706 loci had sufficient flanking regions (shotgun) while only 7 were found with RAD. All microsatellites were monomorphic in the Bornean but polymorphic in another elephant subspecies. Despite using different sample sizes, and the well known differences in the two platforms used regarding sequence length and throughput, the two approaches showed high validation rate. The approaches used here for marker development in a threatened species demonstrate the utility of high throughput sequencing technologies as a starting point for the development of genomic tools in a non-model species and in particular

  8. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Hixson, Kim K.; Purvine, Samuel O.; Anderson, Gordon A.; Smith, Richard D.

    2008-10-15

    De novo sequencing has a promise to discover the protein post-translation modifications; however, such approach is still in their infancy and not widely applied for proteomics practices due to its limited reliability. In this work, we describe a de novo sequencing approach for discovery of protein modifications through identification of the UStags (Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry for peptides and polypeptides in a yeast lysate, and the de novo sequences obtained were filtered to define a more limited set of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags’ prefix and suffix sequences and the UStags themselves) were used to infer the possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances of yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. Random matching of the de novo sequences to the predicted sequences were examined with use of two random (false) databases, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity are described. The de novo-UStag complements the UStag method previously reported by enabling discovery of new protein modifications.

  9. Identification of Disulfide Bonds in Protein Proteolytic Degradation Products Using de Novo-Protein Unique Sequence Tags Approach

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Purvine, Samuel O.; Smith, Richard D.

    2010-08-01

    Disulfide bonds are a form of posttranslational modification that often determines protein structure(s) and function(s). In this work, we report a mass spectrometry method for identification of disulfides in degradation products of proteins, and specifically endogenous peptides in the human blood plasma peptidome. LC-Fourier transform tandem mass spectrometry (FT MS/MS) was used for acquiring mass spectra that were de novo sequenced and then searched against the IPI human protein database. Through the use of unique sequence tags (UStags) we unambiguously correlated the spectra to specific database proteins. Examination of the UStags’ prefix and/or suffix sequences that contain cysteine(s) in conjunction with sequences of the UStags-specified database proteins is shown to enable the unambigious determination of disulfide bonds. Using this method, we identified the intermolecular and intramolecular disulfides in human blood plasma peptidome peptides that have molecular weights of up to ~10 kDa.

  10. Identification of disulfide bonds in protein proteolytic degradation products using de novo-protein unique sequence tags approach.

    PubMed

    Shen, Yufeng; Tolić, Nikola; Purvine, Samuel O; Smith, Richard D

    2010-08-01

    Disulfide bonds are a form of post-translational modification that often determines protein structure(s) and function(s). In this work, we report a mass spectrometry method for identification of disulfides in degradation products of proteins, specifically endogenous peptides in the human blood plasma peptidome. LC-Fourier transform tandem mass spectrometry (FT MS/MS) was used for acquiring mass spectra that were de novo sequenced and then searched against the IPI human protein database. Through the use of unique sequence tags (UStags), we unambiguously correlated the spectra to specific database proteins. Examination of the UStags' prefix and/or suffix sequences that contain cysteine(s) in conjunction with sequences of the UStags-specified database proteins is shown to enable the unambigious determination of disulfide bonds. Using this method, we identified the intermolecular and intramolecular disulfides in human blood plasma peptidome peptides that have molecular weights of up to approximately 10 kDa. PMID:20590115

  11. mPUMA: a computational approach to microbiota analysis by de novo assembly of operational taxonomic units based on protein-coding barcode sequences

    PubMed Central

    2013-01-01

    Background Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. Results Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. Conclusions mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly. PMID:24451012

  12. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins.

    PubMed

    Shen, Yufeng; Tolić, Nikola; Hixson, Kim K; Purvine, Samuel O; Anderson, Gordon A; Smith, Richard D

    2008-10-15

    De novo sequencing is a spectrum analysis approach for mass spectrometry data to discover post-translational modifications in proteins; however, such an approach is still in its infancy and is still not widely applied to proteomic practices due to its limited reliability. In this work, we describe a de novo sequencing approach for the discovery of protein modifications based on identification of the proteome UStags (Shen, Y.; Tolić, N.; Hixson, K. K.; Purvine, S. O.; Pasa-Tolić, L.; Qian, W. J.; Adkins, J. N.; Moore, R. J.; Smith, R. D. Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry data for peptides and polypeptides from a yeast lysate, and the de novo sequences obtained were selected based on filter levels designed to provide a limited yet high quality subset of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags' prefix and suffix sequences and the UStags themselves) were used to infer possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances within several yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. To determine false discovery rates, two random (false) databases were independently used for sequence matching, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity of the approach were investigated and described. The combined de novo-UStag approach complements the UStag method previously reported by enabling the discovery of new protein modifications. PMID:18783246

  13. Novor: Real-Time Peptide de Novo Sequencing Software

    NASA Astrophysics Data System (ADS)

    Ma, Bin

    2015-11-01

    De novo sequencing software has been widely used in proteomics to sequence new peptides from tandem mass spectrometry data. This study presents a new software tool, Novor, to greatly improve both the speed and accuracy of today's peptide de novo sequencing analyses. To improve the accuracy, Novor's scoring functions are based on two large decision trees built from a peptide spectral library with more than 300,000 spectra with machine learning. Important knowledge about peptide fragmentation is extracted automatically from the library and incorporated into the scoring functions. The decision tree model also enables efficient score calculation and contributes to the speed improvement. To further improve the speed, a two-stage algorithmic approach, namely dynamic programming and refinement, is used. The software program was also carefully optimized. On the testing datasets, Novor sequenced 7%-37% more correct residues than the state-of-the-art de novo sequencing tool, PEAKS, while being an order of magnitude faster. Novor can de novo sequence more than 300 MS/MS spectra per second on a laptop computer. The speed surpasses the acquisition speed of today's mass spectrometer and, therefore, opens a new possibility to de novo sequence in real time while the spectrometer is acquiring the spectral data.

  14. Multiplex De Novo Sequencing of Peptide Antibiotics

    NASA Astrophysics Data System (ADS)

    Mohimani, Hosein; Liu, Wei-Ting; Yang, Yu-Liang; Gaudêncio, Susana P.; Fenical, William; Dorrestein, Pieter C.; Pevzner, Pavel A.

    Proliferation of drug-resistant diseases raises the challenge of searching for new, more efficient antibiotics. Currently, some of the most effective antibiotics (i.e., Vancomycin and Daptomycin) are cyclic peptides produced by non-ribosomal biosynthetic pathways. The isolation and sequencing of cyclic peptide antibiotics, unlike the same activity with linear peptides, is time-consuming and error-prone. The dominant technique for sequencing cyclic peptides is NMR-based and requires large amounts (milligrams) of purified materials that, for most compounds, are not possible to obtain. Given these facts, there is a need for new tools to sequence cyclic NRPs using picograms of material. Since nearly all cyclic NRPs are produced along with related analogs, we develop a mass spectrometry approach for sequencing all related peptides at once (in contrast to the existing approach that analyzes individual peptides). Our results suggest that instead of attempting to isolate and NMR-sequence the most abundant compound, one should acquire spectra of many related compounds and sequence all of them simultaneously using tandem mass spectrometry. We illustrate applications of this approach by sequencing new variants of cyclic peptide antibiotics from Bacillus brevis, as well as sequencing a previously unknown familiy of cyclic NRPs produced by marine bacteria.

  15. Current challenges in de novo plant genome sequencing and assembly

    PubMed Central

    2012-01-01

    Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best practices for the community. PMID:22546054

  16. Ameliorated de novo transcriptome assembly using Illumina paired end sequence data with Trinity Assembler

    PubMed Central

    Bankar, Kiran Gopinath; Todur, Vivek Nagaraj; Shukla, Rohit Nandan; Vasudevan, Madavan

    2015-01-01

    Advent of Next Generation Sequencing has led to possibilities of de novo transcriptome assembly of organisms without availability of complete genome sequence. Among various sequencing platforms available, Illumina is the most widely used platform based on data quality, quantity and cost. Various de novo transcriptome assemblers are also available today for construction of de novo transcriptome. In this study, we aimed at obtaining an ameliorated de novo transcriptome assembly with sequence reads obtained from Illumina platform and assembled using Trinity Assembler. We found that, primary transcriptome assembly obtained as a result of Trinity can be ameliorated on the basis of transcript length, coverage, and depth and protein homology. Our approach to ameliorate is reproducible and could enhance the sensitivity and specificity of the assembled transcriptome which could be critical for validation of the assembled transcripts and for planning various downstream biological assays. PMID:26484285

  17. BSTA: a targeted approach combines bulked segregant analysis with next- generation sequencing and de novo transcriptome assembly for SNP discovery in sunflower

    PubMed Central

    2013-01-01

    Background Sunflower belongs to the largest plant family on earth, the genomically poorly explored Compositae. Downy mildew Plasmopara halstedii (Farlow) Berlese & de Toni is one of the major diseases of cultivated sunflower (Helianthus annuus L.). In the search for new sources of downy mildew resistance, the locus PlARG on linkage group 1 (LG1) originating from H. argophyllus is promising since it confers resistance against all known races of the pathogen. However, the mapping resolution in the PlARG region is hampered by significantly suppressed recombination and by limited availability of polymorphic markers. Here we examined a strategy developed for the enrichment of molecular markers linked to this specific genomic region. We combined bulked segregant analysis (BSA) with next-generation sequencing (NGS) and de novo assembly of the sunflower transcriptome for single nucleotide polymorphism (SNP) discovery in a sequence resource combining reads originating from two sunflower species, H. annuus and H. argophyllus. Results A computational pipeline developed for SNP calling and pattern detection identified 219 candidate genes. For a proof of concept, 42 resistance gene-like sequences were subjected to experimental SNP validation. Using a high-resolution mapping population, 12 SNP markers were mapped to LG1. We successfully verified candidate sequences either co-segregating with or closely flanking PlARG. Conclusions This study is the first successful example to improve bulked segregant analysis with de novo transcriptome assembly using next generation sequencing. The BSTA pipeline we developed provides a useful guide for similar studies in other non-model organisms. Our results demonstrate this method is an efficient way to enrich molecular markers and to identify candidate genes in a specific mapping interval. PMID:24330545

  18. Complete De Novo Assembly of Monoclonal Antibody Sequences.

    PubMed

    Tran, Ngoc Hieu; Rahman, M Ziaur; He, Lin; Xin, Lei; Shan, Baozhen; Li, Ming

    2016-01-01

    De novo protein sequencing is one of the key problems in mass spectrometry-based proteomics, especially for novel proteins such as monoclonal antibodies for which genome information is often limited or not available. However, due to limitations in peptides fragmentation and coverage, as well as ambiguities in spectra interpretation, complete de novo assembly of unknown protein sequences still remains challenging. To address this problem, we propose an integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences. Our system integrates de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph to assemble protein sequences. We evaluated ALPS performance on two antibody data sets, each including a heavy chain and a light chain. The results show that ALPS was able to assemble three complete monoclonal antibody sequences of length 216-441 AA, at 100% coverage, and 96.64-100% accuracy. PMID:27562653

  19. Complete De Novo Assembly of Monoclonal Antibody Sequences

    PubMed Central

    Tran, Ngoc Hieu; Rahman, M. Ziaur; He, Lin; Xin, Lei; Shan, Baozhen; Li, Ming

    2016-01-01

    De novo protein sequencing is one of the key problems in mass spectrometry-based proteomics, especially for novel proteins such as monoclonal antibodies for which genome information is often limited or not available. However, due to limitations in peptides fragmentation and coverage, as well as ambiguities in spectra interpretation, complete de novo assembly of unknown protein sequences still remains challenging. To address this problem, we propose an integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences. Our system integrates de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph to assemble protein sequences. We evaluated ALPS performance on two antibody data sets, each including a heavy chain and a light chain. The results show that ALPS was able to assemble three complete monoclonal antibody sequences of length 216–441 AA, at 100% coverage, and 96.64–100% accuracy. PMID:27562653

  20. A gradient-boosting approach for filtering de novo mutations in parent–offspring trios

    PubMed Central

    Liu, Yongzhuang; Li, Bingshan; Tan, Renjie; Zhu, Xiaolin; Wang, Yadong

    2014-01-01

    Motivation: Whole-genome and -exome sequencing on parent–offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge. Results: In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter’s theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity. Availability: The software DNMFilter implemented using a combination of Java and R is freely available from the website at http

  1. De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity

    Technology Transfer Automated Retrieval System (TEKTRAN)

    De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for a genome sequence; this approach can be usefully applied, for instance, in research on 'non-model organisms' of ecological and evolutionary importance, cancer samples or the microbiome. In this protocol...

  2. Considering Transposable Element Diversification in De Novo Annotation Approaches

    PubMed Central

    Flutre, Timothée; Duprat, Elodie; Feuillet, Catherine; Quesneville, Hadi

    2011-01-01

    Transposable elements (TEs) are mobile, repetitive DNA sequences that are almost ubiquitous in prokaryotic and eukaryotic genomes. They have a large impact on genome structure, function and evolution. With the recent development of high-throughput sequencing methods, many genome sequences have become available, making possible comparative studies of TE dynamics at an unprecedented scale. Several methods have been proposed for the de novo identification of TEs in sequenced genomes. Most begin with the detection of genomic repeats, but the subsequent steps for defining TE families differ. High-quality TE annotations are available for the Drosophila melanogaster and Arabidopsis thaliana genome sequences, providing a solid basis for the benchmarking of such methods. We compared the performance of specific algorithms for the clustering of interspersed repeats and found that only a particular combination of algorithms detected TE families with good recovery of the reference sequences. We then applied a new procedure for reconciling the different clustering results and classifying TE sequences. The whole approach was implemented in a pipeline using the REPET package. Finally, we show that our combined approach highlights the dynamics of well defined TE families by making it possible to identify structural variations among their copies. This approach makes it possible to annotate TE families and to study their diversification in a single analysis, improving our understanding of TE dynamics at the whole-genome scale and for diverse species. PMID:21304975

  3. De novo assembly of a bell pepper endornavirus genome sequence using RNA sequencing data.

    PubMed

    Jo, Yeonhwa; Choi, Hoseng; Cho, Won Kyong

    2015-01-01

    The genus Endornavirus is a double-stranded RNA virus that infects a wide range of hosts. In this study, we report on the de novo assembly of a bell pepper endornavirus genome sequence by RNA sequencing (RNA-Seq). Our result demonstrates the successful application of RNA-Seq to obtain a complete viral genome sequence from the transcriptome data. PMID:25792042

  4. De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data

    PubMed Central

    DiGuistini, Scott; Liao, Nancy Y; Platt, Darren; Robertson, Gordon; Seidel, Michael; Chan, Simon K; Docking, T Roderick; Birol, Inanc; Holt, Robert A; Hirst, Martin; Mardis, Elaine; Marra, Marco A; Hamelin, Richard C; Bohlmann, Jörg; Breuil, Colette; Jones, Steven JM

    2009-01-01

    Sequencing-by-synthesis technologies can reduce the cost of generating de novo genome assemblies. We report a method for assembling draft genome sequences of eukaryotic organisms that integrates sequence information from different sources, and demonstrate its effectiveness by assembling an approximately 32.5 Mb draft genome sequence for the forest pathogen Grosmannia clavigera, an ascomycete fungus. We also developed a method for assessing draft assemblies using Illumina paired end read data and demonstrate how we are using it to guide future sequence finishing. Our results demonstrate that eukaryotic genome sequences can be accurately assembled by combining Illumina, 454 and Sanger sequence data. PMID:19747388

  5. De novo SNP discovery and development of an interspecific cotton genome map using a simplified genotyping-by-sequencing (GBS) approach

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Recent developments in next-generation sequencing (NGS) technology have lowered the cost of sequencing per base and enabled whole genome re-sequencing, genome-wide association studies, and for some species, unprecedented discovery of molecular markers. For species with large, complex genomes, genot...

  6. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity.

    PubMed

    Adey, Andrew; Kitzman, Jacob O; Burton, Joshua N; Daza, Riza; Kumar, Akash; Christiansen, Lena; Ronaghi, Mostafa; Amini, Sasan; Gunderson, Kevin L; Steemers, Frank J; Shendure, Jay

    2014-12-01

    We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to > 1 megabase. These pools are "subhaploid," in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate "joins" are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences. PMID:25327137

  7. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity

    PubMed Central

    Adey, Andrew; Kitzman, Jacob O.; Burton, Joshua N.; Daza, Riza; Kumar, Akash; Christiansen, Lena; Ronaghi, Mostafa; Amini, Sasan; L. Gunderson, Kevin; Steemers, Frank J.

    2014-01-01

    We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to >1 megabase. These pools are “subhaploid,” in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate “joins” are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences. PMID:25327137

  8. High-confidence de novo peptide sequencing using positive charge derivatization and tandem MS spectra merging.

    PubMed

    An, Mingrui; Zou, Xiao; Wang, Qingsong; Zhao, Xuyang; Wu, Jing; Xu, Li-Ming; Shen, Hong-Yan; Xiao, Xueyuan; He, Dacheng; Ji, Jianguo

    2013-05-01

    De novo peptide sequencing holds great promise in discovering new protein sequences and modifications but has often been hindered by low success rate of mass spectra interpretation, mainly due to the diversity of fragment ion types and insufficient information for each ion series. Here, we describe a novel methodology that combines highly efficient on-tip charge derivatization and tandem MS spectra merging, which greatly boosts the performance of interpretation. TMPP-Ac-OSu (succinimidyloxycarbonylmethyl tris(2,4,6-trimethoxyphenyl)phosphonium bromide) was used to derivatize peptides at N-termini on tips to reduce mass spectra complexity. Then, a novel approach of spectra merging was adopted to combine the benefits of collision-induced dissociation (CID) and electron transfer dissociation (ETD) fragmentation. We applied this methodology to rat C6 glioma cells and the Cyprinus carpio and searched the resulting peptide sequences against the protein database. Then, we achieved thousands of high-confidence peptide sequences, a level that conventional de novo sequencing methods could not reach. Next, we identified dozens of novel peptide sequences by homology searching of sequences that were fully backbone covered but unmatched during the database search. Furthermore, we randomly chose 34 sequences discovered in rat C6 cells and verified them. Finally, we conclude that this novel methodology that combines on-tip positive charge derivatization and tandem MS spectra merging will greatly facilitate the discovery of novel proteins and the proteome analysis of nonmodel organisms. PMID:23536960

  9. De Novo Methylation of Repeated Sequences in Coprinus Cinereus

    PubMed Central

    Freedman, T.; Pukkila, P. J.

    1993-01-01

    We have examined the stability of duplicated DNA sequences in the sexual phase of the life cycle of the basidiomycete fungus, Coprinus cinereus. We observed premeiotic de novo methylation in haploid nuclei containing either a triplication, a tandem duplication, or an ectopic duplication. Methylation changes were not observed in unique sequences. Repeated sequences underwent methylation changes during the dikaryotic stage. In one cross, 27% of the segregants exhibited methylation-directed gene inactivation. However, all auxotrophs eventually reverted to prototrophy. C to T transition mutations were not observed in this study. Our studies also revealed one inversion that occurred in 50% of the segregants in a single triplication cross, and a single pop-out event that occurred during vegetative growth. These alterations were similar to changes reported in experiments with duplicated sequences in Neurospora crassa and Ascobolus immersus. However, significant differences were also noted. First, the extent of methylation was much less in C. cinereus than in the other two fungi. Second, CpG sequences appeared to be the preferred targets of methylation. PMID:8244000

  10. LESSONS IN DE NOVO PEPTIDE SEQUENCING BY TANDEM MASS SPECTROMETRY

    PubMed Central

    Medzihradszky, Katalin F.; Chalkley, Robert J.

    2015-01-01

    Mass spectrometry has become the method of choice for the qualitative and quantitative characterization of protein mixtures isolated from all kinds of living organisms. The raw data in these studies are MS/MS spectra, usually of peptides produced by proteolytic digestion of a protein. These spectra are “translated” into peptide sequences, normally with the help of various search engines. Data acquisition and interpretation have both been automated, and most researchers look only at the summary of the identifications without ever viewing the underlying raw data used for assignments. Automated analysis of data is essential due to the volume produced. However, being familiar with the finer intricacies of peptide fragmentation processes, and experiencing the difficulties of manual data interpretation allow a researcher to be able to more critically evaluate key results, particularly because there are many known rules of peptide fragmentation that are not incorporated into search engine scoring. Since the most commonly used MS/MS activation method is collision-induced dissociation (CID), in this article we present a brief review of the history of peptide CID analysis. Next, we provide a detailed tutorial on how to determine peptide sequences from CID data. Although the focus of the tutorial is de novo sequencing, the lessons learned and resources supplied are useful for data interpretation in general. PMID:25667941

  11. The De Novo Assembly of Mitochondrial Genomes of the Extinct Passenger Pigeon (Ectopistes migratorius) with Next Generation Sequencing

    PubMed Central

    Hung, Chih-Ming; Lin, Rong-Chien; Chu, Jui-Hua; Yeh, Chia-Fen; Yao, Chiou-Ju; Li, Shou-Hsien

    2013-01-01

    The information from ancient DNA (aDNA) provides an unparalleled opportunity to infer phylogenetic relationships and population history of extinct species and to investigate genetic evolution directly. However, the degraded and fragmented nature of aDNA has posed technical challenges for studies based on conventional PCR amplification. In this study, we present an approach based on next generation sequencing to efficiently sequence the complete mitochondrial genome (mitogenome) of two extinct passenger pigeons (Ectopistes migratorius) using de novo assembly of massive short (90 bp), paired-end or single-end reads. Although varying levels of human contamination and low levels of postmortem nucleotide lesion were observed, they did not impact sequencing accuracy. Our results demonstrated that the de novo assembly of shotgun sequence reads could be a potent approach to sequence mitogenomes, and offered an efficient way to infer evolutionary history of extinct species. PMID:23437111

  12. Evaluation and validation of de novo and hybrid assembly techniques to derive high quality genome sequences

    DOE PAGESBeta

    Utturkar, Sagar M.; Klingeman, Dawn Marie; Land, Miriam L.; Schadt, Christopher Warren; Doktycz, Mitchel John; Pelletier, Dale A.; Brown, Steven D.

    2014-06-14

    Our motivation with this work was to assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Our results show Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as anmore » additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. As to availability and implementation–all assembly tools except CLC Genomics Workbench are freely available under GNU General Public License.« less

  13. Evaluation and validation of de novo and hybrid assembly techniques to derive high quality genome sequences

    SciTech Connect

    Utturkar, Sagar M.; Klingeman, Dawn Marie

    2014-06-14

    Our motivation with this work was to assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Our results show Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as an additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. As to availability and implementation–all assembly tools except CLC Genomics Workbench are freely available under GNU General Public License.

  14. Streamlined analysis of duplex sequencing data with Du Novo.

    PubMed

    Stoler, Nicholas; Arbeithuber, Barbara; Guiblet, Wilfried; Makova, Kateryna D; Nekrutenko, Anton

    2016-01-01

    Duplex sequencing was originally developed to detect rare nucleotide polymorphisms normally obscured by the noise of high-throughput sequencing. Here we describe a new, streamlined, reference-free approach for the analysis of duplex sequencing data. We show the approach performs well on simulated data and precisely reproduces previously published results and apply it to a newly produced dataset, enabling us to type low-frequency variants in human mitochondrial DNA. Finally, we provide all necessary tools as stand-alone components as well as integrate them into the Galaxy platform. All analyses performed in this manuscript can be repeated exactly as described at http://usegalaxy.org/duplex . PMID:27566673

  15. Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction

    PubMed Central

    2010-01-01

    Background With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps. Results We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies. Conclusions Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly. PMID:20078885

  16. Homology-Driven Proteomics of Dinoflagellates with Unsequenced Genomes Using MALDI-TOF/TOF and Automated De Novo Sequencing

    PubMed Central

    Wang, Da-Zhi; Li, Cheng; Xie, Zhang-Xian; Dong, Hong-Po; Lin, Lin; Hong, Hua-Sheng

    2011-01-01

    This study developed a multilayered, gel-based, and underivatized strategy for de novo protein sequence analysis of unsequenced dinoflagellates using a MALDI-TOF/TOF mass spectrometer with the assistance of DeNovo Explorer software. MASCOT was applied as the first layer screen to identify either known or unknown proteins sharing identical peptides presented in a database. Once the confident identifications were removed after searching against the NCBInr database, the remainder was searched against the dinoflagellate expressed sequence tag database. In the last layer, those borderline and nonconfident hits were further subjected to de novo interpretation using DeNovo Explorer software. The de novo sequences passing a reliability filter were subsequently submitted to nonredundant MS-BLAST search. Using this layer identification method, 216 protein spots representing 158 unique proteins out of 220 selected protein spots from Alexandrium tamarense, a dinoflagellate with unsequenced genome, were confidently or tentatively identified by database searching. These proteins were involved in various intracellular physiological activities. This study is the first effort to develop a completely automated approach to identify proteins from unsequenced dinoflagellate databases and establishes a preliminary protein database for various physiological studies of dinoflagellates in the future. PMID:21977052

  17. Identifying wrong assemblies in de novo short read primary sequence assembly contigs.

    PubMed

    Chawla, Vandna; Kumar, Rajnish; Shankar, Ravi

    2016-09-01

    With the advent of short-reads-based genome sequencing approaches, large number of organisms are being sequenced all over the world. Most of these assemblies are done using some de novo short read assemblers and other related approaches. However, the contigs produced this way are prone to wrong assembly. So far, there is a conspicuous dearth of reliable tools to identify mis-assembled contigs. Mis-assemblies could result from incorrectly deleted or wrongly arranged genomic sequences. In the present work various factors related to sequence, sequencing and assembling have been assessed for their role in causing mis-assembly by using different genome sequencing data. Finally, some mis-assembly detecting tools have been evaluated for their ability to detect the wrongly assembled primary contigs, suggesting a lot of scope for improvement in this area. The present work also proposes a simple unsupervised learning-based novel approach to identify mis-assemblies in the contigs which was found performing reasonably well when compared to the already existing tools to report mis-assembled contigs. It was observed that the proposed methodology may work as a complementary system to the existing tools to enhance their accuracy. PMID:27581937

  18. De Novo Sequencing of Heparan Sulfate Oligosaccharides by Electron-Activated Dissociation

    PubMed Central

    Huang, Yu; Yu, Xiang; Mao, Yang; Costello, Catherine E.; Zaia, Joseph; Lin, Cheng

    2014-01-01

    Structural characterization of highly sulfated glycosaminoglycans (GAGs) by collisionally activated dissociation (CAD) is challenging because of the extensive sulfate losses mediated by free protons. While removal of the free protons may be achieved through the use of derivatization, metal cation adducts, and/or electrospray supercharging reagents, these steps add complexity to the experimental workflow. It is therefore desirable to develop an analytical approach for GAG sequencing that does not require derivatization or addition of reagents to the electrospray solution. Electron detachment dissociation (EDD) can produce extensive and informative fragmentation for GAGs without the need to remove free protons from the precursor ions. However, EDD is an inefficient process, often requiring consumption of large sample quantities (typically several micrograms), particularly for highly sulfated GAG ions. Here, we report that with improved instrumentation, optimization of the ionization and ion transfer parameters, and enhanced EDD efficiency, it is possible to generate highly informative EDD spectra of highly sulfated GAGs on the liquid chromatography (LC) time-scale, with consumption of only a few nanograms of sample. We further show that negative electron transfer dissociation (NETD) is an even more effective fragmentation technique for GAG sequencing, producing fewer sulfate losses while consuming smaller amount of samples. Finally, a simple algorithm was developed for de novo HS sequencing based on their high resolution tandem mass spectra. These results demonstrate the potential of EDD and NETD as sensitive analytical tools for detailed, high-throughput, de novo structural analyses of highly sulfated GAGs. PMID:24224699

  19. Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies.

    PubMed

    Rickert, Keith W; Grinberg, Luba; Woods, Robert M; Wilson, Susan; Bowen, Michael A; Baca, Manuel

    2016-04-01

    The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3-5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material. PMID:26852694

  20. Genomic Resources for Water Yam (Dioscorea alata L.): Analyses of EST-Sequences, De Novo Sequencing and GBS Libraries.

    PubMed

    Saski, Christopher A; Bhattacharjee, Ranjana; Scheffler, Brian E; Asiedu, Robert

    2015-01-01

    The reducing cost and rapid progress in next-generation sequencing techniques coupled with high performance computational approaches have resulted in large-scale discovery of advanced genomic resources in several model and non-model plant species. Yam (Dioscorea spp.) is a major food and cash crop in many countries but research efforts have been limited to understand the genetics and generate genomic information for the crop. The availability of a large number of genomic resources including genome-wide molecular markers will accelerate the breeding efforts and application of genomic selection in yams. In the present study, several methods including expressed sequence tags (EST)-sequencing, de novo sequencing, and genotyping-by-sequencing (GBS) profiles on two yam (Dioscorea alata L.) genotypes (TDa 95/00328 and TDa 95-310) was performed to generate genomic resources for use in its improvement programs. This includes a comprehensive set of EST-SSRs, genomic SSRs, whole genome SNPs, and reduced representation SNPs. A total of 1,152 EST-SSRs were developed from >40,000 EST-sequences generated from the two genotypes. A set of 388 EST-SSRs were validated as polymorphic showing a polymorphism rate of 34% when tested on two diverse parents targeted for anthracnose disease. In addition, approximately 40X de novo whole genome sequence coverage was generated for each of the two genotypes, and a total of 18,584 and 15,952 genomic SSRs were identified for TDa 95/00328 and TDa 95-310, respectively. A custom made pipeline resulted in the selection of 573 genomic SSRs common across the two genotypes, of which only eight failed, 478 being polymorphic and 62 monomorphic indicating a polymorphic rate of 83.5%. Additionally, 288,505 high quality SNPs were also identified between these two genotypes. Genotyping by sequencing reads on these two genotypes also revealed 36,790 overlapping SNP positions that are distributed throughout the genome. Our efforts in using different approaches

  1. Genomic Resources for Water Yam (Dioscorea alata L.): Analyses of EST-Sequences, De Novo Sequencing and GBS Libraries

    PubMed Central

    Saski, Christopher A.; Bhattacharjee, Ranjana; Scheffler, Brian E.; Asiedu, Robert

    2015-01-01

    The reducing cost and rapid progress in next-generation sequencing techniques coupled with high performance computational approaches have resulted in large-scale discovery of advanced genomic resources in several model and non-model plant species. Yam (Dioscorea spp.) is a major food and cash crop in many countries but research efforts have been limited to understand the genetics and generate genomic information for the crop. The availability of a large number of genomic resources including genome-wide molecular markers will accelerate the breeding efforts and application of genomic selection in yams. In the present study, several methods including expressed sequence tags (EST)-sequencing, de novo sequencing, and genotyping-by-sequencing (GBS) profiles on two yam (Dioscorea alata L.) genotypes (TDa 95/00328 and TDa 95-310) was performed to generate genomic resources for use in its improvement programs. This includes a comprehensive set of EST-SSRs, genomic SSRs, whole genome SNPs, and reduced representation SNPs. A total of 1,152 EST-SSRs were developed from >40,000 EST-sequences generated from the two genotypes. A set of 388 EST-SSRs were validated as polymorphic showing a polymorphism rate of 34% when tested on two diverse parents targeted for anthracnose disease. In addition, approximately 40X de novo whole genome sequence coverage was generated for each of the two genotypes, and a total of 18,584 and 15,952 genomic SSRs were identified for TDa 95/00328 and TDa 95-310, respectively. A custom made pipeline resulted in the selection of 573 genomic SSRs common across the two genotypes, of which only eight failed, 478 being polymorphic and 62 monomorphic indicating a polymorphic rate of 83.5%. Additionally, 288,505 high quality SNPs were also identified between these two genotypes. Genotyping by sequencing reads on these two genotypes also revealed 36,790 overlapping SNP positions that are distributed throughout the genome. Our efforts in using different approaches

  2. Using Illumina next generation sequencing technologies to sequence multigene families in de novo species.

    PubMed

    Hughes, Graham M; Gang, Li; Murphy, William J; Higgins, Desmond G; Teeling, Emma C

    2013-05-01

    The advent of Next Generation Sequencing Technology (NGST) has revolutionized molecular biology research, allowing for rapid gene/genome sequencing from a multitude of diverse species. As high throughput sequencing becomes more accessible, more efficient workflows must be developed to deal with the amounts of data produced and better assemble the genomes of de novo lineages. We combine traditional laboratory methods with Illumina NGST to amplify and sequence the largest mammalian multigene family, the Olfactory Receptor gene family, for species with and without a reference genome. We develop novel assembly methods to annotate and filter these data, which can be utilized for any gene family or any species. We find no significant difference between the ratio of genes within their respective gene families of our data compared with available genomic data. Using simulated data we explore the limitations of short-read sequence data and our assembly in recovering this gene family. We highlight the benefits and shortcomings of these methods. Compared with data generated from traditional polymerase chain reaction, cloning and Sanger sequencing methodologies, sequence data generated using our pipeline increases yield and sequencing efficiency without reducing the number of unique genes amplified. A cloning step is not required, therefore shortening data generation time. The novel downstream methodologies and workflows described provide a tool to be utilized by many fields of biology, to access and analyze the vast quantities of data generated. By combining laboratory and in silico methods, we provide a means of extracting genomic information for multigene families without complete genome sequencing. PMID:23480365

  3. A general approach for discriminative de novo motif discovery from high-throughput data

    PubMed Central

    Grau, Jan; Posch, Stefan; Grosse, Ivo; Keilwagen, Jens

    2013-01-01

    De novo motif discovery has been an important challenge of bioinformatics for the past two decades. Since the emergence of high-throughput techniques like ChIP-seq, ChIP-exo and protein-binding microarrays (PBMs), the focus of de novo motif discovery has shifted to runtime and accuracy on large data sets. For this purpose, specialized algorithms have been designed for discovering motifs in ChIP-seq or PBM data. However, none of the existing approaches work perfectly for all three high-throughput techniques. In this article, we propose Dimont, a general approach for fast and accurate de novo motif discovery from high-throughput data. We demonstrate that Dimont yields a higher number of correct motifs from ChIP-seq data than any of the specialized approaches and achieves a higher accuracy for predicting PBM intensities from probe sequence than any of the approaches specifically designed for that purpose. Dimont also reports the expected motifs for several ChIP-exo data sets. Investigating differences between in vitro and in vivo binding, we find that for most transcription factors, the motifs discovered by Dimont are in good accordance between techniques, but we also find notable exceptions. We also observe that modeling intra-motif dependencies may increase accuracy, which indicates that more complex motif models are a worthwhile field of research. PMID:24057214

  4. An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data

    PubMed Central

    Deng, Xutao; Naccache, Samia N.; Ng, Terry; Federman, Scot; Li, Linlin; Chiu, Charles Y.; Delwart, Eric L.

    2015-01-01

    Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches. PMID:25586223

  5. De novo sequencing of peptides from top-down tandem mass spectra

    SciTech Connect

    Vyatkina, Kira; Wu, Si; Dekker, Leendert J.; vanDuijn, Martijn M.; Liu, Xiaowen; Tolic, Nikola; Dvorkin, Mikhail; Alexandrova, Sonya; Luider, Theo N.; Pasa-Tolic, Ljiljana; Pevzner, Pavel A.

    2015-09-28

    De novo sequencing of proteins and peptides is one of the most important problems in mass spectrometry-driven proteomics. A variety of methods have been developed to accomplish this task from a set of bottom-up tandem (MS/MS) mass spectra. However, a more recently emerged top-down technology, now gaining more and more popularity, opens new perspectives for protein analysis and characterization, implying a need in efficient algorithms for processing this kind of MS/MS data. Here we describe a method that allows to retrieve from a set of top-down MS/MS spectra long and accurate sequence fragments of the proteins contained in a sample. To this end, we outline a strategy for generating high-quality sequence tags from top-down spectra, and introduce the concept of a T-Bruijn graph by adapting to the case of tags the notion of an A-Bruijn graph widely used in genomics. The output of the proposed approach represents the set of amino acid strings spelled out by optimal paths in the connected components of a T-Bruijn graph. We illustrate its performance on top-down datasets acquired from carbonic anhydrase 2 (CAH2) and the Fab region of alemtuzumab.

  6. De novo meta-assembly of ultra-deep sequencing data

    PubMed Central

    Mirebrahim, Hamid; Close, Timothy J.; Lonardi, Stefano

    2015-01-01

    We introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data (i.e. coverage of 1000x or higher). Our proposed meta-assembler Slicembler partitions the input data into optimal-sized ‘slices’ and uses a standard assembly tool (e.g. Velvet, SPAdes, IDBA_UD and Ray) to assemble each slice individually. Slicembler uses majority voting among the individual assemblies to identify long contigs that can be merged to the consensus assembly. To improve its efficiency, Slicembler uses a generalized suffix tree to identify these frequent contigs (or fraction thereof). Extensive experimental results on real ultra-deep sequencing data (8000x coverage) and simulated data show that Slicembler significantly improves the quality of the assembly compared with the performance of the base assembler. In fact, most of the times, Slicembler generates error-free assemblies. We also show that Slicembler is much more resistant against high sequencing error rate than the base assembler. Availability and implementation: Slicembler can be accessed at http://slicembler.cs.ucr.edu/. Contact: hamid.mirebrahim@email.ucr.edu PMID:26072514

  7. De novo sequencing and variant calling with nanopores using PoreSeq

    PubMed Central

    Szalay, Tamas; Golovchenko, Jene A.

    2016-01-01

    The single-molecule accuracy of nanopore sequencing has been an area of rapid academic and commercial advancement, but remains challenging for the de novo analysis of genomes. We introduce here a novel algorithm for the error correction of nanopore data, utilizing statistical models of the physical system in order to obtain high accuracy de novo sequences at a range of coverage depths. We demonstrate the technique by sequencing M13 bacteriophage DNA to 99% accuracy at moderate coverage as well as its use in an assembly pipeline by sequencing E. coli and λ DNA at a range of coverages. We also show the algorithm’s ability to accurately classify sequence variants at far lower coverage than existing methods. PMID:26352647

  8. De novo mutations revealed by whole exome sequencing are strongly associated with autism

    PubMed Central

    Sanders, Stephan J.; Murtha, Michael T.; Gupta, Abha R.; Murdoch, John D.; Raubeson, Melanie J.; Willsey, A. Jeremy; Ercan-Sencicek, A. Gulhan; DiLullo, Nicholas M.; Parikshak, Neelroop N.; Stein, Jason L.; Walker, Michael F.; Ober, Gordon T.; Teran, Nicole A.; Song, Youeun; El-Fishawy, Paul; Murtha, Ryan C.; Choi, Murim; Overton, John D.; Bjornson, Robert D.; Carriero, Nicholas J.; Meyer, Kyle A.; Bilguvar, Kaya; Mane, Shrikant M.; Šestan, Nenad; Lifton, Richard P.; Günel, Murat; Roeder, Kathryn; Geschwind, Daniel H.; Devlin, Bernie; State, Matthew W.

    2013-01-01

    Multiple studies have confirmed the contribution of rare de novo copy number variations (CNVs) to the risk for Autism Spectrum Disorders (ASD).1-3 While de novo single nucleotide variants (SNVs) have been identified in affected individuals,4 their contribution to risk has yet to be clarified. Specifically, the frequency and distribution of these mutations has not been well characterized in matched unaffected controls, data that are vital to the interpretation of de novo coding mutations observed in probands. Here we show, via whole-exome sequencing of 928 individuals, including 200 phenotypically discordant sibling pairs, that highly disruptive (nonsense and splice-site) de novo mutations in brain-expressed genes are associated with ASD and carry large effects (OR=5.65; CI: 1.44-22.2; p=0.01 asymptotic test). Based on mutation rates in unaffected individuals, we demonstrate that multiple independent de novo SNVs in the same gene among unrelated probands reliably identifies risk alleles, providing a clear path forward for gene discovery. Among a total of 279 identified de novo coding mutations, there is a single instance in probands, and none in siblings, in which two independent nonsense variants disrupt the same gene, SCN2A (Sodium Channel, Voltage-Gated, Type II, Alpha Subunit), a result that is highly unlikely by chance (p=0.005). PMID:22495306

  9. REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads

    PubMed Central

    Chu, Chong; Nielsen, Rasmus; Wu, Yufeng

    2016-01-01

    Repeat elements are important components of eukaryotic genomes. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often contain missing data in highly repetitive regions that are difficult to assemble. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. REPdenovo can construct various types of repeats that are highly repetitive and have low sequence divergence within copies. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. The key advantage of REPdenovo is that it can reconstruct long repeats from sequence reads. We apply the method to human data and discover a number of potentially new repeats sequences that have been missed by previous repeat annotations. Many of these sequences are incorporated into various parasite genomes, possibly because the filtering process for host DNA involved in the sequencing of the parasite genomes failed to exclude the host derived repeat sequences. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families. The software tool, REPdenovo, is available for download at https://github.com/Reedwarbler/REPdenovo. PMID:26977803

  10. DIME: a novel framework for de novo metagenomic sequence assembly.

    PubMed

    Guo, Xuan; Yu, Ning; Ding, Xiaojun; Wang, Jianxin; Pan, Yi

    2015-02-01

    The recently developed next generation sequencing platforms not only decrease the cost for metagenomics data analysis, but also greatly enlarge the size of metagenomic sequence datasets. A common bottleneck of available assemblers is that the trade-off between the noise of the resulting contigs and the gain in sequence length for better annotation has not been attended enough for large-scale sequencing projects, especially for the datasets with low coverage and a large number of nonoverlapping contigs. To address this limitation and promote both accuracy and efficiency, we develop a novel metagenomic sequence assembly framework, DIME, by taking the DIvide, conquer, and MErge strategies. In addition, we give two MapReduce implementations of DIME, DIME-cap3 and DIME-genovo, on Apache Hadoop platform. For a systematic comparison of the performance of the assembly tasks, we tested DIME and five other popular short read assembly programs, Cap3, Genovo, MetaVelvet, SOAPdenovo, and SPAdes on four synthetic and three real metagenomic sequence datasets with various reads from fifty thousand to a couple million in size. The experimental results demonstrate that our method not only partitions the sequence reads with an extremely high accuracy, but also reconstructs more bases, generates higher quality assembled consensus, and yields higher assembly scores, including corrected N50 and BLAST-score-per-base, than other tools with a nearly theoretical speed-up. Results indicate that DIME offers great improvement in assembly across a range of sequence abundances and thus is robust to decreasing coverage. PMID:25684202

  11. DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly

    PubMed Central

    Guo, Xuan; Yu, Ning; Ding, Xiaojun; Wang, Jianxin

    2015-01-01

    Abstract The recently developed next generation sequencing platforms not only decrease the cost for metagenomics data analysis, but also greatly enlarge the size of metagenomic sequence datasets. A common bottleneck of available assemblers is that the trade-off between the noise of the resulting contigs and the gain in sequence length for better annotation has not been attended enough for large-scale sequencing projects, especially for the datasets with low coverage and a large number of nonoverlapping contigs. To address this limitation and promote both accuracy and efficiency, we develop a novel metagenomic sequence assembly framework, DIME, by taking the DIvide, conquer, and MErge strategies. In addition, we give two MapReduce implementations of DIME, DIME-cap3 and DIME-genovo, on Apache Hadoop platform. For a systematic comparison of the performance of the assembly tasks, we tested DIME and five other popular short read assembly programs, Cap3, Genovo, MetaVelvet, SOAPdenovo, and SPAdes on four synthetic and three real metagenomic sequence datasets with various reads from fifty thousand to a couple million in size. The experimental results demonstrate that our method not only partitions the sequence reads with an extremely high accuracy, but also reconstructs more bases, generates higher quality assembled consensus, and yields higher assembly scores, including corrected N50 and BLAST-score-per-base, than other tools with a nearly theoretical speed-up. Results indicate that DIME offers great improvement in assembly across a range of sequence abundances and thus is robust to decreasing coverage. PMID:25684202

  12. Optimization of de novo transcriptome assembly from next-generation sequencing data.

    PubMed

    Surget-Groba, Yann; Montoya-Burgos, Juan I

    2010-10-01

    Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed. PMID:20693479

  13. De Novo Sequencing of Complex Mixtures of Heparan Sulfate Oligosaccharides.

    PubMed

    Huang, Rongrong; Zong, Chengli; Venot, Andre; Chiu, Yulun; Zhou, Dandan; Boons, Geert-Jan; Sharp, Joshua S

    2016-05-17

    Here, we describe the first sequencing method of a complex mixture of heparan sulfate tetrasaccharides by LC-MS/MS. Heparin and heparan sulfate (HS) are linear polysaccharides that are modified in a complex manner by N- and O-sulfation, N-acetylation, and epimerization of the uronic acid. Heparin and HS are involved in various essential cellular communication processes. The structural analysis of these glycosaminoglycans is challenging due to the lability of their sulfate groups, the high heterogeneity of modifications, and the epimerization of the uronic acids. While advances in liquid chromatography (LC) and mass spectrometry (MS) have enabled compositional profiling of HS oligosaccharide mixtures, online separation and detailed structural analysis of isomeric and epimeric HS mixtures has not been achieved. Here, we report the development and evaluation of a chemical derivatization and tandem mass spectrometry method that can separate and identify isomeric and epimeric structures from complex mixtures. A series of well-defined synthetic HS tetrasaccharides varying in sulfation patterns and uronic acid epimerization were analyzed by chemical derivatization and LC-MS/MS. These synthetic compounds made it possible to establish relationships between HS structure, chromatographic behavior and MS/MS fragmentation characteristics. Using the analytical characteristics determined through the analysis of the synthetic HS tetrasaccharide standards, an HS tetrasacharide mixture derived from natural sources was successfully sequenced. This method represents the first sequencing of complex mixtures of HS oligosaccharides, an essential milestone in the analysis of structure-function relationships of these carbohydrates. PMID:27087275

  14. Terminal sequence importance of de novo proteins from binary-patterned library: stable artificial proteins with 11- or 12-amino acid alphabet.

    PubMed

    Okura, Hiromichi; Takahashi, Tsuyoshi; Mihara, Hisakazu

    2012-06-01

    Successful approaches of de novo protein design suggest a great potential to create novel structural folds and to understand natural rules of protein folding. For these purposes, smaller and simpler de novo proteins have been developed. Here, we constructed smaller proteins by removing the terminal sequences from stable de novo vTAJ proteins and compared stabilities between mutant and original proteins. vTAJ proteins were screened from an α3β3 binary-patterned library which was designed with polar/ nonpolar periodicities of α-helix and β-sheet. vTAJ proteins have the additional terminal sequences due to the method of constructing the genetically repeated library sequences. By removing the parts of the sequences, we successfully obtained the stable smaller de novo protein mutants with fewer amino acid alphabets than the originals. However, these mutants showed the differences on ANS binding properties and stabilities against denaturant and pH change. The terminal sequences, which were designed just as flexible linkers not as secondary structure units, sufficiently affected these physicochemical details. This study showed implications for adjusting protein stabilities by designing N- and C-terminal sequences. PMID:22519540

  15. De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome

    PubMed Central

    2012-01-01

    Background Sequence analysis of organelle genomes has revealed important aspects of plant cell evolution. The scope of this study was to develop an approach for de novo assembly of the carrot mitochondrial genome using next generation sequence data from total genomic DNA. Results Sequencing data from a carrot 454 whole genome library were used to develop a de novo assembly of the mitochondrial genome. Development of a new bioinformatic tool allowed visualizing contig connections and elucidation of the de novo assembly. Southern hybridization demonstrated recombination across two large repeats. Genome annotation allowed identification of 44 protein coding genes, three rRNA and 17 tRNA. Identification of the plastid genome sequence allowed organelle genome comparison. Mitochondrial intergenic sequence analysis allowed detection of a fragment of DNA specific to the carrot plastid genome. PCR amplification and sequence analysis across different Apiaceae species revealed consistent conservation of this fragment in the mitochondrial genomes and an insertion in Daucus plastid genomes, giving evidence of a mitochondrial to plastid transfer of DNA. Sequence similarity with a retrotransposon element suggests a possibility that a transposon-like event transferred this sequence into the plastid genome. Conclusions This study confirmed that whole genome sequencing is a practical approach for de novo assembly of higher plant mitochondrial genomes. In addition, a new aspect of intercompartmental genome interaction was reported providing the first evidence for DNA transfer into an angiosperm plastid genome. The approach used here could be used more broadly to sequence and assemble mitochondrial genomes of diverse species. This information will allow us to better understand intercompartmental interactions and cell evolution. PMID:22548759

  16. Proteomics of Soil and Sediment: Protein Identification by De Novo Sequencing of Mass Spectra Complements Traditional Database Searching

    NASA Astrophysics Data System (ADS)

    Miller, S.; Rizzo, A. I.; Waldbauer, J.

    2014-12-01

    Proteomics has the potential to elucidate the metabolic pathways and taxa responsible for in situ biogeochemical transformations. However, low rates of protein identification from high resolution mass spectra have been a barrier to the development of proteomics in complex environmental samples. Much of the difficulty lies in the computational challenge of linking mass spectra to their corresponding proteins. Traditional database search methods for matching peptide sequences to mass spectra are often inadequate due to the complexity of environmental proteomes and the large database search space, as we demonstrate with soil and sediment proteomes generated via a range of extraction methods. One alternative to traditional database searching is de novo sequencing, which identifies peptide sequences without the need for a database. BLAST can then be used to match de novo sequences to similar genetic sequences. Assigning confidence to putative identifications has been one hurdle for the implementation of de novo sequencing. We found that accurate de novo sequences can be screened by quality score and length. Screening criteria are verified by comparing the results of de novo sequencing and traditional database searching for well-characterized proteomes from simple biological systems. The BLAST hits of screened sequences are interrogated for taxonomic and functional information. We applied de novo sequencing to organic topsoil and marine sediment proteomes. Peak-rich proteomes, which can result from various extraction techniques, yield thousands of high-confidence protein identifications, an improvement over previous proteomic studies of soil and sediment. User-friendly software tools for de novo metaproteomics analysis have been developed. This "De Novo Analysis" Pipeline is also a faster method of data analysis than constructing a tailored sequence database for traditional database searching.

  17. Proteomics of Soil and Sediment: Protein Identification by De Novo Sequencing of Mass Spectra Complements Traditional Database Searching

    NASA Astrophysics Data System (ADS)

    Miller, S.; Rizzo, A. I.; Waldbauer, J.

    2015-12-01

    Proteomics has the potential to elucidate the metabolic pathways and taxa responsible for in situ biogeochemical transformations. However, low rates of protein identification from high resolution mass spectra have been a barrier to the development of proteomics in complex environmental samples. Much of the difficulty lies in the computational challenge of linking mass spectra to their corresponding proteins. Traditional database search methods for matching peptide sequences to mass spectra are often inadequate due to the complexity of environmental proteomes and the large database search space, as we demonstrate with soil and sediment proteomes generated via a range of extraction methods. One alternative to traditional database searching is de novo sequencing, which identifies peptide sequences without the need for a database. BLAST can then be used to match de novo sequences to similar genetic sequences. Assigning confidence to putative identifications has been one hurdle for the implementation of de novo sequencing. We found that accurate de novo sequences can be screened by quality score and length. Screening criteria are verified by comparing the results of de novo sequencing and traditional database searching for well-characterized proteomes from simple biological systems. The BLAST hits of screened sequences are interrogated for taxonomic and functional information. We applied de novo sequencing to organic topsoil and marine sediment proteomes. Peak-rich proteomes, which can result from various extraction techniques, yield thousands of high-confidence protein identifications, an improvement over previous proteomic studies of soil and sediment. User-friendly software tools for de novo metaproteomics analysis have been developed. This "De Novo Analysis" Pipeline is also a faster method of data analysis than constructing a tailored sequence database for traditional database searching.

  18. Annotation and re-sequencing of genes from de novo transcriptome assembly of Abies alba (Pinaceae)1

    PubMed Central

    Roschanski, Anna M.; Fady, Bruno; Ziegenhagen, Birgit; Liepelt, Sascha

    2013-01-01

    • Premise of the study: We present a protocol for the annotation of transcriptome sequence data and the identification of candidate genes therein using the example of the nonmodel conifer Abies alba. • Methods and Results: A normalized cDNA library was built from an A. alba seedling. The sequencing on a 454 platform yielded more than 1.5 million reads that were de novo assembled into 25149 contigs. Two complementary approaches were applied to annotate gene fragments that code for (1) well-known proteins and (2) proteins that are potentially adaptively relevant. Primer development and testing yielded 88 amplicons that could successfully be resequenced from genomic DNA. • Conclusions: The annotation workflow offers an efficient way to identify potential adaptively relevant genes from the large quantity of transcriptome sequence data. The primer set presented should be prioritized for single-nucleotide polymorphism detection in adaptively relevant genes in A. alba. PMID:25202477

  19. NxRepair: error correction in de novo sequence assembly using Nextera mate pairs.

    PubMed

    Murphy, Rebecca R; O'Connell, Jared; Cox, Anthony J; Schulz-Trieglaff, Ole

    2015-01-01

    Scaffolding errors and incorrect repeat disambiguation during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub or PyPI, the Python Package Index; a tutorial and user documentation are also available. PMID:26056623

  20. Feature-by-Feature – Evaluating De Novo Sequence Assembly

    PubMed Central

    Vezzi, Francesco; Narzisi, Giuseppe; Mishra, Bud

    2012-01-01

    The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art

  1. De novo assembly and characterization of the Trichuris trichiura adult worm transcriptome using Ion Torrent sequencing.

    PubMed

    Santos, Leonardo N; Silva, Eduardo S; Santos, André S; De Sá, Pablo H; Ramos, Rommel T; Silva, Artur; Cooper, Philip J; Barreto, Maurício L; Loureiro, Sebastião; Pinheiro, Carina S; Alcantara-Neves, Neuza M; Pacheco, Luis G C

    2016-07-01

    Infection with helminthic parasites, including the soil-transmitted helminth Trichuris trichiura (human whipworm), has been shown to modulate host immune responses and, consequently, to have an impact on the development and manifestation of chronic human inflammatory diseases. De novo derivation of helminth proteomes from sequencing of transcriptomes will provide valuable data to aid identification of parasite proteins that could be evaluated as potential immunotherapeutic molecules in near future. Herein, we characterized the transcriptome of the adult stage of the human whipworm T. trichiura, using next-generation sequencing technology and a de novo assembly strategy. Nearly 17.6 million high-quality clean reads were assembled into 6414 contiguous sequences, with an N50 of 1606bp. In total, 5673 protein-encoding sequences were confidentially identified in the T. trichiura adult worm transcriptome; of these, 1013 sequences represent potential newly discovered proteins for the species, most of which presenting orthologs already annotated in the related species T. suis. A number of transcripts representing probable novel non-coding transcripts for the species T. trichiura were also identified. Among the most abundant transcripts, we found sequences that code for proteins involved in lipid transport, such as vitellogenins, and several chitin-binding proteins. Through a cross-species expression analysis of gene orthologs shared by T. trichiura and the closely related parasites T. suis and T. muris it was possible to find twenty-six protein-encoding genes that are consistently highly expressed in the adult stages of the three helminth species. Additionally, twenty transcripts could be identified that code for proteins previously detected by mass spectrometry analysis of protein fractions of the whipworm somatic extract that present immunomodulatory activities. Five of these transcripts were amongst the most highly expressed protein-encoding sequences in the T

  2. De novo assembly and characterization of the carrot mitochondrial genome using next generation sequencing data from whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Sequence analysis of organelle genomes has revealed important aspects of plant cell evolution. The scope of this study was to develop an approach for de novo assembly of the carrot mitochondrial genome using next generation sequence data from total genomic DNA. Sequencing data from a carrot 454 whol...

  3. De Novo Centromere Formation and Centromeric Sequence Expansion in Wheat and its Wide Hybrids.

    PubMed

    Guo, Xiang; Su, Handong; Shi, Qinghua; Fu, Shulan; Wang, Jing; Zhang, Xiangqi; Hu, Zanmin; Han, Fangpu

    2016-04-01

    Centromeres typically contain tandem repeat sequences, but centromere function does not necessarily depend on these sequences. We identified functional centromeres with significant quantitative changes in the centromeric retrotransposons of wheat (CRW) contents in wheat aneuploids (Triticum aestivum) and the offspring of wheat wide hybrids. The CRW signals were strongly reduced or essentially lost in some wheat ditelosomic lines and in the addition lines from the wide hybrids. The total loss of the CRW sequences but the presence of CENH3 in these lines suggests that the centromeres were formed de novo. In wheat and its wide hybrids, which carry large complex genomes or no sequenced genome, we performed CENH3-ChIP-dot-blot methods alone or in combination with CENH3-ChIP-seq and identified the ectopic genomic sequences present at the new centromeres. In adcdition, the transcription of the identified DNA sequences was remarkably increased at the new centromere, suggesting that the transcription of the corresponding sequences may be associated with de novo centromere formation. Stable alien chromosomes with two and three regions containing CRW sequences induced by centromere breakage were observed in the wheat-Th. elongatum hybrid derivatives, but only one was a functional centromere. In wheat-rye (Secale cereale) hybrids, the rye centromere-specific sequences spread along the chromosome arms and may have caused centromere expansion. Frequent and significant quantitative alterations in the centromere sequence via chromosomal rearrangement have been systematically described in wheat wide hybridizations, which may affect the retention or loss of the alien chromosomes in the hybrids. Thus, the centromere behavior in wide crosses likely has an important impact on the generation of biodiversity, which ultimately has implications for speciation. PMID:27110907

  4. De Novo Centromere Formation and Centromeric Sequence Expansion in Wheat and its Wide Hybrids

    PubMed Central

    Fu, Shulan; Wang, Jing; Zhang, Xiangqi; Hu, Zanmin; Han, Fangpu

    2016-01-01

    Centromeres typically contain tandem repeat sequences, but centromere function does not necessarily depend on these sequences. We identified functional centromeres with significant quantitative changes in the centromeric retrotransposons of wheat (CRW) contents in wheat aneuploids (Triticum aestivum) and the offspring of wheat wide hybrids. The CRW signals were strongly reduced or essentially lost in some wheat ditelosomic lines and in the addition lines from the wide hybrids. The total loss of the CRW sequences but the presence of CENH3 in these lines suggests that the centromeres were formed de novo. In wheat and its wide hybrids, which carry large complex genomes or no sequenced genome, we performed CENH3-ChIP-dot-blot methods alone or in combination with CENH3-ChIP-seq and identified the ectopic genomic sequences present at the new centromeres. In adcdition, the transcription of the identified DNA sequences was remarkably increased at the new centromere, suggesting that the transcription of the corresponding sequences may be associated with de novo centromere formation. Stable alien chromosomes with two and three regions containing CRW sequences induced by centromere breakage were observed in the wheat-Th. elongatum hybrid derivatives, but only one was a functional centromere. In wheat-rye (Secale cereale) hybrids, the rye centromere-specific sequences spread along the chromosome arms and may have caused centromere expansion. Frequent and significant quantitative alterations in the centromere sequence via chromosomal rearrangement have been systematically described in wheat wide hybridizations, which may affect the retention or loss of the alien chromosomes in the hybrids. Thus, the centromere behavior in wide crosses likely has an important impact on the generation of biodiversity, which ultimately has implications for speciation. PMID:27110907

  5. Long-read sequencing and de novo assembly of a Chinese genome

    PubMed Central

    Shi, Lingling; Guo, Yunfei; Dong, Chengliang; Huddleston, John; Yang, Hui; Han, Xiaolu; Fu, Aisi; Li, Quan; Li, Na; Gong, Siyi; Lintner, Katherine E.; Ding, Qiong; Wang, Zou; Hu, Jiang; Wang, Depeng; Wang, Feng; Wang, Lin; Lyon, Gholson J.; Guan, Yongtao; Shen, Yufeng; Evgrafov, Oleg V.; Knowles, James A.; Thibaud-Nissen, Francoise; Schneider, Valerie; Yu, Chack-Yung; Zhou, Libing; Eichler, Evan E.; So, Kwok-Fai; Wang, Kai

    2016-01-01

    Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93 Gb (contig N50: 8.3 Mb, scaffold N50: 22.0 Mb, including 39.3 Mb N-bases), together with 206 Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8 Mb of HX1-specific sequences, including 4.1 Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations. PMID:27356984

  6. Long-read sequencing and de novo assembly of a Chinese genome.

    PubMed

    Shi, Lingling; Guo, Yunfei; Dong, Chengliang; Huddleston, John; Yang, Hui; Han, Xiaolu; Fu, Aisi; Li, Quan; Li, Na; Gong, Siyi; Lintner, Katherine E; Ding, Qiong; Wang, Zou; Hu, Jiang; Wang, Depeng; Wang, Feng; Wang, Lin; Lyon, Gholson J; Guan, Yongtao; Shen, Yufeng; Evgrafov, Oleg V; Knowles, James A; Thibaud-Nissen, Francoise; Schneider, Valerie; Yu, Chack-Yung; Zhou, Libing; Eichler, Evan E; So, Kwok-Fai; Wang, Kai

    2016-01-01

    Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays and generate a de novo assembly of 2.93 Gb (contig N50: 8.3 Mb, scaffold N50: 22.0 Mb, including 39.3 Mb N-bases), together with 206 Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8 Mb of HX1-specific sequences, including 4.1 Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations. PMID:27356984

  7. Anchored pseudo-de novo assembly of human genomes identifies extensive sequence variation from unmapped sequence reads.

    PubMed

    Faber-Hammond, Joshua J; Brown, Kim H

    2016-07-01

    The human genome reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high-quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2-5 % of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground, we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual and then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40 % showing high sequence complexity. Genomic coordinates were generated for 99.9 %, with 52.5 % exhibiting high-quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly our data highlight that with this method low coverage (~10-20×) next-generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine. PMID:27061184

  8. De Novo Sequencing and Transcriptome Analysis of the Central Nervous System of Mollusc Lymnaea stagnalis by Deep RNA Sequencing

    PubMed Central

    Sadamoto, Hisayo; Takahashi, Hironobu; Okada, Taketo; Kenmoku, Hiromichi; Toyota, Masao; Asakawa, Yoshinori

    2012-01-01

    The pond snail Lymnaea stagnalis is among several mollusc species that have been well investigated due to the simplicity of their nervous systems and large identifiable neurons. Nonetheless, despite the continued attention given to the physiological characteristics of its nervous system, the genetic information of the Lymnaea central nervous system (CNS) has not yet been fully explored. The absence of genetic information is a large disadvantage for transcriptome sequencing because it makes transcriptome assembly difficult. We here performed transcriptome sequencing for Lymnaea CNS using an Illumina Genome Analyzer IIx platform and obtained 81.9 M of 100 base pair (bp) single end reads. For de novo assembly, five programs were used: ABySS, Velvet, OASES, Trinity and Rnnotator. Based on a comparison of the assemblies, we chose the Rnnotator dataset for the following blast searches and gene ontology analyses. The present dataset, 116,355 contigs of Lymnaea transcriptome shotgun assembly (TSA), contained longer sequences and was much larger compared to the previously reported Lymnaea expression sequence tag (EST) established by classical Sanger sequencing. The TSA sequences were subjected to blast analyses against several protein databases and Aplysia EST data. The results demonstrated that about 20,000 sequences had significant similarity to the reported sequences using a cutoff value of 1e-6, and showed the lack of molluscan sequences in the public databases. The richness of the present TSA data allowed us to identify a large number of new transcripts in Lymnaea and molluscan species. PMID:22870333

  9. De novo bacterial genome sequencing: Millions of very short reads assembled on a desktop computer

    PubMed Central

    Hernandez, David; François, Patrice; Farinelli, Laurent; Østerås, Magne; Schrenzel, Jacques

    2008-01-01

    Novel high-throughput DNA sequencing technologies allow researchers to characterize a bacterial genome during a single experiment and at a moderate cost. However, the increase in sequencing throughput that is allowed by using such platforms is obtained at the expense of individual sequence read length, which must be assembled into longer contigs to be exploitable. This study focuses on the Illumina sequencing platform that produces millions of very short sequences that are 35 bases in length. We propose a de novo assembler software that is dedicated to process such data. Based on a classical overlap graph representation and on the detection of potentially spurious reads, our software generates a set of accurate contigs of several kilobases that cover most of the bacterial genome. The assembly results were validated by comparing data sets that were obtained experimentally for Staphylococcus aureus strain MW2 and Helicobacter acinonychis strain Sheeba with that of their published genomes acquired by conventional sequencing of 1.5- to 3.0-kb fragments. We also provide indications that the broad coverage achieved by high-throughput sequencing might allow for the detection of clonal polymorphisms in the set of DNA molecules being sequenced. PMID:18332092

  10. Whole exome sequencing reveals de novo pathogenic variants in KAT6A as a cause of a neurodevelopmental disorder.

    PubMed

    Millan, Francisca; Cho, Megan T; Retterer, Kyle; Monaghan, Kristin G; Bai, Renkui; Vitazka, Patrik; Everman, David B; Smith, Brooke; Angle, Brad; Roberts, Victoria; Immken, LaDonna; Nagakura, Honey; DiFazio, Marc; Sherr, Elliott; Haverfield, Eden; Friedman, Bethany; Telegrafi, Aida; Juusola, Jane; Chung, Wendy K; Bale, Sherri

    2016-07-01

    Neurodevelopmental disorders (NDD) are common, with 1-3% of general population being affected, but the etiology is unknown in most individuals. Clinical whole-exome sequencing (WES) has proven to be a powerful tool for the identification of pathogenic variants leading to Mendelian disorders, among which NDD represent a significant percentage. Performing WES with a trio-approach has proven to be extremely effective in identifying de novo pathogenic variants as a common cause of NDD. Here we report six unrelated individuals with a common phenotype consisting of NDD with severe speech delay, hypotonia, and facial dysmorphism. These patients underwent WES with a trio approach and de novo heterozygous predicted pathogenic novel variants in the KAT6A gene were identified. The KAT6A gene encodes a histone acetyltransfrease protein and it has long been known for its structural involvement in acute myeloid leukemia; however, it has not previously been associated with any congenital disorder. In animal models the KAT6A ortholog is involved in transcriptional regulation during development. Given the similar findings in animal models and our patient's phenotypes, we hypothesize that KAT6A could play a role in development of the brain, face, and heart in humans. © 2016 Wiley Periodicals, Inc. PMID:27133397

  11. Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences

    PubMed Central

    Catchen, Julian M.; Amores, Angel; Hohenlohe, Paul; Cresko, William; Postlethwait, John H.

    2011-01-01

    Advances in sequencing technology provide special opportunities for genotyping individuals with speed and thrift, but the lack of software to automate the calling of tens of thousands of genotypes over hundreds of individuals has hindered progress. Stacks is a software system that uses short-read sequence data to identify and genotype loci in a set of individuals either de novo or by comparison to a reference genome. From reduced representation Illumina sequence data, such as RAD-tags, Stacks can recover thousands of single nucleotide polymorphism (SNP) markers useful for the genetic analysis of crosses or populations. Stacks can generate markers for ultra-dense genetic linkage maps, facilitate the examination of population phylogeography, and help in reference genome assembly. We report here the algorithms implemented in Stacks and demonstrate their efficacy by constructing loci from simulated RAD-tags taken from the stickleback reference genome and by recapitulating and improving a genetic map of the zebrafish, Danio rerio. PMID:22384329

  12. The sequence and de novo assembly of the giant panda genome

    PubMed Central

    Li, Ruiqiang; Fan, Wei; Tian, Geng; Zhu, Hongmei; He, Lin; Cai, Jing; Huang, Quanfei; Cai, Qingle; Li, Bo; Bai, Yinqi; Zhang, Zhihe; Zhang, Yaping; Wang, Wen; Li, Jun; Wei, Fuwen; Li, Heng; Jian, Min; Li, Jianwen; Zhang, Zhaolei; Nielsen, Rasmus; Li, Dawei; Gu, Wanjun; Yang, Zhentao; Xuan, Zhaoling; Ryder, Oliver A.; Leung, Frederick Chi-Ching; Zhou, Yan; Cao, Jianjun; Sun, Xiao; Fu, Yonggui; Fang, Xiaodong; Guo, Xiaosen; Wang, Bo; Hou, Rong; Shen, Fujun; Mu, Bo; Ni, Peixiang; Lin, Runmao; Qian, Wubin; Wang, Guodong; Yu, Chang; Nie, Wenhui; Wang, Jinhuan; Wu, Zhigang; Liang, Huiqing; Min, Jiumeng; Wu, Qi; Cheng, Shifeng; Ruan, Jue; Wang, Mingwei; Shi, Zhongbin; Wen, Ming; Liu, Binghang; Ren, Xiaoli; Zheng, Huisong; Dong, Dong; Cook, Kathleen; Shan, Gao; Zhang, Hao; Kosiol, Carolin; Xie, Xueying; Lu, Zuhong; Zheng, Hancheng; Li, Yingrui; Steiner, Cynthia C.; Lam, Tommy Tsan-Yuk; Lin, Siyuan; Zhang, Qinghui; Li, Guoqing; Tian, Jing; Gong, Timing; Liu, Hongde; Zhang, Dejin; Fang, Lin; Ye, Chen; Zhang, Juanbin; Hu, Wenbo; Xu, Anlong; Ren, Yuanyuan; Zhang, Guojie; Bruford, Michael W.; Li, Qibin; Ma, Lijia; Guo, Yiran; An, Na; Hu, Yujie; Zheng, Yang; Shi, Yongyong; Li, Zhiqiang; Liu, Qing; Chen, Yanling; Zhao, Jing; Qu, Ning; Zhao, Shancen; Tian, Feng; Wang, Xiaoling; Wang, Haiyin; Xu, Lizhi; Liu, Xiao; Vinar, Tomas; Wang, Yajun; Lam, Tak-Wah; Yiu, Siu-Ming; Liu, Shiping; Zhang, Hemin; Li, Desheng; Huang, Yan; Wang, Xia; Yang, Guohua; Jiang, Zhi; Wang, Junyi; Qin, Nan; Li, Li; Li, Jingxiang; Bolund, Lars; Kristiansen, Karsten; Wong, Gane Ka-Shu; Olson, Maynard; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun

    2013-01-01

    Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes. PMID:20010809

  13. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation.

    PubMed

    Michaelson, Jacob J; Shi, Yujian; Gujral, Madhusudan; Zheng, Hancheng; Malhotra, Dheeraj; Jin, Xin; Jian, Minghan; Liu, Guangming; Greer, Douglas; Bhandari, Abhishek; Wu, Wenting; Corominas, Roser; Peoples, Aine; Koren, Amnon; Gore, Athurva; Kang, Shuli; Lin, Guan Ning; Estabillo, Jasper; Gadomski, Therese; Singh, Balvindar; Zhang, Kun; Akshoomoff, Natacha; Corsello, Christina; McCarroll, Steven; Iakoucheva, Lilia M; Li, Yingrui; Wang, Jun; Sebat, Jonathan

    2012-12-21

    De novo mutation plays an important role in autism spectrum disorders (ASDs). Notably, pathogenic copy number variants (CNVs) are characterized by high mutation rates. We hypothesize that hypermutability is a property of ASD genes and may also include nucleotide-substitution hot spots. We investigated global patterns of germline mutation by whole-genome sequencing of monozygotic twins concordant for ASD and their parents. Mutation rates varied widely throughout the genome (by 100-fold) and could be explained by intrinsic characteristics of DNA sequence and chromatin structure. Dense clusters of mutations within individual genomes were attributable to compound mutation or gene conversion. Hypermutability was a characteristic of genes involved in ASD and other diseases. In addition, genes impacted by mutations in this study were associated with ASD in independent exome-sequencing data sets. Our findings suggest that regional hypermutation is a significant factor shaping patterns of genetic variation and disease risk in humans. PMID:23260136

  14. Sequencing, de novo assembly and comparative analysis of Raphanus sativus transcriptome.

    PubMed

    Wu, Gang; Zhang, Libin; Yin, Yongtai; Wu, Jiangsheng; Yu, Longjiang; Zhou, Yanhong; Li, Maoteng

    2015-01-01

    Raphanus sativus is an important Brassicaceae plant and also an edible vegetable with great economic value. However, currently there is not enough transcriptome information of R. sativus tissues, which impedes further functional genomics research on R. sativus. In this study, RNA-seq technology was employed to characterize the transcriptome of leaf tissues. Approximately 70 million clean pair-end reads were obtained and used for de novo assembly by Trinity program, which generated 68,086 unigenes with an average length of 576 bp. All the unigenes were annotated against GO and KEGG databases. In the meanwhile, we merged leaf sequencing data with existing root sequencing data and obtained better de novo assembly of R. sativus using Oases program. Accordingly, potential simple sequence repeats (SSRs), transcription factors (TFs) and enzyme codes were identified in R. sativus. Additionally, we detected a total of 3563 significantly differentially expressed genes (DEGs, P = 0.05) and tissue-specific biological processes between leaf and root tissues. Furthermore, a TFs-based regulation network was constructed using Cytoscape software. Taken together, these results not only provide a comprehensive genomic resource of R. sativus but also shed light on functional genomic and proteomic research on R. sativus in the future. PMID:26029219

  15. CYCLONE—A Utility for De Novo Sequencing of Microbial Cyclic Peptides

    NASA Astrophysics Data System (ADS)

    Kavan, Daniel; Kuzma, Marek; Lemr, Karel; Schug, Kevin A.; Havlicek, Vladimir

    2013-08-01

    We have developed a de novo sequencing software tool (CYCLONE) and applied it for determination of cyclic peptides. The program uses a non-redundant database of 312 nonribosomal building blocks identified to date in bacteria and fungi (more than 230 additional residues in the database list were isobaric). The software was used to fully characterize the tandem mass spectrum of several cyclic peptides and provide sequence tags. The general strategy of the script was based on fragment ion pre-characterization to accomplish unambiguous b-ion series assignments. Showcase examples were a cyclic tetradepsipeptide beauverolide, a cyclic hexadepsipeptide roseotoxin A, a lasso-like hexapeptide pseudacyclin A, and a cyclic undecapeptide cyclosporin A. The extent of ion scrambling in smaller peptides was as low as 5 % of total ion current; this demonstrated the feasibility of CYCLONE de novo sequencing. The robustness of the script was also tested against database sets of various sizes and isotope-containing data. It can be downloaded from the http://ms.biomed.cas.cz/MSTools/.

  16. CYCLONE--a utility for de novo sequencing of microbial cyclic peptides.

    PubMed

    Kavan, Daniel; Kuzma, Marek; Lemr, Karel; Schug, Kevin A; Havlicek, Vladimir

    2013-08-01

    We have developed a de novo sequencing software tool (CYCLONE) and applied it for determination of cyclic peptides. The program uses a non-redundant database of 312 nonribosomal building blocks identified to date in bacteria and fungi (more than 230 additional residues in the database list were isobaric). The software was used to fully characterize the tandem mass spectrum of several cyclic peptides and provide sequence tags. The general strategy of the script was based on fragment ion pre-characterization to accomplish unambiguous b-ion series assignments. Showcase examples were a cyclic tetradepsipeptide beauverolide, a cyclic hexadepsipeptide roseotoxin A, a lasso-like hexapeptide pseudacyclin A, and a cyclic undecapeptide cyclosporin A. The extent of ion scrambling in smaller peptides was as low as 5 % of total ion current; this demonstrated the feasibility of CYCLONE de novo sequencing. The robustness of the script was also tested against database sets of various sizes and isotope-containing data. It can be downloaded from the http://ms.biomed.cas.cz/MSTools/ website. ᅟ PMID:23702710

  17. Sequencing, de novo assembly and comparative analysis of Raphanus sativus transcriptome

    PubMed Central

    Wu, Gang; Zhang, Libin; Yin, Yongtai; Wu, Jiangsheng; Yu, Longjiang; Zhou, Yanhong; Li, Maoteng

    2015-01-01

    Raphanus sativus is an important Brassicaceae plant and also an edible vegetable with great economic value. However, currently there is not enough transcriptome information of R. sativus tissues, which impedes further functional genomics research on R. sativus. In this study, RNA-seq technology was employed to characterize the transcriptome of leaf tissues. Approximately 70 million clean pair-end reads were obtained and used for de novo assembly by Trinity program, which generated 68,086 unigenes with an average length of 576 bp. All the unigenes were annotated against GO and KEGG databases. In the meanwhile, we merged leaf sequencing data with existing root sequencing data and obtained better de novo assembly of R. sativus using Oases program. Accordingly, potential simple sequence repeats (SSRs), transcription factors (TFs) and enzyme codes were identified in R. sativus. Additionally, we detected a total of 3563 significantly differentially expressed genes (DEGs, P = 0.05) and tissue-specific biological processes between leaf and root tissues. Furthermore, a TFs-based regulation network was constructed using Cytoscape software. Taken together, these results not only provide a comprehensive genomic resource of R. sativus but also shed light on functional genomic and proteomic research on R. sativus in the future. PMID:26029219

  18. RoboOligo: software for mass spectrometry data to support manual and de novo sequencing of post-transcriptionally modified ribonucleic acids

    PubMed Central

    Sample, Paul J.; Gaston, Kirk W.; Alfonzo, Juan D.; Limbach, Patrick A.

    2015-01-01

    Ribosomal ribonucleic acid (RNA), transfer RNA and other biological or synthetic RNA polymers can contain nucleotides that have been modified by the addition of chemical groups. Traditional Sanger sequencing methods cannot establish the chemical nature and sequence of these modified-nucleotide containing oligomers. Mass spectrometry (MS) has become the conventional approach for determining the nucleotide composition, modification status and sequence of modified RNAs. Modified RNAs are analyzed by MS using collision-induced dissociation tandem mass spectrometry (CID MS/MS), which produces a complex dataset of oligomeric fragments that must be interpreted to identify and place modified nucleosides within the RNA sequence. Here we report the development of RoboOligo, an interactive software program for the robust analysis of data generated by CID MS/MS of RNA oligomers. There are three main functions of RoboOligo: (i) automated de novo sequencing via the local search paradigm. (ii) Manual sequencing with real-time spectrum labeling and cumulative intensity scoring. (iii) A hybrid approach, coined ‘variable sequencing’, which combines the user intuition of manual sequencing with the high-throughput sampling of automated de novo sequencing. PMID:25820423

  19. SNP discovery through de novo deep sequencing using the next generation of DNA sequencers

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The production of high volumes of DNA sequence data using new technologies has permitted more efficient identification of single nucleotide polymorphisms in vertebrate genomes. This chapter presented practical methodology for production and analysis of DNA sequence data for SNP discovery....

  20. CycloBranch: De Novo Sequencing of Nonribosomal Peptides from Accurate Product Ion Mass Spectra

    NASA Astrophysics Data System (ADS)

    Novák, Jiří; Lemr, Karel; Schug, Kevin A.; Havlíček, Vladimír

    2015-07-01

    Nonribosomal peptides have a wide range of biological and medical applications. Their identification by tandem mass spectrometry remains a challenging task. A new open-source de novo peptide identification engine CycloBranch was developed and successfully applied in identification or detailed characterization of 11 linear, cyclic, branched, and branch-cyclic peptides. CycloBranch is based on annotated building block databases the size of which is defined by the user according to ribosomal or nonribosomal peptide origin. The current number of involved nonisobaric and isobaric building blocks is 287 and 521, respectively. Contrary to all other peptide sequencing tools utilizing either peptide libraries or peptide fragment libraries, CycloBranch represents a true de novo sequencing engine developed for accurate mass spectrometric data. It is a stand-alone and cross-platform application with a graphical and user-friendly interface; it supports mzML, mzXML, mgf, txt, and baf file formats and can be run in parallel on multiple threads. It can be downloaded for free from http://ms.biomed.cas.cz/cyclobranch/, where the User's manual and video tutorials can be found.

  1. De Novo Sequences of Haloquadratum walsbyi from Lake Tyrrell, Australia, Reveal a Variable Genomic Landscape

    PubMed Central

    Tully, Benjamin J.; Emerson, Joanne B.; Andrade, Karen; Brocks, Jochen J.; Allen, Eric E.; Banfield, Jillian F.; Heidelberg, Karla B.

    2015-01-01

    Hypersaline systems near salt saturation levels represent an extreme environment, in which organisms grow and survive near the limits of life. One of the abundant members of the microbial communities in hypersaline systems is the square archaeon, Haloquadratum walsbyi. Utilizing a short-read metagenome from Lake Tyrrell, a hypersaline ecosystem in Victoria, Australia, we performed a comparative genomic analysis of H. walsbyi to better understand the extent of variation between strains/subspecies. Results revealed that previously isolated strains/subspecies do not fully describe the complete repertoire of the genomic landscape present in H. walsbyi. Rearrangements, insertions, and deletions were observed for the Lake Tyrrell derived Haloquadratum genomes and were supported by environmental de novo sequences, including shifts in the dominant genomic landscape of the two most abundant strains. Analysis pertaining to halomucins indicated that homologs for this large protein are not a feature common for all species of Haloquadratum. Further, we analyzed ATP-binding cassette transporters (ABC-type transporters) for evidence of niche partitioning between different strains/subspecies. We were able to identify unique and variable transporter subunits from all five genomes analyzed and the de novo environmental sequences, suggesting that differences in nutrient and carbon source acquisition may play a role in maintaining distinct strains/subspecies. PMID:25709557

  2. A Real-Time de novo DNA Sequencing Assembly Platform Based on an FPGA Implementation.

    PubMed

    Hu, Yuanqi; Georgiou, Pantelis

    2016-01-01

    This paper presents an FPGA based DNA comparison platform which can be run concurrently with the sensing phase of DNA sequencing and shortens the overall time needed for de novo DNA assembly. A hybrid overlap searching algorithm is applied which is scalable and can deal with incremental detection of new bases. To handle the incomplete data set which gradually increases during sequencing time, all-against-all comparisons are broken down into successive window-against-window comparison phases and executed using a novel dynamic suffix comparison algorithm combined with a partitioned dynamic programming method. The complete system has been designed to facilitate parallel processing in hardware, which allows real-time comparison and full scalability as well as a decrease in the number of computations required. A base pair comparison rate of 51.2 G/s is achieved when implemented on an FPGA with successful DNA comparison when using data sets from real genomes. PMID:27045828

  3. Exome sequencing of case-unaffected-parents trios reveals recessive and de novo genetic variants in sporadic ALS

    PubMed Central

    Steinberg, Karyn Meltz; Yu, Bing; Koboldt, Daniel C.; Mardis, Elaine R.; Pamphlett, Roger

    2015-01-01

    The contribution of genetic variants to sporadic amyotrophic lateral sclerosis (ALS) remains largely unknown. Either recessive or de novo variants could result in an apparently sporadic occurrence of ALS. In an attempt to find such variants we sequenced the exomes of 44 ALS-unaffected-parents trios. Rare and potentially damaging compound heterozygous variants were found in 27% of ALS patients, homozygous recessive variants in 14% and coding de novo variants in 27%. In 20% of patients more than one of the above variants was present. Genes with recessive variants were enriched in nucleotide binding capacity, ATPase activity, and the dynein heavy chain. Genes with de novo variants were enriched in transcription regulation and cell cycle processes. This trio study indicates that rare private recessive variants could be a mechanism underlying some case of sporadic ALS, and that de novo mutations are also likely to play a part in the disease. PMID:25773295

  4. Exome sequencing for bipolar disorder points to roles of de novo loss-of-function and protein-altering mutations.

    PubMed

    Kataoka, M; Matoba, N; Sawada, T; Kazuno, A-A; Ishiwata, M; Fujii, K; Matsuo, K; Takata, A; Kato, T

    2016-07-01

    Although numerous genetic studies have been conducted for bipolar disorder (BD), its genetic architecture remains elusive. Here we perform, to the best of our knowledge, the first trio-based exome sequencing study for BD to investigate potential roles of de novo mutations in the disease etiology. We identified 71 de novo point mutations and one de novo copy-number mutation in 79 BD probands. Among the genes hit by de novo loss-of-function (LOF; nonsense, splice site or frameshift) or protein-altering (LOF, missense and inframe indel) mutations, we found significant enrichment of genes highly intolerant (first percentile of intolerant genes assessed by Residual Variation Intolerance Score) to protein-altering variants in general population, an observation that is also reported in autism and schizophrenia. When we performed a joint analysis using the data of schizoaffective disorder in published studies, we found global enrichment of de novo LOF and protein-altering mutations in the combined group of bipolar I and schizoaffective disorders. Considering relationship between de novo mutations and clinical phenotypes, we observed significantly earlier disease onset among the BD probands with de novo protein-altering mutations when compared with non-carriers. Gene ontology enrichment analysis of genes hit by de novo protein-altering mutations in bipolar I and schizoaffective disorders did not identify any significant enrichment. These results of exploratory analyses collectively point to the roles of de novo LOF and protein-altering mutations in the etiology of bipolar disorder and warrant further large-scale studies. PMID:27217147

  5. De Novo Sequencing and Characterization of the Floral Transcriptome of Dendrocalamus latiflorus (Poaceae: Bambusoideae)

    PubMed Central

    Li, De-Zhu; Guo, Zhen-Hua

    2012-01-01

    Background Transcriptome sequencing can be used to determine gene sequences and transcript abundance in non-model species, and the advent of next-generation sequencing (NGS) technologies has greatly decreased the cost and time required for this process. Transcriptome data are especially desirable in bamboo species, as certain members constitute an economically and culturally important group of mostly semelparous plants with remarkable flowering features, yet little bamboo genomic research has been performed. Here we present, for the first time, extensive sequence and transcript abundance data for the floral transcriptome of a key bamboo species, Dendrocalamus latiflorus, obtained using the Illumina GAII sequencing platform. Our further goal was to identify patterns of gene expression during bamboo flower development. Results Approximately 96 million sequencing reads were generated and assembled de novo, yielding 146,395 high quality unigenes with an average length of 461 bp. Of these, 80,418 were identified as putative homologs of annotated sequences in the public protein databases, of which 290 were associated with the floral transition and 47 were related to flower development. Digital abundance analysis identified 26,529 transcripts differentially enriched between two developmental stages, young flower buds and older developing flowers. Unigenes found at each stage were categorized according to their putative functional categories. These sequence and putative function data comprise a resource for future investigation of the floral transition and flower development in bamboo species. Conclusions Our results present the first broad survey of a bamboo floral transcriptome. Although it will be necessary to validate the functions carried out by these genes, these results represent a starting point for future functional research on D. latiflorus and related species. PMID:22916120

  6. Whole Exome Sequencing for a Patient with Rubinstein-Taybi Syndrome Reveals de Novo Variants besides an Overt CREBBP Mutation

    PubMed Central

    Yoo, Hee Jeong; Kim, Kyung; Kim, In Hyang; Rho, Seong-Hwan; Park, Jong-Eun; Lee, Ki Young; Kim, Soon Ae; Choi, Byung Yoon; Kim, Namshin

    2015-01-01

    Rubinstein-Taybi syndrome (RSTS) is a rare condition with a prevalence of 1 in 125,000–720,000 births and characterized by clinical features that include facial, dental, and limb dysmorphology and growth retardation. Most cases of RSTS occur sporadically and are caused by de novo mutations. Cytogenetic or molecular abnormalities are detected in only 55% of RSTS cases. Previous genetic studies have yielded inconsistent results due to the variety of methods used for genetic analysis. The purpose of this study was to use whole exome sequencing (WES) to evaluate the genetic causes of RSTS in a young girl presenting with an Autism phenotype. We used the Autism diagnostic observation schedule (ADOS) and Autism diagnostic interview revised (ADI-R) to confirm her diagnosis of Autism. In addition, various questionnaires were used to evaluate other psychiatric features. We used WES to analyze the DNA sequences of the patient and her parents and to search for de novo variants. The patient showed all the typical features of Autism, WES revealed a de novo frameshift mutation in CREBBP and de novo sequence variants in TNC and IGFALS genes. Mutations in the CREBBP gene have been extensively reported in RSTS patients, while potential missense mutations in TNC and IGFALS genes have not previously been associated with RSTS. The TNC and IGFALS genes are involved in central nervous system development and growth. It is possible for patients with RSTS to have additional de novo variants that could account for previously unexplained phenotypes. PMID:25768348

  7. Whole exome sequencing for a patient with Rubinstein-Taybi syndrome reveals de novo variants besides an overt CREBBP mutation.

    PubMed

    Yoo, Hee Jeong; Kim, Kyung; Kim, In Hyang; Rho, Seong-Hwan; Park, Jong-Eun; Lee, Ki Young; Kim, Soon Ae; Choi, Byung Yoon; Kim, Namshin

    2015-01-01

    Rubinstein-Taybi syndrome (RSTS) is a rare condition with a prevalence of 1 in 125,000-720,000 births and characterized by clinical features that include facial, dental, and limb dysmorphology and growth retardation. Most cases of RSTS occur sporadically and are caused by de novo mutations. Cytogenetic or molecular abnormalities are detected in only 55% of RSTS cases. Previous genetic studies have yielded inconsistent results due to the variety of methods used for genetic analysis. The purpose of this study was to use whole exome sequencing (WES) to evaluate the genetic causes of RSTS in a young girl presenting with an Autism phenotype. We used the Autism diagnostic observation schedule (ADOS) and Autism diagnostic interview revised (ADI-R) to confirm her diagnosis of Autism. In addition, various questionnaires were used to evaluate other psychiatric features. We used WES to analyze the DNA sequences of the patient and her parents and to search for de novo variants. The patient showed all the typical features of Autism, WES revealed a de novo frameshift mutation in CREBBP and de novo sequence variants in TNC and IGFALS genes. Mutations in the CREBBP gene have been extensively reported in RSTS patients, while potential missense mutations in TNC and IGFALS genes have not previously been associated with RSTS. The TNC and IGFALS genes are involved in central nervous system development and growth. It is possible for patients with RSTS to have additional de novo variants that could account for previously unexplained phenotypes. PMID:25768348

  8. De Novo Sequencing and Analysis of Lemongrass Transcriptome Provide First Insights into the Essential Oil Biosynthesis of Aromatic Grasses

    PubMed Central

    Meena, Seema; Kumar, Sarma R.; Venkata Rao, D. K.; Dwivedi, Varun; Shilpashree, H. B.; Rastogi, Shubhra; Shasany, Ajit K.; Nagegowda, Dinesh A.

    2016-01-01

    Aromatic grasses of the genus Cymbopogon (Poaceae family) represent unique group of plants that produce diverse composition of monoterpene rich essential oils, which have great value in flavor, fragrance, cosmetic, and aromatherapy industries. Despite the commercial importance of these natural aromatic oils, their biosynthesis at the molecular level remains unexplored. As the first step toward understanding the essential oil biosynthesis, we performed de novo transcriptome assembly and analysis of C. flexuosus (lemongrass) by employing Illumina sequencing. Mining of transcriptome data and subsequent phylogenetic analysis led to identification of terpene synthases, pyrophosphatases, alcohol dehydrogenases, aldo-keto reductases, carotenoid cleavage dioxygenases, alcohol acetyltransferases, and aldehyde dehydrogenases, which are potentially involved in essential oil biosynthesis. Comparative essential oil profiling and mRNA expression analysis in three Cymbopogon species (C. flexuosus, aldehyde type; C. martinii, alcohol type; and C. winterianus, intermediate type) with varying essential oil composition indicated the involvement of identified candidate genes in the formation of alcohols, aldehydes, and acetates. Molecular modeling and docking further supported the role of identified protein sequences in aroma formation in Cymbopogon. Also, simple sequence repeats were found in the transcriptome with many linked to terpene pathway genes including the genes potentially involved in aroma biosynthesis. This work provides the first insights into the essential oil biosynthesis of aromatic grasses, and the identified candidate genes and markers can be a great resource for biotechnological and molecular breeding approaches to modulate the essential oil composition. PMID:27516768

  9. De Novo Sequencing and Analysis of Lemongrass Transcriptome Provide First Insights into the Essential Oil Biosynthesis of Aromatic Grasses.

    PubMed

    Meena, Seema; Kumar, Sarma R; Venkata Rao, D K; Dwivedi, Varun; Shilpashree, H B; Rastogi, Shubhra; Shasany, Ajit K; Nagegowda, Dinesh A

    2016-01-01

    Aromatic grasses of the genus Cymbopogon (Poaceae family) represent unique group of plants that produce diverse composition of monoterpene rich essential oils, which have great value in flavor, fragrance, cosmetic, and aromatherapy industries. Despite the commercial importance of these natural aromatic oils, their biosynthesis at the molecular level remains unexplored. As the first step toward understanding the essential oil biosynthesis, we performed de novo transcriptome assembly and analysis of C. flexuosus (lemongrass) by employing Illumina sequencing. Mining of transcriptome data and subsequent phylogenetic analysis led to identification of terpene synthases, pyrophosphatases, alcohol dehydrogenases, aldo-keto reductases, carotenoid cleavage dioxygenases, alcohol acetyltransferases, and aldehyde dehydrogenases, which are potentially involved in essential oil biosynthesis. Comparative essential oil profiling and mRNA expression analysis in three Cymbopogon species (C. flexuosus, aldehyde type; C. martinii, alcohol type; and C. winterianus, intermediate type) with varying essential oil composition indicated the involvement of identified candidate genes in the formation of alcohols, aldehydes, and acetates. Molecular modeling and docking further supported the role of identified protein sequences in aroma formation in Cymbopogon. Also, simple sequence repeats were found in the transcriptome with many linked to terpene pathway genes including the genes potentially involved in aroma biosynthesis. This work provides the first insights into the essential oil biosynthesis of aromatic grasses, and the identified candidate genes and markers can be a great resource for biotechnological and molecular breeding approaches to modulate the essential oil composition. PMID:27516768

  10. Sequencing and De Novo Assembly of the Gonadal Transcriptome of the Endangered Chinese Sturgeon (Acipenser sinensis)

    PubMed Central

    Du, Hao; Zhang, Shuhuan; Wei, Qiwei

    2015-01-01

    Background The Chinese sturgeon (Acipenser sinensis) is endangered through anthropogenic activities including over-fishing, damming, shipping, and pollution. Controlled reproduction has been adopted and successfully conducted for conservation. However, little information is available on the reproductive regulation of the species. In this study, we conducted de novo transcriptome assembly of the gonad tissue to create a comprehensive dataset for A. sinensis. Results The Illumina sequencing platform was adopted to obtain 47,333,701 and 47,229,705 high quality reads from testis and ovary cDNA libraries generated from three-year-old A. sinensis. We identified 86,027 unigenes of which 30,268 were annotated in the NCBI non-redundant protein database and 28,281 were annotated in the Swiss-prot database. Among the annotated unigenes, 26,152 and 7,734 unigenes, respectively, were assigned to gene ontology categories and clusters of orthologous groups. In addition, 12,557 unigenes were mapped to 231 pathways in the Kyoto Encyclopedia of Genes and Genomes Pathway database. A total of 1,896 unigenes, potentially differentially expressed between the two gonad types, were found, with 1,894 predicted to be up-regulated in ovary and only two in testis. Fifty-five potential gametogenesis-related genes were screened in the transcriptome and 34 genes with significant matches were found. Besides, more paralogs of 11 genes in three gene families (sox, apolipoprotein and cyclin) were found in A. sinensis compared to their orthologs in the diploid Danio rerio. In addition, 12,151 putative simple sequence repeats (SSRs) were detected. Conclusions This study provides the first de novo transcriptome analysis currently available for A. sinensis. The transcriptomic data represents the fundamental resource for future research on the mechanism of early gametogenesis in sturgeons. The SSRs identified in this work will be valuable for assessment of genetic diversity of wild fish and genealogy

  11. De Novo Transcriptome Sequencing of Oryza officinalis Wall ex Watt to Identify Disease-Resistance Genes

    PubMed Central

    He, Bin; Gu, Yinghong; Tao, Xiang; Cheng, Xiaojie; Wei, Changhe; Fu, Jian; Cheng, Zaiquan; Zhang, Yizheng

    2015-01-01

    Oryza officinalis Wall ex Watt is one of the most important wild relatives of cultivated rice and exhibits high resistance to many diseases. It has been used as a source of genes for introgression into cultivated rice. However, there are limited genomic resources and little genetic information publicly reported for this species. To better understand the pathways and factors involved in disease resistance and accelerating the process of rice breeding, we carried out a de novo transcriptome sequencing of O. officinalis. In this research, 137,229 contigs were obtained ranging from 200 to 19,214 bp with an N50 of 2331 bp through de novo assembly of leaves, stems and roots in O. officinalis using an Illumina HiSeq 2000 platform. Based on sequence similarity searches against a non-redundant protein database, a total of 88,249 contigs were annotated with gene descriptions and 75,589 transcripts were further assigned to GO terms. Candidate genes for plant–pathogen interaction and plant hormones regulation pathways involved in disease-resistance were identified. Further analyses of gene expression profiles showed that the majority of genes related to disease resistance were all expressed in the three tissues. In addition, there are two kinds of rice bacterial blight-resistant genes in O. officinalis, including two Xa1 genes and three Xa26 genes. All 2 Xa1 genes showed the highest expression level in stem, whereas one of Xa26 was expressed dominantly in leaf and other 2 Xa26 genes displayed low expression level in all three tissues. This transcriptomic database provides an opportunity for identifying the genes involved in disease-resistance and will provide a basis for studying functional genomics of O. officinalis and genetic improvement of cultivated rice in the future. PMID:26690414

  12. De novo Sequencing, Assembly and Characterization of Antennal Transcriptome of Anomala corpulenta Motschulsky (Coleoptera: Rutelidae)

    PubMed Central

    Chen, Haoliang; Lin, Lulu; Xie, Minghui; Zhang, Guangling; Su, Weihua

    2014-01-01

    Background Anomala corpulenta is an important insect pest and can cause enormous economic losses in agriculture, horticulture and forestry. It is widely distributed in China, and both larvae and adults can cause serious damage. It is difficult to control this pest because the larvae live underground. Any new control strategy should exploit alternatives to heavily and frequently used chemical insecticides. However, little genetic research has been carried out on A. corpulenta due to the lack of genomic resources. Genomic resources could be produced by next generation sequencing technologies with low cost and in a short time. In this study, we performed de novo sequencing, assembly and characterization of the antennal transcriptome of A. corpulenta. Results Illumina sequencing technology was used to sequence the antennal transcriptome of A. corpulenta. Approximately 76.7 million total raw reads and about 68.9 million total clean reads were obtained, and then 35,656 unigenes were assembled. Of these unigenes, 21,463 of them could be annotated in the NCBI nr database, and, among the annotated unigenes, 11,154 and 6,625 unigenes could be assigned to GO and COG, respectively. Additionally, 16,350 unigenes could be annotated in the Swiss-Prot database, and 14,499 unigenes could map onto 258 pathways in the KEGG Pathway database. We also found 24 unigenes related to OBPs, 6 to CSPs, and in total 167 unigenes related to chemodetection. We analyzed 4 OBPs and 3CSPs sequences and their RT-qPCR results agreed well with their FPKM values. Conclusion We produced the first large-scale antennal transcriptome of A. corpulenta, which is a species that has little genomic information in public databases. The identified chemodetection unigenes can promote the molecular mechanistic study of behavior in A. corpulenta. These findings provide a general sequence resource for molecular genetics research on A. corpulenta. PMID:25461610

  13. A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies

    PubMed Central

    Zhang, Wenyu; Chen, Jiajia; Yang, Yang; Tang, Yifei; Shang, Jing; Shen, Bairong

    2011-01-01

    The advent of next-generation sequencing technologies is accompanied with the development of many whole-genome sequence assembly methods and software, especially for de novo fragment assembly. Due to the poor knowledge about the applicability and performance of these software tools, choosing a befitting assembler becomes a tough task. Here, we provide the information of adaptivity for each program, then above all, compare the performance of eight distinct tools against eight groups of simulated datasets from Solexa sequencing platform. Considering the computational time, maximum random access memory (RAM) occupancy, assembly accuracy and integrity, our study indicate that string-based assemblers, overlap-layout-consensus (OLC) assemblers are well-suited for very short reads and longer reads of small genomes respectively. For large datasets of more than hundred millions of short reads, De Bruijn graph-based assemblers would be more appropriate. In terms of software implementation, string-based assemblers are superior to graph-based ones, of which SOAPdenovo is complex for the creation of configuration file. Our comparison study will assist researchers in selecting a well-suited assembler and offer essential information for the improvement of existing assemblers or the developing of novel assemblers. PMID:21423806

  14. Mining Novel Allergens from Coconut Pollen Employing Manual De Novo Sequencing and Homology-Driven Proteomics.

    PubMed

    Saha, Bodhisattwa; Sircar, Gaurab; Pandey, Naren; Gupta Bhattacharya, Swati

    2015-11-01

    Coconut pollen, one of the major palm pollen grains is an important constituent among vectors of inhalant allergens in India and a major sensitizer for respiratory allergy in susceptible patients. To gain insight into its allergenic components, pollen proteins were analyzed by two-dimensional electrophoresis, immunoblotted with coconut pollen sensitive patient sera, followed by mass spectrometry of IgE reactive proteins. Coconut being largely unsequenced, a proteomic workflow has been devised that combines the conventional database-dependent analysis of tandem mass spectral data and manual de novo sequencing followed by a homology-based search for identifying the allergenic proteins. N-terminal acetylation helped to distinguish "b" ions from others, facilitating reliable sequencing. This led to the identification of 12 allergenic proteins. Cluster analysis with individual patient sera recognized vicilin-like protein as a major allergen, which was purified to assess its in vitro allergenicity and then partially sequenced. Other IgE-sensitive spots showed significant homology with well-known allergenic proteins such as 11S globulin, enolase, and isoflavone reductase along with a few which are reported as novel allergens. The allergens identified can be used as potential candidates to develop hypoallergenic vaccines, to design specific immunotherapy trials, and to enrich the repertoire of existing IgE reactive proteins. PMID:26426307

  15. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

    PubMed Central

    Goodwin, Sara; Gurtowski, James; Ethe-Sayers, Scott; Deshpande, Panchajanya; Schatz, Michael C.; McCombie, W. Richard

    2015-01-01

    Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5–50 kbp) at such high error rates (between ∼5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly. PMID:26447147

  16. De novo Ixodes ricinus salivary gland transcriptome analysis using two next-generation sequencing methodologies

    PubMed Central

    Schwarz, Alexandra; von Reumont, Björn M.; Erhart, Jan; Chagas, Andrezza C.; Ribeiro, José M. C.; Kotsyfakis, Michalis

    2013-01-01

    Tick salivary gland (SG) proteins possess powerful pharmacologic properties that facilitate tick feeding and pathogen transmission. For the first time, SG transcriptomes of Ixodes ricinus, an important disease vector for humans and animals, were analyzed using next-generation sequencing. SGs were collected from different tick life stages fed on various animal species, including cofeeding of nymphs and adults on the same host. Four cDNA samples were sequenced, discriminating tick SG transcriptomes of early- and late-feeding nymphs or adults. In total, 441,381,454 pyrosequencing reads and 67,703,183 Illumina reads were assembled into 272,220 contigs, of which 34,560 extensively annotated coding sequences are disclosed; 8686 coding sequences were submitted to GenBank. Overall, 13% of contigs were classified as secreted proteins that showed significant differences in the transcript representation among the 4 SG samples, including high numbers of sample-specific transcripts. Detailed phylogenetic reconstructions of two relatively abundant SG-secreted protein families demonstrated how this study improves our understanding of the molecular evolution of hematophagy in arthropods. Our data significantly increase the available genomic information for I. ricinus and form a solid basis for future tick genome/transcriptome assemblies and the functional analysis of effectors that mediate the feeding physiology and parasite-vector interaction of I. ricinus.—Schwarz, A., von Reumont, B.M., Erhart, J., Chagas, A.C., Ribeiro, J.M.C., Kotsyfakis, M. De novo Ixodes ricinus salivary gland transcriptome analysis using two next-generation sequencing methodologies. PMID:23964076

  17. Sequencing and De Novo Assembly of the Transcriptome of the Glassy-Winged Sharpshooter (Homalodisca vitripennis)

    PubMed Central

    Nandety, Raja Sekhar; Kamita, Shizuo G.; Hammock, Bruce D.; Falk, Bryce W.

    2013-01-01

    Background The glassy-winged sharpshooter Homalodisca vitripennis (Hemiptera: Cicadellidae), is a xylem-feeding leafhopper and important vector of the bacterium Xylella fastidiosa; the causal agent of Pierce’s disease of grapevines. The functional complexity of the transcriptome of H. vitripennis has not been elucidated thus far. It is a necessary blueprint for an understanding of the development of H. vitripennis and for designing efficient biorational control strategies including those based on RNA interference. Results Here we elucidate and explore the transcriptome of adult H. vitripennis using high-throughput paired end deep sequencing and de novo assembly. A total of 32,803,656 paired-end reads were obtained with an average transcript length of 624 nucleotides. We assembled 32.9 Mb of the transcriptome of H. vitripennis that spanned across 47,265 loci and 52,708 transcripts. Comparison of our non-redundant database showed that 45% of the deduced proteins of H. vitripennis exhibit identity (e-value ≤1−5) with known proteins. We assigned Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations, and potential Pfam domains to each transcript isoform. In order to gain insight into the molecular basis of key regulatory genes of H. vitripennis, we characterized predicted proteins involved in the metabolism of juvenile hormone, and biogenesis of small RNAs (Dicer and Piwi sequences) from the transcriptomic sequences. Analysis of transposable element sequences of H. vitripennis indicated that the genome is less expanded in comparison to many other insects with approximately 1% of the transcriptome carrying transposable elements. Conclusions Our data significantly enhance the molecular resources available for future study and control of this economically important hemipteran. This transcriptional information not only provides a more nuanced understanding of the underlying biological and physiological mechanisms that govern H

  18. De Novo Assembly and Transcriptome Characterization of Canine Retina Using High-Throughput Sequencing

    PubMed Central

    Reddy, Bhaskar; Patel, Amrutlal K.; Singh, Krishna M.; Patil, Deepak B.; Parikh, Pinesh V.; Kelawala, Divyesh N.; Koringa, Prakash G.; Bhatt, Vaibhav D.; Rao, Mandava V.; Joshi, Chaitanya G.

    2015-01-01

    We performed transcriptome sequencing of canine retinal tissue by 454 GS-FLX and Ion Torrent PGM platforms. RNA-Seq analysis by CLC Genomics Workbench mapped expression of 10,360 genes. Gene ontology analysis of retinal transcriptome revealed abundance of transcripts known to be involved in vision associated processes. The de novo assembly of the sequences using CAP3 generated 29,683 contigs with mean length of 560.9 and N50 of 619 bases. Further analysis of contigs predicted 3,827 full-length cDNAs and 29,481 (99%) open reading frames (ORFs). In addition, 3,782 contigs were assigned to 316 KEGG pathways which included melanogenesis, phototransduction, and retinol metabolism with 33, 15, and 11 contigs, respectively. Among the identified microsatellites, dinucleotide repeats were 68.84%, followed by trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides in proportions of 25.76, 9.40, 2.52, and 0.96%, respectively. This study will serve as a valuable resource for understanding the biology and function of canine retina. PMID:26788372

  19. Sequencing and de novo assembly of the red cusk-eel (Genypterus chilensis) transcriptome.

    PubMed

    Aedo, J E; Maldonado, J; Estrada, J M; Fuentes, E N; Silva, H; Gallardo-Escarate, C; Molina, A; Valdés, J A

    2014-12-01

    The red cusk-eel (Genypterus chilensis) is an endemic fish species distributed along the coasts of the Eastern South Pacific. Biological studies on this fish are scarce, and genomic information for G. chilensis is practically non-existent. Thus, transcriptome information for this species is an essential resource that will greatly enrich molecular information and benefit future studies of red cusk-eel biology. In this work, we obtained transcriptome information of G. chilensis using the Illumina platform. The RNA sequencing generated 66,307,362 and 59,925,554 paired-end reads from skeletal muscle and liver tissues, respectively. De novo assembly using the CLC Genomic Workbench version 7.0.3 produced 48,480 contigs and created a reference transcriptome with a N50 of 846bp and average read coverage of 28.3×. By sequence similarity search for known proteins, a total of 21,272 (43.9%) contigs were annotated for their function. Out of these annotated contigs, 33.5% GO annotation results for biological processes, 32.6% GO annotation results for cellular components and 34.5% GO annotation results for molecular functions. This dataset represents the first transcriptomic resource for the red cusk-eel and for a member of the Ophidiimorpharia taxon. PMID:25139027

  20. Rationale-Based, De Novo Design of Dehydrophenylalanine-Containing Antibiotic Peptides and Systematic Modification in Sequence for Enhanced Potency▿

    PubMed Central

    Pathak, Sarika; Chauhan, Virander Singh

    2011-01-01

    Increased microbial drug resistance has generated a global requirement for new anti-infective agents. As part of an effort to develop new, low-molecular-mass peptide antibiotics, we used a rationale-based minimalist approach to design short, nonhemolytic, potent, and broad-spectrum antibiotic peptides with increased serum stability. These peptides were designed to attain an amphipathic structure in helical conformations. VS1 was used as the lead compound, and its properties were compared with three series of derivates obtained by (i) N-terminal amino acid addition, (ii) systematic Trp substitution, and (iii) peptide dendrimerization. The Trp substitution approach underlined the optimized sequence of VS2 in terms of potency, faster membrane permeation, and cost-effectiveness. VS2 (a variant of VS1 with two Trp substitutions) was found to exhibit good antimicrobial activity against both the Gram-negative Escherichia coli and the Gram-positive bacterium Staphylococcus aureus. It was also found to have noncytolytic activity and the ability to permeate and depolarize the bacterial membrane. Lysis of the bacterial cell wall and inner membrane by the peptide was confirmed by transmission electron microscopy. A combination of small size, the presence of unnatural amino acids, high antimicrobial activity, insignificant hemolysis, and proteolytic resistance provides fundamental information for the de novo design of an antimicrobial peptide useful for the management of infectious disease. PMID:21321136

  1. Sequencing and De Novo Assembly of the Western Tarnished Plant Bug (Lygus hesperus) Transcriptome

    PubMed Central

    Hull, J. Joe; Geib, Scott M.; Fabrick, Jeffrey A.; Brent, Colin S.

    2013-01-01

    Background Mirid plant bugs (Hemiptera: Miridae) are economically important insect pests of many crops worldwide. The western tarnished plant bug Lygus hesperus Knight is a pest of cotton, alfalfa, fruit and vegetable crops, and potentially of several emerging biofuel and natural product feedstocks in the western US. However, little is known about the underlying molecular genetics, biochemistry, or physiology of L. hesperus, including their ability to survive extreme environmental conditions. Methodology/Principal Findings We used 454 pyrosequencing of a normalized adult cDNA library and de novo assembly to obtain an adult L. hesperus transcriptome consisting of 1,429,818 transcriptomic reads representing 36,131 transcript isoforms (isotigs) that correspond to 19,742 genes. A search of the transcriptome against deposited L. hesperus protein sequences revealed that 86 out of 87 were represented. Comparison with the non-redundant database indicated that 54% of the transcriptome exhibited similarity (e-value ≤1−5) with known proteins. In addition, Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations, and potential Pfam domains were assigned to each transcript isoform. To gain insight into the molecular basis of the L. hesperus thermal stress response we used transcriptomic sequences to identify 52 potential heat shock protein (Hsp) homologs. A subset of these transcripts was sequence verified and their expression response to thermal stress monitored by semi-quantitative PCR. Potential homologs of Hsp70, Hsp40, and 2 small Hsps were found to be upregulated in the heat-challenged adults, suggesting a role in thermotolerance. Conclusions/Significance The L. hesperus transcriptome advances the underlying molecular understanding of this arthropod pest by significantly increasing the number of known genes, and provides the basis for further exploration and understanding of the fundamental mechanisms of abiotic stress responses. PMID

  2. Identification of lignin genes and regulatory sequences involved in secondary cell wall formation in Acacia auriculiformis and Acacia mangium via de novo transcriptome sequencing

    PubMed Central

    2011-01-01

    Background Acacia auriculiformis × Acacia mangium hybrids are commercially important trees for the timber and pulp industry in Southeast Asia. Increasing pulp yield while reducing pulping costs are major objectives of tree breeding programs. The general monolignol biosynthesis and secondary cell wall formation pathways are well-characterized but genes in these pathways are poorly characterized in Acacia hybrids. RNA-seq on short-read platforms is a rapid approach for obtaining comprehensive transcriptomic data and to discover informative sequence variants. Results We sequenced transcriptomes of A. auriculiformis and A. mangium from non-normalized cDNA libraries synthesized from pooled young stem and inner bark tissues using paired-end libraries and a single lane of an Illumina GAII machine. De novo assembly produced a total of 42,217 and 35,759 contigs with an average length of 496 bp and 498 bp for A. auriculiformis and A. mangium respectively. The assemblies of A. auriculiformis and A. mangium had a total length of 21,022,649 bp and 17,838,260 bp, respectively, with the largest contig 15,262 bp long. We detected all ten monolignol biosynthetic genes using Blastx and further analysis revealed 18 lignin isoforms for each species. We also identified five contigs homologous to R2R3-MYB proteins in other plant species that are involved in transcriptional regulation of secondary cell wall formation and lignin deposition. We searched the contigs against public microRNA database and predicted the stem-loop structures of six highly conserved microRNA families (miR319, miR396, miR160, miR172, miR162 and miR168) and one legume-specific family (miR2086). Three microRNA target genes were predicted to be involved in wood formation and flavonoid biosynthesis. By using the assemblies as a reference, we discovered 16,648 and 9,335 high quality putative Single Nucleotide Polymorphisms (SNPs) in the transcriptomes of A. auriculiformis and A. mangium, respectively, thus yielding

  3. Highly efficient de novo mutant identification in a Sorghum bicolor TILLING population using the ComSeq approach.

    PubMed

    Nida, Habte; Blum, Shula; Zielinski, Dina; Srivastava, Dhruv A; Elbaum, Rivka; Xin, Zhanguo; Erlich, Yaniv; Fridman, Eyal; Shental, Noam

    2016-05-01

    Screening large populations for carriers of known or de novo rare single nucleotide polymorphisms (SNPs) is required both in Targeting induced local lesions in genomes (TILLING) experiments in plants and in screening of human populations. We previously suggested an approach that combines the mathematical field of compressed sensing with next-generation sequencing to allow such large-scale screening. Based on pooled measurements, this method identifies multiple carriers of heterozygous or homozygous rare alleles while using only a small fraction of resources. Its rigorous mathematical foundations allow scalable and robust detection, and provide error correction and resilience to experimental noise. Here we present a large-scale experimental demonstration of our computational approach, in which we targeted a TILLING population of 1024 Sorghum bicolor lines to detect carriers of de novo SNPs whose frequency was less than 0.1%, using only 48 pools. Subsequent validation confirmed that all detected lines were indeed carriers of the predicted mutations. This novel approach provides a highly cost-effective and robust tool for biologists and breeders to allow identification of novel alleles and subsequent functional analysis. PMID:26959378

  4. The First Illumina-Based De Novo Transcriptome Sequencing and Analysis of Safflower Flowers

    PubMed Central

    Lulin, Huang; Xiao, Yang; Pei, Sun; Wen, Tong; Shangqin, Hu

    2012-01-01

    Background The safflower, Carthamus tinctorius L., is a worldwide oil crop, and its flowers, which have a high flavonoid content, are an important medicinal resource against cardiovascular disease in traditional medicine. Because the safflower has a large and complex genome, the development of its genomic resources has been delayed. Second-generation Illumina sequencing is now an efficient route for generating an enormous volume of sequences that can represent a large number of genes and their expression levels. Methodology/Principal Findings To investigate the genes and pathways that might control flavonoids and other secondary metabolites in the safflower, we used Illumina sequencing to perform a de novo assembly of the safflower tubular flower tissue transcriptome. We obtained a total of 4.69 Gb in clean nucleotides comprising 52,119,104 clean sequencing reads, 195,320 contigs, and 120,778 unigenes. Based on similarity searches with known proteins, we annotated 70,342 of the unigenes (about 58% of the identified unigenes) with cut-off E-values of 10−5. In total, 21,943 of the safflower unigenes were found to have COG classifications, and BLAST2GO assigned 26,332 of the unigenes to 1,754 GO term annotations. In addition, we assigned 30,203 of the unigenes to 121 KEGG pathways. When we focused on genes identified as contributing to flavonoid biosynthesis and the biosynthesis of unsaturated fatty acids, which are important pathways that control flower and seed quality, respectively, we found that these genes were fairly well conserved in the safflower genome compared to those of other plants. Conclusions/Significance Our study provides abundant genomic data for Carthamus tinctorius L. and offers comprehensive sequence resources for studying the safflower. We believe that these transcriptome datasets will serve as an important public information platform to accelerate studies of the safflower genome, and may help us define the mechanisms of flower tissue

  5. De novo sequencing and characterization of Picrorhiza kurrooa transcriptome at two temperatures showed major transcriptome adjustments

    PubMed Central

    2012-01-01

    Background Picrorhiza kurrooa Royle ex Benth. is an endangered plant species of medicinal importance. The medicinal property is attributed to monoterpenoids picroside I and II, which are modulated by temperature. The transcriptome information of this species is limited with the availability of few hundreds of expressed sequence tags (ESTs) in the public databases. In order to gain insight into temperature mediated molecular changes, high throughput de novo transcriptome sequencing and analyses were carried out at 15°C and 25°C, the temperatures known to modulate picrosides content. Results Using paired-end (PE) Illumina sequencing technology, a total of 20,593,412 and 44,229,272 PE reads were obtained after quality filtering for 15°C and 25°C, respectively. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 74,336 assembled transcript sequences were obtained, with an average coverage of 76.6 and average length of 439.5. Guanine-cytosine (GC) content was observed to be 44.6%, while the transcriptome exhibited abundance of trinucleotide simple sequence repeat (SSR; 45.63%) markers. Large scale expression profiling through "read per exon kilobase per million (RPKM)", showed changes in several biological processes and metabolic pathways including cytochrome P450s (CYPs), UDP-glycosyltransferases (UGTs) and those associated with picrosides biosynthesis. RPKM data were validated by reverse transcriptase-polymerase chain reaction using a set of 19 genes, wherein 11 genes behaved in accordance with the two expression methods. Conclusions Study generated transcriptome of P. kurrooa at two different temperatures. Large scale expression profiling through RPKM showed major transcriptome changes in response to temperature reflecting alterations in major biological processes and metabolic pathways, and provided insight of GC content and SSR markers. Analysis also identified

  6. An integer programming approach to DNA sequence assembly.

    PubMed

    Chang, Youngjung; Sahinidis, Nikolaos V

    2011-08-10

    De novo sequence assembly is a ubiquitous combinatorial problem in all DNA sequencing technologies. In the presence of errors in the experimental data, the assembly problem is computationally challenging, and its solution may not lead to a unique reconstruct. The enumeration of all alternative solutions is important in drawing a reliable conclusion on the target sequence, and is often overlooked in the heuristic approaches that are currently available. In this paper, we develop an integer programming formulation and global optimization solution strategy to solve the sequence assembly problem with errors in the data. We also propose an efficient technique to identify all alternative reconstructs. When applied to examples of sequencing-by-hybridization, our approach dramatically increases the length of DNA sequences that can be handled with global optimality certificate to over 10,000, which is more than 10 times longer than previously reported. For some problem instances, alternative solutions exhibited a wide range of different ability in reproducing the target DNA sequence. Therefore, it is important to utilize the methodology proposed in this paper in order to obtain all alternative solutions to reliably infer the true reconstruct. These alternative solutions can be used to refine the obtained results and guide the design of further experiments to correctly reconstruct the target DNA sequence. PMID:21864794

  7. De novo transcriptome sequencing and discovery of genes related to copper tolerance in Paeonia ostii.

    PubMed

    Wang, Yanjie; Dong, Chunlan; Xue, Zeyun; Jin, Qijiang; Xu, Yingchun

    2016-01-15

    Paeonia ostii, an important ornamental and medicinal plant, grows normally on copper (Cu) mines with widespread Cu contamination of soils, and it has the ability to lower Cu contents in the Cu-contaminated soils. However, very little molecular information concerned with Cu resistance of P. ostii is available. In this study, high-throughput de novo transcriptome sequencing was carried out for P. ostii with and without Cu treatment using Illumina HiSeq 2000 platform. A total of 77,704 All-unigenes were obtained with a mean length of 710 bp. Of these unigenes, 47,461 were annotated with public databases based on sequence similarities. Comparative transcript profiling allowed the discovery of 4324 differentially expressed genes (DEGs), with 2207 up-regulated and 2117 down-regulated unigenes in Cu-treated library as compared to the control counterpart. Based on these DEGs, Gene Ontology (GO) enrichment analysis indicated Cu stress-relevant terms, such as 'membrane' and 'antioxidant activity'. Meanwhile, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis uncovered some important pathways, including 'biosynthesis of secondary metabolites' and 'metabolic pathways'. In addition, expression patterns of 12 selected DEGs derived from quantitative real-time polymerase chain reaction (qRT-PCR) were consistent with their transcript abundance changes obtained by transcriptomic analyses, suggesting that all the 12 genes were authentically involved in Cu tolerance in P. ostii. This is the first report to identify genes related to Cu stress responses in P. ostii, which could offer valuable information on the molecular mechanisms of Cu resistance, and provide a basis for further genomics research on this and related ornamental species for phytoremediation. PMID:26435192

  8. De Novo Transcriptome Assembly of the Chinese Swamp Buffalo by RNA Sequencing and SSR Marker Discovery

    PubMed Central

    Lu, Xingrong; Zhu, Peng; Duan, Anqin; Tan, Zhengzhun; Huang, Jian; Li, Hui; Chen, Mingtan; Liang, Xianwei

    2016-01-01

    The Chinese swamp buffalo (Bubalis bubalis) is vital to the lives of small farmers and has tremendous economic importance. However, a lack of genomic information has hampered research on augmenting marker assisted breeding programs in this species. Thus, a high-throughput transcriptomic sequencing of B. bubalis was conducted to generate transcriptomic sequence dataset for gene discovery and molecular marker development. Illumina paired-end sequencing generated a total of 54,109,173 raw reads. After trimming, de novo assembly was performed, which yielded 86,017 unigenes, with an average length of 972.41 bp, an N50 of 1,505 bp, and an average GC content of 49.92%. A total of 62,337 unigenes were successfully annotated. Among the annotated unigenes, 27,025 (43.35%) and 23,232 (37.27%) unigenes showed significant similarity to known proteins in NCBI non-redundant protein and Swiss-Prot databases (E-value < 1.0E-5), respectively. Of these annotated unigenes, 14,439 and 15,813 unigenes were assigned to the Gene Ontology (GO) categories and EuKaryotic Ortholog Group (KOG) cluster, respectively. In addition, a total of 14,167 unigenes were assigned to 331 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Furthermore, 17,401 simple sequence repeats (SSRs) were identified as potential molecular markers. One hundred and fifteen primer pairs were randomly selected for amplification to detect polymorphisms. The results revealed that 110 primer pairs (95.65%) yielded PCR amplicons and 69 primer pairs (60.00%) presented polymorphisms in 35 individual buffaloes. A phylogenetic analysis showed that the five swamp buffalo populations were clustered together, whereas two river buffalo breeds clustered separately. In the present study, the Illumina RNA-seq technology was utilized to perform transcriptome analysis and SSR marker discovery in the swamp buffalo without using a reference genome. Our findings will enrich the current SSR markers resources and help spearhead molecular

  9. Genomic resources for water yam (Dioscorea alata L.): analyses of EST-Sequences, De Novo sequencing and GBS libraries

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The reducing cost and rapid progress in next-generation sequencing techniques coupled with high performance computational approaches have resulted in large-scale discovery of advanced genomic resources such as SSRs, SNPs and InDels in several model and non-model plant species. Yam (Dioscorea spp.) i...

  10. MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads (Metagenomics Informatics Challenges Workshop: 10K Genomes at a Time)

    ScienceCinema

    Sakakibara, Yasumbumi [Keio University

    2013-01-22

    Keio University's Yasumbumi Sakakibara on "MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.

  11. MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads (Metagenomics Informatics Challenges Workshop: 10K Genomes at a Time)

    SciTech Connect

    Sakakibara, Yasumbumi

    2011-10-13

    Keio University's Yasumbumi Sakakibara on "MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.

  12. Hybrid error correction and de novo assembly of single-molecule sequencing reads

    PubMed Central

    Koren, Sergey; Schatz, Michael C.; Walenz, Brian P.; Martin, Jeffrey; Howard, Jason; Ganapathy, Ganeshkumar; Wang, Zhong; Rasko, David A.; McCombie, W. Richard; Jarvis, Erich D.; Phillippy, Adam M.

    2012-01-01

    Emerging single-molecule sequencing instruments can generate multi-kilobase sequences with the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of single-molecule reads is challenging, and has limited their use to resequencing bacteria. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on Pacbio RS reads of phage, prokaryotic, and eukaryotic whole genomes, including the novel genome of the parrot Melopsittacus undulatus, as well as for RNA-seq reads of the corn (Zea mays) transcriptome. Our approach achieves over 99.9% read correction accuracy and produces substantially better assemblies than current sequencing strategies: in the best example, quintupling the median contig size relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly. PMID:22750884

  13. De novo assembly and characterization of Camelina sativa transcriptome by paired-end sequencing

    PubMed Central

    2013-01-01

    Background Biofuels extracted from the seeds of Camelina sativa have recently been used successfully as environmentally friendly jet-fuel to reduce greenhouse gas emissions. Camelina sativa is genetically very close to Arabidopsis thaliana, and both are members of the Brassicaceae. Although public databases are currently available for some members of the Brassicaceae, such as A. thaliana, A. lyrata, Brassica napus, B. juncea and B. rapa, there are no public Expressed Sequence Tags (EST) or genomic data for Camelina sativa. In this study, a high-throughput, large-scale RNA sequencing (RNA-seq) of the Camelina sativa transcriptome was carried out to generate a database that will be useful for further functional analyses. Results Approximately 27 million clean “reads” filtered from raw reads by removal of adaptors, ambiguous reads and low-quality reads (2.42 gigabase pairs) were generated by Illumina paired-end RNA-seq technology. All of these clean reads were assembled de novo into 83,493 unigenes and 103,196 transcripts using SOAPdenovo and Trinity, respectively. The average length of the transcripts generated by Trinity was 697 bp (N50 = 976), which was longer than the average length of unigenes (319 bp, N50 = 346 bp). Nonetheless, the assembly generated by SOAPdenovo produced similar number of non-redundant hits (22,435) with that of Trinity (22,433) in BLASTN searches of the Arabidopsis thaliana CDS sequence database (TAIR). Four public databases, the Kyoto Encyclopedia of Genes and Genomes (KEGG), Swiss-prot, NCBI non-redundant protein (NR), and the Cluster of Orthologous Groups (COG), were used for unigene annotation; 67,791 of 83,493 unigenes (81.2%) were finally annotated with gene descriptions or conserved protein domains that were mapped to 25,329 non-redundant protein sequences. We mapped 27,042 of 83,493 unigenes (32.4%) to 119 KEGG metabolic pathways. Conclusions This is the first report of a transcriptome database for Camelina sativa, an

  14. Rapid Microsatellite Isolation from a Butterfly by De Novo Transcriptome Sequencing: Performance and a Comparison with AFLP-Derived Distances

    PubMed Central

    Mikheyev, Alexander S.; Vo, Tanya; Wee, Brian; Singer, Michael C.; Parmesan, Camille

    2010-01-01

    Background The isolation of microsatellite markers remains laborious and expensive. For some taxa, such as Lepidoptera, development of microsatellite markers has been particularly difficult, as many markers appear to be located in repetitive DNA and have nearly identical flanking regions. We attempted to circumvent this problem by bioinformatic mining of microsatellite sequences from a de novo-sequenced transcriptome of a butterfly (Euphydryas editha). Principal Findings By searching the assembled sequence data for perfect microsatellite repeats we found 10 polymorphic loci. Although, like many expressed sequence tag-derived microsatellites, our markers show strong deviations from Hardy-Weinberg equilibrium in many populations, and, in some cases, a high incidence of null alleles, we show that they nonetheless provide measures of population differentiation consistent with those obtained by amplified fragment length polymorphism analysis. Estimates of pairwise population differentiation between 23 populations were concordant between microsatellite-derived data and AFLP analysis of the same samples (r = 0.71, p<0.00001, 425 individuals from 23 populations). Significance De novo transcriptional sequencing appears to be a rapid and cost-effective tool for developing microsatellite markers for difficult genomes. PMID:20585453

  15. De novo sequencing and analysis of root transcriptome using 454 pyrosequencing to discover putative genes associated with drought tolerance in Ammopiptanthus mongolicus

    PubMed Central

    2012-01-01

    Background De novo assembly of transcript sequences produced by next-generation sequencing technologies offers a rapid approach to obtain expressed gene sequences for non-model organisms. Ammopiptanthus mongolicus, a super-xerophytic broadleaf evergreen wood, is an ecologically important foundation species in desert ecosystems and exhibits substantial drought tolerance in Mid-Asia desert. Root plays an important role in water absorption of plant. There are insufficient transcriptomic and genomic data in public databases for understanding of the molecular mechanism underlying the drought tolerance of A. mongolicus. Thus, high throughput transcriptome sequencing from A. mongolicus root is helpful to generate a large amount of transcript sequences for gene discovery and molecular marker development. Results A total of 672,002 sequencing reads were obtained from a 454 GS XLR70 Titanium pyrosequencer with a mean length of 279 bp. These reads were assembled into 29,056 unique sequences including 15,173 contigs and 13,883 singlets. In our assembled sequences, 1,827 potential simple sequence repeats (SSR) molecular markers were discovered. Based on sequence similarity with known plant proteins, the assembled sequences represent approximately 9,771 proteins in PlantGDB. Based on the Gene ontology (GO) analysis, hundreds of drought stress-related genes were found. We further analyzed the gene expression profiles of 27 putative genes involved in drought tolerance using quantitative real-time PCR (qRT-PCR) assay. Conclusions Our sequence collection represents a major transcriptomic resource for A. mongolicus, and the large number of genetic markers predicted should contribute to future research in Ammopiptanthus genus. The potential drought stress related transcripts identified in this study provide a good start for further investigation into the drought adaptation in Ammopiptanthus. PMID:22721448

  16. Cost-Effective Sequencing of Full-Length cDNA Clones Powered by a De Novo-Reference Hybrid Assembly

    PubMed Central

    Sugano, Sumio; Morishita, Shinichi; Suzuki, Yutaka

    2010-01-01

    Background Sequencing full-length cDNA clones is important to determine gene structures including alternative splice forms, and provides valuable resources for experimental analyses to reveal the biological functions of coded proteins. However, previous approaches for sequencing cDNA clones were expensive or time-consuming, and therefore, a fast and efficient sequencing approach was demanded. Methodology We developed a program, MuSICA 2, that assembles millions of short (36-nucleotide) reads collected from a single flow cell lane of Illumina Genome Analyzer to shotgun-sequence ∼800 human full-length cDNA clones. MuSICA 2 performs a hybrid assembly in which an external de novo assembler is run first and the result is then improved by reference alignment of shotgun reads. We compared the MuSICA 2 assembly with 200 pooled full-length cDNA clones finished independently by the conventional primer-walking using Sanger sequencers. The exon-intron structure of the coding sequence was correct for more than 95% of the clones with coding sequence annotation when we excluded cDNA clones insufficiently represented in the shotgun library due to PCR failure (42 out of 200 clones excluded), and the nucleotide-level accuracy of coding sequences of those correct clones was over 99.99%. We also applied MuSICA 2 to full-length cDNA clones from Toxoplasma gondii, to confirm that its ability was competent even for non-human species. Conclusions The entire sequencing and shotgun assembly takes less than 1 week and the consumables cost only ∼US$3 per clone, demonstrating a significant advantage over previous approaches. PMID:20479877

  17. Comparative analysis of two phenologically divergent populations of the pine processionary moth (Thaumetopoea pityocampa) by de novo transcriptome sequencing.

    PubMed

    Gschloessl, Bernhard; Vogel, Heiko; Burban, Christian; Heckel, David; Streiff, Réjane; Kerdelhué, Carole

    2014-03-01

    The pine processionary moth Thaumetopoea pityocampa is a Mediterranean lepidopteran defoliator that experiences a rapid range expansion towards higher latitudes and altitudes due to the current climate warming. Its phenology - the time of sexual reproduction - is certainly a key trait for the local adaptation of the processionary moth to climatic conditions. Moreover, an exceptional case of allochronic differentiation was discovered ca. 15 years ago in this species. A population with a shifted phenology (the summer population, SP) co-exists near Leiria, Portugal, with a population following the classical cycle (the winter population, WP). The existence of this population is an outstanding opportunity to decipher the genetic bases of phenology. No genomic resources were so far available for T. pityocampa. We developed a high-throughput sequencing approach to build a first reference transcriptome, and to proceed with comparative analyses of the sympatric SP and WP. We pooled RNA extracted from whole individuals of various developmental stages, and performed a transcriptome characterisation for both populations combining Roche 454-FLX and traditional Sanger data. The obtained sequences were clustered into ca. 12,000 transcripts corresponding to 9265 unigenes. The mean transcript coverage was 21.9 reads per bp. Almost 70% of the de novo assembled transcripts displayed significant similarity to previously published proteins and around 50% of the transcripts contained a full-length coding region. Comparative analyses of the population transcriptomes allowed to investigate genes specifically expressed in one of the studied populations only, and to identify the most divergent homologous SP/WP transcripts. The most divergent pairs of transcripts did not correspond to obvious phenology-related candidate genes, and 43% could not be functionally annotated. This study provides the first comprehensive genome-wide resource for the target species T. pityocampa. Many of the

  18. Identification of a De Novo Heterozygous Missense FLNB Mutation in Lethal Atelosteogenesis Type I by Exome Sequencing

    PubMed Central

    Jeon, Ga Won; Lee, Mi-Na; Jung, Ji Mi; Hong, Seong Yeon; Kim, Young Nam; Sin, Jong Beom

    2014-01-01

    Background Atelosteogenesis type I (AO-I) is a rare lethal skeletal dysplastic disorder characterized by severe short-limbed dwarfism and dislocated hips, knees, and elbows. AO-I is caused by mutations in the filamin B (FLNB) gene; however, several other genes can cause AO-like lethal skeletal dysplasias. Methods In order to screen all possible genes associated with AO-like lethal skeletal dysplasias simultaneously, we performed whole-exome sequencing in a female newborn having clinical features of AO-I. Results Exome sequencing identified a novel missense variant (c.517G>A; p.Ala173Thr) in exon 2 of the FLNB gene in the patient. Sanger sequencing validated this variant, and genetic analysis of the patient's parents suggested a de novo occurrence of the variant. Conclusions This study shows that exome sequencing can be a useful tool for the identification of causative mutations in lethal skeletal dysplasia patients. PMID:24624349

  19. Rapid genome mapping in nano channel array for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Next-generation sequencing (NGS) technologies have enabled high-throughput and low-cost generation of sequence data; however, de novo genome assembly remains a great challenge, particularly for large genomes. NGS short reads are often insufficient to create large contigs that span repeat sequences...

  20. Comparison of Illumina de novo assembled and Sanger sequenced viral genomes: A case study for RNA viruses recovered from the plant pathogenic fungus Sclerotinia sclerotiorum.

    PubMed

    Khalifa, Mahmoud E; Varsani, Arvind; Ganley, Austen R D; Pearson, Michael N

    2016-07-01

    The advent of 'next generation sequencing' (NGS) technologies has led to the discovery of many novel mycoviruses, the majority of which are sufficiently different from previously sequenced viruses that there is no appropriate reference sequence on which to base the sequence assembly. Although many new genome sequences are generated by NGS, confirmation of the sequence by Sanger sequencing is still essential for formal classification by the International Committee for the Taxonomy of Viruses (ICTV), although this is currently under review. To empirically test the validity of de novo assembled mycovirus genomes from dsRNA extracts, we compared the results from Illumina sequencing with those from random cloning plus targeted PCR coupled with Sanger sequencing for viruses from five Sclerotinia sclerotiorum isolates. Through Sanger sequencing we detected nine viral genomes while through Illumina sequencing we detected the same nine viruses plus one additional virus from the same samples. Critically, the Illumina derived sequences share >99.3 % identity to those obtained by cloning and Sanger sequencing. Although, there is scope for errors in de novo assembled viral genomes, our results demonstrate that by maximising the proportion of viral sequence in the data and using sufficiently rigorous quality controls, it is possible to generate de novo genome sequences of comparable accuracy from Illumina sequencing to those obtained by Sanger sequencing. PMID:26581665

  1. Sequencing crop genomes: approaches and applications

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Plant genome sequencing methodology parrallels the sequencing of the human genome. The first projects were slow and very expensive. BAC by BAC approaches were utilized first and whole-genome shotgun sequencing rapidly replaced that approach. So called 'next generation' technologies such as short rea...

  2. De Novo Designed Proteins from a Library of Artificial Sequences Function in Escherichia Coli and Enable Cell Growth

    PubMed Central

    Fisher, Michael A.; McKinley, Kara L.; Bradley, Luke H.; Viola, Sara R.; Hecht, Michael H.

    2011-01-01

    A central challenge of synthetic biology is to enable the growth of living systems using parts that are not derived from nature, but designed and synthesized in the laboratory. As an initial step toward achieving this goal, we probed the ability of a collection of >106 de novo designed proteins to provide biological functions necessary to sustain cell growth. Our collection of proteins was drawn from a combinatorial library of 102-residue sequences, designed by binary patterning of polar and nonpolar residues to fold into stable 4-helix bundles. We probed the capacity of proteins from this library to function in vivo by testing their abilities to rescue 27 different knockout strains of Escherichia coli, each deleted for a conditionally essential gene. Four different strains – ΔserB, ΔgltA, ΔilvA, and Δfes – were rescued by specific sequences from our library. Further experiments demonstrated that a strain simultaneously deleted for all four genes was rescued by co-expression of four novel sequences. Thus, cells deleted for ∼0.1% of the E. coli genome (and ∼1% of the genes required for growth under nutrient-poor conditions) can be sustained by sequences designed de novo. PMID:21245923

  3. Somatic mutations and germline sequence variants in the expressed tyrosine kinase genes of patients with de novo acute myeloid leukemia

    PubMed Central

    Xiang, Zhifu; Walgren, Richard; Zhao, Yu; Kasai, Yumi; Miner, Tracie; Ries, Rhonda E.; Lubman, Olga; Fremont, Daved H.; McLellan, Michael D.; Payton, Jacqueline E.; Westervelt, Peter; DiPersio, John F.; Link, Daniel C.; Walter, Matthew J.; Graubert, Timothy A.; Watson, Mark; Baty, Jack; Heath, Sharon; Shannon, William D.; Nagarajan, Rakesh; Bloomfield, Clara D.; Mardis, Elaine R.; Wilson, Richard K.; Ley, Timothy J.

    2008-01-01

    Activating mutations in tyrosine kinase (TK) genes (eg, FLT3 and KIT) are found in more than 30% of patients with de novo acute myeloid leukemia (AML); many groups have speculated that mutations in other TK genes may be present in the remaining 70%. We performed high-throughput resequencing of the kinase domains of 26 TK genes (11 receptor TK; 15 cytoplasmic TK) expressed in most AML patients using genomic DNA from the bone marrow (tumor) and matched skin biopsy samples (“germline”) from 94 patients with de novo AML; sequence variants were validated in an additional 94 AML tumor samples (14.3 million base pairs of sequence were obtained and analyzed). We identified known somatic mutations in FLT3, KIT, and JAK2 TK genes at the expected frequencies and found 4 novel somatic mutations, JAK1V623A, JAK1T478S, DDR1A803V, and NTRK1S677N, once each in 4 respective patients of 188 tested. We also identified novel germline sequence changes encoding amino acid substitutions (ie, nonsynonymous changes) in 14 TK genes, including TYK2, which had the largest number of nonsynonymous sequence variants (11 total detected). Additional studies will be required to define the roles that these somatic and germline TK gene variants play in AML pathogenesis. PMID:18270328

  4. PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

    PubMed Central

    Chin, Francis Y. L.; Yiu, Siu Ming; Quan, Guangri; Liu, Bo; Wang, Yadong

    2014-01-01

    Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https

  5. Exome sequencing identified a novel de novo OPA1 mutation in a consanguineous family presenting with optic atrophy.

    PubMed

    Cohen, Lior; Tzur, Shay; Goldenberg-Cohen, Nitza; Bormans, Concetta; Behar, Doron M; Reinstein, Eyal

    2016-01-01

    Inherited optic neuropathies are a heterogeneous group of disorders characterized by mild to severe visual loss, colour vision deficit, central or paracentral visual field defects and optic disc pallor. Optic atrophies can be classified into isolated or non-syndromic and syndromic forms. While multiple modes of inheritance have been reported, autosomal dominant optic atrophy and mitochondrial inherited Leber's hereditary optic neuropathy are the most common forms. Optic atrophy type 1, caused by mutations in the OPA1 gene is believed to be the most common hereditary optic neuropathy, and most patients inherit a mutation from an affected parent. In this study we used whole-exome sequencing to investigate the genetic aetiology in a patient affected with isolated optic atrophy. Since the proband was the only affected individual in his extended family, and was a product of consanguineous marriage, homozygosity mapping followed by whole-exome sequencing were pursued. Exome results identified a novel de novo OPA1 mutation in the proband. We conclude, that though de novo OPA1 mutations are uncommon, testing of common optic atrophy-associated genes such as mitochondrial mutations and OPA1 gene sequencing should be performed first in single individuals presenting with optic neuropathy, even when dominant inheritance is not apparent. PMID:27265430

  6. De novo Sequencing of the Leaf Transcriptome Reveals Complex Light-Responsive Regulatory Networks in Camellia sinensis cv. Baijiguan.

    PubMed

    Wu, Quanjin; Chen, Zhidan; Sun, Weijiang; Deng, Tingting; Chen, Mingjie

    2016-01-01

    Tea plants (Camellia sinensis L.) possess high genetic diversity that is important for breeding. One cultivar, Baijiguan, exhibits a yellow leaf phenotype, reduced chlorophyll (Chl) content, and aberrant chloroplast structures under high light intensity. In contrast, under low light intensity, the flush shoot from Baijiguan becomes green, the Chl content increases significantly, and the chloroplasts exhibit normal structures. To understand the underlying molecular mechanisms for these observations, we performed de novo transcriptome sequencing and digital gene expression (DGE) profiling using Illumina sequencing technology. De novo transcriptome assembly identified 88,788 unigenes, including 1652 transcription factors from 25 families. In total, 1993 and 2576 differentially expressed genes (DEGs) were identified in Baijiguan plants treated with 3 and 6 days of shade, respectively. Gene Ontology (GO) and pathway enrichment analyses indicated that the DEGs are predominantly involved in the ROS scavenging system, chloroplast development, photosynthetic pigment synthesis, secondary metabolism, and circadian systems. The light-responsive gene POR (protochlorophyllide oxidoreductase) and transcription factor HY5 were identified. Quantitative real-time PCR (qRT-PCR) analysis of 20 selected DEGs confirmed the RNA-sequencing (RNA-Seq) results. Overall, these findings suggest that high light intensity inhibits the expression of photosystem II 10-kDa protein (PsbR) in Baijiguan, thus affecting PSII stability, chloroplast development and chlorophyll biosynthesis. PMID:27047513

  7. De novo Sequencing of the Leaf Transcriptome Reveals Complex Light-Responsive Regulatory Networks in Camellia sinensis cv. Baijiguan

    PubMed Central

    Wu, Quanjin; Chen, Zhidan; Sun, Weijiang; Deng, Tingting; Chen, Mingjie

    2016-01-01

    Tea plants (Camellia sinensis L.) possess high genetic diversity that is important for breeding. One cultivar, Baijiguan, exhibits a yellow leaf phenotype, reduced chlorophyll (Chl) content, and aberrant chloroplast structures under high light intensity. In contrast, under low light intensity, the flush shoot from Baijiguan becomes green, the Chl content increases significantly, and the chloroplasts exhibit normal structures. To understand the underlying molecular mechanisms for these observations, we performed de novo transcriptome sequencing and digital gene expression (DGE) profiling using Illumina sequencing technology. De novo transcriptome assembly identified 88,788 unigenes, including 1652 transcription factors from 25 families. In total, 1993 and 2576 differentially expressed genes (DEGs) were identified in Baijiguan plants treated with 3 and 6 days of shade, respectively. Gene Ontology (GO) and pathway enrichment analyses indicated that the DEGs are predominantly involved in the ROS scavenging system, chloroplast development, photosynthetic pigment synthesis, secondary metabolism, and circadian systems. The light-responsive gene POR (protochlorophyllide oxidoreductase) and transcription factor HY5 were identified. Quantitative real-time PCR (qRT-PCR) analysis of 20 selected DEGs confirmed the RNA-sequencing (RNA-Seq) results. Overall, these findings suggest that high light intensity inhibits the expression of photosystem II 10-kDa protein (PsbR) in Baijiguan, thus affecting PSII stability, chloroplast development and chlorophyll biosynthesis. PMID:27047513

  8. Functional categorization of unique expressed sequence tags obtained from the yeast-like growth phase of the elm pathogen Ophiostoma novo-ulmi

    PubMed Central

    2011-01-01

    Background The highly aggressive pathogenic fungus Ophiostoma novo-ulmi continues to be a serious threat to the American elm (Ulmus americana) in North America. Extensive studies have been conducted in North America to understand the mechanisms of virulence of this introduced pathogen and its evolving population structure, with a view to identifying potential strategies for the control of Dutch elm disease. As part of a larger study to examine the genomes of economically important Ophiostoma spp. and the genetic basis of virulence, we have constructed an expressed sequence tag (EST) library using total RNA extracted from the yeast-like growth phase of O. novo-ulmi (isolate H327). Results A total of 4,386 readable EST sequences were annotated by determining their closest matches to known or theoretical sequences in public databases by BLASTX analysis. Searches matched 2,093 sequences to entries found in Genbank, including 1,761 matches with known proteins and 332 matches with unknown (hypothetical/predicted) proteins. Known proteins included a collection of 880 unique transcripts which were categorized to obtain a functional profile of the transcriptome and to evaluate physiological function. These assignments yielded 20 primary functional categories (FunCat), the largest including Metabolism (FunCat 01, 20.28% of total), Sub-cellular localization (70, 10.23%), Protein synthesis (12, 10.14%), Transcription (11, 8.27%), Biogenesis of cellular components (42, 8.15%), Cellular transport, facilitation and routes (20, 6.08%), Classification unresolved (98, 5.80%), Cell rescue, defence and virulence (32, 5.31%) and the unclassified category, or known sequences of unknown metabolic function (99, 7.5%). A list of specific transcripts of interest was compiled to initiate an evaluation of their impact upon strain virulence in subsequent studies. Conclusions This is the first large-scale study of the O. novo-ulmi transcriptome. The expression profile obtained from the yeast

  9. A Quantitative Tool to Distinguish Isobaric Leucine and Isoleucine Residues for Mass Spectrometry-Based De Novo Monoclonal Antibody Sequencing

    NASA Astrophysics Data System (ADS)

    Poston, Chloe N.; Higgs, Richard E.; You, Jinsam; Gelfanova, Valentina; Hale, John E.; Knierman, Michael D.; Siegel, Robert; Gutierrez, Jesus A.

    2014-07-01

    De novo sequencing by mass spectrometry (MS) allows for the determination of the complete amino acid (AA) sequence of a given protein based on the mass difference of detected ions from MS/MS fragmentation spectra. The technique relies on obtaining specific masses that can be attributed to characteristic theoretical masses of AAs. A major limitation of de novo sequencing by MS is the inability to distinguish between the isobaric residues leucine (Leu) and isoleucine (Ile). Incorrect identification of Ile as Leu or vice versa often results in loss of activity in recombinant antibodies. This functional ambiguity is commonly resolved with costly and time-consuming AA mutation and peptide sequencing experiments. Here, we describe a set of orthogonal biochemical protocols, which experimentally determine the identity of Ile or Leu residues in monoclonal antibodies (mAb) based on the selectivity that leucine aminopeptidase shows for n-terminal Leu residues and the cleavage preference for Leu by chymotrypsin. The resulting observations are combined with germline frequencies and incorporated into a logistic regression model, called Predictor for Xle Sites (PXleS) to provide a statistical likelihood for the identity of Leu at an ambiguous site. We demonstrate that PXleS can generate a probability for an Xle site in mAbs with 96% accuracy. The implementation of PXleS precludes the expression of several possible sequences and, therefore, reduces the overall time and resources required to go from spectra generation to a biologically active sequence for a mAb when an Ile or Leu residue is in question.

  10. Whole Genome Sequencing Reveals a De Novo SHANK3 Mutation in Familial Autism Spectrum Disorder

    PubMed Central

    Nemirovsky, Sergio I.; Córdoba, Marta; Zaiat, Jonathan J.; Completa, Sabrina P.; Vega, Patricia A.; González-Morón, Dolores; Medina, Nancy M.; Fabbro, Mónica; Romero, Soledad; Brun, Bianca; Revale, Santiago; Ogara, María Florencia; Pecci, Adali; Marti, Marcelo; Vazquez, Martin; Turjanski, Adrián; Kauffman, Marcelo A.

    2015-01-01

    Introduction Clinical genomics promise to be especially suitable for the study of etiologically heterogeneous conditions such as Autism Spectrum Disorder (ASD). Here we present three siblings with ASD where we evaluated the usefulness of Whole Genome Sequencing (WGS) for the diagnostic approach to ASD. Methods We identified a family segregating ASD in three siblings with an unidentified cause. We performed WGS in the three probands and used a state-of-the-art comprehensive bioinformatic analysis pipeline and prioritized the identified variants located in genes likely to be related to ASD. We validated the finding by Sanger sequencing in the probands and their parents. Results Three male siblings presented a syndrome characterized by severe intellectual disability, absence of language, autism spectrum symptoms and epilepsy with negative family history for mental retardation, language disorders, ASD or other psychiatric disorders. We found germline mosaicism for a heterozygous deletion of a cytosine in the exon 21 of the SHANK3 gene, resulting in a missense sequence of 5 codons followed by a premature stop codon (NM_033517:c.3259_3259delC, p.Ser1088Profs*6). Conclusions We reported an infrequent form of familial ASD where WGS proved useful in the clinic. We identified a mutation in SHANK3 that underscores its relevance in Autism Spectrum Disorder. PMID:25646853

  11. Characterization of Quasispecies of Pandemic 2009 Influenza A Virus (A/H1N1/2009) by De Novo Sequencing Using a Next-Generation DNA Sequencer

    PubMed Central

    Kuroda, Makoto; Katano, Harutaka; Nakajima, Noriko; Tobiume, Minoru; Ainai, Akira; Sekizuka, Tsuyoshi; Hasegawa, Hideki; Tashiro, Masato; Sasaki, Yuko; Arakawa, Yoshichika; Hata, Satoru; Watanabe, Masahide; Sata, Tetsutaro

    2010-01-01

    Pandemic 2009 influenza A virus (A/H1N1/2009) has emerged globally. In this study, we performed a comprehensive detection of potential pathogens by de novo sequencing using a next-generation DNA sequencer on total RNAs extracted from an autopsy lung of a patient who died of viral pneumonia with A/H1N1/2009. Among a total of 9.4×106 40-mer short reads, more than 98% appeared to be human, while 0.85% were identified as A/H1N1/2009 (A/Nagano/RC1-L/2009(H1N1)). Suspected bacterial reads such as Streptococcus pneumoniae and other oral bacteria flora were very low at 0.005%, and a significant bacterial infection was not histologically observed. De novo assembly and read mapping analysis of A/Nagano/RC1-L/2009(H1N1) showed more than ×200 coverage on average, and revealed nucleotide heterogeneity on hemagglutinin as quasispecies, specifically at two amino acids (Gly172Glu and Gly239Asn of HA) located on the Sa and Ca2 antigenic sites, respectively. Gly239 and Asn239 on antigenic site Ca2 appeared to be minor amino acids compared with the highly distributed Asp239 in H1N1 HAs. This study demonstrated that de novo sequencing can comprehensively detect pathogens, and such in-depth investigation facilitates the identification of influenza A viral heterogeneity. To better characterize the A/H1N1/2009 virus, unbiased comprehensive techniques will be indispensable for the primary investigations of emerging infectious diseases. PMID:20428231

  12. Fast, cheap and out of control--Insights into thermodynamic and informatic constraints on natural protein sequences from de novo protein design.

    PubMed

    Brisendine, Joseph M; Koder, Ronald L

    2016-05-01

    The accumulated results of thirty years of rational and computational de novo protein design have taught us important lessons about the stability, information content, and evolution of natural proteins. First, de novo protein design has complicated the assertion that biological function is equivalent to biological structure - demonstrating the capacity to abstract active sites from natural contexts and paste them into non-native topologies without loss of function. The structure-function relationship has thus been revealed to be either a generality or strictly true only in a local sense. Second, the simplification to "maquette" topologies carried out by rational protein design also has demonstrated that even sophisticated functions such as conformational switching, cooperative ligand binding, and light-activated electron transfer can be achieved with low-information design approaches. This is because for simple topologies the functional footprint in sequence space is enormous and easily exceeds the number of structures which could have possibly existed in the history of life on Earth. Finally, the pervasiveness of extraordinary stability in designed proteins challenges accepted models for the "marginal stability" of natural proteins, suggesting that there must be a selection pressure against highly stable proteins. This can be explained using recent theories which relate non-equilibrium thermodynamics and self-replication. This article is part of a Special Issue entitled Biodesign for Bioenergetics--The design and engineering of electronc transfer cofactors, proteins and protein networks, edited by Ronald L. Koder and J.L. Ross Anderson. PMID:26498191

  13. An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome

    PubMed Central

    2013-01-01

    Background Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome. Results Following error-correction, a total of 28,638 PacBio RS reads were recovered with a mean read length of 1,902 bp totalling 54,492,250 nucleotides and representing an average depth of coverage of 320× the chloroplast genome. The dataset covered the entire 154,959 bp of the chloroplast genome in a single contig (100% coverage) compared to seven contigs (90.59% coverage) recovered from an Illumina data, and revealed no bias in coverage of GC rich regions. Post-assembly the data were largely concordant with the Illumina data generated and allowed 187 ambiguities in the Illumina data to be resolved. The additional read length also permitted small differences in the two inverted repeat regions to be assigned unambiguously. Conclusions This is the first report to our knowledge of a chloroplast genome assembled de novo using PacBio sequence data. The PacBio RS data generated here were assembled into a single large contig spanning the P. micrantha chloroplast genome, with a higher degree of accuracy than an Illumina dataset generated at a much greater depth of coverage, due to longer read lengths and lower GC bias in the data. The results we present suggest PacBio data will be

  14. De Novo Transcriptome Sequencing of Desert Herbaceous Achnatherum splendens (Achnatherum) Seedlings and Identification of Salt Tolerance Genes

    PubMed Central

    Liu, Jiangtao; Zhou, Yuelong; Luo, Changxin; Xiang, Yun; An, Lizhe

    2016-01-01

    Achnatherum splendens is an important forage herb in Northwestern China. It has a high tolerance to salinity and is, thus, considered one of the most important constructive plants in saline and alkaline areas of land in Northwest China. However, the mechanisms of salt stress tolerance in A. splendens remain unknown. Next-generation sequencing (NGS) technologies can be used for global gene expression profiling. In this study, we examined sequence and transcript abundance data for the root/leaf transcriptome of A. splendens obtained using an Illumina HiSeq 2500. Over 35 million clean reads were obtained from the leaf and root libraries. All of the RNA sequencing (RNA-seq) reads were assembled de novo into a total of 126,235 unigenes and 36,511 coding DNA sequences (CDS). We further identified 1663 differentially-expressed genes (DEGs) between the salt stress treatment and control. Functional annotation of the DEGs by gene ontology (GO), using Arabidopsis and rice as references, revealed enrichment of salt stress-related GO categories, including “oxidation reduction”, “transcription factor activity”, and “ion channel transporter”. Thus, this global transcriptome analysis of A. splendens has provided an important genetic resource for the study of salt tolerance in this halophyte. The identified sequences and their putative functional data will facilitate future investigations of the tolerance of Achnatherum species to various types of abiotic stress. PMID:27023614

  15. Transcriptome Sequencing, De Novo Assembly and Differential Gene Expression Analysis of the Early Development of Acipenser baeri

    PubMed Central

    Song, Wei; Jiang, Keji; Zhang, Fengying; Lin, Yu; Ma, Lingbo

    2015-01-01

    The molecular mechanisms that drive the development of the endangered fossil fish species Acipenser baeri are difficult to study due to the lack of genomic data. Recent advances in sequencing technologies and the reducing cost of sequencing offer exclusive opportunities for exploring important molecular mechanisms underlying specific biological processes. This manuscript describes the large scale sequencing and analyses of mRNA from Acipenser baeri collected at five development time points using the Illumina Hiseq2000 platform. The sequencing reads were de novo assembled and clustered into 278167 unigenes, of which 57346 (20.62%) had 45837 known homologues proteins in Uniprot protein databases while 11509 proteins matched with at least one sequence of assembled unigenes. The remaining 79.38% of unigenes could stand for non-coding unigenes or unigenes specific to A. baeri. A number of 43062 unigenes were annotated into functional categories via Gene Ontology (GO) annotation whereas 29526 unigenes were associated with 329 pathways by mapping to KEGG database. Subsequently, 3479 differentially expressed genes were scanned within developmental stages and clustered into 50 gene expression profiles. Genes preferentially expressed at each stage were also identified. Through GO and KEGG pathway enrichment analysis, relevant physiological variations during the early development of A. baeri could be better cognized. Accordingly, the present study gives insights into the transcriptome profile of the early development of A. baeri, and the information contained in this large scale transcriptome will provide substantial references for A. baeri developmental biology and promote its aquaculture research. PMID:26359664

  16. De Novo Transcriptome Sequencing of Desert Herbaceous Achnatherum splendens (Achnatherum) Seedlings and Identification of Salt Tolerance Genes.

    PubMed

    Liu, Jiangtao; Zhou, Yuelong; Luo, Changxin; Xiang, Yun; An, Lizhe

    2016-01-01

    Achnatherum splendens is an important forage herb in Northwestern China. It has a high tolerance to salinity and is, thus, considered one of the most important constructive plants in saline and alkaline areas of land in Northwest China. However, the mechanisms of salt stress tolerance in A. splendens remain unknown. Next-generation sequencing (NGS) technologies can be used for global gene expression profiling. In this study, we examined sequence and transcript abundance data for the root/leaf transcriptome of A. splendens obtained using an Illumina HiSeq 2500. Over 35 million clean reads were obtained from the leaf and root libraries. All of the RNA sequencing (RNA-seq) reads were assembled de novo into a total of 126,235 unigenes and 36,511 coding DNA sequences (CDS). We further identified 1663 differentially-expressed genes (DEGs) between the salt stress treatment and control. Functional annotation of the DEGs by gene ontology (GO), using Arabidopsis and rice as references, revealed enrichment of salt stress-related GO categories, including "oxidation reduction", "transcription factor activity", and "ion channel transporter". Thus, this global transcriptome analysis of A. splendens has provided an important genetic resource for the study of salt tolerance in this halophyte. The identified sequences and their putative functional data will facilitate future investigations of the tolerance of Achnatherum species to various types of abiotic stress. PMID:27023614

  17. De novo transcriptome sequencing and analysis of Rapana venosa from six different developmental stages using Hi-seq 2500.

    PubMed

    Song, Hao; Yu, Zheng-Lin; Sun, Li-Na; Gao, Yan; Zhang, Tao; Wang, Hai-Yan

    2016-03-01

    The carnivorous whelk Rapana venosa is regarded as a biological invader with strong ecological fitness in the United States, Argentina, France and other countries. R. venosa may seriously damage bivalve resources. Nonetheless, in China, R. venosa is an important commercial species. Larval development, especially metamorphosis, influences the natural population and industrial breeding. However, there are few studies on the early development of R. venosa, and our understanding is further limited by a lack of genomic information. In this study, de novo sequencing was performed to obtain a comprehensive transcriptome profile during early development. A Hi-seq 2500 sequencing run produced 148,737,902 raw reads that were assembled into 1,137,556 unigenes (average length of 619 nucleotides, of which 49,673 could be annotated). The unigenes were assigned to biological processes and functions after annotation in Gene Ontology, eukaryotic Ortholog Groups and Kyoto Encyclopedia of Genes and Genomes. We also identified 93,196 simple sequence repeats among the unigenes. Six unique sequences associated with neuroendocrine function were analyzed by quantitative real-time PCR. Our data represent the first comprehensive transcriptomic resource for R. venosa. Functional annotation of the unigenes involved in various biological processes could stimulate research on the mechanisms of early development in this species. Understanding the mechanism of early development and metamorphosis would benefit antifouling research and aquaculture of R. venosa. PMID:26845471

  18. Transcriptome de novo assembly sequencing and analysis of the toxic dinoflagellate Alexandrium catenella using the Illumina platform.

    PubMed

    Zhang, Shu; Sui, Zhenghong; Chang, Lianpeng; Kang, Kyoungho; Ma, Jinhua; Kong, Fanna; Zhou, Wei; Wang, Jinguo; Guo, Liliang; Geng, Huili; Zhong, Jie; Ma, Qingxia

    2014-03-10

    In this article, high-throughput de novo transcriptomic sequencing was performed in Alexandrium catenella, which provided the first view of the gene repertoire in this dinoflagellate based on next-generation sequencing (NGS) technologies. A total of 118,304 unigenes were identified with an average length of 673bp (base pair). Of these unigenes, 77,936 (65.9%) were annotated with known proteins based on sequence similarities, among which 24,149 and 22,956 unigenes were assigned to gene ontology categories (GO) and clusters of orthologous groups (COGs), respectively. Furthermore, 16,467 unigenes were mapped onto 322 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG). We also detected 1143 simple sequence repeats (SSRs), in which the tri-nucleotide repeat motif (69.3%) was the most abundant. The genetic facts and significance derived from the transcriptome dataset were suggested and discussed. All four core nucleosomal histones and linker histones were detected, in addition to the unigenes involved in histone modifications.190 unigenes were identified as being involved in the endocytosis pathway, and clathrin-dependent endocytosis was suggested to play a role in the heterotrophy of A. catenella. A conserved 22-nt spliced leader (SL) was identified in 21 unigenes which suggested the existence of trans-splicing processing of mRNA in A. catenella. PMID:24440238

  19. De novo sequencing of Eucommia ulmoides flower bud transcriptomes for identification of genes related to floral development.

    PubMed

    Liu, Huimin; Fu, JianMin; Du, Hongyan; Hu, Jingjing; Wuyun, Tana

    2016-09-01

    Eucommia ulmoides Oliver is a woody perennial dioecious species native to China and has great economic value. However, little is known about flower bud development in this species. In this study, the transcriptomes of female and male flower buds were sequenced using the Illumina platform, a next-generation sequencing technology that provides cost-effective, highly efficient transcriptome profiling. In total, 11,558,188,080 clean reads were assembled into 75,065 unigenes with an average length of 1011 bp by de novo assembly using Trinity software. Through similarity comparisons with known protein databases, 47,071 unigenes were annotated, 146 of which were putatively related to the floral development of E. ulmoides. Fifteen of the 146 unigenes had significantly different expression levels between the two samples. Additionally, 24,346 simple sequence repeats were identified in 18,565 unigenes with 12,793 sequences suitable for the designed primers. In total, 67,447 and 58,236 single nucleotide polymorphisms were identified in male and female buds, respectively. This study provides a valuable resource for further conservation genetics and functional genomics research on E. ulmoides. PMID:27486566

  20. De novo Sequence Assembly and Characterization of Lycoris aurea Transcriptome Using GS FLX Titanium Platform of 454 Pyrosequencing

    PubMed Central

    Wang, Ren; Xu, Sheng; Jiang, Yumei; Jiang, Jingwei; Li, Xiaodan; Liang, Lijian; He, Jia; Peng, Feng; Xia, Bing

    2013-01-01

    Background Lycoris aurea, also called Golden Magic Lily, is an ornamentally and medicinally important species of the Amaryllidaceae family. To date, the sequencing of its whole genome is unavailable as a non-model organism. Transcriptomic information is also scarce for this species. In this study, we performed de novo transcriptome sequencing to produce the first comprehensive expressed sequence tag (EST) dataset for L. aurea using high-throughput sequencing technology. Methodology and Principal Findings Total RNA was isolated from leaves with sodium nitroprusside (SNP), salicylic acid (SA), or methyl jasmonate (MeJA) treatment, stems, and flowers at the bud, blooming, and wilting stages. Equal quantities of RNA from each tissue and stage were pooled to construct a cDNA library. Using 454 pyrosequencing technology, a total of 937,990 high quality reads (308.63 Mb) with an average read length of 329 bp were generated. Clustering and assembly of these reads produced a non-redundant set of 141,111 unique sequences, comprising 24,604 contigs and 116,507 singletons. All of the unique sequences were involved in the biological process, cellular component and molecular function categories by GO analysis. Potential genes and their functions were predicted by KEGG pathway mapping and COG analysis. Based on our sequence analysis and published literatures, many putative genes involved in Amaryllidaceae alkaloids synthesis, including PAL, TYDC OMT, NMT, P450, and other potentially important candidate genes, were identified for the first time in this Lycoris. Furthermore, 6,386 SSRs and 18,107 high-confidence SNPs were identified in this EST dataset. Conclusions The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in L. aurea. The molecular markers identified in this study will provide a material basis for future genetic linkage and quantitative trait loci analyses, and will provide useful information for functional

  1. A prime number approach to biological sequencing.

    PubMed

    Greer, W; Barrett, A N; Sowden, J M

    1985-03-01

    Computational sequencing of nucleic acid and amino acid sequences is placing increasing demands on computer resources. The use of prime numbers is explored as a convenient means of improving program speed and reducing storage requirements. It is concluded that the application of the prime number approach leads to significant increases in speed and some reduction in storage requirements. PMID:3840126

  2. A Statistical Approach for Ambiguous Sequence Mappings

    Technology Transfer Automated Retrieval System (TEKTRAN)

    When attempting to map RNA sequences to a reference genome, high percentages of short sequence reads are often assigned to multiple genomic locations. One approach to handling these “ambiguous mappings” has been to discard them. This results in a loss of data, which can sometimes be as much as 45% o...

  3. A population-based evolutionary search approach to the multiple minima problem in de novo protein structure prediction

    PubMed Central

    2013-01-01

    Background Elucidating the native structure of a protein molecule from its sequence of amino acids, a problem known as de novo structure prediction, is a long standing challenge in computational structural biology. Difficulties in silico arise due to the high dimensionality of the protein conformational space and the ruggedness of the associated energy surface. The issue of multiple minima is a particularly troublesome hallmark of energy surfaces probed with current energy functions. In contrast to the true energy surface, these surfaces are weakly-funneled and rich in comparably deep minima populated by non-native structures. For this reason, many algorithms seek to be inclusive and obtain a broad view of the low-energy regions through an ensemble of low-energy (decoy) conformations. Conformational diversity in this ensemble is key to increasing the likelihood that the native structure has been captured. Methods We propose an evolutionary search approach to address the multiple-minima problem in decoy sampling for de novo structure prediction. Two population-based evolutionary search algorithms are presented that follow the basic approach of treating conformations as individuals in an evolving population. Coarse graining and molecular fragment replacement are used to efficiently obtain protein-like child conformations from parents. Potential energy is used both to bias parent selection and determine which subset of parents and children will be retained in the evolving population. The effect on the decoy ensemble of sampling minima directly is measured by additionally mapping a conformation to its nearest local minimum before considering it for retainment. The resulting memetic algorithm thus evolves not just a population of conformations but a population of local minima. Results and conclusions Results show that both algorithms are effective in terms of sampling conformations in proximity of the known native structure. The additional minimization is shown to be

  4. Highly efficient de novo mutant identification in a sorghum bicolor tilling population using the ComSeq approach

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Screening large populations for carriers of known or de novo rare SNPs is required both in Targeting induced local lesions IN genomes (TILLING) experiments in plants and analogously in screening human populations. We formerly suggested an approach that combines the celebrated mathematical field of c...

  5. The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

    PubMed Central

    Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio

    2016-01-01

    Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species. PMID:26786968

  6. The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny.

    PubMed

    Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio

    2016-01-01

    Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species. PMID:26786968

  7. High-Throughput Sequencing and De Novo Assembly of Red and Green Forms of the Perilla frutescens var. crispa Transcriptome

    PubMed Central

    Fukushima, Atsushi; Nakamura, Michimi; Suzuki, Hideyuki; Saito, Kazuki; Yamazaki, Mami

    2015-01-01

    Perilla frutescens var. crispa (Labiatae) has two chemo-varietal forms, i.e. red and green forms of perilla, that differ in the production of anthocyanins. To facilitate molecular biological and biochemical studies in perilla-specialized metabolism we used Illumina RNA-sequencing technology in our comprehensive comparison of the transcriptome map of the leaves of red and green forms of perilla. Sequencing generated over 1.2 billion short reads with an average length of 101 nt. De novo transcriptome assembly yielded 47,788 and 47,840 unigenes in the red and green forms of perilla plants, respectively. Comparison of the assembled unigenes and existing perilla cDNA sequences showed highly reliable alignment. All unigenes were annotated with gene ontology (GO) and Enzyme Commission numbers and entered into the Kyoto Encyclopedia of Genes and Genomes. We identified 68 differentially expressed genes (DEGs) in red and green forms of perilla. GO enrichment analysis of the DEGs showed that genes involved in the anthocyanin metabolic process were enriched. Differential expression analysis revealed that the transcript level of anthocyanin biosynthetic unigenes encoding flavonoid 3’-hydroxylase, dihydroflavonol 4-reductase, and anthocyanidin synthase was significantly higher in red perilla, while the transcript level of unigenes encoding limonene synthase was significantly higher in green perilla. Our data serve as a basis for future research on perilla bio-engineering and provide a shortcut for the characterization of new functional genes in P. frutescens. PMID:26070213

  8. De novo assembly and next-generation sequencing to analyse full-length gene variants from codon-barcoded libraries.

    PubMed

    Cho, Namjin; Hwang, Byungjin; Yoon, Jung-ki; Park, Sangun; Lee, Joongoo; Seo, Han Na; Lee, Jeewon; Huh, Sunghoon; Chung, Jinsoo; Bang, Duhee

    2015-01-01

    Interpreting epistatic interactions is crucial for understanding evolutionary dynamics of complex genetic systems and unveiling structure and function of genetic pathways. Although high resolution mapping of en masse variant libraries renders molecular biologists to address genotype-phenotype relationships, long-read sequencing technology remains indispensable to assess functional relationship between mutations that lie far apart. Here, we introduce JigsawSeq for multiplexed sequence identification of pooled gene variant libraries by combining a codon-based molecular barcoding strategy and de novo assembly of short-read data. We first validate JigsawSeq on small sub-pools and observed high precision and recall at various experimental settings. With extensive simulations, we then apply JigsawSeq to large-scale gene variant libraries to show that our method can be reliably scaled using next-generation sequencing. JigsawSeq may serve as a rapid screening tool for functional genomics and offer the opportunity to explore evolutionary trajectories of protein variants. PMID:26387459

  9. De novo assembly and next-generation sequencing to analyse full-length gene variants from codon-barcoded libraries

    PubMed Central

    Cho, Namjin; Hwang, Byungjin; Yoon, Jung-ki; Park, Sangun; Lee, Joongoo; Seo, Han Na; Lee, Jeewon; Huh, Sunghoon; Chung, Jinsoo; Bang, Duhee

    2015-01-01

    Interpreting epistatic interactions is crucial for understanding evolutionary dynamics of complex genetic systems and unveiling structure and function of genetic pathways. Although high resolution mapping of en masse variant libraries renders molecular biologists to address genotype-phenotype relationships, long-read sequencing technology remains indispensable to assess functional relationship between mutations that lie far apart. Here, we introduce JigsawSeq for multiplexed sequence identification of pooled gene variant libraries by combining a codon-based molecular barcoding strategy and de novo assembly of short-read data. We first validate JigsawSeq on small sub-pools and observed high precision and recall at various experimental settings. With extensive simulations, we then apply JigsawSeq to large-scale gene variant libraries to show that our method can be reliably scaled using next-generation sequencing. JigsawSeq may serve as a rapid screening tool for functional genomics and offer the opportunity to explore evolutionary trajectories of protein variants. PMID:26387459

  10. Sequencing, De novo Assembly, Functional Annotation and Analysis of Phyllanthus amarus Leaf Transcriptome Using the Illumina Platform

    PubMed Central

    Bose Mazumdar, Aparupa; Chattopadhyay, Sharmila

    2016-01-01

    Phyllanthus amarus Schum. and Thonn., a widely distributed annual medicinal herb has a long history of use in the traditional system of medicine for over 2000 years. However, the lack of genomic data for P. amarus, a non-model organism hinders research at the molecular level. In the present study, high-throughput sequencing technology has been employed to enhance better understanding of this herb and provide comprehensive genomic information for future work. Here P. amarus leaf transcriptome was sequenced using the Illumina Miseq platform. We assembled 85,927 non-redundant (nr) “unitranscript” sequences with an average length of 1548 bp, from 18,060,997 raw reads. Sequence similarity analyses and annotation of these unitranscripts were performed against databases like green plants nr protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), PlnTFDB, KEGG databases. As a result, 69,394 GO terms, 583 enzyme codes (EC), 134 KEGG maps, and 59 Transcription Factor (TF) families were generated. Functional and comparative analyses of assembled unitranscripts were also performed with the most closely related species like Populus trichocarpa and Ricinus communis using TRAPID. KEGG analysis showed that a number of assembled unitranscripts were involved in secondary metabolites, mainly phenylpropanoid, flavonoid, terpenoids, alkaloids, and lignan biosynthetic pathways that have significant medicinal attributes. Further, Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values of the identified secondary metabolite pathway genes were determined and Reverse Transcription PCR (RT-PCR) of a few of these genes were performed to validate the de novo assembled leaf transcriptome dataset. In addition 65,273 simple sequence repeats (SSRs) were also identified. To the best of our knowledge, this is the first transcriptomic dataset of P. amarus till date. Our study provides the largest genetic resource that will lead to drug development and pave

  11. Differential gene expression analysis between anagen and telogen of Capra hircus skin based on the de novo assembled transcriptome sequence.

    PubMed

    Xu, Teng; Guo, Xudong; Wang, Hui; Hao, Fei; Du, Xiaoyuan; Gao, Xiaoyu; Liu, Dongjun

    2013-05-10

    Capra hircus, an economically important livestock, plays an indispensable role in the world animal fiber industry. In the present study, using Illumina/Solexa high throughput sequencing technology, we sequenced and de novo assembled the goat skin transcriptome corresponding to the anagen and telogen of the hair growth cycle. Approximately 53Mb of transcriptome sequences consisting of 57,040 high quality contigs was obtained. More than 8300 contigs were predicted to contain a full length coding sequence. Approximately 43% of the total contigs were identified as harboring homologs of sequences from other organisms in the public database. Based on the assembled transcript-derived contigs, we identified about 7000 transcripts that were differentially expressed between the anagen and telogen libraries. These differentially expressed genes were mainly enriched in signal transduction mechanisms, extracellular structures and cytoskeleton from the KOG database and in ECM receptor interaction, focal adhesion and gap junction from the KEGG pathway database, indicating the essential roles of these genes may play in cell-to-cell and cell-to-matrix communications during the active hair growth phase. In addition, many signaling pathway associated ligands and/or receptors were also identified as up-regulated genes during the anagen phase compared with the telogen stage, suggesting that enhanced cross-talk among signaling transduction pathways may be required for anagen of the hair cycle. These differentially expressed genes, especially those that were over-represented in each of the functional clusters and biochemical pathways, provide valuable resources and opportunities for characterizing the gene functions associated with hair fiber growth as well as for breeding elite Cashmere goat species. PMID:23466980

  12. High throughput de novo RNA sequencing elucidates novel responses in Penicillium chrysogenum under microgravity.

    PubMed

    Sathishkumar, Yesupatham; Krishnaraj, Chandran; Rajagopal, Kalyanaraman; Sen, Dwaipayan; Lee, Yang Soo

    2016-02-01

    In this study, the transcriptional alterations in Penicillium chrysogenum under simulated microgravity conditions were analyzed for the first time using an RNA-Seq method. The increasing plethora of eukaryotic microbial flora inside the spaceship demands the basic understanding of fungal biology in the absence of gravity vector. Penicillium species are second most dominant fungal contaminant in International Space Station. Penicillium chrysogenum an industrially important organism also has the potential to emerge as an opportunistic pathogen for the astronauts during the long-term space missions. But till date, the cellular mechanisms underlying the survival and adaptation of Penicillium chrysogenum to microgravity conditions are not clearly elucidated. A reference genome for Penicillium chrysogenum is not yet available in the NCBI database. Hence, we performed comparative de novo transcriptome analysis of Penicillium chrysogenum grown under microgravity versus normal gravity. In addition, the changes due to microgravity are documented at the molecular level. Increased response to the environmental stimulus, changes in the cell wall component ABC transporter/MFS transporters are noteworthy. Interestingly, sustained increase in the expression of Acyl-coenzyme A: isopenicillin N acyltransferase (Acyltransferase) under microgravity revealed the significance of gravity in the penicillin production which could be exploited industrially. PMID:26603994

  13. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications

    PubMed Central

    Mathieson, Iain; Iqbal, Zamin; Twigg, Stephen R F; Wilkie, Andrew O M; McVean, Gil; Lunter, Gerton

    2016-01-01

    High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls. PMID:25017105

  14. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications.

    PubMed

    Rimmer, Andy; Phan, Hang; Mathieson, Iain; Iqbal, Zamin; Twigg, Stephen R F; Wilkie, Andrew O M; McVean, Gil; Lunter, Gerton

    2014-08-01

    High-throughput DNA sequencing technology has transformed genetic research and is starting to make an impact on clinical practice. However, analyzing high-throughput sequencing data remains challenging, particularly in clinical settings where accuracy and turnaround times are critical. We present a new approach to this problem, implemented in a software package called Platypus. Platypus achieves high sensitivity and specificity for SNPs, indels and complex polymorphisms by using local de novo assembly to generate candidate variants, followed by local realignment and probabilistic haplotype estimation. It is an order of magnitude faster than existing tools and generates calls from raw aligned read data without preprocessing. We demonstrate the performance of Platypus in clinically relevant experimental designs by comparing with SAMtools and GATK on whole-genome and exome-capture data, by identifying de novo variation in 15 parent-offspring trios with high sensitivity and specificity, and by estimating human leukocyte antigen genotypes directly from variant calls. PMID:25017105

  15. De Novo Transcriptome Sequencing Reveals Important Molecular Networks and Metabolic Pathways of the Plant, Chlorophytum borivilianum

    PubMed Central

    Kalra, Shikha; Puniya, Bhanwar Lal; Kulshreshtha, Deepika; Kumar, Sunil; Kaur, Jagdeep; Ramachandran, Srinivasan; Singh, Kashmir

    2013-01-01

    Chlorophytum borivilianum, an endangered medicinal plant species is highly recognized for its aphrodisiac properties provided by saponins present in the plant. The transcriptome information of this species is limited and only few hundred expressed sequence tags (ESTs) are available in the public databases. To gain molecular insight of this plant, high throughput transcriptome sequencing of leaf RNA was carried out using Illumina's HiSeq 2000 sequencing platform. A total of 22,161,444 single end reads were retrieved after quality filtering. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 101,141 assembled transcripts were obtained, with coverage size of 22.42 Mb and average length of 221 bp. Guanine-cytosine (GC) content was found to be 44%. Bioinformatics analysis, using non-redundant proteins, gene ontology (GO), enzyme commission (EC) and kyoto encyclopedia of genes and genomes (KEGG) databases, extracted all the known enzymes involved in saponin and flavonoid biosynthesis. Few genes of the alkaloid biosynthesis, along with anticancer and plant defense genes, were also discovered. Additionally, several cytochrome P450 (CYP450) and glycosyltransferase unique sequences were also found. We identified simple sequence repeat motifs in transcripts with an abundance of di-nucleotide simple sequence repeat (SSR; 43.1%) markers. Large scale expression profiling through Reads per Kilobase per Million mapped reads (RPKM) showed major genes involved in different metabolic pathways of the plant. Genes, expressed sequence tags (ESTs) and unique sequences from this study provide an important resource for the scientific community, interested in the molecular genetics and functional genomics of C. borivilianum. PMID:24376689

  16. Comparative Transcriptomic Approaches Exploring Contamination Stress Tolerance in Salix sp. Reveal the Importance for a Metaorganismal de Novo Assembly Approach for Nonmodel Plants.

    PubMed

    Brereton, Nicholas J B; Gonzalez, Emmanuel; Marleau, Julie; Nissim, Werther Guidi; Labrecque, Michel; Joly, Simon; Pitre, Frederic E

    2016-05-01

    Metatranscriptomic study of nonmodel organisms requires strategies that retain the highly resolved genetic information generated from model organisms while allowing for identification of the unexpected. A real-world biological application of phytoremediation, the field growth of 10 Salix cultivars on polluted soils, was used as an exemplar nonmodel and multifaceted crop response well-disposed to the study of gene expression. Sequence reads were assembled de novo to create 10 independent transcriptomes, a global transcriptome, and were mapped against the Salix purpurea 94006 reference genome. Annotation of assembled contigs was performed without a priori assumption of the originating organism. Global transcriptome construction from 3.03 billion paired-end reads revealed 606,880 unique contigs annotated from 1588 species, often common in all 10 cultivars. Comparisons between transcriptomic and metatranscriptomic methodologies provide clear evidence that nonnative RNA can mistakenly map to reference genomes, especially to conserved regions of common housekeeping genes, such as actin, α/β-tubulin, and elongation factor 1-α. In Salix, Rubisco activase transcripts were down-regulated in contaminated trees across all 10 cultivars, whereas thiamine thizole synthase and CP12, a Calvin Cycle master regulator, were uniformly up-regulated. De novo assembly approaches, with unconstrained annotation, can improve data quality; care should be taken when exploring such plant genetics to reduce de facto data exclusion by mapping to a single reference genome alone. Salix gene expression patterns strongly suggest cultivar-wide alteration of specific photosynthetic apparatus and protection of the antenna complexes from oxidation damage in contaminated trees, providing an insight into common stress tolerance strategies in a real-world phytoremediation system. PMID:27002060

  17. Comparative Transcriptomic Approaches Exploring Contamination Stress Tolerance in Salix sp. Reveal the Importance for a Metaorganismal de Novo Assembly Approach for Nonmodel Plants1[OPEN

    PubMed Central

    Brereton, Nicholas J. B.; Marleau, Julie; Nissim, Werther Guidi; Labrecque, Michel; Joly, Simon; Pitre, Frederic E.

    2016-01-01

    Metatranscriptomic study of nonmodel organisms requires strategies that retain the highly resolved genetic information generated from model organisms while allowing for identification of the unexpected. A real-world biological application of phytoremediation, the field growth of 10 Salix cultivars on polluted soils, was used as an exemplar nonmodel and multifaceted crop response well-disposed to the study of gene expression. Sequence reads were assembled de novo to create 10 independent transcriptomes, a global transcriptome, and were mapped against the Salix purpurea 94006 reference genome. Annotation of assembled contigs was performed without a priori assumption of the originating organism. Global transcriptome construction from 3.03 billion paired-end reads revealed 606,880 unique contigs annotated from 1588 species, often common in all 10 cultivars. Comparisons between transcriptomic and metatranscriptomic methodologies provide clear evidence that nonnative RNA can mistakenly map to reference genomes, especially to conserved regions of common housekeeping genes, such as actin, α/β-tubulin, and elongation factor 1-α. In Salix, Rubisco activase transcripts were down-regulated in contaminated trees across all 10 cultivars, whereas thiamine thizole synthase and CP12, a Calvin Cycle master regulator, were uniformly up-regulated. De novo assembly approaches, with unconstrained annotation, can improve data quality; care should be taken when exploring such plant genetics to reduce de facto data exclusion by mapping to a single reference genome alone. Salix gene expression patterns strongly suggest cultivar-wide alteration of specific photosynthetic apparatus and protection of the antenna complexes from oxidation damage in contaminated trees, providing an insight into common stress tolerance strategies in a real-world phytoremediation system. PMID:27002060

  18. De Novo Assembly, Gene Annotation, and Marker Discovery in Stored-Product Pest Liposcelis entomophila (Enderlein) Using Transcriptome Sequences

    PubMed Central

    Wei, Dan-Dan; Chen, Er-Hu; Ding, Tian-Bo; Chen, Shi-Chun; Dou, Wei; Wang, Jin-Jun

    2013-01-01

    Background As a major stored-product pest insect, Liposcelis entomophila has developed high levels of resistance to various insecticides in grain storage systems. However, the molecular mechanisms underlying resistance and environmental stress have not been characterized. To date, there is a lack of genomic information for this species. Therefore, studies aimed at profiling the L. entomophila transcriptome would provide a better understanding of the biological functions at the molecular levels. Methodology/Principal Findings We applied Illumina sequencing technology to sequence the transcriptome of L. entomophila. A total of 54,406,328 clean reads were obtained and that de novo assembled into 54,220 unigenes, with an average length of 571 bp. Through a similarity search, 33,404 (61.61%) unigenes were matched to known proteins in the NCBI non-redundant (Nr) protein database. These unigenes were further functionally annotated with gene ontology (GO), cluster of orthologous groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. A large number of genes potentially involved in insecticide resistance were manually curated, including 68 putative cytochrome P450 genes, 37 putative glutathione S-transferase (GST) genes, 19 putative carboxyl/cholinesterase (CCE) genes, and other 126 transcripts to contain target site sequences or encoding detoxification genes representing eight types of resistance enzymes. Furthermore, to gain insight into the molecular basis of the L. entomophila toward thermal stresses, 25 heat shock protein (Hsp) genes were identified. In addition, 1,100 SSRs and 57,757 SNPs were detected and 231 pairs of SSR primes were designed for investigating the genetic diversity in future. Conclusions/Significance We developed a comprehensive transcriptomic database for L. entomophila. These sequences and putative molecular markers would further promote our understanding of the molecular mechanisms underlying insecticide resistance

  19. Transcriptome sequencing and de novo characterization of Korean endemic land snail, Koreanohadra kurodana for functional transcripts and SSR markers.

    PubMed

    Kang, Se Won; Patnaik, Bharat Bhusan; Hwang, Hee-Ju; Park, So Young; Chung, Jong Min; Song, Dae Kwon; Patnaik, Hongray Howrelia; Lee, Jae Bong; Kim, Changmu; Kim, Soonok; Park, Hong Seog; Han, Yeon Soo; Lee, Jun Sang; Lee, Yong Seok

    2016-10-01

    The Korean endemic land snail Koreanohadra kurodana (Gastropoda: Bradybaenidae) found in humid areas of broadleaf forests and shrubs have been considered vulnerable as the number of individuals are declining in recent years. The species is poorly characterized at the genomic level that limits the understanding of functions at the molecular and genetics level. In the present study, we performed de novo transcriptome sequencing to produce a comprehensive transcript dataset of visceral mass tissue of K. kurodana by the Illumina paired-end sequencing technology. Over 234 million quality reads were assembled to a total of 315,924 contigs and 191,071 unigenes, with an average and N50 length of 585.6 and 715 bp and 678 and 927 bp, respectively. Overall, 36.32 % of the unigenes found matches to known protein/nucleotide sequences in the public databases. The direction of the unigenes to functional categories was determined using COG, GO, KEGG, and InterProScan protein domain search. The GO analysis search resulted in 22,967 unigenes (12.02 %) being categorized into 40 functional groups. The KEGG annotation revealed that metabolism pathway genes were enriched. The most prominent protein motifs include the zinc finger, ribonuclease H, reverse transcriptase, and ankyrin repeat domains. The simple sequence repeats (SSRs) identified from >1 kb length of unigenes show a dominancy of dinucleotide repeat motifs followed with tri- and tetranucleotide motifs. A number of unigenes were putatively assessed to belong to adaptation and defense mechanisms including heat shock proteins 70, Toll-like receptor 4, AMP-activated protein kinase, aquaporin-2, etc. Our data provide a rich source for the identification and functional characterization of new genes and candidate polymorphic SSR markers in K. kurodana. The availability of transcriptome information ( http://bioinfo.sch.ac.kr/submission/ ) would promote the utilization of the resources for phylogenetics study and genetic diversity

  20. Disease-Targeted Sequencing of Ion Channel Genes identifies de novo mutations in Patients with Non-Familial Brugada Syndrome

    PubMed Central

    Juang, Jyh-Ming Jimmy; Lu, Tzu-Pin; Lai, Liang-Chuan; Ho, Chia-Chuan; Liu, Yen-Bin; Tsai, Chia-Ti; Lin, Lian-Yu; Yu, Chih-Chieh; Chen, Wen-Jone; Chiang, Fu-Tien; Yeh, Shih-Fan Sherri; Lai, Ling-Ping; Chuang, Eric Y.; Lin, Jiunn-Lee

    2014-01-01

    Brugada syndrome (BrS) is one of the ion channelopathies associated with sudden cardiac death (SCD). The most common BrS-associated gene (SCN5A) only accounts for approximately 20–25% of BrS patients. This study aims to identify novel mutations across human ion channels in non-familial BrS patients without SCN5A variants through disease-targeted sequencing. We performed disease-targeted multi-gene sequencing across 133 human ion channel genes and 12 reported BrS-associated genes in 15 unrelated, non-familial BrS patients without SCN5A variants. Candidate variants were validated by mass spectrometry and Sanger sequencing. Five de novo mutations were identified in four genes (SCNN1A, KCNJ16, KCNB2, and KCNT1) in three BrS patients (20%). Two of the three patients presented SCD and one had syncope. Interestingly, the two patients presented with SCD had compound mutations (SCNN1A:Arg350Gln and KCNB2:Glu522Lys; SCNN1A:Arg597* and KCNJ16:Ser261Gly). Importantly, two SCNN1A mutations were identified from different families. The KCNT1:Arg1106Gln mutation was identified in a patient with syncope. Bioinformatics algorithms predicted severe functional interruptions in these four mutation loci, suggesting their pivotal roles in BrS. This study identified four novel BrS-associated genes and indicated the effectiveness of this disease-targeted sequencing across ion channel genes for non-familial BrS patients without SCN5A variants. PMID:25339316

  1. De novo RNA sequencing and transcriptome analysis of Colletotrichum gloeosporioides ES026 reveal genes related to biosynthesis of huperzine A.

    PubMed

    Zhang, Guowei; Wang, Wenjuan; Zhang, Xiangmei; Xia, Qianqian; Zhao, Xinmei; Ahn, Youngjoon; Ahmed, Nevin; Cosoveanu, Andreea; Wang, Mo; Wang, Jialu; Shu, Shaohua

    2015-01-01

    Huperzine A is important in the treatment of Alzheimer's disease. There are major challenges for the mass production of huperzine A from plants due to the limited number of huperzine-A-producing plants, as well as the low content of huperzine A in these plants. Various endophytic fungi produce huperzine A. Colletotrichum gloeosporioides ES026 was previously isolated from a huperzine-A-producing plant Huperzia serrata, and this fungus also produces huperzine A. In this study, de novo RNA sequencing of C. gloeosporioides ES026 was carried out with an Illumina HiSeq2000. A total of 4,324,299,051 bp from 50,442,617 high-quality sequence reads of ES026 were obtained. These raw data were assembled into 24,998 unigenes, 40,536,684 residues and 19,790 genes. The majority of the unique sequences were assigned to corresponding putative functions based on BLAST searches of public databases. The molecular functions, biological processes and biochemical pathways of these unique sequences were determined using gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) assignments. A gene encoding copper amine oxidase (CAO) (unigene 9322) was annotated for the conversion of cadaverine to 5-aminopentanal in the biosynthesis of huperzine A. This gene was also detected in the root, stem and leaf of H. serrata. Furthermore, a close relationship was observed between expression of the CAO gene (unigene 9322) and quantity of crude huperzine A extracted from ES026. Therefore, CAO might be involved in the biosynthesis of huperzine A and it most likely plays a key role in regulating the content of huperzine A in ES026. PMID:25799531

  2. De Novo Assembly of Bitter Gourd Transcriptomes: Gene Expression and Sequence Variations in Gynoecious and Monoecious Lines.

    PubMed

    Shukla, Anjali; Singh, V K; Bharadwaj, D R; Kumar, Rajesh; Rai, Ashutosh; Rai, A K; Mugasimangalam, Raja; Parameswaran, Sriram; Singh, Major; Naik, P S

    2015-01-01

    Bitter gourd (Momordica charantia L.) is a nutritious vegetable crop of Asian origin, used as a medicinal herb in Indian and Chinese traditional medicine. Molecular breeding in bitter gourd is in its infancy, due to limited molecular resources, particularly on functional markers for traits such as gynoecy. We performed de novo transcriptome sequencing of bitter gourd using Illumina next-generation sequencer, from root, flower buds, stem and leaf samples of gynoecious line (Gy323) and a monoecious line (DRAR1). A total of 65,540 transcripts for Gy323 and 61,490 for DRAR1 were obtained. Comparisons revealed SNP and SSR variations between these lines and, identification of gene classes. Based on available transcripts we identified 80 WRKY transcription factors, several reported in responses to biotic and abiotic stresses; 56 ARF genes which play a pivotal role in auxin-regulated gene expression and development. The data presented will be useful in both functions studies and breeding programs in bitter gourd. PMID:26047102

  3. Novel proline-hydroxyproline glycopeptides from the dandelion (Taraxacum officinale Wigg.) flowers: de novo sequencing and biological activity.

    PubMed

    Astafieva, Alexandra A; Enyenihi, Atim A; Rogozhin, Eugene A; Kozlov, Sergey A; Grishin, Eugene V; Odintsova, Tatyana I; Zubarev, Roman A; Egorov, Tsezi A

    2015-09-01

    Two novel homologous peptides named ToHyp1 and ToHyp2 that show no similarity to any known proteins were isolated from Taraxacum officinale Wigg. flowers by multidimensional liquid chromatography. Amino acid and mass spectrometry analyses demonstrated that the peptides have unusual structure: they are cysteine-free, proline-hydroxyproline-rich and post-translationally glycosylated by pentoses, with 5 carbohydrates in ToHyp2 and 10 in ToHyp1. The ToHyp2 peptide with a monoisotopic molecular mass of 4350.3Da was completely sequenced by a combination of Edman degradation and de novo sequencing via top down multistage collision induced dissociation (CID) and higher energy dissociation (HCD) tandem mass spectrometry (MS(n)). ToHyp2 consists of 35 amino acids, contains eighteen proline residues, of which 8 prolines are hydroxylated. The peptide displays antifungal activity and inhibits growth of Gram-positive and Gram-negative bacteria. We further showed that carbohydrate moieties have no significant impact on the peptide structure, but are important for antifungal activity although not absolutely necessary. The deglycosylated ToHyp2 peptide was less active against the susceptible fungus Bipolaris sorokiniana than the native peptide. Unique structural features of the ToHyp2 peptide place it into a new family of plant defense peptides. The discovery of ToHyp peptides in T. officinale flowers expands the repertoire of molecules of plant origin with practical applications. PMID:26259198

  4. De Novo Assembly of Bitter Gourd Transcriptomes: Gene Expression and Sequence Variations in Gynoecious and Monoecious Lines

    PubMed Central

    Shukla, Anjali; Singh, V. K.; Bharadwaj, D. R.; Kumar, Rajesh; Rai, Ashutosh; Rai, A. K.; Mugasimangalam, Raja; Parameswaran, Sriram; Singh, Major; Naik, P. S.

    2015-01-01

    Bitter gourd (Momordica charantia L.) is a nutritious vegetable crop of Asian origin, used as a medicinal herb in Indian and Chinese traditional medicine. Molecular breeding in bitter gourd is in its infancy, due to limited molecular resources, particularly on functional markers for traits such as gynoecy. We performed de novo transcriptome sequencing of bitter gourd using Illumina next-generation sequencer, from root, flower buds, stem and leaf samples of gynoecious line (Gy323) and a monoecious line (DRAR1). A total of 65,540 transcripts for Gy323 and 61,490 for DRAR1 were obtained. Comparisons revealed SNP and SSR variations between these lines and, identification of gene classes. Based on available transcripts we identified 80 WRKY transcription factors, several reported in responses to biotic and abiotic stresses; 56 ARF genes which play a pivotal role in auxin-regulated gene expression and development. The data presented will be useful in both functions studies and breeding programs in bitter gourd. PMID:26047102

  5. De Novo Transcriptome Sequencing of Olea europaea L. to Identify Genes Involved in the Development of the Pollen Tube.

    PubMed

    Iaria, Domenico; Chiappetta, Adriana; Muzzalupo, Innocenzo

    2016-01-01

    In olive (Olea europaea L.), the processes controlling self-incompatibility are still unclear and the molecular basis underlying this process are still not fully characterized. In order to determine compatibility relationships, using next-generation sequencing techniques and a de novo transcriptome assembly strategy, we show that pollen tubes from different olive plants, grown in vitro in a medium containing its own pistil and in combination pollen/pistil from self-sterile and self-fertile cultivars, have a distinct gene expression profile and many of the differentially expressed sequences between the samples fall within gene families involved in the development of the pollen tube, such as lipase, carboxylesterase, pectinesterase, pectin methylesterase, and callose synthase. Moreover, different genes involved in signal transduction, transcription, and growth are overrepresented. The analysis also allowed us to identify members in actin and actin depolymerization factor and fibrin gene family and member of the Ca(2+) binding gene family related to the development and polarization of pollen apical tip. The whole transcriptomic analysis, through the identification of the differentially expressed transcripts set and an extended functional annotation analysis, will lead to a better understanding of the mechanisms of pollen germination and pollen tube growth in the olive. PMID:26998509

  6. De Novo Transcriptome Sequencing of Olea europaea L. to Identify Genes Involved in the Development of the Pollen Tube

    PubMed Central

    Iaria, Domenico

    2016-01-01

    In olive (Olea europaea L.), the processes controlling self-incompatibility are still unclear and the molecular basis underlying this process are still not fully characterized. In order to determine compatibility relationships, using next-generation sequencing techniques and a de novo transcriptome assembly strategy, we show that pollen tubes from different olive plants, grown in vitro in a medium containing its own pistil and in combination pollen/pistil from self-sterile and self-fertile cultivars, have a distinct gene expression profile and many of the differentially expressed sequences between the samples fall within gene families involved in the development of the pollen tube, such as lipase, carboxylesterase, pectinesterase, pectin methylesterase, and callose synthase. Moreover, different genes involved in signal transduction, transcription, and growth are overrepresented. The analysis also allowed us to identify members in actin and actin depolymerization factor and fibrin gene family and member of the Ca2+ binding gene family related to the development and polarization of pollen apical tip. The whole transcriptomic analysis, through the identification of the differentially expressed transcripts set and an extended functional annotation analysis, will lead to a better understanding of the mechanisms of pollen germination and pollen tube growth in the olive. PMID:26998509

  7. De Novo Transcriptome Sequencing of Low Temperature-Treated Phlox subulata and Analysis of the Genes Involved in Cold Stress

    PubMed Central

    Qu, Yanting; Zhou, Aimin; Zhang, Xing; Tang, Huanwei; Liang, Ming; Han, Hui; Zuo, Yuhu

    2015-01-01

    Phlox subulata, a perennial herbaceous flower, can survive during the winter of northeast China, where the temperature can drop to −30 °C, suggesting that P. subulata is an ideal model for studying the molecular mechanisms of cold acclimation in plants. However, little is known about the gene expression profile of P. subulata under cold stress. Here, we examined changes in cold stress-related genes in P. subulata. We sequenced three cold-treated (CT) and control (CK) samples of P. subulata. After de novo assembly and quantitative assessment of the obtained reads, 99,174 unigenes were generated. Based on similarity searches with known proteins in public protein databases, 59,994 unigenes were functionally annotated. Among all differentially expressed genes (DEGs), 8302, 10,638 and 11,021 up-regulated genes and 9898, 17,876, and 12,358 down-regulated genes were identified after treatment at 4, 0, and −10 °C, respectively. Furthermore, 3417 up-regulated unigenes were expressed only in CT samples. Twenty major cold-related genes, including transcription factors, antioxidant enzymes, osmoregulation proteins, and Ca2+ and ABA signaling components, were identified, and their expression levels were estimated. Overall, this is the first transcriptome sequencing of this plant species under cold stress. Studies of DEGs involved in cold-related metabolic pathways may facilitate the discovery of cold-resistance genes. PMID:25938968

  8. De novo sequencing, assembly and analysis of salivary gland transcriptome of Haemaphysalis flava and identification of sialoprotein genes.

    PubMed

    Xu, Xing-Li; Cheng, Tian-Yin; Yang, Hu; Yan, Fen; Yang, Ya

    2015-06-01

    Saliva plays an important role in feeding and pathogen transmission, identification and analysis of tick salivary gland (SG) proteins is considered as a hot spot in anti-tick researching area. Herein, we present the first description of SG transcriptome of Haemaphysalis flava using next-generation sequencing (NGS). A total of over 143 million high-quality reads were assembled into 54,357 unigenes, of which 20,145 (37.06%) had significant similarities to proteins in the Swiss-Prot database. 13,513 annotated sequences were associated with GO terms. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis showed that 14,280 unigenes were assigned to 279 KEGG pathways in total. Reads per kb per million reads (RPKM) analysis showed that there were 3035 down-regulated unigenes and 2260 up-regulated unigenes in the engorged ticks (ET) compared with the semi-engorged one (SET). Several important genes are associated with blood feeding and ingestion as secreted salivary proteins, concluding cysteine, longipain, 4D8, calreticulin, metalloproteases, serine protease inhibitor, enolase, heat shock protein and AV422 in SG, were identified. The qRT-PCR results confirmed that patterns of these genes (except for the longipain gene) expression were consistent with RNA-seq results. This de novo assembly of SG transcriptome of H. flava not only provides more chance for screening and cloning functional genes, but also forms a solid basis for further insight into the changes of salivary proteins during blood-feeding. PMID:25784566

  9. De novo sequencing of antimicrobial peptides isolated from the venom glands of the wolf spider Lycosa singoriensis.

    PubMed

    Budnik, B A; Olsen, J V; Egorov, T A; Anisimova, V E; Galkina, T G; Musolyamov, A K; Grishin, E V; Zubarev, R A

    2004-02-01

    Antimicrobial peptides (AMPs), named lycocitin 1, 2 and 3, and a peptide with a monoisotopic molecular mass of 3038.70 Da were detected in the venom glands of the wolf spider Lycosa singoriensis. Two of the peptides, lycocitin 1 and 2, are new AMPs whereas lycocitin 3 is highly homologous to lycotoxin II isolated from the venom of spider Lycosa carolinensis. In addition, two other peptides with monoisotopic masses of 2034.20 and 2340.28 Da showing the motif typical for antimicrobial peptides were also identified. These peptides and lycocitin 1, 2 and 3 were de novo sequenced using electron capture dissociation and low-energy collisional tandem mass spectrometry. The amino acid sequence of lycocitin 1 was determined as GKLQAFLAKMKEIAAQTL-NH(2). Lycocitin 2 differs from lycocitin 1 by a replacement of a lysine residue for an arginine residue at the second position. Lycocitin 3 differs from the known lycotoxin II consisting of 27 amino acid residues by a deletion of Gly-26. Both lycocitin 1 and 2 inhibit growth of Gram-positive (Staphylococcus aureus, Bacillus subtilis) and Gram-negative (Escherichia coli) bacteria and fungi (Candida albicans, Pseudomonas aeruginosa) at micromolar concentrations. PMID:14991689

  10. Increased Frequency of De Novo Copy Number Variations in Congenital Heart Disease by Integrative Analysis of SNP Array and Exome Sequence Data

    PubMed Central

    Rodriguez-Murillo, Laura; Fromer, Menachem; Mazaika, Erica; Vardarajan, Badri; Italia, Michael; Leipzig, Jeremy; DePalma, Steven R.; Golhar, Ryan; Sanders, Stephan J.; Yamrom, Boris; Ronemus, Michael; Iossifov, Ivan; Willsey, A. Jeremy; State, Matthew W.; Kaltman, Jonathan R.; White, Peter S.; Shen, Yufeng; Warburton, Dorothy; Brueckner, Martina; Seidman, Christine; Goldmuntz, Elizabeth; Gelb, Bruce D.; Lifton, Richard; Seidman, Jonathan; Hakonarson, Hakon; Chung, Wendy K.

    2014-01-01

    Rationale Congenital heart disease (CHD) is among the most common birth defects. Most cases are of unknown etiology. Objective To determine the contribution of de novo copy number variants (CNVs) in the etiology of sporadic CHD. Methods and Results We studied 538 CHD trios using genome-wide dense single nucleotide polymorphism (SNP) arrays and/or whole exome sequencing (WES). Results were experimentally validated using digital droplet PCR. We compared validated CNVs in CHD cases to CNVs in 1,301 healthy control trios. The two complementary high-resolution technologies identified 63 validated de novo CNVs in 51 CHD cases. A significant increase in CNV burden was observed when comparing CHD trios with healthy trios, using either SNP array (p=7x10−5, Odds Ratio (OR)=4.6) or WES data (p=6x10−4, OR=3.5) and remained after removing 16% of de novo CNV loci previously reported as pathogenic (p=0.02, OR=2.7). We observed recurrent de novo CNVs on 15q11.2 encompassing CYFIP1, NIPA1, and NIPA2 and single de novo CNVs encompassing DUSP1, JUN, JUP, MED15, MED9, PTPRE SREBF1, TOP2A, and ZEB2, genes that interact with established CHD proteins NKX2-5 and GATA4. Integrating de novo variants in WES and CNV data suggests that ETS1 is the pathogenic gene altered by 11q24.2-q25 deletions in Jacobsen syndrome and that CTBP2 is the pathogenic gene in 10q sub-telomeric deletions. Conclusions We demonstrate a significantly increased frequency of rare de novo CNVs in CHD patients compared with healthy controls and suggest several novel genetic loci for CHD. PMID:25205790

  11. Fine De Novo Sequencing of a Fungal Genome Using only SOLiD Short Read Data: Verification on Aspergillus oryzae RIB40

    PubMed Central

    Takeda, Itaru; Hagiwara, Hiroko; Ikegami, Tsutomu; Koike, Hideaki; Machida, Masayuki

    2013-01-01

    The development of next-generation sequencing (NGS) technologies has dramatically increased the throughput, speed, and efficiency of genome sequencing. The short read data generated from NGS platforms, such as SOLiD and Illumina, are quite useful for mapping analysis. However, the SOLiD read data with lengths of <60 bp have been considered to be too short for de novo genome sequencing. Here, to investigate whether de novo sequencing of fungal genomes is possible using only SOLiD short read sequence data, we performed de novo assembly of the Aspergillus oryzae RIB40 genome using only SOLiD read data of 50 bp generated from mate-paired libraries with 2.8- or 1.9-kb insert sizes. The assembled scaffolds showed an N50 value of 1.6 Mb, a 22-fold increase than those obtained using only SOLiD short read in other published reports. In addition, almost 99% of the reference genome was accurately aligned by the assembled scaffold fragments in long lengths. The sequences of secondary metabolite biosynthetic genes and clusters, whose products are of considerable interest in fungal studies due to their potential medicinal, agricultural, and cosmetic properties, were also highly reconstructed in the assembled scaffolds. Based on these findings, we concluded that de novo genome sequencing using only SOLiD short reads is feasible and practical for molecular biological study of fungi. We also investigated the effect of filtering low quality data, library insert size, and k-mer size on the assembly performance, and recommend for the assembly use of mild filtered read data where the N50 was not so degraded and the library has an insert size of ∼2.0 kb, and k-mer size 33. PMID:23667655

  12. Sequence analysis for a de novo genome assembly of Bos indicus (Nelore) cattle

    Technology Transfer Automated Retrieval System (TEKTRAN)

    A second draft sequence assembly of the bovine genome based on the sub-species, Bos indicus, is essential to better evaluate the genetic variation underlying the prototypical beef and dairy cattle in tropical and sub-tropical production environments. A linebred bull (Futuro), two generations remove...

  13. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome

    EPA Science Inventory

    This study was undertaken to develop genome-scale resources for the fathead minnow (Pimphales promelas) an important model organism widely used in both aquatic ecotoxicology research and in regulatory toxicity testing. We report on the first sequencing and two draft assemblies fo...

  14. De Novo Genome Sequence of "Candidatus Liberibacter solanacearum" from a Single Potato Psyllid in California.

    PubMed

    Wu, F; Deng, X; Liang, G; Wallis, C; Trumble, J T; Prager, S; Chen, J

    2015-01-01

    The draft genome sequence of "Candidatus Liberibacter solanacearum" strain RSTM from a potato psyllid (Bactericera cockerelli) in California is reported here. The RSTM strain has a genome size of 1,286,787 bp, a G+C content of 35.1%, 1,211 predicted open reading frames (ORFs), and 43 RNA genes. PMID:26679599

  15. De novo Transcriptome Sequencing Reveals a Considerable Bias in the Incidence of Simple Sequence Repeats towards the Downstream of ‘Pre-miRNAs’ of Black Pepper

    PubMed Central

    Joy, Nisha; Asha, Srinivasan; Mallika, Vijayan; Soniya, Eppurathu Vasudevan

    2013-01-01

    Next generation sequencing has an advantageon transformational development of species with limited available sequence data as it helps to decode the genome and transcriptome. We carried out the de novo sequencing using illuminaHiSeq™ 2000 to generate the first leaf transcriptome of black pepper (Piper nigrum L.), an important spice variety native to South India and also grown in other tropical regions. Despite the economic and biochemical importance of pepper, a scientifically rigorous study at the molecular level is far from complete due to lack of sufficient sequence information and cytological complexity of its genome. The 55 million raw reads obtained, when assembled using Trinity program generated 2,23,386 contigs and 1,28,157 unigenes. Reports suggest that the repeat-rich genomic regions give rise to small non-coding functional RNAs. MicroRNAs (miRNAs) are the most abundant type of non-coding regulatory RNAs. In spite of the widespread research on miRNAs, little is known about the hair-pin precursors of miRNAs bearing Simple Sequence Repeats (SSRs). We used the array of transcripts generated, for the in silico prediction and detection of ‘43 pre-miRNA candidates bearing different types of SSR motifs’. The analysis identified 3913 different types of SSR motifs with an average of one SSR per 3.04 MB of thetranscriptome. About 0.033% of the transcriptome constituted ‘pre-miRNA candidates bearing SSRs’. The abundance, type and distribution of SSR motifs studied across the hair-pin miRNA precursors, showed a significant bias in the position of SSRs towards the downstream of predicted ‘pre-miRNA candidates’. The catalogue of transcripts identified, together with the demonstration of reliable existence of SSRs in the miRNA precursors, permits future opportunities for understanding the genetic mechanism of black pepper and likely functions of ‘tandem repeats’ in miRNAs. PMID:23469176

  16. A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions

    PubMed Central

    Abnousi, Armen; Broschat, Shira L.; Kalyanaraman, Ananth

    2016-01-01

    Background Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. Methods In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. Results We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s

  17. IDBA-MT: de novo assembler for metatranscriptomic data generated from next-generation sequencing technology.

    PubMed

    Leung, Henry C M; Yiu, Siu-Ming; Parkinson, John; Chin, Francis Y L

    2013-07-01

    High-throughput next-generation sequencing technology provides a great opportunity for analyzing metatranscriptomic data. However, the reads produced by these technologies are short and an assembling step is required to combine the short reads into longer contigs. As there are many repeat patterns in mRNAs from different genomes and the abundance ratio of mRNAs in a sample varies a lot, existing assemblers for genomic data, transcriptomic data, and metagenomic data do not work on metatranscriptomic data and produce chimeric contigs, that is, incorrect contigs formed by merging multiple mRNA sequences. To our best knowledge, there is no assembler designed for metatranscriptomic data. In this article, we introduce an assembler called IDBA-MT, which is designed for assembling reads from metatranscriptomic data. IDBA-MT produces much fewer chimeric contigs (reduce by 50% or more) when compared with existing assemblers such as Oases, IDBA-UD, and Trinity. PMID:23829653

  18. De Novo Sequencing and Resurrection of a Human Astrovirus-Neutralizing Antibody

    PubMed Central

    2016-01-01

    Monoclonal antibody (mAb) therapeutics targeting cancer, autoimmune diseases, inflammatory diseases, and infectious diseases are growing exponentially. Although numerous panels of mAbs targeting infectious disease agents have been developed, their progression into clinically useful mAbs is often hindered by the lack of sequence information and/or loss of hybridoma cells that produce them. Here we combine the power of crystallography and mass spectrometry to determine the amino acid sequence and glycosylation modification of the Fab fragment of a potent human astrovirus-neutralizing mAb. We used this information to engineer a recombinant antibody single-chain variable fragment that has the same specificity as the parent monoclonal antibody to bind to the astrovirus capsid protein. This antibody can now potentially be developed as a therapeutic and diagnostic agent. PMID:27213181

  19. Sequencing and de novo assembly of a Dahlia hybrid cultivar transcriptome

    PubMed Central

    Lehnert, Erik M.; Walbot, Virginia

    2014-01-01

    Dahlia variabilis, with an exceptionally high diversity of floral forms and colors, is a popular flower amongst both commercial growers and hobbyists. Recently, some genetic controls of pigment patterns have been elucidated. These studies have been limited, however, by the lack of comprehensive transcriptomic resources for this species. Here we report the sequencing, assembly, and annotation of the transcriptome of the developing leaves, stems, and floral buds of D. variabilis. This resulted in 35,638 contigs, most of which seem to contain the complete coding sequence, and of which 20,881 could be successfully annotated by similarity to UniProt. Furthermore, we conducted a preliminary investigation to identify contigs with expression patterns consistent with tissue-specificity. These results will accelerate research into the genetic controls of pigmentation and floral form of D. variabilis. PMID:25101098

  20. De Novo Assembly of Auricularia polytricha Transcriptome Using Illumina Sequencing for Gene Discovery and SSR Marker Identification

    PubMed Central

    Zhou, Yan; Chen, Lianfu; Fan, Xiuzhi; Bian, Yinbing

    2014-01-01

    Auricularia polytricha (Mont.) Sacc., a type of edible black-brown mushroom with a gelatinous and modality-specific fruiting body, is in high demand in Asia due to its nutritional and medicinal properties. Illumina Solexa sequenceing technology was used to generate very large transcript sequences from the mycelium and the mature fruiting body of A. polytricha for gene discovery and molecular marker development. De novo assembly generated 36,483 ESTs with an N50 length of 636 bp. A total of 28,108 ESTs demonstrated significant hits with known proteins in the nr database, and 94.03% of the annotated ESTs showed the greatest similarity to A. delicata, a related species of A. polytricha. Functional categorization of the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) metabolic pathways revealed the conservation of genes involved in various biological processes in A. polytricha. Gene expression profile analysis indicated that a total of 2,057 ESTs were differentially expressed, including 1,020 ESTs that were up-regulated in the mycelium and 1,037 up-regulated in the fruiting body. Functional enrichment showed that the ESTs associated with biosynthesis, metabolism and assembly of proteins were more active in fruiting body development. The expression patterns of homologous transcription factors indicated that the molecular mechanisms of fruiting body formation and development were not exactly the same as for other agarics. Interestingly, an EST encoding tyrosinase was significantly up-regulated in the fruiting body, indicating that melanins accumulated during the processes of the formation of the black-brown color of the fruiting body in A. polytricha development. In addition, a total of 1,715 potential SSRs were detected in this transcriptome. The transcriptome analysis of A. polytricha provides valuable sequence resources and numerous molecular markers to facilitate further functional genomics studies and

  1. De novo genome assembly of the economically important weed horseweed using integrated data from multiple sequencing platforms.

    PubMed

    Peng, Yanhui; Lai, Zhao; Lane, Thomas; Nageswara-Rao, Madhugiri; Okada, Miki; Jasieniuk, Marie; O'Geen, Henriette; Kim, Ryan W; Sammons, R Douglas; Rieseberg, Loren H; Stewart, C Neal

    2014-11-01

    Horseweed (Conyza canadensis), a member of the Compositae (Asteraceae) family, was the first broadleaf weed to evolve resistance to glyphosate. Horseweed, one of the most problematic weeds in the world, is a true diploid (2n = 2x = 18), with the smallest genome of any known agricultural weed (335 Mb). Thus, it is an appropriate candidate to help us understand the genetic and genomic bases of weediness. We undertook a draft de novo genome assembly of horseweed by combining data from multiple sequencing platforms (454 GS-FLX, Illumina HiSeq 2000, and PacBio RS) using various libraries with different insertion sizes (approximately 350 bp, 600 bp, 3 kb, and 10 kb) of a Tennessee-accessed, glyphosate-resistant horseweed biotype. From 116.3 Gb (approximately 350× coverage) of data, the genome was assembled into 13,966 scaffolds with 50% of the assembly = 33,561 bp. The assembly covered 92.3% of the genome, including the complete chloroplast genome (approximately 153 kb) and a nearly complete mitochondrial genome (approximately 450 kb in 120 scaffolds). The nuclear genome is composed of 44,592 protein-coding genes. Genome resequencing of seven additional horseweed biotypes was performed. These sequence data were assembled and used to analyze genome variation. Simple sequence repeat and single-nucleotide polymorphisms were surveyed. Genomic patterns were detected that associated with glyphosate-resistant or -susceptible biotypes. The draft genome will be useful to better understand weediness and the evolution of herbicide resistance and to devise new management strategies. The genome will also be useful as another reference genome in the Compositae. To our knowledge, this article represents the first published draft genome of an agricultural weed. PMID:25209985

  2. De Novo Assembly of Transcriptome Sequencing in Caragana korshinskii Kom. and Characterization of EST-SSR Markers

    PubMed Central

    Long, Yan; Wang, Yanyan; Wu, Shanshan; Wang, Jiao; Tian, Xinjie; Pei, Xinwu

    2015-01-01

    Caragana korshinskii Kom. is widely distributed in various habitats, including gravel desert, clay desert, fixed and semi-fixed sand, and saline land in the Asian and African deserts. To date, no previous genomic information or EST-SSR marker has been reported in Caragana Fabr. genus. In this study, more than two billion bases of high-quality sequence of C. korshinskii were generated by using illumina sequencing technology and demonstrated the de novo assembly and annotation of genes without prior genome information. These reads were assembled into 86,265 unigenes (mean length = 709 bp). The similarity search indicated that 33,955 and 21,978 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Among these annotated unigenes, 26,232 a unigenes were separately assigned to Gene Ontology (GO) database. When 22,756 unigenes searched against the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database, 5,598 unigenes were assigned to 5 main categories including 32 KEGG pathways. Among the main KEGG categories, metabolism was the biggest category (2,862, 43.7%), suggesting the active metabolic processes in the desert tree. In addition, a total of 19,150 EST-SSRs were identified from 15,484 unigenes, and the characterizations of EST-SSRs were further compared with other four species in Fabraceae. 126 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Among the 9 germplasms in Caranaga Fabr. genus, PCR success rate were 93.7% and the phylogenic tree was constructed based on the genotypic data. This research generated a substantial fraction of transcriptome sequences, which were very useful resources for gene annotation and discovery, molecular markers development, genome assembly and annotation. The EST-SSR markers identified and developed in this study will facilitate marker-assisted selection breeding. PMID:25629164

  3. A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly

    PubMed Central

    2013-01-01

    Background The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes. Results We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. Conclusions These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies. PMID:23496952

  4. The Chaperonin-60 Universal Target Is a Barcode for Bacteria That Enables De Novo Assembly of Metagenomic Sequence Data

    PubMed Central

    Links, Matthew G.; Dumonceaux, Tim J.; Hemmingsen, Sean M.; Hill, Janet E.

    2012-01-01

    Barcoding with molecular sequences is widely used to catalogue eukaryotic biodiversity. Studies investigating the community dynamics of microbes have relied heavily on gene-centric metagenomic profiling using two genes (16S rRNA and cpn60) to identify and track Bacteria. While there have been criteria formalized for barcoding of eukaryotes, these criteria have not been used to evaluate gene targets for other domains of life. Using the framework of the International Barcode of Life we evaluated DNA barcodes for Bacteria. Candidates from the 16S rRNA gene and the protein coding cpn60 gene were evaluated. Within complete bacterial genomes in the public domain representing 983 species from 21 phyla, the largest difference between median pairwise inter- and intra-specific distances (“barcode gap”) was found from cpn60. Distribution of sequence diversity along the ∼555 bp cpn60 target region was remarkably uniform. The barcode gap of the cpn60 universal target facilitated the faithful de novo assembly of full-length operational taxonomic units from pyrosequencing data from a synthetic microbial community. Analysis supported the recognition of both 16S rRNA and cpn60 as DNA barcodes for Bacteria. The cpn60 universal target was found to have a much larger barcode gap than 16S rRNA suggesting cpn60 as a preferred barcode for Bacteria. A large barcode gap for cpn60 provided a robust target for species-level characterization of data. The assembly of consensus sequences for barcodes was shown to be a reliable method for the identification and tracking of novel microbes in metagenomic studies. PMID:23189159

  5. A Cost-Effective Approach to Sequence Hundreds of Complete Mitochondrial Genomes.

    PubMed

    Nunez, Joaquin C B; Oleksiak, Marjorie F

    2016-01-01

    We present a cost-effective approach to sequence whole mitochondrial genomes for hundreds of individuals. Our approach uses small reaction volumes and unmodified (non-phosphorylated) barcoded adaptors to minimize reagent costs. We demonstrate our approach by sequencing 383 Fundulus sp. mitochondrial genomes (192 F. heteroclitus and 191 F. majalis). Prior to sequencing, we amplified the mitochondrial genomes using 4-5 custom-made, overlapping primer pairs, and sequencing was performed on an Illumina HiSeq 2500 platform. After removing low quality and short sequences, 2.9 million and 2.8 million reads were generated for F. heteroclitus and F. majalis respectively. Individual genomes were assembled for each species by mapping barcoded reads to a reference genome. For F. majalis, the reference genome was built de novo. On average, individual consensus sequences had high coverage: 61-fold for F. heteroclitus and 57-fold for F. majalis. The approach discussed in this paper is optimized for sequencing mitochondrial genomes on an Illumina platform. However, with the proper modifications, this approach could be easily applied to other small genomes and sequencing platforms. PMID:27505419

  6. A Cost-Effective Approach to Sequence Hundreds of Complete Mitochondrial Genomes

    PubMed Central

    Oleksiak, Marjorie F.

    2016-01-01

    We present a cost-effective approach to sequence whole mitochondrial genomes for hundreds of individuals. Our approach uses small reaction volumes and unmodified (non-phosphorylated) barcoded adaptors to minimize reagent costs. We demonstrate our approach by sequencing 383 Fundulus sp. mitochondrial genomes (192 F. heteroclitus and 191 F. majalis). Prior to sequencing, we amplified the mitochondrial genomes using 4–5 custom-made, overlapping primer pairs, and sequencing was performed on an Illumina HiSeq 2500 platform. After removing low quality and short sequences, 2.9 million and 2.8 million reads were generated for F. heteroclitus and F. majalis respectively. Individual genomes were assembled for each species by mapping barcoded reads to a reference genome. For F. majalis, the reference genome was built de novo. On average, individual consensus sequences had high coverage: 61-fold for F. heteroclitus and 57-fold for F. majalis. The approach discussed in this paper is optimized for sequencing mitochondrial genomes on an Illumina platform. However, with the proper modifications, this approach could be easily applied to other small genomes and sequencing platforms. PMID:27505419

  7. De Novo Sequencing and Characterization of the Transcriptome of Dwarf Polish Wheat (Triticum polonicum L.)

    PubMed Central

    Wang, Chao; Wang, Xiaolu; Peng, Fan; Wang, Ruijiao; Jiang, Yulin; Zeng, Jian; Fan, Xing; Kang, Houyang; Sha, Lina; Zhang, Haiqin; Xiao, Xue

    2016-01-01

    Construction as well as characterization of a polish wheat transcriptome is a crucial step to study useful traits of polish wheat. In this study, a transcriptome, including 76,014 unigenes, was assembled from dwarf polish wheat (DPW) roots, stems, and leaves using the software of Trinity. Among these unigenes, 61,748 (81.23%) unigenes were functionally annotated in public databases and classified into differentially functional types. Aligning this transcriptome against draft wheat genome released by the International Wheat Genome Sequencing Consortium (IWGSC), 57,331 (75.42%) unigenes, including 26,122 AB-specific and 2,622 D-specific unigenes, were mapped on A, B, and/or D genomes. Compared with the transcriptome of T. turgidum, 56,343 unigenes were matched with 103,327 unigenes of T. turgidum. Compared with the genomes of rice and barley, 14,404 and 7,007 unigenes were matched with 14,608 genes of barley and 7,708 genes of rice, respectively. On the other hand, 2,148, 1,611, and 2,707 unigenes were expressed specifically in roots, stems, and leaves, respectively. Finally, 5,531 SSR sequences were observed from 4,531 unigenes, and 518 primer pairs were designed. PMID:27429972

  8. De novo transcriptome sequencing facilitates genomic resource generation in Tinospora cordifolia.

    PubMed

    Singh, Rakesh; Kumar, Rajesh; Mahato, Ajay Kumar; Paliwal, Ritu; Singh, Amit Kumar; Kumar, Sundeep; Marla, Soma S; Kumar, Ashok; Singh, Nagendra K

    2016-09-01

    Tinospora cordifolia is known for its medicinal properties owing to the presence of useful constituents such as terpenes, glycosides, steroids, alkaloids, and flavonoids belonging to secondary metabolism origin. However, there is little information available pertaining to critical genomic elements (ESTs, molecular markers) necessary for judicious exploitation of its germplasm. We employed 454 GS-FLX pyrosequencing of entire transcripts and altogether ∼25 K assembled transcripts or Expressed sequence tags (ESTs) were identified. As the interest in T. cordifolia is primarily due to its secondary metabolite constituents, the ESTs pertaining to terpenoids biosynthetic pathway were identified in the present study. Additionally, several ESTs were assigned to different transcription factor families. To validate our transcripts dataset, the novel EST-SSR markers were generated to assess the genetic diversity among germplasm of T. cordifolia. These EST-SSR markers were found to be polymorphic and the dendrogram based on dice similarity index revealed three distinct clustering of accessions. The present study demonstrates effectiveness in using both NEWBLER and MIRA sequence read assembler software for enriching transcript-dataset and thus enables better exploitation of EST resources for mining candidate genes and designing molecular markers. PMID:27465295

  9. De novo transcriptome sequencing and gene expression profiling of spinach (Spinacia oleracea L.) leaves under heat stress

    PubMed Central

    Yan, Jun; Yu, Li; Xuan, Jiping; Lu, Ying; Lu, Shijun; Zhu, Weimin

    2016-01-01

    Spinach (Spinacia oleracea) has cold tolerant but heat sensitive characteristics. The spinach variety ‘Island,’ is suitable for summer periods. There is lack molecular information available for spinach in response to heat stress. In this study, high throughput de novo transcriptome sequencing and gene expression analyses were carried out at different spinach variety ‘Island’ leaves (grown at 24 °C (control), exposed to 35 °C for 30 min (S1), and 5 h (S2)). A total of 133,200,898 clean reads were assembled into 59,413 unigenes (average size 1259.55 bp). 33,573 unigenes could match to public databases. The DEG of controls vs S1 was 986, the DEG of control vs S2 was 1741 and the DEG of S1 vs S2 was 1587. Gene Ontology (GO) and pathway enrichment analysis indicated that a great deal of heat-responsive genes and other stress-responsive genes were identified in these DEGs, suggesting that the heat stress may have induced an extensive abiotic stress effect. Comparative transcriptome analysis found 896 unique genes in spinach heat response transcript. The expression patterns of 13 selected genes were verified by RT-qPCR (quantitative real-time PCR). Our study found a series of candidate genes and pathways that may be related to heat resistance in spinach. PMID:26857466

  10. De novo transcriptome sequencing and gene expression profiling of spinach (Spinacia oleracea L.) leaves under heat stress.

    PubMed

    Yan, Jun; Yu, Li; Xuan, Jiping; Lu, Ying; Lu, Shijun; Zhu, Weimin

    2016-01-01

    Spinach (Spinacia oleracea) has cold tolerant but heat sensitive characteristics. The spinach variety 'Island,' is suitable for summer periods. There is lack molecular information available for spinach in response to heat stress. In this study, high throughput de novo transcriptome sequencing and gene expression analyses were carried out at different spinach variety 'Island' leaves (grown at 24 °C (control), exposed to 35 °C for 30 min (S1), and 5 h (S2)). A total of 133,200,898 clean reads were assembled into 59,413 unigenes (average size 1259.55 bp). 33,573 unigenes could match to public databases. The DEG of controls vs S1 was 986, the DEG of control vs S2 was 1741 and the DEG of S1 vs S2 was 1587. Gene Ontology (GO) and pathway enrichment analysis indicated that a great deal of heat-responsive genes and other stress-responsive genes were identified in these DEGs, suggesting that the heat stress may have induced an extensive abiotic stress effect. Comparative transcriptome analysis found 896 unique genes in spinach heat response transcript. The expression patterns of 13 selected genes were verified by RT-qPCR (quantitative real-time PCR). Our study found a series of candidate genes and pathways that may be related to heat resistance in spinach. PMID:26857466

  11. Restriction site-associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference.

    PubMed

    Mastretta-Yanes, A; Arrigo, N; Alvarez, N; Jorgensen, T H; Piñero, D; Emerson, B C

    2015-01-01

    Restriction site-associated DNA sequencing (RADseq) provides researchers with the ability to record genetic polymorphism across thousands of loci for nonmodel organisms, potentially revolutionizing the field of molecular ecology. However, as with other genotyping methods, RADseq is prone to a number of sources of error that may have consequential effects for population genetic inferences, and these have received only limited attention in terms of the estimation and reporting of genotyping error rates. Here we use individual sample replicates, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome. We then use sample replicates to (i) optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci; and (ii) quantify error rates for loci, alleles and single-nucleotide polymorphisms. As an empirical example, we use a double-digest RAD data set of a nonmodel plant species, Berberis alpina, collected from high-altitude mountains in Mexico. PMID:24916682

  12. De novo sequencing, assembly and analysis of eight different transcriptomes from the Malayan pangolin.

    PubMed

    Mohamed Yusoff, Aini; Tan, Tze King; Hari, Ranjeev; Koepfli, Klaus-Peter; Wee, Wei Yee; Antunes, Agostinho; Sitam, Frankie Thomas; Rovie-Ryan, Jeffrine Japning; Karuppannan, Kayal Vizi; Wong, Guat Jah; Lipovich, Leonard; Warren, Wesley C; O'Brien, Stephen J; Choo, Siew Woh

    2016-01-01

    Pangolins are scale-covered mammals, containing eight endangered species. Maintaining pangolins in captivity is a significant challenge, in part because little is known about their genetics. Here we provide the first large-scale sequencing of the critically endangered Manis javanica transcriptomes from eight different organs using Illumina HiSeq technology, yielding ~75 Giga bases and 89,754 unigenes. We found some unigenes involved in the insect hormone biosynthesis pathway and also 747 lipids metabolism-related unigenes that may be insightful to understand the lipid metabolism system in pangolins. Comparative analysis between M. javanica and other mammals revealed many pangolin-specific genes significantly over-represented in stress-related processes, cell proliferation and external stimulus, probably reflecting the traits and adaptations of the analyzed pregnant female M. javanica. Our study provides an invaluable resource for future functional works that may be highly relevant for the conservation of pangolins. PMID:27618997

  13. De Novo Transcriptome Sequencing and Analysis of the Cereal Cyst Nematode, Heterodera avenae

    PubMed Central

    Kumar, Mukesh; Gantasala, Nagavara Prasad; Roychowdhury, Tanmoy; Thakur, Prasoon Kumar; Banakar, Prakash; Shukla, Rohit N.; Jones, Michael G. K.; Rao, Uma

    2014-01-01

    The cereal cyst nematode (CCN, Heterodera avenae) is a major pest of wheat (Triticum spp) that reduces crop yields in many countries. Cyst nematodes are obligate sedentary endoparasites that reproduce by amphimixis. Here, we report the first transcriptome analysis of two stages of H. avenae. After sequencing extracted RNA from pre parasitic infective juvenile and adult stages of the life cycle, 131 million Illumina high quality paired end reads were obtained which generated 27,765 contigs with N50 of 1,028 base pairs, of which 10,452 were annotated. Comparative analyses were undertaken to evaluate H. avenae sequences with those of other plant, animal and free living nematodes to identify differences in expressed genes. There were 4,431 transcripts common to H. avenae and the free living nematode Caenorhabditis elegans, and 9,462 in common with more closely related potato cyst nematode, Globodera pallida. Annotation of H. avenae carbohydrate active enzymes (CAZy) revealed fewer glycoside hydrolases (GHs) but more glycosyl transferases (GTs) and carbohydrate esterases (CEs) when compared to M. incognita. 1,280 transcripts were found to have secretory signature, presence of signal peptide and absence of transmembrane. In a comparison of genes expressed in the pre-parasitic juvenile and feeding female stages, expression levels of 30 genes with high RPKM (reads per base per kilo million) value, were analysed by qRT-PCR which confirmed the observed differences in their levels of expression levels. In addition, we have also developed a user-friendly resource, Heterodera transcriptome database (HATdb) for public access of the data generated in this study. The new data provided on the transcriptome of H. avenae adds to the genetic resources available to study plant parasitic nematodes and provides an opportunity to seek new effectors that are specifically involved in the H. avenae-cereal host interaction. PMID:24802510

  14. De novo sequence assembly of Albugo candida reveals a small genome relative to other biotrophic oomycetes

    PubMed Central

    2011-01-01

    Background Albugo candida is a biotrophic oomycete that parasitizes various species of Brassicaceae, causing a disease (white blister rust) with remarkable convergence in behaviour to unrelated rusts of basidiomycete fungi. Results A recent genome analysis of the oomycete Hyaloperonospora arabidopsidis suggests that a reduction in the number of genes encoding secreted pathogenicity proteins, enzymes for assimilation of inorganic nitrogen and sulphur represent a genomic signature for the evolution of obligate biotrophy. Here, we report a draft reference genome of a major crop pathogen Albugo candida (another obligate biotrophic oomycete) with an estimated genome of 45.3 Mb. This is very similar to the genome size of a necrotrophic oomycete Pythium ultimum (43 Mb) but less than half that of H. arabidopsidis (99 Mb). Sequencing of A. candida transcripts from infected host tissue and zoosporangia combined with genome-wide annotation revealed 15,824 predicted genes. Most of the predicted genes lack significant similarity with sequences from other oomycetes. Most intriguingly, A. candida appears to have a much smaller repertoire of pathogenicity-related proteins than H. arabidopsidis including genes that encode RXLR effector proteins, CRINKLER-like genes, and elicitins. Necrosis and Ethylene inducing Peptides were not detected in the genome of A. candida. Putative orthologs of tat-C, a component of the twin arginine translocase system, were identified from multiple oomycete genera along with proteins containing putative tat-secretion signal peptides. Conclusion Albugo candida has a comparatively small genome amongst oomycetes, retains motility of sporangial inoculum, and harbours a much smaller repertoire of candidate effectors than was recently reported for H. arabidopsidis. This minimal gene repertoire could indicate a lack of expansion, rather than a reduction, in the number of genes that signify the evolution of biotrophy in oomycetes. PMID:21995639

  15. De novo transcriptome sequencing and analysis of the cereal cyst nematode, Heterodera avenae.

    PubMed

    Kumar, Mukesh; Gantasala, Nagavara Prasad; Roychowdhury, Tanmoy; Thakur, Prasoon Kumar; Banakar, Prakash; Shukla, Rohit N; Jones, Michael G K; Rao, Uma

    2014-01-01

    The cereal cyst nematode (CCN, Heterodera avenae) is a major pest of wheat (Triticum spp) that reduces crop yields in many countries. Cyst nematodes are obligate sedentary endoparasites that reproduce by amphimixis. Here, we report the first transcriptome analysis of two stages of H. avenae. After sequencing extracted RNA from pre parasitic infective juvenile and adult stages of the life cycle, 131 million Illumina high quality paired end reads were obtained which generated 27,765 contigs with N50 of 1,028 base pairs, of which 10,452 were annotated. Comparative analyses were undertaken to evaluate H. avenae sequences with those of other plant, animal and free living nematodes to identify differences in expressed genes. There were 4,431 transcripts common to H. avenae and the free living nematode Caenorhabditis elegans, and 9,462 in common with more closely related potato cyst nematode, Globodera pallida. Annotation of H. avenae carbohydrate active enzymes (CAZy) revealed fewer glycoside hydrolases (GHs) but more glycosyl transferases (GTs) and carbohydrate esterases (CEs) when compared to M. incognita. 1,280 transcripts were found to have secretory signature, presence of signal peptide and absence of transmembrane. In a comparison of genes expressed in the pre-parasitic juvenile and feeding female stages, expression levels of 30 genes with high RPKM (reads per base per kilo million) value, were analysed by qRT-PCR which confirmed the observed differences in their levels of expression levels. In addition, we have also developed a user-friendly resource, Heterodera transcriptome database (HATdb) for public access of the data generated in this study. The new data provided on the transcriptome of H. avenae adds to the genetic resources available to study plant parasitic nematodes and provides an opportunity to seek new effectors that are specifically involved in the H. avenae-cereal host interaction. PMID:24802510

  16. De novo Transcriptome Analysis of Chinese Citrus Fly, Bactrocera minax (Diptera: Tephritidae), by High-Throughput Illumina Sequencing.

    PubMed

    Wang, Jia; Xiong, Ke-Cai; Liu, Ying-Hong

    2016-01-01

    The Chinese citrus fly, Bactrocera minax (Enderlein), is one of the most devastating pests of citrus in the temperate areas of Asia. So far, studies involving molecular biology and physiology of B. minax are still scarce, partly because of the lack of genomic information and inability to rear this insect in laboratory. In this study, de novo assembly of a transcriptome was performed using Illumina sequencing technology. A total of 20,928,907 clean reads were obtained and assembled into 33,324 unigenes, with an average length of 908.44 bp. Unigenes were annotated by alignment against NCBI non-redundant protein (Nr), Swiss-Prot, Clusters of Orthologous Groups (COG), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database. Genes potentially involved in stress tolerance, including 20 heat shock protein (Hsps) genes, 26 glutathione S-transferases (GSTs) genes, and 2 ferritin subunit genes, were identified. These genes may play roles in stress tolerance in B. minax diapause stage. It has previously been found that 20E application on B. minax pupae could avert diapause, but the underlying mechanisms remain unknown. Thus, genes encoding enzymes in 20E biosynthesis pathway, including Neverland, Spook, Phantom, Disembodied, Shadow, Shade, and Cyp18a1, and genes encoding 20E receptor proteins, ecdysone receptor (EcR) and ultraspiracle (USP), were identified. The expression patterns of 20E-related genes among developmental stages and between 20E-treated and untreated pupae demonstrated their roles in diapause program. In addition, 1,909 simple sequence repeats (SSRs) were detected, which will contribute to molecular marker development. The findings in this study greatly improve our genetic understanding of B. minax, and lay the foundation for future studies on this species. PMID:27331903

  17. De novo Transcriptome Analysis of Chinese Citrus Fly, Bactrocera minax (Diptera: Tephritidae), by High-Throughput Illumina Sequencing

    PubMed Central

    Wang, Jia; Xiong, Ke-Cai; Liu, Ying-Hong

    2016-01-01

    The Chinese citrus fly, Bactrocera minax (Enderlein), is one of the most devastating pests of citrus in the temperate areas of Asia. So far, studies involving molecular biology and physiology of B. minax are still scarce, partly because of the lack of genomic information and inability to rear this insect in laboratory. In this study, de novo assembly of a transcriptome was performed using Illumina sequencing technology. A total of 20,928,907 clean reads were obtained and assembled into 33,324 unigenes, with an average length of 908.44 bp. Unigenes were annotated by alignment against NCBI non-redundant protein (Nr), Swiss-Prot, Clusters of Orthologous Groups (COG), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database. Genes potentially involved in stress tolerance, including 20 heat shock protein (Hsps) genes, 26 glutathione S-transferases (GSTs) genes, and 2 ferritin subunit genes, were identified. These genes may play roles in stress tolerance in B. minax diapause stage. It has previously been found that 20E application on B. minax pupae could avert diapause, but the underlying mechanisms remain unknown. Thus, genes encoding enzymes in 20E biosynthesis pathway, including Neverland, Spook, Phantom, Disembodied, Shadow, Shade, and Cyp18a1, and genes encoding 20E receptor proteins, ecdysone receptor (EcR) and ultraspiracle (USP), were identified. The expression patterns of 20E-related genes among developmental stages and between 20E-treated and untreated pupae demonstrated their roles in diapause program. In addition, 1,909 simple sequence repeats (SSRs) were detected, which will contribute to molecular marker development. The findings in this study greatly improve our genetic understanding of B. minax, and lay the foundation for future studies on this species. PMID:27331903

  18. De novo transcriptome sequencing of Momordica cochinchinensis to identify genes involved in the carotenoid biosynthesis.

    PubMed

    Hyun, Tae Kyung; Rim, Yeonggil; Jang, Hui-Jeong; Kim, Cheol Hong; Park, Jongsun; Kumar, Ritesh; Lee, Sunghoon; Kim, Byung Chul; Bhak, Jong; Nguyen-Quoc, Binh; Kim, Seon-Won; Lee, Sang Yeol; Kim, Jae-Yean

    2012-07-01

    The ripe fruit of Momordica cochinchinensis Spreng, known as gac, is featured by very high carotenoid content. Although this plant might be a good resource for carotenoid metabolic engineering, so far, the genes involved in the carotenoid metabolic pathways in gac were unidentified due to lack of genomic information in the public database. In order to expedite the process of gene discovery, we have undertaken Illumina deep sequencing of mRNA prepared from aril of gac fruit. From 51,446,670 high-quality reads, we obtained 81,404 assembled unigenes with average length of 388 base pairs. At the protein level, gac aril transcripts showed about 81.5% similarity with cucumber proteomes. In addition 17,104 unigenes have been assigned to specific metabolic pathways in Kyoto Encyclopedia of Genes and Genomes, and all of known enzymes involved in terpenoid backbones biosynthetic and carotenoid biosynthetic pathways were also identified in our library. To analyze the relationship between putative carotenoid biosynthesis genes and alteration of carotenoid content during fruit ripening, digital gene expression analysis was performed on three different ripening stages of aril. This study has revealed putative phytoene synthase, 15-cis-phytone desaturase, zeta-carotene desaturase, carotenoid isomerase and lycopene epsilon cyclase might be key factors for controlling carotenoid contents during aril ripening. Taken together, this study has also made availability of a large gene database. This unique information for gac gene discovery would be helpful to facilitate functional studies for improving carotenoid quantities. PMID:22580955

  19. Transcriptomic Analysis of Flower Blooming in Jasminum sambac through De Novo RNA Sequencing.

    PubMed

    Li, Yong-Hua; Zhang, Wei; Li, Yong

    2015-01-01

    Flower blooming is a critical and complicated plant developmental process in flowering plants. However, insufficient information is available about the complex network that regulates flower blooming in Jasminum sambac. In this study, we used the RNA-Seq platform to analyze the molecular regulation of flower blooming in J. sambac by comparing the transcript profiles at two flower developmental stages: budding and blooming. A total of 4577 differentially-expressed genes (DEGs) were identified between the two floral stages. The Gene Ontology and the Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses revealed that the DEGs in the "oxidation-reduction process", "extracellular region", "steroid biosynthesis", "glycosphingolipid biosynthesis", "plant hormone signal transduction" and "pentose and glucuronate interconversions" might be associated with flower development. A total of 103 and 92 unigenes exhibited sequence similarities to the known flower development and floral scent genes from other plants. Among these unigenes, five flower development and 19 floral scent unigenes exhibited at least four-fold differences in expression between the two stages. Our results provide abundant genetic resources for studying the flower blooming mechanisms and molecular breeding of J. sambac. PMID:26065837

  20. Motor Sequence Learning and Consolidation in Unilateral De Novo Patients with Parkinson’s Disease

    PubMed Central

    Doyon, Julien; Chan, Piu

    2015-01-01

    Previous research investigating motor sequence learning (MSL) and consolidation in patients with Parkinson’s disease (PD) has predominantly included heterogeneous participant samples with early and advanced disease stages; thus, little is known about the onset of potential behavioral impairments. We employed a multisession MSL paradigm to investigate whether behavioral deficits in learning and consolidation appear immediately after or prior to the detection of clinical symptoms in the tested (left) hand. Specifically, our patient sample was limited to recently diagnosed patients with pure unilateral PD. The left hand symptomatic (LH-S) patients provided an assessment of performance following the onset of clinical symptoms in the tested hand. Conversely, right hand affected (left hand asymptomatic, LH-A) patients served to investigate whether MSL impairments appear before symptoms in the tested hand. LH-S patients demonstrated impaired learning during the initial training session and both LH-S and LH-A patients demonstrated decreased performance compared to controls during the next-day retest. Critically, the impairments in later learning stages in the LH-A patients were evident even before the appearance of traditional clinical symptoms in the tested hand. Results may be explained by the progression of disease-related alterations in relevant corticostriatal networks. PMID:26222151

  1. Motor Sequence Learning and Consolidation in Unilateral De Novo Patients with Parkinson's Disease.

    PubMed

    Dan, Xiaojuan; King, Bradley R; Doyon, Julien; Chan, Piu

    2015-01-01

    Previous research investigating motor sequence learning (MSL) and consolidation in patients with Parkinson's disease (PD) has predominantly included heterogeneous participant samples with early and advanced disease stages; thus, little is known about the onset of potential behavioral impairments. We employed a multisession MSL paradigm to investigate whether behavioral deficits in learning and consolidation appear immediately after or prior to the detection of clinical symptoms in the tested (left) hand. Specifically, our patient sample was limited to recently diagnosed patients with pure unilateral PD. The left hand symptomatic (LH-S) patients provided an assessment of performance following the onset of clinical symptoms in the tested hand. Conversely, right hand affected (left hand asymptomatic, LH-A) patients served to investigate whether MSL impairments appear before symptoms in the tested hand. LH-S patients demonstrated impaired learning during the initial training session and both LH-S and LH-A patients demonstrated decreased performance compared to controls during the next-day retest. Critically, the impairments in later learning stages in the LH-A patients were evident even before the appearance of traditional clinical symptoms in the tested hand. Results may be explained by the progression of disease-related alterations in relevant corticostriatal networks. PMID:26222151

  2. Sequencing, De Novo Assembly and Annotation of the Colorado Potato Beetle, Leptinotarsa decemlineata, Transcriptome

    PubMed Central

    Kumar, Abhishek; Congiu, Leonardo; Lindström, Leena; Piiroinen, Saija; Vidotto, Michele; Grapputo, Alessandro

    2014-01-01

    Background The Colorado potato beetle (Leptinotarsa decemlineata) is a major pest and a serious threat to potato cultivation throughout the northern hemisphere. Despite its high importance for invasion biology, phenology and pest management, little is known about L. decemlineata from a genomic perspective. We subjected European L. decemlineata adult and larval transcriptome samples to 454-FLX massively-parallel DNA sequencing to characterize a basal set of genes from this species. We created a combined assembly of the adult and larval datasets including the publicly available midgut larval Roche 454 reads and provided basic annotation. We were particularly interested in diapause-specific genes and genes involved in pesticide and Bacillus thuringiensis (Bt) resistance. Results Using 454-FLX pyrosequencing, we obtained a total of 898,048 reads which, together with the publicly available 804,056 midgut larval reads, were assembled into 121,912 contigs. We established a repository of genes of interest, with 101 out of the 108 diapause-specific genes described in Drosophila montana; and 621 contigs involved in insecticide resistance, including 221 CYP450, 45 GSTs, 13 catalases, 15 superoxide dismutases, 22 glutathione peroxidases, 194 esterases, 3 ADAM metalloproteases, 10 cadherins and 98 calmodulins. We found 460 putative miRNAs and we predicted a significant number of single nucleotide polymorphisms (29,205) and microsatellite loci (17,284). Conclusions This report of the assembly and annotation of the transcriptome of L. decemlineata offers new insights into diapause-associated and insecticide-resistance-associated genes in this species and provides a foundation for comparative studies with other species of insects. The data will also open new avenues for researchers using L. decemlineata as a model species, and for pest management research. Our results provide the basis for performing future gene expression and functional analysis in L. decemlineata and improve our

  3. De Novo Transcriptome Sequencing Analysis and Comparison of Differentially Expressed Genes (DEGs) in Macrobrachium rosenbergii in China

    PubMed Central

    Liu, Qigen

    2014-01-01

    Giant freshwater prawn (GFP; Macrobrachium rosenbergii) is an exotic species that was introduced into China in 1976 and thereafter it became a major species in freshwater aquaculture. However the gene discovery in this species has been limited to small-scale data collection in China. We used the next generation sequencing technology for the experiment; the transcriptome was sequenced of samples of hepatopancreas organ in individuals from 4 GFP groups (A1, A2, B1 and B2). De novo transcriptome sequencing generated 66,953 isogenes. Using BLASTX to search the Non-redundant (NR), Search Tool for the Retrieval of Interacting Genes (STRING), and Kyoto Encyclopedia of Genes and Genome (KEGG) databases; 21,224 unigenes were annotated, 9,552 matched unigenes with the Gene Ontology (GO) classification; 5,782 matched unigenes in 25 categories of Clusters of Orthologous Groups of proteins (COG) and 20,859 unigenes were consequently assigned to 312 KEGG pathways. Between the A and B groups 147 differentially expressed genes (DEGs) were identified; between the A1 and A2 groups 6,860 DEGs were identified and between the B1 and B2 groups 5,229 DEGs were identified. After enrichment, the A and B groups identified 38 DEGs, but none of them were significantly enriched. The A1 and A2 groups identified 21,856 DEGs in three main categories based on functional groups: biological process, cellular_component and molecular function and the KEGG pathway defined 2,459 genes had a KEGG Ortholog - ID (KO-ID) and could be categorized into 251 pathways, of those, 9 pathways were significantly enriched. The B1 and B2 groups identified 5,940 DEGs in three main categories based on functional groups: biological process, cellular_component and molecular function, and the KEGG pathway defined 1,543 genes had a KO-ID and could be categorized into 240 pathways, of those, 2 pathways were significantly enriched. We investigated 99 queries (GO) which related to growth of GFP in 4 groups. After enrichment we

  4. De novo transcriptome sequencing in Bixa orellana to identify genes involved in methylerythritol phosphate, carotenoid and bixin biosynthesis

    DOE PAGESBeta

    Cárdenas-Conejo, Yair; Carballo-Uicab, Víctor; Lieberman, Meric; Aguilar-Espinosa, Margarita; Comai, Luca; Rivera-Madrid, Renata

    2015-10-28

    Bixin or annatto is a commercially important natural orange-red pigment derived from lycopene that is produced and stored in seeds of Bixa orellana L. An enzymatic pathway for bixin biosynthesis was inferred from homology of putative proteins encoded by differentially expressed seed cDNAs. Some activities were later validated in a heterologous system. Nevertheless, much of the pathway remains to be clarified. For example, it is essential to identify the methylerythritol phosphate (MEP) and carotenoid pathways genes. In order to investigate the MEP, carotenoid, and bixin pathways genes, total RNA from young leaves and two different developmental stages of seeds frommore » B. orellana were used for the construction of indexed mRNA libraries, sequenced on the Illumina HiSeq 2500 platform and assembled de novo using Velvet, CLC Genomics Workbench and CAP3 software. A total of 52,549 contigs were obtained with average length of 1,924 bp. Two phylogenetic analyses of inferred proteins, in one case encoded by thirteen general, single-copy cDNAs, in the other from carotenoid and MEP cDNAs, indicated that B. orellana is closely related to sister Malvales species cacao and cotton. Using homology, we identified 7 and 14 core gene products from the MEP and carotenoid pathways, respectively. Surprisingly, previously defined bixin pathway cDNAs were not present in our transcriptome. Here we propose a new set of gene products involved in bixin pathway. In conclusion, the identification and qRT-PCR quantification of cDNAs involved in annatto production suggest a hypothetical model for bixin biosynthesis that involve coordinated activation of some MEP, carotenoid and bixin pathway genes. These findings provide a better understanding of the mechanisms regulating these pathways and will facilitate the genetic improvement of B. orellana.« less

  5. Paired-End Sequencing of Long-Range DNA Fragments for De Novo Assembly of Large, Complex Mammalian Genomes by Direct Intra-Molecule Ligation

    PubMed Central

    Wu, Kui; Cai, Qingle; Wang, Yu; Lang, Yongshan; Cao, Hongzhi; Yang, Huangming; Wang, Jian; Zhang, Xiuqing

    2012-01-01

    Background The relatively short read lengths from next generation sequencing (NGS) technologies still pose a challenge for de novo assembly of complex mammal genomes. One important solution is to use paired-end (PE) sequence information experimentally obtained from long-range DNA fragments (>1 kb). Here, we characterize and extend a long-range PE library construction method based on direct intra-molecule ligation (or molecular linker-free circularization) for NGS. Results We found that the method performs stably for PE sequencing of 2- to 5- kb DNA fragments, and can be extended to 10–20 kb (and even in extremes, up to ∼35 kb). We also characterized the impact of low quality input DNA on the method, and develop a whole-genome amplification (WGA) based protocol using limited input DNA (<1 µg). Using this PE dataset, we accurately assembled the YanHuang (YH) genome, the first sequenced Asian genome, into a scaffold N50 size of >2 Mb, which is over100-times greater than the initial size produced with only small insert PE reads(17 kb). In addition, we mapped two 7- to 8- kb insertions in the YH genome using the larger insert sizes of the long-range PE data. Conclusions In conclusion, we demonstrate here the effectiveness of this long-range PE sequencing method and its use for the de novo assembly of a large, complex genome using NGS short reads. PMID:23029438

  6. Combining bottom-up and top-down mass spectrometric strategies for de novo sequencing of the crustacean hyperglycemic hormone from Cancer borealis.

    PubMed

    Ma, Mingming; Chen, Ruibing; Ge, Ying; He, Huan; Marshall, Alan G; Li, Lingjun

    2009-01-01

    The crustacean hyperglycemic hormone (CHH) is a 72-amino acid residue polypeptide with multiple physiological effects. The X-organ/sinus gland is the primary source for CHH and its family members. However, the amino acid sequence of CHH in Cancer borealis , a premier model system for neuromodulation, has not been characterized. In this study, a novel hybrid strategy combining "bottom-up" and "top-down" methodologies enabled direct sequencing of CHH peptide in the sinus gland of C. borealis . Multiple mass spectrometry (MS)-based techniques were employed to characterize the CHH peptide, including direct tissue analysis by MALDI-FT-ICR-MS, de novo sequencing of tryptic digested CHH by nano-LC/ESI-Q-TOF MS and intact CHH analysis by LC/FT-ICR-MS. In-trap cleaning removed the extensive matrix adducts of CHH in the direct tissue analysis by MALDI-FT-ICR-MS. Fragmentation efficiency of the intact CHH was drastically improved after the reduction-alkylation of the disulfide bonds. The sequence coverage was further enhanced by employing multiple complementary fragmentation techniques. Overall, this example is the largest neuropeptide de novo sequenced in C. borealis by mass spectrometric methods. PMID:19046072

  7. Sequence analysis of the chitin synthase A gene of the Dutch elm pathogen Ophiostoma novo-ulmi indicates a close association with the human pathogen Sporothrix schenckii.

    PubMed

    Hintz, W E

    1999-09-01

    Degenerate oligonucleotide primers were designed according to conserved regions of the chitin synthase gene family and used to amplify a 621 basepair (bp) fragment from genomic DNA of Ophiostoma novo-ulmi, the causal agent of Dutch elm disease. The amplification product was used as a hybridization probe to screen a library of genomic DNA sequences and to retrieve a full-length chitin synthase gene (chsA). The putative coding region of the gene was 2619 bp long, lacked introns, and encoded a polypeptide of 873 amino acids. Based on the similarity of the predicted amino acid sequence to the full-length chsC gene of Aspergillus nidulans and chsA gene of Ampelomyces quisqualis, the O. novo-ulmi chsA was classified as a Class I chitin synthase. The phylogenies constructed, according to a subregion of all available chitin synthases, showed that O. novo-ulmi consistently clustered most closely with the human pathogen Sporothrix schenckii, recently classified as a member of the mitosporic Ophiostomataceae. Disruption of the chsA gene locus had no obvious effects on the growth or morphology of the fungus. PMID:10524253

  8. Evaluating Characteristics of De Novo Assembly Software on 454 Transcriptome Data: A Simulation Approach

    PubMed Central

    Mundry, Marvin; Bornberg-Bauer, Erich; Sammeth, Michael; Feulner, Philine G. D.

    2012-01-01

    Background The quantity of transcriptome data is rapidly increasing for non-model organisms. As sequencing technology advances, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. Recent studies have compared the performance of different software to establish a best practice for transcriptome assembly. Here, we adapted a simulation approach to evaluate specific features of assembly programs on 454 data. The novelty of our study is that the simulation allows us to calculate a model assembly as reference point for comparison. Findings The simulation approach allows us to compare basic metrics of assemblies computed by different software applications (CAP3, MIRA, Newbler, and Oases) to a known optimal solution. We found MIRA and CAP3 are conservative in merging reads. This resulted in comparably high number of short contigs. In contrast, Newbler more readily merged reads into longer contigs, while Oases produced the overall shortest assembly. Due to the simulation approach, reads could be traced back to their correct placement within the transcriptome. Together with mapping reads onto the assembled contigs, we were able to evaluate ambiguity in the assemblies. This analysis further supported the conservative nature of MIRA and CAP3, which resulted in low proportions of chimeric contigs, but high redundancy. Newbler produced less redundancy, but the proportion of chimeric contigs was higher. Conclusion Our evaluation of four assemblers suggested that MIRA and Newbler slightly outperformed the other programs, while showing contrasting characteristics. Oases did not perform very well on the 454 reads. Our evaluation indicated that the software was either conservative (MIRA) or liberal (Newbler) about merging reads into contigs. This suggested that in choosing an assembly program researchers should carefully consider their follow up analysis and consequences of the chosen approach to gain an assembly. PMID:22384018

  9. Genetic variation and the de novo assembly of human genomes

    PubMed Central

    Chaisson, Mark J. P.; Wilson, Richard K.; Eichler, Evan E.

    2016-01-01

    The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation. PMID:26442640

  10. Tetrasomy 18p de novo: identification by FISH with conventional and microdissection probes and analysis of parental origin and formation by short sequence repeat typing.

    PubMed

    Eggermann, T; Engels, H; Moskalonek, B; Nöthen, M M; Müller-Navia, J; Schleiermacher, E; Schwanitz, G; Stengel-Rutkowski, S

    1996-05-01

    We report a de novo supernumerary isochromosome 18p in a child with tetrasomy 18p, analyzed by a straightforward combination of cytogenetic and molecular cytogenetic methods. The diagnostic procedure consisted of standard banding techniques and fluorescence in situ hybridization (FISH) with centromere and library DNA probes for chromosome 18, and 18p-specific FISH probes prepared by chromosome dissesction and in vitro amplification. The maternal origin as well as the most probable cell stages of formation of the supernumerary isochromosome were determined by typing of short sequence repeats (SSRs). The pattern of allelic distribution suggests a nondisjunction during meiosis followed by a centromeric misdivision in an early postzygotic mitosis as the most probable mode of isochromosome 18p formation. The combination of the applied methods represents a powerful tool to investigate the nature and the origin of de novo marker chromosomes. PMID:8655132

  11. Large Scale Discovery and De Novo-Assisted Sequencing of Cationic Antimicrobial Peptides (CAMPs) by Microparticle Capture and Electron-Transfer Dissociation (ETD) Mass Spectrometry.

    PubMed

    Juba, Melanie L; Russo, Paul S; Devine, Megan; Barksdale, Stephanie; Rodriguez, Carlos; Vliet, Kent A; Schnur, Joel M; van Hoek, Monique L; Bishop, Barney M

    2015-10-01

    The identification and sequencing of novel cationic antimicrobial peptides (CAMPs) have proven challenging due to the limitations associated with traditional proteomics methods and difficulties sequencing peptides present in complex biomolecular mixtures. We present here a process for large-scale identification and de novo-assisted sequencing of newly discovered CAMPs using microparticle capture followed by tandem mass spectrometry equipped with electron-transfer dissociation (ETD). This process was initially evaluated and verified using known CAMPs with varying physicochemical properties. The effective parameters were then applied in the analysis of a complex mixture of peptides harvested from American alligator plasma using custom-made (Bioprospector) functionalized hydrogel particles. Here, we report the successful sequencing process for CAMPs that has led to the identification of 340 unique peptides and the discovery of five novel CAMPs from American alligator plasma. PMID:26327436

  12. Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing

    PubMed Central

    Vembar, Shruthi Sridhar; Seetin, Matthew; Lambert, Christine; Nattestad, Maria; Schatz, Michael C.; Baybayan, Primo; Scherf, Artur; Smith, Melissa Laird

    2016-01-01

    The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90–99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission. PMID:27345719

  13. Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing.

    PubMed

    Vembar, Shruthi Sridhar; Seetin, Matthew; Lambert, Christine; Nattestad, Maria; Schatz, Michael C; Baybayan, Primo; Scherf, Artur; Smith, Melissa Laird

    2016-08-01

    The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90-99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission. PMID:27345719

  14. Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios

    PubMed Central

    Besenbacher, Søren; Liu, Siyang; Izarzugaza, José M. G.; Grove, Jakob; Belling, Kirstine; Bork-Jensen, Jette; Huang, Shujia; Als, Thomas D.; Li, Shengting; Yadav, Rachita; Rubio-García, Arcadio; Lescai, Francesco; Demontis, Ditte; Rao, Junhua; Ye, Weijian; Mailund, Thomas; Friborg, Rune M.; Pedersen, Christian N. S.; Xu, Ruiqi; Sun, Jihua; Liu, Hao; Wang, Ou; Cheng, Xiaofang; Flores, David; Rydza, Emil; Rapacki, Kristoffer; Damm Sørensen, John; Chmura, Piotr; Westergaard, David; Dworzynski, Piotr; Sørensen, Thorkild I. A.; Lund, Ole; Hansen, Torben; Xu, Xun; Li, Ning; Bolund, Lars; Pedersen, Oluf; Eiberg, Hans; Krogh, Anders; Børglum, Anders D.; Brunak, Søren; Kristiansen, Karsten; Schierup, Mikkel H.; Wang, Jun; Gupta, Ramneek; Villesen, Palle; Rasmussen, Simon

    2015-01-01

    Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e−8 and 1.5e−9 per nucleotide per generation for SNVs and indels, respectively. PMID:25597990

  15. De novo Transcriptome Sequencing and Development of Abscission Zone-Specific Microarray as a New Molecular Tool for Analysis of Tomato Organ Abscission

    PubMed Central

    Sundaresan, Srivignesh; Philosoph-Hadas, Sonia; Riov, Joseph; Mugasimangalam, Raja; Kuravadi, Nagesh A.; Kochanek, Bettina; Salim, Shoshana; Tucker, Mark L.; Meir, Shimon

    2016-01-01

    Abscission of flower pedicels and leaf petioles of tomato (Solanum lycopersicum) can be induced by flower removal or leaf deblading, respectively, which leads to auxin depletion, resulting in increased sensitivity of the abscission zone (AZ) to ethylene. However, the molecular mechanisms that drive the acquisition of abscission competence and its modulation by auxin gradients are not yet known. We used RNA-Sequencing (RNA-Seq) to obtain a comprehensive transcriptome of tomato flower AZ (FAZ) and leaf AZ (LAZ) during abscission. RNA-Seq was performed on a pool of total RNA extracted from tomato FAZ and LAZ, at different abscission stages, followed by de novo assembly. The assembled clusters contained transcripts that are already known in the Solanaceae (SOL) genomics and NCBI databases, and over 8823 identified novel tomato transcripts of varying sizes. An AZ-specific microarray, encompassing the novel transcripts identified in this study and all known transcripts from the SOL genomics and NCBI databases, was constructed to study the abscission process. Multiple probes for longer genes and key AZ-specific genes, including antisense probes for all transcripts, make this array a unique tool for studying abscission with a comprehensive set of transcripts, and for mining for naturally occurring antisense transcripts. We focused on comparing the global transcriptomes generated from the FAZ and the LAZ to establish the divergences and similarities in their transcriptional networks, and particularly to characterize the processes and transcriptional regulators enriched in gene clusters that are differentially regulated in these two AZs. This study is the first attempt to analyze the global gene expression in different AZs in tomato by combining the RNA-Seq technique with oligonucleotide microarrays. Our AZ-specific microarray chip provides a cost-effective approach for expression profiling and robust analysis of multiple samples in a rapid succession. PMID:26834766

  16. De novo Transcriptome Sequencing and Development of Abscission Zone-Specific Microarray as a New Molecular Tool for Analysis of Tomato Organ Abscission.

    PubMed

    Sundaresan, Srivignesh; Philosoph-Hadas, Sonia; Riov, Joseph; Mugasimangalam, Raja; Kuravadi, Nagesh A; Kochanek, Bettina; Salim, Shoshana; Tucker, Mark L; Meir, Shimon

    2015-01-01

    Abscission of flower pedicels and leaf petioles of tomato (Solanum lycopersicum) can be induced by flower removal or leaf deblading, respectively, which leads to auxin depletion, resulting in increased sensitivity of the abscission zone (AZ) to ethylene. However, the molecular mechanisms that drive the acquisition of abscission competence and its modulation by auxin gradients are not yet known. We used RNA-Sequencing (RNA-Seq) to obtain a comprehensive transcriptome of tomato flower AZ (FAZ) and leaf AZ (LAZ) during abscission. RNA-Seq was performed on a pool of total RNA extracted from tomato FAZ and LAZ, at different abscission stages, followed by de novo assembly. The assembled clusters contained transcripts that are already known in the Solanaceae (SOL) genomics and NCBI databases, and over 8823 identified novel tomato transcripts of varying sizes. An AZ-specific microarray, encompassing the novel transcripts identified in this study and all known transcripts from the SOL genomics and NCBI databases, was constructed to study the abscission process. Multiple probes for longer genes and key AZ-specific genes, including antisense probes for all transcripts, make this array a unique tool for studying abscission with a comprehensive set of transcripts, and for mining for naturally occurring antisense transcripts. We focused on comparing the global transcriptomes generated from the FAZ and the LAZ to establish the divergences and similarities in their transcriptional networks, and particularly to characterize the processes and transcriptional regulators enriched in gene clusters that are differentially regulated in these two AZs. This study is the first attempt to analyze the global gene expression in different AZs in tomato by combining the RNA-Seq technique with oligonucleotide microarrays. Our AZ-specific microarray chip provides a cost-effective approach for expression profiling and robust analysis of multiple samples in a rapid succession. PMID:26834766

  17. Color Sequence of Triton Approach Images

    NASA Technical Reports Server (NTRS)

    1989-01-01

    Triton Voyager 2 approach sequence with latitude-longitude grid superposed. The color image was reconstructed by making a computer composite of three black and white images taken through red, green, and blue filters. Details on Triton's surface unfold dramatically in this sequence of approach images. South Pole near the bottom of the images at the convergence of lines of longitude. Resolution changes from about 60 km/pixel (37 mi/pixel) in the image at upper left taken from a distance of 500,000 (311,000 mi) to about 5 km/pixel (3.1 mi/pixel) for the image at lower right. Global and regional albedo features are visible in all of the images. The albedo features can be tracked in successive images and show that Triton has undergone about 3/4 of a rotation during the 4.3-day interval over which these images were obtained. A southern polar cap of bright pink, yellow, and white materials covers nearly all of the southern hemisphere; these materials consist of nitrogen ice with traces of other substances, including frozen methane and carbon monoxide. Feeble ultraviolet radiation from the sun is thought to act on methane to cause chemical reactions to the pinkish yellowish substances. At the time of the Voyager 2 flyby (Jan. 1989) Triton's southern hemisphere was starting the summer season and the South Pole was canted toward the sun day and night, such that the polar cap was sublimating under the relatively 'hot' summer sun (surface temperature about 38 K, about -391 degree F). Numerous dark streaks on the southern polar nitrogen-ice cap are thought to consist of dark dust deposited by prevailing winds in Triton's tenuous nitrogen atmosphere. A bluish band, seen in all of the images, nearly circumstances Triton's equator; this band is thought to consist of fairly nitrogen frost, perhaps deposited in the decade prior to Voyager 2's flyby.

  18. New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation

    PubMed Central

    McLysaght, Aoife; Guerzoni, Daniele

    2015-01-01

    The origin of novel protein-coding genes de novo was once considered so improbable as to be impossible. In less than a decade, and especially in the last five years, this view has been overturned by extensive evidence from diverse eukaryotic lineages. There is now evidence that this mechanism has contributed a significant number of genes to genomes of organisms as diverse as Saccharomyces, Drosophila, Plasmodium, Arabidopisis and human. From simple beginnings, these genes have in some instances acquired complex structure, regulated expression and important functional roles. New genes are often thought of as dispensable late additions; however, some recent de novo genes in human can play a role in disease. Rather than an extremely rare occurrence, it is now evident that there is a relatively constant trickle of proto-genes released into the testing ground of natural selection. It is currently unknown whether de novo genes arise primarily through an ‘RNA-first’ or ‘ORF-first’ pathway. Either way, evolutionary tinkering with this pool of genetic potential may have been a significant player in the origins of lineage-specific traits and adaptations. PMID:26323763

  19. Transcriptome analysis of colored calla lily (Zantedeschia rehmannii Engl.) by Illumina sequencing: de novo assembly, annotation and EST-SSR marker development

    PubMed Central

    Cui, Binbin; Zhang, Qixiang; Xiong, Min; Wang, Xian

    2016-01-01

    Colored calla lily is the short name for the species or hybrids in section Aestivae of genus Zantedeschia. It is currently one of the most popular flower plants in the world due to its beautiful flower spathe and long postharvest life. However, little genomic information and few molecular markers are available for its genetic improvement. Here, de novo transcriptome sequencing was performed to produce large transcript sequences for Z. rehmannii cv. ‘Rehmannii’ using an Illumina HiSeq 2000 instrument. More than 59.9 million cDNA sequence reads were obtained and assembled into 39,298 unigenes with an average length of 1,038 bp. Among these, 21,077 unigenes showed significant similarity to protein sequences in the non-redundant protein database (Nr) and in the Swiss-Prot, Gene Ontology (GO), Cluster of Orthologous Group (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Moreover, a total of 117 unique transcripts were then defined that might regulate the flower spathe development of colored calla lily. Additionally, 9,933 simple sequence repeats (SSRs) and 7,162 single nucleotide polymorphisms (SNPs) were identified as putative molecular markers. High-quality primers for 200 SSR loci were designed and selected, of which 58 amplified reproducible amplicons were polymorphic among 21 accessions of colored calla lily. The sequence information and molecular markers in the present study will provide valuable resources for genetic diversity analysis, germplasm characterization and marker-assisted selection in the genus Zantedeschia.

  20. Identification and sequence analysis of Escherichia coli purE and purK genes encoding 5'-phosphoribosyl-5-amino-4-imidazole carboxylase for de novo purine biosynthesis.

    PubMed Central

    Watanabe, W; Sampei, G; Aiba, A; Mizobuchi, K

    1989-01-01

    It has been shown that the Escherichia coli purE locus specifying 5'-phosphoribosyl-5-amino-4-imidazole carboxylase in de novo purine nucleotide synthesis is divided into two cistrons. We cloned and determined a 2,449-nucleotide sequence including the purE locus. This sequence contains two overlapped open reading frames, ORF-18 and ORF-39, encoding proteins with molecular weights of 18,000 and 39,000, respectively. The purE mutations of CSH57A and DCSP22 were complemented by plasmids carrying ORF-18, while that of NK6051 was complemented by plasmids carrying ORF-39. Thus, the purE locus consists of two distinct genes, designated purE and purK for ORF-18 and ORF-39, respectively. These genes constitute a single operon. A highly conserved 16-nucleotide sequence, termed the PUR box, was found in the upstream region of purE by comparing the sequences of the purF and purMN operons. We also found three entire and one partial repetitive extragenic palindromic (REP) sequences in the downstream region of purK. Roles of the PUR box and REP sequences are discussed in relation to the genesis of the purEK operon. Images PMID:2644189

  1. "De-novo" amino acid sequence elucidation of protein G'e by combined "Top-Down" and "Bottom-Up" mass spectrometry

    NASA Astrophysics Data System (ADS)

    Yefremova, Yelena; Al-Majdoub, Mahmoud; Opuni, Kwabena F. M.; Koy, Cornelia; Cui, Weidong; Yan, Yuetian; Gross, Michael L.; Glocker, Michael O.

    2015-03-01

    Mass spectrometric de-novo sequencing was applied to review the amino acid sequence of a commercially available recombinant protein Ǵ with great scientific and economic importance. Substantial deviations to the published amino acid sequence (Uniprot Q54181) were found by the presence of 46 additional amino acids at the N-terminus, including a so-called "His-tag" as well as an N-terminal partial α- N-gluconoylation and α- N-phosphogluconoylation, respectively. The unexpected amino acid sequence of the commercial protein G' comprised 241 amino acids and resulted in a molecular mass of 25,998.9 ± 0.2 Da for the unmodified protein. Due to the higher mass that is caused by its extended amino acid sequence compared with the original protein G' (185 amino acids), we named this protein "protein G'e." By means of mass spectrometric peptide mapping, the suggested amino acid sequence, as well as the N-terminal partial α- N-gluconoylations, was confirmed with 100% sequence coverage. After the protein G'e sequence was determined, we were able to determine the expression vector pET-28b from Novagen with the Xho I restriction enzyme cleavage site as the best option that was used for cloning and expressing the recombinant protein G'e in E. coli. A dissociation constant ( K d ) value of 9.4 nM for protein G'e was determined thermophoretically, showing that the N-terminal flanking sequence extension did not cause significant changes in the binding affinity to immunoglobulins.

  2. "De-novo" amino acid sequence elucidation of protein G'e by combined "top-down" and "bottom-up" mass spectrometry.

    PubMed

    Yefremova, Yelena; Al-Majdoub, Mahmoud; Opuni, Kwabena F M; Koy, Cornelia; Cui, Weidong; Yan, Yuetian; Gross, Michael L; Glocker, Michael O

    2015-03-01

    Mass spectrometric de-novo sequencing was applied to review the amino acid sequence of a commercially available recombinant protein G´ with great scientific and economic importance. Substantial deviations to the published amino acid sequence (Uniprot Q54181) were found by the presence of 46 additional amino acids at the N-terminus, including a so-called "His-tag" as well as an N-terminal partial α-N-gluconoylation and α-N-phosphogluconoylation, respectively. The unexpected amino acid sequence of the commercial protein G' comprised 241 amino acids and resulted in a molecular mass of 25,998.9 ± 0.2 Da for the unmodified protein. Due to the higher mass that is caused by its extended amino acid sequence compared with the original protein G' (185 amino acids), we named this protein "protein G'e." By means of mass spectrometric peptide mapping, the suggested amino acid sequence, as well as the N-terminal partial α-N-gluconoylations, was confirmed with 100% sequence coverage. After the protein G'e sequence was determined, we were able to determine the expression vector pET-28b from Novagen with the Xho I restriction enzyme cleavage site as the best option that was used for cloning and expressing the recombinant protein G'e in E. coli. A dissociation constant (K(d)) value of 9.4 nM for protein G'e was determined thermophoretically, showing that the N-terminal flanking sequence extension did not cause significant changes in the binding affinity to immunoglobulins. PMID:25560987

  3. Exome sequencing identifies de novo gain of function missense mutation in KCND2 in identical twins with autism and seizures that slows potassium channel inactivation.

    PubMed

    Lee, Hane; Lin, Meng-chin A; Kornblum, Harley I; Papazian, Diane M; Nelson, Stanley F

    2014-07-01

    Numerous studies and case reports show comorbidity of autism and epilepsy, suggesting some common molecular underpinnings of the two phenotypes. However, the relationship between the two, on the molecular level, remains unclear. Here, whole exome sequencing was performed on a family with identical twins affected with autism and severe, intractable seizures. A de novo variant was identified in the KCND2 gene, which encodes the Kv4.2 potassium channel. Kv4.2 is a major pore-forming subunit in somatodendritic subthreshold A-type potassium current (ISA) channels. The de novo mutation p.Val404Met is novel and occurs at a highly conserved residue within the C-terminal end of the transmembrane helix S6 region of the ion permeation pathway. Functional analysis revealed the likely pathogenicity of the variant in that the p.Val404Met mutant construct showed significantly slowed inactivation, either by itself or after equimolar coexpression with the wild-type Kv4.2 channel construct consistent with a dominant effect. Further, the effect of the mutation on closed-state inactivation was evident in the presence of auxiliary subunits that associate with Kv4 subunits to form ISA channels in vivo. Discovery of a functionally relevant novel de novo variant, coupled with physiological evidence that the mutant protein disrupts potassium current inactivation, strongly supports KCND2 as the causal gene for epilepsy in this family. Interaction of KCND2 with other genes implicated in autism and the role of KCND2 in synaptic plasticity provide suggestive evidence of an etiological role in autism. PMID:24501278

  4. Exome sequencing identifies de novo gain of function missense mutation in KCND2 in identical twins with autism and seizures that slows potassium channel inactivation

    PubMed Central

    Lee, Hane; Lin, Meng-chin A.; Kornblum, Harley I.; Papazian, Diane M.; Nelson, Stanley F.

    2014-01-01

    Numerous studies and case reports show comorbidity of autism and epilepsy, suggesting some common molecular underpinnings of the two phenotypes. However, the relationship between the two, on the molecular level, remains unclear. Here, whole exome sequencing was performed on a family with identical twins affected with autism and severe, intractable seizures. A de novo variant was identified in the KCND2 gene, which encodes the Kv4.2 potassium channel. Kv4.2 is a major pore-forming subunit in somatodendritic subthreshold A-type potassium current (ISA) channels. The de novo mutation p.Val404Met is novel and occurs at a highly conserved residue within the C-terminal end of the transmembrane helix S6 region of the ion permeation pathway. Functional analysis revealed the likely pathogenicity of the variant in that the p.Val404Met mutant construct showed significantly slowed inactivation, either by itself or after equimolar coexpression with the wild-type Kv4.2 channel construct consistent with a dominant effect. Further, the effect of the mutation on closed-state inactivation was evident in the presence of auxiliary subunits that associate with Kv4 subunits to form ISA channels in vivo. Discovery of a functionally relevant novel de novo variant, coupled with physiological evidence that the mutant protein disrupts potassium current inactivation, strongly supports KCND2 as the causal gene for epilepsy in this family. Interaction of KCND2 with other genes implicated in autism and the role of KCND2 in synaptic plasticity provide suggestive evidence of an etiological role in autism. PMID:24501278

  5. Evaluation of Methods for de novo Genome assembly from High-throughput Sequencing Reads Reveals Dependencies that Affect the Quality of the Results

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole...

  6. Identification and de novo sequencing of housekeeping genes appropriate for gene expression analyses in farmed maraena whitefish (Coregonus maraena) during crowding stress.

    PubMed

    Altmann, Simone; Rebl, Alexander; Kühn, Carsten; Goldammer, Tom

    2015-04-01

    Maraena whitefish (Coregonus maraena; synonym Coregonus lavaretus f. balticus) is a high-quality food fish in the Southern Baltic Sea belonging to the group of salmonid fishes. Coregonus sp. is successfully kept in aquaculture throughout northern Europe (e.g. in Finland, Germany, Russia) and North America. In this regard, the molecular and immunological characterisation of stress response in maraena whitefish contributes to the development of robust and fast-growing maraena whitefish breeding strains for aquaculture. Thus, in the present study, the potential housekeeping genes beta actin (ACTB), elongation factor 1 alpha (EEF1A1), glyceraldehydes-3-phosphate dehydrogenase (GAPDH), ribosomal protein 9 (RPL9), ribosomal protein 32 (RPL32) and ribosomal protein S20 (RPS20) were de novo sequenced and tested concerning their applicability as reference genes in quantitative real-time PCR (qPCR) in maraena whitefish under different stocking densities. For this purpose, tissue samples of liver, kidney, gills, head kidney, skin, adipose tissue, heart and dorsal fin were investigated. qPCR data were analysed with Normfinder tool to determine gene expression stability. DNA sequencing exposed transcribed paralogous EEF1A1A and EEF1A1B genes differing in their putative protein structure. Normfinder analysis revealed RPL9 and RPL32 as most stable, GAPDH and ACTB as least stable genes for qPCR analyses, respectively. This is the first study that provides a subset of seven de novo sequenced housekeeping genes usable as reference genes in studies of stress response in maraena whitefish. PMID:25249196

  7. The ATP binding site of the chromatin remodeling homolog Lsh is required for nucleosome density and de novo DNA methylation at repeat sequences

    PubMed Central

    Ren, Jianke; Briones, Victorino; Barbour, Samantha; Yu, Weishi; Han, Yixing; Terashima, Minoru; Muegge, Kathrin

    2015-01-01

    Lsh, a chromatin remodeling protein of the SNF2 family, is critical for normal heterochromatin structure. In particular, DNA methylation at repeat elements, a hallmark of heterochromatin, is greatly reduced in Lsh−/− (KO) cells. Here, we examined the presumed nucleosome remodeling activity of Lsh on chromatin in the context of DNA methylation. We found that dynamic CG methylation was dependent on Lsh in embryonic stem cells. Moreover, we demonstrate that ATP function is critical for de novo methylation at repeat sequences. The ATP binding site of Lsh is in part required to promote stable association of the DNA methyltransferase 3b with the repeat locus. By performing nucleosome occupancy assays, we found distinct nucleosome occupancy in KO ES cells compared to WT ES cells after differentiation. Nucleosome density was restored to wild-type level by re-expressing wild-type Lsh but not the ATP mutant in KO ES cells. Our results suggest that ATP-dependent nucleosome remodeling is the primary molecular function of Lsh, which may promote de novo methylation in differentiating ES cells. PMID:25578963

  8. De Novo variants in the KMT2A (MLL) gene causing atypical Wiedemann-Steiner syndrome in two unrelated individuals identified by clinical exome sequencing

    PubMed Central

    2014-01-01

    Background Wiedemann-Steiner Syndrome (WSS) is characterized by short stature, a variety of dysmorphic facial and skeletal features, characteristic hypertrichosis cubiti (excessive hair on the elbows), mild-to-moderate developmental delay and intellectual disability. [MIM#: 605130]. Here we report two unrelated children for whom clinical exome sequencing of parent-proband trios was performed at UCLA, resulting in a molecular diagnosis of WSS and atypical clinical presentation. Case presentation For patient 1, clinical features at 9 years of age included developmental delay, craniofacial abnormalities, and multiple minor anomalies. Patient 2 presented at 1 year of age with developmental delay, microphthalmia, partial 3–4 left hand syndactyly, and craniofacial abnormalities. A de novo missense c.4342T>C variant and a de novo splice site c.4086+G>A variant were identified in the KMT2A gene in patients 1 and 2, respectively. Conclusions Based on the clinical and molecular findings, both patients appear to have novel presentations of WSS. As the hallmark hypertrichosis cubiti was not initially appreciated in either case, this syndrome was not suspected during the clinical evaluation. This report expands the phenotypic spectrum of the clinical phenotypes and KMT2A variants associated with WSS. PMID:24886118

  9. An evolution-based approach to De Novo protein design and case study on Mycobacterium tuberculosis.

    PubMed

    Mitra, Pralay; Shultis, David; Brender, Jeffrey R; Czajka, Jeff; Marsh, David; Gray, Felicia; Cierpicki, Tomasz; Zhang, Yang

    2013-10-01

    Computational protein design is a reverse procedure of protein folding and structure prediction, where constructing structures from evolutionarily related proteins has been demonstrated to be the most reliable method for protein 3-dimensional structure prediction. Following this spirit, we developed a novel method to design new protein sequences based on evolutionarily related protein families. For a given target structure, a set of proteins having similar fold are identified from the PDB library by structural alignments. A structural profile is then constructed from the protein templates and used to guide the conformational search of amino acid sequence space, where physicochemical packing is accommodated by single-sequence based solvation, torsion angle, and secondary structure predictions. The method was tested on a computational folding experiment based on a large set of 87 protein structures covering different fold classes, which showed that the evolution-based design significantly enhances the foldability and biological functionality of the designed sequences compared to the traditional physics-based force field methods. Without using homologous proteins, the designed sequences can be folded with an average root-mean-square-deviation of 2.1 Å to the target. As a case study, the method is extended to redesign all 243 structurally resolved proteins in the pathogenic bacteria Mycobacterium tuberculosis, which is the second leading cause of death from infectious disease. On a smaller scale, five sequences were randomly selected from the design pool and subjected to experimental validation. The results showed that all the designed proteins are soluble with distinct secondary structure and three have well ordered tertiary structure, as demonstrated by circular dichroism and NMR spectroscopy. Together, these results demonstrate a new avenue in computational protein design that uses knowledge of evolutionary conservation from protein structural families to engineer

  10. An Evolution-Based Approach to De Novo Protein Design and Case Study on Mycobacterium tuberculosis

    PubMed Central

    Brender, Jeffrey R.; Czajka, Jeff; Marsh, David; Gray, Felicia; Cierpicki, Tomasz; Zhang, Yang

    2013-01-01

    Computational protein design is a reverse procedure of protein folding and structure prediction, where constructing structures from evolutionarily related proteins has been demonstrated to be the most reliable method for protein 3-dimensional structure prediction. Following this spirit, we developed a novel method to design new protein sequences based on evolutionarily related protein families. For a given target structure, a set of proteins having similar fold are identified from the PDB library by structural alignments. A structural profile is then constructed from the protein templates and used to guide the conformational search of amino acid sequence space, where physicochemical packing is accommodated by single-sequence based solvation, torsion angle, and secondary structure predictions. The method was tested on a computational folding experiment based on a large set of 87 protein structures covering different fold classes, which showed that the evolution-based design significantly enhances the foldability and biological functionality of the designed sequences compared to the traditional physics-based force field methods. Without using homologous proteins, the designed sequences can be folded with an average root-mean-square-deviation of 2.1 Å to the target. As a case study, the method is extended to redesign all 243 structurally resolved proteins in the pathogenic bacteria Mycobacterium tuberculosis, which is the second leading cause of death from infectious disease. On a smaller scale, five sequences were randomly selected from the design pool and subjected to experimental validation. The results showed that all the designed proteins are soluble with distinct secondary structure and three have well ordered tertiary structure, as demonstrated by circular dichroism and NMR spectroscopy. Together, these results demonstrate a new avenue in computational protein design that uses knowledge of evolutionary conservation from protein structural families to engineer

  11. De novo sequencing of Astyanax mexicanus surface fish and Pachón cavefish transcriptomes reveals enrichment of mutations in cavefish putative eye genes.

    PubMed

    Hinaux, Hélène; Poulain, Julie; Da Silva, Corinne; Noirot, Céline; Jeffery, William R; Casane, Didier; Rétaux, Sylvie

    2013-01-01

    Astyanax mexicanus, a teleost species with surface dwelling (surface fish) and cave adapted (cavefish) morphs, is an important model system in evolutionary developmental biology (evodevo). Astyanax cavefish differ from surface fish in numerous traits, including the enhancement of non-visual sensory systems, and the loss of eyes and pigmentation. The genetic bases for these differences are not fully understood as genomic and transcriptomic data are lacking. We here present de novo transcriptome sequencing of embryonic and larval stages of a surface fish population and a cavefish population originating from the Pachón cave using the Sanger method. This effort represents the first large scale sequence and clone resource for the Astyanax research community. The analysis of these sequences show low levels of polymorphism in cavefish compared to surface fish, confirming previous studies on a small number of genes. A high proportion of the genes mutated in cavefish are known to be expressed in the zebrafish visual system. Such a high number of mutations in cavefish putative eye genes may be explained by relaxed selection for vision during the evolution in the absence of light. Based on these sequence differences, we provide a list of 11 genes that are potential candidates for having a role in cavefish visual system degeneration. PMID:23326453

  12. De novo transcriptome analysis of an imminent biofuel crop, Camelina sativa L. using Illumina GAIIX sequencing platform and identification of SSR markers.

    PubMed

    Mudalkar, Shalini; Golla, Ramesh; Ghatty, Sreenivas; Reddy, Attipalli Ramachandra

    2014-01-01

    Camelina sativa L. is an emerging biofuel crop with potential applications in industry, medicine, cosmetics and human nutrition. The crop is unexploited owing to very limited availability of transcriptome and genomic data. In order to analyse the various metabolic pathways, we performed de novo assembly of the transcriptome on Illumina GAIIX platform with paired end sequencing for obtaining short reads. The sequencing output generated a FastQ file size of 2.97 GB with 10.83 million reads having a maximum read length of 101 nucleotides. The number of contigs generated was 53,854 with maximum and minimum lengths of 10,086 and 200 nucleotides respectively. These trancripts were annotated using BLAST search against the Aracyc, Swiss-Prot, TrEMBL, gene ontology and clusters of orthologous groups (KOG) databases. The genes involved in lipid metabolism were studied and the transcription factors were identified. Sequence similarity studies of Camelina with the other related organisms indicated the close relatedness of Camelina with Arabidopsis. In addition, bioinformatics analysis revealed the presence of a total of 19,379 simple sequence repeats. This is the first report on Camelina sativa L., where the transcriptome of the entire plant, including seedlings, seed, root, leaves and stem was done. Our data established an excellent resource for gene discovery and provide useful information for functional and comparative genomic studies in this promising biofuel crop. PMID:24002439

  13. De Novo Transcriptome Sequencing of the Octopus vulgaris Hemocytes Using Illumina RNA-Seq Technology: Response to the Infection by the Gastrointestinal Parasite Aggregata octopiana

    PubMed Central

    Castellanos-Martínez, Sheila; Arteta, David; Catarino, Susana; Gestal, Camino

    2014-01-01

    Background Octopus vulgaris is a highly valuable species of great commercial interest and excellent candidate for aquaculture diversification; however, the octopus’ well-being is impaired by pathogens, of which the gastrointestinal coccidian parasite Aggregata octopiana is one of the most important. The knowledge of the molecular mechanisms of the immune response in cephalopods, especially in octopus is scarce. The transcriptome of the hemocytes of O. vulgaris was de novo sequenced using the high-throughput paired-end Illumina technology to identify genes involved in immune defense and to understand the molecular basis of octopus tolerance/resistance to coccidiosis. Results A bi-directional mRNA library was constructed from hemocytes of two groups of octopus according to the infection by A. octopiana, sick octopus, suffering coccidiosis, and healthy octopus, and reads were de novo assembled together. The differential expression of transcripts was analysed using the general assembly as a reference for mapping the reads from each condition. After sequencing, a total of 75,571,280 high quality reads were obtained from the sick octopus group and 74,731,646 from the healthy group. The general transcriptome of the O. vulgaris hemocytes was assembled in 254,506 contigs. A total of 48,225 contigs were successfully identified, and 538 transcripts exhibited differential expression between groups of infection. The general transcriptome revealed genes involved in pathways like NF-kB, TLR and Complement. Differential expression of TLR-2, PGRP, C1q and PRDX genes due to infection was validated using RT-qPCR. In sick octopuses, only TLR-2 was up-regulated in hemocytes, but all of them were up-regulated in caecum and gills. Conclusion The transcriptome reported here de novo establishes the first molecular clues to understand how the octopus immune system works and interacts with a highly pathogenic coccidian. The data provided here will contribute to identification of biomarkers

  14. High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients

    PubMed Central

    Kukita, Yoji; Matoba, Ryo; Uchida, Junji; Hamakawa, Takuya; Doki, Yuichiro; Imamura, Fumio; Kato, Kikuya

    2015-01-01

    Circulating tumour DNA (ctDNA) is an emerging field of cancer research. However, current ctDNA analysis is usually restricted to one or a few mutation sites due to technical limitations. In the case of massively parallel DNA sequencers, the number of false positives caused by a high read error rate is a major problem. In addition, the final sequence reads do not represent the original DNA population due to the global amplification step during the template preparation. We established a high-fidelity target sequencing system of individual molecules identified in plasma cell-free DNA using barcode sequences; this system consists of the following two steps. (i) A novel target sequencing method that adds barcode sequences by adaptor ligation. This method uses linear amplification to eliminate the errors introduced during the early cycles of polymerase chain reaction. (ii) The monitoring and removal of erroneous barcode tags. This process involves the identification of individual molecules that have been sequenced and for which the number of mutations have been absolute quantitated. Using plasma cell-free DNA from patients with gastric or lung cancer, we demonstrated that the system achieved near complete elimination of false positives and enabled de novo detection and absolute quantitation of mutations in plasma cell-free DNA. PMID:26126624

  15. High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients.

    PubMed

    Kukita, Yoji; Matoba, Ryo; Uchida, Junji; Hamakawa, Takuya; Doki, Yuichiro; Imamura, Fumio; Kato, Kikuya

    2015-08-01

    Circulating tumour DNA (ctDNA) is an emerging field of cancer research. However, current ctDNA analysis is usually restricted to one or a few mutation sites due to technical limitations. In the case of massively parallel DNA sequencers, the number of false positives caused by a high read error rate is a major problem. In addition, the final sequence reads do not represent the original DNA population due to the global amplification step during the template preparation. We established a high-fidelity target sequencing system of individual molecules identified in plasma cell-free DNA using barcode sequences; this system consists of the following two steps. (i) A novel target sequencing method that adds barcode sequences by adaptor ligation. This method uses linear amplification to eliminate the errors introduced during the early cycles of polymerase chain reaction. (ii) The monitoring and removal of erroneous barcode tags. This process involves the identification of individual molecules that have been sequenced and for which the number of mutations have been absolute quantitated. Using plasma cell-free DNA from patients with gastric or lung cancer, we demonstrated that the system achieved near complete elimination of false positives and enabled de novo detection and absolute quantitation of mutations in plasma cell-free DNA. PMID:26126624

  16. Rational Structure-Based Rescaffolding Approach to De Novo Design of Interleukin 10 (IL-10) Receptor-1 Mimetics

    PubMed Central

    Philipp, Jenny; Künze, Georg; Wodtke, Robert; Löser, Reik; Fahmy, Karim; Pisabarro, M. Teresa

    2016-01-01

    Tackling protein interfaces with small molecules capable of modulating protein-protein interactions remains a challenge in structure-based ligand design. Particularly arduous are cases in which the epitopes involved in molecular recognition have a non-structured and discontinuous nature. Here, the basic strategy of translating continuous binding epitopes into mimetic scaffolds cannot be applied, and other innovative approaches are therefore required. We present a structure-based rational approach involving the use of a regular expression syntax inspired in the well established PROSITE to define minimal descriptors of geometric and functional constraints signifying relevant functionalities for recognition in protein interfaces of non-continuous and unstructured nature. These descriptors feed a search engine that explores the currently available three-dimensional chemical space of the Protein Data Bank (PDB) in order to identify in a straightforward manner regular architectures containing the desired functionalities, which could be used as templates to guide the rational design of small natural-like scaffolds mimicking the targeted recognition site. The application of this rescaffolding strategy to the discovery of natural scaffolds incorporating a selection of functionalities of interleukin-10 receptor-1 (IL-10R1), which are relevant for its interaction with interleukin-10 (IL-10) has resulted in the de novo design of a new class of potent IL-10 peptidomimetic ligands. PMID:27123592

  17. Rational Structure-Based Rescaffolding Approach to De Novo Design of Interleukin 10 (IL-10) Receptor-1 Mimetics.

    PubMed

    Ruiz-Gómez, Gloria; Hawkins, John C; Philipp, Jenny; Künze, Georg; Wodtke, Robert; Löser, Reik; Fahmy, Karim; Pisabarro, M Teresa

    2016-01-01

    Tackling protein interfaces with small molecules capable of modulating protein-protein interactions remains a challenge in structure-based ligand design. Particularly arduous are cases in which the epitopes involved in molecular recognition have a non-structured and discontinuous nature. Here, the basic strategy of translating continuous binding epitopes into mimetic scaffolds cannot be applied, and other innovative approaches are therefore required. We present a structure-based rational approach involving the use of a regular expression syntax inspired in the well established PROSITE to define minimal descriptors of geometric and functional constraints signifying relevant functionalities for recognition in protein interfaces of non-continuous and unstructured nature. These descriptors feed a search engine that explores the currently available three-dimensional chemical space of the Protein Data Bank (PDB) in order to identify in a straightforward manner regular architectures containing the desired functionalities, which could be used as templates to guide the rational design of small natural-like scaffolds mimicking the targeted recognition site. The application of this rescaffolding strategy to the discovery of natural scaffolds incorporating a selection of functionalities of interleukin-10 receptor-1 (IL-10R1), which are relevant for its interaction with interleukin-10 (IL-10) has resulted in the de novo design of a new class of potent IL-10 peptidomimetic ligands. PMID:27123592

  18. Transcriptome Profile of the Asian Giant Hornet (Vespa mandarinia) Using Illumina HiSeq 4000 Sequencing: De Novo Assembly, Functional Annotation, and Discovery of SSR Markers

    PubMed Central

    Park, So Young; Kang, Se Won; Hwang, Hee-Ju; Wang, Tae Hun; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jae Bong; Jeong, Heon Cheon; Park, Hong Seog; Han, Yeon Soo; Lee, Yong Seok

    2016-01-01

    Vespa mandarinia found in the forests of East Asia, including Korea, occupies the highest rank in the arthropod food web within its geographical range. It serves as a source of nutrition in the form of Vespa amino acid mixture and is listed as a threatened species, although no conservation measures have been implemented. Here, we performed de novo assembly of the V. mandarinia transcriptome by Illumina HiSeq 4000 sequencing. Over 60 million raw reads and 59,184,811 clean reads were obtained. After assembly, a total of 66,837 unigenes were clustered, 40,887, 44,455, and 22,390 of which showed homologous matches against the PANM, Unigene, and KOG databases, respectively. A total of 15,675 unigenes were assigned to Gene Ontology terms, and 5,132 unigenes were mapped to 115 KEGG pathways. The zinc finger domain (C2H2-like), serine/threonine/dual specificity protein kinase domain, and RNA recognition motif domain were among the top InterProScan domains predicted for V. mandarinia sequences. Among the unigenes, we identified 534,922 cDNA simple sequence repeats as potential markers. This is the first transcriptomic analysis of the wasp V. mandarinia using Illumina HiSeq 4000. The obtained datasets should promote the search for new genes to understand the physiological attributes of this wasp. PMID:26881195

  19. De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans) Using RNA-Seq and Discovery of Whole-Genome Duplications in the Ancestor of Palms

    PubMed Central

    Guo, Wuxia; Zhang, Ying; Zhou, Renchao; Shi, Suhua

    2015-01-01

    Nypa fruticans (Arecaceae) is the only monocot species of true mangroves. This species represents the earliest mangrove fossil recorded. How N. fruticans adapts to the harsh and unstable intertidal zone is an interesting question. However, the 60 gene segments deposited in NCBI are insufficient for solving this question. In this study, we sequenced, assembled and annotated the transcriptome of N. fruticans using next-generation sequencing technology. A total of 19,918,800 clean paired-end reads were de novo assembled into 45,368 unigenes with a N50 length of 1,096 bp. A total of 41.35% unigenes were functionally annotated using Blast2GO. Many genes annotated to “response to stress” and 15 putative positively selected genes were identified. Simple sequence repeats were identified and compared with other palms. The divergence time between N. fruticans and other palms was estimated at 75 million years ago using the genomic data, which is consistent with the fossil record. After calculating the synonymous substitution rate between paralogs, we found that two whole-genome duplication events were shared by N. fruticans and other palms. These duplication events provided a large amount of raw material for the more than 2,000 later speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general. PMID:26684618

  20. The genetic landscape of paediatric de novo acute myeloid leukaemia as defined by single nucleotide polymorphism array and exon sequencing of 100 candidate genes.

    PubMed

    Olsson, Linda; Zettermark, Sofia; Biloglav, Andrea; Castor, Anders; Behrendtz, Mikael; Forestier, Erik; Paulsson, Kajsa; Johansson, Bertil

    2016-07-01

    Cytogenetic analyses of a consecutive series of 67 paediatric (median age 8 years; range 0-17) de novo acute myeloid leukaemia (AML) patients revealed aberrations in 55 (82%) cases. The most common subgroups were KMT2A rearrangement (29%), normal karyotype (15%), RUNX1-RUNX1T1 (10%), deletions of 5q, 7q and/or 17p (9%), myeloid leukaemia associated with Down syndrome (7%), PML-RARA (7%) and CBFB-MYH11 (5%). Single nucleotide polymorphism array (SNP-A) analysis and exon sequencing of 100 genes, performed in 52 and 40 cases, respectively (39 overlapping), revealed ≥1 aberration in 89%; when adding cytogenetic data, this frequency increased to 98%. Uniparental isodisomies (UPIDs) were detected in 13% and copy number aberrations (CNAs) in 63% (median 2/case); three UPIDs and 22 CNAs were recurrent. Twenty-two genes were targeted by focal CNAs, including AEBP2 and PHF6 deletions and genes involved in AML-associated gene fusions. Deep sequencing identified mutations in 65% of cases (median 1/case). In total, 60 mutations were found in 30 genes, primarily those encoding signalling proteins (47%), transcription factors (25%), or epigenetic modifiers (13%). Twelve genes (BCOR, CEBPA, FLT3, GATA1, KIT, KRAS, NOTCH1, NPM1, NRAS, PTPN11, SMC3 and TP53) were recurrently mutated. We conclude that SNP-A and deep sequencing analyses complement the cytogenetic diagnosis of paediatric AML. PMID:27022003

  1. Transcriptome Profile of the Asian Giant Hornet (Vespa mandarinia) Using Illumina HiSeq 4000 Sequencing: De Novo Assembly, Functional Annotation, and Discovery of SSR Markers.

    PubMed

    Patnaik, Bharat Bhusan; Park, So Young; Kang, Se Won; Hwang, Hee-Ju; Wang, Tae Hun; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jae Bong; Jeong, Heon Cheon; Park, Hong Seog; Han, Yeon Soo; Lee, Yong Seok

    2016-01-01

    Vespa mandarinia found in the forests of East Asia, including Korea, occupies the highest rank in the arthropod food web within its geographical range. It serves as a source of nutrition in the form of Vespa amino acid mixture and is listed as a threatened species, although no conservation measures have been implemented. Here, we performed de novo assembly of the V. mandarinia transcriptome by Illumina HiSeq 4000 sequencing. Over 60 million raw reads and 59,184,811 clean reads were obtained. After assembly, a total of 66,837 unigenes were clustered, 40,887, 44,455, and 22,390 of which showed homologous matches against the PANM, Unigene, and KOG databases, respectively. A total of 15,675 unigenes were assigned to Gene Ontology terms, and 5,132 unigenes were mapped to 115 KEGG pathways. The zinc finger domain (C2H2-like), serine/threonine/dual specificity protein kinase domain, and RNA recognition motif domain were among the top InterProScan domains predicted for V. mandarinia sequences. Among the unigenes, we identified 534,922 cDNA simple sequence repeats as potential markers. This is the first transcriptomic analysis of the wasp V. mandarinia using Illumina HiSeq 4000. The obtained datasets should promote the search for new genes to understand the physiological attributes of this wasp. PMID:26881195

  2. De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans) Using RNA-Seq and Discovery of Whole-Genome Duplications in the Ancestor of Palms.

    PubMed

    He, Ziwen; Zhang, Zhang; Guo, Wuxia; Zhang, Ying; Zhou, Renchao; Shi, Suhua

    2015-01-01

    Nypa fruticans (Arecaceae) is the only monocot species of true mangroves. This species represents the earliest mangrove fossil recorded. How N. fruticans adapts to the harsh and unstable intertidal zone is an interesting question. However, the 60 gene segments deposited in NCBI are insufficient for solving this question. In this study, we sequenced, assembled and annotated the transcriptome of N. fruticans using next-generation sequencing technology. A total of 19,918,800 clean paired-end reads were de novo assembled into 45,368 unigenes with a N50 length of 1,096 bp. A total of 41.35% unigenes were functionally annotated using Blast2GO. Many genes annotated to "response to stress" and 15 putative positively selected genes were identified. Simple sequence repeats were identified and compared with other palms. The divergence time between N. fruticans and other palms was estimated at 75 million years ago using the genomic data, which is consistent with the fossil record. After calculating the synonymous substitution rate between paralogs, we found that two whole-genome duplication events were shared by N. fruticans and other palms. These duplication events provided a large amount of raw material for the more than 2,000 later speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general. PMID:26684618

  3. Sequencing, De Novo Assembly, and Annotation of the Transcriptome of the Endangered Freshwater Pearl Bivalve, Cristaria plicata, Provides Novel Insights into Functional Genes and Marker Discovery

    PubMed Central

    Kang, Se Won; Hwang, Hee-Ju; Park, So Young; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jun Sang; Han, Yeon Soo; Park, Hong Seog; Lee, Yong Seok

    2016-01-01

    Background The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction. Results The C. plicata transcriptome analysis included a total of 286,152,584 raw reads and 281,322,837 clean reads. The de novo assembly identified a total of 453,931 contigs and 374,794 non-redundant unigenes with average lengths of 731.2 and 737.1 bp, respectively. Furthermore, 100% coverage of C. plicata mitochondrial genes within two unigenes supported the quality of the assembler. In total, 84,274 unigenes showed homology to entries in at least one database, and 23,246 unigenes were allocated to one or more Gene Ontology (GO) terms. The most prominent GO biological process, cellular component, and molecular function categories (level 2) were cellular process, membrane, and binding, respectively. A total of 4,776 unigenes were mapped to 123 biological pathways in the KEGG database. Based on the GO terms and KEGG annotation, the unigenes were suggested to be involved in immunity, stress responses, sex-determination, and reproduction. A total of 17,251 cDNA simple sequence repeats (cSSRs) were identified from 61,141 unigenes (size of >1 kb) with the most abundant being dinucleotide repeats. Conclusions This dataset represents the first transcriptome analysis of the endangered

  4. De novo sequencing of root transcriptome reveals complex cadmium-responsive regulatory networks in radish (Raphanus sativus L.).

    PubMed

    Xu, Liang; Wang, Yan; Liu, Wei; Wang, Jin; Zhu, Xianwen; Zhang, Keyun; Yu, Rugang; Wang, Ronghua; Xie, Yang; Zhang, Wei; Gong, Yiqin; Liu, Liwang

    2015-07-01

    Cadmium (Cd) is a nonessential metallic trace element that poses potential chronic toxicity to living organisms. To date, little is known about the Cd-responsive regulatory network in root vegetable crops including radish. In this study, 31,015 unigenes representing 66,552 assembled unique transcripts were isolated from radish root under Cd stress based on de novo transcriptome assembly. In all, 1496 differentially expressed genes (DEGs) consisted of 3579 transcripts were identified from Cd-free (CK) and Cd-treated (Cd200) libraries. Gene Ontology and pathway enrichment analysis indicated that the up- and down-regulated DEGs were predominately involved in glucosinolate biosynthesis as well as cysteine and methionine-related pathways, respectively. RT-qPCR showed that the expression profiles of DEGs were in consistent with results from RNA-Seq analysis. Several candidate genes encoding phytochelatin synthase (PCS), metallothioneins (MTs), glutathione (GSH), zinc iron permease (ZIPs) and ABC transporter were responsible for Cd uptake, accumulation, translocation and detoxification in radish. The schematic model of DEGs and microRNAs-involved in Cd-responsive regulatory network was proposed. This study represents a first comprehensive transcriptome-based characterization of Cd-responsive DEGs in radish. These results could provide fundamental insight into complex Cd-responsive regulatory networks and facilitate further genetic manipulation of Cd accumulation in root vegetable crops. PMID:26025544

  5. De novo sequencing and transcriptome analysis of a low temperature tolerant Saccharum spontaneum clone IND 00-1037.

    PubMed

    Dharshini, S; Chakravarthi, M; J, Ashwin Narayan; Manoj, V M; Naveenarani, M; Kumar, Ravinder; Meena, Minturam; Ram, Bakshi; Appunu, C

    2016-08-10

    Saccharum spontaneum L., a wild relative of sugarcane, is known for its adaptability to environmental stresses, particularly cold stress. In the present study, an attempt was made for transcriptome profiling of the low temperature (10°C) tolerant S. spontaneum clone IND 00-1037 collected from high altitude regions of Arunachal Pradesh, North Eastern India. The Illumina Nextseq500 platform yielded a total of 47.63 and 48.18 million reads corresponding to 4.7 and 4.8 gigabase pairs (Gb) of processed reads for control and cold stressed (10°C for 24h) samples, respectively. These reads were de novo assembled into 214,611 unigenes with an average length of 801bp. Further, all unigenes were aligned to GO, KEGG and COG databases in order to identify novel genes and pathways responsive upon low temperature conditions. The differential gene expression analysis revealed that about 2583 genes were upregulated and 3302 genes were down regulated during the stress. This is perhaps the comprehensive transcriptome data of a low temperature tolerant clone of S. spontaneum. This study would aid in identifying novel genes and also in future genomic studies pertaining to sugarcane and its wild relatives. PMID:27269250

  6. De Novo Sequencing and Transcriptome Analysis of Pleurotus eryngii subsp. tuoliensis (Bailinggu) Mycelia in Response to Cold Stimulation.

    PubMed

    Fu, Yong-Ping; Liang, Yuan; Dai, Yue-Ting; Yang, Chen-Tao; Duan, Ming-Zheng; Zhang, Zhuo; Hu, Song-Nian; Zhang, Zhi-Wu; Li, Yu

    2016-01-01

    Cold stimulation of Bailinggu's mycelia is the main factor that triggers primordia initiation for successful production of fruiting bodies under commercial cultivation. Yet, the molecular-level mechanisms involved in mycelia response to cold stimulation are still unclear. Here, we performed comparative transcriptomic analysis using RNA-Seq technology to better understand the gene expression regulation during different temporal stages of cold stimulation in Bailinggu. A total of 21,558 Bailinggu mycelia unigenes were de novo assembled and annotated from four libraries (control at 25 °C, plus cold stimulation treatments at -3 °C for a duration of 1-2 days, 5-6 days, and 9-10 days). GO and KEGG pathway analysis indicated that functional groups of differentially expressed unigenes associated with cell wall and membrane stabilization, calcium signaling and mitogen-activated protein kinases (MAPK) pathways, and soluble sugars and protein biosynthesis and metabolism pathways play a vital role in Bailinggu's response to cold stimulation. Six hundred and seven potential EST-based SSRs loci were identified in these unigenes, and 100 EST-SSR primers were randomly selected for validation. The overall polymorphism rate was 92% by using 10 wild strains of Bailinggu. Therefore, these results can serve as a valuable resource for a better understanding of the molecular mechanisms associated with Bailinggu's response to cold stimulation. PMID:27196889

  7. De novo transcriptome sequencing in a songbird, the dark-eyed junco (Junco hyemalis): genomic tools for an ecological model system

    PubMed Central

    2012-01-01

    Background Though genomic-level data are becoming widely available, many of the metazoan species sequenced are laboratory systems whose natural history is not well documented. In contrast, the wide array of species with very well-characterized natural history have, until recently, lacked genomics tools. It is now possible to address significant evolutionary genomics questions by applying high-throughput sequencing to discover the majority of genes for ecologically tractable species, and by subsequently developing microarray platforms from which to investigate gene regulatory networks that function in natural systems. We used GS-FLX Titanium Sequencing (Roche/454-Sequencing) of two normalized libraries of pooled RNA samples to characterize a transcriptome of the dark-eyed junco (Junco hyemalis), a North American sparrow that is a classically studied species in the fields of photoperiodism, speciation, and hormone-mediated behavior. Results From a broad pool of RNA sampled from tissues throughout the body of a male and a female junco, we sequenced a total of 434 million nucleotides from 1.17 million reads that were assembled de novo into 31,379 putative transcripts representing 22,765 gene sets covering 35.8 million nucleotides with 12-fold average depth of coverage. Annotation of roughly half of the putative genes was accomplished using sequence similarity, and expression was confirmed for the majority with a preliminary microarray analysis. Of 716 core bilaterian genes, 646 (90 %) were recovered within our characterized gene set. Gene Ontology, orthoDB orthology groups, and KEGG Pathway annotation provide further functional information about the sequences, and 25,781 potential SNPs were identified. Conclusions The extensive sequence information returned by this effort adds to the growing store of genomic data on diverse species. The extent of coverage and annotation achieved and confirmation of expression, show that transcriptome sequencing provides useful

  8. BG7: a new approach for bacterial genome annotation designed for next generation sequencing data.

    PubMed

    Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

    2012-01-01

    BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version - which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310

  9. Malan syndrome: Sotos-like overgrowth with de novo NFIX sequence variants and deletions in six new patients and a review of the literature.

    PubMed

    Klaassens, Merel; Morrogh, Deborah; Rosser, Elisabeth M; Jaffer, Fatima; Vreeburg, Maaike; Bok, Levinus A; Segboer, Tim; van Belzen, Martine; Quinlivan, Ros M; Kumar, Ajith; Hurst, Jane A; Scott, Richard H

    2015-05-01

    De novo monoallelic variants in NFIX cause two distinct syndromes. Whole gene deletions, nonsense variants and missense variants affecting the DNA-binding domain have been seen in association with a Sotos-like phenotype that we propose is referred to as Malan syndrome. Frameshift and splice-site variants thought to avoid nonsense-mediated RNA decay have been seen in Marshall-Smith syndrome. We report six additional patients with Malan syndrome and de novo NFIX deletions or sequence variants and review the 20 patients now reported. The phenotype is characterised by moderate postnatal overgrowth and macrocephaly. Median height and head circumference in childhood are 2.0 and 2.3 standard deviations (SD) above the mean, respectively. There is overlap of the facial phenotype with NSD1-positive Sotos syndrome in some cases including a prominent forehead, high anterior hairline, downslanting palpebral fissures and prominent chin. Neonatal feeding difficulties and/or hypotonia have been reported in 30% of patients. Developmental delay/learning disability have been reported in all cases and are typically moderate. Ocular phenotypes are common, including strabismus (65%), nystagmus (25% ) and optic disc pallor/hypoplasia (25%). Other recurrent features include pectus excavatum (40%) and scoliosis (25%). Eight reported patients have a deletion also encompassing CACNA1A, haploinsufficiency of which causes episodic ataxia type 2 or familial hemiplegic migraine. One previous case had episodic ataxia and one case we report has had cyclical vomiting responsive to pizotifen. In individuals with this contiguous gene deletion syndrome, awareness of possible later neurological manifestations is important, although their penetrance is not yet clear. PMID:25118028

  10. Whole exome sequencing is necessary to clarify ID/DD cases with de novo copy number variants of uncertain significance: Two proof-of-concept examples.

    PubMed

    Giorgio, Elisa; Ciolfi, Andrea; Biamino, Elisa; Caputo, Viviana; Di Gregorio, Eleonora; Belligni, Elga Fabia; Calcia, Alessandro; Gaidolfi, Elena; Bruselles, Alessandro; Mancini, Cecilia; Cavalieri, Simona; Molinatto, Cristina; Cirillo Silengo, Margherita; Ferrero, Giovanni Battista; Tartaglia, Marco; Brusco, Alfredo

    2016-07-01

    Whole exome sequencing (WES) is a powerful tool to identify clinically undefined forms of intellectual disability/developmental delay (ID/DD), especially in consanguineous families. Here we report the genetic definition of two sporadic cases, with syndromic ID/DD for whom array-Comparative Genomic Hybridization (aCGH) identified a de novo copy number variant (CNV) of uncertain significance. The phenotypes included microcephaly with brachycephaly and a distinctive facies in one proband, and hypotonia in the legs and mild ataxia in the other. WES allowed identification of a functionally relevant homozygous variant affecting a known disease gene for rare syndromic ID/DD in each proband, that is, c.1423C>T (p.Arg377*) in the Trafficking Protein Particle Complex 9 (TRAPPC9), and c.154T>C (p.Cys52Arg) in the Very Low Density Lipoprotein Receptor (VLDLR). Four mutations affecting TRAPPC9 have been previously reported, and the present finding further depicts this syndromic form of ID, which includes microcephaly with brachycephaly, corpus callosum hypoplasia, facial dysmorphism, and overweight. VLDLR-associated cerebellar hypoplasia (VLDLR-CH) is characterized by non-progressive congenital ataxia and moderate-to-profound intellectual disability. The c.154T>C (p.Cys52Arg) mutation was associated with a very mild form of ataxia, mild intellectual disability, and cerebellar hypoplasia without cortical gyri simplification. In conclusion, we report two novel cases with rare causes of autosomal recessive ID, which document how interpreting de novo array-CGH variants represents a challenge in consanguineous families; as such, clinical WES should be considered in diagnostic testing. © 2016 Wiley Periodicals, Inc. PMID:27108886