Science.gov

Sample records for novo sequencing approach

  1. A Hybrid Approach for de novo Human Genome Sequence Assembly and Phasing

    PubMed Central

    Mostovoy, Yulia; Levy-Sakin, Michal; Lam, Jessica; Lam, Ernest T; Hastie, Alex R; Marks, Patrick; Lee, Joyce; Chu, Catherine; Lin, Chin; Džakula, Željko; Cao, Han; Schlebusch, Stephen A.; Giorda, Kristina; Schnall-Levin, Michael; Wall, Jeffrey D.; Kwok, Pui-Yan

    2016-01-01

    Despite tremendous progress in genome sequencing, the basic goal of producing phased (haplotype-resolved) genome sequence with end-to-end contiguity for each chromosome at reasonable cost and effort is still unrealized. In this study, we describe a new approach to perform de novo genome assembly and experimental phasing by integrating the data from Illumina short-read sequencing, 10X Genomics Linked-Read sequencing, and BioNano Genomics genome mapping to yield a high-quality, phased, de novo assembled human genome. PMID:27159086

  2. A Machine Learning Based Approach to de novo Sequencing of Glycans from Tandem Mass Spectrometry Spectrum.

    PubMed

    Kumozaki, Shotaro; Sato, Kengo; Sakakibara, Yasubumi

    2015-01-01

    Recently, glycomics has been actively studied and various technologies for glycomics have been rapidly developed. Currently, tandem mass spectrometry (MS/MS) is one of the key experimental tools for identification of structures of oligosaccharides. MS/MS can observe MS/MS peaks of fragmented glycan ions including cross-ring ions resulting from internal cleavages, which provide valuable information to infer glycan structures. Thus, the aim of de novo sequencing of glycans is to find the most probable assignments of observed MS/MS peaks to glycan substructures without databases. However, there are few satisfiable algorithms for glycan de novo sequencing from MS/MS spectra. We present a machine learning based approach to de novo sequencing of glycans from MS/MS spectrum. First, we build a suitable model for the fragmentation of glycans including cross-ring ions, and implement a solver that employs Lagrangian relaxation with a dynamic programming technique. Then, to optimize scores for the algorithm, we introduce a machine learning technique called structured support vector machines that enable us to learn parameters including scores for cross-ring ions from training data, i.e., known glycan mass spectra. Furthermore, we implement additional constraints for core structures of well-known glycan types including N-linked glycans and O-linked glycans. This enables us to predict more accurate glycan structures if the glycan type of given spectra is known. Computational experiments show that our algorithm performs accurate de novo sequencing of glycans. The implementation of our algorithm and the datasets are available at http://glyfon.dna.bio.keio.ac.jp/.

  3. Constrained de novo sequencing of conotoxins.

    PubMed

    Bhatia, Swapnil; Kil, Yong J; Ueberheide, Beatrix; Chait, Brian T; Tayo, Lemmuel; Cruz, Lourdes; Lu, Bingwen; Yates, John R; Bern, Marshall

    2012-08-03

    De novo peptide sequencing by mass spectrometry (MS) can determine the amino acid sequence of an unknown peptide without reference to a protein database. MS-based de novo sequencing assumes special importance in focused studies of families of biologically active peptides and proteins, such as hormones, toxins, and antibodies, for which amino acid sequences may be difficult to obtain through genomic methods. These protein families often exhibit sequence homology or characteristic amino acid content; yet, current de novo sequencing approaches do not take advantage of this prior knowledge and, hence, search an unnecessarily large space of possible sequences. Here, we describe an algorithm for de novo sequencing that incorporates sequence constraints into the core graph algorithm and thereby reduces the search space by many orders of magnitude. We demonstrate our algorithm in a study of cysteine-rich toxins from two cone snail species (Conus textile and Conus stercusmuscarum) and report 13 de novo and about 60 total toxins.

  4. Constrained De Novo Sequencing of Conotoxins

    PubMed Central

    Bhatia, Swapnil; Kil, Yong J.; Ueberheide, Beatrix; Chait, Brian T.; Tayo, Lemmuel; Cruz, Lourdes; Lu, Bingwen; Yates, John R.; Bern, Marshall

    2012-01-01

    De novo peptide sequencing by mass spectrometry (MS) can determine the amino acid sequence of an unknown peptide without reference to a protein database. MS-based de novo sequencing assumes special importance in focused studies of families of biologically active peptides and proteins, such as hormones, toxins, and antibodies, for which amino acid sequences may be difficult to obtain through genomic methods. These protein families often exhibit sequence homology or characteristic amino acid content, yet current de novo sequencing approaches do not take advantage of this prior knowledge and hence search an unnecessarily large space of possible sequences. Here, we describe an algorithm for de novo sequencing that incorporates sequence constraints into the core graph algorithm, and thereby reduces the search space by many orders of magnitude. We demonstrate our algorithm in a study of cysteine-rich toxins from two cone snail species (Conus textile and Conus stercusmuscarum), and report 13 de novo and about 60 total toxins. PMID:22709442

  5. De novo peptide sequencing by deep learning.

    PubMed

    Tran, Ngoc Hieu; Zhang, Xianglilan; Xin, Lei; Shan, Baozhen; Li, Ming

    2017-07-18

    De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7-22.9% higher accuracy at the amino acid level and 38.1-64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5-100% coverage and 97.2-99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming.

  6. De novo peptide sequencing by deep learning

    PubMed Central

    Tran, Ngoc Hieu; Zhang, Xianglilan; Xin, Lei; Shan, Baozhen; Li, Ming

    2017-01-01

    De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7–22.9% higher accuracy at the amino acid level and 38.1–64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5–100% coverage and 97.2–99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming. PMID:28720701

  7. New Approaches and Technologies to Sequence de novo Plant reference Genomes (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

    SciTech Connect

    Schmutz, Jeremy

    2013-03-01

    Jeremy Schmutz of the HudsonAlpha Institute for Biotechnology on "New approaches and technologies to sequence de novo plant reference genomes" at the 8th Annual Genomics of Energy & Environment Meeting on March 27, 2013 in Walnut Creek, Calif.

  8. A combined de novo protein sequencing and cDNA library approach to the venomic analysis of Chinese spider Araneus ventricosus.

    PubMed

    Duan, Zhigui; Cao, Rui; Jiang, Liping; Liang, Songping

    2013-01-14

    In past years, spider venoms have attracted increasing attention due to their extraordinary chemical and pharmacological diversity. The recently popularized proteomic method highly improved our ability to analyze the proteins in the venom. However, the lack of information about isolated venom proteins sequences dramatically limits the ability to confidently identify venom proteins. In the present paper, the venom from Araneus ventricosus was analyzed using two complementary approaches: 2-DE/Shotgun-LC-MS/MS coupled to MASCOT search and 2-DE/Shotgun-LC-MS/MS coupled to manual de novo sequencing followed by local venom protein database (LVPD) search. The LVPD was constructed with toxin-like protein sequences obtained from the analysis of cDNA library from A. ventricosus venom glands. Our results indicate that a total of 130 toxin-like protein sequences were unambiguously identified by manual de novo sequencing coupled to LVPD search, accounting for 86.67% of all toxin-like proteins in LVPD. Thus manual de novo sequencing coupled to LVPD search was proved an extremely effective approach for the analysis of venom proteins. In addition, the approach displays impeccable advantage in validating mutant positions of isoforms from the same toxin-like family. Intriguingly, methyl esterifcation of glutamic acid was discovered for the first time in animal venom proteins by manual de novo sequencing.

  9. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Hixson, Kim K.; Purvine, Samuel O.; Anderson, Gordon A.; Smith, Richard D.

    2008-10-15

    De novo sequencing has a promise to discover the protein post-translation modifications; however, such approach is still in their infancy and not widely applied for proteomics practices due to its limited reliability. In this work, we describe a de novo sequencing approach for discovery of protein modifications through identification of the UStags (Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry for peptides and polypeptides in a yeast lysate, and the de novo sequences obtained were filtered to define a more limited set of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags’ prefix and suffix sequences and the UStags themselves) were used to infer the possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances of yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. Random matching of the de novo sequences to the predicted sequences were examined with use of two random (false) databases, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity are described. The de novo-UStag complements the UStag method previously reported by enabling discovery of new protein modifications.

  10. De novo assembly of Dekkera bruxellensis: a multi technology approach using short and long-read sequencing and optical mapping.

    PubMed

    Olsen, Remi-Andre; Bunikis, Ignas; Tiukova, Ievgeniia; Holmberg, Kicki; Lötstedt, Britta; Pettersson, Olga Vinnere; Passoth, Volkmar; Käller, Max; Vezzi, Francesco

    2015-01-01

    It remains a challenge to perform de novo assembly using next-generation sequencing (NGS). Despite the availability of multiple sequencing technologies and tools (e.g., assemblers) it is still difficult to assemble new genomes at chromosome resolution (i.e., one sequence per chromosome). Obtaining high quality draft assemblies is extremely important in the case of yeast genomes to better characterise major events in their evolutionary history. The aim of this work is two-fold: on the one hand we want to show how combining different and somewhat complementary technologies is key to improving assembly quality and correctness, and on the other hand we present a de novo assembly pipeline we believe to be beneficial to core facility bioinformaticians. To demonstrate both the effectiveness of combining technologies and the simplicity of the pipeline, here we present the results obtained using the Dekkera bruxellensis genome. In this work we used short-read Illumina data and long-read PacBio data combined with the extreme long-range information from OpGen optical maps in the task of de novo genome assembly and finishing. Moreover, we developed NouGAT, a semi-automated pipeline for read-preprocessing, de novo assembly and assembly evaluation, which was instrumental for this work. We obtained a high quality draft assembly of a yeast genome, resolved on a chromosomal level. Furthermore, this assembly was corrected for mis-assembly errors as demonstrated by resolving a large collapsed repeat and by receiving higher scores by assembly evaluation tools. With the inclusion of PacBio data we were able to fill about 5 % of the optical mapped genome not covered by the Illumina data.

  11. RNA-seq analysis of Rubus idaeus cv. Nova: transcriptome sequencing and de novo assembly for subsequent functional genomics approaches.

    PubMed

    Hyun, Tae Kyung; Lee, Sarah; Kumar, Dhinesh; Rim, Yeonggil; Kumar, Ritesh; Lee, Sang Yeol; Lee, Choong Hwan; Kim, Jae-Yean

    2014-10-01

    Using Illumina sequencing technology, we have generated the large-scale transcriptome sequencing data containing abundant information on genes involved in the metabolic pathways in R. idaeus cv. Nova fruits. Rubus idaeus (Red raspberry) is one of the important economical crops that possess numerous nutrients, micronutrients and phytochemicals with essential health benefits to human. The molecular mechanism underlying the ripening process and phytochemical biosynthesis in red raspberry is attributed to the changes in gene expression, but very limited transcriptomic and genomic information in public databases is available. To address this issue, we generated more than 51 million sequencing reads from R. idaeus cv. Nova fruit using Illumina RNA-Seq technology. After de novo assembly, we obtained 42,604 unigenes with an average length of 812 bp. At the protein level, Nova fruit transcriptome showed 77 and 68 % sequence similarities with Rubus coreanus and Fragaria versa, respectively, indicating the evolutionary relationship between them. In addition, 69 % of assembled unigenes were annotated using public databases including NCBI non-redundant, Cluster of Orthologous Groups and Gene ontology database, suggesting that our transcriptome dataset provides a valuable resource for investigating metabolic processes in red raspberry. To analyze the relationship between several novel transcripts and the amounts of metabolites such as γ-aminobutyric acid and anthocyanins, real-time PCR and target metabolite analysis were performed on two different ripening stages of Nova. This is the first attempt using Illumina sequencing platform for RNA sequencing and de novo assembly of Nova fruit without reference genome. Our data provide the most comprehensive transcriptome resource available for Rubus fruits, and will be useful for understanding the ripening process and for breeding R. idaeus cultivars with improved fruit quality.

  12. RNA-Seq analysis of Cocos nucifera: transcriptome sequencing and de novo assembly for subsequent functional genomics approaches.

    PubMed

    Fan, Haikuo; Xiao, Yong; Yang, Yaodong; Xia, Wei; Mason, Annaliese S; Xia, Zhihui; Qiao, Fei; Zhao, Songlin; Tang, Haoru

    2013-01-01

    Cocos nucifera (coconut), a member of the Arecaceae family, is an economically important woody palm grown in tropical regions. Despite its agronomic importance, previous germplasm assessment studies have relied solely on morphological and agronomical traits. Molecular biology techniques have been scarcely used in assessment of genetic resources and for improvement of important agronomic and quality traits in Cocos nucifera, mostly due to the absence of available sequence information. To provide basic information for molecular breeding and further molecular biological analysis in Cocos nucifera, we applied RNA-seq technology and de novo assembly to gain a global overview of the Cocos nucifera transcriptome from mixed tissue samples. Using Illumina sequencing, we obtained 54.9 million short reads and conducted de novo assembly to obtain 57,304 unigenes with an average length of 752 base pairs. Sequence comparison between assembled unigenes and released cDNA sequences of Cocos nucifera and Elaeis guineensis indicated that the assembled sequences were of high quality. Approximately 99.9% of unigenes were novel compared to the released coconut EST sequences. Using BLASTX, 68.2% of unigenes were successfully annotated based on the Genbank non-redundant (Nr) protein database. The annotated unigenes were then further classified using the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Our study provides a large quantity of novel genetic information for Cocos nucifera. This information will act as a valuable resource for further molecular genetic studies and breeding in coconut, as well as for isolation and characterization of functional genes involved in different biochemical pathways in this important tropical crop species.

  13. RNA-Seq Analysis of Cocos nucifera: Transcriptome Sequencing and De Novo Assembly for Subsequent Functional Genomics Approaches

    PubMed Central

    Xia, Wei; Mason, Annaliese S.; Xia, Zhihui; Qiao, Fei; Zhao, Songlin; Tang, Haoru

    2013-01-01

    Background Cocos nucifera (coconut), a member of the Arecaceae family, is an economically important woody palm grown in tropical regions. Despite its agronomic importance, previous germplasm assessment studies have relied solely on morphological and agronomical traits. Molecular biology techniques have been scarcely used in assessment of genetic resources and for improvement of important agronomic and quality traits in Cocos nucifera, mostly due to the absence of available sequence information. Methodology/Principal Findings To provide basic information for molecular breeding and further molecular biological analysis in Cocos nucifera, we applied RNA-seq technology and de novo assembly to gain a global overview of the Cocos nucifera transcriptome from mixed tissue samples. Using Illumina sequencing, we obtained 54.9 million short reads and conducted de novo assembly to obtain 57,304 unigenes with an average length of 752 base pairs. Sequence comparison between assembled unigenes and released cDNA sequences of Cocos nucifera and Elaeis guineensis indicated that the assembled sequences were of high quality. Approximately 99.9% of unigenes were novel compared to the released coconut EST sequences. Using BLASTX, 68.2% of unigenes were successfully annotated based on the Genbank non-redundant (Nr) protein database. The annotated unigenes were then further classified using the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Conclusions/Significance Our study provides a large quantity of novel genetic information for Cocos nucifera. This information will act as a valuable resource for further molecular genetic studies and breeding in coconut, as well as for isolation and characterization of functional genes involved in different biochemical pathways in this important tropical crop species. PMID:23555859

  14. Novor: Real-Time Peptide de Novo Sequencing Software

    NASA Astrophysics Data System (ADS)

    Ma, Bin

    2015-11-01

    De novo sequencing software has been widely used in proteomics to sequence new peptides from tandem mass spectrometry data. This study presents a new software tool, Novor, to greatly improve both the speed and accuracy of today's peptide de novo sequencing analyses. To improve the accuracy, Novor's scoring functions are based on two large decision trees built from a peptide spectral library with more than 300,000 spectra with machine learning. Important knowledge about peptide fragmentation is extracted automatically from the library and incorporated into the scoring functions. The decision tree model also enables efficient score calculation and contributes to the speed improvement. To further improve the speed, a two-stage algorithmic approach, namely dynamic programming and refinement, is used. The software program was also carefully optimized. On the testing datasets, Novor sequenced 7%-37% more correct residues than the state-of-the-art de novo sequencing tool, PEAKS, while being an order of magnitude faster. Novor can de novo sequence more than 300 MS/MS spectra per second on a laptop computer. The speed surpasses the acquisition speed of today's mass spectrometer and, therefore, opens a new possibility to de novo sequence in real time while the spectrometer is acquiring the spectral data.

  15. NIPTL-Novo: Non-isobaric peptide termini labeling assisted peptide de novo sequencing.

    PubMed

    Zhang, Shen; Shan, Yichu; Zhang, Shurong; Sui, Zhigang; Zhang, Lihua; Liang, Zhen; Zhang, Yukui

    2017-02-10

    A simple and effective de novo sequencing strategy assisted by non-isobaric peptide termini labeling, NIPTL-Novo, was established. The y-series ions and b-series ions of peptides can be clearly distinguished according to the different mass tags incorporated in N-terminus and C-terminus. This is helpful for improving the accuracy of peptide sequencing and increasing the sequencing speed. For the spectra commonly identified by both de novo sequencing and database searching software (Mascot or Maxquant), NIPTL-Novo gave identical result to more than 85% of these spectra. Furthermore, the quantitative profiling of the sample can be performed simultaneously along with de novo sequencing. Finally, this strategy can be applied to discover the peptides with potential mutation sites by combining with mass-defect based isotopic labeling.

  16. DeNovoID: a web-based tool for identifying peptides from sequence and mass tags deduced from de novo peptide sequencing by mass spectroscopy.

    PubMed

    Halligan, Brian D; Ruotti, Victor; Twigger, Simon N; Greene, Andrew S

    2005-07-01

    One of the core activities of high-throughput proteomics is the identification of peptides from mass spectra. Some peptides can be identified using spectral matching programs like Sequest or Mascot, but many spectra do not produce high quality database matches. De novo peptide sequencing is an approach to determine partial peptide sequences for some of the unidentified spectra. A drawback of de novo peptide sequencing is that it produces a series of ordered and disordered sequence tags and mass tags rather than a complete, non-degenerate peptide amino acid sequence. This incomplete data is difficult to use in conventional search programs such as BLAST or FASTA. DeNovoID is a program that has been specifically designed to use degenerate amino acid sequence and mass data derived from MS experiments to search a peptide database. Since the algorithm employed depends on the amino acid composition of the peptide and not its sequence, DeNovoID does not have to consider all possible sequences, but rather a smaller number of compositions consistent with a spectrum. DeNovoID also uses a geometric indexing scheme that reduces the number of calculations required to determine the best peptide match in the database. DeNovoID is available at http://proteomics.mcw.edu/denovoid.

  17. Multiplex De Novo Sequencing of Peptide Antibiotics

    NASA Astrophysics Data System (ADS)

    Mohimani, Hosein; Liu, Wei-Ting; Yang, Yu-Liang; Gaudêncio, Susana P.; Fenical, William; Dorrestein, Pieter C.; Pevzner, Pavel A.

    Proliferation of drug-resistant diseases raises the challenge of searching for new, more efficient antibiotics. Currently, some of the most effective antibiotics (i.e., Vancomycin and Daptomycin) are cyclic peptides produced by non-ribosomal biosynthetic pathways. The isolation and sequencing of cyclic peptide antibiotics, unlike the same activity with linear peptides, is time-consuming and error-prone. The dominant technique for sequencing cyclic peptides is NMR-based and requires large amounts (milligrams) of purified materials that, for most compounds, are not possible to obtain. Given these facts, there is a need for new tools to sequence cyclic NRPs using picograms of material. Since nearly all cyclic NRPs are produced along with related analogs, we develop a mass spectrometry approach for sequencing all related peptides at once (in contrast to the existing approach that analyzes individual peptides). Our results suggest that instead of attempting to isolate and NMR-sequence the most abundant compound, one should acquire spectra of many related compounds and sequence all of them simultaneously using tandem mass spectrometry. We illustrate applications of this approach by sequencing new variants of cyclic peptide antibiotics from Bacillus brevis, as well as sequencing a previously unknown familiy of cyclic NRPs produced by marine bacteria.

  18. MRUniNovo: an efficient tool for de novo peptide sequencing utilizing the hadoop distributed computing framework.

    PubMed

    Li, Chuang; Chen, Tao; He, Qiang; Zhu, Yunping; Li, Kenli

    2017-03-15

    Tandem mass spectrometry-based de novo peptide sequencing is a complex and time-consuming process. The current algorithms for de novo peptide sequencing cannot rapidly and thoroughly process large mass spectrometry datasets. In this paper, we propose MRUniNovo, a novel tool for parallel de novo peptide sequencing. MRUniNovo parallelizes UniNovo based on the Hadoop compute platform. Our experimental results demonstrate that MRUniNovo significantly reduces the computation time of de novo peptide sequencing without sacrificing the correctness and accuracy of the results, and thus can process very large datasets that UniNovo cannot. MRUniNovo is an open source software tool implemented in java. The source code and the parameter settings are available at http://bioinfo.hupo.org.cn/MRUniNovo/index.php. s131020002@hnu.edu.cn ; taochen1019@163.com. Supplementary data are available at Bioinformatics online.

  19. Ameliorated de novo transcriptome assembly using Illumina paired end sequence data with Trinity Assembler

    PubMed Central

    Bankar, Kiran Gopinath; Todur, Vivek Nagaraj; Shukla, Rohit Nandan; Vasudevan, Madavan

    2015-01-01

    Advent of Next Generation Sequencing has led to possibilities of de novo transcriptome assembly of organisms without availability of complete genome sequence. Among various sequencing platforms available, Illumina is the most widely used platform based on data quality, quantity and cost. Various de novo transcriptome assemblers are also available today for construction of de novo transcriptome. In this study, we aimed at obtaining an ameliorated de novo transcriptome assembly with sequence reads obtained from Illumina platform and assembled using Trinity Assembler. We found that, primary transcriptome assembly obtained as a result of Trinity can be ameliorated on the basis of transcript length, coverage, and depth and protein homology. Our approach to ameliorate is reproducible and could enhance the sensitivity and specificity of the assembled transcriptome which could be critical for validation of the assembled transcripts and for planning various downstream biological assays. PMID:26484285

  20. Complete De Novo Assembly of Monoclonal Antibody Sequences

    PubMed Central

    Tran, Ngoc Hieu; Rahman, M. Ziaur; He, Lin; Xin, Lei; Shan, Baozhen; Li, Ming

    2016-01-01

    De novo protein sequencing is one of the key problems in mass spectrometry-based proteomics, especially for novel proteins such as monoclonal antibodies for which genome information is often limited or not available. However, due to limitations in peptides fragmentation and coverage, as well as ambiguities in spectra interpretation, complete de novo assembly of unknown protein sequences still remains challenging. To address this problem, we propose an integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences. Our system integrates de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph to assemble protein sequences. We evaluated ALPS performance on two antibody data sets, each including a heavy chain and a light chain. The results show that ALPS was able to assemble three complete monoclonal antibody sequences of length 216–441 AA, at 100% coverage, and 96.64–100% accuracy. PMID:27562653

  1. De novo sequencing and variant calling with nanopores using PoreSeq.

    PubMed

    Szalay, Tamas; Golovchenko, Jene A

    2015-10-01

    The accuracy of sequencing single DNA molecules with nanopores is continually improving, but de novo genome sequencing and assembly using only nanopore data remain challenging. Here we describe PoreSeq, an algorithm that identifies and corrects errors in nanopore sequencing data and improves the accuracy of de novo genome assembly with increasing coverage depth. The approach relies on modeling the possible sources of uncertainty that occur as DNA transits through the nanopore and finds the sequence that best explains multiple reads of the same region. PoreSeq increases nanopore sequencing read accuracy of M13 bacteriophage DNA from 85% to 99% at 100× coverage. We also use the algorithm to assemble Escherichia coli with 30× coverage and the λ genome at a range of coverages from 3× to 50×. Additionally, we classify sequence variants at an order of magnitude lower coverage than is possible with existing methods.

  2. Considering Transposable Element Diversification in De Novo Annotation Approaches

    PubMed Central

    Flutre, Timothée; Duprat, Elodie; Feuillet, Catherine; Quesneville, Hadi

    2011-01-01

    Transposable elements (TEs) are mobile, repetitive DNA sequences that are almost ubiquitous in prokaryotic and eukaryotic genomes. They have a large impact on genome structure, function and evolution. With the recent development of high-throughput sequencing methods, many genome sequences have become available, making possible comparative studies of TE dynamics at an unprecedented scale. Several methods have been proposed for the de novo identification of TEs in sequenced genomes. Most begin with the detection of genomic repeats, but the subsequent steps for defining TE families differ. High-quality TE annotations are available for the Drosophila melanogaster and Arabidopsis thaliana genome sequences, providing a solid basis for the benchmarking of such methods. We compared the performance of specific algorithms for the clustering of interspersed repeats and found that only a particular combination of algorithms detected TE families with good recovery of the reference sequences. We then applied a new procedure for reconciling the different clustering results and classifying TE sequences. The whole approach was implemented in a pipeline using the REPET package. Finally, we show that our combined approach highlights the dynamics of well defined TE families by making it possible to identify structural variations among their copies. This approach makes it possible to annotate TE families and to study their diversification in a single analysis, improving our understanding of TE dynamics at the whole-genome scale and for diverse species. PMID:21304975

  3. De novo assembly of a bell pepper endornavirus genome sequence using RNA sequencing data.

    PubMed

    Jo, Yeonhwa; Choi, Hoseng; Cho, Won Kyong

    2015-03-19

    The genus Endornavirus is a double-stranded RNA virus that infects a wide range of hosts. In this study, we report on the de novo assembly of a bell pepper endornavirus genome sequence by RNA sequencing (RNA-Seq). Our result demonstrates the successful application of RNA-Seq to obtain a complete viral genome sequence from the transcriptome data.

  4. Database Independent Protein Sequencing (DiPS) enables full-length de-novo protein and antibody sequence determination.

    PubMed

    Savidor, Alon; Barzilay, Rotem; Elinger, Dalia; Yarden, Yosef; Lindzen, Moshit; Gabashvili, Alexandra; Adiv Tal, Ophir; Levin, Yishai

    2017-03-27

    Traditional 'bottom-up' proteomics approaches use proteolytic digestion, LC-MS/MS and database searching to elucidate peptide identities and their parent proteins. Protein sequences absent from the database cannot be identified, and even if present in the database, complete sequence coverage is rarely achieved even for the most abundant proteins in the sample. Thus, sequencing of unknown proteins such as antibodies or constituents of metaproteomes remains a challenging problem. To date, there is no available method for full-length protein sequencing, independent of a reference database, in high throughput. Here we present Database Independent Protein Sequencing (DiPS), a method for unambiguous, rapid, database independent, full-length protein sequencing. The method is a novel combination of non-enzymatic, semi-random cleavage of the protein, LC-MS/MS analysis, peptide de novo sequencing, extraction of peptide tags, and their assembly into a consensus sequence using an algorithm named "Peptide Tag Assembler" (pTA). As proof-of-concept, the method was applied to samples of three known proteins representing three size classes and to a previously un-sequenced, clinically relevant, monoclonal antibody. Excluding leucine/isoleucine and glutamic-acid/deamidated glutamine ambiguities, end-to-end, full-length de novo sequencing was achieved with 99-100% accuracy for all benchmarking proteins and the antibody light chain. Accuracy of the sequenced antibody heavy chain, including the entire variable region, was also 100% but there was a 23 residue gap in the constant region sequence.

  5. Top-down analysis of protein samples by de novo sequencing techniques

    SciTech Connect

    Vyatkina, Kira; Wu, Si; Dekker, Lennard J. M.; VanDuijn, Martijn M.; Liu, Xiaowen; Tolić, Nikola; Luider, Theo M.; Paša-Tolić, Ljiljana; Pevzner, Pavel A.

    2016-05-14

    MOTIVATION: Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. RESULTS: We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. The former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns.

  6. Nucleotide-sequence-specific de novo methylation in a somatic murine cell line.

    PubMed Central

    Szyf, M; Schimmer, B P; Seidman, J G

    1989-01-01

    DNA fragments encoding the mouse steroid 21-hydroxylase (C21 or Cyp21A1) gene are de novo methylated when introduced into the mouse adrenocortical tumor cell line Y1 by DNA-mediated gene transfer. Although CCGG sequences within the C21 gene are de novo methylated, CCGG sites within flanking vector sequences, other mammalian gene sequences driven by the C21 promoter, and the neomycin-resistance gene, which was cotransfected with the C21 gene, do not become methylated. At least two separate signals for de novo methylation are encoded within the gene since three fragments derived from the C21 gene were methylated de novo. Specific de novo methylation of C21-derived sequences does not occur in L cells or Y1 kin8 cells; this suggests that the cellular factors needed for de novo methylation of the C21 gene are not ubiquitous. Most DNA sequences are not de novo methylated when introduced into somatic cells and DNA sequences other than the C21 gene are not de novo methylated when introduced into Y1 cells. Several groups have suggested that de novo methylation occurs in early embryonic cells and that somatic cells strictly maintain their methylation pattern by a semiconservative methyltransferase. Our results suggest that de novo methylation of specific nucleotide sequences can occur in some mammalian somatic cells. Images PMID:2789380

  7. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity.

    PubMed

    Adey, Andrew; Kitzman, Jacob O; Burton, Joshua N; Daza, Riza; Kumar, Akash; Christiansen, Lena; Ronaghi, Mostafa; Amini, Sasan; Gunderson, Kevin L; Steemers, Frank J; Shendure, Jay

    2014-12-01

    We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to > 1 megabase. These pools are "subhaploid," in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate "joins" are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences.

  8. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity

    PubMed Central

    Adey, Andrew; Kitzman, Jacob O.; Burton, Joshua N.; Daza, Riza; Kumar, Akash; Christiansen, Lena; Ronaghi, Mostafa; Amini, Sasan; L. Gunderson, Kevin; Steemers, Frank J.

    2014-01-01

    We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to >1 megabase. These pools are “subhaploid,” in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate “joins” are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences. PMID:25327137

  9. De Novo Peptide Sequencing: Deep Mining of High-Resolution Mass Spectrometry Data.

    PubMed

    Islam, Mohammad Tawhidul; Mohamedali, Abidali; Fernandes, Criselda Santan; Baker, Mark S; Ranganathan, Shoba

    2017-01-01

    High resolution mass spectrometry has revolutionized proteomics over the past decade, resulting in tremendous amounts of data in the form of mass spectra, being generated in a relatively short span of time. The mining of this spectral data for analysis and interpretation though has lagged behind such that potentially valuable data is being overlooked because it does not fit into the mold of traditional database searching methodologies. Although the analysis of spectra by de novo sequences removes such biases and has been available for a long period of time, its uptake has been slow or almost nonexistent within the scientific community. In this chapter, we propose a methodology to integrate de novo peptide sequencing using three commonly available software solutions in tandem, complemented by homology searching, and manual validation of spectra. This simplified method would allow greater use of de novo sequencing approaches and potentially greatly increase proteome coverage leading to the unearthing of valuable insights into protein biology, especially of organisms whose genomes have been recently sequenced or are poorly annotated.

  10. PepExplorer: a similarity-driven tool for analyzing de novo sequencing results.

    PubMed

    Leprevost, Felipe V; Valente, Richard H; Lima, Diogo B; Perales, Jonas; Melani, Rafael; Yates, John R; Barbosa, Valmir C; Junqueira, Magno; Carvalho, Paulo C

    2014-09-01

    Peptide spectrum matching is the current gold standard for protein identification via mass-spectrometry-based proteomics. Peptide spectrum matching compares experimental mass spectra against theoretical spectra generated from a protein sequence database to perform identification, but protein sequences not present in a database cannot be identified unless their sequences are in part conserved. The alternative approach, de novo sequencing, can make it possible to infer a peptide sequence directly from a mass spectrum, but interpreting long lists of peptide sequences resulting from large-scale experiments is not trivial. With this as motivation, PepExplorer was developed to use rigorous pattern recognition to assemble a list of homologue proteins using de novo sequencing data coupled to sequence alignment to allow biological interpretation of the data. PepExplorer can read the output of various widely adopted de novo sequencing tools and converge to a list of proteins with a global false-discovery rate. To this end, it employs a radial basis function neural network that considers precursor charge states, de novo sequencing scores, peptide lengths, and alignment scores to select similar protein candidates, from a target-decoy database, usually obtained from phylogenetically related species. Alignments are performed using a modified Smith-Waterman algorithm tailored for the task at hand. We verified the effectiveness of our approach using a reference set of identifications generated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. We then modified the sequence database by swapping amino acids until ProLuCID was no longer capable of identifying any proteins. By searching the mass spectra using PepExplorer on the modified database, we were able to recover most of the identifications at a 1% false-discovery rate. Finally, we employed PepExplorer to disclose a comprehensive proteomic assessment of the Bothrops

  11. PepExplorer: A Similarity-driven Tool for Analyzing de Novo Sequencing Results *

    PubMed Central

    Leprevost, Felipe V.; Valente, Richard H.; Lima, Diogo B.; Perales, Jonas; Melani, Rafael; Yates, John R.; Barbosa, Valmir C.; Junqueira, Magno; Carvalho, Paulo C.

    2014-01-01

    Peptide spectrum matching is the current gold standard for protein identification via mass-spectrometry-based proteomics. Peptide spectrum matching compares experimental mass spectra against theoretical spectra generated from a protein sequence database to perform identification, but protein sequences not present in a database cannot be identified unless their sequences are in part conserved. The alternative approach, de novo sequencing, can make it possible to infer a peptide sequence directly from a mass spectrum, but interpreting long lists of peptide sequences resulting from large-scale experiments is not trivial. With this as motivation, PepExplorer was developed to use rigorous pattern recognition to assemble a list of homologue proteins using de novo sequencing data coupled to sequence alignment to allow biological interpretation of the data. PepExplorer can read the output of various widely adopted de novo sequencing tools and converge to a list of proteins with a global false-discovery rate. To this end, it employs a radial basis function neural network that considers precursor charge states, de novo sequencing scores, peptide lengths, and alignment scores to select similar protein candidates, from a target-decoy database, usually obtained from phylogenetically related species. Alignments are performed using a modified Smith–Waterman algorithm tailored for the task at hand. We verified the effectiveness of our approach using a reference set of identifications generated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. We then modified the sequence database by swapping amino acids until ProLuCID was no longer capable of identifying any proteins. By searching the mass spectra using PepExplorer on the modified database, we were able to recover most of the identifications at a 1% false-discovery rate. Finally, we employed PepExplorer to disclose a comprehensive proteomic assessment of the Bothrops

  12. Bioinformatics challenges in de novo transcriptome assembly using short read sequences in the absence of a reference genome sequence.

    PubMed

    Góngora-Castillo, Elsa; Buell, C Robin

    2013-04-01

    Plant natural product research can be facilitated through genome and transcriptome sequencing approaches that generate informative sequence and expression datasets that enable characterization of biochemical pathways of interest. As the overwhelming majority of plant-derived natural products are derived from species with little, if any, sequence and/or genomic resources, the ability to perform whole genome shotgun sequencing and assembly has been and will continue to be transformative as access to a genome sequence provides molecular resources and a context for discovery and characterization of biosynthetic pathways. Due to the reduced size and complexity of the transcriptome relative to the genome, transcriptome sequencing provides a rapid, inexpensive approach to access gene sequences, gene expression abundances, and gene expression patterns in any species, including those that lack a reference genome sequence. To date, successful applications of RNA sequencing in conjunction with de novo transcriptome assembly has enabled identification of new genes in an array of biochemical pathways in plants. While sequencing technologies are well developed, challenges remain in the handling and analysis of transcriptome sequences. In this Highlight article, we provide an overview of the bioinformatics challenges associated with transcriptome analyses using short read sequences and how to address these issues in plant species that lack a reference genome.

  13. LESSONS IN DE NOVO PEPTIDE SEQUENCING BY TANDEM MASS SPECTROMETRY

    PubMed Central

    Medzihradszky, Katalin F.; Chalkley, Robert J.

    2015-01-01

    Mass spectrometry has become the method of choice for the qualitative and quantitative characterization of protein mixtures isolated from all kinds of living organisms. The raw data in these studies are MS/MS spectra, usually of peptides produced by proteolytic digestion of a protein. These spectra are “translated” into peptide sequences, normally with the help of various search engines. Data acquisition and interpretation have both been automated, and most researchers look only at the summary of the identifications without ever viewing the underlying raw data used for assignments. Automated analysis of data is essential due to the volume produced. However, being familiar with the finer intricacies of peptide fragmentation processes, and experiencing the difficulties of manual data interpretation allow a researcher to be able to more critically evaluate key results, particularly because there are many known rules of peptide fragmentation that are not incorporated into search engine scoring. Since the most commonly used MS/MS activation method is collision-induced dissociation (CID), in this article we present a brief review of the history of peptide CID analysis. Next, we provide a detailed tutorial on how to determine peptide sequences from CID data. Although the focus of the tutorial is de novo sequencing, the lessons learned and resources supplied are useful for data interpretation in general. PMID:25667941

  14. De novo generation of simple sequence during gene amplification.

    PubMed Central

    Kirschner, L S

    1996-01-01

    Mammalian cells that have undergone gene amplification and/or gene rearrangement have been used as resources to gain insight into the questions of chromosome structure and dynamics. The multidrug resistant murine cell line J7.V2-1 has been shown previously to contain two distinct forms of the highly amplified mdr2 gene, a member of the mouse gene family responsible for the multidrug resistant (MDR) phenotype [Kirschner, L. S. (1995) DNA Cell Biol. 14, 47-59]. Characterization of both forms of the gene revealed that one form corresponded to the wild-type structure of the gene, whereas the other represented a rearrangement. Investigation of this altered gene demonstrated a deletion of 1.6 kb of the wild-type sequence, and replacement of this region with a poly(AT) tract that appears to have been generated de novo. Analysis of the native sequence in this region demonstrated the absence of repetitive elements, but was notable for the presence of two long stretches of polypurine: polypyrimidine strand asymmetry. Analysis of mdr2 transcripts in this cell line revealed that nearly all of the mRNA is transcribed from the rearranged form of the gene. This message is unable to code for a functional mdr2 gene product, owing to a deletion of the fourth exon during this event. Mechanisms of the rearrangement, as well as the significance of this curious effect on transcription, are discussed. PMID:8759018

  15. De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity

    PubMed Central

    Yassour, Moran; Grabherr, Manfred; Blood, Philip D.; Bowden, Joshua; Couger, Matthew Brian; Eccles, David; Li, Bo; Lieber, Matthias; MacManes, Matthew D.; Ott, Michael; Orvis, Joshua; Pochet, Nathalie; Strozzi, Francesco; Weeks, Nathan; Westerman, Rick; William, Thomas; Dewey, Colin N.; Henschel, Robert; LeDuc, Richard D.; Friedman, Nir; Regev, Aviv

    2013-01-01

    De novo assembly of RNA-Seq data allows us to study transcriptomes without the need for a genome sequence, such as in non-model organisms of ecological and evolutionary importance, cancer samples, or the microbiome. In this protocol, we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-Seq data in non-model organisms. We also present Trinity’s supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples, and approaches to identify protein coding genes. In an included tutorial we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sf.net. PMID:23845962

  16. Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation.

    PubMed

    Hara, Yuichiro; Tatsumi, Kaori; Yoshida, Michio; Kajikawa, Eriko; Kiyonari, Hiroshi; Kuraku, Shigehiro

    2015-11-18

    RNA-seq enables gene expression profiling in selected spatiotemporal windows and yields massive sequence information with relatively low cost and time investment, even for non-model species. However, there remains a large room for optimizing its workflow, in order to take full advantage of continuously developing sequencing capacity. Transcriptome sequencing for three embryonic stages of Madagascar ground gecko (Paroedura picta) was performed with the Illumina platform. The output reads were assembled de novo for reconstructing transcript sequences. In order to evaluate the completeness of transcriptome assemblies, we prepared a reference gene set consisting of vertebrate one-to-one orthologs. To take advantage of increased read length of >150 nt, we demonstrated shortened RNA fragmentation time, which resulted in a dramatic shift of insert size distribution. To evaluate products of multiple de novo assembly runs incorporating reads with different RNA sources, read lengths, and insert sizes, we introduce a new reference gene set, core vertebrate genes (CVG), consisting of 233 genes that are shared as one-to-one orthologs by all vertebrate genomes examined (29 species)., The completeness assessment performed by the computational pipelines CEGMA and BUSCO referring to CVG, demonstrated higher accuracy and resolution than with the gene set previously established for this purpose. As a result of the assessment with CVG, we have derived the most comprehensive transcript sequence set of the Madagascar ground gecko by means of assembling individual libraries followed by clustering the assembled sequences based on their overall similarities. Our results provide several insights into optimizing de novo RNA-seq workflow, including the coordination between library insert size and read length, which manifested in improved connectivity of assemblies. The approach and assembly assessment with CVG demonstrated here would be applicable to transcriptome analysis of other species as

  17. Evaluation and validation of de novo and hybrid assembly techniques to derive high quality genome sequences

    SciTech Connect

    Utturkar, Sagar M.; Klingeman, Dawn Marie

    2014-06-14

    Our motivation with this work was to assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Our results show Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as an additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. As to availability and implementation–all assembly tools except CLC Genomics Workbench are freely available under GNU General Public License.

  18. Evaluation and validation of de novo and hybrid assembly techniques to derive high quality genome sequences

    DOE PAGES

    Utturkar, Sagar M.; Klingeman, Dawn Marie; Land, Miriam L.; ...

    2014-06-14

    Our motivation with this work was to assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Our results show Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as anmore » additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. As to availability and implementation–all assembly tools except CLC Genomics Workbench are freely available under GNU General Public License.« less

  19. The de novo assembly of mitochondrial genomes of the extinct passenger pigeon (Ectopistes migratorius) with next generation sequencing.

    PubMed

    Hung, Chih-Ming; Lin, Rong-Chien; Chu, Jui-Hua; Yeh, Chia-Fen; Yao, Chiou-Ju; Li, Shou-Hsien

    2013-01-01

    The information from ancient DNA (aDNA) provides an unparalleled opportunity to infer phylogenetic relationships and population history of extinct species and to investigate genetic evolution directly. However, the degraded and fragmented nature of aDNA has posed technical challenges for studies based on conventional PCR amplification. In this study, we present an approach based on next generation sequencing to efficiently sequence the complete mitochondrial genome (mitogenome) of two extinct passenger pigeons (Ectopistes migratorius) using de novo assembly of massive short (90 bp), paired-end or single-end reads. Although varying levels of human contamination and low levels of postmortem nucleotide lesion were observed, they did not impact sequencing accuracy. Our results demonstrated that the de novo assembly of shotgun sequence reads could be a potent approach to sequence mitogenomes, and offered an efficient way to infer evolutionary history of extinct species.

  20. The De Novo Assembly of Mitochondrial Genomes of the Extinct Passenger Pigeon (Ectopistes migratorius) with Next Generation Sequencing

    PubMed Central

    Hung, Chih-Ming; Lin, Rong-Chien; Chu, Jui-Hua; Yeh, Chia-Fen; Yao, Chiou-Ju; Li, Shou-Hsien

    2013-01-01

    The information from ancient DNA (aDNA) provides an unparalleled opportunity to infer phylogenetic relationships and population history of extinct species and to investigate genetic evolution directly. However, the degraded and fragmented nature of aDNA has posed technical challenges for studies based on conventional PCR amplification. In this study, we present an approach based on next generation sequencing to efficiently sequence the complete mitochondrial genome (mitogenome) of two extinct passenger pigeons (Ectopistes migratorius) using de novo assembly of massive short (90 bp), paired-end or single-end reads. Although varying levels of human contamination and low levels of postmortem nucleotide lesion were observed, they did not impact sequencing accuracy. Our results demonstrated that the de novo assembly of shotgun sequence reads could be a potent approach to sequence mitogenomes, and offered an efficient way to infer evolutionary history of extinct species. PMID:23437111

  1. Computational approaches for fragment-based and de novo design.

    PubMed

    Loving, Kathryn; Alberts, Ian; Sherman, Woody

    2010-01-01

    Fragment-based and de novo design strategies have been used in drug discovery for years. The methodologies for these strategies are typically discussed separately, yet the applications of these techniques overlap substantially. We present a review of various fragment-based discovery and de novo design protocols with an emphasis on successful applications in real-world drug discovery projects. Furthermore, we illustrate the strengths and weaknesses of the various approaches and discuss how one method can be used to complement another. We also discuss how the incorporation of experimental data as constraints in computational models can produce novel compounds that occupy unique areas in intellectual property (IP) space yet are biased toward the desired chemical property space. Finally, we present recent research results suggesting that computational tools applied to fragment-based discovery and de novo design can have a greater impact on the discovery process when coupled with the right experiments.

  2. High-definition De Novo Sequencing of Crustacean Hyperglycemic Hormone (CHH)-family Neuropeptides*

    PubMed Central

    Jia, Chenxi; Hui, Limei; Cao, Weifeng; Lietz, Christopher B.; Jiang, Xiaoyue; Chen, Ruibing; Catherman, Adam D.; Thomas, Paul M.; Ge, Ying; Kelleher, Neil L.; Li, Lingjun

    2012-01-01

    A complete understanding of the biological functions of large signaling peptides (>4 kDa) requires comprehensive characterization of their amino acid sequences and post-translational modifications, which presents significant analytical challenges. In the past decade, there has been great success with mass spectrometry-based de novo sequencing of small neuropeptides. However, these approaches are less applicable to larger neuropeptides because of the inefficient fragmentation of peptides larger than 4 kDa and their lower endogenous abundance. The conventional proteomics approach focuses on large-scale determination of protein identities via database searching, lacking the ability for in-depth elucidation of individual amino acid residues. Here, we present a multifaceted MS approach for identification and characterization of large crustacean hyperglycemic hormone (CHH)-family neuropeptides, a class of peptide hormones that play central roles in the regulation of many important physiological processes of crustaceans. Six crustacean CHH-family neuropeptides (8–9.5 kDa), including two novel peptides with extensive disulfide linkages and PTMs, were fully sequenced without reference to genomic databases. High-definition de novo sequencing was achieved by a combination of bottom-up, off-line top-down, and on-line top-down tandem MS methods. Statistical evaluation indicated that these methods provided complementary information for sequence interpretation and increased the local identification confidence of each amino acid. Further investigations by MALDI imaging MS mapped the spatial distribution and colocalization patterns of various CHH-family neuropeptides in the neuroendocrine organs, revealing that two CHH-subfamilies are involved in distinct signaling pathways. PMID:23028060

  3. First de novo whole genome sequencing and assembly of the pink-footed goose.

    PubMed

    Pujolar, J M; Dalén, L; Olsen, R A; Hansen, M M; Madsen, J

    2017-08-30

    Annotated genomes can provide new perspectives on the biology of species. We present the first de novo whole genome sequencing for the pink-footed goose. In order to obtain a high-quality de novo assembly the strategy used was to combine one short insert paired-end library with two mate-pair libraries. The pink-footed goose genome was assembled de novo using three different assemblers and an assembly evaluation was subsequently performed in order to choose the best assembler. For our data, ALLPATHS-LG performed the best, since the assembly produced covers most of the genome, while introducing the fewest errors. A total of 26,134 genes were annotated, with bird species accounting for virtually all BLAST hits. We also estimated the substitution rate in the pink-footed goose, which can be of use in future demographic studies, by using a comparative approach with the genome of the chicken, the mallard and the swan goose. A substitution rate of 1.38×10(-7) per nucleotide per generation was obtained when comparing the genomes of the two closely-related goose species (the pink-footed and the swan goose). Altogether, we provide a valuable tool for future genomic studies aiming at particular genes and regions of the pink-footed goose genome as well as other bird species. Copyright © 2017 Elsevier Inc. All rights reserved.

  4. Personal genome sequencing: current approaches and challenges

    PubMed Central

    Snyder, Michael; Du, Jiang; Gerstein, Mark

    2010-01-01

    The revolution in DNA sequencing technologies has now made it feasible to determine the genome sequences of many individuals; i.e., “personal genomes.” Genome sequences of cells and tissues from both normal and disease states have been determined. Using current approaches, whole human genome sequences are not typically assembled and determined de novo, but, instead, variations relative to a reference sequence are identified. We discuss the current state of personal genome sequencing, the main steps involved in determining a genome sequence (i.e., identifying single-nucleotide polymorphisms [SNPs] and structural variations [SVs], assembling new sequences, and phasing haplotypes), and the challenges and performance metrics for evaluating the accuracy of the reconstruction. Finally, we consider the possible individual and societal benefits of personal genome sequences. PMID:20194435

  5. De Novo Sequencing of Heparan Sulfate Oligosaccharides by Electron-Activated Dissociation

    PubMed Central

    Huang, Yu; Yu, Xiang; Mao, Yang; Costello, Catherine E.; Zaia, Joseph; Lin, Cheng

    2014-01-01

    Structural characterization of highly sulfated glycosaminoglycans (GAGs) by collisionally activated dissociation (CAD) is challenging because of the extensive sulfate losses mediated by free protons. While removal of the free protons may be achieved through the use of derivatization, metal cation adducts, and/or electrospray supercharging reagents, these steps add complexity to the experimental workflow. It is therefore desirable to develop an analytical approach for GAG sequencing that does not require derivatization or addition of reagents to the electrospray solution. Electron detachment dissociation (EDD) can produce extensive and informative fragmentation for GAGs without the need to remove free protons from the precursor ions. However, EDD is an inefficient process, often requiring consumption of large sample quantities (typically several micrograms), particularly for highly sulfated GAG ions. Here, we report that with improved instrumentation, optimization of the ionization and ion transfer parameters, and enhanced EDD efficiency, it is possible to generate highly informative EDD spectra of highly sulfated GAGs on the liquid chromatography (LC) time-scale, with consumption of only a few nanograms of sample. We further show that negative electron transfer dissociation (NETD) is an even more effective fragmentation technique for GAG sequencing, producing fewer sulfate losses while consuming smaller amount of samples. Finally, a simple algorithm was developed for de novo HS sequencing based on their high resolution tandem mass spectra. These results demonstrate the potential of EDD and NETD as sensitive analytical tools for detailed, high-throughput, de novo structural analyses of highly sulfated GAGs. PMID:24224699

  6. De Novo Transcriptome Sequencing in Anopheles funestus Using Illumina RNA-Seq Technology

    PubMed Central

    Crawford, Jacob E.; Guelbeogo, Wamdaogo M.; Sanou, Antoine; Traoré, Alphonse; Vernick, Kenneth D.; Sagnon, N'Fale; Lazzaro, Brian P.

    2010-01-01

    Background Anopheles funestus is one of the primary vectors of human malaria, which causes a million deaths each year in sub-Saharan Africa. Few scientific resources are available to facilitate studies of this mosquito species and relatively little is known about its basic biology and evolution, making development and implementation of novel disease control efforts more difficult. The An. funestus genome has not been sequenced, so in order to facilitate genome-scale experimental biology, we have sequenced the adult female transcriptome of An. funestus from a newly founded colony in Burkina Faso, West Africa, using the Illumina GAIIx next generation sequencing platform. Methodology/Principal Findings We assembled short Illumina reads de novo using a novel approach involving iterative de novo assemblies and “target-based” contig clustering. We then selected a conservative set of 15,527 contigs through comparisons to four Dipteran transcriptomes as well as multiple functional and conserved protein domain databases. Comparison to the Anopheles gambiae immune system identified 339 contigs as putative immune genes, thus identifying a large portion of the immune system that can form the basis for subsequent studies of this important malaria vector. We identified 5,434 1∶1 orthologues between An. funestus and An. gambiae and found that among these 1∶1 orthologues, the protein sequence of those with putative immune function were significantly more diverged than the transcriptome as a whole. Short read alignments to the contig set revealed almost 367,000 genetic polymorphisms segregating in the An. funestus colony and demonstrated the utility of the assembled transcriptome for use in RNA-seq based measurements of gene expression. Conclusions/Significance We developed a pipeline that makes de novo transcriptome sequencing possible in virtually any organism at a very reasonable cost ($6,300 in sequencing costs in our case). We anticipate that our approach could be used

  7. Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies.

    PubMed

    Rickert, Keith W; Grinberg, Luba; Woods, Robert M; Wilson, Susan; Bowen, Michael A; Baca, Manuel

    2016-01-01

    The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3-5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material.

  8. Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies

    PubMed Central

    Rickert, Keith W.; Grinberg, Luba; Woods, Robert M.; Wilson, Susan; Bowen, Michael A.; Baca, Manuel

    2016-01-01

    ABSTRACT The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3–5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material. PMID:26852694

  9. Automated Antibody De Novo Sequencing and Its Utility in Biopharmaceutical Discovery

    NASA Astrophysics Data System (ADS)

    Sen, K. Ilker; Tang, Wilfred H.; Nayak, Shruti; Kil, Yong J.; Bern, Marshall; Ozoglu, Berk; Ueberheide, Beatrix; Davis, Darryl; Becker, Christopher

    2017-05-01

    Applications of antibody de novo sequencing in the biopharmaceutical industry range from the discovery of new antibody drug candidates to identifying reagents for research and determining the primary structure of innovator products for biosimilar development. When murine, phage display, or patient-derived monoclonal antibodies against a target of interest are available, but the cDNA or the original cell line is not, de novo protein sequencing is required to humanize and recombinantly express these antibodies, followed by in vitro and in vivo testing for functional validation. Availability of fully automated software tools for monoclonal antibody de novo sequencing enables efficient and routine analysis. Here, we present a novel method to automatically de novo sequence antibodies using mass spectrometry and the Supernovo software. The robustness of the algorithm is demonstrated through a series of stress tests.

  10. Automated Antibody De Novo Sequencing and Its Utility in Biopharmaceutical Discovery

    NASA Astrophysics Data System (ADS)

    Sen, K. Ilker; Tang, Wilfred H.; Nayak, Shruti; Kil, Yong J.; Bern, Marshall; Ozoglu, Berk; Ueberheide, Beatrix; Davis, Darryl; Becker, Christopher

    2017-01-01

    Applications of antibody de novo sequencing in the biopharmaceutical industry range from the discovery of new antibody drug candidates to identifying reagents for research and determining the primary structure of innovator products for biosimilar development. When murine, phage display, or patient-derived monoclonal antibodies against a target of interest are available, but the cDNA or the original cell line is not, de novo protein sequencing is required to humanize and recombinantly express these antibodies, followed by in vitro and in vivo testing for functional validation. Availability of fully automated software tools for monoclonal antibody de novo sequencing enables efficient and routine analysis. Here, we present a novel method to automatically de novo sequence antibodies using mass spectrometry and the Supernovo software. The robustness of the algorithm is demonstrated through a series of stress tests.

  11. Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification?

    PubMed

    Muth, Thilo; Renard, Bernhard Y

    2017-03-21

    While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally

  12. Long-read sequencing and de novo assembly of a Chinese genome

    USDA-ARS?s Scientific Manuscript database

    Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arr...

  13. Genomic Resources for Water Yam (Dioscorea alata L.): Analyses of EST-Sequences, De Novo Sequencing and GBS Libraries

    PubMed Central

    Saski, Christopher A.; Bhattacharjee, Ranjana; Scheffler, Brian E.; Asiedu, Robert

    2015-01-01

    The reducing cost and rapid progress in next-generation sequencing techniques coupled with high performance computational approaches have resulted in large-scale discovery of advanced genomic resources in several model and non-model plant species. Yam (Dioscorea spp.) is a major food and cash crop in many countries but research efforts have been limited to understand the genetics and generate genomic information for the crop. The availability of a large number of genomic resources including genome-wide molecular markers will accelerate the breeding efforts and application of genomic selection in yams. In the present study, several methods including expressed sequence tags (EST)-sequencing, de novo sequencing, and genotyping-by-sequencing (GBS) profiles on two yam (Dioscorea alata L.) genotypes (TDa 95/00328 and TDa 95-310) was performed to generate genomic resources for use in its improvement programs. This includes a comprehensive set of EST-SSRs, genomic SSRs, whole genome SNPs, and reduced representation SNPs. A total of 1,152 EST-SSRs were developed from >40,000 EST-sequences generated from the two genotypes. A set of 388 EST-SSRs were validated as polymorphic showing a polymorphism rate of 34% when tested on two diverse parents targeted for anthracnose disease. In addition, approximately 40X de novo whole genome sequence coverage was generated for each of the two genotypes, and a total of 18,584 and 15,952 genomic SSRs were identified for TDa 95/00328 and TDa 95-310, respectively. A custom made pipeline resulted in the selection of 573 genomic SSRs common across the two genotypes, of which only eight failed, 478 being polymorphic and 62 monomorphic indicating a polymorphic rate of 83.5%. Additionally, 288,505 high quality SNPs were also identified between these two genotypes. Genotyping by sequencing reads on these two genotypes also revealed 36,790 overlapping SNP positions that are distributed throughout the genome. Our efforts in using different approaches

  14. Genomic Resources for Water Yam (Dioscorea alata L.): Analyses of EST-Sequences, De Novo Sequencing and GBS Libraries.

    PubMed

    Saski, Christopher A; Bhattacharjee, Ranjana; Scheffler, Brian E; Asiedu, Robert

    2015-01-01

    The reducing cost and rapid progress in next-generation sequencing techniques coupled with high performance computational approaches have resulted in large-scale discovery of advanced genomic resources in several model and non-model plant species. Yam (Dioscorea spp.) is a major food and cash crop in many countries but research efforts have been limited to understand the genetics and generate genomic information for the crop. The availability of a large number of genomic resources including genome-wide molecular markers will accelerate the breeding efforts and application of genomic selection in yams. In the present study, several methods including expressed sequence tags (EST)-sequencing, de novo sequencing, and genotyping-by-sequencing (GBS) profiles on two yam (Dioscorea alata L.) genotypes (TDa 95/00328 and TDa 95-310) was performed to generate genomic resources for use in its improvement programs. This includes a comprehensive set of EST-SSRs, genomic SSRs, whole genome SNPs, and reduced representation SNPs. A total of 1,152 EST-SSRs were developed from >40,000 EST-sequences generated from the two genotypes. A set of 388 EST-SSRs were validated as polymorphic showing a polymorphism rate of 34% when tested on two diverse parents targeted for anthracnose disease. In addition, approximately 40X de novo whole genome sequence coverage was generated for each of the two genotypes, and a total of 18,584 and 15,952 genomic SSRs were identified for TDa 95/00328 and TDa 95-310, respectively. A custom made pipeline resulted in the selection of 573 genomic SSRs common across the two genotypes, of which only eight failed, 478 being polymorphic and 62 monomorphic indicating a polymorphic rate of 83.5%. Additionally, 288,505 high quality SNPs were also identified between these two genotypes. Genotyping by sequencing reads on these two genotypes also revealed 36,790 overlapping SNP positions that are distributed throughout the genome. Our efforts in using different approaches

  15. Partial De Novo Sequencing and Unusual CID Fragmentation of a 7 kDa, Disulfide-Bridged Toxin

    NASA Astrophysics Data System (ADS)

    Medzihradszky, Katalin F.; Bohlen, Christopher J.

    2012-05-01

    A 7 kDa toxin isolated from the venom of the Texas coral snake ( Micrurus tener tener) was subjected to collision-induced dissociation (CID) and electron-transfer dissociation (ETD) analyses both before and after reduction at low pH. Manual and automated approaches to de novo sequencing are compared in detail. Manual de novo sequencing utilizing the combination of high accuracy CID and ETD data and an acid-related cleavage yielded the N-terminal half of the sequence from the reduced species. The intact polypeptide, containing 3 disulfide bridges produced a series of unusual fragments in ion trap CID experiments: abundant internal amino acid losses were detected, and also one of the disulfide-linkage positions could be determined from fragments formed by the cleavage of two bonds. In addition, internal and c-type fragments were also observed.

  16. Partial de novo sequencing and unusual CID fragmentation of a 7 kDa, disulfide-bridged toxin

    PubMed Central

    Medzihradszky, Katalin F.; Bohlen, Christopher J.

    2015-01-01

    A 7 kDa toxin isolated from the venom of the Texas coral snake (Micrurus tener tener) was subjected to collision-induced dissociation (CID) and electron-transfer dissociation (ETD) analyses both before and after reduction at low pH. Manual and automated approaches to de novo sequencing are compared in detail. Manual de novo sequencing utilizing the combination of high accuracy CID and ETD data and an acid-related cleavage yielded the N-terminal half of the sequence from the reduced species. The intact polypeptide, containing 3 disulfide bridges produced a series of unusual fragments in ion trap CID experiments: abundant internal amino acid losses were detected, and also one of the disulfide-linkage positions could be determined from fragments formed by the cleavage of two bonds. In addition, internal and c-type fragments were also observed. PMID:22351294

  17. Comprehensive de novo peptide sequencing from MS/MS pairs generated through complementary collision induced dissociation and 351 nm ultraviolet photodissociation.

    PubMed

    Horton, Andrew Pitchford; Robotham, Scott A; Cannon, Joe R; Holden, Dustin D; Marcotte, Edward M; Brodbelt, Jennifer S

    2017-02-24

    We describe a strategy for de novo peptide sequencing based on matched pairs of tandem mass spectra (MS/MS) obtained by collision induced dissociation (CID) and 351 nm ultraviolet photodissociation (UVPD). Each precursor ion is isolated twice with the mass spectrometer switching between CID and UVPD activation modes to obtain a complementary MS/MS pair. To interpret these paired spectra, we modified the UVnovo de novo sequencing software to automatically learn from and interpret fragmentation spectra, provided a representative set of training data. This machine learning procedure, using random forests, synthesizes information from one or multiple complementary spectra, such as the CID/UVPD pairs, into peptide fragmentation site predictions. In doing so, the burden of fragmentation model definition shifts from programmer to machine and opens up the model parameter space for inclusion of nonobvious features and interactions. This spectral synthesis also serves to transform distinct types of spectra into a common representation for subsequent activation-independent processing steps. Then, independent from precursor activation constraints, UVnovo's de novo sequencing procedure generates and scores sequence candidates for each precursor. We demonstrate the combined experimental and computational approach for de novo sequencing using whole cell E. coli lysate. In benchmarks on the CID/UVPD data, UVnovo assigned correct full-length sequences to 83% of the spectral pairs of doubly charged ions with high-confidence database identifications. Considering only top-ranked de novo predictions, 70% of the pairs were deciphered correctly. This de novo sequencing performance exceeds that of PEAKS and PepNovo on the CID spectra and that of UVnovo on CID or UVPD spectra alone. As presented here, the methods for paired CID/UVPD spectral acquisition and interpretation constitute a powerful workflow for high-throughput and accurate de novo peptide sequencing.

  18. Whole-genome sequencing for comparative genomics and de novo genome assembly.

    PubMed

    Benjak, Andrej; Sala, Claudia; Hartkoorn, Ruben C

    2015-01-01

    Next-generation sequencing technologies for whole-genome sequencing of mycobacteria are rapidly becoming an attractive alternative to more traditional sequencing methods. In particular this technology is proving useful for genome-wide identification of mutations in mycobacteria (comparative genomics) as well as for de novo assembly of whole genomes. Next-generation sequencing however generates a vast quantity of data that can only be transformed into a usable and comprehensible form using bioinformatics. Here we describe the methodology one would use to prepare libraries for whole-genome sequencing, and the basic bioinformatics to identify mutations in a genome following Illumina HiSeq or MiSeq sequencing, as well as de novo genome assembly following sequencing using Pacific Biosciences (PacBio).

  19. De Novo Sequencing of Peptides from Top-Down Tandem Mass Spectra

    SciTech Connect

    Vyatkina, Kira; Wu, Si; Dekker, Lennard J. M.; VanDuijn, Martijn M.; Liu, Xiaowen; Tolić, Nikola; Dvorkin, Mikhail; Alexandrova, Sonya; Luider, Theo M.; Paša-Tolić, Ljiljana; Pevzner, Pavel A.

    2015-11-06

    De novo sequencing of proteins and peptides is one of the most important problems in mass spectrometry-driven proteomics. A variety of methods have been developed to accomplish this task from a set of bottom-up tandem (MS/MS) mass spectra. However, a more recently emerged top-down technology, now gaining more and more popularity, opens new perspectives for protein analysis and characterization, implying a need in efficient algorithms for processing this kind of MS/MS data. Here we describe a method that allows to retrieve from a set of top-down MS/MS spectra long and accurate sequence fragments of the proteins contained in a sample. To this end, we outline a strategy for generating high-quality sequence tags from top-down spectra, and introduce the concept of a T-Bruijn graph by adapting to the case of tags the notion of an A-Bruijn graph widely used in genomics. The output of the proposed approach represents the set of amino acid strings spelled out by optimal paths in the connected components of a T-Bruijn graph. We illustrate its performance on top-down datasets acquired from carbonic anhydrase 2 (CAH2) and the Fab region of alemtuzumab.

  20. Strict de novo methylation of the 35S enhancer sequence in gentian.

    PubMed

    Mishiba, Kei-ichiro; Yamasaki, Satoshi; Nakatsuka, Takashi; Abe, Yoshiko; Daimon, Hiroyuki; Oda, Masayuki; Nishihara, Masahiro

    2010-03-23

    A novel transgene silencing phenomenon was found in the ornamental plant, gentian (Gentiana triflora x G. scabra), in which the introduced Cauliflower mosaic virus (CaMV) 35S promoter region was strictly methylated, irrespective of the transgene copy number and integrated loci. Transgenic tobacco having the same vector did not show the silencing behavior. Not only unmodified, but also modified 35S promoters containing a 35S enhancer sequence were found to be highly methylated in the single copy transgenic gentian lines. The 35S core promoter (-90)-introduced transgenic lines showed a small degree of methylation, implying that the 35S enhancer sequence was involved in the methylation machinery. The rigorous silencing phenomenon enabled us to analyze methylation in a number of the transgenic lines in parallel, which led to the discovery of a consensus target region for de novo methylation, which comprised an asymmetric cytosine (CpHpH; H is A, C or T) sequence. Consequently, distinct footprints of de novo methylation were detected in each (modified) 35S promoter sequence, and the enhancer region (-148 to -85) was identified as a crucial target for de novo methylation. Electrophoretic mobility shift assay (EMSA) showed that complexes formed in gentian nuclear extract with the -149 to -124 and -107 to -83 region probes were distinct from those of tobacco nuclear extracts, suggesting that the complexes might contribute to de novo methylation. Our results provide insights into the phenomenon of sequence- and species- specific gene silencing in higher plants.

  1. DIME: a novel framework for de novo metagenomic sequence assembly.

    PubMed

    Guo, Xuan; Yu, Ning; Ding, Xiaojun; Wang, Jianxin; Pan, Yi

    2015-02-01

    The recently developed next generation sequencing platforms not only decrease the cost for metagenomics data analysis, but also greatly enlarge the size of metagenomic sequence datasets. A common bottleneck of available assemblers is that the trade-off between the noise of the resulting contigs and the gain in sequence length for better annotation has not been attended enough for large-scale sequencing projects, especially for the datasets with low coverage and a large number of nonoverlapping contigs. To address this limitation and promote both accuracy and efficiency, we develop a novel metagenomic sequence assembly framework, DIME, by taking the DIvide, conquer, and MErge strategies. In addition, we give two MapReduce implementations of DIME, DIME-cap3 and DIME-genovo, on Apache Hadoop platform. For a systematic comparison of the performance of the assembly tasks, we tested DIME and five other popular short read assembly programs, Cap3, Genovo, MetaVelvet, SOAPdenovo, and SPAdes on four synthetic and three real metagenomic sequence datasets with various reads from fifty thousand to a couple million in size. The experimental results demonstrate that our method not only partitions the sequence reads with an extremely high accuracy, but also reconstructs more bases, generates higher quality assembled consensus, and yields higher assembly scores, including corrected N50 and BLAST-score-per-base, than other tools with a nearly theoretical speed-up. Results indicate that DIME offers great improvement in assembly across a range of sequence abundances and thus is robust to decreasing coverage.

  2. De Novo Sequencing of Top-Down Tandem Mass Spectra: A Next Step towards Retrieving a Complete Protein Sequence

    PubMed Central

    Vyatkina, Kira

    2017-01-01

    De novo sequencing of tandem (MS/MS) mass spectra represents the only way to determine the sequence of proteins from organisms with unknown genomes, or the ones not directly inscribed in a genome—such as antibodies, or novel splice variants. Top-down mass spectrometry provides new opportunities for analyzing such proteins; however, retrieving a complete protein sequence from top-down MS/MS spectra still remains a distant goal. In this paper, we review the state-of-the-art on this subject, and enhance our previously developed Twister algorithm for de novo sequencing of peptides from top-down MS/MS spectra to derive longer sequence fragments of a target protein. PMID:28248257

  3. De novo mutations revealed by whole exome sequencing are strongly associated with autism

    PubMed Central

    Sanders, Stephan J.; Murtha, Michael T.; Gupta, Abha R.; Murdoch, John D.; Raubeson, Melanie J.; Willsey, A. Jeremy; Ercan-Sencicek, A. Gulhan; DiLullo, Nicholas M.; Parikshak, Neelroop N.; Stein, Jason L.; Walker, Michael F.; Ober, Gordon T.; Teran, Nicole A.; Song, Youeun; El-Fishawy, Paul; Murtha, Ryan C.; Choi, Murim; Overton, John D.; Bjornson, Robert D.; Carriero, Nicholas J.; Meyer, Kyle A.; Bilguvar, Kaya; Mane, Shrikant M.; Šestan, Nenad; Lifton, Richard P.; Günel, Murat; Roeder, Kathryn; Geschwind, Daniel H.; Devlin, Bernie; State, Matthew W.

    2013-01-01

    Multiple studies have confirmed the contribution of rare de novo copy number variations (CNVs) to the risk for Autism Spectrum Disorders (ASD).1-3 While de novo single nucleotide variants (SNVs) have been identified in affected individuals,4 their contribution to risk has yet to be clarified. Specifically, the frequency and distribution of these mutations has not been well characterized in matched unaffected controls, data that are vital to the interpretation of de novo coding mutations observed in probands. Here we show, via whole-exome sequencing of 928 individuals, including 200 phenotypically discordant sibling pairs, that highly disruptive (nonsense and splice-site) de novo mutations in brain-expressed genes are associated with ASD and carry large effects (OR=5.65; CI: 1.44-22.2; p=0.01 asymptotic test). Based on mutation rates in unaffected individuals, we demonstrate that multiple independent de novo SNVs in the same gene among unrelated probands reliably identifies risk alleles, providing a clear path forward for gene discovery. Among a total of 279 identified de novo coding mutations, there is a single instance in probands, and none in siblings, in which two independent nonsense variants disrupt the same gene, SCN2A (Sodium Channel, Voltage-Gated, Type II, Alpha Subunit), a result that is highly unlikely by chance (p=0.005). PMID:22495306

  4. De novo mutations revealed by whole-exome sequencing are strongly associated with autism.

    PubMed

    Sanders, Stephan J; Murtha, Michael T; Gupta, Abha R; Murdoch, John D; Raubeson, Melanie J; Willsey, A Jeremy; Ercan-Sencicek, A Gulhan; DiLullo, Nicholas M; Parikshak, Neelroop N; Stein, Jason L; Walker, Michael F; Ober, Gordon T; Teran, Nicole A; Song, Youeun; El-Fishawy, Paul; Murtha, Ryan C; Choi, Murim; Overton, John D; Bjornson, Robert D; Carriero, Nicholas J; Meyer, Kyle A; Bilguvar, Kaya; Mane, Shrikant M; Sestan, Nenad; Lifton, Richard P; Günel, Murat; Roeder, Kathryn; Geschwind, Daniel H; Devlin, Bernie; State, Matthew W

    2012-04-04

    Multiple studies have confirmed the contribution of rare de novo copy number variations to the risk for autism spectrum disorders. But whereas de novo single nucleotide variants have been identified in affected individuals, their contribution to risk has yet to be clarified. Specifically, the frequency and distribution of these mutations have not been well characterized in matched unaffected controls, and such data are vital to the interpretation of de novo coding mutations observed in probands. Here we show, using whole-exome sequencing of 928 individuals, including 200 phenotypically discordant sibling pairs, that highly disruptive (nonsense and splice-site) de novo mutations in brain-expressed genes are associated with autism spectrum disorders and carry large effects. On the basis of mutation rates in unaffected individuals, we demonstrate that multiple independent de novo single nucleotide variants in the same gene among unrelated probands reliably identifies risk alleles, providing a clear path forward for gene discovery. Among a total of 279 identified de novo coding mutations, there is a single instance in probands, and none in siblings, in which two independent nonsense variants disrupt the same gene, SCN2A (sodium channel, voltage-gated, type II, α subunit), a result that is highly unlikely by chance.

  5. Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.

    PubMed

    Desai, Aarti; Marwah, Veer Singh; Yadav, Akshay; Jha, Vineet; Dhaygude, Kishor; Bangar, Ujwala; Kulkarni, Vivek; Jere, Abhay

    2013-01-01

    Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.

  6. A Family-Based Probabilistic Method for Capturing De Novo Mutations from High-Throughput Short-Read Sequencing Data

    PubMed Central

    Cartwright, Reed A.; Hussin, Julie; Keebler, Jonathan E. M.; Stone, Eric A.; Awadalla, Philip

    2013-01-01

    Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date. PMID:22499693

  7. De novo structure prediction of globular proteins aided by sequence variation-derived contacts.

    PubMed

    Kosciolek, Tomasz; Jones, David T

    2014-01-01

    The advent of high accuracy residue-residue intra-protein contact prediction methods enabled a significant boost in the quality of de novo structure predictions. Here, we investigate the potential benefits of combining a well-established fragment-based folding algorithm--FRAGFOLD, with PSICOV, a contact prediction method which uses sparse inverse covariance estimation to identify co-varying sites in multiple sequence alignments. Using a comprehensive set of 150 diverse globular target proteins, up to 266 amino acids in length, we are able to address the effectiveness and some limitations of such approaches to globular proteins in practice. Overall we find that using fragment assembly with both statistical potentials and predicted contacts is significantly better than either statistical potentials or contacts alone. Results show up to nearly 80% of correct predictions (TM-score ≥0.5) within analysed dataset and a mean TM-score of 0.54. Unsuccessful modelling cases emerged either from conformational sampling problems, or insufficient contact prediction accuracy. Nevertheless, a strong dependency of the quality of final models on the fraction of satisfied predicted long-range contacts was observed. This not only highlights the importance of these contacts on determining the protein fold, but also (combined with other ensemble-derived qualities) provides a powerful guide as to the choice of correct models and the global quality of the selected model. A proposed quality assessment scoring function achieves 0.93 precision and 0.77 recall for the discrimination of correct folds on our dataset of decoys. These findings suggest the approach is well-suited for blind predictions on a variety of globular proteins of unknown 3D structure, provided that enough homologous sequences are available to construct a large and accurate multiple sequence alignment for the initial contact prediction step.

  8. De Novo Structure Prediction of Globular Proteins Aided by Sequence Variation-Derived Contacts

    PubMed Central

    Kosciolek, Tomasz; Jones, David T.

    2014-01-01

    The advent of high accuracy residue-residue intra-protein contact prediction methods enabled a significant boost in the quality of de novo structure predictions. Here, we investigate the potential benefits of combining a well-established fragment-based folding algorithm – FRAGFOLD, with PSICOV, a contact prediction method which uses sparse inverse covariance estimation to identify co-varying sites in multiple sequence alignments. Using a comprehensive set of 150 diverse globular target proteins, up to 266 amino acids in length, we are able to address the effectiveness and some limitations of such approaches to globular proteins in practice. Overall we find that using fragment assembly with both statistical potentials and predicted contacts is significantly better than either statistical potentials or contacts alone. Results show up to nearly 80% of correct predictions (TM-score ≥0.5) within analysed dataset and a mean TM-score of 0.54. Unsuccessful modelling cases emerged either from conformational sampling problems, or insufficient contact prediction accuracy. Nevertheless, a strong dependency of the quality of final models on the fraction of satisfied predicted long-range contacts was observed. This not only highlights the importance of these contacts on determining the protein fold, but also (combined with other ensemble-derived qualities) provides a powerful guide as to the choice of correct models and the global quality of the selected model. A proposed quality assessment scoring function achieves 0.93 precision and 0.77 recall for the discrimination of correct folds on our dataset of decoys. These findings suggest the approach is well-suited for blind predictions on a variety of globular proteins of unknown 3D structure, provided that enough homologous sequences are available to construct a large and accurate multiple sequence alignment for the initial contact prediction step. PMID:24637808

  9. De novo design of signal sequences to localize cargo to the 1,2-propanediol utilization microcompartment.

    PubMed

    Jakobson, Christopher M; Slininger Lee, Marilyn F; Tullman-Ercek, Danielle

    2017-05-01

    Organizing heterologous biosyntheses inside bacterial cells can alleviate common problems owing to toxicity, poor kinetic performance, and cofactor imbalances. A subcellular organelle known as a bacterial microcompartment, such as the 1,2-propanediol utilization microcompartment of Salmonella, is a promising chassis for this strategy. Here we demonstrate de novo design of the N-terminal signal sequences used to direct cargo to these microcompartment organelles. We expand the native repertoire of signal sequences using rational and library-based approaches and show that a canonical leucine-zipper motif can function as a signal sequence for microcompartment localization. Our strategy can be applied to generate new signal sequences localizing arbitrary cargo proteins to the 1,2-propanediol utilization microcompartments. © 2017 The Protein Society.

  10. Approaching marine bioprospecting in hexacorals by RNA deep sequencing.

    PubMed

    Johansen, Steinar D; Emblem, Ase; Karlsen, Bård Ove; Okkenhaug, Siri; Hansen, Hilde; Moum, Truls; Coucheron, Dag H; Seternes, Ole Morten

    2010-07-31

    RNA deep sequencing represents a new complementary approach in marine bioprospecting. Next-generation sequencing platforms have recently been developed for de novo whole transcriptome analysis, small RNA discovery and gene expression profiling. Deep sequencing transcriptomics (sequencing the complete set of cellular transcripts at a specific stage or condition) leads to sequential identification of all expressed genes in a sample. When combined to high-throughput bioinformatics and protein synthesis, RNA deep sequencing represents a new powerful approach in gene product discovery and bioprospecting. Here we summarize recent progress in the analyses of hexacoral transcriptomes with the focus on cold-water sea anemones and related organisms.

  11. SMRT sequencing only de novo assembly of the sugar beet (Beta vulgaris) chloroplast genome.

    PubMed

    Stadermann, Kai Bernd; Weisshaar, Bernd; Holtgräwe, Daniela

    2015-09-16

    Third generation sequencing methods, like SMRT (Single Molecule, Real-Time) sequencing developed by Pacific Biosciences, offer much longer read length in comparison to Next Generation Sequencing (NGS) methods. Hence, they are well suited for de novo- or re-sequencing projects. Sequences generated for these purposes will not only contain reads originating from the nuclear genome, but also a significant amount of reads originating from the organelles of the target organism. These reads are usually discarded but they can also be used for an assembly of organellar replicons. The long read length supports resolution of repetitive regions and repeats within the organelles genome which might be problematic when just using short read data. Additionally, SMRT sequencing is less influenced by GC rich areas and by long stretches of the same base. We describe a workflow for a de novo assembly of the sugar beet (Beta vulgaris ssp. vulgaris) chloroplast genome sequence only based on data originating from a SMRT sequencing dataset targeted on its nuclear genome. We show that the data obtained from such an experiment are sufficient to create a high quality assembly with a higher reliability than assemblies derived from e.g. Illumina reads only. The chloroplast genome is especially challenging for de novo assembling as it contains two large inverted repeat (IR) regions. We also describe some limitations that still apply even though long reads are used for the assembly. SMRT sequencing reads extracted from a dataset created for nuclear genome (re)sequencing can be used to obtain a high quality de novo assembly of the chloroplast of the sequenced organism. Even with a relatively small overall coverage for the nuclear genome it is possible to collect more than enough reads to generate a high quality assembly that outperforms short read based assemblies. However, even with long reads it is not always possible to clarify the order of elements of a chloroplast genome sequence reliantly

  12. Sequence Comparative Analysis Using Networks: Software for Evaluating De Novo Transcript Assembly from Next-Generation Sequencing

    PubMed Central

    Misner, Ian; Bicep, Cédric; Lopez, Philippe; Halary, Sébastien; Bapteste, Eric; Lane, Christopher E.

    2013-01-01

    DNA sequencing technology is becoming more accessible to a variety of researchers as costs continue to decline. As researchers begin to sequence novel transcriptomes, most of these data sets lack a reference genome and will have to rely on de novo assemblers. Making comparisons across assemblies can be difficult: each program has its strengths and weaknesses, and no tool exists to comparatively evaluate these data sets. We developed software in R, called Sequence Comparative Analysis using Networks (SCAN), to perform statistical comparisons between distinct assemblies. SCAN uses a reference data set to identify the most accurate de novo assembly and the “good” transcripts in the user’s data. We tested SCAN on three publicly available transcriptomes, each assembled using three assembly programs. Moreover, we sequenced the transcriptome of the oomycete Achlya hypogyna and compared de novo assemblies from Velvet, ABySS, and the CLC Genomics Workbench assembly algorithms. One thousand one hundred twenty-eight of the CLC transcripts were statistically similar to the reference, compared with 49 of the Velvet transcripts and 937 of the ABySS transcripts. SCAN’s strength is providing statistical support for transcript assemblies in a biological context. However, SCAN is designed to compare distinct node sets in networks, therefore it can also easily be extended to perform statistical comparisons on any network graph regardless of what the nodes represent. PMID:23666209

  13. A simplified method for peptide de novo sequencing using (18)O labeling.

    PubMed

    Voráĉ, Aleš; Sedo, Ondrej; Havliš, Jan; Zdráhal, Zbyněk

    2014-01-01

    Incorporation of an (18)O atom into a peptide C-terminus by proteolytic cleavage in the presence of H2(18)O is one of the most effective ways of enhancing tandem mass spectrometry (MS/MS)-based de novo sequencing. Incorporation is usually accomplished by procedures including vacuum-assisted drying of tryptic peptides extracted from gels, their subsequent reconstitution in a H2(16)O/H2(18)O mixture and re-treatment with trypsin. In the present work, we propose a simplified procedure for (18)O incorporation into tryptic peptides by adding H2(18)O and trypsin to the original digest solution. In comparison to published methods, the proposed protocol for peptide de novo sequencing brings significant advantages in analysis and workflow with no deterioration in method performance. We show that labeling by this simplified method leads to a highlighting of the y-ion fragment series in the peptide matrix-assisted laser desorption/ionization (MALDI)- MS/MS data, which facilitates MS/MS data interpretation. We also prove that eliminating acid extraction of peptides from gels does not result in a decrease in sequence coverage or a qualitative loss of particular peptides detectable by MALDI-MS. The method was examined by MALDI-MS/MS on bovine serum albumin and recombinant histidine kinase CKI1 from Arabidopsis thaliana, and was verified by de novo sequencing of tryptic peptides originating from Apodemus sylvaticus salivary proteins.

  14. De novo proteomic sequencing of a monoclonal antibody raised against OX40 ligand.

    PubMed

    Pham, Victoria; Henzel, William J; Arnott, David; Hymowitz, Sarah; Sandoval, Wendy N; Truong, Bao-Tran; Lowman, Henry; Lill, Jennie R

    2006-05-01

    De novo sequencing of a full-length monoclonal antibody raised against OX40 ligand is described. Using a combination of overlapping complementary proteolytic and chemical digestions, with analysis by mass spectrometry and Edman degradation, both the heavy and light chains were fully sequenced. Particular attention was paid to those modifications that could be susceptible to degradation in the complementarity determining region and Fc region. An overview of the protocol is described, and suggestions for improvements to aid in such sequencing projects in the future are discussed.

  15. Feature-by-Feature – Evaluating De Novo Sequence Assembly

    PubMed Central

    Vezzi, Francesco; Narzisi, Giuseppe; Mishra, Bud

    2012-01-01

    The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art

  16. De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome

    PubMed Central

    2012-01-01

    Background Sequence analysis of organelle genomes has revealed important aspects of plant cell evolution. The scope of this study was to develop an approach for de novo assembly of the carrot mitochondrial genome using next generation sequence data from total genomic DNA. Results Sequencing data from a carrot 454 whole genome library were used to develop a de novo assembly of the mitochondrial genome. Development of a new bioinformatic tool allowed visualizing contig connections and elucidation of the de novo assembly. Southern hybridization demonstrated recombination across two large repeats. Genome annotation allowed identification of 44 protein coding genes, three rRNA and 17 tRNA. Identification of the plastid genome sequence allowed organelle genome comparison. Mitochondrial intergenic sequence analysis allowed detection of a fragment of DNA specific to the carrot plastid genome. PCR amplification and sequence analysis across different Apiaceae species revealed consistent conservation of this fragment in the mitochondrial genomes and an insertion in Daucus plastid genomes, giving evidence of a mitochondrial to plastid transfer of DNA. Sequence similarity with a retrotransposon element suggests a possibility that a transposon-like event transferred this sequence into the plastid genome. Conclusions This study confirmed that whole genome sequencing is a practical approach for de novo assembly of higher plant mitochondrial genomes. In addition, a new aspect of intercompartmental genome interaction was reported providing the first evidence for DNA transfer into an angiosperm plastid genome. The approach used here could be used more broadly to sequence and assemble mitochondrial genomes of diverse species. This information will allow us to better understand intercompartmental interactions and cell evolution. PMID:22548759

  17. Terminal sequence importance of de novo proteins from binary-patterned library: stable artificial proteins with 11- or 12-amino acid alphabet.

    PubMed

    Okura, Hiromichi; Takahashi, Tsuyoshi; Mihara, Hisakazu

    2012-06-01

    Successful approaches of de novo protein design suggest a great potential to create novel structural folds and to understand natural rules of protein folding. For these purposes, smaller and simpler de novo proteins have been developed. Here, we constructed smaller proteins by removing the terminal sequences from stable de novo vTAJ proteins and compared stabilities between mutant and original proteins. vTAJ proteins were screened from an α3β3 binary-patterned library which was designed with polar/ nonpolar periodicities of α-helix and β-sheet. vTAJ proteins have the additional terminal sequences due to the method of constructing the genetically repeated library sequences. By removing the parts of the sequences, we successfully obtained the stable smaller de novo protein mutants with fewer amino acid alphabets than the originals. However, these mutants showed the differences on ANS binding properties and stabilities against denaturant and pH change. The terminal sequences, which were designed just as flexible linkers not as secondary structure units, sufficiently affected these physicochemical details. This study showed implications for adjusting protein stabilities by designing N- and C-terminal sequences.

  18. NxRepair: error correction in de novo sequence assembly using Nextera mate pairs.

    PubMed

    Murphy, Rebecca R; O'Connell, Jared; Cox, Anthony J; Schulz-Trieglaff, Ole

    2015-01-01

    Scaffolding errors and incorrect repeat disambiguation during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub or PyPI, the Python Package Index; a tutorial and user documentation are also available.

  19. Annotation and re-sequencing of genes from de novo transcriptome assembly of Abies alba (Pinaceae)1

    PubMed Central

    Roschanski, Anna M.; Fady, Bruno; Ziegenhagen, Birgit; Liepelt, Sascha

    2013-01-01

    • Premise of the study: We present a protocol for the annotation of transcriptome sequence data and the identification of candidate genes therein using the example of the nonmodel conifer Abies alba. • Methods and Results: A normalized cDNA library was built from an A. alba seedling. The sequencing on a 454 platform yielded more than 1.5 million reads that were de novo assembled into 25149 contigs. Two complementary approaches were applied to annotate gene fragments that code for (1) well-known proteins and (2) proteins that are potentially adaptively relevant. Primer development and testing yielded 88 amplicons that could successfully be resequenced from genomic DNA. • Conclusions: The annotation workflow offers an efficient way to identify potential adaptively relevant genes from the large quantity of transcriptome sequence data. The primer set presented should be prioritized for single-nucleotide polymorphism detection in adaptively relevant genes in A. alba. PMID:25202477

  20. Proteomics of Soil and Sediment: Protein Identification by De Novo Sequencing of Mass Spectra Complements Traditional Database Searching

    NASA Astrophysics Data System (ADS)

    Miller, S.; Rizzo, A. I.; Waldbauer, J.

    2015-12-01

    Proteomics has the potential to elucidate the metabolic pathways and taxa responsible for in situ biogeochemical transformations. However, low rates of protein identification from high resolution mass spectra have been a barrier to the development of proteomics in complex environmental samples. Much of the difficulty lies in the computational challenge of linking mass spectra to their corresponding proteins. Traditional database search methods for matching peptide sequences to mass spectra are often inadequate due to the complexity of environmental proteomes and the large database search space, as we demonstrate with soil and sediment proteomes generated via a range of extraction methods. One alternative to traditional database searching is de novo sequencing, which identifies peptide sequences without the need for a database. BLAST can then be used to match de novo sequences to similar genetic sequences. Assigning confidence to putative identifications has been one hurdle for the implementation of de novo sequencing. We found that accurate de novo sequences can be screened by quality score and length. Screening criteria are verified by comparing the results of de novo sequencing and traditional database searching for well-characterized proteomes from simple biological systems. The BLAST hits of screened sequences are interrogated for taxonomic and functional information. We applied de novo sequencing to organic topsoil and marine sediment proteomes. Peak-rich proteomes, which can result from various extraction techniques, yield thousands of high-confidence protein identifications, an improvement over previous proteomic studies of soil and sediment. User-friendly software tools for de novo metaproteomics analysis have been developed. This "De Novo Analysis" Pipeline is also a faster method of data analysis than constructing a tailored sequence database for traditional database searching.

  1. Proteomics of Soil and Sediment: Protein Identification by De Novo Sequencing of Mass Spectra Complements Traditional Database Searching

    NASA Astrophysics Data System (ADS)

    Miller, S.; Rizzo, A. I.; Waldbauer, J.

    2014-12-01

    Proteomics has the potential to elucidate the metabolic pathways and taxa responsible for in situ biogeochemical transformations. However, low rates of protein identification from high resolution mass spectra have been a barrier to the development of proteomics in complex environmental samples. Much of the difficulty lies in the computational challenge of linking mass spectra to their corresponding proteins. Traditional database search methods for matching peptide sequences to mass spectra are often inadequate due to the complexity of environmental proteomes and the large database search space, as we demonstrate with soil and sediment proteomes generated via a range of extraction methods. One alternative to traditional database searching is de novo sequencing, which identifies peptide sequences without the need for a database. BLAST can then be used to match de novo sequences to similar genetic sequences. Assigning confidence to putative identifications has been one hurdle for the implementation of de novo sequencing. We found that accurate de novo sequences can be screened by quality score and length. Screening criteria are verified by comparing the results of de novo sequencing and traditional database searching for well-characterized proteomes from simple biological systems. The BLAST hits of screened sequences are interrogated for taxonomic and functional information. We applied de novo sequencing to organic topsoil and marine sediment proteomes. Peak-rich proteomes, which can result from various extraction techniques, yield thousands of high-confidence protein identifications, an improvement over previous proteomic studies of soil and sediment. User-friendly software tools for de novo metaproteomics analysis have been developed. This "De Novo Analysis" Pipeline is also a faster method of data analysis than constructing a tailored sequence database for traditional database searching.

  2. De novo assembly and characterization of the Trichuris trichiura adult worm transcriptome using Ion Torrent sequencing.

    PubMed

    Santos, Leonardo N; Silva, Eduardo S; Santos, André S; De Sá, Pablo H; Ramos, Rommel T; Silva, Artur; Cooper, Philip J; Barreto, Maurício L; Loureiro, Sebastião; Pinheiro, Carina S; Alcantara-Neves, Neuza M; Pacheco, Luis G C

    2016-07-01

    Infection with helminthic parasites, including the soil-transmitted helminth Trichuris trichiura (human whipworm), has been shown to modulate host immune responses and, consequently, to have an impact on the development and manifestation of chronic human inflammatory diseases. De novo derivation of helminth proteomes from sequencing of transcriptomes will provide valuable data to aid identification of parasite proteins that could be evaluated as potential immunotherapeutic molecules in near future. Herein, we characterized the transcriptome of the adult stage of the human whipworm T. trichiura, using next-generation sequencing technology and a de novo assembly strategy. Nearly 17.6 million high-quality clean reads were assembled into 6414 contiguous sequences, with an N50 of 1606bp. In total, 5673 protein-encoding sequences were confidentially identified in the T. trichiura adult worm transcriptome; of these, 1013 sequences represent potential newly discovered proteins for the species, most of which presenting orthologs already annotated in the related species T. suis. A number of transcripts representing probable novel non-coding transcripts for the species T. trichiura were also identified. Among the most abundant transcripts, we found sequences that code for proteins involved in lipid transport, such as vitellogenins, and several chitin-binding proteins. Through a cross-species expression analysis of gene orthologs shared by T. trichiura and the closely related parasites T. suis and T. muris it was possible to find twenty-six protein-encoding genes that are consistently highly expressed in the adult stages of the three helminth species. Additionally, twenty transcripts could be identified that code for proteins previously detected by mass spectrometry analysis of protein fractions of the whipworm somatic extract that present immunomodulatory activities. Five of these transcripts were amongst the most highly expressed protein-encoding sequences in the T

  3. De novo sequencing of highly modified therapeutic oligonucleotides by hydrophobic tag sequencing coupled with LC-MS.

    PubMed

    Goto, R; Miyakawa, S; Inomata, E; Takami, T; Yamaura, J; Nakamura, Y

    2017-02-01

    Correct sequences are prerequisite for quality control of therapeutic oligonucleotides. However, there is no definitive method available for determining sequences of highly modified therapeutic RNAs, and thereby, most of the oligonucleotides have been used clinically without direct sequence determination. In this study, we developed a novel sequencing method called 'hydrophobic tag sequencing'. Highly modified oligonucleotides are sequenced by partially digesting oligonucleotides conjugated with a 5'-hydrophobic tag, followed by liquid chromatography-mass spectrometry analysis. 5'-Hydrophobic tag-printed fragments (5'-tag degradates) can be separated in order of their molecular masses from tag-free oligonucleotides by reversed-phase liquid chromatography. As models for the sequencing, the anti-VEGF aptamer (Macugen) and the highly modified 38-mer RNA sequences were analyzed under blind conditions. Most nucleotides were identified from the molecular weight of hydrophobic 5'-tag degradates calculated from monoisotopic mass in simple full mass data. When monoisotopic mass could not be assigned, the nucleotide was estimated using the molecular weight of the most abundant mass. The sequences of Macugen and 38-mer RNA perfectly matched the theoretical sequences. The hydrophobic tag sequencing worked well to obtain simple full mass data, resulting in accurate and clear sequencing. The present study provides for the first time a de novo sequencing technology for highly modified RNAs and contributes to quality control of therapeutic oligonucleotides. Copyright © 2016 John Wiley & Sons, Ltd.

  4. De Novo Centromere Formation and Centromeric Sequence Expansion in Wheat and its Wide Hybrids

    PubMed Central

    Fu, Shulan; Wang, Jing; Zhang, Xiangqi; Hu, Zanmin; Han, Fangpu

    2016-01-01

    Centromeres typically contain tandem repeat sequences, but centromere function does not necessarily depend on these sequences. We identified functional centromeres with significant quantitative changes in the centromeric retrotransposons of wheat (CRW) contents in wheat aneuploids (Triticum aestivum) and the offspring of wheat wide hybrids. The CRW signals were strongly reduced or essentially lost in some wheat ditelosomic lines and in the addition lines from the wide hybrids. The total loss of the CRW sequences but the presence of CENH3 in these lines suggests that the centromeres were formed de novo. In wheat and its wide hybrids, which carry large complex genomes or no sequenced genome, we performed CENH3-ChIP-dot-blot methods alone or in combination with CENH3-ChIP-seq and identified the ectopic genomic sequences present at the new centromeres. In adcdition, the transcription of the identified DNA sequences was remarkably increased at the new centromere, suggesting that the transcription of the corresponding sequences may be associated with de novo centromere formation. Stable alien chromosomes with two and three regions containing CRW sequences induced by centromere breakage were observed in the wheat-Th. elongatum hybrid derivatives, but only one was a functional centromere. In wheat-rye (Secale cereale) hybrids, the rye centromere-specific sequences spread along the chromosome arms and may have caused centromere expansion. Frequent and significant quantitative alterations in the centromere sequence via chromosomal rearrangement have been systematically described in wheat wide hybridizations, which may affect the retention or loss of the alien chromosomes in the hybrids. Thus, the centromere behavior in wide crosses likely has an important impact on the generation of biodiversity, which ultimately has implications for speciation. PMID:27110907

  5. De Novo Centromere Formation and Centromeric Sequence Expansion in Wheat and its Wide Hybrids.

    PubMed

    Guo, Xiang; Su, Handong; Shi, Qinghua; Fu, Shulan; Wang, Jing; Zhang, Xiangqi; Hu, Zanmin; Han, Fangpu

    2016-04-01

    Centromeres typically contain tandem repeat sequences, but centromere function does not necessarily depend on these sequences. We identified functional centromeres with significant quantitative changes in the centromeric retrotransposons of wheat (CRW) contents in wheat aneuploids (Triticum aestivum) and the offspring of wheat wide hybrids. The CRW signals were strongly reduced or essentially lost in some wheat ditelosomic lines and in the addition lines from the wide hybrids. The total loss of the CRW sequences but the presence of CENH3 in these lines suggests that the centromeres were formed de novo. In wheat and its wide hybrids, which carry large complex genomes or no sequenced genome, we performed CENH3-ChIP-dot-blot methods alone or in combination with CENH3-ChIP-seq and identified the ectopic genomic sequences present at the new centromeres. In adcdition, the transcription of the identified DNA sequences was remarkably increased at the new centromere, suggesting that the transcription of the corresponding sequences may be associated with de novo centromere formation. Stable alien chromosomes with two and three regions containing CRW sequences induced by centromere breakage were observed in the wheat-Th. elongatum hybrid derivatives, but only one was a functional centromere. In wheat-rye (Secale cereale) hybrids, the rye centromere-specific sequences spread along the chromosome arms and may have caused centromere expansion. Frequent and significant quantitative alterations in the centromere sequence via chromosomal rearrangement have been systematically described in wheat wide hybridizations, which may affect the retention or loss of the alien chromosomes in the hybrids. Thus, the centromere behavior in wide crosses likely has an important impact on the generation of biodiversity, which ultimately has implications for speciation.

  6. De novo transcriptome sequencing of axolotl blastema for identification of differentially expressed genes during limb regeneration

    PubMed Central

    2013-01-01

    Background Salamanders are unique among vertebrates in their ability to completely regenerate amputated limbs through the mediation of blastema cells located at the stump ends. This regeneration is nerve-dependent because blastema formation and regeneration does not occur after limb denervation. To obtain the genomic information of blastema tissues, de novo transcriptomes from both blastema tissues and denervated stump ends of Ambystoma mexicanum (axolotls) 14 days post-amputation were sequenced and compared using Solexa DNA sequencing. Results The sequencing done for this study produced 40,688,892 reads that were assembled into 307,345 transcribed sequences. The N50 of transcribed sequence length was 562 bases. A similarity search with known proteins identified 39,200 different genes to be expressed during limb regeneration with a cut-off E-value exceeding 10-5. We annotated assembled sequences by using gene descriptions, gene ontology, and clusters of orthologous group terms. Targeted searches using these annotations showed that the majority of the genes were in the categories of essential metabolic pathways, transcription factors and conserved signaling pathways, and novel candidate genes for regenerative processes. We discovered and confirmed numerous sequences of the candidate genes by using quantitative polymerase chain reaction and in situ hybridization. Conclusion The results of this study demonstrate that de novo transcriptome sequencing allows gene expression analysis in a species lacking genome information and provides the most comprehensive mRNA sequence resources for axolotls. The characterization of the axolotl transcriptome can help elucidate the molecular mechanisms underlying blastema formation during limb regeneration. PMID:23815514

  7. A Proteomic Workflow Using High-Throughput De Novo Sequencing Towards Complementation of Genome Information for Improved Comparative Crop Science.

    PubMed

    Turetschek, Reinhard; Lyon, David; Desalegn, Getinet; Kaul, Hans-Peter; Wienkoop, Stefanie

    2016-01-01

    The proteomic study of non-model organisms, such as many crop plants, is challenging due to the lack of comprehensive genome information. Changing environmental conditions require the study and selection of adapted cultivars. Mutations, inherent to cultivars, hamper protein identification and thus considerably complicate the qualitative and quantitative comparison in large-scale systems biology approaches. With this workflow, cultivar-specific mutations are detected from high-throughput comparative MS analyses, by extracting sequence polymorphisms with de novo sequencing. Stringent criteria are suggested to filter for confidential mutations. Subsequently, these polymorphisms complement the initially used database, which is ready to use with any preferred database search algorithm. In our example, we thereby identified 26 specific mutations in two cultivars of Pisum sativum and achieved an increased number (17 %) of peptide spectrum matches.

  8. De novo sequencing and characterization of floral transcriptome in two species of buckwheat (Fagopyrum)

    PubMed Central

    2011-01-01

    Background Transcriptome sequencing data has become an integral component of modern genetics, genomics and evolutionary biology. However, despite advances in the technologies of DNA sequencing, such data are lacking for many groups of living organisms, in particular, many plant taxa. We present here the results of transcriptome sequencing for two closely related plant species. These species, Fagopyrum esculentum and F. tataricum, belong to the order Caryophyllales - a large group of flowering plants with uncertain evolutionary relationships. F. esculentum (common buckwheat) is also an important food crop. Despite these practical and evolutionary considerations Fagopyrum species have not been the subject of large-scale sequencing projects. Results Normalized cDNA corresponding to genes expressed in flowers and inflorescences of F. esculentum and F. tataricum was sequenced using the 454 pyrosequencing technology. This resulted in 267 (for F. esculentum) and 229 (F. tataricum) thousands of reads with average length of 341-349 nucleotides. De novo assembly of the reads produced about 25 thousands of contigs for each species, with 7.5-8.2× coverage. Comparative analysis of two transcriptomes demonstrated their overall similarity but also revealed genes that are presumably differentially expressed. Among them are retrotransposon genes and genes involved in sugar biosynthesis and metabolism. Thirteen single-copy genes were used for phylogenetic analysis; the resulting trees are largely consistent with those inferred from multigenic plastid datasets. The sister relationships of the Caryophyllales and asterids now gained high support from nuclear gene sequences. Conclusions 454 transcriptome sequencing and de novo assembly was performed for two congeneric flowering plant species, F. esculentum and F. tataricum. As a result, a large set of cDNA sequences that represent orthologs of known plant genes as well as potential new genes was generated. PMID:21232141

  9. De novo sequencing and characterization of floral transcriptome in two species of buckwheat (Fagopyrum).

    PubMed

    Logacheva, Maria D; Kasianov, Artem S; Vinogradov, Dmitriy V; Samigullin, Tagir H; Gelfand, Mikhail S; Makeev, Vsevolod J; Penin, Aleksey A

    2011-01-13

    Transcriptome sequencing data has become an integral component of modern genetics, genomics and evolutionary biology. However, despite advances in the technologies of DNA sequencing, such data are lacking for many groups of living organisms, in particular, many plant taxa. We present here the results of transcriptome sequencing for two closely related plant species. These species, Fagopyrum esculentum and F. tataricum, belong to the order Caryophyllales--a large group of flowering plants with uncertain evolutionary relationships. F. esculentum (common buckwheat) is also an important food crop. Despite these practical and evolutionary considerations Fagopyrum species have not been the subject of large-scale sequencing projects. Normalized cDNA corresponding to genes expressed in flowers and inflorescences of F. esculentum and F. tataricum was sequenced using the 454 pyrosequencing technology. This resulted in 267 (for F. esculentum) and 229 (F. tataricum) thousands of reads with average length of 341-349 nucleotides. De novo assembly of the reads produced about 25 thousands of contigs for each species, with 7.5-8.2× coverage. Comparative analysis of two transcriptomes demonstrated their overall similarity but also revealed genes that are presumably differentially expressed. Among them are retrotransposon genes and genes involved in sugar biosynthesis and metabolism. Thirteen single-copy genes were used for phylogenetic analysis; the resulting trees are largely consistent with those inferred from multigenic plastid datasets. The sister relationships of the Caryophyllales and asterids now gained high support from nuclear gene sequences. 454 transcriptome sequencing and de novo assembly was performed for two congeneric flowering plant species, F. esculentum and F. tataricum. As a result, a large set of cDNA sequences that represent orthologs of known plant genes as well as potential new genes was generated.

  10. A framework for the detection of de novo mutations in family-based sequencing data

    PubMed Central

    Francioli, Laurent C; Cretu-Stancu, Mircea; Garimella, Kiran V; Fromer, Menachem; Kloosterman, Wigard P; Wijmenga, Cisca; Investigator, Principal; Swertz, Morris A; van Duijn, Cornelia M; Boomsma, Dorret I; Slagboom, PEline; van Ommen, Gertjan B; de Bakker, Paul IW; Swertz, Morris A; Francioli, Laurent C; van Dijk, Freerk; Menelaou, Androniki; Neerincx, Pieter BT; Pulit, Sara L; Deelen, Patrick; Elbers, Clara C; Francesco Palamara, Pier; Pe'er, Itsik; Abdellaoui, Abdel; Kloosterman, Wigard P; van Oven, Mannis; Vermaat, Martijn; Li, Mingkun; Laros, Jeroen FJ; Stoneking, Mark; de Knijff, Peter; Kayser, Manfred; Veldink, Jan H; van den Berg, Leonard H; Byelas, Heorhiy; den Dunnen, Johan T; Dijkstra, Martijn; Amin, Najaf; van der Velde, K Joeri; Hottenga, Jouke Jan; van Setten, Jessica; van Leeuwen, Elisabeth M; Kanterakis, Alexandros; Kattenberg, Mathijs; Karssen, Lennart C; van Schaik, Barbera DC; Bot, Jan; Nijman, Isaäc J; Renkens, Ivo; van Enckevort, David; Mei, Hailiang; Koval, Vyacheslav; Estrada, Karol; Medina-Gomez, Carolina; Ye, Kai; Lameijer, Eric-Wubbo; Moed, Matthijs H; Hehir-Kwa, Jayne Y; Handsaker, Robert E; McCarroll, Steven A; Sunyaev, Shamil R; Polak, Paz; Vuzman, Dana; Sohail, Mashaal; Hormozdiari, Fereydoun; Marschall, Tobias; Schönhuth, Alexander; Guryev, Victor; de Bakker, Paul IW; Slagboom, P Eline; Beekman, Marian B; de Craen, Anton JM; Suchiman, H Eka D; Hofman, Albert; van Duijn, Cornelia M; Oostra, Ben; Isaacs, Aaron; Amin, Najaf; Rivadeneira, Fernando; Uitterlinden, André G; Boomsma, Dorret I; Willemsen, Gonneke; Platteel, Mathieu; Pitts, Steven J; Potluri, Shobha; Sundar, Purnima; Cox, David R; Li, Qibin; Li, Yingrui; Du, Yuanping; Chen, Ruoyan; Cao, Hongzhi; Li, Ning; Cao, Sujie; Wang, Jun; Bovenberg, Jasper A; Brandsma, Margreet; Samocha, Kaitlin E; Neale, Benjamin M; Daly, Mark J; Banks, Eric; DePristo, Mark A; de Bakker, Paul IW

    2017-01-01

    Germline mutation detection from human DNA sequence data is challenging due to the rarity of such events relative to the intrinsic error rates of sequencing technologies and the uneven coverage across the genome. We developed PhaseByTransmission (PBT) to identify de novo single nucleotide variants and short insertions and deletions (indels) from sequence data collected in parent-offspring trios. We compute the joint probability of the data given the genotype likelihoods in the individual family members, the known familial relationships and a prior probability for the mutation rate. Candidate de novo mutations (DNMs) are reported along with their posterior probability, providing a systematic way to prioritize them for validation. Our tool is integrated in the Genome Analysis Toolkit and can be used together with the ReadBackedPhasing module to infer the parental origin of DNMs based on phase-informative reads. Using simulated data, we show that PBT outperforms existing tools, especially in low coverage data and on the X chromosome. We further show that PBT displays high validation rates on empirical parent-offspring sequencing data for whole-exome data from 104 trios and X-chromosome data from 249 parent-offspring families. Finally, we demonstrate an association between father's age at conception and the number of DNMs in female offspring's X chromosome, consistent with previous literature reports. PMID:27876817

  11. Genome Report: Identification and Validation of Antigenic Proteins from Pajaroellobacter abortibovis Using De Novo Genome Sequence Assembly and Reverse Vaccinology

    PubMed Central

    Welly, Bryan T.; Miller, Michael R.; Stott, Jeffrey L.; Blanchard, Myra T.; Islas-Trejo, Alma D.; O’Rourke, Sean M.; Young, Amy E.; Medrano, Juan F.; Van Eenennaam, Alison L.

    2016-01-01

    Epizootic bovine abortion (EBA), or “foothill abortion,” is the leading cause of beef cattle abortion in California and has also been reported in Nevada and Oregon. In the 1970s, the soft-shelled tick Ornithodoros coriaceus, or “pajaroello tick,” was confirmed as the disease-transmitting vector. In 2005, a novel Deltaproteobacterium was discovered as the etiologic agent of EBA (aoEBA), recently named Pajaroellobacter abortibovis. This organism cannot be grown in culture using traditional microbiological techniques; it can only be grown in experimentally-infected severe combined immunodeficient (SCID) mice. The objectives of this study were to perform a de novo genome assembly for P. abortibovis and identify and validate potential antigenic proteins as candidates for future recombinant vaccine development. DNA and RNA were extracted from spleen tissue collected from experimentally-infected SCID mice following exposure to P. abortibovis. This combination of mouse and bacterial DNA was sequenced and aligned to the mouse genome. Mouse sequences were subtracted from the sequence pool and the remaining sequences were de novo assembled at 50x coverage into a 1.82 Mbp complete closed circular Deltaproteobacterial genome containing 2250 putative protein-coding sequences. Phylogenetic analysis of P. abortibovis predicts that this bacterium is most closely related to the organisms of the order Myxococcales, referred to as Myxobacteria. In silico prediction of vaccine candidates was performed using a reverse vaccinology approach resulting in the identification and ranking of the top 10 candidate proteins that are likely to be antigenic. Immunologic testing of these candidate proteins confirmed antigenicity of seven of the nine expressed protein candidates using serum from P. abortibovis immunized mice. PMID:28040777

  12. The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads.

    PubMed

    Wang, Zhiwen; Hobson, Neil; Galindo, Leonardo; Zhu, Shilin; Shi, Daihu; McDill, Joshua; Yang, Linfeng; Hawkins, Simon; Neutelings, Godfrey; Datla, Raju; Lambert, Georgina; Galbraith, David W; Grassa, Christopher J; Geraldes, Armando; Cronk, Quentin C; Cullis, Christopher; Dash, Prasanta K; Kumar, Polumetla A; Cloutier, Sylvie; Sharpe, Andrew G; Wong, Gane K-S; Wang, Jun; Deyholos, Michael K

    2012-11-01

    Flax (Linum usitatissimum) is an ancient crop that is widely cultivated as a source of fiber, oil and medicinally relevant compounds. To accelerate crop improvement, we performed whole-genome shotgun sequencing of the nuclear genome of flax. Seven paired-end libraries ranging in size from 300 bp to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44-100 bp), produced a set of scaffolds with N(50) =694 kb, including contigs with N(50)=20.1 kb. The contig assembly contained 302 Mb of non-redundant sequence representing an estimated 81% genome coverage. Up to 96% of published flax ESTs aligned to the whole-genome shotgun scaffolds. However, comparisons with independently sequenced BACs and fosmids showed some mis-assembly of regions at the genome scale. A total of 43384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (K(s) ) observed within duplicate gene pairs was consistent with a recent (5-9 MYA) whole-genome duplication in flax. Within the predicted proteome, we observed enrichment of many conserved domains (Pfam-A) that may contribute to the unique properties of this crop, including agglutinin proteins. Together these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species.

  13. Whole Exome Sequencing Identifies de Novo Mutations in GATA6 Associated with Congenital Diaphragmatic Hernia

    PubMed Central

    Yu, Lan; Bennett, James T.; Wynn, Julia; Carvill, Gemma L.; Cheung, Yee Him; Shen, Yufeng; Mychaliska, George B.; Azarow, Kenneth S.; Crombleholme, Timothy M.; Chung, Dai H.; Potoka, Douglas; Warner, Brad W.; Bucher, Brian; Lim, Foong-Yen; Pietsch, John; Stolar, Charles; Aspelund, Gudrun; Arkovitz, Marc S.; Mefford, Heather; Chung, Wendy K.

    2014-01-01

    Background Congenital diaphragmatic hernia (CDH) is a common birth defect affecting 1 in 3,000 births. It is characterized by herniation of abdominal viscera through an incompletely formed diaphragm. Although chromosomal anomalies and mutations in several genes have been implicated, the cause for most patients is unknown. Methods We used whole exome sequencing in two families with CDH and congenital heart disease, and identified mutations in GATA6 in both. Results In the first family, we identified a de novo missense mutation (c.1366C>T, p.R456C) in a sporadic CDH patient with tetralogy of Fallot. In the second, a nonsense mutation (c.712G>T, p.G238*) was identified in two siblings with CDH and a large ventricular septal defect. The G238* mutation was inherited from their mother, who was clinically affected with congenital absence of the pericardium, patent ductus arteriosus, and intestinal malrotation. Deep sequencing of blood and saliva derived DNA from the mother suggested somatic mosaicism as an explanation for her milder phenotype, with only approximately 15% mutant alleles. To determine the frequency of GATA6 mutations in CDH, we sequenced the gene in 378 patients with CDH. We identified one additional de novo mutation (c.1071delG, p.V358Cfs34*). Conclusions Mutations in GATA6 have been previously associated with pancreatic agenesis and congenital heart disease. We conclude that, in addition to the heart and the pancreas, GATA6 is involved in development of two additional organs, the diaphragm and the pericardium. In addition we have shown that de novo mutations can contribute to the development of CDH, a common birth defect. PMID:24385578

  14. Genome Calligrapher: A Web Tool for Refactoring Bacterial Genome Sequences for de Novo DNA Synthesis.

    PubMed

    Christen, Matthias; Deutsch, Samuel; Christen, Beat

    2015-08-21

    Recent advances in synthetic biology have resulted in an increasing demand for the de novo synthesis of large-scale DNA constructs. Any process improvement that enables fast and cost-effective streamlining of digitized genetic information into fabricable DNA sequences holds great promise to study, mine, and engineer genomes. Here, we present Genome Calligrapher, a computer-aided design web tool intended for whole genome refactoring of bacterial chromosomes for de novo DNA synthesis. By applying a neutral recoding algorithm, Genome Calligrapher optimizes GC content and removes obstructive DNA features known to interfere with the synthesis of double-stranded DNA and the higher order assembly into large DNA constructs. Subsequent bioinformatics analysis revealed that synthesis constraints are prevalent among bacterial genomes. However, a low level of codon replacement is sufficient for refactoring bacterial genomes into easy-to-synthesize DNA sequences. To test the algorithm, 168 kb of synthetic DNA comprising approximately 20 percent of the synthetic essential genome of the cell-cycle bacterium Caulobacter crescentus was streamlined and then ordered from a commercial supplier of low-cost de novo DNA synthesis. The successful assembly into eight 20 kb segments indicates that Genome Calligrapher algorithm can be efficiently used to refactor difficult-to-synthesize DNA. Genome Calligrapher is broadly applicable to recode biosynthetic pathways, DNA sequences, and whole bacterial genomes, thus offering new opportunities to use synthetic biology tools to explore the functionality of microbial diversity. The Genome Calligrapher web tool can be accessed at https://christenlab.ethz.ch/GenomeCalligrapher  .

  15. GapFiller: a de novo assembly approach to fill the gap within paired reads

    PubMed Central

    2012-01-01

    /deletions detection pipelines, pre-processing routines on datasets for de novo assembly pipelines, or in any hierarchical approach designed to assemble, analyse or validate pools of sequences. PMID:23095524

  16. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation.

    PubMed

    Michaelson, Jacob J; Shi, Yujian; Gujral, Madhusudan; Zheng, Hancheng; Malhotra, Dheeraj; Jin, Xin; Jian, Minghan; Liu, Guangming; Greer, Douglas; Bhandari, Abhishek; Wu, Wenting; Corominas, Roser; Peoples, Aine; Koren, Amnon; Gore, Athurva; Kang, Shuli; Lin, Guan Ning; Estabillo, Jasper; Gadomski, Therese; Singh, Balvindar; Zhang, Kun; Akshoomoff, Natacha; Corsello, Christina; McCarroll, Steven; Iakoucheva, Lilia M; Li, Yingrui; Wang, Jun; Sebat, Jonathan

    2012-12-21

    De novo mutation plays an important role in autism spectrum disorders (ASDs). Notably, pathogenic copy number variants (CNVs) are characterized by high mutation rates. We hypothesize that hypermutability is a property of ASD genes and may also include nucleotide-substitution hot spots. We investigated global patterns of germline mutation by whole-genome sequencing of monozygotic twins concordant for ASD and their parents. Mutation rates varied widely throughout the genome (by 100-fold) and could be explained by intrinsic characteristics of DNA sequence and chromatin structure. Dense clusters of mutations within individual genomes were attributable to compound mutation or gene conversion. Hypermutability was a characteristic of genes involved in ASD and other diseases. In addition, genes impacted by mutations in this study were associated with ASD in independent exome-sequencing data sets. Our findings suggest that regional hypermutation is a significant factor shaping patterns of genetic variation and disease risk in humans.

  17. The sequence and de novo assembly of the giant panda genome.

    PubMed

    Li, Ruiqiang; Fan, Wei; Tian, Geng; Zhu, Hongmei; He, Lin; Cai, Jing; Huang, Quanfei; Cai, Qingle; Li, Bo; Bai, Yinqi; Zhang, Zhihe; Zhang, Yaping; Wang, Wen; Li, Jun; Wei, Fuwen; Li, Heng; Jian, Min; Li, Jianwen; Zhang, Zhaolei; Nielsen, Rasmus; Li, Dawei; Gu, Wanjun; Yang, Zhentao; Xuan, Zhaoling; Ryder, Oliver A; Leung, Frederick Chi-Ching; Zhou, Yan; Cao, Jianjun; Sun, Xiao; Fu, Yonggui; Fang, Xiaodong; Guo, Xiaosen; Wang, Bo; Hou, Rong; Shen, Fujun; Mu, Bo; Ni, Peixiang; Lin, Runmao; Qian, Wubin; Wang, Guodong; Yu, Chang; Nie, Wenhui; Wang, Jinhuan; Wu, Zhigang; Liang, Huiqing; Min, Jiumeng; Wu, Qi; Cheng, Shifeng; Ruan, Jue; Wang, Mingwei; Shi, Zhongbin; Wen, Ming; Liu, Binghang; Ren, Xiaoli; Zheng, Huisong; Dong, Dong; Cook, Kathleen; Shan, Gao; Zhang, Hao; Kosiol, Carolin; Xie, Xueying; Lu, Zuhong; Zheng, Hancheng; Li, Yingrui; Steiner, Cynthia C; Lam, Tommy Tsan-Yuk; Lin, Siyuan; Zhang, Qinghui; Li, Guoqing; Tian, Jing; Gong, Timing; Liu, Hongde; Zhang, Dejin; Fang, Lin; Ye, Chen; Zhang, Juanbin; Hu, Wenbo; Xu, Anlong; Ren, Yuanyuan; Zhang, Guojie; Bruford, Michael W; Li, Qibin; Ma, Lijia; Guo, Yiran; An, Na; Hu, Yujie; Zheng, Yang; Shi, Yongyong; Li, Zhiqiang; Liu, Qing; Chen, Yanling; Zhao, Jing; Qu, Ning; Zhao, Shancen; Tian, Feng; Wang, Xiaoling; Wang, Haiyin; Xu, Lizhi; Liu, Xiao; Vinar, Tomas; Wang, Yajun; Lam, Tak-Wah; Yiu, Siu-Ming; Liu, Shiping; Zhang, Hemin; Li, Desheng; Huang, Yan; Wang, Xia; Yang, Guohua; Jiang, Zhi; Wang, Junyi; Qin, Nan; Li, Li; Li, Jingxiang; Bolund, Lars; Kristiansen, Karsten; Wong, Gane Ka-Shu; Olson, Maynard; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun

    2010-01-21

    Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.

  18. The sequence and de novo assembly of the giant panda genome

    PubMed Central

    Li, Ruiqiang; Fan, Wei; Tian, Geng; Zhu, Hongmei; He, Lin; Cai, Jing; Huang, Quanfei; Cai, Qingle; Li, Bo; Bai, Yinqi; Zhang, Zhihe; Zhang, Yaping; Wang, Wen; Li, Jun; Wei, Fuwen; Li, Heng; Jian, Min; Li, Jianwen; Zhang, Zhaolei; Nielsen, Rasmus; Li, Dawei; Gu, Wanjun; Yang, Zhentao; Xuan, Zhaoling; Ryder, Oliver A.; Leung, Frederick Chi-Ching; Zhou, Yan; Cao, Jianjun; Sun, Xiao; Fu, Yonggui; Fang, Xiaodong; Guo, Xiaosen; Wang, Bo; Hou, Rong; Shen, Fujun; Mu, Bo; Ni, Peixiang; Lin, Runmao; Qian, Wubin; Wang, Guodong; Yu, Chang; Nie, Wenhui; Wang, Jinhuan; Wu, Zhigang; Liang, Huiqing; Min, Jiumeng; Wu, Qi; Cheng, Shifeng; Ruan, Jue; Wang, Mingwei; Shi, Zhongbin; Wen, Ming; Liu, Binghang; Ren, Xiaoli; Zheng, Huisong; Dong, Dong; Cook, Kathleen; Shan, Gao; Zhang, Hao; Kosiol, Carolin; Xie, Xueying; Lu, Zuhong; Zheng, Hancheng; Li, Yingrui; Steiner, Cynthia C.; Lam, Tommy Tsan-Yuk; Lin, Siyuan; Zhang, Qinghui; Li, Guoqing; Tian, Jing; Gong, Timing; Liu, Hongde; Zhang, Dejin; Fang, Lin; Ye, Chen; Zhang, Juanbin; Hu, Wenbo; Xu, Anlong; Ren, Yuanyuan; Zhang, Guojie; Bruford, Michael W.; Li, Qibin; Ma, Lijia; Guo, Yiran; An, Na; Hu, Yujie; Zheng, Yang; Shi, Yongyong; Li, Zhiqiang; Liu, Qing; Chen, Yanling; Zhao, Jing; Qu, Ning; Zhao, Shancen; Tian, Feng; Wang, Xiaoling; Wang, Haiyin; Xu, Lizhi; Liu, Xiao; Vinar, Tomas; Wang, Yajun; Lam, Tak-Wah; Yiu, Siu-Ming; Liu, Shiping; Zhang, Hemin; Li, Desheng; Huang, Yan; Wang, Xia; Yang, Guohua; Jiang, Zhi; Wang, Junyi; Qin, Nan; Li, Li; Li, Jingxiang; Bolund, Lars; Kristiansen, Karsten; Wong, Gane Ka-Shu; Olson, Maynard; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun

    2013-01-01

    Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes. PMID:20010809

  19. Sequencing, de novo assembly and comparative analysis of Raphanus sativus transcriptome.

    PubMed

    Wu, Gang; Zhang, Libin; Yin, Yongtai; Wu, Jiangsheng; Yu, Longjiang; Zhou, Yanhong; Li, Maoteng

    2015-01-01

    Raphanus sativus is an important Brassicaceae plant and also an edible vegetable with great economic value. However, currently there is not enough transcriptome information of R. sativus tissues, which impedes further functional genomics research on R. sativus. In this study, RNA-seq technology was employed to characterize the transcriptome of leaf tissues. Approximately 70 million clean pair-end reads were obtained and used for de novo assembly by Trinity program, which generated 68,086 unigenes with an average length of 576 bp. All the unigenes were annotated against GO and KEGG databases. In the meanwhile, we merged leaf sequencing data with existing root sequencing data and obtained better de novo assembly of R. sativus using Oases program. Accordingly, potential simple sequence repeats (SSRs), transcription factors (TFs) and enzyme codes were identified in R. sativus. Additionally, we detected a total of 3563 significantly differentially expressed genes (DEGs, P = 0.05) and tissue-specific biological processes between leaf and root tissues. Furthermore, a TFs-based regulation network was constructed using Cytoscape software. Taken together, these results not only provide a comprehensive genomic resource of R. sativus but also shed light on functional genomic and proteomic research on R. sativus in the future.

  20. CYCLONE—A Utility for De Novo Sequencing of Microbial Cyclic Peptides

    NASA Astrophysics Data System (ADS)

    Kavan, Daniel; Kuzma, Marek; Lemr, Karel; Schug, Kevin A.; Havlicek, Vladimir

    2013-08-01

    We have developed a de novo sequencing software tool (CYCLONE) and applied it for determination of cyclic peptides. The program uses a non-redundant database of 312 nonribosomal building blocks identified to date in bacteria and fungi (more than 230 additional residues in the database list were isobaric). The software was used to fully characterize the tandem mass spectrum of several cyclic peptides and provide sequence tags. The general strategy of the script was based on fragment ion pre-characterization to accomplish unambiguous b-ion series assignments. Showcase examples were a cyclic tetradepsipeptide beauverolide, a cyclic hexadepsipeptide roseotoxin A, a lasso-like hexapeptide pseudacyclin A, and a cyclic undecapeptide cyclosporin A. The extent of ion scrambling in smaller peptides was as low as 5 % of total ion current; this demonstrated the feasibility of CYCLONE de novo sequencing. The robustness of the script was also tested against database sets of various sizes and isotope-containing data. It can be downloaded from the http://ms.biomed.cas.cz/MSTools/.

  1. Sequencing, de novo assembly and comparative analysis of Raphanus sativus transcriptome

    PubMed Central

    Wu, Gang; Zhang, Libin; Yin, Yongtai; Wu, Jiangsheng; Yu, Longjiang; Zhou, Yanhong; Li, Maoteng

    2015-01-01

    Raphanus sativus is an important Brassicaceae plant and also an edible vegetable with great economic value. However, currently there is not enough transcriptome information of R. sativus tissues, which impedes further functional genomics research on R. sativus. In this study, RNA-seq technology was employed to characterize the transcriptome of leaf tissues. Approximately 70 million clean pair-end reads were obtained and used for de novo assembly by Trinity program, which generated 68,086 unigenes with an average length of 576 bp. All the unigenes were annotated against GO and KEGG databases. In the meanwhile, we merged leaf sequencing data with existing root sequencing data and obtained better de novo assembly of R. sativus using Oases program. Accordingly, potential simple sequence repeats (SSRs), transcription factors (TFs) and enzyme codes were identified in R. sativus. Additionally, we detected a total of 3563 significantly differentially expressed genes (DEGs, P = 0.05) and tissue-specific biological processes between leaf and root tissues. Furthermore, a TFs-based regulation network was constructed using Cytoscape software. Taken together, these results not only provide a comprehensive genomic resource of R. sativus but also shed light on functional genomic and proteomic research on R. sativus in the future. PMID:26029219

  2. RoboOligo: software for mass spectrometry data to support manual and de novo sequencing of post-transcriptionally modified ribonucleic acids.

    PubMed

    Sample, Paul J; Gaston, Kirk W; Alfonzo, Juan D; Limbach, Patrick A

    2015-05-26

    Ribosomal ribonucleic acid (RNA), transfer RNA and other biological or synthetic RNA polymers can contain nucleotides that have been modified by the addition of chemical groups. Traditional Sanger sequencing methods cannot establish the chemical nature and sequence of these modified-nucleotide containing oligomers. Mass spectrometry (MS) has become the conventional approach for determining the nucleotide composition, modification status and sequence of modified RNAs. Modified RNAs are analyzed by MS using collision-induced dissociation tandem mass spectrometry (CID MS/MS), which produces a complex dataset of oligomeric fragments that must be interpreted to identify and place modified nucleosides within the RNA sequence. Here we report the development of RoboOligo, an interactive software program for the robust analysis of data generated by CID MS/MS of RNA oligomers. There are three main functions of RoboOligo: (i) automated de novo sequencing via the local search paradigm. (ii) Manual sequencing with real-time spectrum labeling and cumulative intensity scoring. (iii) A hybrid approach, coined 'variable sequencing', which combines the user intuition of manual sequencing with the high-throughput sampling of automated de novo sequencing.

  3. In Silico Whole Genome Sequencer and Analyzer (iWGS): a Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies

    PubMed Central

    Zhou, Xiaofan; Peris, David; Kominek, Jacek; Kurtzman, Cletus P.; Hittinger, Chris Todd; Rokas, Antonis

    2016-01-01

    The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS. PMID:27638685

  4. in silico Whole Genome Sequencer & Analyzer (iWGS): A Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies.

    PubMed

    Zhou, Xiaofan; Peris, David; Kominek, Jacek; Kurtzman, Cletus P; Hittinger, Chris Todd; Rokas, Antonis

    2016-09-16

    The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in non-model organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.

  5. in silico Whole Genome Sequencer & Analyzer (iWGS): A Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies

    SciTech Connect

    Zhou, Xiaofan; Peris, David; Kominek, Jacek; Kurtzman, Cletus P.; Hittinger, Chris Todd; Rokas, A.

    2016-09-16

    The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.

  6. in silico Whole Genome Sequencer & Analyzer (iWGS): A Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies

    DOE PAGES

    Zhou, Xiaofan; Peris, David; Kominek, Jacek; ...

    2016-09-16

    The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimentalmore » design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.« less

  7. DNApi: A De Novo Adapter Prediction Algorithm for Small RNA Sequencing Data.

    PubMed

    Tsuji, Junko; Weng, Zhiping

    2016-01-01

    With the rapid accumulation of publicly available small RNA sequencing datasets, third-party meta-analysis across many datasets is becoming increasingly powerful. Although removing the 3´ adapter is an essential step for small RNA sequencing analysis, the adapter sequence information is not always available in the metadata. The information can be also erroneous even when it is available. In this study, we developed DNApi, a lightweight Python software package that predicts the 3´ adapter sequence de novo and provides the user with cleansed small RNA sequences ready for down stream analysis. Tested on 539 publicly available small RNA libraries accompanied with 3´ adapter sequences in their metadata, DNApi shows near-perfect accuracy (98.5%) with fast runtime (~2.85 seconds per library) and efficient memory usage (~43 MB on average). In addition to 3´ adapter prediction, it is also important to classify whether the input small RNA libraries were already processed, i.e. the 3´ adapters were removed. DNApi perfectly judged that given another batch of datasets, 192 publicly available processed libraries were "ready-to-map" small RNA sequence. DNApi is compatible with Python 2 and 3, and is available at https://github.com/jnktsj/DNApi. The 731 small RNA libraries used for DNApi evaluation were from human tissues and were carefully and manually collected. This study also provides readers with the curated datasets that can be integrated into their studies.

  8. DNApi: A De Novo Adapter Prediction Algorithm for Small RNA Sequencing Data

    PubMed Central

    Tsuji, Junko; Weng, Zhiping

    2016-01-01

    With the rapid accumulation of publicly available small RNA sequencing datasets, third-party meta-analysis across many datasets is becoming increasingly powerful. Although removing the 3´ adapter is an essential step for small RNA sequencing analysis, the adapter sequence information is not always available in the metadata. The information can be also erroneous even when it is available. In this study, we developed DNApi, a lightweight Python software package that predicts the 3´ adapter sequence de novo and provides the user with cleansed small RNA sequences ready for down stream analysis. Tested on 539 publicly available small RNA libraries accompanied with 3´ adapter sequences in their metadata, DNApi shows near-perfect accuracy (98.5%) with fast runtime (~2.85 seconds per library) and efficient memory usage (~43 MB on average). In addition to 3´ adapter prediction, it is also important to classify whether the input small RNA libraries were already processed, i.e. the 3´ adapters were removed. DNApi perfectly judged that given another batch of datasets, 192 publicly available processed libraries were “ready-to-map” small RNA sequence. DNApi is compatible with Python 2 and 3, and is available at https://github.com/jnktsj/DNApi. The 731 small RNA libraries used for DNApi evaluation were from human tissues and were carefully and manually collected. This study also provides readers with the curated datasets that can be integrated into their studies. PMID:27736901

  9. Gaspra Approach Sequence

    NASA Image and Video Library

    1996-01-29

    This montage of 11 images taken by NASA Galileo spacecraft as it flew by the asteroid Gaspra on Oct. 1991, shows Gaspra growing progressively larger in the field of view of Galileo solid-state imaging camera as the spacecraft approached the asteroid. http://photojournal.jpl.nasa.gov/catalog/PIA00079

  10. Using phage display selected antibodies to dissect microbiomes for complete de novo genome sequencing of low abundance microbes

    PubMed Central

    2013-01-01

    Background Single cell genomics has revolutionized microbial sequencing, but complete coverage of genomes in complex microbiomes is imperfect due to enormous variation in organismal abundance and amplification bias. Empirical methods that complement rapidly improving bioinformatic tools will improve characterization of microbiomes and facilitate better genome coverage for low abundance microbes. Methods We describe a new approach to sequencing individual species from microbiomes that combines antibody phage display against intact bacteria with fluorescence activated cell sorting (FACS). Single chain (scFv) antibodies are selected using phage display against a bacteria or microbial community, resulting in species-specific antibodies that can be used in FACS for relative quantification of an organism in a community, as well as enrichment or depletion prior to genome sequencing. Results We selected antibodies against Lactobacillus acidophilus and demonstrate a FACS-based approach for identification and enrichment of the organism from both laboratory-cultured and commercially derived bacterial mixtures. The ability to selectively enrich for L. acidophilus when it is present at a very low abundance (<0.2%) leads to complete (>99.8%) de novo genome coverage whereas the standard single-cell sequencing approach is incomplete (<68%). We show that specific antibodies can be selected against L. acidophilus when the monoculture is used as antigen as well as when a community of 10 closely related species is used demonstrating that in principal antibodies can be generated against individual organisms within microbial communities. Conclusions The approach presented here demonstrates that phage-selected antibodies against bacteria enable identification, enrichment of rare species, and depletion of abundant organisms making it tractable to virtually any microbe or microbial community. Combining antibody specificity with FACS provides a new approach for characterizing and

  11. De novo sequences of Haloquadratum walsbyi from Lake Tyrrell, Australia, reveal a variable genomic landscape.

    PubMed

    Tully, Benjamin J; Emerson, Joanne B; Andrade, Karen; Brocks, Jochen J; Allen, Eric E; Banfield, Jillian F; Heidelberg, Karla B

    2015-01-01

    Hypersaline systems near salt saturation levels represent an extreme environment, in which organisms grow and survive near the limits of life. One of the abundant members of the microbial communities in hypersaline systems is the square archaeon, Haloquadratum walsbyi. Utilizing a short-read metagenome from Lake Tyrrell, a hypersaline ecosystem in Victoria, Australia, we performed a comparative genomic analysis of H. walsbyi to better understand the extent of variation between strains/subspecies. Results revealed that previously isolated strains/subspecies do not fully describe the complete repertoire of the genomic landscape present in H. walsbyi. Rearrangements, insertions, and deletions were observed for the Lake Tyrrell derived Haloquadratum genomes and were supported by environmental de novo sequences, including shifts in the dominant genomic landscape of the two most abundant strains. Analysis pertaining to halomucins indicated that homologs for this large protein are not a feature common for all species of Haloquadratum. Further, we analyzed ATP-binding cassette transporters (ABC-type transporters) for evidence of niche partitioning between different strains/subspecies. We were able to identify unique and variable transporter subunits from all five genomes analyzed and the de novo environmental sequences, suggesting that differences in nutrient and carbon source acquisition may play a role in maintaining distinct strains/subspecies.

  12. De Novo Sequences of Haloquadratum walsbyi from Lake Tyrrell, Australia, Reveal a Variable Genomic Landscape

    PubMed Central

    Tully, Benjamin J.; Emerson, Joanne B.; Andrade, Karen; Brocks, Jochen J.; Allen, Eric E.; Banfield, Jillian F.; Heidelberg, Karla B.

    2015-01-01

    Hypersaline systems near salt saturation levels represent an extreme environment, in which organisms grow and survive near the limits of life. One of the abundant members of the microbial communities in hypersaline systems is the square archaeon, Haloquadratum walsbyi. Utilizing a short-read metagenome from Lake Tyrrell, a hypersaline ecosystem in Victoria, Australia, we performed a comparative genomic analysis of H. walsbyi to better understand the extent of variation between strains/subspecies. Results revealed that previously isolated strains/subspecies do not fully describe the complete repertoire of the genomic landscape present in H. walsbyi. Rearrangements, insertions, and deletions were observed for the Lake Tyrrell derived Haloquadratum genomes and were supported by environmental de novo sequences, including shifts in the dominant genomic landscape of the two most abundant strains. Analysis pertaining to halomucins indicated that homologs for this large protein are not a feature common for all species of Haloquadratum. Further, we analyzed ATP-binding cassette transporters (ABC-type transporters) for evidence of niche partitioning between different strains/subspecies. We were able to identify unique and variable transporter subunits from all five genomes analyzed and the de novo environmental sequences, suggesting that differences in nutrient and carbon source acquisition may play a role in maintaining distinct strains/subspecies. PMID:25709557

  13. Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly

    PubMed Central

    Yu, Chun-Hui; Chiang, Tzen-Yuh; Hwang, Chi-Chuan

    2013-01-01

    Next-generation-sequencing (NGS) has revolutionized the field of genome assembly because of its much higher data throughput and much lower cost compared with traditional Sanger sequencing. However, NGS poses new computational challenges to de novo genome assembly. Among the challenges, GC bias in NGS data is known to aggravate genome assembly. However, it is not clear to what extent GC bias affects genome assembly in general. In this work, we conduct a systematic analysis on the effects of GC bias on genome assembly. Our analyses reveal that GC bias only lowers assembly completeness when the degree of GC bias is above a threshold. At a strong GC bias, the assembly fragmentation due to GC bias can be explained by the low coverage of reads in the GC-poor or GC-rich regions of a genome. This effect is observed for all the assemblers under study. Increasing the total amount of NGS data thus rescues the assembly fragmentation because of GC bias. However, the amount of data needed for a full rescue depends on the distribution of GC contents. Both low and high coverage depths due to GC bias lower the accuracy of assembly. These pieces of information provide guidance toward a better de novo genome assembly in the presence of GC bias. PMID:23638157

  14. CycloBranch: De Novo Sequencing of Nonribosomal Peptides from Accurate Product Ion Mass Spectra

    NASA Astrophysics Data System (ADS)

    Novák, Jiří; Lemr, Karel; Schug, Kevin A.; Havlíček, Vladimír

    2015-07-01

    Nonribosomal peptides have a wide range of biological and medical applications. Their identification by tandem mass spectrometry remains a challenging task. A new open-source de novo peptide identification engine CycloBranch was developed and successfully applied in identification or detailed characterization of 11 linear, cyclic, branched, and branch-cyclic peptides. CycloBranch is based on annotated building block databases the size of which is defined by the user according to ribosomal or nonribosomal peptide origin. The current number of involved nonisobaric and isobaric building blocks is 287 and 521, respectively. Contrary to all other peptide sequencing tools utilizing either peptide libraries or peptide fragment libraries, CycloBranch represents a true de novo sequencing engine developed for accurate mass spectrometric data. It is a stand-alone and cross-platform application with a graphical and user-friendly interface; it supports mzML, mzXML, mgf, txt, and baf file formats and can be run in parallel on multiple threads. It can be downloaded for free from http://ms.biomed.cas.cz/cyclobranch/, where the User's manual and video tutorials can be found.

  15. A Real-Time de novo DNA Sequencing Assembly Platform Based on an FPGA Implementation.

    PubMed

    Hu, Yuanqi; Georgiou, Pantelis

    2016-01-01

    This paper presents an FPGA based DNA comparison platform which can be run concurrently with the sensing phase of DNA sequencing and shortens the overall time needed for de novo DNA assembly. A hybrid overlap searching algorithm is applied which is scalable and can deal with incremental detection of new bases. To handle the incomplete data set which gradually increases during sequencing time, all-against-all comparisons are broken down into successive window-against-window comparison phases and executed using a novel dynamic suffix comparison algorithm combined with a partitioned dynamic programming method. The complete system has been designed to facilitate parallel processing in hardware, which allows real-time comparison and full scalability as well as a decrease in the number of computations required. A base pair comparison rate of 51.2 G/s is achieved when implemented on an FPGA with successful DNA comparison when using data sets from real genomes.

  16. De novo transcriptome sequencing in Pueraria lobata to identify putative genes involved in isoflavones biosynthesis.

    PubMed

    Wang, Xin; Li, Shutao; Li, Jia; Li, Changfu; Zhang, Yansheng

    2015-05-01

    Using Illumina sequencing technology, we have generated the large-scale transcriptome sequencing data and indentified many putative genes involved in isoflavones biosynthesis in Pueraria lobata. Pueraria lobata, a member of the Leguminosae family, is a traditional Chinese herb which has been used since ancient times. P. lobata root has extensive clinical usages, because it contains a rich source of isoflavones, including daidzin and puerarin. However, the knowledge of isoflavone metabolism and the characterization of corresponding genes in such a pathway remain largely unknown. In this study, de novo transcriptome of P. lobata root and leaf was sequenced using the Solexa sequencing platform. Over 140 million high-quality reads were assembled into 163,625 unigenes, of which about 43.1% were aligned to the Nr protein database. Using the RPKM (reads per kilo bases per million reads) method, 3,148 unigenes were found to be upregulated, and 2,011 genes were downregulated in the leaf as compared to those in the root. Towards a further understanding of these differentially expressed genes, Gene ontology enrichment and metabolic pathway enrichment analyses were performed. Based on these results, 47 novel structural genes were identified in the biosynthesis of isoflavones. Also, 22 putative UDP glycosyltransferases and 45 O-methyltransferases unigenes were identified as the candidates most likely to be involved in the tailoring processes of isoflavonoid downstream pathway. Moreover, MYB transcription factors were analyzed, and 133 of them were found to have higher expression levels in the roots than in the leaves. In conclusion, the de novo transcriptome investigation of these unique transcripts provided an invaluable resource for the global discovery of functional genes related to isoflavones biosynthesis in P. lobata.

  17. Transcriptome Sequencing and De Novo Analysis of the Copepod Calanus sinicus Using 454 GS FLX

    PubMed Central

    Ning, Juan; Wang, Minxiao; Li, Chaolun; Sun, Song

    2013-01-01

    Background Despite their species abundance and primary economic importance, genomic information about copepods is still limited. In particular, genomic resources are lacking for the copepod Calanus sinicus, which is a dominant species in the coastal waters of East Asia. In this study, we performed de novo transcriptome sequencing to produce a large number of expressed sequence tags for the copepod C. sinicus. Results Copepodid larvae and adults were used as the basic material for transcriptome sequencing. Using 454 pyrosequencing, a total of 1,470,799 reads were obtained, which were assembled into 56,809 high quality expressed sequence tags. Based on their sequence similarity to known proteins, about 14,000 different genes were identified, including members of all major conserved signaling pathways. Transcripts that were putatively involved with growth, lipid metabolism, molting, and diapause were also identified among these genes. Differentially expressed genes related to several processes were found in C. sinicus copepodid larvae and adults. We detected 284,154 single nucleotide polymorphisms (SNPs) that provide a resource for gene function studies. Conclusion Our data provide the most comprehensive transcriptome resource available for C. sinicus. This resource allowed us to identify genes associated with primary physiological processes and SNPs in coding regions, which facilitated the quantitative analysis of differential gene expression. These data should provide foundation for future genetic and genomic studies of this and related species. PMID:23671698

  18. De Novo Sequencing and Characterization of the Floral Transcriptome of Dendrocalamus latiflorus (Poaceae: Bambusoideae)

    PubMed Central

    Li, De-Zhu; Guo, Zhen-Hua

    2012-01-01

    Background Transcriptome sequencing can be used to determine gene sequences and transcript abundance in non-model species, and the advent of next-generation sequencing (NGS) technologies has greatly decreased the cost and time required for this process. Transcriptome data are especially desirable in bamboo species, as certain members constitute an economically and culturally important group of mostly semelparous plants with remarkable flowering features, yet little bamboo genomic research has been performed. Here we present, for the first time, extensive sequence and transcript abundance data for the floral transcriptome of a key bamboo species, Dendrocalamus latiflorus, obtained using the Illumina GAII sequencing platform. Our further goal was to identify patterns of gene expression during bamboo flower development. Results Approximately 96 million sequencing reads were generated and assembled de novo, yielding 146,395 high quality unigenes with an average length of 461 bp. Of these, 80,418 were identified as putative homologs of annotated sequences in the public protein databases, of which 290 were associated with the floral transition and 47 were related to flower development. Digital abundance analysis identified 26,529 transcripts differentially enriched between two developmental stages, young flower buds and older developing flowers. Unigenes found at each stage were categorized according to their putative functional categories. These sequence and putative function data comprise a resource for future investigation of the floral transition and flower development in bamboo species. Conclusions Our results present the first broad survey of a bamboo floral transcriptome. Although it will be necessary to validate the functions carried out by these genes, these results represent a starting point for future functional research on D. latiflorus and related species. PMID:22916120

  19. De novo Sequencing, Characterization, and Comparison of Inflorescence Transcriptomes of Cornus canadensis and C. florida (Cornaceae)

    PubMed Central

    Zhang, Jian; Franks, Robert G.; Liu, Xiang; Kang, Ming; Keebler, Jonathan E. M.; Schaff, Jennifer E.; Huang, Hong-Wen; Xiang, Qiu-Yun (Jenny)

    2013-01-01

    Background Transcriptome sequencing analysis is a powerful tool in molecular genetics and evolutionary biology. Here we report the results of de novo 454 sequencing, characterization, and comparison of inflorescence transcriptomes of two closely related dogwood species, Cornus canadensis and C. florida (Cornaceae). Our goals were to build a preliminary source of genome sequence data, and to identify genes potentially expressed differentially between the inflorescence transcriptomes for these important horticultural species. Results The sequencing of cDNAs from inflorescence buds of C. canadensis (cc) and C. florida (cf), and normalized cDNAs from leaves of C. canadensis resulted in 251799 (ccBud), 96245 (ccLeaf) and 114648 (cfBud) raw reads, respectively. The de novo assembly of the high quality (HQ) reads resulted in 36088, 17802 and 21210 unigenes for ccBud, ccLeaf and cfBud. A reference transcriptome for C. canadensis was built by assembling HQ reads of ccBud and ccLeaf, containing 40884 unigenes. Reference mapping and comparative analyses found 10926 sequences were putatively specific to ccBud, and 6979 putatively specific to cfBud. Putative differentially expressed genes between ccBud and cfBud that are related to flower development and/or stress response were identified among 7718 shared sequences by ccBud and cfBud. Bi-directional BLAST found 87 (41.83% of 208) of Arabidopsis genes related to inflorescence development had putative orthologs in the dogwood transcriptomes. Comparisons of the shared sequences by ccBud and cfBud yielded 65931 high quality SNPs between two species. The twenty unigenes with the most SNPs are listed as potential genetic markers for evolutionary studies. Conclusions The data provide an important, although preliminary, information platform for functional genomics and evolutionary developmental biology in Cornus. The study identified putative candidates potentially involved in the genetic regulation of inflorescence evolution and

  20. De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics.

    PubMed

    Adamidi, Catherine; Wang, Yongbo; Gruen, Dominic; Mastrobuoni, Guido; You, Xintian; Tolle, Dominic; Dodt, Matthias; Mackowiak, Sebastian D; Gogol-Doering, Andreas; Oenal, Pinar; Rybak, Agnieszka; Ross, Eric; Sánchez Alvarado, Alejandro; Kempa, Stefan; Dieterich, Christoph; Rajewsky, Nikolaus; Chen, Wei

    2011-07-01

    Freshwater planaria are a very attractive model system for stem cell biology, tissue homeostasis, and regeneration. The genome of the planarian Schmidtea mediterranea has recently been sequenced and is estimated to contain >20,000 protein-encoding genes. However, the characterization of its transcriptome is far from complete. Furthermore, not a single proteome of the entire phylum has been assayed on a genome-wide level. We devised an efficient sequencing strategy that allowed us to de novo assemble a major fraction of the S. mediterranea transcriptome. We then used independent assays and massive shotgun proteomics to validate the authenticity of transcripts. In total, our de novo assembly yielded 18,619 candidate transcripts with a mean length of 1118 nt after filtering. A total of 17,564 candidate transcripts could be mapped to 15,284 distinct loci on the current genome reference sequence. RACE confirmed complete or almost complete 5' and 3' ends for 22/24 transcripts. The frequencies of frame shifts, fusion, and fission events in the assembled transcripts were computationally estimated to be 4.2%-13%, 0%-3.7%, and 2.6%, respectively. Our shotgun proteomics produced 16,135 distinct peptides that validated 4200 transcripts (FDR ≤1%). The catalog of transcripts assembled in this study, together with the identified peptides, dramatically expands and refines planarian gene annotation, demonstrated by validation of several previously unknown transcripts with stem cell-dependent expression patterns. In addition, our robust transcriptome characterization pipeline could be applied to other organisms without genome assembly. All of our data, including homology annotation, are freely available at SmedGD, the S. mediterranea genome database.

  1. Exome sequencing for bipolar disorder points to roles of de novo loss-of-function and protein-altering mutations

    PubMed Central

    Kataoka, M; Matoba, N; Sawada, T; Kazuno, A-A; Ishiwata, M; Fujii, K; Matsuo, K; Takata, A; Kato, T

    2016-01-01

    Although numerous genetic studies have been conducted for bipolar disorder (BD), its genetic architecture remains elusive. Here we perform, to the best of our knowledge, the first trio-based exome sequencing study for BD to investigate potential roles of de novo mutations in the disease etiology. We identified 71 de novo point mutations and one de novo copy-number mutation in 79 BD probands. Among the genes hit by de novo loss-of-function (LOF; nonsense, splice site or frameshift) or protein-altering (LOF, missense and inframe indel) mutations, we found significant enrichment of genes highly intolerant (first percentile of intolerant genes assessed by Residual Variation Intolerance Score) to protein-altering variants in general population, an observation that is also reported in autism and schizophrenia. When we performed a joint analysis using the data of schizoaffective disorder in published studies, we found global enrichment of de novo LOF and protein-altering mutations in the combined group of bipolar I and schizoaffective disorders. Considering relationship between de novo mutations and clinical phenotypes, we observed significantly earlier disease onset among the BD probands with de novo protein-altering mutations when compared with non-carriers. Gene ontology enrichment analysis of genes hit by de novo protein-altering mutations in bipolar I and schizoaffective disorders did not identify any significant enrichment. These results of exploratory analyses collectively point to the roles of de novo LOF and protein-altering mutations in the etiology of bipolar disorder and warrant further large-scale studies. PMID:27217147

  2. Exome sequencing for bipolar disorder points to roles of de novo loss-of-function and protein-altering mutations.

    PubMed

    Kataoka, M; Matoba, N; Sawada, T; Kazuno, A-A; Ishiwata, M; Fujii, K; Matsuo, K; Takata, A; Kato, T

    2016-07-01

    Although numerous genetic studies have been conducted for bipolar disorder (BD), its genetic architecture remains elusive. Here we perform, to the best of our knowledge, the first trio-based exome sequencing study for BD to investigate potential roles of de novo mutations in the disease etiology. We identified 71 de novo point mutations and one de novo copy-number mutation in 79 BD probands. Among the genes hit by de novo loss-of-function (LOF; nonsense, splice site or frameshift) or protein-altering (LOF, missense and inframe indel) mutations, we found significant enrichment of genes highly intolerant (first percentile of intolerant genes assessed by Residual Variation Intolerance Score) to protein-altering variants in general population, an observation that is also reported in autism and schizophrenia. When we performed a joint analysis using the data of schizoaffective disorder in published studies, we found global enrichment of de novo LOF and protein-altering mutations in the combined group of bipolar I and schizoaffective disorders. Considering relationship between de novo mutations and clinical phenotypes, we observed significantly earlier disease onset among the BD probands with de novo protein-altering mutations when compared with non-carriers. Gene ontology enrichment analysis of genes hit by de novo protein-altering mutations in bipolar I and schizoaffective disorders did not identify any significant enrichment. These results of exploratory analyses collectively point to the roles of de novo LOF and protein-altering mutations in the etiology of bipolar disorder and warrant further large-scale studies.

  3. Sequencing, de novo assembly and annotation of a pink bollworm larval midgut transcriptome.

    PubMed

    Tassone, Erica E; Zastrow-Hayes, Gina; Mathis, John; Nelson, Mark E; Wu, Gusui; Flexner, J Lindsey; Carrière, Yves; Tabashnik, Bruce E; Fabrick, Jeffrey A

    2016-06-22

    The pink bollworm Pectinophora gossypiella (Saunders) (Lepidoptera: Gelechiidae) is one of the world's most important pests of cotton. Insecticide sprays and transgenic cotton producing toxins of the bacterium Bacillus thuringiensis (Bt) are currently used to manage this pest. Bt toxins kill susceptible insects by specifically binding to and destroying midgut cells, but they are not toxic to most other organisms. Pink bollworm is useful as a model for understanding insect responses to Bt toxins, yet advances in understanding at the molecular level have been limited because basic genomic information is lacking for this cosmopolitan pest. Here, we have sequenced, de novo assembled and annotated a comprehensive larval midgut transcriptome from a susceptible strain of pink bollworm. A de novo transcriptome assembly for the midgut of P. gossypiella was generated containing 46,458 transcripts (average length of 770 bp) derived from 39,874 unigenes. The size of the transcriptome is similar to published midgut transcriptomes of other Lepidoptera and includes up to 91 % annotated contigs. The dataset is publicly available in NCBI and GigaDB as a resource for researchers. Foundational knowledge of protein-coding genes from the pink bollworm midgut is critical for understanding how this important insect pest functions. The transcriptome data presented here represent the first large-scale molecular resource for this species, and may be used for deciphering relevant midgut proteins critical for xenobiotic detoxification, nutrient digestion and allocation, as well as for the discovery of protein receptors important for Bt intoxication.

  4. De novo assembly and characterization of the garlic (Allium sativum) bud transcriptome by Illumina sequencing.

    PubMed

    Sun, Xiudong; Zhou, Shumei; Meng, Fanlu; Liu, Shiqi

    2012-10-01

    Garlic is widely used as a spice throughout the world for the culinary value of its flavor and aroma, which are created by the chemical transformation of a series of organic sulfur compounds. To analyze the transcriptome of Allium sativum and discover the genes involved in sulfur metabolism, cDNAs derived from the total RNA of Allium sativum buds were analyzed by Illumina sequencing. Approximately 26.67 million 90 bp paired-end clean reads were achieved in two libraries. A total of 127,933 unigenes were generated by de novo assembly and were compared with the sequences in public databases. Of these, 45,286 unigenes had significant hits to the sequences in the Nr database, 29,514 showed significant similarity to known proteins in the Swiss-Prot database and, 20,706 and 21,952 unigenes had significant similarity to existing sequences in the KEGG and COG databases, respectively. Moreover, genes involved in organic sulfur biosynthesis were identified. These unigenes data will provide the foundation for research on gene expression, genomics and functional genomics in Allium sativum. Key message The obtained unigenes will provide the foundation for research on functional genomics in Allium sativum and its closely related species, and fill the gap of the existing plant EST database.

  5. A De Novo Genome Sequence Assembly of the Arabidopsis thaliana Accession Niederzenz-1 Displays Presence/Absence Variation and Strong Synteny

    PubMed Central

    Pucker, Boas; Holtgräwe, Daniela; Rosleff Sörensen, Thomas; Stracke, Ralf; Viehöver, Prisca

    2016-01-01

    Arabidopsis thaliana is the most important model organism for fundamental plant biology. The genome diversity of different accessions of this species has been intensively studied, for example in the 1001 genome project which led to the identification of many small nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels). In addition, presence/absence variation (PAV), copy number variation (CNV) and mobile genetic elements contribute to genomic differences between A. thaliana accessions. To address larger genome rearrangements between the A. thaliana reference accession Columbia-0 (Col-0) and another accession of about average distance to Col-0, we created a de novo next generation sequencing (NGS)-based assembly from the accession Niederzenz-1 (Nd-1). The result was evaluated with respect to assembly strategy and synteny to Col-0. We provide a high quality genome sequence of the A. thaliana accession (Nd-1, LXSY01000000). The assembly displays an N50 of 0.590 Mbp and covers 99% of the Col-0 reference sequence. Scaffolds from the de novo assembly were positioned on the basis of sequence similarity to the reference. Errors in this automatic scaffold anchoring were manually corrected based on analyzing reciprocal best BLAST hits (RBHs) of genes. Comparison of the final Nd-1 assembly to the reference revealed duplications and deletions (PAV). We identified 826 insertions and 746 deletions in Nd-1. Randomly selected candidates of PAV were experimentally validated. Our Nd-1 de novo assembly allowed reliable identification of larger genic and intergenic variants, which was difficult or error-prone by short read mapping approaches alone. While overall sequence similarity as well as synteny is very high, we detected short and larger (affecting more than 100 bp) differences between Col-0 and Nd-1 based on bi-directional comparisons. The de novo assembly provided here and additional assemblies that will certainly be published in the future will allow to

  6. De Novo Sequencing and Analysis of Lemongrass Transcriptome Provide First Insights into the Essential Oil Biosynthesis of Aromatic Grasses.

    PubMed

    Meena, Seema; Kumar, Sarma R; Venkata Rao, D K; Dwivedi, Varun; Shilpashree, H B; Rastogi, Shubhra; Shasany, Ajit K; Nagegowda, Dinesh A

    2016-01-01

    Aromatic grasses of the genus Cymbopogon (Poaceae family) represent unique group of plants that produce diverse composition of monoterpene rich essential oils, which have great value in flavor, fragrance, cosmetic, and aromatherapy industries. Despite the commercial importance of these natural aromatic oils, their biosynthesis at the molecular level remains unexplored. As the first step toward understanding the essential oil biosynthesis, we performed de novo transcriptome assembly and analysis of C. flexuosus (lemongrass) by employing Illumina sequencing. Mining of transcriptome data and subsequent phylogenetic analysis led to identification of terpene synthases, pyrophosphatases, alcohol dehydrogenases, aldo-keto reductases, carotenoid cleavage dioxygenases, alcohol acetyltransferases, and aldehyde dehydrogenases, which are potentially involved in essential oil biosynthesis. Comparative essential oil profiling and mRNA expression analysis in three Cymbopogon species (C. flexuosus, aldehyde type; C. martinii, alcohol type; and C. winterianus, intermediate type) with varying essential oil composition indicated the involvement of identified candidate genes in the formation of alcohols, aldehydes, and acetates. Molecular modeling and docking further supported the role of identified protein sequences in aroma formation in Cymbopogon. Also, simple sequence repeats were found in the transcriptome with many linked to terpene pathway genes including the genes potentially involved in aroma biosynthesis. This work provides the first insights into the essential oil biosynthesis of aromatic grasses, and the identified candidate genes and markers can be a great resource for biotechnological and molecular breeding approaches to modulate the essential oil composition.

  7. De Novo Sequencing and Analysis of Lemongrass Transcriptome Provide First Insights into the Essential Oil Biosynthesis of Aromatic Grasses

    PubMed Central

    Meena, Seema; Kumar, Sarma R.; Venkata Rao, D. K.; Dwivedi, Varun; Shilpashree, H. B.; Rastogi, Shubhra; Shasany, Ajit K.; Nagegowda, Dinesh A.

    2016-01-01

    Aromatic grasses of the genus Cymbopogon (Poaceae family) represent unique group of plants that produce diverse composition of monoterpene rich essential oils, which have great value in flavor, fragrance, cosmetic, and aromatherapy industries. Despite the commercial importance of these natural aromatic oils, their biosynthesis at the molecular level remains unexplored. As the first step toward understanding the essential oil biosynthesis, we performed de novo transcriptome assembly and analysis of C. flexuosus (lemongrass) by employing Illumina sequencing. Mining of transcriptome data and subsequent phylogenetic analysis led to identification of terpene synthases, pyrophosphatases, alcohol dehydrogenases, aldo-keto reductases, carotenoid cleavage dioxygenases, alcohol acetyltransferases, and aldehyde dehydrogenases, which are potentially involved in essential oil biosynthesis. Comparative essential oil profiling and mRNA expression analysis in three Cymbopogon species (C. flexuosus, aldehyde type; C. martinii, alcohol type; and C. winterianus, intermediate type) with varying essential oil composition indicated the involvement of identified candidate genes in the formation of alcohols, aldehydes, and acetates. Molecular modeling and docking further supported the role of identified protein sequences in aroma formation in Cymbopogon. Also, simple sequence repeats were found in the transcriptome with many linked to terpene pathway genes including the genes potentially involved in aroma biosynthesis. This work provides the first insights into the essential oil biosynthesis of aromatic grasses, and the identified candidate genes and markers can be a great resource for biotechnological and molecular breeding approaches to modulate the essential oil composition. PMID:27516768

  8. De Novo Transcriptome Sequencing of Oryza officinalis Wall ex Watt to Identify Disease-Resistance Genes.

    PubMed

    He, Bin; Gu, Yinghong; Tao, Xiang; Cheng, Xiaojie; Wei, Changhe; Fu, Jian; Cheng, Zaiquan; Zhang, Yizheng

    2015-12-10

    Oryza officinalis Wall ex Watt is one of the most important wild relatives of cultivated rice and exhibits high resistance to many diseases. It has been used as a source of genes for introgression into cultivated rice. However, there are limited genomic resources and little genetic information publicly reported for this species. To better understand the pathways and factors involved in disease resistance and accelerating the process of rice breeding, we carried out a de novo transcriptome sequencing of O. officinalis. In this research, 137,229 contigs were obtained ranging from 200 to 19,214 bp with an N50 of 2331 bp through de novo assembly of leaves, stems and roots in O. officinalis using an Illumina HiSeq 2000 platform. Based on sequence similarity searches against a non-redundant protein database, a total of 88,249 contigs were annotated with gene descriptions and 75,589 transcripts were further assigned to GO terms. Candidate genes for plant-pathogen interaction and plant hormones regulation pathways involved in disease-resistance were identified. Further analyses of gene expression profiles showed that the majority of genes related to disease resistance were all expressed in the three tissues. In addition, there are two kinds of rice bacterial blight-resistant genes in O. officinalis, including two Xa1 genes and three Xa26 genes. All 2 Xa1 genes showed the highest expression level in stem, whereas one of Xa26 was expressed dominantly in leaf and other 2 Xa26 genes displayed low expression level in all three tissues. This transcriptomic database provides an opportunity for identifying the genes involved in disease-resistance and will provide a basis for studying functional genomics of O. officinalis and genetic improvement of cultivated rice in the future.

  9. Sequencing and De Novo Assembly of the Gonadal Transcriptome of the Endangered Chinese Sturgeon (Acipenser sinensis).

    PubMed

    Yue, Huamei; Li, Chuangju; Du, Hao; Zhang, Shuhuan; Wei, Qiwei

    2015-01-01

    The Chinese sturgeon (Acipenser sinensis) is endangered through anthropogenic activities including over-fishing, damming, shipping, and pollution. Controlled reproduction has been adopted and successfully conducted for conservation. However, little information is available on the reproductive regulation of the species. In this study, we conducted de novo transcriptome assembly of the gonad tissue to create a comprehensive dataset for A. sinensis. The Illumina sequencing platform was adopted to obtain 47,333,701 and 47,229,705 high quality reads from testis and ovary cDNA libraries generated from three-year-old A. sinensis. We identified 86,027 unigenes of which 30,268 were annotated in the NCBI non-redundant protein database and 28,281 were annotated in the Swiss-prot database. Among the annotated unigenes, 26,152 and 7,734 unigenes, respectively, were assigned to gene ontology categories and clusters of orthologous groups. In addition, 12,557 unigenes were mapped to 231 pathways in the Kyoto Encyclopedia of Genes and Genomes Pathway database. A total of 1,896 unigenes, potentially differentially expressed between the two gonad types, were found, with 1,894 predicted to be up-regulated in ovary and only two in testis. Fifty-five potential gametogenesis-related genes were screened in the transcriptome and 34 genes with significant matches were found. Besides, more paralogs of 11 genes in three gene families (sox, apolipoprotein and cyclin) were found in A. sinensis compared to their orthologs in the diploid Danio rerio. In addition, 12,151 putative simple sequence repeats (SSRs) were detected. This study provides the first de novo transcriptome analysis currently available for A. sinensis. The transcriptomic data represents the fundamental resource for future research on the mechanism of early gametogenesis in sturgeons. The SSRs identified in this work will be valuable for assessment of genetic diversity of wild fish and genealogy management of cultured fish.

  10. Sequencing and De Novo Assembly of the Gonadal Transcriptome of the Endangered Chinese Sturgeon (Acipenser sinensis)

    PubMed Central

    Du, Hao; Zhang, Shuhuan; Wei, Qiwei

    2015-01-01

    Background The Chinese sturgeon (Acipenser sinensis) is endangered through anthropogenic activities including over-fishing, damming, shipping, and pollution. Controlled reproduction has been adopted and successfully conducted for conservation. However, little information is available on the reproductive regulation of the species. In this study, we conducted de novo transcriptome assembly of the gonad tissue to create a comprehensive dataset for A. sinensis. Results The Illumina sequencing platform was adopted to obtain 47,333,701 and 47,229,705 high quality reads from testis and ovary cDNA libraries generated from three-year-old A. sinensis. We identified 86,027 unigenes of which 30,268 were annotated in the NCBI non-redundant protein database and 28,281 were annotated in the Swiss-prot database. Among the annotated unigenes, 26,152 and 7,734 unigenes, respectively, were assigned to gene ontology categories and clusters of orthologous groups. In addition, 12,557 unigenes were mapped to 231 pathways in the Kyoto Encyclopedia of Genes and Genomes Pathway database. A total of 1,896 unigenes, potentially differentially expressed between the two gonad types, were found, with 1,894 predicted to be up-regulated in ovary and only two in testis. Fifty-five potential gametogenesis-related genes were screened in the transcriptome and 34 genes with significant matches were found. Besides, more paralogs of 11 genes in three gene families (sox, apolipoprotein and cyclin) were found in A. sinensis compared to their orthologs in the diploid Danio rerio. In addition, 12,151 putative simple sequence repeats (SSRs) were detected. Conclusions This study provides the first de novo transcriptome analysis currently available for A. sinensis. The transcriptomic data represents the fundamental resource for future research on the mechanism of early gametogenesis in sturgeons. The SSRs identified in this work will be valuable for assessment of genetic diversity of wild fish and genealogy

  11. Transcriptome Sequencing and De Novo Analysis for Ma Bamboo (Dendrocalamus latiflorus Munro) Using the Illumina Platform

    PubMed Central

    Liu, Mingying; Qiao, Guirong; Jiang, Jing; Yang, Huiqin; Xie, Lihua; Xie, Jinzhong; Zhuo, Renying

    2012-01-01

    Background Bamboo occupies an important phylogenetic node in the grass family with remarkable sizes, woodiness and a striking life history. However, limited genetic research has focused on bamboo partially because of the lack of genomic resources. The advent of high-throughput sequencing technologies enables generation of genomic resources in a short time and at a minimal cost, and therefore provides a turning point for bamboo research. In the present study, we performed de novo transcriptome sequencing for the first time to produce a comprehensive dataset for the Ma bamboo (Dendrocalamus latiflorus Munro). Results The Ma bamboo transcriptome was sequenced using the Illumina paired-end sequencing technology. We produced 15,138,726 reads and assembled them into 103,354 scaffolds. A total of 68,229 unigenes were identified, among which 46,087 were annotated in the NCBI non-redundant protein database and 28,165 were annotated in the Swiss-Prot database. Of these annotated unigenes, 11,921 and 10,147 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. We could map 45,649 unigenes onto 292 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database. The annotated unigenes were compared against Moso bamboo, rice and millet. Unigenes that did not match any of those three sequence datasets are considered to be Ma bamboo unique. We predicted 105 unigenes encoding eight key enzymes involved in lignin biosynthesis. In addition, 621 simple sequence repeats (SSRs) were detected. Conclusion Our data provide the most comprehensive transcriptomic resource currently available for D. latiflorus Munro. Candidate genes potentially involved in growth and development were identified, and those predicted to be unique to Ma bamboo are expected to give a better insight on Ma bamboo gene diversity. Numerous SSRs characterized contributed to marker development. These data constitute a new valuable resource for genomic studies

  12. Transcriptome sequencing and de novo analysis for Ma bamboo (Dendrocalamus latiflorus Munro) using the Illumina platform.

    PubMed

    Liu, Mingying; Qiao, Guirong; Jiang, Jing; Yang, Huiqin; Xie, Lihua; Xie, Jinzhong; Zhuo, Renying

    2012-01-01

    Bamboo occupies an important phylogenetic node in the grass family with remarkable sizes, woodiness and a striking life history. However, limited genetic research has focused on bamboo partially because of the lack of genomic resources. The advent of high-throughput sequencing technologies enables generation of genomic resources in a short time and at a minimal cost, and therefore provides a turning point for bamboo research. In the present study, we performed de novo transcriptome sequencing for the first time to produce a comprehensive dataset for the Ma bamboo (Dendrocalamus latiflorus Munro). The Ma bamboo transcriptome was sequenced using the Illumina paired-end sequencing technology. We produced 15,138,726 reads and assembled them into 103,354 scaffolds. A total of 68,229 unigenes were identified, among which 46,087 were annotated in the NCBI non-redundant protein database and 28,165 were annotated in the Swiss-Prot database. Of these annotated unigenes, 11,921 and 10,147 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. We could map 45,649 unigenes onto 292 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database. The annotated unigenes were compared against Moso bamboo, rice and millet. Unigenes that did not match any of those three sequence datasets are considered to be Ma bamboo unique. We predicted 105 unigenes encoding eight key enzymes involved in lignin biosynthesis. In addition, 621 simple sequence repeats (SSRs) were detected. Our data provide the most comprehensive transcriptomic resource currently available for D. latiflorus Munro. Candidate genes potentially involved in growth and development were identified, and those predicted to be unique to Ma bamboo are expected to give a better insight on Ma bamboo gene diversity. Numerous SSRs characterized contributed to marker development. These data constitute a new valuable resource for genomic studies on D. latiflorus Munro and

  13. De novo Sequencing, Assembly and Characterization of Antennal Transcriptome of Anomala corpulenta Motschulsky (Coleoptera: Rutelidae)

    PubMed Central

    Chen, Haoliang; Lin, Lulu; Xie, Minghui; Zhang, Guangling; Su, Weihua

    2014-01-01

    Background Anomala corpulenta is an important insect pest and can cause enormous economic losses in agriculture, horticulture and forestry. It is widely distributed in China, and both larvae and adults can cause serious damage. It is difficult to control this pest because the larvae live underground. Any new control strategy should exploit alternatives to heavily and frequently used chemical insecticides. However, little genetic research has been carried out on A. corpulenta due to the lack of genomic resources. Genomic resources could be produced by next generation sequencing technologies with low cost and in a short time. In this study, we performed de novo sequencing, assembly and characterization of the antennal transcriptome of A. corpulenta. Results Illumina sequencing technology was used to sequence the antennal transcriptome of A. corpulenta. Approximately 76.7 million total raw reads and about 68.9 million total clean reads were obtained, and then 35,656 unigenes were assembled. Of these unigenes, 21,463 of them could be annotated in the NCBI nr database, and, among the annotated unigenes, 11,154 and 6,625 unigenes could be assigned to GO and COG, respectively. Additionally, 16,350 unigenes could be annotated in the Swiss-Prot database, and 14,499 unigenes could map onto 258 pathways in the KEGG Pathway database. We also found 24 unigenes related to OBPs, 6 to CSPs, and in total 167 unigenes related to chemodetection. We analyzed 4 OBPs and 3CSPs sequences and their RT-qPCR results agreed well with their FPKM values. Conclusion We produced the first large-scale antennal transcriptome of A. corpulenta, which is a species that has little genomic information in public databases. The identified chemodetection unigenes can promote the molecular mechanistic study of behavior in A. corpulenta. These findings provide a general sequence resource for molecular genetics research on A. corpulenta. PMID:25461610

  14. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome.

    PubMed

    Goodwin, Sara; Gurtowski, James; Ethe-Sayers, Scott; Deshpande, Panchajanya; Schatz, Michael C; McCombie, W Richard

    2015-11-01

    Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5-50 kbp) at such high error rates (between ∼5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

  15. De novo assembly, characterization and functional annotation of pineapple fruit transcriptome through massively parallel sequencing.

    PubMed

    Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah

    2012-01-01

    Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple.

  16. Mining Novel Allergens from Coconut Pollen Employing Manual De Novo Sequencing and Homology-Driven Proteomics.

    PubMed

    Saha, Bodhisattwa; Sircar, Gaurab; Pandey, Naren; Gupta Bhattacharya, Swati

    2015-11-06

    Coconut pollen, one of the major palm pollen grains is an important constituent among vectors of inhalant allergens in India and a major sensitizer for respiratory allergy in susceptible patients. To gain insight into its allergenic components, pollen proteins were analyzed by two-dimensional electrophoresis, immunoblotted with coconut pollen sensitive patient sera, followed by mass spectrometry of IgE reactive proteins. Coconut being largely unsequenced, a proteomic workflow has been devised that combines the conventional database-dependent analysis of tandem mass spectral data and manual de novo sequencing followed by a homology-based search for identifying the allergenic proteins. N-terminal acetylation helped to distinguish "b" ions from others, facilitating reliable sequencing. This led to the identification of 12 allergenic proteins. Cluster analysis with individual patient sera recognized vicilin-like protein as a major allergen, which was purified to assess its in vitro allergenicity and then partially sequenced. Other IgE-sensitive spots showed significant homology with well-known allergenic proteins such as 11S globulin, enolase, and isoflavone reductase along with a few which are reported as novel allergens. The allergens identified can be used as potential candidates to develop hypoallergenic vaccines, to design specific immunotherapy trials, and to enrich the repertoire of existing IgE reactive proteins.

  17. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

    PubMed Central

    Goodwin, Sara; Gurtowski, James; Ethe-Sayers, Scott; Deshpande, Panchajanya; Schatz, Michael C.; McCombie, W. Richard

    2015-01-01

    Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5–50 kbp) at such high error rates (between ∼5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly. PMID:26447147

  18. Whole genome sequencing data and de novo draft assemblies for 66 teleost species

    PubMed Central

    Malmstrøm, Martin; Matschiner, Michael; Tørresen, Ole K.; Jakobsen, Kjetill S.; Jentoft, Sissel

    2017-01-01

    Teleost fishes comprise more than half of all vertebrate species, yet genomic data are only available for 0.2% of their diversity. Here, we present whole genome sequencing data for 66 new species of teleosts, vastly expanding the availability of genomic data for this important vertebrate group. We report on de novo assemblies based on low-coverage (9–39×) sequencing and present detailed methodology for all analyses. To facilitate further utilization of this data set, we present statistical analyses of the gene space completeness and verify the expected phylogenetic position of the sequenced genomes in a large mitogenomic context. We further present a nuclear marker set used for phylogenetic inference and evaluate each gene tree in relation to the species tree to test for homogeneity in the phylogenetic signal. Collectively, these analyses illustrate the robustness of this highly diverse data set and enable extensive reuse of the selected phylogenetic markers and the genomic data in general. This data set covers all major teleost lineages and provides unprecedented opportunities for comparative studies of teleosts. PMID:28094797

  19. The first Illumina-based de novo transcriptome sequencing and analysis of safflower flowers.

    PubMed

    Lulin, Huang; Xiao, Yang; Pei, Sun; Wen, Tong; Shangqin, Hu

    2012-01-01

    The safflower, Carthamus tinctorius L., is a worldwide oil crop, and its flowers, which have a high flavonoid content, are an important medicinal resource against cardiovascular disease in traditional medicine. Because the safflower has a large and complex genome, the development of its genomic resources has been delayed. Second-generation Illumina sequencing is now an efficient route for generating an enormous volume of sequences that can represent a large number of genes and their expression levels. To investigate the genes and pathways that might control flavonoids and other secondary metabolites in the safflower, we used Illumina sequencing to perform a de novo assembly of the safflower tubular flower tissue transcriptome. We obtained a total of 4.69 Gb in clean nucleotides comprising 52,119,104 clean sequencing reads, 195,320 contigs, and 120,778 unigenes. Based on similarity searches with known proteins, we annotated 70,342 of the unigenes (about 58% of the identified unigenes) with cut-off E-values of 10(-5). In total, 21,943 of the safflower unigenes were found to have COG classifications, and BLAST2GO assigned 26,332 of the unigenes to 1,754 GO term annotations. In addition, we assigned 30,203 of the unigenes to 121 KEGG pathways. When we focused on genes identified as contributing to flavonoid biosynthesis and the biosynthesis of unsaturated fatty acids, which are important pathways that control flower and seed quality, respectively, we found that these genes were fairly well conserved in the safflower genome compared to those of other plants. Our study provides abundant genomic data for Carthamus tinctorius L. and offers comprehensive sequence resources for studying the safflower. We believe that these transcriptome datasets will serve as an important public information platform to accelerate studies of the safflower genome, and may help us define the mechanisms of flower tissue-specific and secondary metabolism in this non-model plant.

  20. De Novo whole genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 from blueberry in Georgia, USA

    USDA-ARS?s Scientific Manuscript database

    This study reports a de novo assembled draft genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 causing blueberry bacterial leaf scorch in Georgia, USA. The BB01 genome is 2,517,579 bp with a G+C content of 51.8% and 2,943 open reading frames (ORFs) and 48 RNA genes....

  1. Optimizing Information in Next-Generation-Sequencing (NGS) Reads for Improving De Novo Genome Assembly

    PubMed Central

    Liu, Tsunglin; Tsai, Cheng-Hung; Lee, Wen-Bin; Chiang, Jung-Hsien

    2013-01-01

    Next-Generation-Sequencing is advantageous because of its much higher data throughput and much lower cost compared with the traditional Sanger method. However, NGS reads are shorter than Sanger reads, making de novo genome assembly very challenging. Because genome assembly is essential for all downstream biological studies, great efforts have been made to enhance the completeness of genome assembly, which requires the presence of long reads or long distance information. To improve de novo genome assembly, we develop a computational program, ARF-PE, to increase the length of Illumina reads. ARF-PE takes as input Illumina paired-end (PE) reads and recovers the original DNA fragments from which two ends the paired reads are obtained. On the PE data of four bacteria, ARF-PE recovered >87% of the DNA fragments and achieved >98% of perfect DNA fragment recovery. Using Velvet, SOAPdenovo, Newbler, and CABOG, we evaluated the benefits of recovered DNA fragments to genome assembly. For all four bacteria, the recovered DNA fragments increased the assembly contiguity. For example, the N50 lengths of the P. brasiliensis contigs assembled by SOAPdenovo and Newbler increased from 80,524 bp to 166,573 bp and from 80,655 bp to 193,388 bp, respectively. ARF-PE also increased assembly accuracy in many cases. On the PE data of two fungi and a human chromosome, ARF-PE doubled and tripled the N50 length. However, the assembly accuracies dropped, but still remained >91%. In general, ARF-PE can increase both assembly contiguity and accuracy for bacterial genomes. For complex eukaryotic genomes, ARF-PE is promising because it raises assembly contiguity. But future error correction is needed for ARF-PE to also increase the assembly accuracy. ARF-PE is freely available at http://140.116.235.124/~tliu/arf-pe/. PMID:23922726

  2. Bromine isotopic signature facilitates de novo sequencing of peptides in free-radical-initiated peptide sequencing (FRIPS) mass spectrometry.

    PubMed

    Nam, Jungjoo; Kwon, Hyuksu; Jang, Inae; Jeon, Aeran; Moon, Jingyu; Lee, Sun Young; Kang, Dukjin; Han, Sang Yun; Moon, Bongjin; Oh, Han Bin

    2015-02-01

    We recently showed that free-radical-initiated peptide sequencing mass spectrometry (FRIPS MS) assisted by the remarkable thermochemical stability of (2,2,6,6-tetramethyl-piperidin-1-yl)oxyl (TEMPO) is another attractive radical-driven peptide fragmentation MS tool. Facile homolytic cleavage of the bond between the benzylic carbon and the oxygen of the TEMPO moiety in o-TEMPO-Bz-C(O)-peptide and the high reactivity of the benzylic radical species generated in •Bz-C(O)-peptide are key elements leading to extensive radical-driven peptide backbone fragmentation. In the present study, we demonstrate that the incorporation of bromine into the benzene ring, i.e. o-TEMPO-Bz(Br)-C(O)-peptide, allows unambiguous distinction of the N-terminal peptide fragments from the C-terminal fragments through the unique bromine doublet isotopic signature. Furthermore, bromine substitution does not alter the overall radical-driven peptide backbone dissociation pathways of o-TEMPO-Bz-C(O)-peptide. From a practical perspective, the presence of the bromine isotopic signature in the N-terminal peptide fragments in TEMPO-assisted FRIPS MS represents a useful and cost-effective opportunity for de novo peptide sequencing.

  3. Overlapping Genes Produce Proteins with Unusual Sequence Properties and Offer Insight into De Novo Protein Creation▿ †

    PubMed Central

    Rancurel, Corinne; Khosravi, Mahvash; Dunker, A. Keith; Romero, Pedro R.; Karlin, David

    2009-01-01

    It is widely assumed that new proteins are created by duplication, fusion, or fission of existing coding sequences. Another mechanism of protein birth is provided by overlapping genes. They are created de novo by mutations within a coding sequence that lead to the expression of a novel protein in another reading frame, a process called “overprinting.” To investigate this mechanism, we have analyzed the sequences of the protein products of manually curated overlapping genes from 43 genera of unspliced RNA viruses infecting eukaryotes. Overlapping proteins have a sequence composition globally biased toward disorder-promoting amino acids and are predicted to contain significantly more structural disorder than nonoverlapping proteins. By analyzing the phylogenetic distribution of overlapping proteins, we were able to confirm that 17 of these had been created de novo and to study them individually. Most proteins created de novo are orphans (i.e., restricted to one species or genus). Almost all are accessory proteins that play a role in viral pathogenicity or spread, rather than proteins central to viral replication or structure. Most proteins created de novo are predicted to be fully disordered and have a highly unusual sequence composition. This suggests that some viral overlapping reading frames encoding hypothetical proteins with highly biased composition, often discarded as noncoding, might in fact encode proteins. Some proteins created de novo are predicted to be ordered, however, and whenever a three-dimensional structure of such a protein has been solved, it corresponds to a fold previously unobserved, suggesting that the study of these proteins could enhance our knowledge of protein space. PMID:19640978

  4. Sequencing and De Novo Assembly of the Transcriptome of the Glassy-Winged Sharpshooter (Homalodisca vitripennis)

    PubMed Central

    Nandety, Raja Sekhar; Kamita, Shizuo G.; Hammock, Bruce D.; Falk, Bryce W.

    2013-01-01

    Background The glassy-winged sharpshooter Homalodisca vitripennis (Hemiptera: Cicadellidae), is a xylem-feeding leafhopper and important vector of the bacterium Xylella fastidiosa; the causal agent of Pierce’s disease of grapevines. The functional complexity of the transcriptome of H. vitripennis has not been elucidated thus far. It is a necessary blueprint for an understanding of the development of H. vitripennis and for designing efficient biorational control strategies including those based on RNA interference. Results Here we elucidate and explore the transcriptome of adult H. vitripennis using high-throughput paired end deep sequencing and de novo assembly. A total of 32,803,656 paired-end reads were obtained with an average transcript length of 624 nucleotides. We assembled 32.9 Mb of the transcriptome of H. vitripennis that spanned across 47,265 loci and 52,708 transcripts. Comparison of our non-redundant database showed that 45% of the deduced proteins of H. vitripennis exhibit identity (e-value ≤1−5) with known proteins. We assigned Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations, and potential Pfam domains to each transcript isoform. In order to gain insight into the molecular basis of key regulatory genes of H. vitripennis, we characterized predicted proteins involved in the metabolism of juvenile hormone, and biogenesis of small RNAs (Dicer and Piwi sequences) from the transcriptomic sequences. Analysis of transposable element sequences of H. vitripennis indicated that the genome is less expanded in comparison to many other insects with approximately 1% of the transcriptome carrying transposable elements. Conclusions Our data significantly enhance the molecular resources available for future study and control of this economically important hemipteran. This transcriptional information not only provides a more nuanced understanding of the underlying biological and physiological mechanisms that govern H

  5. De Novo Assembly and Transcriptome Characterization of Canine Retina Using High-Throughput Sequencing

    PubMed Central

    Reddy, Bhaskar; Patel, Amrutlal K.; Singh, Krishna M.; Patil, Deepak B.; Parikh, Pinesh V.; Kelawala, Divyesh N.; Koringa, Prakash G.; Bhatt, Vaibhav D.; Rao, Mandava V.; Joshi, Chaitanya G.

    2015-01-01

    We performed transcriptome sequencing of canine retinal tissue by 454 GS-FLX and Ion Torrent PGM platforms. RNA-Seq analysis by CLC Genomics Workbench mapped expression of 10,360 genes. Gene ontology analysis of retinal transcriptome revealed abundance of transcripts known to be involved in vision associated processes. The de novo assembly of the sequences using CAP3 generated 29,683 contigs with mean length of 560.9 and N50 of 619 bases. Further analysis of contigs predicted 3,827 full-length cDNAs and 29,481 (99%) open reading frames (ORFs). In addition, 3,782 contigs were assigned to 316 KEGG pathways which included melanogenesis, phototransduction, and retinol metabolism with 33, 15, and 11 contigs, respectively. Among the identified microsatellites, dinucleotide repeats were 68.84%, followed by trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides in proportions of 25.76, 9.40, 2.52, and 0.96%, respectively. This study will serve as a valuable resource for understanding the biology and function of canine retina. PMID:26788372

  6. Sequencing and de novo assembly of the red cusk-eel (Genypterus chilensis) transcriptome.

    PubMed

    Aedo, J E; Maldonado, J; Estrada, J M; Fuentes, E N; Silva, H; Gallardo-Escarate, C; Molina, A; Valdés, J A

    2014-12-01

    The red cusk-eel (Genypterus chilensis) is an endemic fish species distributed along the coasts of the Eastern South Pacific. Biological studies on this fish are scarce, and genomic information for G. chilensis is practically non-existent. Thus, transcriptome information for this species is an essential resource that will greatly enrich molecular information and benefit future studies of red cusk-eel biology. In this work, we obtained transcriptome information of G. chilensis using the Illumina platform. The RNA sequencing generated 66,307,362 and 59,925,554 paired-end reads from skeletal muscle and liver tissues, respectively. De novo assembly using the CLC Genomic Workbench version 7.0.3 produced 48,480 contigs and created a reference transcriptome with a N50 of 846bp and average read coverage of 28.3×. By sequence similarity search for known proteins, a total of 21,272 (43.9%) contigs were annotated for their function. Out of these annotated contigs, 33.5% GO annotation results for biological processes, 32.6% GO annotation results for cellular components and 34.5% GO annotation results for molecular functions. This dataset represents the first transcriptomic resource for the red cusk-eel and for a member of the Ophidiimorpharia taxon. Copyright © 2014 Elsevier B.V. All rights reserved.

  7. Exome Sequencing Identifies a Recurrent De Novo ZSWIM6 Mutation Associated with Acromelic Frontonasal Dysostosis

    PubMed Central

    Smith, Joshua D.; Hing, Anne V.; Clarke, Christine M.; Johnson, Nathan M.; Perez, Francisco A.; Park, Sarah S.; Horst, Jeremy A.; Mecham, Brig; Maves, Lisa; Nickerson, Deborah A.; Cunningham, Michael L.

    2014-01-01

    Acromelic frontonasal dysostosis (AFND) is a rare disorder characterized by distinct craniofacial, brain, and limb malformations, including frontonasal dysplasia, interhemispheric lipoma, agenesis of the corpus callosum, tibial hemimelia, preaxial polydactyly of the feet, and intellectual disability. Exome sequencing of one trio and two unrelated probands revealed the same heterozygous variant (c.3487C>T [p. Arg1163Trp]) in a highly conserved protein domain of ZSWIM6; this variant has not been seen in the 1000 Genomes data, dbSNP, or the Exome Sequencing Project. Sanger validation of the three trios confirmed that the variant was de novo and was also present in a fourth isolated proband. In situ hybridization of early zebrafish embryos at 24 hr postfertilization (hpf) demonstrated telencephalic expression of zswim6 and onset of midbrain, hindbrain, and retinal expression at 48 hpf. Immunohistochemistry of later-stage mouse embryos demonstrated tissue-specific expression in the derivatives of all three germ layers. qRT-PCR expression analysis of osteoblast and fibroblast cell lines available from two probands was suggestive of Hedgehog pathway activation, indicating that the ZSWIM6 mutation associated with AFND may lead to the craniofacial, brain and limb malformations through the disruption of Hedgehog signaling. PMID:25105228

  8. Exome sequencing identifies a recurrent de novo ZSWIM6 mutation associated with acromelic frontonasal dysostosis.

    PubMed

    Smith, Joshua D; Hing, Anne V; Clarke, Christine M; Johnson, Nathan M; Perez, Francisco A; Park, Sarah S; Horst, Jeremy A; Mecham, Brig; Maves, Lisa; Nickerson, Deborah A; Cunningham, Michael L

    2014-08-07

    Acromelic frontonasal dysostosis (AFND) is a rare disorder characterized by distinct craniofacial, brain, and limb malformations, including frontonasal dysplasia, interhemispheric lipoma, agenesis of the corpus callosum, tibial hemimelia, preaxial polydactyly of the feet, and intellectual disability. Exome sequencing of one trio and two unrelated probands revealed the same heterozygous variant (c.3487C>T [p. Arg1163Trp]) in a highly conserved protein domain of ZSWIM6; this variant has not been seen in the 1000 Genomes data, dbSNP, or the Exome Sequencing Project. Sanger validation of the three trios confirmed that the variant was de novo and was also present in a fourth isolated proband. In situ hybridization of early zebrafish embryos at 24 hr postfertilization (hpf) demonstrated telencephalic expression of zswim6 and onset of midbrain, hindbrain, and retinal expression at 48 hpf. Immunohistochemistry of later-stage mouse embryos demonstrated tissue-specific expression in the derivatives of all three germ layers. qRT-PCR expression analysis of osteoblast and fibroblast cell lines available from two probands was suggestive of Hedgehog pathway activation, indicating that the ZSWIM6 mutation associated with AFND may lead to the craniofacial, brain and limb malformations through the disruption of Hedgehog signaling. Copyright © 2014 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  9. A case study of de novo sequence analysis of N-sulfonated peptides by MALDI TOF/TOF mass spectrometry.

    PubMed

    Samyn, Bart; Debyser, Griet; Sergeant, Kjell; Devreese, Bart; Van Beeumen, Jozef

    2004-12-01

    The simplicity and sensitivity of matrix-assisted laser desorption/ionization time-of-flight mass spectrometry have increased its application in recent years. The most common method of "peptide mass fingerprint" analysis often does not provide robust identification. Additional sequence information, obtained by post-source decay or collision induced dissociation, provides additional constraints for database searches. However, de novo sequencing by mass spectrometry is not yet common practice, most likely because of the difficulties associated with the interpretation of high and low energy CID spectra. Success with this type of sequencing requires full sequence coverage and demands better quality spectra than those typically used for data base searching. In this report we show that full-length de novo sequencing is possible using MALDI TOF/TOF analysis. The interpretation of MS/MS data is facilitated by N-terminal sulfonation after protection of lysine side chains (Keough et al., Proc. Natl. Acad. Sci. U.S.A. 1999, 96, 7131-7136). Reliable de novo sequence analysis has been obtained using sub-picomol quantities of peptides and peptide sequences of up to 16 amino acid residues in length have been determined. The simple, predictable fragmentation pattern allows routine de novo interpretation, either manually or using software. Characterization of the complete primary structure of a peptide is often hindered because of differences in fragmentation efficiencies and in specific fragmentation patterns for different peptides. These differences are controlled by various structural parameters including the nature of the residues present. The influence of the presence of internal Pro, acidic and basic residues on the TOF/TOF fragmentation pattern will be discussed, both for underivatized and guanidinated/sulfonated peptides.

  10. Sequencing and De Novo Assembly of the Western Tarnished Plant Bug (Lygus hesperus) Transcriptome

    PubMed Central

    Hull, J. Joe; Geib, Scott M.; Fabrick, Jeffrey A.; Brent, Colin S.

    2013-01-01

    Background Mirid plant bugs (Hemiptera: Miridae) are economically important insect pests of many crops worldwide. The western tarnished plant bug Lygus hesperus Knight is a pest of cotton, alfalfa, fruit and vegetable crops, and potentially of several emerging biofuel and natural product feedstocks in the western US. However, little is known about the underlying molecular genetics, biochemistry, or physiology of L. hesperus, including their ability to survive extreme environmental conditions. Methodology/Principal Findings We used 454 pyrosequencing of a normalized adult cDNA library and de novo assembly to obtain an adult L. hesperus transcriptome consisting of 1,429,818 transcriptomic reads representing 36,131 transcript isoforms (isotigs) that correspond to 19,742 genes. A search of the transcriptome against deposited L. hesperus protein sequences revealed that 86 out of 87 were represented. Comparison with the non-redundant database indicated that 54% of the transcriptome exhibited similarity (e-value ≤1−5) with known proteins. In addition, Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations, and potential Pfam domains were assigned to each transcript isoform. To gain insight into the molecular basis of the L. hesperus thermal stress response we used transcriptomic sequences to identify 52 potential heat shock protein (Hsp) homologs. A subset of these transcripts was sequence verified and their expression response to thermal stress monitored by semi-quantitative PCR. Potential homologs of Hsp70, Hsp40, and 2 small Hsps were found to be upregulated in the heat-challenged adults, suggesting a role in thermotolerance. Conclusions/Significance The L. hesperus transcriptome advances the underlying molecular understanding of this arthropod pest by significantly increasing the number of known genes, and provides the basis for further exploration and understanding of the fundamental mechanisms of abiotic stress responses. PMID

  11. De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing

    PubMed Central

    Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah

    2012-01-01

    Background Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. Methodology/Principal Findings To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. Conclusions The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple. PMID:23091603

  12. Rationale-Based, De Novo Design of Dehydrophenylalanine-Containing Antibiotic Peptides and Systematic Modification in Sequence for Enhanced Potency▿

    PubMed Central

    Pathak, Sarika; Chauhan, Virander Singh

    2011-01-01

    Increased microbial drug resistance has generated a global requirement for new anti-infective agents. As part of an effort to develop new, low-molecular-mass peptide antibiotics, we used a rationale-based minimalist approach to design short, nonhemolytic, potent, and broad-spectrum antibiotic peptides with increased serum stability. These peptides were designed to attain an amphipathic structure in helical conformations. VS1 was used as the lead compound, and its properties were compared with three series of derivates obtained by (i) N-terminal amino acid addition, (ii) systematic Trp substitution, and (iii) peptide dendrimerization. The Trp substitution approach underlined the optimized sequence of VS2 in terms of potency, faster membrane permeation, and cost-effectiveness. VS2 (a variant of VS1 with two Trp substitutions) was found to exhibit good antimicrobial activity against both the Gram-negative Escherichia coli and the Gram-positive bacterium Staphylococcus aureus. It was also found to have noncytolytic activity and the ability to permeate and depolarize the bacterial membrane. Lysis of the bacterial cell wall and inner membrane by the peptide was confirmed by transmission electron microscopy. A combination of small size, the presence of unnatural amino acids, high antimicrobial activity, insignificant hemolysis, and proteolytic resistance provides fundamental information for the de novo design of an antimicrobial peptide useful for the management of infectious disease. PMID:21321136

  13. Identification of lignin genes and regulatory sequences involved in secondary cell wall formation in Acacia auriculiformis and Acacia mangium via de novo transcriptome sequencing

    PubMed Central

    2011-01-01

    Background Acacia auriculiformis × Acacia mangium hybrids are commercially important trees for the timber and pulp industry in Southeast Asia. Increasing pulp yield while reducing pulping costs are major objectives of tree breeding programs. The general monolignol biosynthesis and secondary cell wall formation pathways are well-characterized but genes in these pathways are poorly characterized in Acacia hybrids. RNA-seq on short-read platforms is a rapid approach for obtaining comprehensive transcriptomic data and to discover informative sequence variants. Results We sequenced transcriptomes of A. auriculiformis and A. mangium from non-normalized cDNA libraries synthesized from pooled young stem and inner bark tissues using paired-end libraries and a single lane of an Illumina GAII machine. De novo assembly produced a total of 42,217 and 35,759 contigs with an average length of 496 bp and 498 bp for A. auriculiformis and A. mangium respectively. The assemblies of A. auriculiformis and A. mangium had a total length of 21,022,649 bp and 17,838,260 bp, respectively, with the largest contig 15,262 bp long. We detected all ten monolignol biosynthetic genes using Blastx and further analysis revealed 18 lignin isoforms for each species. We also identified five contigs homologous to R2R3-MYB proteins in other plant species that are involved in transcriptional regulation of secondary cell wall formation and lignin deposition. We searched the contigs against public microRNA database and predicted the stem-loop structures of six highly conserved microRNA families (miR319, miR396, miR160, miR172, miR162 and miR168) and one legume-specific family (miR2086). Three microRNA target genes were predicted to be involved in wood formation and flavonoid biosynthesis. By using the assemblies as a reference, we discovered 16,648 and 9,335 high quality putative Single Nucleotide Polymorphisms (SNPs) in the transcriptomes of A. auriculiformis and A. mangium, respectively, thus yielding

  14. Identification of lignin genes and regulatory sequences involved in secondary cell wall formation in Acacia auriculiformis and Acacia mangium via de novo transcriptome sequencing.

    PubMed

    Wong, Melissa M L; Cannon, Charles H; Wickneswari, Ratnam

    2011-07-05

    Acacia auriculiformis × Acacia mangium hybrids are commercially important trees for the timber and pulp industry in Southeast Asia. Increasing pulp yield while reducing pulping costs are major objectives of tree breeding programs. The general monolignol biosynthesis and secondary cell wall formation pathways are well-characterized but genes in these pathways are poorly characterized in Acacia hybrids. RNA-seq on short-read platforms is a rapid approach for obtaining comprehensive transcriptomic data and to discover informative sequence variants. We sequenced transcriptomes of A. auriculiformis and A. mangium from non-normalized cDNA libraries synthesized from pooled young stem and inner bark tissues using paired-end libraries and a single lane of an Illumina GAII machine. De novo assembly produced a total of 42,217 and 35,759 contigs with an average length of 496 bp and 498 bp for A. auriculiformis and A. mangium respectively. The assemblies of A. auriculiformis and A. mangium had a total length of 21,022,649 bp and 17,838,260 bp, respectively, with the largest contig 15,262 bp long. We detected all ten monolignol biosynthetic genes using Blastx and further analysis revealed 18 lignin isoforms for each species. We also identified five contigs homologous to R2R3-MYB proteins in other plant species that are involved in transcriptional regulation of secondary cell wall formation and lignin deposition. We searched the contigs against public microRNA database and predicted the stem-loop structures of six highly conserved microRNA families (miR319, miR396, miR160, miR172, miR162 and miR168) and one legume-specific family (miR2086). Three microRNA target genes were predicted to be involved in wood formation and flavonoid biosynthesis. By using the assemblies as a reference, we discovered 16,648 and 9,335 high quality putative Single Nucleotide Polymorphisms (SNPs) in the transcriptomes of A. auriculiformis and A. mangium, respectively, thus yielding useful markers for

  15. The First Illumina-Based De Novo Transcriptome Sequencing and Analysis of Safflower Flowers

    PubMed Central

    Lulin, Huang; Xiao, Yang; Pei, Sun; Wen, Tong; Shangqin, Hu

    2012-01-01

    Background The safflower, Carthamus tinctorius L., is a worldwide oil crop, and its flowers, which have a high flavonoid content, are an important medicinal resource against cardiovascular disease in traditional medicine. Because the safflower has a large and complex genome, the development of its genomic resources has been delayed. Second-generation Illumina sequencing is now an efficient route for generating an enormous volume of sequences that can represent a large number of genes and their expression levels. Methodology/Principal Findings To investigate the genes and pathways that might control flavonoids and other secondary metabolites in the safflower, we used Illumina sequencing to perform a de novo assembly of the safflower tubular flower tissue transcriptome. We obtained a total of 4.69 Gb in clean nucleotides comprising 52,119,104 clean sequencing reads, 195,320 contigs, and 120,778 unigenes. Based on similarity searches with known proteins, we annotated 70,342 of the unigenes (about 58% of the identified unigenes) with cut-off E-values of 10−5. In total, 21,943 of the safflower unigenes were found to have COG classifications, and BLAST2GO assigned 26,332 of the unigenes to 1,754 GO term annotations. In addition, we assigned 30,203 of the unigenes to 121 KEGG pathways. When we focused on genes identified as contributing to flavonoid biosynthesis and the biosynthesis of unsaturated fatty acids, which are important pathways that control flower and seed quality, respectively, we found that these genes were fairly well conserved in the safflower genome compared to those of other plants. Conclusions/Significance Our study provides abundant genomic data for Carthamus tinctorius L. and offers comprehensive sequence resources for studying the safflower. We believe that these transcriptome datasets will serve as an important public information platform to accelerate studies of the safflower genome, and may help us define the mechanisms of flower tissue

  16. De novo sequencing and characterization of Picrorhiza kurrooa transcriptome at two temperatures showed major transcriptome adjustments

    PubMed Central

    2012-01-01

    Background Picrorhiza kurrooa Royle ex Benth. is an endangered plant species of medicinal importance. The medicinal property is attributed to monoterpenoids picroside I and II, which are modulated by temperature. The transcriptome information of this species is limited with the availability of few hundreds of expressed sequence tags (ESTs) in the public databases. In order to gain insight into temperature mediated molecular changes, high throughput de novo transcriptome sequencing and analyses were carried out at 15°C and 25°C, the temperatures known to modulate picrosides content. Results Using paired-end (PE) Illumina sequencing technology, a total of 20,593,412 and 44,229,272 PE reads were obtained after quality filtering for 15°C and 25°C, respectively. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 74,336 assembled transcript sequences were obtained, with an average coverage of 76.6 and average length of 439.5. Guanine-cytosine (GC) content was observed to be 44.6%, while the transcriptome exhibited abundance of trinucleotide simple sequence repeat (SSR; 45.63%) markers. Large scale expression profiling through "read per exon kilobase per million (RPKM)", showed changes in several biological processes and metabolic pathways including cytochrome P450s (CYPs), UDP-glycosyltransferases (UGTs) and those associated with picrosides biosynthesis. RPKM data were validated by reverse transcriptase-polymerase chain reaction using a set of 19 genes, wherein 11 genes behaved in accordance with the two expression methods. Conclusions Study generated transcriptome of P. kurrooa at two different temperatures. Large scale expression profiling through RPKM showed major transcriptome changes in response to temperature reflecting alterations in major biological processes and metabolic pathways, and provided insight of GC content and SSR markers. Analysis also identified

  17. Transcriptome Sequencing and De Novo Assembly of Golden Cuttlefish Sepia esculenta Hoyle

    PubMed Central

    Liu, Changlin; Zhao, Fazhen; Yan, Jingping; Liu, Chunsheng; Liu, Siwei; Chen, Siqing

    2016-01-01

    Golden cuttlefish Sepia esculenta Hoyle is an economically important cephalopod species. However, artificial hatching is currently challenged by low survival rate of larvae due to abnormal embryonic development. Dissecting the genetic foundation and regulatory mechanisms in embryonic development requires genomic background knowledge. Therefore, we carried out a transcriptome sequencing on Sepia embryos and larvae via mRNA-Seq. 32,597,241 raw reads were filtered and assembled into 98,615 unigenes (N50 length at 911 bp) which were annotated in NR database, GO and KEGG databases respectively. Digital gene expression analysis was carried out on cleavage stage embryos, healthy larvae and malformed larvae. Unigenes functioning in cell proliferation exhibited higher transcriptional levels at cleavage stage while those related to animal disease and organ development showed increased transcription in malformed larvae. Homologs of key genes in regulatory pathways related to early development of animals were identified in Sepia. Most of them exhibit higher transcriptional levels in cleavage stage than larvae, suggesting their potential roles in embryonic development of Sepia. The de novo assembly of Sepia transcriptome is fundamental genetic background for further exploration in Sepia research. Our demonstration on the transcriptional variations of genes in three developmental stages will provide new perspectives in understanding the molecular mechanisms in early embryonic development of cuttlefish. PMID:27782082

  18. Transcriptome Sequencing and De Novo Assembly of Golden Cuttlefish Sepia esculenta Hoyle.

    PubMed

    Liu, Changlin; Zhao, Fazhen; Yan, Jingping; Liu, Chunsheng; Liu, Siwei; Chen, Siqing

    2016-10-22

    Golden cuttlefish Sepia esculenta Hoyle is an economically important cephalopod species. However, artificial hatching is currently challenged by low survival rate of larvae due to abnormal embryonic development. Dissecting the genetic foundation and regulatory mechanisms in embryonic development requires genomic background knowledge. Therefore, we carried out a transcriptome sequencing on Sepia embryos and larvae via mRNA-Seq. 32,597,241 raw reads were filtered and assembled into 98,615 unigenes (N50 length at 911 bp) which were annotated in NR database, GO and KEGG databases respectively. Digital gene expression analysis was carried out on cleavage stage embryos, healthy larvae and malformed larvae. Unigenes functioning in cell proliferation exhibited higher transcriptional levels at cleavage stage while those related to animal disease and organ development showed increased transcription in malformed larvae. Homologs of key genes in regulatory pathways related to early development of animals were identified in Sepia. Most of them exhibit higher transcriptional levels in cleavage stage than larvae, suggesting their potential roles in embryonic development of Sepia. The de novo assembly of Sepia transcriptome is fundamental genetic background for further exploration in Sepia research. Our demonstration on the transcriptional variations of genes in three developmental stages will provide new perspectives in understanding the molecular mechanisms in early embryonic development of cuttlefish.

  19. MotifHyades: Expectation Maximization for de novo DNA Motif Pair Discovery on Paired Sequences.

    PubMed

    Wong, Ka-Chun

    2017-06-13

    In higher eukaryotes, protein-DNA binding interactions are the central activities in gene regulation. In particular, DNA motifs such as transcription factor binding sites are the key components in gene transcription. Harnessing the recently available chromatin interaction data, computational methods are desired for identifying the coupling DNA motif pairs enriched on long-range chromatin-interacting sequence pairs (e.g. promoter-enhancer pairs) systematically. To fill the void, a novel probabilistic model (namely, MotifHyades) is proposed and developed for de novo DNA motif pair discovery on paired sequences. In particular, two expectation maximization algorithms are derived for efficient model training with linear computational complexity. Under diverse scenarios, MotifHyades is demonstrated faster and more accurate than the existing ad hoc computational pipeline. In addition, MotifHyades is applied to discover thousands of DNA motif pairs with higher gold standard motif matching ratio, higher DNase accessibility, and higher evolutionary conservation than the previous ones in the human K562 cell line. Lastly, it has been run on five other human cell lines (i.e. GM12878, HeLa-S3, HUVEC, IMR90, and NHEK), revealing another thousands of novel DNA motif pairs which are characterized across a broad spectrum of genomic features on long-range promoter-enhancer pairs. The matrix-algebra-optimized versions of MotifHyades and the discovered DNA motif pairs can be found in http://bioinfo.cs.cityu.edu.hk/MotifHyades . kc.w@cityu.edu.hk. Supplementary data are available at Bioinformatics online.

  20. De novo prediction of RNA-protein interactions from sequence information.

    PubMed

    Wang, Ying; Chen, Xiaowei; Liu, Zhi-Ping; Huang, Qiang; Wang, Yong; Xu, Derong; Zhang, Xiang-Sun; Chen, Runsheng; Chen, Luonan

    2013-01-27

    Protein-RNA interactions are fundamentally important in understanding cellular processes. In particular, non-coding RNA-protein interactions play an important role to facilitate biological functions in signalling, transcriptional regulation, and even the progression of complex diseases. However, experimental determination of protein-RNA interactions remains time-consuming and labour-intensive. Here, we develop a novel extended naïve-Bayes-classifier for de novo prediction of protein-RNA interactions, only using protein and RNA sequence information. Specifically, we first collect a set of known protein-RNA interactions as gold-standard positives and extract sequence-based features to represent each protein-RNA pair. To fill the gap between high dimensional features and scarcity of gold-standard positives, we select effective features by cutting a likelihood ratio score, which not only reduces the computational complexity but also allows transparent feature integration during prediction. An extended naïve Bayes classifier is then constructed using these effective features to train a protein-RNA interaction prediction model. Numerical experiments show that our method can achieve the prediction accuracy of 0.77 even though only a small number of protein-RNA interaction data are available. In particular, we demonstrate that the extended naïve-Bayes-classifier is superior to the naïve-Bayes-classifier by fully considering the dependences among features. Importantly, we conduct ncRNA pull-down experiments to validate the predicted novel protein-RNA interactions and identify the interacting proteins of sbRNA CeN72 in C. elegans, which further demonstrates the effectiveness of our method.

  1. Stable isotope N-phosphorylation labeling for Peptide de novo sequencing and protein quantification based on organic phosphorus chemistry.

    PubMed

    Gao, Xiang; Wu, Hanzhi; Lee, Kim-Chung; Liu, Hongxia; Zhao, Yufen; Cai, Zongwei; Jiang, Yuyang

    2012-12-04

    In this paper, we describe the development of a novel stable isotope N-phosphorylation labeling (SIPL) strategy for peptide de novo sequencing and protein quantification based on organic phosphorus chemistry. The labeling reaction could be performed easily and completed within 40 min in a one-pot reaction without additional cleanup procedures. It was found that N-phosphorylation labeling reagents were activated in situ to form labeling intermediates with high reactivity targeting on N-terminus and ε-amino groups of lysine under mild reaction conditions. The introduction of N-terminal-labeled phosphoryl group not only improved the ionization efficiency of peptides and increased the protein sequence coverage for peptide mass fingerprints but also greatly enhanced the intensities of b ions, suppressed the internal fragments, and reduced the complexity of the tandem mass spectrometry (MS/MS) fragmentation patterns of peptides. By using nano liquid chromatography chip/time-of-flight mass spectrometry (nano LC-chip/TOF MS) for the protein quantification, the obtained results showed excellent correlation of the measured ratios to theoretical ratios with relative errors ranging from 0.5% to 6.7% and relative standard deviation of less than 10.6%, indicating that the developed method was reproducible and precise. The isotope effect was negligible because of the deuterium atoms were placed adjacent to the neutral phosphoryl group with high electrophilicity and moderately small size. Moreover, the SIPL approach used inexpensive reagents and was amenable to samples from various sources, including cell culture, biological fluids, and tissues. The method development based on organic phosphorus chemistry offered a new approach for quantitative proteomics by using novel stable isotope labeling reagents.

  2. De Novo Transcriptome Assembly of the Chinese Swamp Buffalo by RNA Sequencing and SSR Marker Discovery

    PubMed Central

    Lu, Xingrong; Zhu, Peng; Duan, Anqin; Tan, Zhengzhun; Huang, Jian; Li, Hui; Chen, Mingtan; Liang, Xianwei

    2016-01-01

    The Chinese swamp buffalo (Bubalis bubalis) is vital to the lives of small farmers and has tremendous economic importance. However, a lack of genomic information has hampered research on augmenting marker assisted breeding programs in this species. Thus, a high-throughput transcriptomic sequencing of B. bubalis was conducted to generate transcriptomic sequence dataset for gene discovery and molecular marker development. Illumina paired-end sequencing generated a total of 54,109,173 raw reads. After trimming, de novo assembly was performed, which yielded 86,017 unigenes, with an average length of 972.41 bp, an N50 of 1,505 bp, and an average GC content of 49.92%. A total of 62,337 unigenes were successfully annotated. Among the annotated unigenes, 27,025 (43.35%) and 23,232 (37.27%) unigenes showed significant similarity to known proteins in NCBI non-redundant protein and Swiss-Prot databases (E-value < 1.0E-5), respectively. Of these annotated unigenes, 14,439 and 15,813 unigenes were assigned to the Gene Ontology (GO) categories and EuKaryotic Ortholog Group (KOG) cluster, respectively. In addition, a total of 14,167 unigenes were assigned to 331 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Furthermore, 17,401 simple sequence repeats (SSRs) were identified as potential molecular markers. One hundred and fifteen primer pairs were randomly selected for amplification to detect polymorphisms. The results revealed that 110 primer pairs (95.65%) yielded PCR amplicons and 69 primer pairs (60.00%) presented polymorphisms in 35 individual buffaloes. A phylogenetic analysis showed that the five swamp buffalo populations were clustered together, whereas two river buffalo breeds clustered separately. In the present study, the Illumina RNA-seq technology was utilized to perform transcriptome analysis and SSR marker discovery in the swamp buffalo without using a reference genome. Our findings will enrich the current SSR markers resources and help spearhead molecular

  3. Genomic resources for water yam (Dioscorea alata L.): analyses of EST-Sequences, De Novo sequencing and GBS libraries

    USDA-ARS?s Scientific Manuscript database

    The reducing cost and rapid progress in next-generation sequencing techniques coupled with high performance computational approaches have resulted in large-scale discovery of advanced genomic resources such as SSRs, SNPs and InDels in several model and non-model plant species. Yam (Dioscorea spp.) i...

  4. Kinase inhibitor data modeling and de novo inhibitor design with fragment approaches.

    PubMed

    Vieth, Michal; Erickson, Jon; Wang, Jibo; Webster, Yue; Mader, Mary; Higgs, Richard; Watson, Ian

    2009-10-22

    A reconstructive approach based on computational fragmentation of existing inhibitors and validated kinase potency models to recombine and create "de novo" kinase inhibitor small molecule libraries is described. The screening results from model selected molecules from the corporate database and seven computationally derived small molecule libraries were used to evaluate this approach. Specifically, 1895 model selected database molecules were screened at 20 microM in six kinase assays and yielded an overall hit rate of 84%. These models were then used in the de novo design of seven chemical libraries consisting of 20-50 compounds each. Then 179 compounds from synthesized libraries were tested against these six kinases with an overall hit rate of 92%. Comparing predicted and observed selectivity profiles serves to highlight the strengths and limitations of the methodology, while analysis of functional group contributions from the libraries suggest general principles governing binding of ATP competitive compounds.

  5. Hybrid error correction and de novo assembly of single-molecule sequencing reads

    PubMed Central

    Koren, Sergey; Schatz, Michael C.; Walenz, Brian P.; Martin, Jeffrey; Howard, Jason; Ganapathy, Ganeshkumar; Wang, Zhong; Rasko, David A.; McCombie, W. Richard; Jarvis, Erich D.; Phillippy, Adam M.

    2012-01-01

    Emerging single-molecule sequencing instruments can generate multi-kilobase sequences with the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of single-molecule reads is challenging, and has limited their use to resequencing bacteria. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on Pacbio RS reads of phage, prokaryotic, and eukaryotic whole genomes, including the novel genome of the parrot Melopsittacus undulatus, as well as for RNA-seq reads of the corn (Zea mays) transcriptome. Our approach achieves over 99.9% read correction accuracy and produces substantially better assemblies than current sequencing strategies: in the best example, quintupling the median contig size relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly. PMID:22750884

  6. De novo transcriptome sequencing and analysis of the juvenile and adult stages of Fasciola gigantica.

    PubMed

    Zhang, Xiao-Xuan; Cong, Wei; Elsheikha, Hany M; Liu, Guo-Hua; Ma, Jian-Gang; Huang, Wei-Yi; Zhao, Quan; Zhu, Xing-Quan

    2017-03-09

    Fasciola gigantica is regarded as the major liver fluke causing fasciolosis in livestock in tropical countries. Despite the significant economic and public health impacts of F. gigantica there are few studies on the pathogenesis of this parasite and our understanding is further limited by the lack of genome and transcriptome information. In this study, de novo Illumina RNA sequencing (RNA-seq) was performed to obtain a comprehensive transcriptome profile of the juvenile (42days post infection) and adult stages of F. gigantica. A total of 49,720 unigenes were produced from juvenile and adult stages of F. gigantica, with an average length of 1286 nucleotides (nt) and N50 of 2076nt. A total of 27,862 (56.03%) unigenes were annotated by BLAST similarity searches against the NCBI non-redundant protein database. Because F. gigantica needs to feed and/or digest host tissues, some proteases (including cysteine proteases and aspartic proteases), which play a role in the degradation of host tissues (protein), have been paid more attention in the present study. A total of 6511 distinct genes were found differentially expressed between juveniles and adults, of which 3993 genes were up-regulated and 2518 genes were down-regulated in adults versus juveniles, respectively. Moreover, stage-specific differentially expressed genes were identified in juvenile (17,009) and adult (6517) F. gigantica. The significantly divergent pathways of differentially expressed genes included cAMP signaling pathway (226; 4.12%), proteoglycans in cancer (256; 4.67%) and focal adhesion (199; 3.63%). The transcription pattern also revealed two egg-laying-associated pathways: cGMP-PKG signaling pathway and TGF-β signaling pathway. This study provides the first comparative transcriptomic data concerning juvenile and adult stages of F. gigantica that will be of great value for future research efforts into understanding parasite pathogenesis and developing vaccines against this important parasite.

  7. The first complete chloroplast genome sequences of Ulmus species by de novo sequencing: Genome comparative and taxonomic position analysis

    PubMed Central

    Zhang, Shuang; Yu, Xiao-Yue; Ren, Ya-Chao; Yang, Min-Sheng; Wang, Jin-Mao

    2017-01-01

    Elm (Ulmus) has a long history of use as a high-quality heavy hardwood famous for its resistance to drought, cold, and salt. It grows in temperate, warm temperate, and subtropical regions. This is the first report of Ulmaceae chloroplast genomes by de novo sequencing. The Ulmus chloroplast genomes exhibited a typical quadripartite structure with two single-copy regions (long single copy [LSC] and short single copy [SSC] sections) separated by a pair of inverted repeats (IRs). The lengths of the chloroplast genomes from five Ulmus ranged from 158,953 to 159,453 bp, with the largest observed in Ulmus davidiana and the smallest in Ulmus laciniata. The genomes contained 137–145 protein-coding genes, of which Ulmus davidiana var. japonica and U. davidiana had the most and U. pumila had the fewest. The five Ulmus species exhibited different evolutionary routes, as some genes had been lost. In total, 18 genes contained introns, 13 of which (trnL-TAA+, trnL-TAA−, rpoC1-, rpl2-, ndhA-, ycf1, rps12-, rps12+, trnA-TGC+, trnA-TGC-, trnV-TAC-, trnI-GAT+, and trnI-GAT) were shared among all five species. The intron of ycf1 was the longest (5,675bp) while that of trnF-AAA was the smallest (53bp). All Ulmus species except U. davidiana exhibited the same degree of amplification in the IR region. To determine the phylogenetic positions of the Ulmus species, we performed phylogenetic analyses using common protein-coding genes in chloroplast sequences of 42 other species published in NCBI. The cluster results showed the closest plants to Ulmaceae were Moraceae and Cannabaceae, followed by Rosaceae. Ulmaceae and Moraceae both belonged to Urticales, and the chloroplast genome clustering results were consistent with their traditional taxonomy. The results strongly supported the position of Ulmaceae as a member of the order Urticales. In addition, we found a potential error in the traditional taxonomies of U. davidiana and U. davidiana var. japonica, which should be confirmed with a

  8. The first complete chloroplast genome sequences of Ulmus species by de novo sequencing: Genome comparative and taxonomic position analysis.

    PubMed

    Zuo, Li-Hui; Shang, Ai-Qin; Zhang, Shuang; Yu, Xiao-Yue; Ren, Ya-Chao; Yang, Min-Sheng; Wang, Jin-Mao

    2017-01-01

    Elm (Ulmus) has a long history of use as a high-quality heavy hardwood famous for its resistance to drought, cold, and salt. It grows in temperate, warm temperate, and subtropical regions. This is the first report of Ulmaceae chloroplast genomes by de novo sequencing. The Ulmus chloroplast genomes exhibited a typical quadripartite structure with two single-copy regions (long single copy [LSC] and short single copy [SSC] sections) separated by a pair of inverted repeats (IRs). The lengths of the chloroplast genomes from five Ulmus ranged from 158,953 to 159,453 bp, with the largest observed in Ulmus davidiana and the smallest in Ulmus laciniata. The genomes contained 137-145 protein-coding genes, of which Ulmus davidiana var. japonica and U. davidiana had the most and U. pumila had the fewest. The five Ulmus species exhibited different evolutionary routes, as some genes had been lost. In total, 18 genes contained introns, 13 of which (trnL-TAA+, trnL-TAA-, rpoC1-, rpl2-, ndhA-, ycf1, rps12-, rps12+, trnA-TGC+, trnA-TGC-, trnV-TAC-, trnI-GAT+, and trnI-GAT) were shared among all five species. The intron of ycf1 was the longest (5,675bp) while that of trnF-AAA was the smallest (53bp). All Ulmus species except U. davidiana exhibited the same degree of amplification in the IR region. To determine the phylogenetic positions of the Ulmus species, we performed phylogenetic analyses using common protein-coding genes in chloroplast sequences of 42 other species published in NCBI. The cluster results showed the closest plants to Ulmaceae were Moraceae and Cannabaceae, followed by Rosaceae. Ulmaceae and Moraceae both belonged to Urticales, and the chloroplast genome clustering results were consistent with their traditional taxonomy. The results strongly supported the position of Ulmaceae as a member of the order Urticales. In addition, we found a potential error in the traditional taxonomies of U. davidiana and U. davidiana var. japonica, which should be confirmed with a

  9. MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads (Metagenomics Informatics Challenges Workshop: 10K Genomes at a Time)

    ScienceCinema

    Sakakibara, Yasumbumi [Keio University

    2016-07-12

    Keio University's Yasumbumi Sakakibara on "MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.

  10. MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads (Metagenomics Informatics Challenges Workshop: 10K Genomes at a Time)

    SciTech Connect

    Sakakibara, Yasumbumi

    2011-10-13

    Keio University's Yasumbumi Sakakibara on "MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.

  11. Rapid Microsatellite Isolation from a Butterfly by De Novo Transcriptome Sequencing: Performance and a Comparison with AFLP-Derived Distances

    PubMed Central

    Mikheyev, Alexander S.; Vo, Tanya; Wee, Brian; Singer, Michael C.; Parmesan, Camille

    2010-01-01

    Background The isolation of microsatellite markers remains laborious and expensive. For some taxa, such as Lepidoptera, development of microsatellite markers has been particularly difficult, as many markers appear to be located in repetitive DNA and have nearly identical flanking regions. We attempted to circumvent this problem by bioinformatic mining of microsatellite sequences from a de novo-sequenced transcriptome of a butterfly (Euphydryas editha). Principal Findings By searching the assembled sequence data for perfect microsatellite repeats we found 10 polymorphic loci. Although, like many expressed sequence tag-derived microsatellites, our markers show strong deviations from Hardy-Weinberg equilibrium in many populations, and, in some cases, a high incidence of null alleles, we show that they nonetheless provide measures of population differentiation consistent with those obtained by amplified fragment length polymorphism analysis. Estimates of pairwise population differentiation between 23 populations were concordant between microsatellite-derived data and AFLP analysis of the same samples (r = 0.71, p<0.00001, 425 individuals from 23 populations). Significance De novo transcriptional sequencing appears to be a rapid and cost-effective tool for developing microsatellite markers for difficult genomes. PMID:20585453

  12. de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer.

    PubMed

    Istace, Benjamin; Friedrich, Anne; d'Agata, Léo; Faye, Sébastien; Payen, Emilie; Beluche, Odette; Caradec, Claudia; Davidas, Sabrina; Cruaud, Corinne; Liti, Gianni; Lemainque, Arnaud; Engelen, Stefan; Wincker, Patrick; Schacherer, Joseph; Aury, Jean-Marc

    2017-02-01

    Oxford Nanopore Technologies Ltd (Oxford, UK) have recently commercialized MinION, a small single-molecule nanopore sequencer, that offers the possibility of sequencing long DNA fragments from small genomes in a matter of seconds. The Oxford Nanopore technology is truly disruptive; it has the potential to revolutionize genomic applications due to its portability, low cost, and ease of use compared with existing long reads sequencing technologies. The MinION sequencer enables the rapid sequencing of small eukaryotic genomes, such as the yeast genome. Combined with existing assembler algorithms, near complete genome assemblies can be generated and comprehensive population genomic analyses can be performed. Here, we resequenced the genome of the Saccharomyces cerevisiae S288C strain to evaluate the performance of nanopore-only assemblers. Then we de novo sequenced and assembled the genomes of 21 isolates representative of the S. cerevisiae genetic diversity using the MinION platform. The contiguity of our assemblies was 14 times higher than the Illumina-only assemblies and we obtained one or two long contigs for 65 % of the chromosomes. This high contiguity allowed us to accurately detect large structural variations across the 21 studied genomes. Because of the high completeness of the nanopore assemblies, we were able to produce a complete cartography of transposable elements insertions and inspect structural variants that are generally missed using a short-read sequencing strategy. Our analyses show that the Oxford Nanopore technology is already usable for de novo sequencing and assembly; however, non-random errors in homopolymers require polishing the consensus using an alternate sequencing technology. © The Author 2017. Published by Oxford University Press.

  13. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.

    PubMed

    Bickhart, Derek M; Rosen, Benjamin D; Koren, Sergey; Sayre, Brian L; Hastie, Alex R; Chan, Saki; Lee, Joyce; Lam, Ernest T; Liachko, Ivan; Sullivan, Shawn T; Burton, Joshua N; Huson, Heather J; Nystrom, John C; Kelley, Christy M; Hutchison, Jana L; Zhou, Yang; Sun, Jiajie; Crisà, Alessandra; Ponce de León, F Abel; Schwartz, John C; Hammond, John A; Waldbieser, Geoffrey C; Schroeder, Steven G; Liu, George E; Dunham, Maitreya J; Shendure, Jay; Sonstegard, Tad S; Phillippy, Adam M; Van Tassell, Curtis P; Smith, Timothy P L

    2017-04-01

    The decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus) based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced what is, to our knowledge, the most continuous de novo mammalian assembly to date, with chromosome-length scaffolds and only 649 gaps. Our assembly represents a ∼400-fold improvement in continuity due to properly assembled gaps, compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, representing the largest repeat family and immune gene complex yet produced for an individual of a ruminant species.

  14. Cost-effective sequencing of full-length cDNA clones powered by a de novo-reference hybrid assembly.

    PubMed

    Kuroshu, Reginaldo M; Watanabe, Junichi; Sugano, Sumio; Morishita, Shinichi; Suzuki, Yutaka; Kasahara, Masahiro

    2010-05-07

    Sequencing full-length cDNA clones is important to determine gene structures including alternative splice forms, and provides valuable resources for experimental analyses to reveal the biological functions of coded proteins. However, previous approaches for sequencing cDNA clones were expensive or time-consuming, and therefore, a fast and efficient sequencing approach was demanded. We developed a program, MuSICA 2, that assembles millions of short (36-nucleotide) reads collected from a single flow cell lane of Illumina Genome Analyzer to shotgun-sequence approximately 800 human full-length cDNA clones. MuSICA 2 performs a hybrid assembly in which an external de novo assembler is run first and the result is then improved by reference alignment of shotgun reads. We compared the MuSICA 2 assembly with 200 pooled full-length cDNA clones finished independently by the conventional primer-walking using Sanger sequencers. The exon-intron structure of the coding sequence was correct for more than 95% of the clones with coding sequence annotation when we excluded cDNA clones insufficiently represented in the shotgun library due to PCR failure (42 out of 200 clones excluded), and the nucleotide-level accuracy of coding sequences of those correct clones was over 99.99%. We also applied MuSICA 2 to full-length cDNA clones from Toxoplasma gondii, to confirm that its ability was competent even for non-human species. The entire sequencing and shotgun assembly takes less than 1 week and the consumables cost only approximately US$3 per clone, demonstrating a significant advantage over previous approaches.

  15. Cost-Effective Sequencing of Full-Length cDNA Clones Powered by a De Novo-Reference Hybrid Assembly

    PubMed Central

    Sugano, Sumio; Morishita, Shinichi; Suzuki, Yutaka

    2010-01-01

    Background Sequencing full-length cDNA clones is important to determine gene structures including alternative splice forms, and provides valuable resources for experimental analyses to reveal the biological functions of coded proteins. However, previous approaches for sequencing cDNA clones were expensive or time-consuming, and therefore, a fast and efficient sequencing approach was demanded. Methodology We developed a program, MuSICA 2, that assembles millions of short (36-nucleotide) reads collected from a single flow cell lane of Illumina Genome Analyzer to shotgun-sequence ∼800 human full-length cDNA clones. MuSICA 2 performs a hybrid assembly in which an external de novo assembler is run first and the result is then improved by reference alignment of shotgun reads. We compared the MuSICA 2 assembly with 200 pooled full-length cDNA clones finished independently by the conventional primer-walking using Sanger sequencers. The exon-intron structure of the coding sequence was correct for more than 95% of the clones with coding sequence annotation when we excluded cDNA clones insufficiently represented in the shotgun library due to PCR failure (42 out of 200 clones excluded), and the nucleotide-level accuracy of coding sequences of those correct clones was over 99.99%. We also applied MuSICA 2 to full-length cDNA clones from Toxoplasma gondii, to confirm that its ability was competent even for non-human species. Conclusions The entire sequencing and shotgun assembly takes less than 1 week and the consumables cost only ∼US$3 per clone, demonstrating a significant advantage over previous approaches. PMID:20479877

  16. Identification of a De Novo Heterozygous Missense FLNB Mutation in Lethal Atelosteogenesis Type I by Exome Sequencing

    PubMed Central

    Jeon, Ga Won; Lee, Mi-Na; Jung, Ji Mi; Hong, Seong Yeon; Kim, Young Nam; Sin, Jong Beom

    2014-01-01

    Background Atelosteogenesis type I (AO-I) is a rare lethal skeletal dysplastic disorder characterized by severe short-limbed dwarfism and dislocated hips, knees, and elbows. AO-I is caused by mutations in the filamin B (FLNB) gene; however, several other genes can cause AO-like lethal skeletal dysplasias. Methods In order to screen all possible genes associated with AO-like lethal skeletal dysplasias simultaneously, we performed whole-exome sequencing in a female newborn having clinical features of AO-I. Results Exome sequencing identified a novel missense variant (c.517G>A; p.Ala173Thr) in exon 2 of the FLNB gene in the patient. Sanger sequencing validated this variant, and genetic analysis of the patient's parents suggested a de novo occurrence of the variant. Conclusions This study shows that exome sequencing can be a useful tool for the identification of causative mutations in lethal skeletal dysplasia patients. PMID:24624349

  17. Sequencing crop genomes: approaches and applications

    USDA-ARS?s Scientific Manuscript database

    Plant genome sequencing methodology parrallels the sequencing of the human genome. The first projects were slow and very expensive. BAC by BAC approaches were utilized first and whole-genome shotgun sequencing rapidly replaced that approach. So called 'next generation' technologies such as short rea...

  18. Somatic mutations and germline sequence variants in the expressed tyrosine kinase genes of patients with de novo acute myeloid leukemia

    PubMed Central

    Xiang, Zhifu; Walgren, Richard; Zhao, Yu; Kasai, Yumi; Miner, Tracie; Ries, Rhonda E.; Lubman, Olga; Fremont, Daved H.; McLellan, Michael D.; Payton, Jacqueline E.; Westervelt, Peter; DiPersio, John F.; Link, Daniel C.; Walter, Matthew J.; Graubert, Timothy A.; Watson, Mark; Baty, Jack; Heath, Sharon; Shannon, William D.; Nagarajan, Rakesh; Bloomfield, Clara D.; Mardis, Elaine R.; Wilson, Richard K.; Ley, Timothy J.

    2008-01-01

    Activating mutations in tyrosine kinase (TK) genes (eg, FLT3 and KIT) are found in more than 30% of patients with de novo acute myeloid leukemia (AML); many groups have speculated that mutations in other TK genes may be present in the remaining 70%. We performed high-throughput resequencing of the kinase domains of 26 TK genes (11 receptor TK; 15 cytoplasmic TK) expressed in most AML patients using genomic DNA from the bone marrow (tumor) and matched skin biopsy samples (“germline”) from 94 patients with de novo AML; sequence variants were validated in an additional 94 AML tumor samples (14.3 million base pairs of sequence were obtained and analyzed). We identified known somatic mutations in FLT3, KIT, and JAK2 TK genes at the expected frequencies and found 4 novel somatic mutations, JAK1V623A, JAK1T478S, DDR1A803V, and NTRK1S677N, once each in 4 respective patients of 188 tested. We also identified novel germline sequence changes encoding amino acid substitutions (ie, nonsynonymous changes) in 14 TK genes, including TYK2, which had the largest number of nonsynonymous sequence variants (11 total detected). Additional studies will be required to define the roles that these somatic and germline TK gene variants play in AML pathogenesis. PMID:18270328

  19. De Novo Designed Proteins from a Library of Artificial Sequences Function in Escherichia Coli and Enable Cell Growth

    PubMed Central

    Fisher, Michael A.; McKinley, Kara L.; Bradley, Luke H.; Viola, Sara R.; Hecht, Michael H.

    2011-01-01

    A central challenge of synthetic biology is to enable the growth of living systems using parts that are not derived from nature, but designed and synthesized in the laboratory. As an initial step toward achieving this goal, we probed the ability of a collection of >106 de novo designed proteins to provide biological functions necessary to sustain cell growth. Our collection of proteins was drawn from a combinatorial library of 102-residue sequences, designed by binary patterning of polar and nonpolar residues to fold into stable 4-helix bundles. We probed the capacity of proteins from this library to function in vivo by testing their abilities to rescue 27 different knockout strains of Escherichia coli, each deleted for a conditionally essential gene. Four different strains – ΔserB, ΔgltA, ΔilvA, and Δfes – were rescued by specific sequences from our library. Further experiments demonstrated that a strain simultaneously deleted for all four genes was rescued by co-expression of four novel sequences. Thus, cells deleted for ∼0.1% of the E. coli genome (and ∼1% of the genes required for growth under nutrient-poor conditions) can be sustained by sequences designed de novo. PMID:21245923

  20. Rapid genome mapping in nano channel array for highly complete and accurate de novo sequence assembly of the complex Aegilops tauschii genome

    USDA-ARS?s Scientific Manuscript database

    Next-generation sequencing (NGS) technologies have enabled high-throughput and low-cost generation of sequence data; however, de novo genome assembly remains a great challenge, particularly for large genomes. NGS short reads are often insufficient to create large contigs that span repeat sequences...

  1. LTQ Orbitrap Velos in routine de novo sequencing of non-tryptic skin peptides from the frog Rana latastei with traditional and reliable manual spectra interpretation.

    PubMed

    Samgina, Tatiana Yu; Tolpina, Miriam D; Trebse, Polonca; Torkar, Gregor; Artemenko, Konstantin A; Bergquist, Jonas; Lebedev, Albert T

    2016-01-30

    Mass spectrometry has shown itself to be the most efficient tool for the sequencing of peptides. However, de novo sequencing of novel natural peptides is significantly more challenging in comparison with the same procedure applied for the tryptic peptides. To reach the goal in this case it is essential to select the most efficient methods of triggering fragmentation and combine all the possible complementary techniques. Collision-induced dissociation (CID), high-energy collision dissociation (HCD), and electron-transfer dissociation (ETD) tandem mass spectra recorded with a LTQ Orbitrap Velos instrument were used for the elucidation of the sequence of the natural non-tryptic peptides from the skin secretion of Rana latastei. Manual interpretation of the spectra was applied. The combined approach using CID, HCD, and ETD tandem mass spectra of the multiprotonated peptides in various charge states, as well as of their proteolytic fragments, allowed the sequences of seven novel peptides from the skin secretion of Rana latastei to be established. Manual mass spectrometry sequencing of natural non-tryptic peptides from the skin secretion of Rana latastei provided the opportunity to work successfully with these species and demonstrated once again its advantage over automatic approaches.

  2. High-Quality de Novo Genome Assembly of the Dekkera bruxellensis Yeast Isolate Using Nanopore MinION Sequencing.

    PubMed

    Fournier, Téo; Gounot, Jean-Sébastien; Freel, Kelle; Cruaud, Corinne; Lemainque, Arnaud; Aury, Jean-Marc; Wincker, Patrick; Schacherer, Joseph; Friedrich, Anne

    2017-08-09

    Genetic variation in natural populations represents the raw material for phenotypic diversity. Species-wide characterization of genetic variants is crucial to have a deeper insight into the genotype-phenotype relationship. With the advent of new sequencing strategies and more recently the release of long-read sequencing platforms, it is now possible to explore the genetic diversity of any non-model organisms, representing a fundamental resource for biological research. In the frame of population genomic surveys, a first step is to obtain the complete sequence and high quality assembly of a reference genome. Here, we sequenced and assembled a reference genome of the non-conventional Dekkera bruxellensis yeast. While this species is a major cause of wine spoilage, it paradoxically contributes to the specific flavor profile of some Belgium beers. In addition, an extreme karyotype variability is observed across natural isolates, highlighting that D. bruxellensis genome is very dynamic. The whole genome of the D. bruxellensis UMY321 isolate was sequenced using a combination of Nanopore long-read and Illumina short-read sequencing data. We generated the most complete and contiguous de novo assembly of D. bruxellensis to date and obtained a first glimpse into the genomic variability within this species by comparing the sequences of several isolates. This genome sequence is therefore of high value for population genomic surveys and represents a reference to study genome dynamic in this yeast species. Copyright © 2017, G3: Genes, Genomes, Genetics.

  3. De novo sequencing and comparative analysis of the blueberry transcriptome to discover putative genes related to antioxidants.

    PubMed

    Li, Xiaoyan; Sun, Haiyue; Pei, Jiabo; Dong, Yuanyuan; Wang, Fawei; Chen, Huan; Sun, Yepeng; Wang, Nan; Li, Haiyan; Li, Yadong

    2012-12-10

    Blueberry (Vaccinium spp.) is an important small fruit crop rich in antioxidants. However, tissue-specific transcriptome and genomic data in public databases are not sufficient for an understanding of the molecular mechanisms associated with antioxidants, especially the biosynthesis of anthocyanins. Here, we obtained more than 64 million sequencing reads from blueberry skin and pulp using Illumina sequencing technology. De novo assemblies yielded 34,464 unigenes, among them 1236 transcripts and 862 putative transcription factors involved in the main antioxidant biosynthesis pathway were identified. Comparative transcript profiling allowed the identification of 92 differentially expressed genes with potential relevance in regulating the fruit metabolism and anthocyanin content during ripening. A series of qRT-PCR confirmed the high expression level of the anthocyanin pathway genes in the skin of the blue fruit from the in silico study. This sequence collection provides a significant resource for the blueberry research and breeding work.

  4. Whole Genome Sequencing Reveals a De Novo SHANK3 Mutation in Familial Autism Spectrum Disorder

    PubMed Central

    Nemirovsky, Sergio I.; Córdoba, Marta; Zaiat, Jonathan J.; Completa, Sabrina P.; Vega, Patricia A.; González-Morón, Dolores; Medina, Nancy M.; Fabbro, Mónica; Romero, Soledad; Brun, Bianca; Revale, Santiago; Ogara, María Florencia; Pecci, Adali; Marti, Marcelo; Vazquez, Martin; Turjanski, Adrián; Kauffman, Marcelo A.

    2015-01-01

    Introduction Clinical genomics promise to be especially suitable for the study of etiologically heterogeneous conditions such as Autism Spectrum Disorder (ASD). Here we present three siblings with ASD where we evaluated the usefulness of Whole Genome Sequencing (WGS) for the diagnostic approach to ASD. Methods We identified a family segregating ASD in three siblings with an unidentified cause. We performed WGS in the three probands and used a state-of-the-art comprehensive bioinformatic analysis pipeline and prioritized the identified variants located in genes likely to be related to ASD. We validated the finding by Sanger sequencing in the probands and their parents. Results Three male siblings presented a syndrome characterized by severe intellectual disability, absence of language, autism spectrum symptoms and epilepsy with negative family history for mental retardation, language disorders, ASD or other psychiatric disorders. We found germline mosaicism for a heterozygous deletion of a cytosine in the exon 21 of the SHANK3 gene, resulting in a missense sequence of 5 codons followed by a premature stop codon (NM_033517:c.3259_3259delC, p.Ser1088Profs*6). Conclusions We reported an infrequent form of familial ASD where WGS proved useful in the clinic. We identified a mutation in SHANK3 that underscores its relevance in Autism Spectrum Disorder. PMID:25646853

  5. Whole genome sequencing reveals a de novo SHANK3 mutation in familial autism spectrum disorder.

    PubMed

    Nemirovsky, Sergio I; Córdoba, Marta; Zaiat, Jonathan J; Completa, Sabrina P; Vega, Patricia A; González-Morón, Dolores; Medina, Nancy M; Fabbro, Mónica; Romero, Soledad; Brun, Bianca; Revale, Santiago; Ogara, María Florencia; Pecci, Adali; Marti, Marcelo; Vazquez, Martin; Turjanski, Adrián; Kauffman, Marcelo A

    2015-01-01

    Clinical genomics promise to be especially suitable for the study of etiologically heterogeneous conditions such as Autism Spectrum Disorder (ASD). Here we present three siblings with ASD where we evaluated the usefulness of Whole Genome Sequencing (WGS) for the diagnostic approach to ASD. We identified a family segregating ASD in three siblings with an unidentified cause. We performed WGS in the three probands and used a state-of-the-art comprehensive bioinformatic analysis pipeline and prioritized the identified variants located in genes likely to be related to ASD. We validated the finding by Sanger sequencing in the probands and their parents. Three male siblings presented a syndrome characterized by severe intellectual disability, absence of language, autism spectrum symptoms and epilepsy with negative family history for mental retardation, language disorders, ASD or other psychiatric disorders. We found germline mosaicism for a heterozygous deletion of a cytosine in the exon 21 of the SHANK3 gene, resulting in a missense sequence of 5 codons followed by a premature stop codon (NM_033517:c.3259_3259delC, p.Ser1088Profs*6). We reported an infrequent form of familial ASD where WGS proved useful in the clinic. We identified a mutation in SHANK3 that underscores its relevance in Autism Spectrum Disorder.

  6. PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach.

    PubMed

    Zhu, Xiao; Leung, Henry C M; Chin, Francis Y L; Yiu, Siu Ming; Quan, Guangri; Liu, Bo; Wang, Yadong

    2014-01-01

    Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.

  7. A Quantitative Tool to Distinguish Isobaric Leucine and Isoleucine Residues for Mass Spectrometry-Based De Novo Monoclonal Antibody Sequencing

    NASA Astrophysics Data System (ADS)

    Poston, Chloe N.; Higgs, Richard E.; You, Jinsam; Gelfanova, Valentina; Hale, John E.; Knierman, Michael D.; Siegel, Robert; Gutierrez, Jesus A.

    2014-07-01

    De novo sequencing by mass spectrometry (MS) allows for the determination of the complete amino acid (AA) sequence of a given protein based on the mass difference of detected ions from MS/MS fragmentation spectra. The technique relies on obtaining specific masses that can be attributed to characteristic theoretical masses of AAs. A major limitation of de novo sequencing by MS is the inability to distinguish between the isobaric residues leucine (Leu) and isoleucine (Ile). Incorrect identification of Ile as Leu or vice versa often results in loss of activity in recombinant antibodies. This functional ambiguity is commonly resolved with costly and time-consuming AA mutation and peptide sequencing experiments. Here, we describe a set of orthogonal biochemical protocols, which experimentally determine the identity of Ile or Leu residues in monoclonal antibodies (mAb) based on the selectivity that leucine aminopeptidase shows for n-terminal Leu residues and the cleavage preference for Leu by chymotrypsin. The resulting observations are combined with germline frequencies and incorporated into a logistic regression model, called Predictor for Xle Sites (PXleS) to provide a statistical likelihood for the identity of Leu at an ambiguous site. We demonstrate that PXleS can generate a probability for an Xle site in mAbs with 96% accuracy. The implementation of PXleS precludes the expression of several possible sequences and, therefore, reduces the overall time and resources required to go from spectra generation to a biologically active sequence for a mAb when an Ile or Leu residue is in question.

  8. UVnovo: A De Novo Sequencing Algorithm Using Single Series of Fragment Ions via Chromophore Tagging and 351 nm Ultraviolet Photodissociation Mass Spectrometry

    PubMed Central

    Robotham, Scott A.; Horton, Andrew P.; Cannon, Joe R.; Cotham, Victoria C.; Marcotte, Edward M.; Brodbelt, Jennifer S.

    2016-01-01

    De novo peptide sequencing by mass spectrometry represents an important strategy for characterizing novel peptides and proteins, in which a peptide’s amino acid sequence is inferred directly from the precursor peptide mass and tandem mass spectrum (MS/MS or MS3) fragment ions, without comparison to a reference proteome. This method is ideal for organisms or samples lacking a complete or well-annotated reference sequence set. One of the major barriers to de novo spectral interpretation arises from confusion of N- and C-terminal ion series due to the symmetry between b and y ion pairs created by collisional activation methods (or c, z ions for electron-based activation methods). This is known as the ‘antisymmetric path problem’ and leads to inverted amino acid subsequences within a de novo reconstruction. Here, we combine several key strategies for de novo peptide sequencing into a single high-throughput pipeline: high efficiency carbamylation blocks lysine side chains, and subsequent tryptic digestion and N-terminal peptide derivatization with the ultraviolet chromophore AMCA yields peptides susceptible to 351 nm ultraviolet photodissociation (UVPD). UVPD-MS/MS of the AMCA-modified peptides then predominantly produces y ions in the MS/MS spectra, specifically addressing the antisymmetric path problem. Finally, the program UVnovo applies a random forest algorithm to automatically learn from and then interpret UVPD mass spectra, passing results to a hidden Markov model for de novo sequence prediction and scoring. We show this combined strategy provides high performance de novo peptide sequencing, enabling the de novo sequencing of thousands of peptides from an E. coli lysate at high confidence. PMID:26938041

  9. Functional categorization of unique expressed sequence tags obtained from the yeast-like growth phase of the elm pathogen Ophiostoma novo-ulmi

    PubMed Central

    2011-01-01

    Background The highly aggressive pathogenic fungus Ophiostoma novo-ulmi continues to be a serious threat to the American elm (Ulmus americana) in North America. Extensive studies have been conducted in North America to understand the mechanisms of virulence of this introduced pathogen and its evolving population structure, with a view to identifying potential strategies for the control of Dutch elm disease. As part of a larger study to examine the genomes of economically important Ophiostoma spp. and the genetic basis of virulence, we have constructed an expressed sequence tag (EST) library using total RNA extracted from the yeast-like growth phase of O. novo-ulmi (isolate H327). Results A total of 4,386 readable EST sequences were annotated by determining their closest matches to known or theoretical sequences in public databases by BLASTX analysis. Searches matched 2,093 sequences to entries found in Genbank, including 1,761 matches with known proteins and 332 matches with unknown (hypothetical/predicted) proteins. Known proteins included a collection of 880 unique transcripts which were categorized to obtain a functional profile of the transcriptome and to evaluate physiological function. These assignments yielded 20 primary functional categories (FunCat), the largest including Metabolism (FunCat 01, 20.28% of total), Sub-cellular localization (70, 10.23%), Protein synthesis (12, 10.14%), Transcription (11, 8.27%), Biogenesis of cellular components (42, 8.15%), Cellular transport, facilitation and routes (20, 6.08%), Classification unresolved (98, 5.80%), Cell rescue, defence and virulence (32, 5.31%) and the unclassified category, or known sequences of unknown metabolic function (99, 7.5%). A list of specific transcripts of interest was compiled to initiate an evaluation of their impact upon strain virulence in subsequent studies. Conclusions This is the first large-scale study of the O. novo-ulmi transcriptome. The expression profile obtained from the yeast

  10. Identification of single amino acid substitutions (SAAS) in neuraminidase from influenza a virus (H1N1) via mass spectrometry analysis coupled with de novo peptide sequencing.

    PubMed

    Peng, Qisheng; Wang, Zijian; Wu, Donglin; Li, Xiaoou; Liu, Xiaofeng; Sun, Wanchun; Liu, Ning

    2016-08-01

    Amino acid substitutions in the neuraminidase of the influenza virus are the main cause of the emergence of resistance to zanamivir or oseltamivir during seasonal influenza treatment; they are the result of non-synonymous mutations in the viral genome that can be successfully detected by polymer chain reaction (PCR)-based approaches. There is always an urgent need to detect variation in amino acid sequences directly at the protein level. Mass spectrometry coupled with de novo sequencing has been explored as an alternative and straightforward strategy for detecting amino acid substitutions, as well - this approach is the primary focus of the present study. Influenza virus (A/Puerto Rico/8/1934 H1N1) propagated in embryonated chicken eggs was purified by ultracentrifugation, followed by PNGase F treatment. The deglycosylated virion was lysed and separated by sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE). The gel band corresponding to neuraminidase was picked up and subjected to liquid chromatography tandem mass spectrometry (LC-MS/MS) analysis. LC-MS/MS analyses, coupled with manual de novo sequencing, allowed the determination of three amino acid substitutions: R346K, S349 N, and S370I/L, in the neuraminidase from the influenza virus (A/Puerto Rico/8/1934 H1N1), which were located in three mutated peptides of the neuraminidase: YGNGVWIGK, TKNHSSR, and PNGWTETDI/LK, respectively. We found that the amino acid substitutions in the proteins of RNA viruses (including influenza A virus) resulting from non-synonymous gene mutations can indeed be directly analyzed via mass spectrometry, and that manual interpretation of the MS/MS data may be beneficial. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  11. Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveal Novel Gene Content.

    PubMed

    Faber-Hammond, Joshua J; Brown, Kim H

    2016-04-01

    Zebrafish represents the third vertebrate with an officially completed genome, yet it remains incomplete with additions and corrections continuing with the current release, GRCz10, having 13% of zebrafish cDNA sequences unmapped. This disparity may result from population differences, given that the genome reference was generated from clonal individuals with limited genetic diversity. This is supported by the recent analysis of a single wild zebrafish, which identified over 5.2 million SNPs and 1.6 million in/dels in the previous genome build, zv9. Re-examination of this sequence data set indicated that 13.8% of quality sequence reads failed to align to GRCz10. Using a novel bioinformatics de novo assembly pipeline on these unmappable reads, we identified 1,514,491 novel contigs covering ∼224 Mb of genomic sequence. Among these, 1083 contigs were found to contain a potential gene coding sequence. RNA-seq data comparison confirmed that 362 contigs contained a transcribed DNA sequence, suggesting that a large amount of functional genomic sequence remains unannotated in the zebrafish reference genome. By utilizing the bioinformatics pipeline developed in this study, the zebrafish genome will be bolstered as a model for human disease research. Adaptation of the pipeline described here also offers a cost-efficient and effective method to identify and map novel genetic content across any genome and will ultimately aid in the completion of additional genomes for a broad range of species.

  12. Rapid 'de novo' peptide sequencing by a combination of nanoelectrospray, isotopic labeling and a quadrupole/time-of-flight mass spectrometer.

    PubMed

    Shevchenko, A; Chernushevich, I; Ens, W; Standing, K G; Thomson, B; Wilm, M; Mann, M

    1997-01-01

    Protein microanalysis usually involves the sequencing of gel-separated proteins available in very small amounts. While mass spectrometry has become the method of choice for identifying proteins in databases, in almost all laboratories 'de novo' protein sequencing is still performed by Edman degradation. Here we show that a combination of the nanoelectrospray ion source, isotopic end labeling of peptides and a quadrupole/ time-of-flight instrument allows facile read-out of the sequences of tryptic peptides. Isotopic labeling was performed by enzymatic digestion of proteins in 1:1 16O/18O water, eliminating the need for peptide derivatization. A quadrupole/time-of-flight mass spectrometer was constructed from a triple quadrupole and an electrospray time-of-flight instrument. Tandem mass spectra of peptides were obtained with better than 50 ppm mass accuracy and resolution routinely in excess of 5000. Unique and error tolerant identification of yeast proteins as well as the sequencing of a novel protein illustrate the potential of the approach. The high data quality in tandem mass spectra and the additional information provided by the isotopic end labeling of peptides enabled automated interpretation of the spectra via simple software algorithms. The technique demonstrated here removes one of the last obstacles to routine and high throughput protein sequencing by mass spectrometry.

  13. Neurodevelopmental disease-associated de novo mutations and rare sequence variants affect TRIO GDP/GTP exchange factor activity.

    PubMed

    Katrancha, Sara M; Wu, Yi; Zhu, Minsheng; Eipper, Betty A; Koleske, Anthony J; Mains, Richard E

    2017-09-14

    Bipolar disorder, schizophrenia, autism, and intellectual disability are complex neurodevelopmental disorders, debilitating millions of people. Therapeutic progress is limited by poor understanding of underlying molecular pathways. Using a targeted search, we identified an enrichment of de novo mutations in the gene encoding the 330-kDa triple functional domain (TRIO) protein associated with neurodevelopmental disorders. By generating multiple TRIO antibodies, we show that the smaller TRIO9 isoform is the major brain protein product, and its levels decrease after birth. TRIO9 contains two guanine nucleotide exchange factor (GEF) domains with distinct specificities: GEF1 activates both Rac1 and RhoG; GEF2 activates RhoA. To understand the impact of disease-associated de novo mutations and other rare sequence variants on TRIO function, we utilized two FRET-based biosensors: a Rac1 biosensor to study mutations in TRIO (T)GEF1, and a RhoA biosensor to study mutations in TGEF2. We discovered that one autism-associated de novo mutation in TGEF1 (K1431M), at the TGEF1/Rac1 interface, markedly decreased its overall activity toward Rac1. A schizophrenia-associated rare sequence variant in TGEF1 (F1538Intron) was substantially less active, normalized to protein level, and expressed poorly. Overall, mutations in TGEF1 decreased GEF1 activity toward Rac1. One bipolar disorder-associated rare variant (M2145T) in TGEF2 impaired inhibition by the TGEF2 pleckstrin-homology domain, resulting in dramatically increased TGEF2 activity. Overall, genetic damage to both TGEF domains altered TRIO catalytic activity, decreasing TGEF1 activity and increasing TGEF2 activity. Importantly, both GEF changes are expected to decrease neurite outgrowth, perhaps consistent with their association with neurodevelopmental disorders. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  14. Knowledge-based approach to de novo design using reaction vectors.

    PubMed

    Patel, Hina; Bodkin, Michael J; Chen, Beining; Gillet, Valerie J

    2009-05-01

    A knowledge-based approach to the de novo design of synthetically feasible molecules is described. The method is based on reaction vectors which represent the structural changes that take place at the reaction center along with the environment in which the reaction occurs. The reaction vectors are derived automatically from a database of reactions which is not restricted by size or reaction complexity. A structure generation algorithm has been developed whereby reaction vectors can be applied to previously unseen starting materials in order to suggest novel syntheses. The approach has been implemented in KNIME and is validated by reproducing known synthetic routes. We then present applications of the method in different drug design scenarios including lead optimization and library enumeration. The method offers great potential for capturing and using the growing body of data on reactions that is becoming available through electronic laboratory notebooks.

  15. Application of de novo sequencing tools to study abiogenic peptide formations by tandem mass spectrometry. The case of homo-peptides from glutamic acid complicated by substitutions of hydrogen by sodium or potassium atoms.

    PubMed

    Terterov, Ivan; Vyatkina, Kira; Kononikhin, Alexey S; Boitsov, Vitali; Vyazmin, Sergey; Popov, Igor A; Nikolaev, Eugene N; Pevzner, Pavel; Dubina, Michael

    2014-01-15

    Peptides and proteins are among the most important components of living systems. Different attempts have been made to experimentally model the formation of peptides from amino acid monomers in investigation of the origin of life. Detailed characterization of peptides formed under various conditions in such reactions is very important for understanding processes of abiogenic peptide formation. We used liquid chromatography coupled with tandem mass spectrometry (MS/MS) for an accurate study of homo-peptides formed in a model reaction: glutamic acid oligomerization catalyzed by 1,1'-carbonyldiimidazole in aqueous solution with 1 M of sodium or potassium chloride and without any salts. We used de novo sequencing software for peptide identification. In addition we propose an approach that uses more spectral information for de novo sequencing then standard methods. Peptides up to 9 amino acids long were found in the experiments with KCl, while in experiments with NaCl and without salts only peptides of up to 7 amino acids were detected. Due to high salt concentrations in samples a high number of singly charged peptide ions with up to 4 substitutions of hydrogen atoms by sodium or potassium atoms were observed. De novo sequencing software provided correct identifications even for peptide ions with substitutions. Multiple substitutions of hydrogen by alkali metal atoms in peptide ions strongly change their fragmentation patterns. Proposed approach for de novo sequencing was found very effective, even for ions with substitutions. So, it may be useful in more complicated cases like sequencing abiogenic peptides consisting of different amino acids. Copyright © 2013 John Wiley & Sons, Ltd.

  16. De Novo Transcriptome Sequencing of Desert Herbaceous Achnatherum splendens (Achnatherum) Seedlings and Identification of Salt Tolerance Genes

    PubMed Central

    Liu, Jiangtao; Zhou, Yuelong; Luo, Changxin; Xiang, Yun; An, Lizhe

    2016-01-01

    Achnatherum splendens is an important forage herb in Northwestern China. It has a high tolerance to salinity and is, thus, considered one of the most important constructive plants in saline and alkaline areas of land in Northwest China. However, the mechanisms of salt stress tolerance in A. splendens remain unknown. Next-generation sequencing (NGS) technologies can be used for global gene expression profiling. In this study, we examined sequence and transcript abundance data for the root/leaf transcriptome of A. splendens obtained using an Illumina HiSeq 2500. Over 35 million clean reads were obtained from the leaf and root libraries. All of the RNA sequencing (RNA-seq) reads were assembled de novo into a total of 126,235 unigenes and 36,511 coding DNA sequences (CDS). We further identified 1663 differentially-expressed genes (DEGs) between the salt stress treatment and control. Functional annotation of the DEGs by gene ontology (GO), using Arabidopsis and rice as references, revealed enrichment of salt stress-related GO categories, including “oxidation reduction”, “transcription factor activity”, and “ion channel transporter”. Thus, this global transcriptome analysis of A. splendens has provided an important genetic resource for the study of salt tolerance in this halophyte. The identified sequences and their putative functional data will facilitate future investigations of the tolerance of Achnatherum species to various types of abiotic stress. PMID:27023614

  17. Transcriptome Sequencing, De Novo Assembly and Differential Gene Expression Analysis of the Early Development of Acipenser baeri

    PubMed Central

    Song, Wei; Jiang, Keji; Zhang, Fengying; Lin, Yu; Ma, Lingbo

    2015-01-01

    The molecular mechanisms that drive the development of the endangered fossil fish species Acipenser baeri are difficult to study due to the lack of genomic data. Recent advances in sequencing technologies and the reducing cost of sequencing offer exclusive opportunities for exploring important molecular mechanisms underlying specific biological processes. This manuscript describes the large scale sequencing and analyses of mRNA from Acipenser baeri collected at five development time points using the Illumina Hiseq2000 platform. The sequencing reads were de novo assembled and clustered into 278167 unigenes, of which 57346 (20.62%) had 45837 known homologues proteins in Uniprot protein databases while 11509 proteins matched with at least one sequence of assembled unigenes. The remaining 79.38% of unigenes could stand for non-coding unigenes or unigenes specific to A. baeri. A number of 43062 unigenes were annotated into functional categories via Gene Ontology (GO) annotation whereas 29526 unigenes were associated with 329 pathways by mapping to KEGG database. Subsequently, 3479 differentially expressed genes were scanned within developmental stages and clustered into 50 gene expression profiles. Genes preferentially expressed at each stage were also identified. Through GO and KEGG pathway enrichment analysis, relevant physiological variations during the early development of A. baeri could be better cognized. Accordingly, the present study gives insights into the transcriptome profile of the early development of A. baeri, and the information contained in this large scale transcriptome will provide substantial references for A. baeri developmental biology and promote its aquaculture research. PMID:26359664

  18. A gene expression microarray for Nicotiana benthamiana based on de novo transcriptome sequence assembly.

    PubMed

    Goralski, Michal; Sobieszczanska, Paula; Obrepalska-Steplowska, Aleksandra; Swiercz, Aleksandra; Zmienko, Agnieszka; Figlerowicz, Marek

    2016-01-01

    microarray capabilities for studying gene expression in this plant. Additionally, by defining the sense orientation of over 106,000 contigs, we substantially improved the functional information on the N. benthamiana transcriptome. The simple hybridization-based approach for detecting the sense orientation of computationally assembled sequences can be used for updating the transcriptomes of other non-model organisms, including cases where no significant homology to known proteins exists.

  19. Complete genome sequence of novel carbon monoxide oxidizing bacteria Citrobacter amalonaticus Y19, assembled de novo.

    PubMed

    Ainala, Satish Kumar; Seol, Eunhee; Park, Sunghoon

    2015-10-10

    We report here the complete genome sequence of Citrobacter amalonaticus Y19 isolated from an anaerobic digester. PacBio single-molecule real-time (SMRT) sequencing was employed, resulting in a single scaffold of 5.58Mb. The sequence of a mega plasmid of 291Kb size is also presented.

  20. Single-Cell RNA Sequencing Reveals T Helper Cells Synthesizing Steroids De Novo to Contribute to Immune Homeostasis

    PubMed Central

    Mahata, Bidesh; Zhang, Xiuwei; Kolodziejczyk, Aleksandra A.; Proserpio, Valentina; Haim-Vilmovsky, Liora; Taylor, Angela E.; Hebenstreit, Daniel; Dingler, Felix A.; Moignard, Victoria; Göttgens, Berthold; Arlt, Wiebke; McKenzie, Andrew N.J.; Teichmann, Sarah A.

    2014-01-01

    Summary T helper 2 (Th2) cells regulate helminth infections, allergic disorders, tumor immunity, and pregnancy by secreting various cytokines. It is likely that there are undiscovered Th2 signaling molecules. Although steroids are known to be immunoregulators, de novo steroid production from immune cells has not been previously characterized. Here, we demonstrate production of the steroid pregnenolone by Th2 cells in vitro and in vivo in a helminth infection model. Single-cell RNA sequencing and quantitative PCR analysis suggest that pregnenolone synthesis in Th2 cells is related to immunosuppression. In support of this, we show that pregnenolone inhibits Th cell proliferation and B cell immunoglobulin class switching. We also show that steroidogenic Th2 cells inhibit Th cell proliferation in a Cyp11a1 enzyme-dependent manner. We propose pregnenolone as a “lymphosteroid,” a steroid produced by lymphocytes. We speculate that this de novo steroid production may be an intrinsic phenomenon of Th2-mediated immune responses to actively restore immune homeostasis. PMID:24813893

  1. De novo Sequence Assembly and Characterization of Lycoris aurea Transcriptome Using GS FLX Titanium Platform of 454 Pyrosequencing

    PubMed Central

    Wang, Ren; Xu, Sheng; Jiang, Yumei; Jiang, Jingwei; Li, Xiaodan; Liang, Lijian; He, Jia; Peng, Feng; Xia, Bing

    2013-01-01

    Background Lycoris aurea, also called Golden Magic Lily, is an ornamentally and medicinally important species of the Amaryllidaceae family. To date, the sequencing of its whole genome is unavailable as a non-model organism. Transcriptomic information is also scarce for this species. In this study, we performed de novo transcriptome sequencing to produce the first comprehensive expressed sequence tag (EST) dataset for L. aurea using high-throughput sequencing technology. Methodology and Principal Findings Total RNA was isolated from leaves with sodium nitroprusside (SNP), salicylic acid (SA), or methyl jasmonate (MeJA) treatment, stems, and flowers at the bud, blooming, and wilting stages. Equal quantities of RNA from each tissue and stage were pooled to construct a cDNA library. Using 454 pyrosequencing technology, a total of 937,990 high quality reads (308.63 Mb) with an average read length of 329 bp were generated. Clustering and assembly of these reads produced a non-redundant set of 141,111 unique sequences, comprising 24,604 contigs and 116,507 singletons. All of the unique sequences were involved in the biological process, cellular component and molecular function categories by GO analysis. Potential genes and their functions were predicted by KEGG pathway mapping and COG analysis. Based on our sequence analysis and published literatures, many putative genes involved in Amaryllidaceae alkaloids synthesis, including PAL, TYDC OMT, NMT, P450, and other potentially important candidate genes, were identified for the first time in this Lycoris. Furthermore, 6,386 SSRs and 18,107 high-confidence SNPs were identified in this EST dataset. Conclusions The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in L. aurea. The molecular markers identified in this study will provide a material basis for future genetic linkage and quantitative trait loci analyses, and will provide useful information for functional

  2. A Statistical Approach for Ambiguous Sequence Mappings

    USDA-ARS?s Scientific Manuscript database

    When attempting to map RNA sequences to a reference genome, high percentages of short sequence reads are often assigned to multiple genomic locations. One approach to handling these “ambiguous mappings” has been to discard them. This results in a loss of data, which can sometimes be as much as 45% o...

  3. De novo construction of a "Gene-space" for diploid plant genome rich in repetitive sequences by an iterative Process of Extraction and Assembly of NGS reads (iPEA protocol) with limited computing resources.

    PubMed

    Aluome, Christelle; Aubert, Grégoire; Alves Carvalho, Susete; Le Paslier, Marie-Christine; Burstin, Judith; Brunel, Dominique

    2016-02-11

    The continuing increase in size and quality of the "short reads" raw data is a significant help for the quality of the assembly obtained through various bioinformatics tools. However, building a reference genome sequence for most plant species remains a significant challenge due to the large number of repeated sequences which are problematic for a whole-genome quality de novo assembly. Furthermore, for most SNP identification approaches in plant genetics and breeding, only the "Gene-space" regions including the promoter, exon and intron sequences are considered. We developed the iPea protocol to produce a de novo Gene-space assembly by reconstructing, in an iterative way, the non-coding sequence flanking the Unigene cDNA sequence through addition of next-generation DNA-seq data. The approach was elaborated with the large diploid genome of pea (Pisum sativum L.), rich in repetitive sequences. The final Gene-space assembly included 35,400 contigs (97 Mb), covering 88 % of the 40,227 contigs (53.1 Mb) of the PsCam_low-copy Unigen set. Its accuracy was validated by the results of the built GenoPea 13.2 K SNP Array. The iPEA protocol allows the reconstruction of a Gene-space based from RNA-Seq and DNA-seq data with limited computing resources.

  4. Increased frequency of de novo copy number variants in congenital heart disease by integrative analysis of single nucleotide polymorphism array and exome sequence data.

    PubMed

    Glessner, Joseph T; Bick, Alexander G; Ito, Kaoru; Homsy, Jason; Rodriguez-Murillo, Laura; Fromer, Menachem; Mazaika, Erica; Vardarajan, Badri; Italia, Michael; Leipzig, Jeremy; DePalma, Steven R; Golhar, Ryan; Sanders, Stephan J; Yamrom, Boris; Ronemus, Michael; Iossifov, Ivan; Willsey, A Jeremy; State, Matthew W; Kaltman, Jonathan R; White, Peter S; Shen, Yufeng; Warburton, Dorothy; Brueckner, Martina; Seidman, Christine; Goldmuntz, Elizabeth; Gelb, Bruce D; Lifton, Richard; Seidman, Jonathan; Hakonarson, Hakon; Chung, Wendy K

    2014-10-24

    Congenital heart disease (CHD) is among the most common birth defects. Most cases are of unknown pathogenesis. To determine the contribution of de novo copy number variants (CNVs) in the pathogenesis of sporadic CHD. We studied 538 CHD trios using genome-wide dense single nucleotide polymorphism arrays and whole exome sequencing. Results were experimentally validated using digital droplet polymerase chain reaction. We compared validated CNVs in CHD cases with CNVs in 1301 healthy control trios. The 2 complementary high-resolution technologies identified 63 validated de novo CNVs in 51 CHD cases. A significant increase in CNV burden was observed when comparing CHD trios with healthy trios, using either single nucleotide polymorphism array (P=7×10(-5); odds ratio, 4.6) or whole exome sequencing data (P=6×10(-4); odds ratio, 3.5) and remained after removing 16% of de novo CNV loci previously reported as pathogenic (P=0.02; odds ratio, 2.7). We observed recurrent de novo CNVs on 15q11.2 encompassing CYFIP1, NIPA1, and NIPA2 and single de novo CNVs encompassing DUSP1, JUN, JUP, MED15, MED9, PTPRE SREBF1, TOP2A, and ZEB2, genes that interact with established CHD proteins NKX2-5 and GATA4. Integrating de novo variants in whole exome sequencing and CNV data suggests that ETS1 is the pathogenic gene altered by 11q24.2-q25 deletions in Jacobsen syndrome and that CTBP2 is the pathogenic gene in 10q subtelomeric deletions. We demonstrate a significantly increased frequency of rare de novo CNVs in CHD patients compared with healthy controls and suggest several novel genetic loci for CHD. © 2014 American Heart Association, Inc.

  5. Intellectual disability and non-compaction cardiomyopathy with a de novo NONO mutation identified by exome sequencing.

    PubMed

    Reinstein, Eyal; Tzur, Shay; Cohen, Rony; Bormans, Concetta; Behar, Doron M

    2016-11-01

    Pathogenic variants in the NONO gene have been most recently implicated in X-linked intellectual disability syndrome. This observation has been supported by studies of NONO-deficient mice showing that NONO has an important role in regulating inhibitory synaptic activity. Thus far, the phenotypic spectrum of affected patients remains limited. We applied whole exome sequencing to members of a family in which the proband was presented with a complex phenotype consisting of developmental delay, dysmorphism, and non-compaction cardiomyopathy. Exome analysis identified a novel de novo splice-site variant c.1171+1G>T in exon 11 of NONO gene that is suspected to abolish the donor splicing site. Thus, we propose that the phenotypic spectrum of NONO-related disorder is much broader than described and that pathogenic variants in NONO cause a recognizable phenotype.

  6. The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny.

    PubMed

    Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio

    2016-01-20

    Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species.

  7. The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

    PubMed Central

    Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio

    2016-01-01

    Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species. PMID:26786968

  8. De novo sequencing analysis of the Rosa roxburghii fruit transcriptome reveals putative ascorbate biosynthetic genes and EST-SSR markers.

    PubMed

    Yan, Xiuqin; Zhang, Xue; Lu, Min; He, Yong; An, Huaming

    2015-04-25

    Rosa roxburghii Tratt. is a well-known ornamental rose species native to China. In addition, the fruits of this species are valued for their nutritional and medicinal characteristics, especially their high ascorbic acid (AsA) levels. Nevertheless, AsA biosynthesis in R. roxburghii fruit has not been explored in detail because of a lack of genomic resources for this species. High-throughput transcriptomic sequencing generating large volumes of transcript sequence data can aid in gene discovery and molecular marker development. In this study, we generated more than 53 million clean reads using Illumina paired-end sequencing technology. De novo assembly yielded 106,590 unigenes, with an average length of 343 bp. On the basis of sequence similarity to known proteins, 9301 and 2393 unigenes were classified into Gene Ontology and Clusters of Orthologous Group categories, respectively. There were 7480 unigenes assigned to 124 pathways in the Kyoto Encyclopedia of Gene and Genome pathway database. BLASTx searches identified 498 unique putative transcripts encoding various transcription factors, some known to regulate fruit development. qRT-PCR validated the expressions of most of the genes encoding the main enzymes involved in ascorbate biosynthesis. In addition, 9131 potential simple sequence repeat (SSR) loci were identified among the unigenes. One hundred and two primer pairs were synthesized and 71 pairs produced an amplification product during initial screening. Among the amplified products, 30 were polymorphic in the 16 R. roxburghii germplasms tested. Our study was the first to produce a large volume of transcriptome data from R. roxburghii. The resulting sequence collection is a valuable resource for gene discovery and marker-assisted selective breeding in this rose species.

  9. De novo Assembly and Characterization of the Global Transcriptome for Rhyacionia leptotubula Using Illumina Paired-End Sequencing

    PubMed Central

    Zhu, Jia-Ying; Li, Yong-He; Yang, Song; Li, Qin-Wen

    2013-01-01

    Background The pine tip moth, Rhyacionia leptotubula (Lepidoptera: Tortricidae) is one of the most destructive forestry pests in Yunnan Province, China. Despite its importance, less is known regarding all aspects of this pest. Understanding the genetic information of it is essential for exploring the specific traits at the molecular level. Thus, we here sequenced the transcriptome of R. leptotubula with high-throughput Illumina sequencing. Methodology/Principal Findings In a single run, more than 60 million sequencing reads were generated. De novo assembling was performed to generate a collection of 46,910 unigenes with mean length of 642 bp. Based on Blastx search with an E-value cut-off of 10−5, 22,581 unigenes showed significant similarities to known proteins from National Center for Biotechnology Information (NCBI) non-redundant (Nr) protein database. Of these annotated unigenes, 10,360, 6,937 and 13,894 were assigned to Gene Ontology (GO), Clusters of Orthologous Group (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases, respectively. A total of 5,926 unigenes were annotated with domain similarity derived functional information, of which 55 and 39 unigenes respectively encoding the insecticide resistance related enzymes, cytochrome P450 and carboxylesterase. Using the transcriptome data, 47 unigenes belonging to the typical “stress” genes of heat shock protein (Hsp) family were retrieved. Furthermore, 1,450 simple sequence repeats (SSRs) were detected; 3.09% of the unigenes contained SSRs. Large numbers of SSR primer pairs were designed and out of randomly verified primer pairs 80% were successfully yielded amplicons. Conclusions/Significance A large of putative R. leptotubula transcript sequences has been obtained from the deep sequencing, which extensively increases the comprehensive and integrated genomic resources of this pest. This large-scale transcriptome dataset will be an important information platform for promoting our

  10. Highly efficient de novo mutant identification in a sorghum bicolor tilling population using the ComSeq approach

    USDA-ARS?s Scientific Manuscript database

    Screening large populations for carriers of known or de novo rare SNPs is required both in Targeting induced local lesions IN genomes (TILLING) experiments in plants and analogously in screening human populations. We formerly suggested an approach that combines the celebrated mathematical field of c...

  11. De novo assembly and next-generation sequencing to analyse full-length gene variants from codon-barcoded libraries

    PubMed Central

    Cho, Namjin; Hwang, Byungjin; Yoon, Jung-ki; Park, Sangun; Lee, Joongoo; Seo, Han Na; Lee, Jeewon; Huh, Sunghoon; Chung, Jinsoo; Bang, Duhee

    2015-01-01

    Interpreting epistatic interactions is crucial for understanding evolutionary dynamics of complex genetic systems and unveiling structure and function of genetic pathways. Although high resolution mapping of en masse variant libraries renders molecular biologists to address genotype-phenotype relationships, long-read sequencing technology remains indispensable to assess functional relationship between mutations that lie far apart. Here, we introduce JigsawSeq for multiplexed sequence identification of pooled gene variant libraries by combining a codon-based molecular barcoding strategy and de novo assembly of short-read data. We first validate JigsawSeq on small sub-pools and observed high precision and recall at various experimental settings. With extensive simulations, we then apply JigsawSeq to large-scale gene variant libraries to show that our method can be reliably scaled using next-generation sequencing. JigsawSeq may serve as a rapid screening tool for functional genomics and offer the opportunity to explore evolutionary trajectories of protein variants. PMID:26387459

  12. High-Throughput Sequencing and De Novo Assembly of Red and Green Forms of the Perilla frutescens var. crispa Transcriptome

    PubMed Central

    Fukushima, Atsushi; Nakamura, Michimi; Suzuki, Hideyuki; Saito, Kazuki; Yamazaki, Mami

    2015-01-01

    Perilla frutescens var. crispa (Labiatae) has two chemo-varietal forms, i.e. red and green forms of perilla, that differ in the production of anthocyanins. To facilitate molecular biological and biochemical studies in perilla-specialized metabolism we used Illumina RNA-sequencing technology in our comprehensive comparison of the transcriptome map of the leaves of red and green forms of perilla. Sequencing generated over 1.2 billion short reads with an average length of 101 nt. De novo transcriptome assembly yielded 47,788 and 47,840 unigenes in the red and green forms of perilla plants, respectively. Comparison of the assembled unigenes and existing perilla cDNA sequences showed highly reliable alignment. All unigenes were annotated with gene ontology (GO) and Enzyme Commission numbers and entered into the Kyoto Encyclopedia of Genes and Genomes. We identified 68 differentially expressed genes (DEGs) in red and green forms of perilla. GO enrichment analysis of the DEGs showed that genes involved in the anthocyanin metabolic process were enriched. Differential expression analysis revealed that the transcript level of anthocyanin biosynthetic unigenes encoding flavonoid 3’-hydroxylase, dihydroflavonol 4-reductase, and anthocyanidin synthase was significantly higher in red perilla, while the transcript level of unigenes encoding limonene synthase was significantly higher in green perilla. Our data serve as a basis for future research on perilla bio-engineering and provide a shortcut for the characterization of new functional genes in P. frutescens. PMID:26070213

  13. Restriction site associated DNA (RAD) for de novo sequencing and marker discovery in sugarcane borer, Diatraea saccharalis Fab. (Lepidoptera: Crambidae).

    PubMed

    Pavinato, V A C; Margarido, G R A; Wijeratne, A J; Wijeratne, S; Meulia, T; Souza, A P; Michel, A P; Zucchi, M I

    2017-05-01

    We present the development of a genomic library using RADseq (restriction site associated DNA sequencing) protocol for marker discovery that can be applied on evolutionary studies of the sugarcane borer Diatraea saccharalis, an important South American insect pest. A RADtag protocol combined with Illumina paired-end sequencing allowed de novo discovery of 12 811 SNPs and a high-quality assembly of 122.8M paired-end reads from six individuals, representing 40 Gb of sequencing data. Approximately 1.7 Mb of the sugarcane borer genome distributed over 5289 minicontigs were obtained upon assembly of second reads from first reads RADtag loci where at least one SNP was discovered and genotyped. Minicontig lengths ranged from 200 to 611 bp and were used for functional annotation and microsatellite discovery. These markers will be used in future studies to understand gene flow and adaptation to host plants and control tactics. © 2016 John Wiley & Sons Ltd.

  14. High-Throughput Sequencing and De Novo Assembly of Red and Green Forms of the Perilla frutescens var. crispa Transcriptome.

    PubMed

    Fukushima, Atsushi; Nakamura, Michimi; Suzuki, Hideyuki; Saito, Kazuki; Yamazaki, Mami

    2015-01-01

    Perilla frutescens var. crispa (Labiatae) has two chemo-varietal forms, i.e. red and green forms of perilla, that differ in the production of anthocyanins. To facilitate molecular biological and biochemical studies in perilla-specialized metabolism we used Illumina RNA-sequencing technology in our comprehensive comparison of the transcriptome map of the leaves of red and green forms of perilla. Sequencing generated over 1.2 billion short reads with an average length of 101 nt. De novo transcriptome assembly yielded 47,788 and 47,840 unigenes in the red and green forms of perilla plants, respectively. Comparison of the assembled unigenes and existing perilla cDNA sequences showed highly reliable alignment. All unigenes were annotated with gene ontology (GO) and Enzyme Commission numbers and entered into the Kyoto Encyclopedia of Genes and Genomes. We identified 68 differentially expressed genes (DEGs) in red and green forms of perilla. GO enrichment analysis of the DEGs showed that genes involved in the anthocyanin metabolic process were enriched. Differential expression analysis revealed that the transcript level of anthocyanin biosynthetic unigenes encoding flavonoid 3'-hydroxylase, dihydroflavonol 4-reductase, and anthocyanidin synthase was significantly higher in red perilla, while the transcript level of unigenes encoding limonene synthase was significantly higher in green perilla. Our data serve as a basis for future research on perilla bio-engineering and provide a shortcut for the characterization of new functional genes in P. frutescens.

  15. High throughput de novo RNA sequencing elucidates novel responses in Penicillium chrysogenum under microgravity.

    PubMed

    Sathishkumar, Yesupatham; Krishnaraj, Chandran; Rajagopal, Kalyanaraman; Sen, Dwaipayan; Lee, Yang Soo

    2016-02-01

    In this study, the transcriptional alterations in Penicillium chrysogenum under simulated microgravity conditions were analyzed for the first time using an RNA-Seq method. The increasing plethora of eukaryotic microbial flora inside the spaceship demands the basic understanding of fungal biology in the absence of gravity vector. Penicillium species are second most dominant fungal contaminant in International Space Station. Penicillium chrysogenum an industrially important organism also has the potential to emerge as an opportunistic pathogen for the astronauts during the long-term space missions. But till date, the cellular mechanisms underlying the survival and adaptation of Penicillium chrysogenum to microgravity conditions are not clearly elucidated. A reference genome for Penicillium chrysogenum is not yet available in the NCBI database. Hence, we performed comparative de novo transcriptome analysis of Penicillium chrysogenum grown under microgravity versus normal gravity. In addition, the changes due to microgravity are documented at the molecular level. Increased response to the environmental stimulus, changes in the cell wall component ABC transporter/MFS transporters are noteworthy. Interestingly, sustained increase in the expression of Acyl-coenzyme A: isopenicillin N acyltransferase (Acyltransferase) under microgravity revealed the significance of gravity in the penicillin production which could be exploited industrially.

  16. De Novo transcriptome sequencing reveals important molecular networks and metabolic pathways of the plant, Chlorophytum borivilianum.

    PubMed

    Kalra, Shikha; Puniya, Bhanwar Lal; Kulshreshtha, Deepika; Kumar, Sunil; Kaur, Jagdeep; Ramachandran, Srinivasan; Singh, Kashmir

    2013-01-01

    Chlorophytum borivilianum, an endangered medicinal plant species is highly recognized for its aphrodisiac properties provided by saponins present in the plant. The transcriptome information of this species is limited and only few hundred expressed sequence tags (ESTs) are available in the public databases. To gain molecular insight of this plant, high throughput transcriptome sequencing of leaf RNA was carried out using Illumina's HiSeq 2000 sequencing platform. A total of 22,161,444 single end reads were retrieved after quality filtering. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 101,141 assembled transcripts were obtained, with coverage size of 22.42 Mb and average length of 221 bp. Guanine-cytosine (GC) content was found to be 44%. Bioinformatics analysis, using non-redundant proteins, gene ontology (GO), enzyme commission (EC) and kyoto encyclopedia of genes and genomes (KEGG) databases, extracted all the known enzymes involved in saponin and flavonoid biosynthesis. Few genes of the alkaloid biosynthesis, along with anticancer and plant defense genes, were also discovered. Additionally, several cytochrome P450 (CYP450) and glycosyltransferase unique sequences were also found. We identified simple sequence repeat motifs in transcripts with an abundance of di-nucleotide simple sequence repeat (SSR; 43.1%) markers. Large scale expression profiling through Reads per Kilobase per Million mapped reads (RPKM) showed major genes involved in different metabolic pathways of the plant. Genes, expressed sequence tags (ESTs) and unique sequences from this study provide an important resource for the scientific community, interested in the molecular genetics and functional genomics of C. borivilianum.

  17. De Novo Transcriptome Sequencing Reveals Important Molecular Networks and Metabolic Pathways of the Plant, Chlorophytum borivilianum

    PubMed Central

    Kalra, Shikha; Puniya, Bhanwar Lal; Kulshreshtha, Deepika; Kumar, Sunil; Kaur, Jagdeep; Ramachandran, Srinivasan; Singh, Kashmir

    2013-01-01

    Chlorophytum borivilianum, an endangered medicinal plant species is highly recognized for its aphrodisiac properties provided by saponins present in the plant. The transcriptome information of this species is limited and only few hundred expressed sequence tags (ESTs) are available in the public databases. To gain molecular insight of this plant, high throughput transcriptome sequencing of leaf RNA was carried out using Illumina's HiSeq 2000 sequencing platform. A total of 22,161,444 single end reads were retrieved after quality filtering. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 101,141 assembled transcripts were obtained, with coverage size of 22.42 Mb and average length of 221 bp. Guanine-cytosine (GC) content was found to be 44%. Bioinformatics analysis, using non-redundant proteins, gene ontology (GO), enzyme commission (EC) and kyoto encyclopedia of genes and genomes (KEGG) databases, extracted all the known enzymes involved in saponin and flavonoid biosynthesis. Few genes of the alkaloid biosynthesis, along with anticancer and plant defense genes, were also discovered. Additionally, several cytochrome P450 (CYP450) and glycosyltransferase unique sequences were also found. We identified simple sequence repeat motifs in transcripts with an abundance of di-nucleotide simple sequence repeat (SSR; 43.1%) markers. Large scale expression profiling through Reads per Kilobase per Million mapped reads (RPKM) showed major genes involved in different metabolic pathways of the plant. Genes, expressed sequence tags (ESTs) and unique sequences from this study provide an important resource for the scientific community, interested in the molecular genetics and functional genomics of C. borivilianum. PMID:24376689

  18. Sequencing, De novo Assembly, Functional Annotation and Analysis of Phyllanthus amarus Leaf Transcriptome Using the Illumina Platform

    PubMed Central

    Bose Mazumdar, Aparupa; Chattopadhyay, Sharmila

    2016-01-01

    Phyllanthus amarus Schum. and Thonn., a widely distributed annual medicinal herb has a long history of use in the traditional system of medicine for over 2000 years. However, the lack of genomic data for P. amarus, a non-model organism hinders research at the molecular level. In the present study, high-throughput sequencing technology has been employed to enhance better understanding of this herb and provide comprehensive genomic information for future work. Here P. amarus leaf transcriptome was sequenced using the Illumina Miseq platform. We assembled 85,927 non-redundant (nr) “unitranscript” sequences with an average length of 1548 bp, from 18,060,997 raw reads. Sequence similarity analyses and annotation of these unitranscripts were performed against databases like green plants nr protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), PlnTFDB, KEGG databases. As a result, 69,394 GO terms, 583 enzyme codes (EC), 134 KEGG maps, and 59 Transcription Factor (TF) families were generated. Functional and comparative analyses of assembled unitranscripts were also performed with the most closely related species like Populus trichocarpa and Ricinus communis using TRAPID. KEGG analysis showed that a number of assembled unitranscripts were involved in secondary metabolites, mainly phenylpropanoid, flavonoid, terpenoids, alkaloids, and lignan biosynthetic pathways that have significant medicinal attributes. Further, Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values of the identified secondary metabolite pathway genes were determined and Reverse Transcription PCR (RT-PCR) of a few of these genes were performed to validate the de novo assembled leaf transcriptome dataset. In addition 65,273 simple sequence repeats (SSRs) were also identified. To the best of our knowledge, this is the first transcriptomic dataset of P. amarus till date. Our study provides the largest genetic resource that will lead to drug development and pave

  19. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

    PubMed Central

    2012-01-01

    Background State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. Results We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush. PMID:23282094

  20. De Novo Whole-Genome Sequence of Xylella fastidiosa subsp. multiplex Strain BB01 Isolated from a Blueberry in Georgia, USA

    PubMed Central

    Van Horn, Christopher; Chang, Chung-Jan

    2017-01-01

    ABSTRACT This study reports a de novo-assembled draft genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 causing blueberry bacterial leaf scorch in Georgia, USA. The BB01 genome is 2,517,579 bp, with a G+C content of 51.8%, 2,943 open reading frames (ORFs), and 48 RNA genes. PMID:28183766

  1. De novo assembly of the chimpanzee transcriptome from NextGen mRNA sequences.

    PubMed

    Maudhoo, Mnirnal D; Madison, Jacob D; Norgren, Robert B

    2015-01-01

    Common chimpanzees (Pan troglodytes) and bonobos (Pan paniscus) are the species most closely related to humans. For this reason, it is especially important to have complete and accurate chimpanzee nucleotide and protein sequences to understand how humans evolved their unique capabilities. We provide transcriptome data from four untransformed cell types derived from the reference Pan troglodytes, "Clint", to better annotate the chimpanzee genome and provide empirical validation for proposed gene models of this important species. RNA was extracted from primary cells cultured from four tissues: skin, adipose stroma, vascular smooth muscle and skeletal muscle. These four RNA samples were sequenced on the Illumina HiSeq 2000 platform. Sequences were deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Transcripts were assembled, annotated and deposited in the NCBI Transcriptome Shotgun Assembly (TSA) database. We have provided a high quality annotation of 44,275 transcripts with full-length coding sequence (CDS). This set represented a total of 10,110 unique genes, thus providing empirical support for their existence. This dataset can be used to improve the annotation of the Pan troglodytes genome.

  2. Sequencing and de novo draft assemblies of a fathead minnow (Pimephales promelas) reference genome.

    PubMed

    Burns, Frank R; Cogburn, Amarin L; Ankley, Gerald T; Villeneuve, Daniel L; Waits, Eric; Chang, Yun-Juan; Llaca, Victor; Deschamps, Stephane D; Jackson, Raymond E; Hoke, Robert Alan

    2016-01-01

    The present study was undertaken to provide the foundation for development of genome-scale resources for the fathead minnow (Pimephales promelas), an important model organism widely used in both aquatic toxicology research and regulatory testing. The authors report on the first sequencing and 2 draft assemblies for the reference genome of this species. Approximately 120× sequence coverage was achieved via Illumina sequencing of a combination of paired-end, mate-pair, and fosmid libraries. Evaluation and comparison of these assemblies demonstrate that they are of sufficient quality to be useful for genome-enabled studies, with 418 of 458 (91%) conserved eukaryotic genes mapping to at least 1 of the assemblies. In addition to its immediate utility, the present work provides a strong foundation on which to build further refinements of a reference genome for the fathead minnow.

  3. De novo Assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis

    PubMed Central

    Nowrousian, Minou; Stajich, Jason E.; Chu, Meiling; Engh, Ines; Espagne, Eric; Halliday, Karen; Kamerewerd, Jens; Kempken, Frank; Knab, Birgit; Kuo, Hsiao-Che; Osiewacz, Heinz D.; Pöggeler, Stefanie; Read, Nick D.; Seiler, Stephan; Smith, Kristina M.; Zickler, Denise; Kück, Ulrich; Freitag, Michael

    2010-01-01

    Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30–90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in ∼4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative

  4. De novo assembly of a 40 Mb eukaryotic genome from short sequence reads: Sordaria macrospora, a model organism for fungal morphogenesis.

    PubMed

    Nowrousian, Minou; Stajich, Jason E; Chu, Meiling; Engh, Ines; Espagne, Eric; Halliday, Karen; Kamerewerd, Jens; Kempken, Frank; Knab, Birgit; Kuo, Hsiao-Che; Osiewacz, Heinz D; Pöggeler, Stefanie; Read, Nick D; Seiler, Stephan; Smith, Kristina M; Zickler, Denise; Kück, Ulrich; Freitag, Michael

    2010-04-08

    Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for

  5. De Novo Assembly, Gene Annotation, and Marker Discovery in Stored-Product Pest Liposcelis entomophila (Enderlein) Using Transcriptome Sequences

    PubMed Central

    Wei, Dan-Dan; Chen, Er-Hu; Ding, Tian-Bo; Chen, Shi-Chun; Dou, Wei; Wang, Jin-Jun

    2013-01-01

    Background As a major stored-product pest insect, Liposcelis entomophila has developed high levels of resistance to various insecticides in grain storage systems. However, the molecular mechanisms underlying resistance and environmental stress have not been characterized. To date, there is a lack of genomic information for this species. Therefore, studies aimed at profiling the L. entomophila transcriptome would provide a better understanding of the biological functions at the molecular levels. Methodology/Principal Findings We applied Illumina sequencing technology to sequence the transcriptome of L. entomophila. A total of 54,406,328 clean reads were obtained and that de novo assembled into 54,220 unigenes, with an average length of 571 bp. Through a similarity search, 33,404 (61.61%) unigenes were matched to known proteins in the NCBI non-redundant (Nr) protein database. These unigenes were further functionally annotated with gene ontology (GO), cluster of orthologous groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. A large number of genes potentially involved in insecticide resistance were manually curated, including 68 putative cytochrome P450 genes, 37 putative glutathione S-transferase (GST) genes, 19 putative carboxyl/cholinesterase (CCE) genes, and other 126 transcripts to contain target site sequences or encoding detoxification genes representing eight types of resistance enzymes. Furthermore, to gain insight into the molecular basis of the L. entomophila toward thermal stresses, 25 heat shock protein (Hsp) genes were identified. In addition, 1,100 SSRs and 57,757 SNPs were detected and 231 pairs of SSR primes were designed for investigating the genetic diversity in future. Conclusions/Significance We developed a comprehensive transcriptomic database for L. entomophila. These sequences and putative molecular markers would further promote our understanding of the molecular mechanisms underlying insecticide resistance

  6. De novo RNA sequencing and transcriptome analysis of Colletotrichum gloeosporioides ES026 reveal genes related to biosynthesis of huperzine A.

    PubMed

    Zhang, Guowei; Wang, Wenjuan; Zhang, Xiangmei; Xia, Qianqian; Zhao, Xinmei; Ahn, Youngjoon; Ahmed, Nevin; Cosoveanu, Andreea; Wang, Mo; Wang, Jialu; Shu, Shaohua

    2015-01-01

    Huperzine A is important in the treatment of Alzheimer's disease. There are major challenges for the mass production of huperzine A from plants due to the limited number of huperzine-A-producing plants, as well as the low content of huperzine A in these plants. Various endophytic fungi produce huperzine A. Colletotrichum gloeosporioides ES026 was previously isolated from a huperzine-A-producing plant Huperzia serrata, and this fungus also produces huperzine A. In this study, de novo RNA sequencing of C. gloeosporioides ES026 was carried out with an Illumina HiSeq2000. A total of 4,324,299,051 bp from 50,442,617 high-quality sequence reads of ES026 were obtained. These raw data were assembled into 24,998 unigenes, 40,536,684 residues and 19,790 genes. The majority of the unique sequences were assigned to corresponding putative functions based on BLAST searches of public databases. The molecular functions, biological processes and biochemical pathways of these unique sequences were determined using gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) assignments. A gene encoding copper amine oxidase (CAO) (unigene 9322) was annotated for the conversion of cadaverine to 5-aminopentanal in the biosynthesis of huperzine A. This gene was also detected in the root, stem and leaf of H. serrata. Furthermore, a close relationship was observed between expression of the CAO gene (unigene 9322) and quantity of crude huperzine A extracted from ES026. Therefore, CAO might be involved in the biosynthesis of huperzine A and it most likely plays a key role in regulating the content of huperzine A in ES026.

  7. Disease-targeted sequencing of ion channel genes identifies de novo mutations in patients with non-familial Brugada syndrome.

    PubMed

    Juang, Jyh-Ming Jimmy; Lu, Tzu-Pin; Lai, Liang-Chuan; Ho, Chia-Chuan; Liu, Yen-Bin; Tsai, Chia-Ti; Lin, Lian-Yu; Yu, Chih-Chieh; Chen, Wen-Jone; Chiang, Fu-Tien; Yeh, Shih-Fan Sherri; Lai, Ling-Ping; Chuang, Eric Y; Lin, Jiunn-Lee

    2014-10-23

    Brugada syndrome (BrS) is one of the ion channelopathies associated with sudden cardiac death (SCD). The most common BrS-associated gene (SCN5A) only accounts for approximately 20-25% of BrS patients. This study aims to identify novel mutations across human ion channels in non-familial BrS patients without SCN5A variants through disease-targeted sequencing. We performed disease-targeted multi-gene sequencing across 133 human ion channel genes and 12 reported BrS-associated genes in 15 unrelated, non-familial BrS patients without SCN5A variants. Candidate variants were validated by mass spectrometry and Sanger sequencing. Five de novo mutations were identified in four genes (SCNN1A, KCNJ16, KCNB2, and KCNT1) in three BrS patients (20%). Two of the three patients presented SCD and one had syncope. Interestingly, the two patients presented with SCD had compound mutations (SCNN1A:Arg350Gln and KCNB2:Glu522Lys; SCNN1A:Arg597* and KCNJ16:Ser261Gly). Importantly, two SCNN1A mutations were identified from different families. The KCNT1:Arg1106Gln mutation was identified in a patient with syncope. Bioinformatics algorithms predicted severe functional interruptions in these four mutation loci, suggesting their pivotal roles in BrS. This study identified four novel BrS-associated genes and indicated the effectiveness of this disease-targeted sequencing across ion channel genes for non-familial BrS patients without SCN5A variants.

  8. De novo sequencing and transcriptome analysis of female venom glands of ectoparasitoid Bracon hebetor (Say.) (Hymenoptera: Braconidae).

    PubMed

    Manzoor, Atif; UlAbdin, Zain; Webb, Bruce A; Arif, Muhammad Jalal; Jamil, Amer

    2016-12-01

    Venom is a key-factor in the regulation of host physiology by parasitic Hymenoptera and a potentially rich source of novel bioactive substances for biotechnological applications. The limited study of venom from the ectoparasitoid Bracon hebetor, a tiny wasp that attacks larval pest insects of field and stored products and is thus a potential insect control agent, has not described the full complement and composition of these biomolecules. To have a comprehensive picture of genes expressed in the venom glands of B. hebetor, a venom gland transcriptome was assembled by using next generation sequencing technologies followed by de novo assemblies of the 10.81 M sequence reads yielded 22,425 contigs, of which 10,581 had significant BLASTx hits to know genes. The majority of hits were to Diachasma alloeum, an ectoparasitoid from same taxonomic family, as well as other wasps. Gene ontology grouped the sequences into molecular functions in which catalytic activity with 42.2% was maximum, cellular components in which cells with 33.8% and biological processes among which metabolic process with 30% had the most representatives. In this study, we highlight the most abundant sequences, and those that are likely to be functional components of the venom for parasitization. Full length ORFs of Calreticulin, Venom Acid Phosphatase Acph-1 like protein and arginine kinase proteins were isolated and their tissue specific expression was studied by RT-PCR. Our report is the first to characterize components of the B. hebetor venom glands that may be useful for developing control tools for insect pests and other applications. Copyright © 2016 Elsevier Inc. All rights reserved.

  9. De Novo Assembly of Bitter Gourd Transcriptomes: Gene Expression and Sequence Variations in Gynoecious and Monoecious Lines.

    PubMed

    Shukla, Anjali; Singh, V K; Bharadwaj, D R; Kumar, Rajesh; Rai, Ashutosh; Rai, A K; Mugasimangalam, Raja; Parameswaran, Sriram; Singh, Major; Naik, P S

    2015-01-01

    Bitter gourd (Momordica charantia L.) is a nutritious vegetable crop of Asian origin, used as a medicinal herb in Indian and Chinese traditional medicine. Molecular breeding in bitter gourd is in its infancy, due to limited molecular resources, particularly on functional markers for traits such as gynoecy. We performed de novo transcriptome sequencing of bitter gourd using Illumina next-generation sequencer, from root, flower buds, stem and leaf samples of gynoecious line (Gy323) and a monoecious line (DRAR1). A total of 65,540 transcripts for Gy323 and 61,490 for DRAR1 were obtained. Comparisons revealed SNP and SSR variations between these lines and, identification of gene classes. Based on available transcripts we identified 80 WRKY transcription factors, several reported in responses to biotic and abiotic stresses; 56 ARF genes which play a pivotal role in auxin-regulated gene expression and development. The data presented will be useful in both functions studies and breeding programs in bitter gourd.

  10. De Novo Assembly of Bitter Gourd Transcriptomes: Gene Expression and Sequence Variations in Gynoecious and Monoecious Lines

    PubMed Central

    Shukla, Anjali; Singh, V. K.; Bharadwaj, D. R.; Kumar, Rajesh; Rai, Ashutosh; Rai, A. K.; Mugasimangalam, Raja; Parameswaran, Sriram; Singh, Major; Naik, P. S.

    2015-01-01

    Bitter gourd (Momordica charantia L.) is a nutritious vegetable crop of Asian origin, used as a medicinal herb in Indian and Chinese traditional medicine. Molecular breeding in bitter gourd is in its infancy, due to limited molecular resources, particularly on functional markers for traits such as gynoecy. We performed de novo transcriptome sequencing of bitter gourd using Illumina next-generation sequencer, from root, flower buds, stem and leaf samples of gynoecious line (Gy323) and a monoecious line (DRAR1). A total of 65,540 transcripts for Gy323 and 61,490 for DRAR1 were obtained. Comparisons revealed SNP and SSR variations between these lines and, identification of gene classes. Based on available transcripts we identified 80 WRKY transcription factors, several reported in responses to biotic and abiotic stresses; 56 ARF genes which play a pivotal role in auxin-regulated gene expression and development. The data presented will be useful in both functions studies and breeding programs in bitter gourd. PMID:26047102

  11. de novo Sequencing and Disulfide Mapping of a Bromotryptophan-Containing Conotoxin by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry

    PubMed Central

    Nair, Sudarslal Sadasivan; Nilsson, Carol L.; Emmett, Mark R.; Schaub, Tanner M.; Gowd, Konkallu Hanumae; Thakur, Suman S.; Krishnan, K. S.; Balaram, Padmanabhan; Marshall, Alan G.

    2008-01-01

    T-1-family conotoxins belong to the T-superfamily and are composed of 10−17 amino acids. They share a common cysteine framework and disulfide connectivity, and exhibit unusual posttranslational modifications, such as tryptophan bromination, glutamic acid carboxylation and threonine glycosylation. We have isolated and characterized a novel peptide, Mo1274, containing 11 amino acids, that shows the same cysteine pattern, -CC-CC, and disulfide linkage as those of the T-1-family members. The complete sequence, GNWCCSARVCC, in which W denotes bromotryptophan, was derived from MS-based de novo sequencing. The FT-ICR MS/MS techniques of electron capture dissociation (ECD), infrared multiphoton dissociation (IRMPD), and collision-induced dissociation (CID) served to detect and localize the tryptophan bromination. The bromine contributes a distinctive isotopic distribution in all fragments that contain bromotryptophan. ECD fragmentation results in the loss of bromine and return to the normal isotopic distribution. Disulfide connectivity of Mo1274, between cysteine pairs 1−3 and 2−4, was determined by mass spectrometry in combination with chemical derivatization employing tris(2-carboxyethyl) phosphine, followed by differential alkylation with N-ethylmaleimide and iodoacetamide. The ECD spectra of the native and partially modified peptide reveal a loss of bromine in a process that requires the presence of a disulfide bond. PMID:17134143

  12. De Novo Transcriptome Sequencing of Low Temperature-Treated Phlox subulata and Analysis of the Genes Involved in Cold Stress

    PubMed Central

    Qu, Yanting; Zhou, Aimin; Zhang, Xing; Tang, Huanwei; Liang, Ming; Han, Hui; Zuo, Yuhu

    2015-01-01

    Phlox subulata, a perennial herbaceous flower, can survive during the winter of northeast China, where the temperature can drop to −30 °C, suggesting that P. subulata is an ideal model for studying the molecular mechanisms of cold acclimation in plants. However, little is known about the gene expression profile of P. subulata under cold stress. Here, we examined changes in cold stress-related genes in P. subulata. We sequenced three cold-treated (CT) and control (CK) samples of P. subulata. After de novo assembly and quantitative assessment of the obtained reads, 99,174 unigenes were generated. Based on similarity searches with known proteins in public protein databases, 59,994 unigenes were functionally annotated. Among all differentially expressed genes (DEGs), 8302, 10,638 and 11,021 up-regulated genes and 9898, 17,876, and 12,358 down-regulated genes were identified after treatment at 4, 0, and −10 °C, respectively. Furthermore, 3417 up-regulated unigenes were expressed only in CT samples. Twenty major cold-related genes, including transcription factors, antioxidant enzymes, osmoregulation proteins, and Ca2+ and ABA signaling components, were identified, and their expression levels were estimated. Overall, this is the first transcriptome sequencing of this plant species under cold stress. Studies of DEGs involved in cold-related metabolic pathways may facilitate the discovery of cold-resistance genes. PMID:25938968

  13. Novel proline-hydroxyproline glycopeptides from the dandelion (Taraxacum officinale Wigg.) flowers: de novo sequencing and biological activity.

    PubMed

    Astafieva, Alexandra A; Enyenihi, Atim A; Rogozhin, Eugene A; Kozlov, Sergey A; Grishin, Eugene V; Odintsova, Tatyana I; Zubarev, Roman A; Egorov, Tsezi A

    2015-09-01

    Two novel homologous peptides named ToHyp1 and ToHyp2 that show no similarity to any known proteins were isolated from Taraxacum officinale Wigg. flowers by multidimensional liquid chromatography. Amino acid and mass spectrometry analyses demonstrated that the peptides have unusual structure: they are cysteine-free, proline-hydroxyproline-rich and post-translationally glycosylated by pentoses, with 5 carbohydrates in ToHyp2 and 10 in ToHyp1. The ToHyp2 peptide with a monoisotopic molecular mass of 4350.3Da was completely sequenced by a combination of Edman degradation and de novo sequencing via top down multistage collision induced dissociation (CID) and higher energy dissociation (HCD) tandem mass spectrometry (MS(n)). ToHyp2 consists of 35 amino acids, contains eighteen proline residues, of which 8 prolines are hydroxylated. The peptide displays antifungal activity and inhibits growth of Gram-positive and Gram-negative bacteria. We further showed that carbohydrate moieties have no significant impact on the peptide structure, but are important for antifungal activity although not absolutely necessary. The deglycosylated ToHyp2 peptide was less active against the susceptible fungus Bipolaris sorokiniana than the native peptide. Unique structural features of the ToHyp2 peptide place it into a new family of plant defense peptides. The discovery of ToHyp peptides in T. officinale flowers expands the repertoire of molecules of plant origin with practical applications.

  14. De Novo Transcriptome Sequencing of Olea europaea L. to Identify Genes Involved in the Development of the Pollen Tube.

    PubMed

    Iaria, Domenico; Chiappetta, Adriana; Muzzalupo, Innocenzo

    2016-01-01

    In olive (Olea europaea L.), the processes controlling self-incompatibility are still unclear and the molecular basis underlying this process are still not fully characterized. In order to determine compatibility relationships, using next-generation sequencing techniques and a de novo transcriptome assembly strategy, we show that pollen tubes from different olive plants, grown in vitro in a medium containing its own pistil and in combination pollen/pistil from self-sterile and self-fertile cultivars, have a distinct gene expression profile and many of the differentially expressed sequences between the samples fall within gene families involved in the development of the pollen tube, such as lipase, carboxylesterase, pectinesterase, pectin methylesterase, and callose synthase. Moreover, different genes involved in signal transduction, transcription, and growth are overrepresented. The analysis also allowed us to identify members in actin and actin depolymerization factor and fibrin gene family and member of the Ca(2+) binding gene family related to the development and polarization of pollen apical tip. The whole transcriptomic analysis, through the identification of the differentially expressed transcripts set and an extended functional annotation analysis, will lead to a better understanding of the mechanisms of pollen germination and pollen tube growth in the olive.

  15. Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes

    USDA-ARS?s Scientific Manuscript database

    Genome assemblies have been produced for numerous species as a result of advances in sequencing technologies. However, many of the assemblies are fragmented, with many gaps, ambiguities, and errors. We use the genome of the domestic goat (Capra hircus) to demonstrate current state of the art for ef...

  16. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome

    EPA Science Inventory

    This study was undertaken to develop genome-scale resources for the fathead minnow (Pimphales promelas) an important model organism widely used in both aquatic ecotoxicology research and in regulatory toxicity testing. We report on the first sequencing and two draft assemblies fo...

  17. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome

    EPA Science Inventory

    This study was undertaken to develop genome-scale resources for the fathead minnow (Pimphales promelas) an important model organism widely used in both aquatic ecotoxicology research and in regulatory toxicity testing. We report on the first sequencing and two draft assemblies fo...

  18. Comparative Transcriptomic Approaches Exploring Contamination Stress Tolerance in Salix sp. Reveal the Importance for a Metaorganismal de Novo Assembly Approach for Nonmodel Plants1[OPEN

    PubMed Central

    Brereton, Nicholas J. B.; Marleau, Julie; Nissim, Werther Guidi; Labrecque, Michel; Joly, Simon; Pitre, Frederic E.

    2016-01-01

    Metatranscriptomic study of nonmodel organisms requires strategies that retain the highly resolved genetic information generated from model organisms while allowing for identification of the unexpected. A real-world biological application of phytoremediation, the field growth of 10 Salix cultivars on polluted soils, was used as an exemplar nonmodel and multifaceted crop response well-disposed to the study of gene expression. Sequence reads were assembled de novo to create 10 independent transcriptomes, a global transcriptome, and were mapped against the Salix purpurea 94006 reference genome. Annotation of assembled contigs was performed without a priori assumption of the originating organism. Global transcriptome construction from 3.03 billion paired-end reads revealed 606,880 unique contigs annotated from 1588 species, often common in all 10 cultivars. Comparisons between transcriptomic and metatranscriptomic methodologies provide clear evidence that nonnative RNA can mistakenly map to reference genomes, especially to conserved regions of common housekeeping genes, such as actin, α/β-tubulin, and elongation factor 1-α. In Salix, Rubisco activase transcripts were down-regulated in contaminated trees across all 10 cultivars, whereas thiamine thizole synthase and CP12, a Calvin Cycle master regulator, were uniformly up-regulated. De novo assembly approaches, with unconstrained annotation, can improve data quality; care should be taken when exploring such plant genetics to reduce de facto data exclusion by mapping to a single reference genome alone. Salix gene expression patterns strongly suggest cultivar-wide alteration of specific photosynthetic apparatus and protection of the antenna complexes from oxidation damage in contaminated trees, providing an insight into common stress tolerance strategies in a real-world phytoremediation system. PMID:27002060

  19. Comparative Transcriptomic Approaches Exploring Contamination Stress Tolerance in Salix sp. Reveal the Importance for a Metaorganismal de Novo Assembly Approach for Nonmodel Plants.

    PubMed

    Brereton, Nicholas J B; Gonzalez, Emmanuel; Marleau, Julie; Nissim, Werther Guidi; Labrecque, Michel; Joly, Simon; Pitre, Frederic E

    2016-05-01

    Metatranscriptomic study of nonmodel organisms requires strategies that retain the highly resolved genetic information generated from model organisms while allowing for identification of the unexpected. A real-world biological application of phytoremediation, the field growth of 10 Salix cultivars on polluted soils, was used as an exemplar nonmodel and multifaceted crop response well-disposed to the study of gene expression. Sequence reads were assembled de novo to create 10 independent transcriptomes, a global transcriptome, and were mapped against the Salix purpurea 94006 reference genome. Annotation of assembled contigs was performed without a priori assumption of the originating organism. Global transcriptome construction from 3.03 billion paired-end reads revealed 606,880 unique contigs annotated from 1588 species, often common in all 10 cultivars. Comparisons between transcriptomic and metatranscriptomic methodologies provide clear evidence that nonnative RNA can mistakenly map to reference genomes, especially to conserved regions of common housekeeping genes, such as actin, α/β-tubulin, and elongation factor 1-α. In Salix, Rubisco activase transcripts were down-regulated in contaminated trees across all 10 cultivars, whereas thiamine thizole synthase and CP12, a Calvin Cycle master regulator, were uniformly up-regulated. De novo assembly approaches, with unconstrained annotation, can improve data quality; care should be taken when exploring such plant genetics to reduce de facto data exclusion by mapping to a single reference genome alone. Salix gene expression patterns strongly suggest cultivar-wide alteration of specific photosynthetic apparatus and protection of the antenna complexes from oxidation damage in contaminated trees, providing an insight into common stress tolerance strategies in a real-world phytoremediation system. © 2016 American Society of Plant Biologists. All Rights Reserved.

  20. De novo sequencing and a comprehensive analysis of purple sweet potato (Impomoea batatas L.) transcriptome.

    PubMed

    Xie, Fuliang; Burklew, Caitlin E; Yang, Yanfang; Liu, Min; Xiao, Peng; Zhang, Baohong; Qiu, Deyou

    2012-07-01

    High-throughput RNA sequencing was performed for comprehensively analyzing the transcriptome of the purple sweet potato. A total of 58,800 unigenes were obtained and ranged from 200 nt to 10,380 nt with an average length of 476 nt. The average expression of one unigene was 34 reads per kb per million reads (RPKM) with a maximum expression of 1,935 RPKM. At least 40,280 (68.5%) unigenes were identified to be protein-coding genes, in which 11,978 and 5,184 genes were homologous to Arabidopsis and rice proteins, respectively. Gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) analysis showed that 19,707 (33.5%) unigenes were classified to 1,807 terms of GO including molecular functions, biological processes, and cellular components and 9,970 (17.0%) unigenes were enriched to 11,119 KEGG pathways. We found that at least 3,553 genes may be involved in the biosynthesis pathways of starch, alkaloids, anthocyanin pigments, and vitamins. Additionally, 851 potential simple sequence repeats (SSRs) were identified in all unigenes. Transcriptome sequencing on tuberous roots of the sweet potato yielded substantial transcriptional sequences and potentially useful SSR markers which provide an important data source for sweet potato research. Comparison of two RNA-sequence datasets from the purple and the yellow sweet potato showed that UDP-glucose-flavonoid 3-O-glucosyltransferase was one of the key enzymes in the pathway of anthocyanin biosynthesis and that anthocyanin-3-glucoside might be one of the major components for anthocyanin pigments in the purple sweet potato. This study contributes to the molecular mechanisms of sweet potato development and metabolism and therefore that increases the potential utilization of the sweet potato in food nutrition and pharmacy.

  1. Fast, cheap and out of control--Insights into thermodynamic and informatic constraints on natural protein sequences from de novo protein design.

    PubMed

    Brisendine, Joseph M; Koder, Ronald L

    2016-05-01

    The accumulated results of thirty years of rational and computational de novo protein design have taught us important lessons about the stability, information content, and evolution of natural proteins. First, de novo protein design has complicated the assertion that biological function is equivalent to biological structure - demonstrating the capacity to abstract active sites from natural contexts and paste them into non-native topologies without loss of function. The structure-function relationship has thus been revealed to be either a generality or strictly true only in a local sense. Second, the simplification to "maquette" topologies carried out by rational protein design also has demonstrated that even sophisticated functions such as conformational switching, cooperative ligand binding, and light-activated electron transfer can be achieved with low-information design approaches. This is because for simple topologies the functional footprint in sequence space is enormous and easily exceeds the number of structures which could have possibly existed in the history of life on Earth. Finally, the pervasiveness of extraordinary stability in designed proteins challenges accepted models for the "marginal stability" of natural proteins, suggesting that there must be a selection pressure against highly stable proteins. This can be explained using recent theories which relate non-equilibrium thermodynamics and self-replication. This article is part of a Special Issue entitled Biodesign for Bioenergetics--The design and engineering of electronc transfer cofactors, proteins and protein networks, edited by Ronald L. Koder and J.L. Ross Anderson. Copyright © 2016 Elsevier B.V. All rights reserved.

  2. De novo sequencing and resurrection of a human astrovirus-neutralizing antibody

    DOE PAGES

    Bogdanoff, Walter A.; Morgenstern, David; Bern, Marshall; ...

    2016-03-14

    Monoclonal antibody (mAb) therapeutics targeting cancer, autoimmune diseases, inflammatory diseases, and infectious diseases are growing exponentially. Although numerous panels of mAbs targeting infectious disease agents have been developed, their progression into clinically useful mAbs is often hindered by the lack of sequence information and/or loss of hybridoma cells that produce them. Here we combine the power of crystallography and mass spectrometry to determine the amino acid sequence and glycosylation modification of the Fab fragment of a potent human astrovirus-neutralizing mAb. We used this information to engineer a recombinant antibody single-chain variable fragment that has the same specificity as the parentmore » monoclonal antibody to bind to the astrovirus capsid protein. Furthermore, this antibody can now potentially be developed as a therapeutic and diagnostic agent.« less

  3. De Novo Sequencing and Resurrection of a Human Astrovirus-Neutralizing Antibody

    PubMed Central

    2016-01-01

    Monoclonal antibody (mAb) therapeutics targeting cancer, autoimmune diseases, inflammatory diseases, and infectious diseases are growing exponentially. Although numerous panels of mAbs targeting infectious disease agents have been developed, their progression into clinically useful mAbs is often hindered by the lack of sequence information and/or loss of hybridoma cells that produce them. Here we combine the power of crystallography and mass spectrometry to determine the amino acid sequence and glycosylation modification of the Fab fragment of a potent human astrovirus-neutralizing mAb. We used this information to engineer a recombinant antibody single-chain variable fragment that has the same specificity as the parent monoclonal antibody to bind to the astrovirus capsid protein. This antibody can now potentially be developed as a therapeutic and diagnostic agent. PMID:27213181

  4. De Novo Sequencing and Resurrection of a Human Astrovirus-Neutralizing Antibody.

    PubMed

    Bogdanoff, Walter A; Morgenstern, David; Bern, Marshall; Ueberheide, Beatrix M; Sanchez-Fauquier, Alicia; DuBois, Rebecca M

    2016-05-13

    Monoclonal antibody (mAb) therapeutics targeting cancer, autoimmune diseases, inflammatory diseases, and infectious diseases are growing exponentially. Although numerous panels of mAbs targeting infectious disease agents have been developed, their progression into clinically useful mAbs is often hindered by the lack of sequence information and/or loss of hybridoma cells that produce them. Here we combine the power of crystallography and mass spectrometry to determine the amino acid sequence and glycosylation modification of the Fab fragment of a potent human astrovirus-neutralizing mAb. We used this information to engineer a recombinant antibody single-chain variable fragment that has the same specificity as the parent monoclonal antibody to bind to the astrovirus capsid protein. This antibody can now potentially be developed as a therapeutic and diagnostic agent.

  5. Massively Parallel Sequencing Reveals an Accumulation of De Novo Mutations and an Activating Mutation of LPAR1 in a Patient with Metastatic Neuroblastoma

    PubMed Central

    Wei, Jun S.; Johansson, Peter; Chen, Li; Song, Young K.; Tolman, Catherine; Li, Samuel; Hurd, Laura; Patidar, Rajesh; Wen, Xinyu; Badgett, Thomas C.; Cheuk, Adam T. C.; Marshall, Jean-Claude; Steeg, Patricia S.; Vaqué Díez, José P.; Yu, Yanlin; Gutkind, J. Silvio; Khan, Javed

    2013-01-01

    Neuroblastoma is one of the most genomically heterogeneous childhood malignances studied to date, and the molecular events that occur during the course of the disease are not fully understood. Genomic studies in neuroblastoma have showed only a few recurrent mutations and a low somatic mutation burden. However, none of these studies has examined the mutations arising during the course of disease, nor have they systemically examined the expression of mutant genes. Here we performed genomic analyses on tumors taken during a 3.5 years disease course from a neuroblastoma patient (bone marrow biopsy at diagnosis, adrenal primary tumor taken at surgical resection, and a liver metastasis at autopsy). Whole genome sequencing of the index liver metastasis identified 44 non-synonymous somatic mutations in 42 genes (0.85 mutation/MB) and a large hemizygous deletion in the ATRX gene which has been recently reported in neuroblastoma. Of these 45 somatic alterations, 15 were also detected in the primary tumor and bone marrow biopsy, while the other 30 were unique to the index tumor, indicating accumulation of de novo mutations during therapy. Furthermore, transcriptome sequencing on the 3 tumors demonstrated only 3 out of the 15 commonly mutated genes (LPAR1, GATA2, and NUFIP1) had high level of expression of the mutant alleles, suggesting potential oncogenic driver roles of these mutated genes. Among them, the druggable G-protein coupled receptor LPAR1 was highly expressed in all tumors. Cells expressing the LPAR1 R163W mutant demonstrated a significantly increased motility through elevated Rho signaling, but had no effect on growth. Therefore, this study highlights the need for multiple biopsies and sequencing during progression of a cancer and combinatorial DNA and RNA sequencing approach for systematic identification of expressed driver mutations. PMID:24147068

  6. Increased Frequency of De Novo Copy Number Variations in Congenital Heart Disease by Integrative Analysis of SNP Array and Exome Sequence Data

    PubMed Central

    Rodriguez-Murillo, Laura; Fromer, Menachem; Mazaika, Erica; Vardarajan, Badri; Italia, Michael; Leipzig, Jeremy; DePalma, Steven R.; Golhar, Ryan; Sanders, Stephan J.; Yamrom, Boris; Ronemus, Michael; Iossifov, Ivan; Willsey, A. Jeremy; State, Matthew W.; Kaltman, Jonathan R.; White, Peter S.; Shen, Yufeng; Warburton, Dorothy; Brueckner, Martina; Seidman, Christine; Goldmuntz, Elizabeth; Gelb, Bruce D.; Lifton, Richard; Seidman, Jonathan; Hakonarson, Hakon; Chung, Wendy K.

    2014-01-01

    Rationale Congenital heart disease (CHD) is among the most common birth defects. Most cases are of unknown etiology. Objective To determine the contribution of de novo copy number variants (CNVs) in the etiology of sporadic CHD. Methods and Results We studied 538 CHD trios using genome-wide dense single nucleotide polymorphism (SNP) arrays and/or whole exome sequencing (WES). Results were experimentally validated using digital droplet PCR. We compared validated CNVs in CHD cases to CNVs in 1,301 healthy control trios. The two complementary high-resolution technologies identified 63 validated de novo CNVs in 51 CHD cases. A significant increase in CNV burden was observed when comparing CHD trios with healthy trios, using either SNP array (p=7x10−5, Odds Ratio (OR)=4.6) or WES data (p=6x10−4, OR=3.5) and remained after removing 16% of de novo CNV loci previously reported as pathogenic (p=0.02, OR=2.7). We observed recurrent de novo CNVs on 15q11.2 encompassing CYFIP1, NIPA1, and NIPA2 and single de novo CNVs encompassing DUSP1, JUN, JUP, MED15, MED9, PTPRE SREBF1, TOP2A, and ZEB2, genes that interact with established CHD proteins NKX2-5 and GATA4. Integrating de novo variants in WES and CNV data suggests that ETS1 is the pathogenic gene altered by 11q24.2-q25 deletions in Jacobsen syndrome and that CTBP2 is the pathogenic gene in 10q sub-telomeric deletions. Conclusions We demonstrate a significantly increased frequency of rare de novo CNVs in CHD patients compared with healthy controls and suggest several novel genetic loci for CHD. PMID:25205790

  7. Sequencing and de novo transcriptome assembly of Anthopleura dowii Verrill (1869), from Mexico.

    PubMed

    Ayala-Sumuano, Jorge-Tonatiuh; Licea-Navarro, Alexei; Rudiño-Piñera, Enrique; Rodríguez, Estefanía; Rodríguez-Almazán, Claudia

    2017-03-01

    Next-generation technologies for determination of genomics and transcriptomics composition have a wide range of applications. Moreover, the development of tools for big data set analysis has allowed the identification of molecules and networks involved in metabolism, evolution or behavior. By natural habitats aquatic organisms have implemented molecular strategies for survival, including the production and secretion of toxic compounds for their predators; therefore these organisms are possible sources of proteins or peptides with potential biotechnological application. In the last decade anthozoans, mainly octocorals but also sea anemones, have been proben to be a source of natural products. Members of the genus Anthopleura are one of the best known and most studied sea anemones because they are common constituents of rocky intertidal communities and show interesting ecological and biological phenomena (e.g. intraespecific competition, symbiosis, etc.); however, many aspects of these taxa remain in need to be analyzed. This work describes the transcriptome sequencing of Anthopleura dowii Verrill, 1869 (Cnidaria: Anthozoa: Actiniaria); this is the first report of this kind for these species. The data set used to construct the transcriptome has been deposited on NCBI's database. Illumina sequence reads are available under BioProject accession number PRJNA329297 and Sequence Read Archive under accession number SRP078992.

  8. Comparative transcriptome sequencing and de novo analysis of Vaccinium corymbosum during fruit and color development.

    PubMed

    Li, Lingli; Zhang, Hehua; Liu, Zhongshuai; Cui, Xiaoyue; Zhang, Tong; Li, Yanfang; Zhang, Lingyun

    2016-10-12

    Blueberry is an economically important fruit crop in Ericaceae family. The substantial quantities of flavonoids in blueberry have been implicated in a broad range of health benefits. However, the information regarding fruit development and flavonoid metabolites based on the transcriptome level is still limited. In the present study, the transcriptome and gene expression profiling over berry development, especially during color development were initiated. A total of approximately 13.67 Gbp of data were obtained and assembled into 186,962 transcripts and 80,836 unigenes from three stages of blueberry fruit and color development. A large number of simple sequence repeats (SSRs) and candidate genes, which are potentially involved in plant development, metabolic and hormone pathways, were identified. A total of 6429 sequences containing 8796 SSRs were characterized from 15,457 unigenes and 1763 unigenes contained more than one SSR. The expression profiles of key genes involved in anthocyanin biosynthesis were also studied. In addition, a comparison between our dataset and other published results was carried out. Our high quality reads produced in this study are an important advancement and provide a new resource for the interpretation of high-throughput data for blueberry species whether regarding sequencing data depth or species extension. The use of this transcriptome data will serve as a valuable public information database for the studies of blueberry genome and would greatly boost the research of fruit and color development, flavonoid metabolisms and regulation and breeding of more healthful blueberries.

  9. Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.

    PubMed

    Alkhateeb, Abedalrhman; Rueda, Luis

    2017-08-01

    Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.

  10. De Novo Sequencing and Comparative Analysis of Schima superba Seedlings to Explore the Response to Drought Stress

    PubMed Central

    Han, Bao-cai; Wei, Wei; Mi, Xiang-cheng; Ma, Ke-ping

    2016-01-01

    Schima superba is an important dominant species in subtropical evergreen broadleaved forests of China, and plays a vital role in community structure and dynamics. However, the survival rate of its seedlings in the field is low, and water shortage could be a factor that limits its regeneration. In order to better understand the response of its seedlings to drought stress on a functional genomics scale, RNA-seq technology was utilized in this study to perform a large-scale transcriptome sequencing of the S. superba seedlings under drought stress. More than 320 million clean reads were generated and 72218 unique transcripts were obtained through de novo assembly. These unigenes were further annotated by blasting with different public databases and a total of 53300 unique transcripts were annotated. A total of 31586 simple sequence repeat (SSR) loci were presented. Through gene expression profiling analysis between drought treatment and control, 11038 genes were found to be significantly enriched in drought-stressed seedlings. Based on these differentially expressed genes (DEGs), Gene Ontology (GO) terms enrichment and Kyoto Encyclopedia of Genes and Genomes pathway (KEGG) enrichment analysis indicated that drought stress caused a number of changes in the types of sugars, enzymes, secondary mechanisms, and light responses, and induced some potential physical protection mechanisms. In addition, the expression patterns of 18 transcripts induced by drought, as determined by quantitative real-time PCR, were consistent with their transcript abundance changes, as identified by RNA-seq. This transcriptome study provides a rapid method for understanding the response of S. superba seedlings to drought stress and provides a number of gene sequences available for further functional genomics studies. PMID:27930677

  11. De Novo Sequencing and Comparative Analysis of Schima superba Seedlings to Explore the Response to Drought Stress.

    PubMed

    Han, Bao-Cai; Wei, Wei; Mi, Xiang-Cheng; Ma, Ke-Ping

    2016-01-01

    Schima superba is an important dominant species in subtropical evergreen broadleaved forests of China, and plays a vital role in community structure and dynamics. However, the survival rate of its seedlings in the field is low, and water shortage could be a factor that limits its regeneration. In order to better understand the response of its seedlings to drought stress on a functional genomics scale, RNA-seq technology was utilized in this study to perform a large-scale transcriptome sequencing of the S. superba seedlings under drought stress. More than 320 million clean reads were generated and 72218 unique transcripts were obtained through de novo assembly. These unigenes were further annotated by blasting with different public databases and a total of 53300 unique transcripts were annotated. A total of 31586 simple sequence repeat (SSR) loci were presented. Through gene expression profiling analysis between drought treatment and control, 11038 genes were found to be significantly enriched in drought-stressed seedlings. Based on these differentially expressed genes (DEGs), Gene Ontology (GO) terms enrichment and Kyoto Encyclopedia of Genes and Genomes pathway (KEGG) enrichment analysis indicated that drought stress caused a number of changes in the types of sugars, enzymes, secondary mechanisms, and light responses, and induced some potential physical protection mechanisms. In addition, the expression patterns of 18 transcripts induced by drought, as determined by quantitative real-time PCR, were consistent with their transcript abundance changes, as identified by RNA-seq. This transcriptome study provides a rapid method for understanding the response of S. superba seedlings to drought stress and provides a number of gene sequences available for further functional genomics studies.

  12. De novo genome assembly of the economically important weed horseweed using integrated data from multiple sequencing platforms.

    PubMed

    Peng, Yanhui; Lai, Zhao; Lane, Thomas; Nageswara-Rao, Madhugiri; Okada, Miki; Jasieniuk, Marie; O'Geen, Henriette; Kim, Ryan W; Sammons, R Douglas; Rieseberg, Loren H; Stewart, C Neal

    2014-11-01

    Horseweed (Conyza canadensis), a member of the Compositae (Asteraceae) family, was the first broadleaf weed to evolve resistance to glyphosate. Horseweed, one of the most problematic weeds in the world, is a true diploid (2n = 2x = 18), with the smallest genome of any known agricultural weed (335 Mb). Thus, it is an appropriate candidate to help us understand the genetic and genomic bases of weediness. We undertook a draft de novo genome assembly of horseweed by combining data from multiple sequencing platforms (454 GS-FLX, Illumina HiSeq 2000, and PacBio RS) using various libraries with different insertion sizes (approximately 350 bp, 600 bp, 3 kb, and 10 kb) of a Tennessee-accessed, glyphosate-resistant horseweed biotype. From 116.3 Gb (approximately 350× coverage) of data, the genome was assembled into 13,966 scaffolds with 50% of the assembly = 33,561 bp. The assembly covered 92.3% of the genome, including the complete chloroplast genome (approximately 153 kb) and a nearly complete mitochondrial genome (approximately 450 kb in 120 scaffolds). The nuclear genome is composed of 44,592 protein-coding genes. Genome resequencing of seven additional horseweed biotypes was performed. These sequence data were assembled and used to analyze genome variation. Simple sequence repeat and single-nucleotide polymorphisms were surveyed. Genomic patterns were detected that associated with glyphosate-resistant or -susceptible biotypes. The draft genome will be useful to better understand weediness and the evolution of herbicide resistance and to devise new management strategies. The genome will also be useful as another reference genome in the Compositae. To our knowledge, this article represents the first published draft genome of an agricultural weed.

  13. A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly

    PubMed Central

    2013-01-01

    Background The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes. Results We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. Conclusions These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies. PMID:23496952

  14. De novo sequencing and transcriptome analysis of Wolfiporia cocos to reveal genes related to biosynthesis of triterpenoids.

    PubMed

    Shu, Shaohua; Chen, Bei; Zhou, Mengchun; Zhao, Xinmei; Xia, Haiyang; Wang, Mo

    2013-01-01

    Wolfiporia cocos Ryvarden et Gilbertson is a saprophytic fungus in the Basidiomycetes. Its dried sclerotium is widely used as a traditional crude drug in East Asia. Especially in China, the dried sclerotium is regarded as the silver of the Chinese traditional drugs, not only for its white color, but also its medicinal value. Furthermore, triterpenoids from W. cocos are the main active compounds with antitumor and anti-inflammatory activity. Biosynthesis of the triterpenoids has rarely been researched. In this study, the de novo sequencing of the mycelia and sclerotia of W. cocos were carried out by Illumina HiSeq 2000. A total of 3,484,996,740 bp from 38,722,186 sequence reads of mycelia, and 3,573,921,960 bp from 39,710,244 high quality sequence reads of sclerotium were obtained. These raw data were assembled into 60,354 contigs and 40,939 singletons, and 56,938 contigs and 37,220 singletons for mycelia and sclerotia, respectively. The transcriptomic data clearly showed that terpenoid biosynthesis was only via the MVA pathwayin W. cocos. The production of total triterpenoids and pachymic acid was examined in the dry mycelia and sclerotia. The content of total triterpenoids was 5.36% and 1.43% in mycelia and sclerotia, respectively, and the content of pachymic acid was 0.458% and 0.174%. Some genes involved in the triterpenoid biosynthetic pathway were chosen to be verified by qRT-PCR. The unigenes encoding diphosphomevalonate decarboxylase (Unigene 20430), farnesyl diphosphate synthase (Unigene 14106 and 21656), hydroxymethylglutaryl-CoA reductase (NADPH) (Unigene 6395_All) and lanosterol synthase (Unigene28001_All) were upregulated in the mycelia stage. It is likely that expression of these genes influences the biosynthesis of triterpenoids in the mycelia stage.

  15. De Novo Sequencing and Transcriptome Analysis of Wolfiporia cocos to Reveal Genes Related to Biosynthesis of Triterpenoids

    PubMed Central

    Shu, Shaohua; Chen, Bei; Zhou, Mengchun; Zhao, Xinmei; Xia, Haiyang; Wang, Mo

    2013-01-01

    Wolfiporia cocos Ryvarden et Gilbertson is a saprophytic fungus in the Basidiomycetes. Its dried sclerotium is widely used as a traditional crude drug in East Asia. Especially in China, the dried sclerotium is regarded as the silver of the Chinese traditional drugs, not only for its white color, but also its medicinal value. Furthermore, triterpenoids from W. cocos are the main active compounds with antitumor and anti-inflammatory activity. Biosynthesis of the triterpenoids has rarely been researched. In this study, the de novo sequencing of the mycelia and sclerotia of W. cocos were carried out by Illumina HiSeq 2000. A total of 3,484,996,740 bp from 38,722,186 sequence reads of mycelia, and 3,573,921,960 bp from 39,710,244 high quality sequence reads of sclerotium were obtained. These raw data were assembled into 60,354 contigs and 40,939 singletons, and 56,938 contigs and 37,220 singletons for mycelia and sclerotia, respectively. The transcriptomic data clearly showed that terpenoid biosynthesis was only via the MVA pathwayin W. cocos. The production of total triterpenoids and pachymic acid was examined in the dry mycelia and sclerotia. The content of total triterpenoids was 5.36% and 1.43% in mycelia and sclerotia, respectively, and the content of pachymic acid was 0.458% and 0.174%. Some genes involved in the triterpenoid biosynthetic pathway were chosen to be verified by qRT-PCR. The unigenes encoding diphosphomevalonate decarboxylase (Unigene 20430), farnesyl diphosphate synthase (Unigene 14106 and 21656), hydroxymethylglutaryl-CoA reductase (NADPH) (Unigene 6395_All) and lanosterol synthase (Unigene28001_All) were upregulated in the mycelia stage. It is likely that expression of these genes influences the biosynthesis of triterpenoids in the mycelia stage. PMID:23967197

  16. De novo sequencing and comparative analysis of testicular transcriptome from different reproductive phases in freshwater spotted snakehead Channa punctatus

    PubMed Central

    Roy, Alivia; Basak, Reetuparna

    2017-01-01

    The spotted snakehead Channa punctatus is a seasonally breeding teleost widely distributed in the Indian subcontinent and economically important due to high nutritional value. The declining population of C. punctatus prompted us to focus on genetic regulation of its reproduction. The present study carried out de novo testicular transcriptome sequencing during the four reproductive phases and correlated differential expression of transcripts with various testicular events in C. punctatus. The Illumina paired-end sequencing of testicular transcriptome from resting, preparatory, spawning and postspawning phases generated 41.94, 47.51, 61.81 and 44.45 million reads, and 105526, 105169, 122964 and 106544 transcripts, respectively. Transcripts annotated using Rattus norvegicus reference protein sequences and classified under various subcategories of biological process, molecular function and cellular component showed that the majority of the subcategories had highest number of transcripts during spawning phase. In addition, analysis of transcripts exhibiting differential expression during the four phases revealed an appreciable increase in upregulated transcripts of biological processes such as cell proliferation and differentiation, cytoskeleton organization, response to vitamin A, transcription and translation, regulation of angiogenesis and response to hypoxia during spermatogenically active phases. The study also identified significant differential expression of transcripts relevant to spermatogenesis (mgat3, nqo1, hes2, rgs4, cxcl2, alcam, agmat), steroidogenesis (star, tkt, gipc3), cell proliferation (eef1a2, btg3, pif1, myo16, grik3, trim39, plbd1), cytoskeletal organization (espn, wipf3, cd276), sperm development (klhl10, mast1, hspa1a, slc6a1, ros1, foxj1, hipk1), and sperm transport and motility (hint1, muc13). Analysis of functional annotation and differential expression of testicular transcripts depending on reproductive phases of C. punctatus helped in

  17. De novo assembly of transcriptome sequencing in Caragana korshinskii Kom. and characterization of EST-SSR markers.

    PubMed

    Long, Yan; Wang, Yanyan; Wu, Shanshan; Wang, Jiao; Tian, Xinjie; Pei, Xinwu

    2015-01-01

    Caragana korshinskii Kom. is widely distributed in various habitats, including gravel desert, clay desert, fixed and semi-fixed sand, and saline land in the Asian and African deserts. To date, no previous genomic information or EST-SSR marker has been reported in Caragana Fabr. genus. In this study, more than two billion bases of high-quality sequence of C. korshinskii were generated by using illumina sequencing technology and demonstrated the de novo assembly and annotation of genes without prior genome information. These reads were assembled into 86,265 unigenes (mean length = 709 bp). The similarity search indicated that 33,955 and 21,978 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Among these annotated unigenes, 26,232 a unigenes were separately assigned to Gene Ontology (GO) database. When 22,756 unigenes searched against the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database, 5,598 unigenes were assigned to 5 main categories including 32 KEGG pathways. Among the main KEGG categories, metabolism was the biggest category (2,862, 43.7%), suggesting the active metabolic processes in the desert tree. In addition, a total of 19,150 EST-SSRs were identified from 15,484 unigenes, and the characterizations of EST-SSRs were further compared with other four species in Fabraceae. 126 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Among the 9 germplasms in Caranaga Fabr. genus, PCR success rate were 93.7% and the phylogenic tree was constructed based on the genotypic data. This research generated a substantial fraction of transcriptome sequences, which were very useful resources for gene annotation and discovery, molecular markers development, genome assembly and annotation. The EST-SSR markers identified and developed in this study will facilitate marker-assisted selection breeding.

  18. De novo transcriptome sequencing of radish (Raphanus sativus L.) and analysis of major genes involved in glucosinolate metabolism.

    PubMed

    Wang, Yan; Pan, Yan; Liu, Zhe; Zhu, Xianwen; Zhai, Lulu; Xu, Liang; Yu, Rugang; Gong, Yiqin; Liu, Liwang

    2013-11-27

    Radish (Raphanus sativus L.), is an important root vegetable crop worldwide. Glucosinolates in the fleshy taproot significantly affect the flavor and nutritional quality of radish. However, little is known about the molecular mechanisms underlying glucosinolate metabolism in radish taproots. The limited availability of radish genomic information has greatly hindered functional genomic analysis and molecular breeding in radish. In this study, a high-throughput, large-scale RNA sequencing technology was employed to characterize the de novo transcriptome of radish roots at different stages of development. Approximately 66.11 million paired-end reads representing 73,084 unigenes with a N50 length of 1,095 bp, and a total length of 55.73 Mb were obtained. Comparison with the publicly available protein database indicates that a total of 67,305 (about 92.09% of the assembled unigenes) unigenes exhibit similarity (e -value ≤ 1.0e⁻⁵) to known proteins. The functional annotation and classification including Gene Ontology (GO), Clusters of Orthologous Group (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis revealed that the main activated genes in radish taproots are predominantly involved in basic physiological and metabolic processes, biosynthesis of secondary metabolite pathways, signal transduction mechanisms and other cellular components and molecular function related terms. The majority of the genes encoding enzymes involved in glucosinolate (GS) metabolism and regulation pathways were identified in the unigene dataset by targeted searches of their annotations. A number of candidate radish genes in the glucosinolate metabolism related pathways were also discovered, from which, eight genes were validated by T-A cloning and sequencing while four were validated by quantitative RT-PCR expression profiling. The ensuing transcriptome dataset provides a comprehensive sequence resource for molecular genetics research in radish. It will serve as an

  19. De novo sequencing and comparative analysis of testicular transcriptome from different reproductive phases in freshwater spotted snakehead Channa punctatus.

    PubMed

    Roy, Alivia; Basak, Reetuparna; Rai, Umesh

    2017-01-01

    The spotted snakehead Channa punctatus is a seasonally breeding teleost widely distributed in the Indian subcontinent and economically important due to high nutritional value. The declining population of C. punctatus prompted us to focus on genetic regulation of its reproduction. The present study carried out de novo testicular transcriptome sequencing during the four reproductive phases and correlated differential expression of transcripts with various testicular events in C. punctatus. The Illumina paired-end sequencing of testicular transcriptome from resting, preparatory, spawning and postspawning phases generated 41.94, 47.51, 61.81 and 44.45 million reads, and 105526, 105169, 122964 and 106544 transcripts, respectively. Transcripts annotated using Rattus norvegicus reference protein sequences and classified under various subcategories of biological process, molecular function and cellular component showed that the majority of the subcategories had highest number of transcripts during spawning phase. In addition, analysis of transcripts exhibiting differential expression during the four phases revealed an appreciable increase in upregulated transcripts of biological processes such as cell proliferation and differentiation, cytoskeleton organization, response to vitamin A, transcription and translation, regulation of angiogenesis and response to hypoxia during spermatogenically active phases. The study also identified significant differential expression of transcripts relevant to spermatogenesis (mgat3, nqo1, hes2, rgs4, cxcl2, alcam, agmat), steroidogenesis (star, tkt, gipc3), cell proliferation (eef1a2, btg3, pif1, myo16, grik3, trim39, plbd1), cytoskeletal organization (espn, wipf3, cd276), sperm development (klhl10, mast1, hspa1a, slc6a1, ros1, foxj1, hipk1), and sperm transport and motility (hint1, muc13). Analysis of functional annotation and differential expression of testicular transcripts depending on reproductive phases of C. punctatus helped in

  20. De Novo Sequencing and Assembly Analysis of the Pseudostellaria heterophylla Transcriptome

    PubMed Central

    Li, Jun; Zhen, Wei; Long, Dengkai; Ding, Ling; Gong, Anhui; Xiao, Chenghong; Jiang, Weike; Liu, Xiaoqing; Zhou, Tao; Huang, Luqi

    2016-01-01

    Pseudostellaria heterophylla (Miq.) Pax is a mild tonic herb widely cultivated in the Southern part of China. The tuberous roots of P. heterophylla accumulate high levels of secondary metabolism products of medicinal value such as saponins, flavonoids, and isoquinoline alkaloids. Despite numerous studies on the pharmacological importance and purification of these compounds in P. heterophylla, their biosynthesis is not well understood. In the present study, we used Illumina HiSeq 4000 sequencing platform to sequence the RNA from flowers, leaves, stem, root cortex and xylem tissues of P. heterophylla. We obtained 616,413,316 clean reads that we assembled into 127, 334 unique sequences with an N50 length of 951 bp. Among these unigenes, 53,184 unigenes (41.76%) were annotated in a public database and 39, 795 unigenes were assigned to 356 KEGG pathways; 23,714 unigenes (8.82%) had high homology with the genes from Beta vulgaris. We discovered 32, 095 DEGs in different tissues and performed GO and KEGG enrichment analysis. The most enriched KEGG pathway of secondary metabolism showed up-regulated expression in tuberous roots as compared with the ground parts of P. heterophylla. Moreover, we identified 72 candidate genes involved in triterpenoids saponins biosynthesis in P. heterophylla. The expression profiles of 11 candidate unigenes were analyzed by quantitative real-time PCR (RT-qPCR). Our study established a global transcriptome database of P. heterophylla for gene identification and regulation. We also identified the candidate unigenes involved in triterpenoids saponins biosynthesis. Our results provide an invaluable resource for the secondary metabolites and physiological processes in different tissues of P. heterophylla. PMID:27764127

  1. De novo assembly and characterization of germinating lettuce seed transcriptome using Illumina paired-end sequencing.

    PubMed

    Liu, Shu-Jun; Song, Shun-Hua; Wang, Wei-Qing; Song, Song-Quan

    2015-11-01

    At supraoptimal temperature, germination of lettuce (Lactuca sativa L.) seeds exhibits a typical germination thermoinhibition, which can be alleviated by sodium nitroprusside (SNP) in a nitric oxide-dependent manner. However, the molecular mechanism of seed germination thermoinhibition and its alleviation by SNP are poorly understood. In the present study, the lettuce seeds imbibed at optimal temperature in water or at supraoptimal temperature with or without 100 μM SNP for different periods of time were used as experimental materials, the total RNA was extracted and sequenced, we gained 147,271,347 raw reads using Illumina paired-end sequencing technique and assembled the transcriptome of germinating lettuce seeds. A total of 51,792 unigenes with a mean length of 849 nucleotides were obtained. Of these unigenes, a total of 29,542 unigenes were annotated by sequence similarity searching in four databases, NCBI non-redundant protein database, SwissProt protein database, euKaryotic Ortholog Groups database, and NCBI nucleotide database. Among the annotated unigenes, 22,276 unigenes were assigned to Gene Ontology database. When all the annotated unigenes were searched against the Kyoto Encyclopedia of Genes and Genomes Pathway database, a total of 8,810 unigenes were mapped to 5 main categories including 260 pathways. We first obtained a lot of unigenes encoding proteins involved in abscisic acid (ABA) signaling in lettuce, including 11 ABA receptors, 94 protein phosphatase 2Cs and 16 sucrose non-fermenting 1-related protein kinases. These results will help us to better understand the molecular mechanism of seed germination, thermoinhibition of seed germination and its alleviation by SNP. Copyright © 2015 Elsevier Masson SAS. All rights reserved.

  2. Discovery of Novel Antimicrobial Peptides from Varanus komodoensis (Komodo Dragon) by Large-Scale Analyses and De-Novo-Assisted Sequencing Using Electron-Transfer Dissociation Mass Spectrometry.

    PubMed

    Bishop, Barney M; Juba, Melanie L; Russo, Paul S; Devine, Megan; Barksdale, Stephanie M; Scott, Shaylyn; Settlage, Robert; Michalak, Pawel; Gupta, Kajal; Vliet, Kent; Schnur, Joel M; van Hoek, Monique L

    2017-04-07

    Komodo dragons are the largest living lizards and are the apex predators in their environs. They endure numerous strains of pathogenic bacteria in their saliva and recover from wounds inflicted by other dragons, reflecting the inherent robustness of their innate immune defense. We have employed a custom bioprospecting approach combining partial de novo peptide sequencing with transcriptome assembly to identify cationic antimicrobial peptides from Komodo dragon plasma. Through these analyses, we identified 48 novel potential cationic antimicrobial peptides. All but one of the identified peptides were derived from histone proteins. The antimicrobial effectiveness of eight of these peptides was evaluated against Pseudomonas aeruginosa (ATCC 9027) and Staphylococcus aureus (ATCC 25923), with seven peptides exhibiting antimicrobial activity against both microbes and one only showing significant potency against P. aeruginosa. This study demonstrates the power and promise of our bioprospecting approach to cationic antimicrobial peptide discovery, and it reveals the presence of a plethora of novel histone-derived antimicrobial peptides in the plasma of the Komodo dragon. These findings may have broader implications regarding the role that intact histones and histone-derived peptides play in defending the host from infection. Data are available via ProteomeXChange with identifier PXD005043.

  3. Color Sequence of Triton Approach Images

    NASA Image and Video Library

    1998-06-04

    This color image from NASA Voyager 2 was reconstructed by making a computer composite of three black and white images taken through red, green, and blue filters. Details on Triton surface unfold dramatically in this sequence of approach images. http://photojournal.jpl.nasa.gov/catalog/PIA00329

  4. De novo discovery of neuropeptides in the genomes of parasitic flatworms using a novel comparative approach.

    PubMed

    Koziol, Uriel; Koziol, Miguel; Preza, Matías; Costábile, Alicia; Brehm, Klaus; Castillo, Estela

    2016-10-01

    Neuropeptide mediated signalling is an ancient mechanism found in almost all animals and has been proposed as a promising target for the development of novel drugs against helminths. However, identification of neuropeptides from genomic data is challenging, and knowledge of the neuropeptide complement of parasitic flatworms is still fragmentary. In this work, we have developed an evolution-based strategy for the de novo discovery of neuropeptide precursors, based on the detection of localised sequence conservation between possible prohormone convertase cleavage sites. The method detected known neuropeptide precursors with good precision and specificity in the models Drosophila melanogaster and Caenorhabditis elegans. Furthermore, it identified novel putative neuropeptide precursors in nematodes, including the first description of allatotropin homologues in this phylum. Our search for neuropeptide precursors in the genomes of parasitic flatworms resulted in the description of 34 conserved neuropeptide precursor families, including 13 new ones, and of hundreds of new homologues of known neuropeptide precursor families. Most neuropeptide precursor families show a wide phylogenetic distribution among parasitic flatworms and show little similarity to neuropeptide precursors of other bilaterian animals. However, we could also find orthologs of some conserved bilaterian neuropeptides including pyrokinin, crustacean cardioactive peptide, myomodulin, neuropeptide-Y, neuropeptide KY and SIF-amide. Finally, we determined the expression patterns of seven putative neuropeptide precursor genes in the protoscolex of Echinococcus multilocularis. All genes were expressed in the nervous system with different patterns, indicating a hidden complexity of peptidergic signalling in cestodes. Copyright © 2016 Australian Society for Parasitology. Published by Elsevier Ltd. All rights reserved.

  5. Sequencing and de novo assembly of the Asian gypsy moth transcriptome using the Illumina platform.

    PubMed

    Xiaojun, Fan; Chun, Yang; Jianhong, Liu; Chang, Zhang; Yao, Li

    2017-01-01

    The Asian gypsy moth (Lymantria dispar) is a serious pest of forest and shade trees in many Asian and some European countries. However, there have been few studies of L. dispar genetic information and comprehensive genetic analyses of this species are needed in order to understand its genetic and metabolic sensitivities, such as the molting mechanism during larval development. In this study, high-throughput sequencing technology was used to sequence the transcriptome of the Asian subspecies of the gyspy moth, after which a comprehensive analysis of chitin metabolism was undertaken. We generated 37,750,380 high-quality reads and assembled them into contigs. A total of 37,098 unigenes were identified, of which 15,901 were annotated in the NCBI non-redundant protein database and 9,613 were annotated in the Swiss-Prot database. We mapped 4,329 unigenes onto 317 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database. Chitin metabolism unigenes were found in the transcriptome and the data indicated that a variety of enzymes was involved in chitin catabolic and biosynthetic pathways.

  6. De Novo Sequencing and Characterization of the Transcriptome of Dwarf Polish Wheat (Triticum polonicum L.).

    PubMed

    Wang, Yi; Wang, Chao; Wang, Xiaolu; Peng, Fan; Wang, Ruijiao; Jiang, Yulin; Zeng, Jian; Fan, Xing; Kang, Houyang; Sha, Lina; Zhang, Haiqin; Xiao, Xue; Zhou, Yonghong

    2016-01-01

    Construction as well as characterization of a polish wheat transcriptome is a crucial step to study useful traits of polish wheat. In this study, a transcriptome, including 76,014 unigenes, was assembled from dwarf polish wheat (DPW) roots, stems, and leaves using the software of Trinity. Among these unigenes, 61,748 (81.23%) unigenes were functionally annotated in public databases and classified into differentially functional types. Aligning this transcriptome against draft wheat genome released by the International Wheat Genome Sequencing Consortium (IWGSC), 57,331 (75.42%) unigenes, including 26,122 AB-specific and 2,622 D-specific unigenes, were mapped on A, B, and/or D genomes. Compared with the transcriptome of T. turgidum, 56,343 unigenes were matched with 103,327 unigenes of T. turgidum. Compared with the genomes of rice and barley, 14,404 and 7,007 unigenes were matched with 14,608 genes of barley and 7,708 genes of rice, respectively. On the other hand, 2,148, 1,611, and 2,707 unigenes were expressed specifically in roots, stems, and leaves, respectively. Finally, 5,531 SSR sequences were observed from 4,531 unigenes, and 518 primer pairs were designed.

  7. De novo sequencing, assembly and analysis of eight different transcriptomes from the Malayan pangolin

    PubMed Central

    Mohamed Yusoff, Aini; Tan, Tze King; Hari, Ranjeev; Koepfli, Klaus-Peter; Wee, Wei Yee; Antunes, Agostinho; Sitam, Frankie Thomas; Rovie-Ryan, Jeffrine Japning; Karuppannan, Kayal Vizi; Wong, Guat Jah; Lipovich, Leonard; Warren, Wesley C.; O’Brien, Stephen J.; Choo, Siew Woh

    2016-01-01

    Pangolins are scale-covered mammals, containing eight endangered species. Maintaining pangolins in captivity is a significant challenge, in part because little is known about their genetics. Here we provide the first large-scale sequencing of the critically endangered Manis javanica transcriptomes from eight different organs using Illumina HiSeq technology, yielding ~75 Giga bases and 89,754 unigenes. We found some unigenes involved in the insect hormone biosynthesis pathway and also 747 lipids metabolism-related unigenes that may be insightful to understand the lipid metabolism system in pangolins. Comparative analysis between M. javanica and other mammals revealed many pangolin-specific genes significantly over-represented in stress-related processes, cell proliferation and external stimulus, probably reflecting the traits and adaptations of the analyzed pregnant female M. javanica. Our study provides an invaluable resource for future functional works that may be highly relevant for the conservation of pangolins. PMID:27618997

  8. De novo sequencing, assembly and analysis of eight different transcriptomes from the Malayan pangolin.

    PubMed

    Mohamed Yusoff, Aini; Tan, Tze King; Hari, Ranjeev; Koepfli, Klaus-Peter; Wee, Wei Yee; Antunes, Agostinho; Sitam, Frankie Thomas; Rovie-Ryan, Jeffrine Japning; Karuppannan, Kayal Vizi; Wong, Guat Jah; Lipovich, Leonard; Warren, Wesley C; O'Brien, Stephen J; Choo, Siew Woh

    2016-09-13

    Pangolins are scale-covered mammals, containing eight endangered species. Maintaining pangolins in captivity is a significant challenge, in part because little is known about their genetics. Here we provide the first large-scale sequencing of the critically endangered Manis javanica transcriptomes from eight different organs using Illumina HiSeq technology, yielding ~75 Giga bases and 89,754 unigenes. We found some unigenes involved in the insect hormone biosynthesis pathway and also 747 lipids metabolism-related unigenes that may be insightful to understand the lipid metabolism system in pangolins. Comparative analysis between M. javanica and other mammals revealed many pangolin-specific genes significantly over-represented in stress-related processes, cell proliferation and external stimulus, probably reflecting the traits and adaptations of the analyzed pregnant female M. javanica. Our study provides an invaluable resource for future functional works that may be highly relevant for the conservation of pangolins.

  9. De Novo Transcriptome Sequencing and Analysis of the Cereal Cyst Nematode, Heterodera avenae

    PubMed Central

    Kumar, Mukesh; Gantasala, Nagavara Prasad; Roychowdhury, Tanmoy; Thakur, Prasoon Kumar; Banakar, Prakash; Shukla, Rohit N.; Jones, Michael G. K.; Rao, Uma

    2014-01-01

    The cereal cyst nematode (CCN, Heterodera avenae) is a major pest of wheat (Triticum spp) that reduces crop yields in many countries. Cyst nematodes are obligate sedentary endoparasites that reproduce by amphimixis. Here, we report the first transcriptome analysis of two stages of H. avenae. After sequencing extracted RNA from pre parasitic infective juvenile and adult stages of the life cycle, 131 million Illumina high quality paired end reads were obtained which generated 27,765 contigs with N50 of 1,028 base pairs, of which 10,452 were annotated. Comparative analyses were undertaken to evaluate H. avenae sequences with those of other plant, animal and free living nematodes to identify differences in expressed genes. There were 4,431 transcripts common to H. avenae and the free living nematode Caenorhabditis elegans, and 9,462 in common with more closely related potato cyst nematode, Globodera pallida. Annotation of H. avenae carbohydrate active enzymes (CAZy) revealed fewer glycoside hydrolases (GHs) but more glycosyl transferases (GTs) and carbohydrate esterases (CEs) when compared to M. incognita. 1,280 transcripts were found to have secretory signature, presence of signal peptide and absence of transmembrane. In a comparison of genes expressed in the pre-parasitic juvenile and feeding female stages, expression levels of 30 genes with high RPKM (reads per base per kilo million) value, were analysed by qRT-PCR which confirmed the observed differences in their levels of expression levels. In addition, we have also developed a user-friendly resource, Heterodera transcriptome database (HATdb) for public access of the data generated in this study. The new data provided on the transcriptome of H. avenae adds to the genetic resources available to study plant parasitic nematodes and provides an opportunity to seek new effectors that are specifically involved in the H. avenae-cereal host interaction. PMID:24802510

  10. De novo sequence assembly of Albugo candida reveals a small genome relative to other biotrophic oomycetes

    PubMed Central

    2011-01-01

    Background Albugo candida is a biotrophic oomycete that parasitizes various species of Brassicaceae, causing a disease (white blister rust) with remarkable convergence in behaviour to unrelated rusts of basidiomycete fungi. Results A recent genome analysis of the oomycete Hyaloperonospora arabidopsidis suggests that a reduction in the number of genes encoding secreted pathogenicity proteins, enzymes for assimilation of inorganic nitrogen and sulphur represent a genomic signature for the evolution of obligate biotrophy. Here, we report a draft reference genome of a major crop pathogen Albugo candida (another obligate biotrophic oomycete) with an estimated genome of 45.3 Mb. This is very similar to the genome size of a necrotrophic oomycete Pythium ultimum (43 Mb) but less than half that of H. arabidopsidis (99 Mb). Sequencing of A. candida transcripts from infected host tissue and zoosporangia combined with genome-wide annotation revealed 15,824 predicted genes. Most of the predicted genes lack significant similarity with sequences from other oomycetes. Most intriguingly, A. candida appears to have a much smaller repertoire of pathogenicity-related proteins than H. arabidopsidis including genes that encode RXLR effector proteins, CRINKLER-like genes, and elicitins. Necrosis and Ethylene inducing Peptides were not detected in the genome of A. candida. Putative orthologs of tat-C, a component of the twin arginine translocase system, were identified from multiple oomycete genera along with proteins containing putative tat-secretion signal peptides. Conclusion Albugo candida has a comparatively small genome amongst oomycetes, retains motility of sporangial inoculum, and harbours a much smaller repertoire of candidate effectors than was recently reported for H. arabidopsidis. This minimal gene repertoire could indicate a lack of expansion, rather than a reduction, in the number of genes that signify the evolution of biotrophy in oomycetes. PMID:21995639

  11. Transcriptome sequencing and de novo annotation of the critically endangered Adriatic sturgeon

    PubMed Central

    2013-01-01

    Background Sturgeons are a group of Condrostean fish with very high evolutionary, economical and conservation interest. The eggs of these living fossils represent one of the most high prized foods of animal origin. The intense fishing pressure on wild stocks to harvest caviar has caused in the last decades a dramatic decline of their distribution and abundance leading the International Union for Conservation of Nature to list them as the more endangered group of species. As a direct consequence, world-wide efforts have been made to develop sturgeon aquaculture programmes for caviar production. In this context, the characterization of the genes involved in sex determination could provide relevant information for the selective farming of the more profitable females. Results The 454 sequencing of two cDNA libraries from the gonads and brain of one male and one female full-sib A. naccarii, yielded 182,066 and 167,776 reads respectively, which, after strict quality control, were iterative assembled into more than 55,000 high quality ESTs. The average per-base coverage reached by assembling the two libraries was 4X. The multi-step annotation process resulted in 16% successfully annotated sequences with GO terms. We screened the transcriptome for 32 sex-related genes and highlighted 7 genes that are potentially specifically expressed, 5 in male and 2 in females, at the first life stage at which sex is histologically identifiable. In addition we identified 21,791 putative EST-linked SNPs and 5,295 SSRs. Conclusions This study represents the first large massive release of sturgeon transcriptome information that we organized into the public database AnaccariiBase, which is freely available at http://compgen.bio.unipd.it/anaccariibase/. This transcriptomic data represents an important source of information for further studies on sturgeon species. The hundreds of putative EST-linked molecular makers discovered in this study will be invaluable for sturgeon reintroduction and

  12. Detection and phasing of single base de novo mutations in biopsies from human in vitro fertilized embryos by advanced whole-genome sequencing.

    PubMed

    Peters, Brock A; Kermani, Bahram G; Alferov, Oleg; Agarwal, Misha R; McElwain, Mark A; Gulbahce, Natali; Hayden, Daniel M; Tang, Y Tom; Zhang, Rebecca Yu; Tearle, Rick; Crain, Birgit; Prates, Renata; Berkeley, Alan; Munné, Santiago; Drmanac, Radoje

    2015-03-01

    Currently, the methods available for preimplantation genetic diagnosis (PGD) of in vitro fertilized (IVF) embryos do not detect de novo single-nucleotide and short indel mutations, which have been shown to cause a large fraction of genetic diseases. Detection of all these types of mutations requires whole-genome sequencing (WGS). In this study, advanced massively parallel WGS was performed on three 5- to 10-cell biopsies from two blastocyst-stage embryos. Both parents and paternal grandparents were also analyzed to allow for accurate measurements of false-positive and false-negative error rates. Overall, >95% of each genome was called. In the embryos, experimentally derived haplotypes and barcoded read data were used to detect and phase up to 82% of de novo single base mutations with a false-positive rate of about one error per Gb, resulting in fewer than 10 such errors per embryo. This represents a ∼ 100-fold lower error rate than previously published from 10 cells, and it is the first demonstration that advanced WGS can be used to accurately identify these de novo mutations in spite of the thousands of false-positive errors introduced by the extensive DNA amplification required for deep sequencing. Using haplotype information, we also demonstrate how small de novo deletions could be detected. These results suggest that phased WGS using barcoded DNA could be used in the future as part of the PGD process to maximize comprehensiveness in detecting disease-causing mutations and to reduce the incidence of genetic diseases.

  13. Batch-processing of imaging or liquid-chromatography mass spectrometry datasets and De Novo sequencing of polyketide siderophores.

    PubMed

    Novák, Jiří; Sokolová, Lucie; Lemr, Karel; Pluháček, Tomáš; Palyzová, Andrea; Havlíček, Vladimír

    2017-07-01

    The open-source and cross-platform software CycloBranch was utilized for dereplication of organic compounds from mass spectrometry imaging imzML datasets and its functions were illustrated on microbial siderophores. The pixel-to-pixel batch-processing was analogous to liquid chromatography mass spectrometry data. Each data point represented here by accurate m/z values and the corresponding ion intensities was matched against integrated compound libraries. The fine isotopic structure matching was also embedded into CycloBranch dereplication process. The siderophores' characterization from single-pixel mass spectra was further supported by their de novo sequencing. New ketide building block library was utilized by CycloBranch to characterize the siderophores in images and mixtures and nomenclature of fragment ion series of linear and cyclic polyketide siderophores was proposed. The software is freely available at http://ms.biomed.cas.cz/cyclobranch. This article is part of a Special Issue entitled: MALDI Imaging, edited by Dr. Corinna Henkel and Prof. Peter Hoffmann. Copyright © 2016 Elsevier B.V. All rights reserved.

  14. De novo assembly, functional annotation, and marker development of Asian pear (Pyrus pyrifolia) fruit transcriptome through massively parallel sequencing.

    PubMed

    Li, J F; Gao, Z; Lou, Y S; Luo, M; Song, S R; Xu, W P; Wang, S P; Zhang, C X

    2015-12-28

    This study investigated the Asian pear transcriptome using the RNA-Seq normalized fruit cDNA library to create a transcriptomic resource for unigene and marker discovery. Following the removal of lowquality reads, 127,085,054 trimmed reads were assembled de novo to yield 37,649 non-redundant unigenes with an average length of 599 bp. Alternative splicing events were detected in 4121 contigs. A total of 30,560 single nucleotide polymorphisms (SNPs) and 7443 simple sequence repeat (SSR) makers were obtained. Approximately 21,449 (56.9%) unigenes were categorized into three gene ontology groups; 3682 (9.8%) were classified into 25 cluster of orthologous groups; and 10,451 (27.8%) were assigned to six Kyoto Encyclopedia of Genes and Genomes pathways. Differentially expressed genes were investigated using the reads per kilobase of the exon model per million reads methodology. A total of 546 unigenes showed significant differences in expression levels at different fruit developmental stages. Gene ontology categories associated with various aspects, including carbohydrate metabolic processes, transmembrane transport, and signal transduction, were enriched with genes with divergent expressions. These Pyrus pyrifolia transcriptome data provide a rich resource for the discovery and identification of new genes. Furthermore, the numerous putative SSRs and SNPs detected in this study will be important resources for the future development of a linkage map or of marker-assisted breeding programs for the Asian pear.

  15. De novo transcriptome sequencing and gene expression profiling of spinach (Spinacia oleracea L.) leaves under heat stress

    PubMed Central

    Yan, Jun; Yu, Li; Xuan, Jiping; Lu, Ying; Lu, Shijun; Zhu, Weimin

    2016-01-01

    Spinach (Spinacia oleracea) has cold tolerant but heat sensitive characteristics. The spinach variety ‘Island,’ is suitable for summer periods. There is lack molecular information available for spinach in response to heat stress. In this study, high throughput de novo transcriptome sequencing and gene expression analyses were carried out at different spinach variety ‘Island’ leaves (grown at 24 °C (control), exposed to 35 °C for 30 min (S1), and 5 h (S2)). A total of 133,200,898 clean reads were assembled into 59,413 unigenes (average size 1259.55 bp). 33,573 unigenes could match to public databases. The DEG of controls vs S1 was 986, the DEG of control vs S2 was 1741 and the DEG of S1 vs S2 was 1587. Gene Ontology (GO) and pathway enrichment analysis indicated that a great deal of heat-responsive genes and other stress-responsive genes were identified in these DEGs, suggesting that the heat stress may have induced an extensive abiotic stress effect. Comparative transcriptome analysis found 896 unique genes in spinach heat response transcript. The expression patterns of 13 selected genes were verified by RT-qPCR (quantitative real-time PCR). Our study found a series of candidate genes and pathways that may be related to heat resistance in spinach. PMID:26857466

  16. De novo transcriptome sequencing and gene expression profiling of spinach (Spinacia oleracea L.) leaves under heat stress.

    PubMed

    Yan, Jun; Yu, Li; Xuan, Jiping; Lu, Ying; Lu, Shijun; Zhu, Weimin

    2016-02-09

    Spinach (Spinacia oleracea) has cold tolerant but heat sensitive characteristics. The spinach variety 'Island,' is suitable for summer periods. There is lack molecular information available for spinach in response to heat stress. In this study, high throughput de novo transcriptome sequencing and gene expression analyses were carried out at different spinach variety 'Island' leaves (grown at 24 °C (control), exposed to 35 °C for 30 min (S1), and 5 h (S2)). A total of 133,200,898 clean reads were assembled into 59,413 unigenes (average size 1259.55 bp). 33,573 unigenes could match to public databases. The DEG of controls vs S1 was 986, the DEG of control vs S2 was 1741 and the DEG of S1 vs S2 was 1587. Gene Ontology (GO) and pathway enrichment analysis indicated that a great deal of heat-responsive genes and other stress-responsive genes were identified in these DEGs, suggesting that the heat stress may have induced an extensive abiotic stress effect. Comparative transcriptome analysis found 896 unique genes in spinach heat response transcript. The expression patterns of 13 selected genes were verified by RT-qPCR (quantitative real-time PCR). Our study found a series of candidate genes and pathways that may be related to heat resistance in spinach.

  17. Whole Exome Sequencing Identifies De Novo Heterozygous CAV1 Mutations Associated with a Novel Neonatal Onset Lipodystrophy Syndrome

    PubMed Central

    Garg, Abhimanyu; Kircher, Martin; del Campo, Miguel; Amato, R. Stephen; Agarwal, Anil K.

    2016-01-01

    Despite remarkable progress in identifying causal genes for many types of genetic lipodystrophies in the last decade, the molecular basis of many extremely rare lipodystrophy patients with distinctive phenotypes remains unclear. We conducted whole exome sequencing of the parents and probands from six pedigrees with neonatal onset of generalized loss of subcutaneous fat with additional distinctive phenotypic features and report de novo heterozygous null mutations, c.424C>T (p. Q142*) and c.479_480delTT (p.F160*), in CAV1 in a 7-year-old male and a 3-year-old female of European origin, respectively. Both the patients had generalized fat loss, thin mottled skin and progeroid features at birth. The male patient had cataracts requiring extraction at age 30 months and the female patient had pulmonary arterial hypertension. Dermal fibroblasts of the female patient revealed negligible CAV1 immunofluorescence staining compared to control but there were no differences in the number and morphology of caveolae upon electron microscopy examination. Based upon the similarities in the clinical features of these two patients, previous reports of CAV1 mutations in patients with lipodystrophies and pulmonary hypertension, and similar features seen in CAV1 null mice, we conclude that these variants are the most likely cause of one subtype of neonatal onset generalized lipodystrophy syndrome. PMID:25898808

  18. A Cost-Effective Approach to Sequence Hundreds of Complete Mitochondrial Genomes

    PubMed Central

    Oleksiak, Marjorie F.

    2016-01-01

    We present a cost-effective approach to sequence whole mitochondrial genomes for hundreds of individuals. Our approach uses small reaction volumes and unmodified (non-phosphorylated) barcoded adaptors to minimize reagent costs. We demonstrate our approach by sequencing 383 Fundulus sp. mitochondrial genomes (192 F. heteroclitus and 191 F. majalis). Prior to sequencing, we amplified the mitochondrial genomes using 4–5 custom-made, overlapping primer pairs, and sequencing was performed on an Illumina HiSeq 2500 platform. After removing low quality and short sequences, 2.9 million and 2.8 million reads were generated for F. heteroclitus and F. majalis respectively. Individual genomes were assembled for each species by mapping barcoded reads to a reference genome. For F. majalis, the reference genome was built de novo. On average, individual consensus sequences had high coverage: 61-fold for F. heteroclitus and 57-fold for F. majalis. The approach discussed in this paper is optimized for sequencing mitochondrial genomes on an Illumina platform. However, with the proper modifications, this approach could be easily applied to other small genomes and sequencing platforms. PMID:27505419

  19. De novo Transcriptome Analysis of Chinese Citrus Fly, Bactrocera minax (Diptera: Tephritidae), by High-Throughput Illumina Sequencing

    PubMed Central

    Wang, Jia; Xiong, Ke-Cai; Liu, Ying-Hong

    2016-01-01

    The Chinese citrus fly, Bactrocera minax (Enderlein), is one of the most devastating pests of citrus in the temperate areas of Asia. So far, studies involving molecular biology and physiology of B. minax are still scarce, partly because of the lack of genomic information and inability to rear this insect in laboratory. In this study, de novo assembly of a transcriptome was performed using Illumina sequencing technology. A total of 20,928,907 clean reads were obtained and assembled into 33,324 unigenes, with an average length of 908.44 bp. Unigenes were annotated by alignment against NCBI non-redundant protein (Nr), Swiss-Prot, Clusters of Orthologous Groups (COG), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database. Genes potentially involved in stress tolerance, including 20 heat shock protein (Hsps) genes, 26 glutathione S-transferases (GSTs) genes, and 2 ferritin subunit genes, were identified. These genes may play roles in stress tolerance in B. minax diapause stage. It has previously been found that 20E application on B. minax pupae could avert diapause, but the underlying mechanisms remain unknown. Thus, genes encoding enzymes in 20E biosynthesis pathway, including Neverland, Spook, Phantom, Disembodied, Shadow, Shade, and Cyp18a1, and genes encoding 20E receptor proteins, ecdysone receptor (EcR) and ultraspiracle (USP), were identified. The expression patterns of 20E-related genes among developmental stages and between 20E-treated and untreated pupae demonstrated their roles in diapause program. In addition, 1,909 simple sequence repeats (SSRs) were detected, which will contribute to molecular marker development. The findings in this study greatly improve our genetic understanding of B. minax, and lay the foundation for future studies on this species. PMID:27331903

  20. De novo sequencing, assembly and analysis of salivary gland transcriptome of Haemaphysalis flava and identification of sialoprotein genes.

    PubMed

    Xu, Xing-Li; Cheng, Tian-Yin; Yang, Hu; Yan, Fen; Yang, Ya

    2015-06-01

    Saliva plays an important role in feeding and pathogen transmission, identification and analysis of tick salivary gland (SG) proteins is considered as a hot spot in anti-tick researching area. Herein, we present the first description of SG transcriptome of Haemaphysalis flava using next-generation sequencing (NGS). A total of over 143 million high-quality reads were assembled into 54,357 unigenes, of which 20,145 (37.06%) had significant similarities to proteins in the Swiss-Prot database. 13,513 annotated sequences were associated with GO terms. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis showed that 14,280 unigenes were assigned to 279 KEGG pathways in total. Reads per kb per million reads (RPKM) analysis showed that there were 3035 down-regulated unigenes and 2260 up-regulated unigenes in the engorged ticks (ET) compared with the semi-engorged one (SET). Several important genes are associated with blood feeding and ingestion as secreted salivary proteins, concluding cysteine, longipain, 4D8, calreticulin, metalloproteases, serine protease inhibitor, enolase, heat shock protein and AV422 in SG, were identified. The qRT-PCR results confirmed that patterns of these genes (except for the longipain gene) expression were consistent with RNA-seq results. This de novo assembly of SG transcriptome of H. flava not only provides more chance for screening and cloning functional genes, but also forms a solid basis for further insight into the changes of salivary proteins during blood-feeding. Copyright © 2015 Elsevier B.V. All rights reserved.

  1. De novo sequencing of RCB-1 to -3: peptide biomarkers from the castor bean plant Ricinus communis.

    PubMed

    Ovenden, Simon P B; Fredriksson, Sten-Ake; Bagas, Christina K; Bergström, Tomas; Thomson, Stuart A; Nilsson, Calle; Bourne, David J

    2009-05-15

    Ricinus communis (also know as the castor bean plant) whose forbears escaped from suburban gardens or commercial cultivation grow wild in many countries. In temperate and tropical climates seeds will develop to maturity, and plants may be perennial. In Australia these plants have become widespread and are regarded as noxious weeds in many localities. The seeds of R. communis contain ricin, a protein toxin which can easily be extracted into an aqueous solution. Ricin is toxic by ingestion, inhalation, and injection. The history of terrorist and anarchist interest in the use of seeds from R. communis has driven the development of strategies for determination of cultivar and geographic location of the source of an extract of wild-grown castor bean seed. This forensic information is of considerable interest to law enforcement and intelligence organizations. During forensic studies of both the metabolome and proteome of extracts from eight specimens of six different cultivars of R. communis ("zanzibariensis" collected from Kenya and Tanzania, "gibsonii", "impala", "dehradun", "carmencita", and "sanguineus" collected from Spain and Tanzania), three peptide biomarkers (designated Ricinus communis biomarkers, or RCB) were identified in both the MALDI and electrospray LC-MS spectra. Two of these peptides (RCB-1 and RCB-2) were present in varying amounts in all cultivars, while RCB-3 was present only in the "carmencita" cultivar. The amino acid sequences of RCB-1 to -3 were determined using LC-MS(n) fragmentation and de novo sequencing on both the intact and the carbamidomethyl modified peptides. The connectivity of the two disulfide bonds that were present in all three RCB were determined using a strategy of partial reduction and differential alkylation using tris-(2-carboxyethyl)phosphine with N-ethylmaleimide to reduce and alkylate the most accessible disulfide bond, followed by reduction and alkylation of the remaining disulfide bond with dithiolthreitol and

  2. De novo transcriptome sequencing and analysis of male, pseudo-male and female yellow perch, Perca flavescens

    PubMed Central

    Li, Yan-He; Wang, Han-Ping; Yao, Hong; O’Bryant, Paul; Rapp, Dean; Guo, Liang; Waly, Eman A.

    2017-01-01

    Transcriptome sequencing could facilitate discovery of sex-biased genes, biological pathways and molecular markers, which could help clarify the molecular mechanism of sex determination and sexual dimorphism, and assist with selective breeding in aquaculture. Yellow perch has unique gonad system and sexual dimorphism and is an alternative model to study mechanism of sex determination, sexual dimorphism and sexual selection. In this study, we performed the de novo assembly of yellow perch gonads and muscle transcriptomes by high throughput Illumina sequencing. A total of 212,180 contigs were obtained, ranging from 127 to 64,876 bp, and N50 of 1,066 bp. The assembly RNA-Seq contigs (≥200bp) were then used for subsequent analyses, including annotation, pathway analysis, and microsatellites discovery. No female- and pseudo-male-biased genes were involved in any pathways while male-biased genes were involved in 29 pathways, and neuroactive ligand receptor interaction and enzyme of trypsin (enzyme code, EC: 3.4.21.4) was highly involved. Pyruvate kinase (enzyme code, EC: 2.7.1.40), which plays important roles in cell proliferation, was highly expressed in muscles. In addition, a total of 183,939 SNPs, 11,286 InDels and 41,479 microsatellites were identified. This study is the first report on transcriptome information in Percids, and provides rich resources for conducting further studies on understanding the molecular basis of sex determinations, sexual dimorphism, and sexual selection in fish, and for population studies and marker-assisted selection in Percids. PMID:28158238

  3. Motor Sequence Learning and Consolidation in Unilateral De Novo Patients with Parkinson’s Disease

    PubMed Central

    Doyon, Julien; Chan, Piu

    2015-01-01

    Previous research investigating motor sequence learning (MSL) and consolidation in patients with Parkinson’s disease (PD) has predominantly included heterogeneous participant samples with early and advanced disease stages; thus, little is known about the onset of potential behavioral impairments. We employed a multisession MSL paradigm to investigate whether behavioral deficits in learning and consolidation appear immediately after or prior to the detection of clinical symptoms in the tested (left) hand. Specifically, our patient sample was limited to recently diagnosed patients with pure unilateral PD. The left hand symptomatic (LH-S) patients provided an assessment of performance following the onset of clinical symptoms in the tested hand. Conversely, right hand affected (left hand asymptomatic, LH-A) patients served to investigate whether MSL impairments appear before symptoms in the tested hand. LH-S patients demonstrated impaired learning during the initial training session and both LH-S and LH-A patients demonstrated decreased performance compared to controls during the next-day retest. Critically, the impairments in later learning stages in the LH-A patients were evident even before the appearance of traditional clinical symptoms in the tested hand. Results may be explained by the progression of disease-related alterations in relevant corticostriatal networks. PMID:26222151

  4. Motor Sequence Learning and Consolidation in Unilateral De Novo Patients with Parkinson's Disease.

    PubMed

    Dan, Xiaojuan; King, Bradley R; Doyon, Julien; Chan, Piu

    2015-01-01

    Previous research investigating motor sequence learning (MSL) and consolidation in patients with Parkinson's disease (PD) has predominantly included heterogeneous participant samples with early and advanced disease stages; thus, little is known about the onset of potential behavioral impairments. We employed a multisession MSL paradigm to investigate whether behavioral deficits in learning and consolidation appear immediately after or prior to the detection of clinical symptoms in the tested (left) hand. Specifically, our patient sample was limited to recently diagnosed patients with pure unilateral PD. The left hand symptomatic (LH-S) patients provided an assessment of performance following the onset of clinical symptoms in the tested hand. Conversely, right hand affected (left hand asymptomatic, LH-A) patients served to investigate whether MSL impairments appear before symptoms in the tested hand. LH-S patients demonstrated impaired learning during the initial training session and both LH-S and LH-A patients demonstrated decreased performance compared to controls during the next-day retest. Critically, the impairments in later learning stages in the LH-A patients were evident even before the appearance of traditional clinical symptoms in the tested hand. Results may be explained by the progression of disease-related alterations in relevant corticostriatal networks.

  5. Transcriptomic Analysis of Flower Blooming in Jasminum sambac through De Novo RNA Sequencing.

    PubMed

    Li, Yong-Hua; Zhang, Wei; Li, Yong

    2015-06-10

    Flower blooming is a critical and complicated plant developmental process in flowering plants. However, insufficient information is available about the complex network that regulates flower blooming in Jasminum sambac. In this study, we used the RNA-Seq platform to analyze the molecular regulation of flower blooming in J. sambac by comparing the transcript profiles at two flower developmental stages: budding and blooming. A total of 4577 differentially-expressed genes (DEGs) were identified between the two floral stages. The Gene Ontology and the Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses revealed that the DEGs in the "oxidation-reduction process", "extracellular region", "steroid biosynthesis", "glycosphingolipid biosynthesis", "plant hormone signal transduction" and "pentose and glucuronate interconversions" might be associated with flower development. A total of 103 and 92 unigenes exhibited sequence similarities to the known flower development and floral scent genes from other plants. Among these unigenes, five flower development and 19 floral scent unigenes exhibited at least four-fold differences in expression between the two stages. Our results provide abundant genetic resources for studying the flower blooming mechanisms and molecular breeding of J. sambac.

  6. De novo transcriptome sequencing of Momordica cochinchinensis to identify genes involved in the carotenoid biosynthesis.

    PubMed

    Hyun, Tae Kyung; Rim, Yeonggil; Jang, Hui-Jeong; Kim, Cheol Hong; Park, Jongsun; Kumar, Ritesh; Lee, Sunghoon; Kim, Byung Chul; Bhak, Jong; Nguyen-Quoc, Binh; Kim, Seon-Won; Lee, Sang Yeol; Kim, Jae-Yean

    2012-07-01

    The ripe fruit of Momordica cochinchinensis Spreng, known as gac, is featured by very high carotenoid content. Although this plant might be a good resource for carotenoid metabolic engineering, so far, the genes involved in the carotenoid metabolic pathways in gac were unidentified due to lack of genomic information in the public database. In order to expedite the process of gene discovery, we have undertaken Illumina deep sequencing of mRNA prepared from aril of gac fruit. From 51,446,670 high-quality reads, we obtained 81,404 assembled unigenes with average length of 388 base pairs. At the protein level, gac aril transcripts showed about 81.5% similarity with cucumber proteomes. In addition 17,104 unigenes have been assigned to specific metabolic pathways in Kyoto Encyclopedia of Genes and Genomes, and all of known enzymes involved in terpenoid backbones biosynthetic and carotenoid biosynthetic pathways were also identified in our library. To analyze the relationship between putative carotenoid biosynthesis genes and alteration of carotenoid content during fruit ripening, digital gene expression analysis was performed on three different ripening stages of aril. This study has revealed putative phytoene synthase, 15-cis-phytone desaturase, zeta-carotene desaturase, carotenoid isomerase and lycopene epsilon cyclase might be key factors for controlling carotenoid contents during aril ripening. Taken together, this study has also made availability of a large gene database. This unique information for gac gene discovery would be helpful to facilitate functional studies for improving carotenoid quantities.

  7. De novo Sequencing and Comparative Transcriptomics of Floral Development of the Distylous Species Lithospermum multiflorum

    PubMed Central

    Cohen, James I.

    2016-01-01

    Genes controlling the morphological, micromorphological, and physiological components of the breeding system distyly have been hypothesized, but many of the genes have not been investigated throughout development of the two floral morphs. To this end, the present study is an examination of comparative transcriptomes from three stages of development for the floral organs of the morphs of Lithospermum multiflorum. Transcriptomes of flowers of the two morphs, from various stages of development, were sequenced using an Illumina HiSeq 2000. The floral transcriptome of L. multiflorum was assembled, and differential gene expression (DE) was identified between morphs, throughout development. Additionally, Gene Ontology (GO) terms for DE genes were determined. Fewer genes were DE early in development compared to later in development, with more genes highly expressed in the gynoecium of the SS morph and the corolla and androecium of the LS morph. A reciprocal pattern was observed later in development, and many more genes were DE during this latter stage. During early development, DE genes appear to be involved in growth and floral development, and during later development, DE genes seem to affect physiological functions. Interestingly, many genes involved in response to stress were identified as DE between morphs. PMID:28066486

  8. Sequencing, De Novo Assembly and Annotation of the Colorado Potato Beetle, Leptinotarsa decemlineata, Transcriptome

    PubMed Central

    Kumar, Abhishek; Congiu, Leonardo; Lindström, Leena; Piiroinen, Saija; Vidotto, Michele; Grapputo, Alessandro

    2014-01-01

    Background The Colorado potato beetle (Leptinotarsa decemlineata) is a major pest and a serious threat to potato cultivation throughout the northern hemisphere. Despite its high importance for invasion biology, phenology and pest management, little is known about L. decemlineata from a genomic perspective. We subjected European L. decemlineata adult and larval transcriptome samples to 454-FLX massively-parallel DNA sequencing to characterize a basal set of genes from this species. We created a combined assembly of the adult and larval datasets including the publicly available midgut larval Roche 454 reads and provided basic annotation. We were particularly interested in diapause-specific genes and genes involved in pesticide and Bacillus thuringiensis (Bt) resistance. Results Using 454-FLX pyrosequencing, we obtained a total of 898,048 reads which, together with the publicly available 804,056 midgut larval reads, were assembled into 121,912 contigs. We established a repository of genes of interest, with 101 out of the 108 diapause-specific genes described in Drosophila montana; and 621 contigs involved in insecticide resistance, including 221 CYP450, 45 GSTs, 13 catalases, 15 superoxide dismutases, 22 glutathione peroxidases, 194 esterases, 3 ADAM metalloproteases, 10 cadherins and 98 calmodulins. We found 460 putative miRNAs and we predicted a significant number of single nucleotide polymorphisms (29,205) and microsatellite loci (17,284). Conclusions This report of the assembly and annotation of the transcriptome of L. decemlineata offers new insights into diapause-associated and insecticide-resistance-associated genes in this species and provides a foundation for comparative studies with other species of insects. The data will also open new avenues for researchers using L. decemlineata as a model species, and for pest management research. Our results provide the basis for performing future gene expression and functional analysis in L. decemlineata and improve our

  9. Pollen of common ragweed (Ambrosia artemisiifolia L.): Illumina-based de novo sequencing and differential transcript expression upon elevated NO2/O3.

    PubMed

    Zhao, Feng; Durner, Jörg; Winkler, J Barbro; Traidl-Hoffmann, Claudia; Strom, Tim-Matthias; Ernst, Dieter; Frank, Ulrike

    2017-05-01

    Common ragweed (Ambrosia artemisiifolia L.) is a highly allergenic annual ruderal plant and native to Northern America, but now also spreading across Europe. Air pollution and climate change will not only affect plant growth, pollen production and duration of the whole pollen season, but also the amount of allergenic encoding transcripts and proteins of the pollen. The objective of this study was to get a better understanding of transcriptional changes in ragweed pollen upon NO2 and O3 fumigation. This will also contribute to a systems biology approach to understand the reaction of the allergenic pollen to air pollution and climate change. Ragweed plants were grown in climate chambers under controlled conditions and fumigated with enhanced levels of NO2 and O3. Illumina sequencing and de novo assembly revealed significant differentially expressed transcripts, belonging to different gene ontology (GO) terms that were grouped into biological process and molecular function. Transcript levels of the known Amb a ragweed encoding allergens were clearly up-regulated under elevated NO2, whereas the amount of allergen encoding transcripts was more variable under elevated O3 conditions. Moreover transcripts encoding allergen known from other plants could be identified. The transcriptional changes in ragweed pollen upon elevated NO2 fumigation indicates that air pollution will alter the transcriptome of the pollen. The changed levels of allergenic encoding transcripts may have an influence on the total allergenic potential of ragweed pollen. Copyright © 2017 Elsevier Ltd. All rights reserved.

  10. BAL31-NGS approach for identification of telomeres de novo in large genomes.

    PubMed

    Peška, Vratislav; Sitová, Zdeňka; Fajkus, Petr; Fajkus, Jiří

    2017-02-01

    This article describes a novel method to identify as yet undiscovered telomere sequences, which combines next generation sequencing (NGS) with BAL31 digestion of high molecular weight DNA. The method was applied to two groups of plants: i) dicots, genus Cestrum, and ii) monocots, Allium species (e.g. A. ursinum and A. cepa). Both groups consist of species with large genomes (tens of Gb) and a low number of chromosomes (2n=14-16), full of repeat elements. Both genera lack typical telomeric repeats and multiple studies have attempted to characterize alternative telomeric sequences. However, despite interesting hypotheses and suggestions of alternative candidate telomeres (retrotransposons, rDNA, satellite repeats) these studies have not resolved the question. In a novel approach based on the two most general features of eukaryotic telomeres, their repetitive character and sensitivity to BAL31 nuclease digestion, we have taken advantage of the capacity and current affordability of NGS in combination with the robustness of classical BAL31 nuclease digestion of chromosomal termini. While representative samples of most repeat elements were ensured by low-coverage (less than 5%) genomic shot-gun NGS, candidate telomeres were identified as under-represented sequences in BAL31-treated samples.

  11. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine

    PubMed Central

    2014-01-01

    High-throughput DNA sequencing is revolutionizing the study of cancer and enabling the measurement of the somatic mutations that drive cancer development. However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations. Here, we review computational approaches to identify somatic mutations in cancer genome sequences and to distinguish the driver mutations that are responsible for cancer from random, passenger mutations. First, we describe approaches to detect somatic mutations from high-throughput DNA sequencing data, particularly for tumor samples that comprise heterogeneous populations of cells. Next, we review computational approaches that aim to predict driver mutations according to their frequency of occurrence in a cohort of samples, or according to their predicted functional impact on protein sequence or structure. Finally, we review techniques to identify recurrent combinations of somatic mutations, including approaches that examine mutations in known pathways or protein-interaction networks, as well as de novo approaches that identify combinations of mutations according to statistical patterns of mutual exclusivity. These techniques, coupled with advances in high-throughput DNA sequencing, are enabling precision medicine approaches to the diagnosis and treatment of cancer. PMID:24479672

  12. Analysis of de novo sequencing and transcriptome assembly and lignocellulolytic enzymes gene expression of Coriolopsis gallica HTC.

    PubMed

    Chen, Yuehong; Cao, Qinghua; Tao, Xiang; Shao, Huanhuan; Zhang, Kun; Zhang, Yizheng; Tan, Xuemei

    2017-03-01

    White-rot basidiomycete Coriolopsis gallica HTC is one of the main biodegraders of poplar. In our previous study, we have shown the strong capacity of C. gallica HTC to degrade lignocellulose. In this study, equal amounts of total RNA fromC. Gallica HTC cultures grown in different conditions were pooled together. Illumina paired-end RNA sequencing was performed, and 13.2 million 90-bp paired-end reads were generated. We chose the Merged Assembly of Oases data-set for the following blast searches and gene ontology analyses. The reads were assembled de novo into 28,034 transcripts (≥ 100 bp) using combined assembly strategy MAO. The transcripts were annotated using Blast2GO. In all, 18,810 transcripts (≥100 bp) achieved BLASTX hits, of which, 7048 transcripts had GO term and 2074 had ECs. The expression level of 11 lignocellulolytic enzyme genes from the assembled C. gallica HTC transcriptome were detected by real-time quantitative polymerase chain reaction. The results showed that expression levels of these genes were affected by carbon source and nitrogen source at the level of transcription. The current abundant transcriptome data allowed the identification of many new transcripts in C. gallica HTC. Data provided here represent the most comprehensive and integrated genomic resources for cloning and identifying genes of interest from C. gallica HTC. Characterization of C. gallica HTC transcriptome provides an effective tool to understand mechanisms underlying cellular and molecular functions of C. gallica HTC.

  13. De novo transcriptome sequencing in Bixa orellana to identify genes involved in methylerythritol phosphate, carotenoid and bixin biosynthesis

    DOE PAGES

    Cárdenas-Conejo, Yair; Carballo-Uicab, Víctor; Lieberman, Meric; ...

    2015-10-28

    Bixin or annatto is a commercially important natural orange-red pigment derived from lycopene that is produced and stored in seeds of Bixa orellana L. An enzymatic pathway for bixin biosynthesis was inferred from homology of putative proteins encoded by differentially expressed seed cDNAs. Some activities were later validated in a heterologous system. Nevertheless, much of the pathway remains to be clarified. For example, it is essential to identify the methylerythritol phosphate (MEP) and carotenoid pathways genes. In order to investigate the MEP, carotenoid, and bixin pathways genes, total RNA from young leaves and two different developmental stages of seeds frommore » B. orellana were used for the construction of indexed mRNA libraries, sequenced on the Illumina HiSeq 2500 platform and assembled de novo using Velvet, CLC Genomics Workbench and CAP3 software. A total of 52,549 contigs were obtained with average length of 1,924 bp. Two phylogenetic analyses of inferred proteins, in one case encoded by thirteen general, single-copy cDNAs, in the other from carotenoid and MEP cDNAs, indicated that B. orellana is closely related to sister Malvales species cacao and cotton. Using homology, we identified 7 and 14 core gene products from the MEP and carotenoid pathways, respectively. Surprisingly, previously defined bixin pathway cDNAs were not present in our transcriptome. Here we propose a new set of gene products involved in bixin pathway. In conclusion, the identification and qRT-PCR quantification of cDNAs involved in annatto production suggest a hypothetical model for bixin biosynthesis that involve coordinated activation of some MEP, carotenoid and bixin pathway genes. These findings provide a better understanding of the mechanisms regulating these pathways and will facilitate the genetic improvement of B. orellana.« less

  14. High-throughput sequencing and de novo transcriptome assembly of Swertia japonica to identify genes involved in the biosynthesis of therapeutic metabolites.

    PubMed

    Rai, Amit; Nakamura, Michimi; Takahashi, Hiroki; Suzuki, Hideyuki; Saito, Kazuki; Yamazaki, Mami

    2016-10-01

    Here, we report potential transcripts involved in the biosynthesis of therapeutic metabolites in Swertia japonica , the first report of transcriptome assembly, and characterization of the medicinal plant from Swertia genus. Swertia genus, representing over 170 plant species including herbs such as S. chirata, S. hookeri, S. longifolia, S. japonica, among others, have been used as the traditional medicine in China, India, Korea, and Japan for thousands of years. Due to the lack of genomic and transcriptomic resources, little is known about the molecular basis involved in the biosynthesis of characteristic key bioactive metabolites. Here, we performed deep-transcriptome sequencing for the aerial tissues and the roots of S. japonica, generating over 2 billion raw reads with an average length of 101 bps. Using a combined approach of three popular assemblers, de novo transcriptome assembly for S. japonica was obtained, yielding 81,729 unigenes having an average length of 884 bps and N50 value of 1452 bps, of which 46,963 unigenes were annotated based on the sequence similarity against NCBI-nr protein database. Annotation of transcriptome assembly resulted in the identification of putative genes encoding all enzymes from the key therapeutic metabolite biosynthesis pathways. Transcript abundance analysis, gene ontology enrichment analysis, and KEGG pathway enrichment analysis revealed metabolic processes being up-regulated in the aerial tissues with respect to the roots of S. japonica. We also identified 37 unigenes as potential candidates involved in the glycosylation of bioactive metabolites. Being the first report of transcriptome assembly and annotation for any of the Swertia species, this study will be a valuable resource for future investigations on the biosynthetic pathways of therapeutic metabolites and their regulations.

  15. An oligonucleotide hybridization approach to DNA sequencing.

    PubMed

    Khrapko, K R; Lysov YuP; Khorlyn, A A; Shick, V V; Florentiev, V L; Mirzabekov, A D

    1989-10-09

    We have proposed a DNA sequencing method based on hybridization of a DNA fragment to be sequenced with the complete set of fixed-length oligonucleotides (e.g., 4(8) = 65,536 possible 8-mers) immobilized individually as dots of a 2-D matrix [(1989) Dokl. Akad. Nauk SSSR 303, 1508-1511]. It was shown that the list of hybridizing octanucleotides is sufficient for the computer-assisted reconstruction of the structures for 80% of random-sequence fragments up to 200 bases long, based on the analysis of the octanucleotide overlapping. Here a refinement of the method and some experimental data are presented. We have performed hybridizations with oligonucleotides immobilized on a glass plate, and obtained their dissociation curves down to heptanucleotides. Other approaches, e.g., an additional hybridization of short oligonucleotides which continuously extend duplexes formed between the fragment and immobilized oligonucleotides, should considerably increase either the probability of unambiguous reconstruction, or the length of reconstructed sequences, or decrease the size of immobilized oligonucleotides.

  16. Evaluating Characteristics of De Novo Assembly Software on 454 Transcriptome Data: A Simulation Approach

    PubMed Central

    Mundry, Marvin; Bornberg-Bauer, Erich; Sammeth, Michael; Feulner, Philine G. D.

    2012-01-01

    Background The quantity of transcriptome data is rapidly increasing for non-model organisms. As sequencing technology advances, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. Recent studies have compared the performance of different software to establish a best practice for transcriptome assembly. Here, we adapted a simulation approach to evaluate specific features of assembly programs on 454 data. The novelty of our study is that the simulation allows us to calculate a model assembly as reference point for comparison. Findings The simulation approach allows us to compare basic metrics of assemblies computed by different software applications (CAP3, MIRA, Newbler, and Oases) to a known optimal solution. We found MIRA and CAP3 are conservative in merging reads. This resulted in comparably high number of short contigs. In contrast, Newbler more readily merged reads into longer contigs, while Oases produced the overall shortest assembly. Due to the simulation approach, reads could be traced back to their correct placement within the transcriptome. Together with mapping reads onto the assembled contigs, we were able to evaluate ambiguity in the assemblies. This analysis further supported the conservative nature of MIRA and CAP3, which resulted in low proportions of chimeric contigs, but high redundancy. Newbler produced less redundancy, but the proportion of chimeric contigs was higher. Conclusion Our evaluation of four assemblers suggested that MIRA and Newbler slightly outperformed the other programs, while showing contrasting characteristics. Oases did not perform very well on the 454 reads. Our evaluation indicated that the software was either conservative (MIRA) or liberal (Newbler) about merging reads into contigs. This suggested that in choosing an assembly program researchers should carefully consider their follow up analysis and consequences of the chosen approach to gain an assembly. PMID:22384018

  17. Evaluating characteristics of de novo assembly software on 454 transcriptome data: a simulation approach.

    PubMed

    Mundry, Marvin; Bornberg-Bauer, Erich; Sammeth, Michael; Feulner, Philine G D

    2012-01-01

    The quantity of transcriptome data is rapidly increasing for non-model organisms. As sequencing technology advances, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. Recent studies have compared the performance of different software to establish a best practice for transcriptome assembly. Here, we adapted a simulation approach to evaluate specific features of assembly programs on 454 data. The novelty of our study is that the simulation allows us to calculate a model assembly as reference point for comparison. The simulation approach allows us to compare basic metrics of assemblies computed by different software applications (CAP3, MIRA, Newbler, and Oases) to a known optimal solution. We found MIRA and CAP3 are conservative in merging reads. This resulted in comparably high number of short contigs. In contrast, Newbler more readily merged reads into longer contigs, while Oases produced the overall shortest assembly. Due to the simulation approach, reads could be traced back to their correct placement within the transcriptome. Together with mapping reads onto the assembled contigs, we were able to evaluate ambiguity in the assemblies. This analysis further supported the conservative nature of MIRA and CAP3, which resulted in low proportions of chimeric contigs, but high redundancy. Newbler produced less redundancy, but the proportion of chimeric contigs was higher. Our evaluation of four assemblers suggested that MIRA and Newbler slightly outperformed the other programs, while showing contrasting characteristics. Oases did not perform very well on the 454 reads. Our evaluation indicated that the software was either conservative (MIRA) or liberal (Newbler) about merging reads into contigs. This suggested that in choosing an assembly program researchers should carefully consider their follow up analysis and consequences of the chosen approach to gain an assembly.

  18. De novo sequence analysis and intact mass measurements for characterization of phycocyanin subunit isoforms from the blue-green alga Aphanizomenon flos-aquae.

    PubMed

    Rinalducci, Sara; Roepstorff, Peter; Zolla, Lello

    2009-04-01

    In this work, partial characterization of the primary structure of phycocyanin from the cyanobacterium Aphanizomenon flos-aquae (AFA) was achieved by mass spectrometry de novo sequencing with the aid of chemical derivatization. Combining N-terminal sulfonation of tryptic peptides by 4-sulfophenyl isothiocyanate (SPITC) and MALDI-TOF/TOF analyses, facilitated the acquisition of sequence information for AFA phycocyanin subunits. In fact, SPITC-derivatized peptides underwent facile fragmentation, predominantly resulting in y-series ions in the MS/MS spectra and often exhibiting uninterrupted sequences of 20 or more amino acid residues. This strategy allowed us to carry out peptide fragment fingerprinting and de novo sequencing of several peptides belonging to both alpha- and beta-phycocyanin polypeptides, obtaining a sequence coverage of 67% and 75%, respectively. The presence of different isoforms of phycocyanin subunits was also revealed; subsequently Intact Mass Measurements (IMMs) by both MALDI- and ESI-MS supported the detection of these protein isoforms. Finally, we discuss the evolutionary importance of phycocyanin isoforms in cyanobacteria, suggesting the possible use of the phycocyanin operon for a correct taxonomic identity of this species.

  19. Factors determining the performance of triple quadrupole, quadrupole ion trap and sector field mass spectrometer in electrospray ionization mass spectrometry. 2. Suitability for de novo sequencing.

    PubMed

    Premstaller, A; Huber, C G

    2001-01-01

    The sequence coverage by fragment ions resulting from collision-induced dissociation in a triple stage quadrupole (TSQ) and a quadrupole ion trap (QIT) mass spectrometer of 10-20-mer oligonucleotides was investigated. While (a-B) and w ion series were the most abundant on both instruments, additional ion series of sequence relevance were preferably formed in the TSQ. Thus, a total number of 83 fragment ions were used to deduce the complete sequence of a 10-mer oligonucleotide of mixed sequence from a tandem mass spectrum recorded on the TSQ. The complete sequence was also encoded in the 28 fragments that were obtained from the QIT under comparable fragmentation conditions. Spectrum complexity increased considerably at the cost of signal-to-noise ratio upon fragmentation of a 20-mer oligonucleotide in the TSQ, whereas spectrum interpretation with longer oligonucleotides was significantly more straightforward in spectra recorded on the QIT. The extent of fragmentation had to be optimized by appropriate setting of collision energy and choice of precursor ion charge state in order to obtain full sequence coverage by fragments for de novo sequencing. Moreover, full sequence information was also dependent on base sequence because of the low tendency of backbone cleavage at thymidines. Tandem mass spectrometry on the QIT yielded redundant information that was successfully utilized to deduce the complete sequence of 20-mer oligonucleotides with high confidence. Copyright 2001 John Wiley & Sons, Ltd.

  20. Large Scale Discovery and De Novo-Assisted Sequencing of Cationic Antimicrobial Peptides (CAMPs) by Microparticle Capture and Electron-Transfer Dissociation (ETD) Mass Spectrometry.

    PubMed

    Juba, Melanie L; Russo, Paul S; Devine, Megan; Barksdale, Stephanie; Rodriguez, Carlos; Vliet, Kent A; Schnur, Joel M; van Hoek, Monique L; Bishop, Barney M

    2015-10-02

    The identification and sequencing of novel cationic antimicrobial peptides (CAMPs) have proven challenging due to the limitations associated with traditional proteomics methods and difficulties sequencing peptides present in complex biomolecular mixtures. We present here a process for large-scale identification and de novo-assisted sequencing of newly discovered CAMPs using microparticle capture followed by tandem mass spectrometry equipped with electron-transfer dissociation (ETD). This process was initially evaluated and verified using known CAMPs with varying physicochemical properties. The effective parameters were then applied in the analysis of a complex mixture of peptides harvested from American alligator plasma using custom-made (Bioprospector) functionalized hydrogel particles. Here, we report the successful sequencing process for CAMPs that has led to the identification of 340 unique peptides and the discovery of five novel CAMPs from American alligator plasma.

  1. Frequency and Complexity of De Novo Structural Mutation in Autism.

    PubMed

    Brandler, William M; Antaki, Danny; Gujral, Madhusudan; Noor, Amina; Rosanio, Gabriel; Chapman, Timothy R; Barrera, Daniel J; Lin, Guan Ning; Malhotra, Dheeraj; Watts, Amanda C; Wong, Lawrence C; Estabillo, Jasper A; Gadomski, Therese E; Hong, Oanh; Fajardo, Karin V Fuentes; Bhandari, Abhishek; Owen, Renius; Baughn, Michael; Yuan, Jeffrey; Solomon, Terry; Moyzis, Alexandra G; Maile, Michelle S; Sanders, Stephan J; Reiner, Gail E; Vaux, Keith K; Strom, Charles M; Zhang, Kang; Muotri, Alysson R; Akshoomoff, Natacha; Leal, Suzanne M; Pierce, Karen; Courchesne, Eric; Iakoucheva, Lilia M; Corsello, Christina; Sebat, Jonathan

    2016-04-07

    Genetic studies of autism spectrum disorder (ASD) have established that de novo duplications and deletions contribute to risk. However, ascertainment of structural variants (SVs) has been restricted by the coarse resolution of current approaches. By applying a custom pipeline for SV discovery, genotyping, and de novo assembly to genome sequencing of 235 subjects (71 affected individuals, 26 healthy siblings, and their parents), we compiled an atlas of 29,719 SV loci (5,213/genome), comprising 11 different classes. We found a high diversity of de novo mutations, the majority of which were undetectable by previous methods. In addition, we observed complex mutation clusters where combinations of de novo SVs, nucleotide substitutions, and indels occurred as a single event. We estimate a high rate of structural mutation in humans (20%) and propose that genetic risk for ASD is attributable to an elevated frequency of gene-disrupting de novo SVs, but not an elevated rate of genome rearrangement.

  2. Frequency and Complexity of De Novo Structural Mutation in Autism

    PubMed Central

    Brandler, William M.; Antaki, Danny; Gujral, Madhusudan; Noor, Amina; Rosanio, Gabriel; Chapman, Timothy R.; Barrera, Daniel J.; Lin, Guan Ning; Malhotra, Dheeraj; Watts, Amanda C.; Wong, Lawrence C.; Estabillo, Jasper A.; Gadomski, Therese E.; Hong, Oanh; Fajardo, Karin V. Fuentes; Bhandari, Abhishek; Owen, Renius; Baughn, Michael; Yuan, Jeffrey; Solomon, Terry; Moyzis, Alexandra G.; Maile, Michelle S.; Sanders, Stephan J.; Reiner, Gail E.; Vaux, Keith K.; Strom, Charles M.; Zhang, Kang; Muotri, Alysson R.; Akshoomoff, Natacha; Leal, Suzanne M.; Pierce, Karen; Courchesne, Eric; Iakoucheva, Lilia M.; Corsello, Christina; Sebat, Jonathan

    2016-01-01

    Genetic studies of autism spectrum disorder (ASD) have established that de novo duplications and deletions contribute to risk. However, ascertainment of structural variants (SVs) has been restricted by the coarse resolution of current approaches. By applying a custom pipeline for SV discovery, genotyping, and de novo assembly to genome sequencing of 235 subjects (71 affected individuals, 26 healthy siblings, and their parents), we compiled an atlas of 29,719 SV loci (5,213/genome), comprising 11 different classes. We found a high diversity of de novo mutations, the majority of which were undetectable by previous methods. In addition, we observed complex mutation clusters where combinations of de novo SVs, nucleotide substitutions, and indels occurred as a single event. We estimate a high rate of structural mutation in humans (20%) and propose that genetic risk for ASD is attributable to an elevated frequency of gene-disrupting de novo SVs, but not an elevated rate of genome rearrangement. PMID:27018473

  3. Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing

    PubMed Central

    Vembar, Shruthi Sridhar; Seetin, Matthew; Lambert, Christine; Nattestad, Maria; Schatz, Michael C.; Baybayan, Primo; Scherf, Artur; Smith, Melissa Laird

    2016-01-01

    The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90–99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission. PMID:27345719

  4. Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing.

    PubMed

    Vembar, Shruthi Sridhar; Seetin, Matthew; Lambert, Christine; Nattestad, Maria; Schatz, Michael C; Baybayan, Primo; Scherf, Artur; Smith, Melissa Laird

    2016-08-01

    The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90-99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission.

  5. Genetic variation and the de novo assembly of human genomes

    PubMed Central

    Chaisson, Mark J. P.; Wilson, Richard K.; Eichler, Evan E.

    2016-01-01

    The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation. PMID:26442640

  6. De novo Transcriptome Sequencing and Development of Abscission Zone-Specific Microarray as a New Molecular Tool for Analysis of Tomato Organ Abscission

    PubMed Central

    Sundaresan, Srivignesh; Philosoph-Hadas, Sonia; Riov, Joseph; Mugasimangalam, Raja; Kuravadi, Nagesh A.; Kochanek, Bettina; Salim, Shoshana; Tucker, Mark L.; Meir, Shimon

    2016-01-01

    Abscission of flower pedicels and leaf petioles of tomato (Solanum lycopersicum) can be induced by flower removal or leaf deblading, respectively, which leads to auxin depletion, resulting in increased sensitivity of the abscission zone (AZ) to ethylene. However, the molecular mechanisms that drive the acquisition of abscission competence and its modulation by auxin gradients are not yet known. We used RNA-Sequencing (RNA-Seq) to obtain a comprehensive transcriptome of tomato flower AZ (FAZ) and leaf AZ (LAZ) during abscission. RNA-Seq was performed on a pool of total RNA extracted from tomato FAZ and LAZ, at different abscission stages, followed by de novo assembly. The assembled clusters contained transcripts that are already known in the Solanaceae (SOL) genomics and NCBI databases, and over 8823 identified novel tomato transcripts of varying sizes. An AZ-specific microarray, encompassing the novel transcripts identified in this study and all known transcripts from the SOL genomics and NCBI databases, was constructed to study the abscission process. Multiple probes for longer genes and key AZ-specific genes, including antisense probes for all transcripts, make this array a unique tool for studying abscission with a comprehensive set of transcripts, and for mining for naturally occurring antisense transcripts. We focused on comparing the global transcriptomes generated from the FAZ and the LAZ to establish the divergences and similarities in their transcriptional networks, and particularly to characterize the processes and transcriptional regulators enriched in gene clusters that are differentially regulated in these two AZs. This study is the first attempt to analyze the global gene expression in different AZs in tomato by combining the RNA-Seq technique with oligonucleotide microarrays. Our AZ-specific microarray chip provides a cost-effective approach for expression profiling and robust analysis of multiple samples in a rapid succession. PMID:26834766

  7. UVliPiD: A UVPD-Based Hierarchical Approach for De Novo Characterization of Lipid A Structures.

    PubMed

    Morrison, Lindsay J; Parker, W Ryan; Holden, Dustin D; Henderson, Jeremy C; Boll, Joseph M; Trent, M Stephen; Brodbelt, Jennifer S

    2016-02-02

    The lipid A domain of the endotoxic lipopolysaccharide layer of Gram-negative bacteria is comprised of a diglucosamine backbone to which a variable number of variable length fatty acyl chains are anchored. Traditional characterization of these tails and their linkages by nuclear magnetic resonance (NMR) or mass spectrometry is time-consuming and necessitates databases of pre-existing structures for structural assignment. Here, we introduce an automated de novo approach for characterization of lipid A structures that is completely database-independent. A hierarchical decision-tree MS(n) method is used in conjunction with a hybrid activation technique, UVPDCID, to acquire characteristic fragmentation patterns of lipid A variants from a number of Gram-negative bacteria. Structural assignments are derived from integration of key features from three to five spectra and automated interpretation is achieved in minutes without the need for pre-existing information or candidate structures. The utility of this strategy is demonstrated for a mixture of lipid A structures from an enzymatically modified E. coli lipid A variant. A total of 27 lipid A structures were discovered, many of which were isomeric, showcasing the need for a rapid de novo approach to lipid A characterization.

  8. Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios

    PubMed Central

    Besenbacher, Søren; Liu, Siyang; Izarzugaza, José M. G.; Grove, Jakob; Belling, Kirstine; Bork-Jensen, Jette; Huang, Shujia; Als, Thomas D.; Li, Shengting; Yadav, Rachita; Rubio-García, Arcadio; Lescai, Francesco; Demontis, Ditte; Rao, Junhua; Ye, Weijian; Mailund, Thomas; Friborg, Rune M.; Pedersen, Christian N. S.; Xu, Ruiqi; Sun, Jihua; Liu, Hao; Wang, Ou; Cheng, Xiaofang; Flores, David; Rydza, Emil; Rapacki, Kristoffer; Damm Sørensen, John; Chmura, Piotr; Westergaard, David; Dworzynski, Piotr; Sørensen, Thorkild I. A.; Lund, Ole; Hansen, Torben; Xu, Xun; Li, Ning; Bolund, Lars; Pedersen, Oluf; Eiberg, Hans; Krogh, Anders; Børglum, Anders D.; Brunak, Søren; Kristiansen, Karsten; Schierup, Mikkel H.; Wang, Jun; Gupta, Ramneek; Villesen, Palle; Rasmussen, Simon

    2015-01-01

    Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e−8 and 1.5e−9 per nucleotide per generation for SNVs and indels, respectively. PMID:25597990

  9. Sequencing of sporadic Attention-Deficit Hyperactivity Disorder (ADHD) identifies novel and potentially pathogenic de novo variants and excludes overlap with genes associated with autism spectrum disorder.

    PubMed

    Kim, Daniel Seung; Burt, Amber A; Ranchalis, Jane E; Wilmot, Beth; Smith, Joshua D; Patterson, Karynne E; Coe, Bradley P; Li, Yatong K; Bamshad, Michael J; Nikolas, Molly; Eichler, Evan E; Swanson, James M; Nigg, Joel T; Nickerson, Deborah A; Jarvik, Gail P

    2017-03-22

    Attention-Deficit Hyperactivity Disorder (ADHD) has high heritability; however, studies of common variation account for <5% of ADHD variance. Using data from affected participants without a family history of ADHD, we sought to identify de novo variants that could account for sporadic ADHD. Considering a total of 128 families, two analyses were conducted in parallel: first, in 11 unaffected parent/affected proband trios (or quads with the addition of an unaffected sibling) we completed exome sequencing. Six de novo missense variants at highly conserved bases were identified and validated from four of the 11 families: the brain-expressed genes TBC1D9, DAGLA, QARS, CSMD2, TRPM2, and WDR83. Separately, in 117 unrelated probands with sporadic ADHD, we sequenced a panel of 26 genes implicated in intellectual disability (ID) and autism spectrum disorder (ASD) to evaluate whether variation in ASD/ID-associated genes were also present in participants with ADHD. Only one putative deleterious variant (Gln600STOP) in CHD1L was identified; this was found in a single proband. Notably, no other nonsense, splice, frameshift, or highly conserved missense variants in the 26 gene panel were identified and validated. These data suggest that de novo variant analysis in families with independently adjudicated sporadic ADHD diagnosis can identify novel genes implicated in ADHD pathogenesis. Moreover, that only one of the 128 cases (0.8%, 11 exome, and 117 MIP sequenced participants) had putative deleterious variants within our data in 26 genes related to ID and ASD suggests significant independence in the genetic pathogenesis of ADHD as compared to ASD and ID phenotypes. © 2017 Wiley Periodicals, Inc.

  10. Computational approaches for de novo design and redesign of metal-binding sites on proteins.

    PubMed

    Akcapinar, Gunseli Bayram; Sezerman, Osman Ugur

    2017-04-28

    Metal ions play pivotal roles in protein structure, function and stability. The functional and structural diversity of proteins in nature expanded with the incorporation of metal ions or clusters in proteins. Approximately one-third of these proteins in the databases contain metal ions. Many biological and chemical processes in nature involve metal ion-binding proteins, aka metalloproteins. Many cellular reactions that underpin life require metalloproteins. Most of the remarkable, complex chemical transformations are catalysed by metalloenzymes. Realization of the importance of metal-binding sites in a variety of cellular events led to the advancement of various computational methods for their prediction and characterization. Furthermore, as structural and functional knowledgebase about metalloproteins is expanding with advances in computational and experimental fields, the focus of the research is now shifting towards de novo design and redesign of metalloproteins to extend nature's own diversity beyond its limits. In this review, we will focus on the computational toolbox for prediction of metal ion-binding sites, de novo metalloprotein design and redesign. We will also give examples of tailor-made artificial metalloproteins designed with the computational toolbox.

  11. A pipeline for the de novo assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a multiple k-mer length approach.

    PubMed

    Melicher, Dacotah; Torson, Alex S; Dworkin, Ian; Bowsher, Julia H

    2014-03-12

    The Sepsidae family of flies is a model for investigating how sexual selection shapes courtship and sexual dimorphism in a comparative framework. However, like many non-model systems, there are few molecular resources available. Large-scale sequencing and assembly have not been performed in any sepsid, and the lack of a closely related genome makes investigation of gene expression challenging. Our goal was to develop an automated pipeline for de novo transcriptome assembly, and to use that pipeline to assemble and analyze the transcriptome of the sepsid Themira biloba. Our bioinformatics pipeline uses cloud computing services to assemble and analyze the transcriptome with off-site data management, processing, and backup. It uses a multiple k-mer length approach combined with a second meta-assembly to extend transcripts and recover more bases of transcript sequences than standard single k-mer assembly. We used 454 sequencing to generate 1.48 million reads from cDNA generated from embryo, larva, and pupae of T. biloba and assembled a transcriptome consisting of 24,495 contigs. Annotation identified 16,705 transcripts, including those involved in embryogenesis and limb patterning. We assembled transcriptomes from an additional three non-model organisms to demonstrate that our pipeline assembled a higher-quality transcriptome than single k-mer approaches across multiple species. The pipeline we have developed for assembly and analysis increases contig length, recovers unique transcripts, and assembles more base pairs than other methods through the use of a meta-assembly. The T. biloba transcriptome is a critical resource for performing large-scale RNA-Seq investigations of gene expression patterns, and is the first transcriptome sequenced in this Dipteran family.

  12. RNA sequencing and de novo assembly of the digestive gland transcriptome in Mytilus galloprovincialis fed with toxinogenic and non-toxic strains of Alexandrium minutum.

    PubMed

    Gerdol, Marco; De Moro, Gianluca; Manfrin, Chiara; Milandri, Anna; Riccardi, Elena; Beran, Alfred; Venier, Paola; Pallavicini, Alberto

    2014-10-14

    The Mediterranean mussel Mytilus galloprovincialis is marine bivalve with a relevant commercial importance as well as a key sentinel organism for the biomonitoring of environmental pollution. Here we report the RNA sequencing of the mussel digestive gland, performed with the aim: a) to produce a high quality de novo transcriptome assembly, thus improving the genetic and molecular knowledge of this organism b) to provide an initial assessment of the response to paralytic shellfish poisoning (PSP) on a molecular level, in order to identify possible molecular markers of toxin accumulation. The comprehensive de novo assembly and annotation of the transcriptome yielded a collection of 12,079 non-redundant consensus sequences with an average length of 958 bp, with a high percentage of full-length transcripts. The whole-transcriptome gene expression study indicated that the accumulation of paralytic toxins produced by the dinoflagellate Alexandrium minutum over a time span of 5 days scarcely affected gene expression, but the results need further validation with a greater number of biological samples and naturally contaminated specimens. The digestive gland reference transcriptome we produced significantly improves the data collected from previous sequencing efforts and provides a basic resource for expanding functional genomics investigations in M. galloprovincialis. Although not conclusive, the results of the RNA-seq gene expression analysis support the classification of mussels as bivalves refractory to paralytic shellfish poisoning and point out that the identification molecular biomarkers of PSP in the digestive gland of this organism is problematic.

  13. Dose-dependent de novo germline mutations detected by whole-exome sequencing in progeny of ENU-treated male gpt delta mice.

    PubMed

    Masumura, Kenichi; Toyoda-Hokaiwado, Naomi; Ukai, Akiko; Gondo, Yoichi; Honma, Masamitsu; Nohmi, Takehiko

    2016-11-01

    Germline mutations are an important component of genetic toxicology; however, mutagenicity tests of germline cells are limited. Recent advances in sequencing technology can be used to detect mutations by direct sequencing of genomic DNA (gDNA). We previously reported induced de novo mutations detected using whole-exome sequencing in the offspring of N-ethyl-N-nitrosourea (ENU)-treated mice in a single-dose experiment (85mg/kg, i.p., weekly on two occasions). In this study, two lower doses (10 and 30mg/kg) were added, and dose-response of inherited germline mutations was analyzed. Male gpt delta transgenic mice treated with ENU in three dose groups were mated with untreated females 10 weeks after the last treatment, and offspring were obtained. The ENU-treated male mice showed dose-dependent increases in gpt mutant frequencies in their sperm, testis, and liver. gDNA of one family (parents and four offspring) from each dose group was used for whole-exome sequencing, and unique de novo mutations in the offspring were detected. Frequencies of inherited mutations increased with dosage more than 25-fold in the highest dose group. The mutation spectrum of the inherited mutations showed characteristics of ENU-induced mutations, such as A:T base substitutions. No confirmed mutations were observed in the control group. Filtering using the alternate reads ratio resulted in the mutation frequencies and spectra similar to those obtained by the Sanger sequencing confirmation. These results suggest that direct sequencing analysis may be a useful tool to investigate inherited germline mutations induced by environmental mutagens.

  14. PRO_LIGAND: An approach to de novo molecular design. 1. Application to the design of organic molecules

    NASA Astrophysics Data System (ADS)

    Clark, David E.; Frenkel, David; Levy, Stephen A.; Li, Jin; Murray, Christopher W.; Robson, Barry; Waszkowycz, Bohdan; Westhead, David R.

    1995-02-01

    An approach to de novo molecular design, PRO_LIGAND, has been developed that, in the environment of a large, integrated molecular design and simulation system, provides a unified framework for the generation of novel molecules which are either similar or complementary to a specified target. The approach is based on a methodology that has proved to be effective in other studies-placing molecular fragments upon target interaction sites-but incorporates many novel features such as the use of a rapid graph-theoretical algorithm for fragment placing, a generalised driver for structure generation which offers a large variety of fragment assembly strategies to the user and the pre-screening of library fragments. After a detailed description of the relevant modules of the package, PRO_LIGAND's efficacy in aiding rational drug design is demonstrated by its ability to design mimics of methotrexate and potential inhibitors for dihydrofolate reductase and HIV-1 protease.

  15. New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation

    PubMed Central

    McLysaght, Aoife; Guerzoni, Daniele

    2015-01-01

    The origin of novel protein-coding genes de novo was once considered so improbable as to be impossible. In less than a decade, and especially in the last five years, this view has been overturned by extensive evidence from diverse eukaryotic lineages. There is now evidence that this mechanism has contributed a significant number of genes to genomes of organisms as diverse as Saccharomyces, Drosophila, Plasmodium, Arabidopisis and human. From simple beginnings, these genes have in some instances acquired complex structure, regulated expression and important functional roles. New genes are often thought of as dispensable late additions; however, some recent de novo genes in human can play a role in disease. Rather than an extremely rare occurrence, it is now evident that there is a relatively constant trickle of proto-genes released into the testing ground of natural selection. It is currently unknown whether de novo genes arise primarily through an ‘RNA-first’ or ‘ORF-first’ pathway. Either way, evolutionary tinkering with this pool of genetic potential may have been a significant player in the origins of lineage-specific traits and adaptations. PMID:26323763

  16. in silico Whole Genome Sequencer & Analyzer (iWGS): a computational pipeline to guide the design and analysis of de novo genome sequencing studies

    USDA-ARS?s Scientific Manuscript database

    The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding it...

  17. De novo transcriptome sequencing of the Octopus vulgaris hemocytes using Illumina RNA-Seq technology: response to the infection by the gastrointestinal parasite Aggregata octopiana.

    PubMed

    Castellanos-Martínez, Sheila; Arteta, David; Catarino, Susana; Gestal, Camino

    2014-01-01

    Octopus vulgaris is a highly valuable species of great commercial interest and excellent candidate for aquaculture diversification; however, the octopus' well-being is impaired by pathogens, of which the gastrointestinal coccidian parasite Aggregata octopiana is one of the most important. The knowledge of the molecular mechanisms of the immune response in cephalopods, especially in octopus is scarce. The transcriptome of the hemocytes of O. vulgaris was de novo sequenced using the high-throughput paired-end Illumina technology to identify genes involved in immune defense and to understand the molecular basis of octopus tolerance/resistance to coccidiosis. A bi-directional mRNA library was constructed from hemocytes of two groups of octopus according to the infection by A. octopiana, sick octopus, suffering coccidiosis, and healthy octopus, and reads were de novo assembled together. The differential expression of transcripts was analysed using the general assembly as a reference for mapping the reads from each condition. After sequencing, a total of 75,571,280 high quality reads were obtained from the sick octopus group and 74,731,646 from the healthy group. The general transcriptome of the O. vulgaris hemocytes was assembled in 254,506 contigs. A total of 48,225 contigs were successfully identified, and 538 transcripts exhibited differential expression between groups of infection. The general transcriptome revealed genes involved in pathways like NF-kB, TLR and Complement. Differential expression of TLR-2, PGRP, C1q and PRDX genes due to infection was validated using RT-qPCR. In sick octopuses, only TLR-2 was up-regulated in hemocytes, but all of them were up-regulated in caecum and gills. The transcriptome reported here de novo establishes the first molecular clues to understand how the octopus immune system works and interacts with a highly pathogenic coccidian. The data provided here will contribute to identification of biomarkers for octopus resistance against

  18. Deep sequencing for de novo construction of a marine fish (Sparus aurata) transcriptome database with a large coverage of protein-coding transcripts.

    PubMed

    Calduch-Giner, Josep A; Bermejo-Nogales, Azucena; Benedito-Palos, Laura; Estensoro, Itziar; Ballester-Lozano, Gabriel; Sitjà-Bobadilla, Ariadna; Pérez-Sánchez, Jaume

    2013-03-15

    The gilthead sea bream (Sparus aurata) is the main fish species cultured in the Mediterranean area and constitutes an interesting model of research. Nevertheless, transcriptomic and genomic data are still scarce for this highly valuable species. A transcriptome database was constructed by de novo assembly of gilthead sea bream sequences derived from public repositories of mRNA and collections of expressed sequence tags together with new high-quality reads from five cDNA 454 normalized libraries of skeletal muscle (1), intestine (1), head kidney (2) and blood (1). Sequencing of the new 454 normalized libraries produced 2,945,914 high-quality reads and the de novo global assembly yielded 125,263 unique sequences with an average length of 727 nt. Blast analysis directed to protein and nucleotide databases annotated 63,880 sequences encoding for 21,384 gene descriptions, that were curated for redundancies and frameshifting at the homopolymer regions of open reading frames, and hosted at http://www.nutrigroup-iats.org/seabreamdb. Among the annotated gene descriptions, 16,177 were mapped in the Ingenuity Pathway Analysis (IPA) database, and 10,899 were eligible for functional analysis with a representation in 341 out of 372 IPA canonical pathways. The high representation of randomly selected stickleback transcripts by Blast search in the nucleotide gilthead sea bream database evidenced its high coverage of protein-coding transcripts. The newly assembled gilthead sea bream transcriptome represents a progress in genomic resources for this species, as it probably contains more than 75% of actively transcribed genes, constituting a valuable tool to assist studies on functional genomics and future genome projects.

  19. Deep sequencing for de novo construction of a marine fish (Sparus aurata) transcriptome database with a large coverage of protein-coding transcripts

    PubMed Central

    2013-01-01

    Background The gilthead sea bream (Sparus aurata) is the main fish species cultured in the Mediterranean area and constitutes an interesting model of research. Nevertheless, transcriptomic and genomic data are still scarce for this highly valuable species. A transcriptome database was constructed by de novo assembly of gilthead sea bream sequences derived from public repositories of mRNA and collections of expressed sequence tags together with new high-quality reads from five cDNA 454 normalized libraries of skeletal muscle (1), intestine (1), head kidney (2) and blood (1). Results Sequencing of the new 454 normalized libraries produced 2,945,914 high-quality reads and the de novo global assembly yielded 125,263 unique sequences with an average length of 727 nt. Blast analysis directed to protein and nucleotide databases annotated 63,880 sequences encoding for 21,384 gene descriptions, that were curated for redundancies and frameshifting at the homopolymer regions of open reading frames, and hosted at http://www.nutrigroup-iats.org/seabreamdb. Among the annotated gene descriptions, 16,177 were mapped in the Ingenuity Pathway Analysis (IPA) database, and 10,899 were eligible for functional analysis with a representation in 341 out of 372 IPA canonical pathways. The high representation of randomly selected stickleback transcripts by Blast search in the nucleotide gilthead sea bream database evidenced its high coverage of protein-coding transcripts. Conclusions The newly assembled gilthead sea bream transcriptome represents a progress in genomic resources for this species, as it probably contains more than 75% of actively transcribed genes, constituting a valuable tool to assist studies on functional genomics and future genome projects. PMID:23497320

  20. Detection of a Usp-like gene in Calotropis procera plant from the de novo assembled genome contigs of the high-throughput sequencing dataset.

    PubMed

    Shokry, Ahmed M; Al-Karim, Saleh; Ramadan, Ahmed; Gadallah, Nour; Al Attas, Sanaa G; Sabir, Jamal S M; Hassan, Sabah M; Madkour, Magdy A; Bressan, Ray; Mahfouz, Magdy; Bahieldin, Ahmed

    2014-02-01

    The wild plant species Calotropis procera (C. procera) has many potential applications and beneficial uses in medicine, industry and ornamental field. It also represents an excellent source of genes for drought and salt tolerance. Genes encoding proteins that contain the conserved universal stress protein (USP) domain are known to provide organisms like bacteria, archaea, fungi, protozoa and plants with the ability to respond to a plethora of environmental stresses. However, information on the possible occurrence of Usp in C. procera is not available. In this study, we uncovered and characterized a one-class A Usp-like (UspA-like, NCBI accession No. KC954274) gene in this medicinal plant from the de novo assembled genome contigs of the high-throughput sequencing dataset. A number of GenBank accessions for Usp sequences were blasted with the recovered de novo assembled contigs. Homology modelling of the deduced amino acids (NCBI accession No. AGT02387) was further carried out using Swiss-Model, accessible via the EXPASY. Superimposition of C. procera USPA-like full sequence model on Thermus thermophilus USP UniProt protein (PDB accession No. Q5SJV7) was constructed using RasMol and Deep-View programs. The functional domains of the novel USPA-like amino acids sequence were identified from the NCBI conserved domain database (CDD) that provide insights into sequence structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

  1. Characterization of P5CS gene in Calotropis procera plant from the de novo assembled transcriptome contigs of the high-throughput sequencing dataset.

    PubMed

    Ramadan, Ahmed M; Hassanein, Sameh E

    2014-12-01

    The wild plant known as Calotropis procera is important in medicine, industry and ornamental fields. Due to spread in areas that suffer from environmental stress, it has a large number of tolerance genes to environmental stress such as drought and salinity. Proline is one of the most compatible solutes that accumulate widely in plants to tolerate unfavorable environmental conditions. Plant proline synthesis depends on Δ-pyrroline-5-carboxylate synthase (P5CS) gene. But information about this gene in C. procera is unavailable. In this study, we uncovered and characterized P5CS (P5CS, NCBI accession no. KJ020750) gene in this medicinal plant from the de novo assembled transcriptome contigs of the high-throughput sequencing dataset. A number of GenBank accessions for P5CS sequences were blasted with the recovered de novo assembled contigs. Homology modeling of the deduced amino acids (NCBI accession No. AHM25913) was further carried out using Swiss-Model, accessible via the EXPASY. Superimposition of C. procera P5CS-like full sequence model on Homo sapiens (P5CS_HUMAN, UniProt protein accession no. P54886) was constructed using RasMol and Deep-View programs. The functional domains of the novel P5CS amino acids sequence were identified from the NCBI conserved domain database (CDD) that provide insights into sequence structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

  2. Characterization of the Genomic Diversity of Norovirus in Linked Patients Using a Metagenomic Deep Sequencing Approach

    PubMed Central

    Nasheri, Neda; Petronella, Nicholas; Ronholm, Jennifer; Bidawid, Sabah; Corneau, Nathalie

    2017-01-01

    Norovirus (NoV) is the leading cause of gastroenteritis worldwide. A robust cell culture system does not exist for NoV and therefore detailed characterization of outbreak and sporadic strains relies on molecular techniques. In this study, we employed a metagenomic approach that uses non-specific amplification followed by next-generation sequencing to whole genome sequence NoV genomes directly from clinical samples obtained from 8 linked patients. Enough sequencing depth was obtained for each sample to use a de novo assembly of near-complete genome sequences. The resultant consensus sequences were then used to identify inter-host nucleotide variations that occur after direct transmission, analyze amino acid variations in the major capsid protein, and provide evidence of recombination events. The analysis of intra-host quasispecies diversity was possible due to high coverage-depth. We also observed a linear relationship between NoV viral load in the clinical sample and the number of sequence reads that could be attributed to NoV. The method demonstrated here has the potential for future use in whole genome sequence analyses of other RNA viruses isolated from clinical, environmental, and food specimens. PMID:28197136

  3. Transcriptome analysis of colored calla lily (Zantedeschia rehmannii Engl.) by Illumina sequencing: de novo assembly, annotation and EST-SSR marker development

    PubMed Central

    Cui, Binbin; Zhang, Qixiang; Xiong, Min; Wang, Xian

    2016-01-01

    Colored calla lily is the short name for the species or hybrids in section Aestivae of genus Zantedeschia. It is currently one of the most popular flower plants in the world due to its beautiful flower spathe and long postharvest life. However, little genomic information and few molecular markers are available for its genetic improvement. Here, de novo transcriptome sequencing was performed to produce large transcript sequences for Z. rehmannii cv. ‘Rehmannii’ using an Illumina HiSeq 2000 instrument. More than 59.9 million cDNA sequence reads were obtained and assembled into 39,298 unigenes with an average length of 1,038 bp. Among these, 21,077 unigenes showed significant similarity to protein sequences in the non-redundant protein database (Nr) and in the Swiss-Prot, Gene Ontology (GO), Cluster of Orthologous Group (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Moreover, a total of 117 unique transcripts were then defined that might regulate the flower spathe development of colored calla lily. Additionally, 9,933 simple sequence repeats (SSRs) and 7,162 single nucleotide polymorphisms (SNPs) were identified as putative molecular markers. High-quality primers for 200 SSR loci were designed and selected, of which 58 amplified reproducible amplicons were polymorphic among 21 accessions of colored calla lily. The sequence information and molecular markers in the present study will provide valuable resources for genetic diversity analysis, germplasm characterization and marker-assisted selection in the genus Zantedeschia. PMID:27635342

  4. De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units.

    PubMed

    Westcott, Sarah L; Schloss, Patrick D

    2015-01-01

    Background. 16S rRNA gene sequences are routinely assigned to operational taxonomic units (OTUs) that are then used to analyze complex microbial communities. A number of methods have been employed to carry out the assignment of 16S rRNA gene sequences to OTUs leading to confusion over which method is optimal. A recent study suggested that a clustering method should be selected based on its ability to generate stable OTU assignments that do not change as additional sequences are added to the dataset. In contrast, we contend that the quality of the OTU assignments, the ability of the method to properly represent the distances between the sequences, is more important. Methods. Our analysis implemented six de novo clustering algorithms including the single linkage, complete linkage, average linkage, abundance-based greedy clustering, distance-based greedy clustering, and Swarm and the open and closed-reference methods. Using two previously published datasets we used the Matthew's Correlation Coefficient (MCC) to assess the stability and quality of OTU assignments. Results. The stability of OTU assignments did not reflect the quality of the assignments. Depending on the dataset being analyzed, the average linkage and the distance and abundance-based greedy clustering methods generated OTUs that were more likely to represent the actual distances between sequences than the open and closed-reference methods. We also demonstrated that for the greedy algorithms VSEARCH produced assignments that were comparable to those produced by USEARCH making VSEARCH a viable free and open source alternative to USEARCH. Further interrogation of the reference-based methods indicated that when USEARCH or VSEARCH were used to identify the closest reference, the OTU assignments were sensitive to the order of the reference sequences because the reference sequences can be identical over the region being considered. More troubling was the observation that while both USEARCH and VSEARCH have a

  5. Next generation sequencing based approaches to epigenomics

    PubMed Central

    Marra, Marco A.

    2010-01-01

    Next generation sequencing has brought epigenomic studies to the forefront of current research. The power of massively parallel sequencing coupled to innovative molecular and computational techniques has allowed researchers to profile the epigenome at resolutions that were unimaginable only a few years ago. With early proof of concept studies published, the field is now moving into the next phase where the importance of method standardization and rigorous quality control are becoming paramount. In this review we will describe methodologies that have been developed to profile the epigenome using next generation sequencing platforms. We will discuss these in terms of library preparation, sequence platforms and analysis techniques. PMID:21266347

  6. PRO_LIGAND: An approach to de novo molecular design. 4. Application to the design of peptides

    NASA Astrophysics Data System (ADS)

    Frenkel, David; Clark, David E.; Li, Jin; Murray, Christopher W.; Robson, Barry; Waszkowycz, Bohdan; Westhead, David R.

    1995-06-01

    In some instances, peptides can play an important role in the discovery of lead compounds. This paper describes the peptide design facility of the de novo drug design package, PRO_LIGAND. The package provides a unified framework for the design of peptides that are similar or complementary to a specified target. The approach uses single amino acid residues, selected from preconstructed libraries of different residues and conformations, and places them on top of predefined target interaction sites. This approach is a well-tested methodology for the design of organics but has not been used for peptides before. Peptides represent a difficulty because of their great conformational flexibility and a study of the advantages and disavantages of this simple approach is an important step in the development of design tools. After a description of our general approach, a more detailed discussion of its adaptation to peptides is given. The method is then applied to the design of peptide-based inhibitors to HIV-1 protease and the design of structural mimics of the surface region of lysozyme. The results are encouraging and point the way towards further development of interaction site-based approaches for peptide design.

  7. "De-novo" amino acid sequence elucidation of protein G'e by combined "Top-Down" and "Bottom-Up" mass spectrometry

    NASA Astrophysics Data System (ADS)

    Yefremova, Yelena; Al-Majdoub, Mahmoud; Opuni, Kwabena F. M.; Koy, Cornelia; Cui, Weidong; Yan, Yuetian; Gross, Michael L.; Glocker, Michael O.

    2015-03-01

    Mass spectrometric de-novo sequencing was applied to review the amino acid sequence of a commercially available recombinant protein Ǵ with great scientific and economic importance. Substantial deviations to the published amino acid sequence (Uniprot Q54181) were found by the presence of 46 additional amino acids at the N-terminus, including a so-called "His-tag" as well as an N-terminal partial α- N-gluconoylation and α- N-phosphogluconoylation, respectively. The unexpected amino acid sequence of the commercial protein G' comprised 241 amino acids and resulted in a molecular mass of 25,998.9 ± 0.2 Da for the unmodified protein. Due to the higher mass that is caused by its extended amino acid sequence compared with the original protein G' (185 amino acids), we named this protein "protein G'e." By means of mass spectrometric peptide mapping, the suggested amino acid sequence, as well as the N-terminal partial α- N-gluconoylations, was confirmed with 100% sequence coverage. After the protein G'e sequence was determined, we were able to determine the expression vector pET-28b from Novagen with the Xho I restriction enzyme cleavage site as the best option that was used for cloning and expressing the recombinant protein G'e in E. coli. A dissociation constant ( K d ) value of 9.4 nM for protein G'e was determined thermophoretically, showing that the N-terminal flanking sequence extension did not cause significant changes in the binding affinity to immunoglobulins.

  8. Exome sequencing identifies de novo gain of function missense mutation in KCND2 in identical twins with autism and seizures that slows potassium channel inactivation.

    PubMed

    Lee, Hane; Lin, Meng-chin A; Kornblum, Harley I; Papazian, Diane M; Nelson, Stanley F

    2014-07-01

    Numerous studies and case reports show comorbidity of autism and epilepsy, suggesting some common molecular underpinnings of the two phenotypes. However, the relationship between the two, on the molecular level, remains unclear. Here, whole exome sequencing was performed on a family with identical twins affected with autism and severe, intractable seizures. A de novo variant was identified in the KCND2 gene, which encodes the Kv4.2 potassium channel. Kv4.2 is a major pore-forming subunit in somatodendritic subthreshold A-type potassium current (ISA) channels. The de novo mutation p.Val404Met is novel and occurs at a highly conserved residue within the C-terminal end of the transmembrane helix S6 region of the ion permeation pathway. Functional analysis revealed the likely pathogenicity of the variant in that the p.Val404Met mutant construct showed significantly slowed inactivation, either by itself or after equimolar coexpression with the wild-type Kv4.2 channel construct consistent with a dominant effect. Further, the effect of the mutation on closed-state inactivation was evident in the presence of auxiliary subunits that associate with Kv4 subunits to form ISA channels in vivo. Discovery of a functionally relevant novel de novo variant, coupled with physiological evidence that the mutant protein disrupts potassium current inactivation, strongly supports KCND2 as the causal gene for epilepsy in this family. Interaction of KCND2 with other genes implicated in autism and the role of KCND2 in synaptic plasticity provide suggestive evidence of an etiological role in autism.

  9. Evaluation of Methods for de novo Genome assembly from High-throughput Sequencing Reads Reveals Dependencies that Affect the Quality of the Results

    USDA-ARS?s Scientific Manuscript database

    Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole...

  10. De novo assembly and characterization of bark transcriptome using Illumina sequencing and development of EST-SSR markers in rubber tree (Hevea brasiliensis Muell. Arg.)

    PubMed Central

    2012-01-01

    Background In rubber tree, bark is one of important agricultural and biological organs. However, the molecular mechanism involved in the bark formation and development in rubber tree remains largely unknown, which is at least partially due to lack of bark transcriptomic and genomic information. Therefore, it is necessary to carried out high-throughput transcriptome sequencing of rubber tree bark to generate enormous transcript sequences for the functional characterization and molecular marker development. Results In this study, more than 30 million sequencing reads were generated using Illumina paired-end sequencing technology. In total, 22,756 unigenes with an average length of 485 bp were obtained with de novo assembly. The similarity search indicated that 16,520 and 12,558 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Among these annotated unigenes, 6,867 and 5,559 unigenes were separately assigned to Gene Ontology (GO) and Clusters of Orthologous Group (COG). When 22,756 unigenes searched against the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database, 12,097 unigenes were assigned to 5 main categories including 123 KEGG pathways. Among the main KEGG categories, metabolism was the biggest category (9,043, 74.75%), suggesting the active metabolic processes in rubber tree bark. In addition, a total of 39,257 EST-SSRs were identified from 22,756 unigenes, and the characterizations of EST-SSRs were further analyzed in rubber tree. 110 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Among 13 Hevea germplasms, PCR success rate and polymorphism rate of 110 markers were separately 96.36% and 55.45% in this study. Conclusion By assembling and analyzing de novo transcriptome sequencing data, we reported the comprehensive functional characterization of rubber tree bark. This research generated a substantial fraction

  11. De Novo variants in the KMT2A (MLL) gene causing atypical Wiedemann-Steiner syndrome in two unrelated individuals identified by clinical exome sequencing

    PubMed Central

    2014-01-01

    Background Wiedemann-Steiner Syndrome (WSS) is characterized by short stature, a variety of dysmorphic facial and skeletal features, characteristic hypertrichosis cubiti (excessive hair on the elbows), mild-to-moderate developmental delay and intellectual disability. [MIM#: 605130]. Here we report two unrelated children for whom clinical exome sequencing of parent-proband trios was performed at UCLA, resulting in a molecular diagnosis of WSS and atypical clinical presentation. Case presentation For patient 1, clinical features at 9 years of age included developmental delay, craniofacial abnormalities, and multiple minor anomalies. Patient 2 presented at 1 year of age with developmental delay, microphthalmia, partial 3–4 left hand syndactyly, and craniofacial abnormalities. A de novo missense c.4342T>C variant and a de novo splice site c.4086+G>A variant were identified in the KMT2A gene in patients 1 and 2, respectively. Conclusions Based on the clinical and molecular findings, both patients appear to have novel presentations of WSS. As the hallmark hypertrichosis cubiti was not initially appreciated in either case, this syndrome was not suspected during the clinical evaluation. This report expands the phenotypic spectrum of the clinical phenotypes and KMT2A variants associated with WSS. PMID:24886118

  12. An Evolution-Based Approach to De Novo Protein Design and Case Study on Mycobacterium tuberculosis

    PubMed Central

    Brender, Jeffrey R.; Czajka, Jeff; Marsh, David; Gray, Felicia; Cierpicki, Tomasz; Zhang, Yang

    2013-01-01

    Computational protein design is a reverse procedure of protein folding and structure prediction, where constructing structures from evolutionarily related proteins has been demonstrated to be the most reliable method for protein 3-dimensional structure prediction. Following this spirit, we developed a novel method to design new protein sequences based on evolutionarily related protein families. For a given target structure, a set of proteins having similar fold are identified from the PDB library by structural alignments. A structural profile is then constructed from the protein templates and used to guide the conformational search of amino acid sequence space, where physicochemical packing is accommodated by single-sequence based solvation, torsion angle, and secondary structure predictions. The method was tested on a computational folding experiment based on a large set of 87 protein structures covering different fold classes, which showed that the evolution-based design significantly enhances the foldability and biological functionality of the designed sequences compared to the traditional physics-based force field methods. Without using homologous proteins, the designed sequences can be folded with an average root-mean-square-deviation of 2.1 Å to the target. As a case study, the method is extended to redesign all 243 structurally resolved proteins in the pathogenic bacteria Mycobacterium tuberculosis, which is the second leading cause of death from infectious disease. On a smaller scale, five sequences were randomly selected from the design pool and subjected to experimental validation. The results showed that all the designed proteins are soluble with distinct secondary structure and three have well ordered tertiary structure, as demonstrated by circular dichroism and NMR spectroscopy. Together, these results demonstrate a new avenue in computational protein design that uses knowledge of evolutionary conservation from protein structural families to engineer

  13. Exome Sequencing Reveals De Novo WDR45 Mutations Causing a Phenotypically Distinct, X-Linked Dominant Form of NBIA

    PubMed Central

    Haack, Tobias B.; Hogarth, Penelope; Kruer, Michael C.; Gregory, Allison; Wieland, Thomas; Schwarzmayr, Thomas; Graf, Elisabeth; Sanford, Lynn; Meyer, Esther; Kara, Eleanna; Cuno, Stephan M.; Harik, Sami I.; Dandu, Vasuki H.; Nardocci, Nardo; Zorzi, Giovanna; Dunaway, Todd; Tarnopolsky, Mark; Skinner, Steven; Frucht, Steven; Hanspal, Era; Schrander-Stumpel, Connie; Héron, Delphine; Mignot, Cyril; Garavaglia, Barbara; Bhatia, Kailash; Hardy, John; Strom, Tim M.; Boddaert, Nathalie; Houlden, Henry H.; Kurian, Manju A.; Meitinger, Thomas; Prokisch, Holger; Hayflick, Susan J.

    2012-01-01

    Neurodegeneration with brain iron accumulation (NBIA) is a group of genetic disorders characterized by abnormal iron deposition in the basal ganglia. We report that de novo mutations in WDR45, a gene located at Xp11.23 and encoding a beta-propeller scaffold protein with a putative role in autophagy, cause a distinctive NBIA phenotype. The clinical features include early-onset global developmental delay and further neurological deterioration (parkinsonism, dystonia, and dementia developing by early adulthood). Brain MRI revealed evidence of iron deposition in the substantia nigra and globus pallidus. Males and females are phenotypically similar, an observation that might be explained by somatic mosaicism in surviving males and germline or somatic mutations in females, as well as skewing of X chromosome inactivation. This clinically recognizable disorder is among the more common forms of NBIA, and we suggest that it be named accordingly as beta-propeller protein-associated neurodegeneration. PMID:23176820

  14. Use of targeted next-generation sequencing for molecular diagnosis of craniosynostosis: Identification of a novel de novo mutation of EFNB1.

    PubMed

    Yamamoto, Toshiyuki; Igarashi, Naru; Shimojima, Keiko; Sangu, Noriko; Sakamoto, Yuko; Shimoji, Kazuaki; Niijima, Shinichi

    2016-03-01

    Craniofrontonasal syndrome (CFNS; MIM#304110) is characterized by asymmetric facial features with hypertelorism and a broad bifid nose due to synostosis of the coronal suture. CFNS shows a unique X-linked inheritance pattern (most affected patients are female and obligate male carriers exhibit a mild manifestation or no typical features at all) associated with the ephrin-B1 gene (EFNB1) located in the Xq13.1 region. In this study, we performed targeted, massively parallel sequencing using a next-generation sequencer, and identified a novel EFNB1 mutation, c.270_271delCA, in a Japanese female patient with craniosynostosis. Because subsequent Sanger sequencing identified no mutation in either parent, this mutation was determined to be de novo in origin. After obtaining molecular diagnosis, a retrospective clinical evaluation confirmed the clinical diagnosis of CFNS in this patient. Comprehensive molecular diagnosis using a next-generation sequencer would be beneficial for early diagnosis of the patients with undiagnosed craniosynostosis.

  15. De novo computational identification of stress-related sequence motifs and microRNA target sites in untranslated regions of a plant translatome

    PubMed Central

    Munusamy, Prabhakaran; Zolotarov, Yevgen; Meteignier, Louis-Valentin; Moffett, Peter; Strömvik, Martina V.

    2017-01-01

    Gene regulation at the transcriptional and translational level leads to diversity in phenotypes and function in organisms. Regulatory DNA or RNA sequence motifs adjacent to the gene coding sequence act as binding sites for proteins that in turn enable or disable expression of the gene. Whereas the known DNA and RNA binding proteins range in the thousands, only a few motifs have been examined. In this study, we have predicted putative regulatory motifs in groups of untranslated regions from genes regulated at the translational level in Arabidopsis thaliana under normal and stressed conditions. The test group of sequences was divided into random subgroups and subjected to three de novo motif finding algorithms (Seeder, Weeder and MEME). In addition to identifying sequence motifs, using an in silico tool we have predicted microRNA target sites in the 3′ UTRs of the translationally regulated genes, as well as identified upstream open reading frames located in the 5′ UTRs. Our bioinformatics strategy and the knowledge generated contribute to understanding gene regulation during stress, and can be applied to disease and stress resistant plant development. PMID:28276452

  16. General Approach in Computing Sums of Products of Binary Sequences

    DTIC Science & Technology

    2011-12-08

    General Approach in Computing Sums of Products of Binary Sequences E. Kiliç1, P. Stănică2 1TOBB Economics and Technology University, Mathematics...pstanica@nps.edu December 8, 2011 Abstract In this paper we find a general approach to find closed forms of sums of products of arbitrary sequences ...satisfying the same recurrence with different initial conditions. We apply successfully our technique to sums of products of such sequences with indices in

  17. Exome sequencing identifies de novo pathogenic variants in FBN1 and TRPS1 in a patient with a complex connective tissue phenotype

    PubMed Central

    Zastrow, Diane B.; Zornio, Patricia A.; Dries, Annika; Kohler, Jennefer; Fernandez, Liliana; Waggott, Daryl; Walkiewicz, Magdalena; Eng, Christine M.; Manning, Melanie A.; Farrelly, Ellyn; Fisher, Paul G.; Ashley, Euan A.; Bernstein, Jonathan A.

    2017-01-01

    Here we describe a patient who presented with a history of congenital diaphragmatic hernia, inguinal hernia, and recurrent umbilical hernia. She also has joint laxity, hypotonia, and dysmorphic features. A unifying diagnosis was not identified based on her clinical phenotype. As part of her evaluation through the Undiagnosed Diseases Network, trio whole-exome sequencing was performed. Pathogenic variants in FBN1 and TRPS1 were identified as causing two distinct autosomal dominant conditions, each with de novo inheritance. Fibrillin 1 (FBN1) mutations are associated with Marfan syndrome and a spectrum of similar phenotypes. TRPS1 mutations are associated with trichorhinophalangeal syndrome types I and III. Features of both conditions are evident in the patient reported here. Discrepant features of the conditions (e.g., stature) and the young age of the patient may have made a clinical diagnosis more difficult in the absence of exome-wide genetic testing. PMID:28050602

  18. Exome sequencing identifies de novo pathogenic variants in FBN1 and TRPS1 in a patient with a complex connective tissue phenotype.

    PubMed

    Zastrow, Diane B; Zornio, Patricia A; Dries, Annika; Kohler, Jennefer; Fernandez, Liliana; Waggott, Daryl; Walkiewicz, Magdalena; Eng, Christine M; Manning, Melanie A; Farrelly, Ellyn; Fisher, Paul G; Ashley, Euan A; Bernstein, Jonathan A; Wheeler, Matthew T

    2017-01-01

    Here we describe a patient who presented with a history of congenital diaphragmatic hernia, inguinal hernia, and recurrent umbilical hernia. She also has joint laxity, hypotonia, and dysmorphic features. A unifying diagnosis was not identified based on her clinical phenotype. As part of her evaluation through the Undiagnosed Diseases Network, trio whole-exome sequencing was performed. Pathogenic variants in FBN1 and TRPS1 were identified as causing two distinct autosomal dominant conditions, each with de novo inheritance. Fibrillin 1 (FBN1) mutations are associated with Marfan syndrome and a spectrum of similar phenotypes. TRPS1 mutations are associated with trichorhinophalangeal syndrome types I and III. Features of both conditions are evident in the patient reported here. Discrepant features of the conditions (e.g., stature) and the young age of the patient may have made a clinical diagnosis more difficult in the absence of exome-wide genetic testing.

  19. De Novo Transcriptome Sequencing of the Octopus vulgaris Hemocytes Using Illumina RNA-Seq Technology: Response to the Infection by the Gastrointestinal Parasite Aggregata octopiana

    PubMed Central

    Castellanos-Martínez, Sheila; Arteta, David; Catarino, Susana; Gestal, Camino

    2014-01-01

    Background Octopus vulgaris is a highly valuable species of great commercial interest and excellent candidate for aquaculture diversification; however, the octopus’ well-being is impaired by pathogens, of which the gastrointestinal coccidian parasite Aggregata octopiana is one of the most important. The knowledge of the molecular mechanisms of the immune response in cephalopods, especially in octopus is scarce. The transcriptome of the hemocytes of O. vulgaris was de novo sequenced using the high-throughput paired-end Illumina technology to identify genes involved in immune defense and to understand the molecular basis of octopus tolerance/resistance to coccidiosis. Results A bi-directional mRNA library was constructed from hemocytes of two groups of octopus according to the infection by A. octopiana, sick octopus, suffering coccidiosis, and healthy octopus, and reads were de novo assembled together. The differential expression of transcripts was analysed using the general assembly as a reference for mapping the reads from each condition. After sequencing, a total of 75,571,280 high quality reads were obtained from the sick octopus group and 74,731,646 from the healthy group. The general transcriptome of the O. vulgaris hemocytes was assembled in 254,506 contigs. A total of 48,225 contigs were successfully identified, and 538 transcripts exhibited differential expression between groups of infection. The general transcriptome revealed genes involved in pathways like NF-kB, TLR and Complement. Differential expression of TLR-2, PGRP, C1q and PRDX genes due to infection was validated using RT-qPCR. In sick octopuses, only TLR-2 was up-regulated in hemocytes, but all of them were up-regulated in caecum and gills. Conclusion The transcriptome reported here de novo establishes the first molecular clues to understand how the octopus immune system works and interacts with a highly pathogenic coccidian. The data provided here will contribute to identification of biomarkers

  20. Sequencing and de novo assembly of visceral mass transcriptome of the critically endangered land snail Satsuma myomphala: Annotation and SSR discovery.

    PubMed

    Kang, Se Won; Patnaik, Bharat Bhusan; Hwang, Hee-Ju; Park, So Young; Chung, Jong Min; Song, Dae Kwon; Patnaik, Hongray Howrelia; Lee, Jae Bong; Kim, Changmu; Kim, Soonok; Park, Hong Seog; Park, Seung-Hwan; Park, Young-Su; Han, Yeon Soo; Lee, Jun Sang; Lee, Yong Seok

    2017-03-01

    Satsuma myomphala is critically endangered through loss of natural habitats, predation by natural enemies, and indiscriminate collection. It is a protected species in Korea but lacks genomic resources for an understanding of varied functional processes attributable to evolutionary success under natural habitats. For assessing the genetic information of S. myomphala, we performed for the first time, de novo transcriptome sequencing and functional annotation of expressed sequences using Illumina Next-Generation Sequencing (NGS) platform and bioinformatics analysis. We identified 103,774 unigenes of which 37,959, 12,890, and 17,699 were annotated in the PANM (Protostome DB), Unigene, and COG (Clusters of Orthologous Groups) databases, respectively. In addition, 14,451 unigenes were predicted under Gene Ontology functional categories, with 4581 assigned to a single category. Furthermore, 3369 sequences with 646 having Enzyme Commission (EC) numbers were mapped to 122 pathways in the Kyoto Encyclopedia of Genes and Genomes Pathway database. The prominent protein domains included the Zinc finger (C2H2-like), Reverse Transcriptase, Thioredoxin-like fold, and RNA recognition motif domain. Many unigenes with homology to immunity, defense, and reproduction-related genes were screened in the transcriptome. We also detected 3120 putative simple sequence repeats (SSRs) encompassing dinucleotide to hexanucleotide repeat motifs from >1kb unigene sequences. A list of PCR primers of SSR loci have been identified to study the genetic polymorphisms. The transcriptome data represents a valuable resource for further investigations on the species genome structure and biology. The unigenes information and microsatellites would provide an indispensable tool for conservation of the species in natural and adaptive environments. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

  1. Rational Structure-Based Rescaffolding Approach to De Novo Design of Interleukin 10 (IL-10) Receptor-1 Mimetics

    PubMed Central

    Philipp, Jenny; Künze, Georg; Wodtke, Robert; Löser, Reik; Fahmy, Karim; Pisabarro, M. Teresa

    2016-01-01

    Tackling protein interfaces with small molecules capable of modulating protein-protein interactions remains a challenge in structure-based ligand design. Particularly arduous are cases in which the epitopes involved in molecular recognition have a non-structured and discontinuous nature. Here, the basic strategy of translating continuous binding epitopes into mimetic scaffolds cannot be applied, and other innovative approaches are therefore required. We present a structure-based rational approach involving the use of a regular expression syntax inspired in the well established PROSITE to define minimal descriptors of geometric and functional constraints signifying relevant functionalities for recognition in protein interfaces of non-continuous and unstructured nature. These descriptors feed a search engine that explores the currently available three-dimensional chemical space of the Protein Data Bank (PDB) in order to identify in a straightforward manner regular architectures containing the desired functionalities, which could be used as templates to guide the rational design of small natural-like scaffolds mimicking the targeted recognition site. The application of this rescaffolding strategy to the discovery of natural scaffolds incorporating a selection of functionalities of interleukin-10 receptor-1 (IL-10R1), which are relevant for its interaction with interleukin-10 (IL-10) has resulted in the de novo design of a new class of potent IL-10 peptidomimetic ligands. PMID:27123592

  2. Transcriptome Profile of the Asian Giant Hornet (Vespa mandarinia) Using Illumina HiSeq 4000 Sequencing: De Novo Assembly, Functional Annotation, and Discovery of SSR Markers

    PubMed Central

    Park, So Young; Kang, Se Won; Hwang, Hee-Ju; Wang, Tae Hun; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jae Bong; Jeong, Heon Cheon; Park, Hong Seog; Han, Yeon Soo; Lee, Yong Seok

    2016-01-01

    Vespa mandarinia found in the forests of East Asia, including Korea, occupies the highest rank in the arthropod food web within its geographical range. It serves as a source of nutrition in the form of Vespa amino acid mixture and is listed as a threatened species, although no conservation measures have been implemented. Here, we performed de novo assembly of the V. mandarinia transcriptome by Illumina HiSeq 4000 sequencing. Over 60 million raw reads and 59,184,811 clean reads were obtained. After assembly, a total of 66,837 unigenes were clustered, 40,887, 44,455, and 22,390 of which showed homologous matches against the PANM, Unigene, and KOG databases, respectively. A total of 15,675 unigenes were assigned to Gene Ontology terms, and 5,132 unigenes were mapped to 115 KEGG pathways. The zinc finger domain (C2H2-like), serine/threonine/dual specificity protein kinase domain, and RNA recognition motif domain were among the top InterProScan domains predicted for V. mandarinia sequences. Among the unigenes, we identified 534,922 cDNA simple sequence repeats as potential markers. This is the first transcriptomic analysis of the wasp V. mandarinia using Illumina HiSeq 4000. The obtained datasets should promote the search for new genes to understand the physiological attributes of this wasp. PMID:26881195

  3. Transcriptome Sequencing and De Novo Analysis of a Cytoplasmic Male Sterile Line and Its Near-Isogenic Restorer Line in Chili Pepper (Capsicum annuum L.)

    PubMed Central

    Wang, Ping-Yong; Fu, Nan; Shen, Huo-Lin

    2013-01-01

    Background The use of cytoplasmic male sterility (CMS) in F1 hybrid seed production of chili pepper is increasingly popular. However, the molecular mechanisms of cytoplasmic male sterility and fertility restoration remain poorly understood due to limited transcriptomic and genomic data. Therefore, we analyzed the difference between a CMS line 121A and its near-isogenic restorer line 121C in transcriptome level using next generation sequencing technology (NGS), aiming to find out critical genes and pathways associated with the male sterility. Results We generated approximately 53 million sequencing reads and assembled de novo, yielding 85,144 high quality unigenes with an average length of 643 bp. Among these unigenes, 27,191 were identified as putative homologs of annotated sequences in the public protein databases, 4,326 and 7,061 unigenes were found to be highly abundant in lines 121A and 121C, respectively. Many of the differentially expressed unigenes represent a set of potential candidate genes associated with the formation or abortion of pollen. Conclusions Our study profiled anther transcriptomes of a chili pepper CMS line and its restorer line. The results shed the lights on the occurrence and recovery of the disturbances in nuclear-mitochondrial interaction and provide clues for further investigations. PMID:23750245

  4. De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans) Using RNA-Seq and Discovery of Whole-Genome Duplications in the Ancestor of Palms

    PubMed Central

    Guo, Wuxia; Zhang, Ying; Zhou, Renchao; Shi, Suhua

    2015-01-01

    Nypa fruticans (Arecaceae) is the only monocot species of true mangroves. This species represents the earliest mangrove fossil recorded. How N. fruticans adapts to the harsh and unstable intertidal zone is an interesting question. However, the 60 gene segments deposited in NCBI are insufficient for solving this question. In this study, we sequenced, assembled and annotated the transcriptome of N. fruticans using next-generation sequencing technology. A total of 19,918,800 clean paired-end reads were de novo assembled into 45,368 unigenes with a N50 length of 1,096 bp. A total of 41.35% unigenes were functionally annotated using Blast2GO. Many genes annotated to “response to stress” and 15 putative positively selected genes were identified. Simple sequence repeats were identified and compared with other palms. The divergence time between N. fruticans and other palms was estimated at 75 million years ago using the genomic data, which is consistent with the fossil record. After calculating the synonymous substitution rate between paralogs, we found that two whole-genome duplication events were shared by N. fruticans and other palms. These duplication events provided a large amount of raw material for the more than 2,000 later speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general. PMID:26684618

  5. De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans) Using RNA-Seq and Discovery of Whole-Genome Duplications in the Ancestor of Palms.

    PubMed

    He, Ziwen; Zhang, Zhang; Guo, Wuxia; Zhang, Ying; Zhou, Renchao; Shi, Suhua

    2015-01-01

    Nypa fruticans (Arecaceae) is the only monocot species of true mangroves. This species represents the earliest mangrove fossil recorded. How N. fruticans adapts to the harsh and unstable intertidal zone is an interesting question. However, the 60 gene segments deposited in NCBI are insufficient for solving this question. In this study, we sequenced, assembled and annotated the transcriptome of N. fruticans using next-generation sequencing technology. A total of 19,918,800 clean paired-end reads were de novo assembled into 45,368 unigenes with a N50 length of 1,096 bp. A total of 41.35% unigenes were functionally annotated using Blast2GO. Many genes annotated to "response to stress" and 15 putative positively selected genes were identified. Simple sequence repeats were identified and compared with other palms. The divergence time between N. fruticans and other palms was estimated at 75 million years ago using the genomic data, which is consistent with the fossil record. After calculating the synonymous substitution rate between paralogs, we found that two whole-genome duplication events were shared by N. fruticans and other palms. These duplication events provided a large amount of raw material for the more than 2,000 later speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general.

  6. De Novo Sequencing of Hypericum perforatum Transcriptome to Identify Potential Genes Involved in the Biosynthesis of Active Metabolites

    PubMed Central

    He, Miao; Wang, Ying; Hua, Wenping; Zhang, Yuan; Wang, Zhezhi

    2012-01-01

    Background Hypericum perforatum L. (St. John’s wort) is a medicinal plant with pharmacological properties that are antidepressant, anti-inflammatory, antiviral, anti-cancer, and antibacterial. Its major active metabolites are hypericins, hyperforins, and melatonin. However, little genetic information is available for this species, especially that concerning the biosynthetic pathways for active ingredients. Methodology/Principal Findings Using de novo transcriptome analysis, we obtained 59,184 unigenes covering the entire life cycle of these plants. In all, 40,813 unigenes (68.86%) were annotated and 2,359 were assigned to secondary metabolic pathways. Among them, 260 unigenes are involved in the production of hypericin, hyperforin, and melatonin. Another 2,291 unigenes are classified as potential Type III polyketide synthase. Our BlastX search against the AGRIS database reveals 1,772 unigenes that are homologous to 47 known Arabidopsis transcription factor families. Further analysis shows that 10.61% (6,277) of these unigenes contain 7,643 SSRs. Conclusion We have identified a set of putative genes involved in several secondary metabolism pathways, especially those related to the synthesis of its active ingredients. Our results will serve as an important platform for public information about gene expression, genomics, and functional genomics in H. perforatum. PMID:22860059

  7. De novo sequencing of Hypericum perforatum transcriptome to identify potential genes involved in the biosynthesis of active metabolites.

    PubMed

    He, Miao; Wang, Ying; Hua, Wenping; Zhang, Yuan; Wang, Zhezhi

    2012-01-01

    Hypericum perforatum L. (St. John's wort) is a medicinal plant with pharmacological properties that are antidepressant, anti-inflammatory, antiviral, anti-cancer, and antibacterial. Its major active metabolites are hypericins, hyperforins, and melatonin. However, little genetic information is available for this species, especially that concerning the biosynthetic pathways for active ingredients. Using de novo transcriptome analysis, we obtained 59,184 unigenes covering the entire life cycle of these plants. In all, 40,813 unigenes (68.86%) were annotated and 2,359 were assigned to secondary metabolic pathways. Among them, 260 unigenes are involved in the production of hypericin, hyperforin, and melatonin. Another 2,291 unigenes are classified as potential Type III polyketide synthase. Our BlastX search against the AGRIS database reveals 1,772 unigenes that are homologous to 47 known Arabidopsis transcription factor families. Further analysis shows that 10.61% (6,277) of these unigenes contain 7,643 SSRs. We have identified a set of putative genes involved in several secondary metabolism pathways, especially those related to the synthesis of its active ingredients. Our results will serve as an important platform for public information about gene expression, genomics, and functional genomics in H. perforatum.

  8. De novo sequencing and transcriptome analysis of a low temperature tolerant Saccharum spontaneum clone IND 00-1037.

    PubMed

    Dharshini, S; Chakravarthi, M; J, Ashwin Narayan; Manoj, V M; Naveenarani, M; Kumar, Ravinder; Meena, Minturam; Ram, Bakshi; Appunu, C

    2016-08-10

    Saccharum spontaneum L., a wild relative of sugarcane, is known for its adaptability to environmental stresses, particularly cold stress. In the present study, an attempt was made for transcriptome profiling of the low temperature (10°C) tolerant S. spontaneum clone IND 00-1037 collected from high altitude regions of Arunachal Pradesh, North Eastern India. The Illumina Nextseq500 platform yielded a total of 47.63 and 48.18 million reads corresponding to 4.7 and 4.8 gigabase pairs (Gb) of processed reads for control and cold stressed (10°C for 24h) samples, respectively. These reads were de novo assembled into 214,611 unigenes with an average length of 801bp. Further, all unigenes were aligned to GO, KEGG and COG databases in order to identify novel genes and pathways responsive upon low temperature conditions. The differential gene expression analysis revealed that about 2583 genes were upregulated and 3302 genes were down regulated during the stress. This is perhaps the comprehensive transcriptome data of a low temperature tolerant clone of S. spontaneum. This study would aid in identifying novel genes and also in future genomic studies pertaining to sugarcane and its wild relatives.

  9. De novo sequencing of root transcriptome reveals complex cadmium-responsive regulatory networks in radish (Raphanus sativus L.).

    PubMed

    Xu, Liang; Wang, Yan; Liu, Wei; Wang, Jin; Zhu, Xianwen; Zhang, Keyun; Yu, Rugang; Wang, Ronghua; Xie, Yang; Zhang, Wei; Gong, Yiqin; Liu, Liwang

    2015-07-01

    Cadmium (Cd) is a nonessential metallic trace element that poses potential chronic toxicity to living organisms. To date, little is known about the Cd-responsive regulatory network in root vegetable crops including radish. In this study, 31,015 unigenes representing 66,552 assembled unique transcripts were isolated from radish root under Cd stress based on de novo transcriptome assembly. In all, 1496 differentially expressed genes (DEGs) consisted of 3579 transcripts were identified from Cd-free (CK) and Cd-treated (Cd200) libraries. Gene Ontology and pathway enrichment analysis indicated that the up- and down-regulated DEGs were predominately involved in glucosinolate biosynthesis as well as cysteine and methionine-related pathways, respectively. RT-qPCR showed that the expression profiles of DEGs were in consistent with results from RNA-Seq analysis. Several candidate genes encoding phytochelatin synthase (PCS), metallothioneins (MTs), glutathione (GSH), zinc iron permease (ZIPs) and ABC transporter were responsible for Cd uptake, accumulation, translocation and detoxification in radish. The schematic model of DEGs and microRNAs-involved in Cd-responsive regulatory network was proposed. This study represents a first comprehensive transcriptome-based characterization of Cd-responsive DEGs in radish. These results could provide fundamental insight into complex Cd-responsive regulatory networks and facilitate further genetic manipulation of Cd accumulation in root vegetable crops. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  10. Sequencing, De Novo Assembly, and Annotation of the Transcriptome of the Endangered Freshwater Pearl Bivalve, Cristaria plicata, Provides Novel Insights into Functional Genes and Marker Discovery.

    PubMed

    Patnaik, Bharat Bhusan; Wang, Tae Hun; Kang, Se Won; Hwang, Hee-Ju; Park, So Young; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jun Sang; Han, Yeon Soo; Park, Hong Seog; Lee, Yong Seok

    2016-01-01

    The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction. The C. plicata transcriptome analysis included a total of 286,152,584 raw reads and 281,322,837 clean reads. The de novo assembly identified a total of 453,931 contigs and 374,794 non-redundant unigenes with average lengths of 731.2 and 737.1 bp, respectively. Furthermore, 100% coverage of C. plicata mitochondrial genes within two unigenes supported the quality of the assembler. In total, 84,274 unigenes showed homology to entries in at least one database, and 23,246 unigenes were allocated to one or more Gene Ontology (GO) terms. The most prominent GO biological process, cellular component, and molecular function categories (level 2) were cellular process, membrane, and binding, respectively. A total of 4,776 unigenes were mapped to 123 biological pathways in the KEGG database. Based on the GO terms and KEGG annotation, the unigenes were suggested to be involved in immunity, stress responses, sex-determination, and reproduction. A total of 17,251 cDNA simple sequence repeats (cSSRs) were identified from 61,141 unigenes (size of >1 kb) with the most abundant being dinucleotide repeats. This dataset represents the first transcriptome analysis of the endangered mollusc, C. plicata. The

  11. Sequencing, De Novo Assembly, and Annotation of the Transcriptome of the Endangered Freshwater Pearl Bivalve, Cristaria plicata, Provides Novel Insights into Functional Genes and Marker Discovery

    PubMed Central

    Kang, Se Won; Hwang, Hee-Ju; Park, So Young; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jun Sang; Han, Yeon Soo; Park, Hong Seog; Lee, Yong Seok

    2016-01-01

    Background The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction. Results The C. plicata transcriptome analysis included a total of 286,152,584 raw reads and 281,322,837 clean reads. The de novo assembly identified a total of 453,931 contigs and 374,794 non-redundant unigenes with average lengths of 731.2 and 737.1 bp, respectively. Furthermore, 100% coverage of C. plicata mitochondrial genes within two unigenes supported the quality of the assembler. In total, 84,274 unigenes showed homology to entries in at least one database, and 23,246 unigenes were allocated to one or more Gene Ontology (GO) terms. The most prominent GO biological process, cellular component, and molecular function categories (level 2) were cellular process, membrane, and binding, respectively. A total of 4,776 unigenes were mapped to 123 biological pathways in the KEGG database. Based on the GO terms and KEGG annotation, the unigenes were suggested to be involved in immunity, stress responses, sex-determination, and reproduction. A total of 17,251 cDNA simple sequence repeats (cSSRs) were identified from 61,141 unigenes (size of >1 kb) with the most abundant being dinucleotide repeats. Conclusions This dataset represents the first transcriptome analysis of the endangered

  12. High-Throughput Sequencing and De Novo Assembly of Brassica oleracea var. Capitata L. for Transcriptome Analysis

    PubMed Central

    Kim, Sangmi; Choe, Jun Kyoung; Jo, Sung-Hwan; Baek, Namkwon; Kwon, Suk-Yoon

    2014-01-01

    Background The cabbage, Brassica oleracea var. capitata L., has a distinguishable phenotype within the genus Brassica. Despite the economic and genetic importance of cabbage, there is little genomic data for cabbage, and most studies of Brassica are focused on other species or other B. oleracea subspecies. The lack of genomic data for cabbage, a non-model organism, hinders research on its molecular biology. Hence, the construction of reliable transcriptomic data based on high-throughput sequencing technologies is needed to enhance our understanding of cabbage and provide genomic information for future work. Methodology/Principal Findings We constructed cDNAs from total RNA isolated from the roots, leaves, flowers, seedlings, and calcium-limited seedling tissues of two cabbage genotypes: 102043 and 107140. We sequenced a total of six different samples using the Illumina HiSeq platform, producing 40.5 Gbp of sequence data comprising 401,454,986 short reads. We assembled 205,046 transcripts (≥ 200 bp) using the Velvet and Oases assembler and predicted 53,562 loci from the transcripts. We annotated 35,274 of the loci with 55,916 plant peptides in the Phytozome database. The average length of the annotated loci was 1,419 bp. We confirmed the reliability of the sequencing assembly using reverse-transcriptase PCR to identify tissue-specific gene candidates among the annotated loci. Conclusion Our study provides valuable transcriptome sequence data for B. oleracea var. capitata L., offering a new resource for studying B. oleracea and closely related species. Our transcriptomic sequences will enhance the quality of gene annotation and functional analysis of the cabbage genome and serve as a material basis for future genomic research on cabbage. The sequencing data from this study can be used to develop molecular markers and to identify the extreme differences among the phenotypes of different species in the genus Brassica. PMID:24682075

  13. De novo sequencing and analysis of the lily pollen transcriptome: an open access data source for an orphan plant species.

    PubMed

    Lang, Veronika; Usadel, Björn; Obermeyer, Gerhard

    2015-01-01

    Pollen grains of Lilium longiflorum are a long-established model system for pollen germination and tube tip growth. Due to their size, protein content and almost synchronous germination in synthetic media, they provide a simple system for physiological measurements as well as sufficient material for biochemical studies like protein purifications, enzyme assays, organelle isolation or determination of metabolites during germination and pollen tube elongation. Despite recent progresses in molecular biology techniques, sequence information of expressed proteins or transcripts in lily pollen is still scarce. Using a next generation sequencing strategy (RNAseq), the lily pollen transcriptome was investigated resulting in more than 50 million high quality reads with a length of 90 base pairs. Sequenced transcripts were assembled and annotated, and finally visualized with MAPMAN software tools and compared with other RNAseq or genome data including Arabidopsis pollen, Lilium vegetative tissues and the Amborella trichopoda genome. All lily pollen sequence data are provided as open access files with suitable tools to search sequences of interest.

  14. The power of single molecule real-time sequencing technology in the de novo assembly of a eukaryotic genome.

    PubMed

    Sakai, Hiroaki; Naito, Ken; Ogiso-Tanaka, Eri; Takahashi, Yu; Iseki, Kohtaro; Muto, Chiaki; Satou, Kazuhito; Teruya, Kuniko; Shiroma, Akino; Shimoji, Makiko; Hirano, Takashi; Itoh, Takeshi; Kaga, Akito; Tomooka, Norihiko

    2015-11-30

    Second-generation sequencers (SGS) have been game-changing, achieving cost-effective whole genome sequencing in many non-model organisms. However, a large portion of the genomes still remains unassembled. We reconstructed azuki bean (Vigna angularis) genome using single molecule real-time (SMRT) sequencing technology and achieved the best contiguity and coverage among currently assembled legume crops. The SMRT-based assembly produced 100 times longer contigs with 100 times smaller amount of gaps compared to the SGS-based assemblies. A detailed comparison between the assemblies revealed that the SMRT-based assembly enabled a more comprehensive gene annotation than the SGS-based assemblies where thousands of genes were missing or fragmented. A chromosome-scale assembly was generated based on the high-density genetic map, covering 86% of the azuki bean genome. We demonstrated that SMRT technology, though still needed support of SGS data, achieved a near-complete assembly of a eukaryotic genome.

  15. Whole exome sequencing is necessary to clarify ID/DD cases with de novo copy number variants of uncertain significance: Two proof-of-concept examples.

    PubMed

    Giorgio, Elisa; Ciolfi, Andrea; Biamino, Elisa; Caputo, Viviana; Di Gregorio, Eleonora; Belligni, Elga Fabia; Calcia, Alessandro; Gaidolfi, Elena; Bruselles, Alessandro; Mancini, Cecilia; Cavalieri, Simona; Molinatto, Cristina; Cirillo Silengo, Margherita; Ferrero, Giovanni Battista; Tartaglia, Marco; Brusco, Alfredo

    2016-07-01

    Whole exome sequencing (WES) is a powerful tool to identify clinically undefined forms of intellectual disability/developmental delay (ID/DD), especially in consanguineous families. Here we report the genetic definition of two sporadic cases, with syndromic ID/DD for whom array-Comparative Genomic Hybridization (aCGH) identified a de novo copy number variant (CNV) of uncertain significance. The phenotypes included microcephaly with brachycephaly and a distinctive facies in one proband, and hypotonia in the legs and mild ataxia in the other. WES allowed identification of a functionally relevant homozygous variant affecting a known disease gene for rare syndromic ID/DD in each proband, that is, c.1423C>T (p.Arg377*) in the Trafficking Protein Particle Complex 9 (TRAPPC9), and c.154T>C (p.Cys52Arg) in the Very Low Density Lipoprotein Receptor (VLDLR). Four mutations affecting TRAPPC9 have been previously reported, and the present finding further depicts this syndromic form of ID, which includes microcephaly with brachycephaly, corpus callosum hypoplasia, facial dysmorphism, and overweight. VLDLR-associated cerebellar hypoplasia (VLDLR-CH) is characterized by non-progressive congenital ataxia and moderate-to-profound intellectual disability. The c.154T>C (p.Cys52Arg) mutation was associated with a very mild form of ataxia, mild intellectual disability, and cerebellar hypoplasia without cortical gyri simplification. In conclusion, we report two novel cases with rare causes of autosomal recessive ID, which document how interpreting de novo array-CGH variants represents a challenge in consanguineous families; as such, clinical WES should be considered in diagnostic testing. © 2016 Wiley Periodicals, Inc.

  16. De novo assembly and characterization of root transcriptome using Illumina paired-end sequencing and development of cSSR markers in sweetpotato (Ipomoea batatas)

    PubMed Central

    2010-01-01

    Background The tuberous root of sweetpotato is an important agricultural and biological organ. There are not sufficient transcriptomic and genomic data in public databases for understanding of the molecular mechanism underlying the tuberous root formation and development. Thus, high throughput transcriptome sequencing is needed to generate enormous transcript sequences from sweetpotato root for gene discovery and molecular marker development. Results In this study, more than 59 million sequencing reads were generated using Illumina paired-end sequencing technology. De novo assembly yielded 56,516 unigenes with an average length of 581 bp. Based on sequence similarity search with known proteins, a total of 35,051 (62.02%) genes were identified. Out of these annotated unigenes, 5,046 and 11,983 unigenes were assigned to gene ontology and clusters of orthologous group, respectively. Searching against the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG) indicated that 17,598 (31.14%) unigenes were mapped to 124 KEGG pathways, and 11,056 were assigned to metabolic pathways, which were well represented by carbohydrate metabolism and biosynthesis of secondary metabolite. In addition, 4,114 cDNA SSRs (cSSRs) were identified as potential molecular markers in our unigenes. One hundred pairs of PCR primers were designed and used for validation of the amplification and assessment of the polymorphism in genomic DNA pools. The result revealed that 92 primer pairs were successfully amplified in initial screening tests. Conclusion This study generated a substantial fraction of sweetpotato transcript sequences, which can be used to discover novel genes associated with tuberous root formation and development and will also make it possible to construct high density microarrays for further characterization of gene expression profiles during these processes. Thousands of cSSR markers identified in the present study can enrich molecular markers and will facilitate marker

  17. De novo sequencing and analysis of the cranberry fruit transcriptome to identify putative genes involved in flavonoid biosynthesis, transport and regulation.

    PubMed

    Sun, Haiyue; Liu, Yushan; Gai, Yuzhuo; Geng, Jinman; Chen, Li; Liu, Hongdi; Kang, Limin; Tian, Youwen; Li, Yadong

    2015-09-02

    Cranberries (Vaccinium macrocarpon Ait.), renowned for their excellent health benefits, are an important berry crop. Here, we performed transcriptome sequencing of one cranberry cultivar, from fruits at two different developmental stages, on the Illumina HiSeq 2000 platform. Our main goals were to identify putative genes for major metabolic pathways of bioactive compounds and compare the expression patterns between white fruit (W) and red fruit (R) in cranberry. In this study, two cDNA libraries of W and R were constructed. Approximately 119 million raw sequencing reads were generated and assembled de novo, yielding 57,331 high quality unigenes with an average length of 739 bp. Using BLASTx, 38,460 unigenes were identified as putative homologs of annotated sequences in public protein databases, including NCBI NR, NT, Swiss-Prot, KEGG, COG and GO. Of these, 21,898 unigenes mapped to 128 KEGG pathways, with the metabolic pathways, secondary metabolites, glycerophospholipid metabolism, ether lipid metabolism, starch and sucrose metabolism, purine metabolism, and pyrimidine metabolism being well represented. Among them, many candidate genes were involved in flavonoid biosynthesis, transport and regulation. Furthermore, digital gene expression (DEG) analysis identified 3,257 unigenes that were differentially expressed between the two fruit developmental stages. In addition, 14,473 simple sequence repeats (SSRs) were detected. Our results present comprehensive gene expression information about the cranberry fruit transcriptome that could facilitate our understanding of the molecular mechanisms of fruit development in cranberries. Although it will be necessary to validate the functions carried out by these genes, these results could be used to improve the quality of breeding programs for the cranberry and related species.

  18. Transcriptome de novo assembly from next-generation sequencing and comparative analyses in the hexaploid salt marsh species Spartina maritima and Spartina alterniflora (Poaceae).

    PubMed

    Ferreira de Carvalho, J; Poulain, J; Da Silva, C; Wincker, P; Michon-Coudouel, S; Dheilly, A; Naquin, D; Boutte, J; Salmon, A; Ainouche, M

    2013-02-01

    Spartina species have a critical ecological role in salt marshes and represent an excellent system to investigate recurrent polyploid speciation. Using the 454 GS-FLX pyrosequencer, we assembled and annotated the first reference transcriptome (from roots and leaves) for two related hexaploid Spartina species that hybridize in Western Europe, the East American invasive Spartina alterniflora and the Euro-African S. maritima. The de novo read assembly generated 38 478 consensus sequences and 99% found an annotation using Poaceae databases, representing a total of 16 753 non-redundant genes. Spartina expressed sequence tags were mapped onto the Sorghum bicolor genome, where they were distributed among the subtelomeric arms of the 10 S. bicolor chromosomes, with high gene density correlation. Normalization of the complementary DNA library improved the number of annotated genes. Ecologically relevant genes were identified among GO biological function categories in salt and heavy metal stress response, C4 photosynthesis and in lignin and cellulose metabolism. Expression of some of these genes had been found to be altered by hybridization and genome duplication in a previous microarray-based study in Spartina. As these species are hexaploid, up to three duplicated homoeologs may be expected per locus. When analyzing sequence polymorphism at four different loci in S. maritima and S. alterniflora, we found up to four haplotypes per locus, suggesting the presence of two expressed homoeologous sequences with one or two allelic variants each. This reference transcriptome will allow analysis of specific Spartina genes of ecological or evolutionary interest, estimation of homoeologous gene expression variation using RNA-seq and further gene expression evolution analyses in natural populations.

  19. Transcriptome de novo assembly from next-generation sequencing and comparative analyses in the hexaploid salt marsh species Spartina maritima and Spartina alterniflora (Poaceae)

    PubMed Central

    Ferreira de Carvalho, J; Poulain, J; Da Silva, C; Wincker, P; Michon-Coudouel, S; Dheilly, A; Naquin, D; Boutte, J; Salmon, A; Ainouche, M

    2013-01-01

    Spartina species have a critical ecological role in salt marshes and represent an excellent system to investigate recurrent polyploid speciation. Using the 454 GS-FLX pyrosequencer, we assembled and annotated the first reference transcriptome (from roots and leaves) for two related hexaploid Spartina species that hybridize in Western Europe, the East American invasive Spartina alterniflora and the Euro-African S. maritima. The de novo read assembly generated 38 478 consensus sequences and 99% found an annotation using Poaceae databases, representing a total of 16 753 non-redundant genes. Spartina expressed sequence tags were mapped onto the Sorghum bicolor genome, where they were distributed among the subtelomeric arms of the 10 S. bicolor chromosomes, with high gene density correlation. Normalization of the complementary DNA library improved the number of annotated genes. Ecologically relevant genes were identified among GO biological function categories in salt and heavy metal stress response, C4 photosynthesis and in lignin and cellulose metabolism. Expression of some of these genes had been found to be altered by hybridization and genome duplication in a previous microarray-based study in Spartina. As these species are hexaploid, up to three duplicated homoeologs may be expected per locus. When analyzing sequence polymorphism at four different loci in S. maritima and S. alterniflora, we found up to four haplotypes per locus, suggesting the presence of two expressed homoeologous sequences with one or two allelic variants each. This reference transcriptome will allow analysis of specific Spartina genes of ecological or evolutionary interest, estimation of homoeologous gene expression variation using RNA-seq and further gene expression evolution analyses in natural populations. PMID:23149455

  20. De Novo Sequencing-Based Transcriptome and Digital Gene Expression Analysis Reveals Insecticide Resistance-Relevant Genes in Propylaea japonica (Thunberg) (Coleoptea: Coccinellidae)

    PubMed Central

    Jin, Feng-Liang; Qiu, Bao-Li; Wu, Jian-Hui; Ren, Shun-Xiang

    2014-01-01

    The ladybird Propylaea japonica (Thunberg) is one of most important natural enemies of aphids in China. This species is threatened by the extensive use of insecticides but genomics-based information on the molecular mechanisms underlying insecticide resistance is limited. Hence, we analyzed the transcriptome and expression profile data of P. japonica in order to gain a deeper understanding of insecticide resistance in ladybirds. We performed de novo assembly of a transcriptome using Illumina's Solexa sequencing technology and short reads. A total of 27,243,552 reads were generated. These were assembled into 81,458 contigs and 33,647 unigenes (6,862 clusters and 26,785 singletons). Of the unigenes, 23,965 (71.22%) have putative homologues in the non-redundant (nr) protein database from NCBI, using BLASTX, with a cut-off E-value of 10−5. We examined COG, GO and KEGG annotations to better understand the functions of these unigenes. Digital gene expression (DGE) libraries showed differences in gene expression profiles between two insecticide resistant strains. When compared with an insecticide susceptible profile, a total of 4,692 genes were significantly up- or down- regulated in a moderately resistant strain. Among these genes, 125 putative insecticide resistance genes were identified. To confirm the DGE results, 16 selected genes were validated using quantitative real time PCR (qRT-PCR). This study is the first to report genetic information on P. japonica and has greatly enriched the sequence data for ladybirds. The large number of gene sequences produced from the transcriptome and DGE sequencing will greatly improve our understanding of this important insect, at the molecular level, and could contribute to the in-depth research into insecticide resistance mechanisms. PMID:24959827

  1. BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data

    PubMed Central

    Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

    2012-01-01

    BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310

  2. De Novo Assembly of the Quorum-Sensing Pandoraea sp. Strain RB-44 Complete Genome Sequence Using PacBio Single-Molecule Real-Time Sequencing Technology.

    PubMed

    Ee, Robson; Lim, Yan-Lue; Yin, Wai-Fong; Chan, Kok-Gan

    2014-04-03

    We report the first complete genome sequence of Pandoraea sp. strain RB-44, which was found to possess quorum-sensing properties. To the best of our knowledge, this is the first documentation of both a complete genome sequence and quorum-sensing properties of a Pandoraea species.

  3. De novo Sequencing and Transcriptome Analysis of Pinellia ternata Identify the Candidate Genes Involved in the Biosynthesis of Benzoic Acid and Ephedrine

    PubMed Central

    Zhang, Guang-hui; Jiang, Ni-hao; Song, Wan-ling; Ma, Chun-hua; Yang, Sheng-chao; Chen, Jun-wen

    2016-01-01

    Background: The medicinal herb, Pinellia ternata, is purported to be an anti-emetic with analgesic and sedative effects. Alkaloids are the main biologically active compounds in P. ternata, especially ephedrine that is a phenylpropylamino alkaloid specifically produced by Ephedra and Catha edulis. However, how ephedrine is synthesized in plants is uncertain. Only the phenylalanine ammonia lyase (PAL) and relevant genes in this pathway have been characterized. Genomic information of P. ternata is also unavailable. Results: We analyzed the transcriptome of the tuber of P. ternata with the Illumina HiSeq™ 2000 sequencing platform. 66,813,052 high-quality reads were generated, and these reads were assembled de novo into 89,068 unigenes. Most known genes involved in benzoic acid biosynthesis were identified in the unigene dataset of P. ternata, and the expression patterns of some ephedrine biosynthesis-related genes were analyzed by reverse transcription quantitative real-time PCR (RT-qPCR). Also, 14,468 simple sequence repeats (SSRs) were identified from 12,000 unigenes. Twenty primer pairs for SSRs were randomly selected for the validation of their amplification effect. Conclusion: RNA-seq data was used for the first time to provide a comprehensive gene information on P. ternata at the transcriptional level. These data will advance molecular genetics in this valuable medicinal plant. PMID:27579029

  4. De novo transcriptome sequencing and analysis of freshwater snail (Radix balthica) to discover genes and pathways affected by exposure to oxazepam.

    PubMed

    Mazzitelli, Jean-Yves; Bonnafe, Elsa; Klopp, Christophe; Escudier, Frédéric; Geret, Florence

    2017-01-01

    Pharmaceuticals are increasingly found in aquatic ecosystems due to the non-efficiency of waste water treatment plants. Therefore, aquatic organisms are frequently exposed to a broad diversity of pharmaceuticals. Freshwater snail Radix balthica has been chosen as model to study the effects of oxazepam (psychotropic drug) on developmental stages ranging from trochophore to hatching. In order to provide a global insight of these effects, a transcriptome deep sequencing has been performed on exposed embryos. Eighteen libraries were sequenced, six libraries for three conditions: control, exposed to the lowest oxazepam concentration with a phenotypic effect (delayed hatching) (TA) and exposed to oxazepam concentration found in freshwater (TB). A total of 39,759,772 filtered raw reads were assembled into 56,435 contigs having a mean length of 1579.68 bp and mean depth of 378.96 reads. 44.91% of the contigs have at least one annotation. The differential expression analysis between the control condition and the two exposure conditions revealed 146 contigs differentially expressed of which 144 for TA and two for TB. 34.0% were annotated with biological function. There were four mainly impacted processes: two cellular signalling systems (Notch and JNK) and two biosynthesis pathways (Polyamine and Catecholamine pathways). This work reports a large-scale analysis of differentially transcribed genes of R. balthica exposed to oxazepam during egg development until hatching. In addition, these results enriched the de novo database of potential ecotoxicological models.

  5. Resurrection of a Clinical Antibody: Template ProteoGenomic de novo Proteomic Sequencing and Reverse Engineering of an Anti-Lymphotoxin Alpha Antibody

    PubMed Central

    Castellana, Natalie E.; McCutcheon, Krista; Pham, Victoria C.; Harden, Kristin; Nguyen, Allen; Young, Judy; Adams, Camellia; Schroeder, Kurt; Arnott, David; Bafna, Vineet; Grogan, Jane L.; Lill, Jennie R.

    2011-01-01

    A mouse hybridoma antibody directed against a member of the TNF-superfamily, lymphotoxin alpha (LT-α), was isolated from stored mouse ascites and purified to homogeneity. After more than a decade of storage the genetic material was not available for cloning, however biochemical assays with the ascites showed this antibody against LT-α (LT-3F12) to be a pre-clinical candidate for the treatment of several inflammatory pathologies. We have successfully rescued the LT-3F12 antibody by performing mass spectrometric analysis, primary amino acid sequence determination by template proteogenomics, and synthesis of the corresponding recombinant DNA by reverse engineering. The resurrected antibody was expressed, purified and shown to demonstrate the desired specificity and binding properties in a panel of immuno-biochemical tests. The work described herein demonstrates the powerful combination of high throughput informatic proteomic de novo sequencing with reverse engineering to re-establish monoclonal antibody expressing cells from archived protein sample, exemplifying the development of novel therapeutics from cryptic protein sources. PMID:21268269

  6. Identification of a Novel De Novo Variant in the PAX3 Gene in Waardenburg Syndrome by Diagnostic Exome Sequencing: The First Molecular Diagnosis in Korea

    PubMed Central

    Jang, Mi-Ae; Lee, Taeheon; Lee, Junnam

    2015-01-01

    Waardenburg syndrome (WS) is a clinically and genetically heterogeneous hereditary auditory pigmentary disorder characterized by congenital sensorineural hearing loss and iris discoloration. Many genes have been linked to WS, including PAX3, MITF, SNAI2, EDNRB, EDN3, and SOX10, and many additional genes have been associated with disorders with phenotypic overlap with WS. To screen all possible genes associated with WS and congenital deafness simultaneously, we performed diagnostic exome sequencing (DES) in a male patient with clinical features consistent with WS. Using DES, we identified a novel missense variant (c.220C>G; p.Arg74Gly) in exon 2 of the PAX3 gene in the patient. Further analysis by Sanger sequencing of the patient and his parents revealed a de novo occurrence of the variant. Our findings show that DES can be a useful tool for the identification of pathogenic gene variants in WS patients and for differentiation between WS and similar disorders. To the best of our knowledge, this is the first report of genetically confirmed WS in Korea. PMID:25932447

  7. De novo next-generation sequencing, assembling and annotation of Arachis hypogaea L. Spanish botanical type whole plant transcriptome.

    PubMed

    Wu, Ning; Matand, Kanyand; Wu, Huijuan; Li, Baoming; Li, Yue; Zhang, Xiaoli; He, Zheng; Qian, Jialin; Liu, Xu; Conley, Stephan; Bailey, Marshall; Acquaah, George

    2013-05-01

    Peanut is a major agronomic crop within the legume family and an important source of plant oil, proteins, vitamins, and minerals for human consumption, as well as animal feed, bioenergy, and health products. Peanut genomic research effort lags that of other legumes of economic importance, mainly due to the shortage of essential genomic infrastructure, tools, resources, and the complexity of the peanut genome. This is a pioneering study that explored the peanut Spanish Group whole plant transcriptome and culminated in developing unigenes database. The study applied modern technologies, such as, normalization and next-generation sequencing. It overall sequenced 8,308,655,800 nucleotides and generated 26,048 unigenes amongst which 12,302 were annotated and 8,817 were characterized. The remainder, 13,746 (52.77 %) unigenes, had unknown functions. These results will be applied as the reference transcriptome sequences for expanded transcriptome sequencing of the remaining three peanut botanical types (Valencia, Runner, and Virginia), which is currently in progress, RNA-seq, exome identification, and genomic markers development. It will also provide important tools and resources for other legumes and plant species genomic research.

  8. De novo sequencing and analysis of the transcriptome of Panax ginseng in the leaf-expansion period.

    PubMed

    Liu, Shichao; Wang, Siming; Liu, Meichen; Yang, Fei; Zhang, Hui; Liu, Shiyang; Wang, Qun; Zhao, Yu

    2016-08-01

    Panax ginseng, a traditional Chinese medicine, is used worldwide for its variety of health benefits and its treatment efficacy. However, it is difficult to cultivate due to its vulnerability to environmental stresses. The present study provided the first report, to the best of our knowledge, of transcriptome analysis of ginseng at the leaf‑expansion stage. Using the Illumina sequencing platform, >40,000,000 high‑quality paired‑end reads were obtained and assembled into 100,533 unique sequences. When the sequences were searched against the publicly available National Center for Biotechnology Information protein database using The Basic Local Alignment Search Tool, 61,599 sequences exhibited similarity to known proteins. Functional annotation and classification, including use of the Gene Ontology, Clusters of Orthologous Groups, and Kyoto Encyclopedia of Genes and Genomes databases, revealed that the activated genes in ginseng were predominantly ribonuclease‑like storage genes, environmental stress genes, pathogenesis-related genes and other antioxidant genes. A number of candidate genes in environmental stress‑associated pathways were also identified. These novel data provide useful information on the growth and development stages of ginseng, and serve as an important public information platform for further understanding of the molecular mechanisms and functional genomics of ginseng.

  9. Sequencing and De Novo Assembly of the Complete Chloroplast Genome of the Peruvian Carrot (Arracacia xanthorrhiza Bancroft)

    PubMed Central

    Alvarado, Javier Santiago; López, Diane Hinojosa; Torres, Isaury Maldonado; Meléndez, María Margarita; Batista, Rosalinda Aybar; Raxwal, Vivek K.; Berríos, Juan A. Negrón

    2017-01-01

    ABSTRACT Arracacia xanthorrhiza is an important secondary food crop in South America and Puerto Rico. The lack of crop protection and improvement strategies leads to infections damaging the storage roots. Here, we report the annotated complete chloroplast genome sequence of A. xanthorrhiza as a step toward developing genomic resources for this crop. PMID:28209812

  10. Sequencing and De Novo Assembly of the Complete Chloroplast Genome of the Peruvian Carrot (Arracacia xanthorrhiza Bancroft).

    PubMed

    Alvarado, Javier Santiago; López, Diane Hinojosa; Torres, Isaury Maldonado; Meléndez, María Margarita; Batista, Rosalinda Aybar; Raxwal, Vivek K; Berríos, Juan A Negrón; Arun, Alok

    2017-02-16

    Arracacia xanthorrhiza is an important secondary food crop in South America and Puerto Rico. The lack of crop protection and improvement strategies leads to infections damaging the storage roots. Here, we report the annotated complete chloroplast genome sequence of A. xanthorrhiza as a step toward developing genomic resources for this crop.

  11. Is de novo stress incontinence after sacrocolpopexy related to anatomical changes and surgical approach?

    PubMed

    LeClaire, Edgar L; Mukati, Marium S; Juarez, Dianna; White, Dena; Quiroz, Lieschen H

    2014-09-01

    The objective was to investigate the relationship between new onset postoperative stress urinary incontinence (SUI) after sacrocolpopexy (SCP) and anatomical change/surgical approach. We analyzed a retrospective cohort of patients with negative preoperative testing for SUI who underwent SCP from 2005 to 2012. Our primary outcome was new onset postoperative SUI. Logistic regression was used to examine the relationship among anatomical change, defined as ΔAa, ΔBa, ΔC, and ΔTVL, and surgical approach, categorized as abdominal (ASCP) for open cases and minimally invasive (MISCP) for laparoscopic and robot-assisted cases, and postoperative SUI. Of 795 cases, 33 ASCP (43%) and 44 MISCP (57%) met the inclusion criteria for analysis. New onset SUI was demonstrated by 15 patients (45%) of the ASCP group and 7 patients (15%) of the MISCP group (p = 0.005). New onset SUI was significantly associated with route of SCP and ΔAa (p = 0.006 and p = 0.033 respectively). Controlling for ΔAa, the odds of new onset SUI were 4.4 times higher in the ASCP group compared with the MISCP group (OR 4.37, 95% CI 1.42, 13.48). Controlling for route of SCP, the odds of new onset SUI were 2.2 times higher with moderate ΔAa compared with low ΔAa (OR 2.16 95% CI 1.07, 4.38). The odds of new onset SUI was 4.7 times higher in those with high ΔAa than in those with low ΔAa (OR 4.67 95% CI 1.14, 19.22). ΔBa, ΔC, and ΔTVL were not associated with new onset SUI. Greater reduction in point Aa and abdominal surgical route are risk factors for new onset postoperative SUI after SCP.

  12. De Novo proteome analysis of genetically modified tumor cells by a metabolic labeling/azide-alkyne cycloaddition approach.

    PubMed

    Ballikaya, Seda; Lee, Jennifer; Warnken, Uwe; Schnölzer, Martina; Gebert, Johannes; Kopitz, Jürgen

    2014-12-01

    Activin receptor type II (ACVR2) is a member of the transforming growth factor type II receptor family and controls cell growth and differentiation, thereby acting as a tumor suppressor. ACVR2 inactivation is known to drive colorectal tumorigenesis. We used an ACVR2-deficient microsatellite unstable colon cancer cell line (HCT116) to set up a novel experimental design for comprehensive analysis of proteomic changes associated with such functional loss of a tumor suppressor. To this end we combined two existing technologies. First, the ACVR2 gene was reconstituted in an ACVR2-deficient colorectal cancer (CRC) cell line by means of recombinase-mediated cassette exchange, resulting in the generation of an inducible expression system that allowed the regulation of ACVR2 gene expression in a doxycycline-dependent manner. Functional expression in the induced cells was explicitly proven. Second, we used the methionine analog azidohomoalanine for metabolic labeling of newly synthesized proteins in our cell line model. Labeled proteins were tagged with biotin via a Click-iT chemistry approach enabling specific extraction of labeled proteins by streptavidin-coated beads. Tryptic on-bead digestion of captured proteins and subsequent ultra-high-performance LC coupled to LTQ Orbitrap XL mass spectrometry identified 513 proteins, with 25 of them differentially expressed between ACVR2-deficient and -proficient cells. Among these, several candidates that had already been linked to colorectal cancer or were known to play a key role in cell growth or apoptosis control were identified, proving the utility of the presented experimental approach. In principle, this strategy can be adapted to analyze any gene of interest and its effect on the cellular de novo proteome.

  13. Identification of critical genes associated with lignin biosynthesis in radish (Raphanus sativus L.) by de novo transcriptome sequencing.

    PubMed

    Feng, Haiyang; Xu, Liang; Wang, Yan; Tang, Mingjia; Zhu, Xianwen; Zhang, Wei; Sun, Xiaochuan; Nie, Shanshan; Muleke, Everlyne M'mbone; Liu, Liwang

    2017-06-30

    Radish is an important root vegetable crop with high nutritional, economic, and medicinal value. Lignin is an important secondary metabolite possessing a great effect on plant growth and product quality. To date, lignin biosynthesis-related genes have been identified in some important plant species. However, little information on characterization of critical genes involved in plant lignin biosynthesis is available in radish. In this study, a total of 71,148 transcripts sequences were obtained from radish root, of which 66 assembled unigenes and ten candidate genes were identified to be involved in lignin monolignol biosynthesis. Full-length cDNA sequences of seven randomly selected genes were isolated and sequenced from radish root, and the assembled unigenes covered more than 80% of their corresponding cDNA sequences. Moreover, the lignin content gradually accumulated in leaf during the developmental stages, and it increased from pre-cortex to cortex splitting stage, followed by a decrease at thickening stage and then increased at mature stage in root. RT-qPCR analysis revealed that all these genes except RsF5H exhibited relatively low expression level in root at thickening stage. The expression profiles of Rs4CL5, RsCCoAOMT1, and RsCOMT genes were consistent with the changes of root lignin content, implying that these candidate genes may play important roles in lignin formation in radish root. These findings would provide valuable information for identification of lignin biosynthesis-related genes and facilitate dissection of molecular mechanism underlying lignin biosynthesis in radish and other root vegetable crops.

  14. A domain sequence approach to pangenomics: applications to Escherichia coli

    PubMed Central

    Snipen, Lars-Gustav

    2013-01-01

    The study of microbial pangenomes relies on the computation of gene families, i.e. the clustering of coding sequences into groups of essentially similar genes. There is no standard approach to obtain such gene families. Ideally, the gene family computations should be robust against errors in the annotation of genes in various genomes. In an attempt to achieve this robustness, we propose to cluster sequences by their domain sequence, i.e. the ordered sequence of domains in their protein sequence. In a study of 347 genomes from Escherichia coli we find on average around 4500 proteins having hits in Pfam-A in every genome, clustering into around 2500 distinct domain sequence families in each genome. Across all genomes we find a total of 5724 such families. A binomial mixture model approach indicates this is around 95% of all domain sequences we would expect to see in E. coli in the future. A Heaps law analysis indicates the population of domain sequences is larger, but this analysis is also very sensitive to smaller changes in the computation procedure. The resolution between strains is good despite the coarse grouping obtained by domain sequence families. Clustering sequences by their ordered domain content give us domain sequence families, who are robust to errors in the gene prediction step. The computational load of the procedure scales linearly with the number of genomes, which is needed for the future explosion in the number of re-sequenced strains. The use of domain sequence families for a functional classification of strains clearly has some potential to be explored. PMID:24555018

  15. De Novo Genome Assembly of the Economically Important Weed Horseweed Using Integrated Data from Multiple Sequencing Platforms1[C][W][OPEN

    PubMed Central

    Peng, Yanhui; Lai, Zhao; Lane, Thomas; Nageswara-Rao, Madhugiri; Okada, Miki; Jasieniuk, Marie; O’Geen, Henriette; Kim, Ryan W.; Sammons, R. Douglas; Rieseberg, Loren H.; Stewart, C. Neal

    2014-01-01

    Horseweed (Conyza canadensis), a member of the Compositae (Asteraceae) family, was the first broadleaf weed to evolve resistance to glyphosate. Horseweed, one of the most problematic weeds in the world, is a true diploid (2n = 2x = 18), with the smallest genome of any known agricultural weed (335 Mb). Thus, it is an appropriate candidate to help us understand the genetic and genomic bases of weediness. We undertook a draft de novo genome assembly of horseweed by combining data from multiple sequencing platforms (454 GS-FLX, Illumina HiSeq 2000, and PacBio RS) using various libraries with different insertion sizes (approximately 350 bp, 600 bp, 3 kb, and 10 kb) of a Tennessee-accessed, glyphosate-resistant horseweed biotype. From 116.3 Gb (approximately 350× coverage) of data, the genome was assembled into 13,966 scaffolds with 50% of the assembly = 33,561 bp. The assembly covered 92.3% of the genome, including the complete chloroplast genome (approximately 153 kb) and a nearly complete mitochondrial genome (approximately 450 kb in 120 scaffolds). The nuclear genome is composed of 44,592 protein-coding genes. Genome resequencing of seven additional horseweed biotypes was performed. These sequence data were assembled and used to analyze genome variation. Simple sequence repeat and single-nucleotide polymorphisms were surveyed. Genomic patterns were detected that associated with glyphosate-resistant or -susceptible biotypes. The draft genome will be useful to better understand weediness and the evolution of herbicide resistance and to devise new management strategies. The genome will also be useful as another reference genome in the Compositae. To our knowledge, this article represents the first published draft genome of an agricultural weed. PMID:25209985

  16. Sequencing, de novo assembly and characterization of the spotted scat Scatophagus argus (Linnaeus 1766) transcriptome for discovery of reproduction related genes and SSRs

    NASA Astrophysics Data System (ADS)

    Yang, Wei; Chen, Huapu; Cui, Xuefan; Zhang, Kewei; Jiang, Dongneng; Deng, Siping; Zhu, Chunhua; Li, Guangli

    2017-09-01

    Spotted scat (Scatophagus argus) is an economically important farmed fish, particularly in East and Southeast Asia. Because there has been little research on reproductive development and regulation in this species, the lack of a mature artificial reproduction technology remains a barrier for the sustainable development of the aquaculture industry. More genetic and genomic background knowledge is urgently needed for an in-depth understanding of the molecular mechanism of reproductive process and identification of functional genes related to sexual differentiation, gonad maturation and gametogenesis. For these reasons, we performed transcriptomic analysis on spotted scat using a multiple tissue sample mixing strategy. The Illumina RNA sequencing generated 118 510 486 raw reads. After trimming, de novo assembly was performed and yielded 99 888 unigenes with an average length of 905.75 bp. A total of 45 015 unigenes were successfully annotated to the Nr, Swiss-Prot, KOG and KEGG databases. Additionally, 23 783 and 27 183 annotated unigenes were assigned to 56 Gene Ontology (GO) functional groups and 228 KEGG pathways, respectively. Subsequently, 2 474 transcripts associated with reproduction were selected using GO term and KEGG pathway assignments, and a number of reproduction-related genes involved in sex differentiation, gonad development and gametogenesis were identified. Furthermore, 22 279 simple sequence repeat (SSR) loci were discovered and characterized. The comprehensive transcript dataset described here greatly increases the genetic information available for spotted scat and contributes valuable sequence resources for functional gene mining and analysis. Candidate transcripts involved in reproduction would make good starting points for future studies on reproductive mechanisms, and the putative sex differentiation-related genes will be helpful for sex-determining gene identification and sex-specific marker isolation. Lastly, the SSRs can serve as marker

  17. MIDDAS-M: Motif-Independent De Novo Detection of Secondary Metabolite Gene Clusters through the Integration of Genome Sequencing and Transcriptome Data

    PubMed Central

    Umemura, Myco; Koike, Hideaki; Nagano, Nozomi; Ishii, Tomoko; Kawano, Jin; Yamane, Noriko; Kozone, Ikuko; Horimoto, Katsuhisa; Shin-ya, Kazuo; Asai, Kiyoshi; Yu, Jiujiang; Bennett, Joan W.; Machida, Masayuki

    2013-01-01

    Many bioactive natural products are produced as “secondary metabolites” by plants, bacteria, and fungi. During the middle of the 20th century, several secondary metabolites from fungi revolutionized the pharmaceutical industry, for example, penicillin, lovastatin, and cyclosporine. They are generally biosynthesized by enzymes encoded by clusters of coordinately regulated genes, and several motif-based methods have been developed to detect secondary metabolite biosynthetic (SMB) gene clusters using the sequence information of typical SMB core genes such as polyketide synthases (PKS) and non-ribosomal peptide synthetases (NRPS). However, no detection method exists for SMB gene clusters that are functional and do not include core SMB genes at present. To advance the exploration of SMB gene clusters, especially those without known core genes, we developed MIDDAS-M, a motif-independent de novo detection algorithm for SMB gene clusters. We integrated virtual gene cluster generation in an annotated genome sequence with highly sensitive scoring of the cooperative transcriptional regulation of cluster member genes. MIDDAS-M accurately predicted 38 SMB gene clusters that have been experimentally confirmed and/or predicted by other motif-based methods in 3 fungal strains. MIDDAS-M further identified a new SMB gene cluster for ustiloxin B, which was experimentally validated. Sequence analysis of the cluster genes indicated a novel mechanism for peptide biosynthesis independent of NRPS. Because it is fully computational and independent of empirical knowledge about SMB core genes, MIDDAS-M allows a large-scale, comprehensive analysis of SMB gene clusters, including those with novel biosynthetic mechanisms that do not contain any functionally characterized genes. PMID:24391870

  18. De novo assembly and characterization of transcriptome using Illumina paired-end sequencing and identification of CesA gene in ramie (Boehmeria nivea L. Gaud)

    PubMed Central

    2013-01-01

    Background Ramie fiber, extracted from vegetative organ stem bast, is one of the most important natural fibers. Understanding the molecular mechanisms of the vegetative growth of the ramie and the formation and development of bast fiber is essential for improving the yield and quality of the ramie fiber. However, only 418 expressed tag sequences (ESTs) of ramie deposited in public databases are far from sufficient to understand the molecular mechanisms. Thus, high-throughput transcriptome sequencing is essential to generate enormous ramie transcript sequences for the purpose of gene discovery, especially genes such as the cellulose synthase (CesA) gene. Results Using Illumina paired-end sequencing, about 53 million sequencing reads were generated. De novo assembly yielded 43,990 unigenes with an average length of 824 bp. By sequence similarity searching for known proteins, a total of 34,192 (77.7%) genes were annotated for their function. Out of these annotated unigenes, 16,050 and 13,042 unigenes were assigned to gene ontology and clusters of orthologous group, respectively. Searching against the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG) indicated that 19,846 unigenes were mapped to 126 KEGG pathways, and 565 genes were assigned to http://starch and sucrose metabolic pathway which was related with cellulose biosynthesis. Additionally, 51 CesA genes involved in cellulose biosynthesis were identified. Analysis of tissue-specific expression pattern of the 51 CesA genes revealed that there were 36 genes with a relatively high expression levels in the stem bark, which suggests that they are most likely responsible for the biosynthesis of bast fiber. Conclusion To the best of our knowledge, this study is the first to characterize the ramie transcriptome and the substantial amount of transcripts obtained will accelerate the understanding of the ramie vegetative growth and development mechanism. Moreover, discovery of the 36 CesA genes with

  19. De novo assembly and characterization of the transcriptome of seagrass Zostera marina using Illumina paired-end sequencing.

    PubMed

    Kong, Fanna; Li, Hong; Sun, Peipei; Zhou, Yang; Mao, Yunxiang

    2014-01-01

    The seagrass Zostera marina is a monocotyledonous angiosperm belonging to a polyphyletic group of plants that can live submerged in marine habitats. Zostera marina L. is one of the most common seagrasses and is considered a cornerstone of marine plant molecular ecology research and comparative studies. However, the mechanisms underlying its adaptation to the marine environment still remain poorly understood due to limited transcriptomic and genomic data. Here we explored the transcriptome of Z. marina leaves under different environmental conditions using Illumina paired-end sequencing. Approximately 55 million sequencing reads were obtained, representing 58,457 transcripts that correspond to 24,216 unigenes. A total of 14,389 (59.41%) unigenes were annotated by blast searches against the NCBI non-redundant protein database. 45.18% and 46.91% of the unigenes had significant similarity with proteins in the Swiss-Prot database and Pfam database, respectively. Among these, 13,897 unigenes were assigned to 57 Gene Ontology (GO) terms and 4,745 unigenes were identified and mapped to 233 pathways via functional annotation against the Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG). We compared the orthologous gene family of the Z. marina transcriptome to Oryza sativa and Pyropia yezoensis and 11,667 orthologous gene families are specific to Z. marina. Furthermore, we identified the photoreceptors sensing red/far-red light and blue light. Also, we identified a large number of genes that are involved in ion transporters and channels including Na+ efflux, K+ uptake, Cl- channels, and H+ pumping. Our study contains an extensive sequencing and gene-annotation analysis of Z. marina. This information represents a genetic resource for the discovery of genes related to light sensing and salt tolerance in this species. Our transcriptome can be further utilized in future studies on molecular adaptation to abiotic stress in Z. marina.

  20. De Novo Assembly and Characterization of the Transcriptome of Seagrass Zostera marina Using Illumina Paired-End Sequencing

    PubMed Central

    Kong, Fanna; Li, Hong; Sun, Peipei; Zhou, Yang; Mao, Yunxiang

    2014-01-01

    Background The seagrass Zostera marina is a monocotyledonous angiosperm belonging to a polyphyletic group of plants that can live submerged in marine habitats. Zostera marina L. is one of the most common seagrasses and is considered a cornerstone of marine plant molecular ecology research and comparative studies. However, the mechanisms underlying its adaptation to the marine environment still remain poorly understood due to limited transcriptomic and genomic data. Principal Findings Here we explored the transcriptome of Z. marina leaves under different environmental conditions using Illumina paired-end sequencing. Approximately 55 million sequencing reads were obtained, representing 58,457 transcripts that correspond to 24,216 unigenes. A total of 14,389 (59.41%) unigenes were annotated by blast searches against the NCBI non-redundant protein database. 45.18% and 46.91% of the unigenes had significant similarity with proteins in the Swiss-Prot database and Pfam database, respectively. Among these, 13,897 unigenes were assigned to 57 Gene Ontology (GO) terms and 4,745 unigenes were identified and mapped to 233 pathways via functional annotation against the Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG). We compared the orthologous gene family of the Z. marina transcriptome to Oryza sativa and Pyropia yezoensis and 11,667 orthologous gene families are specific to Z. marina. Furthermore, we identified the photoreceptors sensing red/far-red light and blue light. Also, we identified a large number of genes that are involved in ion transporters and channels including Na+ efflux, K+ uptake, Cl− channels, and H+ pumping. Conclusions Our study contains an extensive sequencing and gene-annotation analysis of Z. marina. This information represents a genetic resource for the discovery of genes related to light sensing and salt tolerance in this species. Our transcriptome can be further utilized in future studies on molecular adaptation to abiotic stress in

  1. Specific versus non-specific immune responses in an invertebrate species evidenced by a comparative de novo sequencing study.

    PubMed

    Deleury, Emeline; Dubreuil, Géraldine; Elangovan, Namasivayam; Wajnberg, Eric; Reichhart, Jean-Marc; Gourbal, Benjamin; Duval, David; Baron, Olga Lucia; Gouzy, Jérôme; Coustau, Christine

    2012-01-01

    Our present understanding of the functioning and evolutionary history of invertebrate innate immunity derives mostly from studies on a few model species belonging to ecdysozoa. In particular, the characterization of signaling pathways dedicated to specific responses towards fungi and Gram-positive or Gram-negative bacteria in Drosophila melanogaster challenged our original view of a non-specific immunity in invertebrates. However, much remains to be elucidated from lophotrochozoan species. To investigate the global specificity of the immune response in the fresh-water snail Biomphalaria glabrata, we used massive Illumina sequencing of 5'-end cDNAs to compare expression profiles after challenge by Gram-positive or Gram-negative bacteria or after a yeast challenge. 5'-end cDNA sequencing of the libraries yielded over 12 millions high quality reads. To link these short reads to expressed genes, we prepared a reference transcriptomic database through automatic assembly and annotation of the 758,510 redundant sequences (ESTs, mRNAs) of B. glabrata available in public databases. Computational analysis of Illumina reads followed by multivariate analyses allowed identification of 1685 candidate transcripts differentially expressed after an immune challenge, with a two fold ratio between transcripts showing a challenge-specific expression versus a lower or non-specific differential expression. Differential expression has been validated using quantitative PCR for a subset of randomly selected candidates. Predicted functions of annotated candidates (approx. 700 unisequences) belonged to a large extend to similar functional categories or protein types. This work significantly expands upon previous gene discovery and expression studies on B. glabrata and suggests that responses to various pathogens may involve similar immune processes or signaling pathways but different genes belonging to multigenic families. These results raise the question of the importance of gene

  2. De novo Taproot Transcriptome Sequencing and Analysis of Major Genes Involved in Sucrose Metabolism in Radish (Raphanus sativus L.)

    PubMed Central

    Yu, Rugang; Xu, Liang; Zhang, Wei; Wang, Yan; Luo, Xiaobo; Wang, Ronghua; Zhu, Xianwen; Xie, Yang; Karanja, Benard; Liu, Liwang

    2016-01-01

    Radish (Raphanus sativus L.) is an important annual or biennial root vegetable crop. The fleshy taproot comprises the main edible portion of the plant with high nutrition and medical value. Molecular biology study of radish begun rather later, and lacks sufficient transcriptomic and genomic data in pubic databases for understanding of the molecular mechanism during the radish taproot formation. To develop a comprehensive overview of the ‘NAU-YH’ root transcriptome, a cDNA library, prepared from three equally mixed RNA of taproots at different developmental stages including pre-cortex splitting stage, cortex splitting stage, and expanding stage was sequenced using high-throughput Illumina RNA sequencing. From approximately 51 million clean reads, a total of 70,168 unigenes with a total length of 50.28 Mb, an average length of 717 bp and a N50 of 994 bp were obtained. In total, 63,991 (about 91.20% of the assembled unigenes) unigenes were successfully annotated in five public databases including NR, GO, COG, KEGG, and Nt. GO analysis revealed that the majority of these unigenes were predominately involved in basic physiological and metabolic processes, catalytic, binding, and cellular process. In addition, a total of 103 unigenes encoding eight enzymes involved in the sucrose metabolism related pathways were also identified by KEGG pathway analysis. Sucrose synthase (29 unigenes), invertase (17 unigenes), sucrose-phosphate synthase (16 unigenes), fructokinase (17 unigenes), and hexokinase (11 unigenes) ranked top five in these eight key enzymes. From which, two genes (RsSuSy1, RsSPS1) were validated by T-A cloning and sequenced, while the expression of six unigenes were profiled with RT-qPCR analysis. These results would be served as an important public reference platform to identify the related key genes during taproot thickening and facilitate the dissection of molecular mechanisms underlying taproot formation in radish. PMID:27242808

  3. De novo Taproot Transcriptome Sequencing and Analysis of Major Genes Involved in Sucrose Metabolism in Radish (Raphanus sativus L.).

    PubMed

    Yu, Rugang; Xu, Liang; Zhang, Wei; Wang, Yan; Luo, Xiaobo; Wang, Ronghua; Zhu, Xianwen; Xie, Yang; Karanja, Benard; Liu, Liwang

    2016-01-01

    Radish (Raphanus sativus L.) is an important annual or biennial root vegetable crop. The fleshy taproot comprises the main edible portion of the plant with high nutrition and medical value. Molecular biology study of radish begun rather later, and lacks sufficient transcriptomic and genomic data in pubic databases for understanding of the molecular mechanism during the radish taproot formation. To develop a comprehensive overview of the 'NAU-YH' root transcriptome, a cDNA library, prepared from three equally mixed RNA of taproots at different developmental stages including pre-cortex splitting stage, cortex splitting stage, and expanding stage was sequenced using high-throughput Illumina RNA sequencing. From approximately 51 million clean reads, a total of 70,168 unigenes with a total length of 50.28 Mb, an average length of 717 bp and a N50 of 994 bp were obtained. In total, 63,991 (about 91.20% of the assembled unigenes) unigenes were successfully annotated in five public databases including NR, GO, COG, KEGG, and Nt. GO analysis revealed that the majority of these unigenes were predominately involved in basic physiological and metabolic processes, catalytic, binding, and cellular process. In addition, a total of 103 unigenes encoding eight enzymes involved in the sucrose metabolism related pathways were also identified by KEGG pathway analysis. Sucrose synthase (29 unigenes), invertase (17 unigenes), sucrose-phosphate synthase (16 unigenes), fructokinase (17 unigenes), and hexokinase (11 unigenes) ranked top five in these eight key enzymes. From which, two genes (RsSuSy1, RsSPS1) were validated by T-A cloning and sequenced, while the expression of six unigenes were profiled with RT-qPCR analysis. These results would be served as an important public reference platform to identify the related key genes during taproot thickening and facilitate the dissection of molecular mechanisms underlying taproot formation in radish.

  4. An optimization approach and its application to compare DNA sequences

    NASA Astrophysics Data System (ADS)

    Liu, Liwei; Li, Chao; Bai, Fenglan; Zhao, Qi; Wang, Ying

    2015-02-01

    Studying the evolutionary relationship between biological sequences has become one of the main tasks in bioinformatics research by means of comparing and analyzing the gene sequence. Many valid methods have been applied to the DNA sequence alignment. In this paper, we propose a novel comparing method based on the Lempel-Ziv (LZ) complexity to compare biological sequences. Moreover, we introduce a new distance measure and make use of the corresponding similarity matrix to construct phylogenic tree without multiple sequence alignment. Further, we construct phylogenic tree for 24 species of Eutherian mammals and 48 countries of Hepatitis E virus (HEV) by an optimization approach. The results indicate that this new method improves the efficiency of sequence comparison and successfully construct phylogenies.

  5. De novo assembly and characterization of farmed blue fox (Alopex lagopus) global transcriptome using Illumina paired-end sequencing.

    PubMed

    Guo, P C; Yan, S Q; Si, S; Bai, C Y; Zhao, Y; Zhang, Y; Yao, J Y; Li, Y M

    2016-03-28

    The blue fox (Alopex lagopus), a coat-color variant of the Arctic fox, is a domesticated fur-bearing mammal. In the present study, transcriptome data generated from a pool of nine different tissues were obtained with Illumina HiSeq2500 paired-end sequencing technology. After filtering from raw reads, 32,358,290 clean reads were assembled into 161,269 transcripts and 97,252 unigenes by the Trinity fragment assembly software. Of the assembled unigenes, 37,967 were annotated in the National Center for Biotechnology Information (NCBI) Non-Redundant (NR) protein database and 26,264 in the Swiss-Prot database. Among the annotated unigenes, 24,839 and 24,267 were assigned using the Gene Ontology (GO) and euKaryotic Orthologous Groups (KOG) databases, respectively. Altogether, 17,057 unigenes were mapped onto 227 pathways using the Kyoto Encyclopedia of Genes and Genomes database. In addition, 6394 simple sequence repeats were identified by examining 12,965 unigenes (>1 kb), which could contribute to the development of molecular markers. This study generated transcriptome data for the blue fox that will promote further progress in expression profiling studies, and provide a good annotation basis for genomic studies.

  6. De novo analysis of peptide tandem mass spectra by spectral graph partitioning.

    PubMed

    Bern, Marshall; Goldberg, David

    2006-03-01

    We report on a new de novo peptide sequencing algorithm that uses spectral graph partitioning. In this approach, relationships between m/z peaks are represented by attractive and repulsive springs, and the vibrational modes of the spring system are used to infer information about the peaks (such as "likely b-ion" or "likely y-ion"). We demonstrate the effectiveness of this approach by comparison with other de novo sequencers on test sets of ion-trap and QTOF spectra, including spectra of mixtures of peptides. On all datasets, we outperform the other sequencers. Along with spectral graph theory techniques, the new de novo sequencer EigenMS incorporates another improvement of independent interest: robust statistical methods for recalibration of time-of-flight mass measurements. Robust recalibration greatly outperforms simple least-squares recalibration, achieving about three times the accuracy for one QTOF dataset.

  7. Solving the Water Jugs Problem by an Integer Sequence Approach

    ERIC Educational Resources Information Center

    Man, Yiu-Kwong

    2012-01-01

    In this article, we present an integer sequence approach to solve the classic water jugs problem. The solution steps can be obtained easily by additions and subtractions only, which is suitable for manual calculation or programming by computer. This approach can be introduced to secondary and undergraduate students, and also to teachers and…

  8. Solving the Water Jugs Problem by an Integer Sequence Approach

    ERIC Educational Resources Information Center

    Man, Yiu-Kwong

    2012-01-01

    In this article, we present an integer sequence approach to solve the classic water jugs problem. The solution steps can be obtained easily by additions and subtractions only, which is suitable for manual calculation or programming by computer. This approach can be introduced to secondary and undergraduate students, and also to teachers and…

  9. MANGO: a new approach to multiple sequence alignment.

    PubMed

    Zhang, Zefeng; Lin, Hao; Li, Ming

    2007-01-01

    Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.

  10. Microsatellites from Fosterella christophii (Bromeliaceae) by de novo transcriptome sequencing on the Pacific Biosciences RS platform1

    PubMed Central

    Wöhrmann, Tina; Huettel, Bruno; Wagner, Natascha; Weising, Kurt

    2016-01-01

    Premise of the study: Microsatellite markers were developed in Fosterella christophii (Bromeliaceae) to investigate the genetic diversity and population structure within the F. micrantha group, comprising F. christophii, F. micrantha, and F. villosula. Methods and Results: Full-length cDNAs were isolated from F. christophii and sequenced on a Pacific Biosciences RS platform. A total of 1590 high-quality consensus isoforms were assembled into 971 unigenes containing 421 perfect microsatellites. Thirty primer sets were designed, of which 13 revealed a high level of polymorphism in three populations of F. christophii, with four to nine alleles per locus. Each of these 13 loci cross-amplified in the closely related species F. micrantha and F. villosula, with one to six and one to 11 alleles per locus, respectively. Conclusions: The new markers are promising tools to study the population genetics of F. christophii and to discover species boundaries within the F. micrantha group. PMID:26819858

  11. De novo sequencing of circulating miRNAs identifies novel markers predicting clinical outcome of locally advanced breast cancer

    PubMed Central

    2012-01-01

    Background MicroRNAs (miRNAs) have been recently detected in the circulation of cancer patients, where they are associated with clinical parameters. Discovery profiling of circulating small RNAs has not been reported in breast cancer (BC), and was carried out in this study to identify blood-based small RNA markers of BC clinical outcome. Methods The pre-treatment sera of 42 stage II-III locally advanced and inflammatory BC patients who received neoadjuvant chemotherapy (NCT) followed by surgical tumor resection were analyzed for marker identification by deep sequencing all circulating small RNAs. An independent validation cohort of 26 stage II-III BC patients was used to assess the power of identified miRNA markers. Results More than 800 miRNA species were detected in the circulation, and observed patterns showed association with histopathological profiles of BC. Groups of circulating miRNAs differentially associated with ER/PR/HER2 status and inflammatory BC were identified. The relative levels of selected miRNAs measured by PCR showed consistency with their abundance determined by deep sequencing. Two circulating miRNAs, miR-375 and miR-122, exhibited strong correlations with clinical outcomes, including NCT response and relapse with metastatic disease. In the validation cohort, higher levels of circulating miR-122 specifically predicted metastatic recurrence in stage II-III BC patients. Conclusions Our study indicates that certain miRNAs can serve as potential blood-based biomarkers for NCT response, and that miR-122 prevalence in the circulation predicts BC metastasis in early-stage patients. These results may allow optimized chemotherapy treatments and preventive anti-metastasis interventions in future clinical applications. PMID:22400902

  12. Transcriptome sequencing and de novo analysis of cytoplasmic male sterility and maintenance in JA-CMS cotton.

    PubMed

    Yang, Peng; Han, Jinfeng; Huang, Jinling

    2014-01-01

    Cytoplasmic male sterility (CMS) is the failure to produce functional pollen, which is inherited maternally. And it is known that anther development is modulated through complicated interactions between nuclear and mitochondrial genes in sporophytic and gametophytic tissues. However, an unbiased transcriptome sequencing analysis of CMS in cotton is currently lacking in the literature. This study compared differentially expressed (DE) genes of floral buds at the sporogenous cells stage (SS) and microsporocyte stage (MS) (the two most important stages for pollen abortion in JA-CMS) between JA-CMS and its fertile maintainer line JB cotton plants, using the Illumina HiSeq 2000 sequencing platform. A total of 709 (1.8%) DE genes including 293 up-regulated and 416 down-regulated genes were identified in JA-CMS line comparing with its maintainer line at the SS stage, and 644 (1.6%) DE genes with 263 up-regulated and 381 down-regulated genes were detected at the MS stage. By comparing the two stages in the same material, there were 8 up-regulated and 9 down-regulated DE genes in JA-CMS line and 29 up-regulated and 9 down-regulated DE genes in JB maintainer line at the MS stage. Quantitative RT-PCR was used to validate 7 randomly selected DE genes. Bioinformatics analysis revealed that genes involved in reduction-oxidation reactions and alpha-linolenic acid metabolism were down-regulated, while genes pertaining to photosynthesis and flavonoid biosynthesis were up-regulated in JA-CMS floral buds compared with their JB counterparts at the SS and/or MS stages. All these four biological processes play important roles in reactive oxygen species (ROS) homeostasis, which may be an important factor contributing to the sterile trait of JA-CMS. Further experiments are warranted to elucidate molecular mechanisms of these genes that lead to CMS.

  13. De novo transcriptome assembly of Ipomoea nil using Illumina sequencing for gene discovery and SSR marker identification.

    PubMed

    Wei, Changhe; Tao, Xiang; Li, Ming; He, Bin; Yan, Lang; Tan, Xuemei; Zhang, Yizheng

    2015-10-01

    Ipomoea nil is widely used as an ornamental plant due to its abundance of flower color, but the limited transcriptome and genomic data hinder research on it. Using illumina platform, transcriptome profiling of I. nil was performed through high-throughput sequencing, which was proven to be a rapid and cost-effective means to characterize gene content. Our goal is to use the resulting information to facilitate the relevant research on flowering and flower color formation in I. nil. In total, 268 million unique illumina RNA-Seq reads were produced and used in the transcriptome assembly. These reads were assembled into 220,117 contigs, of which 137,307 contigs were annotated using the GO and KEGG database. Based on the result of functional annotations, a total of 89,781 contigs were assigned 455,335 GO term annotations. Meanwhile, 17,418 contigs were identified with pathway annotation and they were functionally assigned to 144 KEGG pathways. Our transcriptome revealed at least 55 contigs as probably flowering-related genes in I. nil, and we also identified 25 contigs that encode key enzymes in the phenylpropanoid biosynthesis pathway. Based on the analysis relating to gene expression profiles, in the phenylpropanoid biosynthesis pathway of I. nil, the repression of lignin biosynthesis might lead to the redirection of the metabolic flux into anthocyanin biosynthesis. This may be the most likely reason that I. nil has high anthocyanins content, especially in its flowers. Additionally, 15,537 simple sequence repeats (SSRs) were detected using the MISA software, and these SSRs will undoubtedly benefit future breeding work. Moreover, the information uncovered in this study will also serve as a valuable resource for understanding the flowering and flower color formation mechanisms in I. nil.

  14. De novo assembly and characterization of the spleen transcriptome of common carp (Cyprinus carpio) using Illumina paired-end sequencing.

    PubMed

    Li, Guoxi; Zhao, Yinli; Liu, Zhonghu; Gao, Chunsheng; Yan, Fengbin; Liu, Bianzhi; Feng, Jianxin

    2015-06-01

    Common carp (Cyprinus carpio) is one of the most important aquacultured species of the family Cyprinidae, and breeding this species for disease resistance is becoming more and more important. However, at the genome or transcriptome levels, study of the immunogenetics of disease resistance in the common carp is lacking. In this study, 60,316,906 and 75,200,328 paired-end clean reads were obtained from two cDNA libraries of the common carp spleen by Illumina paired-end sequencing technology. Totally, 130,293 unique transcript fragments (unigenes) were assembled, with an average length of 1400.57 bp. Approximately 105,612 (81.06%) unigenes could be annotated according to their homology with matches in the Nr, Nt, Swiss-Prot, COG, GO, or KEGG databases, and they were found to represent 46,747 non-redundant genes. Comparative analysis showed that 59.82% of the unigenes have significant similarity to zebrafish Refseq proteins. Gene expression comparison revealed that 10,432 and 6889 annotated unigenes were, respectively, up- and down-regulated with at least twofold changes between two developmental stages of the common carp spleen. Gene ontology and KEGG analysis were performed to classify all unigenes into functional categories for understanding gene functions and regulation pathways. In addition, 46,847 simple sequence repeats (SSRs) were detected from 35,618 unigenes, and a large number of single nucleotide polymorphism (SNP) and insertion/deletion (INDEL) sites were identified in the spleen transcriptome of common carp. This study has characterized the spleen transcriptome of the common carp for the first time, providing a valuable resource for a better understanding of the common carp immune system and defense mechanisms. This knowledge will also facilitate future functional studies on common carp immunogenetics that may eventually be applied in breeding programs. Copyright © 2015 Elsevier Ltd. All rights reserved.

  15. Sequencing and de novo analysis of the hemocytes transcriptome in Litopenaeus vannamei response to white spot syndrome virus infection.

    PubMed

    Xue, Shuxia; Liu, Yichen; Zhang, Yichen; Sun, Yan; Geng, Xuyun; Sun, Jinsheng

    2013-01-01

    White spot syndrome virus (WSSV) is a causative pathogen found in most shrimp farming areas of the world and causes large economic losses to the shrimp aquaculture. The mechanism underlying the molecular pathogenesis of the highly virulent WSSV remains unknown. To better understand the virus-host interactions at the molecular level, the transcriptome profiles in hemocytes of unchallenged and WSSV-challenged shrimp (Litopenaeus vannamei) were compared using a short-read deep sequencing method (Illumina). RNA-seq analysis generated more than 25.81 million clean pair end (PE) reads, which were assembled into 52,073 unigenes (mean size = 520 bp). Based on sequence similarity searches, 23,568 (45.3%) genes were identified, among which 6,562 and 7,822 unigenes were assigned to gene ontology (GO) categories and clusters of orthologous groups (COG), respectively. Searches in the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG) mapped 14,941 (63.4%) unigenes to 240 KEGG pathways. Among all the annotated unigenes, 1,179 were associated with immune-related genes. Digital gene expression (DGE) analysis revealed that the host transcriptome profile was slightly changed in the early infection (5 hours post injection) of the virus, while large transcriptional differences were identified in the late infection (48 hpi) of WSSV. The differentially expressed genes mainly involved in pattern recognition genes and some immune response factors. The results indicated that antiviral immune mechanisms were probably involved in the recognition of pathogen-associated molecular patterns. This study provided a global survey of host gene activities against virus infection in a non-model organism, pacific white shrimp. Results can contribute to the in-depth study of candidate genes in white shrimp, and help to improve the current understanding of host-pathogen interactions.

  16. De novo identification of VRC01 class HIV-1-neutralizing antibodies by next-generation sequencing of B-cell transcripts.

    PubMed

    Zhu, Jiang; Wu, Xueling; Zhang, Baoshan; McKee, Krisha; O'Dell, Sijy; Soto, Cinque; Zhou, Tongqing; Casazza, Joseph P; Mullikin, James C; Kwong, Peter D; Mascola, John R; Shapiro, Lawrence

    2013-10-22

    Next-generation sequencing of antibody transcripts provides a wealth of data, but the ability to identify function-specific antibodies solely on the basis of sequence has remained elusive. We previously characterized the VRC01 class of antibodies, which target the CD4-binding site on gp120, appear in multiple donors, and broadly neutralize HIV-1. Antibodies of this class have developmental commonalities, but typically share only ∼50% amino acid sequence identity among different donors. Here we apply next-generation sequencing to identify VRC01 class antibodies in a new donor, C38, directly from B cell transcript sequences. We first tested a lineage rank approach, but this was unsuccessful, likely because VRC01 class antibody sequences were not highly prevalent in this donor. We next identified VRC01 class heavy chains through a phylogenetic analysis that included thousands of sequences from C38 and a few known VRC01 class sequences from other donors. This "cross-donor analysis" yielded heavy chains with little sequence homology to previously identified VRC01 class heavy chains. Nonetheless, when reconstituted with the light chain from VRC01, half of the heavy chain chimeric antibodies showed substantial neutralization potency and breadth. We then identified VRC01 class light chains through a five-amino-acid sequence motif necessary for VRC01 light chain recognition. From over a million light chain sequences, we identified 13 candidate VRC01 class members. Pairing of these light chains with the phylogenetically identified C38 heavy chains yielded functional antibodies that effectively neutralized HIV-1. Bioinformatics analysis can thus directly identify functional HIV-1-neutralizing antibodies of the VRC01 class from a sequenced antibody repertoire.

  17. De novo transcriptome sequence assembly from coconut leaves and seeds with a focus on factors involved in RNA-directed DNA methylation.

    PubMed

    Huang, Ya-Yi; Lee, Chueh-Pai; Fu, Jason L; Chang, Bill Chia-Han; Matzke, Antonius J M; Matzke, Marjori

    2014-09-04

    Coconut palm (Cocos nucifera) is a symbol of the tropics and a source of numerous edible and nonedible products of economic value. Despite its nutritional and industrial significance, coconut remains under-represented in public repositories for genomic and transcriptomic data. We report de novo transcript assembly from RNA-seq data and analysis of gene expression in seed tissues (embryo and endosperm) and leaves of a dwarf coconut variety. Assembly of 10 GB sequencing data for each tissue resulted in 58,211 total unigenes in embryo, 61,152 in endosperm, and 33,446 in leaf. Within each unigene pool, 24,857 could be annotated in embryo, 29,731 could be annotated in endosperm, and 26,064 could be annotated in leaf. A KEGG analysis identified 138, 138, and 139 pathways, respectively, in transcriptomes of embryo, endosperm, and leaf tissues. Given the extraordinarily large size of coconut seeds and the importance of small RNA-mediated epigenetic regulation during seed development in model plants, we used homology searches to identify putative homologs of factors required for RNA-directed DNA methylation in coconut. The findings suggest that RNA-directed DNA methylation is important during coconut seed development, particularly in maturing endosperm. This dataset will expand the genomics resources available for coconut and provide a foundation for more detailed analyses that may assist molecular breeding strategies aimed at improving this major tropical crop. Copyright © 2014 Huang et al.

  18. De Novo Transcriptome Sequence Assembly from Coconut Leaves and Seeds with a Focus on Factors Involved in RNA-Directed DNA Methylation

    PubMed Central

    Huang, Ya-Yi; Lee, Chueh-Pai; Fu, Jason L.; Chang, Bill Chia-Han; Matzke, Antonius J. M.; Matzke, Marjori

    2014-01-01

    Coconut palm (Cocos nucifera) is a symbol of the tropics and a source of numerous edible and nonedible products of economic value. Despite its nutritional and industrial significance, coconut remains under-represented in public repositories for genomic and transcriptomic data. We report de novo transcript assembly from RNA-seq data and analysis of gene expression in seed tissues (embryo and endosperm) and leaves of a dwarf coconut variety. Assembly of 10 GB sequencing data for each tissue resulted in 58,211 total unigenes in embryo, 61,152 in endosperm, and 33,446 in leaf. Within each unigene pool, 24,857 could be annotated in embryo, 29,731 could be annotated in endosperm, and 26,064 could be annotated in leaf. A KEGG analysis identified 138, 138, and 139 pathways, respectively, in transcriptomes of embryo, endosperm, and leaf tissues. Given the extraordinarily large size of coconut seeds and the importance of small RNA-mediated epigenetic regulation during seed development in model plants, we used homology searches to identify putative homologs of factors required for RNA-directed DNA methylation in coconut. The findings suggest that RNA-directed DNA methylation is important during coconut seed development, particularly in maturing endosperm. This dataset will expand the genomics resources available for coconut and provide a foundation for more detailed analyses that may assist molecular breeding strategies aimed at improving this major tropical crop. PMID:25193496

  19. Multiplexed next-generation sequencing and de novo assembly to obtain near full-length HIV-1 genome from plasma virus.

    PubMed

    Aralaguppe, Shambhu G; Siddik, Abu Bakar; Manickam, Ashokkumar; Ambikan, Anoop T; Kumar, Milner M; Fernandes, Sunjay Jude; Amogne, Wondwossen; Bangaruswamy, Dhinoth K; Hanna, Luke Elizabeth; Sonnerborg, Anders; Neogi, Ujjwal

    2016-10-01

    Analysing the HIV-1 near full-length genome (HIV-NFLG) facilitates new understanding into the diversity of virus population dynamics at individual or population level. In this study we developed a simple but high-throughput next generation sequencing (NGS) protocol for HIV-NFLG using clinical specimens and validated the method against an external quality control (EQC) panel. Clinical specimens (n=105) were obtained from three cohorts from two highly conserved HIV-1C epidemics (India and Ethiopia) and one diverse epidemic (Sweden). Additionally an EQC panel (n=10) was used to validate the protocol. HIV-NFLG was performed amplifying the HIV-genome (Gag-to-nef) in two fragments. NGS was performed using the Illumina HiSeq2500 after multiplexing 24 samples, followed by de novo assembly in Iterative Virus Assembler or VICUNA. Subtyping was carried out using several bioinformatics tools. Amplification of HIV-NFLG has 90% (95/105) success-rate in clinical specimens. NGS was successful in all clinical specimens (n=45) and EQA samples (n=10) attempted. The mean error for mutations for the EQC panel viruses were <1%. Subtyping identified two as A1C recombinant. Our results demonstrate the feasibility of a simple NGS-based HIV-NFLG that can potentially be used in the molecular surveillance for effective identification of subtypes and transmission clusters for operational public health intervention.

  20. The importance of de novo mutations for pediatric neurological disease--It is not all in utero or birth trauma.

    PubMed

    Erickson, Robert P

    2016-01-01

    The advent of next generation sequencing (NGS, which consists of massively parallel sequencing to perform TGS (total genome sequencing) or WES (whole exome sequencing)) has abundantly discovered many causative mutations in patients with pediatric neurological disease. A surprisingly high number of these are de novo mutations which have not been inherited from either parent. For epilepsy, autism spectrum disorders, and neuromotor disorders, including cerebral palsy, initial estimates put the frequency of causative de novo mutations at about 15% and about 10% of these are somatic. There are some shared mutated genes between these three classes of disease. Studies of copy number variation by comparative genomic hybridization (CGH) proceded the NGS approaches but they also detect de novo variation which is especially important for ASDs. There are interesting differences between the mutated genes detected by CGS and NGS. In summary, de novo mutations cause a very significant proportion of pediatric neurological disease. Copyright © 2015 Elsevier B.V. All rights reserved.

  1. Development of an expressed gene catalogue and molecular markers from the de novo assembly of short sequence reads of the lentil (Lens culinaris Medik.) transcriptome.

    PubMed

    Verma, Priyanka; Shah, Niraj; Bhatia, Sabhyata

    2013-09-01

    Genomic resources such as ESTs, molecular markers and linkage maps are essential for crop improvement. However, these resources are still limited in important legumes such as lentil (Lens culinaris Medik.), which is valued world wide as a rich source of dietary protein. In this study, the de novo transcriptome assembly of 119,855,798 short reads, generated by Illumina paired-end sequencing, was performed using various assembly programs. This resulted in 42,196 nonredundant high-quality transcripts of average length 810 bases, N50 value of 1,432 and an average expression per transcript of 26.21 rpkm reads per kilobase per million(RPKM). Similarity search with the unigenes and protein sequences of other plants resulted in maximum similarity with soybean. A total of 20,009 nonredundant transcripts showed similarity with the UniProtKB database and of these, 18,064 transcripts were grouped into three main GO categories, that is, biological process (15,126), molecular function (15,505) and cellular component (9,434). Annotated transcripts were mapped to 289 predicted Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and 8,893 transcripts were classified into 24 functional categories based on Cluster of Orthologous Groups (COG) of proteins. Mining the data set for the presence of SSRs resulted in 8,722 SSRs with a frequency occurrence of one SSR per 3.92 kb. From these, 5,673 SSR primer pairs were designed, and a subset of these were utilized for diversity analysis. This study, which provides a large data set of annotated transcripts and gene-based SSR markers, would serve as a foundation for various applications in lentil breeding and genetics.

  2. De Novo RNA Sequencing and Transcriptome Analysis of Monascus purpureus and Analysis of Key Genes Involved in Monacolin K Biosynthesis.

    PubMed

    Zhang, Chan; Liang, Jian; Yang, Le; Sun, Baoguo; Wang, Chengtao

    2017-01-01

    Monascus purpureus is an important medicinal and edible microbial resource. To facilitate biological, biochemical, and molecular research on medicinal components of M. purpureus, we investigated the M. purpureus transcriptome by RNA sequencing (RNA-seq). An RNA-seq library was created using RNA extracted from a mixed sample of M. purpureus expressing different levels of monacolin K output. In total 29,713 unigenes were assembled from more than 60 million high-quality short reads. A BLAST search revealed hits for 21,331 unigenes in at least one of the protein or nucleotide databases used in this study. The 22,365 unigenes were categorized into 48 functional groups based on Gene Ontology classification. Owing to the economic and medicinal importance of M. purpureus, most studies on this organism have focused on the pharmacological activity of chemical components and the molecular function of genes involved in their biogenesis. In this study, we performed quantitative real-time PCR to detect the expression of genes related to monacolin K (mokA-mokI) at different phases (2, 5, 8, and 12 days) of M. purpureus M1 and M1-36. Our study found that mokF modulates monacolin K biogenesis in M. purpureus. Nine genes were suggested to be associated with the monacolin K biosynthesis. Studies on these genes could provide useful information on secondary metabolic processes in M. purpureus. These results indicate a detailed resource through genetic engineering of monacolin K biosynthesis in M. purpureus and related species.

  3. De novo sequencing and comprehensive analysis of the mutant transcriptome from purple sweet potato (Ipomoea batatas L.).

    PubMed

    Ma, Peiyong; Bian, Xiaofeng; Jia, Zhaodong; Guo, Xiaoding; Xie, Yizhi

    2016-01-10

    Purple sweet potatoes, rich in anthocyanin, have been widely favored in light of increasing awareness of health and food safety. In this study, a mutant of purple sweet potato (white peel and flesh) was used to study anthocyanin metabolism by high-throughput RNA sequencing and comparative analysis of the mutant and wild type transcriptomes. A total of 88,509 unigenes ranging from 200nt to 14,986nt with an average length of 849nt were obtained. Unigenes were assigned to Gene Ontology (GO), Clusters of Orthologous Group (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG). Functional enrichment using GO and KEGG annotations showed that 3828 of the differently expressed genes probably influenced many important biological and metabolic pathways, including anthocyanin biosynthesis. Most importantly, the structural and transcription factor genes that contribute to anthocyanin biosynthesis were downregulated in the mutant. The unigene dataset that was used to discover the anthocyanin candidate genes can serve as a comprehensive resource for molecular research in sweet potato. Copyright © 2015 Elsevier B.V. All rights reserved.

  4. De novo TUBB2B mutation causes fetal akinesia deformation sequence with microlissencephaly: An unusual presentation of tubulinopathy.

    PubMed

    Laquerriere, Annie; Gonzales, Marie; Saillour, Yoann; Cavallin, Mara; Joyē, Nicole; Quēlin, Chloé; Bidat, Laurent; Dommergues, Marc; Plessis, Ghislaine; Encha-Razavi, Ferechte; Chelly, Jamel; Bahi-Buisson, Nadia; Poirier, Karine

    2016-04-01

    Tubulinopathies are increasingly emerging major causes underlying complex cerebral malformations, particularly in case of microlissencephaly often associated with hypoplastic or absent corticospinal tracts. Fetal akinesia deformation sequence (FADS) refers to a clinically and genetically heterogeneous group of disorders with congenital malformations related to impaired fetal movement. We report on an early foetal case with FADS and microlissencephaly due to TUBB2B mutation. Neuropathological examination disclosed virtually absent cortical lamination, foci of neuronal overmigration into the leptomeningeal spaces, corpus callosum agenesis, cerebellar and brainstem hypoplasia and extremely severe hypoplasia of the spinal cord with no anterior and posterior horns and almost no motoneurons. At the cellular level, the p.Cys239Phe TUBB2B mutant leads to tubulin heterodimerization impairment, decreased ability to incorporate into the cytoskeleton, microtubule dynamics alteration, with an accelerated rate of depolymerization. To our knowledge, this is the first case of microlissencephaly to be reported presenting with a so severe and early form of FADS, highlighting the importance of tubulin mutation screening in the context of FADS with microlissencephaly.

  5. De novo transcriptome sequence and identification of major bast-related genes involved in cellulose biosynthesis in jute (Corchorus capsularis L.).

    PubMed

    Zhang, Liwu; Ming, Ray; Zhang, Jisen; Tao, Aifen; Fang, Pingping; Qi, Jianmin

    2015-12-15

    Jute fiber, extracted from stem bast, is called golden fiber. It is essential for fiber improvement to discover the genes associated with jute development at the vegetative growth stage. However, only 858 EST sequences of jute were deposited in the GenBank database. Obviously, the public available data is far from sufficient to understand the molecular mechanism of the fiber biosynthesis. It is imperative to conduct transcriptomic sequence for jute, which can be used for the discovery of a number of new genes, especially genes involved in cellulose biosynthesis. A total of 79,754,600 clean reads (7.98 Gb) were generated using Illumina paired-end sequencing. De novo assembly yielded 48,914 unigenes with an average length of 903 bp. By sequence similarity searching for known proteins, 27,962 (57.16 %) unigenes were annotated for their function. Out of these annotated unigenes, 21,856 and 11,190 unigenes were assigned to gene ontology (GO) and euKaryotic Ortholog Groups (KOG), respectively. Searching against the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG) indicated that 14,216 unigenes were mapped to 268 KEGG pathways. Moreover, 5 Susy, 3 UGPase, 9 CesA, 18 CSL, 2 Kor (Korrigan), and 12 Cobra unigenes involving in cellulose biosynthesis were identified. Among these unigenes, the unigenes of comp11264_c0 (SuSy), comp24568_c0 (UGPase), comp11363_c0 (CesA), comp11363_c1 (CesA), comp24217_c0 (CesA), and comp23531_c0 (CesA), displayed relatively high expression level in stem bast using FPKM and RT-qPCR, indicating that they may have potential value of dissecting mechanism on cellulose biosynthesis in jute. In addition, a total of 12,518 putative gene-associate SNPs were called from these assembled uingenes. We characterized the transcriptome of jute, discovered a broad survey of unigenes associated with vegetative growth and development, developed large-scale SNPs, and analyzed the expression patterns of genes involved in cellulose biosynthesis for bast

  6. De Novo RNA Sequencing and Transcriptome Analysis of Monascus purpureus and Analysis of Key Genes Involved in Monacolin K Biosynthesis

    PubMed Central

    Zhang, Chan; Liang, Jian; Yang, Le; Sun, Baoguo; Wang, Chengtao

    2017-01-01

    Monascus purpureus is an important medicinal and edible microbial resource. To facilitate biological, biochemical, and molecular research on medicinal components of M. purpureus, we investigated the M. purpureus transcriptome by RNA sequencing (RNA-seq). An RNA-seq library was created using RNA extracted from a mixed sample of M. purpureus expressing different levels of monacolin K output. In total 29,713 unigenes were assembled from more than 60 million high-quality short reads. A BLAST search revealed hits for 21,331 unigenes in at least one of the protein or nucleotide databases used in this study. The 22,365 unigenes were categorized into 48 functional groups based on Gene Ontology classification. Owing to the economic and medicinal importance of M. purpureus, most studies on this organism have focused on the pharmacological activity of chemical components and the molecular function of genes involved in their biogenesis. In this study, we performed quantitative real-time PCR to detect the expression of genes related to monacolin K (mokA-mokI) at different phases (2, 5, 8, and 12 days) of M. purpureus M1 and M1-36. Our study found that mokF modulates monacolin K biogenesis in M. purpureus. Nine genes were suggested to be associated with the monacolin K biosynthesis. Studies on these genes could provide useful information on secondary metabolic processes in M. purpureus. These results indicate a detailed resource through genetic engineering of monacolin K biosynthesis in M. purpureus and related species. PMID:28114365

  7. De novo sequence assembly and characterisation of a partial transcriptome for an evolutionarily distinct reptile, the tuatara (Sphenodon punctatus)

    PubMed Central

    2012-01-01

    Background The tuatara (Sphenodon punctatus) is a species of extraordinary zoological interest, being the only surviving member of an entire order of reptiles which diverged early in amniote evolution. In addition to their unique phylogenetic placement, many aspects of tuatara biology, including temperature-dependent sex determination, cold adaptation and extreme longevity have the potential to inform studies of genome evolution and development. Despite increasing interest in the tuatara genome, genomic resources for the species are still very limited. We aimed to address this by assembling a transcriptome for tuatara from an early-stage embryo, which will provide a resource for genome annotation, molecular marker development and studies of development and adaptation in tuatara. Results We obtained 30 million paired-end 50 bp reads from an Illumina Genome Analyzer and assembled them with Velvet and Oases using a range of kmers. After removing redundancy and filtering out low quality transcripts, our transcriptome dataset contained 32911 transcripts, with an N50 of 675 and a mean length of 451 bp. Almost 50% (15965) of these transcripts could be annotated by comparison with the NCBI non-redundant (NR) protein database or the chicken, green anole and zebrafish UniGene sets. A scan of candidate genes and repetitive elements revealed genes involved in immune function, sex differentiation and temperature-sensitivity, as well as over 200 microsatellite markers. Conclusions This dataset represents a major increase in genomic resources for the tuatara, increasing the number of annotated gene sequences from just 60 to almost 16,000. This will facilitate future research in sex determination, genome evolution, local adaptation and population genetics of tuatara, as well as inform studies on amniote evolution. PMID:22938396

  8. SNP Detection from De Novo Transcriptome Sequencing in the Bivalve Macoma balthica: Marker Development for Evolutionary Studies

    PubMed Central

    Becquet, Vanessa; Belkhir, Khalid; Bierne, Nicolas; Garcia, Pascale

    2012-01-01

    Hybrid zones are noteworthy systems for the study of environmental adaptation to fast-changing environments, as they constitute reservoirs of polymorphism and are key to the maintenance of biodiversity. They can move in relation to climate fluctuations, as temperature can affect both selection and migration, or remain trapped by environmental and physical barriers. There is therefore a very strong incentive to study the dynamics of hybrid zones subjected to climate variations. The infaunal bivalve Macoma balthica emerges as a noteworthy model species, as divergent lineages hybridize, and its native NE Atlantic range is currently contracting to the North. To investigate the dynamics and functioning of hybrid zones in M. balthica, we developed new molecular markers by sequencing the collective transcriptome of 30 individuals. Ten individuals were pooled for each of the three populations sampled at the margins of two hybrid zones. A single 454 run generated 277 Mb from which 17K SNPs were detected. SNP density averaged 1 polymorphic site every 14 to 19 bases, for mitochondrial and nuclear loci, respectively. An scan detected high genetic divergence among several hundred SNPs, some of them involved in energetic metabolism, cellular respiration and physiological stress. The high population differentiation, recorded for nuclear-encoded ATP synthase and NADH dehydrogenase as well as most mitochondrial loci, suggests cytonuclear genetic incompatibilities. Results from this study will help pave the way to a high-resolution study of hybrid zone dynamics in M. balthica, and the relative importance of endogenous and exogenous barriers to gene flow in this system. PMID:23300636

  9. A 454 sequencing approach to dipteran mitochondrial genome research.

    PubMed

    Ramakodi, Meganathan P; Singh, Baneshwar; Wells, Jeffrey D; Guerrero, Felix; Ray, David A

    2015-01-01

    The availability of complete mitochondrial genome (mtgenome) data for Diptera, one of the largest metazoan orders, in public databases is limited. The advent of high throughput sequencing technology provides the potential to generate mtgenomes for many species affordably and quickly. However, these technologies need to be validated for dipterans as the members of this clade play important economic and research roles. Illumina and 454 sequencing platforms are widely used in genomic research involving non-model organisms. The Illumina platform has already been utilized for generating mitochondrial genomes without using conventional long range PCR for insects whereas the power of 454 sequencing for generating mitochondrial genome drafts without PCR has not yet been validated for insects. Thus, this study examines the utility of 454 sequencing approach for dipteran mtgenomic research. We generated complete or nearly complete mitochondrial genomes for Cochliomyia hominivorax, Haematobia irritans, Phormia regina and Sarcophaga crassipalpis using a 454 sequencing approach. Comparisons between newly obtained and existing assemblies for C. hominivorax and H. irritans revealed no major discrepancies and verified the utility of 454 sequencing for dipteran mitochondrial genomes. We also report the complete mitochondrial sequences for two forensically important flies, P. regina and S. crassipalpis, which could be used to provide useful information to legal personnel. Comparative analyses revealed that dipterans follow similar codon usage and nucleotide biases that could be due to mutational and selection pressures. This study illustrates the utility of 454 sequencing to obtain complete mitochondrial genomes for dipterans without the aid of conventional molecular techniques such as PCR and cloning and validates this method of mtgenome sequencing in arthropods.

  10. The role of melanin pathways in extremotolerance and virulence of Fonsecaea revealed by de novo assembly transcriptomics using illumina paired-end sequencing.

    PubMed

    Li, X Q; Guo, B L; Cai, W Y; Zhang, J M; Huang, H Q; Zhan, P; Xi, L Y; Vicente, V A; Stielow, B; Sun, J F; de Hoog, G S

    2016-01-01

    Melanisation has been considered to be an important virulence factor of Fonsecaea monophora. However, the biosynthetic mechanisms of melanisation remain unknown. We therefore used next generation sequencing technology to investigate the transcriptome and digital gene expression data, which are valuable resources to better understand the molecular and biological mechanisms regulating melanisation in F. monophora. We performed de novo transcriptome assembly and digital gene expression (DGE) profiling analyses of parent (CBS 122845) and albino (CBS 125194) strains using the Illumina RNA-seq system. A total of 17 352 annotated unigenes were found by BLAST search of NR, Swiss-Prot, Gene Ontology, Clusters of Orthologous Groups and Kyoto Encyclopedia of Genes and Genomes (KEGG) (E-value <1e‒5). A total of 2 283 unigenes were judged to be the differentially expressed between the two genotypes. We identified most of the genes coding for key enzymes involved in melanin biosynthesis pathways, including polyketide synthase (pks), multicopper oxidase (mco), laccase, tyrosinase and homogentisate 1,2-dioxygenase (hmgA). DEG analysis showed extensive down-regulation of key genes in the DHN pathway, while up-regulation was noted in the DOPA pathway of the albino mutant. The transcript levels of partial genes were confirmed by real time RT-PCR, while the crucial role of key enzymes was confirmed by either inhibitor or substrate tests in vitro. Meanwhile, numbers of genes involved in light sensing, cell wall synthesis, morphology and environmental stress were identified in the transcriptome of F. monophora. In addition, 3 353 SSRs (Simple Sequence Repeats) markers were identified from 21 600 consensus sequences. Blocking of the DNH pathway is the most likely reason of melanin deficiency in the albino strain, while the production of pheomelanin and pyomelanin were probably regulated by unknown transcription factors on upstream of both pathways. Most of genes involved in

  11. An ORFome assembly approach to metagenomics sequences analysis.

    PubMed

    Ye, Yuzhen; Tang, Haixu

    2009-06-01

    Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e. ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increases the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for metagenomic projects when the genome assembly does not work because of the low sequence coverage.

  12. Cross-Curricular Sequence: An Approach for Teaching Business Communication.

    ERIC Educational Resources Information Center

    Clarke, Lillian W.; Franklin, Carl M.

    1985-01-01

    The Cross-Curricular Sequencing (CCS) approach to teaching business communications is explored. Its uses in word processing, principles of management, and business policy courses are discussed. Techniques for integrating materials from these courses into business communication classes are described. The implications of CCS for business…

  13. De novo transcriptome sequencing in Bixa orellana to identify genes involved in methylerythritol phosphate, carotenoid and bixin biosynthesis

    SciTech Connect

    Cárdenas-Conejo, Yair; Carballo-Uicab, Víctor; Lieberman, Meric; Aguilar-Espinosa, Margarita; Comai, Luca; Rivera-Madrid, Renata

    2015-10-28

    Bixin or annatto is a commercially important natural orange-red pigment derived from lycopene that is produced and stored in seeds of Bixa orellana L. An enzymatic pathway for bixin biosynthesis was inferred from homology of putative proteins encoded by differentially expressed seed cDNAs. Some activities were later validated in a heterologous system. Nevertheless, much of the pathway remains to be clarified. For example, it is essential to identify the methylerythritol phosphate (MEP) and carotenoid pathways genes. In order to investigate the MEP, carotenoid, and bixin pathways genes, total RNA from young leaves and two different developmental stages of seeds from B. orellana were used for the construction of indexed mRNA libraries, sequenced on the Illumina HiSeq 2500 platform and assembled de novo using Velvet, CLC Genomics Workbench and CAP3 software. A total of 52,549 contigs were obtained with average length of 1,924 bp. Two phylogenetic analyses of inferred proteins, in one case encoded by thirteen general, single-copy cDNAs, in the other from carotenoid and MEP cDNAs, indicated that B. orellana is closely related to sister Malvales species cacao and cotton. Using homology, we identified 7 and 14 core gene products from the MEP and carotenoid pathways, respectively. Surprisingly, previously defined bixin pathway cDNAs were not present in our transcriptome. Here we propose a new set of gene products involved in bixin pathway. In conclusion, the identification and qRT-PCR quantification of cDNAs involved in annatto production suggest a hypothetical model for bixin biosynthesis that involve coordinated activation of some MEP, carotenoid and bixin pathway genes. These findings provide a better understanding of the mechanisms regulating these pathways and will facilitate the genetic improvement of B. orellana.

  14. De novo assembly and characterization of the transcriptome of the pancreatic fluke Eurytrema pancreaticum (trematoda: Dicrocoeliidae) using Illumina paired-end sequencing.

    PubMed

    Liu, Guo-Hua; Xu, Min-Jun; Song, Hui-Qun; Wang, Chun-Ren; Zhu, Xing-Quan

    2016-01-15

    Eurytrema pancreaticum is one of the most common trematodes living in the pancreatic and bile ducts of ruminants and also occasionally infects humans, causing eurytremiasis. In spite of its economic and medical importance, very little is known about the genomic resources of this parasite. Herein, we performed de novo sequencing, assembly and characterization of the transcriptome of adult E. pancreaticum. Approximately 36.4 million high-quality clean reads were obtained, and the length of the transcript contigs ranged from 66 to 19,968 nt with mean length of 479 nt and N50 length of 1094 nt, and then 23,573 unigenes were assembled. Of these unigenes, 15,353 (65.1%) were annotated by blast searches against the NCBI non-redundant protein database. Among these, 15,267 (64.8%), 2732 (11.6%) and 10,354 (43.9%) of the unigenes had significant similarity with proteins in the NR, NT and Swiss-Prot databases, respectively. 5510 (23.4%) and 4567 (19.4%) unigenes were assigned to GO and COG, respectively. 8886 (37.7%) unigenes were identified and mapped onto 254 pathways in the KEGG Pathway database. Furthermore, we found that 105 (1.18%) unigenes were related to pancreatic secretion and 61 (0.7%) to pancreatic cancer. The present study represents the first transcriptome of any members of the family Dicrocoeliidae, which has little genomic information available in the public databases. The novel transcriptome of E. pancreaticum should provide a useful resource for designing new strategies against pancreatic flukes and other trematodes of human and animal health significance. Copyright © 2015 Elsevier B.V. All rights reserved.

  15. Imparting functionality to biocatalysts via embedding enzymes into nanoporous materials by a de novo approach: size-selective sheltering of catalase in metal-organic framework microcrystals.

    PubMed

    Shieh, Fa-Kuen; Wang, Shao-Chun; Yen, Chia-I; Wu, Chang-Cheng; Dutta, Saikat; Chou, Lien-Yang; Morabito, Joseph V; Hu, Pan; Hsu, Ming-Hua; Wu, Kevin C-W; Tsung, Chia-Kuang

    2015-04-08

    We develop a new concept to impart new functions to biocatalysts by combining enzymes and metal-organic frameworks (MOFs). The proof-of-concept design is demonstrated by embedding catalase molecules into uniformly sized ZIF-90 crystals via a de novo approach. We have carried out electron microscopy, X-ray diffraction, nitrogen sorption, electrophoresis, thermogravimetric analysis, and confocal microscopy to confirm that the ~10 nm catalase molecules are embedded in 2 μm single-crystalline ZIF-90 crystals with ~5 wt % loading. Because catalase is immobilized and sheltered by the ZIF-90 crystals, the composites show activity in hydrogen peroxide degradation even in the presence of protease proteinase K.

  16. De novo sequencing and analysis of the Ulva linza transcriptome to discover putative mechanisms associated with its successful colonization of coastal ecosystems

    PubMed Central

    2012-01-01

    Background The green algal genus Ulva Linnaeus (Ulvaceae, Ulvales, Chlorophyta) is well known for its wide distribution in marine, freshwater, and brackish environments throughout the world. The Ulva species are also highly tolerant of variations in salinity, temperature, and irradiance and are the main cause of green tides, which can have deleterious ecological effects. However, limited genomic information is currently available in this non-model and ecologically important species. Ulva linza is a species that inhabits bedrock in the mid to low intertidal zone, and it is a major contributor to biofouling. Here, we presented the global characterization of the U. linza transcriptome using the Roche GS FLX Titanium platform, with the aim of uncovering the genomic mechanisms underlying rapid and successful colonization of the coastal ecosystems. Results De novo assembly of 382,884 reads generated 13,426 contigs with an average length of 1,000 bases. Contiguous sequences were further assembled into 10,784 isotigs with an average length of 1,515 bases. A total of 304,101 reads were nominally identified by BLAST; 4,368 isotigs were functionally annotated with 13,550 GO terms, and 2,404 isotigs having enzyme commission (EC) numbers were assigned to 262 KEGG pathways. When compared with four other full sequenced green algae, 3,457 unique isotigs were found in U. linza and 18 conserved in land plants. In addition, a specific photoprotective mechanism based on both LhcSR and PsbS proteins and a C4-like carbon-concentrating mechanism were found, which may help U. linza survive stress conditions. At least 19 transporters for essential inorganic nutrients (i.e., nitrogen, phosphorus, and sulphur) were responsible for its ability to take up inorganic nutrients, and at least 25 eukaryotic cytochrome P450s, which is a higher number than that found in other algae, may be related to their strong allelopathy. Multi-origination of the stress related proteins, such as glutamate

  17. De novo sequencing and analysis of the Ulva linza transcriptome to discover putative mechanisms associated with its successful colonization of coastal ecosystems.

    PubMed

    Zhang, Xiaowen; Ye, Naihao; Liang, Chengwei; Mou, Shanli; Fan, Xiao; Xu, Jianfang; Xu, Dong; Zhuang, Zhimeng

    2012-10-25

    The green algal genus Ulva Linnaeus (Ulvaceae, Ulvales, Chlorophyta) is well known for its wide distribution in marine, freshwater, and brackish environments throughout the world. The Ulva species are also highly tolerant of variations in salinity, temperature, and irradiance and are the main cause of green tides, which can have deleterious ecological effects. However, limited genomic information is currently available in this non-model and ecologically important species. Ulva linza is a species that inhabits bedrock in the mid to low intertidal zone, and it is a major contributor to biofouling. Here, we presented the global characterization of the U. linza transcriptome using the Roche GS FLX Titanium platform, with the aim of uncovering the genomic mechanisms underlying rapid and successful colonization of the coastal ecosystems. De novo assembly of 382,884 reads generated 13,426 contigs with an average length of 1,000 bases. Contiguous sequences were further assembled into 10,784 isotigs with an average length of 1,515 bases. A total of 304,101 reads were nominally identified by BLAST; 4,368 isotigs were functionally annotated with 13,550 GO terms, and 2,404 isotigs having enzyme commission (EC) numbers were assigned to 262 KEGG pathways. When compared with four other full sequenced green algae, 3,457 unique isotigs were found in U. linza and 18 conserved in land plants. In addition, a specific photoprotective mechanism based on both LhcSR and PsbS proteins and a C4-like carbon-concentrating mechanism were found, which may help U. linza survive stress conditions. At least 19 transporters for essential inorganic nutrients (i.e., nitrogen, phosphorus, and sulphur) were responsible for its ability to take up inorganic nutrients, and at least 25 eukaryotic cytochrome P450s, which is a higher number than that found in other algae, may be related to their strong allelopathy. Multi-origination of the stress related proteins, such as glutamate dehydrogenase, superoxide

  18. Tracing the evolutionary lineage of pattern recognition receptor homologues in vertebrates: An insight into reptilian immunity via de novo sequencing of the wall lizard splenic transcriptome.

    PubMed

    Priyam, Manisha; Tripathy, Mamta; Rai, Umesh; Ghorai, Soma Mondal

    2016-04-01

    Reptiles remain a deprived class in the area of genomic and molecular resources for the vertebrate classes. The transition of squamates from aquatic to