Science.gov

Sample records for novo sequencing approach

  1. A Hybrid Approach for de novo Human Genome Sequence Assembly and Phasing

    PubMed Central

    Mostovoy, Yulia; Levy-Sakin, Michal; Lam, Jessica; Lam, Ernest T; Hastie, Alex R; Marks, Patrick; Lee, Joyce; Chu, Catherine; Lin, Chin; Džakula, Željko; Cao, Han; Schlebusch, Stephen A.; Giorda, Kristina; Schnall-Levin, Michael; Wall, Jeffrey D.; Kwok, Pui-Yan

    2016-01-01

    Despite tremendous progress in genome sequencing, the basic goal of producing phased (haplotype-resolved) genome sequence with end-to-end contiguity for each chromosome at reasonable cost and effort is still unrealized. In this study, we describe a new approach to perform de novo genome assembly and experimental phasing by integrating the data from Illumina short-read sequencing, 10X Genomics Linked-Read sequencing, and BioNano Genomics genome mapping to yield a high-quality, phased, de novo assembled human genome. PMID:27159086

  2. A Machine Learning Based Approach to de novo Sequencing of Glycans from Tandem Mass Spectrometry Spectrum.

    PubMed

    Kumozaki, Shotaro; Sato, Kengo; Sakakibara, Yasubumi

    2015-01-01

    Recently, glycomics has been actively studied and various technologies for glycomics have been rapidly developed. Currently, tandem mass spectrometry (MS/MS) is one of the key experimental tools for identification of structures of oligosaccharides. MS/MS can observe MS/MS peaks of fragmented glycan ions including cross-ring ions resulting from internal cleavages, which provide valuable information to infer glycan structures. Thus, the aim of de novo sequencing of glycans is to find the most probable assignments of observed MS/MS peaks to glycan substructures without databases. However, there are few satisfiable algorithms for glycan de novo sequencing from MS/MS spectra. We present a machine learning based approach to de novo sequencing of glycans from MS/MS spectrum. First, we build a suitable model for the fragmentation of glycans including cross-ring ions, and implement a solver that employs Lagrangian relaxation with a dynamic programming technique. Then, to optimize scores for the algorithm, we introduce a machine learning technique called structured support vector machines that enable us to learn parameters including scores for cross-ring ions from training data, i.e., known glycan mass spectra. Furthermore, we implement additional constraints for core structures of well-known glycan types including N-linked glycans and O-linked glycans. This enables us to predict more accurate glycan structures if the glycan type of given spectra is known. Computational experiments show that our algorithm performs accurate de novo sequencing of glycans. The implementation of our algorithm and the datasets are available at http://glyfon.dna.bio.keio.ac.jp/.

  3. New Approaches and Technologies to Sequence de novo Plant reference Genomes (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

    SciTech Connect

    Schmutz, Jeremy

    2013-03-01

    Jeremy Schmutz of the HudsonAlpha Institute for Biotechnology on "New approaches and technologies to sequence de novo plant reference genomes" at the 8th Annual Genomics of Energy & Environment Meeting on March 27, 2013 in Walnut Creek, Calif.

  4. A combined de novo protein sequencing and cDNA library approach to the venomic analysis of Chinese spider Araneus ventricosus.

    PubMed

    Duan, Zhigui; Cao, Rui; Jiang, Liping; Liang, Songping

    2013-01-14

    In past years, spider venoms have attracted increasing attention due to their extraordinary chemical and pharmacological diversity. The recently popularized proteomic method highly improved our ability to analyze the proteins in the venom. However, the lack of information about isolated venom proteins sequences dramatically limits the ability to confidently identify venom proteins. In the present paper, the venom from Araneus ventricosus was analyzed using two complementary approaches: 2-DE/Shotgun-LC-MS/MS coupled to MASCOT search and 2-DE/Shotgun-LC-MS/MS coupled to manual de novo sequencing followed by local venom protein database (LVPD) search. The LVPD was constructed with toxin-like protein sequences obtained from the analysis of cDNA library from A. ventricosus venom glands. Our results indicate that a total of 130 toxin-like protein sequences were unambiguously identified by manual de novo sequencing coupled to LVPD search, accounting for 86.67% of all toxin-like proteins in LVPD. Thus manual de novo sequencing coupled to LVPD search was proved an extremely effective approach for the analysis of venom proteins. In addition, the approach displays impeccable advantage in validating mutant positions of isoforms from the same toxin-like family. Intriguingly, methyl esterifcation of glutamic acid was discovered for the first time in animal venom proteins by manual de novo sequencing.

  5. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins

    SciTech Connect

    Shen, Yufeng; Tolic, Nikola; Hixson, Kim K.; Purvine, Samuel O.; Anderson, Gordon A.; Smith, Richard D.

    2008-10-15

    De novo sequencing has a promise to discover the protein post-translation modifications; however, such approach is still in their infancy and not widely applied for proteomics practices due to its limited reliability. In this work, we describe a de novo sequencing approach for discovery of protein modifications through identification of the UStags (Anal. Chem. 2008, 80, 1871-1882). The de novo information was obtained from Fourier-transform tandem mass spectrometry for peptides and polypeptides in a yeast lysate, and the de novo sequences obtained were filtered to define a more limited set of UStags. The DNA-predicted database protein sequences were then compared to the UStags, and the differences observed across or in the UStags (i.e., the UStags’ prefix and suffix sequences and the UStags themselves) were used to infer the possible sequence modifications. With this de novo-UStag approach, we uncovered some unexpected variances of yeast protein sequences due to amino acid mutations and/or multiple modifications to the predicted protein sequences. Random matching of the de novo sequences to the predicted sequences were examined with use of two random (false) databases, and ~3% false discovery rates were estimated for the de novo-UStag approach. The factors affecting the reliability (e.g., existence of de novo sequencing noise residues and redundant sequences) and the sensitivity are described. The de novo-UStag complements the UStag method previously reported by enabling discovery of new protein modifications.

  6. Novor: Real-Time Peptide de Novo Sequencing Software

    NASA Astrophysics Data System (ADS)

    Ma, Bin

    2015-11-01

    De novo sequencing software has been widely used in proteomics to sequence new peptides from tandem mass spectrometry data. This study presents a new software tool, Novor, to greatly improve both the speed and accuracy of today's peptide de novo sequencing analyses. To improve the accuracy, Novor's scoring functions are based on two large decision trees built from a peptide spectral library with more than 300,000 spectra with machine learning. Important knowledge about peptide fragmentation is extracted automatically from the library and incorporated into the scoring functions. The decision tree model also enables efficient score calculation and contributes to the speed improvement. To further improve the speed, a two-stage algorithmic approach, namely dynamic programming and refinement, is used. The software program was also carefully optimized. On the testing datasets, Novor sequenced 7%-37% more correct residues than the state-of-the-art de novo sequencing tool, PEAKS, while being an order of magnitude faster. Novor can de novo sequence more than 300 MS/MS spectra per second on a laptop computer. The speed surpasses the acquisition speed of today's mass spectrometer and, therefore, opens a new possibility to de novo sequence in real time while the spectrometer is acquiring the spectral data.

  7. MRUniNovo: an efficient tool for de novo peptide sequencing utilizing the hadoop distributed computing framework.

    PubMed

    Li, Chuang; Chen, Tao; He, Qiang; Zhu, Yunping; Li, Kenli

    2016-12-19

    Tandem mass spectrometry-based de novo peptide sequencing is a complex and time-consuming process. The current algorithms for de novo peptide sequencing cannot rapidly and thoroughly process large mass spectrometry datasets. In this paper, we propose MRUniNovo, a novel tool for parallel de novo peptide sequencing. MRUniNovo parallelizes UniNovo based on the Hadoop compute platform. Our experimental results demonstrate that MRUniNovo significantly reduces the computation time of de novo peptide sequencing without sacrificing the correctness and accuracy of the results, and thus can process very large datasets that UniNovo cannot.

  8. NIPTL-Novo: Non-isobaric peptide termini labeling assisted peptide de novo sequencing.

    PubMed

    Zhang, Shen; Shan, Yichu; Zhang, Shurong; Sui, Zhigang; Zhang, Lihua; Liang, Zhen; Zhang, Yukui

    2017-02-10

    A simple and effective de novo sequencing strategy assisted by non-isobaric peptide termini labeling, NIPTL-Novo, was established. The y-series ions and b-series ions of peptides can be clearly distinguished according to the different mass tags incorporated in N-terminus and C-terminus. This is helpful for improving the accuracy of peptide sequencing and increasing the sequencing speed. For the spectra commonly identified by both de novo sequencing and database searching software (Mascot or Maxquant), NIPTL-Novo gave identical result to more than 85% of these spectra. Furthermore, the quantitative profiling of the sample can be performed simultaneously along with de novo sequencing. Finally, this strategy can be applied to discover the peptides with potential mutation sites by combining with mass-defect based isotopic labeling.

  9. DeNovoID: a web-based tool for identifying peptides from sequence and mass tags deduced from de novo peptide sequencing by mass spectroscopy.

    PubMed

    Halligan, Brian D; Ruotti, Victor; Twigger, Simon N; Greene, Andrew S

    2005-07-01

    One of the core activities of high-throughput proteomics is the identification of peptides from mass spectra. Some peptides can be identified using spectral matching programs like Sequest or Mascot, but many spectra do not produce high quality database matches. De novo peptide sequencing is an approach to determine partial peptide sequences for some of the unidentified spectra. A drawback of de novo peptide sequencing is that it produces a series of ordered and disordered sequence tags and mass tags rather than a complete, non-degenerate peptide amino acid sequence. This incomplete data is difficult to use in conventional search programs such as BLAST or FASTA. DeNovoID is a program that has been specifically designed to use degenerate amino acid sequence and mass data derived from MS experiments to search a peptide database. Since the algorithm employed depends on the amino acid composition of the peptide and not its sequence, DeNovoID does not have to consider all possible sequences, but rather a smaller number of compositions consistent with a spectrum. DeNovoID also uses a geometric indexing scheme that reduces the number of calculations required to determine the best peptide match in the database. DeNovoID is available at http://proteomics.mcw.edu/denovoid.

  10. Multiplex De Novo Sequencing of Peptide Antibiotics

    NASA Astrophysics Data System (ADS)

    Mohimani, Hosein; Liu, Wei-Ting; Yang, Yu-Liang; Gaudêncio, Susana P.; Fenical, William; Dorrestein, Pieter C.; Pevzner, Pavel A.

    Proliferation of drug-resistant diseases raises the challenge of searching for new, more efficient antibiotics. Currently, some of the most effective antibiotics (i.e., Vancomycin and Daptomycin) are cyclic peptides produced by non-ribosomal biosynthetic pathways. The isolation and sequencing of cyclic peptide antibiotics, unlike the same activity with linear peptides, is time-consuming and error-prone. The dominant technique for sequencing cyclic peptides is NMR-based and requires large amounts (milligrams) of purified materials that, for most compounds, are not possible to obtain. Given these facts, there is a need for new tools to sequence cyclic NRPs using picograms of material. Since nearly all cyclic NRPs are produced along with related analogs, we develop a mass spectrometry approach for sequencing all related peptides at once (in contrast to the existing approach that analyzes individual peptides). Our results suggest that instead of attempting to isolate and NMR-sequence the most abundant compound, one should acquire spectra of many related compounds and sequence all of them simultaneously using tandem mass spectrometry. We illustrate applications of this approach by sequencing new variants of cyclic peptide antibiotics from Bacillus brevis, as well as sequencing a previously unknown familiy of cyclic NRPs produced by marine bacteria.

  11. Ameliorated de novo transcriptome assembly using Illumina paired end sequence data with Trinity Assembler

    PubMed Central

    Bankar, Kiran Gopinath; Todur, Vivek Nagaraj; Shukla, Rohit Nandan; Vasudevan, Madavan

    2015-01-01

    Advent of Next Generation Sequencing has led to possibilities of de novo transcriptome assembly of organisms without availability of complete genome sequence. Among various sequencing platforms available, Illumina is the most widely used platform based on data quality, quantity and cost. Various de novo transcriptome assemblers are also available today for construction of de novo transcriptome. In this study, we aimed at obtaining an ameliorated de novo transcriptome assembly with sequence reads obtained from Illumina platform and assembled using Trinity Assembler. We found that, primary transcriptome assembly obtained as a result of Trinity can be ameliorated on the basis of transcript length, coverage, and depth and protein homology. Our approach to ameliorate is reproducible and could enhance the sensitivity and specificity of the assembled transcriptome which could be critical for validation of the assembled transcripts and for planning various downstream biological assays. PMID:26484285

  12. Complete De Novo Assembly of Monoclonal Antibody Sequences

    PubMed Central

    Tran, Ngoc Hieu; Rahman, M. Ziaur; He, Lin; Xin, Lei; Shan, Baozhen; Li, Ming

    2016-01-01

    De novo protein sequencing is one of the key problems in mass spectrometry-based proteomics, especially for novel proteins such as monoclonal antibodies for which genome information is often limited or not available. However, due to limitations in peptides fragmentation and coverage, as well as ambiguities in spectra interpretation, complete de novo assembly of unknown protein sequences still remains challenging. To address this problem, we propose an integrated system, ALPS, which for the first time can automatically assemble full-length monoclonal antibody sequences. Our system integrates de novo sequencing peptides, their quality scores and error-correction information from databases into a weighted de Bruijn graph to assemble protein sequences. We evaluated ALPS performance on two antibody data sets, each including a heavy chain and a light chain. The results show that ALPS was able to assemble three complete monoclonal antibody sequences of length 216–441 AA, at 100% coverage, and 96.64–100% accuracy. PMID:27562653

  13. De novo sequencing and variant calling with nanopores using PoreSeq.

    PubMed

    Szalay, Tamas; Golovchenko, Jene A

    2015-10-01

    The accuracy of sequencing single DNA molecules with nanopores is continually improving, but de novo genome sequencing and assembly using only nanopore data remain challenging. Here we describe PoreSeq, an algorithm that identifies and corrects errors in nanopore sequencing data and improves the accuracy of de novo genome assembly with increasing coverage depth. The approach relies on modeling the possible sources of uncertainty that occur as DNA transits through the nanopore and finds the sequence that best explains multiple reads of the same region. PoreSeq increases nanopore sequencing read accuracy of M13 bacteriophage DNA from 85% to 99% at 100× coverage. We also use the algorithm to assemble Escherichia coli with 30× coverage and the λ genome at a range of coverages from 3× to 50×. Additionally, we classify sequence variants at an order of magnitude lower coverage than is possible with existing methods.

  14. De novo assembly of a bell pepper endornavirus genome sequence using RNA sequencing data.

    PubMed

    Jo, Yeonhwa; Choi, Hoseng; Cho, Won Kyong

    2015-03-19

    The genus Endornavirus is a double-stranded RNA virus that infects a wide range of hosts. In this study, we report on the de novo assembly of a bell pepper endornavirus genome sequence by RNA sequencing (RNA-Seq). Our result demonstrates the successful application of RNA-Seq to obtain a complete viral genome sequence from the transcriptome data.

  15. Database Independent Protein Sequencing (DiPS) enables full-length de-novo protein and antibody sequence determination.

    PubMed

    Savidor, Alon; Barzilay, Rotem; Elinger, Dalia; Yarden, Yosef; Lindzen, Moshit; Gabashvili, Alexandra; Adiv Tal, Ophir; Levin, Yishai

    2017-03-27

    Traditional 'bottom-up' proteomics approaches use proteolytic digestion, LC-MS/MS and database searching to elucidate peptide identities and their parent proteins. Protein sequences absent from the database cannot be identified, and even if present in the database, complete sequence coverage is rarely achieved even for the most abundant proteins in the sample. Thus, sequencing of unknown proteins such as antibodies or constituents of metaproteomes remains a challenging problem. To date, there is no available method for full-length protein sequencing, independent of a reference database, in high throughput. Here we present Database Independent Protein Sequencing (DiPS), a method for unambiguous, rapid, database independent, full-length protein sequencing. The method is a novel combination of non-enzymatic, semi-random cleavage of the protein, LC-MS/MS analysis, peptide de novo sequencing, extraction of peptide tags, and their assembly into a consensus sequence using an algorithm named "Peptide Tag Assembler" (pTA). As proof-of-concept, the method was applied to samples of three known proteins representing three size classes and to a previously un-sequenced, clinically relevant, monoclonal antibody. Excluding leucine/isoleucine and glutamic-acid/deamidated glutamine ambiguities, end-to-end, full-length de novo sequencing was achieved with 99-100% accuracy for all benchmarking proteins and the antibody light chain. Accuracy of the sequenced antibody heavy chain, including the entire variable region, was also 100% but there was a 23 residue gap in the constant region sequence.

  16. Top-down analysis of protein samples by de novo sequencing techniques

    SciTech Connect

    Vyatkina, Kira; Wu, Si; Dekker, Lennard J. M.; VanDuijn, Martijn M.; Liu, Xiaowen; Tolić, Nikola; Luider, Theo M.; Paša-Tolić, Ljiljana; Pevzner, Pavel A.

    2016-05-14

    MOTIVATION: Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. RESULTS: We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. The former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns.

  17. Nucleotide-sequence-specific de novo methylation in a somatic murine cell line.

    PubMed Central

    Szyf, M; Schimmer, B P; Seidman, J G

    1989-01-01

    DNA fragments encoding the mouse steroid 21-hydroxylase (C21 or Cyp21A1) gene are de novo methylated when introduced into the mouse adrenocortical tumor cell line Y1 by DNA-mediated gene transfer. Although CCGG sequences within the C21 gene are de novo methylated, CCGG sites within flanking vector sequences, other mammalian gene sequences driven by the C21 promoter, and the neomycin-resistance gene, which was cotransfected with the C21 gene, do not become methylated. At least two separate signals for de novo methylation are encoded within the gene since three fragments derived from the C21 gene were methylated de novo. Specific de novo methylation of C21-derived sequences does not occur in L cells or Y1 kin8 cells; this suggests that the cellular factors needed for de novo methylation of the C21 gene are not ubiquitous. Most DNA sequences are not de novo methylated when introduced into somatic cells and DNA sequences other than the C21 gene are not de novo methylated when introduced into Y1 cells. Several groups have suggested that de novo methylation occurs in early embryonic cells and that somatic cells strictly maintain their methylation pattern by a semiconservative methyltransferase. Our results suggest that de novo methylation of specific nucleotide sequences can occur in some mammalian somatic cells. Images PMID:2789380

  18. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity.

    PubMed

    Adey, Andrew; Kitzman, Jacob O; Burton, Joshua N; Daza, Riza; Kumar, Akash; Christiansen, Lena; Ronaghi, Mostafa; Amini, Sasan; Gunderson, Kevin L; Steemers, Frank J; Shendure, Jay

    2014-12-01

    We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to > 1 megabase. These pools are "subhaploid," in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate "joins" are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences.

  19. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity

    PubMed Central

    Adey, Andrew; Kitzman, Jacob O.; Burton, Joshua N.; Daza, Riza; Kumar, Akash; Christiansen, Lena; Ronaghi, Mostafa; Amini, Sasan; L. Gunderson, Kevin; Steemers, Frank J.

    2014-01-01

    We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to >1 megabase. These pools are “subhaploid,” in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate “joins” are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences. PMID:25327137

  20. Bioinformatics challenges in de novo transcriptome assembly using short read sequences in the absence of a reference genome sequence.

    PubMed

    Góngora-Castillo, Elsa; Buell, C Robin

    2013-04-01

    Plant natural product research can be facilitated through genome and transcriptome sequencing approaches that generate informative sequence and expression datasets that enable characterization of biochemical pathways of interest. As the overwhelming majority of plant-derived natural products are derived from species with little, if any, sequence and/or genomic resources, the ability to perform whole genome shotgun sequencing and assembly has been and will continue to be transformative as access to a genome sequence provides molecular resources and a context for discovery and characterization of biosynthetic pathways. Due to the reduced size and complexity of the transcriptome relative to the genome, transcriptome sequencing provides a rapid, inexpensive approach to access gene sequences, gene expression abundances, and gene expression patterns in any species, including those that lack a reference genome sequence. To date, successful applications of RNA sequencing in conjunction with de novo transcriptome assembly has enabled identification of new genes in an array of biochemical pathways in plants. While sequencing technologies are well developed, challenges remain in the handling and analysis of transcriptome sequences. In this Highlight article, we provide an overview of the bioinformatics challenges associated with transcriptome analyses using short read sequences and how to address these issues in plant species that lack a reference genome.

  1. LESSONS IN DE NOVO PEPTIDE SEQUENCING BY TANDEM MASS SPECTROMETRY

    PubMed Central

    Medzihradszky, Katalin F.; Chalkley, Robert J.

    2015-01-01

    Mass spectrometry has become the method of choice for the qualitative and quantitative characterization of protein mixtures isolated from all kinds of living organisms. The raw data in these studies are MS/MS spectra, usually of peptides produced by proteolytic digestion of a protein. These spectra are “translated” into peptide sequences, normally with the help of various search engines. Data acquisition and interpretation have both been automated, and most researchers look only at the summary of the identifications without ever viewing the underlying raw data used for assignments. Automated analysis of data is essential due to the volume produced. However, being familiar with the finer intricacies of peptide fragmentation processes, and experiencing the difficulties of manual data interpretation allow a researcher to be able to more critically evaluate key results, particularly because there are many known rules of peptide fragmentation that are not incorporated into search engine scoring. Since the most commonly used MS/MS activation method is collision-induced dissociation (CID), in this article we present a brief review of the history of peptide CID analysis. Next, we provide a detailed tutorial on how to determine peptide sequences from CID data. Although the focus of the tutorial is de novo sequencing, the lessons learned and resources supplied are useful for data interpretation in general. PMID:25667941

  2. De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity

    PubMed Central

    Yassour, Moran; Grabherr, Manfred; Blood, Philip D.; Bowden, Joshua; Couger, Matthew Brian; Eccles, David; Li, Bo; Lieber, Matthias; MacManes, Matthew D.; Ott, Michael; Orvis, Joshua; Pochet, Nathalie; Strozzi, Francesco; Weeks, Nathan; Westerman, Rick; William, Thomas; Dewey, Colin N.; Henschel, Robert; LeDuc, Richard D.; Friedman, Nir; Regev, Aviv

    2013-01-01

    De novo assembly of RNA-Seq data allows us to study transcriptomes without the need for a genome sequence, such as in non-model organisms of ecological and evolutionary importance, cancer samples, or the microbiome. In this protocol, we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-Seq data in non-model organisms. We also present Trinity’s supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples, and approaches to identify protein coding genes. In an included tutorial we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sf.net. PMID:23845962

  3. Evaluation and validation of de novo and hybrid assembly techniques to derive high quality genome sequences

    SciTech Connect

    Utturkar, Sagar M.; Klingeman, Dawn Marie

    2014-06-14

    Our motivation with this work was to assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Our results show Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as an additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. As to availability and implementation–all assembly tools except CLC Genomics Workbench are freely available under GNU General Public License.

  4. Evaluation and validation of de novo and hybrid assembly techniques to derive high quality genome sequences

    DOE PAGES

    Utturkar, Sagar M.; Klingeman, Dawn Marie; Land, Miriam L.; ...

    2014-06-14

    Our motivation with this work was to assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Our results show Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as anmore » additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. As to availability and implementation–all assembly tools except CLC Genomics Workbench are freely available under GNU General Public License.« less

  5. The de novo assembly of mitochondrial genomes of the extinct passenger pigeon (Ectopistes migratorius) with next generation sequencing.

    PubMed

    Hung, Chih-Ming; Lin, Rong-Chien; Chu, Jui-Hua; Yeh, Chia-Fen; Yao, Chiou-Ju; Li, Shou-Hsien

    2013-01-01

    The information from ancient DNA (aDNA) provides an unparalleled opportunity to infer phylogenetic relationships and population history of extinct species and to investigate genetic evolution directly. However, the degraded and fragmented nature of aDNA has posed technical challenges for studies based on conventional PCR amplification. In this study, we present an approach based on next generation sequencing to efficiently sequence the complete mitochondrial genome (mitogenome) of two extinct passenger pigeons (Ectopistes migratorius) using de novo assembly of massive short (90 bp), paired-end or single-end reads. Although varying levels of human contamination and low levels of postmortem nucleotide lesion were observed, they did not impact sequencing accuracy. Our results demonstrated that the de novo assembly of shotgun sequence reads could be a potent approach to sequence mitogenomes, and offered an efficient way to infer evolutionary history of extinct species.

  6. High-definition De Novo Sequencing of Crustacean Hyperglycemic Hormone (CHH)-family Neuropeptides*

    PubMed Central

    Jia, Chenxi; Hui, Limei; Cao, Weifeng; Lietz, Christopher B.; Jiang, Xiaoyue; Chen, Ruibing; Catherman, Adam D.; Thomas, Paul M.; Ge, Ying; Kelleher, Neil L.; Li, Lingjun

    2012-01-01

    A complete understanding of the biological functions of large signaling peptides (>4 kDa) requires comprehensive characterization of their amino acid sequences and post-translational modifications, which presents significant analytical challenges. In the past decade, there has been great success with mass spectrometry-based de novo sequencing of small neuropeptides. However, these approaches are less applicable to larger neuropeptides because of the inefficient fragmentation of peptides larger than 4 kDa and their lower endogenous abundance. The conventional proteomics approach focuses on large-scale determination of protein identities via database searching, lacking the ability for in-depth elucidation of individual amino acid residues. Here, we present a multifaceted MS approach for identification and characterization of large crustacean hyperglycemic hormone (CHH)-family neuropeptides, a class of peptide hormones that play central roles in the regulation of many important physiological processes of crustaceans. Six crustacean CHH-family neuropeptides (8–9.5 kDa), including two novel peptides with extensive disulfide linkages and PTMs, were fully sequenced without reference to genomic databases. High-definition de novo sequencing was achieved by a combination of bottom-up, off-line top-down, and on-line top-down tandem MS methods. Statistical evaluation indicated that these methods provided complementary information for sequence interpretation and increased the local identification confidence of each amino acid. Further investigations by MALDI imaging MS mapped the spatial distribution and colocalization patterns of various CHH-family neuropeptides in the neuroendocrine organs, revealing that two CHH-subfamilies are involved in distinct signaling pathways. PMID:23028060

  7. Personal genome sequencing: current approaches and challenges

    PubMed Central

    Snyder, Michael; Du, Jiang; Gerstein, Mark

    2010-01-01

    The revolution in DNA sequencing technologies has now made it feasible to determine the genome sequences of many individuals; i.e., “personal genomes.” Genome sequences of cells and tissues from both normal and disease states have been determined. Using current approaches, whole human genome sequences are not typically assembled and determined de novo, but, instead, variations relative to a reference sequence are identified. We discuss the current state of personal genome sequencing, the main steps involved in determining a genome sequence (i.e., identifying single-nucleotide polymorphisms [SNPs] and structural variations [SVs], assembling new sequences, and phasing haplotypes), and the challenges and performance metrics for evaluating the accuracy of the reconstruction. Finally, we consider the possible individual and societal benefits of personal genome sequences. PMID:20194435

  8. De Novo Transcriptome Sequencing in Anopheles funestus Using Illumina RNA-Seq Technology

    PubMed Central

    Crawford, Jacob E.; Guelbeogo, Wamdaogo M.; Sanou, Antoine; Traoré, Alphonse; Vernick, Kenneth D.; Sagnon, N'Fale; Lazzaro, Brian P.

    2010-01-01

    Background Anopheles funestus is one of the primary vectors of human malaria, which causes a million deaths each year in sub-Saharan Africa. Few scientific resources are available to facilitate studies of this mosquito species and relatively little is known about its basic biology and evolution, making development and implementation of novel disease control efforts more difficult. The An. funestus genome has not been sequenced, so in order to facilitate genome-scale experimental biology, we have sequenced the adult female transcriptome of An. funestus from a newly founded colony in Burkina Faso, West Africa, using the Illumina GAIIx next generation sequencing platform. Methodology/Principal Findings We assembled short Illumina reads de novo using a novel approach involving iterative de novo assemblies and “target-based” contig clustering. We then selected a conservative set of 15,527 contigs through comparisons to four Dipteran transcriptomes as well as multiple functional and conserved protein domain databases. Comparison to the Anopheles gambiae immune system identified 339 contigs as putative immune genes, thus identifying a large portion of the immune system that can form the basis for subsequent studies of this important malaria vector. We identified 5,434 1∶1 orthologues between An. funestus and An. gambiae and found that among these 1∶1 orthologues, the protein sequence of those with putative immune function were significantly more diverged than the transcriptome as a whole. Short read alignments to the contig set revealed almost 367,000 genetic polymorphisms segregating in the An. funestus colony and demonstrated the utility of the assembled transcriptome for use in RNA-seq based measurements of gene expression. Conclusions/Significance We developed a pipeline that makes de novo transcriptome sequencing possible in virtually any organism at a very reasonable cost ($6,300 in sequencing costs in our case). We anticipate that our approach could be used

  9. Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies.

    PubMed

    Rickert, Keith W; Grinberg, Luba; Woods, Robert M; Wilson, Susan; Bowen, Michael A; Baca, Manuel

    2016-01-01

    The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3-5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material.

  10. Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies

    PubMed Central

    Rickert, Keith W.; Grinberg, Luba; Woods, Robert M.; Wilson, Susan; Bowen, Michael A.; Baca, Manuel

    2016-01-01

    ABSTRACT The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3–5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material. PMID:26852694

  11. Automated Antibody De Novo Sequencing and Its Utility in Biopharmaceutical Discovery

    NASA Astrophysics Data System (ADS)

    Sen, K. Ilker; Tang, Wilfred H.; Nayak, Shruti; Kil, Yong J.; Bern, Marshall; Ozoglu, Berk; Ueberheide, Beatrix; Davis, Darryl; Becker, Christopher

    2017-01-01

    Applications of antibody de novo sequencing in the biopharmaceutical industry range from the discovery of new antibody drug candidates to identifying reagents for research and determining the primary structure of innovator products for biosimilar development. When murine, phage display, or patient-derived monoclonal antibodies against a target of interest are available, but the cDNA or the original cell line is not, de novo protein sequencing is required to humanize and recombinantly express these antibodies, followed by in vitro and in vivo testing for functional validation. Availability of fully automated software tools for monoclonal antibody de novo sequencing enables efficient and routine analysis. Here, we present a novel method to automatically de novo sequence antibodies using mass spectrometry and the Supernovo software. The robustness of the algorithm is demonstrated through a series of stress tests.

  12. Evaluating de novo sequencing in proteomics: already an accurate alternative to database-driven peptide identification?

    PubMed

    Muth, Thilo; Renard, Bernhard Y

    2017-03-21

    While peptide identifications in mass spectrometry (MS)-based shotgun proteomics are mostly obtained using database search methods, high-resolution spectrum data from modern MS instruments nowadays offer the prospect of improving the performance of computational de novo peptide sequencing. The major benefit of de novo sequencing is that it does not require a reference database to deduce full-length or partial tag-based peptide sequences directly from experimental tandem mass spectrometry spectra. Although various algorithms have been developed for automated de novo sequencing, the prediction accuracy of proposed solutions has been rarely evaluated in independent benchmarking studies. The main objective of this work is to provide a detailed evaluation on the performance of de novo sequencing algorithms on high-resolution data. For this purpose, we processed four experimental data sets acquired from different instrument types from collision-induced dissociation and higher energy collisional dissociation (HCD) fragmentation mode using the software packages Novor, PEAKS and PepNovo. Moreover, the accuracy of these algorithms is also tested on ground truth data based on simulated spectra generated from peak intensity prediction software. We found that Novor shows the overall best performance compared with PEAKS and PepNovo with respect to the accuracy of correct full peptide, tag-based and single-residue predictions. In addition, the same tool outpaced the commercial competitor PEAKS in terms of running time speedup by factors of around 12-17. Despite around 35% prediction accuracy for complete peptide sequences on HCD data sets, taken as a whole, the evaluated algorithms perform moderately on experimental data but show a significantly better performance on simulated data (up to 84% accuracy). Further, we describe the most frequently occurring de novo sequencing errors and evaluate the influence of missing fragment ion peaks and spectral noise on the accuracy. Finally

  13. Genomic Resources for Water Yam (Dioscorea alata L.): Analyses of EST-Sequences, De Novo Sequencing and GBS Libraries.

    PubMed

    Saski, Christopher A; Bhattacharjee, Ranjana; Scheffler, Brian E; Asiedu, Robert

    2015-01-01

    The reducing cost and rapid progress in next-generation sequencing techniques coupled with high performance computational approaches have resulted in large-scale discovery of advanced genomic resources in several model and non-model plant species. Yam (Dioscorea spp.) is a major food and cash crop in many countries but research efforts have been limited to understand the genetics and generate genomic information for the crop. The availability of a large number of genomic resources including genome-wide molecular markers will accelerate the breeding efforts and application of genomic selection in yams. In the present study, several methods including expressed sequence tags (EST)-sequencing, de novo sequencing, and genotyping-by-sequencing (GBS) profiles on two yam (Dioscorea alata L.) genotypes (TDa 95/00328 and TDa 95-310) was performed to generate genomic resources for use in its improvement programs. This includes a comprehensive set of EST-SSRs, genomic SSRs, whole genome SNPs, and reduced representation SNPs. A total of 1,152 EST-SSRs were developed from >40,000 EST-sequences generated from the two genotypes. A set of 388 EST-SSRs were validated as polymorphic showing a polymorphism rate of 34% when tested on two diverse parents targeted for anthracnose disease. In addition, approximately 40X de novo whole genome sequence coverage was generated for each of the two genotypes, and a total of 18,584 and 15,952 genomic SSRs were identified for TDa 95/00328 and TDa 95-310, respectively. A custom made pipeline resulted in the selection of 573 genomic SSRs common across the two genotypes, of which only eight failed, 478 being polymorphic and 62 monomorphic indicating a polymorphic rate of 83.5%. Additionally, 288,505 high quality SNPs were also identified between these two genotypes. Genotyping by sequencing reads on these two genotypes also revealed 36,790 overlapping SNP positions that are distributed throughout the genome. Our efforts in using different approaches

  14. Partial De Novo Sequencing and Unusual CID Fragmentation of a 7 kDa, Disulfide-Bridged Toxin

    NASA Astrophysics Data System (ADS)

    Medzihradszky, Katalin F.; Bohlen, Christopher J.

    2012-05-01

    A 7 kDa toxin isolated from the venom of the Texas coral snake ( Micrurus tener tener) was subjected to collision-induced dissociation (CID) and electron-transfer dissociation (ETD) analyses both before and after reduction at low pH. Manual and automated approaches to de novo sequencing are compared in detail. Manual de novo sequencing utilizing the combination of high accuracy CID and ETD data and an acid-related cleavage yielded the N-terminal half of the sequence from the reduced species. The intact polypeptide, containing 3 disulfide bridges produced a series of unusual fragments in ion trap CID experiments: abundant internal amino acid losses were detected, and also one of the disulfide-linkage positions could be determined from fragments formed by the cleavage of two bonds. In addition, internal and c-type fragments were also observed.

  15. Comprehensive de novo peptide sequencing from MS/MS pairs generated through complementary collision induced dissociation and 351 nm ultraviolet photodissociation.

    PubMed

    Horton, Andrew Pitchford; Robotham, Scott A; Cannon, Joe R; Holden, Dustin D; Marcotte, Edward M; Brodbelt, Jennifer S

    2017-02-24

    We describe a strategy for de novo peptide sequencing based on matched pairs of tandem mass spectra (MS/MS) obtained by collision induced dissociation (CID) and 351 nm ultraviolet photodissociation (UVPD). Each precursor ion is isolated twice with the mass spectrometer switching between CID and UVPD activation modes to obtain a complementary MS/MS pair. To interpret these paired spectra, we modified the UVnovo de novo sequencing software to automatically learn from and interpret fragmentation spectra, provided a representative set of training data. This machine learning procedure, using random forests, synthesizes information from one or multiple complementary spectra, such as the CID/UVPD pairs, into peptide fragmentation site predictions. In doing so, the burden of fragmentation model definition shifts from programmer to machine and opens up the model parameter space for inclusion of nonobvious features and interactions. This spectral synthesis also serves to transform distinct types of spectra into a common representation for subsequent activation-independent processing steps. Then, independent from precursor activation constraints, UVnovo's de novo sequencing procedure generates and scores sequence candidates for each precursor. We demonstrate the combined experimental and computational approach for de novo sequencing using whole cell E. coli lysate. In benchmarks on the CID/UVPD data, UVnovo assigned correct full-length sequences to 83% of the spectral pairs of doubly charged ions with high-confidence database identifications. Considering only top-ranked de novo predictions, 70% of the pairs were deciphered correctly. This de novo sequencing performance exceeds that of PEAKS and PepNovo on the CID spectra and that of UVnovo on CID or UVPD spectra alone. As presented here, the methods for paired CID/UVPD spectral acquisition and interpretation constitute a powerful workflow for high-throughput and accurate de novo peptide sequencing.

  16. Whole-genome sequencing for comparative genomics and de novo genome assembly.

    PubMed

    Benjak, Andrej; Sala, Claudia; Hartkoorn, Ruben C

    2015-01-01

    Next-generation sequencing technologies for whole-genome sequencing of mycobacteria are rapidly becoming an attractive alternative to more traditional sequencing methods. In particular this technology is proving useful for genome-wide identification of mutations in mycobacteria (comparative genomics) as well as for de novo assembly of whole genomes. Next-generation sequencing however generates a vast quantity of data that can only be transformed into a usable and comprehensible form using bioinformatics. Here we describe the methodology one would use to prepare libraries for whole-genome sequencing, and the basic bioinformatics to identify mutations in a genome following Illumina HiSeq or MiSeq sequencing, as well as de novo genome assembly following sequencing using Pacific Biosciences (PacBio).

  17. An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data

    PubMed Central

    Deng, Xutao; Naccache, Samia N.; Ng, Terry; Federman, Scot; Li, Linlin; Chiu, Charles Y.; Delwart, Eric L.

    2015-01-01

    Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches. PMID:25586223

  18. De Novo Sequencing of Peptides from Top-Down Tandem Mass Spectra

    SciTech Connect

    Vyatkina, Kira; Wu, Si; Dekker, Lennard J. M.; VanDuijn, Martijn M.; Liu, Xiaowen; Tolić, Nikola; Dvorkin, Mikhail; Alexandrova, Sonya; Luider, Theo M.; Paša-Tolić, Ljiljana; Pevzner, Pavel A.

    2015-11-06

    De novo sequencing of proteins and peptides is one of the most important problems in mass spectrometry-driven proteomics. A variety of methods have been developed to accomplish this task from a set of bottom-up tandem (MS/MS) mass spectra. However, a more recently emerged top-down technology, now gaining more and more popularity, opens new perspectives for protein analysis and characterization, implying a need in efficient algorithms for processing this kind of MS/MS data. Here we describe a method that allows to retrieve from a set of top-down MS/MS spectra long and accurate sequence fragments of the proteins contained in a sample. To this end, we outline a strategy for generating high-quality sequence tags from top-down spectra, and introduce the concept of a T-Bruijn graph by adapting to the case of tags the notion of an A-Bruijn graph widely used in genomics. The output of the proposed approach represents the set of amino acid strings spelled out by optimal paths in the connected components of a T-Bruijn graph. We illustrate its performance on top-down datasets acquired from carbonic anhydrase 2 (CAH2) and the Fab region of alemtuzumab.

  19. Strict de novo methylation of the 35S enhancer sequence in gentian.

    PubMed

    Mishiba, Kei-ichiro; Yamasaki, Satoshi; Nakatsuka, Takashi; Abe, Yoshiko; Daimon, Hiroyuki; Oda, Masayuki; Nishihara, Masahiro

    2010-03-23

    A novel transgene silencing phenomenon was found in the ornamental plant, gentian (Gentiana triflora x G. scabra), in which the introduced Cauliflower mosaic virus (CaMV) 35S promoter region was strictly methylated, irrespective of the transgene copy number and integrated loci. Transgenic tobacco having the same vector did not show the silencing behavior. Not only unmodified, but also modified 35S promoters containing a 35S enhancer sequence were found to be highly methylated in the single copy transgenic gentian lines. The 35S core promoter (-90)-introduced transgenic lines showed a small degree of methylation, implying that the 35S enhancer sequence was involved in the methylation machinery. The rigorous silencing phenomenon enabled us to analyze methylation in a number of the transgenic lines in parallel, which led to the discovery of a consensus target region for de novo methylation, which comprised an asymmetric cytosine (CpHpH; H is A, C or T) sequence. Consequently, distinct footprints of de novo methylation were detected in each (modified) 35S promoter sequence, and the enhancer region (-148 to -85) was identified as a crucial target for de novo methylation. Electrophoretic mobility shift assay (EMSA) showed that complexes formed in gentian nuclear extract with the -149 to -124 and -107 to -83 region probes were distinct from those of tobacco nuclear extracts, suggesting that the complexes might contribute to de novo methylation. Our results provide insights into the phenomenon of sequence- and species- specific gene silencing in higher plants.

  20. DIME: a novel framework for de novo metagenomic sequence assembly.

    PubMed

    Guo, Xuan; Yu, Ning; Ding, Xiaojun; Wang, Jianxin; Pan, Yi

    2015-02-01

    The recently developed next generation sequencing platforms not only decrease the cost for metagenomics data analysis, but also greatly enlarge the size of metagenomic sequence datasets. A common bottleneck of available assemblers is that the trade-off between the noise of the resulting contigs and the gain in sequence length for better annotation has not been attended enough for large-scale sequencing projects, especially for the datasets with low coverage and a large number of nonoverlapping contigs. To address this limitation and promote both accuracy and efficiency, we develop a novel metagenomic sequence assembly framework, DIME, by taking the DIvide, conquer, and MErge strategies. In addition, we give two MapReduce implementations of DIME, DIME-cap3 and DIME-genovo, on Apache Hadoop platform. For a systematic comparison of the performance of the assembly tasks, we tested DIME and five other popular short read assembly programs, Cap3, Genovo, MetaVelvet, SOAPdenovo, and SPAdes on four synthetic and three real metagenomic sequence datasets with various reads from fifty thousand to a couple million in size. The experimental results demonstrate that our method not only partitions the sequence reads with an extremely high accuracy, but also reconstructs more bases, generates higher quality assembled consensus, and yields higher assembly scores, including corrected N50 and BLAST-score-per-base, than other tools with a nearly theoretical speed-up. Results indicate that DIME offers great improvement in assembly across a range of sequence abundances and thus is robust to decreasing coverage.

  1. De Novo Sequencing of Top-Down Tandem Mass Spectra: A Next Step towards Retrieving a Complete Protein Sequence

    PubMed Central

    Vyatkina, Kira

    2017-01-01

    De novo sequencing of tandem (MS/MS) mass spectra represents the only way to determine the sequence of proteins from organisms with unknown genomes, or the ones not directly inscribed in a genome—such as antibodies, or novel splice variants. Top-down mass spectrometry provides new opportunities for analyzing such proteins; however, retrieving a complete protein sequence from top-down MS/MS spectra still remains a distant goal. In this paper, we review the state-of-the-art on this subject, and enhance our previously developed Twister algorithm for de novo sequencing of peptides from top-down MS/MS spectra to derive longer sequence fragments of a target protein. PMID:28248257

  2. Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.

    PubMed

    Desai, Aarti; Marwah, Veer Singh; Yadav, Akshay; Jha, Vineet; Dhaygude, Kishor; Bangar, Ujwala; Kulkarni, Vivek; Jere, Abhay

    2013-01-01

    Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.

  3. De Novo Structure Prediction of Globular Proteins Aided by Sequence Variation-Derived Contacts

    PubMed Central

    Kosciolek, Tomasz; Jones, David T.

    2014-01-01

    The advent of high accuracy residue-residue intra-protein contact prediction methods enabled a significant boost in the quality of de novo structure predictions. Here, we investigate the potential benefits of combining a well-established fragment-based folding algorithm – FRAGFOLD, with PSICOV, a contact prediction method which uses sparse inverse covariance estimation to identify co-varying sites in multiple sequence alignments. Using a comprehensive set of 150 diverse globular target proteins, up to 266 amino acids in length, we are able to address the effectiveness and some limitations of such approaches to globular proteins in practice. Overall we find that using fragment assembly with both statistical potentials and predicted contacts is significantly better than either statistical potentials or contacts alone. Results show up to nearly 80% of correct predictions (TM-score ≥0.5) within analysed dataset and a mean TM-score of 0.54. Unsuccessful modelling cases emerged either from conformational sampling problems, or insufficient contact prediction accuracy. Nevertheless, a strong dependency of the quality of final models on the fraction of satisfied predicted long-range contacts was observed. This not only highlights the importance of these contacts on determining the protein fold, but also (combined with other ensemble-derived qualities) provides a powerful guide as to the choice of correct models and the global quality of the selected model. A proposed quality assessment scoring function achieves 0.93 precision and 0.77 recall for the discrimination of correct folds on our dataset of decoys. These findings suggest the approach is well-suited for blind predictions on a variety of globular proteins of unknown 3D structure, provided that enough homologous sequences are available to construct a large and accurate multiple sequence alignment for the initial contact prediction step. PMID:24637808

  4. De novo structure prediction of globular proteins aided by sequence variation-derived contacts.

    PubMed

    Kosciolek, Tomasz; Jones, David T

    2014-01-01

    The advent of high accuracy residue-residue intra-protein contact prediction methods enabled a significant boost in the quality of de novo structure predictions. Here, we investigate the potential benefits of combining a well-established fragment-based folding algorithm--FRAGFOLD, with PSICOV, a contact prediction method which uses sparse inverse covariance estimation to identify co-varying sites in multiple sequence alignments. Using a comprehensive set of 150 diverse globular target proteins, up to 266 amino acids in length, we are able to address the effectiveness and some limitations of such approaches to globular proteins in practice. Overall we find that using fragment assembly with both statistical potentials and predicted contacts is significantly better than either statistical potentials or contacts alone. Results show up to nearly 80% of correct predictions (TM-score ≥0.5) within analysed dataset and a mean TM-score of 0.54. Unsuccessful modelling cases emerged either from conformational sampling problems, or insufficient contact prediction accuracy. Nevertheless, a strong dependency of the quality of final models on the fraction of satisfied predicted long-range contacts was observed. This not only highlights the importance of these contacts on determining the protein fold, but also (combined with other ensemble-derived qualities) provides a powerful guide as to the choice of correct models and the global quality of the selected model. A proposed quality assessment scoring function achieves 0.93 precision and 0.77 recall for the discrimination of correct folds on our dataset of decoys. These findings suggest the approach is well-suited for blind predictions on a variety of globular proteins of unknown 3D structure, provided that enough homologous sequences are available to construct a large and accurate multiple sequence alignment for the initial contact prediction step.

  5. Approaching marine bioprospecting in hexacorals by RNA deep sequencing.

    PubMed

    Johansen, Steinar D; Emblem, Ase; Karlsen, Bård Ove; Okkenhaug, Siri; Hansen, Hilde; Moum, Truls; Coucheron, Dag H; Seternes, Ole Morten

    2010-07-31

    RNA deep sequencing represents a new complementary approach in marine bioprospecting. Next-generation sequencing platforms have recently been developed for de novo whole transcriptome analysis, small RNA discovery and gene expression profiling. Deep sequencing transcriptomics (sequencing the complete set of cellular transcripts at a specific stage or condition) leads to sequential identification of all expressed genes in a sample. When combined to high-throughput bioinformatics and protein synthesis, RNA deep sequencing represents a new powerful approach in gene product discovery and bioprospecting. Here we summarize recent progress in the analyses of hexacoral transcriptomes with the focus on cold-water sea anemones and related organisms.

  6. De novo proteomic sequencing of a monoclonal antibody raised against OX40 ligand.

    PubMed

    Pham, Victoria; Henzel, William J; Arnott, David; Hymowitz, Sarah; Sandoval, Wendy N; Truong, Bao-Tran; Lowman, Henry; Lill, Jennie R

    2006-05-01

    De novo sequencing of a full-length monoclonal antibody raised against OX40 ligand is described. Using a combination of overlapping complementary proteolytic and chemical digestions, with analysis by mass spectrometry and Edman degradation, both the heavy and light chains were fully sequenced. Particular attention was paid to those modifications that could be susceptible to degradation in the complementarity determining region and Fc region. An overview of the protocol is described, and suggestions for improvements to aid in such sequencing projects in the future are discussed.

  7. Terminal sequence importance of de novo proteins from binary-patterned library: stable artificial proteins with 11- or 12-amino acid alphabet.

    PubMed

    Okura, Hiromichi; Takahashi, Tsuyoshi; Mihara, Hisakazu

    2012-06-01

    Successful approaches of de novo protein design suggest a great potential to create novel structural folds and to understand natural rules of protein folding. For these purposes, smaller and simpler de novo proteins have been developed. Here, we constructed smaller proteins by removing the terminal sequences from stable de novo vTAJ proteins and compared stabilities between mutant and original proteins. vTAJ proteins were screened from an α3β3 binary-patterned library which was designed with polar/ nonpolar periodicities of α-helix and β-sheet. vTAJ proteins have the additional terminal sequences due to the method of constructing the genetically repeated library sequences. By removing the parts of the sequences, we successfully obtained the stable smaller de novo protein mutants with fewer amino acid alphabets than the originals. However, these mutants showed the differences on ANS binding properties and stabilities against denaturant and pH change. The terminal sequences, which were designed just as flexible linkers not as secondary structure units, sufficiently affected these physicochemical details. This study showed implications for adjusting protein stabilities by designing N- and C-terminal sequences.

  8. Feature-by-Feature – Evaluating De Novo Sequence Assembly

    PubMed Central

    Vezzi, Francesco; Narzisi, Giuseppe; Mishra, Bud

    2012-01-01

    The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art

  9. NxRepair: error correction in de novo sequence assembly using Nextera mate pairs.

    PubMed

    Murphy, Rebecca R; O'Connell, Jared; Cox, Anthony J; Schulz-Trieglaff, Ole

    2015-01-01

    Scaffolding errors and incorrect repeat disambiguation during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub or PyPI, the Python Package Index; a tutorial and user documentation are also available.

  10. Annotation and re-sequencing of genes from de novo transcriptome assembly of Abies alba (Pinaceae)1

    PubMed Central

    Roschanski, Anna M.; Fady, Bruno; Ziegenhagen, Birgit; Liepelt, Sascha

    2013-01-01

    • Premise of the study: We present a protocol for the annotation of transcriptome sequence data and the identification of candidate genes therein using the example of the nonmodel conifer Abies alba. • Methods and Results: A normalized cDNA library was built from an A. alba seedling. The sequencing on a 454 platform yielded more than 1.5 million reads that were de novo assembled into 25149 contigs. Two complementary approaches were applied to annotate gene fragments that code for (1) well-known proteins and (2) proteins that are potentially adaptively relevant. Primer development and testing yielded 88 amplicons that could successfully be resequenced from genomic DNA. • Conclusions: The annotation workflow offers an efficient way to identify potential adaptively relevant genes from the large quantity of transcriptome sequence data. The primer set presented should be prioritized for single-nucleotide polymorphism detection in adaptively relevant genes in A. alba. PMID:25202477

  11. Proteomics of Soil and Sediment: Protein Identification by De Novo Sequencing of Mass Spectra Complements Traditional Database Searching

    NASA Astrophysics Data System (ADS)

    Miller, S.; Rizzo, A. I.; Waldbauer, J.

    2014-12-01

    Proteomics has the potential to elucidate the metabolic pathways and taxa responsible for in situ biogeochemical transformations. However, low rates of protein identification from high resolution mass spectra have been a barrier to the development of proteomics in complex environmental samples. Much of the difficulty lies in the computational challenge of linking mass spectra to their corresponding proteins. Traditional database search methods for matching peptide sequences to mass spectra are often inadequate due to the complexity of environmental proteomes and the large database search space, as we demonstrate with soil and sediment proteomes generated via a range of extraction methods. One alternative to traditional database searching is de novo sequencing, which identifies peptide sequences without the need for a database. BLAST can then be used to match de novo sequences to similar genetic sequences. Assigning confidence to putative identifications has been one hurdle for the implementation of de novo sequencing. We found that accurate de novo sequences can be screened by quality score and length. Screening criteria are verified by comparing the results of de novo sequencing and traditional database searching for well-characterized proteomes from simple biological systems. The BLAST hits of screened sequences are interrogated for taxonomic and functional information. We applied de novo sequencing to organic topsoil and marine sediment proteomes. Peak-rich proteomes, which can result from various extraction techniques, yield thousands of high-confidence protein identifications, an improvement over previous proteomic studies of soil and sediment. User-friendly software tools for de novo metaproteomics analysis have been developed. This "De Novo Analysis" Pipeline is also a faster method of data analysis than constructing a tailored sequence database for traditional database searching.

  12. Proteomics of Soil and Sediment: Protein Identification by De Novo Sequencing of Mass Spectra Complements Traditional Database Searching

    NASA Astrophysics Data System (ADS)

    Miller, S.; Rizzo, A. I.; Waldbauer, J.

    2015-12-01

    Proteomics has the potential to elucidate the metabolic pathways and taxa responsible for in situ biogeochemical transformations. However, low rates of protein identification from high resolution mass spectra have been a barrier to the development of proteomics in complex environmental samples. Much of the difficulty lies in the computational challenge of linking mass spectra to their corresponding proteins. Traditional database search methods for matching peptide sequences to mass spectra are often inadequate due to the complexity of environmental proteomes and the large database search space, as we demonstrate with soil and sediment proteomes generated via a range of extraction methods. One alternative to traditional database searching is de novo sequencing, which identifies peptide sequences without the need for a database. BLAST can then be used to match de novo sequences to similar genetic sequences. Assigning confidence to putative identifications has been one hurdle for the implementation of de novo sequencing. We found that accurate de novo sequences can be screened by quality score and length. Screening criteria are verified by comparing the results of de novo sequencing and traditional database searching for well-characterized proteomes from simple biological systems. The BLAST hits of screened sequences are interrogated for taxonomic and functional information. We applied de novo sequencing to organic topsoil and marine sediment proteomes. Peak-rich proteomes, which can result from various extraction techniques, yield thousands of high-confidence protein identifications, an improvement over previous proteomic studies of soil and sediment. User-friendly software tools for de novo metaproteomics analysis have been developed. This "De Novo Analysis" Pipeline is also a faster method of data analysis than constructing a tailored sequence database for traditional database searching.

  13. De novo assembly and characterization of the Trichuris trichiura adult worm transcriptome using Ion Torrent sequencing.

    PubMed

    Santos, Leonardo N; Silva, Eduardo S; Santos, André S; De Sá, Pablo H; Ramos, Rommel T; Silva, Artur; Cooper, Philip J; Barreto, Maurício L; Loureiro, Sebastião; Pinheiro, Carina S; Alcantara-Neves, Neuza M; Pacheco, Luis G C

    2016-07-01

    Infection with helminthic parasites, including the soil-transmitted helminth Trichuris trichiura (human whipworm), has been shown to modulate host immune responses and, consequently, to have an impact on the development and manifestation of chronic human inflammatory diseases. De novo derivation of helminth proteomes from sequencing of transcriptomes will provide valuable data to aid identification of parasite proteins that could be evaluated as potential immunotherapeutic molecules in near future. Herein, we characterized the transcriptome of the adult stage of the human whipworm T. trichiura, using next-generation sequencing technology and a de novo assembly strategy. Nearly 17.6 million high-quality clean reads were assembled into 6414 contiguous sequences, with an N50 of 1606bp. In total, 5673 protein-encoding sequences were confidentially identified in the T. trichiura adult worm transcriptome; of these, 1013 sequences represent potential newly discovered proteins for the species, most of which presenting orthologs already annotated in the related species T. suis. A number of transcripts representing probable novel non-coding transcripts for the species T. trichiura were also identified. Among the most abundant transcripts, we found sequences that code for proteins involved in lipid transport, such as vitellogenins, and several chitin-binding proteins. Through a cross-species expression analysis of gene orthologs shared by T. trichiura and the closely related parasites T. suis and T. muris it was possible to find twenty-six protein-encoding genes that are consistently highly expressed in the adult stages of the three helminth species. Additionally, twenty transcripts could be identified that code for proteins previously detected by mass spectrometry analysis of protein fractions of the whipworm somatic extract that present immunomodulatory activities. Five of these transcripts were amongst the most highly expressed protein-encoding sequences in the T

  14. De novo sequencing of highly modified therapeutic oligonucleotides by hydrophobic tag sequencing coupled with LC-MS.

    PubMed

    Goto, R; Miyakawa, S; Inomata, E; Takami, T; Yamaura, J; Nakamura, Y

    2017-02-01

    Correct sequences are prerequisite for quality control of therapeutic oligonucleotides. However, there is no definitive method available for determining sequences of highly modified therapeutic RNAs, and thereby, most of the oligonucleotides have been used clinically without direct sequence determination. In this study, we developed a novel sequencing method called 'hydrophobic tag sequencing'. Highly modified oligonucleotides are sequenced by partially digesting oligonucleotides conjugated with a 5'-hydrophobic tag, followed by liquid chromatography-mass spectrometry analysis. 5'-Hydrophobic tag-printed fragments (5'-tag degradates) can be separated in order of their molecular masses from tag-free oligonucleotides by reversed-phase liquid chromatography. As models for the sequencing, the anti-VEGF aptamer (Macugen) and the highly modified 38-mer RNA sequences were analyzed under blind conditions. Most nucleotides were identified from the molecular weight of hydrophobic 5'-tag degradates calculated from monoisotopic mass in simple full mass data. When monoisotopic mass could not be assigned, the nucleotide was estimated using the molecular weight of the most abundant mass. The sequences of Macugen and 38-mer RNA perfectly matched the theoretical sequences. The hydrophobic tag sequencing worked well to obtain simple full mass data, resulting in accurate and clear sequencing. The present study provides for the first time a de novo sequencing technology for highly modified RNAs and contributes to quality control of therapeutic oligonucleotides. Copyright © 2016 John Wiley & Sons, Ltd.

  15. De Novo Centromere Formation and Centromeric Sequence Expansion in Wheat and its Wide Hybrids

    PubMed Central

    Fu, Shulan; Wang, Jing; Zhang, Xiangqi; Hu, Zanmin; Han, Fangpu

    2016-01-01

    Centromeres typically contain tandem repeat sequences, but centromere function does not necessarily depend on these sequences. We identified functional centromeres with significant quantitative changes in the centromeric retrotransposons of wheat (CRW) contents in wheat aneuploids (Triticum aestivum) and the offspring of wheat wide hybrids. The CRW signals were strongly reduced or essentially lost in some wheat ditelosomic lines and in the addition lines from the wide hybrids. The total loss of the CRW sequences but the presence of CENH3 in these lines suggests that the centromeres were formed de novo. In wheat and its wide hybrids, which carry large complex genomes or no sequenced genome, we performed CENH3-ChIP-dot-blot methods alone or in combination with CENH3-ChIP-seq and identified the ectopic genomic sequences present at the new centromeres. In adcdition, the transcription of the identified DNA sequences was remarkably increased at the new centromere, suggesting that the transcription of the corresponding sequences may be associated with de novo centromere formation. Stable alien chromosomes with two and three regions containing CRW sequences induced by centromere breakage were observed in the wheat-Th. elongatum hybrid derivatives, but only one was a functional centromere. In wheat-rye (Secale cereale) hybrids, the rye centromere-specific sequences spread along the chromosome arms and may have caused centromere expansion. Frequent and significant quantitative alterations in the centromere sequence via chromosomal rearrangement have been systematically described in wheat wide hybridizations, which may affect the retention or loss of the alien chromosomes in the hybrids. Thus, the centromere behavior in wide crosses likely has an important impact on the generation of biodiversity, which ultimately has implications for speciation. PMID:27110907

  16. De novo transcriptome sequencing of axolotl blastema for identification of differentially expressed genes during limb regeneration

    PubMed Central

    2013-01-01

    Background Salamanders are unique among vertebrates in their ability to completely regenerate amputated limbs through the mediation of blastema cells located at the stump ends. This regeneration is nerve-dependent because blastema formation and regeneration does not occur after limb denervation. To obtain the genomic information of blastema tissues, de novo transcriptomes from both blastema tissues and denervated stump ends of Ambystoma mexicanum (axolotls) 14 days post-amputation were sequenced and compared using Solexa DNA sequencing. Results The sequencing done for this study produced 40,688,892 reads that were assembled into 307,345 transcribed sequences. The N50 of transcribed sequence length was 562 bases. A similarity search with known proteins identified 39,200 different genes to be expressed during limb regeneration with a cut-off E-value exceeding 10-5. We annotated assembled sequences by using gene descriptions, gene ontology, and clusters of orthologous group terms. Targeted searches using these annotations showed that the majority of the genes were in the categories of essential metabolic pathways, transcription factors and conserved signaling pathways, and novel candidate genes for regenerative processes. We discovered and confirmed numerous sequences of the candidate genes by using quantitative polymerase chain reaction and in situ hybridization. Conclusion The results of this study demonstrate that de novo transcriptome sequencing allows gene expression analysis in a species lacking genome information and provides the most comprehensive mRNA sequence resources for axolotls. The characterization of the axolotl transcriptome can help elucidate the molecular mechanisms underlying blastema formation during limb regeneration. PMID:23815514

  17. A Proteomic Workflow Using High-Throughput De Novo Sequencing Towards Complementation of Genome Information for Improved Comparative Crop Science.

    PubMed

    Turetschek, Reinhard; Lyon, David; Desalegn, Getinet; Kaul, Hans-Peter; Wienkoop, Stefanie

    2016-01-01

    The proteomic study of non-model organisms, such as many crop plants, is challenging due to the lack of comprehensive genome information. Changing environmental conditions require the study and selection of adapted cultivars. Mutations, inherent to cultivars, hamper protein identification and thus considerably complicate the qualitative and quantitative comparison in large-scale systems biology approaches. With this workflow, cultivar-specific mutations are detected from high-throughput comparative MS analyses, by extracting sequence polymorphisms with de novo sequencing. Stringent criteria are suggested to filter for confidential mutations. Subsequently, these polymorphisms complement the initially used database, which is ready to use with any preferred database search algorithm. In our example, we thereby identified 26 specific mutations in two cultivars of Pisum sativum and achieved an increased number (17 %) of peptide spectrum matches.

  18. De novo sequencing and characterization of floral transcriptome in two species of buckwheat (Fagopyrum)

    PubMed Central

    2011-01-01

    Background Transcriptome sequencing data has become an integral component of modern genetics, genomics and evolutionary biology. However, despite advances in the technologies of DNA sequencing, such data are lacking for many groups of living organisms, in particular, many plant taxa. We present here the results of transcriptome sequencing for two closely related plant species. These species, Fagopyrum esculentum and F. tataricum, belong to the order Caryophyllales - a large group of flowering plants with uncertain evolutionary relationships. F. esculentum (common buckwheat) is also an important food crop. Despite these practical and evolutionary considerations Fagopyrum species have not been the subject of large-scale sequencing projects. Results Normalized cDNA corresponding to genes expressed in flowers and inflorescences of F. esculentum and F. tataricum was sequenced using the 454 pyrosequencing technology. This resulted in 267 (for F. esculentum) and 229 (F. tataricum) thousands of reads with average length of 341-349 nucleotides. De novo assembly of the reads produced about 25 thousands of contigs for each species, with 7.5-8.2× coverage. Comparative analysis of two transcriptomes demonstrated their overall similarity but also revealed genes that are presumably differentially expressed. Among them are retrotransposon genes and genes involved in sugar biosynthesis and metabolism. Thirteen single-copy genes were used for phylogenetic analysis; the resulting trees are largely consistent with those inferred from multigenic plastid datasets. The sister relationships of the Caryophyllales and asterids now gained high support from nuclear gene sequences. Conclusions 454 transcriptome sequencing and de novo assembly was performed for two congeneric flowering plant species, F. esculentum and F. tataricum. As a result, a large set of cDNA sequences that represent orthologs of known plant genes as well as potential new genes was generated. PMID:21232141

  19. A framework for the detection of de novo mutations in family-based sequencing data

    PubMed Central

    Francioli, Laurent C; Cretu-Stancu, Mircea; Garimella, Kiran V; Fromer, Menachem; Kloosterman, Wigard P; Wijmenga, Cisca; Investigator, Principal; Swertz, Morris A; van Duijn, Cornelia M; Boomsma, Dorret I; Slagboom, PEline; van Ommen, Gertjan B; de Bakker, Paul IW; Swertz, Morris A; Francioli, Laurent C; van Dijk, Freerk; Menelaou, Androniki; Neerincx, Pieter BT; Pulit, Sara L; Deelen, Patrick; Elbers, Clara C; Francesco Palamara, Pier; Pe'er, Itsik; Abdellaoui, Abdel; Kloosterman, Wigard P; van Oven, Mannis; Vermaat, Martijn; Li, Mingkun; Laros, Jeroen FJ; Stoneking, Mark; de Knijff, Peter; Kayser, Manfred; Veldink, Jan H; van den Berg, Leonard H; Byelas, Heorhiy; den Dunnen, Johan T; Dijkstra, Martijn; Amin, Najaf; van der Velde, K Joeri; Hottenga, Jouke Jan; van Setten, Jessica; van Leeuwen, Elisabeth M; Kanterakis, Alexandros; Kattenberg, Mathijs; Karssen, Lennart C; van Schaik, Barbera DC; Bot, Jan; Nijman, Isaäc J; Renkens, Ivo; van Enckevort, David; Mei, Hailiang; Koval, Vyacheslav; Estrada, Karol; Medina-Gomez, Carolina; Ye, Kai; Lameijer, Eric-Wubbo; Moed, Matthijs H; Hehir-Kwa, Jayne Y; Handsaker, Robert E; McCarroll, Steven A; Sunyaev, Shamil R; Polak, Paz; Vuzman, Dana; Sohail, Mashaal; Hormozdiari, Fereydoun; Marschall, Tobias; Schönhuth, Alexander; Guryev, Victor; de Bakker, Paul IW; Slagboom, P Eline; Beekman, Marian B; de Craen, Anton JM; Suchiman, H Eka D; Hofman, Albert; van Duijn, Cornelia M; Oostra, Ben; Isaacs, Aaron; Amin, Najaf; Rivadeneira, Fernando; Uitterlinden, André G; Boomsma, Dorret I; Willemsen, Gonneke; Platteel, Mathieu; Pitts, Steven J; Potluri, Shobha; Sundar, Purnima; Cox, David R; Li, Qibin; Li, Yingrui; Du, Yuanping; Chen, Ruoyan; Cao, Hongzhi; Li, Ning; Cao, Sujie; Wang, Jun; Bovenberg, Jasper A; Brandsma, Margreet; Samocha, Kaitlin E; Neale, Benjamin M; Daly, Mark J; Banks, Eric; DePristo, Mark A; de Bakker, Paul IW

    2017-01-01

    Germline mutation detection from human DNA sequence data is challenging due to the rarity of such events relative to the intrinsic error rates of sequencing technologies and the uneven coverage across the genome. We developed PhaseByTransmission (PBT) to identify de novo single nucleotide variants and short insertions and deletions (indels) from sequence data collected in parent-offspring trios. We compute the joint probability of the data given the genotype likelihoods in the individual family members, the known familial relationships and a prior probability for the mutation rate. Candidate de novo mutations (DNMs) are reported along with their posterior probability, providing a systematic way to prioritize them for validation. Our tool is integrated in the Genome Analysis Toolkit and can be used together with the ReadBackedPhasing module to infer the parental origin of DNMs based on phase-informative reads. Using simulated data, we show that PBT outperforms existing tools, especially in low coverage data and on the X chromosome. We further show that PBT displays high validation rates on empirical parent-offspring sequencing data for whole-exome data from 104 trios and X-chromosome data from 249 parent-offspring families. Finally, we demonstrate an association between father's age at conception and the number of DNMs in female offspring's X chromosome, consistent with previous literature reports. PMID:27876817

  20. Genome Report: Identification and Validation of Antigenic Proteins from Pajaroellobacter abortibovis Using De Novo Genome Sequence Assembly and Reverse Vaccinology

    PubMed Central

    Welly, Bryan T.; Miller, Michael R.; Stott, Jeffrey L.; Blanchard, Myra T.; Islas-Trejo, Alma D.; O’Rourke, Sean M.; Young, Amy E.; Medrano, Juan F.; Van Eenennaam, Alison L.

    2016-01-01

    Epizootic bovine abortion (EBA), or “foothill abortion,” is the leading cause of beef cattle abortion in California and has also been reported in Nevada and Oregon. In the 1970s, the soft-shelled tick Ornithodoros coriaceus, or “pajaroello tick,” was confirmed as the disease-transmitting vector. In 2005, a novel Deltaproteobacterium was discovered as the etiologic agent of EBA (aoEBA), recently named Pajaroellobacter abortibovis. This organism cannot be grown in culture using traditional microbiological techniques; it can only be grown in experimentally-infected severe combined immunodeficient (SCID) mice. The objectives of this study were to perform a de novo genome assembly for P. abortibovis and identify and validate potential antigenic proteins as candidates for future recombinant vaccine development. DNA and RNA were extracted from spleen tissue collected from experimentally-infected SCID mice following exposure to P. abortibovis. This combination of mouse and bacterial DNA was sequenced and aligned to the mouse genome. Mouse sequences were subtracted from the sequence pool and the remaining sequences were de novo assembled at 50x coverage into a 1.82 Mbp complete closed circular Deltaproteobacterial genome containing 2250 putative protein-coding sequences. Phylogenetic analysis of P. abortibovis predicts that this bacterium is most closely related to the organisms of the order Myxococcales, referred to as Myxobacteria. In silico prediction of vaccine candidates was performed using a reverse vaccinology approach resulting in the identification and ranking of the top 10 candidate proteins that are likely to be antigenic. Immunologic testing of these candidate proteins confirmed antigenicity of seven of the nine expressed protein candidates using serum from P. abortibovis immunized mice. PMID:28040777

  1. The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads.

    PubMed

    Wang, Zhiwen; Hobson, Neil; Galindo, Leonardo; Zhu, Shilin; Shi, Daihu; McDill, Joshua; Yang, Linfeng; Hawkins, Simon; Neutelings, Godfrey; Datla, Raju; Lambert, Georgina; Galbraith, David W; Grassa, Christopher J; Geraldes, Armando; Cronk, Quentin C; Cullis, Christopher; Dash, Prasanta K; Kumar, Polumetla A; Cloutier, Sylvie; Sharpe, Andrew G; Wong, Gane K-S; Wang, Jun; Deyholos, Michael K

    2012-11-01

    Flax (Linum usitatissimum) is an ancient crop that is widely cultivated as a source of fiber, oil and medicinally relevant compounds. To accelerate crop improvement, we performed whole-genome shotgun sequencing of the nuclear genome of flax. Seven paired-end libraries ranging in size from 300 bp to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44-100 bp), produced a set of scaffolds with N(50) =694 kb, including contigs with N(50)=20.1 kb. The contig assembly contained 302 Mb of non-redundant sequence representing an estimated 81% genome coverage. Up to 96% of published flax ESTs aligned to the whole-genome shotgun scaffolds. However, comparisons with independently sequenced BACs and fosmids showed some mis-assembly of regions at the genome scale. A total of 43384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (K(s) ) observed within duplicate gene pairs was consistent with a recent (5-9 MYA) whole-genome duplication in flax. Within the predicted proteome, we observed enrichment of many conserved domains (Pfam-A) that may contribute to the unique properties of this crop, including agglutinin proteins. Together these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species.

  2. Genome Calligrapher: A Web Tool for Refactoring Bacterial Genome Sequences for de Novo DNA Synthesis.

    PubMed

    Christen, Matthias; Deutsch, Samuel; Christen, Beat

    2015-08-21

    Recent advances in synthetic biology have resulted in an increasing demand for the de novo synthesis of large-scale DNA constructs. Any process improvement that enables fast and cost-effective streamlining of digitized genetic information into fabricable DNA sequences holds great promise to study, mine, and engineer genomes. Here, we present Genome Calligrapher, a computer-aided design web tool intended for whole genome refactoring of bacterial chromosomes for de novo DNA synthesis. By applying a neutral recoding algorithm, Genome Calligrapher optimizes GC content and removes obstructive DNA features known to interfere with the synthesis of double-stranded DNA and the higher order assembly into large DNA constructs. Subsequent bioinformatics analysis revealed that synthesis constraints are prevalent among bacterial genomes. However, a low level of codon replacement is sufficient for refactoring bacterial genomes into easy-to-synthesize DNA sequences. To test the algorithm, 168 kb of synthetic DNA comprising approximately 20 percent of the synthetic essential genome of the cell-cycle bacterium Caulobacter crescentus was streamlined and then ordered from a commercial supplier of low-cost de novo DNA synthesis. The successful assembly into eight 20 kb segments indicates that Genome Calligrapher algorithm can be efficiently used to refactor difficult-to-synthesize DNA. Genome Calligrapher is broadly applicable to recode biosynthetic pathways, DNA sequences, and whole bacterial genomes, thus offering new opportunities to use synthetic biology tools to explore the functionality of microbial diversity. The Genome Calligrapher web tool can be accessed at https://christenlab.ethz.ch/GenomeCalligrapher  .

  3. Whole Exome Sequencing Identifies de Novo Mutations in GATA6 Associated with Congenital Diaphragmatic Hernia

    PubMed Central

    Yu, Lan; Bennett, James T.; Wynn, Julia; Carvill, Gemma L.; Cheung, Yee Him; Shen, Yufeng; Mychaliska, George B.; Azarow, Kenneth S.; Crombleholme, Timothy M.; Chung, Dai H.; Potoka, Douglas; Warner, Brad W.; Bucher, Brian; Lim, Foong-Yen; Pietsch, John; Stolar, Charles; Aspelund, Gudrun; Arkovitz, Marc S.; Mefford, Heather; Chung, Wendy K.

    2014-01-01

    Background Congenital diaphragmatic hernia (CDH) is a common birth defect affecting 1 in 3,000 births. It is characterized by herniation of abdominal viscera through an incompletely formed diaphragm. Although chromosomal anomalies and mutations in several genes have been implicated, the cause for most patients is unknown. Methods We used whole exome sequencing in two families with CDH and congenital heart disease, and identified mutations in GATA6 in both. Results In the first family, we identified a de novo missense mutation (c.1366C>T, p.R456C) in a sporadic CDH patient with tetralogy of Fallot. In the second, a nonsense mutation (c.712G>T, p.G238*) was identified in two siblings with CDH and a large ventricular septal defect. The G238* mutation was inherited from their mother, who was clinically affected with congenital absence of the pericardium, patent ductus arteriosus, and intestinal malrotation. Deep sequencing of blood and saliva derived DNA from the mother suggested somatic mosaicism as an explanation for her milder phenotype, with only approximately 15% mutant alleles. To determine the frequency of GATA6 mutations in CDH, we sequenced the gene in 378 patients with CDH. We identified one additional de novo mutation (c.1071delG, p.V358Cfs34*). Conclusions Mutations in GATA6 have been previously associated with pancreatic agenesis and congenital heart disease. We conclude that, in addition to the heart and the pancreas, GATA6 is involved in development of two additional organs, the diaphragm and the pericardium. In addition we have shown that de novo mutations can contribute to the development of CDH, a common birth defect. PMID:24385578

  4. The sequence and de novo assembly of the giant panda genome.

    PubMed

    Li, Ruiqiang; Fan, Wei; Tian, Geng; Zhu, Hongmei; He, Lin; Cai, Jing; Huang, Quanfei; Cai, Qingle; Li, Bo; Bai, Yinqi; Zhang, Zhihe; Zhang, Yaping; Wang, Wen; Li, Jun; Wei, Fuwen; Li, Heng; Jian, Min; Li, Jianwen; Zhang, Zhaolei; Nielsen, Rasmus; Li, Dawei; Gu, Wanjun; Yang, Zhentao; Xuan, Zhaoling; Ryder, Oliver A; Leung, Frederick Chi-Ching; Zhou, Yan; Cao, Jianjun; Sun, Xiao; Fu, Yonggui; Fang, Xiaodong; Guo, Xiaosen; Wang, Bo; Hou, Rong; Shen, Fujun; Mu, Bo; Ni, Peixiang; Lin, Runmao; Qian, Wubin; Wang, Guodong; Yu, Chang; Nie, Wenhui; Wang, Jinhuan; Wu, Zhigang; Liang, Huiqing; Min, Jiumeng; Wu, Qi; Cheng, Shifeng; Ruan, Jue; Wang, Mingwei; Shi, Zhongbin; Wen, Ming; Liu, Binghang; Ren, Xiaoli; Zheng, Huisong; Dong, Dong; Cook, Kathleen; Shan, Gao; Zhang, Hao; Kosiol, Carolin; Xie, Xueying; Lu, Zuhong; Zheng, Hancheng; Li, Yingrui; Steiner, Cynthia C; Lam, Tommy Tsan-Yuk; Lin, Siyuan; Zhang, Qinghui; Li, Guoqing; Tian, Jing; Gong, Timing; Liu, Hongde; Zhang, Dejin; Fang, Lin; Ye, Chen; Zhang, Juanbin; Hu, Wenbo; Xu, Anlong; Ren, Yuanyuan; Zhang, Guojie; Bruford, Michael W; Li, Qibin; Ma, Lijia; Guo, Yiran; An, Na; Hu, Yujie; Zheng, Yang; Shi, Yongyong; Li, Zhiqiang; Liu, Qing; Chen, Yanling; Zhao, Jing; Qu, Ning; Zhao, Shancen; Tian, Feng; Wang, Xiaoling; Wang, Haiyin; Xu, Lizhi; Liu, Xiao; Vinar, Tomas; Wang, Yajun; Lam, Tak-Wah; Yiu, Siu-Ming; Liu, Shiping; Zhang, Hemin; Li, Desheng; Huang, Yan; Wang, Xia; Yang, Guohua; Jiang, Zhi; Wang, Junyi; Qin, Nan; Li, Li; Li, Jingxiang; Bolund, Lars; Kristiansen, Karsten; Wong, Gane Ka-Shu; Olson, Maynard; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun

    2010-01-21

    Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.

  5. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation.

    PubMed

    Michaelson, Jacob J; Shi, Yujian; Gujral, Madhusudan; Zheng, Hancheng; Malhotra, Dheeraj; Jin, Xin; Jian, Minghan; Liu, Guangming; Greer, Douglas; Bhandari, Abhishek; Wu, Wenting; Corominas, Roser; Peoples, Aine; Koren, Amnon; Gore, Athurva; Kang, Shuli; Lin, Guan Ning; Estabillo, Jasper; Gadomski, Therese; Singh, Balvindar; Zhang, Kun; Akshoomoff, Natacha; Corsello, Christina; McCarroll, Steven; Iakoucheva, Lilia M; Li, Yingrui; Wang, Jun; Sebat, Jonathan

    2012-12-21

    De novo mutation plays an important role in autism spectrum disorders (ASDs). Notably, pathogenic copy number variants (CNVs) are characterized by high mutation rates. We hypothesize that hypermutability is a property of ASD genes and may also include nucleotide-substitution hot spots. We investigated global patterns of germline mutation by whole-genome sequencing of monozygotic twins concordant for ASD and their parents. Mutation rates varied widely throughout the genome (by 100-fold) and could be explained by intrinsic characteristics of DNA sequence and chromatin structure. Dense clusters of mutations within individual genomes were attributable to compound mutation or gene conversion. Hypermutability was a characteristic of genes involved in ASD and other diseases. In addition, genes impacted by mutations in this study were associated with ASD in independent exome-sequencing data sets. Our findings suggest that regional hypermutation is a significant factor shaping patterns of genetic variation and disease risk in humans.

  6. Sequencing, de novo assembly and comparative analysis of Raphanus sativus transcriptome.

    PubMed

    Wu, Gang; Zhang, Libin; Yin, Yongtai; Wu, Jiangsheng; Yu, Longjiang; Zhou, Yanhong; Li, Maoteng

    2015-01-01

    Raphanus sativus is an important Brassicaceae plant and also an edible vegetable with great economic value. However, currently there is not enough transcriptome information of R. sativus tissues, which impedes further functional genomics research on R. sativus. In this study, RNA-seq technology was employed to characterize the transcriptome of leaf tissues. Approximately 70 million clean pair-end reads were obtained and used for de novo assembly by Trinity program, which generated 68,086 unigenes with an average length of 576 bp. All the unigenes were annotated against GO and KEGG databases. In the meanwhile, we merged leaf sequencing data with existing root sequencing data and obtained better de novo assembly of R. sativus using Oases program. Accordingly, potential simple sequence repeats (SSRs), transcription factors (TFs) and enzyme codes were identified in R. sativus. Additionally, we detected a total of 3563 significantly differentially expressed genes (DEGs, P = 0.05) and tissue-specific biological processes between leaf and root tissues. Furthermore, a TFs-based regulation network was constructed using Cytoscape software. Taken together, these results not only provide a comprehensive genomic resource of R. sativus but also shed light on functional genomic and proteomic research on R. sativus in the future.

  7. RoboOligo: software for mass spectrometry data to support manual and de novo sequencing of post-transcriptionally modified ribonucleic acids.

    PubMed

    Sample, Paul J; Gaston, Kirk W; Alfonzo, Juan D; Limbach, Patrick A

    2015-05-26

    Ribosomal ribonucleic acid (RNA), transfer RNA and other biological or synthetic RNA polymers can contain nucleotides that have been modified by the addition of chemical groups. Traditional Sanger sequencing methods cannot establish the chemical nature and sequence of these modified-nucleotide containing oligomers. Mass spectrometry (MS) has become the conventional approach for determining the nucleotide composition, modification status and sequence of modified RNAs. Modified RNAs are analyzed by MS using collision-induced dissociation tandem mass spectrometry (CID MS/MS), which produces a complex dataset of oligomeric fragments that must be interpreted to identify and place modified nucleosides within the RNA sequence. Here we report the development of RoboOligo, an interactive software program for the robust analysis of data generated by CID MS/MS of RNA oligomers. There are three main functions of RoboOligo: (i) automated de novo sequencing via the local search paradigm. (ii) Manual sequencing with real-time spectrum labeling and cumulative intensity scoring. (iii) A hybrid approach, coined 'variable sequencing', which combines the user intuition of manual sequencing with the high-throughput sampling of automated de novo sequencing.

  8. in silico Whole Genome Sequencer & Analyzer (iWGS): A Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies.

    PubMed

    Zhou, Xiaofan; Peris, David; Kominek, Jacek; Kurtzman, Cletus P; Hittinger, Chris Todd; Rokas, Antonis

    2016-09-16

    The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in non-model organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.

  9. In Silico Whole Genome Sequencer and Analyzer (iWGS): a Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies

    PubMed Central

    Zhou, Xiaofan; Peris, David; Kominek, Jacek; Kurtzman, Cletus P.; Hittinger, Chris Todd; Rokas, Antonis

    2016-01-01

    The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimental design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS. PMID:27638685

  10. De novo sequences of Haloquadratum walsbyi from Lake Tyrrell, Australia, reveal a variable genomic landscape.

    PubMed

    Tully, Benjamin J; Emerson, Joanne B; Andrade, Karen; Brocks, Jochen J; Allen, Eric E; Banfield, Jillian F; Heidelberg, Karla B

    2015-01-01

    Hypersaline systems near salt saturation levels represent an extreme environment, in which organisms grow and survive near the limits of life. One of the abundant members of the microbial communities in hypersaline systems is the square archaeon, Haloquadratum walsbyi. Utilizing a short-read metagenome from Lake Tyrrell, a hypersaline ecosystem in Victoria, Australia, we performed a comparative genomic analysis of H. walsbyi to better understand the extent of variation between strains/subspecies. Results revealed that previously isolated strains/subspecies do not fully describe the complete repertoire of the genomic landscape present in H. walsbyi. Rearrangements, insertions, and deletions were observed for the Lake Tyrrell derived Haloquadratum genomes and were supported by environmental de novo sequences, including shifts in the dominant genomic landscape of the two most abundant strains. Analysis pertaining to halomucins indicated that homologs for this large protein are not a feature common for all species of Haloquadratum. Further, we analyzed ATP-binding cassette transporters (ABC-type transporters) for evidence of niche partitioning between different strains/subspecies. We were able to identify unique and variable transporter subunits from all five genomes analyzed and the de novo environmental sequences, suggesting that differences in nutrient and carbon source acquisition may play a role in maintaining distinct strains/subspecies.

  11. De Novo Sequences of Haloquadratum walsbyi from Lake Tyrrell, Australia, Reveal a Variable Genomic Landscape

    PubMed Central

    Tully, Benjamin J.; Emerson, Joanne B.; Andrade, Karen; Brocks, Jochen J.; Allen, Eric E.; Banfield, Jillian F.; Heidelberg, Karla B.

    2015-01-01

    Hypersaline systems near salt saturation levels represent an extreme environment, in which organisms grow and survive near the limits of life. One of the abundant members of the microbial communities in hypersaline systems is the square archaeon, Haloquadratum walsbyi. Utilizing a short-read metagenome from Lake Tyrrell, a hypersaline ecosystem in Victoria, Australia, we performed a comparative genomic analysis of H. walsbyi to better understand the extent of variation between strains/subspecies. Results revealed that previously isolated strains/subspecies do not fully describe the complete repertoire of the genomic landscape present in H. walsbyi. Rearrangements, insertions, and deletions were observed for the Lake Tyrrell derived Haloquadratum genomes and were supported by environmental de novo sequences, including shifts in the dominant genomic landscape of the two most abundant strains. Analysis pertaining to halomucins indicated that homologs for this large protein are not a feature common for all species of Haloquadratum. Further, we analyzed ATP-binding cassette transporters (ABC-type transporters) for evidence of niche partitioning between different strains/subspecies. We were able to identify unique and variable transporter subunits from all five genomes analyzed and the de novo environmental sequences, suggesting that differences in nutrient and carbon source acquisition may play a role in maintaining distinct strains/subspecies. PMID:25709557

  12. CycloBranch: De Novo Sequencing of Nonribosomal Peptides from Accurate Product Ion Mass Spectra

    NASA Astrophysics Data System (ADS)

    Novák, Jiří; Lemr, Karel; Schug, Kevin A.; Havlíček, Vladimír

    2015-07-01

    Nonribosomal peptides have a wide range of biological and medical applications. Their identification by tandem mass spectrometry remains a challenging task. A new open-source de novo peptide identification engine CycloBranch was developed and successfully applied in identification or detailed characterization of 11 linear, cyclic, branched, and branch-cyclic peptides. CycloBranch is based on annotated building block databases the size of which is defined by the user according to ribosomal or nonribosomal peptide origin. The current number of involved nonisobaric and isobaric building blocks is 287 and 521, respectively. Contrary to all other peptide sequencing tools utilizing either peptide libraries or peptide fragment libraries, CycloBranch represents a true de novo sequencing engine developed for accurate mass spectrometric data. It is a stand-alone and cross-platform application with a graphical and user-friendly interface; it supports mzML, mzXML, mgf, txt, and baf file formats and can be run in parallel on multiple threads. It can be downloaded for free from http://ms.biomed.cas.cz/cyclobranch/, where the User's manual and video tutorials can be found.

  13. A Real-Time de novo DNA Sequencing Assembly Platform Based on an FPGA Implementation.

    PubMed

    Hu, Yuanqi; Georgiou, Pantelis

    2016-01-01

    This paper presents an FPGA based DNA comparison platform which can be run concurrently with the sensing phase of DNA sequencing and shortens the overall time needed for de novo DNA assembly. A hybrid overlap searching algorithm is applied which is scalable and can deal with incremental detection of new bases. To handle the incomplete data set which gradually increases during sequencing time, all-against-all comparisons are broken down into successive window-against-window comparison phases and executed using a novel dynamic suffix comparison algorithm combined with a partitioned dynamic programming method. The complete system has been designed to facilitate parallel processing in hardware, which allows real-time comparison and full scalability as well as a decrease in the number of computations required. A base pair comparison rate of 51.2 G/s is achieved when implemented on an FPGA with successful DNA comparison when using data sets from real genomes.

  14. De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics.

    PubMed

    Adamidi, Catherine; Wang, Yongbo; Gruen, Dominic; Mastrobuoni, Guido; You, Xintian; Tolle, Dominic; Dodt, Matthias; Mackowiak, Sebastian D; Gogol-Doering, Andreas; Oenal, Pinar; Rybak, Agnieszka; Ross, Eric; Sánchez Alvarado, Alejandro; Kempa, Stefan; Dieterich, Christoph; Rajewsky, Nikolaus; Chen, Wei

    2011-07-01

    Freshwater planaria are a very attractive model system for stem cell biology, tissue homeostasis, and regeneration. The genome of the planarian Schmidtea mediterranea has recently been sequenced and is estimated to contain >20,000 protein-encoding genes. However, the characterization of its transcriptome is far from complete. Furthermore, not a single proteome of the entire phylum has been assayed on a genome-wide level. We devised an efficient sequencing strategy that allowed us to de novo assemble a major fraction of the S. mediterranea transcriptome. We then used independent assays and massive shotgun proteomics to validate the authenticity of transcripts. In total, our de novo assembly yielded 18,619 candidate transcripts with a mean length of 1118 nt after filtering. A total of 17,564 candidate transcripts could be mapped to 15,284 distinct loci on the current genome reference sequence. RACE confirmed complete or almost complete 5' and 3' ends for 22/24 transcripts. The frequencies of frame shifts, fusion, and fission events in the assembled transcripts were computationally estimated to be 4.2%-13%, 0%-3.7%, and 2.6%, respectively. Our shotgun proteomics produced 16,135 distinct peptides that validated 4200 transcripts (FDR ≤1%). The catalog of transcripts assembled in this study, together with the identified peptides, dramatically expands and refines planarian gene annotation, demonstrated by validation of several previously unknown transcripts with stem cell-dependent expression patterns. In addition, our robust transcriptome characterization pipeline could be applied to other organisms without genome assembly. All of our data, including homology annotation, are freely available at SmedGD, the S. mediterranea genome database.

  15. A De Novo Genome Sequence Assembly of the Arabidopsis thaliana Accession Niederzenz-1 Displays Presence/Absence Variation and Strong Synteny

    PubMed Central

    Pucker, Boas; Holtgräwe, Daniela; Rosleff Sörensen, Thomas; Stracke, Ralf; Viehöver, Prisca

    2016-01-01

    Arabidopsis thaliana is the most important model organism for fundamental plant biology. The genome diversity of different accessions of this species has been intensively studied, for example in the 1001 genome project which led to the identification of many small nucleotide polymorphisms (SNPs) and small insertions and deletions (InDels). In addition, presence/absence variation (PAV), copy number variation (CNV) and mobile genetic elements contribute to genomic differences between A. thaliana accessions. To address larger genome rearrangements between the A. thaliana reference accession Columbia-0 (Col-0) and another accession of about average distance to Col-0, we created a de novo next generation sequencing (NGS)-based assembly from the accession Niederzenz-1 (Nd-1). The result was evaluated with respect to assembly strategy and synteny to Col-0. We provide a high quality genome sequence of the A. thaliana accession (Nd-1, LXSY01000000). The assembly displays an N50 of 0.590 Mbp and covers 99% of the Col-0 reference sequence. Scaffolds from the de novo assembly were positioned on the basis of sequence similarity to the reference. Errors in this automatic scaffold anchoring were manually corrected based on analyzing reciprocal best BLAST hits (RBHs) of genes. Comparison of the final Nd-1 assembly to the reference revealed duplications and deletions (PAV). We identified 826 insertions and 746 deletions in Nd-1. Randomly selected candidates of PAV were experimentally validated. Our Nd-1 de novo assembly allowed reliable identification of larger genic and intergenic variants, which was difficult or error-prone by short read mapping approaches alone. While overall sequence similarity as well as synteny is very high, we detected short and larger (affecting more than 100 bp) differences between Col-0 and Nd-1 based on bi-directional comparisons. The de novo assembly provided here and additional assemblies that will certainly be published in the future will allow to

  16. De novo assembly and characterization of the garlic (Allium sativum) bud transcriptome by Illumina sequencing.

    PubMed

    Sun, Xiudong; Zhou, Shumei; Meng, Fanlu; Liu, Shiqi

    2012-10-01

    Garlic is widely used as a spice throughout the world for the culinary value of its flavor and aroma, which are created by the chemical transformation of a series of organic sulfur compounds. To analyze the transcriptome of Allium sativum and discover the genes involved in sulfur metabolism, cDNAs derived from the total RNA of Allium sativum buds were analyzed by Illumina sequencing. Approximately 26.67 million 90 bp paired-end clean reads were achieved in two libraries. A total of 127,933 unigenes were generated by de novo assembly and were compared with the sequences in public databases. Of these, 45,286 unigenes had significant hits to the sequences in the Nr database, 29,514 showed significant similarity to known proteins in the Swiss-Prot database and, 20,706 and 21,952 unigenes had significant similarity to existing sequences in the KEGG and COG databases, respectively. Moreover, genes involved in organic sulfur biosynthesis were identified. These unigenes data will provide the foundation for research on gene expression, genomics and functional genomics in Allium sativum. Key message The obtained unigenes will provide the foundation for research on functional genomics in Allium sativum and its closely related species, and fill the gap of the existing plant EST database.

  17. De Novo Sequencing and Analysis of Lemongrass Transcriptome Provide First Insights into the Essential Oil Biosynthesis of Aromatic Grasses

    PubMed Central

    Meena, Seema; Kumar, Sarma R.; Venkata Rao, D. K.; Dwivedi, Varun; Shilpashree, H. B.; Rastogi, Shubhra; Shasany, Ajit K.; Nagegowda, Dinesh A.

    2016-01-01

    Aromatic grasses of the genus Cymbopogon (Poaceae family) represent unique group of plants that produce diverse composition of monoterpene rich essential oils, which have great value in flavor, fragrance, cosmetic, and aromatherapy industries. Despite the commercial importance of these natural aromatic oils, their biosynthesis at the molecular level remains unexplored. As the first step toward understanding the essential oil biosynthesis, we performed de novo transcriptome assembly and analysis of C. flexuosus (lemongrass) by employing Illumina sequencing. Mining of transcriptome data and subsequent phylogenetic analysis led to identification of terpene synthases, pyrophosphatases, alcohol dehydrogenases, aldo-keto reductases, carotenoid cleavage dioxygenases, alcohol acetyltransferases, and aldehyde dehydrogenases, which are potentially involved in essential oil biosynthesis. Comparative essential oil profiling and mRNA expression analysis in three Cymbopogon species (C. flexuosus, aldehyde type; C. martinii, alcohol type; and C. winterianus, intermediate type) with varying essential oil composition indicated the involvement of identified candidate genes in the formation of alcohols, aldehydes, and acetates. Molecular modeling and docking further supported the role of identified protein sequences in aroma formation in Cymbopogon. Also, simple sequence repeats were found in the transcriptome with many linked to terpene pathway genes including the genes potentially involved in aroma biosynthesis. This work provides the first insights into the essential oil biosynthesis of aromatic grasses, and the identified candidate genes and markers can be a great resource for biotechnological and molecular breeding approaches to modulate the essential oil composition. PMID:27516768

  18. De Novo Sequencing and Analysis of Lemongrass Transcriptome Provide First Insights into the Essential Oil Biosynthesis of Aromatic Grasses.

    PubMed

    Meena, Seema; Kumar, Sarma R; Venkata Rao, D K; Dwivedi, Varun; Shilpashree, H B; Rastogi, Shubhra; Shasany, Ajit K; Nagegowda, Dinesh A

    2016-01-01

    Aromatic grasses of the genus Cymbopogon (Poaceae family) represent unique group of plants that produce diverse composition of monoterpene rich essential oils, which have great value in flavor, fragrance, cosmetic, and aromatherapy industries. Despite the commercial importance of these natural aromatic oils, their biosynthesis at the molecular level remains unexplored. As the first step toward understanding the essential oil biosynthesis, we performed de novo transcriptome assembly and analysis of C. flexuosus (lemongrass) by employing Illumina sequencing. Mining of transcriptome data and subsequent phylogenetic analysis led to identification of terpene synthases, pyrophosphatases, alcohol dehydrogenases, aldo-keto reductases, carotenoid cleavage dioxygenases, alcohol acetyltransferases, and aldehyde dehydrogenases, which are potentially involved in essential oil biosynthesis. Comparative essential oil profiling and mRNA expression analysis in three Cymbopogon species (C. flexuosus, aldehyde type; C. martinii, alcohol type; and C. winterianus, intermediate type) with varying essential oil composition indicated the involvement of identified candidate genes in the formation of alcohols, aldehydes, and acetates. Molecular modeling and docking further supported the role of identified protein sequences in aroma formation in Cymbopogon. Also, simple sequence repeats were found in the transcriptome with many linked to terpene pathway genes including the genes potentially involved in aroma biosynthesis. This work provides the first insights into the essential oil biosynthesis of aromatic grasses, and the identified candidate genes and markers can be a great resource for biotechnological and molecular breeding approaches to modulate the essential oil composition.

  19. Sequencing and De Novo Assembly of the Gonadal Transcriptome of the Endangered Chinese Sturgeon (Acipenser sinensis)

    PubMed Central

    Du, Hao; Zhang, Shuhuan; Wei, Qiwei

    2015-01-01

    Background The Chinese sturgeon (Acipenser sinensis) is endangered through anthropogenic activities including over-fishing, damming, shipping, and pollution. Controlled reproduction has been adopted and successfully conducted for conservation. However, little information is available on the reproductive regulation of the species. In this study, we conducted de novo transcriptome assembly of the gonad tissue to create a comprehensive dataset for A. sinensis. Results The Illumina sequencing platform was adopted to obtain 47,333,701 and 47,229,705 high quality reads from testis and ovary cDNA libraries generated from three-year-old A. sinensis. We identified 86,027 unigenes of which 30,268 were annotated in the NCBI non-redundant protein database and 28,281 were annotated in the Swiss-prot database. Among the annotated unigenes, 26,152 and 7,734 unigenes, respectively, were assigned to gene ontology categories and clusters of orthologous groups. In addition, 12,557 unigenes were mapped to 231 pathways in the Kyoto Encyclopedia of Genes and Genomes Pathway database. A total of 1,896 unigenes, potentially differentially expressed between the two gonad types, were found, with 1,894 predicted to be up-regulated in ovary and only two in testis. Fifty-five potential gametogenesis-related genes were screened in the transcriptome and 34 genes with significant matches were found. Besides, more paralogs of 11 genes in three gene families (sox, apolipoprotein and cyclin) were found in A. sinensis compared to their orthologs in the diploid Danio rerio. In addition, 12,151 putative simple sequence repeats (SSRs) were detected. Conclusions This study provides the first de novo transcriptome analysis currently available for A. sinensis. The transcriptomic data represents the fundamental resource for future research on the mechanism of early gametogenesis in sturgeons. The SSRs identified in this work will be valuable for assessment of genetic diversity of wild fish and genealogy

  20. De Novo Transcriptome Sequencing of Oryza officinalis Wall ex Watt to Identify Disease-Resistance Genes.

    PubMed

    He, Bin; Gu, Yinghong; Tao, Xiang; Cheng, Xiaojie; Wei, Changhe; Fu, Jian; Cheng, Zaiquan; Zhang, Yizheng

    2015-12-10

    Oryza officinalis Wall ex Watt is one of the most important wild relatives of cultivated rice and exhibits high resistance to many diseases. It has been used as a source of genes for introgression into cultivated rice. However, there are limited genomic resources and little genetic information publicly reported for this species. To better understand the pathways and factors involved in disease resistance and accelerating the process of rice breeding, we carried out a de novo transcriptome sequencing of O. officinalis. In this research, 137,229 contigs were obtained ranging from 200 to 19,214 bp with an N50 of 2331 bp through de novo assembly of leaves, stems and roots in O. officinalis using an Illumina HiSeq 2000 platform. Based on sequence similarity searches against a non-redundant protein database, a total of 88,249 contigs were annotated with gene descriptions and 75,589 transcripts were further assigned to GO terms. Candidate genes for plant-pathogen interaction and plant hormones regulation pathways involved in disease-resistance were identified. Further analyses of gene expression profiles showed that the majority of genes related to disease resistance were all expressed in the three tissues. In addition, there are two kinds of rice bacterial blight-resistant genes in O. officinalis, including two Xa1 genes and three Xa26 genes. All 2 Xa1 genes showed the highest expression level in stem, whereas one of Xa26 was expressed dominantly in leaf and other 2 Xa26 genes displayed low expression level in all three tissues. This transcriptomic database provides an opportunity for identifying the genes involved in disease-resistance and will provide a basis for studying functional genomics of O. officinalis and genetic improvement of cultivated rice in the future.

  1. De novo Sequencing, Assembly and Characterization of Antennal Transcriptome of Anomala corpulenta Motschulsky (Coleoptera: Rutelidae)

    PubMed Central

    Chen, Haoliang; Lin, Lulu; Xie, Minghui; Zhang, Guangling; Su, Weihua

    2014-01-01

    Background Anomala corpulenta is an important insect pest and can cause enormous economic losses in agriculture, horticulture and forestry. It is widely distributed in China, and both larvae and adults can cause serious damage. It is difficult to control this pest because the larvae live underground. Any new control strategy should exploit alternatives to heavily and frequently used chemical insecticides. However, little genetic research has been carried out on A. corpulenta due to the lack of genomic resources. Genomic resources could be produced by next generation sequencing technologies with low cost and in a short time. In this study, we performed de novo sequencing, assembly and characterization of the antennal transcriptome of A. corpulenta. Results Illumina sequencing technology was used to sequence the antennal transcriptome of A. corpulenta. Approximately 76.7 million total raw reads and about 68.9 million total clean reads were obtained, and then 35,656 unigenes were assembled. Of these unigenes, 21,463 of them could be annotated in the NCBI nr database, and, among the annotated unigenes, 11,154 and 6,625 unigenes could be assigned to GO and COG, respectively. Additionally, 16,350 unigenes could be annotated in the Swiss-Prot database, and 14,499 unigenes could map onto 258 pathways in the KEGG Pathway database. We also found 24 unigenes related to OBPs, 6 to CSPs, and in total 167 unigenes related to chemodetection. We analyzed 4 OBPs and 3CSPs sequences and their RT-qPCR results agreed well with their FPKM values. Conclusion We produced the first large-scale antennal transcriptome of A. corpulenta, which is a species that has little genomic information in public databases. The identified chemodetection unigenes can promote the molecular mechanistic study of behavior in A. corpulenta. These findings provide a general sequence resource for molecular genetics research on A. corpulenta. PMID:25461610

  2. Mining Novel Allergens from Coconut Pollen Employing Manual De Novo Sequencing and Homology-Driven Proteomics.

    PubMed

    Saha, Bodhisattwa; Sircar, Gaurab; Pandey, Naren; Gupta Bhattacharya, Swati

    2015-11-06

    Coconut pollen, one of the major palm pollen grains is an important constituent among vectors of inhalant allergens in India and a major sensitizer for respiratory allergy in susceptible patients. To gain insight into its allergenic components, pollen proteins were analyzed by two-dimensional electrophoresis, immunoblotted with coconut pollen sensitive patient sera, followed by mass spectrometry of IgE reactive proteins. Coconut being largely unsequenced, a proteomic workflow has been devised that combines the conventional database-dependent analysis of tandem mass spectral data and manual de novo sequencing followed by a homology-based search for identifying the allergenic proteins. N-terminal acetylation helped to distinguish "b" ions from others, facilitating reliable sequencing. This led to the identification of 12 allergenic proteins. Cluster analysis with individual patient sera recognized vicilin-like protein as a major allergen, which was purified to assess its in vitro allergenicity and then partially sequenced. Other IgE-sensitive spots showed significant homology with well-known allergenic proteins such as 11S globulin, enolase, and isoflavone reductase along with a few which are reported as novel allergens. The allergens identified can be used as potential candidates to develop hypoallergenic vaccines, to design specific immunotherapy trials, and to enrich the repertoire of existing IgE reactive proteins.

  3. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

    PubMed Central

    Goodwin, Sara; Gurtowski, James; Ethe-Sayers, Scott; Deshpande, Panchajanya; Schatz, Michael C.; McCombie, W. Richard

    2015-01-01

    Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5–50 kbp) at such high error rates (between ∼5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly. PMID:26447147

  4. Whole genome sequencing data and de novo draft assemblies for 66 teleost species

    PubMed Central

    Malmstrøm, Martin; Matschiner, Michael; Tørresen, Ole K.; Jakobsen, Kjetill S.; Jentoft, Sissel

    2017-01-01

    Teleost fishes comprise more than half of all vertebrate species, yet genomic data are only available for 0.2% of their diversity. Here, we present whole genome sequencing data for 66 new species of teleosts, vastly expanding the availability of genomic data for this important vertebrate group. We report on de novo assemblies based on low-coverage (9–39×) sequencing and present detailed methodology for all analyses. To facilitate further utilization of this data set, we present statistical analyses of the gene space completeness and verify the expected phylogenetic position of the sequenced genomes in a large mitogenomic context. We further present a nuclear marker set used for phylogenetic inference and evaluate each gene tree in relation to the species tree to test for homogeneity in the phylogenetic signal. Collectively, these analyses illustrate the robustness of this highly diverse data set and enable extensive reuse of the selected phylogenetic markers and the genomic data in general. This data set covers all major teleost lineages and provides unprecedented opportunities for comparative studies of teleosts. PMID:28094797

  5. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome.

    PubMed

    Goodwin, Sara; Gurtowski, James; Ethe-Sayers, Scott; Deshpande, Panchajanya; Schatz, Michael C; McCombie, W Richard

    2015-11-01

    Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5-50 kbp) at such high error rates (between ∼5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly.

  6. De Novo whole genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 from blueberry in Georgia, USA

    Technology Transfer Automated Retrieval System (TEKTRAN)

    This study reports a de novo assembled draft genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 causing blueberry bacterial leaf scorch in Georgia, USA. The BB01 genome is 2,517,579 bp with a G+C content of 51.8% and 2,943 open reading frames (ORFs) and 48 RNA genes....

  7. Overlapping Genes Produce Proteins with Unusual Sequence Properties and Offer Insight into De Novo Protein Creation▿ †

    PubMed Central

    Rancurel, Corinne; Khosravi, Mahvash; Dunker, A. Keith; Romero, Pedro R.; Karlin, David

    2009-01-01

    It is widely assumed that new proteins are created by duplication, fusion, or fission of existing coding sequences. Another mechanism of protein birth is provided by overlapping genes. They are created de novo by mutations within a coding sequence that lead to the expression of a novel protein in another reading frame, a process called “overprinting.” To investigate this mechanism, we have analyzed the sequences of the protein products of manually curated overlapping genes from 43 genera of unspliced RNA viruses infecting eukaryotes. Overlapping proteins have a sequence composition globally biased toward disorder-promoting amino acids and are predicted to contain significantly more structural disorder than nonoverlapping proteins. By analyzing the phylogenetic distribution of overlapping proteins, we were able to confirm that 17 of these had been created de novo and to study them individually. Most proteins created de novo are orphans (i.e., restricted to one species or genus). Almost all are accessory proteins that play a role in viral pathogenicity or spread, rather than proteins central to viral replication or structure. Most proteins created de novo are predicted to be fully disordered and have a highly unusual sequence composition. This suggests that some viral overlapping reading frames encoding hypothetical proteins with highly biased composition, often discarded as noncoding, might in fact encode proteins. Some proteins created de novo are predicted to be ordered, however, and whenever a three-dimensional structure of such a protein has been solved, it corresponds to a fold previously unobserved, suggesting that the study of these proteins could enhance our knowledge of protein space. PMID:19640978

  8. Bromine isotopic signature facilitates de novo sequencing of peptides in free-radical-initiated peptide sequencing (FRIPS) mass spectrometry.

    PubMed

    Nam, Jungjoo; Kwon, Hyuksu; Jang, Inae; Jeon, Aeran; Moon, Jingyu; Lee, Sun Young; Kang, Dukjin; Han, Sang Yun; Moon, Bongjin; Oh, Han Bin

    2015-02-01

    We recently showed that free-radical-initiated peptide sequencing mass spectrometry (FRIPS MS) assisted by the remarkable thermochemical stability of (2,2,6,6-tetramethyl-piperidin-1-yl)oxyl (TEMPO) is another attractive radical-driven peptide fragmentation MS tool. Facile homolytic cleavage of the bond between the benzylic carbon and the oxygen of the TEMPO moiety in o-TEMPO-Bz-C(O)-peptide and the high reactivity of the benzylic radical species generated in •Bz-C(O)-peptide are key elements leading to extensive radical-driven peptide backbone fragmentation. In the present study, we demonstrate that the incorporation of bromine into the benzene ring, i.e. o-TEMPO-Bz(Br)-C(O)-peptide, allows unambiguous distinction of the N-terminal peptide fragments from the C-terminal fragments through the unique bromine doublet isotopic signature. Furthermore, bromine substitution does not alter the overall radical-driven peptide backbone dissociation pathways of o-TEMPO-Bz-C(O)-peptide. From a practical perspective, the presence of the bromine isotopic signature in the N-terminal peptide fragments in TEMPO-assisted FRIPS MS represents a useful and cost-effective opportunity for de novo peptide sequencing.

  9. Sequencing and De Novo Assembly of the Transcriptome of the Glassy-Winged Sharpshooter (Homalodisca vitripennis)

    PubMed Central

    Nandety, Raja Sekhar; Kamita, Shizuo G.; Hammock, Bruce D.; Falk, Bryce W.

    2013-01-01

    Background The glassy-winged sharpshooter Homalodisca vitripennis (Hemiptera: Cicadellidae), is a xylem-feeding leafhopper and important vector of the bacterium Xylella fastidiosa; the causal agent of Pierce’s disease of grapevines. The functional complexity of the transcriptome of H. vitripennis has not been elucidated thus far. It is a necessary blueprint for an understanding of the development of H. vitripennis and for designing efficient biorational control strategies including those based on RNA interference. Results Here we elucidate and explore the transcriptome of adult H. vitripennis using high-throughput paired end deep sequencing and de novo assembly. A total of 32,803,656 paired-end reads were obtained with an average transcript length of 624 nucleotides. We assembled 32.9 Mb of the transcriptome of H. vitripennis that spanned across 47,265 loci and 52,708 transcripts. Comparison of our non-redundant database showed that 45% of the deduced proteins of H. vitripennis exhibit identity (e-value ≤1−5) with known proteins. We assigned Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations, and potential Pfam domains to each transcript isoform. In order to gain insight into the molecular basis of key regulatory genes of H. vitripennis, we characterized predicted proteins involved in the metabolism of juvenile hormone, and biogenesis of small RNAs (Dicer and Piwi sequences) from the transcriptomic sequences. Analysis of transposable element sequences of H. vitripennis indicated that the genome is less expanded in comparison to many other insects with approximately 1% of the transcriptome carrying transposable elements. Conclusions Our data significantly enhance the molecular resources available for future study and control of this economically important hemipteran. This transcriptional information not only provides a more nuanced understanding of the underlying biological and physiological mechanisms that govern H

  10. De Novo Assembly and Transcriptome Characterization of Canine Retina Using High-Throughput Sequencing

    PubMed Central

    Reddy, Bhaskar; Patel, Amrutlal K.; Singh, Krishna M.; Patil, Deepak B.; Parikh, Pinesh V.; Kelawala, Divyesh N.; Koringa, Prakash G.; Bhatt, Vaibhav D.; Rao, Mandava V.; Joshi, Chaitanya G.

    2015-01-01

    We performed transcriptome sequencing of canine retinal tissue by 454 GS-FLX and Ion Torrent PGM platforms. RNA-Seq analysis by CLC Genomics Workbench mapped expression of 10,360 genes. Gene ontology analysis of retinal transcriptome revealed abundance of transcripts known to be involved in vision associated processes. The de novo assembly of the sequences using CAP3 generated 29,683 contigs with mean length of 560.9 and N50 of 619 bases. Further analysis of contigs predicted 3,827 full-length cDNAs and 29,481 (99%) open reading frames (ORFs). In addition, 3,782 contigs were assigned to 316 KEGG pathways which included melanogenesis, phototransduction, and retinol metabolism with 33, 15, and 11 contigs, respectively. Among the identified microsatellites, dinucleotide repeats were 68.84%, followed by trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides in proportions of 25.76, 9.40, 2.52, and 0.96%, respectively. This study will serve as a valuable resource for understanding the biology and function of canine retina. PMID:26788372

  11. Sequencing and de novo assembly of the red cusk-eel (Genypterus chilensis) transcriptome.

    PubMed

    Aedo, J E; Maldonado, J; Estrada, J M; Fuentes, E N; Silva, H; Gallardo-Escarate, C; Molina, A; Valdés, J A

    2014-12-01

    The red cusk-eel (Genypterus chilensis) is an endemic fish species distributed along the coasts of the Eastern South Pacific. Biological studies on this fish are scarce, and genomic information for G. chilensis is practically non-existent. Thus, transcriptome information for this species is an essential resource that will greatly enrich molecular information and benefit future studies of red cusk-eel biology. In this work, we obtained transcriptome information of G. chilensis using the Illumina platform. The RNA sequencing generated 66,307,362 and 59,925,554 paired-end reads from skeletal muscle and liver tissues, respectively. De novo assembly using the CLC Genomic Workbench version 7.0.3 produced 48,480 contigs and created a reference transcriptome with a N50 of 846bp and average read coverage of 28.3×. By sequence similarity search for known proteins, a total of 21,272 (43.9%) contigs were annotated for their function. Out of these annotated contigs, 33.5% GO annotation results for biological processes, 32.6% GO annotation results for cellular components and 34.5% GO annotation results for molecular functions. This dataset represents the first transcriptomic resource for the red cusk-eel and for a member of the Ophidiimorpharia taxon.

  12. A case study of de novo sequence analysis of N-sulfonated peptides by MALDI TOF/TOF mass spectrometry.

    PubMed

    Samyn, Bart; Debyser, Griet; Sergeant, Kjell; Devreese, Bart; Van Beeumen, Jozef

    2004-12-01

    The simplicity and sensitivity of matrix-assisted laser desorption/ionization time-of-flight mass spectrometry have increased its application in recent years. The most common method of "peptide mass fingerprint" analysis often does not provide robust identification. Additional sequence information, obtained by post-source decay or collision induced dissociation, provides additional constraints for database searches. However, de novo sequencing by mass spectrometry is not yet common practice, most likely because of the difficulties associated with the interpretation of high and low energy CID spectra. Success with this type of sequencing requires full sequence coverage and demands better quality spectra than those typically used for data base searching. In this report we show that full-length de novo sequencing is possible using MALDI TOF/TOF analysis. The interpretation of MS/MS data is facilitated by N-terminal sulfonation after protection of lysine side chains (Keough et al., Proc. Natl. Acad. Sci. U.S.A. 1999, 96, 7131-7136). Reliable de novo sequence analysis has been obtained using sub-picomol quantities of peptides and peptide sequences of up to 16 amino acid residues in length have been determined. The simple, predictable fragmentation pattern allows routine de novo interpretation, either manually or using software. Characterization of the complete primary structure of a peptide is often hindered because of differences in fragmentation efficiencies and in specific fragmentation patterns for different peptides. These differences are controlled by various structural parameters including the nature of the residues present. The influence of the presence of internal Pro, acidic and basic residues on the TOF/TOF fragmentation pattern will be discussed, both for underivatized and guanidinated/sulfonated peptides.

  13. Sequencing and De Novo Assembly of the Western Tarnished Plant Bug (Lygus hesperus) Transcriptome

    PubMed Central

    Hull, J. Joe; Geib, Scott M.; Fabrick, Jeffrey A.; Brent, Colin S.

    2013-01-01

    Background Mirid plant bugs (Hemiptera: Miridae) are economically important insect pests of many crops worldwide. The western tarnished plant bug Lygus hesperus Knight is a pest of cotton, alfalfa, fruit and vegetable crops, and potentially of several emerging biofuel and natural product feedstocks in the western US. However, little is known about the underlying molecular genetics, biochemistry, or physiology of L. hesperus, including their ability to survive extreme environmental conditions. Methodology/Principal Findings We used 454 pyrosequencing of a normalized adult cDNA library and de novo assembly to obtain an adult L. hesperus transcriptome consisting of 1,429,818 transcriptomic reads representing 36,131 transcript isoforms (isotigs) that correspond to 19,742 genes. A search of the transcriptome against deposited L. hesperus protein sequences revealed that 86 out of 87 were represented. Comparison with the non-redundant database indicated that 54% of the transcriptome exhibited similarity (e-value ≤1−5) with known proteins. In addition, Gene Ontology (GO) terms, Kyoto Encyclopedia of Genes and Genomes (KEGG) annotations, and potential Pfam domains were assigned to each transcript isoform. To gain insight into the molecular basis of the L. hesperus thermal stress response we used transcriptomic sequences to identify 52 potential heat shock protein (Hsp) homologs. A subset of these transcripts was sequence verified and their expression response to thermal stress monitored by semi-quantitative PCR. Potential homologs of Hsp70, Hsp40, and 2 small Hsps were found to be upregulated in the heat-challenged adults, suggesting a role in thermotolerance. Conclusions/Significance The L. hesperus transcriptome advances the underlying molecular understanding of this arthropod pest by significantly increasing the number of known genes, and provides the basis for further exploration and understanding of the fundamental mechanisms of abiotic stress responses. PMID

  14. Rationale-Based, De Novo Design of Dehydrophenylalanine-Containing Antibiotic Peptides and Systematic Modification in Sequence for Enhanced Potency▿

    PubMed Central

    Pathak, Sarika; Chauhan, Virander Singh

    2011-01-01

    Increased microbial drug resistance has generated a global requirement for new anti-infective agents. As part of an effort to develop new, low-molecular-mass peptide antibiotics, we used a rationale-based minimalist approach to design short, nonhemolytic, potent, and broad-spectrum antibiotic peptides with increased serum stability. These peptides were designed to attain an amphipathic structure in helical conformations. VS1 was used as the lead compound, and its properties were compared with three series of derivates obtained by (i) N-terminal amino acid addition, (ii) systematic Trp substitution, and (iii) peptide dendrimerization. The Trp substitution approach underlined the optimized sequence of VS2 in terms of potency, faster membrane permeation, and cost-effectiveness. VS2 (a variant of VS1 with two Trp substitutions) was found to exhibit good antimicrobial activity against both the Gram-negative Escherichia coli and the Gram-positive bacterium Staphylococcus aureus. It was also found to have noncytolytic activity and the ability to permeate and depolarize the bacterial membrane. Lysis of the bacterial cell wall and inner membrane by the peptide was confirmed by transmission electron microscopy. A combination of small size, the presence of unnatural amino acids, high antimicrobial activity, insignificant hemolysis, and proteolytic resistance provides fundamental information for the de novo design of an antimicrobial peptide useful for the management of infectious disease. PMID:21321136

  15. The First Illumina-Based De Novo Transcriptome Sequencing and Analysis of Safflower Flowers

    PubMed Central

    Lulin, Huang; Xiao, Yang; Pei, Sun; Wen, Tong; Shangqin, Hu

    2012-01-01

    Background The safflower, Carthamus tinctorius L., is a worldwide oil crop, and its flowers, which have a high flavonoid content, are an important medicinal resource against cardiovascular disease in traditional medicine. Because the safflower has a large and complex genome, the development of its genomic resources has been delayed. Second-generation Illumina sequencing is now an efficient route for generating an enormous volume of sequences that can represent a large number of genes and their expression levels. Methodology/Principal Findings To investigate the genes and pathways that might control flavonoids and other secondary metabolites in the safflower, we used Illumina sequencing to perform a de novo assembly of the safflower tubular flower tissue transcriptome. We obtained a total of 4.69 Gb in clean nucleotides comprising 52,119,104 clean sequencing reads, 195,320 contigs, and 120,778 unigenes. Based on similarity searches with known proteins, we annotated 70,342 of the unigenes (about 58% of the identified unigenes) with cut-off E-values of 10−5. In total, 21,943 of the safflower unigenes were found to have COG classifications, and BLAST2GO assigned 26,332 of the unigenes to 1,754 GO term annotations. In addition, we assigned 30,203 of the unigenes to 121 KEGG pathways. When we focused on genes identified as contributing to flavonoid biosynthesis and the biosynthesis of unsaturated fatty acids, which are important pathways that control flower and seed quality, respectively, we found that these genes were fairly well conserved in the safflower genome compared to those of other plants. Conclusions/Significance Our study provides abundant genomic data for Carthamus tinctorius L. and offers comprehensive sequence resources for studying the safflower. We believe that these transcriptome datasets will serve as an important public information platform to accelerate studies of the safflower genome, and may help us define the mechanisms of flower tissue

  16. De novo sequencing and characterization of Picrorhiza kurrooa transcriptome at two temperatures showed major transcriptome adjustments

    PubMed Central

    2012-01-01

    Background Picrorhiza kurrooa Royle ex Benth. is an endangered plant species of medicinal importance. The medicinal property is attributed to monoterpenoids picroside I and II, which are modulated by temperature. The transcriptome information of this species is limited with the availability of few hundreds of expressed sequence tags (ESTs) in the public databases. In order to gain insight into temperature mediated molecular changes, high throughput de novo transcriptome sequencing and analyses were carried out at 15°C and 25°C, the temperatures known to modulate picrosides content. Results Using paired-end (PE) Illumina sequencing technology, a total of 20,593,412 and 44,229,272 PE reads were obtained after quality filtering for 15°C and 25°C, respectively. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 74,336 assembled transcript sequences were obtained, with an average coverage of 76.6 and average length of 439.5. Guanine-cytosine (GC) content was observed to be 44.6%, while the transcriptome exhibited abundance of trinucleotide simple sequence repeat (SSR; 45.63%) markers. Large scale expression profiling through "read per exon kilobase per million (RPKM)", showed changes in several biological processes and metabolic pathways including cytochrome P450s (CYPs), UDP-glycosyltransferases (UGTs) and those associated with picrosides biosynthesis. RPKM data were validated by reverse transcriptase-polymerase chain reaction using a set of 19 genes, wherein 11 genes behaved in accordance with the two expression methods. Conclusions Study generated transcriptome of P. kurrooa at two different temperatures. Large scale expression profiling through RPKM showed major transcriptome changes in response to temperature reflecting alterations in major biological processes and metabolic pathways, and provided insight of GC content and SSR markers. Analysis also identified

  17. Transcriptome Sequencing and De Novo Assembly of Golden Cuttlefish Sepia esculenta Hoyle.

    PubMed

    Liu, Changlin; Zhao, Fazhen; Yan, Jingping; Liu, Chunsheng; Liu, Siwei; Chen, Siqing

    2016-10-22

    Golden cuttlefish Sepia esculenta Hoyle is an economically important cephalopod species. However, artificial hatching is currently challenged by low survival rate of larvae due to abnormal embryonic development. Dissecting the genetic foundation and regulatory mechanisms in embryonic development requires genomic background knowledge. Therefore, we carried out a transcriptome sequencing on Sepia embryos and larvae via mRNA-Seq. 32,597,241 raw reads were filtered and assembled into 98,615 unigenes (N50 length at 911 bp) which were annotated in NR database, GO and KEGG databases respectively. Digital gene expression analysis was carried out on cleavage stage embryos, healthy larvae and malformed larvae. Unigenes functioning in cell proliferation exhibited higher transcriptional levels at cleavage stage while those related to animal disease and organ development showed increased transcription in malformed larvae. Homologs of key genes in regulatory pathways related to early development of animals were identified in Sepia. Most of them exhibit higher transcriptional levels in cleavage stage than larvae, suggesting their potential roles in embryonic development of Sepia. The de novo assembly of Sepia transcriptome is fundamental genetic background for further exploration in Sepia research. Our demonstration on the transcriptional variations of genes in three developmental stages will provide new perspectives in understanding the molecular mechanisms in early embryonic development of cuttlefish.

  18. Transcriptome Sequencing and De Novo Assembly of Golden Cuttlefish Sepia esculenta Hoyle

    PubMed Central

    Liu, Changlin; Zhao, Fazhen; Yan, Jingping; Liu, Chunsheng; Liu, Siwei; Chen, Siqing

    2016-01-01

    Golden cuttlefish Sepia esculenta Hoyle is an economically important cephalopod species. However, artificial hatching is currently challenged by low survival rate of larvae due to abnormal embryonic development. Dissecting the genetic foundation and regulatory mechanisms in embryonic development requires genomic background knowledge. Therefore, we carried out a transcriptome sequencing on Sepia embryos and larvae via mRNA-Seq. 32,597,241 raw reads were filtered and assembled into 98,615 unigenes (N50 length at 911 bp) which were annotated in NR database, GO and KEGG databases respectively. Digital gene expression analysis was carried out on cleavage stage embryos, healthy larvae and malformed larvae. Unigenes functioning in cell proliferation exhibited higher transcriptional levels at cleavage stage while those related to animal disease and organ development showed increased transcription in malformed larvae. Homologs of key genes in regulatory pathways related to early development of animals were identified in Sepia. Most of them exhibit higher transcriptional levels in cleavage stage than larvae, suggesting their potential roles in embryonic development of Sepia. The de novo assembly of Sepia transcriptome is fundamental genetic background for further exploration in Sepia research. Our demonstration on the transcriptional variations of genes in three developmental stages will provide new perspectives in understanding the molecular mechanisms in early embryonic development of cuttlefish. PMID:27782082

  19. De novo prediction of RNA-protein interactions from sequence information.

    PubMed

    Wang, Ying; Chen, Xiaowei; Liu, Zhi-Ping; Huang, Qiang; Wang, Yong; Xu, Derong; Zhang, Xiang-Sun; Chen, Runsheng; Chen, Luonan

    2013-01-27

    Protein-RNA interactions are fundamentally important in understanding cellular processes. In particular, non-coding RNA-protein interactions play an important role to facilitate biological functions in signalling, transcriptional regulation, and even the progression of complex diseases. However, experimental determination of protein-RNA interactions remains time-consuming and labour-intensive. Here, we develop a novel extended naïve-Bayes-classifier for de novo prediction of protein-RNA interactions, only using protein and RNA sequence information. Specifically, we first collect a set of known protein-RNA interactions as gold-standard positives and extract sequence-based features to represent each protein-RNA pair. To fill the gap between high dimensional features and scarcity of gold-standard positives, we select effective features by cutting a likelihood ratio score, which not only reduces the computational complexity but also allows transparent feature integration during prediction. An extended naïve Bayes classifier is then constructed using these effective features to train a protein-RNA interaction prediction model. Numerical experiments show that our method can achieve the prediction accuracy of 0.77 even though only a small number of protein-RNA interaction data are available. In particular, we demonstrate that the extended naïve-Bayes-classifier is superior to the naïve-Bayes-classifier by fully considering the dependences among features. Importantly, we conduct ncRNA pull-down experiments to validate the predicted novel protein-RNA interactions and identify the interacting proteins of sbRNA CeN72 in C. elegans, which further demonstrates the effectiveness of our method.

  20. Stable isotope N-phosphorylation labeling for Peptide de novo sequencing and protein quantification based on organic phosphorus chemistry.

    PubMed

    Gao, Xiang; Wu, Hanzhi; Lee, Kim-Chung; Liu, Hongxia; Zhao, Yufen; Cai, Zongwei; Jiang, Yuyang

    2012-12-04

    In this paper, we describe the development of a novel stable isotope N-phosphorylation labeling (SIPL) strategy for peptide de novo sequencing and protein quantification based on organic phosphorus chemistry. The labeling reaction could be performed easily and completed within 40 min in a one-pot reaction without additional cleanup procedures. It was found that N-phosphorylation labeling reagents were activated in situ to form labeling intermediates with high reactivity targeting on N-terminus and ε-amino groups of lysine under mild reaction conditions. The introduction of N-terminal-labeled phosphoryl group not only improved the ionization efficiency of peptides and increased the protein sequence coverage for peptide mass fingerprints but also greatly enhanced the intensities of b ions, suppressed the internal fragments, and reduced the complexity of the tandem mass spectrometry (MS/MS) fragmentation patterns of peptides. By using nano liquid chromatography chip/time-of-flight mass spectrometry (nano LC-chip/TOF MS) for the protein quantification, the obtained results showed excellent correlation of the measured ratios to theoretical ratios with relative errors ranging from 0.5% to 6.7% and relative standard deviation of less than 10.6%, indicating that the developed method was reproducible and precise. The isotope effect was negligible because of the deuterium atoms were placed adjacent to the neutral phosphoryl group with high electrophilicity and moderately small size. Moreover, the SIPL approach used inexpensive reagents and was amenable to samples from various sources, including cell culture, biological fluids, and tissues. The method development based on organic phosphorus chemistry offered a new approach for quantitative proteomics by using novel stable isotope labeling reagents.

  1. De Novo Transcriptome Assembly of the Chinese Swamp Buffalo by RNA Sequencing and SSR Marker Discovery

    PubMed Central

    Lu, Xingrong; Zhu, Peng; Duan, Anqin; Tan, Zhengzhun; Huang, Jian; Li, Hui; Chen, Mingtan; Liang, Xianwei

    2016-01-01

    The Chinese swamp buffalo (Bubalis bubalis) is vital to the lives of small farmers and has tremendous economic importance. However, a lack of genomic information has hampered research on augmenting marker assisted breeding programs in this species. Thus, a high-throughput transcriptomic sequencing of B. bubalis was conducted to generate transcriptomic sequence dataset for gene discovery and molecular marker development. Illumina paired-end sequencing generated a total of 54,109,173 raw reads. After trimming, de novo assembly was performed, which yielded 86,017 unigenes, with an average length of 972.41 bp, an N50 of 1,505 bp, and an average GC content of 49.92%. A total of 62,337 unigenes were successfully annotated. Among the annotated unigenes, 27,025 (43.35%) and 23,232 (37.27%) unigenes showed significant similarity to known proteins in NCBI non-redundant protein and Swiss-Prot databases (E-value < 1.0E-5), respectively. Of these annotated unigenes, 14,439 and 15,813 unigenes were assigned to the Gene Ontology (GO) categories and EuKaryotic Ortholog Group (KOG) cluster, respectively. In addition, a total of 14,167 unigenes were assigned to 331 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Furthermore, 17,401 simple sequence repeats (SSRs) were identified as potential molecular markers. One hundred and fifteen primer pairs were randomly selected for amplification to detect polymorphisms. The results revealed that 110 primer pairs (95.65%) yielded PCR amplicons and 69 primer pairs (60.00%) presented polymorphisms in 35 individual buffaloes. A phylogenetic analysis showed that the five swamp buffalo populations were clustered together, whereas two river buffalo breeds clustered separately. In the present study, the Illumina RNA-seq technology was utilized to perform transcriptome analysis and SSR marker discovery in the swamp buffalo without using a reference genome. Our findings will enrich the current SSR markers resources and help spearhead molecular

  2. Genomic resources for water yam (Dioscorea alata L.): analyses of EST-Sequences, De Novo sequencing and GBS libraries

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The reducing cost and rapid progress in next-generation sequencing techniques coupled with high performance computational approaches have resulted in large-scale discovery of advanced genomic resources such as SSRs, SNPs and InDels in several model and non-model plant species. Yam (Dioscorea spp.) i...

  3. De novo transcriptome sequencing and analysis of the juvenile and adult stages of Fasciola gigantica.

    PubMed

    Zhang, Xiao-Xuan; Cong, Wei; Elsheikha, Hany M; Liu, Guo-Hua; Ma, Jian-Gang; Huang, Wei-Yi; Zhao, Quan; Zhu, Xing-Quan

    2017-03-09

    Fasciola gigantica is regarded as the major liver fluke causing fasciolosis in livestock in tropical countries. Despite the significant economic and public health impacts of F. gigantica there are few studies on the pathogenesis of this parasite and our understanding is further limited by the lack of genome and transcriptome information. In this study, de novo Illumina RNA sequencing (RNA-seq) was performed to obtain a comprehensive transcriptome profile of the juvenile (42days post infection) and adult stages of F. gigantica. A total of 49,720 unigenes were produced from juvenile and adult stages of F. gigantica, with an average length of 1286 nucleotides (nt) and N50 of 2076nt. A total of 27,862 (56.03%) unigenes were annotated by BLAST similarity searches against the NCBI non-redundant protein database. Because F. gigantica needs to feed and/or digest host tissues, some proteases (including cysteine proteases and aspartic proteases), which play a role in the degradation of host tissues (protein), have been paid more attention in the present study. A total of 6511 distinct genes were found differentially expressed between juveniles and adults, of which 3993 genes were up-regulated and 2518 genes were down-regulated in adults versus juveniles, respectively. Moreover, stage-specific differentially expressed genes were identified in juvenile (17,009) and adult (6517) F. gigantica. The significantly divergent pathways of differentially expressed genes included cAMP signaling pathway (226; 4.12%), proteoglycans in cancer (256; 4.67%) and focal adhesion (199; 3.63%). The transcription pattern also revealed two egg-laying-associated pathways: cGMP-PKG signaling pathway and TGF-β signaling pathway. This study provides the first comparative transcriptomic data concerning juvenile and adult stages of F. gigantica that will be of great value for future research efforts into understanding parasite pathogenesis and developing vaccines against this important parasite.

  4. The first complete chloroplast genome sequences of Ulmus species by de novo sequencing: Genome comparative and taxonomic position analysis.

    PubMed

    Zuo, Li-Hui; Shang, Ai-Qin; Zhang, Shuang; Yu, Xiao-Yue; Ren, Ya-Chao; Yang, Min-Sheng; Wang, Jin-Mao

    2017-01-01

    Elm (Ulmus) has a long history of use as a high-quality heavy hardwood famous for its resistance to drought, cold, and salt. It grows in temperate, warm temperate, and subtropical regions. This is the first report of Ulmaceae chloroplast genomes by de novo sequencing. The Ulmus chloroplast genomes exhibited a typical quadripartite structure with two single-copy regions (long single copy [LSC] and short single copy [SSC] sections) separated by a pair of inverted repeats (IRs). The lengths of the chloroplast genomes from five Ulmus ranged from 158,953 to 159,453 bp, with the largest observed in Ulmus davidiana and the smallest in Ulmus laciniata. The genomes contained 137-145 protein-coding genes, of which Ulmus davidiana var. japonica and U. davidiana had the most and U. pumila had the fewest. The five Ulmus species exhibited different evolutionary routes, as some genes had been lost. In total, 18 genes contained introns, 13 of which (trnL-TAA+, trnL-TAA-, rpoC1-, rpl2-, ndhA-, ycf1, rps12-, rps12+, trnA-TGC+, trnA-TGC-, trnV-TAC-, trnI-GAT+, and trnI-GAT) were shared among all five species. The intron of ycf1 was the longest (5,675bp) while that of trnF-AAA was the smallest (53bp). All Ulmus species except U. davidiana exhibited the same degree of amplification in the IR region. To determine the phylogenetic positions of the Ulmus species, we performed phylogenetic analyses using common protein-coding genes in chloroplast sequences of 42 other species published in NCBI. The cluster results showed the closest plants to Ulmaceae were Moraceae and Cannabaceae, followed by Rosaceae. Ulmaceae and Moraceae both belonged to Urticales, and the chloroplast genome clustering results were consistent with their traditional taxonomy. The results strongly supported the position of Ulmaceae as a member of the order Urticales. In addition, we found a potential error in the traditional taxonomies of U. davidiana and U. davidiana var. japonica, which should be confirmed with a

  5. The first complete chloroplast genome sequences of Ulmus species by de novo sequencing: Genome comparative and taxonomic position analysis

    PubMed Central

    Zhang, Shuang; Yu, Xiao-Yue; Ren, Ya-Chao; Yang, Min-Sheng; Wang, Jin-Mao

    2017-01-01

    Elm (Ulmus) has a long history of use as a high-quality heavy hardwood famous for its resistance to drought, cold, and salt. It grows in temperate, warm temperate, and subtropical regions. This is the first report of Ulmaceae chloroplast genomes by de novo sequencing. The Ulmus chloroplast genomes exhibited a typical quadripartite structure with two single-copy regions (long single copy [LSC] and short single copy [SSC] sections) separated by a pair of inverted repeats (IRs). The lengths of the chloroplast genomes from five Ulmus ranged from 158,953 to 159,453 bp, with the largest observed in Ulmus davidiana and the smallest in Ulmus laciniata. The genomes contained 137–145 protein-coding genes, of which Ulmus davidiana var. japonica and U. davidiana had the most and U. pumila had the fewest. The five Ulmus species exhibited different evolutionary routes, as some genes had been lost. In total, 18 genes contained introns, 13 of which (trnL-TAA+, trnL-TAA−, rpoC1-, rpl2-, ndhA-, ycf1, rps12-, rps12+, trnA-TGC+, trnA-TGC-, trnV-TAC-, trnI-GAT+, and trnI-GAT) were shared among all five species. The intron of ycf1 was the longest (5,675bp) while that of trnF-AAA was the smallest (53bp). All Ulmus species except U. davidiana exhibited the same degree of amplification in the IR region. To determine the phylogenetic positions of the Ulmus species, we performed phylogenetic analyses using common protein-coding genes in chloroplast sequences of 42 other species published in NCBI. The cluster results showed the closest plants to Ulmaceae were Moraceae and Cannabaceae, followed by Rosaceae. Ulmaceae and Moraceae both belonged to Urticales, and the chloroplast genome clustering results were consistent with their traditional taxonomy. The results strongly supported the position of Ulmaceae as a member of the order Urticales. In addition, we found a potential error in the traditional taxonomies of U. davidiana and U. davidiana var. japonica, which should be confirmed with a

  6. MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads (Metagenomics Informatics Challenges Workshop: 10K Genomes at a Time)

    ScienceCinema

    Sakakibara, Yasumbumi [Keio University

    2016-07-12

    Keio University's Yasumbumi Sakakibara on "MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.

  7. Rapid Microsatellite Isolation from a Butterfly by De Novo Transcriptome Sequencing: Performance and a Comparison with AFLP-Derived Distances

    PubMed Central

    Mikheyev, Alexander S.; Vo, Tanya; Wee, Brian; Singer, Michael C.; Parmesan, Camille

    2010-01-01

    Background The isolation of microsatellite markers remains laborious and expensive. For some taxa, such as Lepidoptera, development of microsatellite markers has been particularly difficult, as many markers appear to be located in repetitive DNA and have nearly identical flanking regions. We attempted to circumvent this problem by bioinformatic mining of microsatellite sequences from a de novo-sequenced transcriptome of a butterfly (Euphydryas editha). Principal Findings By searching the assembled sequence data for perfect microsatellite repeats we found 10 polymorphic loci. Although, like many expressed sequence tag-derived microsatellites, our markers show strong deviations from Hardy-Weinberg equilibrium in many populations, and, in some cases, a high incidence of null alleles, we show that they nonetheless provide measures of population differentiation consistent with those obtained by amplified fragment length polymorphism analysis. Estimates of pairwise population differentiation between 23 populations were concordant between microsatellite-derived data and AFLP analysis of the same samples (r = 0.71, p<0.00001, 425 individuals from 23 populations). Significance De novo transcriptional sequencing appears to be a rapid and cost-effective tool for developing microsatellite markers for difficult genomes. PMID:20585453

  8. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.

    PubMed

    Bickhart, Derek M; Rosen, Benjamin D; Koren, Sergey; Sayre, Brian L; Hastie, Alex R; Chan, Saki; Lee, Joyce; Lam, Ernest T; Liachko, Ivan; Sullivan, Shawn T; Burton, Joshua N; Huson, Heather J; Nystrom, John C; Kelley, Christy M; Hutchison, Jana L; Zhou, Yang; Sun, Jiajie; Crisà, Alessandra; Ponce de León, F Abel; Schwartz, John C; Hammond, John A; Waldbieser, Geoffrey C; Schroeder, Steven G; Liu, George E; Dunham, Maitreya J; Shendure, Jay; Sonstegard, Tad S; Phillippy, Adam M; Van Tassell, Curtis P; Smith, Timothy P L

    2017-04-01

    The decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus) based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced what is, to our knowledge, the most continuous de novo mammalian assembly to date, with chromosome-length scaffolds and only 649 gaps. Our assembly represents a ∼400-fold improvement in continuity due to properly assembled gaps, compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, representing the largest repeat family and immune gene complex yet produced for an individual of a ruminant species.

  9. Identification of a De Novo Heterozygous Missense FLNB Mutation in Lethal Atelosteogenesis Type I by Exome Sequencing

    PubMed Central

    Jeon, Ga Won; Lee, Mi-Na; Jung, Ji Mi; Hong, Seong Yeon; Kim, Young Nam; Sin, Jong Beom

    2014-01-01

    Background Atelosteogenesis type I (AO-I) is a rare lethal skeletal dysplastic disorder characterized by severe short-limbed dwarfism and dislocated hips, knees, and elbows. AO-I is caused by mutations in the filamin B (FLNB) gene; however, several other genes can cause AO-like lethal skeletal dysplasias. Methods In order to screen all possible genes associated with AO-like lethal skeletal dysplasias simultaneously, we performed whole-exome sequencing in a female newborn having clinical features of AO-I. Results Exome sequencing identified a novel missense variant (c.517G>A; p.Ala173Thr) in exon 2 of the FLNB gene in the patient. Sanger sequencing validated this variant, and genetic analysis of the patient's parents suggested a de novo occurrence of the variant. Conclusions This study shows that exome sequencing can be a useful tool for the identification of causative mutations in lethal skeletal dysplasia patients. PMID:24624349

  10. Somatic mutations and germline sequence variants in the expressed tyrosine kinase genes of patients with de novo acute myeloid leukemia

    PubMed Central

    Xiang, Zhifu; Walgren, Richard; Zhao, Yu; Kasai, Yumi; Miner, Tracie; Ries, Rhonda E.; Lubman, Olga; Fremont, Daved H.; McLellan, Michael D.; Payton, Jacqueline E.; Westervelt, Peter; DiPersio, John F.; Link, Daniel C.; Walter, Matthew J.; Graubert, Timothy A.; Watson, Mark; Baty, Jack; Heath, Sharon; Shannon, William D.; Nagarajan, Rakesh; Bloomfield, Clara D.; Mardis, Elaine R.; Wilson, Richard K.; Ley, Timothy J.

    2008-01-01

    Activating mutations in tyrosine kinase (TK) genes (eg, FLT3 and KIT) are found in more than 30% of patients with de novo acute myeloid leukemia (AML); many groups have speculated that mutations in other TK genes may be present in the remaining 70%. We performed high-throughput resequencing of the kinase domains of 26 TK genes (11 receptor TK; 15 cytoplasmic TK) expressed in most AML patients using genomic DNA from the bone marrow (tumor) and matched skin biopsy samples (“germline”) from 94 patients with de novo AML; sequence variants were validated in an additional 94 AML tumor samples (14.3 million base pairs of sequence were obtained and analyzed). We identified known somatic mutations in FLT3, KIT, and JAK2 TK genes at the expected frequencies and found 4 novel somatic mutations, JAK1V623A, JAK1T478S, DDR1A803V, and NTRK1S677N, once each in 4 respective patients of 188 tested. We also identified novel germline sequence changes encoding amino acid substitutions (ie, nonsynonymous changes) in 14 TK genes, including TYK2, which had the largest number of nonsynonymous sequence variants (11 total detected). Additional studies will be required to define the roles that these somatic and germline TK gene variants play in AML pathogenesis. PMID:18270328

  11. De Novo Designed Proteins from a Library of Artificial Sequences Function in Escherichia Coli and Enable Cell Growth

    PubMed Central

    Fisher, Michael A.; McKinley, Kara L.; Bradley, Luke H.; Viola, Sara R.; Hecht, Michael H.

    2011-01-01

    A central challenge of synthetic biology is to enable the growth of living systems using parts that are not derived from nature, but designed and synthesized in the laboratory. As an initial step toward achieving this goal, we probed the ability of a collection of >106 de novo designed proteins to provide biological functions necessary to sustain cell growth. Our collection of proteins was drawn from a combinatorial library of 102-residue sequences, designed by binary patterning of polar and nonpolar residues to fold into stable 4-helix bundles. We probed the capacity of proteins from this library to function in vivo by testing their abilities to rescue 27 different knockout strains of Escherichia coli, each deleted for a conditionally essential gene. Four different strains – ΔserB, ΔgltA, ΔilvA, and Δfes – were rescued by specific sequences from our library. Further experiments demonstrated that a strain simultaneously deleted for all four genes was rescued by co-expression of four novel sequences. Thus, cells deleted for ∼0.1% of the E. coli genome (and ∼1% of the genes required for growth under nutrient-poor conditions) can be sustained by sequences designed de novo. PMID:21245923

  12. Sequencing and de novo Analysis of Crassostrea angulata (Fujian Oyster) from 8 Different Developing Phases Using 454 GSFlx

    PubMed Central

    Chen, Jun; Zou, Quan; You, Weiwei; Ke, Caihuan

    2012-01-01

    Research on the mechanism for early development of shellfish, such as body plan, shell formation, settlement and metamorphosis is currently an active research field. However, studies were still limited and not deep enough because of the lack of genomic resources such as genome or transcriptome sequences. In the present research, de novo transcriptome sequencing was performed for Crassostrea angulata, the most economically important cultured oyster species in China, at eight early developmental stages using the 454 sequencing technology. A total of 555,215 reads were produced with an average length of 309 nucleotides that were then assembled into 10,462 contigs. As determined by GO annotation and KEGG pathway mapping, functional annotation of the unigenes recovered diverse biological functions and processes. Six unique sequences related to settlement, metamorphosis and growth were subsequently analyzed by real-time PCR. Given the lack of whole genome information for oysters, transcriptome and de novo analysis of C. angulata from the eight different developing phases will provide important and useful information on early development mechanism and help genetic breeding of shellfish. PMID:22952730

  13. PERGA: a paired-end read guided de novo assembler for extending contigs using SVM and look ahead approach.

    PubMed

    Zhu, Xiao; Leung, Henry C M; Chin, Francis Y L; Yiu, Siu Ming; Quan, Guangri; Liu, Bo; Wang, Yadong

    2014-01-01

    Since the read lengths of high throughput sequencing (HTS) technologies are short, de novo assembly which plays significant roles in many applications remains a great challenge. Most of the state-of-the-art approaches base on de Bruijn graph strategy and overlap-layout strategy. However, these approaches which depend on k-mers or read overlaps do not fully utilize information of paired-end and single-end reads when resolving branches. Since they treat all single-end reads with overlapped length larger than a fix threshold equally, they fail to use the more confident long overlapped reads for assembling and mix up with the relative short overlapped reads. Moreover, these approaches have not been special designed for handling tandem repeats (repeats occur adjacently in the genome) and they usually break down the contigs near the tandem repeats. We present PERGA (Paired-End Reads Guided Assembler), a novel sequence-reads-guided de novo assembly approach, which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds using paired-end reads and different read overlap size ranging from Omax to Omin to resolve the gaps and branches. By constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. When the correct extension cannot be determined, PERGA will try to extend the contig by all feasible extensions and determine the correct extension by using look-ahead approach. Many difficult-resolved branches are due to tandem repeats which are close in the genome. PERGA detects such different copies of the repeats to resolve the branches to make the extension much longer and more accurate. We evaluated PERGA on both Illumina real and simulated datasets ranging from small bacterial genomes to large human chromosome, and it constructed longer and more accurate contigs and scaffolds than other state-of-the-art assemblers. PERGA can be freely downloaded at https://github.com/hitbio/PERGA.

  14. Restriction site associated DNA (RAD) for de novo sequencing and marker discovery in sugarcane borer, Diatraea saccharalis Fab. (Lepidoptera: Crambidae).

    PubMed

    Pavinato, V A C; Margarido, G R A; Wijeratne, A J; Wijeratne, S; Meulia, T; Souza, A P; Michel, A P; Zucchi, M I

    2016-08-30

    We present the development of a genomic library using RADseq (restriction site associated DNA sequencing) protocol for marker discovery that can be applied on evolutionary studies of the sugarcane borer Diatraea saccharalis, an important South American insect pest. A RADtag protocol combined with Illumina paired-end sequencing allowed de novo discovery of 12 811 SNPs and a high-quality assembly of 122.8M paired-end reads from six individuals, representing 40 Gb of sequencing data. Approximately 1.7 Mb of the sugarcane borer genome distributed over 5289 minicontigs were obtained upon assembly of second reads from first reads RADtag loci where at least one SNP was discovered and genotyped. Minicontig lengths ranged from 200 to 611 bp and were used for functional annotation and microsatellite discovery. These markers will be used in future studies to understand gene flow and adaptation to host plants and control tactics.

  15. De novo sequencing and comparative analysis of the blueberry transcriptome to discover putative genes related to antioxidants.

    PubMed

    Li, Xiaoyan; Sun, Haiyue; Pei, Jiabo; Dong, Yuanyuan; Wang, Fawei; Chen, Huan; Sun, Yepeng; Wang, Nan; Li, Haiyan; Li, Yadong

    2012-12-10

    Blueberry (Vaccinium spp.) is an important small fruit crop rich in antioxidants. However, tissue-specific transcriptome and genomic data in public databases are not sufficient for an understanding of the molecular mechanisms associated with antioxidants, especially the biosynthesis of anthocyanins. Here, we obtained more than 64 million sequencing reads from blueberry skin and pulp using Illumina sequencing technology. De novo assemblies yielded 34,464 unigenes, among them 1236 transcripts and 862 putative transcription factors involved in the main antioxidant biosynthesis pathway were identified. Comparative transcript profiling allowed the identification of 92 differentially expressed genes with potential relevance in regulating the fruit metabolism and anthocyanin content during ripening. A series of qRT-PCR confirmed the high expression level of the anthocyanin pathway genes in the skin of the blue fruit from the in silico study. This sequence collection provides a significant resource for the blueberry research and breeding work.

  16. Whole Genome Sequencing Reveals a De Novo SHANK3 Mutation in Familial Autism Spectrum Disorder

    PubMed Central

    Nemirovsky, Sergio I.; Córdoba, Marta; Zaiat, Jonathan J.; Completa, Sabrina P.; Vega, Patricia A.; González-Morón, Dolores; Medina, Nancy M.; Fabbro, Mónica; Romero, Soledad; Brun, Bianca; Revale, Santiago; Ogara, María Florencia; Pecci, Adali; Marti, Marcelo; Vazquez, Martin; Turjanski, Adrián; Kauffman, Marcelo A.

    2015-01-01

    Introduction Clinical genomics promise to be especially suitable for the study of etiologically heterogeneous conditions such as Autism Spectrum Disorder (ASD). Here we present three siblings with ASD where we evaluated the usefulness of Whole Genome Sequencing (WGS) for the diagnostic approach to ASD. Methods We identified a family segregating ASD in three siblings with an unidentified cause. We performed WGS in the three probands and used a state-of-the-art comprehensive bioinformatic analysis pipeline and prioritized the identified variants located in genes likely to be related to ASD. We validated the finding by Sanger sequencing in the probands and their parents. Results Three male siblings presented a syndrome characterized by severe intellectual disability, absence of language, autism spectrum symptoms and epilepsy with negative family history for mental retardation, language disorders, ASD or other psychiatric disorders. We found germline mosaicism for a heterozygous deletion of a cytosine in the exon 21 of the SHANK3 gene, resulting in a missense sequence of 5 codons followed by a premature stop codon (NM_033517:c.3259_3259delC, p.Ser1088Profs*6). Conclusions We reported an infrequent form of familial ASD where WGS proved useful in the clinic. We identified a mutation in SHANK3 that underscores its relevance in Autism Spectrum Disorder. PMID:25646853

  17. A Quantitative Tool to Distinguish Isobaric Leucine and Isoleucine Residues for Mass Spectrometry-Based De Novo Monoclonal Antibody Sequencing

    NASA Astrophysics Data System (ADS)

    Poston, Chloe N.; Higgs, Richard E.; You, Jinsam; Gelfanova, Valentina; Hale, John E.; Knierman, Michael D.; Siegel, Robert; Gutierrez, Jesus A.

    2014-07-01

    De novo sequencing by mass spectrometry (MS) allows for the determination of the complete amino acid (AA) sequence of a given protein based on the mass difference of detected ions from MS/MS fragmentation spectra. The technique relies on obtaining specific masses that can be attributed to characteristic theoretical masses of AAs. A major limitation of de novo sequencing by MS is the inability to distinguish between the isobaric residues leucine (Leu) and isoleucine (Ile). Incorrect identification of Ile as Leu or vice versa often results in loss of activity in recombinant antibodies. This functional ambiguity is commonly resolved with costly and time-consuming AA mutation and peptide sequencing experiments. Here, we describe a set of orthogonal biochemical protocols, which experimentally determine the identity of Ile or Leu residues in monoclonal antibodies (mAb) based on the selectivity that leucine aminopeptidase shows for n-terminal Leu residues and the cleavage preference for Leu by chymotrypsin. The resulting observations are combined with germline frequencies and incorporated into a logistic regression model, called Predictor for Xle Sites (PXleS) to provide a statistical likelihood for the identity of Leu at an ambiguous site. We demonstrate that PXleS can generate a probability for an Xle site in mAbs with 96% accuracy. The implementation of PXleS precludes the expression of several possible sequences and, therefore, reduces the overall time and resources required to go from spectra generation to a biologically active sequence for a mAb when an Ile or Leu residue is in question.

  18. Pseudo-De Novo Assembly and Analysis of Unmapped Genome Sequence Reads in Wild Zebrafish Reveal Novel Gene Content.

    PubMed

    Faber-Hammond, Joshua J; Brown, Kim H

    2016-04-01

    Zebrafish represents the third vertebrate with an officially completed genome, yet it remains incomplete with additions and corrections continuing with the current release, GRCz10, having 13% of zebrafish cDNA sequences unmapped. This disparity may result from population differences, given that the genome reference was generated from clonal individuals with limited genetic diversity. This is supported by the recent analysis of a single wild zebrafish, which identified over 5.2 million SNPs and 1.6 million in/dels in the previous genome build, zv9. Re-examination of this sequence data set indicated that 13.8% of quality sequence reads failed to align to GRCz10. Using a novel bioinformatics de novo assembly pipeline on these unmappable reads, we identified 1,514,491 novel contigs covering ∼224 Mb of genomic sequence. Among these, 1083 contigs were found to contain a potential gene coding sequence. RNA-seq data comparison confirmed that 362 contigs contained a transcribed DNA sequence, suggesting that a large amount of functional genomic sequence remains unannotated in the zebrafish reference genome. By utilizing the bioinformatics pipeline developed in this study, the zebrafish genome will be bolstered as a model for human disease research. Adaptation of the pipeline described here also offers a cost-efficient and effective method to identify and map novel genetic content across any genome and will ultimately aid in the completion of additional genomes for a broad range of species.

  19. Rapid 'de novo' peptide sequencing by a combination of nanoelectrospray, isotopic labeling and a quadrupole/time-of-flight mass spectrometer.

    PubMed

    Shevchenko, A; Chernushevich, I; Ens, W; Standing, K G; Thomson, B; Wilm, M; Mann, M

    1997-01-01

    Protein microanalysis usually involves the sequencing of gel-separated proteins available in very small amounts. While mass spectrometry has become the method of choice for identifying proteins in databases, in almost all laboratories 'de novo' protein sequencing is still performed by Edman degradation. Here we show that a combination of the nanoelectrospray ion source, isotopic end labeling of peptides and a quadrupole/ time-of-flight instrument allows facile read-out of the sequences of tryptic peptides. Isotopic labeling was performed by enzymatic digestion of proteins in 1:1 16O/18O water, eliminating the need for peptide derivatization. A quadrupole/time-of-flight mass spectrometer was constructed from a triple quadrupole and an electrospray time-of-flight instrument. Tandem mass spectra of peptides were obtained with better than 50 ppm mass accuracy and resolution routinely in excess of 5000. Unique and error tolerant identification of yeast proteins as well as the sequencing of a novel protein illustrate the potential of the approach. The high data quality in tandem mass spectra and the additional information provided by the isotopic end labeling of peptides enabled automated interpretation of the spectra via simple software algorithms. The technique demonstrated here removes one of the last obstacles to routine and high throughput protein sequencing by mass spectrometry.

  20. Knowledge-based approach to de novo design using reaction vectors.

    PubMed

    Patel, Hina; Bodkin, Michael J; Chen, Beining; Gillet, Valerie J

    2009-05-01

    A knowledge-based approach to the de novo design of synthetically feasible molecules is described. The method is based on reaction vectors which represent the structural changes that take place at the reaction center along with the environment in which the reaction occurs. The reaction vectors are derived automatically from a database of reactions which is not restricted by size or reaction complexity. A structure generation algorithm has been developed whereby reaction vectors can be applied to previously unseen starting materials in order to suggest novel syntheses. The approach has been implemented in KNIME and is validated by reproducing known synthetic routes. We then present applications of the method in different drug design scenarios including lead optimization and library enumeration. The method offers great potential for capturing and using the growing body of data on reactions that is becoming available through electronic laboratory notebooks.

  1. De Novo Transcriptome Sequencing of Desert Herbaceous Achnatherum splendens (Achnatherum) Seedlings and Identification of Salt Tolerance Genes

    PubMed Central

    Liu, Jiangtao; Zhou, Yuelong; Luo, Changxin; Xiang, Yun; An, Lizhe

    2016-01-01

    Achnatherum splendens is an important forage herb in Northwestern China. It has a high tolerance to salinity and is, thus, considered one of the most important constructive plants in saline and alkaline areas of land in Northwest China. However, the mechanisms of salt stress tolerance in A. splendens remain unknown. Next-generation sequencing (NGS) technologies can be used for global gene expression profiling. In this study, we examined sequence and transcript abundance data for the root/leaf transcriptome of A. splendens obtained using an Illumina HiSeq 2500. Over 35 million clean reads were obtained from the leaf and root libraries. All of the RNA sequencing (RNA-seq) reads were assembled de novo into a total of 126,235 unigenes and 36,511 coding DNA sequences (CDS). We further identified 1663 differentially-expressed genes (DEGs) between the salt stress treatment and control. Functional annotation of the DEGs by gene ontology (GO), using Arabidopsis and rice as references, revealed enrichment of salt stress-related GO categories, including “oxidation reduction”, “transcription factor activity”, and “ion channel transporter”. Thus, this global transcriptome analysis of A. splendens has provided an important genetic resource for the study of salt tolerance in this halophyte. The identified sequences and their putative functional data will facilitate future investigations of the tolerance of Achnatherum species to various types of abiotic stress. PMID:27023614

  2. Transcriptome Sequencing, De Novo Assembly and Differential Gene Expression Analysis of the Early Development of Acipenser baeri

    PubMed Central

    Song, Wei; Jiang, Keji; Zhang, Fengying; Lin, Yu; Ma, Lingbo

    2015-01-01

    The molecular mechanisms that drive the development of the endangered fossil fish species Acipenser baeri are difficult to study due to the lack of genomic data. Recent advances in sequencing technologies and the reducing cost of sequencing offer exclusive opportunities for exploring important molecular mechanisms underlying specific biological processes. This manuscript describes the large scale sequencing and analyses of mRNA from Acipenser baeri collected at five development time points using the Illumina Hiseq2000 platform. The sequencing reads were de novo assembled and clustered into 278167 unigenes, of which 57346 (20.62%) had 45837 known homologues proteins in Uniprot protein databases while 11509 proteins matched with at least one sequence of assembled unigenes. The remaining 79.38% of unigenes could stand for non-coding unigenes or unigenes specific to A. baeri. A number of 43062 unigenes were annotated into functional categories via Gene Ontology (GO) annotation whereas 29526 unigenes were associated with 329 pathways by mapping to KEGG database. Subsequently, 3479 differentially expressed genes were scanned within developmental stages and clustered into 50 gene expression profiles. Genes preferentially expressed at each stage were also identified. Through GO and KEGG pathway enrichment analysis, relevant physiological variations during the early development of A. baeri could be better cognized. Accordingly, the present study gives insights into the transcriptome profile of the early development of A. baeri, and the information contained in this large scale transcriptome will provide substantial references for A. baeri developmental biology and promote its aquaculture research. PMID:26359664

  3. Complete genome sequence of novel carbon monoxide oxidizing bacteria Citrobacter amalonaticus Y19, assembled de novo.

    PubMed

    Ainala, Satish Kumar; Seol, Eunhee; Park, Sunghoon

    2015-10-10

    We report here the complete genome sequence of Citrobacter amalonaticus Y19 isolated from an anaerobic digester. PacBio single-molecule real-time (SMRT) sequencing was employed, resulting in a single scaffold of 5.58Mb. The sequence of a mega plasmid of 291Kb size is also presented.

  4. Single-Cell RNA Sequencing Reveals T Helper Cells Synthesizing Steroids De Novo to Contribute to Immune Homeostasis

    PubMed Central

    Mahata, Bidesh; Zhang, Xiuwei; Kolodziejczyk, Aleksandra A.; Proserpio, Valentina; Haim-Vilmovsky, Liora; Taylor, Angela E.; Hebenstreit, Daniel; Dingler, Felix A.; Moignard, Victoria; Göttgens, Berthold; Arlt, Wiebke; McKenzie, Andrew N.J.; Teichmann, Sarah A.

    2014-01-01

    Summary T helper 2 (Th2) cells regulate helminth infections, allergic disorders, tumor immunity, and pregnancy by secreting various cytokines. It is likely that there are undiscovered Th2 signaling molecules. Although steroids are known to be immunoregulators, de novo steroid production from immune cells has not been previously characterized. Here, we demonstrate production of the steroid pregnenolone by Th2 cells in vitro and in vivo in a helminth infection model. Single-cell RNA sequencing and quantitative PCR analysis suggest that pregnenolone synthesis in Th2 cells is related to immunosuppression. In support of this, we show that pregnenolone inhibits Th cell proliferation and B cell immunoglobulin class switching. We also show that steroidogenic Th2 cells inhibit Th cell proliferation in a Cyp11a1 enzyme-dependent manner. We propose pregnenolone as a “lymphosteroid,” a steroid produced by lymphocytes. We speculate that this de novo steroid production may be an intrinsic phenomenon of Th2-mediated immune responses to actively restore immune homeostasis. PMID:24813893

  5. A Statistical Approach for Ambiguous Sequence Mappings

    Technology Transfer Automated Retrieval System (TEKTRAN)

    When attempting to map RNA sequences to a reference genome, high percentages of short sequence reads are often assigned to multiple genomic locations. One approach to handling these “ambiguous mappings” has been to discard them. This results in a loss of data, which can sometimes be as much as 45% o...

  6. Intellectual disability and non-compaction cardiomyopathy with a de novo NONO mutation identified by exome sequencing.

    PubMed

    Reinstein, Eyal; Tzur, Shay; Cohen, Rony; Bormans, Concetta; Behar, Doron M

    2016-11-01

    Pathogenic variants in the NONO gene have been most recently implicated in X-linked intellectual disability syndrome. This observation has been supported by studies of NONO-deficient mice showing that NONO has an important role in regulating inhibitory synaptic activity. Thus far, the phenotypic spectrum of affected patients remains limited. We applied whole exome sequencing to members of a family in which the proband was presented with a complex phenotype consisting of developmental delay, dysmorphism, and non-compaction cardiomyopathy. Exome analysis identified a novel de novo splice-site variant c.1171+1G>T in exon 11 of NONO gene that is suspected to abolish the donor splicing site. Thus, we propose that the phenotypic spectrum of NONO-related disorder is much broader than described and that pathogenic variants in NONO cause a recognizable phenotype.

  7. Highly efficient de novo mutant identification in a sorghum bicolor tilling population using the ComSeq approach

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Screening large populations for carriers of known or de novo rare SNPs is required both in Targeting induced local lesions IN genomes (TILLING) experiments in plants and analogously in screening human populations. We formerly suggested an approach that combines the celebrated mathematical field of c...

  8. The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

    PubMed Central

    Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio

    2016-01-01

    Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species. PMID:26786968

  9. The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny.

    PubMed

    Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio

    2016-01-20

    Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species.

  10. De novo Assembly and Characterization of the Global Transcriptome for Rhyacionia leptotubula Using Illumina Paired-End Sequencing

    PubMed Central

    Zhu, Jia-Ying; Li, Yong-He; Yang, Song; Li, Qin-Wen

    2013-01-01

    Background The pine tip moth, Rhyacionia leptotubula (Lepidoptera: Tortricidae) is one of the most destructive forestry pests in Yunnan Province, China. Despite its importance, less is known regarding all aspects of this pest. Understanding the genetic information of it is essential for exploring the specific traits at the molecular level. Thus, we here sequenced the transcriptome of R. leptotubula with high-throughput Illumina sequencing. Methodology/Principal Findings In a single run, more than 60 million sequencing reads were generated. De novo assembling was performed to generate a collection of 46,910 unigenes with mean length of 642 bp. Based on Blastx search with an E-value cut-off of 10−5, 22,581 unigenes showed significant similarities to known proteins from National Center for Biotechnology Information (NCBI) non-redundant (Nr) protein database. Of these annotated unigenes, 10,360, 6,937 and 13,894 were assigned to Gene Ontology (GO), Clusters of Orthologous Group (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases, respectively. A total of 5,926 unigenes were annotated with domain similarity derived functional information, of which 55 and 39 unigenes respectively encoding the insecticide resistance related enzymes, cytochrome P450 and carboxylesterase. Using the transcriptome data, 47 unigenes belonging to the typical “stress” genes of heat shock protein (Hsp) family were retrieved. Furthermore, 1,450 simple sequence repeats (SSRs) were detected; 3.09% of the unigenes contained SSRs. Large numbers of SSR primer pairs were designed and out of randomly verified primer pairs 80% were successfully yielded amplicons. Conclusions/Significance A large of putative R. leptotubula transcript sequences has been obtained from the deep sequencing, which extensively increases the comprehensive and integrated genomic resources of this pest. This large-scale transcriptome dataset will be an important information platform for promoting our

  11. De novo sequencing analysis of the Rosa roxburghii fruit transcriptome reveals putative ascorbate biosynthetic genes and EST-SSR markers.

    PubMed

    Yan, Xiuqin; Zhang, Xue; Lu, Min; He, Yong; An, Huaming

    2015-04-25

    Rosa roxburghii Tratt. is a well-known ornamental rose species native to China. In addition, the fruits of this species are valued for their nutritional and medicinal characteristics, especially their high ascorbic acid (AsA) levels. Nevertheless, AsA biosynthesis in R. roxburghii fruit has not been explored in detail because of a lack of genomic resources for this species. High-throughput transcriptomic sequencing generating large volumes of transcript sequence data can aid in gene discovery and molecular marker development. In this study, we generated more than 53 million clean reads using Illumina paired-end sequencing technology. De novo assembly yielded 106,590 unigenes, with an average length of 343 bp. On the basis of sequence similarity to known proteins, 9301 and 2393 unigenes were classified into Gene Ontology and Clusters of Orthologous Group categories, respectively. There were 7480 unigenes assigned to 124 pathways in the Kyoto Encyclopedia of Gene and Genome pathway database. BLASTx searches identified 498 unique putative transcripts encoding various transcription factors, some known to regulate fruit development. qRT-PCR validated the expressions of most of the genes encoding the main enzymes involved in ascorbate biosynthesis. In addition, 9131 potential simple sequence repeat (SSR) loci were identified among the unigenes. One hundred and two primer pairs were synthesized and 71 pairs produced an amplification product during initial screening. Among the amplified products, 30 were polymorphic in the 16 R. roxburghii germplasms tested. Our study was the first to produce a large volume of transcriptome data from R. roxburghii. The resulting sequence collection is a valuable resource for gene discovery and marker-assisted selective breeding in this rose species.

  12. De novo assembly and next-generation sequencing to analyse full-length gene variants from codon-barcoded libraries

    PubMed Central

    Cho, Namjin; Hwang, Byungjin; Yoon, Jung-ki; Park, Sangun; Lee, Joongoo; Seo, Han Na; Lee, Jeewon; Huh, Sunghoon; Chung, Jinsoo; Bang, Duhee

    2015-01-01

    Interpreting epistatic interactions is crucial for understanding evolutionary dynamics of complex genetic systems and unveiling structure and function of genetic pathways. Although high resolution mapping of en masse variant libraries renders molecular biologists to address genotype-phenotype relationships, long-read sequencing technology remains indispensable to assess functional relationship between mutations that lie far apart. Here, we introduce JigsawSeq for multiplexed sequence identification of pooled gene variant libraries by combining a codon-based molecular barcoding strategy and de novo assembly of short-read data. We first validate JigsawSeq on small sub-pools and observed high precision and recall at various experimental settings. With extensive simulations, we then apply JigsawSeq to large-scale gene variant libraries to show that our method can be reliably scaled using next-generation sequencing. JigsawSeq may serve as a rapid screening tool for functional genomics and offer the opportunity to explore evolutionary trajectories of protein variants. PMID:26387459

  13. High-Throughput Sequencing and De Novo Assembly of Red and Green Forms of the Perilla frutescens var. crispa Transcriptome

    PubMed Central

    Fukushima, Atsushi; Nakamura, Michimi; Suzuki, Hideyuki; Saito, Kazuki; Yamazaki, Mami

    2015-01-01

    Perilla frutescens var. crispa (Labiatae) has two chemo-varietal forms, i.e. red and green forms of perilla, that differ in the production of anthocyanins. To facilitate molecular biological and biochemical studies in perilla-specialized metabolism we used Illumina RNA-sequencing technology in our comprehensive comparison of the transcriptome map of the leaves of red and green forms of perilla. Sequencing generated over 1.2 billion short reads with an average length of 101 nt. De novo transcriptome assembly yielded 47,788 and 47,840 unigenes in the red and green forms of perilla plants, respectively. Comparison of the assembled unigenes and existing perilla cDNA sequences showed highly reliable alignment. All unigenes were annotated with gene ontology (GO) and Enzyme Commission numbers and entered into the Kyoto Encyclopedia of Genes and Genomes. We identified 68 differentially expressed genes (DEGs) in red and green forms of perilla. GO enrichment analysis of the DEGs showed that genes involved in the anthocyanin metabolic process were enriched. Differential expression analysis revealed that the transcript level of anthocyanin biosynthetic unigenes encoding flavonoid 3’-hydroxylase, dihydroflavonol 4-reductase, and anthocyanidin synthase was significantly higher in red perilla, while the transcript level of unigenes encoding limonene synthase was significantly higher in green perilla. Our data serve as a basis for future research on perilla bio-engineering and provide a shortcut for the characterization of new functional genes in P. frutescens. PMID:26070213

  14. High-Throughput Sequencing and De Novo Assembly of Red and Green Forms of the Perilla frutescens var. crispa Transcriptome.

    PubMed

    Fukushima, Atsushi; Nakamura, Michimi; Suzuki, Hideyuki; Saito, Kazuki; Yamazaki, Mami

    2015-01-01

    Perilla frutescens var. crispa (Labiatae) has two chemo-varietal forms, i.e. red and green forms of perilla, that differ in the production of anthocyanins. To facilitate molecular biological and biochemical studies in perilla-specialized metabolism we used Illumina RNA-sequencing technology in our comprehensive comparison of the transcriptome map of the leaves of red and green forms of perilla. Sequencing generated over 1.2 billion short reads with an average length of 101 nt. De novo transcriptome assembly yielded 47,788 and 47,840 unigenes in the red and green forms of perilla plants, respectively. Comparison of the assembled unigenes and existing perilla cDNA sequences showed highly reliable alignment. All unigenes were annotated with gene ontology (GO) and Enzyme Commission numbers and entered into the Kyoto Encyclopedia of Genes and Genomes. We identified 68 differentially expressed genes (DEGs) in red and green forms of perilla. GO enrichment analysis of the DEGs showed that genes involved in the anthocyanin metabolic process were enriched. Differential expression analysis revealed that the transcript level of anthocyanin biosynthetic unigenes encoding flavonoid 3'-hydroxylase, dihydroflavonol 4-reductase, and anthocyanidin synthase was significantly higher in red perilla, while the transcript level of unigenes encoding limonene synthase was significantly higher in green perilla. Our data serve as a basis for future research on perilla bio-engineering and provide a shortcut for the characterization of new functional genes in P. frutescens.

  15. High throughput de novo RNA sequencing elucidates novel responses in Penicillium chrysogenum under microgravity.

    PubMed

    Sathishkumar, Yesupatham; Krishnaraj, Chandran; Rajagopal, Kalyanaraman; Sen, Dwaipayan; Lee, Yang Soo

    2016-02-01

    In this study, the transcriptional alterations in Penicillium chrysogenum under simulated microgravity conditions were analyzed for the first time using an RNA-Seq method. The increasing plethora of eukaryotic microbial flora inside the spaceship demands the basic understanding of fungal biology in the absence of gravity vector. Penicillium species are second most dominant fungal contaminant in International Space Station. Penicillium chrysogenum an industrially important organism also has the potential to emerge as an opportunistic pathogen for the astronauts during the long-term space missions. But till date, the cellular mechanisms underlying the survival and adaptation of Penicillium chrysogenum to microgravity conditions are not clearly elucidated. A reference genome for Penicillium chrysogenum is not yet available in the NCBI database. Hence, we performed comparative de novo transcriptome analysis of Penicillium chrysogenum grown under microgravity versus normal gravity. In addition, the changes due to microgravity are documented at the molecular level. Increased response to the environmental stimulus, changes in the cell wall component ABC transporter/MFS transporters are noteworthy. Interestingly, sustained increase in the expression of Acyl-coenzyme A: isopenicillin N acyltransferase (Acyltransferase) under microgravity revealed the significance of gravity in the penicillin production which could be exploited industrially.

  16. De Novo transcriptome sequencing reveals important molecular networks and metabolic pathways of the plant, Chlorophytum borivilianum.

    PubMed

    Kalra, Shikha; Puniya, Bhanwar Lal; Kulshreshtha, Deepika; Kumar, Sunil; Kaur, Jagdeep; Ramachandran, Srinivasan; Singh, Kashmir

    2013-01-01

    Chlorophytum borivilianum, an endangered medicinal plant species is highly recognized for its aphrodisiac properties provided by saponins present in the plant. The transcriptome information of this species is limited and only few hundred expressed sequence tags (ESTs) are available in the public databases. To gain molecular insight of this plant, high throughput transcriptome sequencing of leaf RNA was carried out using Illumina's HiSeq 2000 sequencing platform. A total of 22,161,444 single end reads were retrieved after quality filtering. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 101,141 assembled transcripts were obtained, with coverage size of 22.42 Mb and average length of 221 bp. Guanine-cytosine (GC) content was found to be 44%. Bioinformatics analysis, using non-redundant proteins, gene ontology (GO), enzyme commission (EC) and kyoto encyclopedia of genes and genomes (KEGG) databases, extracted all the known enzymes involved in saponin and flavonoid biosynthesis. Few genes of the alkaloid biosynthesis, along with anticancer and plant defense genes, were also discovered. Additionally, several cytochrome P450 (CYP450) and glycosyltransferase unique sequences were also found. We identified simple sequence repeat motifs in transcripts with an abundance of di-nucleotide simple sequence repeat (SSR; 43.1%) markers. Large scale expression profiling through Reads per Kilobase per Million mapped reads (RPKM) showed major genes involved in different metabolic pathways of the plant. Genes, expressed sequence tags (ESTs) and unique sequences from this study provide an important resource for the scientific community, interested in the molecular genetics and functional genomics of C. borivilianum.

  17. De Novo Transcriptome Sequencing Reveals Important Molecular Networks and Metabolic Pathways of the Plant, Chlorophytum borivilianum

    PubMed Central

    Kalra, Shikha; Puniya, Bhanwar Lal; Kulshreshtha, Deepika; Kumar, Sunil; Kaur, Jagdeep; Ramachandran, Srinivasan; Singh, Kashmir

    2013-01-01

    Chlorophytum borivilianum, an endangered medicinal plant species is highly recognized for its aphrodisiac properties provided by saponins present in the plant. The transcriptome information of this species is limited and only few hundred expressed sequence tags (ESTs) are available in the public databases. To gain molecular insight of this plant, high throughput transcriptome sequencing of leaf RNA was carried out using Illumina's HiSeq 2000 sequencing platform. A total of 22,161,444 single end reads were retrieved after quality filtering. Available (e.g., De-Bruijn/Eulerian graph) and in-house developed bioinformatics tools were used for assembly and annotation of transcriptome. A total of 101,141 assembled transcripts were obtained, with coverage size of 22.42 Mb and average length of 221 bp. Guanine-cytosine (GC) content was found to be 44%. Bioinformatics analysis, using non-redundant proteins, gene ontology (GO), enzyme commission (EC) and kyoto encyclopedia of genes and genomes (KEGG) databases, extracted all the known enzymes involved in saponin and flavonoid biosynthesis. Few genes of the alkaloid biosynthesis, along with anticancer and plant defense genes, were also discovered. Additionally, several cytochrome P450 (CYP450) and glycosyltransferase unique sequences were also found. We identified simple sequence repeat motifs in transcripts with an abundance of di-nucleotide simple sequence repeat (SSR; 43.1%) markers. Large scale expression profiling through Reads per Kilobase per Million mapped reads (RPKM) showed major genes involved in different metabolic pathways of the plant. Genes, expressed sequence tags (ESTs) and unique sequences from this study provide an important resource for the scientific community, interested in the molecular genetics and functional genomics of C. borivilianum. PMID:24376689

  18. Sequencing, De novo Assembly, Functional Annotation and Analysis of Phyllanthus amarus Leaf Transcriptome Using the Illumina Platform

    PubMed Central

    Bose Mazumdar, Aparupa; Chattopadhyay, Sharmila

    2016-01-01

    Phyllanthus amarus Schum. and Thonn., a widely distributed annual medicinal herb has a long history of use in the traditional system of medicine for over 2000 years. However, the lack of genomic data for P. amarus, a non-model organism hinders research at the molecular level. In the present study, high-throughput sequencing technology has been employed to enhance better understanding of this herb and provide comprehensive genomic information for future work. Here P. amarus leaf transcriptome was sequenced using the Illumina Miseq platform. We assembled 85,927 non-redundant (nr) “unitranscript” sequences with an average length of 1548 bp, from 18,060,997 raw reads. Sequence similarity analyses and annotation of these unitranscripts were performed against databases like green plants nr protein database, Gene Ontology (GO), Clusters of Orthologous Groups (COG), PlnTFDB, KEGG databases. As a result, 69,394 GO terms, 583 enzyme codes (EC), 134 KEGG maps, and 59 Transcription Factor (TF) families were generated. Functional and comparative analyses of assembled unitranscripts were also performed with the most closely related species like Populus trichocarpa and Ricinus communis using TRAPID. KEGG analysis showed that a number of assembled unitranscripts were involved in secondary metabolites, mainly phenylpropanoid, flavonoid, terpenoids, alkaloids, and lignan biosynthetic pathways that have significant medicinal attributes. Further, Fragments Per Kilobase of transcript per Million mapped reads (FPKM) values of the identified secondary metabolite pathway genes were determined and Reverse Transcription PCR (RT-PCR) of a few of these genes were performed to validate the de novo assembled leaf transcriptome dataset. In addition 65,273 simple sequence repeats (SSRs) were also identified. To the best of our knowledge, this is the first transcriptomic dataset of P. amarus till date. Our study provides the largest genetic resource that will lead to drug development and pave

  19. A de novo next generation genomic sequence assembler based on string graph and MapReduce cloud computing framework

    PubMed Central

    2012-01-01

    Background State-of-the-art high-throughput sequencers, e.g., the Illumina HiSeq series, generate sequencing reads that are longer than 150 bp up to a total of 600 Gbp of data per run. The high-throughput sequencers generate lengthier reads with greater sequencing depth than those generated by previous technologies. Two major challenges exist in using the high-throughput technology for de novo assembly of genomes. First, the amount of physical memory may be insufficient to store the data structure of the assembly algorithm, even for high-end multicore processors. Moreover, the graph-theoretical model used to capture intersection relationships of the reads may contain structural defects that are not well managed by existing assembly algorithms. Results We developed a distributed genome assembler based on string graphs and MapReduce framework, known as the CloudBrush. The assembler includes a novel edge-adjustment algorithm to detect structural defects by examining the neighboring reads of a specific read for sequencing errors and adjusting the edges of the string graph, if necessary. CloudBrush is evaluated against GAGE benchmarks to compare its assembly quality with the other assemblers. The results show that our assemblies have a moderate N50, a low misassembly rate of misjoins, and indels of > 5 bp. In addition, we have introduced two measures, known as precision and recall, to address the issues of faithfully aligned contigs to target genomes. Compared with the assembly tools used in the GAGE benchmarks, CloudBrush is shown to produce contigs with high precision and recall. We also verified the effectiveness of the edge-adjustment algorithm using simulated datasets and ran CloudBrush on a nematode dataset using a commercial cloud. CloudBrush assembler is available at https://github.com/ice91/CloudBrush. PMID:23282094

  20. De Novo Whole-Genome Sequence of Xylella fastidiosa subsp. multiplex Strain BB01 Isolated from a Blueberry in Georgia, USA

    PubMed Central

    Van Horn, Christopher; Chang, Chung-Jan

    2017-01-01

    ABSTRACT This study reports a de novo-assembled draft genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 causing blueberry bacterial leaf scorch in Georgia, USA. The BB01 genome is 2,517,579 bp, with a G+C content of 51.8%, 2,943 open reading frames (ORFs), and 48 RNA genes. PMID:28183766

  1. Comparative Transcriptomic Approaches Exploring Contamination Stress Tolerance in Salix sp. Reveal the Importance for a Metaorganismal de Novo Assembly Approach for Nonmodel Plants1[OPEN

    PubMed Central

    Brereton, Nicholas J. B.; Marleau, Julie; Nissim, Werther Guidi; Labrecque, Michel; Joly, Simon; Pitre, Frederic E.

    2016-01-01

    Metatranscriptomic study of nonmodel organisms requires strategies that retain the highly resolved genetic information generated from model organisms while allowing for identification of the unexpected. A real-world biological application of phytoremediation, the field growth of 10 Salix cultivars on polluted soils, was used as an exemplar nonmodel and multifaceted crop response well-disposed to the study of gene expression. Sequence reads were assembled de novo to create 10 independent transcriptomes, a global transcriptome, and were mapped against the Salix purpurea 94006 reference genome. Annotation of assembled contigs was performed without a priori assumption of the originating organism. Global transcriptome construction from 3.03 billion paired-end reads revealed 606,880 unique contigs annotated from 1588 species, often common in all 10 cultivars. Comparisons between transcriptomic and metatranscriptomic methodologies provide clear evidence that nonnative RNA can mistakenly map to reference genomes, especially to conserved regions of common housekeeping genes, such as actin, α/β-tubulin, and elongation factor 1-α. In Salix, Rubisco activase transcripts were down-regulated in contaminated trees across all 10 cultivars, whereas thiamine thizole synthase and CP12, a Calvin Cycle master regulator, were uniformly up-regulated. De novo assembly approaches, with unconstrained annotation, can improve data quality; care should be taken when exploring such plant genetics to reduce de facto data exclusion by mapping to a single reference genome alone. Salix gene expression patterns strongly suggest cultivar-wide alteration of specific photosynthetic apparatus and protection of the antenna complexes from oxidation damage in contaminated trees, providing an insight into common stress tolerance strategies in a real-world phytoremediation system. PMID:27002060

  2. De novo Assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis

    PubMed Central

    Nowrousian, Minou; Stajich, Jason E.; Chu, Meiling; Engh, Ines; Espagne, Eric; Halliday, Karen; Kamerewerd, Jens; Kempken, Frank; Knab, Birgit; Kuo, Hsiao-Che; Osiewacz, Heinz D.; Pöggeler, Stefanie; Read, Nick D.; Seiler, Stephan; Smith, Kristina M.; Zickler, Denise; Kück, Ulrich; Freitag, Michael

    2010-01-01

    Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30–90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in ∼4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative

  3. Sequencing and de novo draft assemblies of a fathead minnow (Pimephales promelas) reference genome.

    PubMed

    Burns, Frank R; Cogburn, Amarin L; Ankley, Gerald T; Villeneuve, Daniel L; Waits, Eric; Chang, Yun-Juan; Llaca, Victor; Deschamps, Stephane D; Jackson, Raymond E; Hoke, Robert Alan

    2016-01-01

    The present study was undertaken to provide the foundation for development of genome-scale resources for the fathead minnow (Pimephales promelas), an important model organism widely used in both aquatic toxicology research and regulatory testing. The authors report on the first sequencing and 2 draft assemblies for the reference genome of this species. Approximately 120× sequence coverage was achieved via Illumina sequencing of a combination of paired-end, mate-pair, and fosmid libraries. Evaluation and comparison of these assemblies demonstrate that they are of sufficient quality to be useful for genome-enabled studies, with 418 of 458 (91%) conserved eukaryotic genes mapping to at least 1 of the assemblies. In addition to its immediate utility, the present work provides a strong foundation on which to build further refinements of a reference genome for the fathead minnow.

  4. Disease-targeted sequencing of ion channel genes identifies de novo mutations in patients with non-familial Brugada syndrome.

    PubMed

    Juang, Jyh-Ming Jimmy; Lu, Tzu-Pin; Lai, Liang-Chuan; Ho, Chia-Chuan; Liu, Yen-Bin; Tsai, Chia-Ti; Lin, Lian-Yu; Yu, Chih-Chieh; Chen, Wen-Jone; Chiang, Fu-Tien; Yeh, Shih-Fan Sherri; Lai, Ling-Ping; Chuang, Eric Y; Lin, Jiunn-Lee

    2014-10-23

    Brugada syndrome (BrS) is one of the ion channelopathies associated with sudden cardiac death (SCD). The most common BrS-associated gene (SCN5A) only accounts for approximately 20-25% of BrS patients. This study aims to identify novel mutations across human ion channels in non-familial BrS patients without SCN5A variants through disease-targeted sequencing. We performed disease-targeted multi-gene sequencing across 133 human ion channel genes and 12 reported BrS-associated genes in 15 unrelated, non-familial BrS patients without SCN5A variants. Candidate variants were validated by mass spectrometry and Sanger sequencing. Five de novo mutations were identified in four genes (SCNN1A, KCNJ16, KCNB2, and KCNT1) in three BrS patients (20%). Two of the three patients presented SCD and one had syncope. Interestingly, the two patients presented with SCD had compound mutations (SCNN1A:Arg350Gln and KCNB2:Glu522Lys; SCNN1A:Arg597* and KCNJ16:Ser261Gly). Importantly, two SCNN1A mutations were identified from different families. The KCNT1:Arg1106Gln mutation was identified in a patient with syncope. Bioinformatics algorithms predicted severe functional interruptions in these four mutation loci, suggesting their pivotal roles in BrS. This study identified four novel BrS-associated genes and indicated the effectiveness of this disease-targeted sequencing across ion channel genes for non-familial BrS patients without SCN5A variants.

  5. De Novo Assembly, Gene Annotation, and Marker Discovery in Stored-Product Pest Liposcelis entomophila (Enderlein) Using Transcriptome Sequences

    PubMed Central

    Wei, Dan-Dan; Chen, Er-Hu; Ding, Tian-Bo; Chen, Shi-Chun; Dou, Wei; Wang, Jin-Jun

    2013-01-01

    Background As a major stored-product pest insect, Liposcelis entomophila has developed high levels of resistance to various insecticides in grain storage systems. However, the molecular mechanisms underlying resistance and environmental stress have not been characterized. To date, there is a lack of genomic information for this species. Therefore, studies aimed at profiling the L. entomophila transcriptome would provide a better understanding of the biological functions at the molecular levels. Methodology/Principal Findings We applied Illumina sequencing technology to sequence the transcriptome of L. entomophila. A total of 54,406,328 clean reads were obtained and that de novo assembled into 54,220 unigenes, with an average length of 571 bp. Through a similarity search, 33,404 (61.61%) unigenes were matched to known proteins in the NCBI non-redundant (Nr) protein database. These unigenes were further functionally annotated with gene ontology (GO), cluster of orthologous groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. A large number of genes potentially involved in insecticide resistance were manually curated, including 68 putative cytochrome P450 genes, 37 putative glutathione S-transferase (GST) genes, 19 putative carboxyl/cholinesterase (CCE) genes, and other 126 transcripts to contain target site sequences or encoding detoxification genes representing eight types of resistance enzymes. Furthermore, to gain insight into the molecular basis of the L. entomophila toward thermal stresses, 25 heat shock protein (Hsp) genes were identified. In addition, 1,100 SSRs and 57,757 SNPs were detected and 231 pairs of SSR primes were designed for investigating the genetic diversity in future. Conclusions/Significance We developed a comprehensive transcriptomic database for L. entomophila. These sequences and putative molecular markers would further promote our understanding of the molecular mechanisms underlying insecticide resistance

  6. De Novo Transcriptome Sequencing of Olea europaea L. to Identify Genes Involved in the Development of the Pollen Tube.

    PubMed

    Iaria, Domenico; Chiappetta, Adriana; Muzzalupo, Innocenzo

    2016-01-01

    In olive (Olea europaea L.), the processes controlling self-incompatibility are still unclear and the molecular basis underlying this process are still not fully characterized. In order to determine compatibility relationships, using next-generation sequencing techniques and a de novo transcriptome assembly strategy, we show that pollen tubes from different olive plants, grown in vitro in a medium containing its own pistil and in combination pollen/pistil from self-sterile and self-fertile cultivars, have a distinct gene expression profile and many of the differentially expressed sequences between the samples fall within gene families involved in the development of the pollen tube, such as lipase, carboxylesterase, pectinesterase, pectin methylesterase, and callose synthase. Moreover, different genes involved in signal transduction, transcription, and growth are overrepresented. The analysis also allowed us to identify members in actin and actin depolymerization factor and fibrin gene family and member of the Ca(2+) binding gene family related to the development and polarization of pollen apical tip. The whole transcriptomic analysis, through the identification of the differentially expressed transcripts set and an extended functional annotation analysis, will lead to a better understanding of the mechanisms of pollen germination and pollen tube growth in the olive.

  7. De Novo Assembly of Bitter Gourd Transcriptomes: Gene Expression and Sequence Variations in Gynoecious and Monoecious Lines.

    PubMed

    Shukla, Anjali; Singh, V K; Bharadwaj, D R; Kumar, Rajesh; Rai, Ashutosh; Rai, A K; Mugasimangalam, Raja; Parameswaran, Sriram; Singh, Major; Naik, P S

    2015-01-01

    Bitter gourd (Momordica charantia L.) is a nutritious vegetable crop of Asian origin, used as a medicinal herb in Indian and Chinese traditional medicine. Molecular breeding in bitter gourd is in its infancy, due to limited molecular resources, particularly on functional markers for traits such as gynoecy. We performed de novo transcriptome sequencing of bitter gourd using Illumina next-generation sequencer, from root, flower buds, stem and leaf samples of gynoecious line (Gy323) and a monoecious line (DRAR1). A total of 65,540 transcripts for Gy323 and 61,490 for DRAR1 were obtained. Comparisons revealed SNP and SSR variations between these lines and, identification of gene classes. Based on available transcripts we identified 80 WRKY transcription factors, several reported in responses to biotic and abiotic stresses; 56 ARF genes which play a pivotal role in auxin-regulated gene expression and development. The data presented will be useful in both functions studies and breeding programs in bitter gourd.

  8. Novel proline-hydroxyproline glycopeptides from the dandelion (Taraxacum officinale Wigg.) flowers: de novo sequencing and biological activity.

    PubMed

    Astafieva, Alexandra A; Enyenihi, Atim A; Rogozhin, Eugene A; Kozlov, Sergey A; Grishin, Eugene V; Odintsova, Tatyana I; Zubarev, Roman A; Egorov, Tsezi A

    2015-09-01

    Two novel homologous peptides named ToHyp1 and ToHyp2 that show no similarity to any known proteins were isolated from Taraxacum officinale Wigg. flowers by multidimensional liquid chromatography. Amino acid and mass spectrometry analyses demonstrated that the peptides have unusual structure: they are cysteine-free, proline-hydroxyproline-rich and post-translationally glycosylated by pentoses, with 5 carbohydrates in ToHyp2 and 10 in ToHyp1. The ToHyp2 peptide with a monoisotopic molecular mass of 4350.3Da was completely sequenced by a combination of Edman degradation and de novo sequencing via top down multistage collision induced dissociation (CID) and higher energy dissociation (HCD) tandem mass spectrometry (MS(n)). ToHyp2 consists of 35 amino acids, contains eighteen proline residues, of which 8 prolines are hydroxylated. The peptide displays antifungal activity and inhibits growth of Gram-positive and Gram-negative bacteria. We further showed that carbohydrate moieties have no significant impact on the peptide structure, but are important for antifungal activity although not absolutely necessary. The deglycosylated ToHyp2 peptide was less active against the susceptible fungus Bipolaris sorokiniana than the native peptide. Unique structural features of the ToHyp2 peptide place it into a new family of plant defense peptides. The discovery of ToHyp peptides in T. officinale flowers expands the repertoire of molecules of plant origin with practical applications.

  9. De novo sequencing, assembly and analysis of salivary gland transcriptome of Haemaphysalis flava and identification of sialoprotein genes.

    PubMed

    Xu, Xing-Li; Cheng, Tian-Yin; Yang, Hu; Yan, Fen; Yang, Ya

    2015-06-01

    Saliva plays an important role in feeding and pathogen transmission, identification and analysis of tick salivary gland (SG) proteins is considered as a hot spot in anti-tick researching area. Herein, we present the first description of SG transcriptome of Haemaphysalis flava using next-generation sequencing (NGS). A total of over 143 million high-quality reads were assembled into 54,357 unigenes, of which 20,145 (37.06%) had significant similarities to proteins in the Swiss-Prot database. 13,513 annotated sequences were associated with GO terms. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis showed that 14,280 unigenes were assigned to 279 KEGG pathways in total. Reads per kb per million reads (RPKM) analysis showed that there were 3035 down-regulated unigenes and 2260 up-regulated unigenes in the engorged ticks (ET) compared with the semi-engorged one (SET). Several important genes are associated with blood feeding and ingestion as secreted salivary proteins, concluding cysteine, longipain, 4D8, calreticulin, metalloproteases, serine protease inhibitor, enolase, heat shock protein and AV422 in SG, were identified. The qRT-PCR results confirmed that patterns of these genes (except for the longipain gene) expression were consistent with RNA-seq results. This de novo assembly of SG transcriptome of H. flava not only provides more chance for screening and cloning functional genes, but also forms a solid basis for further insight into the changes of salivary proteins during blood-feeding.

  10. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome

    EPA Science Inventory

    This study was undertaken to develop genome-scale resources for the fathead minnow (Pimphales promelas) an important model organism widely used in both aquatic ecotoxicology research and in regulatory toxicity testing. We report on the first sequencing and two draft assemblies fo...

  11. De Novo Sequencing and Resurrection of a Human Astrovirus-Neutralizing Antibody

    PubMed Central

    2016-01-01

    Monoclonal antibody (mAb) therapeutics targeting cancer, autoimmune diseases, inflammatory diseases, and infectious diseases are growing exponentially. Although numerous panels of mAbs targeting infectious disease agents have been developed, their progression into clinically useful mAbs is often hindered by the lack of sequence information and/or loss of hybridoma cells that produce them. Here we combine the power of crystallography and mass spectrometry to determine the amino acid sequence and glycosylation modification of the Fab fragment of a potent human astrovirus-neutralizing mAb. We used this information to engineer a recombinant antibody single-chain variable fragment that has the same specificity as the parent monoclonal antibody to bind to the astrovirus capsid protein. This antibody can now potentially be developed as a therapeutic and diagnostic agent. PMID:27213181

  12. Sequencing and de novo transcriptome assembly of Anthopleura dowii Verrill (1869), from Mexico.

    PubMed

    Ayala-Sumuano, Jorge-Tonatiuh; Licea-Navarro, Alexei; Rudiño-Piñera, Enrique; Rodríguez, Estefanía; Rodríguez-Almazán, Claudia

    2017-03-01

    Next-generation technologies for determination of genomics and transcriptomics composition have a wide range of applications. Moreover, the development of tools for big data set analysis has allowed the identification of molecules and networks involved in metabolism, evolution or behavior. By natural habitats aquatic organisms have implemented molecular strategies for survival, including the production and secretion of toxic compounds for their predators; therefore these organisms are possible sources of proteins or peptides with potential biotechnological application. In the last decade anthozoans, mainly octocorals but also sea anemones, have been proben to be a source of natural products. Members of the genus Anthopleura are one of the best known and most studied sea anemones because they are common constituents of rocky intertidal communities and show interesting ecological and biological phenomena (e.g. intraespecific competition, symbiosis, etc.); however, many aspects of these taxa remain in need to be analyzed. This work describes the transcriptome sequencing of Anthopleura dowii Verrill, 1869 (Cnidaria: Anthozoa: Actiniaria); this is the first report of this kind for these species. The data set used to construct the transcriptome has been deposited on NCBI's database. Illumina sequence reads are available under BioProject accession number PRJNA329297 and Sequence Read Archive under accession number SRP078992.

  13. De novo genome assembly of the economically important weed horseweed using integrated data from multiple sequencing platforms.

    PubMed

    Peng, Yanhui; Lai, Zhao; Lane, Thomas; Nageswara-Rao, Madhugiri; Okada, Miki; Jasieniuk, Marie; O'Geen, Henriette; Kim, Ryan W; Sammons, R Douglas; Rieseberg, Loren H; Stewart, C Neal

    2014-11-01

    Horseweed (Conyza canadensis), a member of the Compositae (Asteraceae) family, was the first broadleaf weed to evolve resistance to glyphosate. Horseweed, one of the most problematic weeds in the world, is a true diploid (2n = 2x = 18), with the smallest genome of any known agricultural weed (335 Mb). Thus, it is an appropriate candidate to help us understand the genetic and genomic bases of weediness. We undertook a draft de novo genome assembly of horseweed by combining data from multiple sequencing platforms (454 GS-FLX, Illumina HiSeq 2000, and PacBio RS) using various libraries with different insertion sizes (approximately 350 bp, 600 bp, 3 kb, and 10 kb) of a Tennessee-accessed, glyphosate-resistant horseweed biotype. From 116.3 Gb (approximately 350× coverage) of data, the genome was assembled into 13,966 scaffolds with 50% of the assembly = 33,561 bp. The assembly covered 92.3% of the genome, including the complete chloroplast genome (approximately 153 kb) and a nearly complete mitochondrial genome (approximately 450 kb in 120 scaffolds). The nuclear genome is composed of 44,592 protein-coding genes. Genome resequencing of seven additional horseweed biotypes was performed. These sequence data were assembled and used to analyze genome variation. Simple sequence repeat and single-nucleotide polymorphisms were surveyed. Genomic patterns were detected that associated with glyphosate-resistant or -susceptible biotypes. The draft genome will be useful to better understand weediness and the evolution of herbicide resistance and to devise new management strategies. The genome will also be useful as another reference genome in the Compositae. To our knowledge, this article represents the first published draft genome of an agricultural weed.

  14. A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly

    PubMed Central

    2013-01-01

    Background The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes. Results We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. Conclusions These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies. PMID:23496952

  15. De novo sequencing and comparative analysis of testicular transcriptome from different reproductive phases in freshwater spotted snakehead Channa punctatus

    PubMed Central

    Roy, Alivia; Basak, Reetuparna

    2017-01-01

    The spotted snakehead Channa punctatus is a seasonally breeding teleost widely distributed in the Indian subcontinent and economically important due to high nutritional value. The declining population of C. punctatus prompted us to focus on genetic regulation of its reproduction. The present study carried out de novo testicular transcriptome sequencing during the four reproductive phases and correlated differential expression of transcripts with various testicular events in C. punctatus. The Illumina paired-end sequencing of testicular transcriptome from resting, preparatory, spawning and postspawning phases generated 41.94, 47.51, 61.81 and 44.45 million reads, and 105526, 105169, 122964 and 106544 transcripts, respectively. Transcripts annotated using Rattus norvegicus reference protein sequences and classified under various subcategories of biological process, molecular function and cellular component showed that the majority of the subcategories had highest number of transcripts during spawning phase. In addition, analysis of transcripts exhibiting differential expression during the four phases revealed an appreciable increase in upregulated transcripts of biological processes such as cell proliferation and differentiation, cytoskeleton organization, response to vitamin A, transcription and translation, regulation of angiogenesis and response to hypoxia during spermatogenically active phases. The study also identified significant differential expression of transcripts relevant to spermatogenesis (mgat3, nqo1, hes2, rgs4, cxcl2, alcam, agmat), steroidogenesis (star, tkt, gipc3), cell proliferation (eef1a2, btg3, pif1, myo16, grik3, trim39, plbd1), cytoskeletal organization (espn, wipf3, cd276), sperm development (klhl10, mast1, hspa1a, slc6a1, ros1, foxj1, hipk1), and sperm transport and motility (hint1, muc13). Analysis of functional annotation and differential expression of testicular transcripts depending on reproductive phases of C. punctatus helped in

  16. De Novo Sequencing and Transcriptome Analysis of Wolfiporia cocos to Reveal Genes Related to Biosynthesis of Triterpenoids

    PubMed Central

    Shu, Shaohua; Chen, Bei; Zhou, Mengchun; Zhao, Xinmei; Xia, Haiyang; Wang, Mo

    2013-01-01

    Wolfiporia cocos Ryvarden et Gilbertson is a saprophytic fungus in the Basidiomycetes. Its dried sclerotium is widely used as a traditional crude drug in East Asia. Especially in China, the dried sclerotium is regarded as the silver of the Chinese traditional drugs, not only for its white color, but also its medicinal value. Furthermore, triterpenoids from W. cocos are the main active compounds with antitumor and anti-inflammatory activity. Biosynthesis of the triterpenoids has rarely been researched. In this study, the de novo sequencing of the mycelia and sclerotia of W. cocos were carried out by Illumina HiSeq 2000. A total of 3,484,996,740 bp from 38,722,186 sequence reads of mycelia, and 3,573,921,960 bp from 39,710,244 high quality sequence reads of sclerotium were obtained. These raw data were assembled into 60,354 contigs and 40,939 singletons, and 56,938 contigs and 37,220 singletons for mycelia and sclerotia, respectively. The transcriptomic data clearly showed that terpenoid biosynthesis was only via the MVA pathwayin W. cocos. The production of total triterpenoids and pachymic acid was examined in the dry mycelia and sclerotia. The content of total triterpenoids was 5.36% and 1.43% in mycelia and sclerotia, respectively, and the content of pachymic acid was 0.458% and 0.174%. Some genes involved in the triterpenoid biosynthetic pathway were chosen to be verified by qRT-PCR. The unigenes encoding diphosphomevalonate decarboxylase (Unigene 20430), farnesyl diphosphate synthase (Unigene 14106 and 21656), hydroxymethylglutaryl-CoA reductase (NADPH) (Unigene 6395_All) and lanosterol synthase (Unigene28001_All) were upregulated in the mycelia stage. It is likely that expression of these genes influences the biosynthesis of triterpenoids in the mycelia stage. PMID:23967197

  17. De novo sequencing and comparative analysis of testicular transcriptome from different reproductive phases in freshwater spotted snakehead Channa punctatus.

    PubMed

    Roy, Alivia; Basak, Reetuparna; Rai, Umesh

    2017-01-01

    The spotted snakehead Channa punctatus is a seasonally breeding teleost widely distributed in the Indian subcontinent and economically important due to high nutritional value. The declining population of C. punctatus prompted us to focus on genetic regulation of its reproduction. The present study carried out de novo testicular transcriptome sequencing during the four reproductive phases and correlated differential expression of transcripts with various testicular events in C. punctatus. The Illumina paired-end sequencing of testicular transcriptome from resting, preparatory, spawning and postspawning phases generated 41.94, 47.51, 61.81 and 44.45 million reads, and 105526, 105169, 122964 and 106544 transcripts, respectively. Transcripts annotated using Rattus norvegicus reference protein sequences and classified under various subcategories of biological process, molecular function and cellular component showed that the majority of the subcategories had highest number of transcripts during spawning phase. In addition, analysis of transcripts exhibiting differential expression during the four phases revealed an appreciable increase in upregulated transcripts of biological processes such as cell proliferation and differentiation, cytoskeleton organization, response to vitamin A, transcription and translation, regulation of angiogenesis and response to hypoxia during spermatogenically active phases. The study also identified significant differential expression of transcripts relevant to spermatogenesis (mgat3, nqo1, hes2, rgs4, cxcl2, alcam, agmat), steroidogenesis (star, tkt, gipc3), cell proliferation (eef1a2, btg3, pif1, myo16, grik3, trim39, plbd1), cytoskeletal organization (espn, wipf3, cd276), sperm development (klhl10, mast1, hspa1a, slc6a1, ros1, foxj1, hipk1), and sperm transport and motility (hint1, muc13). Analysis of functional annotation and differential expression of testicular transcripts depending on reproductive phases of C. punctatus helped in

  18. De Novo Sequencing and Comparative Analysis of Schima superba Seedlings to Explore the Response to Drought Stress

    PubMed Central

    Han, Bao-cai; Wei, Wei; Mi, Xiang-cheng; Ma, Ke-ping

    2016-01-01

    Schima superba is an important dominant species in subtropical evergreen broadleaved forests of China, and plays a vital role in community structure and dynamics. However, the survival rate of its seedlings in the field is low, and water shortage could be a factor that limits its regeneration. In order to better understand the response of its seedlings to drought stress on a functional genomics scale, RNA-seq technology was utilized in this study to perform a large-scale transcriptome sequencing of the S. superba seedlings under drought stress. More than 320 million clean reads were generated and 72218 unique transcripts were obtained through de novo assembly. These unigenes were further annotated by blasting with different public databases and a total of 53300 unique transcripts were annotated. A total of 31586 simple sequence repeat (SSR) loci were presented. Through gene expression profiling analysis between drought treatment and control, 11038 genes were found to be significantly enriched in drought-stressed seedlings. Based on these differentially expressed genes (DEGs), Gene Ontology (GO) terms enrichment and Kyoto Encyclopedia of Genes and Genomes pathway (KEGG) enrichment analysis indicated that drought stress caused a number of changes in the types of sugars, enzymes, secondary mechanisms, and light responses, and induced some potential physical protection mechanisms. In addition, the expression patterns of 18 transcripts induced by drought, as determined by quantitative real-time PCR, were consistent with their transcript abundance changes, as identified by RNA-seq. This transcriptome study provides a rapid method for understanding the response of S. superba seedlings to drought stress and provides a number of gene sequences available for further functional genomics studies. PMID:27930677

  19. De novo assembly of transcriptome sequencing in Caragana korshinskii Kom. and characterization of EST-SSR markers.

    PubMed

    Long, Yan; Wang, Yanyan; Wu, Shanshan; Wang, Jiao; Tian, Xinjie; Pei, Xinwu

    2015-01-01

    Caragana korshinskii Kom. is widely distributed in various habitats, including gravel desert, clay desert, fixed and semi-fixed sand, and saline land in the Asian and African deserts. To date, no previous genomic information or EST-SSR marker has been reported in Caragana Fabr. genus. In this study, more than two billion bases of high-quality sequence of C. korshinskii were generated by using illumina sequencing technology and demonstrated the de novo assembly and annotation of genes without prior genome information. These reads were assembled into 86,265 unigenes (mean length = 709 bp). The similarity search indicated that 33,955 and 21,978 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Among these annotated unigenes, 26,232 a unigenes were separately assigned to Gene Ontology (GO) database. When 22,756 unigenes searched against the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database, 5,598 unigenes were assigned to 5 main categories including 32 KEGG pathways. Among the main KEGG categories, metabolism was the biggest category (2,862, 43.7%), suggesting the active metabolic processes in the desert tree. In addition, a total of 19,150 EST-SSRs were identified from 15,484 unigenes, and the characterizations of EST-SSRs were further compared with other four species in Fabraceae. 126 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Among the 9 germplasms in Caranaga Fabr. genus, PCR success rate were 93.7% and the phylogenic tree was constructed based on the genotypic data. This research generated a substantial fraction of transcriptome sequences, which were very useful resources for gene annotation and discovery, molecular markers development, genome assembly and annotation. The EST-SSR markers identified and developed in this study will facilitate marker-assisted selection breeding.

  20. Discovery of Novel Antimicrobial Peptides from Varanus komodoensis (Komodo Dragon) by Large-Scale Analyses and De-Novo-Assisted Sequencing Using Electron-Transfer Dissociation Mass Spectrometry.

    PubMed

    Bishop, Barney M; Juba, Melanie L; Russo, Paul S; Devine, Megan; Barksdale, Stephanie M; Scott, Shaylyn; Settlage, Robert; Michalak, Pawel; Gupta, Kajal; Vliet, Kent; Schnur, Joel M; van Hoek, Monique L

    2017-04-07

    Komodo dragons are the largest living lizards and are the apex predators in their environs. They endure numerous strains of pathogenic bacteria in their saliva and recover from wounds inflicted by other dragons, reflecting the inherent robustness of their innate immune defense. We have employed a custom bioprospecting approach combining partial de novo peptide sequencing with transcriptome assembly to identify cationic antimicrobial peptides from Komodo dragon plasma. Through these analyses, we identified 48 novel potential cationic antimicrobial peptides. All but one of the identified peptides were derived from histone proteins. The antimicrobial effectiveness of eight of these peptides was evaluated against Pseudomonas aeruginosa (ATCC 9027) and Staphylococcus aureus (ATCC 25923), with seven peptides exhibiting antimicrobial activity against both microbes and one only showing significant potency against P. aeruginosa. This study demonstrates the power and promise of our bioprospecting approach to cationic antimicrobial peptide discovery, and it reveals the presence of a plethora of novel histone-derived antimicrobial peptides in the plasma of the Komodo dragon. These findings may have broader implications regarding the role that intact histones and histone-derived peptides play in defending the host from infection. Data are available via ProteomeXChange with identifier PXD005043.

  1. De Novo Sequencing and Assembly Analysis of the Pseudostellaria heterophylla Transcriptome

    PubMed Central

    Li, Jun; Zhen, Wei; Long, Dengkai; Ding, Ling; Gong, Anhui; Xiao, Chenghong; Jiang, Weike; Liu, Xiaoqing; Zhou, Tao; Huang, Luqi

    2016-01-01

    Pseudostellaria heterophylla (Miq.) Pax is a mild tonic herb widely cultivated in the Southern part of China. The tuberous roots of P. heterophylla accumulate high levels of secondary metabolism products of medicinal value such as saponins, flavonoids, and isoquinoline alkaloids. Despite numerous studies on the pharmacological importance and purification of these compounds in P. heterophylla, their biosynthesis is not well understood. In the present study, we used Illumina HiSeq 4000 sequencing platform to sequence the RNA from flowers, leaves, stem, root cortex and xylem tissues of P. heterophylla. We obtained 616,413,316 clean reads that we assembled into 127, 334 unique sequences with an N50 length of 951 bp. Among these unigenes, 53,184 unigenes (41.76%) were annotated in a public database and 39, 795 unigenes were assigned to 356 KEGG pathways; 23,714 unigenes (8.82%) had high homology with the genes from Beta vulgaris. We discovered 32, 095 DEGs in different tissues and performed GO and KEGG enrichment analysis. The most enriched KEGG pathway of secondary metabolism showed up-regulated expression in tuberous roots as compared with the ground parts of P. heterophylla. Moreover, we identified 72 candidate genes involved in triterpenoids saponins biosynthesis in P. heterophylla. The expression profiles of 11 candidate unigenes were analyzed by quantitative real-time PCR (RT-qPCR). Our study established a global transcriptome database of P. heterophylla for gene identification and regulation. We also identified the candidate unigenes involved in triterpenoids saponins biosynthesis. Our results provide an invaluable resource for the secondary metabolites and physiological processes in different tissues of P. heterophylla. PMID:27764127

  2. De novo assembly and characterization of germinating lettuce seed transcriptome using Illumina paired-end sequencing.

    PubMed

    Liu, Shu-Jun; Song, Shun-Hua; Wang, Wei-Qing; Song, Song-Quan

    2015-11-01

    At supraoptimal temperature, germination of lettuce (Lactuca sativa L.) seeds exhibits a typical germination thermoinhibition, which can be alleviated by sodium nitroprusside (SNP) in a nitric oxide-dependent manner. However, the molecular mechanism of seed germination thermoinhibition and its alleviation by SNP are poorly understood. In the present study, the lettuce seeds imbibed at optimal temperature in water or at supraoptimal temperature with or without 100 μM SNP for different periods of time were used as experimental materials, the total RNA was extracted and sequenced, we gained 147,271,347 raw reads using Illumina paired-end sequencing technique and assembled the transcriptome of germinating lettuce seeds. A total of 51,792 unigenes with a mean length of 849 nucleotides were obtained. Of these unigenes, a total of 29,542 unigenes were annotated by sequence similarity searching in four databases, NCBI non-redundant protein database, SwissProt protein database, euKaryotic Ortholog Groups database, and NCBI nucleotide database. Among the annotated unigenes, 22,276 unigenes were assigned to Gene Ontology database. When all the annotated unigenes were searched against the Kyoto Encyclopedia of Genes and Genomes Pathway database, a total of 8,810 unigenes were mapped to 5 main categories including 260 pathways. We first obtained a lot of unigenes encoding proteins involved in abscisic acid (ABA) signaling in lettuce, including 11 ABA receptors, 94 protein phosphatase 2Cs and 16 sucrose non-fermenting 1-related protein kinases. These results will help us to better understand the molecular mechanism of seed germination, thermoinhibition of seed germination and its alleviation by SNP.

  3. A Cost-Effective Approach to Sequence Hundreds of Complete Mitochondrial Genomes

    PubMed Central

    Oleksiak, Marjorie F.

    2016-01-01

    We present a cost-effective approach to sequence whole mitochondrial genomes for hundreds of individuals. Our approach uses small reaction volumes and unmodified (non-phosphorylated) barcoded adaptors to minimize reagent costs. We demonstrate our approach by sequencing 383 Fundulus sp. mitochondrial genomes (192 F. heteroclitus and 191 F. majalis). Prior to sequencing, we amplified the mitochondrial genomes using 4–5 custom-made, overlapping primer pairs, and sequencing was performed on an Illumina HiSeq 2500 platform. After removing low quality and short sequences, 2.9 million and 2.8 million reads were generated for F. heteroclitus and F. majalis respectively. Individual genomes were assembled for each species by mapping barcoded reads to a reference genome. For F. majalis, the reference genome was built de novo. On average, individual consensus sequences had high coverage: 61-fold for F. heteroclitus and 57-fold for F. majalis. The approach discussed in this paper is optimized for sequencing mitochondrial genomes on an Illumina platform. However, with the proper modifications, this approach could be easily applied to other small genomes and sequencing platforms. PMID:27505419

  4. De novo sequencing, assembly and analysis of eight different transcriptomes from the Malayan pangolin.

    PubMed

    Mohamed Yusoff, Aini; Tan, Tze King; Hari, Ranjeev; Koepfli, Klaus-Peter; Wee, Wei Yee; Antunes, Agostinho; Sitam, Frankie Thomas; Rovie-Ryan, Jeffrine Japning; Karuppannan, Kayal Vizi; Wong, Guat Jah; Lipovich, Leonard; Warren, Wesley C; O'Brien, Stephen J; Choo, Siew Woh

    2016-09-13

    Pangolins are scale-covered mammals, containing eight endangered species. Maintaining pangolins in captivity is a significant challenge, in part because little is known about their genetics. Here we provide the first large-scale sequencing of the critically endangered Manis javanica transcriptomes from eight different organs using Illumina HiSeq technology, yielding ~75 Giga bases and 89,754 unigenes. We found some unigenes involved in the insect hormone biosynthesis pathway and also 747 lipids metabolism-related unigenes that may be insightful to understand the lipid metabolism system in pangolins. Comparative analysis between M. javanica and other mammals revealed many pangolin-specific genes significantly over-represented in stress-related processes, cell proliferation and external stimulus, probably reflecting the traits and adaptations of the analyzed pregnant female M. javanica. Our study provides an invaluable resource for future functional works that may be highly relevant for the conservation of pangolins.

  5. De novo sequencing, assembly and analysis of eight different transcriptomes from the Malayan pangolin

    PubMed Central

    Mohamed Yusoff, Aini; Tan, Tze King; Hari, Ranjeev; Koepfli, Klaus-Peter; Wee, Wei Yee; Antunes, Agostinho; Sitam, Frankie Thomas; Rovie-Ryan, Jeffrine Japning; Karuppannan, Kayal Vizi; Wong, Guat Jah; Lipovich, Leonard; Warren, Wesley C.; O’Brien, Stephen J.; Choo, Siew Woh

    2016-01-01

    Pangolins are scale-covered mammals, containing eight endangered species. Maintaining pangolins in captivity is a significant challenge, in part because little is known about their genetics. Here we provide the first large-scale sequencing of the critically endangered Manis javanica transcriptomes from eight different organs using Illumina HiSeq technology, yielding ~75 Giga bases and 89,754 unigenes. We found some unigenes involved in the insect hormone biosynthesis pathway and also 747 lipids metabolism-related unigenes that may be insightful to understand the lipid metabolism system in pangolins. Comparative analysis between M. javanica and other mammals revealed many pangolin-specific genes significantly over-represented in stress-related processes, cell proliferation and external stimulus, probably reflecting the traits and adaptations of the analyzed pregnant female M. javanica. Our study provides an invaluable resource for future functional works that may be highly relevant for the conservation of pangolins. PMID:27618997

  6. De Novo Transcriptome Sequencing and Analysis of the Cereal Cyst Nematode, Heterodera avenae

    PubMed Central

    Kumar, Mukesh; Gantasala, Nagavara Prasad; Roychowdhury, Tanmoy; Thakur, Prasoon Kumar; Banakar, Prakash; Shukla, Rohit N.; Jones, Michael G. K.; Rao, Uma

    2014-01-01

    The cereal cyst nematode (CCN, Heterodera avenae) is a major pest of wheat (Triticum spp) that reduces crop yields in many countries. Cyst nematodes are obligate sedentary endoparasites that reproduce by amphimixis. Here, we report the first transcriptome analysis of two stages of H. avenae. After sequencing extracted RNA from pre parasitic infective juvenile and adult stages of the life cycle, 131 million Illumina high quality paired end reads were obtained which generated 27,765 contigs with N50 of 1,028 base pairs, of which 10,452 were annotated. Comparative analyses were undertaken to evaluate H. avenae sequences with those of other plant, animal and free living nematodes to identify differences in expressed genes. There were 4,431 transcripts common to H. avenae and the free living nematode Caenorhabditis elegans, and 9,462 in common with more closely related potato cyst nematode, Globodera pallida. Annotation of H. avenae carbohydrate active enzymes (CAZy) revealed fewer glycoside hydrolases (GHs) but more glycosyl transferases (GTs) and carbohydrate esterases (CEs) when compared to M. incognita. 1,280 transcripts were found to have secretory signature, presence of signal peptide and absence of transmembrane. In a comparison of genes expressed in the pre-parasitic juvenile and feeding female stages, expression levels of 30 genes with high RPKM (reads per base per kilo million) value, were analysed by qRT-PCR which confirmed the observed differences in their levels of expression levels. In addition, we have also developed a user-friendly resource, Heterodera transcriptome database (HATdb) for public access of the data generated in this study. The new data provided on the transcriptome of H. avenae adds to the genetic resources available to study plant parasitic nematodes and provides an opportunity to seek new effectors that are specifically involved in the H. avenae-cereal host interaction. PMID:24802510

  7. Detection and phasing of single base de novo mutations in biopsies from human in vitro fertilized embryos by advanced whole-genome sequencing.

    PubMed

    Peters, Brock A; Kermani, Bahram G; Alferov, Oleg; Agarwal, Misha R; McElwain, Mark A; Gulbahce, Natali; Hayden, Daniel M; Tang, Y Tom; Zhang, Rebecca Yu; Tearle, Rick; Crain, Birgit; Prates, Renata; Berkeley, Alan; Munné, Santiago; Drmanac, Radoje

    2015-03-01

    Currently, the methods available for preimplantation genetic diagnosis (PGD) of in vitro fertilized (IVF) embryos do not detect de novo single-nucleotide and short indel mutations, which have been shown to cause a large fraction of genetic diseases. Detection of all these types of mutations requires whole-genome sequencing (WGS). In this study, advanced massively parallel WGS was performed on three 5- to 10-cell biopsies from two blastocyst-stage embryos. Both parents and paternal grandparents were also analyzed to allow for accurate measurements of false-positive and false-negative error rates. Overall, >95% of each genome was called. In the embryos, experimentally derived haplotypes and barcoded read data were used to detect and phase up to 82% of de novo single base mutations with a false-positive rate of about one error per Gb, resulting in fewer than 10 such errors per embryo. This represents a ∼ 100-fold lower error rate than previously published from 10 cells, and it is the first demonstration that advanced WGS can be used to accurately identify these de novo mutations in spite of the thousands of false-positive errors introduced by the extensive DNA amplification required for deep sequencing. Using haplotype information, we also demonstrate how small de novo deletions could be detected. These results suggest that phased WGS using barcoded DNA could be used in the future as part of the PGD process to maximize comprehensiveness in detecting disease-causing mutations and to reduce the incidence of genetic diseases.

  8. De novo transcriptome sequencing and gene expression profiling of spinach (Spinacia oleracea L.) leaves under heat stress

    PubMed Central

    Yan, Jun; Yu, Li; Xuan, Jiping; Lu, Ying; Lu, Shijun; Zhu, Weimin

    2016-01-01

    Spinach (Spinacia oleracea) has cold tolerant but heat sensitive characteristics. The spinach variety ‘Island,’ is suitable for summer periods. There is lack molecular information available for spinach in response to heat stress. In this study, high throughput de novo transcriptome sequencing and gene expression analyses were carried out at different spinach variety ‘Island’ leaves (grown at 24 °C (control), exposed to 35 °C for 30 min (S1), and 5 h (S2)). A total of 133,200,898 clean reads were assembled into 59,413 unigenes (average size 1259.55 bp). 33,573 unigenes could match to public databases. The DEG of controls vs S1 was 986, the DEG of control vs S2 was 1741 and the DEG of S1 vs S2 was 1587. Gene Ontology (GO) and pathway enrichment analysis indicated that a great deal of heat-responsive genes and other stress-responsive genes were identified in these DEGs, suggesting that the heat stress may have induced an extensive abiotic stress effect. Comparative transcriptome analysis found 896 unique genes in spinach heat response transcript. The expression patterns of 13 selected genes were verified by RT-qPCR (quantitative real-time PCR). Our study found a series of candidate genes and pathways that may be related to heat resistance in spinach. PMID:26857466

  9. De novo transcriptome sequencing and gene expression profiling of spinach (Spinacia oleracea L.) leaves under heat stress.

    PubMed

    Yan, Jun; Yu, Li; Xuan, Jiping; Lu, Ying; Lu, Shijun; Zhu, Weimin

    2016-02-09

    Spinach (Spinacia oleracea) has cold tolerant but heat sensitive characteristics. The spinach variety 'Island,' is suitable for summer periods. There is lack molecular information available for spinach in response to heat stress. In this study, high throughput de novo transcriptome sequencing and gene expression analyses were carried out at different spinach variety 'Island' leaves (grown at 24 °C (control), exposed to 35 °C for 30 min (S1), and 5 h (S2)). A total of 133,200,898 clean reads were assembled into 59,413 unigenes (average size 1259.55 bp). 33,573 unigenes could match to public databases. The DEG of controls vs S1 was 986, the DEG of control vs S2 was 1741 and the DEG of S1 vs S2 was 1587. Gene Ontology (GO) and pathway enrichment analysis indicated that a great deal of heat-responsive genes and other stress-responsive genes were identified in these DEGs, suggesting that the heat stress may have induced an extensive abiotic stress effect. Comparative transcriptome analysis found 896 unique genes in spinach heat response transcript. The expression patterns of 13 selected genes were verified by RT-qPCR (quantitative real-time PCR). Our study found a series of candidate genes and pathways that may be related to heat resistance in spinach.

  10. Whole Exome Sequencing Identifies De Novo Heterozygous CAV1 Mutations Associated with a Novel Neonatal Onset Lipodystrophy Syndrome

    PubMed Central

    Garg, Abhimanyu; Kircher, Martin; del Campo, Miguel; Amato, R. Stephen; Agarwal, Anil K.

    2016-01-01

    Despite remarkable progress in identifying causal genes for many types of genetic lipodystrophies in the last decade, the molecular basis of many extremely rare lipodystrophy patients with distinctive phenotypes remains unclear. We conducted whole exome sequencing of the parents and probands from six pedigrees with neonatal onset of generalized loss of subcutaneous fat with additional distinctive phenotypic features and report de novo heterozygous null mutations, c.424C>T (p. Q142*) and c.479_480delTT (p.F160*), in CAV1 in a 7-year-old male and a 3-year-old female of European origin, respectively. Both the patients had generalized fat loss, thin mottled skin and progeroid features at birth. The male patient had cataracts requiring extraction at age 30 months and the female patient had pulmonary arterial hypertension. Dermal fibroblasts of the female patient revealed negligible CAV1 immunofluorescence staining compared to control but there were no differences in the number and morphology of caveolae upon electron microscopy examination. Based upon the similarities in the clinical features of these two patients, previous reports of CAV1 mutations in patients with lipodystrophies and pulmonary hypertension, and similar features seen in CAV1 null mice, we conclude that these variants are the most likely cause of one subtype of neonatal onset generalized lipodystrophy syndrome. PMID:25898808

  11. De novo assembly, functional annotation, and marker development of Asian pear (Pyrus pyrifolia) fruit transcriptome through massively parallel sequencing.

    PubMed

    Li, J F; Gao, Z; Lou, Y S; Luo, M; Song, S R; Xu, W P; Wang, S P; Zhang, C X

    2015-12-28

    This study investigated the Asian pear transcriptome using the RNA-Seq normalized fruit cDNA library to create a transcriptomic resource for unigene and marker discovery. Following the removal of lowquality reads, 127,085,054 trimmed reads were assembled de novo to yield 37,649 non-redundant unigenes with an average length of 599 bp. Alternative splicing events were detected in 4121 contigs. A total of 30,560 single nucleotide polymorphisms (SNPs) and 7443 simple sequence repeat (SSR) makers were obtained. Approximately 21,449 (56.9%) unigenes were categorized into three gene ontology groups; 3682 (9.8%) were classified into 25 cluster of orthologous groups; and 10,451 (27.8%) were assigned to six Kyoto Encyclopedia of Genes and Genomes pathways. Differentially expressed genes were investigated using the reads per kilobase of the exon model per million reads methodology. A total of 546 unigenes showed significant differences in expression levels at different fruit developmental stages. Gene ontology categories associated with various aspects, including carbohydrate metabolic processes, transmembrane transport, and signal transduction, were enriched with genes with divergent expressions. These Pyrus pyrifolia transcriptome data provide a rich resource for the discovery and identification of new genes. Furthermore, the numerous putative SSRs and SNPs detected in this study will be important resources for the future development of a linkage map or of marker-assisted breeding programs for the Asian pear.

  12. De novo Transcriptome Analysis of Chinese Citrus Fly, Bactrocera minax (Diptera: Tephritidae), by High-Throughput Illumina Sequencing

    PubMed Central

    Wang, Jia; Xiong, Ke-Cai; Liu, Ying-Hong

    2016-01-01

    The Chinese citrus fly, Bactrocera minax (Enderlein), is one of the most devastating pests of citrus in the temperate areas of Asia. So far, studies involving molecular biology and physiology of B. minax are still scarce, partly because of the lack of genomic information and inability to rear this insect in laboratory. In this study, de novo assembly of a transcriptome was performed using Illumina sequencing technology. A total of 20,928,907 clean reads were obtained and assembled into 33,324 unigenes, with an average length of 908.44 bp. Unigenes were annotated by alignment against NCBI non-redundant protein (Nr), Swiss-Prot, Clusters of Orthologous Groups (COG), Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database. Genes potentially involved in stress tolerance, including 20 heat shock protein (Hsps) genes, 26 glutathione S-transferases (GSTs) genes, and 2 ferritin subunit genes, were identified. These genes may play roles in stress tolerance in B. minax diapause stage. It has previously been found that 20E application on B. minax pupae could avert diapause, but the underlying mechanisms remain unknown. Thus, genes encoding enzymes in 20E biosynthesis pathway, including Neverland, Spook, Phantom, Disembodied, Shadow, Shade, and Cyp18a1, and genes encoding 20E receptor proteins, ecdysone receptor (EcR) and ultraspiracle (USP), were identified. The expression patterns of 20E-related genes among developmental stages and between 20E-treated and untreated pupae demonstrated their roles in diapause program. In addition, 1,909 simple sequence repeats (SSRs) were detected, which will contribute to molecular marker development. The findings in this study greatly improve our genetic understanding of B. minax, and lay the foundation for future studies on this species. PMID:27331903

  13. De novo transcriptome sequencing and analysis of male, pseudo-male and female yellow perch, Perca flavescens

    PubMed Central

    Li, Yan-He; Wang, Han-Ping; Yao, Hong; O’Bryant, Paul; Rapp, Dean; Guo, Liang; Waly, Eman A.

    2017-01-01

    Transcriptome sequencing could facilitate discovery of sex-biased genes, biological pathways and molecular markers, which could help clarify the molecular mechanism of sex determination and sexual dimorphism, and assist with selective breeding in aquaculture. Yellow perch has unique gonad system and sexual dimorphism and is an alternative model to study mechanism of sex determination, sexual dimorphism and sexual selection. In this study, we performed the de novo assembly of yellow perch gonads and muscle transcriptomes by high throughput Illumina sequencing. A total of 212,180 contigs were obtained, ranging from 127 to 64,876 bp, and N50 of 1,066 bp. The assembly RNA-Seq contigs (≥200bp) were then used for subsequent analyses, including annotation, pathway analysis, and microsatellites discovery. No female- and pseudo-male-biased genes were involved in any pathways while male-biased genes were involved in 29 pathways, and neuroactive ligand receptor interaction and enzyme of trypsin (enzyme code, EC: 3.4.21.4) was highly involved. Pyruvate kinase (enzyme code, EC: 2.7.1.40), which plays important roles in cell proliferation, was highly expressed in muscles. In addition, a total of 183,939 SNPs, 11,286 InDels and 41,479 microsatellites were identified. This study is the first report on transcriptome information in Percids, and provides rich resources for conducting further studies on understanding the molecular basis of sex determinations, sexual dimorphism, and sexual selection in fish, and for population studies and marker-assisted selection in Percids. PMID:28158238

  14. De novo sequencing of RCB-1 to -3: peptide biomarkers from the castor bean plant Ricinus communis.

    PubMed

    Ovenden, Simon P B; Fredriksson, Sten-Ake; Bagas, Christina K; Bergström, Tomas; Thomson, Stuart A; Nilsson, Calle; Bourne, David J

    2009-05-15

    Ricinus communis (also know as the castor bean plant) whose forbears escaped from suburban gardens or commercial cultivation grow wild in many countries. In temperate and tropical climates seeds will develop to maturity, and plants may be perennial. In Australia these plants have become widespread and are regarded as noxious weeds in many localities. The seeds of R. communis contain ricin, a protein toxin which can easily be extracted into an aqueous solution. Ricin is toxic by ingestion, inhalation, and injection. The history of terrorist and anarchist interest in the use of seeds from R. communis has driven the development of strategies for determination of cultivar and geographic location of the source of an extract of wild-grown castor bean seed. This forensic information is of considerable interest to law enforcement and intelligence organizations. During forensic studies of both the metabolome and proteome of extracts from eight specimens of six different cultivars of R. communis ("zanzibariensis" collected from Kenya and Tanzania, "gibsonii", "impala", "dehradun", "carmencita", and "sanguineus" collected from Spain and Tanzania), three peptide biomarkers (designated Ricinus communis biomarkers, or RCB) were identified in both the MALDI and electrospray LC-MS spectra. Two of these peptides (RCB-1 and RCB-2) were present in varying amounts in all cultivars, while RCB-3 was present only in the "carmencita" cultivar. The amino acid sequences of RCB-1 to -3 were determined using LC-MS(n) fragmentation and de novo sequencing on both the intact and the carbamidomethyl modified peptides. The connectivity of the two disulfide bonds that were present in all three RCB were determined using a strategy of partial reduction and differential alkylation using tris-(2-carboxyethyl)phosphine with N-ethylmaleimide to reduce and alkylate the most accessible disulfide bond, followed by reduction and alkylation of the remaining disulfide bond with dithiolthreitol and

  15. De novo Sequencing and Comparative Transcriptomics of Floral Development of the Distylous Species Lithospermum multiflorum

    PubMed Central

    Cohen, James I.

    2016-01-01

    Genes controlling the morphological, micromorphological, and physiological components of the breeding system distyly have been hypothesized, but many of the genes have not been investigated throughout development of the two floral morphs. To this end, the present study is an examination of comparative transcriptomes from three stages of development for the floral organs of the morphs of Lithospermum multiflorum. Transcriptomes of flowers of the two morphs, from various stages of development, were sequenced using an Illumina HiSeq 2000. The floral transcriptome of L. multiflorum was assembled, and differential gene expression (DE) was identified between morphs, throughout development. Additionally, Gene Ontology (GO) terms for DE genes were determined. Fewer genes were DE early in development compared to later in development, with more genes highly expressed in the gynoecium of the SS morph and the corolla and androecium of the LS morph. A reciprocal pattern was observed later in development, and many more genes were DE during this latter stage. During early development, DE genes appear to be involved in growth and floral development, and during later development, DE genes seem to affect physiological functions. Interestingly, many genes involved in response to stress were identified as DE between morphs. PMID:28066486

  16. Transcriptomic Analysis of Flower Blooming in Jasminum sambac through De Novo RNA Sequencing.

    PubMed

    Li, Yong-Hua; Zhang, Wei; Li, Yong

    2015-06-10

    Flower blooming is a critical and complicated plant developmental process in flowering plants. However, insufficient information is available about the complex network that regulates flower blooming in Jasminum sambac. In this study, we used the RNA-Seq platform to analyze the molecular regulation of flower blooming in J. sambac by comparing the transcript profiles at two flower developmental stages: budding and blooming. A total of 4577 differentially-expressed genes (DEGs) were identified between the two floral stages. The Gene Ontology and the Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses revealed that the DEGs in the "oxidation-reduction process", "extracellular region", "steroid biosynthesis", "glycosphingolipid biosynthesis", "plant hormone signal transduction" and "pentose and glucuronate interconversions" might be associated with flower development. A total of 103 and 92 unigenes exhibited sequence similarities to the known flower development and floral scent genes from other plants. Among these unigenes, five flower development and 19 floral scent unigenes exhibited at least four-fold differences in expression between the two stages. Our results provide abundant genetic resources for studying the flower blooming mechanisms and molecular breeding of J. sambac.

  17. Sequencing, De Novo Assembly and Annotation of the Colorado Potato Beetle, Leptinotarsa decemlineata, Transcriptome

    PubMed Central

    Kumar, Abhishek; Congiu, Leonardo; Lindström, Leena; Piiroinen, Saija; Vidotto, Michele; Grapputo, Alessandro

    2014-01-01

    Background The Colorado potato beetle (Leptinotarsa decemlineata) is a major pest and a serious threat to potato cultivation throughout the northern hemisphere. Despite its high importance for invasion biology, phenology and pest management, little is known about L. decemlineata from a genomic perspective. We subjected European L. decemlineata adult and larval transcriptome samples to 454-FLX massively-parallel DNA sequencing to characterize a basal set of genes from this species. We created a combined assembly of the adult and larval datasets including the publicly available midgut larval Roche 454 reads and provided basic annotation. We were particularly interested in diapause-specific genes and genes involved in pesticide and Bacillus thuringiensis (Bt) resistance. Results Using 454-FLX pyrosequencing, we obtained a total of 898,048 reads which, together with the publicly available 804,056 midgut larval reads, were assembled into 121,912 contigs. We established a repository of genes of interest, with 101 out of the 108 diapause-specific genes described in Drosophila montana; and 621 contigs involved in insecticide resistance, including 221 CYP450, 45 GSTs, 13 catalases, 15 superoxide dismutases, 22 glutathione peroxidases, 194 esterases, 3 ADAM metalloproteases, 10 cadherins and 98 calmodulins. We found 460 putative miRNAs and we predicted a significant number of single nucleotide polymorphisms (29,205) and microsatellite loci (17,284). Conclusions This report of the assembly and annotation of the transcriptome of L. decemlineata offers new insights into diapause-associated and insecticide-resistance-associated genes in this species and provides a foundation for comparative studies with other species of insects. The data will also open new avenues for researchers using L. decemlineata as a model species, and for pest management research. Our results provide the basis for performing future gene expression and functional analysis in L. decemlineata and improve our

  18. BAL31-NGS approach for identification of telomeres de novo in large genomes.

    PubMed

    Peška, Vratislav; Sitová, Zdeňka; Fajkus, Petr; Fajkus, Jiří

    2017-02-01

    This article describes a novel method to identify as yet undiscovered telomere sequences, which combines next generation sequencing (NGS) with BAL31 digestion of high molecular weight DNA. The method was applied to two groups of plants: i) dicots, genus Cestrum, and ii) monocots, Allium species (e.g. A. ursinum and A. cepa). Both groups consist of species with large genomes (tens of Gb) and a low number of chromosomes (2n=14-16), full of repeat elements. Both genera lack typical telomeric repeats and multiple studies have attempted to characterize alternative telomeric sequences. However, despite interesting hypotheses and suggestions of alternative candidate telomeres (retrotransposons, rDNA, satellite repeats) these studies have not resolved the question. In a novel approach based on the two most general features of eukaryotic telomeres, their repetitive character and sensitivity to BAL31 nuclease digestion, we have taken advantage of the capacity and current affordability of NGS in combination with the robustness of classical BAL31 nuclease digestion of chromosomal termini. While representative samples of most repeat elements were ensured by low-coverage (less than 5%) genomic shot-gun NGS, candidate telomeres were identified as under-represented sequences in BAL31-treated samples.

  19. Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine

    PubMed Central

    2014-01-01

    High-throughput DNA sequencing is revolutionizing the study of cancer and enabling the measurement of the somatic mutations that drive cancer development. However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations. Here, we review computational approaches to identify somatic mutations in cancer genome sequences and to distinguish the driver mutations that are responsible for cancer from random, passenger mutations. First, we describe approaches to detect somatic mutations from high-throughput DNA sequencing data, particularly for tumor samples that comprise heterogeneous populations of cells. Next, we review computational approaches that aim to predict driver mutations according to their frequency of occurrence in a cohort of samples, or according to their predicted functional impact on protein sequence or structure. Finally, we review techniques to identify recurrent combinations of somatic mutations, including approaches that examine mutations in known pathways or protein-interaction networks, as well as de novo approaches that identify combinations of mutations according to statistical patterns of mutual exclusivity. These techniques, coupled with advances in high-throughput DNA sequencing, are enabling precision medicine approaches to the diagnosis and treatment of cancer. PMID:24479672

  20. Analysis of de novo sequencing and transcriptome assembly and lignocellulolytic enzymes gene expression of Coriolopsis gallica HTC.

    PubMed

    Chen, Yuehong; Cao, Qinghua; Tao, Xiang; Shao, Huanhuan; Zhang, Kun; Zhang, Yizheng; Tan, Xuemei

    2017-03-01

    White-rot basidiomycete Coriolopsis gallica HTC is one of the main biodegraders of poplar. In our previous study, we have shown the strong capacity of C. gallica HTC to degrade lignocellulose. In this study, equal amounts of total RNA fromC. Gallica HTC cultures grown in different conditions were pooled together. Illumina paired-end RNA sequencing was performed, and 13.2 million 90-bp paired-end reads were generated. We chose the Merged Assembly of Oases data-set for the following blast searches and gene ontology analyses. The reads were assembled de novo into 28,034 transcripts (≥ 100 bp) using combined assembly strategy MAO. The transcripts were annotated using Blast2GO. In all, 18,810 transcripts (≥100 bp) achieved BLASTX hits, of which, 7048 transcripts had GO term and 2074 had ECs. The expression level of 11 lignocellulolytic enzyme genes from the assembled C. gallica HTC transcriptome were detected by real-time quantitative polymerase chain reaction. The results showed that expression levels of these genes were affected by carbon source and nitrogen source at the level of transcription. The current abundant transcriptome data allowed the identification of many new transcripts in C. gallica HTC. Data provided here represent the most comprehensive and integrated genomic resources for cloning and identifying genes of interest from C. gallica HTC. Characterization of C. gallica HTC transcriptome provides an effective tool to understand mechanisms underlying cellular and molecular functions of C. gallica HTC.

  1. De novo transcriptome sequencing in Bixa orellana to identify genes involved in methylerythritol phosphate, carotenoid and bixin biosynthesis

    DOE PAGES

    Cárdenas-Conejo, Yair; Carballo-Uicab, Víctor; Lieberman, Meric; ...

    2015-10-28

    Bixin or annatto is a commercially important natural orange-red pigment derived from lycopene that is produced and stored in seeds of Bixa orellana L. An enzymatic pathway for bixin biosynthesis was inferred from homology of putative proteins encoded by differentially expressed seed cDNAs. Some activities were later validated in a heterologous system. Nevertheless, much of the pathway remains to be clarified. For example, it is essential to identify the methylerythritol phosphate (MEP) and carotenoid pathways genes. In order to investigate the MEP, carotenoid, and bixin pathways genes, total RNA from young leaves and two different developmental stages of seeds frommore » B. orellana were used for the construction of indexed mRNA libraries, sequenced on the Illumina HiSeq 2500 platform and assembled de novo using Velvet, CLC Genomics Workbench and CAP3 software. A total of 52,549 contigs were obtained with average length of 1,924 bp. Two phylogenetic analyses of inferred proteins, in one case encoded by thirteen general, single-copy cDNAs, in the other from carotenoid and MEP cDNAs, indicated that B. orellana is closely related to sister Malvales species cacao and cotton. Using homology, we identified 7 and 14 core gene products from the MEP and carotenoid pathways, respectively. Surprisingly, previously defined bixin pathway cDNAs were not present in our transcriptome. Here we propose a new set of gene products involved in bixin pathway. In conclusion, the identification and qRT-PCR quantification of cDNAs involved in annatto production suggest a hypothetical model for bixin biosynthesis that involve coordinated activation of some MEP, carotenoid and bixin pathway genes. These findings provide a better understanding of the mechanisms regulating these pathways and will facilitate the genetic improvement of B. orellana.« less

  2. An oligonucleotide hybridization approach to DNA sequencing.

    PubMed

    Khrapko, K R; Lysov YuP; Khorlyn, A A; Shick, V V; Florentiev, V L; Mirzabekov, A D

    1989-10-09

    We have proposed a DNA sequencing method based on hybridization of a DNA fragment to be sequenced with the complete set of fixed-length oligonucleotides (e.g., 4(8) = 65,536 possible 8-mers) immobilized individually as dots of a 2-D matrix [(1989) Dokl. Akad. Nauk SSSR 303, 1508-1511]. It was shown that the list of hybridizing octanucleotides is sufficient for the computer-assisted reconstruction of the structures for 80% of random-sequence fragments up to 200 bases long, based on the analysis of the octanucleotide overlapping. Here a refinement of the method and some experimental data are presented. We have performed hybridizations with oligonucleotides immobilized on a glass plate, and obtained their dissociation curves down to heptanucleotides. Other approaches, e.g., an additional hybridization of short oligonucleotides which continuously extend duplexes formed between the fragment and immobilized oligonucleotides, should considerably increase either the probability of unambiguous reconstruction, or the length of reconstructed sequences, or decrease the size of immobilized oligonucleotides.

  3. Evaluating Characteristics of De Novo Assembly Software on 454 Transcriptome Data: A Simulation Approach

    PubMed Central

    Mundry, Marvin; Bornberg-Bauer, Erich; Sammeth, Michael; Feulner, Philine G. D.

    2012-01-01

    Background The quantity of transcriptome data is rapidly increasing for non-model organisms. As sequencing technology advances, focus shifts towards solving bioinformatic challenges, of which sequence read assembly is the first task. Recent studies have compared the performance of different software to establish a best practice for transcriptome assembly. Here, we adapted a simulation approach to evaluate specific features of assembly programs on 454 data. The novelty of our study is that the simulation allows us to calculate a model assembly as reference point for comparison. Findings The simulation approach allows us to compare basic metrics of assemblies computed by different software applications (CAP3, MIRA, Newbler, and Oases) to a known optimal solution. We found MIRA and CAP3 are conservative in merging reads. This resulted in comparably high number of short contigs. In contrast, Newbler more readily merged reads into longer contigs, while Oases produced the overall shortest assembly. Due to the simulation approach, reads could be traced back to their correct placement within the transcriptome. Together with mapping reads onto the assembled contigs, we were able to evaluate ambiguity in the assemblies. This analysis further supported the conservative nature of MIRA and CAP3, which resulted in low proportions of chimeric contigs, but high redundancy. Newbler produced less redundancy, but the proportion of chimeric contigs was higher. Conclusion Our evaluation of four assemblers suggested that MIRA and Newbler slightly outperformed the other programs, while showing contrasting characteristics. Oases did not perform very well on the 454 reads. Our evaluation indicated that the software was either conservative (MIRA) or liberal (Newbler) about merging reads into contigs. This suggested that in choosing an assembly program researchers should carefully consider their follow up analysis and consequences of the chosen approach to gain an assembly. PMID:22384018

  4. De novo sequence analysis and intact mass measurements for characterization of phycocyanin subunit isoforms from the blue-green alga Aphanizomenon flos-aquae.

    PubMed

    Rinalducci, Sara; Roepstorff, Peter; Zolla, Lello

    2009-04-01

    In this work, partial characterization of the primary structure of phycocyanin from the cyanobacterium Aphanizomenon flos-aquae (AFA) was achieved by mass spectrometry de novo sequencing with the aid of chemical derivatization. Combining N-terminal sulfonation of tryptic peptides by 4-sulfophenyl isothiocyanate (SPITC) and MALDI-TOF/TOF analyses, facilitated the acquisition of sequence information for AFA phycocyanin subunits. In fact, SPITC-derivatized peptides underwent facile fragmentation, predominantly resulting in y-series ions in the MS/MS spectra and often exhibiting uninterrupted sequences of 20 or more amino acid residues. This strategy allowed us to carry out peptide fragment fingerprinting and de novo sequencing of several peptides belonging to both alpha- and beta-phycocyanin polypeptides, obtaining a sequence coverage of 67% and 75%, respectively. The presence of different isoforms of phycocyanin subunits was also revealed; subsequently Intact Mass Measurements (IMMs) by both MALDI- and ESI-MS supported the detection of these protein isoforms. Finally, we discuss the evolutionary importance of phycocyanin isoforms in cyanobacteria, suggesting the possible use of the phycocyanin operon for a correct taxonomic identity of this species.

  5. Frequency and Complexity of De Novo Structural Mutation in Autism

    PubMed Central

    Brandler, William M.; Antaki, Danny; Gujral, Madhusudan; Noor, Amina; Rosanio, Gabriel; Chapman, Timothy R.; Barrera, Daniel J.; Lin, Guan Ning; Malhotra, Dheeraj; Watts, Amanda C.; Wong, Lawrence C.; Estabillo, Jasper A.; Gadomski, Therese E.; Hong, Oanh; Fajardo, Karin V. Fuentes; Bhandari, Abhishek; Owen, Renius; Baughn, Michael; Yuan, Jeffrey; Solomon, Terry; Moyzis, Alexandra G.; Maile, Michelle S.; Sanders, Stephan J.; Reiner, Gail E.; Vaux, Keith K.; Strom, Charles M.; Zhang, Kang; Muotri, Alysson R.; Akshoomoff, Natacha; Leal, Suzanne M.; Pierce, Karen; Courchesne, Eric; Iakoucheva, Lilia M.; Corsello, Christina; Sebat, Jonathan

    2016-01-01

    Genetic studies of autism spectrum disorder (ASD) have established that de novo duplications and deletions contribute to risk. However, ascertainment of structural variants (SVs) has been restricted by the coarse resolution of current approaches. By applying a custom pipeline for SV discovery, genotyping, and de novo assembly to genome sequencing of 235 subjects (71 affected individuals, 26 healthy siblings, and their parents), we compiled an atlas of 29,719 SV loci (5,213/genome), comprising 11 different classes. We found a high diversity of de novo mutations, the majority of which were undetectable by previous methods. In addition, we observed complex mutation clusters where combinations of de novo SVs, nucleotide substitutions, and indels occurred as a single event. We estimate a high rate of structural mutation in humans (20%) and propose that genetic risk for ASD is attributable to an elevated frequency of gene-disrupting de novo SVs, but not an elevated rate of genome rearrangement. PMID:27018473

  6. Frequency and Complexity of De Novo Structural Mutation in Autism.

    PubMed

    Brandler, William M; Antaki, Danny; Gujral, Madhusudan; Noor, Amina; Rosanio, Gabriel; Chapman, Timothy R; Barrera, Daniel J; Lin, Guan Ning; Malhotra, Dheeraj; Watts, Amanda C; Wong, Lawrence C; Estabillo, Jasper A; Gadomski, Therese E; Hong, Oanh; Fajardo, Karin V Fuentes; Bhandari, Abhishek; Owen, Renius; Baughn, Michael; Yuan, Jeffrey; Solomon, Terry; Moyzis, Alexandra G; Maile, Michelle S; Sanders, Stephan J; Reiner, Gail E; Vaux, Keith K; Strom, Charles M; Zhang, Kang; Muotri, Alysson R; Akshoomoff, Natacha; Leal, Suzanne M; Pierce, Karen; Courchesne, Eric; Iakoucheva, Lilia M; Corsello, Christina; Sebat, Jonathan

    2016-04-07

    Genetic studies of autism spectrum disorder (ASD) have established that de novo duplications and deletions contribute to risk. However, ascertainment of structural variants (SVs) has been restricted by the coarse resolution of current approaches. By applying a custom pipeline for SV discovery, genotyping, and de novo assembly to genome sequencing of 235 subjects (71 affected individuals, 26 healthy siblings, and their parents), we compiled an atlas of 29,719 SV loci (5,213/genome), comprising 11 different classes. We found a high diversity of de novo mutations, the majority of which were undetectable by previous methods. In addition, we observed complex mutation clusters where combinations of de novo SVs, nucleotide substitutions, and indels occurred as a single event. We estimate a high rate of structural mutation in humans (20%) and propose that genetic risk for ASD is attributable to an elevated frequency of gene-disrupting de novo SVs, but not an elevated rate of genome rearrangement.

  7. Large Scale Discovery and De Novo-Assisted Sequencing of Cationic Antimicrobial Peptides (CAMPs) by Microparticle Capture and Electron-Transfer Dissociation (ETD) Mass Spectrometry.

    PubMed

    Juba, Melanie L; Russo, Paul S; Devine, Megan; Barksdale, Stephanie; Rodriguez, Carlos; Vliet, Kent A; Schnur, Joel M; van Hoek, Monique L; Bishop, Barney M

    2015-10-02

    The identification and sequencing of novel cationic antimicrobial peptides (CAMPs) have proven challenging due to the limitations associated with traditional proteomics methods and difficulties sequencing peptides present in complex biomolecular mixtures. We present here a process for large-scale identification and de novo-assisted sequencing of newly discovered CAMPs using microparticle capture followed by tandem mass spectrometry equipped with electron-transfer dissociation (ETD). This process was initially evaluated and verified using known CAMPs with varying physicochemical properties. The effective parameters were then applied in the analysis of a complex mixture of peptides harvested from American alligator plasma using custom-made (Bioprospector) functionalized hydrogel particles. Here, we report the successful sequencing process for CAMPs that has led to the identification of 340 unique peptides and the discovery of five novel CAMPs from American alligator plasma.

  8. Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing.

    PubMed

    Vembar, Shruthi Sridhar; Seetin, Matthew; Lambert, Christine; Nattestad, Maria; Schatz, Michael C; Baybayan, Primo; Scherf, Artur; Smith, Melissa Laird

    2016-08-01

    The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90-99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission.

  9. Genetic variation and the de novo assembly of human genomes

    PubMed Central

    Chaisson, Mark J. P.; Wilson, Richard K.; Eichler, Evan E.

    2016-01-01

    The discovery of genetic variation and the assembly of genome sequences are both inextricably linked to advances in DNA-sequencing technology. Short-read massively parallel sequencing has revolutionized our ability to discover genetic variation but is insufficient to generate high-quality genome assemblies or resolve most structural variation. Full resolution of variation is only guaranteed by complete de novo assembly of a genome. Here, we review approaches to genome assembly, the nature of gaps or missing sequences, and biases in the assembly process. We describe the challenges of generating a complete de novo genome assembly using current technologies and the impact that being able to perfectly sequence the genome would have on understanding human disease and evolution. Finally, we summarize recent technological advances that improve both contiguity and accuracy and emphasize the importance of complete de novo assembly as opposed to read mapping as the primary means to understanding the full range of human genetic variation. PMID:26442640

  10. De novo Transcriptome Sequencing and Development of Abscission Zone-Specific Microarray as a New Molecular Tool for Analysis of Tomato Organ Abscission

    PubMed Central

    Sundaresan, Srivignesh; Philosoph-Hadas, Sonia; Riov, Joseph; Mugasimangalam, Raja; Kuravadi, Nagesh A.; Kochanek, Bettina; Salim, Shoshana; Tucker, Mark L.; Meir, Shimon

    2016-01-01

    Abscission of flower pedicels and leaf petioles of tomato (Solanum lycopersicum) can be induced by flower removal or leaf deblading, respectively, which leads to auxin depletion, resulting in increased sensitivity of the abscission zone (AZ) to ethylene. However, the molecular mechanisms that drive the acquisition of abscission competence and its modulation by auxin gradients are not yet known. We used RNA-Sequencing (RNA-Seq) to obtain a comprehensive transcriptome of tomato flower AZ (FAZ) and leaf AZ (LAZ) during abscission. RNA-Seq was performed on a pool of total RNA extracted from tomato FAZ and LAZ, at different abscission stages, followed by de novo assembly. The assembled clusters contained transcripts that are already known in the Solanaceae (SOL) genomics and NCBI databases, and over 8823 identified novel tomato transcripts of varying sizes. An AZ-specific microarray, encompassing the novel transcripts identified in this study and all known transcripts from the SOL genomics and NCBI databases, was constructed to study the abscission process. Multiple probes for longer genes and key AZ-specific genes, including antisense probes for all transcripts, make this array a unique tool for studying abscission with a comprehensive set of transcripts, and for mining for naturally occurring antisense transcripts. We focused on comparing the global transcriptomes generated from the FAZ and the LAZ to establish the divergences and similarities in their transcriptional networks, and particularly to characterize the processes and transcriptional regulators enriched in gene clusters that are differentially regulated in these two AZs. This study is the first attempt to analyze the global gene expression in different AZs in tomato by combining the RNA-Seq technique with oligonucleotide microarrays. Our AZ-specific microarray chip provides a cost-effective approach for expression profiling and robust analysis of multiple samples in a rapid succession. PMID:26834766

  11. UVliPiD: A UVPD-Based Hierarchical Approach for De Novo Characterization of Lipid A Structures.

    PubMed

    Morrison, Lindsay J; Parker, W Ryan; Holden, Dustin D; Henderson, Jeremy C; Boll, Joseph M; Trent, M Stephen; Brodbelt, Jennifer S

    2016-02-02

    The lipid A domain of the endotoxic lipopolysaccharide layer of Gram-negative bacteria is comprised of a diglucosamine backbone to which a variable number of variable length fatty acyl chains are anchored. Traditional characterization of these tails and their linkages by nuclear magnetic resonance (NMR) or mass spectrometry is time-consuming and necessitates databases of pre-existing structures for structural assignment. Here, we introduce an automated de novo approach for characterization of lipid A structures that is completely database-independent. A hierarchical decision-tree MS(n) method is used in conjunction with a hybrid activation technique, UVPDCID, to acquire characteristic fragmentation patterns of lipid A variants from a number of Gram-negative bacteria. Structural assignments are derived from integration of key features from three to five spectra and automated interpretation is achieved in minutes without the need for pre-existing information or candidate structures. The utility of this strategy is demonstrated for a mixture of lipid A structures from an enzymatically modified E. coli lipid A variant. A total of 27 lipid A structures were discovered, many of which were isomeric, showcasing the need for a rapid de novo approach to lipid A characterization.

  12. Novel variation and de novo mutation rates in population-wide de novo assembled Danish trios

    PubMed Central

    Besenbacher, Søren; Liu, Siyang; Izarzugaza, José M. G.; Grove, Jakob; Belling, Kirstine; Bork-Jensen, Jette; Huang, Shujia; Als, Thomas D.; Li, Shengting; Yadav, Rachita; Rubio-García, Arcadio; Lescai, Francesco; Demontis, Ditte; Rao, Junhua; Ye, Weijian; Mailund, Thomas; Friborg, Rune M.; Pedersen, Christian N. S.; Xu, Ruiqi; Sun, Jihua; Liu, Hao; Wang, Ou; Cheng, Xiaofang; Flores, David; Rydza, Emil; Rapacki, Kristoffer; Damm Sørensen, John; Chmura, Piotr; Westergaard, David; Dworzynski, Piotr; Sørensen, Thorkild I. A.; Lund, Ole; Hansen, Torben; Xu, Xun; Li, Ning; Bolund, Lars; Pedersen, Oluf; Eiberg, Hans; Krogh, Anders; Børglum, Anders D.; Brunak, Søren; Kristiansen, Karsten; Schierup, Mikkel H.; Wang, Jun; Gupta, Ramneek; Villesen, Palle; Rasmussen, Simon

    2015-01-01

    Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e−8 and 1.5e−9 per nucleotide per generation for SNVs and indels, respectively. PMID:25597990

  13. Computational approaches for de novo design and redesign of metal-binding sites on proteins.

    PubMed

    Akcapinar, Gunseli Bayram; Sezerman, Osman Ugur

    2017-04-28

    Metal ions play pivotal roles in protein structure, function and stability. The functional and structural diversity of proteins in nature expanded with the incorporation of metal ions or clusters in proteins. Approximately one-third of these proteins in the databases contain metal ions. Many biological and chemical processes in nature involve metal ion-binding proteins, aka metalloproteins. Many cellular reactions that underpin life require metalloproteins. Most of the remarkable, complex chemical transformations are catalysed by metalloenzymes. Realization of the importance of metal-binding sites in a variety of cellular events led to the advancement of various computational methods for their prediction and characterization. Furthermore, as structural and functional knowledgebase about metalloproteins is expanding with advances in computational and experimental fields, the focus of the research is now shifting towards de novo design and redesign of metalloproteins to extend nature's own diversity beyond its limits. In this review, we will focus on the computational toolbox for prediction of metal ion-binding sites, de novo metalloprotein design and redesign. We will also give examples of tailor-made artificial metalloproteins designed with the computational toolbox.

  14. Sequencing of sporadic Attention-Deficit Hyperactivity Disorder (ADHD) identifies novel and potentially pathogenic de novo variants and excludes overlap with genes associated with autism spectrum disorder.

    PubMed

    Kim, Daniel Seung; Burt, Amber A; Ranchalis, Jane E; Wilmot, Beth; Smith, Joshua D; Patterson, Karynne E; Coe, Bradley P; Li, Yatong K; Bamshad, Michael J; Nikolas, Molly; Eichler, Evan E; Swanson, James M; Nigg, Joel T; Nickerson, Deborah A; Jarvik, Gail P

    2017-03-22

    Attention-Deficit Hyperactivity Disorder (ADHD) has high heritability; however, studies of common variation account for <5% of ADHD variance. Using data from affected participants without a family history of ADHD, we sought to identify de novo variants that could account for sporadic ADHD. Considering a total of 128 families, two analyses were conducted in parallel: first, in 11 unaffected parent/affected proband trios (or quads with the addition of an unaffected sibling) we completed exome sequencing. Six de novo missense variants at highly conserved bases were identified and validated from four of the 11 families: the brain-expressed genes TBC1D9, DAGLA, QARS, CSMD2, TRPM2, and WDR83. Separately, in 117 unrelated probands with sporadic ADHD, we sequenced a panel of 26 genes implicated in intellectual disability (ID) and autism spectrum disorder (ASD) to evaluate whether variation in ASD/ID-associated genes were also present in participants with ADHD. Only one putative deleterious variant (Gln600STOP) in CHD1L was identified; this was found in a single proband. Notably, no other nonsense, splice, frameshift, or highly conserved missense variants in the 26 gene panel were identified and validated. These data suggest that de novo variant analysis in families with independently adjudicated sporadic ADHD diagnosis can identify novel genes implicated in ADHD pathogenesis. Moreover, that only one of the 128 cases (0.8%, 11 exome, and 117 MIP sequenced participants) had putative deleterious variants within our data in 26 genes related to ID and ASD suggests significant independence in the genetic pathogenesis of ADHD as compared to ASD and ID phenotypes. © 2017 Wiley Periodicals, Inc.

  15. PRO_LIGAND: An approach to de novo molecular design. 1. Application to the design of organic molecules

    NASA Astrophysics Data System (ADS)

    Clark, David E.; Frenkel, David; Levy, Stephen A.; Li, Jin; Murray, Christopher W.; Robson, Barry; Waszkowycz, Bohdan; Westhead, David R.

    1995-02-01

    An approach to de novo molecular design, PRO_LIGAND, has been developed that, in the environment of a large, integrated molecular design and simulation system, provides a unified framework for the generation of novel molecules which are either similar or complementary to a specified target. The approach is based on a methodology that has proved to be effective in other studies-placing molecular fragments upon target interaction sites-but incorporates many novel features such as the use of a rapid graph-theoretical algorithm for fragment placing, a generalised driver for structure generation which offers a large variety of fragment assembly strategies to the user and the pre-screening of library fragments. After a detailed description of the relevant modules of the package, PRO_LIGAND's efficacy in aiding rational drug design is demonstrated by its ability to design mimics of methotrexate and potential inhibitors for dihydrofolate reductase and HIV-1 protease.

  16. Dose-dependent de novo germline mutations detected by whole-exome sequencing in progeny of ENU-treated male gpt delta mice.

    PubMed

    Masumura, Kenichi; Toyoda-Hokaiwado, Naomi; Ukai, Akiko; Gondo, Yoichi; Honma, Masamitsu; Nohmi, Takehiko

    2016-11-01

    Germline mutations are an important component of genetic toxicology; however, mutagenicity tests of germline cells are limited. Recent advances in sequencing technology can be used to detect mutations by direct sequencing of genomic DNA (gDNA). We previously reported induced de novo mutations detected using whole-exome sequencing in the offspring of N-ethyl-N-nitrosourea (ENU)-treated mice in a single-dose experiment (85mg/kg, i.p., weekly on two occasions). In this study, two lower doses (10 and 30mg/kg) were added, and dose-response of inherited germline mutations was analyzed. Male gpt delta transgenic mice treated with ENU in three dose groups were mated with untreated females 10 weeks after the last treatment, and offspring were obtained. The ENU-treated male mice showed dose-dependent increases in gpt mutant frequencies in their sperm, testis, and liver. gDNA of one family (parents and four offspring) from each dose group was used for whole-exome sequencing, and unique de novo mutations in the offspring were detected. Frequencies of inherited mutations increased with dosage more than 25-fold in the highest dose group. The mutation spectrum of the inherited mutations showed characteristics of ENU-induced mutations, such as A:T base substitutions. No confirmed mutations were observed in the control group. Filtering using the alternate reads ratio resulted in the mutation frequencies and spectra similar to those obtained by the Sanger sequencing confirmation. These results suggest that direct sequencing analysis may be a useful tool to investigate inherited germline mutations induced by environmental mutagens.

  17. New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation

    PubMed Central

    McLysaght, Aoife; Guerzoni, Daniele

    2015-01-01

    The origin of novel protein-coding genes de novo was once considered so improbable as to be impossible. In less than a decade, and especially in the last five years, this view has been overturned by extensive evidence from diverse eukaryotic lineages. There is now evidence that this mechanism has contributed a significant number of genes to genomes of organisms as diverse as Saccharomyces, Drosophila, Plasmodium, Arabidopisis and human. From simple beginnings, these genes have in some instances acquired complex structure, regulated expression and important functional roles. New genes are often thought of as dispensable late additions; however, some recent de novo genes in human can play a role in disease. Rather than an extremely rare occurrence, it is now evident that there is a relatively constant trickle of proto-genes released into the testing ground of natural selection. It is currently unknown whether de novo genes arise primarily through an ‘RNA-first’ or ‘ORF-first’ pathway. Either way, evolutionary tinkering with this pool of genetic potential may have been a significant player in the origins of lineage-specific traits and adaptations. PMID:26323763

  18. in silico Whole Genome Sequencer & Analyzer (iWGS): a computational pipeline to guide the design and analysis of de novo genome sequencing studies

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding it...

  19. Deep sequencing for de novo construction of a marine fish (Sparus aurata) transcriptome database with a large coverage of protein-coding transcripts

    PubMed Central

    2013-01-01

    Background The gilthead sea bream (Sparus aurata) is the main fish species cultured in the Mediterranean area and constitutes an interesting model of research. Nevertheless, transcriptomic and genomic data are still scarce for this highly valuable species. A transcriptome database was constructed by de novo assembly of gilthead sea bream sequences derived from public repositories of mRNA and collections of expressed sequence tags together with new high-quality reads from five cDNA 454 normalized libraries of skeletal muscle (1), intestine (1), head kidney (2) and blood (1). Results Sequencing of the new 454 normalized libraries produced 2,945,914 high-quality reads and the de novo global assembly yielded 125,263 unique sequences with an average length of 727 nt. Blast analysis directed to protein and nucleotide databases annotated 63,880 sequences encoding for 21,384 gene descriptions, that were curated for redundancies and frameshifting at the homopolymer regions of open reading frames, and hosted at http://www.nutrigroup-iats.org/seabreamdb. Among the annotated gene descriptions, 16,177 were mapped in the Ingenuity Pathway Analysis (IPA) database, and 10,899 were eligible for functional analysis with a representation in 341 out of 372 IPA canonical pathways. The high representation of randomly selected stickleback transcripts by Blast search in the nucleotide gilthead sea bream database evidenced its high coverage of protein-coding transcripts. Conclusions The newly assembled gilthead sea bream transcriptome represents a progress in genomic resources for this species, as it probably contains more than 75% of actively transcribed genes, constituting a valuable tool to assist studies on functional genomics and future genome projects. PMID:23497320

  20. Detection of a Usp-like gene in Calotropis procera plant from the de novo assembled genome contigs of the high-throughput sequencing dataset.

    PubMed

    Shokry, Ahmed M; Al-Karim, Saleh; Ramadan, Ahmed; Gadallah, Nour; Al Attas, Sanaa G; Sabir, Jamal S M; Hassan, Sabah M; Madkour, Magdy A; Bressan, Ray; Mahfouz, Magdy; Bahieldin, Ahmed

    2014-02-01

    The wild plant species Calotropis procera (C. procera) has many potential applications and beneficial uses in medicine, industry and ornamental field. It also represents an excellent source of genes for drought and salt tolerance. Genes encoding proteins that contain the conserved universal stress protein (USP) domain are known to provide organisms like bacteria, archaea, fungi, protozoa and plants with the ability to respond to a plethora of environmental stresses. However, information on the possible occurrence of Usp in C. procera is not available. In this study, we uncovered and characterized a one-class A Usp-like (UspA-like, NCBI accession No. KC954274) gene in this medicinal plant from the de novo assembled genome contigs of the high-throughput sequencing dataset. A number of GenBank accessions for Usp sequences were blasted with the recovered de novo assembled contigs. Homology modelling of the deduced amino acids (NCBI accession No. AGT02387) was further carried out using Swiss-Model, accessible via the EXPASY. Superimposition of C. procera USPA-like full sequence model on Thermus thermophilus USP UniProt protein (PDB accession No. Q5SJV7) was constructed using RasMol and Deep-View programs. The functional domains of the novel USPA-like amino acids sequence were identified from the NCBI conserved domain database (CDD) that provide insights into sequence structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

  1. Characterization of the Genomic Diversity of Norovirus in Linked Patients Using a Metagenomic Deep Sequencing Approach

    PubMed Central

    Nasheri, Neda; Petronella, Nicholas; Ronholm, Jennifer; Bidawid, Sabah; Corneau, Nathalie

    2017-01-01

    Norovirus (NoV) is the leading cause of gastroenteritis worldwide. A robust cell culture system does not exist for NoV and therefore detailed characterization of outbreak and sporadic strains relies on molecular techniques. In this study, we employed a metagenomic approach that uses non-specific amplification followed by next-generation sequencing to whole genome sequence NoV genomes directly from clinical samples obtained from 8 linked patients. Enough sequencing depth was obtained for each sample to use a de novo assembly of near-complete genome sequences. The resultant consensus sequences were then used to identify inter-host nucleotide variations that occur after direct transmission, analyze amino acid variations in the major capsid protein, and provide evidence of recombination events. The analysis of intra-host quasispecies diversity was possible due to high coverage-depth. We also observed a linear relationship between NoV viral load in the clinical sample and the number of sequence reads that could be attributed to NoV. The method demonstrated here has the potential for future use in whole genome sequence analyses of other RNA viruses isolated from clinical, environmental, and food specimens. PMID:28197136

  2. Transcriptome analysis of colored calla lily (Zantedeschia rehmannii Engl.) by Illumina sequencing: de novo assembly, annotation and EST-SSR marker development

    PubMed Central

    Cui, Binbin; Zhang, Qixiang; Xiong, Min; Wang, Xian

    2016-01-01

    Colored calla lily is the short name for the species or hybrids in section Aestivae of genus Zantedeschia. It is currently one of the most popular flower plants in the world due to its beautiful flower spathe and long postharvest life. However, little genomic information and few molecular markers are available for its genetic improvement. Here, de novo transcriptome sequencing was performed to produce large transcript sequences for Z. rehmannii cv. ‘Rehmannii’ using an Illumina HiSeq 2000 instrument. More than 59.9 million cDNA sequence reads were obtained and assembled into 39,298 unigenes with an average length of 1,038 bp. Among these, 21,077 unigenes showed significant similarity to protein sequences in the non-redundant protein database (Nr) and in the Swiss-Prot, Gene Ontology (GO), Cluster of Orthologous Group (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Moreover, a total of 117 unique transcripts were then defined that might regulate the flower spathe development of colored calla lily. Additionally, 9,933 simple sequence repeats (SSRs) and 7,162 single nucleotide polymorphisms (SNPs) were identified as putative molecular markers. High-quality primers for 200 SSR loci were designed and selected, of which 58 amplified reproducible amplicons were polymorphic among 21 accessions of colored calla lily. The sequence information and molecular markers in the present study will provide valuable resources for genetic diversity analysis, germplasm characterization and marker-assisted selection in the genus Zantedeschia. PMID:27635342

  3. Next generation sequencing based approaches to epigenomics

    PubMed Central

    Marra, Marco A.

    2010-01-01

    Next generation sequencing has brought epigenomic studies to the forefront of current research. The power of massively parallel sequencing coupled to innovative molecular and computational techniques has allowed researchers to profile the epigenome at resolutions that were unimaginable only a few years ago. With early proof of concept studies published, the field is now moving into the next phase where the importance of method standardization and rigorous quality control are becoming paramount. In this review we will describe methodologies that have been developed to profile the epigenome using next generation sequencing platforms. We will discuss these in terms of library preparation, sequence platforms and analysis techniques. PMID:21266347

  4. "De-novo" amino acid sequence elucidation of protein G'e by combined "Top-Down" and "Bottom-Up" mass spectrometry

    NASA Astrophysics Data System (ADS)

    Yefremova, Yelena; Al-Majdoub, Mahmoud; Opuni, Kwabena F. M.; Koy, Cornelia; Cui, Weidong; Yan, Yuetian; Gross, Michael L.; Glocker, Michael O.

    2015-03-01

    Mass spectrometric de-novo sequencing was applied to review the amino acid sequence of a commercially available recombinant protein Ǵ with great scientific and economic importance. Substantial deviations to the published amino acid sequence (Uniprot Q54181) were found by the presence of 46 additional amino acids at the N-terminus, including a so-called "His-tag" as well as an N-terminal partial α- N-gluconoylation and α- N-phosphogluconoylation, respectively. The unexpected amino acid sequence of the commercial protein G' comprised 241 amino acids and resulted in a molecular mass of 25,998.9 ± 0.2 Da for the unmodified protein. Due to the higher mass that is caused by its extended amino acid sequence compared with the original protein G' (185 amino acids), we named this protein "protein G'e." By means of mass spectrometric peptide mapping, the suggested amino acid sequence, as well as the N-terminal partial α- N-gluconoylations, was confirmed with 100% sequence coverage. After the protein G'e sequence was determined, we were able to determine the expression vector pET-28b from Novagen with the Xho I restriction enzyme cleavage site as the best option that was used for cloning and expressing the recombinant protein G'e in E. coli. A dissociation constant ( K d ) value of 9.4 nM for protein G'e was determined thermophoretically, showing that the N-terminal flanking sequence extension did not cause significant changes in the binding affinity to immunoglobulins.

  5. Exome sequencing identifies de novo gain of function missense mutation in KCND2 in identical twins with autism and seizures that slows potassium channel inactivation.

    PubMed

    Lee, Hane; Lin, Meng-chin A; Kornblum, Harley I; Papazian, Diane M; Nelson, Stanley F

    2014-07-01

    Numerous studies and case reports show comorbidity of autism and epilepsy, suggesting some common molecular underpinnings of the two phenotypes. However, the relationship between the two, on the molecular level, remains unclear. Here, whole exome sequencing was performed on a family with identical twins affected with autism and severe, intractable seizures. A de novo variant was identified in the KCND2 gene, which encodes the Kv4.2 potassium channel. Kv4.2 is a major pore-forming subunit in somatodendritic subthreshold A-type potassium current (ISA) channels. The de novo mutation p.Val404Met is novel and occurs at a highly conserved residue within the C-terminal end of the transmembrane helix S6 region of the ion permeation pathway. Functional analysis revealed the likely pathogenicity of the variant in that the p.Val404Met mutant construct showed significantly slowed inactivation, either by itself or after equimolar coexpression with the wild-type Kv4.2 channel construct consistent with a dominant effect. Further, the effect of the mutation on closed-state inactivation was evident in the presence of auxiliary subunits that associate with Kv4 subunits to form ISA channels in vivo. Discovery of a functionally relevant novel de novo variant, coupled with physiological evidence that the mutant protein disrupts potassium current inactivation, strongly supports KCND2 as the causal gene for epilepsy in this family. Interaction of KCND2 with other genes implicated in autism and the role of KCND2 in synaptic plasticity provide suggestive evidence of an etiological role in autism.

  6. Evaluation of Methods for de novo Genome assembly from High-throughput Sequencing Reads Reveals Dependencies that Affect the Quality of the Results

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole...

  7. De novo assembly and characterization of bark transcriptome using Illumina sequencing and development of EST-SSR markers in rubber tree (Hevea brasiliensis Muell. Arg.)

    PubMed Central

    2012-01-01

    Background In rubber tree, bark is one of important agricultural and biological organs. However, the molecular mechanism involved in the bark formation and development in rubber tree remains largely unknown, which is at least partially due to lack of bark transcriptomic and genomic information. Therefore, it is necessary to carried out high-throughput transcriptome sequencing of rubber tree bark to generate enormous transcript sequences for the functional characterization and molecular marker development. Results In this study, more than 30 million sequencing reads were generated using Illumina paired-end sequencing technology. In total, 22,756 unigenes with an average length of 485 bp were obtained with de novo assembly. The similarity search indicated that 16,520 and 12,558 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Among these annotated unigenes, 6,867 and 5,559 unigenes were separately assigned to Gene Ontology (GO) and Clusters of Orthologous Group (COG). When 22,756 unigenes searched against the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database, 12,097 unigenes were assigned to 5 main categories including 123 KEGG pathways. Among the main KEGG categories, metabolism was the biggest category (9,043, 74.75%), suggesting the active metabolic processes in rubber tree bark. In addition, a total of 39,257 EST-SSRs were identified from 22,756 unigenes, and the characterizations of EST-SSRs were further analyzed in rubber tree. 110 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Among 13 Hevea germplasms, PCR success rate and polymorphism rate of 110 markers were separately 96.36% and 55.45% in this study. Conclusion By assembling and analyzing de novo transcriptome sequencing data, we reported the comprehensive functional characterization of rubber tree bark. This research generated a substantial fraction

  8. De novo computational identification of stress-related sequence motifs and microRNA target sites in untranslated regions of a plant translatome

    PubMed Central

    Munusamy, Prabhakaran; Zolotarov, Yevgen; Meteignier, Louis-Valentin; Moffett, Peter; Strömvik, Martina V.

    2017-01-01

    Gene regulation at the transcriptional and translational level leads to diversity in phenotypes and function in organisms. Regulatory DNA or RNA sequence motifs adjacent to the gene coding sequence act as binding sites for proteins that in turn enable or disable expression of the gene. Whereas the known DNA and RNA binding proteins range in the thousands, only a few motifs have been examined. In this study, we have predicted putative regulatory motifs in groups of untranslated regions from genes regulated at the translational level in Arabidopsis thaliana under normal and stressed conditions. The test group of sequences was divided into random subgroups and subjected to three de novo motif finding algorithms (Seeder, Weeder and MEME). In addition to identifying sequence motifs, using an in silico tool we have predicted microRNA target sites in the 3′ UTRs of the translationally regulated genes, as well as identified upstream open reading frames located in the 5′ UTRs. Our bioinformatics strategy and the knowledge generated contribute to understanding gene regulation during stress, and can be applied to disease and stress resistant plant development. PMID:28276452

  9. Use of targeted next-generation sequencing for molecular diagnosis of craniosynostosis: Identification of a novel de novo mutation of EFNB1.

    PubMed

    Yamamoto, Toshiyuki; Igarashi, Naru; Shimojima, Keiko; Sangu, Noriko; Sakamoto, Yuko; Shimoji, Kazuaki; Niijima, Shinichi

    2016-03-01

    Craniofrontonasal syndrome (CFNS; MIM#304110) is characterized by asymmetric facial features with hypertelorism and a broad bifid nose due to synostosis of the coronal suture. CFNS shows a unique X-linked inheritance pattern (most affected patients are female and obligate male carriers exhibit a mild manifestation or no typical features at all) associated with the ephrin-B1 gene (EFNB1) located in the Xq13.1 region. In this study, we performed targeted, massively parallel sequencing using a next-generation sequencer, and identified a novel EFNB1 mutation, c.270_271delCA, in a Japanese female patient with craniosynostosis. Because subsequent Sanger sequencing identified no mutation in either parent, this mutation was determined to be de novo in origin. After obtaining molecular diagnosis, a retrospective clinical evaluation confirmed the clinical diagnosis of CFNS in this patient. Comprehensive molecular diagnosis using a next-generation sequencer would be beneficial for early diagnosis of the patients with undiagnosed craniosynostosis.

  10. General Approach in Computing Sums of Products of Binary Sequences

    DTIC Science & Technology

    2011-12-08

    General Approach in Computing Sums of Products of Binary Sequences E. Kiliç1, P. Stănică2 1TOBB Economics and Technology University, Mathematics...pstanica@nps.edu December 8, 2011 Abstract In this paper we find a general approach to find closed forms of sums of products of arbitrary sequences ...satisfying the same recurrence with different initial conditions. We apply successfully our technique to sums of products of such sequences with indices in

  11. Exome sequencing identifies de novo pathogenic variants in FBN1 and TRPS1 in a patient with a complex connective tissue phenotype

    PubMed Central

    Zastrow, Diane B.; Zornio, Patricia A.; Dries, Annika; Kohler, Jennefer; Fernandez, Liliana; Waggott, Daryl; Walkiewicz, Magdalena; Eng, Christine M.; Manning, Melanie A.; Farrelly, Ellyn; Fisher, Paul G.; Ashley, Euan A.; Bernstein, Jonathan A.

    2017-01-01

    Here we describe a patient who presented with a history of congenital diaphragmatic hernia, inguinal hernia, and recurrent umbilical hernia. She also has joint laxity, hypotonia, and dysmorphic features. A unifying diagnosis was not identified based on her clinical phenotype. As part of her evaluation through the Undiagnosed Diseases Network, trio whole-exome sequencing was performed. Pathogenic variants in FBN1 and TRPS1 were identified as causing two distinct autosomal dominant conditions, each with de novo inheritance. Fibrillin 1 (FBN1) mutations are associated with Marfan syndrome and a spectrum of similar phenotypes. TRPS1 mutations are associated with trichorhinophalangeal syndrome types I and III. Features of both conditions are evident in the patient reported here. Discrepant features of the conditions (e.g., stature) and the young age of the patient may have made a clinical diagnosis more difficult in the absence of exome-wide genetic testing. PMID:28050602

  12. Exome sequencing identifies de novo pathogenic variants in FBN1 and TRPS1 in a patient with a complex connective tissue phenotype.

    PubMed

    Zastrow, Diane B; Zornio, Patricia A; Dries, Annika; Kohler, Jennefer; Fernandez, Liliana; Waggott, Daryl; Walkiewicz, Magdalena; Eng, Christine M; Manning, Melanie A; Farrelly, Ellyn; Fisher, Paul G; Ashley, Euan A; Bernstein, Jonathan A; Wheeler, Matthew T

    2017-01-01

    Here we describe a patient who presented with a history of congenital diaphragmatic hernia, inguinal hernia, and recurrent umbilical hernia. She also has joint laxity, hypotonia, and dysmorphic features. A unifying diagnosis was not identified based on her clinical phenotype. As part of her evaluation through the Undiagnosed Diseases Network, trio whole-exome sequencing was performed. Pathogenic variants in FBN1 and TRPS1 were identified as causing two distinct autosomal dominant conditions, each with de novo inheritance. Fibrillin 1 (FBN1) mutations are associated with Marfan syndrome and a spectrum of similar phenotypes. TRPS1 mutations are associated with trichorhinophalangeal syndrome types I and III. Features of both conditions are evident in the patient reported here. Discrepant features of the conditions (e.g., stature) and the young age of the patient may have made a clinical diagnosis more difficult in the absence of exome-wide genetic testing.

  13. Rational Structure-Based Rescaffolding Approach to De Novo Design of Interleukin 10 (IL-10) Receptor-1 Mimetics

    PubMed Central

    Philipp, Jenny; Künze, Georg; Wodtke, Robert; Löser, Reik; Fahmy, Karim; Pisabarro, M. Teresa

    2016-01-01

    Tackling protein interfaces with small molecules capable of modulating protein-protein interactions remains a challenge in structure-based ligand design. Particularly arduous are cases in which the epitopes involved in molecular recognition have a non-structured and discontinuous nature. Here, the basic strategy of translating continuous binding epitopes into mimetic scaffolds cannot be applied, and other innovative approaches are therefore required. We present a structure-based rational approach involving the use of a regular expression syntax inspired in the well established PROSITE to define minimal descriptors of geometric and functional constraints signifying relevant functionalities for recognition in protein interfaces of non-continuous and unstructured nature. These descriptors feed a search engine that explores the currently available three-dimensional chemical space of the Protein Data Bank (PDB) in order to identify in a straightforward manner regular architectures containing the desired functionalities, which could be used as templates to guide the rational design of small natural-like scaffolds mimicking the targeted recognition site. The application of this rescaffolding strategy to the discovery of natural scaffolds incorporating a selection of functionalities of interleukin-10 receptor-1 (IL-10R1), which are relevant for its interaction with interleukin-10 (IL-10) has resulted in the de novo design of a new class of potent IL-10 peptidomimetic ligands. PMID:27123592

  14. De Novo Transcriptome Sequencing of the Octopus vulgaris Hemocytes Using Illumina RNA-Seq Technology: Response to the Infection by the Gastrointestinal Parasite Aggregata octopiana

    PubMed Central

    Castellanos-Martínez, Sheila; Arteta, David; Catarino, Susana; Gestal, Camino

    2014-01-01

    Background Octopus vulgaris is a highly valuable species of great commercial interest and excellent candidate for aquaculture diversification; however, the octopus’ well-being is impaired by pathogens, of which the gastrointestinal coccidian parasite Aggregata octopiana is one of the most important. The knowledge of the molecular mechanisms of the immune response in cephalopods, especially in octopus is scarce. The transcriptome of the hemocytes of O. vulgaris was de novo sequenced using the high-throughput paired-end Illumina technology to identify genes involved in immune defense and to understand the molecular basis of octopus tolerance/resistance to coccidiosis. Results A bi-directional mRNA library was constructed from hemocytes of two groups of octopus according to the infection by A. octopiana, sick octopus, suffering coccidiosis, and healthy octopus, and reads were de novo assembled together. The differential expression of transcripts was analysed using the general assembly as a reference for mapping the reads from each condition. After sequencing, a total of 75,571,280 high quality reads were obtained from the sick octopus group and 74,731,646 from the healthy group. The general transcriptome of the O. vulgaris hemocytes was assembled in 254,506 contigs. A total of 48,225 contigs were successfully identified, and 538 transcripts exhibited differential expression between groups of infection. The general transcriptome revealed genes involved in pathways like NF-kB, TLR and Complement. Differential expression of TLR-2, PGRP, C1q and PRDX genes due to infection was validated using RT-qPCR. In sick octopuses, only TLR-2 was up-regulated in hemocytes, but all of them were up-regulated in caecum and gills. Conclusion The transcriptome reported here de novo establishes the first molecular clues to understand how the octopus immune system works and interacts with a highly pathogenic coccidian. The data provided here will contribute to identification of biomarkers

  15. Transcriptome Profile of the Asian Giant Hornet (Vespa mandarinia) Using Illumina HiSeq 4000 Sequencing: De Novo Assembly, Functional Annotation, and Discovery of SSR Markers

    PubMed Central

    Park, So Young; Kang, Se Won; Hwang, Hee-Ju; Wang, Tae Hun; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jae Bong; Jeong, Heon Cheon; Park, Hong Seog; Han, Yeon Soo; Lee, Yong Seok

    2016-01-01

    Vespa mandarinia found in the forests of East Asia, including Korea, occupies the highest rank in the arthropod food web within its geographical range. It serves as a source of nutrition in the form of Vespa amino acid mixture and is listed as a threatened species, although no conservation measures have been implemented. Here, we performed de novo assembly of the V. mandarinia transcriptome by Illumina HiSeq 4000 sequencing. Over 60 million raw reads and 59,184,811 clean reads were obtained. After assembly, a total of 66,837 unigenes were clustered, 40,887, 44,455, and 22,390 of which showed homologous matches against the PANM, Unigene, and KOG databases, respectively. A total of 15,675 unigenes were assigned to Gene Ontology terms, and 5,132 unigenes were mapped to 115 KEGG pathways. The zinc finger domain (C2H2-like), serine/threonine/dual specificity protein kinase domain, and RNA recognition motif domain were among the top InterProScan domains predicted for V. mandarinia sequences. Among the unigenes, we identified 534,922 cDNA simple sequence repeats as potential markers. This is the first transcriptomic analysis of the wasp V. mandarinia using Illumina HiSeq 4000. The obtained datasets should promote the search for new genes to understand the physiological attributes of this wasp. PMID:26881195

  16. De Novo Assembly of Coding Sequences of the Mangrove Palm (Nypa fruticans) Using RNA-Seq and Discovery of Whole-Genome Duplications in the Ancestor of Palms.

    PubMed

    He, Ziwen; Zhang, Zhang; Guo, Wuxia; Zhang, Ying; Zhou, Renchao; Shi, Suhua

    2015-01-01

    Nypa fruticans (Arecaceae) is the only monocot species of true mangroves. This species represents the earliest mangrove fossil recorded. How N. fruticans adapts to the harsh and unstable intertidal zone is an interesting question. However, the 60 gene segments deposited in NCBI are insufficient for solving this question. In this study, we sequenced, assembled and annotated the transcriptome of N. fruticans using next-generation sequencing technology. A total of 19,918,800 clean paired-end reads were de novo assembled into 45,368 unigenes with a N50 length of 1,096 bp. A total of 41.35% unigenes were functionally annotated using Blast2GO. Many genes annotated to "response to stress" and 15 putative positively selected genes were identified. Simple sequence repeats were identified and compared with other palms. The divergence time between N. fruticans and other palms was estimated at 75 million years ago using the genomic data, which is consistent with the fossil record. After calculating the synonymous substitution rate between paralogs, we found that two whole-genome duplication events were shared by N. fruticans and other palms. These duplication events provided a large amount of raw material for the more than 2,000 later speciation events in Arecaceae. This study provides a high quality resource for further functional and evolutionary studies of N. fruticans and palms in general.

  17. De novo sequencing and transcriptome analysis of a low temperature tolerant Saccharum spontaneum clone IND 00-1037.

    PubMed

    Dharshini, S; Chakravarthi, M; J, Ashwin Narayan; Manoj, V M; Naveenarani, M; Kumar, Ravinder; Meena, Minturam; Ram, Bakshi; Appunu, C

    2016-08-10

    Saccharum spontaneum L., a wild relative of sugarcane, is known for its adaptability to environmental stresses, particularly cold stress. In the present study, an attempt was made for transcriptome profiling of the low temperature (10°C) tolerant S. spontaneum clone IND 00-1037 collected from high altitude regions of Arunachal Pradesh, North Eastern India. The Illumina Nextseq500 platform yielded a total of 47.63 and 48.18 million reads corresponding to 4.7 and 4.8 gigabase pairs (Gb) of processed reads for control and cold stressed (10°C for 24h) samples, respectively. These reads were de novo assembled into 214,611 unigenes with an average length of 801bp. Further, all unigenes were aligned to GO, KEGG and COG databases in order to identify novel genes and pathways responsive upon low temperature conditions. The differential gene expression analysis revealed that about 2583 genes were upregulated and 3302 genes were down regulated during the stress. This is perhaps the comprehensive transcriptome data of a low temperature tolerant clone of S. spontaneum. This study would aid in identifying novel genes and also in future genomic studies pertaining to sugarcane and its wild relatives.

  18. De novo sequencing of root transcriptome reveals complex cadmium-responsive regulatory networks in radish (Raphanus sativus L.).

    PubMed

    Xu, Liang; Wang, Yan; Liu, Wei; Wang, Jin; Zhu, Xianwen; Zhang, Keyun; Yu, Rugang; Wang, Ronghua; Xie, Yang; Zhang, Wei; Gong, Yiqin; Liu, Liwang

    2015-07-01

    Cadmium (Cd) is a nonessential metallic trace element that poses potential chronic toxicity to living organisms. To date, little is known about the Cd-responsive regulatory network in root vegetable crops including radish. In this study, 31,015 unigenes representing 66,552 assembled unique transcripts were isolated from radish root under Cd stress based on de novo transcriptome assembly. In all, 1496 differentially expressed genes (DEGs) consisted of 3579 transcripts were identified from Cd-free (CK) and Cd-treated (Cd200) libraries. Gene Ontology and pathway enrichment analysis indicated that the up- and down-regulated DEGs were predominately involved in glucosinolate biosynthesis as well as cysteine and methionine-related pathways, respectively. RT-qPCR showed that the expression profiles of DEGs were in consistent with results from RNA-Seq analysis. Several candidate genes encoding phytochelatin synthase (PCS), metallothioneins (MTs), glutathione (GSH), zinc iron permease (ZIPs) and ABC transporter were responsible for Cd uptake, accumulation, translocation and detoxification in radish. The schematic model of DEGs and microRNAs-involved in Cd-responsive regulatory network was proposed. This study represents a first comprehensive transcriptome-based characterization of Cd-responsive DEGs in radish. These results could provide fundamental insight into complex Cd-responsive regulatory networks and facilitate further genetic manipulation of Cd accumulation in root vegetable crops.

  19. Sequencing, De Novo Assembly, and Annotation of the Transcriptome of the Endangered Freshwater Pearl Bivalve, Cristaria plicata, Provides Novel Insights into Functional Genes and Marker Discovery

    PubMed Central

    Kang, Se Won; Hwang, Hee-Ju; Park, So Young; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Kim, Changmu; Kim, Soonok; Lee, Jun Sang; Han, Yeon Soo; Park, Hong Seog; Lee, Yong Seok

    2016-01-01

    Background The freshwater mussel Cristaria plicata (Bivalvia: Eulamellibranchia: Unionidae), is an economically important species in molluscan aquaculture due to its use in pearl farming. The species have been listed as endangered in South Korea due to the loss of natural habitats caused by anthropogenic activities. The decreasing population and a lack of genomic information on the species is concerning for environmentalists and conservationists. In this study, we conducted a de novo transcriptome sequencing and annotation analysis of C. plicata using Illumina HiSeq 2500 next-generation sequencing (NGS) technology, the Trinity assembler, and bioinformatics databases to prepare a sustainable resource for the identification of candidate genes involved in immunity, defense, and reproduction. Results The C. plicata transcriptome analysis included a total of 286,152,584 raw reads and 281,322,837 clean reads. The de novo assembly identified a total of 453,931 contigs and 374,794 non-redundant unigenes with average lengths of 731.2 and 737.1 bp, respectively. Furthermore, 100% coverage of C. plicata mitochondrial genes within two unigenes supported the quality of the assembler. In total, 84,274 unigenes showed homology to entries in at least one database, and 23,246 unigenes were allocated to one or more Gene Ontology (GO) terms. The most prominent GO biological process, cellular component, and molecular function categories (level 2) were cellular process, membrane, and binding, respectively. A total of 4,776 unigenes were mapped to 123 biological pathways in the KEGG database. Based on the GO terms and KEGG annotation, the unigenes were suggested to be involved in immunity, stress responses, sex-determination, and reproduction. A total of 17,251 cDNA simple sequence repeats (cSSRs) were identified from 61,141 unigenes (size of >1 kb) with the most abundant being dinucleotide repeats. Conclusions This dataset represents the first transcriptome analysis of the endangered

  20. High-Throughput Sequencing and De Novo Assembly of Brassica oleracea var. Capitata L. for Transcriptome Analysis

    PubMed Central

    Kim, Sangmi; Choe, Jun Kyoung; Jo, Sung-Hwan; Baek, Namkwon; Kwon, Suk-Yoon

    2014-01-01

    Background The cabbage, Brassica oleracea var. capitata L., has a distinguishable phenotype within the genus Brassica. Despite the economic and genetic importance of cabbage, there is little genomic data for cabbage, and most studies of Brassica are focused on other species or other B. oleracea subspecies. The lack of genomic data for cabbage, a non-model organism, hinders research on its molecular biology. Hence, the construction of reliable transcriptomic data based on high-throughput sequencing technologies is needed to enhance our understanding of cabbage and provide genomic information for future work. Methodology/Principal Findings We constructed cDNAs from total RNA isolated from the roots, leaves, flowers, seedlings, and calcium-limited seedling tissues of two cabbage genotypes: 102043 and 107140. We sequenced a total of six different samples using the Illumina HiSeq platform, producing 40.5 Gbp of sequence data comprising 401,454,986 short reads. We assembled 205,046 transcripts (≥ 200 bp) using the Velvet and Oases assembler and predicted 53,562 loci from the transcripts. We annotated 35,274 of the loci with 55,916 plant peptides in the Phytozome database. The average length of the annotated loci was 1,419 bp. We confirmed the reliability of the sequencing assembly using reverse-transcriptase PCR to identify tissue-specific gene candidates among the annotated loci. Conclusion Our study provides valuable transcriptome sequence data for B. oleracea var. capitata L., offering a new resource for studying B. oleracea and closely related species. Our transcriptomic sequences will enhance the quality of gene annotation and functional analysis of the cabbage genome and serve as a material basis for future genomic research on cabbage. The sequencing data from this study can be used to develop molecular markers and to identify the extreme differences among the phenotypes of different species in the genus Brassica. PMID:24682075

  1. De novo sequencing and analysis of the lily pollen transcriptome: an open access data source for an orphan plant species.

    PubMed

    Lang, Veronika; Usadel, Björn; Obermeyer, Gerhard

    2015-01-01

    Pollen grains of Lilium longiflorum are a long-established model system for pollen germination and tube tip growth. Due to their size, protein content and almost synchronous germination in synthetic media, they provide a simple system for physiological measurements as well as sufficient material for biochemical studies like protein purifications, enzyme assays, organelle isolation or determination of metabolites during germination and pollen tube elongation. Despite recent progresses in molecular biology techniques, sequence information of expressed proteins or transcripts in lily pollen is still scarce. Using a next generation sequencing strategy (RNAseq), the lily pollen transcriptome was investigated resulting in more than 50 million high quality reads with a length of 90 base pairs. Sequenced transcripts were assembled and annotated, and finally visualized with MAPMAN software tools and compared with other RNAseq or genome data including Arabidopsis pollen, Lilium vegetative tissues and the Amborella trichopoda genome. All lily pollen sequence data are provided as open access files with suitable tools to search sequences of interest.

  2. Whole exome sequencing is necessary to clarify ID/DD cases with de novo copy number variants of uncertain significance: Two proof-of-concept examples.

    PubMed

    Giorgio, Elisa; Ciolfi, Andrea; Biamino, Elisa; Caputo, Viviana; Di Gregorio, Eleonora; Belligni, Elga Fabia; Calcia, Alessandro; Gaidolfi, Elena; Bruselles, Alessandro; Mancini, Cecilia; Cavalieri, Simona; Molinatto, Cristina; Cirillo Silengo, Margherita; Ferrero, Giovanni Battista; Tartaglia, Marco; Brusco, Alfredo

    2016-07-01

    Whole exome sequencing (WES) is a powerful tool to identify clinically undefined forms of intellectual disability/developmental delay (ID/DD), especially in consanguineous families. Here we report the genetic definition of two sporadic cases, with syndromic ID/DD for whom array-Comparative Genomic Hybridization (aCGH) identified a de novo copy number variant (CNV) of uncertain significance. The phenotypes included microcephaly with brachycephaly and a distinctive facies in one proband, and hypotonia in the legs and mild ataxia in the other. WES allowed identification of a functionally relevant homozygous variant affecting a known disease gene for rare syndromic ID/DD in each proband, that is, c.1423C>T (p.Arg377*) in the Trafficking Protein Particle Complex 9 (TRAPPC9), and c.154T>C (p.Cys52Arg) in the Very Low Density Lipoprotein Receptor (VLDLR). Four mutations affecting TRAPPC9 have been previously reported, and the present finding further depicts this syndromic form of ID, which includes microcephaly with brachycephaly, corpus callosum hypoplasia, facial dysmorphism, and overweight. VLDLR-associated cerebellar hypoplasia (VLDLR-CH) is characterized by non-progressive congenital ataxia and moderate-to-profound intellectual disability. The c.154T>C (p.Cys52Arg) mutation was associated with a very mild form of ataxia, mild intellectual disability, and cerebellar hypoplasia without cortical gyri simplification. In conclusion, we report two novel cases with rare causes of autosomal recessive ID, which document how interpreting de novo array-CGH variants represents a challenge in consanguineous families; as such, clinical WES should be considered in diagnostic testing. © 2016 Wiley Periodicals, Inc.

  3. The power of single molecule real-time sequencing technology in the de novo assembly of a eukaryotic genome.

    PubMed

    Sakai, Hiroaki; Naito, Ken; Ogiso-Tanaka, Eri; Takahashi, Yu; Iseki, Kohtaro; Muto, Chiaki; Satou, Kazuhito; Teruya, Kuniko; Shiroma, Akino; Shimoji, Makiko; Hirano, Takashi; Itoh, Takeshi; Kaga, Akito; Tomooka, Norihiko

    2015-11-30

    Second-generation sequencers (SGS) have been game-changing, achieving cost-effective whole genome sequencing in many non-model organisms. However, a large portion of the genomes still remains unassembled. We reconstructed azuki bean (Vigna angularis) genome using single molecule real-time (SMRT) sequencing technology and achieved the best contiguity and coverage among currently assembled legume crops. The SMRT-based assembly produced 100 times longer contigs with 100 times smaller amount of gaps compared to the SGS-based assemblies. A detailed comparison between the assemblies revealed that the SMRT-based assembly enabled a more comprehensive gene annotation than the SGS-based assemblies where thousands of genes were missing or fragmented. A chromosome-scale assembly was generated based on the high-density genetic map, covering 86% of the azuki bean genome. We demonstrated that SMRT technology, though still needed support of SGS data, achieved a near-complete assembly of a eukaryotic genome.

  4. De Novo proteome analysis of genetically modified tumor cells by a metabolic labeling/azide-alkyne cycloaddition approach.

    PubMed

    Ballikaya, Seda; Lee, Jennifer; Warnken, Uwe; Schnölzer, Martina; Gebert, Johannes; Kopitz, Jürgen

    2014-12-01

    Activin receptor type II (ACVR2) is a member of the transforming growth factor type II receptor family and controls cell growth and differentiation, thereby acting as a tumor suppressor. ACVR2 inactivation is known to drive colorectal tumorigenesis. We used an ACVR2-deficient microsatellite unstable colon cancer cell line (HCT116) to set up a novel experimental design for comprehensive analysis of proteomic changes associated with such functional loss of a tumor suppressor. To this end we combined two existing technologies. First, the ACVR2 gene was reconstituted in an ACVR2-deficient colorectal cancer (CRC) cell line by means of recombinase-mediated cassette exchange, resulting in the generation of an inducible expression system that allowed the regulation of ACVR2 gene expression in a doxycycline-dependent manner. Functional expression in the induced cells was explicitly proven. Second, we used the methionine analog azidohomoalanine for metabolic labeling of newly synthesized proteins in our cell line model. Labeled proteins were tagged with biotin via a Click-iT chemistry approach enabling specific extraction of labeled proteins by streptavidin-coated beads. Tryptic on-bead digestion of captured proteins and subsequent ultra-high-performance LC coupled to LTQ Orbitrap XL mass spectrometry identified 513 proteins, with 25 of them differentially expressed between ACVR2-deficient and -proficient cells. Among these, several candidates that had already been linked to colorectal cancer or were known to play a key role in cell growth or apoptosis control were identified, proving the utility of the presented experimental approach. In principle, this strategy can be adapted to analyze any gene of interest and its effect on the cellular de novo proteome.

  5. Transcriptome de novo assembly from next-generation sequencing and comparative analyses in the hexaploid salt marsh species Spartina maritima and Spartina alterniflora (Poaceae)

    PubMed Central

    Ferreira de Carvalho, J; Poulain, J; Da Silva, C; Wincker, P; Michon-Coudouel, S; Dheilly, A; Naquin, D; Boutte, J; Salmon, A; Ainouche, M

    2013-01-01

    Spartina species have a critical ecological role in salt marshes and represent an excellent system to investigate recurrent polyploid speciation. Using the 454 GS-FLX pyrosequencer, we assembled and annotated the first reference transcriptome (from roots and leaves) for two related hexaploid Spartina species that hybridize in Western Europe, the East American invasive Spartina alterniflora and the Euro-African S. maritima. The de novo read assembly generated 38 478 consensus sequences and 99% found an annotation using Poaceae databases, representing a total of 16 753 non-redundant genes. Spartina expressed sequence tags were mapped onto the Sorghum bicolor genome, where they were distributed among the subtelomeric arms of the 10 S. bicolor chromosomes, with high gene density correlation. Normalization of the complementary DNA library improved the number of annotated genes. Ecologically relevant genes were identified among GO biological function categories in salt and heavy metal stress response, C4 photosynthesis and in lignin and cellulose metabolism. Expression of some of these genes had been found to be altered by hybridization and genome duplication in a previous microarray-based study in Spartina. As these species are hexaploid, up to three duplicated homoeologs may be expected per locus. When analyzing sequence polymorphism at four different loci in S. maritima and S. alterniflora, we found up to four haplotypes per locus, suggesting the presence of two expressed homoeologous sequences with one or two allelic variants each. This reference transcriptome will allow analysis of specific Spartina genes of ecological or evolutionary interest, estimation of homoeologous gene expression variation using RNA-seq and further gene expression evolution analyses in natural populations. PMID:23149455

  6. De Novo Sequencing-Based Transcriptome and Digital Gene Expression Analysis Reveals Insecticide Resistance-Relevant Genes in Propylaea japonica (Thunberg) (Coleoptea: Coccinellidae)

    PubMed Central

    Jin, Feng-Liang; Qiu, Bao-Li; Wu, Jian-Hui; Ren, Shun-Xiang

    2014-01-01

    The ladybird Propylaea japonica (Thunberg) is one of most important natural enemies of aphids in China. This species is threatened by the extensive use of insecticides but genomics-based information on the molecular mechanisms underlying insecticide resistance is limited. Hence, we analyzed the transcriptome and expression profile data of P. japonica in order to gain a deeper understanding of insecticide resistance in ladybirds. We performed de novo assembly of a transcriptome using Illumina's Solexa sequencing technology and short reads. A total of 27,243,552 reads were generated. These were assembled into 81,458 contigs and 33,647 unigenes (6,862 clusters and 26,785 singletons). Of the unigenes, 23,965 (71.22%) have putative homologues in the non-redundant (nr) protein database from NCBI, using BLASTX, with a cut-off E-value of 10−5. We examined COG, GO and KEGG annotations to better understand the functions of these unigenes. Digital gene expression (DGE) libraries showed differences in gene expression profiles between two insecticide resistant strains. When compared with an insecticide susceptible profile, a total of 4,692 genes were significantly up- or down- regulated in a moderately resistant strain. Among these genes, 125 putative insecticide resistance genes were identified. To confirm the DGE results, 16 selected genes were validated using quantitative real time PCR (qRT-PCR). This study is the first to report genetic information on P. japonica and has greatly enriched the sequence data for ladybirds. The large number of gene sequences produced from the transcriptome and DGE sequencing will greatly improve our understanding of this important insect, at the molecular level, and could contribute to the in-depth research into insecticide resistance mechanisms. PMID:24959827

  7. Identification of a Novel De Novo Variant in the PAX3 Gene in Waardenburg Syndrome by Diagnostic Exome Sequencing: The First Molecular Diagnosis in Korea

    PubMed Central

    Jang, Mi-Ae; Lee, Taeheon; Lee, Junnam

    2015-01-01

    Waardenburg syndrome (WS) is a clinically and genetically heterogeneous hereditary auditory pigmentary disorder characterized by congenital sensorineural hearing loss and iris discoloration. Many genes have been linked to WS, including PAX3, MITF, SNAI2, EDNRB, EDN3, and SOX10, and many additional genes have been associated with disorders with phenotypic overlap with WS. To screen all possible genes associated with WS and congenital deafness simultaneously, we performed diagnostic exome sequencing (DES) in a male patient with clinical features consistent with WS. Using DES, we identified a novel missense variant (c.220C>G; p.Arg74Gly) in exon 2 of the PAX3 gene in the patient. Further analysis by Sanger sequencing of the patient and his parents revealed a de novo occurrence of the variant. Our findings show that DES can be a useful tool for the identification of pathogenic gene variants in WS patients and for differentiation between WS and similar disorders. To the best of our knowledge, this is the first report of genetically confirmed WS in Korea. PMID:25932447

  8. De novo Sequencing and Transcriptome Analysis of Pinellia ternata Identify the Candidate Genes Involved in the Biosynthesis of Benzoic Acid and Ephedrine

    PubMed Central

    Zhang, Guang-hui; Jiang, Ni-hao; Song, Wan-ling; Ma, Chun-hua; Yang, Sheng-chao; Chen, Jun-wen

    2016-01-01

    Background: The medicinal herb, Pinellia ternata, is purported to be an anti-emetic with analgesic and sedative effects. Alkaloids are the main biologically active compounds in P. ternata, especially ephedrine that is a phenylpropylamino alkaloid specifically produced by Ephedra and Catha edulis. However, how ephedrine is synthesized in plants is uncertain. Only the phenylalanine ammonia lyase (PAL) and relevant genes in this pathway have been characterized. Genomic information of P. ternata is also unavailable. Results: We analyzed the transcriptome of the tuber of P. ternata with the Illumina HiSeq™ 2000 sequencing platform. 66,813,052 high-quality reads were generated, and these reads were assembled de novo into 89,068 unigenes. Most known genes involved in benzoic acid biosynthesis were identified in the unigene dataset of P. ternata, and the expression patterns of some ephedrine biosynthesis-related genes were analyzed by reverse transcription quantitative real-time PCR (RT-qPCR). Also, 14,468 simple sequence repeats (SSRs) were identified from 12,000 unigenes. Twenty primer pairs for SSRs were randomly selected for the validation of their amplification effect. Conclusion: RNA-seq data was used for the first time to provide a comprehensive gene information on P. ternata at the transcriptional level. These data will advance molecular genetics in this valuable medicinal plant. PMID:27579029

  9. De novo transcriptome sequencing and analysis of freshwater snail (Radix balthica) to discover genes and pathways affected by exposure to oxazepam.

    PubMed

    Mazzitelli, Jean-Yves; Bonnafe, Elsa; Klopp, Christophe; Escudier, Frédéric; Geret, Florence

    2017-01-01

    Pharmaceuticals are increasingly found in aquatic ecosystems due to the non-efficiency of waste water treatment plants. Therefore, aquatic organisms are frequently exposed to a broad diversity of pharmaceuticals. Freshwater snail Radix balthica has been chosen as model to study the effects of oxazepam (psychotropic drug) on developmental stages ranging from trochophore to hatching. In order to provide a global insight of these effects, a transcriptome deep sequencing has been performed on exposed embryos. Eighteen libraries were sequenced, six libraries for three conditions: control, exposed to the lowest oxazepam concentration with a phenotypic effect (delayed hatching) (TA) and exposed to oxazepam concentration found in freshwater (TB). A total of 39,759,772 filtered raw reads were assembled into 56,435 contigs having a mean length of 1579.68 bp and mean depth of 378.96 reads. 44.91% of the contigs have at least one annotation. The differential expression analysis between the control condition and the two exposure conditions revealed 146 contigs differentially expressed of which 144 for TA and two for TB. 34.0% were annotated with biological function. There were four mainly impacted processes: two cellular signalling systems (Notch and JNK) and two biosynthesis pathways (Polyamine and Catecholamine pathways). This work reports a large-scale analysis of differentially transcribed genes of R. balthica exposed to oxazepam during egg development until hatching. In addition, these results enriched the de novo database of potential ecotoxicological models.

  10. Sequencing and De Novo Assembly of the Complete Chloroplast Genome of the Peruvian Carrot (Arracacia xanthorrhiza Bancroft)

    PubMed Central

    Alvarado, Javier Santiago; López, Diane Hinojosa; Torres, Isaury Maldonado; Meléndez, María Margarita; Batista, Rosalinda Aybar; Raxwal, Vivek K.; Berríos, Juan A. Negrón

    2017-01-01

    ABSTRACT Arracacia xanthorrhiza is an important secondary food crop in South America and Puerto Rico. The lack of crop protection and improvement strategies leads to infections damaging the storage roots. Here, we report the annotated complete chloroplast genome sequence of A. xanthorrhiza as a step toward developing genomic resources for this crop. PMID:28209812

  11. De novo next-generation sequencing, assembling and annotation of Arachis hypogaea L. Spanish botanical type whole plant transcriptome.

    PubMed

    Wu, Ning; Matand, Kanyand; Wu, Huijuan; Li, Baoming; Li, Yue; Zhang, Xiaoli; He, Zheng; Qian, Jialin; Liu, Xu; Conley, Stephan; Bailey, Marshall; Acquaah, George

    2013-05-01

    Peanut is a major agronomic crop within the legume family and an important source of plant oil, proteins, vitamins, and minerals for human consumption, as well as animal feed, bioenergy, and health products. Peanut genomic research effort lags that of other legumes of economic importance, mainly due to the shortage of essential genomic infrastructure, tools, resources, and the complexity of the peanut genome. This is a pioneering study that explored the peanut Spanish Group whole plant transcriptome and culminated in developing unigenes database. The study applied modern technologies, such as, normalization and next-generation sequencing. It overall sequenced 8,308,655,800 nucleotides and generated 26,048 unigenes amongst which 12,302 were annotated and 8,817 were characterized. The remainder, 13,746 (52.77 %) unigenes, had unknown functions. These results will be applied as the reference transcriptome sequences for expanded transcriptome sequencing of the remaining three peanut botanical types (Valencia, Runner, and Virginia), which is currently in progress, RNA-seq, exome identification, and genomic markers development. It will also provide important tools and resources for other legumes and plant species genomic research.

  12. Sequencing and De Novo Assembly of the Complete Chloroplast Genome of the Peruvian Carrot (Arracacia xanthorrhiza Bancroft).

    PubMed

    Alvarado, Javier Santiago; López, Diane Hinojosa; Torres, Isaury Maldonado; Meléndez, María Margarita; Batista, Rosalinda Aybar; Raxwal, Vivek K; Berríos, Juan A Negrón; Arun, Alok

    2017-02-16

    Arracacia xanthorrhiza is an important secondary food crop in South America and Puerto Rico. The lack of crop protection and improvement strategies leads to infections damaging the storage roots. Here, we report the annotated complete chloroplast genome sequence of A. xanthorrhiza as a step toward developing genomic resources for this crop.

  13. De Novo Genome Assembly of the Economically Important Weed Horseweed Using Integrated Data from Multiple Sequencing Platforms1[C][W][OPEN

    PubMed Central

    Peng, Yanhui; Lai, Zhao; Lane, Thomas; Nageswara-Rao, Madhugiri; Okada, Miki; Jasieniuk, Marie; O’Geen, Henriette; Kim, Ryan W.; Sammons, R. Douglas; Rieseberg, Loren H.; Stewart, C. Neal

    2014-01-01

    Horseweed (Conyza canadensis), a member of the Compositae (Asteraceae) family, was the first broadleaf weed to evolve resistance to glyphosate. Horseweed, one of the most problematic weeds in the world, is a true diploid (2n = 2x = 18), with the smallest genome of any known agricultural weed (335 Mb). Thus, it is an appropriate candidate to help us understand the genetic and genomic bases of weediness. We undertook a draft de novo genome assembly of horseweed by combining data from multiple sequencing platforms (454 GS-FLX, Illumina HiSeq 2000, and PacBio RS) using various libraries with different insertion sizes (approximately 350 bp, 600 bp, 3 kb, and 10 kb) of a Tennessee-accessed, glyphosate-resistant horseweed biotype. From 116.3 Gb (approximately 350× coverage) of data, the genome was assembled into 13,966 scaffolds with 50% of the assembly = 33,561 bp. The assembly covered 92.3% of the genome, including the complete chloroplast genome (approximately 153 kb) and a nearly complete mitochondrial genome (approximately 450 kb in 120 scaffolds). The nuclear genome is composed of 44,592 protein-coding genes. Genome resequencing of seven additional horseweed biotypes was performed. These sequence data were assembled and used to analyze genome variation. Simple sequence repeat and single-nucleotide polymorphisms were surveyed. Genomic patterns were detected that associated with glyphosate-resistant or -susceptible biotypes. The draft genome will be useful to better understand weediness and the evolution of herbicide resistance and to devise new management strategies. The genome will also be useful as another reference genome in the Compositae. To our knowledge, this article represents the first published draft genome of an agricultural weed. PMID:25209985

  14. An optimization approach and its application to compare DNA sequences

    NASA Astrophysics Data System (ADS)

    Liu, Liwei; Li, Chao; Bai, Fenglan; Zhao, Qi; Wang, Ying

    2015-02-01

    Studying the evolutionary relationship between biological sequences has become one of the main tasks in bioinformatics research by means of comparing and analyzing the gene sequence. Many valid methods have been applied to the DNA sequence alignment. In this paper, we propose a novel comparing method based on the Lempel-Ziv (LZ) complexity to compare biological sequences. Moreover, we introduce a new distance measure and make use of the corresponding similarity matrix to construct phylogenic tree without multiple sequence alignment. Further, we construct phylogenic tree for 24 species of Eutherian mammals and 48 countries of Hepatitis E virus (HEV) by an optimization approach. The results indicate that this new method improves the efficiency of sequence comparison and successfully construct phylogenies.

  15. De Novo Assembly and Characterization of the Transcriptome of Seagrass Zostera marina Using Illumina Paired-End Sequencing

    PubMed Central

    Kong, Fanna; Li, Hong; Sun, Peipei; Zhou, Yang; Mao, Yunxiang

    2014-01-01

    Background The seagrass Zostera marina is a monocotyledonous angiosperm belonging to a polyphyletic group of plants that can live submerged in marine habitats. Zostera marina L. is one of the most common seagrasses and is considered a cornerstone of marine plant molecular ecology research and comparative studies. However, the mechanisms underlying its adaptation to the marine environment still remain poorly understood due to limited transcriptomic and genomic data. Principal Findings Here we explored the transcriptome of Z. marina leaves under different environmental conditions using Illumina paired-end sequencing. Approximately 55 million sequencing reads were obtained, representing 58,457 transcripts that correspond to 24,216 unigenes. A total of 14,389 (59.41%) unigenes were annotated by blast searches against the NCBI non-redundant protein database. 45.18% and 46.91% of the unigenes had significant similarity with proteins in the Swiss-Prot database and Pfam database, respectively. Among these, 13,897 unigenes were assigned to 57 Gene Ontology (GO) terms and 4,745 unigenes were identified and mapped to 233 pathways via functional annotation against the Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG). We compared the orthologous gene family of the Z. marina transcriptome to Oryza sativa and Pyropia yezoensis and 11,667 orthologous gene families are specific to Z. marina. Furthermore, we identified the photoreceptors sensing red/far-red light and blue light. Also, we identified a large number of genes that are involved in ion transporters and channels including Na+ efflux, K+ uptake, Cl− channels, and H+ pumping. Conclusions Our study contains an extensive sequencing and gene-annotation analysis of Z. marina. This information represents a genetic resource for the discovery of genes related to light sensing and salt tolerance in this species. Our transcriptome can be further utilized in future studies on molecular adaptation to abiotic stress in

  16. Specific versus non-specific immune responses in an invertebrate species evidenced by a comparative de novo sequencing study.

    PubMed

    Deleury, Emeline; Dubreuil, Géraldine; Elangovan, Namasivayam; Wajnberg, Eric; Reichhart, Jean-Marc; Gourbal, Benjamin; Duval, David; Baron, Olga Lucia; Gouzy, Jérôme; Coustau, Christine

    2012-01-01

    Our present understanding of the functioning and evolutionary history of invertebrate innate immunity derives mostly from studies on a few model species belonging to ecdysozoa. In particular, the characterization of signaling pathways dedicated to specific responses towards fungi and Gram-positive or Gram-negative bacteria in Drosophila melanogaster challenged our original view of a non-specific immunity in invertebrates. However, much remains to be elucidated from lophotrochozoan species. To investigate the global specificity of the immune response in the fresh-water snail Biomphalaria glabrata, we used massive Illumina sequencing of 5'-end cDNAs to compare expression profiles after challenge by Gram-positive or Gram-negative bacteria or after a yeast challenge. 5'-end cDNA sequencing of the libraries yielded over 12 millions high quality reads. To link these short reads to expressed genes, we prepared a reference transcriptomic database through automatic assembly and annotation of the 758,510 redundant sequences (ESTs, mRNAs) of B. glabrata available in public databases. Computational analysis of Illumina reads followed by multivariate analyses allowed identification of 1685 candidate transcripts differentially expressed after an immune challenge, with a two fold ratio between transcripts showing a challenge-specific expression versus a lower or non-specific differential expression. Differential expression has been validated using quantitative PCR for a subset of randomly selected candidates. Predicted functions of annotated candidates (approx. 700 unisequences) belonged to a large extend to similar functional categories or protein types. This work significantly expands upon previous gene discovery and expression studies on B. glabrata and suggests that responses to various pathogens may involve similar immune processes or signaling pathways but different genes belonging to multigenic families. These results raise the question of the importance of gene

  17. De novo analysis of peptide tandem mass spectra by spectral graph partitioning.

    PubMed

    Bern, Marshall; Goldberg, David

    2006-03-01

    We report on a new de novo peptide sequencing algorithm that uses spectral graph partitioning. In this approach, relationships between m/z peaks are represented by attractive and repulsive springs, and the vibrational modes of the spring system are used to infer information about the peaks (such as "likely b-ion" or "likely y-ion"). We demonstrate the effectiveness of this approach by comparison with other de novo sequencers on test sets of ion-trap and QTOF spectra, including spectra of mixtures of peptides. On all datasets, we outperform the other sequencers. Along with spectral graph theory techniques, the new de novo sequencer EigenMS incorporates another improvement of independent interest: robust statistical methods for recalibration of time-of-flight mass measurements. Robust recalibration greatly outperforms simple least-squares recalibration, achieving about three times the accuracy for one QTOF dataset.

  18. Solving the Water Jugs Problem by an Integer Sequence Approach

    ERIC Educational Resources Information Center

    Man, Yiu-Kwong

    2012-01-01

    In this article, we present an integer sequence approach to solve the classic water jugs problem. The solution steps can be obtained easily by additions and subtractions only, which is suitable for manual calculation or programming by computer. This approach can be introduced to secondary and undergraduate students, and also to teachers and…

  19. De novo assembly and characterization of farmed blue fox (Alopex lagopus) global transcriptome using Illumina paired-end sequencing.

    PubMed

    Guo, P C; Yan, S Q; Si, S; Bai, C Y; Zhao, Y; Zhang, Y; Yao, J Y; Li, Y M

    2016-03-28

    The blue fox (Alopex lagopus), a coat-color variant of the Arctic fox, is a domesticated fur-bearing mammal. In the present study, transcriptome data generated from a pool of nine different tissues were obtained with Illumina HiSeq2500 paired-end sequencing technology. After filtering from raw reads, 32,358,290 clean reads were assembled into 161,269 transcripts and 97,252 unigenes by the Trinity fragment assembly software. Of the assembled unigenes, 37,967 were annotated in the National Center for Biotechnology Information (NCBI) Non-Redundant (NR) protein database and 26,264 in the Swiss-Prot database. Among the annotated unigenes, 24,839 and 24,267 were assigned using the Gene Ontology (GO) and euKaryotic Orthologous Groups (KOG) databases, respectively. Altogether, 17,057 unigenes were mapped onto 227 pathways using the Kyoto Encyclopedia of Genes and Genomes database. In addition, 6394 simple sequence repeats were identified by examining 12,965 unigenes (>1 kb), which could contribute to the development of molecular markers. This study generated transcriptome data for the blue fox that will promote further progress in expression profiling studies, and provide a good annotation basis for genomic studies.

  20. MANGO: a new approach to multiple sequence alignment.

    PubMed

    Zhang, Zefeng; Lin, Hao; Li, Ming

    2007-01-01

    Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.

  1. Microsatellites from Fosterella christophii (Bromeliaceae) by de novo transcriptome sequencing on the Pacific Biosciences RS platform1

    PubMed Central

    Wöhrmann, Tina; Huettel, Bruno; Wagner, Natascha; Weising, Kurt

    2016-01-01

    Premise of the study: Microsatellite markers were developed in Fosterella christophii (Bromeliaceae) to investigate the genetic diversity and population structure within the F. micrantha group, comprising F. christophii, F. micrantha, and F. villosula. Methods and Results: Full-length cDNAs were isolated from F. christophii and sequenced on a Pacific Biosciences RS platform. A total of 1590 high-quality consensus isoforms were assembled into 971 unigenes containing 421 perfect microsatellites. Thirty primer sets were designed, of which 13 revealed a high level of polymorphism in three populations of F. christophii, with four to nine alleles per locus. Each of these 13 loci cross-amplified in the closely related species F. micrantha and F. villosula, with one to six and one to 11 alleles per locus, respectively. Conclusions: The new markers are promising tools to study the population genetics of F. christophii and to discover species boundaries within the F. micrantha group. PMID:26819858

  2. Transcriptome sequencing and de novo analysis of cytoplasmic male sterility and maintenance in JA-CMS cotton.

    PubMed

    Yang, Peng; Han, Jinfeng; Huang, Jinling

    2014-01-01

    Cytoplasmic male sterility (CMS) is the failure to produce functional pollen, which is inherited maternally. And it is known that anther development is modulated through complicated interactions between nuclear and mitochondrial genes in sporophytic and gametophytic tissues. However, an unbiased transcriptome sequencing analysis of CMS in cotton is currently lacking in the literature. This study compared differentially expressed (DE) genes of floral buds at the sporogenous cells stage (SS) and microsporocyte stage (MS) (the two most important stages for pollen abortion in JA-CMS) between JA-CMS and its fertile maintainer line JB cotton plants, using the Illumina HiSeq 2000 sequencing platform. A total of 709 (1.8%) DE genes including 293 up-regulated and 416 down-regulated genes were identified in JA-CMS line comparing with its maintainer line at the SS stage, and 644 (1.6%) DE genes with 263 up-regulated and 381 down-regulated genes were detected at the MS stage. By comparing the two stages in the same material, there were 8 up-regulated and 9 down-regulated DE genes in JA-CMS line and 29 up-regulated and 9 down-regulated DE genes in JB maintainer line at the MS stage. Quantitative RT-PCR was used to validate 7 randomly selected DE genes. Bioinformatics analysis revealed that genes involved in reduction-oxidation reactions and alpha-linolenic acid metabolism were down-regulated, while genes pertaining to photosynthesis and flavonoid biosynthesis were up-regulated in JA-CMS floral buds compared with their JB counterparts at the SS and/or MS stages. All these four biological processes play important roles in reactive oxygen species (ROS) homeostasis, which may be an important factor contributing to the sterile trait of JA-CMS. Further experiments are warranted to elucidate molecular mechanisms of these genes that lead to CMS.

  3. De novo transcriptome assembly of Ipomoea nil using Illumina sequencing for gene discovery and SSR marker identification.

    PubMed

    Wei, Changhe; Tao, Xiang; Li, Ming; He, Bin; Yan, Lang; Tan, Xuemei; Zhang, Yizheng

    2015-10-01

    Ipomoea nil is widely used as an ornamental plant due to its abundance of flower color, but the limited transcriptome and genomic data hinder research on it. Using illumina platform, transcriptome profiling of I. nil was performed through high-throughput sequencing, which was proven to be a rapid and cost-effective means to characterize gene content. Our goal is to use the resulting information to facilitate the relevant research on flowering and flower color formation in I. nil. In total, 268 million unique illumina RNA-Seq reads were produced and used in the transcriptome assembly. These reads were assembled into 220,117 contigs, of which 137,307 contigs were annotated using the GO and KEGG database. Based on the result of functional annotations, a total of 89,781 contigs were assigned 455,335 GO term annotations. Meanwhile, 17,418 contigs were identified with pathway annotation and they were functionally assigned to 144 KEGG pathways. Our transcriptome revealed at least 55 contigs as probably flowering-related genes in I. nil, and we also identified 25 contigs that encode key enzymes in the phenylpropanoid biosynthesis pathway. Based on the analysis relating to gene expression profiles, in the phenylpropanoid biosynthesis pathway of I. nil, the repression of lignin biosynthesis might lead to the redirection of the metabolic flux into anthocyanin biosynthesis. This may be the most likely reason that I. nil has high anthocyanins content, especially in its flowers. Additionally, 15,537 simple sequence repeats (SSRs) were detected using the MISA software, and these SSRs will undoubtedly benefit future breeding work. Moreover, the information uncovered in this study will also serve as a valuable resource for understanding the flowering and flower color formation mechanisms in I. nil.

  4. De novo assembly and characterization of the spleen transcriptome of common carp (Cyprinus carpio) using Illumina paired-end sequencing.

    PubMed

    Li, Guoxi; Zhao, Yinli; Liu, Zhonghu; Gao, Chunsheng; Yan, Fengbin; Liu, Bianzhi; Feng, Jianxin

    2015-06-01

    Common carp (Cyprinus carpio) is one of the most important aquacultured species of the family Cyprinidae, and breeding this species for disease resistance is becoming more and more important. However, at the genome or transcriptome levels, study of the immunogenetics of disease resistance in the common carp is lacking. In this study, 60,316,906 and 75,200,328 paired-end clean reads were obtained from two cDNA libraries of the common carp spleen by Illumina paired-end sequencing technology. Totally, 130,293 unique transcript fragments (unigenes) were assembled, with an average length of 1400.57 bp. Approximately 105,612 (81.06%) unigenes could be annotated according to their homology with matches in the Nr, Nt, Swiss-Prot, COG, GO, or KEGG databases, and they were found to represent 46,747 non-redundant genes. Comparative analysis showed that 59.82% of the unigenes have significant similarity to zebrafish Refseq proteins. Gene expression comparison revealed that 10,432 and 6889 annotated unigenes were, respectively, up- and down-regulated with at least twofold changes between two developmental stages of the common carp spleen. Gene ontology and KEGG analysis were performed to classify all unigenes into functional categories for understanding gene functions and regulation pathways. In addition, 46,847 simple sequence repeats (SSRs) were detected from 35,618 unigenes, and a large number of single nucleotide polymorphism (SNP) and insertion/deletion (INDEL) sites were identified in the spleen transcriptome of common carp. This study has characterized the spleen transcriptome of the common carp for the first time, providing a valuable resource for a better understanding of the common carp immune system and defense mechanisms. This knowledge will also facilitate future functional studies on common carp immunogenetics that may eventually be applied in breeding programs.

  5. De novo sequencing of circulating miRNAs identifies novel markers predicting clinical outcome of locally advanced breast cancer

    PubMed Central

    2012-01-01

    Background MicroRNAs (miRNAs) have been recently detected in the circulation of cancer patients, where they are associated with clinical parameters. Discovery profiling of circulating small RNAs has not been reported in breast cancer (BC), and was carried out in this study to identify blood-based small RNA markers of BC clinical outcome. Methods The pre-treatment sera of 42 stage II-III locally advanced and inflammatory BC patients who received neoadjuvant chemotherapy (NCT) followed by surgical tumor resection were analyzed for marker identification by deep sequencing all circulating small RNAs. An independent validation cohort of 26 stage II-III BC patients was used to assess the power of identified miRNA markers. Results More than 800 miRNA species were detected in the circulation, and observed patterns showed association with histopathological profiles of BC. Groups of circulating miRNAs differentially associated with ER/PR/HER2 status and inflammatory BC were identified. The relative levels of selected miRNAs measured by PCR showed consistency with their abundance determined by deep sequencing. Two circulating miRNAs, miR-375 and miR-122, exhibited strong correlations with clinical outcomes, including NCT response and relapse with metastatic disease. In the validation cohort, higher levels of circulating miR-122 specifically predicted metastatic recurrence in stage II-III BC patients. Conclusions Our study indicates that certain miRNAs can serve as potential blood-based biomarkers for NCT response, and that miR-122 prevalence in the circulation predicts BC metastasis in early-stage patients. These results may allow optimized chemotherapy treatments and preventive anti-metastasis interventions in future clinical applications. PMID:22400902

  6. De novo identification of VRC01 class HIV-1-neutralizing antibodies by next-generation sequencing of B-cell transcripts.

    PubMed

    Zhu, Jiang; Wu, Xueling; Zhang, Baoshan; McKee, Krisha; O'Dell, Sijy; Soto, Cinque; Zhou, Tongqing; Casazza, Joseph P; Mullikin, James C; Kwong, Peter D; Mascola, John R; Shapiro, Lawrence

    2013-10-22

    Next-generation sequencing of antibody transcripts provides a wealth of data, but the ability to identify function-specific antibodies solely on the basis of sequence has remained elusive. We previously characterized the VRC01 class of antibodies, which target the CD4-binding site on gp120, appear in multiple donors, and broadly neutralize HIV-1. Antibodies of this class have developmental commonalities, but typically share only ∼50% amino acid sequence identity among different donors. Here we apply next-generation sequencing to identify VRC01 class antibodies in a new donor, C38, directly from B cell transcript sequences. We first tested a lineage rank approach, but this was unsuccessful, likely because VRC01 class antibody sequences were not highly prevalent in this donor. We next identified VRC01 class heavy chains through a phylogenetic analysis that included thousands of sequences from C38 and a few known VRC01 class sequences from other donors. This "cross-donor analysis" yielded heavy chains with little sequence homology to previously identified VRC01 class heavy chains. Nonetheless, when reconstituted with the light chain from VRC01, half of the heavy chain chimeric antibodies showed substantial neutralization potency and breadth. We then identified VRC01 class light chains through a five-amino-acid sequence motif necessary for VRC01 light chain recognition. From over a million light chain sequences, we identified 13 candidate VRC01 class members. Pairing of these light chains with the phylogenetically identified C38 heavy chains yielded functional antibodies that effectively neutralized HIV-1. Bioinformatics analysis can thus directly identify functional HIV-1-neutralizing antibodies of the VRC01 class from a sequenced antibody repertoire.

  7. De novo transcriptome sequence assembly from coconut leaves and seeds with a focus on factors involved in RNA-directed DNA methylation.

    PubMed

    Huang, Ya-Yi; Lee, Chueh-Pai; Fu, Jason L; Chang, Bill Chia-Han; Matzke, Antonius J M; Matzke, Marjori

    2014-09-04

    Coconut palm (Cocos nucifera) is a symbol of the tropics and a source of numerous edible and nonedible products of economic value. Despite its nutritional and industrial significance, coconut remains under-represented in public repositories for genomic and transcriptomic data. We report de novo transcript assembly from RNA-seq data and analysis of gene expression in seed tissues (embryo and endosperm) and leaves of a dwarf coconut variety. Assembly of 10 GB sequencing data for each tissue resulted in 58,211 total unigenes in embryo, 61,152 in endosperm, and 33,446 in leaf. Within each unigene pool, 24,857 could be annotated in embryo, 29,731 could be annotated in endosperm, and 26,064 could be annotated in leaf. A KEGG analysis identified 138, 138, and 139 pathways, respectively, in transcriptomes of embryo, endosperm, and leaf tissues. Given the extraordinarily large size of coconut seeds and the importance of small RNA-mediated epigenetic regulation during seed development in model plants, we used homology searches to identify putative homologs of factors required for RNA-directed DNA methylation in coconut. The findings suggest that RNA-directed DNA methylation is important during coconut seed development, particularly in maturing endosperm. This dataset will expand the genomics resources available for coconut and provide a foundation for more detailed analyses that may assist molecular breeding strategies aimed at improving this major tropical crop.

  8. De Novo Transcriptome Sequence Assembly from Coconut Leaves and Seeds with a Focus on Factors Involved in RNA-Directed DNA Methylation

    PubMed Central

    Huang, Ya-Yi; Lee, Chueh-Pai; Fu, Jason L.; Chang, Bill Chia-Han; Matzke, Antonius J. M.; Matzke, Marjori

    2014-01-01

    Coconut palm (Cocos nucifera) is a symbol of the tropics and a source of numerous edible and nonedible products of economic value. Despite its nutritional and industrial significance, coconut remains under-represented in public repositories for genomic and transcriptomic data. We report de novo transcript assembly from RNA-seq data and analysis of gene expression in seed tissues (embryo and endosperm) and leaves of a dwarf coconut variety. Assembly of 10 GB sequencing data for each tissue resulted in 58,211 total unigenes in embryo, 61,152 in endosperm, and 33,446 in leaf. Within each unigene pool, 24,857 could be annotated in embryo, 29,731 could be annotated in endosperm, and 26,064 could be annotated in leaf. A KEGG analysis identified 138, 138, and 139 pathways, respectively, in transcriptomes of embryo, endosperm, and leaf tissues. Given the extraordinarily large size of coconut seeds and the importance of small RNA-mediated epigenetic regulation during seed development in model plants, we used homology searches to identify putative homologs of factors required for RNA-directed DNA methylation in coconut. The findings suggest that RNA-directed DNA methylation is important during coconut seed development, particularly in maturing endosperm. This dataset will expand the genomics resources available for coconut and provide a foundation for more detailed analyses that may assist molecular breeding strategies aimed at improving this major tropical crop. PMID:25193496

  9. Multiplexed next-generation sequencing and de novo assembly to obtain near full-length HIV-1 genome from plasma virus.

    PubMed

    Aralaguppe, Shambhu G; Siddik, Abu Bakar; Manickam, Ashokkumar; Ambikan, Anoop T; Kumar, Milner M; Fernandes, Sunjay Jude; Amogne, Wondwossen; Bangaruswamy, Dhinoth K; Hanna, Luke Elizabeth; Sonnerborg, Anders; Neogi, Ujjwal

    2016-10-01

    Analysing the HIV-1 near full-length genome (HIV-NFLG) facilitates new understanding into the diversity of virus population dynamics at individual or population level. In this study we developed a simple but high-throughput next generation sequencing (NGS) protocol for HIV-NFLG using clinical specimens and validated the method against an external quality control (EQC) panel. Clinical specimens (n=105) were obtained from three cohorts from two highly conserved HIV-1C epidemics (India and Ethiopia) and one diverse epidemic (Sweden). Additionally an EQC panel (n=10) was used to validate the protocol. HIV-NFLG was performed amplifying the HIV-genome (Gag-to-nef) in two fragments. NGS was performed using the Illumina HiSeq2500 after multiplexing 24 samples, followed by de novo assembly in Iterative Virus Assembler or VICUNA. Subtyping was carried out using several bioinformatics tools. Amplification of HIV-NFLG has 90% (95/105) success-rate in clinical specimens. NGS was successful in all clinical specimens (n=45) and EQA samples (n=10) attempted. The mean error for mutations for the EQC panel viruses were <1%. Subtyping identified two as A1C recombinant. Our results demonstrate the feasibility of a simple NGS-based HIV-NFLG that can potentially be used in the molecular surveillance for effective identification of subtypes and transmission clusters for operational public health intervention.

  10. Development of an expressed gene catalogue and molecular markers from the de novo assembly of short sequence reads of the lentil (Lens culinaris Medik.) transcriptome.

    PubMed

    Verma, Priyanka; Shah, Niraj; Bhatia, Sabhyata

    2013-09-01

    Genomic resources such as ESTs, molecular markers and linkage maps are essential for crop improvement. However, these resources are still limited in important legumes such as lentil (Lens culinaris Medik.), which is valued world wide as a rich source of dietary protein. In this study, the de novo transcriptome assembly of 119,855,798 short reads, generated by Illumina paired-end sequencing, was performed using various assembly programs. This resulted in 42,196 nonredundant high-quality transcripts of average length 810 bases, N50 value of 1,432 and an average expression per transcript of 26.21 rpkm reads per kilobase per million(RPKM). Similarity search with the unigenes and protein sequences of other plants resulted in maximum similarity with soybean. A total of 20,009 nonredundant transcripts showed similarity with the UniProtKB database and of these, 18,064 transcripts were grouped into three main GO categories, that is, biological process (15,126), molecular function (15,505) and cellular component (9,434). Annotated transcripts were mapped to 289 predicted Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and 8,893 transcripts were classified into 24 functional categories based on Cluster of Orthologous Groups (COG) of proteins. Mining the data set for the presence of SSRs resulted in 8,722 SSRs with a frequency occurrence of one SSR per 3.92 kb. From these, 5,673 SSR primer pairs were designed, and a subset of these were utilized for diversity analysis. This study, which provides a large data set of annotated transcripts and gene-based SSR markers, would serve as a foundation for various applications in lentil breeding and genetics.

  11. De novo TUBB2B mutation causes fetal akinesia deformation sequence with microlissencephaly: An unusual presentation of tubulinopathy.

    PubMed

    Laquerriere, Annie; Gonzales, Marie; Saillour, Yoann; Cavallin, Mara; Joyē, Nicole; Quēlin, Chloé; Bidat, Laurent; Dommergues, Marc; Plessis, Ghislaine; Encha-Razavi, Ferechte; Chelly, Jamel; Bahi-Buisson, Nadia; Poirier, Karine

    2016-04-01

    Tubulinopathies are increasingly emerging major causes underlying complex cerebral malformations, particularly in case of microlissencephaly often associated with hypoplastic or absent corticospinal tracts. Fetal akinesia deformation sequence (FADS) refers to a clinically and genetically heterogeneous group of disorders with congenital malformations related to impaired fetal movement. We report on an early foetal case with FADS and microlissencephaly due to TUBB2B mutation. Neuropathological examination disclosed virtually absent cortical lamination, foci of neuronal overmigration into the leptomeningeal spaces, corpus callosum agenesis, cerebellar and brainstem hypoplasia and extremely severe hypoplasia of the spinal cord with no anterior and posterior horns and almost no motoneurons. At the cellular level, the p.Cys239Phe TUBB2B mutant leads to tubulin heterodimerization impairment, decreased ability to incorporate into the cytoskeleton, microtubule dynamics alteration, with an accelerated rate of depolymerization. To our knowledge, this is the first case of microlissencephaly to be reported presenting with a so severe and early form of FADS, highlighting the importance of tubulin mutation screening in the context of FADS with microlissencephaly.

  12. De novo sequencing and comprehensive analysis of the mutant transcriptome from purple sweet potato (Ipomoea batatas L.).

    PubMed

    Ma, Peiyong; Bian, Xiaofeng; Jia, Zhaodong; Guo, Xiaoding; Xie, Yizhi

    2016-01-10

    Purple sweet potatoes, rich in anthocyanin, have been widely favored in light of increasing awareness of health and food safety. In this study, a mutant of purple sweet potato (white peel and flesh) was used to study anthocyanin metabolism by high-throughput RNA sequencing and comparative analysis of the mutant and wild type transcriptomes. A total of 88,509 unigenes ranging from 200nt to 14,986nt with an average length of 849nt were obtained. Unigenes were assigned to Gene Ontology (GO), Clusters of Orthologous Group (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG). Functional enrichment using GO and KEGG annotations showed that 3828 of the differently expressed genes probably influenced many important biological and metabolic pathways, including anthocyanin biosynthesis. Most importantly, the structural and transcription factor genes that contribute to anthocyanin biosynthesis were downregulated in the mutant. The unigene dataset that was used to discover the anthocyanin candidate genes can serve as a comprehensive resource for molecular research in sweet potato.

  13. SNP Detection from De Novo Transcriptome Sequencing in the Bivalve Macoma balthica: Marker Development for Evolutionary Studies

    PubMed Central

    Becquet, Vanessa; Belkhir, Khalid; Bierne, Nicolas; Garcia, Pascale

    2012-01-01

    Hybrid zones are noteworthy systems for the study of environmental adaptation to fast-changing environments, as they constitute reservoirs of polymorphism and are key to the maintenance of biodiversity. They can move in relation to climate fluctuations, as temperature can affect both selection and migration, or remain trapped by environmental and physical barriers. There is therefore a very strong incentive to study the dynamics of hybrid zones subjected to climate variations. The infaunal bivalve Macoma balthica emerges as a noteworthy model species, as divergent lineages hybridize, and its native NE Atlantic range is currently contracting to the North. To investigate the dynamics and functioning of hybrid zones in M. balthica, we developed new molecular markers by sequencing the collective transcriptome of 30 individuals. Ten individuals were pooled for each of the three populations sampled at the margins of two hybrid zones. A single 454 run generated 277 Mb from which 17K SNPs were detected. SNP density averaged 1 polymorphic site every 14 to 19 bases, for mitochondrial and nuclear loci, respectively. An scan detected high genetic divergence among several hundred SNPs, some of them involved in energetic metabolism, cellular respiration and physiological stress. The high population differentiation, recorded for nuclear-encoded ATP synthase and NADH dehydrogenase as well as most mitochondrial loci, suggests cytonuclear genetic incompatibilities. Results from this study will help pave the way to a high-resolution study of hybrid zone dynamics in M. balthica, and the relative importance of endogenous and exogenous barriers to gene flow in this system. PMID:23300636

  14. De Novo RNA Sequencing and Transcriptome Analysis of Monascus purpureus and Analysis of Key Genes Involved in Monacolin K Biosynthesis

    PubMed Central

    Zhang, Chan; Liang, Jian; Yang, Le; Sun, Baoguo; Wang, Chengtao

    2017-01-01

    Monascus purpureus is an important medicinal and edible microbial resource. To facilitate biological, biochemical, and molecular research on medicinal components of M. purpureus, we investigated the M. purpureus transcriptome by RNA sequencing (RNA-seq). An RNA-seq library was created using RNA extracted from a mixed sample of M. purpureus expressing different levels of monacolin K output. In total 29,713 unigenes were assembled from more than 60 million high-quality short reads. A BLAST search revealed hits for 21,331 unigenes in at least one of the protein or nucleotide databases used in this study. The 22,365 unigenes were categorized into 48 functional groups based on Gene Ontology classification. Owing to the economic and medicinal importance of M. purpureus, most studies on this organism have focused on the pharmacological activity of chemical components and the molecular function of genes involved in their biogenesis. In this study, we performed quantitative real-time PCR to detect the expression of genes related to monacolin K (mokA-mokI) at different phases (2, 5, 8, and 12 days) of M. purpureus M1 and M1-36. Our study found that mokF modulates monacolin K biogenesis in M. purpureus. Nine genes were suggested to be associated with the monacolin K biosynthesis. Studies on these genes could provide useful information on secondary metabolic processes in M. purpureus. These results indicate a detailed resource through genetic engineering of monacolin K biosynthesis in M. purpureus and related species. PMID:28114365

  15. De novo sequence assembly and characterisation of a partial transcriptome for an evolutionarily distinct reptile, the tuatara (Sphenodon punctatus)

    PubMed Central

    2012-01-01

    Background The tuatara (Sphenodon punctatus) is a species of extraordinary zoological interest, being the only surviving member of an entire order of reptiles which diverged early in amniote evolution. In addition to their unique phylogenetic placement, many aspects of tuatara biology, including temperature-dependent sex determination, cold adaptation and extreme longevity have the potential to inform studies of genome evolution and development. Despite increasing interest in the tuatara genome, genomic resources for the species are still very limited. We aimed to address this by assembling a transcriptome for tuatara from an early-stage embryo, which will provide a resource for genome annotation, molecular marker development and studies of development and adaptation in tuatara. Results We obtained 30 million paired-end 50 bp reads from an Illumina Genome Analyzer and assembled them with Velvet and Oases using a range of kmers. After removing redundancy and filtering out low quality transcripts, our transcriptome dataset contained 32911 transcripts, with an N50 of 675 and a mean length of 451 bp. Almost 50% (15965) of these transcripts could be annotated by comparison with the NCBI non-redundant (NR) protein database or the chicken, green anole and zebrafish UniGene sets. A scan of candidate genes and repetitive elements revealed genes involved in immune function, sex differentiation and temperature-sensitivity, as well as over 200 microsatellite markers. Conclusions This dataset represents a major increase in genomic resources for the tuatara, increasing the number of annotated gene sequences from just 60 to almost 16,000. This will facilitate future research in sex determination, genome evolution, local adaptation and population genetics of tuatara, as well as inform studies on amniote evolution. PMID:22938396

  16. A 454 sequencing approach to dipteran mitochondrial genome research.

    PubMed

    Ramakodi, Meganathan P; Singh, Baneshwar; Wells, Jeffrey D; Guerrero, Felix; Ray, David A

    2015-01-01

    The availability of complete mitochondrial genome (mtgenome) data for Diptera, one of the largest metazoan orders, in public databases is limited. The advent of high throughput sequencing technology provides the potential to generate mtgenomes for many species affordably and quickly. However, these technologies need to be validated for dipterans as the members of this clade play important economic and research roles. Illumina and 454 sequencing platforms are widely used in genomic research involving non-model organisms. The Illumina platform has already been utilized for generating mitochondrial genomes without using conventional long range PCR for insects whereas the power of 454 sequencing for generating mitochondrial genome drafts without PCR has not yet been validated for insects. Thus, this study examines the utility of 454 sequencing approach for dipteran mtgenomic research. We generated complete or nearly complete mitochondrial genomes for Cochliomyia hominivorax, Haematobia irritans, Phormia regina and Sarcophaga crassipalpis using a 454 sequencing approach. Comparisons between newly obtained and existing assemblies for C. hominivorax and H. irritans revealed no major discrepancies and verified the utility of 454 sequencing for dipteran mitochondrial genomes. We also report the complete mitochondrial sequences for two forensically important flies, P. regina and S. crassipalpis, which could be used to provide useful information to legal personnel. Comparative analyses revealed that dipterans follow similar codon usage and nucleotide biases that could be due to mutational and selection pressures. This study illustrates the utility of 454 sequencing to obtain complete mitochondrial genomes for dipterans without the aid of conventional molecular techniques such as PCR and cloning and validates this method of mtgenome sequencing in arthropods.

  17. An ORFome assembly approach to metagenomics sequences analysis.

    PubMed

    Ye, Yuzhen; Tang, Haixu

    2009-06-01

    Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e. ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increases the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for metagenomic projects when the genome assembly does not work because of the low sequence coverage.

  18. The role of melanin pathways in extremotolerance and virulence of Fonsecaea revealed by de novo assembly transcriptomics using illumina paired-end sequencing.

    PubMed

    Li, X Q; Guo, B L; Cai, W Y; Zhang, J M; Huang, H Q; Zhan, P; Xi, L Y; Vicente, V A; Stielow, B; Sun, J F; de Hoog, G S

    2016-01-01

    Melanisation has been considered to be an important virulence factor of Fonsecaea monophora. However, the biosynthetic mechanisms of melanisation remain unknown. We therefore used next generation sequencing technology to investigate the transcriptome and digital gene expression data, which are valuable resources to better understand the molecular and biological mechanisms regulating melanisation in F. monophora. We performed de novo transcriptome assembly and digital gene expression (DGE) profiling analyses of parent (CBS 122845) and albino (CBS 125194) strains using the Illumina RNA-seq system. A total of 17 352 annotated unigenes were found by BLAST search of NR, Swiss-Prot, Gene Ontology, Clusters of Orthologous Groups and Kyoto Encyclopedia of Genes and Genomes (KEGG) (E-value <1e‒5). A total of 2 283 unigenes were judged to be the differentially expressed between the two genotypes. We identified most of the genes coding for key enzymes involved in melanin biosynthesis pathways, including polyketide synthase (pks), multicopper oxidase (mco), laccase, tyrosinase and homogentisate 1,2-dioxygenase (hmgA). DEG analysis showed extensive down-regulation of key genes in the DHN pathway, while up-regulation was noted in the DOPA pathway of the albino mutant. The transcript levels of partial genes were confirmed by real time RT-PCR, while the crucial role of key enzymes was confirmed by either inhibitor or substrate tests in vitro. Meanwhile, numbers of genes involved in light sensing, cell wall synthesis, morphology and environmental stress were identified in the transcriptome of F. monophora. In addition, 3 353 SSRs (Simple Sequence Repeats) markers were identified from 21 600 consensus sequences. Blocking of the DNH pathway is the most likely reason of melanin deficiency in the albino strain, while the production of pheomelanin and pyomelanin were probably regulated by unknown transcription factors on upstream of both pathways. Most of genes involved in

  19. Cross-Curricular Sequence: An Approach for Teaching Business Communication.

    ERIC Educational Resources Information Center

    Clarke, Lillian W.; Franklin, Carl M.

    1985-01-01

    The Cross-Curricular Sequencing (CCS) approach to teaching business communications is explored. Its uses in word processing, principles of management, and business policy courses are discussed. Techniques for integrating materials from these courses into business communication classes are described. The implications of CCS for business…

  20. De novo assembly and characterization of the transcriptome of the pancreatic fluke Eurytrema pancreaticum (trematoda: Dicrocoeliidae) using Illumina paired-end sequencing.

    PubMed

    Liu, Guo-Hua; Xu, Min-Jun; Song, Hui-Qun; Wang, Chun-Ren; Zhu, Xing-Quan

    2016-01-15

    Eurytrema pancreaticum is one of the most common trematodes living in the pancreatic and bile ducts of ruminants and also occasionally infects humans, causing eurytremiasis. In spite of its economic and medical importance, very little is known about the genomic resources of this parasite. Herein, we performed de novo sequencing, assembly and characterization of the transcriptome of adult E. pancreaticum. Approximately 36.4 million high-quality clean reads were obtained, and the length of the transcript contigs ranged from 66 to 19,968 nt with mean length of 479 nt and N50 length of 1094 nt, and then 23,573 unigenes were assembled. Of these unigenes, 15,353 (65.1%) were annotated by blast searches against the NCBI non-redundant protein database. Among these, 15,267 (64.8%), 2732 (11.6%) and 10,354 (43.9%) of the unigenes had significant similarity with proteins in the NR, NT and Swiss-Prot databases, respectively. 5510 (23.4%) and 4567 (19.4%) unigenes were assigned to GO and COG, respectively. 8886 (37.7%) unigenes were identified and mapped onto 254 pathways in the KEGG Pathway database. Furthermore, we found that 105 (1.18%) unigenes were related to pancreatic secretion and 61 (0.7%) to pancreatic cancer. The present study represents the first transcriptome of any members of the family Dicrocoeliidae, which has little genomic information available in the public databases. The novel transcriptome of E. pancreaticum should provide a useful resource for designing new strategies against pancreatic flukes and other trematodes of human and animal health significance.

  1. De novo transcriptome sequencing in Bixa orellana to identify genes involved in methylerythritol phosphate, carotenoid and bixin biosynthesis

    SciTech Connect

    Cárdenas-Conejo, Yair; Carballo-Uicab, Víctor; Lieberman, Meric; Aguilar-Espinosa, Margarita; Comai, Luca; Rivera-Madrid, Renata

    2015-10-28

    Bixin or annatto is a commercially important natural orange-red pigment derived from lycopene that is produced and stored in seeds of Bixa orellana L. An enzymatic pathway for bixin biosynthesis was inferred from homology of putative proteins encoded by differentially expressed seed cDNAs. Some activities were later validated in a heterologous system. Nevertheless, much of the pathway remains to be clarified. For example, it is essential to identify the methylerythritol phosphate (MEP) and carotenoid pathways genes. In order to investigate the MEP, carotenoid, and bixin pathways genes, total RNA from young leaves and two different developmental stages of seeds from B. orellana were used for the construction of indexed mRNA libraries, sequenced on the Illumina HiSeq 2500 platform and assembled de novo using Velvet, CLC Genomics Workbench and CAP3 software. A total of 52,549 contigs were obtained with average length of 1,924 bp. Two phylogenetic analyses of inferred proteins, in one case encoded by thirteen general, single-copy cDNAs, in the other from carotenoid and MEP cDNAs, indicated that B. orellana is closely related to sister Malvales species cacao and cotton. Using homology, we identified 7 and 14 core gene products from the MEP and carotenoid pathways, respectively. Surprisingly, previously defined bixin pathway cDNAs were not present in our transcriptome. Here we propose a new set of gene products involved in bixin pathway. In conclusion, the identification and qRT-PCR quantification of cDNAs involved in annatto production suggest a hypothetical model for bixin biosynthesis that involve coordinated activation of some MEP, carotenoid and bixin pathway genes. These findings provide a better understanding of the mechanisms regulating these pathways and will facilitate the genetic improvement of B. orellana.

  2. Imparting functionality to biocatalysts via embedding enzymes into nanoporous materials by a de novo approach: size-selective sheltering of catalase in metal-organic framework microcrystals.

    PubMed

    Shieh, Fa-Kuen; Wang, Shao-Chun; Yen, Chia-I; Wu, Chang-Cheng; Dutta, Saikat; Chou, Lien-Yang; Morabito, Joseph V; Hu, Pan; Hsu, Ming-Hua; Wu, Kevin C-W; Tsung, Chia-Kuang

    2015-04-08

    We develop a new concept to impart new functions to biocatalysts by combining enzymes and metal-organic frameworks (MOFs). The proof-of-concept design is demonstrated by embedding catalase molecules into uniformly sized ZIF-90 crystals via a de novo approach. We have carried out electron microscopy, X-ray diffraction, nitrogen sorption, electrophoresis, thermogravimetric analysis, and confocal microscopy to confirm that the ~10 nm catalase molecules are embedded in 2 μm single-crystalline ZIF-90 crystals with ~5 wt % loading. Because catalase is immobilized and sheltered by the ZIF-90 crystals, the composites show activity in hydrogen peroxide degradation even in the presence of protease proteinase K.

  3. De novo sequencing and analysis of the Ulva linza transcriptome to discover putative mechanisms associated with its successful colonization of coastal ecosystems

    PubMed Central

    2012-01-01

    Background The green algal genus Ulva Linnaeus (Ulvaceae, Ulvales, Chlorophyta) is well known for its wide distribution in marine, freshwater, and brackish environments throughout the world. The Ulva species are also highly tolerant of variations in salinity, temperature, and irradiance and are the main cause of green tides, which can have deleterious ecological effects. However, limited genomic information is currently available in this non-model and ecologically important species. Ulva linza is a species that inhabits bedrock in the mid to low intertidal zone, and it is a major contributor to biofouling. Here, we presented the global characterization of the U. linza transcriptome using the Roche GS FLX Titanium platform, with the aim of uncovering the genomic mechanisms underlying rapid and successful colonization of the coastal ecosystems. Results De novo assembly of 382,884 reads generated 13,426 contigs with an average length of 1,000 bases. Contiguous sequences were further assembled into 10,784 isotigs with an average length of 1,515 bases. A total of 304,101 reads were nominally identified by BLAST; 4,368 isotigs were functionally annotated with 13,550 GO terms, and 2,404 isotigs having enzyme commission (EC) numbers were assigned to 262 KEGG pathways. When compared with four other full sequenced green algae, 3,457 unique isotigs were found in U. linza and 18 conserved in land plants. In addition, a specific photoprotective mechanism based on both LhcSR and PsbS proteins and a C4-like carbon-concentrating mechanism were found, which may help U. linza survive stress conditions. At least 19 transporters for essential inorganic nutrients (i.e., nitrogen, phosphorus, and sulphur) were responsible for its ability to take up inorganic nutrients, and at least 25 eukaryotic cytochrome P450s, which is a higher number than that found in other algae, may be related to their strong allelopathy. Multi-origination of the stress related proteins, such as glutamate

  4. Molecular Characterization of Transgenic Events Using Next Generation Sequencing Approach

    PubMed Central

    Mammadov, Jafar; Ye, Liang; Soe, Khaing; Richey, Kimberly; Cruse, James; Zhuang, Meibao; Gao, Zhifang; Evans, Clive; Rounsley, Steve; Kumpatla, Siva P.

    2016-01-01

    Demand for the commercial use of genetically modified (GM) crops has been increasing in light of the projected growth of world population to nine billion by 2050. A prerequisite of paramount importance for regulatory submissions is the rigorous safety assessment of GM crops. One of the components of safety assessment is molecular characterization at DNA level which helps to determine the copy number, integrity and stability of a transgene; characterize the integration site within a host genome; and confirm the absence of vector DNA. Historically, molecular characterization has been carried out using Southern blot analysis coupled with Sanger sequencing. While this is a robust approach to characterize the transgenic crops, it is both time- and resource-consuming. The emergence of next-generation sequencing (NGS) technologies has provided highly sensitive and cost- and labor-effective alternative for molecular characterization compared to traditional Southern blot analysis. Herein, we have demonstrated the successful application of both whole genome sequencing and target capture sequencing approaches for the characterization of single and stacked transgenic events and compared the results and inferences with traditional method with respect to key criteria required for regulatory submissions. PMID:26908260

  5. Identification of genes required for de novo DNA methylation in Arabidopsis

    PubMed Central

    Greenberg, Maxim VC; Ausin, Israel; Chan, Simon WL; Cokus, Shawn J; Cuperus, Josh T; Feng, Suhua; Law, Julie A; Chu, Carolyn; Pellegrini, Matteo; Carrington, James C

    2011-01-01

    De novo DNA methylation in Arabidopsis thaliana is catalyzed by the methyltransferase DRM2, a homolog of the mammalian de novo methyltransferase DNMT3. DRM2 is targeted to DNA by small interfering RNAs (siRNAs) in a process known as RNA-directed DNA Methylation (RdDM). While several components of the RdDM pathway are known, a functional understanding of the underlying mechanism is far from complete. We employed both forward and reverse genetic approaches to identify factors involved in de novo methylation. We utilized the FWA transgene, which is methylated and silenced when transformed into wild-type plants, but unmethylated and expressed when transformed into de novo methylation mutants. Expression of FWA is marked by a late-flowering phenotype, which is easily scored in mutant versus wild-type plants. By reverse genetics we discovered the requirement for known RdDM effectors AGO6 and NRPE5a for efficient de novo methylation. A forward genetic approach uncovered alleles of several components of the RdDM pathway, including alleles of clsy1, ktf1 and nrpd/e2, which have not been previously shown to be required for the initial establishment of DNA methylation. Mutations were mapped and genes cloned by both traditional and whole genome sequencing approaches. The methodologies and the mutant alleles discovered will be instrumental in further studies of de novo DNA methylation. PMID:21150311

  6. Induction of robust de novo centrosome amplification, high-grade spindle multipolarity and metaphase catastrophe: a novel chemotherapeutic approach.

    PubMed

    Pannu, V; Rida, P C G; Ogden, A; Clewley, R; Cheng, A; Karna, P; Lopus, M; Mishra, R C; Zhou, J; Aneja, R

    2012-07-12

    Centrosome amplification (CA) and resultant chromosomal instability have long been associated with tumorigenesis. However, exacerbation of CA and relentless centrosome declustering engender robust spindle multipolarity (SM) during mitosis and may induce cell death. Recently, we demonstrated that a noscapinoid member, reduced bromonoscapine, (S)-3-(R)-9-bromo-5-(4,5-dimethoxy-1,3-dihydroisobenzofuran-1-yl)-4-methoxy-6-methyl-5,6,7,8-tetrahydro-[1,3]dioxolo-[4,5-g]isoquinoline (Red-Br-nos), induces reactive oxygen species (ROS)-mediated autophagy and caspase-independent death in prostate cancer PC-3 cells. Herein, we show that Red-Br-nos induces ROS-dependent DNA damage that resulted in high-grade CA and SM in PC-3 cells. Unlike doxorubicin, which causes double-stranded DNA breaks and chronic G2 arrest accompanied by 'templated' CA, Red-Br-nos-mediated DNA damage elicits de novo CA during a transient S/G2 stall, followed by checkpoint abrogation and mitotic entry to form aberrant mitotic figures with supernumerary spindle poles. Attenuation of multipolar phenotype in the presence of tiron, a ROS inhibitor, indicated that ROS-mediated DNA damage was partly responsible for driving CA and SM. Although a few cells (∼5%) yielded to aberrant cytokinesis following an 'anaphase catastrophe', most mitotically arrested cells (∼70%) succumbed to 'metaphase catastrophe,' which was caspase-independent. This report is the first documentation of rapid de novo centrosome formation in the presence of parent centrosome by a noscapinoid family member, which triggers death-inducing SM via a unique mechanism that distinguishes it from other ROS-inducers, conventional DNA-damaging agents, as well as other microtubule-binding drugs.

  7. De Novo Sequencing, Assembly, and Analysis of the Root Transcriptome of Persea americana (Mill.) in Response to Phytophthora cinnamomi and Flooding

    PubMed Central

    Reeksting, Bianca J.; Coetzer, Nanette; Mahomed, Waheed; Engelbrecht, Juanita; van den Berg, Noëlani

    2014-01-01

    Avocado is a diploid angiosperm containing 24 chromosomes with a genome estimated to be around 920 Mb. It is an important fruit crop worldwide but is susceptible to a root rot caused by the ubiquitous oomycete Phytophthora cinnamomi. Phytophthora root rot (PRR) causes damage to the feeder roots of trees, causing necrosis. This leads to branch-dieback and eventual tree death, resulting in severe losses in production. Control strategies are limited and at present an integrated approach involving the use of phosphite, tolerant rootstocks, and proper nursery management has shown the best results. Disease progression of PRR is accelerated under high soil moisture or flooding conditions. In addition, avocado is highly susceptible to flooding, with even short periods of flooding causing significant losses. Despite the commercial importance of avocado, limited genomic resources are available. Next generation sequencing has provided the means to generate sequence data at a relatively low cost, making this an attractive option for non-model organisms such as avocado. The aims of this study were to generate sequence data for the avocado root transcriptome and identify stress-related genes. Tissue was isolated from avocado infected with P. cinnamomi, avocado exposed to flooding and avocado exposed to a combination of these two stresses. Three separate sequencing runs were performed on the Roche 454 platform and produced approximately 124 Mb of data. This was assembled into 7685 contigs, with 106 448 sequences remaining as singletons. Genes involved in defence pathways such as the salicylic acid and jasmonic acid pathways as well as genes associated with the response to low oxygen caused by flooding, were identified. This is the most comprehensive study of transcripts derived from root tissue of avocado to date and will provide a useful resource for future studies. PMID:24563685

  8. De novo sequencing, assembly, and analysis of the root transcriptome of Persea americana (Mill.) in response to Phytophthora cinnamomi and flooding.

    PubMed

    Reeksting, Bianca J; Coetzer, Nanette; Mahomed, Waheed; Engelbrecht, Juanita; van den Berg, Noëlani

    2014-01-01

    Avocado is a diploid angiosperm containing 24 chromosomes with a genome estimated to be around 920 Mb. It is an important fruit crop worldwide but is susceptible to a root rot caused by the ubiquitous oomycete Phytophthora cinnamomi. Phytophthora root rot (PRR) causes damage to the feeder roots of trees, causing necrosis. This leads to branch-dieback and eventual tree death, resulting in severe losses in production. Control strategies are limited and at present an integrated approach involving the use of phosphite, tolerant rootstocks, and proper nursery management has shown the best results. Disease progression of PRR is accelerated under high soil moisture or flooding conditions. In addition, avocado is highly susceptible to flooding, with even short periods of flooding causing significant losses. Despite the commercial importance of avocado, limited genomic resources are available. Next generation sequencing has provided the means to generate sequence data at a relatively low cost, making this an attractive option for non-model organisms such as avocado. The aims of this study were to generate sequence data for the avocado root transcriptome and identify stress-related genes. Tissue was isolated from avocado infected with P. cinnamomi, avocado exposed to flooding and avocado exposed to a combination of these two stresses. Three separate sequencing runs were performed on the Roche 454 platform and produced approximately 124 Mb of data. This was assembled into 7685 contigs, with 106 448 sequences remaining as singletons. Genes involved in defence pathways such as the salicylic acid and jasmonic acid pathways as well as genes associated with the response to low oxygen caused by flooding, were identified. This is the most comprehensive study of transcripts derived from root tissue of avocado to date and will provide a useful resource for future studies.

  9. Comprehensively identifying and characterizing the missing gene sequences in human reference genome with integrated analytic approaches.

    PubMed

    Chen, Geng; Wang, Charles; Shi, Leming; Tong, Weida; Qu, Xiongfei; Chen, Jiwei; Yang, Jianmin; Shi, Caiping; Chen, Long; Zhou, Peiying; Lu, Bingxin; Shi, Tieliu

    2013-08-01

    The human reference genome is still incomplete and a number of gene sequences are missing from it. The approaches to uncover them, the reasons causing their absence and their functions are less explored. Here, we comprehensively identified and characterized the missing genes of human reference genome with RNA-Seq data from 16 different human tissues. By using a combined approach of genome-guided transcriptome reconstruction coupled with genome-wide comparison, we uncovered 3.78 and 2.37 Mb transcribed regions in the human genome assemblies of Celera and HuRef either missed from their homologous chromosomes of NCBI human reference genome build 37.2 or partially or entirely absent from the reference. We further identified a significant number of novel transcript contigs in each tissue from de novo transcriptome assembly that are unalignable to NCBI build 37.2 but can be aligned to at least one of the genomes from Celera, HuRef, chimpanzee, macaca or mouse. Our analyses indicate that the missing genes could result from genome misassembly, transposition, copy number variation, translocation and other structural variations. Moreover, our results further suggest that a large portion of these missing genes are conserved between human and other mammals, implying their important biological functions. Totally, 1,233 functional protein domains were detected in these missing genes. Collectively, our study not only provides approaches for uncovering the missing genes of a genome, but also proposes the potential reasons causing genes missed from the genome and highlights the importance of uncovering the missing genes of incomplete genomes.

  10. Deep sequencing approach for investigating infectious agents causing fever.

    PubMed

    Susilawati, T N; Jex, A R; Cantacessi, C; Pearson, M; Navarro, S; Susianto, A; Loukas, A C; McBride, W J H

    2016-07-01

    Acute undifferentiated fever (AUF) poses a diagnostic challenge due to the variety of possible aetiologies. While the majority of AUFs resolve spontaneously, some cases become prolonged and cause significant morbidity and mortality, necessitating improved diagnostic methods. This study evaluated the utility of deep sequencing in fever investigation. DNA and RNA were isolated from plasma/sera of AUF cases being investigated at Cairns Hospital in northern Australia, including eight control samples from patients with a confirmed diagnosis. Following isolation, DNA and RNA were bulk amplified and RNA was reverse transcribed to cDNA. The resulting DNA and cDNA amplicons were subjected to deep sequencing on an Illumina HiSeq 2000 platform. Bioinformatics analysis was performed using the program Kraken and the CLC assembly-alignment pipeline. The results were compared with the outcomes of clinical tests. We generated between 4 and 20 million reads per sample. The results of Kraken and CLC analyses concurred with diagnoses obtained by other means in 87.5 % (7/8) and 25 % (2/8) of control samples, respectively. Some plausible causes of fever were identified in ten patients who remained undiagnosed following routine hospital investigations, including Escherichia coli bacteraemia and scrub typhus that eluded conventional tests. Achromobacter xylosoxidans, Alteromonas macleodii and Enterobacteria phage were prevalent in all samples. A deep sequencing approach of patient plasma/serum samples led to the identification of aetiological agents putatively implicated in AUFs and enabled the study of microbial diversity in human blood. The application of this approach in hospital practice is currently limited by sequencing input requirements and complicated data analysis.

  11. De novo sequencing and characterization of a novel Bowman-Birk inhibitor from Lathyrus sativus L. seeds by electrospray mass spectrometry.

    PubMed

    Tamburino, Rachele; Severino, Valeria; Sandomenico, Annamaria; Ruvo, Menotti; Parente, Augusto; Chambery, Angela; Di Maro, Antimo

    2012-10-30

    Bowman-Birk serine protease inhibitors (BBIs) from legume seeds are small proteins showing a two-head structure with distinct reactive site loops, which inhibit two molecules of the same enzyme or two different proteases. Purification and characterization of new BBIs is of broad interest for understanding the basic molecular mechanisms underlying natural defence against the action of proteolytic enzymes. In this study, two novel acidic BBIs (LSI-1a and LSI-2a) were isolated from L. sativus seeds using classical biochemical techniques and characterized for their inhibitory activity. In addition, the N-terminal sequencing of LSI-1a was performed by Edman degradation up to residue 10 and the complete primary structure of the most abundant form (LSI-2a) was determined by using a combination of mass spectrometry approaches, including MALDI-TOF MS, tandem MS and Electron Transfer Dissociation coupled with Proton Transfer Reaction (ETD/PTR) top-down sequencing of N- and C-termini. Furthermore, the LSI-2a dimerization surface has also been investigated by a combination of gel filtration, electrophoretic techniques and homology modelling. Knowing the structure of small proteins inhibiting proteolytic enzymes is of general importance for understanding the defence mechanisms against degradation for their use in biological applications as well as for designing artificial inhibitors.

  12. Fine mapping of de novo CMT1A and HNPP rearrangements within CMT1A-REPs evidences two distinct sex-dependent mechanisms and candidate sequences involved in recombination.

    PubMed

    Lopes, J; Ravisé, N; Vandenberghe, A; Palau, F; Ionasescu, V; Mayer, M; Lévy, N; Wood, N; Tachi, N; Bouche, P; Latour, P; Ruberg, M; Brice, A; LeGuern, E

    1998-01-01

    The molecular mechanism resulting in the duplication or deletion of a 1.5 Mb region of 17p11.2-p12, associated, respectively, with Charcot-Marie-Tooth type 1A (CMT1A) and hereditary neuropathy with liability to pressure palsies (HNPP), has been proposed to be an unequal crossing-over during meiosis between the two chromosome 17 homologues generated by misalignment of the proximal and distal CMT1A-REP repeats, two homologous sequences flanking the 1.5 Mb CMT1A/HNPP monomer unit. In a recent study of a large series of de novo cases of CMT1A and HNPP, two distinct sex-dependent mechanisms were identified. Rearrangements of paternal origin, essentially duplications, were indeed generated by unequal meiotic crossing-over between the two chromosome 17 homologues, but duplications and deletions of maternal origin resulted from an intrachromosomal process, either unequal sister chromatid exchange or, in the case of deletion, excision of an intrachromatidal loop. In order to determine how these recombinations occur, 24 de novo crossover breakpoints were localized within the 1.7 kb rearrangement hot spot by comparing the sequences of the parental CMT1A-REPs with the chimeric copy in affected offspring. Nineteen out of 21 paternal crossovers were found in a 741 bp hot spot. All the breakpoints of maternal origin (n = 3), however, were located outside this interval, but in closely flanking sequences, supporting the hypothesis that two distinct sex-dependent mechanisms are involved. Several putative recombination promoting sequences in the hot spot, which are rare or absent in the surrounding 7.8 kb, were identified.

  13. Graph-based sequence annotation using a data integration approach.

    PubMed

    Pesch, Robert; Lysenko, Artem; Hindle, Matthew; Hassani-Pak, Keywan; Thiele, Ralf; Rawlings, Christopher; Köhler, Jacob; Taubert, Jan

    2008-08-25

    The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.

  14. Development of an Electrochemistry Teaching Sequence using a Phenomenographic Approach

    NASA Astrophysics Data System (ADS)

    Rodriguez-Velazquez, Sorangel

    the core concepts from discipline-specific models and theories serve as visual tools to describe reversible redox half-reactions at equilibrium, predict the spontaneity of the electrochemical process and explain interfacial equilibrium between redox species and electrodes in solution. The integration of physics concepts into electrochemistry instruction facilitated describing the interactions between the chemical system (e.g., redox species) and the external circuit (e.g., voltmeter). The "Two worlds" theoretical framework was chosen to anchor a robust educational design where the world of objects and events is deliberately connected to the world of theories and models. The core concepts in Marcus theory and density of states (DOS) provided the scientific foundations to connect both worlds. The design of this teaching sequence involved three phases; the selection of the content to be taught, the determination of a coherent and explicit connection among concepts and the development of educational activities to engage students in the learning process. The reduction-oxidation and electrochemistry chapters of three of the most popular general chemistry textbooks were revised in order to identify potential gaps during instruction, taking into consideration learning and teaching difficulties. The electrochemistry curriculum was decomposed into manageable sections contained in modules. Thirteen modules were developed and each module addresses specific conceptions with regard to terminology, redox reactions in electrochemical cells, and the function of the external circuit in electrochemical process. The electrochemistry teaching sequence was evaluated using a phenomenographic approach. This approach allows describing the qualitative variation in instructors' consciousness about the teaching of electrochemistry. A phenomenographic analysis revealed that the most relevant aspect of variation came from instructors' expertise. Participant A expertise (electrochemist) promoted in

  15. De novo sequencing of tryptic peptides sulfonated by 4-sulfophenyl isothiocyanate for unambiguous protein identification using post-source decay matrix-assisted laser desorption/ionization mass spectrometry.

    PubMed

    Chen, Ping; Nie, Song; Mi, Wei; Wang, Xian-Chun; Liang, Song-Ping

    2004-01-01

    A simple method of solid-phase derivatization and sequencing of tryptic peptides has been developed for rapid and unambiguous identification of spots on two-dimensional gels using post-source decay (PSD) matrix-assisted laser desorption/ionization (MALDI) mass spectrometry. The proteolytic digests of proteins are chemically modified by 4-sulfophenyl isothiocyanate. The derivatization reaction introduces a negative sulfonic acid group at the N-terminus of a peptide, which can increase the efficiency of PSD fragmentation and enable the selective detection of only a single series of fragment ions (y-ions). This chemically assisted method avoids the limitation of high background normally observed in MALDI-PSD spectra, and makes the spectra easier to interpret and facilitates de novo sequencing of internal fragment. The modification reaction is conducted in C(18) microZipTips to decrease the background and to enhance the signal/noise. Derivatization procedures were optimized for MALDI-PSD to increase the structural information and to obtain a complete peptide sequence even in critical cases. The MALDI-PSD mass spectra of two model peptides and their sulfonated derivatives are compared. For some proteins unambiguous identification could be achieved by MALDI-PSD sequencing of derivatized peptides obtained from in-gel digests of phosphorylase B and proteins of hepatic satellite cells (HSC).

  16. Suggested Involvement of PP1/PP2A Activity and De Novo Gene Expression in Anhydrobiotic Survival in a Tardigrade, Hypsibius dujardini, by Chemical Genetic Approach.

    PubMed

    Kondo, Koyuki; Kubo, Takeo; Kunieda, Takekazu

    2015-01-01

    Upon desiccation, some tardigrades enter an ametabolic dehydrated state called anhydrobiosis and can survive a desiccated environment in this state. For successful transition to anhydrobiosis, some anhydrobiotic tardigrades require pre-incubation under high humidity conditions, a process called preconditioning, prior to exposure to severe desiccation. Although tardigrades are thought to prepare for transition to anhydrobiosis during preconditioning, the molecular mechanisms governing such processes remain unknown. In this study, we used chemical genetic approaches to elucidate the regulatory mechanisms of anhydrobiosis in the anhydrobiotic tardigrade, Hypsibius dujardini. We first demonstrated that inhibition of transcription or translation drastically impaired anhydrobiotic survival, suggesting that de novo gene expression is required for successful transition to anhydrobiosis in this tardigrade. We then screened 81 chemicals and identified 5 chemicals that significantly impaired anhydrobiotic survival after severe desiccation, in contrast to little or no effect on survival after high humidity exposure only. In particular, cantharidic acid, a selective inhibitor of protein phosphatase (PP) 1 and PP2A, exhibited the most profound inhibitory effects. Another PP1/PP2A inhibitor, okadaic acid, also significantly and specifically impaired anhydrobiotic survival, suggesting that PP1/PP2A activity plays an important role for anhydrobiosis in this species. This is, to our knowledge, the first report of the required activities of signaling molecules for desiccation tolerance in tardigrades. The identified inhibitory chemicals could provide novel clues to elucidate the regulatory mechanisms underlying anhydrobiosis in tardigrades.

  17. Suggested Involvement of PP1/PP2A Activity and De Novo Gene Expression in Anhydrobiotic Survival in a Tardigrade, Hypsibius dujardini, by Chemical Genetic Approach

    PubMed Central

    Kondo, Koyuki; Kubo, Takeo; Kunieda, Takekazu

    2015-01-01

    Upon desiccation, some tardigrades enter an ametabolic dehydrated state called anhydrobiosis and can survive a desiccated environment in this state. For successful transition to anhydrobiosis, some anhydrobiotic tardigrades require pre-incubation under high humidity conditions, a process called preconditioning, prior to exposure to severe desiccation. Although tardigrades are thought to prepare for transition to anhydrobiosis during preconditioning, the molecular mechanisms governing such processes remain unknown. In this study, we used chemical genetic approaches to elucidate the regulatory mechanisms of anhydrobiosis in the anhydrobiotic tardigrade, Hypsibius dujardini. We first demonstrated that inhibition of transcription or translation drastically impaired anhydrobiotic survival, suggesting that de novo gene expression is required for successful transition to anhydrobiosis in this tardigrade. We then screened 81 chemicals and identified 5 chemicals that significantly impaired anhydrobiotic survival after severe desiccation, in contrast to little or no effect on survival after high humidity exposure only. In particular, cantharidic acid, a selective inhibitor of protein phosphatase (PP) 1 and PP2A, exhibited the most profound inhibitory effects. Another PP1/PP2A inhibitor, okadaic acid, also significantly and specifically impaired anhydrobiotic survival, suggesting that PP1/PP2A activity plays an important role for anhydrobiosis in this species. This is, to our knowledge, the first report of the required activities of signaling molecules for desiccation tolerance in tardigrades. The identified inhibitory chemicals could provide novel clues to elucidate the regulatory mechanisms underlying anhydrobiosis in tardigrades. PMID:26690982

  18. Data compression of discrete sequence: A tree based approach using dynamic programming

    NASA Technical Reports Server (NTRS)

    Shivaram, Gurusrasad; Seetharaman, Guna; Rao, T. R. N.

    1994-01-01

    A dynamic programming based approach for data compression of a ID sequence is presented. The compression of an input sequence of size N to that of a smaller size k is achieved by dividing the input sequence into k subsequences and replacing the subsequences by their respective average values. The partitioning of the input sequence is carried with the intention of reducing the mean squared error in the reconstructed sequence. The complexity involved in finding the partitions which would result in such an optimal compressed sequence is reduced by using the dynamic programming approach, which is presented.

  19. Sequencing and de novo analysis of the Chinese Sika deer antler-tip transcriptome during the ossification stage using Illumina RNA-Seq technology.

    PubMed

    Yao, Baojin; Zhao, Yu; Zhang, Haishan; Zhang, Mei; Liu, Meichen; Liu, Hailong; Li, Juan

    2012-05-01

    Deer antlers are the only mammalian appendages capable of repeated rounds of regeneration. Every year, deer antlers are shed and regrown from blastema into large branched structures of cartilage and bone. Little is known about the genes involved in antler development particularly during the later stages of ossification. We have produced more than 39 million sequencing reads in a single run using the Illumina sequencing platform. These were assembled into 138,642 unique sequences (mean size: 405 bp) representing 50 times the number of Sika deer sequences previously available in the NCBI database (as of Nov 2, 2011). Based on a similarity search of a database of known proteins, we identified 43,937 sequences with a cut-off E-value of 10(-5). Assembled sequences were annotated using Gene Ontology terms, Clusters of Orthologous Groups classifications and Kyoto Encyclopedia of Genes and Genomes pathways. A number of highly expressed genes involved in the regulation of Sika deer antler ossification, including growth factors, transcription factors and extracellular matrix components were found. This is the most comprehensive sequence resource available for the deer antler and provides a basis for the molecular genetics and functional genomics of deer antler.

  20. Solving the Curriculum Sequencing Problem with DNA Computing Approach

    ERIC Educational Resources Information Center

    Debbah, Amina; Ben Ali, Yamina Mohamed

    2014-01-01

    In the e-learning systems, a learning path is known as a sequence of learning materials linked to each others to help learners achieving their learning goals. As it is impossible to have the same learning path that suits different learners, the Curriculum Sequencing problem (CS) consists of the generation of a personalized learning path for each…

  1. Social and behavioral research in genomic sequencing: approaches from the Clinical Sequencing Exploratory Research Consortium Outcomes and Measures Working Group.

    PubMed

    Gray, Stacy W; Martins, Yolanda; Feuerman, Lindsay Z; Bernhardt, Barbara A; Biesecker, Barbara B; Christensen, Kurt D; Joffe, Steven; Rini, Christine; Veenstra, David; McGuire, Amy L

    2014-10-01

    The routine use of genomic sequencing in clinical medicine has the potential to dramatically alter patient care and medical outcomes. To fully understand the psychosocial and behavioral impact of sequencing integration into clinical practice, it is imperative that we identify the factors that influence sequencing-related decision making and patient outcomes. In an effort to develop a collaborative and conceptually grounded approach to studying sequencing adoption, members of the National Human Genome Research Institute's Clinical Sequencing Exploratory Research Consortium formed the Outcomes and Measures Working Group. Here we highlight the priority areas of investigation and psychosocial and behavioral outcomes identified by the Working Group. We also review some of the anticipated challenges to measurement in social and behavioral research related to genomic sequencing; opportunities for instrument development; and the importance of qualitative, quantitative, and mixed-method approaches. This work represents the early, shared efforts of multiple research teams as we strive to understand individuals' experiences with genomic sequencing. The resulting body of knowledge will guide recommendations for the optimal use of sequencing in clinical practice.

  2. Poisson approach to clustering analysis of regulatory sequences.

    PubMed

    Wang, Haiying; Zheng, Huiru; Hu, Jinglu

    2008-01-01

    The presence of similar patterns in regulatory sequences may aid users in identifying co-regulated genes or inferring regulatory modules. By modelling pattern occurrences in regulatory regions with Poisson statistics, this paper presents a log likelihood ratio statistics-based distance measure to calculate pair-wise similarities between regulatory sequences. We employed it within three clustering algorithms: hierarchical clustering, Self-Organising Map, and a self-adaptive neural network. The results indicate that, in comparison to traditional clustering algorithms, the incorporation of the log likelihood ratio statistics-based distance into the learning process may offer considerable improvements in the process of regulatory sequence-based classification of genes.

  3. BAC-pool 454-sequencing: A rapid and efficient approach to sequence complex tetraploid cotton genomes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    New and emerging next generation sequencing technologies have been promising in reducing sequencing costs, but not significantly for complex polyploid plant genomes such as cotton. Large and highly repetitive genome of G. hirsutum (~2.5GB) is less amenable and cost-intensive with traditional BAC-by...

  4. De novo assembly of a haplotype-resolved human genome.

    PubMed

    Cao, Hongzhi; Wu, Honglong; Luo, Ruibang; Huang, Shujia; Sun, Yuhui; Tong, Xin; Xie, Yinlong; Liu, Binghang; Yang, Hailong; Zheng, Hancheng; Li, Jian; Li, Bo; Wang, Yu; Yang, Fang; Sun, Peng; Liu, Siyang; Gao, Peng; Huang, Haodong; Sun, Jing; Chen, Dan; He, Guangzhu; Huang, Weihua; Huang, Zheng; Li, Yue; Tellier, Laurent C A M; Liu, Xiao; Feng, Qiang; Xu, Xun; Zhang, Xiuqing; Bolund, Lars; Krogh, Anders; Kristiansen, Karsten; Drmanac, Radoje; Drmanac, Snezana; Nielsen, Rasmus; Li, Songgang; Wang, Jian; Yang, Huanming; Li, Yingrui; Wong, Gane Ka-Shu; Wang, Jun

    2015-06-01

    The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome. Our pipeline relies on fosmid pooling together with whole-genome shotgun strategies, based solely on next-generation sequencing and hierarchical assembly methods. We applied our sequencing method to the genome of an Asian individual and generated a 5.15-Gb assembled genome with a haplotype N50 of 484 kb. Our analysis identified previously undetected indels and 7.49 Mb of novel coding sequences that could not be aligned to the human reference genome, which include at least six predicted genes. This haplotype-resolved genome represents the most complete de novo human genome assembly to date. Application of our approach to identify individual haplotype differences should aid in translating genotypes to phenotypes for the development of personalized medicine.

  5. De novo assembly and characterization of global transcriptome of coconut palm (Cocos nucifera L.) embryogenic calli using Illumina paired-end sequencing.

    PubMed

    Rajesh, M K; Fayas, T P; Naganeeswaran, S; Rachana, K E; Bhavyashree, U; Sajini, K K; Karun, Anitha

    2016-05-01

    Production and supply of quality planting material is significant to coconut cultivation but is one of the major constraints in coconut productivity. Rapid multiplication of coconut through in vitro techniques, therefore, is of paramount importance. Although somatic embryogenesis in coconut is a promising technique that will allow for the mass production of high quality palms, coconut is highly recalcitrant to in vitro culture. In order to overcome the bottlenecks in coconut somatic embryogenesis and to develop a repeatable protocol, it is imperative to understand, identify, and characterize molecular events involved in coconut somatic embryogenesis pathway. Transcriptome analysis (RNA-Seq) of coconut embryogenic calli, derived from plumular explants of West Coast Tall cultivar, was undertaken on an Illumina HiSeq 2000 platform. After de novo transcriptome assembly and functional annotation, we have obtained 40,367 transcripts which showed significant BLASTx matches with similarity greater than 40 % and E value of ≤10(-5). Fourteen genes known to be involved in somatic embryogenesis were identified. Quantitative real-time PCR (qRT-PCR) analyses of these 14 genes were carried in six developmental stages. The result showed that CLV was upregulated in the initial stage of callogenesis. Transcripts GLP, GST, PKL, WUS, and WRKY were expressed more in somatic embryo stage. The expression of SERK, MAPK, AP2, SAUR, ECP, AGP, LEA, and ANT were higher in the embryogenic callus stage compared to initial culture and somatic embryo stages. This study provides the first insights into the gene expression patterns during somatic embryogenesis in coconut.

  6. De Novo Transcriptome Sequencing Analysis of cDNA Library and Large-Scale Unigene Assembly in Japanese Red Pine (Pinus densiflora).

    PubMed

    Liu, Le; Zhang, Shijie; Lian, Chunlan

    2015-12-04

    Japanese red pine (Pinus densiflora) is extensively cultivated in Japan, Korea, China, and Russia and is harvested for timber, pulpwood, garden, and paper markets. However, genetic information and molecular markers were very scarce for this species. In this study, over 51 million sequencing clean reads from P. densiflora mRNA were produced using Illumina paired-end sequencing technology. It yielded 83,913 unigenes with a mean length of 751 bp, of which 54,530 (64.98%) unigenes showed similarity to sequences in the NCBI database. Among which the best matches in the NCBI Nr database were Picea sitchensis (41.60%), Amborella trichopoda (9.83%), and Pinus taeda (4.15%). A total of 1953 putative microsatellites were identified in 1784 unigenes using MISA (MicroSAtellite) software, of which the tri-nucleotide repeats were most abundant (50.18%) and 629 EST-SSR (expressed sequence tag- simple sequence repeats) primer pairs were successfully designed. Among 20 EST-SSR primer pairs randomly chosen, 17 markers yielded amplification products of the expected size in P. densiflora. Our results will provide a valuable resource for gene-function analysis, germplasm identification, molecular marker-assisted breeding and resistance-related gene(s) mapping for pine for P. densiflora.

  7. De Novo Transcriptome Sequencing Analysis of cDNA Library and Large-Scale Unigene Assembly in Japanese Red Pine (Pinus densiflora)

    PubMed Central

    Liu, Le; Zhang, Shijie; Lian, Chunlan

    2015-01-01

    Japanese red pine (Pinus densiflora) is extensively cultivated in Japan, Korea, China, and Russia and is harvested for timber, pulpwood, garden, and paper markets. However, genetic information and molecular markers were very scarce for this species. In this study, over 51 million sequencing clean reads from P. densiflora mRNA were produced using Illumina paired-end sequencing technology. It yielded 83,913 unigenes with a mean length of 751 bp, of which 54,530 (64.98%) unigenes showed similarity to sequences in the NCBI database. Among which the best matches in the NCBI Nr database were Picea sitchensis (41.60%), Amborella trichopoda (9.83%), and Pinus taeda (4.15%). A total of 1953 putative microsatellites were identified in 1784 unigenes using MISA (MicroSAtellite) software, of which the tri-nucleotide repeats were most abundant (50.18%) and 629 EST-SSR (expressed sequence tag- simple sequence repeats) primer pairs were successfully designed. Among 20 EST-SSR primer pairs randomly chosen, 17 markers yielded amplification products of the expected size in P. densiflora. Our results will provide a valuable resource for gene-function analysis, germplasm identification, molecular marker-assisted breeding and resistance-related gene(s) mapping for pine for P. densiflora. PMID:26690126

  8. De novo assembly and characterization of the leaf, bud, and fruit transcriptome from the vulnerable tree Juglans mandshurica for the development of 20 new microsatellite markers using Illumina sequencing.

    PubMed

    Hu, Zhuang; Zhang, Tian; Gao, Xiao-Xiao; Wang, Yang; Zhang, Qiang; Zhou, Hui-Juan; Zhao, Gui-Fang; Wang, Ma-Li; Woeste, Keith E; Zhao, Peng

    2016-04-01

    Manchurian walnut (Juglans mandshurica Maxim.) is a vulnerable, temperate deciduous tree valued for its wood and nut, but transcriptomic and genomic data for the species are very limited. Next generation sequencing (NGS) has made it possible to develop molecular markers for this species rapidly and efficiently. Our goal is to use transcriptome information from RNA-Seq to understand development in J. mandshurica and develop polymorphic simple sequence repeats (SSRs, microsatellites) to understand the species' population genetics. In this study, more than 47.7 million clean reads were generated using Illumina sequencing technology. De novo assembly yielded 99,869 unigenes with an average length of 747 bp. Based on sequence similarity search with known proteins, a total of 39,708 (42.32 %) genes were identified. Searching against the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG) identified 15,903 (16.9 %) unigenes. Further, we identified and characterized 63 new transcriptome-derived microsatellite markers. By testing the markers on 4 to 14 individuals from four populations, we found that 20 were polymorphic and easily amplified. The number of alleles per locus ranged from 2 to 8. The observed and expected heterozygosity per locus ranged from 0.209 to 0.813 and 0.335 to 0.842, respectively. These twenty microsatellite markers will be useful for studies of population genetics, diversity, and genetic structure, and they will undoubtedly benefit future breeding studies of this walnut species. Moreover, the information uncovered in this research will also serve as a useful genetic resource for understanding the transcriptome and development of J. mandshurica and other Juglans species.

  9. De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology.

    PubMed

    Nijkamp, Jurgen F; van den Broek, Marcel; Datema, Erwin; de Kok, Stefan; Bosman, Lizanne; Luttik, Marijke A; Daran-Lapujade, Pascale; Vongsangnak, Wanwipa; Nielsen, Jens; Heijne, Wilbert H M; Klaassen, Paul; Paddon, Chris J; Platt, Darren; Kötter, Peter; van Ham, Roeland C; Reinders, Marcel J T; Pronk, Jack T; de Ridder, Dick; Daran, Jean-Marc

    2012-03-26

    Saccharomyces cerevisiae CEN.PK 113-7D is widely used for metabolic engineering and systems biology research in industry and academia. We sequenced, assembled, annotated and analyzed its genome. Single-nucleotide variations (SNV), insertions/deletions (indels) and differences in genome organization compared to the reference strain S. cerevisiae S288C were analyzed. In addition to a few large deletions and duplications, nearly 3000 indels were identified in the CEN.PK113-7D genome relative to S288C. These differences were overrepresented in genes whose functions are related to transcriptional regulation and chromatin remodelling. Some of these variations were caused by unstable tandem repeats, suggesting an innate evolvability of the corresponding genes. Besides a previously characterized mutation in adenylate cyclase, the CEN.PK113-7D genome sequence revealed a significant enrichment of non-synonymous mutations in genes encoding for components of the cAMP signalling pathway. Some phenotypic characteristics of the CEN.PK113-7D strains were explained by the presence of additional specific metabolic genes relative to S288C. In particular, the presence of the BIO1 and BIO6 genes correlated with a biotin prototrophy of CEN.PK113-7D. Furthermore, the copy number, chromosomal location and sequences of the MAL loci were resolved. The assembled sequence reveals that CEN.PK113-7D has a mosaic genome that combines characteristics of laboratory strains and wild-industrial strains.

  10. De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology

    PubMed Central

    2012-01-01

    Saccharomyces cerevisiae CEN.PK 113-7D is widely used for metabolic engineering and systems biology research in industry and academia. We sequenced, assembled, annotated and analyzed its genome. Single-nucleotide variations (SNV), insertions/deletions (indels) and differences in genome organization compared to the reference strain S. cerevisiae S288C were analyzed. In addition to a few large deletions and duplications, nearly 3000 indels were identified in the CEN.PK113-7D genome relative to S288C. These differences were overrepresented in genes whose functions are related to transcriptional regulation and chromatin remodelling. Some of these variations were caused by unstable tandem repeats, suggesting an innate evolvability of the corresponding genes. Besides a previously characterized mutation in adenylate cyclase, the CEN.PK113-7D genome sequence revealed a significant enrichment of non-synonymous mutations in genes encoding for components of the cAMP signalling pathway. Some phenotypic characteristics of the CEN.PK113-7D strains were explained by the presence of additional specific metabolic genes relative to S288C. In particular, the presence of the BIO1 and BIO6 genes correlated with a biotin prototrophy of CEN.PK113-7D. Furthermore, the copy number, chromosomal location and sequences of the MAL loci were resolved. The assembled sequence reveals that CEN.PK113-7D has a mosaic genome that combines characteristics of laboratory strains and wild-industrial strains. PMID:22448915

  11. From DNA sequence to transcriptional behaviour: a quantitative approach.

    PubMed

    Segal, Eran; Widom, Jonathan

    2009-07-01

    Complex transcriptional behaviours are encoded in the DNA sequences of gene regulatory regions. Advances in our understanding of these behaviours have been recently gained through quantitative models that describe how molecules such as transcription factors and nucleosomes interact with genomic sequences. An emerging view is that every regulatory sequence is associated with a unique binding affinity landscape for each molecule and, consequently, with a unique set of molecule-binding configurations and transcriptional outputs. We present a quantitative framework based on existing methods that unifies these ideas. This framework explains many experimental observations regarding the binding patterns of factors and nucleosomes and the dynamics of transcriptional activation. It can also be used to model more complex phenomena such as transcriptional noise and the evolution of transcriptional regulation.

  12. Next-Generation Technologies for Multiomics Approaches Including Interactome Sequencing

    PubMed Central

    Ohashi, Hiroyuki; Miyamoto-Sato, Etsuko

    2015-01-01

    The development of high-speed analytical techniques such as next-generation sequencing and microarrays allows high-throughput analysis of biological information at a low cost. These techniques contribute to medical and bioscience advancements and provide new avenues for scientific research. Here, we outline a variety of new innovative techniques and discuss their use in omics research (e.g., genomics, transcriptomics, metabolomics, proteomics, and interactomics). We also discuss the possible applications of these methods, including an interactome sequencing technology that we developed, in future medical and life science research. PMID:25649523

  13. Targeted Amplicon Sequencing (TAS): A Scalable Next-Gen Approach to Multilocus, Multitaxa Phylogenetics

    PubMed Central

    Bybee, Seth M.; Bracken-Grissom, Heather; Haynes, Benjamin D.; Hermansen, Russell A.; Byers, Robert L.; Clement, Mark J.; Udall, Joshua A.; Wilcox, Edward R.; Crandall, Keith A.

    2011-01-01

    Next-gen sequencing technologies have revolutionized data collection in genetic studies and advanced genome biology to novel frontiers. However, to date, next-gen technologies have been used principally for whole genome sequencing and transcriptome sequencing. Yet many questions in population genetics and systematics rely on sequencing specific genes of known function or diversity levels. Here, we describe a targeted amplicon sequencing (TAS) approach capitalizing on next-gen capacity to sequence large numbers of targeted gene regions from a large number of samples. Our TAS approach is easily scalable, simple in execution, neither time-nor labor-intensive, relatively inexpensive, and can be applied to a broad diversity of organisms and/or genes. Our TAS approach includes a bioinformatic application, BarcodeCrucher, to take raw next-gen sequence reads and perform quality control checks and convert the data into FASTA format organized by gene and sample, ready for phylogenetic analyses. We demonstrate our approach by sequencing targeted genes of known phylogenetic utility to estimate a phylogeny for the Pancrustacea. We generated data from 44 taxa using 68 different 10-bp multiplexing identifiers. The overall quality of data produced was robust and was informative for phylogeny estimation. The potential for this method to produce copious amounts of data from a single 454 plate (e.g., 325 taxa for 24 loci) significantly reduces sequencing expenses incurred from traditional Sanger sequencing. We further discuss the advantages and disadvantages of this method, while offering suggestions to enhance the approach. PMID:22002916

  14. De Novo Identification and Biophysical Characterization of Transcription Factor Binding Sites with Microfluidic Affinity Analysis

    PubMed Central

    Fordyce, Polly M.; Gerber, Doron; Tran, Danh; Zheng, Jiashun; Li, Hao; DeRisi, Joseph L.; Quake, Stephen R.

    2010-01-01

    Gene expression is regulated in part by protein transcription factors (TFs) that bind target regulatory DNA sequences. Predicting DNA binding sites and affinities from transcription factor sequence or structure is difficult; therefore, experimental data are required to link TFs to target sequences. We present a microfluidics-based approach for de novo discovery and quantitative biophysical characterization of DNA target sequences. We validated our technique by measuring sequence preferences for 28 S. cerevisiae TFs with a variety of DNA binding domains, including several that have proven difficult to study via other techniques. For each TF, we measured relative binding affinities to oligonucleotides covering all possible 8-bp DNA sequences to create a comprehensive map of sequence preferences; for 4 TFs, we also determined absolute affinities. We anticipate that these data and future use of this technique will provide information essential for understanding TF specificity, improving identification of regulatory sites, and reconstructing regulatory interactions. PMID:20802496

  15. A Probabilistic Approach for Improved Sequence Mapping in Metatranscriptomic Studies

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Mapping millions of short DNA sequences a reference genome is a necessary step in many experiments designed to investigate the expression of genes involved in disease resistance. This is a difficult task in which several challenges often arise resulting in a suboptimal mapping. This mapping process ...

  16. Correlation approach to identify coding regions in DNA sequences

    NASA Technical Reports Server (NTRS)

    Ossadnik, S. M.; Buldyrev, S. V.; Goldberger, A. L.; Havlin, S.; Mantegna, R. N.; Peng, C. K.; Simons, M.; Stanley, H. E.

    1994-01-01

    Recently, it was observed that noncoding regions of DNA sequences possess long-range power-law correlations, whereas coding regions typically display only short-range correlations. We develop an algorithm based on this finding that enables investigators to perform a statistical analysis on long DNA sequences to locate possible coding regions. The algorithm is particularly successful in predicting the location of lengthy coding regions. For example, for the complete genome of yeast chromosome III (315,344 nucleotides), at least 82% of the predictions correspond to putative coding regions; the algorithm correctly identified all coding regions larger than 3000 nucleotides, 92% of coding regions between 2000 and 3000 nucleotides long, and 79% of coding regions between 1000 and 2000 nucleotides. The predictive ability of this new algorithm supports the claim that there is a fundamental difference in the correlation property between coding and noncoding sequences. This algorithm, which is not species-dependent, can be implemented with other techniques for rapidly and accurately locating relatively long coding regions in genomic sequences.

  17. Strategic Cognitive Sequencing: A Computational Cognitive Neuroscience Approach

    PubMed Central

    Herd, Seth A.; Krueger, Kai A.; Kriete, Trenton E.; Huang, Tsung-Ren; Hazy, Thomas E.; O'Reilly, Randall C.

    2013-01-01

    We address strategic cognitive sequencing, the “outer loop” of human cognition: how the brain decides what cognitive process to apply at a given moment to solve complex, multistep cognitive tasks. We argue that this topic has been neglected relative to its importance for systematic reasons but that recent work on how individual brain systems accomplish their computations has set the stage for productively addressing how brain regions coordinate over time to accomplish our most impressive thinking. We present four preliminary neural network models. The first addresses how the prefrontal cortex (PFC) and basal ganglia (BG) cooperate to perform trial-and-error learning of short sequences; the next, how several areas of PFC learn to make predictions of likely reward, and how this contributes to the BG making decisions at the level of strategies. The third models address how PFC, BG, parietal cortex, and hippocampus can work together to memorize sequences of cognitive actions from instruction (or “self-instruction”). The last shows how a constraint satisfaction process can find useful plans. The PFC maintains current and goal states and associates from both of these to find a “bridging” state, an abstract plan. We discuss how these processes could work together to produce strategic cognitive sequencing and discuss future directions in this area. PMID:23935605

  18. de novo analysis and functional classification of the transcriptome of the root lesion nematode, Pratylenchus thornei, after 454 GS FLX sequencing.

    PubMed

    Nicol, Paul; Gill, Reetinder; Fosu-Nyarko, John; Jones, Michael G K

    2012-01-01

    The migratory endoparasitic root lesion nematode Pratylenchus thornei is a major pest of the cereals wheat and barley. In what we believe to be the first global transcriptome analysis for P. thornei, using Roche GS FLX sequencing, 787,275 reads were assembled into 34,312 contigs using two assembly programs, to yield 6,989 contigs common to both. These contigs were annotated, resulting in functional assignments for 3,048. Specific transcripts studied in more detail included carbohydrate active enzymes potentially involved in cell wall degradation, neuropeptides, putative plant nematode parasitism genes, and transcripts that could be secreted by the nematode. Transcripts for cell wall degrading enzymes were similar to bacterial genes, suggesting that they were acquired by horizontal gene transfer. Contigs matching 14 parasitism genes found in sedentary endoparasitic nematodes were identified. These genes are thought to function in suppression of host defenses and in feeding site development, but their function in P. thornei may differ. Comparison of the common contigs from P. thornei with other nematodes showed that 2,039 were common to sequences of the Heteroderidae, 1,947 to the Meloidogynidae, 1,218 to Radopholus similis, 1,209 matched expressed sequence tags (ESTs) of Pratylenchus penetrans and Pratylenchus vulnus, and 2,940 to contigs of Pratylenchus coffeae. There were 2,014 contigs common to Caenarhabditis elegans, with 15.9% being common to all three groups. Twelve percent of contigs with matches to the Heteroderidae and the Meloidogynidae had no homology to any C. elegans protein. Fifty-seven percent of the contigs did not match known sequences and some could be unique to P. thornei. These data provide substantial new information on the transcriptome of P. thornei, those genes common to migratory and sedentary endoparasitic nematodes, and provide additional understanding of genes required for different forms of parasitism. The data can also be used to

  19. De novo transcriptome sequence assembly and identification of AP2/ERF transcription factor related to abiotic stress in parsley (Petroselinum crispum).

    PubMed

    Li, Meng-Yao; Tan, Hua-Wei; Wang, Feng; Jiang, Qian; Xu, Zhi-Sheng; Tian, Chang; Xiong, Ai-Sheng

    2014-01-01

    Parsley is an important biennial Apiaceae species that is widely cultivated as herb, spice, and vegetable. Previous studies on parsley principally focused on its physiological and biochemical properties, including phenolic compound and volatile oil contents. However, little is known about the molecular and genetic properties of parsley. In this study, 23,686,707 high-quality reads were obtained and assembled into 81,852 transcripts and 50,161 unigenes for the first time. Functional annotation showed that 30,516 unigenes had sequence similarity to known genes. In addition, 3,244 putative simple sequence repeats were detected in curly parsley. Finally, 1,569 of the identified unigenes belonged to 58 transcription factor families. Various abiotic stresses have a strong detrimental effect on the yield and quality of parsley. AP2/ERF transcription factors have important functions in plant development, hormonal regulation, and abiotic response. A total of 88 putative AP2/ERF factors were identified from the transcriptome sequence of parsley. Seven AP2/ERF transcription factors were selected in this study to analyze the expression profiles of parsley under different abiotic stresses. Our data provide a potentially valuable resource that can be used for intensive parsley research.

  20. Approaches to sequence analysis of 125I-labeled RNA.

    PubMed Central

    Dickson, E; Pape, L K; Robertson, H D

    1979-01-01

    A method is described for the initial steps of sequence analysis of RNase T1-and pancreatic RN-ase-resistant oligonucleotides of RNA containing cytidylate residues labeled in vitro with 125I. In many cases an oligonucleotide sequence can be deduced from a consideration of (i) its relative position in the two-dimensional fingerprint (with DEAE thin layer homochromatographic second dimension), (ii) its electrophoretic mobility on DEAE paper at pH 1.9, and (iii) identification of its products of further enzymatic digestion by comparison with a set of marker oligonucleotides. Additional methods including analysis of oligonucleotides following chemical blocking of uridylate residues with CMCT and analysis of products of incomplete enzymatic digestion are also discussed. Images PMID:106369

  1. Single cell sequencing approaches for complex biological systems.

    PubMed

    Baslan, Timour; Hicks, James

    2014-06-01

    Biological phenotype is the output of complex interactions between heterogeneous cells within a specified niche. These interactions are tightly governed and regulated by the genetic, epigenetic, and transcriptional states of single cells, with deregulation of these states resulting in disease. As such, genome wide single cell investigations are bound to enhance our knowledge of the underlying principles that govern biological systems. Recent technological advances have enabled such investigations in the form of single-cell sequencing. Here, we review the most recent developments in genome wide profiling of single cells, discuss some of the novel biological observations gleaned by such investigations, and touch upon the promise of single cell sequencing in unraveling biological systems.

  2. A Structural Approach to the Validation of Hierarchical Training Sequences

    DTIC Science & Technology

    1981-06-01

    Assessment 3 The Current Lack of Validated Hierarchies 3 Advances in Statistics that Make a Practical Technology for Hierarchy Validation Possible 4 New...learner’s current level. The statistical models employed in the second year of research could be used to develop tests that could not only sequence...flawed and are extremely time consuming and may not be suitable for broad scale appi ication. Advances in Statistics that Make a Practical Technology f’r

  3. A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads.

    PubMed

    Mbandi, Stanley Kimbung; Hesse, Uljana; Rees, D Jasper G; Christoffels, Alan

    2014-01-01

    Downstream analyses of short-reads from next-generation sequencing platforms are often preceded by a pre-processing step that removes uncalled and wrongly called bases. Standard approaches rely on their associated base quality scores to retain the read or a portion of it when the score is above a predefined threshold. It is difficult to differentiate sequencing error from biological variation without a reference using quality scores. The effects of quality score based trimming have not been systematically studied in de novo transcriptome assembly. Using RNA-Seq data produced from Illumina, we teased out the effects of quality score based filtering or trimming on de novo transcriptome reconstruction. We showed that assemblies produced from reads subjected to different quality score thresholds contain truncated and missing transfrags when compared to those from untrimmed reads. Our data supports the fact that de novo assembling of untrimmed data is challenging for de Bruijn graph assemblers. However, our results indicates that comparing the assemblies from untrimmed and trimmed read subsets can suggest appropriate filtering parameters and enable selection of the optimum de novo transcriptome assembly in non-model organisms.

  4. De novo Transcriptome Generation and Annotation for Two Korean Endemic Land Snails, Aegista chejuensis and Aegista quelpartensis, Using Illumina Paired-End Sequencing Technology

    PubMed Central

    Kang, Se Won; Patnaik, Bharat Bhusan; Hwang, Hee-Ju; Park, So Young; Wang, Tae Hun; Park, Eun Bi; Chung, Jong Min; Song, Dae Kwon; Patnaik, Hongray Howrelia; Lee, Jae Bong; Kim, Changmu; Kim, Soonok; Park, Hong Seog; Lee, Jun Sang; Han, Yeon Soo; Lee, Yong Seok

    2016-01-01

    Aegista chejuensis and Aegista quelpartensis (Family-Bradybaenidae) are endemic to Korea, and are considered vulnerable due to declines in their population. The limited genetic resources for these species restricts the ability to prioritize conservation efforts. We sequenced the transcriptomes of these species using Illumina paired-end technology. Approximately 257 and 240 million reads were obtained and assembled into 198,531 and 230,497 unigenes for A. chejuensis and A. quelpartensis, respectively. The average and N50 unigene lengths were 735.4 and 1073 bp, respectively, for A. chejuensis, and 705.6 and 1001 bp, respectively, for A. quelpartensis. In total, 68,484 (34.5%) and 77,745 (33.73%) unigenes for A. chejuensis and A. quelpartensis, respectively, were annotated to databases. Gene Ontology terms were assigned to 23,778 (11.98%) and 26,396 (11.45) unigenes, for A. chejuensis and A. quelpartensis, respectively, while 5050 and 5838 unigenes were mapped to 117 and 124 pathways in the Kyoto Encyclopedia of Genes and Genomes database. In addition, we identified and annotated 9542 and 10,395 putative simple sequence repeats (SSRs) in unigenes from A. chejuensis and A. quelpartensis, respectively. We designed a list of PCR primers flanking the putative SSR regions. These microsatellites may be utilized for future phylogenetics and conservation initiatives. PMID:26999110

  5. De novo sequencing and comparative analysis of leaf transcriptomes of diverse condensed tannin-containing lines of underutilized Psophocarpus tetragonolobus (L.) DC

    PubMed Central

    Singh, Vinayak; Goel, Ridhi; Pande, Veena; Asif, Mehar Hasan; Mohanty, Chandra Sekhar

    2017-01-01

    Condensed tannin (CT) or proanthocyanidin (PA) is a unique group of phenolic metabolite with high molecular weight with specific structure. It is reported that, the presence of high-CT in the legumes adversely affect the nutrients in the plant and impairs the digestibility upon consumption by animals. Winged bean (Psophocarpus tetragonolobus (L.) DC.) is one of the promising underutilized legume with high protein and oil-content. One of the reasons for its underutilization is due to the presence of CT. Transcriptome sequencing of leaves of two diverse CT-containing lines of P. tetragonolobus was carried out on Illumina Nextseq 500 sequencer to identify the underlying genes and contigs responsible for CT-biosynthesis. RNA-Seq data generated 102586 and 88433 contigs for high (HCTW) and low CT (LCTW) lines of P. tetragonolobus, respectively. Based on the similarity searches against gene ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) database revealed 5210 contigs involved in 229 different pathways. A total of 1235 contigs were detected to differentially express between HCTW and LCTW lines. This study along with its findings will be helpful in providing information for functional and comparative genomic analysis of condensed tannin biosynthesis in this plant in specific and legumes in general. PMID:28322296

  6. De-novo RNA sequencing and metabolite profiling to identify genes involved in anthocyanin biosynthesis in Korean black raspberry (Rubus coreanus Miquel).

    PubMed

    Hyun, Tae Kyung; Lee, Sarah; Rim, Yeonggil; Kumar, Ritesh; Han, Xiao; Lee, Sang Yeol; Lee, Choong Hwan; Kim, Jae-Yean

    2014-01-01

    The Korean black raspberry (Rubus coreanus Miquel, KB) on ripening is usually consumed as fresh fruit, whereas the unripe KB has been widely used as a source of traditional herbal medicine. Such a stage specific utilization of KB has been assumed due to the changing metabolite profile during fruit ripening process, but so far molecular and biochemical changes during its fruit maturation are poorly understood. To analyze biochemical changes during fruit ripening process at molecular level, firstly, we have sequenced, assembled, and annotated the transcriptome of KB fruits. Over 4.86 Gb of normalized cDNA prepared from fruits was sequenced using Illumina HiSeq™ 2000, and assembled into 43,723 unigenes. Secondly, we have reported that alterations in anthocyanins and proanthocyanidins are the major factors facilitating variations in these stages of fruits. In addition, up-regulation of F3'H1, DFR4 and LDOX1 resulted in the accumulation of cyanidin derivatives during the ripening process of KB, indicating the positive relationship between the expression of anthocyanin biosynthetic genes and the anthocyanin accumulation. Furthermore, the ability of RcMCHI2 (R. coreanus Miquel chalcone flavanone isomerase 2) gene to complement Arabidopsis transparent testa 5 mutant supported the feasibility of our transcriptome library to provide the gene resources for improving plant nutrition and pigmentation. Taken together, these datasets obtained from transcriptome library and metabolic profiling would be helpful to define the gene-metabolite relationships in this non-model plant.

  7. De-novo RNA Sequencing and Metabolite Profiling to Identify Genes Involved in Anthocyanin Biosynthesis in Korean Black Raspberry (Rubus coreanus Miquel)

    PubMed Central

    Rim, Yeonggil; Kumar, Ritesh; Han, Xiao; Lee, Sang Yeol; Lee, Choong Hwan; Kim, Jae-Yean

    2014-01-01

    The Korean black raspberry (Rubus coreanus Miquel, KB) on ripening is usually consumed as fresh fruit, whereas the unripe KB has been widely used as a source of traditional herbal medicine. Such a stage specific utilization of KB has been assumed due to the changing metabolite profile during fruit ripening process, but so far molecular and biochemical changes during its fruit maturation are poorly understood. To analyze biochemical changes during fruit ripening process at molecular level, firstly, we have sequenced, assembled, and annotated the transcriptome of KB fruits. Over 4.86 Gb of normalized cDNA prepared from fruits was sequenced using Illumina HiSeq™ 2000, and assembled into 43,723 unigenes. Secondly, we have reported that alterations in anthocyanins and proanthocyanidins are the major factors facilitating variations in these stages of fruits. In addition, up-regulation of F3′H1, DFR4 and LDOX1 resulted in the accumulation of cyanidin derivatives during the ripening process of KB, indicating the positive relationship between the expression of anthocyanin biosynthetic genes and the anthocyanin accumulation. Furthermore, the ability of RcMCHI2 (R. coreanus Miquel chalcone flavanone isomerase 2) gene to complement Arabidopsis transparent testa 5 mutant supported the feasibility of our transcriptome library to provide the gene resources for improving plant nutrition and pigmentation. Taken together, these datasets obtained from transcriptome library and metabolic profiling would be helpful to define the gene-metabolite relationships in this non-model plant. PMID:24505466

  8. A novel approach to multiple sequence alignment using hadoop data grids.

    PubMed

    Sudha Sadasivam, G; Baktavatchalam, G

    2010-01-01

    Multiple alignment of protein sequences helps to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Although dynamic programming algorithms produce accurate alignments, they are computation intensive. In this paper we propose a time efficient approach to sequence alignment that also produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. The principle of block splitting in hadoop coupled with its scalability facilitates alignment of very large sequences.

  9. Robin sequence: what the multidisciplinary approach can do

    PubMed Central

    Cohen, Stephanie M; Greathouse, S Travis; Rabbani, Cyrus C; O’Neil, Joseph; Kardatzke, Matthew A; Hall, Tasha E; Bennett, William E; Daftary, Ameet S; Matt, Bruce H; Tholpady, Sunil S

    2017-01-01

    Robin sequence (RS) is a commonly encountered triad of micrognathia, glossoptosis, and airway obstruction, with or without a cleft palate. The management of airway obstruction is of paramount importance, and multiple reviews and retrospective series outline the diagnosis and treatment of RS. This article focuses on the multidisciplinary nature of RS and the specialists’ contributions and thought processes regarding the management of the RS child from birth to skeletal maturity. This review demonstrates that the care of these children extends far beyond the acute airway obstruction and that thorough monitoring and appropriate intervention are required to help them achieve optimal outcomes. PMID:28392703

  10. Two patients with overlapping de novo duplications of the long arm of chromosome 9, including one case with Di George sequence.

    PubMed

    Lindgren, V; Rosinsky, B; Chin, J; Berry-Kravis, E

    1994-01-01

    Duplications of chromosome 9q are rare. We describe the cytogenetic and phenotypic findings in 2 patients, one with a large duplication covering most of 9q(q12-q33.2) and one with a smaller duplication (q21.12-q22.1) who had Di George sequence (DGS). The chromosome 9 origin of the extra material in the second case was confirmed by fluorescence in situ hybridization (FISH) analysis with a whole chromosome 9 paint. Microdeletions of chromosome 22 are common in DGS and have been reported in CHARGE association. This is the first report of an association of a chromosome 9 abnormality with DGS in the absence of a chromosome 22 abnormality and the seventh report of a patient with a duplication of a large portion of 9q (q11-q13 to q32-q33).

  11. Two patients with overlapping de novo duplications of the long arm of chromosome 9, including one case with Di George sequence

    SciTech Connect

    Lindgren, V.; Rosinsky, B.; Chin, J.; Berry-Kravis, E.

    1994-01-01

    Duplications of chromosome 9q are rare. The authors describe the cytogenetic and phenotypic findings in 2 patients, one with a large duplication covering most of 9q (q12-q33.2) and one with a smaller duplication (q21.12-q22.1) who had Di George sequence (DGS). The chromosome 9 origin of the extra material in the second case was confirmed by fluorescence in situ hybridization (FISH) analysis with a whole chromosome 9 paint. Microdeletions of chromosome 22 are common in DGS and have been reported in CHARGE association. This is the first report of an association of a chromosome 9 abnormality with DGS in the absence of a chromosome 22 abnormality and the seventh report of a patient with a duplication of a large portion of 9q (q11-q13 to q32-q33). 31 refs., 4 figs., 1 tab.

  12. De Novo Transcriptome Analysis of Oncomelania hupensis after Molluscicide Treatment by Next-Generation Sequencing: Implications for Biology and Future Snail Interventions

    PubMed Central

    Zhao, Qin Ping; Xiong, Tao; Xu, Xing Jian; Jiang, Ming Sen; Dong, Hui Fen

    2015-01-01

    The freshwater snail Oncomelania hupensis is the only intermediate host of Schistosoma japonicum, which causes schistosomiasis. This disease is endemic in the Far East, especially in mainland China. Because niclosamide is the only molluscicide recommended by the World Health Organization, 50% wettable powder of niclosamide ethanolamine salt (WPN), the only chemical molluscicide available in China, has been widely used as the main snail control method for over two decades. Recently, a novel molluscicide derived from niclosamide, the salt of quinoid-2',5-dichloro-4'-nitro-salicylanilide (Liu Dai Shui Yang An, LDS), has been developed and proven to have the same molluscicidal effect as WPN, with lower cost and significantly lower toxicity to fish than WPN. The mechanism by which these molluscicides cause snail death is not known. Here, we report the next-generation transcriptome sequencing of O. hupensis; 145,008,667 clean reads were generated and assembled into 254,286 unigenes. Using GO and KEGG databases, 14,860 unigenes were assigned GO annotations and 4,686 unigenes were mapped to 250 KEGG pathways. Many sequences involved in key processes associated with biological regulation and innate immunity have been identified. After the snails were exposed to LDS and WPN, 254 unigenes showed significant differential expression. These genes were shown to be involved in cell structure defects and the inhibition of neurohumoral transmission and energy metabolism, which may cause snail death. Gene expression patterns differed after exposure to LDS and WPN, and these differences must be elucidated by the identification and annotation of these unknown unigenes. We believe that this first large-scale transcriptome dataset for O. hupensis will provide an opportunity for the in-depth analysis of this biomedically important freshwater snail at the molecular level and accelerate studies of the O. hupensis genome. The data elucidating the molluscicidal mechanism will be of great

  13. De Novo Sequencing and Analysis of the Safflower Transcriptome to Discover Putative Genes Associated with Safflor Yellow in Carthamus tinctorius L.

    PubMed

    Liu, Xiuming; Dong, Yuanyuan; Yao, Na; Zhang, Yu; Wang, Nan; Cui, Xiyan; Li, Xiaowei; Wang, Yanfang; Wang, Fawei; Yang, Jing; Guan, Lili; Du, Linna; Li, Haiyan; Li, Xiaokun

    2015-10-26

    Safflower (Carthamus tinctorius L.), an important traditional Chinese medicine, is cultured widely for its pharmacological effects, but little is known regarding the genes related to the metabolic regulation of the safflower's yellow pigment. To investigate genes related to safflor yellow biosynthesis, 454 pyrosequencing of flower RNA at different developmental stages was performed, generating large databases.In this study, we analyzed 454 sequencing data from different flowering stages in safflower. In total, 1,151,324 raw reads and 1,140,594 clean reads were produced, which were assembled into 51,591 unigenes with an average length of 679 bp and a maximum length of 5109 bp. Among the unigenes, 40,139 were in the early group, 39,768 were obtained from the full group and 28,316 were detected in both samples. With the threshold of "log2 ratio ≥ 1", there were 34,464 differentially expressed genes, of which 18,043 were up-regulated and 16,421 were down-regulated in the early flower library. Based on the annotations of the unigenes, 281 pathways were predicted. We selected 12 putative genes and analyzed their expression levels using quantitative real time-PCR. The results were consistent with the 454 sequencing results. In addition, the expression of chalcone synthase, chalcone isomerase and anthocyanidin synthase, which are involved in safflor yellow biosynthesis and safflower yellow pigment (SYP) content, were analyzed in different flowering periods, indicating that their expression levels were related to SYP synthesis. Moreover, to further confirm the results of the 454 pyrosequencing, full-length cDNA of chalcone isomerase (CHI) and anthocyanidin synthase (ANS) were cloned from safflower petal by RACE (Rapid-amplification of cDNA ends) method according to fragment of the transcriptome.

  14. Whole Genome Duplication and Enrichment of Metal Cation Transporters Revealed by De Novo Genome Sequencing of Extremely Halotolerant Black Yeast Hortaea werneckii

    PubMed Central

    Jackman, Shaun; Turk, Martina; Sadowski, Ivan; Nislow, Corey; Jones, Steven; Birol, Inanc; Cimerman, Nina Gunde; Plemenitaš, Ana

    2013-01-01

    Hortaea werneckii, ascomycetous yeast from the order Capnodiales, shows an exceptional adaptability to osmotically stressful conditions. To investigate this unusual phenotype we obtained a draft genomic sequence of a H. werneckii strain isolated from hypersaline water of solar saltern. Two of its most striking characteristics that may be associated with a halotolerant lifestyle are the large genetic redundancy and the expansion of genes encoding metal cation transporters. Although no sexual state of H. werneckii has yet been described, a mating locus with characteristics of heterothallic fungi was found. The total assembly size of the genome is 51.6 Mb, larger than most phylogenetically related fungi, coding for almost twice the usual number of predicted genes (23333). The genome appears to have experienced a relatively recent whole genome duplication, and contains two highly identical gene copies of almost every protein. This is consistent with some previous studies that reported increases in genomic DNA content triggered by exposure to salt stress. In hypersaline conditions transmembrane ion transport is of utmost importance. The analysis of predicted metal cation transporters showed that most types of transporters experienced several gene duplications at various points during their evolution. Consequently they are present in much higher numbers than expected. The resulting diversity of transporters presents interesting biotechnological opportunities for improvement of halotolerance of salt-sensitive species. The involvement of plasma P-type H+ ATPases in adaptation to different concentrations of salt was indicated by their salt dependent transcription. This was not the case with vacuolar H+ ATPases, which were transcribed constitutively. The availability of this genomic sequence is expected to promote the research of H. werneckii. Studying its extreme halotolerance will not only contribute to our understanding of life in hypersaline environments, but should also

  15. De-novo assembly and characterization of Chlorella minutissima UTEX2341 transcriptome by paired-end sequencing and the identification of genes related to the biosynthesis of lipids for biodiesel.

    PubMed

    Yu, Mingjia; Yang, Shanjun; Lin, Xiangzhi

    2016-02-01

    Chlorella minutissima is considered to be one of the promising feedstocks for biofuels in the future. In this study, the transcriptome from the oil-rich strain UTEX2341 of C. minutissima was generated based on Illumina paired-end sequencing. Through de-novo assembly, a total of 14,905 isogenes were obtained and compacted into 6216 unigenes. A total of 80% of the unigenes were assigned with GO terms and were further subdivided into 55 sub-categories. KEGG analysis demonstrated that 37.2% of the unigenes could be accessed and mapped into 278 pathways. Interestingly, the genes that encoded key enzymes that are involved in the biosynthesis, elongation, and metabolism of fatty acids were identified, including malonyl-CoA-ACP transacylase, 3-ketoacyl-ACP synthase, 3-ketoacyl-ACP reductase, and others. Moreover, the genes that are involved in triacylglycerol (TAG) biosynthesis and metabolism were also observed. Therefore, the transcriptome analysis of C. minutissima UTEX2341 not only supplies comprehensive insight into the molecular pathway that is involved in the biosynthesis of biofuel precursors but also provides substantial valuable genomic resources to accelerate the further development and utilization of biofuels.

  16. Whale song analyses using bioinformatics sequence analysis approaches

    NASA Astrophysics Data System (ADS)

    Chen, Yian A.; Almeida, Jonas S.; Chou, Lien-Siang

    2005-04-01

    Animal songs are frequently analyzed using discrete hierarchical units, such as units, themes and songs. Because animal songs and bio-sequences may be understood as analogous, bioinformatics analysis tools DNA/protein sequence alignment and alignment-free methods are proposed to quantify the theme similarities of the songs of false killer whales recorded off northeast Taiwan. The eighteen themes with discrete units that were identified in an earlier study [Y. A. Chen, masters thesis, University of Charleston, 2001] were compared quantitatively using several distance metrics. These metrics included the scores calculated using the Smith-Waterman algorithm with the repeated procedure; the standardized Euclidian distance and the angle metrics based on word frequencies. The theme classifications based on different metrics were summarized and compared in dendrograms using cluster analyses. The results agree with earlier classifications derived by human observation qualitatively. These methods further quantify the similarities among themes. These methods could be applied to the analyses of other animal songs on a larger scale. For instance, these techniques could be used to investigate song evolution and cultural transmission quantifying the dissimilarities of humpback whale songs across different seasons, years, populations, and geographic regions. [Work supported by SC Sea Grant, and Ilan County Government, Taiwan.

  17. An Approach to the Design of Mathematical Task Sequences: Conceptual Learning as Abstraction

    ERIC Educational Resources Information Center

    Simon, Martin A.

    2016-01-01

    This paper describes an emerging approach to the design of task sequences and the theory that undergirds it. The approach aims at promoting particular mathematical concepts, understood as the result of reflective abstraction. Central to this approach is the identification of available student activities from which students can abstract the…

  18. Exploring amyloid formation by a de novo design.

    PubMed

    Kammerer, Richard A; Kostrewa, Dirk; Zurdo, Jesús; Detken, Andreas; García-Echeverría, Carlos; Green, Janelle D; Müller, Shirley A; Meier, Beat H; Winkler, Fritz K; Dobson, Christopher M; Steinmetz, Michel O

    2004-03-30

    Protein deposition as amyloid fibrils underlies many debilitating human disorders. The complexity and size of disease-related polypeptides, however, often hinders a detailed rational approach to study effects that contribute to the process of amyloid formation. We report here a simplified peptide sequence successfully designed de novo to fold into a coiled-coil conformation under ambient conditions but to transform into amyloid fibrils at elevated temperatures. We have determined the crystal structure of the coiled-coil form and propose a detailed molecular model for the peptide in its fibrillar state. The relative stabilities of the two structural forms and the kinetics of their interconversion were found to be highly sensitive to small sequence changes. The results reveal the importance of specific packing interactions on the kinetics of amyloid formation and show the potential of this exceptionally favorable system for probing details of the molecular origins of amyloid disease.

  19. A knowledge engineering approach to recognizing and extracting sequences of nucleic acids from scientific literature.

    PubMed

    García-Remesal, Miguel; Maojo, Victor; Crespo, José

    2010-01-01

    In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.

  20. Evaluation of sequencing approaches for high-throughput ...

    EPA Pesticide Factsheets

    Whole-genome in vitro transcriptomics has shown the capability to identify mechanisms of action and estimates of potency for chemical-mediated effects in a toxicological framework, but with limited throughput and high cost. We present the evaluation of three toxicogenomics platforms for potential application to high-throughput screening: 1. TempO-Seq utilizing custom designed paired probes per gene; 2. Targeted sequencing (TSQ) utilizing Illumina’s TruSeq RNA Access Library Prep Kit containing tiled exon-specific probe sets; 3. Low coverage whole transcriptome sequencing (LSQ) using Illumina’s TruSeq Stranded mRNA Kit. Each platform was required to cover the ~20,000 genes of the full transcriptome, operate directly with cell lysates, and be automatable with 384-well plates. Technical reproducibility was assessed using MAQC control RNA samples A and B, while functional utility for chemical screening was evaluated using six treatments at a single concentration after 6 hr in MCF7 breast cancer cells: 10 µM chlorpromazine, 10 µM ciclopriox, 10 µM genistein, 100 nM sirolimus, 1 µM tanespimycin, and 1 µM trichostatin A. All RNA samples and chemical treatments were run with 5 technical replicates. The three platforms achieved different read depths, with the TempO-Seq having ~34M mapped reads per sample, while TSQ and LSQ averaged 20M and 11M aligned reads per sample, respectively. Inter-replicate correlation averaged ≥0.95 for raw log2 expression values i

  1. De novo sequence analysis of cytochrome P450 1-3 genes expressed in ostrich liver with highest expression of CYP2G19.

    PubMed

    Kawai, Yusuke K; Watanabe, Kensuke P; Ishii, Akihiro; Ohnuma, Aiko; Sawa, Hirofumi; Ikenaka, Yoshinori; Ishizuka, Mayumi

    2013-09-01

    The cytochrome P450 (CYP) 1-3 families are involved in xenobiotic metabolism, and are expressed primarily in the liver. Ostriches (Struthio camelus) are members of Palaeognathae with the earliest divergence from other bird lineages. An understanding of genes coding for ostrich xenobiotic metabolizing enzyme contributes to knowledge regarding the xenobiotic metabolisms of other Palaeognathae birds. We investigated CYP1-3 genes expressed in female ostrich liver using a next-generation sequencer. We detected 10 CYP genes: CYP1A5, CYP2C23, CYP2C45, CYP2D49, CYP2G19, CYP2W2, CYP2AC1, CYP2AC2, CYP2AF1, and CYP3A37. We compared the gene expression levels of CYP1A5, CYP2C23, CYP2C45, CYP2D49, CYP2G19, CYP2AF1, and CYP3A37 in ostrich liver and determined that CYP2G19 exhibited the highest expression level. The mRNA expression level of CYP2G19 was approximately 2-10 times higher than those of other CYP genes. The other CYP genes displayed similar expression levels. Our results suggest that CYP2G19, which has not been a focus of previous bird studies, has an important role in ostrich xenobiotic metabolism.

  2. Qualitative de novo analysis of full length cDNA and quantitative analysis of gene expression for common marmoset (Callithrix jacchus) transcriptomes using parallel long-read technology and short-read sequencing.

    PubMed

    Shimizu, Makiko; Iwano, Shunsuke; Uno, Yasuhiro; Uehara, Shotaro; Inoue, Takashi; Murayama, Norie; Onodera, Jun; Sasaki, Erika; Yamazaki, Hiroshi

    2014-01-01

    The common marmoset (Callithrix jacchus) is a non-human primate that could prove useful as human pharmacokinetic and biomedical research models. The cytochromes P450 (P450s) are a superfamily of enzymes that have critical roles in drug metabolism and disposition via monooxygenation of a broad range of xenobiotics; however, information on some marmoset P450s is currently limited. Therefore, identification and quantitative analysis of tissue-specific mRNA transcripts, including those of P450s and flavin-containing monooxygenases (FMO, another monooxygenase family), need to be carried out in detail before the marmoset can be used as an animal model in drug development. De novo assembly and expression analysis of marmoset transcripts were conducted with pooled liver, intestine, kidney, and brain samples from three male and three female marmosets. After unique sequences were automatically aligned by assembling software, the mean contig length was 718 bp (with a standard deviation of 457 bp) among a total of 47,883 transcripts. Approximately 30% of the total transcripts were matched to known marmoset sequences. Gene expression in 18 marmoset P450- and 4 FMO-like genes displayed some tissue-specific patterns. Of these, the three most highly expressed in marmoset liver were P450 2D-, 2E-, and 3A-like genes. In extrahepatic tissues, including brain, gene expressions of these monooxygenases were lower than those in liver, although P450 3A4 (previously P450 3A21) in intestine and P450 4A11- and FMO1-like genes in kidney were relatively highly expressed. By means of massive parallel long-read sequencing and short-read technology applied to marmoset liver, intestine, kidney, and brain, the combined next-generation sequencing analyses reported here were able to identify novel marmoset drug-metabolizing P450 transcripts that have until now been little reported. These results provide a foundation for mechanistic studies and pave the way for the use of marmosets as model animals

  3. Qualitative De Novo Analysis of Full Length cDNA and Quantitative Analysis of Gene Expression for Common Marmoset (Callithrix jacchus) Transcriptomes Using Parallel Long-Read Technology and Short-Read Sequencing

    PubMed Central

    Uno, Yasuhiro; Uehara, Shotaro; Inoue, Takashi; Murayama, Norie; Onodera, Jun; Sasaki, Erika; Yamazaki, Hiroshi

    2014-01-01

    The common marmoset (Callithrix jacchus) is a non-human primate that could prove useful as human pharmacokinetic and biomedical research models. The cytochromes P450 (P450s) are a superfamily of enzymes that have critical roles in drug metabolism and disposition via monooxygenation of a broad range of xenobiotics; however, information on some marmoset P450s is currently limited. Therefore, identification and quantitative analysis of tissue-specific mRNA transcripts, including those of P450s and flavin-containing monooxygenases (FMO, another monooxygenase family), need to be carried out in detail before the marmoset can be used as an animal model in drug development. De novo assembly and expression analysis of marmoset transcripts were conducted with pooled liver, intestine, kidney, and brain samples from three male and three female marmosets. After unique sequences were automatically aligned by assembling software, the mean contig length was 718 bp (with a standard deviation of 457 bp) among a total of 47,883 transcripts. Approximately 30% of the total transcripts were matched to known marmoset sequences. Gene expression in 18 marmoset P450- and 4 FMO-like genes displayed some tissue-specific patterns. Of these, the three most highly expressed in marmoset liver were P450 2D-, 2E-, and 3A-like genes. In extrahepatic tissues, including brain, gene expressions of these monooxygenases were lower than those in liver, although P450 3A4 (previously P450 3A21) in intestine and P450 4A11- and FMO1-like genes in kidney were relatively highly expressed. By means of massive parallel long-read sequencing and short-read technology applied to marmoset liver, intestine, kidney, and brain, the combined next-generation sequencing analyses reported here were able to identify novel marmoset drug-metabolizing P450 transcripts that have until now been little reported. These results provide a foundation for mechanistic studies and pave the way for the use of marmosets as model animals

  4. De novo computer-aided design of novel antiviral agents.

    PubMed

    Massarotti, Alberto; Coluccia, Antonio; Sorba, Giovanni; Silvestri, Romano; Brancale, Andrea

    2012-01-01

    Computer-aided drug design techniques have become an integral part of the drug discovery process. In particular, de novo methodologies can be useful to identify putative ligands for a specific target relying only on the structural information of the target itself. Here we discuss the basic de novo approaches available and their application in antiviral drug design.:

  5. De novo transcriptome sequencing of Acer palmatum and comprehensive analysis of differentially expressed genes under salt stress in two contrasting genotypes.

    PubMed

    Rong, Liping; Li, Qianzhong; Li, Shushun; Tang, Ling; Wen, Jing

    2016-04-01

    Maple (Acer palmatum) is an important species for landscape planting worldwide. Salt stress affects the normal growth of the Maple leaf directly, leading to loss of esthetic value. However, the limited availability of Maple genomic information has hindered research on the mechanisms underlying this tolerance. In this study, we performed comprehensive analyses of the salt tolerance in two genotypes of Maple using RNA-seq. Approximately 146.4 million paired-end reads, representing 181,769 unigenes, were obtained. The N50 length of the unigenes was 738 bp, and their total length over 102.66 Mb. 14,090 simple sequence repeats and over 500,000 single nucleotide polymorphisms were identified, which represent useful resources for marker development. Importantly, 181,769 genes were detected in at least one library, and 303 differentially expressed genes (DEGs) were identified between salt-sensitive and salt-tolerant genotypes. Among these DEGs, 125 were upregulated and 178 were downregulated genes. Two MYB-related proteins and one LEA protein were detected among the first 10 most downregulated genes. Moreover, a methyltransferase-related gene was detected among the first 10 most upregulated genes. The three most significantly enriched pathways were plant hormone signal transduction, arginine and proline metabolism, and photosynthesis. The transcriptome analysis provided a rich genetic resource for gene discovery related to salt tolerance in Maple, and in closely related species. The data will serve as an important public information platform to further our understanding of the molecular mechanisms involved in salt tolerance in Maple.

  6. Next-Generation Phylogeography: A Targeted Approach for Multilocus Sequencing of Non-Model Organisms

    PubMed Central

    Puritz, Jonathan B.; Addison, Jason A.; Toonen, Robert J.

    2012-01-01

    The field of phylogeography has long since realized the need and utility of incorporating nuclear DNA (nDNA) sequences into analyses. However, the use of nDNA sequence data, at the population level, has been hindered by technical laboratory difficulty, sequencing costs, and problematic analytical methods dealing with genotypic sequence data, especially in non-model organisms. Here, we present a method utilizing the 454 GS-FLX Titanium pyrosequencing platform with the capacity to simultaneously sequence two species of sea star (Meridiastra calcar and Parvulastra exigua) at five different nDNA loci across 16 different populations of 20 individuals each per species. We compare results from 3 populations with traditional Sanger sequencing based methods, and demonstrate that this next-generation sequencing platform is more time and cost effective and more sensitive to rare variants than Sanger based sequencing. A crucial advantage is that the high coverage of clonally amplified sequences simplifies haplotype determination, even in highly polymorphic species. This targeted next-generation approach can greatly increase the use of nDNA sequence loci in phylogeographic and population genetic studies by mitigating many of the time, cost, and analytical issues associated with highly polymorphic, diploid sequence markers. PMID:22470543

  7. An Effective Approach for Analyzing “Prefinished” Genomic Sequence Data

    PubMed Central

    Kuehl, Peter M.; Weisemann, Jane M.; Touchman, Jeffrey W.; Green, Eric D.; Boguski, Mark S.

    1999-01-01

    Ongoing efforts to sequence the human genome are already generating large amounts of data, with substantial increases anticipated over the next few years. In most cases, a shotgun sequencing strategy is being used, which rapidly yields most of the primary sequence in incompletely assembled sequence contigs (“prefinished” sequence) and more slowly produces the final, completely assembled sequence (“finished” sequence). Thus, in general, prefinished sequence is produced in excess of finished sequence, and this trend is certain to continue and even accelerate over the next few years. Even at a prefinished stage, genomic sequence represents a rich source of important biological information that is of great interest to many investigators. However, analyzing such data is a challenging and daunting task, both because of its sheer volume and because it can change on a day-by-day basis. To facilitate the discovery and characterization of genes and other important elements within prefinished sequence, we have developed an analytical strategy and system that uses readily available software tools in new combinations. Implementation of this strategy for the analysis of prefinished sequence data from human chromosome 7 has demonstrated that this is a convenient, inexpensive, and extensible solution to the problem of analyzing the large amounts of preliminary data being produced by large-scale sequencing efforts. Our approach is accessible to any investigator who wishes to assimilate additional information about particular sequence data en route to developing richer annotations of a finished sequence. [Our software system is available via an extensive web supplement to this article at http://www.ncbi.nlm.nih.gov/Kuehl/prefinished.] PMID:10022984

  8. A NGS approach to the encrusting Mediterranean sponge Crella elegans (Porifera, Demospongiae, Poecilosclerida): transcriptome sequencing, characterization and overview of the gene expression along three life cycle stages.

    PubMed

    Pérez-Porro, A R; Navarro-Gómez, D; Uriz, M J; Giribet, G

    2013-05-01

    Sponges can be dominant organisms in many marine and freshwater habitats where they play essential ecological roles. They also represent a key group to address important questions in early metazoan evolution. Recent approaches for improving knowledge on sponge biological and ecological functions as well as on animal evolution have focused on the genetic toolkits involved in ecological responses to environmental changes (biotic and abiotic), development and reproduction. These approaches are possible thanks to newly available, massive sequencing technologies-such as the Illumina platform, which facilitate genome and transcriptome sequencing in a cost-effective manner. Here we present the first NGS (next-generation sequencing) approach to understanding the life cycle of an encrusting marine sponge. For this we sequenced libraries of three different life cycle stages of the Mediterranean sponge Crella elegans and generated de novo transcriptome assemblies. Three assemblies were based on sponge tissue of a particular life cycle stage, including non-reproductive tissue, tissue with sperm cysts and tissue with larvae. The fourth assembly pooled the data from all three stages. By aggregating data from all the different life cycle stages we obtained a higher total number of contigs, contigs with blast hit and annotated contigs than from one stage-based assemblies. In that multi-stage assembly we obtained a larger number of the developmental regulatory genes known for metazoans than in any other assembly. We also advance the differential expression of selected genes in the three life cycle stages to explore the potential of RNA-seq for improving knowledge on functional processes along the sponge life cycle.

  9. Whole genome sequencing and integrative genomic analysis approach on two 22q11.2 deletion syndrome family trios for genotype to phenotype correlations

    PubMed Central

    Chung, Jonathan H.; Cai, Jinlu; Suskin, Barrie G.; Zhang, Zhengdong; Coleman, Karlene

    2015-01-01

    The 22q11.2 deletion syndrome (22q11DS) affects 1:4000 live births and presents with highly variable phenotype expressivity. In this study, we developed an analytical approach utilizing whole genome sequencing and integrative analysis to discover genetic modifiers. Our pipeline combined available tools in order to prioritize rare, predicted deleterious, coding and non-coding single nucleotide variants (SNVs) and insertion/deletions (INDELs) from whole genome sequencing (WGS). We sequenced two unrelated probands with 22q11DS, with contrasting clinical findings, and their unaffected parents. Proband P1 had cognitive impairment, psychotic episodes, anxiety, and tetralogy of Fallot (TOF); while proband P2 had juvenile rheumatoid arthritis but no other major clinical findings. In P1, we identified common variants in COMT and PRODH on 22q11.2 as well as rare potentially deleterious DNA variants in other behavioral/neurocognitive genes. We also identified a de novo SNV in ADNP2 (NM_014913.3:c.2243G>C), encoding a neuroprotective protein that may be involved in behavioral disorders. In P2, we identified a novel non-synonymous SNV in ZFPM2 (NM_012082.3:c.1576C>T), a known causative gene for TOF, which may act as a protective variant downstream of TBX1, haploinsufficiency of which is responsible for congenital heart disease in individuals with 22q11DS. PMID:25981510

  10. A systematic screening to identify de novo mutations causing sporadic early-onset Parkinson's disease

    PubMed Central

    Kun-Rodrigues, Celia; Ganos, Christos; Guerreiro, Rita; Schneider, Susanne A.; Schulte, Claudia; Lesage, Suzanne; Darwent, Lee; Holmans, Peter; Singleton, Andrew; Bhatia, Kailash; Bras, Jose

    2015-01-01

    Despite the many advances in our understanding of the genetic basis of Mendelian forms of Parkinson's disease (PD), a large number of early-onset cases still remain to be explained. Many of these cases, present with a form of disease that is identical to that underlined by genetic causes, but do not have mutations in any of the currently known disease-causing genes. Here, we hypothesized that de novo mutations may account for a proportion of these early-onset, sporadic cases. We performed exome sequencing in full parent–child trios where the proband presents with typical PD to unequivocally identify de novo mutations. This approach allows us to test all genes in the genome in an unbiased manner. We have identified and confirmed 20 coding de novo mutations in 21 trios. We have used publicly available population genetic data to compare variant frequencies and our independent in-house dataset of exome sequencing in PD (with over 1200 cases) to identify additional variants in the same genes. Of the genes identified to carry de novo mutations, PTEN, VAPB and ASNA1 are supported by various sources of data to be involved in PD. We show that these genes are reported to be within a protein–protein interaction network with PD genes and that they contain additional rare, case-specific, mutations in our independent cohort of PD cases. Our results support the involvement of these three genes in PD and suggest that testing for de novo mutations in sporadic disease may aid in the identification of novel disease-causing genes. PMID:26362251

  11. Sequenced Integration and the Identification of a Problem-Solving Approach through a Learning Process

    ERIC Educational Resources Information Center

    Cormas, Peter C.

    2016-01-01

    Preservice teachers (N = 27) in two sections of a sequenced, methodological and process integrated mathematics/science course solved a levers problem with three similar learning processes and a problem-solving approach, and identified a problem-solving approach through one different learning process. Similar learning processes used included:…

  12. The GeneOptimizer Algorithm: using a sliding window approach to cope with the vast sequence space in multiparameter DNA sequence optimization

    PubMed Central

    Raab, David; Graf, Marcus; Notka, Frank; Schödl, Thomas

    2010-01-01

    One of the main advantages of de novo gene synthesis is the fact that it frees the researcher from any limitations imposed by the use of natural templates. To make the most out of this opportunity, efficient algorithms are needed to calculate a coding sequence, combining different requirements, such as adapted codon usage or avoidance of restriction sites, in the best possible way. We present an algorithm where a “variation window” covering several amino acid positions slides along the coding sequence. Candidate sequences are built comprising the already optimized part of the complete sequence and all possible combinations of synonymous codons representing the amino acids within the window. The candidate sequences are assessed with a quality function, and the first codon of the best candidates’ variation window is fixed. Subsequently the window is shifted by one codon position. As an example of a freely accessible software implementing the algorithm, we present the Mr. Gene web-application. Additionally two experimental applications of the algorithm are shown. PMID:21189842

  13. De novo assembly and characterization of skin transcriptome using RNAseq in sheep (Ovis aries).

    PubMed

    Yue, Y J; Liu, J B; Yang, M; Han, J L; Guo, T T; Guo, J; Feng, R L; Yang, B H

    2015-02-13

    Wool is produced via synthetic processes of wool follicles, which are embedded in the skin of sheep. The development of new-generation sequencing and RNA sequencing provides new approaches that may elucidate the molecular regulation mechanism of wool follicle development and facilitate enhanced selection for wool traits through gene-assisted selection or targeted gene manipulation. We performed de novo transcriptome sequencing of skin using the Illumina Hiseq 2000 sequencing system in sheep (Ovis aries). Transcriptome de novo assembly was carried out via short-read assembly programs, including SOAPdenovo and ESTScan. The protein function, clusters of orthologous group function, gene ontology function, metabolic pathway analysis, and protein coding region prediction of unigenes were annotated by BLASTx, BLAST2GO, and ESTScan. More than 26,266,670 clean reads were collected and assembled into 79,741 unigene sequences, with a final assembly length of 35,447,962 nucleotides. A total of 22,164 unigenes were annotated, accounting for 36.27% of the total number of unigenes, which were divided into 25 classes belonging to 218 signaling pathways. Among them, there were 17 signal paths related to hair follicle development. Based on mass sequencing data of sheepskin obtained by RNA-Seq, many unigenes were identified and annotated, which provides an excellent platform for future sheep genetic and functional genomic research. The data could be used for improving wool quality and as a model for human hair follicle development or disease prevention.

  14. Automated de novo phasing and model building of coiled-coil proteins.

    PubMed

    Rämisch, Sebastian; Lizatović, Robert; André, Ingemar

    2015-03-01

    Models generated by de novo structure prediction can be very useful starting points for molecular replacement for systems where suitable structural homologues cannot be readily identified. Protein-protein complexes and de novo-designed proteins are examples of systems that can be challenging to phase. In this study, the potential of de novo models of protein complexes for use as starting points for molecular replacement is investigated. The approach is demonstrated using homomeric coiled-coil proteins, which are excellent model systems for oligomeric systems. Despite the stereotypical fold of coiled coils, initial phase estimation can be difficult and many structures have to be solved with experimental phasing. A method was developed for automatic structure determination of homomeric coiled coils from X-ray diffraction data. In a benchmark set of 24 coiled coils, ranging from dimers to pentamers with resolutions down to 2.5 Å, 22 systems were automatically solved, 11 of which had previously been solved by experimental phasing. The generated models contained 71-103% of the residues present in the deposited structures, had the correct sequence and had free R values that deviated on average by 0.01 from those of the respective reference structures. The electron-density maps were of sufficient quality that only minor manual editing was necessary to produce final structures. The method, named CCsolve, combines methods for de novo structure prediction, initial phase estimation and automated model building into one pipeline. CCsolve is robust against errors in the initial models and can readily be modified to make use of alternative crystallographic software. The results demonstrate the feasibility of de novo phasing of protein-protein complexes, an approach that could also be employed for other small systems beyond coiled coils.

  15. [Recent progress in gene mapping through high-throughput sequencing technology and forward genetic approaches].

    PubMed

    Lu, Cairui; Zou, Changsong; Song, Guoli

    2015-08-01

    Traditional gene mapping using forward genetic approaches is conducted primarily through construction of a genetic linkage map, the process of which is tedious and time-consuming, and often results in low accuracy of mapping and large mapping intervals. With the rapid development of high-throughput sequencing technology and decreasing cost of sequencing, a variety of simple and quick methods of gene mapping through sequencing have been developed, including direct sequencing of the mutant genome, sequencing of selective mutant DNA pooling, genetic map construction through sequencing of individuals in population, as well as sequencing of transcriptome and partial genome. These methods can be used to identify mutations at the nucleotide level and has been applied in complex genetic background. Recent reports have shown that sequencing mapping could be even done without the reference of genome sequence, hybridization, and genetic linkage information, which made it possible to perform forward genetic study in many non-model species. In this review, we summarized these new technologies and their application in gene mapping.

  16. A distributed coding approach for stereo sequences in the tree structured Haar transform domain

    NASA Astrophysics Data System (ADS)

    Cancellaro, M.; Carli, M.; Neri, A.

    2009-02-01

    In this contribution, a novel method for distributed video coding for stereo sequences is proposed. The system encodes independently the left and right frames of the stereoscopic sequence. The decoder exploits the side information to achieve the best reconstruction of the correlated video streams. In particular, a syndrome coder approach based on a lifted Tree Structured Haar wavelet scheme has been adopted. The experimental results show the effectiveness of the proposed scheme.

  17. Biologically inspired multilevel approach for multiple moving targets detection from airborne forward-looking infrared sequences.

    PubMed

    Li, Yansheng; Tan, Yihua; Li, Hang; Li, Tao; Tian, Jinwen

    2014-04-01

    In this paper, a biologically inspired multilevel approach for simultaneously detecting multiple independently moving targets from airborne forward-looking infrared (FLIR) sequences is proposed. Due to the moving platform, low contrast infrared images, and nonrepeatability of the target signature, moving targets detection from FLIR sequences is still an open problem. Avoiding six parameter affine or eight parameter planar projective transformation matrix estimation of two adjacent frames, which are utilized by existing moving targets detection approaches to cope with the moving infrared camera and have become the bottleneck for the further elevation of the moving targets detection performance, the proposed moving targets detection approach comprises three sequential modules: motion perception for efficiently extracting motion cues, attended motion views extraction for coarsely localizing moving targets, and appearance perception in the local attended motion views for accurately detecting moving targets. Experimental results demonstrate that the proposed approach is efficient and outperforms the compared state-of-the-art approaches.

  18. Sequence-Based Pronunciation Variation Modeling for Spontaneous ASR Using a Noisy Channel Approach

    NASA Astrophysics Data System (ADS)

    Hofmann, Hansjörg; Sakti, Sakriani; Hori, Chiori; Kashioka, Hideki; Nakamura, Satoshi; Minker, Wolfgang

    The performance of English automatic speech recognition systems decreases when recognizing spontaneous speech mainly due to multiple pronunciation variants in the utterances. Previous approaches address this problem by modeling the alteration of the pronunciation on a phoneme to phoneme level. However, the phonetic transformation effects induced by the pronunciation of the whole sentence have not yet been considered. In this article, the sequence-based pronunciation variation is modeled using a noisy channel approach where the spontaneous phoneme sequence is considered as a “noisy” string and the goal is to recover the “clean” string of the word sequence. Hereby, the whole word sequence and its effect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this study, first the phonemes will be recognized with the present recognition system and afterwards the pronunciation variation model based on the noisy channel approach will map from the phoneme to the word level. Two well-known natural language processing approaches are adopted and derived from the noisy channel model theory: Joint-sequence models and statistical machine translation. Both of them are applied and various experiments are conducted using microphone and telephone of spontaneous speech.

  19. De novo design of functional proteins: Toward artificial hydrogenases.

    PubMed

    Faiella, Marina; Roy, Anindya; Sommer, Dayn; Ghirlanda, Giovanna

    2013-11-01

    Over the last 25 years, de novo design has proven to be a valid approach to generate novel, well-folded proteins, and most recently, functional proteins. In response to societal needs, this approach is been used increasingly to design functional proteins developed with an eye toward sustainable fuel production. This review surveys recent examples of bioinspired de novo designed peptide based catalysts, focusing in particular on artificial hydrogenases.

  20. De novo TBR1 mutations in sporadic autism disrupt protein functions

    PubMed Central

    Deriziotis, Pelagia; O’Roak, Brian J.; Graham, Sarah A.; Estruch, Sara B.; Dimitropoulou, Danai; Bernier, Raphael A.; Gerdts, Jennifer; Shendure, Jay; Eichler, Evan E.; Fisher, Simon E.

    2014-01-01

    Next-generation sequencing recently revealed that recurrent disruptive mutations in a few genes may account for 1% of sporadic autism cases. Coupling these novel genetic data to empirical assays of protein function can illuminate crucial molecular networks. Here we demonstrate the power of the approach, performing the first functional analyses of TBR1 variants identified in sporadic autism. De novo truncating and missense mutations disrupt multiple aspects of TBR1 function, including subcellular localization, interactions with co-regulators and transcriptional repression. Missense mutations inherited from unaffected parents did not disturb function in our assays. We show that TBR1 homodimerizes, that it interacts with FOXP2, a transcription factor implicated in speech/language disorders, and that this interaction is disrupted by pathogenic mutations affecting either protein. These findings support the hypothesis that de novo mutations in sporadic autism have severe functional consequences. Moreover, they uncover neurogenetic mechanisms that bridge different neurodevelopmental disorders involving language deficits. PMID:25232744

  1. A proposal for the reference-based annotation of de novo transposable element insertions.

    PubMed

    Bergman, Casey M

    2012-01-01

    Understanding the causes and consequences of transposable element (TE) activity in the genomic era requires sophisticated bioinformatics approaches to accurately identify individual insertion sites. Next-generation sequencing technology now makes it possible to rapidly identify new TE insertions using resequencing data, opening up new possibilities to study the nature of TE-induced mutation and the target site preferences of different TE families. While the identification of new TE insertion sites is seemingly a simple task, the mechanisms of transposition present unique challenges for the annotation of de novo transposable element insertions mapped to a reference genome. Here I discuss these challenges and propose a framework for the annotation of de novo TE insertions that accommodates known mechanisms of TE insertion and established coordinate systems for genome annotation.

  2. De novo transcriptome of the hemimetabolous German cockroach (Blattella germanica)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    A total of 1,365,609 raw reads with an average length of 529 bp, which were de novo assembled into 48,800 contigs and 3,961 singletons for a total of 52,761 high-quality unique sequences are generated. These sequences are annotated in terms of GO and KEGG, and the results reveal putative genes of va...

  3. Reassociation kinetics-based approach for partial genome sequencing of the cattle tick, Rhipicephalus (Boophilus) microplus

    PubMed Central

    2010-01-01

    Background The size and repetitive nature of the Rhipicephalus microplus genome makes obtaining a full genome sequence fiscally and technically problematic. To selectively obtain gene-enriched regions of this tick's genome, Cot filtration was performed, and Cot-filtered DNA was sequenced via 454 FLX pyrosequencing. Results The sequenced Cot-filtered genomic DNA was assembled with an EST-based gene index of 14,586 unique entries where each EST served as a potential "seed" for scaffold formation. The new sequence assembly extended the lengths of 3,913 of the 14,586 gene index entries. Over half of the extensions corresponded to extensions of over 30 amino acids. To survey the repetitive elements in the tick genome, the complete sequences of five BAC clones were determined. Both Class I and II transposable elements were found. Comparison of the BAC and Cot filtration data indicates that Cot filtration was highly successful in filtering repetitive DNA out of the genomic DNA used in 454 sequencing. Conclusion Cot filtration is a very useful strategy to incorporate into genome sequencing projects on organisms with large genome sizes and which contain high percentages of repetitive, difficult to assemble, genomic DNA. Combining the Cot selection approach with 454 sequencing and assembly with a pre-existing EST database as seeds resulted in extensions of 27% of the members of the EST database. PMID:20540747

  4. Deep sequencing approach for genetic stability evaluation of influenza A viruses.

    PubMed

    Bidzhieva, Bella; Zagorodnyaya, Tatiana; Karagiannis, Konstantinos; Simonyan, Vahan; Laassri, Majid; Chumakov, Konstantin

    2014-04-01

    Assessment of genetic stability of viruses could be used to monitor manufacturing process of both live and inactivated viral vaccines. Until recently such studies were limited by the difficulty of detecting and quantifying mutations in heterogeneous viral populations. High-throughput sequencing technologies (deep sequencing) can generate massive amounts of genetic information and could be used to reveal and quantify mutations. Comparison of different approaches for deep sequencing of the complete influenza A genome was performed to determine the best way to detect and quantify mutants in attenuated influenza reassortant strain A/Brisbane/59/2007 (H1N1) and its passages in different cell substrates. Full-length amplicons of influenza A virus segments as well as multiple overlapping amplicons covering the entire viral genome were subjected to several ways of DNA library preparation followed by deep sequencing using Solexa (Illumina) and pyrosequencing (454 Life Science) technologies. Sequencing coverage (the number of times each nucleotide was determined) of mutational profiles generated after 454-pyrosequencing of individually synthesized overlapping amplicons were relatively low and insufficiently uniform. Amplification of the entire genome of influenza virus followed by its enzymatic fragmentation, library construction, and Illumina sequencing resulted in high and uniform sequencing coverage enabling sensitive quantitation of mutations. A new bioinformatic procedure was developed to improve the post-alignment quality control for deep-sequencing data analysis.

  5. Whole-Genome Sequencing and Integrative Genomic Analysis Approach on Two 22q11.2 Deletion Syndrome Family Trios for Genotype to Phenotype Correlations.

    PubMed

    Chung, Jonathan H; Cai, Jinlu; Suskin, Barrie G; Zhang, Zhengdong; Coleman, Karlene; Morrow, Bernice E

    2015-08-01

    The 22q11.2 deletion syndrome (22q11DS) affects 1:4,000 live births and presents with highly variable phenotype expressivity. In this study, we developed an analytical approach utilizing whole-genome sequencing (WGS) and integrative analysis to discover genetic modifiers. Our pipeline combined available tools in order to prioritize rare, predicted deleterious, coding and noncoding single-nucleotide variants (SNVs), and insertion/deletions from WGS. We sequenced two unrelated probands with 22q11DS, with contrasting clinical findings, and their unaffected parents. Proband P1 had cognitive impairment, psychotic episodes, anxiety, and tetralogy of Fallot (TOF), whereas proband P2 had juvenile rheumatoid arthritis but no other major clinical findings. In P1, we identified common variants in COMT and PRODH on 22q11.2 as well as rare potentially deleterious DNA variants in other behavioral/neurocognitive genes. We also identified a de novo SNV in ADNP2 (NM_014913.3:c.2243G>C), encoding a neuroprotective protein that may be involved in behavioral disorders. In P2, we identified a novel nonsynonymous SNV in ZFPM2 (NM_012082.3:c.1576C>T), a known causative gene for TOF, which may act as a protective variant downstream of TBX1, haploinsufficiency of which is responsible for congenital heart disease in individuals with 22q11DS.

  6. De Novo Assembly of Human Herpes Virus Type 1 (HHV-1) Genome, Mining of Non-Canonical Structures and Detection of Novel Drug-Resistance Mutations Using Short- and Long-Read Next Generation Sequencing Technologies

    PubMed Central

    Karamitros, Timokratis; Piorkowska, Renata; Katzourakis, Aris; Magiorkinis, Gkikas; Mbisa, Jean Lutamyo

    2016-01-01

    Human herpesvirus type 1 (HHV-1) has a large double-stranded DNA genome of approximately 152 kbp that is structurally complex and GC-rich. This makes the assembly of HHV-1 whole genomes from short-read sequencing data technically challenging. To improve the assembly of HHV-1 genomes we have employed a hybrid genome assembly protocol using data from two sequencing technologies: the short-read Roche 454 and the long-read Oxford Nanopore MinION sequencers. We sequenced 18 HHV-1 cell culture-isolated clinical specimens collected from immunocompromised patients undergoing antiviral therapy. The susceptibility of the samples to several antivirals was determined by plaque reduction assay. Hybrid genome assembly resulted in a decrease in the number of contigs in 6 out of 7 samples and an increase in N(G)50 and N(G)75 of all 7 samples sequenced by both technologies. The approach also enhanced the detection of non-canonical contigs including a rearrangement between the unique (UL) and repeat (T/IRL) sequence regions of one sample that was not detectable by assembly of 454 reads alone. We detected several known and novel resistance-associated mutations in UL23 and UL30 genes. Genome-wide genetic variability ranged from <1% to 53% of amino acids in each gene exhibiting at least one substitution within the pool of samples. The UL23 gene had one of the highest genetic variabilities at 35.2% in keeping with its role in development of drug resistance. The assembly of accurate, full-length HHV-1 genomes will be useful in determining genetic determinants of drug resistance, virulence, pathogenesis and viral evolution. The numerous, complex repeat regions of the HHV-1 genome currently remain a barrier towards this goal. PMID:27309375

  7. Strategies for the effective identification of remotely related sequences in multiple PSSM search approach.

    PubMed

    Gowri, V S; Tina, K G; Krishnadev, O; Srinivasan, N

    2007-06-01

    Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.

  8. An improved and validated RNA HLA class I SBT approach for obtaining full length coding sequences.

    PubMed

    Gerritsen, K E H; Olieslagers, T I; Groeneweg, M; Voorter, C E M; Tilanus, M G J

    2014-11-01

    The functional relevance of human leukocyte antigen (HLA) class I allele polymorphism beyond exons 2 and 3 is difficult to address because more than 70% of the HLA class I alleles are defined by exons 2 and 3 sequences only. For routine application on clinical samples we improved and validated the HLA sequence-based typing (SBT) approach based on RNA templates, using either a single locus-specific or two overlapping group-specific polymerase chain reaction (PCR) amplifications, with three forward and three reverse sequencing reactions for full length sequencing. Locus-specific HLA typing with RNA SBT of a reference panel, representing the major antigen groups, showed identical results compared to DNA SBT typing. Alleles encountered with unknown exons in the IMGT/HLA database and three samples, two with Null and one with a Low expressed allele, have been addressed by the group-specific RNA SBT approach to obtain full length coding sequences. This RNA SBT approach has proven its value in our routine full length definition of alleles.

  9. Environmental barcoding: a next-generation sequencing approach for biomonitoring applications using river benthos.

    PubMed

    Hajibabaei, Mehrdad; Shokralla, Shadi; Zhou, Xin; Singer, Gregory A C; Baird, Donald J

    2011-04-13

    Timely and accurate biodiversity analysis poses an ongoing challenge for the success of biomonitoring programs. Morphology-based identification of bioindicator taxa is time consuming, and rarely supports species-level resolution especially for immature life stages. Much work has been done in the past decade to develop alternative approaches for biodiversity analysis using DNA sequence-based approaches such as molecular phylogenetics and DNA barcoding. On-going assembly of DNA barcode reference libraries will provide the basis for a DNA-based identification system. The use of recently introduced next-generation sequencing (NGS) approaches in biodiversity science has the potential to further extend the application of DNA information for routine biomonitoring applications to an unprecedented scale. Here we demonstrate the feasibility of using 454 massively parallel pyrosequencing for species-level analysis of freshwater benthic macroinvertebrate taxa commonly used for biomonitoring. We designed our experiments in order to directly compare morphology-based, Sanger sequencing DNA barcoding, and next-generation environmental barcoding approaches. Our results show the ability of 454 pyrosequencing of mini-barcodes to accurately identify all species with more than 1% abundance in the pooled mixture. Although the approach failed to identify 6 rare species in the mixture, the presence of sequences from 9 species that were not represented by individuals in the mixture provides evidence that DNA based analysis may yet provide a valuable approach in finding rare species in bulk environmental samples. We further demonstrate the application of the environmental barcoding approach by comparing benthic macroinvertebrates from an urban region to those obtained from a conservation area. Although considerable effort will be required to robustly optimize NGS tools to identify species from bulk environmental samples, our results indicate the potential of an environmental barcoding

  10. De novo mutations in the classic epileptic encephalopathies

    PubMed Central

    2013-01-01

    Epileptic encephalopathies (EE) are a devastating group of severe childhood epilepsy disorders for which the cause is often unknown. Here, we report a screen for de novo mutations in patients with two classical EE: infantile spasms (IS, n=149) and Lennox-Gastaut Syndrome (LGS, n=115). We sequenced the exomes of 264 probands, and their parents, and confirmed 329 de novo mutations. A likelihood analysis showed a significant excess of de novo mutations in the ~4,000 genes that are the most intolerant to functional genetic variation in the human population (p=2.9 × 10−3). Among these are GABRB3 with de novo mutations in four patients and ALG13 with the same de novo mutation in two patients; both genes show clear statistical evidence of association. Given the relevant site-specific mutation rates, the probabilities of these outcomes occurring by chance are p=4.1 × 10−10 and p=7.8 × 10−12, respectively. Other genes with de novo mutations in this cohort include: CACNA1A, CHD2, FLNA, GABRA1, GRIN1, GRIN2B, HDAC4, HNRNPU, IQSEC2, MTOR, and NEDD4L. Finally, we show that the de novo mutations observed are enriched in specific gene sets including genes regulated by the Fragile X protein (p<10−8), as was reported for autism spectrum disorders (ASD)1. PMID:23934111

  11. A parallel approach of COFFEE objective function to multiple sequence alignment

    NASA Astrophysics Data System (ADS)

    Zafalon, G. F. D.; Visotaky, J. M. V.; Amorim, A. R.; Valêncio, C. R.; Neves, L. A.; de Souza, R. C. G.; Machado, J. M.

    2015-09-01

    The computational tools to assist genomic analyzes show even more necessary due to fast increasing of data amount available. With high computational costs of deterministic algorithms for sequence alignments, many works concentrate their efforts in the development of heuristic approaches to multiple sequence alignments. However, the selection of an approach, which offers solutions with good biological significance and feasible execution time, is a great challenge. Thus, this work aims to show the parallelization of the processing steps of MSA-GA tool using multithread paradigm in the execution of COFFEE objective function. The standard objective function implemented in the tool is the Weighted Sum of Pairs (WSP), which produces some distortions in the final alignments when sequences sets with low similarity are aligned. Then, in studies previously performed we implemented the COFFEE objective function in the tool to smooth these distortions. Although the nature of COFFEE objective function implies in the increasing of execution time, this approach presents points, which can be executed in parallel. With the improvements implemented in this work, we can verify the execution time of new approach is 24% faster than the sequential approach with COFFEE. Moreover, the COFFEE multithreaded approach is more efficient than WSP, because besides it is slightly fast, its biological results are better.

  12. Next-generation sequencing approach for connecting secondary metabolites to biosynthetic gene clusters in fungi

    PubMed Central

    Cacho, Ralph A.; Tang, Yi; Chooi, Yit-Heng

    2015-01-01

    Genomics has revolutionized the research on fungal secondary metabolite (SM) biosynthesis. To elucidate the molecular and enzymatic mechanisms underlying the biosynthesis of a specific SM compound, the important first step is often to find the genes that responsible for its synthesis. The accessibility to fungal genome sequences allows the bypass of the cumbersome traditional library construction and screening approach. The advance in next-generation sequencing (NGS) technologies have further improved the speed and reduced the cost of microbial genome sequencing in the past few years, which has accelerated the research in this field. Here, we will present an example work flow for identifying the gene cluster encoding the biosynthesis of SMs of interest using an NGS approach. We will also review the different strategies that can be employed to pinpoint the targeted gene clusters rapidly by giving several examples stemming from our work. PMID:25642215

  13. The potential cost-effectiveness of the Diamondback 360® Coronary Orbital Atherectomy System for treating de novo, severely calcified coronary lesions: an economic modeling approach

    PubMed Central

    Chambers, Jeffrey; Généreux, Philippe; Lee, Arthur; Lewin, Jack; Young, Christopher; Crittendon, Janna; Mann, Marita; Garrison, Louis P.

    2015-01-01

    Background: Patients who undergo percutaneous coronary intervention (PCI) for severely calcified coronary lesions have long been known to have worse clinical and economic outcomes than patients with no or mildly calcified lesions. We sought to assess the likely cost-effectiveness of using the Diamondback 360® Orbital Atherectomy System (OAS) in the treatment of de novo, severely calcified lesions from a health-system perspective. Methods and results: In the absence of a head-to-head trial and long-term follow up, cost-effectiveness was based on a modeled synthesis of clinical and economic data. A cost-effectiveness model was used to project the likely economic impact. To estimate the net cost impact, the cost of using the OAS technology in elderly (⩾ 65 years) Medicare patients with de novo severely calcified lesions was compared with cost offsets. Elderly OAS patients from the ORBIT II trial (Evaluate the Safety and Efficacy of OAS in Treating Severely Calcified Coronary Lesions) [ClinicalTrials.gov identifier: NCT01092426] were indirectly compared with similar patients using observational data. For the index procedure, the comparison was with Medicare data, and for both revascularization and cardiac death in the following year, the comparison was with a pooled analysis of the Harmonizing Outcomes with Revascularization and Stents in Acute Myocardial Infarction (HORIZONS-AMI)/Acute Catheterization and Urgent Intervention Triage Strategy (ACUITY) trials. After adjusting for differences in age, gender, and comorbidities, the ORBIT II mean index procedure costs were 17% (p < 0.001) lower, approximately US$2700. Estimated mean revascularization costs were lower by US$1240 in the base case. These cost offsets in the first year, on average, fully cover the cost of the device with an additional 1.2% cost savings. Even in the low-value scenario, the use of the OAS is cost-effective with a cost per life-year gained of US$11,895. Conclusions: Based on economic modeling

  14. A Time Sequence-Oriented Concept Map Approach to Developing Educational Computer Games for History Courses

    ERIC Educational Resources Information Center

    Chu, Hui-Chun; Yang, Kai-Hsiang; Chen, Jing-Hong

    2015-01-01

    Concept maps have been recognized as an effective tool for students to organize their knowledge; however, in history courses, it is important for students to learn and organize historical events according to the time of their occurrence. Therefore, in this study, a time sequence-oriented concept map approach is proposed for developing a game-based…

  15. Developing Scope and Sequence for the Gifted Learner: A Comprehensive Approach.

    ERIC Educational Resources Information Center

    VanTassel-Baska, Joyce; Campbell, Myrtle

    1988-01-01

    A comprehensive curriculum-development program which covers grades K-12 can ensure a meaningful scope and sequence of experiences for gifted learners. The experience of the Gary Community School Corporation and other Indiana communities with such an approach is described. Eight steps from needs assessment to implementing the model are presented.…

  16. Magnetism Teaching Sequences Based on an Inductive Approach for First-Year Thai University Science Students

    ERIC Educational Resources Information Center

    Narjaikaew, Pattawan; Emarat, Narumon; Arayathanitkul, Kwan; Cowie, Bronwen

    2010-01-01

    The study investigated the impact on student motivation and understanding of magnetism of teaching sequences based on an inductive approach. The study was conducted in large lecture classes. A pre- and post-Conceptual Survey of Electricity and Magnetism was conducted with just fewer than 700 Thai undergraduate science students, before and after…

  17. Identification of molecular motors in the Woods Hole squid, Loligo pealei: an expressed sequence tag approach.

    PubMed

    DeGiorgis, Joseph A; Cavaliere, Kimberly R; Burbach, J Peter H

    2011-10-01

    The squid giant axon and synapse are unique systems for studying neuronal function. While a few nucleotide and amino acid sequences have been obtained from squid, large scale genetic and proteomic information is lacking. We have been particularly interested in motors present in axons and their roles in transport processes. Here, to obtain genetic data and to identify motors expressed in squid, we initiated an expressed sequence tag project by single-pass sequencing mRNAs isolated from the stellate ganglia of the Woods Hole Squid, Loligo pealei. A total of 22,689 high quality expressed sequence tag (EST) sequences were obtained and subjected to basic local alignment search tool analysis. Seventy six percent of these sequences matched genes in the National Center for Bioinformatics databases. By CAP3 analysis this library contained 2459 contigs and 7568 singletons. Mining for motors successfully identified six kinesins, six myosins, a single dynein heavy chain, as well as components of the dynactin complex, and motor light chains and accessory proteins. This initiative demonstrates that EST projects represent an effective approach to obtain sequences of interest.

  18. denovo-db: a compendium of human de novo variants

    PubMed Central

    Turner, Tychele N.; Yi, Qian; Krumm, Niklas; Huddleston, John; Hoekzema, Kendra; F. Stessman, Holly A.; Doebley, Anna-Lisa; Bernier, Raphael A.; Nickerson, Deborah A.; Eichler, Evan E.

    2017-01-01

    Whole-exome and whole-genome sequencing have facilitated the large-scale discovery of de novo variants in human disease. To date, most de novo discovery through next-generation sequencing focused on congenital heart disease and neurodevelopmental disorders (NDDs). Currently, de novo variants are one of the most significant risk factors for NDDs with a substantial overlap of genes involved in more than one NDD. To facilitate better usage of published data, provide standardization of annotation, and improve accessibility, we created denovo-db (http://denovo-db.gs.washington.edu), a database for human de novo variants. As of July 2016, denovo-db contained 40 different studies and 32,991 de novo variants from 23,098 trios. Database features include basic variant information (chromosome location, change, type); detailed annotation at the transcript and protein levels; severity scores; frequency; validation status; and, most importantly, the phenotype of the individual with the variant. We included a feature on our browsable website to download any query result, including a downloadable file of the full database with additional variant details. denovo-db provides necessary information for researchers to compare their data to other individuals with the same phenotype and also to controls allowing for a better understanding of the biology of de novo variants and their contribution to disease. PMID:27907889

  19. A Convex Atomic-Norm Approach to Multiple Sequence Alignment and Motif Discovery

    PubMed Central

    Yen, Ian E. H.; Lin, Xin; Zhang, Jiong; Ravikumar, Pradeep; Dhillon, Inderjit S.

    2016-01-01

    Multiple Sequence Alignment and Motif Discovery, known as NP-hard problems, are two fundamental tasks in Bioinformatics. Existing approaches to these two problems are based on either local search methods such as Expectation Maximization (EM), Gibbs Sampling or greedy heuristic methods. In this work, we develop a convex relaxation approach to both problems based on the recent concept of atomic norm and develop a new algorithm, termed Greedy Direction Method of Multiplier, for solving the convex relaxation with two convex atomic constraints. Experiments show that our convex relaxation approach produces solutions of higher quality than those standard tools widely-used in Bioinformatics community on the Multiple Sequence Alignment and Motif Discovery problems. PMID:27559428

  20. Continuous intensity map optimization (CIMO): a novel approach to leaf sequencing in step and shoot IMRT.

    PubMed

    Cao, Daliang; Earl, Matthew A; Luan, Shuang; Shepard, David M

    2006-04-01

    A new leaf-sequencing approach has been developed that is designed to reduce the number of required beam segments for step-and-shoot intensity modulated radiation therapy (IMRT). This approach to leaf sequencing is called continuous-intensity-map-optimization (CIMO). Using a simulated annealing algorithm, CIMO seeks to minimize differences between the optimized and sequenced intensity maps. Two distinguishing features of the CIMO algorithm are (1) CIMO does not require that each optimized intensity map be clustered into discrete levels and (2) CIMO is not rule-based but rather simultaneously optimizes both the aperture shapes and weights. To test the CIMO algorithm, ten IMRT patient cases were selected (four head-and-neck, two pancreas, two prostate, one brain, and one pelvis). For each case, the optimized intensity maps were extracted from the Pinnacle3 treatment planning system. The CIMO algorithm was applied, and the optimized aperture shapes and weights were loaded back into Pinnacle. A final dose calculation was performed using Pinnacle's convolution/superposition based dose calculation. On average, the CIMO algorithm provided a 54% reduction in the number of beam segments as compared with Pinnacle's leaf sequencer. The plans sequenced using the CIMO algorithm also provided improved target dose uniformity and a reduced discrepancy between the optimized and sequenced intensity maps. For ten clinical intensity maps, comparisons were performed between the CIMO algorithm and the power-of-two reduction algorithm of Xia and Verhey [Med. Phys. 25(8), 1424-1434 (1998)]. When the constraints of a Varian Millennium multileaf collimator were applied, the CIMO algorithm resulted in a 26% reduction in the number of segments. For an Elekta multileaf collimator, the CIMO algorithm resulted in a 67% reduction in the number of segments. An average leaf sequencing time of less than one minute per beam was observed.

  1. De novo assembly and phasing of a Korean human genome.

    PubMed

    Seo, Jeong-Sun; Rhie, Arang; Kim, Junsoo; Lee, Sangjin; Sohn, Min-Hwan; Kim, Chang-Uk; Hastie, Alex; Cao, Han; Yun, Ji-Young; Kim, Jihye; Kuk, Junho; Park, Gun Hwa; Kim, Juhyeok; Ryu, Hanna; Kim, Jongbum; Roh, Mira; Baek, Jeonghun; Hunkapiller, Michael W; Korlach, Jonas; Shin, Jong-Yeon; Kim, Changhoon

    2016-10-13

    Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing, next-generation mapping, microfluidics-based linked reads, and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novo assembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6. This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of

  2. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

    PubMed Central

    Li, Heng

    2012-01-01

    Motivation: Eugene Myers in his string graph paper suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we propose FMD-index for forward–backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: hengli@broadinstitute.org PMID:22569178

  3. De-novo protein function prediction using DNA binding and RNA binding proteins as a test case

    PubMed Central

    Peled, Sapir; Leiderman, Olga; Charar, Rotem; Efroni, Gilat; Shav-Tal, Yaron; Ofran, Yanay

    2016-01-01

    Of the currently identified protein sequences, 99.6% have never been observed in the laboratory as proteins and their molecular function has not been established experimentally. Predicting the function of such proteins relies mostly on annotated homologs. However, this has resulted in some erroneous annotations, and many proteins have no annotated homologs. Here we propose a de-novo function prediction approach based on identifying biophysical features that underlie function. Using our approach, we discover DNA and RNA binding proteins that cannot be identified based on homology and validate these predictions experimentally. For example, FGF14, which belongs to a family of secreted growth factors was predicted to bind DNA. We verify this experimentally and also show that FGF14 is localized to the nucleus. Mutating the predicted binding site on FGF14 abrogated DNA binding. These results demonstrate the feasibility of automated de-novo function prediction based on identifying function-related biophysical features. PMID:27869118

  4. De-novo protein function prediction using DNA binding and RNA binding proteins as a test case.

    PubMed

    Peled, Sapir; Leiderman, Olga; Charar, Rotem; Efroni, Gilat; Shav-Tal, Yaron; Ofran, Yanay

    2016-11-21

    Of the currently identified protein sequences, 99.6% have never been observed in the laboratory as proteins and their molecular function has not been established experimentally. Predicting the function of such proteins relies mostly on annotated homologs. However, this has resulted in some erroneous annotations, and many proteins have no annotated homologs. Here we propose a de-novo function prediction approach based on identifying biophysical features that underlie function. Using our approach, we discover DNA and RNA binding proteins that cannot be identified based on homology and validate these predictions experimentally. For example, FGF14, which belongs to a family of secreted growth factors was predicted to bind DNA. We verify this experimentally and also show that FGF14 is localized to the nucleus. Mutating the predicted binding site on FGF14 abrogated DNA binding. These results demonstrate the feasibility of automated de-novo function prediction based on identifying function-related biophysical features.

  5. A long PCR–based approach for DNA enrichment prior to next-generation sequencing for systematic studies1

    PubMed Central

    Uribe-Convers, Simon; Duke, Justin R.; Moore, Michael J.; Tank, David C.

    2014-01-01

    • Premise of the study: We present an alternative approach for molecular systematic studies that combines long PCR and next-generation sequencing. Our approach can be used to generate templates from any DNA source for next-generation sequencing. Here we test our approach by amplifying complete chloroplast genomes, and we present a set of 58 potentially universal primers for angiosperms to do so. Additionally, this approach is likely to be particularly useful for nuclear and mitochondrial regions. • Methods and Results: Chloroplast genomes of 30 species across angiosperms were amplified to test our approach. Amplification success varied depending on whether PCR conditions were optimized for a given taxon. To further test our approach, some amplicons were sequenced on an Illumina HiSeq 2000. • Conclusions: Although here we tested this approach by sequencing plastomes, long PCR amplicons could be generated using DNA from any genome, expanding the possibilities of this approach for molecular systematic studies. PMID:25202592

  6. Targeted sequencing approach to identify genetic mutations in Nasu-Hakola disease

    PubMed Central

    Satoh, Jun-ichi; Yanaizu, Motoaki; Tosaki, Youhei; Sakai, Kenji; Kino, Yoshihiro

    2016-01-01

    Summary Nasu-Hakola disease (NHD) is a rare autosomal recessive disorder characterized by sclerosing leukoencephalopathy and multifocal bone cysts, caused by a loss-of-function mutation of either TYROBP (DAP12) or TREM2. TREM2 and DAP12 constitute a receptor/adaptor signaling complex expressed exclusively on osteoclasts, dendritic cells, macrophages, and microglia. Premortem molecular diagnosis of NHD requires genetic analysis of both TYROBP and TREM2, in which 20 distinct NHD-causing mutations have been reported. Due to genetic heterogeneity, it is often difficult to identify the exact mutation responsible for NHD. Recently, the revolution of the next-generation sequencing (NGS) technology has greatly advanced the field of genome research. A targeted sequencing approach allows us to investigate a selected set of disease-causing genes and mutations in a number of samples within several days. By targeted sequencing using the TruSight One Sequencing Panel, we resequenced genetic mutations of seven NHD cases with known molecular diagnosis and two control subjects. We identified homozygous variants of TYROBP or TREM2 in all NHD cases, composed of a frameshift mutation of c.141delG in exon 3 of TYROBP in four cases, a missense mutation of c.2T>C in exon 1 of TYROBP in two cases, or a splicing mutation of c.482+2T>C in intron 3 of TREM2 in one case. The results of targeted resequencing corresponded to those of Sanger sequencing. In contrast, causative variants were not detected in control subjects. These results indicate that targeted sequencing is a useful approach to precisely identify genetic mutations responsible for NHD in a comprehensive manner. PMID:27904822

  7. A De novo Transcriptomic Approach to Identify Flavonoids and Anthocyanins “Switch-Off” in Olive (Olea europaea L.) Drupes at Different Stages of Maturation

    PubMed Central

    Iaria, Domenico L.; Chiappetta, Adriana; Muzzalupo, Innocenzo

    2016-01-01

    Highlights A de novo transcriptome reconstruction of olive drupes was performed in two genotypesGene expression was monitored during drupe development in two olive cultivarsTranscripts involved in flavonoid and anthocyanin pathways were analyzed in Cassanese and Leucocarpa cultivarsBoth cultivar and developmental stage impact gene expression in Olea europaea fruits. During ripening, the fruits of the olive tree (Olea europaea L.) undergo a progressive chromatic change characterized by the formation of a red-brown “spot” which gradually extends on the epidermis and in the innermost part of the mesocarp. This event finds an exception in the Leucocarpa cultivar, in which we observe a destabilized equilibrium between the metabolisms of chlorophyll and other pigments, particularly the anthocyanins whose switch-off during maturation promotes the white coloration of fruits. Despite its importance, genomic information on the olive tree is still lacking. Different RNA-seq libraries were generated from drupes of “Leucocarpa” and “Cassanese” olive genotypes, sampled at 100 and 130 days after flowering (DAF), and were used in order to identify transcripts involved in the main phenotypic changes of fruits during maturation and their corresponding expression patterns. A total of 103,359 transcripts were obtained and 3792 and 3064 were differentially expressed in “Leucocarpa” and “Cassanese” genotypes, respectively, during 100–130 DAF transition. Among them flavonoid and anthocyanin related transcripts such as phenylalanine ammonia lyase (PAL), cinnamate 4-hydroxylase (C4H), 4-coumarate-CoA ligase (4CL), chalcone synthase (CHS), chalcone isomerase (CHI), flavanone 3-hydroxylase (F3H), flavonol 3′-hydrogenase (F3′H), flavonol 3′5 ′-hydrogenase (F3′5′H), flavonol synthase (FLS), dihydroflavonol 4-reductase (DFR), anthocyanidin synthase (ANS), UDP-glucose:anthocianidin: flavonoid glucosyltransferase (UFGT) were identified. These results contribute

  8. A novel conceptual approach to read-filtering in high-throughput amplicon sequencing studies.

    PubMed

    Puente-Sánchez, Fernando; Aguirre, Jacobo; Parro, Víctor

    2016-02-29

    Adequate read filtering is critical when processing high-throughput data in marker-gene-based studies. Sequencing errors can cause the mis-clustering of otherwise similar reads, artificially increasing the number of retrieved Operational Taxonomic Units (OTUs) and therefore leading to the overestimation of microbial diversity. Sequencing errors will also result in OTUs that are not accurate reconstructions of the original biological sequences. Herein we present the Poisson binomial filtering algorithm (PBF), which minimizes both problems by calculating the error-probability distribution of a sequence from its quality scores. In order to validate our method, we quality-filtered 37 publicly available datasets obtained by sequencing mock and environmental microbial communities with the Roche 454, Illumina MiSeq and IonTorrent PGM platforms, and compared our results to those obtained with previous approaches such as the ones included in mothur, QIIME and USEARCH. Our algorithm retained substantially more reads than its predecessors, while resulting in fewer and more accurate OTUs. This improved sensitiveness produced more faithful representations, both quantitatively and qualitatively, of the true microbial diversity present in the studied samples. Furthermore, the method introduced in this work is computationally inexpensive and can be readily applied in conjunction with any existent analysis pipeline.

  9. A novel conceptual approach to read-filtering in high-throughput amplicon sequencing studies

    PubMed Central

    Puente-Sánchez, Fernando; Aguirre, Jacobo; Parro, Víctor

    2016-01-01

    Adequate read filtering is critical when processing high-throughput data in marker-gene-based studies. Sequencing errors can cause the mis-clustering of otherwise similar reads, artificially increasing the number of retrieved Operational Taxonomic Units (OTUs) and therefore leading to the overestimation of microbial diversity. Sequencing errors will also result in OTUs that are not accurate reconstructions of the original biological sequences. Herein we present the Poisson binomial filtering algorithm (PBF), which minimizes both problems by calculating the error-probability distribution of a sequence from its quality scores. In order to validate our method, we quality-filtered 37 publicly available datasets obtained by sequencing mock and environmental microbial communities with the Roche 454, Illumina MiSeq and IonTorrent PGM platforms, and compared our results to those obtained with previous approaches such as the ones included in mothur, QIIME and USEARCH. Our algorithm retained substantially more reads than its predecessors, while resulting in fewer and more accurate OTUs. This improved sensitiveness produced more faithful representations, both quantitatively and qualitatively, of the true microbial diversity present in the studied samples. Furthermore, the method introduced in this work is computationally inexpensive and can be readily applied in conjunction with any existent analysis pipeline. PMID:26553806

  10. A work stealing based approach for enabling scalable optimal sequence homology detection

    SciTech Connect

    Daily, Jeffrey A.; Kalyanaraman, Anantharaman; Krishnamoorthy, Sriram; Vishnu, Abhinav

    2015-05-01

    Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. Here, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for 2.56M sequences on up to 8K cores show parallel efficiencies of ~ 75-100%, a time-to-solution of 33s, and a rate of ~ 2.0M alignments per second.

  11. DECIPHER, a search-based approach to chimera identification for 16S rRNA sequences.

    PubMed

    Wright, Erik S; Yilmaz, L Safak; Noguera, Daniel R

    2012-02-01

    DECIPHER is a new method for finding 16S rRNA chimeric sequences by the use of a search-based approach. The method is based upon detecting short fragments that are uncommon in the phylogenetic group where a query sequence is classified but frequently found in another phylogenetic group. The algorithm was calibrated for full sequences (fs_DECIPHER) and short sequences (ss_DECIPHER) and benchmarked against WigeoN (Pintail), ChimeraSlayer, and Uchime using artificially generated chimeras. Overall, ss_DECIPHER and Uchime provided the highest chimera detection for sequences 100 to 600 nucleotides long (79% and 81%, respectively), but Uchime's performance deteriorated for longer sequences, while ss_DECIPHER maintained a high detection rate (89%). Both methods had low false-positive rates (1.3% and 1.6%). The more conservative fs_DECIPHER, benchmarked only for sequences longer than 600 nucleotides, had an overall detection rate lower than that of ss_DECIPHER (75%) but higher than those of the other programs. In addition, fs_DECIPHER had the lowest false-positive rate among all the benchmarked programs (<0.20%). DECIPHER was outperformed only by ChimeraSlayer and Uchime when chimeras were formed from closely related parents (less than 10% divergence). Given the differences in the programs, it was possible to detect over 89% of all chimeras with just the combination of ss_DECIPHER and Uchime. Using fs_DECIPHER, we detected between 1% and 2% additional chimeras in the RDP, SILVA, and Greengenes databases from which chimeras had already been removed with Pintail or Bellerophon. DECIPHER was implemented in the R programming language and is directly accessible through a webpage or by downloading the program as an R package (http://DECIPHER.cee.wisc.edu).

  12. Identification of purple sea urchin telomerase RNA using a next-generation sequencing based approach.

    PubMed

    Li, Yang; Podlevsky, Joshua D; Marz, Manja; Qi, Xiaodong; Hoffmann, Steve; Stadler, Peter F; Chen, Julian J-L

    2013-06-01

    Telomerase is a ribonucleoprotein (RNP) enzyme essential for telomere maintenance and chromosome stability. While the catalytic telomerase reverse transcriptase (TERT) protein is well conserved across eukaryotes, telomerase RNA (TR) is extensively divergent in size, sequence, and structure. This diversity prohibits TR identification from many important organisms. Here we report a novel approach for TR discovery that combines in vitro TR enrichment from total RNA, next-generation sequencing, and a computational screening pipeline. With this approach, we have successfully identified TR from Strongylocentrotus purpuratus (purple sea urchin) from the phylum Echinodermata. Reconstitution of activity in vitro confirmed that this RNA is an integral component of sea urchin telomerase. Comparative phylogenetic analysis against vertebrate TR sequences revealed that the purple sea urchin TR contains vertebrate-like template-pseudoknot and H/ACA domains. While lacking a vertebrate-like CR4/5 domain, sea urchin TR has a unique central domain critical for telomerase activity. This is the first TR identified from the previously unexplored invertebrate clade and provides the first glimpse of TR evolution in the deuterostome lineage. Moreover, our TR discovery approach is a significant step toward the comprehensive understanding of telomerase RNP evolution.

  13. Full-length HLA-DRB1 coding sequences generated by a hemizygous RNA-SBT approach.

    PubMed

    Gerritsen, K E H; Groeneweg, M; Meertens, C M H; Voorter, C E M; Tilanus, M G J

    2015-11-01

    Currently 1582 HLA-DRB1 alleles have been identified in the IMGT/HLA database (v3.18). Among those alleles, more than 90% have incomplete allele sequences, which complicates the analysis of the functional relevance of polymorphism beyond exon 2. The polymorphic index of each individual exon of the currently known allele sequences, shows that polymorphism is present in all exons, albeit not equally abundant. Full-length HLA-DRB1 RNA sequencing identifies polymorphism of the complete coding region. Here we describe a hemizygous full-length RNA sequence-based typing (SBT) approach based on group-specific HLA-DRB1 amplification and subsequent sequencing. RNA full-length sequences can easily be accessed because of the short amplicon length (801 bp). The RNA-SBT approach was successfully validated on a panel of DRB1 alleles having fully known coding sequences according to the IMGT/HLA database, and cover all serological equivalents. Subsequently, the approach was applied on a panel of 54 alleles with incomplete allele sequences, resulting in full-length coding sequences and the identification of one new and one corrected allele. This study shows the universal applicability of the RNA-based sequencing approach to identify full-length coding sequences and to define the polymorphic content of HLA-DRB1 alleles.

  14. Plasticity in Dnmt3L-dependent and -independent modes of de novo methylation in the developing mouse embryo.

    PubMed

    Guenatri, Mounia; Duffié, Rachel; Iranzo, Julian; Fauque, Patricia; Bourc'his, Déborah

    2013-02-01

    A stimulatory DNA methyltransferase co-factor, Dnmt3L, has evolved in mammals to assist the process of de novo methylation, as genetically demonstrated in the germline. The function of Dnmt3L in the early embryo remains unresolved. By combining developmental and genetic approaches, we find that mouse embryos begin development with a maternal store of Dnmt3L, which is rapidly degraded and does not participate in embryonic de novo methylation. A zygotic-specific promoter of Dnmt3l is activated following gametic methylation loss and the potential recruitment of pluripotency factors just before implantation. Importantly, we find that zygotic Dnmt3L deficiency slows down the rate of de novo methylation in the embryo by affecting methylation density at some, but not all, genomic sequences. Dnmt3L is not strictly required, however, as methylation patterns are eventually established in its absence, in the context of increased Dnmt3A protein availability. This study proves that the postimplantation embryo is more plastic than the germline in terms of DNA methylation mechanistic choices and, importantly, that de novo methylation can be achieved in vivo without Dnmt3L.

  15. Strategies for complete plastid genome sequencing.

    PubMed

    Twyford, Alex D; Ness, Rob W

    2016-10-28

    Plastid sequencing is an essential tool in the study of plant evolution. This high-copy organelle is one of the most technically accessible regions of the genome, and its sequence conservation makes it a valuable region for comparative genome evolution, phylogenetic analysis and population studies. Here, we discuss recent innovations and approaches for de novo plastid assembly that harness genomic tools. We focus on technical developments including low-cost sequence library preparation approaches for genome skimming, enrichment via hybrid baits and methylation-sensitive capture, sequence platforms with higher read outputs and longer read lengths, and automated tools for assembly. These developments allow for a much more streamlined assembly than via conventional short-range PCR. Although newer methods make complete plastid sequencing possible for any land plant or green alga, there are still challenges for producing finished plastomes particularly from herbarium material or from structurally divergent plastids such as those of parasitic plants.

  16. A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data

    PubMed Central

    Hajirasouliha, Iman; Mahmoody, Ahmad; Raphael, Benjamin J.

    2014-01-01

    Motivation: High-throughput sequencing of tumor samples has shown that most tumors exhibit extensive intra-tumor heterogeneity, with multiple subpopulations of tumor cells containing different somatic mutations. Recent studies have quantified this intra-tumor heterogeneity by clustering mutations into subpopulations according to the observed counts of DNA sequencing reads containing the variant allele. However, these clustering approaches do not consider that the population frequencies of different tumor subpopulations are correlated by their shared ancestry in the same population of cells. Results: We introduce the binary tree partition (BTP), a novel combinatorial formulation of the problem of constructing the subpopulations of tumor cells from the variant allele frequencies of somatic mutations. We show that finding a BTP is an NP-complete problem; derive an approximation algorithm for an optimization version of the problem; and present a recursive algorithm to find a BTP with errors in the input. We show that the resulting algorithm outperforms existing clustering approaches on simulated and real sequencing data. Availability and implementation: Python and MATLAB implementations of our method are available at http://compbio.cs.brown.edu/software/ Contact: braphael@cs.brown.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24932008

  17. Sensitivity of the NMR density matrix to pulse sequence parameters: a simplified analytic approach.

    PubMed

    Momot, Konstantin I; Takegoshi, K

    2012-08-01

    We present a formalism for the analysis of sensitivity of nuclear magnetic resonance pulse sequences to variations of pulse sequence parameters, such as radiofrequency pulses, gradient pulses or evolution delays. The formalism enables the calculation of compact, analytic expressions for the derivatives of the density matrix and the observed signal with respect to the parameters varied. The analysis is based on two constructs computed in the course of modified density-matrix simulations: the error interrogation operators and error commutators. The approach presented is consequently named the Error Commutator Formalism (ECF). It is used to evaluate the sensitivity of the density matrix to parameter variation based on the simulations carried out for the ideal parameters, obviating the need for finite-difference calculations of signal errors. The ECF analysis therefore carries a computational cost comparable to a single density-matrix or product-operator simulation. Its application is illustrated using a number of examples from basic NMR spectroscopy. We show that the strength of the ECF is its ability to provide analytic insights into the propagation of errors through pulse sequences and the behaviour of signal errors under phase cycling. Furthermore, the approach is algorithmic and easily amenable to implementation in the form of a programming code. It is envisaged that it could be incorporated into standard NMR product-operator simulation packages.

  18. Multiplex parallel pair-end-ditag sequencing approaches in system biology.

    PubMed

    Ruan, Yijun; Wei, Chia-Lin

    2010-01-01

    Characterization of all the functional components constituted in human genome relies in our ability to completely elucidate the genetic/epigenetic regulatory networks, chromatin states, nuclear architectures, and genome variations. Such endeavors demand for the development of robust and effective genomic technologies. In the past few years, the availability of disruptive next generation DNA sequencing technologies has offered new promise for whole genome interrogation. However, despite the massive parallel and ultra-high throughput capacity, the common nature of short read lengths found within these platforms limits their applications for many types of whole genome-based analyses. To overcome such constrain, pair end ditag (PET) based sequencing concept was conceived as an immediate solution to expand the information content and extend the linear coverage. By sequencing paired end signatures from any desired DNA fragment and mapping them to the reference genome, PET strategy allows the accurate demarcation of target DNA boundaries and defines their locations on the genomic landscape. Furthermore, the ability to delineate relationship between two ends of a DNA molecule enables the full scale discovery of unconventional gene products, genome rearrangements, and chromatin interactions. Coupling with the massively parallel and ultra-high throughput sequencing platforms, such unique features of PET strategy have the potential to revolutionize the approaches used to decipher regulatory networks in system biology, define the genome organizations, and characterize genome variations; which ultimately leads to the development of strategies for personalized medicine.

  19. Identifying natural substrates for chaperonins using a sequence-based approach

    PubMed Central

    Stan, George; Brooks, Bernard R.; Lorimer, George H.; Thirumalai, D.

    2005-01-01

    The Escherichia coli chaperonin machinery, GroEL, assists the folding of a number of proteins. We describe a sequence-based approach to identify the natural substrate proteins (SPs) for GroEL. Our method is based on the hypothesis that natural SPs are those that contain patterns of residues similar to those found in either GroES mobile loop and/or strongly binding peptide in complex with GroEL. The method is validated by comparing the predicted results with experimentally determined natural SPs for GroEL. We have searched for such patterns in five genomes. In the E. coli genome, we identify 1422 (about one-third) sequences that are putative natural SPs. In Saccharomyces cerevisiae, 2885 (32%) of sequences can be natural substrates for Hsp60, which is the analog of GroEL. The precise number of natural SPs is shown to be a function of the number of contacts an SP makes with the apical domain (NC) and the number of binding sites (NB) in the oligomer with which it interacts. For known SPs for GroEL, we find ~4 < NC < 5 and 2 ≤ NB ≤ 4. A limited analysis of the predicted binding sequences shows that they do not adopt any preferred secondary structure. Our method also predicts the putative binding regions in the identified SPs. The results of our study show that a variety of SPs, associated with diverse functions, can interact with GroEL. PMID:15576562

  20. A Bayesian Approach to Joint Modeling of Protein-DNA Binding, Gene Expression and Sequence Data

    PubMed Central

    Xie, Yang; Pan, Wei; Jeong, Kyeong S.; Xiao, Guanghua; Khodursky, Arkady B.

    2012-01-01

    The genome-wide DNA-protein binding data, DNA sequence data and gene expression data represent complementary means to deciphering global and local transcriptional regulatory circuits. Combining these different types of data can not only improve the statistical power, but also provide a more comprehensive picture of gene regulation. In this paper, we propose a novel statistical model to augment proteinDNA binding data with gene expression and DNA sequence data when available. We specify a hierarchical Bayes model and use Markov chain Monte Carlo simulations to draw inferences. Both simulation studies and an analysis of an experimental dataset show that the proposed joint modeling method can significantly improve the specificity and sensitivity of identifying target genes as compared to conventional approaches relying on a single data source. PMID:20049751

  1. Exome sequencing from nanogram amounts of starting DNA: comparing three approaches.

    PubMed

    Rykalina, Vera N; Shadrin, Alexey A; Amstislavskiy, Vyacheslav S; Rogaev, Evgeny I; Lehrach, Hans; Borodina, Tatiana A

    2014-01-01

    Hybridization-based target enrichment protocols require relatively large starting amounts of genomic DNA, which is not always available. Here, we tested three approaches to pre-capture library preparation starting from 10 ng of genomic DNA: (i and ii) whole-genome amplification of DNA samples with REPLI-g (Qiagen) and GenomePlex (Sigma) kits followed by standard library preparation, and (iii) library construction with a low input oriented ThruPLEX kit (Rubicon Genomics). Exome capture with Agilent SureSelectXT2 Human AllExon v4+UTRs capture probes, and HiSeq2000 sequencing were performed for test libraries along with the control library prepared from 1 µg of starting DNA. Tested protocols were characterized in terms of mapping efficiency, enrichment ratio, coverage of the target region, and reliability of SNP genotyping. REPLI-g- and ThruPLEX-FD-based protocols seem to be adequate solutions for exome sequencing of low input samples.

  2. A long-term target detection approach in infrared image sequence

    NASA Astrophysics Data System (ADS)

    Li, Hang; Zhang, Qi; Wang, Xin; Hu, Chao

    2016-10-01

    An automatic target detection method used in long term infrared (IR) image sequence from a moving platform is proposed. Firstly, based on POME(the principle of maximum entropy), target candidates are iteratively segmented. Then the real target is captured via two different selection approaches. At the beginning of image sequence, the genuine target with litter texture is discriminated from other candidates by using contrast-based confidence measure. On the other hand, when the target becomes larger, we apply online EM method to estimate and update the distributions of target's size and position based on the prior detection results, and then recognize the genuine one which satisfies both the constraints of size and position. Experimental results demonstrate that the presented method is accurate, robust and efficient.

  3. A long-term target detection approach in infrared image sequence

    NASA Astrophysics Data System (ADS)

    Li, Hang; Zhang, Qi; Li, Yuanyuan; Wang, Liqiang

    2015-12-01

    An automatic target detection method used in long term infrared (IR) image sequence from a moving platform is proposed. Firstly, based on non-linear histogram equalization, target candidates are coarse-to-fine segmented by using two self-adapt thresholds generated in the intensity space. Then the real target is captured via two different selection approaches. At the beginning of image sequence, the genuine target with litter texture is discriminated from other candidates by using contrast-based confidence measure. On the other hand, when the target becomes larger, we apply online EM method to iteratively estimate and update the distributions of target's size and position based on the prior detection results, and then recognize the genuine one which satisfies both the constraints of size and position. Experimental results demonstrate that the presented method is accurate, robust and efficient.

  4. Sequence-based discrimination of protein-RNA interacting residues using a probabilistic approach.

    PubMed

    Pai, Priyadarshini P; Dash, Tirtharaj; Mondal, Sukanta

    2017-04-07

    Protein interactions with ribonucleic acids (RNA) are well-known to be crucial for a wide range of cellular processes such as transcriptional regulation, protein synthesis or translation, and post-translational modifications. Identification of the RNA-interacting residues can provide insights into these processes and aid in relevant biotechnological manipulations. Owing to their eventual potential in combating diseases and industrial production, several computational attempts have been made over years using sequence- and structure-based information. Recent comparative studies suggest that despite these developments, many problems are faced with respect to the usability, prerequisites, and accessibility of various tools, thereby calling for an alternative approach and perspective supplementation in the prediction scenario. With this motivation, in this paper, we propose the use of a simple-yet-efficient conditional probabilistic approach based on the application of local occurrence of amino acids in the interacting region in a non-numeric sequence feature space, for discriminating between RNA interacting and non-interacting residues. The proposed method has been meticulously tested for robustness using a cross-estimation method showing MCC of 0.341 and F- measure of 66.84%. Upon exploring large scale applications using benchmark datasets available to date, this approach showed an encouraging performance comparable with the state-of-art. The software is available at https://github.com/ABCgrp/DORAEMON.

  5. Comparative analysis of de novo transcriptome assembly.

    PubMed

    Clarke, Kaitlin; Yang, Yi; Marsh, Ronald; Xie, Linglin; Zhang, Ke K

    2013-02-01

    The fast development of next-generation sequencing technology presents a major computational challenge for data processing and analysis. A fast algorithm, de Bruijn graph has been successfully used for genome DNA de novo assembly; nevertheless, its performance for transcriptome assembly is unclear. In this study, we used both simulated and real RNA-Seq data, from either artificial RNA templates or human transcripts, to evaluate five de novo assemblers, ABySS, Mira, Trinity, Velvet and Oases. Of these assemblers, ABySS, Trinity, Velvet and Oases are all based on de Bruijn graph, and Mira uses an overlap graph algorithm. Various numbers of RNA short reads were selected from the External RNA Control Consortium (ERCC) data and human chromosome 22. A number of statistics were then calculated for the resulting contigs from each assembler. Each experiment was repeated multiple times to obtain the mean statistics and standard error estimate. Trinity had relative good performance for both ERCC and human data, but it may not consistently generate full length transcripts. ABySS was the fastest method but its assembly quality was low. Mira gave a good rate for mapping its contigs onto human chromosome 22, but its computational speed is not satisfactory. Our results suggest that transcript assembly remains a challenge problem for bioinformatics society. Therefore, a novel assembler is in need for assembling transcriptome data generated by next generation sequencing technique.

  6. PCR Strategies for Complete Allele Calling in Multigene Families Using High-Throughput Sequencing Approaches.

    PubMed

    Marmesat, Elena; Soriano, Laura; Mazzoni, Camila J; Sommer, Simone; Godoy, José A

    2016-01-01

    The characterization of multigene families with high copy number variation is often approached through PCR amplification with highly degenerate primers to account for all expected variants flanking the region of interest. Such an approach often introduces PCR biases that result in an unbalanced representation of targets in high-throughput sequencing libraries that eventually results in incomplete detection of the targeted alleles. Here we confirm this result and propose two different amplification strategies to alleviate this problem. The first strategy (called pooled-PCRs) targets different subsets of alleles in multiple independent PCRs using different moderately degenerate primer pairs, whereas the second approach (called pooled-primers) uses a custom-made pool of non-degenerate primers in a single PCR. We compare their performance to the common use of a single PCR with highly degenerate primers using the MHC class I of the Iberian lynx as a model. We found both novel approaches to work similarly well and better than the conventional approach. They significantly scored more alleles per individual (11.33 ± 1.38 and 11.72 ± 0.89 vs 7.94 ± 1.95), yielded more complete allelic profiles (96.28 ± 8.46 and 99.50 ± 2.12 vs 63.76 ± 15.43), and revealed more alleles at a population level (13 vs 12). Finally, we could link each allele's amplification efficiency with the primer-mismatches in its flanking sequences and show that ultra-deep coverage offered by high-throughput technologies does not fully compensate for such biases, especially as real alleles may reach lower coverage than artefacts. Adopting either of the proposed amplification methods provides the opportunity to attain more complete allelic profiles at lower coverages, improving confidence over the downstream analyses and subsequent applications.

  7. PCR Strategies for Complete Allele Calling in Multigene Families Using High-Throughput Sequencing Approaches

    PubMed Central

    Marmesat, Elena; Soriano, Laura; Mazzoni, Camila J.; Sommer, Simone

    2016-01-01

    The characterization of multigene families with high copy number variation is often approached through PCR amplification with highly degenerate primers to account for all expected variants flanking the region of interest. Such an approach often introduces PCR biases that result in an unbalanced representation of targets in high-throughput sequencing libraries that eventually results in incomplete detection of the targeted alleles. Here we confirm this result and propose two different amplification strategies to alleviate this problem. The first strategy (called pooled-PCRs) targets different subsets of alleles in multiple independent PCRs using different moderately degenerate primer pairs, whereas the second approach (called pooled-primers) uses a custom-made pool of non-degenerate primers in a single PCR. We compare their performance to the common use of a single PCR with highly degenerate primers using the MHC class I of the Iberian lynx as a model. We found both novel approaches to work similarly well and better than the conventional approach. They significantly scored more alleles per individual (11.33 ± 1.38 and 11.72 ± 0.89 vs 7.94 ± 1.95), yielded more complete allelic profiles (96.28 ± 8.46 and 99.50 ± 2.12 vs 63.76 ± 15.43), and revealed more alleles at a population level (13 vs 12). Finally, we could link each allele’s amplification efficiency with the primer-mismatches in its flanking sequences and show that ultra-deep coverage offered by high-throughput technologies does not fully compensate for such biases, especially as real alleles may reach lower coverage than artefacts. Adopting either of the proposed amplification methods provides the opportunity to attain more complete allelic profiles at lower coverages, improving confidence over the downstream analyses and subsequent applications. PMID:27294261

  8. Genovo: De Novo Assembly for Metagenomes

    NASA Astrophysics Data System (ADS)

    Laserson, Jonathan; Jojic, Vladimir; Koller, Daphne

    Next-generation sequencing technologies produce a large number of noisy reads from the DNA in a sample. Metagenomics and population sequencing aim to recover the genomic sequences of the species in the sample, which could be of high diversity. Methods geared towards single sequence reconstruction are not sensitive enough when applied in this setting. We introduce a generative probabilistic model of read generation from environmental samples and present Genovo, a novel de novo sequence assembler that discovers likely sequence reconstructions under the model. A Chinese restaurant process prior accounts for the unknown number of genomes in the sample. Inference is made by applying a series of hill-climbing steps iteratively until convergence. We compare the performance of Genovo to three other short read assembly programs across one synthetic dataset and eight metagenomic datasets created using the 454 platform, the largest of which has 311k reads. Genovo's reconstructions cover more bases and recover more genes than the other methods, and yield a higher assembly score.

  9. TBro: visualization and management of de novo transcriptomes.

    PubMed

    Ankenbrand, Markus J; Weber, Lorenz; Becker, Dirk; Förster, Frank; Bemm, Felix

    2016-01-01

    RNA sequencing (RNA-seq) has become a powerful tool to understand molecular mechanisms and/or developmental programs. It provides a fast, reliable and cost-effective method to access sets of expressed elements in a qualitative and quantitative manner. Especially for non-model organisms and in absence of a reference genome, RNA-seq data is used to reconstruct and quantify transcriptomes at the same time. Even SNPs, InDels, and alternative splicing events are predicted directly from the data without having a reference genome at hand. A key challenge, especially for non-computational personnal, is the management of the resulting datasets, consisting of different data types and formats. Here, we present TBro, a flexible de novo transcriptome browser, tackling this challenge. TBro aggregates sequences, their annotation, expression levels as well as differential testing results. It provides an easy-to-use interface to mine the aggregated data and generate publication-ready visualizations. Additionally, it supports users with an intuitive cart system, that helps collecting and analysing biological meaningful sets of transcripts. TBro's modular architecture allows easy extension of its functionalities in the future. Especially, the integration of new data types such as proteomic quantifications or array-based gene expression data is straightforward. Thus, TBro is a fully featured yet flexible transcriptome browser that supports approaching complex biological questions and enhances collaboration of numerous researchers. DATABASE URL: : tbro.carnivorom.com.

  10. TBro: visualization and management of de novo transcriptomes

    PubMed Central

    Ankenbrand, Markus J.; Weber, Lorenz; Becker, Dirk; Förster, Frank; Bemm, Felix

    2016-01-01

    RNA sequencing (RNA-seq) has become a powerful tool to understand molecular mechanisms and/or developmental programs. It provides a fast, reliable and cost-effective method to access sets of expressed elements in a qualitative and quantitative manner. Especially for non-model organisms and in absence of a reference genome, RNA-seq data is used to reconstruct and quantify transcriptomes at the same time. Even SNPs, InDels, and alternative splicing events are predicted directly from the data without having a reference genome at hand. A key challenge, especially for non-computational personnal, is the management of the resulting datasets, consisting of different data types and formats. Here, we present TBro, a flexible de novo transcriptome browser, tackling this challenge. TBro aggregates sequences, their annotation, expression levels as well as differential testing results. It provides an easy-to-use interface to mine the aggregated data and generate publication-ready visualizations. Additionally, it supports users with an intuitive cart system, that helps collecting and analysing biological meaningful sets of transcripts. TBro’s modular architecture allows easy extension of its functionalities in the future. Especially, the integration of new data types such as proteomic quantifications or array-based gene expression data is straightforward. Thus, TBro is a fully featured yet flexible transcriptome browser that supports approaching complex biological questions and enhances collaboration of numerous researchers. Database URL: tbro.carnivorom.com PMID:28025338

  11. A 454 sequencing approach for large scale phylogenomic analysis of the common emperor scorpion (Pandinus imperator).

    PubMed

    Roeding, Falko; Borner, Janus; Kube, Michael; Klages, Sven; Reinhardt, Richard; Burmester, Thorsten

    2009-12-01

    In recent years, phylogenetic tree reconstructions that rely on multiple gene alignments that had been deduced from expressed sequence tags (ESTs) have become a popular method in molecular systematics. Here, we present a 454 pyrosequencing approach to infer the transcriptome of the Emperor scorpion Pandinus imperator. We obtained 428,844 high-quality reads (mean length=223+/-50 b) from total cDNA, which were assembled into 8334 contigs (mean length 422+/-313 bp) and 26,147 singletons. About 1200 contigs were successfully annotated by BLAST and orthology search. Specific analyses of eight distinct hemocyanin sequences provided further proof for the quality of the 454 reads and the assembly process. The P. imperator sequences were included in a concatenated alignment of 149 orthologous genes of 67 metazoan taxa that covers 39,842 amino acids. After removal of low-quality regions, 11,168 positions were employed for phylogenetic reconstructions. Using Bayesian and maximum likelihood methods, we obtained strongly supported monophyletic Ecdysozoa, Arthropoda (excluding Tardigrada), Euarthropoda, Pancrustacea and Hexapoda. We also recovered the Myriochelata (Chelicerata+Myriapoda). Within the chelicerates, Pycnogonida form the sister group of Euchelicerata. However, Arachnida were found paraphyletic because the Acari (mites and ticks) were recovered as sister group of a clade comprising Xiphosura, Scorpiones and Araneae. In summary, we have shown that 454 pyrosequencing is a cost-effective method that provides sufficient data and coverage depth for gene detection and multigene-based phylogenetic analyses.

  12. Installing hydrolytic activity into a completely de novo protein framework

    NASA Astrophysics Data System (ADS)

    Burton, Antony J.; Thomson, Andrew R.; Dawson, William M.; Brady, R. Leo; Woolfson, Derek N.

    2016-09-01

    The design of enzyme-like catalysts tests our understanding of sequence-to-structure/function relationships in proteins. Here we install hydrolytic activity predictably into a completely de novo and thermostable α-helical barrel, which comprises seven helices arranged around an accessible channel. We show that the lumen of the barrel accepts 21 mutations to functional polar residues. The resulting variant, which has cysteine-histidine-glutamic acid triads on each helix, hydrolyses p-nitrophenyl acetate with catalytic efficiencies that match the most-efficient redesigned hydrolases based on natural protein scaffolds. This is the first report of a functional catalytic triad engineered into a de novo protein framework. The flexibility of our system also allows the facile incorporation of unnatural side chains to improve activity and probe the catalytic mechanism. Such a predictable and robust construction of truly de novo biocatalysts holds promise for applications in chemical and biochemical synthesis.

  13. Next-Generation Sequencing Workflow for NSCLC Critical Samples Using a Targeted Sequencing Approach by Ion Torrent PGM™ Platform

    PubMed Central

    Vanni, Irene; Coco, Simona; Truini, Anna; Rusmini, Marta; Dal Bello, Maria Giovanna; Alama, Angela; Banelli, Barbara; Mora, Marco; Rijavec, Erika; Barletta, Giulia; Genova, Carlo; Biello, Federica; Maggioni, Claudia; Grossi, Francesco

    2015-01-01

    Next-generation sequencing (NGS) is a cost-effective technology capable of screening several genes simultaneously; however, its application in a clinical context requires an established workflow to acquire reliable sequencing results. Here, we report an optimized NGS workflow analyzing 22 lung cancer-related genes to sequence critical samples such as DNA from formalin-fixed paraffin-embedded (FFPE) blocks and circulating free DNA (cfDNA). Snap frozen and matched FFPE gDNA from 12 non-small cell lung cancer (NSCLC) patients, whose gDNA fragmentation status was previously evaluated using a multiplex PCR-based quality control, were successfully sequenced with Ion Torrent PGM™. The robust bioinformatic pipeline allowed us to correctly call both Single Nucleotide Variants (SNVs) and indels with a detection limit of 5%, achieving 100% specificity and 96% sensitivity. This workflow was also validated in 13 FFPE NSCLC biopsies. Furthermore, a specific protocol for low input gDNA capable of producing good sequencing data with high coverage, high uniformity, and a low error rate was also optimized. In conclusion, we demonstrate the feasibility of obtaining gDNA from FFPE samples suitable for NGS by performing appropriate quality controls. The optimized workflow, capable of screening low input gDNA, highlights NGS as a potential tool in the detection, disease monitoring, and treatment of NSCLC. PMID:26633390

  14. Approach for moving small target detection in infrared image sequence based on reinforcement learning

    NASA Astrophysics Data System (ADS)

    Wang, Chuanyun; Qin, Shiyin

    2016-09-01

    Addressing the problems of moving small target detection in infrared image sequence caused by background clutter and target size variation with time, an approach for moving small target detection is proposed under a pipeline framework with an optimization strategy based on reinforcement learning. The pipeline framework is composed by pipeline establishment, target-background images separation, and target confirmation, in which the pipeline is established by designating several successive images with temporal sliding window, target-background images separation is dealt with low-rank and sparse matrix decomposition via robust principal component analysis, and target confirmation is achieved by employing a voting mechanism over more than one separated target images of the same input image. For unremitting optimization of target-background images separation, the weighting parameter of low-rank and sparse matrix decomposition is dynamically regulated by the way of reinforcement learning in consecutive detection, in which the complexity evaluation from sequential infrared images and results assessment of moving small target detection are integrated. The experiment results over four infrared small target image sequences with different cloudy sky backgrounds demonstrate the effectiveness and advantages of the proposed approach in both background clutter suppression and small target detection.

  15. Approaches to the detection of recessive effects using next generation sequencing data from outbred populations.

    PubMed

    Curtis, David

    2013-01-01

    Conventional methods to analyze genome-wide association studies and whole exome or whole genome sequencing studies would be prone to overlook variants which might exert a recessive effect on risk of disease, either as homozygotes or compound heterozygotes. It is plausible that such effects may be common even in outbred populations. An approach is described which is based on identifying a set of variants in a gene as being potentially of interest and then testing whether there is an excess of cases who are either homozygotes or complex heterozygotes for these variants. Methods based on departure from Hardy-Weinberg equilibrium are more powerful than those which compare cases to controls. However, linkage disequilibrium between variants can be difficult to deal with if phase is unknown. A simple approach for discarding variants apparently in strong linkage disequilibrium with others is proposed. The procedure is simple and quick to apply so can be used in the context of whole genome or exome sequencing studies and is implemented in the SCOREASSOC program.

  16. "Polymeromics": Mass spectrometry based strategies in polymer science toward complete sequencing approaches: a review.

    PubMed

    Altuntaş, Esra; Schubert, Ulrich S

    2014-01-15

    Mass spectrometry (MS) is the most versatile and comprehensive method in "OMICS" sciences (i.e. in proteomics, genomics, metabolomics and lipidomics). The applications of MS and tandem MS (MS/MS or MS(n)) provide sequence information of the full complement of biological samples in order to understand the importance of the sequences on their precise and specific functions. Nowadays, the control of polymer sequences and their accurate characterization is one of the significant challenges of current polymer science. Therefore, a similar approach can be very beneficial for characterizing and understanding the complex structures of synthetic macromolecules. MS-based strategies allow a relatively precise examination of polymeric structures (e.g. their molar mass distributions, monomer units, side chain substituents, end-group functionalities, and copolymer compositions). Moreover, tandem MS offer accurate structural information from intricate macromolecular structures; however, it produces vast amount of data to interpret. In "OMICS" sciences, the software application to interpret the obtained data has developed satisfyingly (e.g. in proteomics), because it is not possible to handle the amount of data acquired via (tandem) MS studies on the biological samples manually. It can be expected that special software tools will improve the interpretation of (tandem) MS output from the investigations of synthetic polymers as well. Eventually, the MS/MS field will also open up for polymer scientists who are not MS-specialists. In this review, we dissect the overall framework of the MS and MS/MS analysis of synthetic polymers into its key components. We discuss the fundamentals of polymer analyses as well as recent advances in the areas of tandem mass spectrometry, software developments, and the overall future perspectives on the way to polymer sequencing, one of the last Holy Grail in polymer science.

  17. Imputation approach for deducing a complete mitogenome sequence from low-depth-coverage next-generation sequencing data: application to ancient remains from the Moon Pyramid, Mexico.

    PubMed

    Mizuno, Fuzuki; Kumagai, Masahiko; Kurosaki, Kunihiko; Hayashi, Michiko; Sugiyama, Saburo; Ueda, Shintaroh; Wang, Li

    2017-02-16

    It is considered that more than 15 depths of coverage are necessary for next-generation sequencing (NGS) data to obtain reliable complete nucleotide sequences of the mitogenome. However, it is difficult to satisfy this requirement for all nucleotide positions because of problems obtaining a uniform depth of coverage for poorly preserved materials. Thus, we propose an imputation approach that allows a complete mitogenome sequence to be deduced from low-depth-coverage NGS data. We used different types of mitogenome data files as panels for imputation: a worldwide panel comprising all the major haplogroups, a worldwide panel comprising sequences belonging to the estimated haplogroup alone, a panel comprising sequences from the population most closely related to an individual under investigation, and a panel comprising sequences belonging to the estimated haplogroup from the population most closely related to an individual under investigation. The number of missing nucleotides was drastically reduced in all the panels, but the contents obtained by imputation were quite different among the panels. The efficiency of the imputation method differed according to the panels used. The missing nucleotides were most credibly imputed using sequences of the estimated haplogroup from the population most closely related to the individual under investigation as a panel.Journal of Human Genetics advance online publication, 16 February 2017; doi:10.1038/jhg.2017.14.

  18. Reconstructing Networks from Profit Sequences in Evolutionary Games via a Multiobjective Optimization Approach with Lasso Initialization

    NASA Astrophysics Data System (ADS)

    Wu, Kai; Liu, Jing; Wang, Shuai

    2016-11-01

    Evolutionary games (EG) model a common type of interactions in various complex, networked, natural and social systems. Given such a system with only profit sequences being available, reconstructing the interacting structure of EG networks is fundamental to understand and control its collective dynamics. Existing approaches used to handle this problem, such as the lasso, a convex optimization method, need a user-defined constant to control the tradeoff between the natural sparsity of networks and measurement error (the difference between observed data and simulated data). However, a shortcoming of these approaches is that it is not easy to determine these key parameters which can maximize the performance. In contrast to these approaches, we first model the EG network reconstruction problem as a multiobjective optimization problem (MOP), and then develop a framework which involves multiobjective evolutionary algorithm (MOEA), followed by solution selection based on knee regions, termed as MOEANet, to solve this MOP. We also design an effective initialization operator based on the lasso for MOEA. We apply the proposed method to reconstruct various types of synthetic and real-world networks, and the results show that our approach is effective to avoid the above parameter selecting problem and can reconstruct EG networks with high accuracy.

  19. Reconstructing Networks from Profit Sequences in Evolutionary Games via a Multiobjective Optimization Approach with Lasso Initialization

    PubMed Central

    Wu, Kai; Liu, Jing; Wang, Shuai

    2016-01-01

    Evolutionary games (EG) model a common type of interactions in various complex, networked, natural and social systems. Given such a system with only profit sequences being available, reconstructing the interacting structure of EG networks is fundamental to understand and control its collective dynamics. Existing approaches used to handle this problem, such as the lasso, a convex optimization method, need a user-defined constant to control the tradeoff between the natural sparsity of networks and measurement error (the difference between observed data and simulated data). However, a shortcoming of these approaches is that it is not easy to determine these key parameters which can maximize the performance. In contrast to these approaches, we first model the EG network reconstruction problem as a multiobjective optimization problem (MOP), and then develop a framework which involves multiobjective evolutionary algorithm (MOEA), followed by solution selection based on knee regions, termed as MOEANet, to solve this MOP. We also design an effective initialization operator based on the lasso for MOEA. We apply the proposed method to reconstruct various types of synthetic and real-world networks, and the results show that our approach is effective to avoid the above parameter selecting problem and can reconstruct EG networks with high accuracy. PMID:27886244

  20. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements.

    PubMed

    McCoy, Rajiv C; Taylor, Ryan W; Blauwkamp, Timothy A; Kelley, Joanna L; Kertesz, Michael; Pushkarev, Dmitry; Petrov, Dmitri A; Fiston-Lavier, Anna-Sophie

    2014-01-01

    High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or complex genomic arrangements. While TEs strongly affect genome function and evolution, most current de novo assembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly-parallel library preparation and local assembly of short read data and which achieve lengths of 1.5-18.5 Kbp with an extremely low error rate ([Formula: see text]0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain y; cn, bw, sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long-reads, and likely other methods that generate long-reads, offer a powerful approach to improve de novo assemblies of whole genomes.

  1. Enzyme-like replication de novo in a microcontroller environment.

    PubMed

    Tangen, Uwe

    2010-01-01

    The desire to start evolution from scratch inside a computer memory is as old as computing. Here we demonstrate how viable computer programs can be established de novo in a Precambrian environment without supplying any specific instantiation, just starting with random bit sequences. These programs are not self-replicators, but act much more like catalysts. The microcontrollers used in the end are the result of a long series of simplifications. The objective of this simplification process was to produce universal machines with a human-readable interface, allowing software and/or hardware evolution to be studied. The power of the instruction set can be modified by introducing a secondary structure-folding mechanism, which is a state machine, allowing nontrivial replication to emerge with an instruction width of only a few bits. This state-machine approach not only attenuates the problems of brittleness and encoding functionality (too few bits available for coding, and too many instructions needed); it also enables the study of hardware evolution as such. Furthermore, the instruction set is sufficiently powerful to permit external signals to be processed. This information-theoretic approach forms one vertex of a triangle alongside artificial cell research and experimental research on the creation of life. Hopefully this work helps develop an understanding of how information—in a similar sense to the account of functional information described by Hazen et al.—is created by evolution and how this information interacts with or is embedded in its physico-chemical environment.

  2. Protein folding and de novo protein design for biotechnological applications

    PubMed Central

    Khoury, George A.; Smadbeck, James; Kieslich, Chris A.; Floudas, Christodoulos A.

    2014-01-01

    In the post-genomic era, the medical/biological fields are advancing faster than ever. However, before the power of full-genome sequencing can be fully realized, the connection between amino acid sequence and protein structure, known as the protein folding problem, needs to be elucidated. The protein folding problem remains elusive, with significant difficulties still arising when modeling amino acid sequences lacking an identifiable template. Understanding protein folding will allow for unforeseen advances in protein design, often referred as the inverse protein folding problem. Despite challenges in protein folding, de novo protein design has recently demonstrated significant success via computational techniques. We review advances and challenges in protein structure prediction and de novo protein design, and highlight their interplay in successful biotechnological applications. PMID:24268901

  3. Characterization of Squamate Olfactory Receptor Genes and Their Transcripts by the High-Throughput Sequencing Approach

    PubMed Central

    Dehara, Yuki; Hashiguchi, Yasuyuki; Matsubara, Kazumi; Yanai, Tokuma; Kubo, Masahito; Kumazawa, Yoshinori

    2012-01-01

    The olfactory receptor (OR) genes represent the largest multigene family in the genome of terrestrial vertebrates. Here, the high-throughput next-generation sequencing (NGS) approach was applied to characterization of OR gene repertoires in the green anole lizard Anolis carolinensis and the Japanese four-lined ratsnake Elaphe quadrivirgata. Tagged polymerase chain reaction (PCR) products amplified from either genomic DNA or cDNA of the two species were used for parallel pyrosequencing, assembling, and screening for errors in PCR and pyrosequencing. Starting from the lizard genomic DNA, we accurately identified 56 of 136 OR genes that were identified from its draft genome sequence. These recovered genes were broadly distributed in the phylogenetic tree of vertebrate OR genes without severe biases toward particular OR families. Ninety-six OR genes were identified from the ratsnake genomic DNA, implying that the snake has more OR gene loci than the anole lizard in response to an increased need for the acuity of olfaction. This view is supported by the estimated number of OR genes in the Burmese python's draft genome (∼280), although squamates may generally have fewer OR genes than terrestrial mammals and amphibians. The OR gene repertoire of the python seems unique in that many class I OR genes are retained. The NGS approach also allowed us to identify candidates of highly expressed and silent OR gene copies in the lizard's olfactory epithelium. The approach will facilitate efficient and parallel characterization of considerable unbiased proportions of multigene family members and their transcripts from nonmodel organisms. PMID:22511035

  4. Extraction of high-molecular-weight genomic DNA for long-read sequencing of single molecules.

    PubMed

    Mayjonade, Baptiste; Gouzy, Jérôme; Donnadieu, Cécile; Pouilly, Nicolas; Marande, William; Callot, Caroline; Langlade, Nicolas; Muños, Stéphane

    2016-10-01

    De novo sequencing of complex genomes is one of the main challenges for researchers seeking high-quality reference sequences. Many de novo assemblies are based on short reads, producing fragmented genome sequences. Third-generation sequencing, with read lengths >10 kb, will improve the assembly of complex genomes, but these techniques require high-molecular-weight genomic DNA (gDNA), and gDNA extraction protocols used for obtaining smaller fragments for short-read sequencing are not suitable for this purpose. Methods of preparing gDNA for bacterial artificial chromosome (BAC) libraries could be adapted, but these approaches are time-consuming, and commercial kits for these methods are expensive. Here, we present a protocol for rapid, inexpensive extraction of high-molecular-weight gDNA from bacteria, plants, and animals. Our technique was validated using sunflower leaf samples, producing a mean read length of 12.6 kb and a maximum read length of 80 kb.

  5. A protein constructed de novo enables cell growth by altering gene regulation

    PubMed Central

    Digianantonio, Katherine M.; Hecht, Michael H.

    2016-01-01

    Recent advances in protein design rely on rational and computational approaches to create novel sequences that fold and function. In contrast, natural systems selected functional proteins without any design a priori. In an attempt to mimic nature, we used large libraries of novel sequences and selected for functional proteins that rescue Escherichia coli cells in which a conditionally essential gene has been deleted. In this way, the de novo protein SynSerB3 was selected as a rescuer of cells in which serB, which encodes phosphoserine phosphatase, an enzyme essential for serine biosynthesis, was deleted. However, SynSerB3 does not rescue the deleted activity by catalyzing hydrolysis of phosphoserine. Instead, SynSerB3 up-regulates hisB, a gene encoding histidinol phosphate phosphatase. This endogenous E. coli phosphatase has promiscuous activity that, when overexpressed, compensates for the deletion of phosphoserine phosphatase. Thus, the de novo protein SynSerB3 rescues the deletion of serB by altering the natural regulation of the His operon. PMID:26884172

  6. Evidence of radius inflation in stars approaching the slow-rotator sequence

    NASA Astrophysics Data System (ADS)

    Lanzafame, A. C.; Spada, F.; Distefano, E.

    2017-01-01

    Context. Average stellar radii in open clusters can be estimated from rotation periods and projected rotational velocities under the assumption that the spin axis has a random orientation. These estimates are independent of distance, interstellar absorption, and models, but their validity can be limited by lacking data (truncation) or data that only represent upper or lower limits (censoring). Aims: We present a new statistical analysis method to estimate average stellar radii in the presence of censoring and truncation. Methods: We used theoretical distribution functions of the projected stellar radius Rsini to define a likelihood function in the presence of censoring and truncation. Average stellar radii in magnitude bins were then obtained by a maximum likelihood parametric estimation procedure. Results: This method is capable of recovering the average stellar radius within a few percent with as few as aboutten measurements. Here we apply this for the first time to the dataset available for the Pleiades. We find an agreement better than ≈10 percent between the observed R vs. MK relationship and current standard stellar models for 1.2 ≥ M/M⊙ ≥ 0.85 with no evident bias. Evidence of a systematic deviation at 2σ level are found for stars with 0.8 ≥ M/M⊙ ≥ 0.6 that approach the slow-rotator sequence. Fast rotators (P < 2 d) agree with standard models within 15 percent with no systematic deviations in the whole 1.2 ≳ M/M⊙ ≳ 0.5 range. Conclusions: The evidence of a possible radius inflation just below the lower mass limit of the slow-rotator sequence indicates a possible connection with the transition from the fast- to the slow-rotator sequence. Full Table 1 is only available at the CDS via anonymous ftp to http://cdsarc.u-strasbg.fr (http://130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/597/A63

  7. Taxonomic Assessment of Rumen Microbiota Using Total RNA and Targeted Amplicon Sequencing Approaches

    PubMed Central

    Li, Fuyong; Henderson, Gemma; Sun, Xu; Cox, Faith; Janssen, Peter H.; Guan, Le Luo

    2016-01-01

    Taxonomic characterization of active gastrointestinal microbiota is essential to detect shifts in microbial communities and functions under various conditions. This study aimed to identify and quantify potentially active rumen microbiota using total RNA sequencing and to compare the outcomes of this approach with the widely used targeted RNA/DNA amplicon sequencing technique. Total RNA isolated from rumen digesta samples from five beef steers was subjected to Illumina paired-end sequencing (RNA-seq), and bacterial and archaeal amplicons of partial 16S rRNA/rDNA were subjected to 454 pyrosequencing (RNA/DNA Amplicon-seq). Taxonomic assessments of the RNA-seq, RNA Amplicon-seq, and DNA Amplicon-seq datasets were performed using a pipeline developed in house. The detected major microbial phylotypes were common among the three datasets, with seven bacterial phyla, fifteen bacterial families, and five archaeal taxa commonly identified across all datasets. There were also unique microbial taxa detected in each dataset. Elusimicrobia and Verrucomicrobia phyla; Desulfovibrionaceae, Elusimicrobiaceae, and Sphaerochaetaceae families; and Methanobrevibacter woesei were only detected in the RNA-Seq and RNA Amplicon-seq datasets, whereas Streptococcaceae was only detected in the DNA Amplicon-seq dataset. In addition, the relative abundances of four bacterial phyla, eight bacterial families and one archaeal taxon were different among the three datasets. This is the first study to compare the outcomes of rumen microbiota profiling between RNA-seq and RNA/DNA Amplicon-seq datasets. Our results illustrate the differences between these methods in characterizing microbiota both qualitatively and quantitatively for the same sample, and so caution must be exercised when comparing data. PMID:27446027

  8. Congenital Corneal Endothelial Dystrophies Resulting from Novel De Novo Mutations

    PubMed Central

    Cunnusamy, Khrishen; Bowman, Charles B.; Beebe, Walter; Gong, Xin; Hogan, R. Nick; Mootha, V. Vinod

    2015-01-01

    Purpose To describe two cases of congenital corneal endothelial edema resulting from novel de novo mutations. Methods Case A patient was a 15 months old Caucasian infant and Case B patient was a 3 year old Hispanic child presenting with bilateral cloudy corneas since birth. Clinicopathological findings are presented. DNA samples were screened for mutations in candidate genes by Sanger sequencing. Results Slit-lamp examination of Case A patient revealed stromal edema and haze. Histology of keratoplasty button showed stromal thickening with loss of endothelium and thin Descemet’s membrane. Sanger sequencing established the diagnosis of congenital hereditary endothelial dystrophy (CHED) by detection of a compound heterozygous mutation in SLC4A11. The proband displayed a novel de novo frameshift mutation in one SLC4A11 allele, p.(Pro817Argfs*32), in conjunction with a maternally inherited missense mutation in SLC4A11, p.(Arg869His). Case B patient similarly presented with stromal edema and stromal haze. Histopathological analysis revealed a spongy epithelium, focal discontinuities in Bowman’s layer, stromal thickening with areas of compacted posterior stroma, variable thickness of Descemet’s membrane, and regional multilayered endothelium. Sanger sequencing found a novel de novo nonsense mutation in the first exon of ZEB1, p.(Cys7*). Conclusions To our knowledge, we present the earliest clinical presentation of posterior polymorphous corneal dystrophy resulting from a de novo mutation in ZEB1. Additionally, we present a CHED case with a thin Descemet’s membrane with a novel compound heterozygous SLC4A11 mutation. In the absence of a family history or consanguinity, de novo mutations may result in congenital corneal endothelial dystrophies. PMID:26619383

  9. A Restricted Repertoire of De Novo Mutations in ITPR1 Cause Gillespie Syndrome with Evidence for Dominant-Negative Effect

    PubMed Central

    McEntagart, Meriel; Williamson, Kathleen A.; Rainger, Jacqueline K.; Wheeler, Ann; Seawright, Anne; De Baere, Elfride; Verdin, Hannah; Bergendahl, L. Therese; Quigley, Alan; Rainger, Joe; Dixit, Abhijit; Sarkar, Ajoy; López Laso, Eduardo; Sanchez-Carpintero, Rocio; Barrio, Jesus; Bitoun, Pierre; Prescott, Trine; Riise, Ruth; McKee, Shane; Cook, Jackie; McKie, Lisa; Ceulemans, Berten; Meire, Françoise; Temple, I. Karen; Prieur, Fabienne; Williams, Jonathan; Clouston, Penny; Németh, Andrea H.; Banka, Siddharth; Bengani, Hemant; Handley, Mark; Freyer, Elisabeth; Ross, Allyson; van Heyningen, Veronica; Marsh, Joseph A.; Elmslie, Frances; FitzPatrick, David R.

    2016-01-01

    Gillespie syndrome (GS) is characterized by bilateral iris hypoplasia, congenital hypotonia, non-progressive ataxia, and progressive cerebellar atrophy. Trio-based exome sequencing identified de novo mutations in ITPR1 in three unrelated individuals with GS recruited to the Deciphering Developmental Disorders study. Whole-exome or targeted sequence analysis identified plausible disease-causing ITPR1 mutations in 10/10 additional GS-affected individuals. These ultra-rare protein-altering variants affected only three residues in ITPR1: Glu2094 missense (one de novo, one co-segregating), Gly2539 missense (five de novo, one inheritance uncertain), and Lys2596 in-frame deletion (four de novo). No clinical or radiological differences were evident between individuals with different mutations. ITPR1 encodes an inositol 1,4,5-triphosphate-responsive calcium channel. The homo-tetrameric structure has been solved by cryoelectron microscopy. Using estimations of the degree of structural change induced by known recessive- and dominant-negative mutations in other disease-associated multimeric channels, we developed a generalizable computational approach to indicate the likely mutational mechanism. This analysis supports a dominant-negative mechanism for GS variants in ITPR1. In GS-derived lymphoblastoid cell lines (LCLs), the proportion of ITPR1-positive cells using immunofluorescence was significantly higher in mutant than control LCLs, consistent with an abnormality of nuclear calcium signaling feedback control. Super-resolution imaging supports the existence of an ITPR1-lined nucleoplasmic reticulum. Mice with Itpr1 heterozygous null mutations showed no major iris defects. Purkinje cells of the cerebellum appear to be the most sensitive to impaired ITPR1 function in humans. Iris hypoplasia is likely to result from either complete loss of ITPR1 activity or structure-specific disruption of multimeric interactions. PMID:27108798

  10. An integrated approach for analyzing clinical genomic variant data from next-generation sequencing.

    PubMed

    Crowgey, Erin L; Stabley, Deborah L; Chen, Chuming; Huang, Hongzhan; Robbins, Katherine M; Polson, Shawn W; Sol-Church, Katia; Wu, Cathy H

    2015-04-01

    Next-generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis for data interpretation. We have developed an integrated approach for end-to-end clinical NGS data analysis from variant detection to functional profiling. Robust bioinformatics pipelines were implemented for genome alignment, single nucleotide polymorphism (SNP), small insertion/deletion (InDel), and copy number variation (CNV) detection of whole exome sequencing (WES) data from the Illumina platform. Quality-control metrics were analyzed at each step of the pipeline by use of a validated training dataset to ensure data integrity for clinical applications. We annotate the variants with data regarding the disease population and variant impact. Custom algorithms were developed to filter variants based on criteria, such as quality of variant, inheritance pattern, and impact of variant on protein function. The developed clinical variant pipeline links the identified rare variants to Integrated Genome Viewer for visualization in a genomic context and to the Protein Information Resource's iProXpress for rich protein and disease information. With the application of our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for downstream variant filtering that empowers clinicians and researchers to interpret more effectively the relevance of genomic alterations within a rare genetic disease.

  11. A machine-learning approach for predicting palmitoylation sites from integrated sequence-based features.

    PubMed

    Li, Liqi; Luo, Qifa; Xiao, Weidong; Li, Jinhui; Zhou, Shiwen; Li, Yongsheng; Zheng, Xiaoqi; Yang, Hua

    2017-02-01

    Palmitoylation is the covalent attachment of lipids to amino acid residues in proteins. As an important form of protein posttranslational modification, it increases the hydrophobicity of proteins, which contributes to the protein transportation, organelle localization, and functions, therefore plays an important role in a variety of cell biological processes. Identification of palmitoylation sites is necessary for understanding protein-protein interaction, protein stability, and activity. Since conventional experimental techniques to determine palmitoylation sites in proteins are both labor intensive and costly, a fast and accurate computational approach to predict palmitoylation sites from protein sequences is in urgent need. In this study, a support vector machine (SVM)-based method was proposed through integrating PSI-BLAST profile, physicochemical properties, [Formula: see text]-mer amino acid compositions (AACs), and [Formula: see text]-mer pseudo AACs into the principal feature vector. A recursive feature selection scheme was subsequently implemented to single out the most discriminative features. Finally, an SVM method was implemented to predict palmitoylation sites in proteins based on the optimal features. The proposed method achieved an accuracy of 99.41% and Matthews Correlation Coefficient of 0.9773 for a benchmark dataset. The result indicates the efficiency and accuracy of our method in prediction of palmitoylation sites based on protein sequences.

  12. Affective Visual Stimuli: Characterization of the Picture Sequences Impacts by Means of Nonlinear Approaches

    PubMed Central

    Goshvarpour, Ateke; Abbasi, Ataollah; Goshvarpour, Atefeh

    2015-01-01

    Introduction: The main objective of the present study was to investigate the effect of preceding pictorial stimulus on the emotional autonomic responses of the subsequent one. Methods: To this effect, physiological signals, including Electrocardiogram (ECG), Pulse Rate (PR), and Galvanic Skin Response (GSR) were collected. As these signals have random and chaotic nature, nonlinear dynamics of these physiological signals were evaluated with the methods of nonlinear system theory. Considering the hypothesis that emotional responses are usually associated with previous experiences of a subject, the subjective ratings of 4 emotional states were also evaluated. Four nonlinear characteristics (including Detrended Fluctuation Analysis (DFA), based parameters, Lyapunov exponent, and approximate entropy) were implemented. Nine standard features (including mean, standard deviation, minimum, maximum, median, mode, the second, third, and fourth moment) were also extracted. Results: To evaluate the ability of features in discriminating different types of emotions, some classification approaches were appraised, of them, Probabilistic Neural Network (PNN) led to the best classification rate of 100%. The results show that considering the emotional sequences, GSR is the best candidate for the representation of the physiological changes. Discussion: Lower discrimination was attained when the sequence occurred in the diagonal line of valence-arousal coordinates (for instance, positive valence and positive arousal versus negative valence and negative arousal). By employing self-assessment ranks, no obvious improvement was achieved. PMID:26649159

  13. Predicting sequences and structures of MHC-binding peptides: a computational combinatorial approach

    NASA Astrophysics Data System (ADS)

    Zeng, Jun; Treutlein, Herbert R.; Rudy, George B.

    2001-06-01

    Peptides bound to MHC molecules on the surface of cells convey critical information about the cellular milieu to immune system T cells. Predicting which peptides can bind an MHC molecule, and understanding their modes of binding, are important in order to design better diagnostic and therapeutic agents for infectious and autoimmune diseases. Due to the difficulty of obtaining sufficient experimental binding data for each human MHC molecule, computational modeling of MHC peptide-binding properties is necessary. This paper describes a computational combinatorial design approach to the prediction of peptides that bind an MHC molecule of known X-ray crystallographic or NMR-determined structure. The procedure uses chemical fragments as models for amino acid residues and produces a set of sequences for peptides predicted to bind in the MHC peptide-binding groove. The probabilities for specific amino acids occurring at each position of the peptide are calculated based on these sequences, and these probabilities show a good agreement with amino acid distributions derived from a MHC-binding peptide database. The method also enables prediction of the three-dimensional structure of MHC-peptide complexes. Docking, linking, and optimization procedures were performed with the XPLOR program [1].

  14. Stratification approach for 3-D euclidean reconstruction of nonrigid objects from uncalibrated image sequences.

    PubMed

    Wang, Guanghui; Wu, Q M Jonathan

    2008-02-01

    This paper addresses the problem of 3-D reconstruction of nonrigid objects from uncalibrated image sequences. Under the assumption of affine camera and that the nonrigid object is composed of a rigid part and a deformation part, we propose a stratification approach to recover the structure of nonrigid objects by first reconstructing the structure in affine space and then upgrading it to the Euclidean space. The novelty and main features of the method lies in several aspects. First, we propose a deformation weight constraint to the problem and prove the invariability between the recovered structure and shape bases under this constraint. The constraint was not observed by previous studies. Second, we propose a constrained power factorization algorithm to recover the deformation structure in affine space. The algorithm overcomes some limitations of a previous singular-value-decomposition-based method. It can even work with missing data in the tracking matrix. Third, we propose to separate the rigid features from the deformation ones in 3-D affine space, which makes the detection more accurate and robust. The stratification matrix is estimated from the rigid features, which may relax the influence of large tracking errors in the deformation part. Extensive experiments on synthetic data and real sequences validate the proposed method and show improvements over existing solutions.

  15. Next-generation sequencing approach to epigenetic-based tissue source attribution.

    PubMed

    Bartling, Craig M; Hester, Mark E; Bartz, Julianne; Heizer, Esley; Faith, Seth A

    2014-11-01

    The ability to determine the tissue source of biological materials from evidence samples can be highly informative for interpreting forensic data. In this study, a previously published CE-based method to probe locus-specific DNA methylation was modified to accommodate detection using next-generation sequencing (NGS) to perform tissue source attribution. DNA samples (1 ng) from each of four different tissue types were digested with the methylation sensitive restriction endonuclease Hha1 and PCR was used to amplify an optimized subset of ten methylated loci, including positive and negative control loci. The products were prepared as NGS libraries, pooled in a multiplex assay with sample-specific barcodes, sequenced with an Illumina MiSeq, and analyzed using a k-Nearest Neighbor algorithm. With this initial effort a concordance rate of 15/16 was demonstrated from samples of varying types: semen, saliva, skin epidermis, and blood. This method also was designed to be compatible with the workflows published to date for NGS of STRs. Thus, the methylation approach described here is highly accurate and upon further validation and testing may be potentially used in practice as a confirmatory test in conjunction with other NGS protocols used in forensic laboratories.

  16. An Integrated Approach for Analyzing Clinical Genomic Variant Data from Next-Generation Sequencing

    PubMed Central

    Stabley, Deborah L.; Chen, Chuming; Huang, Hongzhan; Robbins, Katherine M.; Polson, Shawn W.; Sol-Church, Katia; Wu, Cathy H.

    2015-01-01

    Next-generation sequencing (NGS) technologies provide the potential for developing high-throughput and low-cost platforms for clinical diagnostics. A limiting factor to clinical applications of genomic NGS is downstream bioinformatics analysis for data interpretation. We have developed an integrated approach for end-to-end clinical NGS data analysis from variant detection to functional profiling. Robust bioinformatics pipelines were implemented for genome alignment, single nucleotide polymorphism (SNP), small insertion/deletion (InDel), and copy number variation (CNV) detection of whole exome sequencing (WES) data from the Illumina platform. Quality-control metrics were analyzed at each step of the pipeline by use of a validated training dataset to ensure data integrity for clinical applications. We annotate the variants with data regarding the disease population and variant impact. Custom algorithms were developed to filter variants based on criteria, such as quality of variant, inheritance pattern, and impact of variant on protein function. The developed clinical variant pipeline links the identified rare variants to Integrated Genome Viewer for visualization in a genomic context and to the Protein Information Resource’s iProXpress for rich protein and disease information. With the application of our system of annotations, prioritizations, inheritance filters, and functional profiling and analysis, we have created a unique methodology for downstream variant filtering that empowers clinicians and researchers to interpret more effectively the relevance of genomic alterations within a rare genetic disease. PMID:25649353

  17. De Novo Reconstruction of Consensus Master Genomes of Plant RNA and DNA Viruses from siRNAs

    PubMed Central

    Seguin, Jonathan; Rajeswaran, Rajendran; Malpica-López, Nachelli; Martin, Robert R.; Kasschau, Kristin; Dolja, Valerian V.; Otten, Patricia; Farinelli, Laurent; Pooggin, Mikhail M.

    2014-01-01

    Virus-infected plants accumulate abundant, 21–24 nucleotide viral siRNAs which are generated by the evolutionary conserved RNA interference (RNAi) machinery that regulates gene expression and defends against invasive nucleic acids. Here we show that, similar to RNA viruses, the entire genome sequences of DNA viruses are densely covered with siRNAs in both sense and antisense orientations. This implies pervasive transcription of both coding and non-coding viral DNA in the nucleus, which generates double-stranded RNA precursors of viral siRNAs. Consistent with our finding and hypothesis, we demonstrate that the complete genomes of DNA viruses from Caulimoviridae and Geminiviridae families can be reconstructed by deep sequencing and de novo assembly of viral siRNAs using bioinformatics tools. Furthermore, we prove that this ‘siRNA omics’ approach can be used for reliable identification of the consensus master genome and its microvariants in viral quasispecies. Finally, we utilized this approach to reconstruct an emerging DNA virus and two viroids associated with economically-important red blotch disease of grapevine, and to rapidly generate a biologically-active clone representing the wild type master genome of Oilseed rape mosaic virus. Our findings show that deep siRNA sequencing allows for de novo reconstruction of any DNA or RNA virus genome and its microvariants, making it suitable for universal characterization of evolving viral quasispecies as well as for studying the mechanisms of siRNA biogenesis and RNAi-based antiviral defense. PMID:24523907

  18. Effective de novo assembly of fish genome using haploid larvae.

    PubMed

    Iwasaki, Yuki; Nishiki, Issei; Nakamura, Yoji; Yasuike, Motoshige; Kai, Wataru; Nomura, Kazuharu; Yoshida, Kazunori; Nomura, Yousuke; Fujiwara, Atushi; Kobayashi, Takanori; Ototake, Mitsuru

    2016-02-01

    Recent improvements in next-generation sequencing technology have made it possible to do whole genome sequencing, on even non-model eukaryote species with no available reference genomes. However, de novo assembly of diploid genomes is still a big challenge because of allelic variation. The aim of this study was to determine the feasibility of utilizing the genome of haploid fish larvae for de novo assembly of whole-genome sequences. We compared the efficiency of assembly using the haploid genome of yellowtail (Seriola quinqueradiata) with that using the diploid genome obtained from the dam. De novo assembly from the haploid and the diploid sequence reads (100 million reads per each datasets) generated by the Ion Proton sequencer (200 bp) was done under two different assembly algorithms, namely overlap-layout-consensus (OLC) and de Bruijn graph (DBG). This revealed that the assembly of the haploid genome significantly reduced (approximately 22% for OLC, 9% for DBG) the total number of contigs (with longer average and N50 contig lengths) when compared to the diploid genome assembly. The haploid assembly also improved the quality of the scaffolds by reducing the number of regions with unassigned nucleotides (Ns) (total length of Ns; 45,331,916 bp for haploids and 67,724,360 bp for diploids) in OLC-based assemblies. It appears clear that the haploid genome assembly is better because the allelic variation in the diploid genome disrupts the extension of contigs during the assembly process. Our results indicate that utilizing the genome of haploid larvae leads to a significant improvement in the de novo assembly process, thus providing a novel strategy for the construction of reference genomes from non-model diploid organisms such as fish.

  19. In planta Identification of Putative Pathogenicity Factors from the Chickpea Pathogen Ascochyta rabiei by De novo Transcriptome Sequencing Using RNA-Seq and Massive Analysis of cDNA Ends

    PubMed Central

    Fondevilla, Sara; Krezdorn, Nicolas; Rotter, Björn; Kahl, Guenter; Winter, Peter

    2015-01-01

    The most important foliar diseases in legumes worldwide are ascochyta blights. Up to now, in the Ascochyta-legume pathosystem most studies focused on the identification of resistance genes in the host, while very little is known about the pathogenicity factors of the fungal pathogen. Moreover, available data were often obtained from fungi growing under artificial conditions. Therefore, in this study we aimed at the identification of the pathogenicity factors of Ascochyta rabiei, causing ascochyta blight in chickpea. To identify potential fungal pathogenicity factors, we employed RNA-seq and Massive Analysis of cDNA Ends (MACE) to produce comprehensive expression profiles of A. rabiei genes isolated either from the fungus growing in absence of its host or from fungi infecting chickpea leaves. We further provide a comprehensive de novo assembly of the A. rabiei transcriptome comprising 22,725 contigs with an average length of 1178 bp. Since pathogenicity factors are usually secreted, we predicted the A. rabiei secretome, yielding 550 putatively secreted proteins. MACE identified 596 transcripts that were up-regulated during infection. An analysis of these genes identified a collection of candidate pathogenicity factors and unraveled the pathogen's strategy for infecting its host. PMID:26648917

  20. De Novo Transcriptome Assembly in Polyploid Species.

    PubMed

    Gutierrez-Gonzalez, Juan J; Garvin, David F

    2017-01-01

    In the absence of a reference genome, the ultimate goal of a de novo transcriptome assembly is to accurately and comprehensively reconstruct the set of messenger RNA transcripts represented in the sample. Non-reference assembly of the transcriptome of polyploid species poses a particular challenge because of the presence of homeologs that are difficult to disentangle at the sequence level. This is especially true for hexaploid oats, which have three highly similar subgenomes, two of which are thought to be nearly identical. Under these circumstances, most software packages and established pipelines encounter difficulties in rendering an accurate transcriptome because they are typically developed, refined, and tested for diploid organisms. We present a protocol for transcriptome assembly in oats that can be extended both to other polyploids and species with highly duplicated genomes.

  1. Quasispecies structure, cornerstone of hepatitis B virus infection: mass sequencing approach.

    PubMed

    Rodriguez-Frias, Francisco; Buti, Maria; Tabernero, David; Homs, Maria

    2013-11-07

    Hepatitis B virus (HBV) is a DNA virus with complex replication, and high replication and mutation rates, leading to a heterogeneous viral population. The population is comprised of genomes that are closely related, but not identical; hence, HBV is considered a viral quasispecies. Quasispecies variability may be somewhat limited by the high degree of overlapping between the HBV coding regions, which is especially important in the P and S gene overlapping regions, but is less significant in the X and preCore/Core genes. Despite this restriction, several clinically and pathologically relevant variants have been characterized along the viral genome. Next-generation sequencing (NGS) approaches enable high-throughput analysis of thousands of clonally amplified regions and are powerful tools for characterizing genetic diversity in viral strains. In the present review, we update the information regarding HBV variability and present a summary of the various NGS approaches available for research in this virus. In addition, we provide an analysis of the clinical implications of HBV variants and their study by NGS.

  2. Direct Chloroplast Sequencing: Comparison of Sequencing Platforms and Analysis Tools for Whole Chloroplast Barcoding

    PubMed Central

    Brozynska, Marta; Furtado, Agnelo; Henry, Robert James

    2014-01-01

    Direct sequencing of total plant DNA using next generation sequencing technologies generates a whole chloroplast genome sequence that has the potential to provide a barcode for use in plant and food identification. Advances in DNA sequencing platforms may make this an attractive approach for routine plant identification. The HiSeq (Illumina) and Ion Torrent (Life Technology) sequencing platforms were used to sequence total DNA from rice to identify polymorphisms in the whole chloroplast genome sequence of a wild rice plant relative to cultivated rice (cv. Nipponbare). Consensus chloroplast sequences were produced by mapping sequence reads to the reference rice chloroplast genome or by de novo assembly and mapping of the resulting contigs to the reference sequence. A total of 122 polymorphisms (SNPs and indels) between the wild and cultivated rice chloroplasts were predicted by these different sequencing and analysis methods. Of these, a total of 102 polymorphisms including 90 SNPs were predicted by both platforms. Indels were more variable with different sequencing methods, with almost all discrepancies found in homopolymers. The Ion Torrent platform gave no apparent false SNP but was less reliable for indels. The methods should be suitable for routine barcoding using appropriate combinations of sequencing platform and data analysis. PMID:25329378

  3. De Novo Kidney Regeneration with Stem Cells

    PubMed Central

    Yokote, Shinya; Yamanaka, Shuichiro; Yokoo, Takashi

    2012-01-01

    Recent studies have reported on techniques to mobilize and activate endogenous stem-cells in injured kidneys or to introduce exogenous stem cells for tissue repair. Despite many recent advantages in renal regenerative therapy, chronic kidney disease (CKD) remains a major cause of morbidity and mortality and the number of CKD patients has been increasing. When the sophisticated structure of the kidneys is totally disrupted by end stage renal disease (ESRD), traditional stem cell-based therapy is unable to completely regenerate the damaged tissue. This suggests that whole organ regeneration may be a promising therapeutic approach to alleviate patients with uncured CKD. We summarize here the potential of stem-cell-based therapy for injured tissue repair and de novo whole kidney regeneration. In addition, we describe the hurdles that must be overcome and possible applications of this approach in kidney regeneration. PMID:23251079

  4. Characterization of rainbow trout gonad, brain and gill deep cDNA repertoires using a Roche 454-Titanium sequencing approach.

    PubMed

    Le Cam, Aurélie; Bobe, Julien; Bouchez, Olivier; Cabau, Cédric; Kah, Olivier; Klopp, Christophe; Lareyre, Jean-Jacques; Le Guen, Isabelle; Lluch, Jérôme; Montfort, Jérôme; Moreews, Francois; Nicol, Barbara; Prunet, Patrick; Rescan, Pierre-Yves; Servili, Arianna; Guiguen, Yann

    2012-05-25

    Rainbow trout, Oncorhynchus mykiss, is an important aquaculture species worldwide and, in addition to being of commercial interest, it is also a research model organism of considerable scientific importance. Because of the lack of a whole genome sequence in that species, transcriptomic analyses of this species have often been hindered. Using next-generation sequencing (NGS) technologies, we sought to fill these informational gaps. Here, using Roche 454-Titanium technology, we provide new tissue-specific cDNA repertoires from several rainbow trout tissues. Non-normalized cDNA libraries were constructed from testis, ovary, brain and gill rainbow trout tissue samples, and these different libraries were sequenced in 10 separate half-runs of 454-Titanium. Overall, we produced a total of 3million quality sequences with an average size of 328bp, representing more than 1Gb of expressed sequence information. These sequences have been combined with all publicly available rainbow trout sequences, resulting in a total of 242,187 clusters of putative transcript groups and 22,373 singletons. To identify the predominantly expressed genes in different tissues of interest, we developed a Digital Differential Display (DDD) approach. This approach allowed us to characterize the genes that are predominantly expressed within each tissue of interest. Of these genes, some were already known to be tissue-specific, thereby validating our approach. Many others, however, were novel candidates, demonstrating the usefulness of our strategy and of such tissue-specific resources. This new sequence information, acquired using NGS 454-Titanium technology, deeply enriched our current knowledge of the expressed genes in rainbow trout through the identification of an increased number of tissue-specific sequences. This identification allowed a precise cDNA tissue repertoire to be characterized in several important rainbow trout tissues. The rainbow trout contig browser can be accessed at the following

  5. Simple quantitative PCR approach to reveal naturally occurring and mutation-induced repetitive sequence variation on the Drosophila Y chromosome.

    PubMed

    Aldrich, John C; Maggert, Keith A

    2014-01-01

    Heterochromatin is a significant component of the human genome and the genomes of most model organisms. Although heterochromatin is thought to be largely non-coding, it is clear that it plays an important role in chromosome structure and gene regulation. Despite a growing awareness of its functional significance, the repetitive sequences underlying some heterochromatin remain relatively uncharacterized. We have developed a real-time quantitative PCR-based method for quantifying simple repetitive satellite sequences and have used this technique to characterize the heterochromatic Y chromosome of Drosophila melanogaster. In this report, we validate the approach, identify previously unknown satellite sequence copy number polymorphisms in Y chromosomes from different geographic sources, and show that a defect in heterochromatin formation can induce similar copy number polymorphisms in a laboratory strain. These findings provide a simple method to investigate the dynamic nature of repetitive sequences and characterize conditions which might give rise to long-lasting alterations in DNA sequence.

  6. Simple Quantitative PCR Approach to Reveal Naturally Occurring and Mutation-Induced Repetitive Sequence Variation on the Drosophila Y Chromosome

    PubMed Central

    Aldrich, John C.; Maggert, Keith A.

    2014-01-01

    Heterochromatin is a significant component of the human genome and the genomes of most model organisms. Although heterochromatin is thought to be largely non-coding, it is clear that it plays an important role in chromosome structure and gene regulation. Despite a growing awareness of its functional significance, the repetitive sequences underlying some heterochromatin remain relatively uncharacterized. We have developed a real-time quantitative PCR-based method for quantifying simple repetitive satellite sequences and have used this technique to characterize the heterochromatic Y chromosome of Drosophila melanogaster. In this report, we validate the approach, identify previously unknown satellite sequence copy number polymorphisms in Y chromosomes from different geographic sources, and show that a defect in heterochromatin formation can induce similar copy number polymorphisms in a laboratory strain. These findings provide a simple method to investigate the dynamic nature of repetitive sequences and characterize conditions which might give rise to long-lasting alterations in DNA sequence. PMID:25285439

  7. Whole-genome sequencing in newborn screening? A statement on the continued importance of targeted approaches in newborn screening programmes

    PubMed Central

    Howard, Heidi Carmen; Knoppers, Bartha Maria; Cornel, Martina C; Wright Clayton, Ellen; Sénécal, Karine; Borry, Pascal

    2015-01-01

    The advent and refinement of sequencing technologies has resulted in a decrease in both the cost and time needed to generate data on the entire sequence of the human genome. This has increased the accessibility of using whole-genome sequencing and whole-exome sequencing approaches for analysis in both the research and clinical contexts. The expectation is that more services based on these and other high-throughput technologies will become available to patients and the wider population. Some authors predict that sequencing will be performed once in a lifetime, namely, shortly after birth. The Public and Professional Policy Committee of the European Society of Human Genetics, the Human Genome Organisation Committee on Ethics, Law and Society, the PHG Foundation and the P3G International Paediatric Platform address herein the important issues and challenges surrounding the potential use of sequencing technologies in publicly funded newborn screening (NBS) programmes. This statement presents the relevant issues and culminates in a set of recommendations to help inform and guide scientists and clinicians, as well as policy makers regarding the necessary considerations for the use of genome sequencing technologies and approaches in NBS programmes. The primary objective of NBS should be the targeted analysis and identification of gene variants conferring a high risk of preventable or treatable conditions, for which treatment has to start in the newborn period or in early childhood. PMID:25626707

  8. De novo assembly of the transcriptome of the non-model plant Streptocarpus rexii employing a novel heuristic to recover locus-specific transcript clusters.

    PubMed

    Chiara, Matteo; Horner, David S; Spada, Alberto

    2013-01-01

    De novo transcriptome characterization from Next Generation Sequencing data has become an important approach in the study of non-model plants. Despite notable advances in the assembly of short reads, the clustering of transcripts into unigene-like (locus-specific) clusters remains a somewhat neglected subject. Indeed, closely related paralogous transcripts are often merged into single clusters by current approaches. Here, a novel heuristic method for locus-specific clustering is compared to that implemented in the de novo assembler Oases, using the same initial transcript collections, derived from Arabidopsis thaliana and the developmental model Streptocarpus rexii. We show that the proposed approach improves cluster specificity in the A. thaliana dataset for which the reference genome is available. Furthermore, for the S. rexii data our filtered transcript collection matches a larger number of distinct annotated loci in reference genomes than the Oases set, while containing a reduced overall number of loci. A detailed discussion of advantages and limitations of our approach in processing de novo transcriptome reconstructions is presented. The proposed method should be widely applicable to other organisms, irrespective of the transcript assembly method employed. The S. rexii transcriptome is available as a sophisticated and augmented publicly available online database.

  9. De Novo Assembly of the Transcriptome of the Non-Model Plant Streptocarpus rexii Employing a Novel Heuristic to Recover Locus-Specific Transcript Clusters

    PubMed Central

    Chiara, Matteo; Horner, David S.; Spada, Alberto

    2013-01-01

    De novo transcriptome characterization from Next Generation Sequencing data has become an important approach in the study of non-model plants. Despite notable advances in the assembly of short reads, the clustering of transcripts into unigene-like (locus-specific) clusters remains a somewhat neglected subject. Indeed, closely related paralogous transcripts are often merged into single clusters by current approaches. Here, a novel heuristic method for locus-specific clustering is compared to that implemented in the de novo assembler Oases, using the same initial transcript collections, derived from Arabidopsis thaliana and the developmental model Streptocarpus rexii. We show that the proposed approach improves cluster specificity in the A. thaliana dataset for which the reference genome is available. Furthermore, for the S. rexii data our filtered transcript collection matches a larger number of distinct annotated loci in reference genomes than the Oases set, while containing a reduced overall number of loci. A detailed discussion of advantages and limitations of our approach in processing de novo transcriptome reconstructions is presented. The proposed method should be widely applicable to other organisms, irrespective of the transcript assembly method employed. The S. rexii transcriptome is available as a sophisticated and augmented publicly available online database. PMID:24324652

  10. Mixing Bandt-Pompe and Lempel-Ziv approaches: another way to analyze the complexity of continuous-state sequences

    NASA Astrophysics Data System (ADS)

    Zozor, S.; Mateos, D.; Lamberti, P. W.

    2014-05-01

    In this paper, we propose to mix the approach underlying Bandt-Pompe permutation entropy with Lempel-Ziv complexity, to design what we call Lempel-Ziv permutation complexity. The principle consists of two steps: (i) transformation of a continuous-state series that is intrinsically multivariate or arises from embedding into a sequence of permutation vectors, where the components are the positions of the components of the initial vector when re-arranged; (ii) performing the Lempel-Ziv complexity for this series of `symbols', as part of a discrete finite-size alphabet. On the one hand, the permutation entropy of Bandt-Pompe aims at the study of the entropy of such a sequence; i.e., the entropy of patterns in a sequence (e.g., local increases or decreases). On the other hand, the Lempel-Ziv complexity of a discrete-state sequence aims at the study of the temporal organization of the symbols (i.e., the rate of compressibility of the sequence). Thus, the Lempel-Ziv permutation complexity aims to take advantage of both of these methods. The potential from such a combined approach - of a permutation procedure and a complexity analysis - is evaluated through the illustration of some simulated data and some real data. In both cases, we compare the individual approaches and the combined approach.

  11. Sequencing technologies and genome sequencing.

    PubMed

    Pareek, Chandra Shekhar; Smoczynski, Rafal; Tretyn, Andrzej

    2011-11-01

    The high-throughput - next generation sequencing (HT-NGS) technologies are currently the hottest topic in the field of human and animals genomics researches, which can produce over 100 times more data compared to the most sophisticated capillary sequencers based on the Sanger method. With the ongoing developments of high throughput sequencing machines and advancement of modern bioinformatics tools at unprecedented pace, the target goal of sequencing individual genomes of living organism at a cost of $1,000 each is seemed to be realistically feasible in the near future. In the relatively short time frame since 2005, the HT-NGS technologies are revolutionizing the human and animal genome researches by analysis of chromatin immunoprecipitation coupled to DNA microarray (ChIP-chip) or sequencing (ChIP-seq), RNA sequencing (RNA-seq), whole genome genotyping, genome wide structural variation, de novo assembling and re-assembling of genome, mutation detection and carrier screening, detection of inherited disorders and complex human diseases, DNA library preparation, paired ends and genomic captures, sequencing of mitochondrial genome and personal genomics. In this review, we addressed the important features of HT-NGS like, first generation DNA sequencers, birth of HT-NGS, second generation HT-NGS platforms, third generation HT-NGS platforms: including single molecule Heliscope™, SMRT™ and RNAP sequencers, Nanopore, Archon Genomics X PRIZE foundation, comparison of second and third HT-NGS platforms, applications, advances and future perspectives of sequencing technologies on human and animal genome research.

  12. Wide spetcrum mutational analysis of metastatic renal cell cancer: a retrospective next generation sequencing approach.

    PubMed

    Fiorentino, Michelangelo; Gruppioni, Elisa; Massari, Francesco; Giunchi, Francesca; Altimari, Annalisa; Ciccarese, Chiara; Bimbatti, Davide; Scarpa, Aldo; Iacovelli, Roberto; Porta, Camillo; Virinder, Sarhadi; Tortora, Giampaolo; Artibani, Walter; Schiavina, Riccardo; Ardizzoni, Andrea; Brunelli, Matteo; Knuutila, Sakari; Martignoni, Guido

    2017-01-31

    Renal cell cancer (RCC) is characterized by histological and molecular heterogeneity that may account for variable response to targeted therapies. We evaluated retrospectively with a next generation sequencing (NGS) approach using a pre-designed cancer panel the mutation burden of 32 lesions from 22 metastatic RCC patients treated with at least one tyrosine kinase or mTOR inhibitor. We identified mutations in the VHL, PTEN, JAK3, MET, ERBB4, APC, CDKN2A, FGFR3, EGFR, RB1, TP53 genes. Somatic alterations were correlated with response to therapy. Most mutations hit VHL1 (31,8%) followed by PTEN (13,6%), JAK3, FGFR and TP53 (9% each). Eight (36%) patients were wild-type at least for the genes included in the panel.A genotype concordance between primary RCC and its secondary lesion was found in 3/6 cases. Patients were treated with Sorafenib, Sunitinib and Temsirolimus with partial responses in 4 (18,2%) and disease stabilization in 7 (31,8%). Among the 4 partial responders, 1 (25%) was wild-type and 3 (75%) harbored different VHL1 variants. Among the 7 patients with disease stabilization 2 (29%) were wild-type, 2 (29%) PTEN mutated, and single patients (14% each) displayed mutations in VHL1, JAK3 and APC/CDKN2A. Among the 11 non-responders 7 (64%) were wild-type, 2 (18%) were p53 mutated and 2 (18%) VHL1 mutated.No significant associations were found among RCC histotype, mutation variants and response to therapies. In the absence of predictive biomarkers for metastatic RCC treatment, a NGS approach may address single patients to basket clinical trials according to actionable molecular specific alterations.

  13. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    SciTech Connect

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-12-28

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences.

  14. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    PubMed Central

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-01-01

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences. PMID:26723608

  15. CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

    NASA Astrophysics Data System (ADS)

    Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

    2015-12-01

    We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, a