novo sequencing approach: Topics by Science.gov

Sample records for novo sequencing approach

De novo peptide sequencing by deep learning

PubMed Central

Tran, Ngoc Hieu; Zhang, Xianglilan; Xin, Lei; Shan, Baozhen; Li, Ming

2017-01-01

De novo peptide sequencing from tandem MS data is the key technology in proteomics for the characterization of proteins, especially for new sequences, such as mAbs. In this study, we propose a deep neural network model, DeepNovo, for de novo peptide sequencing. DeepNovo architecture combines recent advances in convolutional neural networks and recurrent neural networks to learn features of tandem mass spectra, fragment ions, and sequence patterns of peptides. The networks are further integrated with local dynamic programming to solve the complex optimization task of de novo sequencing. We evaluated the method on a wide variety of species and found that DeepNovo considerably outperformed state of the art methods, achieving 7.7–22.9% higher accuracy at the amino acid level and 38.1–64.0% higher accuracy at the peptide level. We further used DeepNovo to automatically reconstruct the complete sequences of antibody light and heavy chains of mouse, achieving 97.5–100% coverage and 97.2–99.5% accuracy, without assisting databases. Moreover, DeepNovo is retrainable to adapt to any sources of data and provides a complete end-to-end training and prediction solution to the de novo sequencing problem. Not only does our study extend the deep learning revolution to a new field, but it also shows an innovative approach in solving optimization problems by using deep learning and dynamic programming. PMID:28720701
Mass spectrometry-based protein identification by integrating de novo sequencing with database searching.

PubMed

Wang, Penghao; Wilson, Susan R

2013-01-01

Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching. We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.
Open-pNovo: De Novo Peptide Sequencing with Thousands of Protein Modifications.

PubMed

Yang, Hao; Chi, Hao; Zhou, Wen-Jing; Zeng, Wen-Feng; He, Kun; Liu, Chao; Sun, Rui-Xiang; He, Si-Min

2017-02-03

De novo peptide sequencing has improved remarkably, but sequencing full-length peptides with unexpected modifications is still a challenging problem. Here we present an open de novo sequencing tool, Open-pNovo, for de novo sequencing of peptides with arbitrary types of modifications. Although the search space increases by ∼300 times, Open-pNovo is close to or even ∼10-times faster than the other three proposed algorithms. Furthermore, considering top-1 candidates on three MS/MS data sets, Open-pNovo can recall over 90% of the results obtained by any one traditional algorithm and report 5-87% more peptides, including 14-250% more modified peptides. On a high-quality simulated data set, ∼85% peptides with arbitrary modifications can be recalled by Open-pNovo, while hardly any results can be recalled by others. In summary, Open-pNovo is an excellent tool for open de novo sequencing and has great potential for discovering unexpected modifications in the real biological applications.
A combined de novo protein sequencing and cDNA library approach to the venomic analysis of Chinese spider Araneus ventricosus.

PubMed

Duan, Zhigui; Cao, Rui; Jiang, Liping; Liang, Songping

2013-01-14

In past years, spider venoms have attracted increasing attention due to their extraordinary chemical and pharmacological diversity. The recently popularized proteomic method highly improved our ability to analyze the proteins in the venom. However, the lack of information about isolated venom proteins sequences dramatically limits the ability to confidently identify venom proteins. In the present paper, the venom from Araneus ventricosus was analyzed using two complementary approaches: 2-DE/Shotgun-LC-MS/MS coupled to MASCOT search and 2-DE/Shotgun-LC-MS/MS coupled to manual de novo sequencing followed by local venom protein database (LVPD) search. The LVPD was constructed with toxin-like protein sequences obtained from the analysis of cDNA library from A. ventricosus venom glands. Our results indicate that a total of 130 toxin-like protein sequences were unambiguously identified by manual de novo sequencing coupled to LVPD search, accounting for 86.67% of all toxin-like proteins in LVPD. Thus manual de novo sequencing coupled to LVPD search was proved an extremely effective approach for the analysis of venom proteins. In addition, the approach displays impeccable advantage in validating mutant positions of isoforms from the same toxin-like family. Intriguingly, methyl esterifcation of glutamic acid was discovered for the first time in animal venom proteins by manual de novo sequencing. Crown Copyright © 2012. Published by Elsevier B.V. All rights reserved.
UniNovo: a universal tool for de novo peptide sequencing.

PubMed

Jeong, Kyowon; Kim, Sangtae; Pevzner, Pavel A

2013-08-15

Mass spectrometry (MS) instruments and experimental protocols are rapidly advancing, but de novo peptide sequencing algorithms to analyze tandem mass (MS/MS) spectra are lagging behind. Although existing de novo sequencing tools perform well on certain types of spectra [e.g. Collision Induced Dissociation (CID) spectra of tryptic peptides], their performance often deteriorates on other types of spectra, such as Electron Transfer Dissociation (ETD), Higher-energy Collisional Dissociation (HCD) spectra or spectra of non-tryptic digests. Thus, rather than developing a new algorithm for each type of spectra, we develop a universal de novo sequencing algorithm called UniNovo that works well for all types of spectra or even for spectral pairs (e.g. CID/ETD spectral pairs). UniNovo uses an improved scoring function that captures the dependences between different ion types, where such dependencies are learned automatically using a modified offset frequency function. The performance of UniNovo is compared with PepNovo+, PEAKS and pNovo using various types of spectra. The results show that the performance of UniNovo is superior to other tools for ETD spectra and superior or comparable with others for CID and HCD spectra. UniNovo also estimates the probability that each reported reconstruction is correct, using simple statistics that are readily obtained from a small training dataset. We demonstrate that the estimation is accurate for all tested types of spectra (including CID, HCD, ETD, CID/ETD and HCD/ETD spectra of trypsin, LysC or AspN digested peptides). UniNovo is implemented in JAVA and tested on Windows, Ubuntu and OS X machines. UniNovo is available at http://proteomics.ucsd.edu/Software/UniNovo.html along with the manual.
A gradient-boosting approach for filtering de novo mutations in parent-offspring trios.

PubMed

Liu, Yongzhuang; Li, Bingshan; Tan, Renjie; Zhu, Xiaolin; Wang, Yadong

2014-07-01

Whole-genome and -exome sequencing on parent-offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge. In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter's theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity. The software DNMFilter implemented using a combination of Java and R is freely available from the website at http://humangenome.duke.edu/software. © The Author 2014. Published by Oxford
Combining De Novo Peptide Sequencing Algorithms, A Synergistic Approach to Boost Both Identifications and Confidence in Bottom-up Proteomics.

PubMed

Blank-Landeshammer, Bernhard; Kollipara, Laxmikanth; Biß, Karsten; Pfenninger, Markus; Malchow, Sebastian; Shuvaev, Konstantin; Zahedi, René P; Sickmann, Albert

2017-09-01

Complex mass spectrometry based proteomics data sets are mostly analyzed by protein database searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e., de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than 3-fold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11 120 PSMs (combined) instead of 3476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.
MRUniNovo: an efficient tool for de novo peptide sequencing utilizing the hadoop distributed computing framework.

PubMed

Li, Chuang; Chen, Tao; He, Qiang; Zhu, Yunping; Li, Kenli

2017-03-15

Tandem mass spectrometry-based de novo peptide sequencing is a complex and time-consuming process. The current algorithms for de novo peptide sequencing cannot rapidly and thoroughly process large mass spectrometry datasets. In this paper, we propose MRUniNovo, a novel tool for parallel de novo peptide sequencing. MRUniNovo parallelizes UniNovo based on the Hadoop compute platform. Our experimental results demonstrate that MRUniNovo significantly reduces the computation time of de novo peptide sequencing without sacrificing the correctness and accuracy of the results, and thus can process very large datasets that UniNovo cannot. MRUniNovo is an open source software tool implemented in java. The source code and the parameter settings are available at http://bioinfo.hupo.org.cn/MRUniNovo/index.php. s131020002@hnu.edu.cn ; taochen1019@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Reference-guided de novo assembly approach improves genome reconstruction for related species.

PubMed

Lischer, Heidi E L; Shimizu, Kentaro K

2017-11-10

The development of next-generation sequencing has made it possible to sequence whole genomes at a relatively low cost. However, de novo genome assemblies remain challenging due to short read length, missing data, repetitive regions, polymorphisms and sequencing errors. As more and more genomes are sequenced, reference-guided assembly approaches can be used to assist the assembly process. However, previous methods mostly focused on the assembly of other genotypes within the same species. We adapted and extended a reference-guided de novo assembly approach, which enables the usage of a related reference sequence to guide the genome assembly. In order to compare and evaluate de novo and our reference-guided de novo assembly approaches, we used a simulated data set of a repetitive and heterozygotic plant genome. The extended reference-guided de novo assembly approach almost always outperforms the corresponding de novo assembly program even when a reference of a different species is used. Similar improvements can be observed in high and low coverage situations. In addition, we show that a single evaluation metric, like the widely used N50 length, is not enough to properly rate assemblies as it not always points to the best assembly evaluated with other criteria. Therefore, we used the summed z-scores of 36 different statistics to evaluate the assemblies. The combination of reference mapping and de novo assembly provides a powerful tool to improve genome reconstruction by integrating information of a related genome. Our extension of the reference-guided de novo assembly approach enables the application of this strategy not only within but also between related species. Finally, the evaluation of genome assemblies is often not straight forward, as the truth is not known. Thus one should always use a combination of evaluation metrics, which not only try to assess the continuity but also the accuracy of an assembly.
DeNovoGUI: An Open Source Graphical User Interface for de Novo Sequencing of Tandem Mass Spectra

PubMed Central

2013-01-01

De novo sequencing is a popular technique in proteomics for identifying peptides from tandem mass spectra without having to rely on a protein sequence database. Despite the strong potential of de novo sequencing algorithms, their adoption threshold remains quite high. We here present a user-friendly and lightweight graphical user interface called DeNovoGUI for running parallelized versions of the freely available de novo sequencing software PepNovo+, greatly simplifying the use of de novo sequencing in proteomics. Our platform-independent software is freely available under the permissible Apache2 open source license. Source code, binaries, and additional documentation are available at http://denovogui.googlecode.com. PMID:24295440
DeNovoGUI: an open source graphical user interface for de novo sequencing of tandem mass spectra.

PubMed

Muth, Thilo; Weilnböck, Lisa; Rapp, Erdmann; Huber, Christian G; Martens, Lennart; Vaudel, Marc; Barsnes, Harald

2014-02-07

De novo sequencing is a popular technique in proteomics for identifying peptides from tandem mass spectra without having to rely on a protein sequence database. Despite the strong potential of de novo sequencing algorithms, their adoption threshold remains quite high. We here present a user-friendly and lightweight graphical user interface called DeNovoGUI for running parallelized versions of the freely available de novo sequencing software PepNovo+, greatly simplifying the use of de novo sequencing in proteomics. Our platform-independent software is freely available under the permissible Apache2 open source license. Source code, binaries, and additional documentation are available at http://denovogui.googlecode.com .
Peptide de novo sequencing of mixture tandem mass spectra.

PubMed

Gorshkov, Vladimir; Hotta, Stéphanie Yuki Kolbeck; Verano-Braga, Thiago; Kjeldsen, Frank

2016-09-01

The impact of mixture spectra deconvolution on the performance of four popular de novo sequencing programs was tested using artificially constructed mixture spectra as well as experimental proteomics data. Mixture fragmentation spectra are recognized as a limitation in proteomics because they decrease the identification performance using database search engines. De novo sequencing approaches are expected to be even more sensitive to the reduction in mass spectrum quality resulting from peptide precursor co-isolation and thus prone to false identifications. The deconvolution approach matched complementary b-, y-ions to each precursor peptide mass, which allowed the creation of virtual spectra containing sequence specific fragment ions of each co-isolated peptide. Deconvolution processing resulted in equally efficient identification rates but increased the absolute number of correctly sequenced peptides. The improvement was in the range of 20-35% additional peptide identifications for a HeLa lysate sample. Some correct sequences were identified only using unprocessed spectra; however, the number of these was lower than those where improvement was obtained by mass spectral deconvolution. Tight candidate peptide score distribution and high sensitivity to small changes in the mass spectrum introduced by the employed deconvolution method could explain some of the missing peptide identifications. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Peptide de novo sequencing of mixture tandem mass spectra

PubMed Central

Hotta, Stéphanie Yuki Kolbeck; Verano‐Braga, Thiago; Kjeldsen, Frank

2016-01-01

The impact of mixture spectra deconvolution on the performance of four popular de novo sequencing programs was tested using artificially constructed mixture spectra as well as experimental proteomics data. Mixture fragmentation spectra are recognized as a limitation in proteomics because they decrease the identification performance using database search engines. De novo sequencing approaches are expected to be even more sensitive to the reduction in mass spectrum quality resulting from peptide precursor co‐isolation and thus prone to false identifications. The deconvolution approach matched complementary b‐, y‐ions to each precursor peptide mass, which allowed the creation of virtual spectra containing sequence specific fragment ions of each co‐isolated peptide. Deconvolution processing resulted in equally efficient identification rates but increased the absolute number of correctly sequenced peptides. The improvement was in the range of 20–35% additional peptide identifications for a HeLa lysate sample. Some correct sequences were identified only using unprocessed spectra; however, the number of these was lower than those where improvement was obtained by mass spectral deconvolution. Tight candidate peptide score distribution and high sensitivity to small changes in the mass spectrum introduced by the employed deconvolution method could explain some of the missing peptide identifications. PMID:27329701
De Novo Peptide Sequencing: Deep Mining of High-Resolution Mass Spectrometry Data.

PubMed

Islam, Mohammad Tawhidul; Mohamedali, Abidali; Fernandes, Criselda Santan; Baker, Mark S; Ranganathan, Shoba

2017-01-01

High resolution mass spectrometry has revolutionized proteomics over the past decade, resulting in tremendous amounts of data in the form of mass spectra, being generated in a relatively short span of time. The mining of this spectral data for analysis and interpretation though has lagged behind such that potentially valuable data is being overlooked because it does not fit into the mold of traditional database searching methodologies. Although the analysis of spectra by de novo sequences removes such biases and has been available for a long period of time, its uptake has been slow or almost nonexistent within the scientific community. In this chapter, we propose a methodology to integrate de novo peptide sequencing using three commonly available software solutions in tandem, complemented by homology searching, and manual validation of spectra. This simplified method would allow greater use of de novo sequencing approaches and potentially greatly increase proteome coverage leading to the unearthing of valuable insights into protein biology, especially of organisms whose genomes have been recently sequenced or are poorly annotated.
Top-down analysis of protein samples by de novo sequencing techniques

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vyatkina, Kira; Wu, Si; Dekker, Lennard J. M.

MOTIVATION: Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. RESULTS: We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. Themore » former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns.« less
Automated Antibody De Novo Sequencing and Its Utility in Biopharmaceutical Discovery

NASA Astrophysics Data System (ADS)

Sen, K. Ilker; Tang, Wilfred H.; Nayak, Shruti; Kil, Yong J.; Bern, Marshall; Ozoglu, Berk; Ueberheide, Beatrix; Davis, Darryl; Becker, Christopher

2017-05-01

Applications of antibody de novo sequencing in the biopharmaceutical industry range from the discovery of new antibody drug candidates to identifying reagents for research and determining the primary structure of innovator products for biosimilar development. When murine, phage display, or patient-derived monoclonal antibodies against a target of interest are available, but the cDNA or the original cell line is not, de novo protein sequencing is required to humanize and recombinantly express these antibodies, followed by in vitro and in vivo testing for functional validation. Availability of fully automated software tools for monoclonal antibody de novo sequencing enables efficient and routine analysis. Here, we present a novel method to automatically de novo sequence antibodies using mass spectrometry and the Supernovo software. The robustness of the algorithm is demonstrated through a series of stress tests.
Spectra library assisted de novo peptide sequencing for HCD and ETD spectra pairs.

PubMed

Yan, Yan; Zhang, Kaizhong

2016-12-23

De novo peptide sequencing via tandem mass spectrometry (MS/MS) has been developed rapidly in recent years. With the use of spectra pairs from the same peptide under different fragmentation modes, performance of de novo sequencing is greatly improved. Currently, with large amount of spectra sequenced everyday, spectra libraries containing tens of thousands of annotated experimental MS/MS spectra become available. These libraries provide information of the spectra properties, thus have the potential to be used with de novo sequencing to improve its performance. In this study, an improved de novo sequencing method assisted with spectra library is proposed. It uses spectra libraries as training datasets and introduces significant scores of the features used in our previous de novo sequencing method for HCD and ETD spectra pairs. Two pairs of HCD and ETD spectral datasets were used to test the performance of the proposed method and our previous method. The results show that this proposed method achieves better sequencing accuracy with higher ranked correct sequences and less computational time. This paper proposed an advanced de novo sequencing method for HCD and ETD spectra pair and used information from spectra libraries and significant improved previous similar methods.
Database-independent Protein Sequencing (DiPS) Enables Full-length de Novo Protein and Antibody Sequence Determination.

PubMed

Savidor, Alon; Barzilay, Rotem; Elinger, Dalia; Yarden, Yosef; Lindzen, Moshit; Gabashvili, Alexandra; Adiv Tal, Ophir; Levin, Yishai

2017-06-01

Traditional "bottom-up" proteomic approaches use proteolytic digestion, LC-MS/MS, and database searching to elucidate peptide identities and their parent proteins. Protein sequences absent from the database cannot be identified, and even if present in the database, complete sequence coverage is rarely achieved even for the most abundant proteins in the sample. Thus, sequencing of unknown proteins such as antibodies or constituents of metaproteomes remains a challenging problem. To date, there is no available method for full-length protein sequencing, independent of a reference database, in high throughput. Here, we present Database-independent Protein Sequencing, a method for unambiguous, rapid, database-independent, full-length protein sequencing. The method is a novel combination of non-enzymatic, semi-random cleavage of the protein, LC-MS/MS analysis, peptide de novo sequencing, extraction of peptide tags, and their assembly into a consensus sequence using an algorithm named "Peptide Tag Assembler." As proof-of-concept, the method was applied to samples of three known proteins representing three size classes and to a previously un-sequenced, clinically relevant monoclonal antibody. Excluding leucine/isoleucine and glutamic acid/deamidated glutamine ambiguities, end-to-end full-length de novo sequencing was achieved with 99-100% accuracy for all benchmarking proteins and the antibody light chain. Accuracy of the sequenced antibody heavy chain, including the entire variable region, was also 100%, but there was a 23-residue gap in the constant region sequence. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.
New Approaches and Technologies to Sequence de novo Plant reference Genomes (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

ScienceCinema

Schmutz, Jeremy

2018-02-01

Jeremy Schmutz of the HudsonAlpha Institute for Biotechnology on New approaches and technologies to sequence de novo plant reference genomes at the 8th Annual Genomics of Energy Environment Meeting on March 27, 2013 in Walnut Creek, CA.
New Approaches and Technologies to Sequence de novo Plant reference Genomes (2013 DOE JGI Genomics of Energy and Environment 8th Annual User Meeting)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schmutz, Jeremy

2013-03-01

Jeremy Schmutz of the HudsonAlpha Institute for Biotechnology on New approaches and technologies to sequence de novo plant reference genomes at the 8th Annual Genomics of Energy Environment Meeting on March 27, 2013 in Walnut Creek, CA.

De novo assembly of human genomes with massively parallel short read sequencing.

PubMed

Li, Ruiqiang; Zhu, Hongmei; Ruan, Jue; Qian, Wubin; Fang, Xiaodong; Shi, Zhongbin; Li, Yingrui; Li, Shengting; Shan, Gao; Kristiansen, Karsten; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun

2010-02-01

Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
Whole-genome sequencing for comparative genomics and de novo genome assembly.

PubMed

Benjak, Andrej; Sala, Claudia; Hartkoorn, Ruben C

2015-01-01

Next-generation sequencing technologies for whole-genome sequencing of mycobacteria are rapidly becoming an attractive alternative to more traditional sequencing methods. In particular this technology is proving useful for genome-wide identification of mutations in mycobacteria (comparative genomics) as well as for de novo assembly of whole genomes. Next-generation sequencing however generates a vast quantity of data that can only be transformed into a usable and comprehensible form using bioinformatics. Here we describe the methodology one would use to prepare libraries for whole-genome sequencing, and the basic bioinformatics to identify mutations in a genome following Illumina HiSeq or MiSeq sequencing, as well as de novo genome assembly following sequencing using Pacific Biosciences (PacBio).
Top-down analysis of protein samples by de novo sequencing techniques.

PubMed

Vyatkina, Kira; Wu, Si; Dekker, Lennard J M; VanDuijn, Martijn M; Liu, Xiaowen; Tolić, Nikola; Luider, Theo M; Paša-Tolić, Ljiljana; Pevzner, Pavel A

2016-09-15

Recent technological advances have made high-resolution mass spectrometers affordable to many laboratories, thus boosting rapid development of top-down mass spectrometry, and implying a need in efficient methods for analyzing this kind of data. We describe a method for analysis of protein samples from top-down tandem mass spectrometry data, which capitalizes on de novo sequencing of fragments of the proteins present in the sample. Our algorithm takes as input a set of de novo amino acid strings derived from the given mass spectra using the recently proposed Twister approach, and combines them into aggregated strings endowed with offsets. The former typically constitute accurate sequence fragments of sufficiently well-represented proteins from the sample being analyzed, while the latter indicate their location in the protein sequence, and also bear information on post-translational modifications and fragmentation patterns. Freely available on the web at http://bioinf.spbau.ru/en/twister vyatkina@spbau.ru or ppevzner@ucsd.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Discovery, genotyping and characterization of structural variation and novel sequence at single nucleotide resolution from de novo genome assemblies on a population scale.

PubMed

Liu, Siyang; Huang, Shujia; Rao, Junhua; Ye, Weijian; Krogh, Anders; Wang, Jun

2015-01-01

Comprehensive recognition of genomic variation in one individual is important for understanding disease and developing personalized medication and treatment. Many tools based on DNA re-sequencing exist for identification of single nucleotide polymorphisms, small insertions and deletions (indels) as well as large deletions. However, these approaches consistently display a substantial bias against the recovery of complex structural variants and novel sequence in individual genomes and do not provide interpretation information such as the annotation of ancestral state and formation mechanism. We present a novel approach implemented in a single software package, AsmVar, to discover, genotype and characterize different forms of structural variation and novel sequence from population-scale de novo genome assemblies up to nucleotide resolution. Application of AsmVar to several human de novo genome assemblies captures a wide spectrum of structural variants and novel sequences present in the human population in high sensitivity and specificity. Our method provides a direct solution for investigating structural variants and novel sequences from de novo genome assemblies, facilitating the construction of population-scale pan-genomes. Our study also highlights the usefulness of the de novo assembly strategy for definition of genome structure.
Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation.

PubMed

Hara, Yuichiro; Tatsumi, Kaori; Yoshida, Michio; Kajikawa, Eriko; Kiyonari, Hiroshi; Kuraku, Shigehiro

2015-11-18

RNA-seq enables gene expression profiling in selected spatiotemporal windows and yields massive sequence information with relatively low cost and time investment, even for non-model species. However, there remains a large room for optimizing its workflow, in order to take full advantage of continuously developing sequencing capacity. Transcriptome sequencing for three embryonic stages of Madagascar ground gecko (Paroedura picta) was performed with the Illumina platform. The output reads were assembled de novo for reconstructing transcript sequences. In order to evaluate the completeness of transcriptome assemblies, we prepared a reference gene set consisting of vertebrate one-to-one orthologs. To take advantage of increased read length of >150 nt, we demonstrated shortened RNA fragmentation time, which resulted in a dramatic shift of insert size distribution. To evaluate products of multiple de novo assembly runs incorporating reads with different RNA sources, read lengths, and insert sizes, we introduce a new reference gene set, core vertebrate genes (CVG), consisting of 233 genes that are shared as one-to-one orthologs by all vertebrate genomes examined (29 species)., The completeness assessment performed by the computational pipelines CEGMA and BUSCO referring to CVG, demonstrated higher accuracy and resolution than with the gene set previously established for this purpose. As a result of the assessment with CVG, we have derived the most comprehensive transcript sequence set of the Madagascar ground gecko by means of assembling individual libraries followed by clustering the assembled sequences based on their overall similarities. Our results provide several insights into optimizing de novo RNA-seq workflow, including the coordination between library insert size and read length, which manifested in improved connectivity of assemblies. The approach and assembly assessment with CVG demonstrated here would be applicable to transcriptome analysis of other species as
Terminal sequence importance of de novo proteins from binary-patterned library: stable artificial proteins with 11- or 12-amino acid alphabet.

PubMed

Okura, Hiromichi; Takahashi, Tsuyoshi; Mihara, Hisakazu

2012-06-01

Successful approaches of de novo protein design suggest a great potential to create novel structural folds and to understand natural rules of protein folding. For these purposes, smaller and simpler de novo proteins have been developed. Here, we constructed smaller proteins by removing the terminal sequences from stable de novo vTAJ proteins and compared stabilities between mutant and original proteins. vTAJ proteins were screened from an α3β3 binary-patterned library which was designed with polar/ nonpolar periodicities of α-helix and β-sheet. vTAJ proteins have the additional terminal sequences due to the method of constructing the genetically repeated library sequences. By removing the parts of the sequences, we successfully obtained the stable smaller de novo protein mutants with fewer amino acid alphabets than the originals. However, these mutants showed the differences on ANS binding properties and stabilities against denaturant and pH change. The terminal sequences, which were designed just as flexible linkers not as secondary structure units, sufficiently affected these physicochemical details. This study showed implications for adjusting protein stabilities by designing N- and C-terminal sequences.
in silico Whole Genome Sequencer & Analyzer (iWGS): A Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhou, Xiaofan; Peris, David; Kominek, Jacek

The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimentalmore » design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.« less
in silico Whole Genome Sequencer & Analyzer (iWGS): A Computational Pipeline to Guide the Design and Analysis of de novo Genome Sequencing Studies

DOE PAGES

Zhou, Xiaofan; Peris, David; Kominek, Jacek; ...

2016-09-16

The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding its potential for understanding the biology and evolution of the full spectrum of biodiversity. The increasing diversity of sequencing technologies, assays, and de novo assembly algorithms have augmented the complexity of de novo genome sequencing projects in nonmodel organisms. To reduce the costs and challenges in de novo genome sequencing projects and streamline their experimentalmore » design and analysis, we developed iWGS (in silico Whole Genome Sequencer and Analyzer), an automated pipeline for guiding the choice of appropriate sequencing strategy and assembly protocols. iWGS seamlessly integrates the four key steps of a de novo genome sequencing project: data generation (through simulation), data quality control, de novo assembly, and assembly evaluation and validation. The last three steps can also be applied to the analysis of real data. iWGS is designed to enable the user to have great flexibility in testing the range of experimental designs available for genome sequencing projects, and supports all major sequencing technologies and popular assembly tools. Three case studies illustrate how iWGS can guide the design of de novo genome sequencing projects, and evaluate the performance of a wide variety of user-specified sequencing strategies and assembly protocols on genomes of differing architectures. iWGS, along with a detailed documentation, is freely available at https://github.com/zhouxiaofan1983/iWGS.« less
De novo protein sequencing by combining top-down and bottom-up tandem mass spectra.

PubMed

Liu, Xiaowen; Dekker, Lennard J M; Wu, Si; Vanduijn, Martijn M; Luider, Theo M; Tolić, Nikola; Kou, Qiang; Dvorkin, Mikhail; Alexandrova, Sonya; Vyatkina, Kira; Paša-Tolić, Ljiljana; Pevzner, Pavel A

2014-07-03

There are two approaches for de novo protein sequencing: Edman degradation and mass spectrometry (MS). Existing MS-based methods characterize a novel protein by assembling tandem mass spectra of overlapping peptides generated from multiple proteolytic digestions of the protein. Because each tandem mass spectrum covers only a short peptide of the target protein, the key to high coverage protein sequencing is to find spectral pairs from overlapping peptides in order to assemble tandem mass spectra to long ones. However, overlapping regions of peptides may be too short to be confidently identified. High-resolution mass spectrometers have become accessible to many laboratories. These mass spectrometers are capable of analyzing molecules of large mass values, boosting the development of top-down MS. Top-down tandem mass spectra cover whole proteins. However, top-down tandem mass spectra, even combined, rarely provide full ion fragmentation coverage of a protein. We propose an algorithm, TBNovo, for de novo protein sequencing by combining top-down and bottom-up MS. In TBNovo, a top-down tandem mass spectrum is utilized as a scaffold, and bottom-up tandem mass spectra are aligned to the scaffold to increase sequence coverage. Experiments on data sets of two proteins showed that TBNovo achieved high sequence coverage and high sequence accuracy.
Partial De Novo Sequencing and Unusual CID Fragmentation of a 7 kDa, Disulfide-Bridged Toxin

NASA Astrophysics Data System (ADS)

Medzihradszky, Katalin F.; Bohlen, Christopher J.

2012-05-01

A 7 kDa toxin isolated from the venom of the Texas coral snake ( Micrurus tener tener) was subjected to collision-induced dissociation (CID) and electron-transfer dissociation (ETD) analyses both before and after reduction at low pH. Manual and automated approaches to de novo sequencing are compared in detail. Manual de novo sequencing utilizing the combination of high accuracy CID and ETD data and an acid-related cleavage yielded the N-terminal half of the sequence from the reduced species. The intact polypeptide, containing 3 disulfide bridges produced a series of unusual fragments in ion trap CID experiments: abundant internal amino acid losses were detected, and also one of the disulfide-linkage positions could be determined from fragments formed by the cleavage of two bonds. In addition, internal and c-type fragments were also observed.
An Approach for Peptide Identification by De Novo Sequencing of Mixture Spectra.

PubMed

Liu, Yi; Ma, Bin; Zhang, Kaizhong; Lajoie, Gilles

2017-01-01

Mixture spectra occur quite frequently in a typical wet-lab mass spectrometry experiment, which result from the concurrent fragmentation of multiple precursors. The ability to efficiently and confidently identify mixture spectra is essential to alleviate the existent bottleneck of low mass spectra identification rate. However, most of the traditional computational methods are not suitable for interpreting mixture spectra, because they still take the assumption that the acquired spectra come from the fragmentation of a single precursor. In this manuscript, we formulate the mixture spectra de novo sequencing problem mathematically, and propose a dynamic programming algorithm for the problem. Additionally, we use both simulated and real mixture spectra data sets to verify the merits of the proposed algorithm.
Long-read sequencing and de novo assembly of a Chinese genome

USDA-ARS?s Scientific Manuscript database

Short-read sequencing has enabled the de novo assembly of several individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arr...
Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction.

PubMed

Palmer, Lance E; Dejori, Mathaeus; Bolanos, Randall; Fasulo, Daniel

2010-01-15

With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps. We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies. Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.
Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.

PubMed

Desai, Aarti; Marwah, Veer Singh; Yadav, Akshay; Jha, Vineet; Dhaygude, Kishor; Bangar, Ujwala; Kulkarni, Vivek; Jere, Abhay

2013-01-01

Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.
RoboOligo: software for mass spectrometry data to support manual and de novo sequencing of post-transcriptionally modified ribonucleic acids

PubMed Central

Sample, Paul J.; Gaston, Kirk W.; Alfonzo, Juan D.; Limbach, Patrick A.

2015-01-01

Ribosomal ribonucleic acid (RNA), transfer RNA and other biological or synthetic RNA polymers can contain nucleotides that have been modified by the addition of chemical groups. Traditional Sanger sequencing methods cannot establish the chemical nature and sequence of these modified-nucleotide containing oligomers. Mass spectrometry (MS) has become the conventional approach for determining the nucleotide composition, modification status and sequence of modified RNAs. Modified RNAs are analyzed by MS using collision-induced dissociation tandem mass spectrometry (CID MS/MS), which produces a complex dataset of oligomeric fragments that must be interpreted to identify and place modified nucleosides within the RNA sequence. Here we report the development of RoboOligo, an interactive software program for the robust analysis of data generated by CID MS/MS of RNA oligomers. There are three main functions of RoboOligo: (i) automated de novo sequencing via the local search paradigm. (ii) Manual sequencing with real-time spectrum labeling and cumulative intensity scoring. (iii) A hybrid approach, coined ‘variable sequencing’, which combines the user intuition of manual sequencing with the high-throughput sampling of automated de novo sequencing. PMID:25820423
RNA-Seq Analysis of Cocos nucifera: Transcriptome Sequencing and De Novo Assembly for Subsequent Functional Genomics Approaches

PubMed Central

Xia, Wei; Mason, Annaliese S.; Xia, Zhihui; Qiao, Fei; Zhao, Songlin; Tang, Haoru

2013-01-01

Background Cocos nucifera (coconut), a member of the Arecaceae family, is an economically important woody palm grown in tropical regions. Despite its agronomic importance, previous germplasm assessment studies have relied solely on morphological and agronomical traits. Molecular biology techniques have been scarcely used in assessment of genetic resources and for improvement of important agronomic and quality traits in Cocos nucifera, mostly due to the absence of available sequence information. Methodology/Principal Findings To provide basic information for molecular breeding and further molecular biological analysis in Cocos nucifera, we applied RNA-seq technology and de novo assembly to gain a global overview of the Cocos nucifera transcriptome from mixed tissue samples. Using Illumina sequencing, we obtained 54.9 million short reads and conducted de novo assembly to obtain 57,304 unigenes with an average length of 752 base pairs. Sequence comparison between assembled unigenes and released cDNA sequences of Cocos nucifera and Elaeis guineensis indicated that the assembled sequences were of high quality. Approximately 99.9% of unigenes were novel compared to the released coconut EST sequences. Using BLASTX, 68.2% of unigenes were successfully annotated based on the Genbank non-redundant (Nr) protein database. The annotated unigenes were then further classified using the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Conclusions/Significance Our study provides a large quantity of novel genetic information for Cocos nucifera. This information will act as a valuable resource for further molecular genetic studies and breeding in coconut, as well as for isolation and characterization of functional genes involved in different biochemical pathways in this important tropical crop species. PMID:23555859
RNA-Seq analysis of Cocos nucifera: transcriptome sequencing and de novo assembly for subsequent functional genomics approaches.

PubMed

Fan, Haikuo; Xiao, Yong; Yang, Yaodong; Xia, Wei; Mason, Annaliese S; Xia, Zhihui; Qiao, Fei; Zhao, Songlin; Tang, Haoru

2013-01-01

Cocos nucifera (coconut), a member of the Arecaceae family, is an economically important woody palm grown in tropical regions. Despite its agronomic importance, previous germplasm assessment studies have relied solely on morphological and agronomical traits. Molecular biology techniques have been scarcely used in assessment of genetic resources and for improvement of important agronomic and quality traits in Cocos nucifera, mostly due to the absence of available sequence information. To provide basic information for molecular breeding and further molecular biological analysis in Cocos nucifera, we applied RNA-seq technology and de novo assembly to gain a global overview of the Cocos nucifera transcriptome from mixed tissue samples. Using Illumina sequencing, we obtained 54.9 million short reads and conducted de novo assembly to obtain 57,304 unigenes with an average length of 752 base pairs. Sequence comparison between assembled unigenes and released cDNA sequences of Cocos nucifera and Elaeis guineensis indicated that the assembled sequences were of high quality. Approximately 99.9% of unigenes were novel compared to the released coconut EST sequences. Using BLASTX, 68.2% of unigenes were successfully annotated based on the Genbank non-redundant (Nr) protein database. The annotated unigenes were then further classified using the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Our study provides a large quantity of novel genetic information for Cocos nucifera. This information will act as a valuable resource for further molecular genetic studies and breeding in coconut, as well as for isolation and characterization of functional genes involved in different biochemical pathways in this important tropical crop species.
mPUMA: a computational approach to microbiota analysis by de novo assembly of operational taxonomic units based on protein-coding barcode sequences.

PubMed

Links, Matthew G; Chaban, Bonnie; Hemmingsen, Sean M; Muirhead, Kevin; Hill, Janet E

2013-08-15

Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly.
High quality de novo sequencing and assembly of the Saccharomyces arboricolus genome

PubMed Central

2013-01-01

Background Comparative genomics is a formidable tool to identify functional elements throughout a genome. In the past ten years, studies in the budding yeast Saccharomyces cerevisiae and a set of closely related species have been instrumental in showing the benefit of analyzing patterns of sequence conservation. Increasing the number of closely related genome sequences makes the comparative genomics approach more powerful and accurate. Results Here, we report the genome sequence and analysis of Saccharomyces arboricolus, a yeast species recently isolated in China, that is closely related to S. cerevisiae. We obtained high quality de novo sequence and assemblies using a combination of next generation sequencing technologies, established the phylogenetic position of this species and considered its phenotypic profile under multiple environmental conditions in the light of its gene content and phylogeny. Conclusions We suggest that the genome of S. arboricolus will be useful in future comparative genomics analysis of the Saccharomyces sensu stricto yeasts. PMID:23368932
De novo peptide sequencing using CID and HCD spectra pairs.

PubMed

Yan, Yan; Kusalik, Anthony J; Wu, Fang-Xiang

2016-10-01

In tandem mass spectrometry (MS/MS), there are several different fragmentation techniques possible, including, collision-induced dissociation (CID) higher energy collisional dissociation (HCD), electron-capture dissociation (ECD), and electron transfer dissociation (ETD). When using pairs of spectra for de novo peptide sequencing, the most popular methods are designed for CID (or HCD) and ECD (or ETD) spectra because of the complementarity between them. Less attention has been paid to the use of CID and HCD spectra pairs. In this study, a new de novo peptide sequencing method is proposed for these spectra pairs. This method includes a CID and HCD spectra merging criterion and a parent mass correction step, along with improvements to our previously proposed algorithm for sequencing merged spectra. Three pairs of spectral datasets were used to investigate and compare the performance of the proposed method with other existing methods designed for single spectrum (HCD or CID) sequencing. Experimental results showed that full-length peptide sequencing accuracy was increased significantly by using spectra pairs in the proposed method, with the highest accuracy reaching 81.31%. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

Evaluation and validation of de novo and hybrid assembly techniques to derive high quality genome sequences

DOE PAGES

Utturkar, Sagar M.; Klingeman, Dawn Marie; Land, Miriam L.; ...

2014-06-14

Our motivation with this work was to assess the potential of different types of sequence data combined with de novo and hybrid assembly approaches to improve existing draft genome sequences. Our results show Illumina, 454 and PacBio sequencing technologies were used to generate de novo and hybrid genome assemblies for four different bacteria, which were assessed for quality using summary statistics (e.g. number of contigs, N50) and in silico evaluation tools. Differences in predictions of multiple copies of rDNA operons for each respective bacterium were evaluated by PCR and Sanger sequencing, and then the validated results were applied as anmore » additional criterion to rank assemblies. In general, assemblies using longer PacBio reads were better able to resolve repetitive regions. In this study, the combination of Illumina and PacBio sequence data assembled through the ALLPATHS-LG algorithm gave the best summary statistics and most accurate rDNA operon number predictions. This study will aid others looking to improve existing draft genome assemblies. As to availability and implementation–all assembly tools except CLC Genomics Workbench are freely available under GNU General Public License.« less
De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity

PubMed Central

Yassour, Moran; Grabherr, Manfred; Blood, Philip D.; Bowden, Joshua; Couger, Matthew Brian; Eccles, David; Li, Bo; Lieber, Matthias; MacManes, Matthew D.; Ott, Michael; Orvis, Joshua; Pochet, Nathalie; Strozzi, Francesco; Weeks, Nathan; Westerman, Rick; William, Thomas; Dewey, Colin N.; Henschel, Robert; LeDuc, Richard D.; Friedman, Nir; Regev, Aviv

2013-01-01

De novo assembly of RNA-Seq data allows us to study transcriptomes without the need for a genome sequence, such as in non-model organisms of ecological and evolutionary importance, cancer samples, or the microbiome. In this protocol, we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-Seq data in non-model organisms. We also present Trinity’s supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples, and approaches to identify protein coding genes. In an included tutorial we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sf.net. PMID:23845962
A family-based probabilistic method for capturing de novo mutations from high-throughput short-read sequencing data.

PubMed

Cartwright, Reed A; Hussin, Julie; Keebler, Jonathan E M; Stone, Eric A; Awadalla, Philip

2012-01-06

Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date.
Exome sequencing supports a de novo mutational paradigm for schizophrenia

PubMed Central

Xu, Bin; Roos, J. Louw; Dexheimer, Phillip; Boone, Braden; Plummer, Brooks; Levy, Shawn; Gogos, Joseph A.; Karayiorgou, Maria

2011-01-01

Despite high heritability, a large fraction of cases with schizophrenia do not have a family history of the disease (sporadic cases). Here, we examine the possibility that rare de novo protein-altering mutations contribute to the genetic component of schizophrenia by sequencing the exome of 53 sporadic cases, 22 unaffected controls and their parents. We identified 40 de novo mutations in 27 patients affecting 40 genes including a potentially disruptive mutation in DGCR2, a gene removed by the recurrent schizophrenia-predisposing 22q11.2 microdeletion. Comparison to rare inherited variants revealed that the identified de novo mutations show a large excess of nonsynonymous changes in cases, as well as a greater potential to affect protein structure and function. Our analysis reveals a major role of de novo mutations in schizophrenia and also a large mutational target, which together provide a plausible explanation for the high global incidence and persistence of the disease. PMID:21822266
RNA-seq analysis of Rubus idaeus cv. Nova: transcriptome sequencing and de novo assembly for subsequent functional genomics approaches.

PubMed

Hyun, Tae Kyung; Lee, Sarah; Kumar, Dhinesh; Rim, Yeonggil; Kumar, Ritesh; Lee, Sang Yeol; Lee, Choong Hwan; Kim, Jae-Yean

2014-10-01

Using Illumina sequencing technology, we have generated the large-scale transcriptome sequencing data containing abundant information on genes involved in the metabolic pathways in R. idaeus cv. Nova fruits. Rubus idaeus (Red raspberry) is one of the important economical crops that possess numerous nutrients, micronutrients and phytochemicals with essential health benefits to human. The molecular mechanism underlying the ripening process and phytochemical biosynthesis in red raspberry is attributed to the changes in gene expression, but very limited transcriptomic and genomic information in public databases is available. To address this issue, we generated more than 51 million sequencing reads from R. idaeus cv. Nova fruit using Illumina RNA-Seq technology. After de novo assembly, we obtained 42,604 unigenes with an average length of 812 bp. At the protein level, Nova fruit transcriptome showed 77 and 68 % sequence similarities with Rubus coreanus and Fragaria versa, respectively, indicating the evolutionary relationship between them. In addition, 69 % of assembled unigenes were annotated using public databases including NCBI non-redundant, Cluster of Orthologous Groups and Gene ontology database, suggesting that our transcriptome dataset provides a valuable resource for investigating metabolic processes in red raspberry. To analyze the relationship between several novel transcripts and the amounts of metabolites such as γ-aminobutyric acid and anthocyanins, real-time PCR and target metabolite analysis were performed on two different ripening stages of Nova. This is the first attempt using Illumina sequencing platform for RNA sequencing and de novo assembly of Nova fruit without reference genome. Our data provide the most comprehensive transcriptome resource available for Rubus fruits, and will be useful for understanding the ripening process and for breeding R. idaeus cultivars with improved fruit quality.
Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies

PubMed Central

Rickert, Keith W.; Grinberg, Luba; Woods, Robert M.; Wilson, Susan; Bowen, Michael A.; Baca, Manuel

2016-01-01

ABSTRACT The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3–5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material. PMID:26852694
Combining phage display with de novo protein sequencing for reverse engineering of monoclonal antibodies.

PubMed

Rickert, Keith W; Grinberg, Luba; Woods, Robert M; Wilson, Susan; Bowen, Michael A; Baca, Manuel

2016-01-01

The enormous diversity created by gene recombination and somatic hypermutation makes de novo protein sequencing of monoclonal antibodies a uniquely challenging problem. Modern mass spectrometry-based sequencing will rarely, if ever, provide a single unambiguous sequence for the variable domains. A more likely outcome is computation of an ensemble of highly similar sequences that can satisfy the experimental data. This outcome can result in the need for empirical testing of many candidate sequences, sometimes iteratively, to identity one which can replicate the activity of the parental antibody. Here we describe an improved approach to antibody protein sequencing by using phage display technology to generate a combinatorial library of sequences that satisfy the mass spectrometry data, and selecting for functional candidates that bind antigen. This approach was used to reverse engineer 2 commercially-obtained monoclonal antibodies against murine CD137. Proteomic data enabled us to assign the majority of the variable domain sequences, with the exception of 3-5% of the sequence located within or adjacent to complementarity-determining regions. To efficiently resolve the sequence in these regions, small phage-displayed libraries were generated and subjected to antigen binding selection. Following enrichment of antigen-binding clones, 2 clones were selected for each antibody and recombinantly expressed as antigen-binding fragments (Fabs). In both cases, the reverse-engineered Fabs exhibited identical antigen binding affinity, within error, as Fabs produced from the commercial IgGs. This combination of proteomic and protein engineering techniques provides a useful approach to simplifying the technically challenging process of reverse engineering monoclonal antibodies from protein material.
Distilled single-cell genome sequencing and de novo assembly for sparse microbial communities.

PubMed

Taghavi, Zeinab; Movahedi, Narjes S; Draghici, Sorin; Chitsaz, Hamidreza

2013-10-01

Identification of every single genome present in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single-cell sequencing using multiple displacement amplification, which is simply intractable for a large number of cells. However, there is hope for breaking this barrier, as the number of different cell types with distinct genome sequences is usually much smaller than the number of cells. Here, we present a novel divide and conquer method to sequence and de novo assemble all distinct genomes present in a microbial sample with a sequencing cost and computational complexity proportional to the number of genome types, rather than the number of cells. The method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated data. The proposed divide and conquer method successfully reduces the cost of sequencing in comparison with the naïve exhaustive approach. Squeezambler and datasets are available at http://compbio.cs.wayne.edu/software/squeezambler/.
Precise detection of de novo single nucleotide variants in human genomes.

PubMed

Gómez-Romero, Laura; Palacios-Flores, Kim; Reyes, José; García, Delfino; Boege, Margareta; Dávila, Guillermo; Flores, Margarita; Schatz, Michael C; Palacios, Rafael

2018-05-22

The precise determination of de novo genetic variants has enormous implications across different fields of biology and medicine, particularly personalized medicine. Currently, de novo variations are identified by mapping sample reads from a parent-offspring trio to a reference genome, allowing for a certain degree of differences. While widely used, this approach often introduces false-positive (FP) results due to misaligned reads and mischaracterized sequencing errors. In a previous study, we developed an alternative approach to accurately identify single nucleotide variants (SNVs) using only perfect matches. However, this approach could be applied only to haploid regions of the genome and was computationally intensive. In this study, we present a unique approach, coverage-based single nucleotide variant identification (COBASI), which allows the exploration of the entire genome using second-generation short sequence reads without extensive computing requirements. COBASI identifies SNVs using changes in coverage of exactly matching unique substrings, and is particularly suited for pinpointing de novo SNVs. Unlike other approaches that require population frequencies across hundreds of samples to filter out any methodological biases, COBASI can be applied to detect de novo SNVs within isolated families. We demonstrate this capability through extensive simulation studies and by studying a parent-offspring trio we sequenced using short reads. Experimental validation of all 58 candidate de novo SNVs and a selection of non-de novo SNVs found in the trio confirmed zero FP calls. COBASI is available as open source at https://github.com/Laura-Gomez/COBASI for any researcher to use. Copyright © 2018 the Author(s). Published by PNAS.
First de novo whole genome sequencing and assembly of the pink-footed goose.

PubMed

Pujolar, J M; Dalén, L; Olsen, R A; Hansen, M M; Madsen, J

2018-03-01

Annotated genomes can provide new perspectives on the biology of species. We present the first de novo whole genome sequencing for the pink-footed goose. In order to obtain a high-quality de novo assembly the strategy used was to combine one short insert paired-end library with two mate-pair libraries. The pink-footed goose genome was assembled de novo using three different assemblers and an assembly evaluation was subsequently performed in order to choose the best assembler. For our data, ALLPATHS-LG performed the best, since the assembly produced covers most of the genome, while introducing the fewest errors. A total of 26,134 genes were annotated, with bird species accounting for virtually all BLAST hits. We also estimated the substitution rate in the pink-footed goose, which can be of use in future demographic studies, by using a comparative approach with the genome of the chicken, the mallard and the swan goose. A substitution rate of 1.38×10 -7 per nucleotide per generation was obtained when comparing the genomes of the two closely-related goose species (the pink-footed and the swan goose). Altogether, we provide a valuable tool for future genomic studies aiming at particular genes and regions of the pink-footed goose genome as well as other bird species. Copyright © 2017 Elsevier Inc. All rights reserved.
De novo characterization of Lentinula edodes C(91-3) transcriptome by deep Solexa sequencing.

PubMed

Zhong, Mintao; Liu, Ben; Wang, Xiaoli; Liu, Lei; Lun, Yongzhi; Li, Xingyun; Ning, Anhong; Cao, Jing; Huang, Min

2013-02-01

Lentinula edodes, has been utilized as food, as well as, in popular medicine, moreover, its extract isolated from its mycelium and fruiting body have shown several therapeutic properties. Yet little is understood about its genes involved in these properties, and the absence of L.edodes genomes has been a barrier to the development of functional genomics research. However, high throughput sequencing technologies are now being widely applied to non-model species. To facilitate research on L.edodes, we leveraged Solexa sequencing technology in de novo assembly of L.edodes C(91-3) transcriptome. In a single run, we produced more than 57 million sequencing reads. These reads were assembled into 28,923 unigene sequences (mean size=689bp) including 18,120 unigenes with coding sequence (CDS). Based on similarity search with known proteins, assembled unigene sequences were annotated with gene descriptions, gene ontology (GO) and clusters of orthologous group (COG) terms. Our data provides the first comprehensive sequence resource available for functional genomics studies in L.edodes, and demonstrates the utility of Illumina/Solexa sequencing for de novo transcriptome characterization and gene discovery in a non-model mushroom. Copyright © 2012 Elsevier Inc. All rights reserved.
De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis.

PubMed

Haas, Brian J; Papanicolaou, Alexie; Yassour, Moran; Grabherr, Manfred; Blood, Philip D; Bowden, Joshua; Couger, Matthew Brian; Eccles, David; Li, Bo; Lieber, Matthias; MacManes, Matthew D; Ott, Michael; Orvis, Joshua; Pochet, Nathalie; Strozzi, Francesco; Weeks, Nathan; Westerman, Rick; William, Thomas; Dewey, Colin N; Henschel, Robert; LeDuc, Richard D; Friedman, Nir; Regev, Aviv

2013-08-01

De novo assembly of RNA-seq data enables researchers to study transcriptomes without the need for a genome sequence; this approach can be usefully applied, for instance, in research on 'non-model organisms' of ecological and evolutionary importance, cancer samples or the microbiome. In this protocol we describe the use of the Trinity platform for de novo transcriptome assembly from RNA-seq data in non-model organisms. We also present Trinity-supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples and approaches to identify protein-coding genes. In the procedure, we provide a workflow for genome-independent transcriptome analysis leveraging the Trinity platform. The software, documentation and demonstrations are freely available from http://trinityrnaseq.sourceforge.net. The run time of this protocol is highly dependent on the size and complexity of data to be analyzed. The example data set analyzed in the procedure detailed herein can be processed in less than 5 h.
UVnovo: A De Novo Sequencing Algorithm Using Single Series of Fragment Ions via Chromophore Tagging and 351 nm Ultraviolet Photodissociation Mass Spectrometry

PubMed Central

Robotham, Scott A.; Horton, Andrew P.; Cannon, Joe R.; Cotham, Victoria C.; Marcotte, Edward M.; Brodbelt, Jennifer S.

2016-01-01

De novo peptide sequencing by mass spectrometry represents an important strategy for characterizing novel peptides and proteins, in which a peptide’s amino acid sequence is inferred directly from the precursor peptide mass and tandem mass spectrum (MS/MS or MS3) fragment ions, without comparison to a reference proteome. This method is ideal for organisms or samples lacking a complete or well-annotated reference sequence set. One of the major barriers to de novo spectral interpretation arises from confusion of N- and C-terminal ion series due to the symmetry between b and y ion pairs created by collisional activation methods (or c, z ions for electron-based activation methods). This is known as the ‘antisymmetric path problem’ and leads to inverted amino acid subsequences within a de novo reconstruction. Here, we combine several key strategies for de novo peptide sequencing into a single high-throughput pipeline: high efficiency carbamylation blocks lysine side chains, and subsequent tryptic digestion and N-terminal peptide derivatization with the ultraviolet chromophore AMCA yields peptides susceptible to 351 nm ultraviolet photodissociation (UVPD). UVPD-MS/MS of the AMCA-modified peptides then predominantly produces y ions in the MS/MS spectra, specifically addressing the antisymmetric path problem. Finally, the program UVnovo applies a random forest algorithm to automatically learn from and then interpret UVPD mass spectra, passing results to a hidden Markov model for de novo sequence prediction and scoring. We show this combined strategy provides high performance de novo peptide sequencing, enabling the de novo sequencing of thousands of peptides from an E. coli lysate at high confidence. PMID:26938041
Memetic algorithms for de novo motif-finding in biomedical sequences.

PubMed

Bi, Chengpeng

2012-09-01

The objectives of this study are to design and implement a new memetic algorithm for de novo motif discovery, which is then applied to detect important signals hidden in various biomedical molecular sequences. In this paper, memetic algorithms are developed and tested in de novo motif-finding problems. Several strategies in the algorithm design are employed that are to not only efficiently explore the multiple sequence local alignment space, but also effectively uncover the molecular signals. As a result, there are a number of key features in the implementation of the memetic motif-finding algorithm (MaMotif), including a chromosome replacement operator, a chromosome alteration-aware local search operator, a truncated local search strategy, and a stochastic operation of local search imposed on individual learning. To test the new algorithm, we compare MaMotif with a few of other similar algorithms using simulated and experimental data including genomic DNA, primary microRNA sequences (let-7 family), and transmembrane protein sequences. The new memetic motif-finding algorithm is successfully implemented in C++, and exhaustively tested with various simulated and real biological sequences. In the simulation, it shows that MaMotif is the most time-efficient algorithm compared with others, that is, it runs 2 times faster than the expectation maximization (EM) method and 16 times faster than the genetic algorithm-based EM hybrid. In both simulated and experimental testing, results show that the new algorithm is compared favorably or superior to other algorithms. Notably, MaMotif is able to successfully discover the transcription factors' binding sites in the chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) data, correctly uncover the RNA splicing signals in gene expression, and precisely find the highly conserved helix motif in the transmembrane protein sequences, as well as rightly detect the palindromic segments in the primary micro
The long reads ahead: de novo genome assembly using the MinION

PubMed Central

de Lannoy, Carlos; de Ridder, Dick; Risse, Judith

2017-01-01

Nanopore technology provides a novel approach to DNA sequencing that yields long, label-free reads of constant quality. The first commercial implementation of this approach, the MinION, has shown promise in various sequencing applications. This review gives an up-to-date overview of the MinION's utility as a de novo sequencing device. It is argued that the MinION may allow for portable and affordable de novo sequencing of even complex genomes in the near future, despite the currently error-prone nature of its reads. Through continuous updates to the MinION hardware and the development of new assembly pipelines, both sequencing accuracy and assembly quality have already risen rapidly. However, this fast pace of development has also lead to a lack of overview of the expanding landscape of analysis tools, as performance evaluations are outdated quickly. As the MinION is approaching a state of maturity, its user community would benefit from a thorough comparative benchmarking effort of de novo assembly pipelines in the near future. An earlier version of this article can be found on bioRxiv. PMID:29375809
Cost-effective sequencing of full-length cDNA clones powered by a de novo-reference hybrid assembly.

PubMed

Kuroshu, Reginaldo M; Watanabe, Junichi; Sugano, Sumio; Morishita, Shinichi; Suzuki, Yutaka; Kasahara, Masahiro

2010-05-07

Sequencing full-length cDNA clones is important to determine gene structures including alternative splice forms, and provides valuable resources for experimental analyses to reveal the biological functions of coded proteins. However, previous approaches for sequencing cDNA clones were expensive or time-consuming, and therefore, a fast and efficient sequencing approach was demanded. We developed a program, MuSICA 2, that assembles millions of short (36-nucleotide) reads collected from a single flow cell lane of Illumina Genome Analyzer to shotgun-sequence approximately 800 human full-length cDNA clones. MuSICA 2 performs a hybrid assembly in which an external de novo assembler is run first and the result is then improved by reference alignment of shotgun reads. We compared the MuSICA 2 assembly with 200 pooled full-length cDNA clones finished independently by the conventional primer-walking using Sanger sequencers. The exon-intron structure of the coding sequence was correct for more than 95% of the clones with coding sequence annotation when we excluded cDNA clones insufficiently represented in the shotgun library due to PCR failure (42 out of 200 clones excluded), and the nucleotide-level accuracy of coding sequences of those correct clones was over 99.99%. We also applied MuSICA 2 to full-length cDNA clones from Toxoplasma gondii, to confirm that its ability was competent even for non-human species. The entire sequencing and shotgun assembly takes less than 1 week and the consumables cost only approximately US$3 per clone, demonstrating a significant advantage over previous approaches.
When less is more: 'slicing' sequencing data improves read decoding accuracy and de novo assembly quality.

PubMed

Lonardi, Stefano; Mirebrahim, Hamid; Wanamaker, Steve; Alpert, Matthew; Ciardo, Gianfranco; Duma, Denisa; Close, Timothy J

2015-09-15

As the invention of DNA sequencing in the 70s, computational biologists have had to deal with the problem of de novo genome assembly with limited (or insufficient) depth of sequencing. In this work, we investigate the opposite problem, that is, the challenge of dealing with excessive depth of sequencing. We explore the effect of ultra-deep sequencing data in two domains: (i) the problem of decoding reads to bacterial artificial chromosome (BAC) clones (in the context of the combinatorial pooling design we have recently proposed), and (ii) the problem of de novo assembly of BAC clones. Using real ultra-deep sequencing data, we show that when the depth of sequencing increases over a certain threshold, sequencing errors make these two problems harder and harder (instead of easier, as one would expect with error-free data), and as a consequence the quality of the solution degrades with more and more data. For the first problem, we propose an effective solution based on 'divide and conquer': we 'slice' a large dataset into smaller samples of optimal size, decode each slice independently, and then merge the results. Experimental results on over 15 000 barley BACs and over 4000 cowpea BACs demonstrate a significant improvement in the quality of the decoding and the final assembly. For the second problem, we show for the first time that modern de novo assemblers cannot take advantage of ultra-deep sequencing data. Python scripts to process slices and resolve decoding conflicts are available from http://goo.gl/YXgdHT; software Hashfilter can be downloaded from http://goo.gl/MIyZHs stelo@cs.ucr.edu or timothy.close@ucr.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Genome sequencing of bacteria: sequencing, de novo assembly and rapid analysis using open source tools.

PubMed

Kisand, Veljo; Lettieri, Teresa

2013-04-01

De novo genome sequencing of previously uncharacterized microorganisms has the potential to open up new frontiers in microbial genomics by providing insight into both functional capabilities and biodiversity. Until recently, Roche 454 pyrosequencing was the NGS method of choice for de novo assembly because it generates hundreds of thousands of long reads (<450 bps), which are presumed to aid in the analysis of uncharacterized genomes. The array of tools for processing NGS data are increasingly free and open source and are often adopted for both their high quality and role in promoting academic freedom. The error rate of pyrosequencing the Alcanivorax borkumensis genome was such that thousands of insertions and deletions were artificially introduced into the finished genome. Despite a high coverage (~30 fold), it did not allow the reference genome to be fully mapped. Reads from regions with errors had low quality, low coverage, or were missing. The main defect of the reference mapping was the introduction of artificial indels into contigs through lower than 100% consensus and distracting gene calling due to artificial stop codons. No assembler was able to perform de novo assembly comparable to reference mapping. Automated annotation tools performed similarly on reference mapped and de novo draft genomes, and annotated most CDSs in the de novo assembled draft genomes. Free and open source software (FOSS) tools for assembly and annotation of NGS data are being developed rapidly to provide accurate results with less computational effort. Usability is not high priority and these tools currently do not allow the data to be processed without manual intervention. Despite this, genome assemblers now readily assemble medium short reads into long contigs (>97-98% genome coverage). A notable gap in pyrosequencing technology is the quality of base pair calling and conflicting base pairs between single reads at the same nucleotide position. Regardless, using draft whole genomes
Cost-Effective Sequencing of Full-Length cDNA Clones Powered by a De Novo-Reference Hybrid Assembly

PubMed Central

Sugano, Sumio; Morishita, Shinichi; Suzuki, Yutaka

2010-01-01

Background Sequencing full-length cDNA clones is important to determine gene structures including alternative splice forms, and provides valuable resources for experimental analyses to reveal the biological functions of coded proteins. However, previous approaches for sequencing cDNA clones were expensive or time-consuming, and therefore, a fast and efficient sequencing approach was demanded. Methodology We developed a program, MuSICA 2, that assembles millions of short (36-nucleotide) reads collected from a single flow cell lane of Illumina Genome Analyzer to shotgun-sequence ∼800 human full-length cDNA clones. MuSICA 2 performs a hybrid assembly in which an external de novo assembler is run first and the result is then improved by reference alignment of shotgun reads. We compared the MuSICA 2 assembly with 200 pooled full-length cDNA clones finished independently by the conventional primer-walking using Sanger sequencers. The exon-intron structure of the coding sequence was correct for more than 95% of the clones with coding sequence annotation when we excluded cDNA clones insufficiently represented in the shotgun library due to PCR failure (42 out of 200 clones excluded), and the nucleotide-level accuracy of coding sequences of those correct clones was over 99.99%. We also applied MuSICA 2 to full-length cDNA clones from Toxoplasma gondii, to confirm that its ability was competent even for non-human species. Conclusions The entire sequencing and shotgun assembly takes less than 1 week and the consumables cost only ∼US$3 per clone, demonstrating a significant advantage over previous approaches. PMID:20479877
De Novo Centromere Formation and Centromeric Sequence Expansion in Wheat and its Wide Hybrids.

PubMed

Guo, Xiang; Su, Handong; Shi, Qinghua; Fu, Shulan; Wang, Jing; Zhang, Xiangqi; Hu, Zanmin; Han, Fangpu

2016-04-01

Centromeres typically contain tandem repeat sequences, but centromere function does not necessarily depend on these sequences. We identified functional centromeres with significant quantitative changes in the centromeric retrotransposons of wheat (CRW) contents in wheat aneuploids (Triticum aestivum) and the offspring of wheat wide hybrids. The CRW signals were strongly reduced or essentially lost in some wheat ditelosomic lines and in the addition lines from the wide hybrids. The total loss of the CRW sequences but the presence of CENH3 in these lines suggests that the centromeres were formed de novo. In wheat and its wide hybrids, which carry large complex genomes or no sequenced genome, we performed CENH3-ChIP-dot-blot methods alone or in combination with CENH3-ChIP-seq and identified the ectopic genomic sequences present at the new centromeres. In adcdition, the transcription of the identified DNA sequences was remarkably increased at the new centromere, suggesting that the transcription of the corresponding sequences may be associated with de novo centromere formation. Stable alien chromosomes with two and three regions containing CRW sequences induced by centromere breakage were observed in the wheat-Th. elongatum hybrid derivatives, but only one was a functional centromere. In wheat-rye (Secale cereale) hybrids, the rye centromere-specific sequences spread along the chromosome arms and may have caused centromere expansion. Frequent and significant quantitative alterations in the centromere sequence via chromosomal rearrangement have been systematically described in wheat wide hybridizations, which may affect the retention or loss of the alien chromosomes in the hybrids. Thus, the centromere behavior in wide crosses likely has an important impact on the generation of biodiversity, which ultimately has implications for speciation.

De Novo Centromere Formation and Centromeric Sequence Expansion in Wheat and its Wide Hybrids

PubMed Central

Fu, Shulan; Wang, Jing; Zhang, Xiangqi; Hu, Zanmin; Han, Fangpu

2016-01-01

Centromeres typically contain tandem repeat sequences, but centromere function does not necessarily depend on these sequences. We identified functional centromeres with significant quantitative changes in the centromeric retrotransposons of wheat (CRW) contents in wheat aneuploids (Triticum aestivum) and the offspring of wheat wide hybrids. The CRW signals were strongly reduced or essentially lost in some wheat ditelosomic lines and in the addition lines from the wide hybrids. The total loss of the CRW sequences but the presence of CENH3 in these lines suggests that the centromeres were formed de novo. In wheat and its wide hybrids, which carry large complex genomes or no sequenced genome, we performed CENH3-ChIP-dot-blot methods alone or in combination with CENH3-ChIP-seq and identified the ectopic genomic sequences present at the new centromeres. In adcdition, the transcription of the identified DNA sequences was remarkably increased at the new centromere, suggesting that the transcription of the corresponding sequences may be associated with de novo centromere formation. Stable alien chromosomes with two and three regions containing CRW sequences induced by centromere breakage were observed in the wheat-Th. elongatum hybrid derivatives, but only one was a functional centromere. In wheat-rye (Secale cereale) hybrids, the rye centromere-specific sequences spread along the chromosome arms and may have caused centromere expansion. Frequent and significant quantitative alterations in the centromere sequence via chromosomal rearrangement have been systematically described in wheat wide hybridizations, which may affect the retention or loss of the alien chromosomes in the hybrids. Thus, the centromere behavior in wide crosses likely has an important impact on the generation of biodiversity, which ultimately has implications for speciation. PMID:27110907
Multiplex De Novo Sequencing of Peptide Antibiotics

NASA Astrophysics Data System (ADS)

Mohimani, Hosein; Liu, Wei-Ting; Yang, Yu-Liang; Gaudêncio, Susana P.; Fenical, William; Dorrestein, Pieter C.; Pevzner, Pavel A.

Proliferation of drug-resistant diseases raises the challenge of searching for new, more efficient antibiotics. Currently, some of the most effective antibiotics (i.e., Vancomycin and Daptomycin) are cyclic peptides produced by non-ribosomal biosynthetic pathways. The isolation and sequencing of cyclic peptide antibiotics, unlike the same activity with linear peptides, is time-consuming and error-prone. The dominant technique for sequencing cyclic peptides is NMR-based and requires large amounts (milligrams) of purified materials that, for most compounds, are not possible to obtain. Given these facts, there is a need for new tools to sequence cyclic NRPs using picograms of material. Since nearly all cyclic NRPs are produced along with related analogs, we develop a mass spectrometry approach for sequencing all related peptides at once (in contrast to the existing approach that analyzes individual peptides). Our results suggest that instead of attempting to isolate and NMR-sequence the most abundant compound, one should acquire spectra of many related compounds and sequence all of them simultaneously using tandem mass spectrometry. We illustrate applications of this approach by sequencing new variants of cyclic peptide antibiotics from Bacillus brevis, as well as sequencing a previously unknown familiy of cyclic NRPs produced by marine bacteria.
The sequence and de novo assembly of the giant panda genome

PubMed Central

Li, Ruiqiang; Fan, Wei; Tian, Geng; Zhu, Hongmei; He, Lin; Cai, Jing; Huang, Quanfei; Cai, Qingle; Li, Bo; Bai, Yinqi; Zhang, Zhihe; Zhang, Yaping; Wang, Wen; Li, Jun; Wei, Fuwen; Li, Heng; Jian, Min; Li, Jianwen; Zhang, Zhaolei; Nielsen, Rasmus; Li, Dawei; Gu, Wanjun; Yang, Zhentao; Xuan, Zhaoling; Ryder, Oliver A.; Leung, Frederick Chi-Ching; Zhou, Yan; Cao, Jianjun; Sun, Xiao; Fu, Yonggui; Fang, Xiaodong; Guo, Xiaosen; Wang, Bo; Hou, Rong; Shen, Fujun; Mu, Bo; Ni, Peixiang; Lin, Runmao; Qian, Wubin; Wang, Guodong; Yu, Chang; Nie, Wenhui; Wang, Jinhuan; Wu, Zhigang; Liang, Huiqing; Min, Jiumeng; Wu, Qi; Cheng, Shifeng; Ruan, Jue; Wang, Mingwei; Shi, Zhongbin; Wen, Ming; Liu, Binghang; Ren, Xiaoli; Zheng, Huisong; Dong, Dong; Cook, Kathleen; Shan, Gao; Zhang, Hao; Kosiol, Carolin; Xie, Xueying; Lu, Zuhong; Zheng, Hancheng; Li, Yingrui; Steiner, Cynthia C.; Lam, Tommy Tsan-Yuk; Lin, Siyuan; Zhang, Qinghui; Li, Guoqing; Tian, Jing; Gong, Timing; Liu, Hongde; Zhang, Dejin; Fang, Lin; Ye, Chen; Zhang, Juanbin; Hu, Wenbo; Xu, Anlong; Ren, Yuanyuan; Zhang, Guojie; Bruford, Michael W.; Li, Qibin; Ma, Lijia; Guo, Yiran; An, Na; Hu, Yujie; Zheng, Yang; Shi, Yongyong; Li, Zhiqiang; Liu, Qing; Chen, Yanling; Zhao, Jing; Qu, Ning; Zhao, Shancen; Tian, Feng; Wang, Xiaoling; Wang, Haiyin; Xu, Lizhi; Liu, Xiao; Vinar, Tomas; Wang, Yajun; Lam, Tak-Wah; Yiu, Siu-Ming; Liu, Shiping; Zhang, Hemin; Li, Desheng; Huang, Yan; Wang, Xia; Yang, Guohua; Jiang, Zhi; Wang, Junyi; Qin, Nan; Li, Li; Li, Jingxiang; Bolund, Lars; Kristiansen, Karsten; Wong, Gane Ka-Shu; Olson, Maynard; Zhang, Xiuqing; Li, Songgang; Yang, Huanming; Wang, Jian; Wang, Jun

2013-01-01

Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes. PMID:20010809
De novo assembly and characterization of the Trichuris trichiura adult worm transcriptome using Ion Torrent sequencing.

PubMed

Santos, Leonardo N; Silva, Eduardo S; Santos, André S; De Sá, Pablo H; Ramos, Rommel T; Silva, Artur; Cooper, Philip J; Barreto, Maurício L; Loureiro, Sebastião; Pinheiro, Carina S; Alcantara-Neves, Neuza M; Pacheco, Luis G C

2016-07-01

Infection with helminthic parasites, including the soil-transmitted helminth Trichuris trichiura (human whipworm), has been shown to modulate host immune responses and, consequently, to have an impact on the development and manifestation of chronic human inflammatory diseases. De novo derivation of helminth proteomes from sequencing of transcriptomes will provide valuable data to aid identification of parasite proteins that could be evaluated as potential immunotherapeutic molecules in near future. Herein, we characterized the transcriptome of the adult stage of the human whipworm T. trichiura, using next-generation sequencing technology and a de novo assembly strategy. Nearly 17.6 million high-quality clean reads were assembled into 6414 contiguous sequences, with an N50 of 1606bp. In total, 5673 protein-encoding sequences were confidentially identified in the T. trichiura adult worm transcriptome; of these, 1013 sequences represent potential newly discovered proteins for the species, most of which presenting orthologs already annotated in the related species T. suis. A number of transcripts representing probable novel non-coding transcripts for the species T. trichiura were also identified. Among the most abundant transcripts, we found sequences that code for proteins involved in lipid transport, such as vitellogenins, and several chitin-binding proteins. Through a cross-species expression analysis of gene orthologs shared by T. trichiura and the closely related parasites T. suis and T. muris it was possible to find twenty-six protein-encoding genes that are consistently highly expressed in the adult stages of the three helminth species. Additionally, twenty transcripts could be identified that code for proteins previously detected by mass spectrometry analysis of protein fractions of the whipworm somatic extract that present immunomodulatory activities. Five of these transcripts were amongst the most highly expressed protein-encoding sequences in the T
De novo sequencing and characterization of floral transcriptome in two species of buckwheat (Fagopyrum)

PubMed Central

2011-01-01

Background Transcriptome sequencing data has become an integral component of modern genetics, genomics and evolutionary biology. However, despite advances in the technologies of DNA sequencing, such data are lacking for many groups of living organisms, in particular, many plant taxa. We present here the results of transcriptome sequencing for two closely related plant species. These species, Fagopyrum esculentum and F. tataricum, belong to the order Caryophyllales - a large group of flowering plants with uncertain evolutionary relationships. F. esculentum (common buckwheat) is also an important food crop. Despite these practical and evolutionary considerations Fagopyrum species have not been the subject of large-scale sequencing projects. Results Normalized cDNA corresponding to genes expressed in flowers and inflorescences of F. esculentum and F. tataricum was sequenced using the 454 pyrosequencing technology. This resulted in 267 (for F. esculentum) and 229 (F. tataricum) thousands of reads with average length of 341-349 nucleotides. De novo assembly of the reads produced about 25 thousands of contigs for each species, with 7.5-8.2× coverage. Comparative analysis of two transcriptomes demonstrated their overall similarity but also revealed genes that are presumably differentially expressed. Among them are retrotransposon genes and genes involved in sugar biosynthesis and metabolism. Thirteen single-copy genes were used for phylogenetic analysis; the resulting trees are largely consistent with those inferred from multigenic plastid datasets. The sister relationships of the Caryophyllales and asterids now gained high support from nuclear gene sequences. Conclusions 454 transcriptome sequencing and de novo assembly was performed for two congeneric flowering plant species, F. esculentum and F. tataricum. As a result, a large set of cDNA sequences that represent orthologs of known plant genes as well as potential new genes was generated. PMID:21232141
Shotgun Protein Sequencing with Meta-contig Assembly*

PubMed Central

Guthals, Adrian; Clauser, Karl R.; Bandeira, Nuno

2012-01-01

Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings. PMID:22798278
Shotgun protein sequencing with meta-contig assembly.

PubMed

Guthals, Adrian; Clauser, Karl R; Bandeira, Nuno

2012-10-01

Full-length de novo sequencing from tandem mass (MS/MS) spectra of unknown proteins such as antibodies or proteins from organisms with unsequenced genomes remains a challenging open problem. Conventional algorithms designed to individually sequence each MS/MS spectrum are limited by incomplete peptide fragmentation or low signal to noise ratios and tend to result in short de novo sequences at low sequencing accuracy. Our shotgun protein sequencing (SPS) approach was developed to ameliorate these limitations by first finding groups of unidentified spectra from the same peptides (contigs) and then deriving a consensus de novo sequence for each assembled set of spectra (contig sequences). But whereas SPS enables much more accurate reconstruction of de novo sequences longer than can be recovered from individual MS/MS spectra, it still requires error-tolerant matching to homologous proteins to group smaller contig sequences into full-length protein sequences, thus limiting its effectiveness on sequences from poorly annotated proteins. Using low and high resolution CID and high resolution HCD MS/MS spectra, we address this limitation with a Meta-SPS algorithm designed to overlap and further assemble SPS contigs into Meta-SPS de novo contig sequences extending as long as 100 amino acids at over 97% accuracy without requiring any knowledge of homologous protein sequences. We demonstrate Meta-SPS using distinct MS/MS data sets obtained with separate enzymatic digestions and discuss how the remaining de novo sequencing limitations relate to MS/MS acquisition settings.
Genome Calligrapher: A Web Tool for Refactoring Bacterial Genome Sequences for de Novo DNA Synthesis.

PubMed

Christen, Matthias; Deutsch, Samuel; Christen, Beat

2015-08-21

Recent advances in synthetic biology have resulted in an increasing demand for the de novo synthesis of large-scale DNA constructs. Any process improvement that enables fast and cost-effective streamlining of digitized genetic information into fabricable DNA sequences holds great promise to study, mine, and engineer genomes. Here, we present Genome Calligrapher, a computer-aided design web tool intended for whole genome refactoring of bacterial chromosomes for de novo DNA synthesis. By applying a neutral recoding algorithm, Genome Calligrapher optimizes GC content and removes obstructive DNA features known to interfere with the synthesis of double-stranded DNA and the higher order assembly into large DNA constructs. Subsequent bioinformatics analysis revealed that synthesis constraints are prevalent among bacterial genomes. However, a low level of codon replacement is sufficient for refactoring bacterial genomes into easy-to-synthesize DNA sequences. To test the algorithm, 168 kb of synthetic DNA comprising approximately 20 percent of the synthetic essential genome of the cell-cycle bacterium Caulobacter crescentus was streamlined and then ordered from a commercial supplier of low-cost de novo DNA synthesis. The successful assembly into eight 20 kb segments indicates that Genome Calligrapher algorithm can be efficiently used to refactor difficult-to-synthesize DNA. Genome Calligrapher is broadly applicable to recode biosynthetic pathways, DNA sequences, and whole bacterial genomes, thus offering new opportunities to use synthetic biology tools to explore the functionality of microbial diversity. The Genome Calligrapher web tool can be accessed at https://christenlab.ethz.ch/GenomeCalligrapher  .
TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets.

PubMed

Dang, Louis T; Tondl, Markus; Chiu, Man Ho H; Revote, Jerico; Paten, Benedict; Tano, Vincent; Tokolyi, Alex; Besse, Florence; Quaife-Ryan, Greg; Cumming, Helen; Drvodelic, Mark J; Eichenlaub, Michael P; Hallab, Jeannette C; Stolper, Julian S; Rossello, Fernando J; Bogoyevitch, Marie A; Jans, David A; Nim, Hieu T; Porrello, Enzo R; Hudson, James E; Ramialison, Mirana

2018-04-05

A strong focus of the post-genomic era is mining of the non-coding regulatory genome in order to unravel the function of regulatory elements that coordinate gene expression (Nat 489:57-74, 2012; Nat 507:462-70, 2014; Nat 507:455-61, 2014; Nat 518:317-30, 2015). Whole-genome approaches based on next-generation sequencing (NGS) have provided insight into the genomic location of regulatory elements throughout different cell types, organs and organisms. These technologies are now widespread and commonly used in laboratories from various fields of research. This highlights the need for fast and user-friendly software tools dedicated to extracting cis-regulatory information contained in these regulatory regions; for instance transcription factor binding site (TFBS) composition. Ideally, such tools should not require prior programming knowledge to ensure they are accessible for all users. We present TrawlerWeb, a web-based version of the Trawler_standalone tool (Nat Methods 4:563-5, 2007; Nat Protoc 5:323-34, 2010), to allow for the identification of enriched motifs in DNA sequences obtained from next-generation sequencing experiments in order to predict their TFBS composition. TrawlerWeb is designed for online queries with standard options common to web-based motif discovery tools. In addition, TrawlerWeb provides three unique new features: 1) TrawlerWeb allows the input of BED files directly generated from NGS experiments, 2) it automatically generates an input-matched biologically relevant background, and 3) it displays resulting conservation scores for each instance of the motif found in the input sequences, which assists the researcher in prioritising the motifs to validate experimentally. Finally, to date, this web-based version of Trawler_standalone remains the fastest online de novo motif discovery tool compared to other popular web-based software, while generating predictions with high accuracy. TrawlerWeb provides users with a fast, simple and easy-to-use web
The genome of flax (Linum usitatissimum) assembled de novo from short shotgun sequence reads.

PubMed

Wang, Zhiwen; Hobson, Neil; Galindo, Leonardo; Zhu, Shilin; Shi, Daihu; McDill, Joshua; Yang, Linfeng; Hawkins, Simon; Neutelings, Godfrey; Datla, Raju; Lambert, Georgina; Galbraith, David W; Grassa, Christopher J; Geraldes, Armando; Cronk, Quentin C; Cullis, Christopher; Dash, Prasanta K; Kumar, Polumetla A; Cloutier, Sylvie; Sharpe, Andrew G; Wong, Gane K-S; Wang, Jun; Deyholos, Michael K

2012-11-01

Flax (Linum usitatissimum) is an ancient crop that is widely cultivated as a source of fiber, oil and medicinally relevant compounds. To accelerate crop improvement, we performed whole-genome shotgun sequencing of the nuclear genome of flax. Seven paired-end libraries ranging in size from 300 bp to 10 kb were sequenced using an Illumina genome analyzer. A de novo assembly, comprised exclusively of deep-coverage (approximately 94× raw, approximately 69× filtered) short-sequence reads (44-100 bp), produced a set of scaffolds with N(50) =694 kb, including contigs with N(50)=20.1 kb. The contig assembly contained 302 Mb of non-redundant sequence representing an estimated 81% genome coverage. Up to 96% of published flax ESTs aligned to the whole-genome shotgun scaffolds. However, comparisons with independently sequenced BACs and fosmids showed some mis-assembly of regions at the genome scale. A total of 43384 protein-coding genes were predicted in the whole-genome shotgun assembly, and up to 93% of published flax ESTs, and 86% of A. thaliana genes aligned to these predicted genes, indicating excellent coverage and accuracy at the gene level. Analysis of the synonymous substitution rates (K(s) ) observed within duplicate gene pairs was consistent with a recent (5-9 MYA) whole-genome duplication in flax. Within the predicted proteome, we observed enrichment of many conserved domains (Pfam-A) that may contribute to the unique properties of this crop, including agglutinin proteins. Together these results show that de novo assembly, based solely on whole-genome shotgun short-sequence reads, is an efficient means of obtaining nearly complete genome sequence information for some plant species. © 2012 The Authors. The Plant Journal © 2012 Blackwell Publishing Ltd.
De novo transcriptome sequencing of axolotl blastema for identification of differentially expressed genes during limb regeneration

PubMed Central

2013-01-01

Background Salamanders are unique among vertebrates in their ability to completely regenerate amputated limbs through the mediation of blastema cells located at the stump ends. This regeneration is nerve-dependent because blastema formation and regeneration does not occur after limb denervation. To obtain the genomic information of blastema tissues, de novo transcriptomes from both blastema tissues and denervated stump ends of Ambystoma mexicanum (axolotls) 14 days post-amputation were sequenced and compared using Solexa DNA sequencing. Results The sequencing done for this study produced 40,688,892 reads that were assembled into 307,345 transcribed sequences. The N50 of transcribed sequence length was 562 bases. A similarity search with known proteins identified 39,200 different genes to be expressed during limb regeneration with a cut-off E-value exceeding 10-5. We annotated assembled sequences by using gene descriptions, gene ontology, and clusters of orthologous group terms. Targeted searches using these annotations showed that the majority of the genes were in the categories of essential metabolic pathways, transcription factors and conserved signaling pathways, and novel candidate genes for regenerative processes. We discovered and confirmed numerous sequences of the candidate genes by using quantitative polymerase chain reaction and in situ hybridization. Conclusion The results of this study demonstrate that de novo transcriptome sequencing allows gene expression analysis in a species lacking genome information and provides the most comprehensive mRNA sequence resources for axolotls. The characterization of the axolotl transcriptome can help elucidate the molecular mechanisms underlying blastema formation during limb regeneration. PMID:23815514
Whole Exome Sequencing Identifies de Novo Mutations in GATA6 Associated with Congenital Diaphragmatic Hernia

PubMed Central

Yu, Lan; Bennett, James T.; Wynn, Julia; Carvill, Gemma L.; Cheung, Yee Him; Shen, Yufeng; Mychaliska, George B.; Azarow, Kenneth S.; Crombleholme, Timothy M.; Chung, Dai H.; Potoka, Douglas; Warner, Brad W.; Bucher, Brian; Lim, Foong-Yen; Pietsch, John; Stolar, Charles; Aspelund, Gudrun; Arkovitz, Marc S.; Mefford, Heather; Chung, Wendy K.

2014-01-01

Background Congenital diaphragmatic hernia (CDH) is a common birth defect affecting 1 in 3,000 births. It is characterized by herniation of abdominal viscera through an incompletely formed diaphragm. Although chromosomal anomalies and mutations in several genes have been implicated, the cause for most patients is unknown. Methods We used whole exome sequencing in two families with CDH and congenital heart disease, and identified mutations in GATA6 in both. Results In the first family, we identified a de novo missense mutation (c.1366C>T, p.R456C) in a sporadic CDH patient with tetralogy of Fallot. In the second, a nonsense mutation (c.712G>T, p.G238*) was identified in two siblings with CDH and a large ventricular septal defect. The G238* mutation was inherited from their mother, who was clinically affected with congenital absence of the pericardium, patent ductus arteriosus, and intestinal malrotation. Deep sequencing of blood and saliva derived DNA from the mother suggested somatic mosaicism as an explanation for her milder phenotype, with only approximately 15% mutant alleles. To determine the frequency of GATA6 mutations in CDH, we sequenced the gene in 378 patients with CDH. We identified one additional de novo mutation (c.1071delG, p.V358Cfs34*). Conclusions Mutations in GATA6 have been previously associated with pancreatic agenesis and congenital heart disease. We conclude that, in addition to the heart and the pancreas, GATA6 is involved in development of two additional organs, the diaphragm and the pericardium. In addition we have shown that de novo mutations can contribute to the development of CDH, a common birth defect. PMID:24385578
The present and future of de novo whole-genome assembly.

PubMed

Sohn, Jang-Il; Nam, Jin-Wu

2018-01-01

As the advent of next-generation sequencing (NGS) technology, various de novo assembly algorithms based on the de Bruijn graph have been developed to construct chromosome-level sequences. However, numerous technical or computational challenges in de novo assembly still remain, although many bright ideas and heuristics have been suggested to tackle the challenges in both experimental and computational settings. In this review, we categorize de novo assemblers on the basis of the type of de Bruijn graphs (Hamiltonian and Eulerian) and discuss the challenges of de novo assembly for short NGS reads regarding computational complexity and assembly ambiguity. Then, we discuss how the limitations of the short reads can be overcome by using a single-molecule sequencing platform that generates long reads of up to several kilobases. In fact, the long read assembly has caused a paradigm shift in whole-genome assembly in terms of algorithms and supporting steps. We also summarize (i) hybrid assemblies using both short and long reads and (ii) overlap-based assemblies for long reads and discuss their challenges and future prospects. This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
De Novo ORFs in Drosophila Are Important to Organismal Fitness and Evolved Rapidly from Previously Non-coding Sequences

PubMed Central

Reinhardt, Josephine A.; Wanjiru, Betty M.; Brant, Alicia T.; Saelao, Perot; Begun, David J.; Jones, Corbin D.

2013-01-01

How non-coding DNA gives rise to new protein-coding genes (de novo genes) is not well understood. Recent work has revealed the origins and functions of a few de novo genes, but common principles governing the evolution or biological roles of these genes are unknown. To better define these principles, we performed a parallel analysis of the evolution and function of six putatively protein-coding de novo genes described in Drosophila melanogaster. Reconstruction of the transcriptional history of de novo genes shows that two de novo genes emerged from novel long non-coding RNAs that arose at least 5 MY prior to evolution of an open reading frame. In contrast, four other de novo genes evolved a translated open reading frame and transcription within the same evolutionary interval suggesting that nascent open reading frames (proto-ORFs), while not required, can contribute to the emergence of a new de novo gene. However, none of the genes arose from proto-ORFs that existed long before expression evolved. Sequence and structural evolution of de novo genes was rapid compared to nearby genes and the structural complexity of de novo genes steadily increases over evolutionary time. Despite the fact that these genes are transcribed at a higher level in males than females, and are most strongly expressed in testes, RNAi experiments show that most of these genes are essential in both sexes during metamorphosis. This lethality suggests that protein coding de novo genes in Drosophila quickly become functionally important. PMID:24146629
A Proteomic Workflow Using High-Throughput De Novo Sequencing Towards Complementation of Genome Information for Improved Comparative Crop Science.

PubMed

Turetschek, Reinhard; Lyon, David; Desalegn, Getinet; Kaul, Hans-Peter; Wienkoop, Stefanie

2016-01-01

The proteomic study of non-model organisms, such as many crop plants, is challenging due to the lack of comprehensive genome information. Changing environmental conditions require the study and selection of adapted cultivars. Mutations, inherent to cultivars, hamper protein identification and thus considerably complicate the qualitative and quantitative comparison in large-scale systems biology approaches. With this workflow, cultivar-specific mutations are detected from high-throughput comparative MS analyses, by extracting sequence polymorphisms with de novo sequencing. Stringent criteria are suggested to filter for confidential mutations. Subsequently, these polymorphisms complement the initially used database, which is ready to use with any preferred database search algorithm. In our example, we thereby identified 26 specific mutations in two cultivars of Pisum sativum and achieved an increased number (17 %) of peptide spectrum matches.
Compression of next-generation sequencing reads aided by highly efficient de novo assembly

PubMed Central

Jones, Daniel C.; Ruzzo, Walter L.; Peng, Xinxia

2012-01-01

We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information. Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip. PMID:22904078
GapFiller: a de novo assembly approach to fill the gap within paired reads

PubMed Central

2012-01-01

/deletions detection pipelines, pre-processing routines on datasets for de novo assembly pipelines, or in any hierarchical approach designed to assemble, analyse or validate pools of sequences. PMID:23095524
Reliable Detection of Herpes Simplex Virus Sequence Variation by High-Throughput Resequencing.

PubMed

Morse, Alison M; Calabro, Kaitlyn R; Fear, Justin M; Bloom, David C; McIntyre, Lauren M

2017-08-16

High-throughput sequencing (HTS) has resulted in data for a number of herpes simplex virus (HSV) laboratory strains and clinical isolates. The knowledge of these sequences has been critical for investigating viral pathogenicity. However, the assembly of complete herpesviral genomes, including HSV, is complicated due to the existence of large repeat regions and arrays of smaller reiterated sequences that are commonly found in these genomes. In addition, the inherent genetic variation in populations of isolates for viruses and other microorganisms presents an additional challenge to many existing HTS sequence assembly pipelines. Here, we evaluate two approaches for the identification of genetic variants in HSV1 strains using Illumina short read sequencing data. The first, a reference-based approach, identifies variants from reads aligned to a reference sequence and the second, a de novo assembly approach, identifies variants from reads aligned to de novo assembled consensus sequences. Of critical importance for both approaches is the reduction in the number of low complexity regions through the construction of a non-redundant reference genome. We compared variants identified in the two methods. Our results indicate that approximately 85% of variants are identified regardless of the approach. The reference-based approach to variant discovery captures an additional 15% representing variants divergent from the HSV1 reference possibly due to viral passage. Reference-based approaches are significantly less labor-intensive and identify variants across the genome where de novo assembly-based approaches are limited to regions where contigs have been successfully assembled. In addition, regions of poor quality assembly can lead to false variant identification in de novo consensus sequences. For viruses with a well-assembled reference genome, a reference-based approach is recommended.
De Novo Coding Variants Are Strongly Associated with Tourette Disorder

PubMed Central

Willsey, A. Jeremy; Fernandez, Thomas V.; Yu, Dongmei; King, Robert A.; Dietrich, Andrea; Xing, Jinchuan; Sanders, Stephan J.; Mandell, Jeffrey D.; Huang, Alden Y.; Richer, Petra; Smith, Louw; Dong, Shan; Samocha, Kaitlin E.; Neale, Benjamin M.; Coppola, Giovanni; Mathews, Carol A.; Tischfield, Jay A.; Scharf, Jeremiah M.; State, Matthew W.; Heiman, Gary A.

2017-01-01

SUMMARY Whole-exome sequencing (WES) and de novo variant detection have proven a powerful approach to gene discovery in complex neurodevelopmental disorders. We have completed WES of 325 Tourette disorder trios from the Tourette International Collaborative Genetics cohort and a replication sample of 186 trios from the Tourette Syndrome Association International Consortium on Genetics (511 total). We observe strong and consistent evidence for the contribution of de novo likely gene-disrupting (LGD) variants (rate ratio [RR] 2.32, p = 0.002). Additionally, de novo damaging variants (LGD and probably damaging missense) are overrepresented in probands (RR 1.37, p = 0.003). We identify four likely risk genes with multiple de novo damaging variants in unrelated probands: WWC1 (WW and C2 domain containing 1), CELSR3 (Cadherin EGF LAG seven-pass G-type receptor 3), NIPBL (Nipped-B-like), and FN1 (fibronectin 1). Overall, we estimate that de novo damaging variants in approximately 400 genes contribute risk in 12% of clinical cases. PMID:28472652
Sequencing, De Novo Assembly, and Annotation of the Complete Genome of a New Thraustochytrid Species, Strain CCAP_4062/3

PubMed Central

Seddiki, Khawla; Godart, François; Aiese Cigliano, Riccardo; Sanseverino, Walter; Barakat, Mohamed; Ortet, Philippe; Rébeillé, Fabrice; Maréchal, Eric

2018-01-01

ABSTRACT Thraustochytrids are ecologically and biotechnologically relevant marine species. We report here the de novo assembly and annotation of the whole-genome sequence of a new thraustochytrid strain, CCAP_4062/3. The genome size was estimated at 38.7 Mb with 11,853 predicted coding sequences, and the GC content was scored at 57%. PMID:29545303

De novo Assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis

PubMed Central

Nowrousian, Minou; Stajich, Jason E.; Chu, Meiling; Engh, Ines; Espagne, Eric; Halliday, Karen; Kamerewerd, Jens; Kempken, Frank; Knab, Birgit; Kuo, Hsiao-Che; Osiewacz, Heinz D.; Pöggeler, Stefanie; Read, Nick D.; Seiler, Stephan; Smith, Kristina M.; Zickler, Denise; Kück, Ulrich; Freitag, Michael

2010-01-01

Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30–90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in ∼4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for comparative
De novo assembly of a 40 Mb eukaryotic genome from short sequence reads: Sordaria macrospora, a model organism for fungal morphogenesis.

PubMed

Nowrousian, Minou; Stajich, Jason E; Chu, Meiling; Engh, Ines; Espagne, Eric; Halliday, Karen; Kamerewerd, Jens; Kempken, Frank; Knab, Birgit; Kuo, Hsiao-Che; Osiewacz, Heinz D; Pöggeler, Stefanie; Read, Nick D; Seiler, Stephan; Smith, Kristina M; Zickler, Denise; Kück, Ulrich; Freitag, Michael

2010-04-08

Filamentous fungi are of great importance in ecology, agriculture, medicine, and biotechnology. Thus, it is not surprising that genomes for more than 100 filamentous fungi have been sequenced, most of them by Sanger sequencing. While next-generation sequencing techniques have revolutionized genome resequencing, e.g. for strain comparisons, genetic mapping, or transcriptome and ChIP analyses, de novo assembly of eukaryotic genomes still presents significant hurdles, because of their large size and stretches of repetitive sequences. Filamentous fungi contain few repetitive regions in their 30-90 Mb genomes and thus are suitable candidates to test de novo genome assembly from short sequence reads. Here, we present a high-quality draft sequence of the Sordaria macrospora genome that was obtained by a combination of Illumina/Solexa and Roche/454 sequencing. Paired-end Solexa sequencing of genomic DNA to 85-fold coverage and an additional 10-fold coverage by single-end 454 sequencing resulted in approximately 4 Gb of DNA sequence. Reads were assembled to a 40 Mb draft version (N50 of 117 kb) with the Velvet assembler. Comparative analysis with Neurospora genomes increased the N50 to 498 kb. The S. macrospora genome contains even fewer repeat regions than its closest sequenced relative, Neurospora crassa. Comparison with genomes of other fungi showed that S. macrospora, a model organism for morphogenesis and meiosis, harbors duplications of several genes involved in self/nonself-recognition. Furthermore, S. macrospora contains more polyketide biosynthesis genes than N. crassa. Phylogenetic analyses suggest that some of these genes may have been acquired by horizontal gene transfer from a distantly related ascomycete group. Our study shows that, for typical filamentous fungi, de novo assembly of genomes from short sequence reads alone is feasible, that a mixture of Solexa and 454 sequencing substantially improves the assembly, and that the resulting data can be used for
Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.

PubMed

Alkhateeb, Abedalrhman; Rueda, Luis

2017-08-01

Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.
The importance of de novo mutations for pediatric neurological disease--It is not all in utero or birth trauma.

PubMed

Erickson, Robert P

2016-01-01

The advent of next generation sequencing (NGS, which consists of massively parallel sequencing to perform TGS (total genome sequencing) or WES (whole exome sequencing)) has abundantly discovered many causative mutations in patients with pediatric neurological disease. A surprisingly high number of these are de novo mutations which have not been inherited from either parent. For epilepsy, autism spectrum disorders, and neuromotor disorders, including cerebral palsy, initial estimates put the frequency of causative de novo mutations at about 15% and about 10% of these are somatic. There are some shared mutated genes between these three classes of disease. Studies of copy number variation by comparative genomic hybridization (CGH) proceded the NGS approaches but they also detect de novo variation which is especially important for ASDs. There are interesting differences between the mutated genes detected by CGS and NGS. In summary, de novo mutations cause a very significant proportion of pediatric neurological disease. Copyright © 2015 Elsevier B.V. All rights reserved.
Increased frequency of de novo copy number variants in congenital heart disease by integrative analysis of single nucleotide polymorphism array and exome sequence data.

PubMed

Glessner, Joseph T; Bick, Alexander G; Ito, Kaoru; Homsy, Jason; Rodriguez-Murillo, Laura; Fromer, Menachem; Mazaika, Erica; Vardarajan, Badri; Italia, Michael; Leipzig, Jeremy; DePalma, Steven R; Golhar, Ryan; Sanders, Stephan J; Yamrom, Boris; Ronemus, Michael; Iossifov, Ivan; Willsey, A Jeremy; State, Matthew W; Kaltman, Jonathan R; White, Peter S; Shen, Yufeng; Warburton, Dorothy; Brueckner, Martina; Seidman, Christine; Goldmuntz, Elizabeth; Gelb, Bruce D; Lifton, Richard; Seidman, Jonathan; Hakonarson, Hakon; Chung, Wendy K

2014-10-24

Congenital heart disease (CHD) is among the most common birth defects. Most cases are of unknown pathogenesis. To determine the contribution of de novo copy number variants (CNVs) in the pathogenesis of sporadic CHD. We studied 538 CHD trios using genome-wide dense single nucleotide polymorphism arrays and whole exome sequencing. Results were experimentally validated using digital droplet polymerase chain reaction. We compared validated CNVs in CHD cases with CNVs in 1301 healthy control trios. The 2 complementary high-resolution technologies identified 63 validated de novo CNVs in 51 CHD cases. A significant increase in CNV burden was observed when comparing CHD trios with healthy trios, using either single nucleotide polymorphism array (P=7×10(-5); odds ratio, 4.6) or whole exome sequencing data (P=6×10(-4); odds ratio, 3.5) and remained after removing 16% of de novo CNV loci previously reported as pathogenic (P=0.02; odds ratio, 2.7). We observed recurrent de novo CNVs on 15q11.2 encompassing CYFIP1, NIPA1, and NIPA2 and single de novo CNVs encompassing DUSP1, JUN, JUP, MED15, MED9, PTPRE SREBF1, TOP2A, and ZEB2, genes that interact with established CHD proteins NKX2-5 and GATA4. Integrating de novo variants in whole exome sequencing and CNV data suggests that ETS1 is the pathogenic gene altered by 11q24.2-q25 deletions in Jacobsen syndrome and that CTBP2 is the pathogenic gene in 10q subtelomeric deletions. We demonstrate a significantly increased frequency of rare de novo CNVs in CHD patients compared with healthy controls and suggest several novel genetic loci for CHD. © 2014 American Heart Association, Inc.
LESSONS IN DE NOVO PEPTIDE SEQUENCING BY TANDEM MASS SPECTROMETRY

PubMed Central

Medzihradszky, Katalin F.; Chalkley, Robert J.

2015-01-01

Mass spectrometry has become the method of choice for the qualitative and quantitative characterization of protein mixtures isolated from all kinds of living organisms. The raw data in these studies are MS/MS spectra, usually of peptides produced by proteolytic digestion of a protein. These spectra are “translated” into peptide sequences, normally with the help of various search engines. Data acquisition and interpretation have both been automated, and most researchers look only at the summary of the identifications without ever viewing the underlying raw data used for assignments. Automated analysis of data is essential due to the volume produced. However, being familiar with the finer intricacies of peptide fragmentation processes, and experiencing the difficulties of manual data interpretation allow a researcher to be able to more critically evaluate key results, particularly because there are many known rules of peptide fragmentation that are not incorporated into search engine scoring. Since the most commonly used MS/MS activation method is collision-induced dissociation (CID), in this article we present a brief review of the history of peptide CID analysis. Next, we provide a detailed tutorial on how to determine peptide sequences from CID data. Although the focus of the tutorial is de novo sequencing, the lessons learned and resources supplied are useful for data interpretation in general. PMID:25667941
de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer

PubMed Central

Istace, Benjamin; Friedrich, Anne; d'Agata, Léo; Faye, Sébastien; Payen, Emilie; Beluche, Odette; Caradec, Claudia; Davidas, Sabrina; Cruaud, Corinne; Liti, Gianni; Lemainque, Arnaud; Engelen, Stefan; Wincker, Patrick; Schacherer, Joseph

2017-01-01

Abstract Background: Oxford Nanopore Technologies Ltd (Oxford, UK) have recently commercialized MinION, a small single-molecule nanopore sequencer, that offers the possibility of sequencing long DNA fragments from small genomes in a matter of seconds. The Oxford Nanopore technology is truly disruptive; it has the potential to revolutionize genomic applications due to its portability, low cost, and ease of use compared with existing long reads sequencing technologies. The MinION sequencer enables the rapid sequencing of small eukaryotic genomes, such as the yeast genome. Combined with existing assembler algorithms, near complete genome assemblies can be generated and comprehensive population genomic analyses can be performed. Results: Here, we resequenced the genome of the Saccharomyces cerevisiae S288C strain to evaluate the performance of nanopore-only assemblers. Then we de novo sequenced and assembled the genomes of 21 isolates representative of the S. cerevisiae genetic diversity using the MinION platform. The contiguity of our assemblies was 14 times higher than the Illumina-only assemblies and we obtained one or two long contigs for 65 % of the chromosomes. This high contiguity allowed us to accurately detect large structural variations across the 21 studied genomes. Conclusion: Because of the high completeness of the nanopore assemblies, we were able to produce a complete cartography of transposable elements insertions and inspect structural variants that are generally missed using a short-read sequencing strategy. Our analyses show that the Oxford Nanopore technology is already usable for de novo sequencing and assembly; however, non-random errors in homopolymers require polishing the consensus using an alternate sequencing technology. PMID:28369459
de novo assembly and population genomic survey of natural yeast isolates with the Oxford Nanopore MinION sequencer.

PubMed

Istace, Benjamin; Friedrich, Anne; d'Agata, Léo; Faye, Sébastien; Payen, Emilie; Beluche, Odette; Caradec, Claudia; Davidas, Sabrina; Cruaud, Corinne; Liti, Gianni; Lemainque, Arnaud; Engelen, Stefan; Wincker, Patrick; Schacherer, Joseph; Aury, Jean-Marc

2017-02-01

Oxford Nanopore Technologies Ltd (Oxford, UK) have recently commercialized MinION, a small single-molecule nanopore sequencer, that offers the possibility of sequencing long DNA fragments from small genomes in a matter of seconds. The Oxford Nanopore technology is truly disruptive; it has the potential to revolutionize genomic applications due to its portability, low cost, and ease of use compared with existing long reads sequencing technologies. The MinION sequencer enables the rapid sequencing of small eukaryotic genomes, such as the yeast genome. Combined with existing assembler algorithms, near complete genome assemblies can be generated and comprehensive population genomic analyses can be performed. Here, we resequenced the genome of the Saccharomyces cerevisiae S288C strain to evaluate the performance of nanopore-only assemblers. Then we de novo sequenced and assembled the genomes of 21 isolates representative of the S. cerevisiae genetic diversity using the MinION platform. The contiguity of our assemblies was 14 times higher than the Illumina-only assemblies and we obtained one or two long contigs for 65 % of the chromosomes. This high contiguity allowed us to accurately detect large structural variations across the 21 studied genomes. Because of the high completeness of the nanopore assemblies, we were able to produce a complete cartography of transposable elements insertions and inspect structural variants that are generally missed using a short-read sequencing strategy. Our analyses show that the Oxford Nanopore technology is already usable for de novo sequencing and assembly; however, non-random errors in homopolymers require polishing the consensus using an alternate sequencing technology. © The Author 2017. Published by Oxford University Press.
De Novo Coding Variants Are Strongly Associated with Tourette Disorder.

PubMed

Willsey, A Jeremy; Fernandez, Thomas V; Yu, Dongmei; King, Robert A; Dietrich, Andrea; Xing, Jinchuan; Sanders, Stephan J; Mandell, Jeffrey D; Huang, Alden Y; Richer, Petra; Smith, Louw; Dong, Shan; Samocha, Kaitlin E; Neale, Benjamin M; Coppola, Giovanni; Mathews, Carol A; Tischfield, Jay A; Scharf, Jeremiah M; State, Matthew W; Heiman, Gary A

2017-05-03

Whole-exome sequencing (WES) and de novo variant detection have proven a powerful approach to gene discovery in complex neurodevelopmental disorders. We have completed WES of 325 Tourette disorder trios from the Tourette International Collaborative Genetics cohort and a replication sample of 186 trios from the Tourette Syndrome Association International Consortium on Genetics (511 total). We observe strong and consistent evidence for the contribution of de novo likely gene-disrupting (LGD) variants (rate ratio [RR] 2.32, p = 0.002). Additionally, de novo damaging variants (LGD and probably damaging missense) are overrepresented in probands (RR 1.37, p = 0.003). We identify four likely risk genes with multiple de novo damaging variants in unrelated probands: WWC1 (WW and C2 domain containing 1), CELSR3 (Cadherin EGF LAG seven-pass G-type receptor 3), NIPBL (Nipped-B-like), and FN1 (fibronectin 1). Overall, we estimate that de novo damaging variants in approximately 400 genes contribute risk in 12% of clinical cases. VIDEO ABSTRACT. Copyright © 2017 Elsevier Inc. All rights reserved.
Sequencing and De Novo Assembly of the Toxicodendron radicans (Poison Ivy) Transcriptome

PubMed Central

Kim, Gunjune

2017-01-01

Contact with poison ivy plants is widely dreaded because they produce a natural product called urushiol that is responsible for allergenic contact delayed-dermatitis symptoms lasting for weeks. For this reason, the catchphrase most associated with poison ivy is “leaves of three, let it be”, which serves the purpose of both identification and an appeal for avoidance. Ironically, despite this notoriety, there is a dearth of specific knowledge about nearly all other aspects of poison ivy physiology and ecology. As a means of gaining a more molecular-oriented understanding of poison ivy physiology and ecology, Next Generation DNA sequencing technology was used to develop poison ivy root and leaf RNA-seq transcriptome resources. De novo assembled transcriptomes were analyzed to generate a core set of high quality expressed transcripts present in poison ivy tissue. The predicted protein sequences were evaluated for similarity to SwissProt homologs and InterProScan domains, as well as assigned both GO terms and KEGG annotations. Over 23,000 simple sequence repeats were identified in the transcriptome, and corresponding oligo nucleotide primer pairs were designed. A pan-transcriptome analysis of existing Anacardiaceae transcriptomes revealed conserved and unique transcripts among these species. PMID:29125533
Sequencing and De Novo Assembly of the Toxicodendron radicans (Poison Ivy) Transcriptome.

PubMed

Weisberg, Alexandra J; Kim, Gunjune; Westwood, James H; Jelesko, John G

2017-11-10

Contact with poison ivy plants is widely dreaded because they produce a natural product called urushiol that is responsible for allergenic contact delayed-dermatitis symptoms lasting for weeks. For this reason, the catchphrase most associated with poison ivy is "leaves of three, let it be", which serves the purpose of both identification and an appeal for avoidance. Ironically, despite this notoriety, there is a dearth of specific knowledge about nearly all other aspects of poison ivy physiology and ecology. As a means of gaining a more molecular-oriented understanding of poison ivy physiology and ecology, Next Generation DNA sequencing technology was used to develop poison ivy root and leaf RNA-seq transcriptome resources. De novo assembled transcriptomes were analyzed to generate a core set of high quality expressed transcripts present in poison ivy tissue. The predicted protein sequences were evaluated for similarity to SwissProt homologs and InterProScan domains, as well as assigned both GO terms and KEGG annotations. Over 23,000 simple sequence repeats were identified in the transcriptome, and corresponding oligo nucleotide primer pairs were designed. A pan-transcriptome analysis of existing Anacardiaceae transcriptomes revealed conserved and unique transcripts among these species.
Whole genome sequencing data and de novo draft assemblies for 66 teleost species

PubMed Central

Malmstrøm, Martin; Matschiner, Michael; Tørresen, Ole K.; Jakobsen, Kjetill S.; Jentoft, Sissel

2017-01-01

Teleost fishes comprise more than half of all vertebrate species, yet genomic data are only available for 0.2% of their diversity. Here, we present whole genome sequencing data for 66 new species of teleosts, vastly expanding the availability of genomic data for this important vertebrate group. We report on de novo assemblies based on low-coverage (9–39×) sequencing and present detailed methodology for all analyses. To facilitate further utilization of this data set, we present statistical analyses of the gene space completeness and verify the expected phylogenetic position of the sequenced genomes in a large mitogenomic context. We further present a nuclear marker set used for phylogenetic inference and evaluate each gene tree in relation to the species tree to test for homogeneity in the phylogenetic signal. Collectively, these analyses illustrate the robustness of this highly diverse data set and enable extensive reuse of the selected phylogenetic markers and the genomic data in general. This data set covers all major teleost lineages and provides unprecedented opportunities for comparative studies of teleosts. PMID:28094797
The BaMM web server for de-novo motif discovery and regulatory sequence analysis.

PubMed

Kiesel, Anja; Roth, Christian; Ge, Wanwan; Wess, Maximilian; Meier, Markus; Söding, Johannes

2018-05-28

The BaMM web server offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database and (iv) browsing and keyword searching the motif database. In contrast to most other servers, we represent sequence motifs not by position weight matrices (PWMs) but by Bayesian Markov Models (BaMMs) of order 4, which we showed previously to perform substantially better in ROC analyses than PWMs or first order models. To address the inadequacy of P- and E-values as measures of motif quality, we introduce the AvRec score, the average recall over the TP-to-FP ratio between 1 and 100. The BaMM server is freely accessible without registration at https://bammmotif.mpibpc.mpg.de.
De Novo whole genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 from blueberry in Georgia, USA

USDA-ARS?s Scientific Manuscript database

This study reports a de novo assembled draft genome sequence of Xylella fastidiosa subsp. multiplex strain BB01 causing blueberry bacterial leaf scorch in Georgia, USA. The BB01 genome is 2,517,579 bp with a G+C content of 51.8% and 2,943 open reading frames (ORFs) and 48 RNA genes....
A Quantitative Tool to Distinguish Isobaric Leucine and Isoleucine Residues for Mass Spectrometry-Based De Novo Monoclonal Antibody Sequencing

NASA Astrophysics Data System (ADS)

Poston, Chloe N.; Higgs, Richard E.; You, Jinsam; Gelfanova, Valentina; Hale, John E.; Knierman, Michael D.; Siegel, Robert; Gutierrez, Jesus A.

2014-07-01

De novo sequencing by mass spectrometry (MS) allows for the determination of the complete amino acid (AA) sequence of a given protein based on the mass difference of detected ions from MS/MS fragmentation spectra. The technique relies on obtaining specific masses that can be attributed to characteristic theoretical masses of AAs. A major limitation of de novo sequencing by MS is the inability to distinguish between the isobaric residues leucine (Leu) and isoleucine (Ile). Incorrect identification of Ile as Leu or vice versa often results in loss of activity in recombinant antibodies. This functional ambiguity is commonly resolved with costly and time-consuming AA mutation and peptide sequencing experiments. Here, we describe a set of orthogonal biochemical protocols, which experimentally determine the identity of Ile or Leu residues in monoclonal antibodies (mAb) based on the selectivity that leucine aminopeptidase shows for n-terminal Leu residues and the cleavage preference for Leu by chymotrypsin. The resulting observations are combined with germline frequencies and incorporated into a logistic regression model, called Predictor for Xle Sites (PXleS) to provide a statistical likelihood for the identity of Leu at an ambiguous site. We demonstrate that PXleS can generate a probability for an Xle site in mAbs with 96% accuracy. The implementation of PXleS precludes the expression of several possible sequences and, therefore, reduces the overall time and resources required to go from spectra generation to a biologically active sequence for a mAb when an Ile or Leu residue is in question.
A quantitative tool to distinguish isobaric leucine and isoleucine residues for mass spectrometry-based de novo monoclonal antibody sequencing.

PubMed

Poston, Chloe N; Higgs, Richard E; You, Jinsam; Gelfanova, Valentina; Hale, John E; Knierman, Michael D; Siegel, Robert; Gutierrez, Jesus A

2014-07-01

De novo sequencing by mass spectrometry (MS) allows for the determination of the complete amino acid (AA) sequence of a given protein based on the mass difference of detected ions from MS/MS fragmentation spectra. The technique relies on obtaining specific masses that can be attributed to characteristic theoretical masses of AAs. A major limitation of de novo sequencing by MS is the inability to distinguish between the isobaric residues leucine (Leu) and isoleucine (Ile). Incorrect identification of Ile as Leu or vice versa often results in loss of activity in recombinant antibodies. This functional ambiguity is commonly resolved with costly and time-consuming AA mutation and peptide sequencing experiments. Here, we describe a set of orthogonal biochemical protocols, which experimentally determine the identity of Ile or Leu residues in monoclonal antibodies (mAb) based on the selectivity that leucine aminopeptidase shows for n-terminal Leu residues and the cleavage preference for Leu by chymotrypsin. The resulting observations are combined with germline frequencies and incorporated into a logistic regression model, called Predictor for Xle Sites (PXleS) to provide a statistical likelihood for the identity of Leu at an ambiguous site. We demonstrate that PXleS can generate a probability for an Xle site in mAbs with 96% accuracy. The implementation of PXleS precludes the expression of several possible sequences and, therefore, reduces the overall time and resources required to go from spectra generation to a biologically active sequence for a mAb when an Ile or Leu residue is in question.
De novo generation of plant centromeres at tandem repeats.

PubMed

Teo, Chee How; Lermontova, Inna; Houben, Andreas; Mette, Michael Florian; Schubert, Ingo

2013-06-01

Artificial minichromosomes are highly desirable tools for basic research, breeding, and biotechnology purposes. We present an option to generate plant artificial minichromosomes via de novo engineering of plant centromeres in Arabidopsis thaliana by targeting kinetochore proteins to tandem repeat arrays at non-centromeric positions. We employed the bacterial lactose repressor/lactose operator system to guide derivatives of the centromeric histone H3 variant cenH3 to LacO operator sequences. Tethering of cenH3 to non-centromeric loci led to de novo assembly of kinetochore proteins and to dicentric carrier chromosomes which potentially form anaphase bridges. This approach will be further developed and may contribute to generating minichromosomes from preselected genomic regions, potentially even in a diploid background.
A Molecular Portrait of De Novo Genes in Yeasts.

PubMed

Vakirlis, Nikolaos; Hebert, Alex S; Opulente, Dana A; Achaz, Guillaume; Hittinger, Chris Todd; Fischer, Gilles; Coon, Joshua J; Lafontaine, Ingrid

2018-03-01

New genes, with novel protein functions, can evolve "from scratch" out of intergenic sequences. These de novo genes can integrate the cell's genetic network and drive important phenotypic innovations. Therefore, identifying de novo genes and understanding how the transition from noncoding to coding occurs are key problems in evolutionary biology. However, identifying de novo genes is a difficult task, hampered by the presence of remote homologs, fast evolving sequences and erroneously annotated protein coding genes. To overcome these limitations, we developed a procedure that handles the usual pitfalls in de novo gene identification and predicted the emergence of 703 de novo gene candidates in 15 yeast species from 2 genera whose phylogeny spans at least 100 million years of evolution. We validated 85 candidates by proteomic data, providing new translation evidence for 25 of them through mass spectrometry experiments. We also unambiguously identified the mutations that enabled the transition from noncoding to coding for 30 Saccharomyces de novo genes. We established that de novo gene origination is a widespread phenomenon in yeasts, only a few being ultimately maintained by selection. We also found that de novo genes preferentially emerge next to divergent promoters in GC-rich intergenic regions where the probability of finding a fortuitous and transcribed ORF is the highest. Finally, we found a more than 3-fold enrichment of de novo genes at recombination hot spots, which are GC-rich and nucleosome-free regions, suggesting that meiotic recombination contributes to de novo gene emergence in yeasts.
De Novo Generation and Characterization of New Zika Virus Isolate Using Sequence Data from a Microcephaly Case

PubMed Central

Setoh, Yin Xiang; Prow, Natalie A.; Peng, Nias; Hugo, Leon E.; Devine, Gregor; Hazlewood, Jessamine E.

2017-01-01

ABSTRACT Zika virus (ZIKV) has recently emerged and is the etiological agent of congenital Zika syndrome (CZS), a spectrum of congenital abnormalities arising from neural tissue infections in utero. Herein, we describe the de novo generation of a new ZIKV isolate, ZIKVNatal, using a modified circular polymerase extension reaction protocol and sequence data obtained from a ZIKV-infected fetus with microcephaly. ZIKVNatal thus has no laboratory passage history and is unequivocally associated with CZS. ZIKVNatal could be used to establish a fetal brain infection model in IFNAR−/− mice (including intrauterine growth restriction) without causing symptomatic infections in dams. ZIKVNatal was also able to be transmitted by Aedes aegypti mosquitoes. ZIKVNatal thus retains key aspects of circulating pathogenic ZIKVs and illustrates a novel methodology for obtaining an authentic functional viral isolate by using data from deep sequencing of infected tissues. IMPORTANCE The major complications of an ongoing Zika virus outbreak in the Americas and Asia are congenital defects caused by the virus’s ability to cross the placenta and infect the fetal brain. The ability to generate molecular tools to analyze viral isolates from the current outbreak is essential for furthering our understanding of how these viruses cause congenital defects. The majority of existing viral isolates and infectious cDNA clones generated from them have undergone various numbers of passages in cell culture and/or suckling mice, which is likely to result in the accumulation of adaptive mutations that may affect viral properties. The approach described herein allows rapid generation of new, fully functional Zika virus isolates directly from deep sequencing data from virus-infected tissues without the need for prior virus passaging and for the generation and propagation of full-length cDNA clones. The approach should be applicable to other medically important flaviviruses and perhaps other positive
De Novo Generation and Characterization of New Zika Virus Isolate Using Sequence Data from a Microcephaly Case.

PubMed

Setoh, Yin Xiang; Prow, Natalie A; Peng, Nias; Hugo, Leon E; Devine, Gregor; Hazlewood, Jessamine E; Suhrbier, Andreas; Khromykh, Alexander A

2017-01-01

Zika virus (ZIKV) has recently emerged and is the etiological agent of congenital Zika syndrome (CZS), a spectrum of congenital abnormalities arising from neural tissue infections in utero . Herein, we describe the de novo generation of a new ZIKV isolate, ZIKV Natal , using a modified circular polymerase extension reaction protocol and sequence data obtained from a ZIKV-infected fetus with microcephaly. ZIKV Natal thus has no laboratory passage history and is unequivocally associated with CZS. ZIKV Natal could be used to establish a fetal brain infection model in IFNAR -/- mice (including intrauterine growth restriction) without causing symptomatic infections in dams. ZIKV Natal was also able to be transmitted by Aedes aegypti mosquitoes. ZIKV Natal thus retains key aspects of circulating pathogenic ZIKVs and illustrates a novel methodology for obtaining an authentic functional viral isolate by using data from deep sequencing of infected tissues. IMPORTANCE The major complications of an ongoing Zika virus outbreak in the Americas and Asia are congenital defects caused by the virus's ability to cross the placenta and infect the fetal brain. The ability to generate molecular tools to analyze viral isolates from the current outbreak is essential for furthering our understanding of how these viruses cause congenital defects. The majority of existing viral isolates and infectious cDNA clones generated from them have undergone various numbers of passages in cell culture and/or suckling mice, which is likely to result in the accumulation of adaptive mutations that may affect viral properties. The approach described herein allows rapid generation of new, fully functional Zika virus isolates directly from deep sequencing data from virus-infected tissues without the need for prior virus passaging and for the generation and propagation of full-length cDNA clones. The approach should be applicable to other medically important flaviviruses and perhaps other positive-strand RNA

Neurodevelopmental disease-associated de novo mutations and rare sequence variants affect TRIO GDP/GTP exchange factor activity.

PubMed

Katrancha, Sara M; Wu, Yi; Zhu, Minsheng; Eipper, Betty A; Koleske, Anthony J; Mains, Richard E

2017-12-01

Bipolar disorder, schizophrenia, autism and intellectual disability are complex neurodevelopmental disorders, debilitating millions of people. Therapeutic progress is limited by poor understanding of underlying molecular pathways. Using a targeted search, we identified an enrichment of de novo mutations in the gene encoding the 330-kDa triple functional domain (TRIO) protein associated with neurodevelopmental disorders. By generating multiple TRIO antibodies, we show that the smaller TRIO9 isoform is the major brain protein product, and its levels decrease after birth. TRIO9 contains two guanine nucleotide exchange factor (GEF) domains with distinct specificities: GEF1 activates both Rac1 and RhoG; GEF2 activates RhoA. To understand the impact of disease-associated de novo mutations and other rare sequence variants on TRIO function, we utilized two FRET-based biosensors: a Rac1 biosensor to study mutations in TRIO (T)GEF1, and a RhoA biosensor to study mutations in TGEF2. We discovered that one autism-associated de novo mutation in TGEF1 (K1431M), at the TGEF1/Rac1 interface, markedly decreased its overall activity toward Rac1. A schizophrenia-associated rare sequence variant in TGEF1 (F1538Intron) was substantially less active, normalized to protein level and expressed poorly. Overall, mutations in TGEF1 decreased GEF1 activity toward Rac1. One bipolar disorder-associated rare variant (M2145T) in TGEF2 impaired inhibition by the TGEF2 pleckstrin-homology domain, resulting in dramatically increased TGEF2 activity. Overall, genetic damage to both TGEF domains altered TRIO catalytic activity, decreasing TGEF1 activity and increasing TGEF2 activity. Importantly, both GEF changes are expected to decrease neurite outgrowth, perhaps consistent with their association with neurodevelopmental disorders. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
De Novo Sequencing and Characterization of the Floral Transcriptome of Dendrocalamus latiflorus (Poaceae: Bambusoideae)

PubMed Central

Li, De-Zhu; Guo, Zhen-Hua

2012-01-01

Background Transcriptome sequencing can be used to determine gene sequences and transcript abundance in non-model species, and the advent of next-generation sequencing (NGS) technologies has greatly decreased the cost and time required for this process. Transcriptome data are especially desirable in bamboo species, as certain members constitute an economically and culturally important group of mostly semelparous plants with remarkable flowering features, yet little bamboo genomic research has been performed. Here we present, for the first time, extensive sequence and transcript abundance data for the floral transcriptome of a key bamboo species, Dendrocalamus latiflorus, obtained using the Illumina GAII sequencing platform. Our further goal was to identify patterns of gene expression during bamboo flower development. Results Approximately 96 million sequencing reads were generated and assembled de novo, yielding 146,395 high quality unigenes with an average length of 461 bp. Of these, 80,418 were identified as putative homologs of annotated sequences in the public protein databases, of which 290 were associated with the floral transition and 47 were related to flower development. Digital abundance analysis identified 26,529 transcripts differentially enriched between two developmental stages, young flower buds and older developing flowers. Unigenes found at each stage were categorized according to their putative functional categories. These sequence and putative function data comprise a resource for future investigation of the floral transition and flower development in bamboo species. Conclusions Our results present the first broad survey of a bamboo floral transcriptome. Although it will be necessary to validate the functions carried out by these genes, these results represent a starting point for future functional research on D. latiflorus and related species. PMID:22916120
Bromine isotopic signature facilitates de novo sequencing of peptides in free-radical-initiated peptide sequencing (FRIPS) mass spectrometry.

PubMed

Nam, Jungjoo; Kwon, Hyuksu; Jang, Inae; Jeon, Aeran; Moon, Jingyu; Lee, Sun Young; Kang, Dukjin; Han, Sang Yun; Moon, Bongjin; Oh, Han Bin

2015-02-01

We recently showed that free-radical-initiated peptide sequencing mass spectrometry (FRIPS MS) assisted by the remarkable thermochemical stability of (2,2,6,6-tetramethyl-piperidin-1-yl)oxyl (TEMPO) is another attractive radical-driven peptide fragmentation MS tool. Facile homolytic cleavage of the bond between the benzylic carbon and the oxygen of the TEMPO moiety in o-TEMPO-Bz-C(O)-peptide and the high reactivity of the benzylic radical species generated in •Bz-C(O)-peptide are key elements leading to extensive radical-driven peptide backbone fragmentation. In the present study, we demonstrate that the incorporation of bromine into the benzene ring, i.e. o-TEMPO-Bz(Br)-C(O)-peptide, allows unambiguous distinction of the N-terminal peptide fragments from the C-terminal fragments through the unique bromine doublet isotopic signature. Furthermore, bromine substitution does not alter the overall radical-driven peptide backbone dissociation pathways of o-TEMPO-Bz-C(O)-peptide. From a practical perspective, the presence of the bromine isotopic signature in the N-terminal peptide fragments in TEMPO-assisted FRIPS MS represents a useful and cost-effective opportunity for de novo peptide sequencing. Copyright © 2015 John Wiley & Sons, Ltd.
Detection of de novo single nucleotide variants in offspring of atomic-bomb survivors close to the hypocenter by whole-genome sequencing.

PubMed

Horai, Makiko; Mishima, Hiroyuki; Hayashida, Chisa; Kinoshita, Akira; Nakane, Yoshibumi; Matsuo, Tatsuki; Tsuruda, Kazuto; Yanagihara, Katsunori; Sato, Shinya; Imanishi, Daisuke; Imaizumi, Yoshitaka; Hata, Tomoko; Miyazaki, Yasushi; Yoshiura, Koh-Ichiro

2018-03-01

Ionizing radiation released by the atomic bombs at Hiroshima and Nagasaki, Japan, in 1945 caused many long-term illnesses, including increased risks of malignancies such as leukemia and solid tumours. Radiation has demonstrated genetic effects in animal models, leading to concerns over the potential hereditary effects of atomic bomb-related radiation. However, no direct analyses of whole DNA have yet been reported. We therefore investigated de novo variants in offspring of atomic-bomb survivors by whole-genome sequencing (WGS). We collected peripheral blood from three trios, each comprising a father (atomic-bomb survivor with acute radiation symptoms), a non-exposed mother, and their child, none of whom had any past history of haematological disorders. One trio of non-exposed individuals was included as a control. DNA was extracted and the numbers of de novo single nucleotide variants in the children were counted by WGS with sequencing confirmation. Gross structural variants were also analysed. Written informed consent was obtained from all participants prior to the study. There were 62, 81, and 42 de novo single nucleotide variants in the children of atomic-bomb survivors, compared with 48 in the control trio. There were no gross structural variants in any trio. These findings are in accord with previously published results that also showed no significant genetic effects of atomic-bomb radiation on second-generation survivors.
Detecting false positive sequence homology: a machine learning approach.

PubMed

Fujimoto, M Stanley; Suvorov, Anton; Jensen, Nicholas O; Clement, Mark J; Bybee, Seth M

2016-02-24

Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches.
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.

PubMed

Bickhart, Derek M; Rosen, Benjamin D; Koren, Sergey; Sayre, Brian L; Hastie, Alex R; Chan, Saki; Lee, Joyce; Lam, Ernest T; Liachko, Ivan; Sullivan, Shawn T; Burton, Joshua N; Huson, Heather J; Nystrom, John C; Kelley, Christy M; Hutchison, Jana L; Zhou, Yang; Sun, Jiajie; Crisà, Alessandra; Ponce de León, F Abel; Schwartz, John C; Hammond, John A; Waldbieser, Geoffrey C; Schroeder, Steven G; Liu, George E; Dunham, Maitreya J; Shendure, Jay; Sonstegard, Tad S; Phillippy, Adam M; Van Tassell, Curtis P; Smith, Timothy P L

2017-04-01

The decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus) based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced what is, to our knowledge, the most continuous de novo mammalian assembly to date, with chromosome-length scaffolds and only 649 gaps. Our assembly represents a ∼400-fold improvement in continuity due to properly assembled gaps, compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, representing the largest repeat family and immune gene complex yet produced for an individual of a ruminant species.
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome

PubMed Central

Bickhart, Derek M.; Rosen, Benjamin D.; Koren, Sergey; Sayre, Brian L.; Hastie, Alex R.; Chan, Saki; Lee, Joyce; Lam, Ernest T.; Liachko, Ivan; Sullivan, Shawn T.; Burton, Joshua N.; Huson, Heather J.; Nystrom, John C.; Kelley, Christy M.; Hutchison, Jana L.; Zhou, Yang; Sun, Jiajie; Crisà, Alessandra; de León, F. Abel Ponce; Schwartz, John C.; Hammond, John A.; Waldbieser, Geoffrey C.; Schroeder, Steven G.; Liu, George E.; Dunham, Maitreya J.; Shendure, Jay; Sonstegard, Tad S.; Phillippy, Adam M.; Van Tassell, Curtis P.; Smith, Timothy P.L.

2018-01-01

The decrease in sequencing cost and increased sophistication of assembly algorithms for short-read platforms has resulted in a sharp increase in the number of species with genome assemblies. However, these assemblies are highly fragmented, with many gaps, ambiguities, and errors, impeding downstream applications. We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus), based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced the most continuous de novo mammalian assembly to date, with chromosome-length scaffolds and only 649 gaps. Our assembly represents a ~400-fold improvement in continuity due to properly assembled gaps compared to the previously published C. hircus assembly, and better resolves repetitive structures longer than 1 kb, representing the largest repeat family and immune gene complex ever produced for an individual of a ruminant species. PMID:28263316
Identification of single amino acid substitutions (SAAS) in neuraminidase from influenza a virus (H1N1) via mass spectrometry analysis coupled with de novo peptide sequencing.

PubMed

Peng, Qisheng; Wang, Zijian; Wu, Donglin; Li, Xiaoou; Liu, Xiaofeng; Sun, Wanchun; Liu, Ning

2016-08-01

Amino acid substitutions in the neuraminidase of the influenza virus are the main cause of the emergence of resistance to zanamivir or oseltamivir during seasonal influenza treatment; they are the result of non-synonymous mutations in the viral genome that can be successfully detected by polymer chain reaction (PCR)-based approaches. There is always an urgent need to detect variation in amino acid sequences directly at the protein level. Mass spectrometry coupled with de novo sequencing has been explored as an alternative and straightforward strategy for detecting amino acid substitutions, as well - this approach is the primary focus of the present study. Influenza virus (A/Puerto Rico/8/1934 H1N1) propagated in embryonated chicken eggs was purified by ultracentrifugation, followed by PNGase F treatment. The deglycosylated virion was lysed and separated by sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE). The gel band corresponding to neuraminidase was picked up and subjected to liquid chromatography tandem mass spectrometry (LC-MS/MS) analysis. LC-MS/MS analyses, coupled with manual de novo sequencing, allowed the determination of three amino acid substitutions: R346K, S349 N, and S370I/L, in the neuraminidase from the influenza virus (A/Puerto Rico/8/1934 H1N1), which were located in three mutated peptides of the neuraminidase: YGNGVWIGK, TKNHSSR, and PNGWTETDI/LK, respectively. We found that the amino acid substitutions in the proteins of RNA viruses (including influenza A virus) resulting from non-synonymous gene mutations can indeed be directly analyzed via mass spectrometry, and that manual interpretation of the MS/MS data may be beneficial. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
A Comprehensive Analysis of Transcript-Supported De Novo Genes in Saccharomyces sensu stricto Yeasts

PubMed Central

Lu, Tzu-Chiao; Leu, Jun-Yi; Lin, Wen-Chang

2017-01-01

Abstract Novel genes arising from random DNA sequences (de novo genes) have been suggested to be widespread in the genomes of different organisms. However, our knowledge about the origin and evolution of de novo genes is still limited. To systematically understand the general features of de novo genes, we established a robust pipeline to analyze >20,000 transcript-supported coding sequences (CDSs) from the budding yeast Saccharomyces cerevisiae. Our analysis pipeline combined phylogeny, synteny, and sequence alignment information to identify possible orthologs across 20 Saccharomycetaceae yeasts and discovered 4,340 S. cerevisiae-specific de novo genes and 8,871 S. sensu stricto-specific de novo genes. We further combine information on CDS positions and transcript structures to show that >65% of de novo genes arose from transcript isoforms of ancient genes, especially in the upstream and internal regions of ancient genes. Fourteen identified de novo genes with high transcript levels were chosen to verify their protein expressions. Ten of them, including eight transcript isoform-associated CDSs, showed translation signals and five proteins exhibited specific cytosolic localizations. Our results suggest that de novo genes frequently arise in the S. sensu stricto complex and have the potential to be quickly integrated into ancient cellular network. PMID:28981695
Transcriptome sequencing and de novo analysis of the copepod Calanus sinicus using 454 GS FLX.

PubMed

Ning, Juan; Wang, Minxiao; Li, Chaolun; Sun, Song

2013-01-01

Despite their species abundance and primary economic importance, genomic information about copepods is still limited. In particular, genomic resources are lacking for the copepod Calanus sinicus, which is a dominant species in the coastal waters of East Asia. In this study, we performed de novo transcriptome sequencing to produce a large number of expressed sequence tags for the copepod C. sinicus. Copepodid larvae and adults were used as the basic material for transcriptome sequencing. Using 454 pyrosequencing, a total of 1,470,799 reads were obtained, which were assembled into 56,809 high quality expressed sequence tags. Based on their sequence similarity to known proteins, about 14,000 different genes were identified, including members of all major conserved signaling pathways. Transcripts that were putatively involved with growth, lipid metabolism, molting, and diapause were also identified among these genes. Differentially expressed genes related to several processes were found in C. sinicus copepodid larvae and adults. We detected 284,154 single nucleotide polymorphisms (SNPs) that provide a resource for gene function studies. Our data provide the most comprehensive transcriptome resource available for C. sinicus. This resource allowed us to identify genes associated with primary physiological processes and SNPs in coding regions, which facilitated the quantitative analysis of differential gene expression. These data should provide foundation for future genetic and genomic studies of this and related species.
De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline.

PubMed

Lin, You-Yu; Hsieh, Chia-Hung; Chen, Jiun-Hong; Lu, Xuemei; Kao, Jia-Horng; Chen, Pei-Jer; Chen, Ding-Shinn; Wang, Hurng-Yi

2017-04-26

The accuracy of metagenomic assembly is usually compromised by high levels of polymorphism due to divergent reads from the same genomic region recognized as different loci when sequenced and assembled together. A viral quasispecies is a group of abundant and diversified genetically related viruses found in a single carrier. Current mainstream assembly methods, such as Velvet and SOAPdenovo, were not originally intended for the assembly of such metagenomics data, and therefore demands for new methods to provide accurate and informative assembly results for metagenomic data. In this study, we present a hybrid method for assembling highly polymorphic data combining the partial de novo-reference assembly (PDR) strategy and the BLAST-based assembly pipeline (BBAP). The PDR strategy generates in situ reference sequences through de novo assembly of a randomly extracted partial data set which is subsequently used for the reference assembly for the full data set. BBAP employs a greedy algorithm to assemble polymorphic reads. We used 12 hepatitis B virus quasispecies NGS data sets from a previous study to assess and compare the performance of both PDR and BBAP. Analyses suggest the high polymorphism of a full metagenomic data set leads to fragmentized de novo assembly results, whereas the biased or limited representation of external reference sequences included fewer reads into the assembly with lower assembly accuracy and variation sensitivity. In comparison, the PDR generated in situ reference sequence incorporated more reads into the final PDR assembly of the full metagenomics data set along with greater accuracy and higher variation sensitivity. BBAP assembly results also suggest higher assembly efficiency and accuracy compared to other assembly methods. Additionally, BBAP assembly recovered HBV structural variants that were not observed amongst assembly results of other methods. Together, PDR/BBAP assembly results were significantly better than other compared methods
De novo assembly and characterization of the garlic (Allium sativum) bud transcriptome by Illumina sequencing.

PubMed

Sun, Xiudong; Zhou, Shumei; Meng, Fanlu; Liu, Shiqi

2012-10-01

Garlic is widely used as a spice throughout the world for the culinary value of its flavor and aroma, which are created by the chemical transformation of a series of organic sulfur compounds. To analyze the transcriptome of Allium sativum and discover the genes involved in sulfur metabolism, cDNAs derived from the total RNA of Allium sativum buds were analyzed by Illumina sequencing. Approximately 26.67 million 90 bp paired-end clean reads were achieved in two libraries. A total of 127,933 unigenes were generated by de novo assembly and were compared with the sequences in public databases. Of these, 45,286 unigenes had significant hits to the sequences in the Nr database, 29,514 showed significant similarity to known proteins in the Swiss-Prot database and, 20,706 and 21,952 unigenes had significant similarity to existing sequences in the KEGG and COG databases, respectively. Moreover, genes involved in organic sulfur biosynthesis were identified. These unigenes data will provide the foundation for research on gene expression, genomics and functional genomics in Allium sativum. Key message The obtained unigenes will provide the foundation for research on functional genomics in Allium sativum and its closely related species, and fill the gap of the existing plant EST database.
The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

PubMed Central

Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio

2016-01-01

Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species. PMID:26786968
De novo-based transcriptome profiling of male-sterile and fertile watermelon lines

PubMed Central

Seo, Minseok; Jang, Yoon Jeong; Sim, Tae Yong; Cho, Seoae; Han, Sang-Wook

2017-01-01

The whole-genome sequence of watermelon (Citrullus lanatus (Thunb.) Matsum. & Nakai), a valuable horticultural crop worldwide, was released in 2013. Here, we compared a de novo-based approach (DBA) to a reference-based approach (RBA) using RNA-seq data, to aid in efforts to improve the annotation of the watermelon reference genome and to obtain biological insight into male-sterility in watermelon. We applied these techniques to available data from two watermelon lines: the male-sterile line DAH3615-MS and the male-fertile line DAH3615. Using DBA, we newly annotated 855 watermelon transcripts, and found gene functional clusters predicted to be related to stimulus responses, nucleic acid binding, transmembrane transport, homeostasis, and Golgi/vesicles. Among the DBA-annotated transcripts, 138 de novo-exclusive differentially-expressed genes (DEDEGs) related to male sterility were detected. Out of 33 randomly selected newly annotated transcripts and DEDEGs, 32 were validated by RT-qPCR. This study demonstrates the usefulness and reliability of the de novo transcriptome assembly in watermelon, and provides new insights for researchers exploring transcriptional blueprints with regard to the male sterility. PMID:29095876
De novo assembly and next-generation sequencing to analyse full-length gene variants from codon-barcoded libraries.

PubMed

Cho, Namjin; Hwang, Byungjin; Yoon, Jung-ki; Park, Sangun; Lee, Joongoo; Seo, Han Na; Lee, Jeewon; Huh, Sunghoon; Chung, Jinsoo; Bang, Duhee

2015-09-21

Interpreting epistatic interactions is crucial for understanding evolutionary dynamics of complex genetic systems and unveiling structure and function of genetic pathways. Although high resolution mapping of en masse variant libraries renders molecular biologists to address genotype-phenotype relationships, long-read sequencing technology remains indispensable to assess functional relationship between mutations that lie far apart. Here, we introduce JigsawSeq for multiplexed sequence identification of pooled gene variant libraries by combining a codon-based molecular barcoding strategy and de novo assembly of short-read data. We first validate JigsawSeq on small sub-pools and observed high precision and recall at various experimental settings. With extensive simulations, we then apply JigsawSeq to large-scale gene variant libraries to show that our method can be reliably scaled using next-generation sequencing. JigsawSeq may serve as a rapid screening tool for functional genomics and offer the opportunity to explore evolutionary trajectories of protein variants.
De novo assembled expressed gene catalog of a fast-growing Eucalyptus tree produced by Illumina mRNA-Seq

PubMed Central

2010-01-01

Background De novo assembly of transcript sequences produced by short-read DNA sequencing technologies offers a rapid approach to obtain expressed gene catalogs for non-model organisms. A draft genome sequence will be produced in 2010 for a Eucalyptus tree species (E. grandis) representing the most important hardwood fibre crop in the world. Genome annotation of this valuable woody plant and genetic dissection of its superior growth and productivity will be greatly facilitated by the availability of a comprehensive collection of expressed gene sequences from multiple tissues and organs. Results We present an extensive expressed gene catalog for a commercially grown E. grandis × E. urophylla hybrid clone constructed using only Illumina mRNA-Seq technology and de novo assembly. A total of 18,894 transcript-derived contigs, a large proportion of which represent full-length protein coding genes were assembled and annotated. Analysis of assembly quality, length and diversity show that this dataset represent the most comprehensive expressed gene catalog for any Eucalyptus tree. mRNA-Seq analysis furthermore allowed digital expression profiling of all of the assembled transcripts across diverse xylogenic and non-xylogenic tissues, which is invaluable for ascribing putative gene functions. Conclusions De novo assembly of Illumina mRNA-Seq reads is an efficient approach for transcriptome sequencing and profiling in Eucalyptus and other non-model organisms. The transcriptome resource (Eucspresso, http://eucspresso.bi.up.ac.za/) generated by this study will be of value for genomic analysis of woody biomass production in Eucalyptus and for comparative genomic analysis of growth and development in woody and herbaceous plants. PMID:21122097
De Novo Sequencing and Analysis of Lemongrass Transcriptome Provide First Insights into the Essential Oil Biosynthesis of Aromatic Grasses.

PubMed

Meena, Seema; Kumar, Sarma R; Venkata Rao, D K; Dwivedi, Varun; Shilpashree, H B; Rastogi, Shubhra; Shasany, Ajit K; Nagegowda, Dinesh A

2016-01-01

Aromatic grasses of the genus Cymbopogon (Poaceae family) represent unique group of plants that produce diverse composition of monoterpene rich essential oils, which have great value in flavor, fragrance, cosmetic, and aromatherapy industries. Despite the commercial importance of these natural aromatic oils, their biosynthesis at the molecular level remains unexplored. As the first step toward understanding the essential oil biosynthesis, we performed de novo transcriptome assembly and analysis of C. flexuosus (lemongrass) by employing Illumina sequencing. Mining of transcriptome data and subsequent phylogenetic analysis led to identification of terpene synthases, pyrophosphatases, alcohol dehydrogenases, aldo-keto reductases, carotenoid cleavage dioxygenases, alcohol acetyltransferases, and aldehyde dehydrogenases, which are potentially involved in essential oil biosynthesis. Comparative essential oil profiling and mRNA expression analysis in three Cymbopogon species (C. flexuosus, aldehyde type; C. martinii, alcohol type; and C. winterianus, intermediate type) with varying essential oil composition indicated the involvement of identified candidate genes in the formation of alcohols, aldehydes, and acetates. Molecular modeling and docking further supported the role of identified protein sequences in aroma formation in Cymbopogon. Also, simple sequence repeats were found in the transcriptome with many linked to terpene pathway genes including the genes potentially involved in aroma biosynthesis. This work provides the first insights into the essential oil biosynthesis of aromatic grasses, and the identified candidate genes and markers can be a great resource for biotechnological and molecular breeding approaches to modulate the essential oil composition.
De Novo Sequencing and Analysis of Lemongrass Transcriptome Provide First Insights into the Essential Oil Biosynthesis of Aromatic Grasses

PubMed Central

Meena, Seema; Kumar, Sarma R.; Venkata Rao, D. K.; Dwivedi, Varun; Shilpashree, H. B.; Rastogi, Shubhra; Shasany, Ajit K.; Nagegowda, Dinesh A.

2016-01-01

Aromatic grasses of the genus Cymbopogon (Poaceae family) represent unique group of plants that produce diverse composition of monoterpene rich essential oils, which have great value in flavor, fragrance, cosmetic, and aromatherapy industries. Despite the commercial importance of these natural aromatic oils, their biosynthesis at the molecular level remains unexplored. As the first step toward understanding the essential oil biosynthesis, we performed de novo transcriptome assembly and analysis of C. flexuosus (lemongrass) by employing Illumina sequencing. Mining of transcriptome data and subsequent phylogenetic analysis led to identification of terpene synthases, pyrophosphatases, alcohol dehydrogenases, aldo-keto reductases, carotenoid cleavage dioxygenases, alcohol acetyltransferases, and aldehyde dehydrogenases, which are potentially involved in essential oil biosynthesis. Comparative essential oil profiling and mRNA expression analysis in three Cymbopogon species (C. flexuosus, aldehyde type; C. martinii, alcohol type; and C. winterianus, intermediate type) with varying essential oil composition indicated the involvement of identified candidate genes in the formation of alcohols, aldehydes, and acetates. Molecular modeling and docking further supported the role of identified protein sequences in aroma formation in Cymbopogon. Also, simple sequence repeats were found in the transcriptome with many linked to terpene pathway genes including the genes potentially involved in aroma biosynthesis. This work provides the first insights into the essential oil biosynthesis of aromatic grasses, and the identified candidate genes and markers can be a great resource for biotechnological and molecular breeding approaches to modulate the essential oil composition. PMID:27516768
Increased Frequency of De Novo Copy Number Variations in Congenital Heart Disease by Integrative Analysis of SNP Array and Exome Sequence Data

PubMed Central

Rodriguez-Murillo, Laura; Fromer, Menachem; Mazaika, Erica; Vardarajan, Badri; Italia, Michael; Leipzig, Jeremy; DePalma, Steven R.; Golhar, Ryan; Sanders, Stephan J.; Yamrom, Boris; Ronemus, Michael; Iossifov, Ivan; Willsey, A. Jeremy; State, Matthew W.; Kaltman, Jonathan R.; White, Peter S.; Shen, Yufeng; Warburton, Dorothy; Brueckner, Martina; Seidman, Christine; Goldmuntz, Elizabeth; Gelb, Bruce D.; Lifton, Richard; Seidman, Jonathan; Hakonarson, Hakon; Chung, Wendy K.

2014-01-01

Rationale Congenital heart disease (CHD) is among the most common birth defects. Most cases are of unknown etiology. Objective To determine the contribution of de novo copy number variants (CNVs) in the etiology of sporadic CHD. Methods and Results We studied 538 CHD trios using genome-wide dense single nucleotide polymorphism (SNP) arrays and/or whole exome sequencing (WES). Results were experimentally validated using digital droplet PCR. We compared validated CNVs in CHD cases to CNVs in 1,301 healthy control trios. The two complementary high-resolution technologies identified 63 validated de novo CNVs in 51 CHD cases. A significant increase in CNV burden was observed when comparing CHD trios with healthy trios, using either SNP array (p=7x10−5, Odds Ratio (OR)=4.6) or WES data (p=6x10−4, OR=3.5) and remained after removing 16% of de novo CNV loci previously reported as pathogenic (p=0.02, OR=2.7). We observed recurrent de novo CNVs on 15q11.2 encompassing CYFIP1, NIPA1, and NIPA2 and single de novo CNVs encompassing DUSP1, JUN, JUP, MED15, MED9, PTPRE SREBF1, TOP2A, and ZEB2, genes that interact with established CHD proteins NKX2-5 and GATA4. Integrating de novo variants in WES and CNV data suggests that ETS1 is the pathogenic gene altered by 11q24.2-q25 deletions in Jacobsen syndrome and that CTBP2 is the pathogenic gene in 10q sub-telomeric deletions. Conclusions We demonstrate a significantly increased frequency of rare de novo CNVs in CHD patients compared with healthy controls and suggest several novel genetic loci for CHD. PMID:25205790
Capturing chloroplast variation for molecular ecology studies: a simple next generation sequencing approach applied to a rainforest tree

PubMed Central

2013-01-01

Background With high quantity and quality data production and low cost, next generation sequencing has the potential to provide new opportunities for plant phylogeographic studies on single and multiple species. Here we present an approach for in silicio chloroplast DNA assembly and single nucleotide polymorphism detection from short-read shotgun sequencing. The approach is simple and effective and can be implemented using standard bioinformatic tools. Results The chloroplast genome of Toona ciliata (Meliaceae), 159,514 base pairs long, was assembled from shotgun sequencing on the Illumina platform using de novo assembly of contigs. To evaluate its practicality, value and quality, we compared the short read assembly with an assembly completed using 454 data obtained after chloroplast DNA isolation. Sanger sequence verifications indicated that the Illumina dataset outperformed the longer read 454 data. Pooling of several individuals during preparation of the shotgun library enabled detection of informative chloroplast SNP markers. Following validation, we used the identified SNPs for a preliminary phylogeographic study of T. ciliata in Australia and to confirm low diversity across the distribution. Conclusions Our approach provides a simple method for construction of whole chloroplast genomes from shotgun sequencing of whole genomic DNA using short-read data and no available closely related reference genome (e.g. from the same species or genus). The high coverage of Illumina sequence data also renders this method appropriate for multiplexing and SNP discovery and therefore a useful approach for landscape level studies of evolutionary ecology. PMID:23497206

Library Design-Facilitated High-Throughput Sequencing of Synthetic Peptide Libraries.

PubMed

Vinogradov, Alexander A; Gates, Zachary P; Zhang, Chi; Quartararo, Anthony J; Halloran, Kathryn H; Pentelute, Bradley L

2017-11-13

A methodology to achieve high-throughput de novo sequencing of synthetic peptide mixtures is reported. The approach leverages shotgun nanoliquid chromatography coupled with tandem mass spectrometry-based de novo sequencing of library mixtures (up to 2000 peptides) as well as automated data analysis protocols to filter away incorrect assignments, noise, and synthetic side-products. For increasing the confidence in the sequencing results, mass spectrometry-friendly library designs were developed that enabled unambiguous decoding of up to 600 peptide sequences per hour while maintaining greater than 85% sequence identification rates in most cases. The reliability of the reported decoding strategy was additionally confirmed by matching fragmentation spectra for select authentic peptides identified from library sequencing samples. The methods reported here are directly applicable to screening techniques that yield mixtures of active compounds, including particle sorting of one-bead one-compound libraries and affinity enrichment of synthetic library mixtures performed in solution.
Two low coverage bird genomes and a comparison of reference-guided versus de novo genome assemblies.

PubMed

Card, Daren C; Schield, Drew R; Reyes-Velasco, Jacobo; Fujita, Matthew K; Andrew, Audra L; Oyler-McCance, Sara J; Fike, Jennifer A; Tomback, Diana F; Ruggiero, Robert P; Castoe, Todd A

2014-01-01

As a greater number and diversity of high-quality vertebrate reference genomes become available, it is increasingly feasible to use these references to guide new draft assemblies for related species. Reference-guided assembly approaches may substantially increase the contiguity and completeness of a new genome using only low levels of genome coverage that might otherwise be insufficient for de novo genome assembly. We used low-coverage (∼3.5-5.5x) Illumina paired-end sequencing to assemble draft genomes of two bird species (the Gunnison Sage-Grouse, Centrocercus minimus, and the Clark's Nutcracker, Nucifraga columbiana). We used these data to estimate de novo genome assemblies and reference-guided assemblies, and compared the information content and completeness of these assemblies by comparing CEGMA gene set representation, repeat element content, simple sequence repeat content, and GC isochore structure among assemblies. Our results demonstrate that even lower-coverage genome sequencing projects are capable of producing informative and useful genomic resources, particularly through the use of reference-guided assemblies.
Two low coverage bird genomes and a comparison of reference-guided versus de novo genome assemblies

USGS Publications Warehouse

Card, Daren C.; Schield, Drew R.; Reyes-Velasco, Jacobo; Fujita, Matthre K.; Andrew, Audra L.; Oyler-McCance, Sara J.; Fike, Jennifer A.; Tomback, Diana F.; Ruggiero, Robert P.; Castoe, Todd A.

2014-01-01

As a greater number and diversity of high-quality vertebrate reference genomes become available, it is increasingly feasible to use these references to guide new draft assemblies for related species. Reference-guided assembly approaches may substantially increase the contiguity and completeness of a new genome using only low levels of genome coverage that might otherwise be insufficient for de novo genome assembly. We used low-coverage (~3.5–5.5x) Illumina paired-end sequencing to assemble draft genomes of two bird species (the Gunnison Sage-Grouse, Centrocercus minimus, and the Clark's Nutcracker, Nucifraga columbiana). We used these data to estimate de novo genome assemblies and reference-guided assemblies, and compared the information content and completeness of these assemblies by comparing CEGMA gene set representation, repeat element content, simple sequence repeat content, and GC isochore structure among assemblies. Our results demonstrate that even lower-coverage genome sequencing projects are capable of producing informative and useful genomic resources, particularly through the use of reference-guided assemblies.
Single-Cell RNA Sequencing Reveals T Helper Cells Synthesizing Steroids De Novo to Contribute to Immune Homeostasis

PubMed Central

Mahata, Bidesh; Zhang, Xiuwei; Kolodziejczyk, Aleksandra A.; Proserpio, Valentina; Haim-Vilmovsky, Liora; Taylor, Angela E.; Hebenstreit, Daniel; Dingler, Felix A.; Moignard, Victoria; Göttgens, Berthold; Arlt, Wiebke; McKenzie, Andrew N.J.; Teichmann, Sarah A.

2014-01-01

Summary T helper 2 (Th2) cells regulate helminth infections, allergic disorders, tumor immunity, and pregnancy by secreting various cytokines. It is likely that there are undiscovered Th2 signaling molecules. Although steroids are known to be immunoregulators, de novo steroid production from immune cells has not been previously characterized. Here, we demonstrate production of the steroid pregnenolone by Th2 cells in vitro and in vivo in a helminth infection model. Single-cell RNA sequencing and quantitative PCR analysis suggest that pregnenolone synthesis in Th2 cells is related to immunosuppression. In support of this, we show that pregnenolone inhibits Th cell proliferation and B cell immunoglobulin class switching. We also show that steroidogenic Th2 cells inhibit Th cell proliferation in a Cyp11a1 enzyme-dependent manner. We propose pregnenolone as a “lymphosteroid,” a steroid produced by lymphocytes. We speculate that this de novo steroid production may be an intrinsic phenomenon of Th2-mediated immune responses to actively restore immune homeostasis. PMID:24813893
De Novo Proteins with Life-Sustaining Functions Are Structurally Dynamic.

PubMed

Murphy, Grant S; Greisman, Jack B; Hecht, Michael H

2016-01-29

Designing and producing novel proteins that fold into stable structures and provide essential biological functions are key goals in synthetic biology. In initial steps toward achieving these goals, we constructed a combinatorial library of de novo proteins designed to fold into 4-helix bundles. As described previously, screening this library for sequences that function in vivo to rescue conditionally lethal mutants of Escherichia coli (auxotrophs) yielded several de novo sequences, termed SynRescue proteins, which rescued four different E. coli auxotrophs. In an effort to understand the structural requirements necessary for auxotroph rescue, we investigated the biophysical properties of the SynRescue proteins, using both computational and experimental approaches. Results from circular dichroism, size-exclusion chromatography, and NMR demonstrate that the SynRescue proteins are α-helical and relatively stable. Surprisingly, however, they do not form well-ordered structures. Instead, they form dynamic structures that fluctuate between monomeric and dimeric states. These findings show that a well-ordered structure is not a prerequisite for life-sustaining functions, and suggests that dynamic structures may have been important in the early evolution of protein function. Copyright © 2015 Elsevier Ltd. All rights reserved.
Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum genome through long-read (>11 kb), single molecule, real-time sequencing

PubMed Central

Vembar, Shruthi Sridhar; Seetin, Matthew; Lambert, Christine; Nattestad, Maria; Schatz, Michael C.; Baybayan, Primo; Scherf, Artur; Smith, Melissa Laird

2016-01-01

The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90–99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission. PMID:27345719
Automated de novo phasing and model building of coiled-coil proteins.

PubMed

Rämisch, Sebastian; Lizatović, Robert; André, Ingemar

2015-03-01

Models generated by de novo structure prediction can be very useful starting points for molecular replacement for systems where suitable structural homologues cannot be readily identified. Protein-protein complexes and de novo-designed proteins are examples of systems that can be challenging to phase. In this study, the potential of de novo models of protein complexes for use as starting points for molecular replacement is investigated. The approach is demonstrated using homomeric coiled-coil proteins, which are excellent model systems for oligomeric systems. Despite the stereotypical fold of coiled coils, initial phase estimation can be difficult and many structures have to be solved with experimental phasing. A method was developed for automatic structure determination of homomeric coiled coils from X-ray diffraction data. In a benchmark set of 24 coiled coils, ranging from dimers to pentamers with resolutions down to 2.5 Å, 22 systems were automatically solved, 11 of which had previously been solved by experimental phasing. The generated models contained 71-103% of the residues present in the deposited structures, had the correct sequence and had free R values that deviated on average by 0.01 from those of the respective reference structures. The electron-density maps were of sufficient quality that only minor manual editing was necessary to produce final structures. The method, named CCsolve, combines methods for de novo structure prediction, initial phase estimation and automated model building into one pipeline. CCsolve is robust against errors in the initial models and can readily be modified to make use of alternative crystallographic software. The results demonstrate the feasibility of de novo phasing of protein-protein complexes, an approach that could also be employed for other small systems beyond coiled coils.
CABS-fold: Server for the de novo and consensus-based prediction of protein structure.

PubMed

Blaszczyk, Maciej; Jamroz, Michal; Kmiecik, Sebastian; Kolinski, Andrzej

2013-07-01

The CABS-fold web server provides tools for protein structure prediction from sequence only (de novo modeling) and also using alternative templates (consensus modeling). The web server is based on the CABS modeling procedures ranked in previous Critical Assessment of techniques for protein Structure Prediction competitions as one of the leading approaches for de novo and template-based modeling. Except for template data, fragmentary distance restraints can also be incorporated into the modeling process. The web server output is a coarse-grained trajectory of generated conformations, its Jmol representation and predicted models in all-atom resolution (together with accompanying analysis). CABS-fold can be freely accessed at http://biocomp.chem.uw.edu.pl/CABSfold.
CABS-fold: server for the de novo and consensus-based prediction of protein structure

PubMed Central

Blaszczyk, Maciej; Jamroz, Michal; Kmiecik, Sebastian; Kolinski, Andrzej

2013-01-01

The CABS-fold web server provides tools for protein structure prediction from sequence only (de novo modeling) and also using alternative templates (consensus modeling). The web server is based on the CABS modeling procedures ranked in previous Critical Assessment of techniques for protein Structure Prediction competitions as one of the leading approaches for de novo and template-based modeling. Except for template data, fragmentary distance restraints can also be incorporated into the modeling process. The web server output is a coarse-grained trajectory of generated conformations, its Jmol representation and predicted models in all-atom resolution (together with accompanying analysis). CABS-fold can be freely accessed at http://biocomp.chem.uw.edu.pl/CABSfold. PMID:23748950
MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads (Metagenomics Informatics Challenges Workshop: 10K Genomes at a Time)

ScienceCinema

Sakakibara, Yasumbumi

2018-02-13

Keio University's Yasumbumi Sakakibara on "MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.
MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads (Metagenomics Informatics Challenges Workshop: 10K Genomes at a Time)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sakakibara, Yasumbumi

2011-10-13

Keio University's Yasumbumi Sakakibara on "MetaVelvet: An Extension of Velvet Assembler to de novo Metagenome Assembly from Short Sequence Reads" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.
Non-invasive prenatal diagnosis of achondroplasia and thanatophoric dysplasia: next-generation sequencing allows for a safer, more accurate, and comprehensive approach

PubMed Central

Chitty, Lyn S; Mason, Sarah; Barrett, Angela N; McKay, Fiona; Lench, Nicholas; Daley, Rebecca; Jenkins, Lucy A

2015-01-01

Abstract Objective Accurate prenatal diagnosis of genetic conditions can be challenging and usually requires invasive testing. Here, we demonstrate the potential of next-generation sequencing (NGS) for the analysis of cell-free DNA in maternal blood to transform prenatal diagnosis of monogenic disorders. Methods Analysis of cell-free DNA using a PCR and restriction enzyme digest (PCR–RED) was compared with a novel NGS assay in pregnancies at risk of achondroplasia and thanatophoric dysplasia. Results PCR–RED was performed in 72 cases and was correct in 88.6%, inconclusive in 7% with one false negative. NGS was performed in 47 cases and was accurate in 96.2% with no inconclusives. Both approaches were used in 27 cases, with NGS giving the correct result in the two cases inconclusive with PCR–RED. Conclusion NGS provides an accurate, flexible approach to non-invasive prenatal diagnosis of de novo and paternally inherited mutations. It is more sensitive than PCR–RED and is ideal when screening a gene with multiple potential pathogenic mutations. These findings highlight the value of NGS in the development of non-invasive prenatal diagnosis for other monogenic disorders. © 2015 The Authors. Prenatal Diagnosis published by John Wiley & Sons, Ltd. What's already known about this topic? Non-invasive prenatal diagnosis (NIPD) using PCR-based methods has been reported for the detection or exclusion of individual paternally inherited or de novo alleles in maternal plasma. What does this study add? NIPD using next generation sequencing provides an accurate, more sensitive approach which can be used to detect multiple mutations in a single assay and so is ideal when screening a gene with multiple potential pathogenic mutations. Next generation sequencing thus provides a flexible approach to non-invasive prenatal diagnosis ideal for use in a busy service laboratory. PMID:25728633
Hybrid error correction and de novo assembly of single-molecule sequencing reads

PubMed Central

Koren, Sergey; Schatz, Michael C.; Walenz, Brian P.; Martin, Jeffrey; Howard, Jason; Ganapathy, Ganeshkumar; Wang, Zhong; Rasko, David A.; McCombie, W. Richard; Jarvis, Erich D.; Phillippy, Adam M.

2012-01-01

Emerging single-molecule sequencing instruments can generate multi-kilobase sequences with the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of single-molecule reads is challenging, and has limited their use to resequencing bacteria. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. We demonstrate the utility of this approach on Pacbio RS reads of phage, prokaryotic, and eukaryotic whole genomes, including the novel genome of the parrot Melopsittacus undulatus, as well as for RNA-seq reads of the corn (Zea mays) transcriptome. Our approach achieves over 99.9% read correction accuracy and produces substantially better assemblies than current sequencing strategies: in the best example, quintupling the median contig size relative to high-coverage, second-generation assemblies. Greater gains are predicted if read lengths continue to increase, including the prospect of single-contig bacterial chromosome assembly. PMID:22750884
Improving prokaryotic transposable elements identification using a combination of de novo and profile HMM methods.

PubMed

Kamoun, Choumouss; Payen, Thibaut; Hua-Van, Aurélie; Filée, Jonathan

2013-10-11

Insertion Sequences (ISs) and their non-autonomous derivatives (MITEs) are important components of prokaryotic genomes inducing duplication, deletion, rearrangement or lateral gene transfers. Although ISs and MITEs are relatively simple and basic genetic elements, their detection remains a difficult task due to their remarkable sequence diversity. With the advent of high-throughput genome and metagenome sequencing technologies, the development of fast, reliable and sensitive methods of ISs and MITEs detection become an important challenge. So far, almost all studies dealing with prokaryotic transposons have used classical BLAST-based detection methods against reference libraries. Here we introduce alternative methods of detection either taking advantages of the structural properties of the elements (de novo methods) or using an additional library-based method using profile HMM searches. In this study, we have developed three different work flows dedicated to ISs and MITEs detection: the first two use de novo methods detecting either repeated sequences or presence of Inverted Repeats; the third one use 28 in-house transposase alignment profiles with HMM search methods. We have compared the respective performances of each method using a reference dataset of 30 archaeal and 30 bacterial genomes in addition to simulated and real metagenomes. Compared to a BLAST-based method using ISFinder as library, de novo methods significantly improve ISs and MITEs detection. For example, in the 30 archaeal genomes, we discovered 30 new elements (+20%) in addition to the 141 multi-copies elements already detected by the BLAST approach. Many of the new elements correspond to ISs belonging to unknown or highly divergent families. The total number of MITEs has even doubled with the discovery of elements displaying very limited sequence similarities with their respective autonomous partners (mainly in the Inverted Repeats of the elements). Concerning metagenomes, with the exception of
Software for pre-processing Illumina next-generation sequencing short read sequences

PubMed Central

2014-01-01

Background When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets. Methods We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7. Results Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference
Two Low Coverage Bird Genomes and a Comparison of Reference-Guided versus De Novo Genome Assemblies

PubMed Central

Card, Daren C.; Schield, Drew R.; Reyes-Velasco, Jacobo; Fujita, Matthew K.; Andrew, Audra L.; Oyler-McCance, Sara J.; Fike, Jennifer A.; Tomback, Diana F.; Ruggiero, Robert P.; Castoe, Todd A.

2014-01-01

As a greater number and diversity of high-quality vertebrate reference genomes become available, it is increasingly feasible to use these references to guide new draft assemblies for related species. Reference-guided assembly approaches may substantially increase the contiguity and completeness of a new genome using only low levels of genome coverage that might otherwise be insufficient for de novo genome assembly. We used low-coverage (∼3.5–5.5x) Illumina paired-end sequencing to assemble draft genomes of two bird species (the Gunnison Sage-Grouse, Centrocercus minimus, and the Clark's Nutcracker, Nucifraga columbiana). We used these data to estimate de novo genome assemblies and reference-guided assemblies, and compared the information content and completeness of these assemblies by comparing CEGMA gene set representation, repeat element content, simple sequence repeat content, and GC isochore structure among assemblies. Our results demonstrate that even lower-coverage genome sequencing projects are capable of producing informative and useful genomic resources, particularly through the use of reference-guided assemblies. PMID:25192061
Effective de novo assembly of fish genome using haploid larvae.

PubMed

Iwasaki, Yuki; Nishiki, Issei; Nakamura, Yoji; Yasuike, Motoshige; Kai, Wataru; Nomura, Kazuharu; Yoshida, Kazunori; Nomura, Yousuke; Fujiwara, Atushi; Kobayashi, Takanori; Ototake, Mitsuru

2016-02-01

Recent improvements in next-generation sequencing technology have made it possible to do whole genome sequencing, on even non-model eukaryote species with no available reference genomes. However, de novo assembly of diploid genomes is still a big challenge because of allelic variation. The aim of this study was to determine the feasibility of utilizing the genome of haploid fish larvae for de novo assembly of whole-genome sequences. We compared the efficiency of assembly using the haploid genome of yellowtail (Seriola quinqueradiata) with that using the diploid genome obtained from the dam. De novo assembly from the haploid and the diploid sequence reads (100 million reads per each datasets) generated by the Ion Proton sequencer (200 bp) was done under two different assembly algorithms, namely overlap-layout-consensus (OLC) and de Bruijn graph (DBG). This revealed that the assembly of the haploid genome significantly reduced (approximately 22% for OLC, 9% for DBG) the total number of contigs (with longer average and N50 contig lengths) when compared to the diploid genome assembly. The haploid assembly also improved the quality of the scaffolds by reducing the number of regions with unassigned nucleotides (Ns) (total length of Ns; 45,331,916 bp for haploids and 67,724,360 bp for diploids) in OLC-based assemblies. It appears clear that the haploid genome assembly is better because the allelic variation in the diploid genome disrupts the extension of contigs during the assembly process. Our results indicate that utilizing the genome of haploid larvae leads to a significant improvement in the de novo assembly process, thus providing a novel strategy for the construction of reference genomes from non-model diploid organisms such as fish. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.
A new strategy for genome assembly using short sequence reads and reduced representation libraries.

PubMed

Young, Andrew L; Abaan, Hatice Ozel; Zerbino, Daniel; Mullikin, James C; Birney, Ewan; Margulies, Elliott H

2010-02-01

We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.
De novo assembly of a haplotype-resolved human genome.

PubMed

Cao, Hongzhi; Wu, Honglong; Luo, Ruibang; Huang, Shujia; Sun, Yuhui; Tong, Xin; Xie, Yinlong; Liu, Binghang; Yang, Hailong; Zheng, Hancheng; Li, Jian; Li, Bo; Wang, Yu; Yang, Fang; Sun, Peng; Liu, Siyang; Gao, Peng; Huang, Haodong; Sun, Jing; Chen, Dan; He, Guangzhu; Huang, Weihua; Huang, Zheng; Li, Yue; Tellier, Laurent C A M; Liu, Xiao; Feng, Qiang; Xu, Xun; Zhang, Xiuqing; Bolund, Lars; Krogh, Anders; Kristiansen, Karsten; Drmanac, Radoje; Drmanac, Snezana; Nielsen, Rasmus; Li, Songgang; Wang, Jian; Yang, Huanming; Li, Yingrui; Wong, Gane Ka-Shu; Wang, Jun

2015-06-01

The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome. Our pipeline relies on fosmid pooling together with whole-genome shotgun strategies, based solely on next-generation sequencing and hierarchical assembly methods. We applied our sequencing method to the genome of an Asian individual and generated a 5.15-Gb assembled genome with a haplotype N50 of 484 kb. Our analysis identified previously undetected indels and 7.49 Mb of novel coding sequences that could not be aligned to the human reference genome, which include at least six predicted genes. This haplotype-resolved genome represents the most complete de novo human genome assembly to date. Application of our approach to identify individual haplotype differences should aid in translating genotypes to phenotypes for the development of personalized medicine.
High-fidelity target sequencing of individual molecules identified using barcode sequences: de novo detection and absolute quantitation of mutations in plasma cell-free DNA from cancer patients.

PubMed

Kukita, Yoji; Matoba, Ryo; Uchida, Junji; Hamakawa, Takuya; Doki, Yuichiro; Imamura, Fumio; Kato, Kikuya

2015-08-01

Circulating tumour DNA (ctDNA) is an emerging field of cancer research. However, current ctDNA analysis is usually restricted to one or a few mutation sites due to technical limitations. In the case of massively parallel DNA sequencers, the number of false positives caused by a high read error rate is a major problem. In addition, the final sequence reads do not represent the original DNA population due to the global amplification step during the template preparation. We established a high-fidelity target sequencing system of individual molecules identified in plasma cell-free DNA using barcode sequences; this system consists of the following two steps. (i) A novel target sequencing method that adds barcode sequences by adaptor ligation. This method uses linear amplification to eliminate the errors introduced during the early cycles of polymerase chain reaction. (ii) The monitoring and removal of erroneous barcode tags. This process involves the identification of individual molecules that have been sequenced and for which the number of mutations have been absolute quantitated. Using plasma cell-free DNA from patients with gastric or lung cancer, we demonstrated that the system achieved near complete elimination of false positives and enabled de novo detection and absolute quantitation of mutations in plasma cell-free DNA. © The Author 2015. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

De novo selection of oncogenes.

PubMed

Chacón, Kelly M; Petti, Lisa M; Scheideman, Elizabeth H; Pirazzoli, Valentina; Politi, Katerina; DiMaio, Daniel

2014-01-07

All cellular proteins are derived from preexisting ones by natural selection. Because of the random nature of this process, many potentially useful protein structures never arose or were discarded during evolution. Here, we used a single round of genetic selection in mouse cells to isolate chemically simple, biologically active transmembrane proteins that do not contain any amino acid sequences from preexisting proteins. We screened a retroviral library expressing hundreds of thousands of proteins consisting of hydrophobic amino acids in random order to isolate four 29-aa proteins that induced focus formation in mouse and human fibroblasts and tumors in mice. These proteins share no amino acid sequences with known cellular or viral proteins, and the simplest of them contains only seven different amino acids. They transformed cells by forming a stable complex with the platelet-derived growth factor β receptor transmembrane domain and causing ligand-independent receptor activation. We term this approach de novo selection and suggest that it can be used to generate structures and activities not observed in nature, create prototypes for novel research reagents and therapeutics, and provide insight into cell biology, transmembrane protein-protein interactions, and possibly virus evolution and the origin of life.
SearchGUI: A Highly Adaptable Common Interface for Proteomics Search and de Novo Engines.

PubMed

Barsnes, Harald; Vaudel, Marc

2018-05-25

Mass-spectrometry-based proteomics has become the standard approach for identifying and quantifying proteins. A vital step consists of analyzing experimentally generated mass spectra to identify the underlying peptide sequences for later mapping to the originating proteins. We here present the latest developments in SearchGUI, a common open-source interface for the most frequently used freely available proteomics search and de novo engines that has evolved into a central component in numerous bioinformatics workflows.
GlycoDeNovo - an Efficient Algorithm for Accurate de novo Glycan Topology Reconstruction from Tandem Mass Spectra

NASA Astrophysics Data System (ADS)

Hong, Pengyu; Sun, Hui; Sha, Long; Pu, Yi; Khatri, Kshitij; Yu, Xiang; Tang, Yang; Lin, Cheng

2017-08-01

A major challenge in glycomics is the characterization of complex glycan structures that are essential for understanding their diverse roles in many biological processes. We present a novel efficient computational approach, named GlycoDeNovo, for accurate elucidation of the glycan topologies from their tandem mass spectra. Given a spectrum, GlycoDeNovo first builds an interpretation-graph specifying how to interpret each peak using preceding interpreted peaks. It then reconstructs the topologies of peaks that contribute to interpreting the precursor ion. We theoretically prove that GlycoDeNovo is highly efficient. A major innovative feature added to GlycoDeNovo is a data-driven IonClassifier which can be used to effectively rank candidate topologies. IonClassifier is automatically learned from experimental spectra of known glycans to distinguish B- and C-type ions from all other ion types. Our results showed that GlycoDeNovo is robust and accurate for topology reconstruction of glycans from their tandem mass spectra. [Figure not available: see fulltext.
De novo mutations in regulatory elements in neurodevelopmental disorders

PubMed Central

Short, Patrick J.; McRae, Jeremy F.; Gallone, Giuseppe; Sifrim, Alejandro; Won, Hyejung; Geschwind, Daniel H.; Wright, Caroline F.; Firth, Helen V; FitzPatrick, David R.; Barrett, Jeffrey C.; Hurles, Matthew E.

2018-01-01

We previously estimated that 42% of patients with severe developmental disorders carry pathogenic de novo mutations in coding sequences. The role of de novo mutations in regulatory elements affecting genes associated with developmental disorders, or other genes, has been essentially unexplored. We identified de novo mutations in three classes of putative regulatory elements in almost 8,000 patients with developmental disorders. Here we show that de novo mutations in highly evolutionarily conserved fetal brain-active elements are significantly and specifically enriched in neurodevelopmental disorders. We identified a significant twofold enrichment of recurrently mutated elements. We estimate that, genome-wide, 1-3% of patients without a diagnostic coding variant carry pathogenic de novo mutations in fetal brain-active regulatory elements and that only 0.15% of all possible mutations within highly conserved fetal brain-active elements cause neurodevelopmental disorders with a dominant mechanism. Our findings represent a robust estimate of the contribution of de novo mutations in regulatory elements to this genetically heterogeneous set of disorders, and emphasize the importance of combining functional and evolutionary evidence to identify regulatory causes of genetic disorders. PMID:29562236
De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia

PubMed Central

Xu, Bin; Ionita-Laza, Iuliana; Roos, J. Louw; Boone, Braden; Woodrick, Scarlet; Sun, Yan; Levy, Shawn; Gogos, Joseph A.; Karayiorgou, Maria

2013-01-01

To evaluate evidence for de novo etiologies in schizophrenia, we sequenced at high coverage the exomes of families recruited from two populations with distinct demographic structure and history. We sequenced a total of 795 exomes from 231 parent-proband trios enriched for sporadic schizophrenia cases, as well as 34 unaffected trios. We observed in cases an excess of non-synonymous single nucleotide variants as well as a higher prevalence of gene-disruptive de novo mutations. We found four genes (LAMA2, DPYD, TRRAP and VPS39) affected by recurrent de novo events within or across the two populations, a finding unlikely to have occurred by chance. We show that de novo mutations affect genes with diverse functions and developmental profiles but we also find a substantial contribution of mutations in genes with higher expression in early fetal life. Our results help define the pattern of genomic and neural architecture of schizophrenia. PMID:23042115
Transcriptome de novo assembly sequencing and analysis of the toxic dinoflagellate Alexandrium catenella using the Illumina platform.

PubMed

Zhang, Shu; Sui, Zhenghong; Chang, Lianpeng; Kang, Kyoungho; Ma, Jinhua; Kong, Fanna; Zhou, Wei; Wang, Jinguo; Guo, Liliang; Geng, Huili; Zhong, Jie; Ma, Qingxia

2014-03-10

In this article, high-throughput de novo transcriptomic sequencing was performed in Alexandrium catenella, which provided the first view of the gene repertoire in this dinoflagellate based on next-generation sequencing (NGS) technologies. A total of 118,304 unigenes were identified with an average length of 673bp (base pair). Of these unigenes, 77,936 (65.9%) were annotated with known proteins based on sequence similarities, among which 24,149 and 22,956 unigenes were assigned to gene ontology categories (GO) and clusters of orthologous groups (COGs), respectively. Furthermore, 16,467 unigenes were mapped onto 322 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG). We also detected 1143 simple sequence repeats (SSRs), in which the tri-nucleotide repeat motif (69.3%) was the most abundant. The genetic facts and significance derived from the transcriptome dataset were suggested and discussed. All four core nucleosomal histones and linker histones were detected, in addition to the unigenes involved in histone modifications.190 unigenes were identified as being involved in the endocytosis pathway, and clathrin-dependent endocytosis was suggested to play a role in the heterotrophy of A. catenella. A conserved 22-nt spliced leader (SL) was identified in 21 unigenes which suggested the existence of trans-splicing processing of mRNA in A. catenella. Crown Copyright © 2013. Published by Elsevier B.V. All rights reserved.
Application of the whole-transcriptome shotgun sequencing approach to the study of Philadelphia-positive acute lymphoblastic leukemia

PubMed Central

Iacobucci, I; Ferrarini, A; Sazzini, M; Giacomelli, E; Lonetti, A; Xumerle, L; Ferrari, A; Papayannidis, C; Malerba, G; Luiselli, D; Boattini, A; Garagnani, P; Vitale, A; Soverini, S; Pane, F; Baccarani, M; Delledonne, M; Martinelli, G

2012-01-01

Although the pathogenesis of BCR–ABL1-positive acute lymphoblastic leukemia (ALL) is mainly related to the expression of the BCR–ABL1 fusion transcript, additional cooperating genetic lesions are supposed to be involved in its development and progression. Therefore, in an attempt to investigate the complex landscape of mutations, changes in expression profiles and alternative splicing (AS) events that can be observed in such disease, the leukemia transcriptome of a BCR–ABL1-positive ALL patient at diagnosis and at relapse was sequenced using a whole-transcriptome shotgun sequencing (RNA-Seq) approach. A total of 13.9 and 15.8 million sequence reads was generated from de novo and relapsed samples, respectively, and aligned to the human genome reference sequence. This led to the identification of five validated missense mutations in genes involved in metabolic processes (DPEP1, TMEM46), transport (MVP), cell cycle regulation (ABL1) and catalytic activity (CTSZ), two of which resulted in acquired relapse variants. In all, 6390 and 4671 putative AS events were also detected, as well as expression levels for 18 315 and 18 795 genes, 28% of which were differentially expressed in the two disease phases. These data demonstrate that RNA-Seq is a suitable approach for identifying a wide spectrum of genetic alterations potentially involved in ALL. PMID:22829256
A pipeline for the de novo assembly of the Themira biloba (Sepsidae: Diptera) transcriptome using a multiple k-mer length approach.

PubMed

Melicher, Dacotah; Torson, Alex S; Dworkin, Ian; Bowsher, Julia H

2014-03-12

The Sepsidae family of flies is a model for investigating how sexual selection shapes courtship and sexual dimorphism in a comparative framework. However, like many non-model systems, there are few molecular resources available. Large-scale sequencing and assembly have not been performed in any sepsid, and the lack of a closely related genome makes investigation of gene expression challenging. Our goal was to develop an automated pipeline for de novo transcriptome assembly, and to use that pipeline to assemble and analyze the transcriptome of the sepsid Themira biloba. Our bioinformatics pipeline uses cloud computing services to assemble and analyze the transcriptome with off-site data management, processing, and backup. It uses a multiple k-mer length approach combined with a second meta-assembly to extend transcripts and recover more bases of transcript sequences than standard single k-mer assembly. We used 454 sequencing to generate 1.48 million reads from cDNA generated from embryo, larva, and pupae of T. biloba and assembled a transcriptome consisting of 24,495 contigs. Annotation identified 16,705 transcripts, including those involved in embryogenesis and limb patterning. We assembled transcriptomes from an additional three non-model organisms to demonstrate that our pipeline assembled a higher-quality transcriptome than single k-mer approaches across multiple species. The pipeline we have developed for assembly and analysis increases contig length, recovers unique transcripts, and assembles more base pairs than other methods through the use of a meta-assembly. The T. biloba transcriptome is a critical resource for performing large-scale RNA-Seq investigations of gene expression patterns, and is the first transcriptome sequenced in this Dipteran family.
Sequencing of sporadic Attention-Deficit Hyperactivity Disorder (ADHD) identifies novel and potentially pathogenic de novo variants and excludes overlap with genes associated with autism spectrum disorder.

PubMed

Kim, Daniel Seung; Burt, Amber A; Ranchalis, Jane E; Wilmot, Beth; Smith, Joshua D; Patterson, Karynne E; Coe, Bradley P; Li, Yatong K; Bamshad, Michael J; Nikolas, Molly; Eichler, Evan E; Swanson, James M; Nigg, Joel T; Nickerson, Deborah A; Jarvik, Gail P

2017-06-01

Attention-Deficit Hyperactivity Disorder (ADHD) has high heritability; however, studies of common variation account for <5% of ADHD variance. Using data from affected participants without a family history of ADHD, we sought to identify de novo variants that could account for sporadic ADHD. Considering a total of 128 families, two analyses were conducted in parallel: first, in 11 unaffected parent/affected proband trios (or quads with the addition of an unaffected sibling) we completed exome sequencing. Six de novo missense variants at highly conserved bases were identified and validated from four of the 11 families: the brain-expressed genes TBC1D9, DAGLA, QARS, CSMD2, TRPM2, and WDR83. Separately, in 117 unrelated probands with sporadic ADHD, we sequenced a panel of 26 genes implicated in intellectual disability (ID) and autism spectrum disorder (ASD) to evaluate whether variation in ASD/ID-associated genes were also present in participants with ADHD. Only one putative deleterious variant (Gln600STOP) in CHD1L was identified; this was found in a single proband. Notably, no other nonsense, splice, frameshift, or highly conserved missense variants in the 26 gene panel were identified and validated. These data suggest that de novo variant analysis in families with independently adjudicated sporadic ADHD diagnosis can identify novel genes implicated in ADHD pathogenesis. Moreover, that only one of the 128 cases (0.8%, 11 exome, and 117 MIP sequenced participants) had putative deleterious variants within our data in 26 genes related to ID and ASD suggests significant independence in the genetic pathogenesis of ADHD as compared to ASD and ID phenotypes. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
De novo assembly, characterization and functional annotation of pineapple fruit transcriptome through massively parallel sequencing.

PubMed

Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah

2012-01-01

Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple.
De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing

PubMed Central

Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah

2012-01-01

Background Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. Methodology/Principal Findings To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. Conclusions The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple. PMID:23091603
Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.

PubMed

Chin, Chen-Shan; Alexander, David H; Marks, Patrick; Klammer, Aaron A; Drake, James; Heiner, Cheryl; Clum, Alicia; Copeland, Alex; Huddleston, John; Eichler, Evan E; Turner, Stephen W; Korlach, Jonas

2013-06-01

We present a hierarchical genome-assembly process (HGAP) for high-quality de novo microbial genome assemblies using only a single, long-insert shotgun DNA library in conjunction with Single Molecule, Real-Time (SMRT) DNA sequencing. Our method uses the longest reads as seeds to recruit all other reads for construction of highly accurate preassembled reads through a directed acyclic graph-based consensus procedure, which we follow with assembly using off-the-shelf long-read assemblers. In contrast to hybrid approaches, HGAP does not require highly accurate raw reads for error correction. We demonstrate efficient genome assembly for several microorganisms using as few as three SMRT Cell zero-mode waveguide arrays of sequencing and for BACs using just one SMRT Cell. Long repeat regions can be successfully resolved with this workflow. We also describe a consensus algorithm that incorporates SMRT sequencing primary quality values to produce de novo genome sequence exceeding 99.999% accuracy.
"De-novo" amino acid sequence elucidation of protein G'e by combined "top-down" and "bottom-up" mass spectrometry.

PubMed

Yefremova, Yelena; Al-Majdoub, Mahmoud; Opuni, Kwabena F M; Koy, Cornelia; Cui, Weidong; Yan, Yuetian; Gross, Michael L; Glocker, Michael O

2015-03-01

Mass spectrometric de-novo sequencing was applied to review the amino acid sequence of a commercially available recombinant protein G´ with great scientific and economic importance. Substantial deviations to the published amino acid sequence (Uniprot Q54181) were found by the presence of 46 additional amino acids at the N-terminus, including a so-called "His-tag" as well as an N-terminal partial α-N-gluconoylation and α-N-phosphogluconoylation, respectively. The unexpected amino acid sequence of the commercial protein G' comprised 241 amino acids and resulted in a molecular mass of 25,998.9 ± 0.2 Da for the unmodified protein. Due to the higher mass that is caused by its extended amino acid sequence compared with the original protein G' (185 amino acids), we named this protein "protein G'e." By means of mass spectrometric peptide mapping, the suggested amino acid sequence, as well as the N-terminal partial α-N-gluconoylations, was confirmed with 100% sequence coverage. After the protein G'e sequence was determined, we were able to determine the expression vector pET-28b from Novagen with the Xho I restriction enzyme cleavage site as the best option that was used for cloning and expressing the recombinant protein G'e in E. coli. A dissociation constant (K(d)) value of 9.4 nM for protein G'e was determined thermophoretically, showing that the N-terminal flanking sequence extension did not cause significant changes in the binding affinity to immunoglobulins.
Sequential de novo centromere formation and inactivation on a chromosomal fragment in maize.

PubMed

Liu, Yalin; Su, Handong; Pang, Junling; Gao, Zhi; Wang, Xiu-Jie; Birchler, James A; Han, Fangpu

2015-03-17

The ability of centromeres to alternate between active and inactive states indicates significant epigenetic aspects controlling centromere assembly and function. In maize (Zea mays), misdivision of the B chromosome centromere on a translocation with the short arm of chromosome 9 (TB-9Sb) can produce many variants with varying centromere sizes and centromeric DNA sequences. In such derivatives of TB-9Sb, we found a de novo centromere on chromosome derivative 3-3, which has no canonical centromeric repeat sequences. This centromere is derived from a 288-kb region on the short arm of chromosome 9, and is 19 megabases (Mb) removed from the translocation breakpoint of chromosome 9 in TB-9Sb. The functional B centromere in progenitor telo2-2 is deleted from derivative 3-3, but some B-repeat sequences remain. The de novo centromere of derivative 3-3 becomes inactive in three further derivatives with new centromeres being formed elsewhere on each chromosome. Our results suggest that de novo centromere initiation is quite common and can persist on chromosomal fragments without a canonical centromere. However, we hypothesize that when de novo centromeres are initiated in opposition to a larger normal centromere, they are cleared from the chromosome by inactivation, thus maintaining karyotype integrity.
Sequential de novo centromere formation and inactivation on a chromosomal fragment in maize

PubMed Central

Liu, Yalin; Su, Handong; Pang, Junling; Gao, Zhi; Wang, Xiu-Jie; Birchler, James A.; Han, Fangpu

2015-01-01

The ability of centromeres to alternate between active and inactive states indicates significant epigenetic aspects controlling centromere assembly and function. In maize (Zea mays), misdivision of the B chromosome centromere on a translocation with the short arm of chromosome 9 (TB-9Sb) can produce many variants with varying centromere sizes and centromeric DNA sequences. In such derivatives of TB-9Sb, we found a de novo centromere on chromosome derivative 3-3, which has no canonical centromeric repeat sequences. This centromere is derived from a 288-kb region on the short arm of chromosome 9, and is 19 megabases (Mb) removed from the translocation breakpoint of chromosome 9 in TB-9Sb. The functional B centromere in progenitor telo2-2 is deleted from derivative 3-3, but some B-repeat sequences remain. The de novo centromere of derivative 3-3 becomes inactive in three further derivatives with new centromeres being formed elsewhere on each chromosome. Our results suggest that de novo centromere initiation is quite common and can persist on chromosomal fragments without a canonical centromere. However, we hypothesize that when de novo centromeres are initiated in opposition to a larger normal centromere, they are cleared from the chromosome by inactivation, thus maintaining karyotype integrity. PMID:25733907
A population-based evolutionary search approach to the multiple minima problem in de novo protein structure prediction

PubMed Central

2013-01-01

Background Elucidating the native structure of a protein molecule from its sequence of amino acids, a problem known as de novo structure prediction, is a long standing challenge in computational structural biology. Difficulties in silico arise due to the high dimensionality of the protein conformational space and the ruggedness of the associated energy surface. The issue of multiple minima is a particularly troublesome hallmark of energy surfaces probed with current energy functions. In contrast to the true energy surface, these surfaces are weakly-funneled and rich in comparably deep minima populated by non-native structures. For this reason, many algorithms seek to be inclusive and obtain a broad view of the low-energy regions through an ensemble of low-energy (decoy) conformations. Conformational diversity in this ensemble is key to increasing the likelihood that the native structure has been captured. Methods We propose an evolutionary search approach to address the multiple-minima problem in decoy sampling for de novo structure prediction. Two population-based evolutionary search algorithms are presented that follow the basic approach of treating conformations as individuals in an evolving population. Coarse graining and molecular fragment replacement are used to efficiently obtain protein-like child conformations from parents. Potential energy is used both to bias parent selection and determine which subset of parents and children will be retained in the evolving population. The effect on the decoy ensemble of sampling minima directly is measured by additionally mapping a conformation to its nearest local minimum before considering it for retainment. The resulting memetic algorithm thus evolves not just a population of conformations but a population of local minima. Results and conclusions Results show that both algorithms are effective in terms of sampling conformations in proximity of the known native structure. The additional minimization is shown to be
De novo assembly and phasing of a Korean human genome.

PubMed

Seo, Jeong-Sun; Rhie, Arang; Kim, Junsoo; Lee, Sangjin; Sohn, Min-Hwan; Kim, Chang-Uk; Hastie, Alex; Cao, Han; Yun, Ji-Young; Kim, Jihye; Kuk, Junho; Park, Gun Hwa; Kim, Juhyeok; Ryu, Hanna; Kim, Jongbum; Roh, Mira; Baek, Jeonghun; Hunkapiller, Michael W; Korlach, Jonas; Shin, Jong-Yeon; Kim, Changhoon

2016-10-13

Advances in genome assembly and phasing provide an opportunity to investigate the diploid architecture of the human genome and reveal the full range of structural variation across population groups. Here we report the de novo assembly and haplotype phasing of the Korean individual AK1 (ref. 1) using single-molecule real-time sequencing, next-generation mapping, microfluidics-based linked reads, and bacterial artificial chromosome (BAC) sequencing approaches. Single-molecule sequencing coupled with next-generation mapping generated a highly contiguous assembly, with a contig N50 size of 17.9 Mb and a scaffold N50 size of 44.8 Mb, resolving 8 chromosomal arms into single scaffolds. The de novo assembly, along with local assemblies and spanning long reads, closes 105 and extends into 72 out of 190 euchromatic gaps in the reference genome, adding 1.03 Mb of previously intractable sequence. High concordance between the assembly and paired-end sequences from 62,758 BAC clones provides strong support for the robustness of the assembly. We identify 18,210 structural variants by direct comparison of the assembly with the human reference, identifying thousands of breakpoints that, to our knowledge, have not been reported before. Many of the insertions are reflected in the transcriptome and are shared across the Asian population. We performed haplotype phasing of the assembly with short reads, long reads and linked reads from whole-genome sequencing and with short reads from 31,719 BAC clones, thereby achieving phased blocks with an N50 size of 11.6 Mb. Haplotigs assembled from single-molecule real-time reads assigned to haplotypes on phased blocks covered 89% of genes. The haplotigs accurately characterized the hypervariable major histocompatability complex region as well as demonstrating allele configuration in clinically relevant genes such as CYP2D6. This work presents the most contiguous diploid human genome assembly so far, with extensive investigation of
Highly efficient de novo mutant identification in a sorghum bicolor tilling population using the ComSeq approach

USDA-ARS?s Scientific Manuscript database

Screening large populations for carriers of known or de novo rare SNPs is required both in Targeting induced local lesions IN genomes (TILLING) experiments in plants and analogously in screening human populations. We formerly suggested an approach that combines the celebrated mathematical field of c...
Foldability of a Natural De Novo Evolved Protein.

PubMed

Bungard, Dixie; Copple, Jacob S; Yan, Jing; Chhun, Jimmy J; Kumirov, Vlad K; Foy, Scott G; Masel, Joanna; Wysocki, Vicki H; Cordes, Matthew H J

2017-11-07

The de novo evolution of protein-coding genes from noncoding DNA is emerging as a source of molecular innovation in biology. Studies of random sequence libraries, however, suggest that young de novo proteins will not fold into compact, specific structures typical of native globular proteins. Here we show that Bsc4, a functional, natural de novo protein encoded by a gene that evolved recently from noncoding DNA in the yeast S. cerevisiae, folds to a partially specific three-dimensional structure. Bsc4 forms soluble, compact oligomers with high β sheet content and a hydrophobic core, and undergoes cooperative, reversible denaturation. Bsc4 lacks a specific quaternary state, however, existing instead as a continuous distribution of oligomer sizes, and binds dyes indicative of amyloid oligomers or molten globules. The combination of native-like and non-native-like properties suggests a rudimentary fold that could potentially act as a functional intermediate in the emergence of new folded proteins de novo. Copyright © 2017 Elsevier Ltd. All rights reserved.
Extreme-Scale De Novo Genome Assembly

DOE Office of Scientific and Technical Information (OSTI.GOV)

Georganas, Evangelos; Hofmeyr, Steven; Egan, Rob

De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMER, a high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. Genome assembly software has many components, each of which stresses different components of a computer system. This chapter explains the computational challenges involved in each step of the HipMer pipeline, the key distributed data structures, and communication costs in detail. We present performance results of assembling the human genome and themore » large hexaploid wheat genome on large supercomputers up to tens of thousands of cores.« less

RNA sequencing, de novo assembly and differential analysis of the gill transcriptome of freshwater climbing perch Anabas testudineus after six days of seawater exposure.

PubMed

Chen, X L; Lui, E Y; Ip, Y Kwong; Lam, S H

2018-06-21

To obtain transcriptomic insights into branchial responses to salinity challenge in Anabas testudineus, this study employed RNA sequencing (RNA-Seq) to analyse the gill transcriptome of A. testudineus exposed to seawater (SW) for 6 days compared with the freshwater (FW) control group. A combined FW and SW gill transcriptome was de novo assembled from 169.9 million 101 bp paired-end reads. In silico validation employing 17 A. testudineus Sanger full-length coding sequences showed that 15/17 of them had greater than 80% of their sequences aligned to the de novo assembled contigs where 5/17 had their full-length (100%) aligned and 9/17 had greater than 90% of their sequences aligned. The combined FW and SW gill transcriptome was mapped to 13780 unique human identifiers at E-value < 1.0E-20 while 952 and 886 identifiers were determined as up and down-regulated by 1.5 fold, respectively, in the gills of A. testudineus in SW when compared with FW. These genes were found to be associated with at least 23 biological processes. A larger proportion of genes encoding enzymes and transporters associated with molecular transport, energy production, metabolisms were up-regulated, while a larger proportion of genes encoding transmembrane receptors, G-protein coupled receptors, kinases and transcription regulators associated with cell cycle, growth, development, signalling, morphology and gene expression were relatively lower in the gills of A. testudineus in SW when compared with FW. High correlation (R = 0.99) was observed between RNA-Seq data and real-time quantitative PCR validation for 13 selected genes. The transcriptomic sequence information will facilitate development of molecular resources and tools while the findings will provide insights for future studies into branchial iono-osmoregulation and related cellular processes in A. testudineus. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
BayesMotif: de novo protein sorting motif discovery from impure datasets.

PubMed

Hu, Jianjun; Zhang, Fan

2010-01-18

Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of
Ultrahigh-resolution Fourier transform ion cyclotron resonance mass spectrometry and tandem mass spectrometry for peptide de novo amino acid sequencing for a seven-protein mixture by paired single-residue transposed Lys-N and Lys-C digestion.

PubMed

Guan, Xiaoyan; Brownstein, Naomi C; Young, Nicolas L; Marshall, Alan G

2017-01-30

Bottom-up tandem mass spectrometry (MS/MS) is regularly used in proteomics to identify proteins from a sequence database. De novo sequencing is also available for sequencing peptides with relatively short sequence lengths. We recently showed that paired Lys-C and Lys-N proteases produce peptides of identical mass and similar retention time, but different tandem mass spectra. Such parallel experiments provide complementary information, and allow for up to 100% MS/MS sequence coverage. Here, we report digestion by paired Lys-C and Lys-N proteases of a seven-protein mixture: human hemoglobin alpha, bovine carbonic anhydrase 2, horse skeletal muscle myoglobin, hen egg white lysozyme, bovine pancreatic ribonuclease, bovine rhodanese, and bovine serum albumin, followed by reversed-phase nanoflow liquid chromatography, collision-induced dissociation, and 14.5 T Fourier transform ion cyclotron resonance mass spectrometry. Matched pairs of product peptide ions of equal precursor mass and similar retention times from each digestion are compared, leveraging single-residue transposed information with independent interferences to confidently identify fragment ion types, residues, and peptides. Selected pairs of product ion mass spectra for de novo sequenced protein segments from each member of the mixture are presented. Pairs of the transposed product ions as well as complementary information from the parallel experiments allow for both high MS/MS coverage for long peptide sequences and high confidence in the amino acid identification. Moreover, the parallel experiments in the de novo sequencing reduce false-positive matches of product ions from the single-residue transposed peptides from the same segment, and thereby further improve the confidence in protein identification. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
De novo assembly of the transcriptome of the non-model plant Streptocarpus rexii employing a novel heuristic to recover locus-specific transcript clusters.

PubMed

Chiara, Matteo; Horner, David S; Spada, Alberto

2013-01-01

De novo transcriptome characterization from Next Generation Sequencing data has become an important approach in the study of non-model plants. Despite notable advances in the assembly of short reads, the clustering of transcripts into unigene-like (locus-specific) clusters remains a somewhat neglected subject. Indeed, closely related paralogous transcripts are often merged into single clusters by current approaches. Here, a novel heuristic method for locus-specific clustering is compared to that implemented in the de novo assembler Oases, using the same initial transcript collections, derived from Arabidopsis thaliana and the developmental model Streptocarpus rexii. We show that the proposed approach improves cluster specificity in the A. thaliana dataset for which the reference genome is available. Furthermore, for the S. rexii data our filtered transcript collection matches a larger number of distinct annotated loci in reference genomes than the Oases set, while containing a reduced overall number of loci. A detailed discussion of advantages and limitations of our approach in processing de novo transcriptome reconstructions is presented. The proposed method should be widely applicable to other organisms, irrespective of the transcript assembly method employed. The S. rexii transcriptome is available as a sophisticated and augmented publicly available online database.
Endogenous Hot Spots of De Novo Telomere Addition in the Yeast Genome Contain Proximal Enhancers That Bind Cdc13

PubMed Central

Obodo, Udochukwu C.; Epum, Esther A.; Platts, Margaret H.; Seloff, Jacob; Dahlson, Nicole A.; Velkovsky, Stoycho M.; Paul, Shira R.

2016-01-01

DNA double-strand breaks (DSBs) pose a threat to genome stability and are repaired through multiple mechanisms. Rarely, telomerase, the enzyme that maintains telomeres, acts upon a DSB in a mutagenic process termed telomere healing. The probability of telomere addition is increased at specific genomic sequences termed sites of repair-associated telomere addition (SiRTAs). By monitoring repair of an induced DSB, we show that SiRTAs on chromosomes V and IX share a bipartite structure in which a core sequence (Core) is directly targeted by telomerase, while a proximal sequence (Stim) enhances the probability of de novo telomere formation. The Stim and Core sequences are sufficient to confer a high frequency of telomere addition to an ectopic site. Cdc13, a single-stranded DNA binding protein that recruits telomerase to endogenous telomeres, is known to stimulate de novo telomere addition when artificially recruited to an induced DSB. Here we show that the ability of the Stim sequence to enhance de novo telomere addition correlates with its ability to bind Cdc13, indicating that natural sites at which telomere addition occurs at high frequency require binding by Cdc13 to a sequence 20 to 100 bp internal from the site at which telomerase acts to initiate de novo telomere addition. PMID:27044869
Informative priors based on transcription factor structural class improve de novo motif discovery.

PubMed

Narlikar, Leelavati; Gordân, Raluca; Ohler, Uwe; Hartemink, Alexander J

2006-07-15

An important problem in molecular biology is to identify the locations at which a transcription factor (TF) binds to DNA, given a set of DNA sequences believed to be bound by that TF. In previous work, we showed that information in the DNA sequence of a binding site is sufficient to predict the structural class of the TF that binds it. In particular, this suggests that we can predict which locations in any DNA sequence are more likely to be bound by certain classes of TFs than others. Here, we argue that traditional methods for de novo motif finding can be significantly improved by adopting an informative prior probability that a TF binding site occurs at each sequence location. To demonstrate the utility of such an approach, we present priority, a powerful new de novo motif finding algorithm. Using data from TRANSFAC, we train three classifiers to recognize binding sites of basic leucine zipper, forkhead, and basic helix loop helix TFs. These classifiers are used to equip priority with three class-specific priors, in addition to a default prior to handle TFs of other classes. We apply priority and a number of popular motif finding programs to sets of yeast intergenic regions that are reported by ChIP-chip to be bound by particular TFs. priority identifies motifs the other methods fail to identify, and correctly predicts the structural class of the TF recognizing the identified binding sites. Supplementary material and code can be found at http://www.cs.duke.edu/~amink/.
Massively parallel de novo protein design for targeted therapeutics.

PubMed

Chevalier, Aaron; Silva, Daniel-Adriano; Rocklin, Gabriel J; Hicks, Derrick R; Vergara, Renan; Murapa, Patience; Bernard, Steffen M; Zhang, Lu; Lam, Kwok-Ho; Yao, Guorui; Bahl, Christopher D; Miyashita, Shin-Ichiro; Goreshnik, Inna; Fuller, James T; Koday, Merika T; Jenkins, Cody M; Colvin, Tom; Carter, Lauren; Bohn, Alan; Bryan, Cassie M; Fernández-Velasco, D Alejandro; Stewart, Lance; Dong, Min; Huang, Xuhui; Jin, Rongsheng; Wilson, Ian A; Fuller, Deborah H; Baker, David

2017-10-05

De novo protein design holds promise for creating small stable proteins with shapes customized to bind therapeutic targets. We describe a massively parallel approach for designing, manufacturing and screening mini-protein binders, integrating large-scale computational design, oligonucleotide synthesis, yeast display screening and next-generation sequencing. We designed and tested 22,660 mini-proteins of 37-43 residues that target influenza haemagglutinin and botulinum neurotoxin B, along with 6,286 control sequences to probe contributions to folding and binding, and identified 2,618 high-affinity binders. Comparison of the binding and non-binding design sets, which are two orders of magnitude larger than any previously investigated, enabled the evaluation and improvement of the computational model. Biophysical characterization of a subset of the binder designs showed that they are extremely stable and, unlike antibodies, do not lose activity after exposure to high temperatures. The designs elicit little or no immune response and provide potent prophylactic and therapeutic protection against influenza, even after extensive repeated dosing.
Massively parallel de novo protein design for targeted therapeutics

NASA Astrophysics Data System (ADS)

Chevalier, Aaron; Silva, Daniel-Adriano; Rocklin, Gabriel J.; Hicks, Derrick R.; Vergara, Renan; Murapa, Patience; Bernard, Steffen M.; Zhang, Lu; Lam, Kwok-Ho; Yao, Guorui; Bahl, Christopher D.; Miyashita, Shin-Ichiro; Goreshnik, Inna; Fuller, James T.; Koday, Merika T.; Jenkins, Cody M.; Colvin, Tom; Carter, Lauren; Bohn, Alan; Bryan, Cassie M.; Fernández-Velasco, D. Alejandro; Stewart, Lance; Dong, Min; Huang, Xuhui; Jin, Rongsheng; Wilson, Ian A.; Fuller, Deborah H.; Baker, David

2017-10-01

De novo protein design holds promise for creating small stable proteins with shapes customized to bind therapeutic targets. We describe a massively parallel approach for designing, manufacturing and screening mini-protein binders, integrating large-scale computational design, oligonucleotide synthesis, yeast display screening and next-generation sequencing. We designed and tested 22,660 mini-proteins of 37-43 residues that target influenza haemagglutinin and botulinum neurotoxin B, along with 6,286 control sequences to probe contributions to folding and binding, and identified 2,618 high-affinity binders. Comparison of the binding and non-binding design sets, which are two orders of magnitude larger than any previously investigated, enabled the evaluation and improvement of the computational model. Biophysical characterization of a subset of the binder designs showed that they are extremely stable and, unlike antibodies, do not lose activity after exposure to high temperatures. The designs elicit little or no immune response and provide potent prophylactic and therapeutic protection against influenza, even after extensive repeated dosing.
Massively parallel de novo protein design for targeted therapeutics

PubMed Central

Chevalier, Aaron; Silva, Daniel-Adriano; Rocklin, Gabriel J.; Hicks, Derrick R.; Vergara, Renan; Murapa, Patience; Bernard, Steffen M.; Zhang, Lu; Lam, Kwok-Ho; Yao, Guorui; Bahl, Christopher D.; Miyashita, Shin-Ichiro; Goreshnik, Inna; Fuller, James T.; Koday, Merika T.; Jenkins, Cody M.; Colvin, Tom; Carter, Lauren; Bohn, Alan; Bryan, Cassie M.; Fernández-Velasco, D. Alejandro; Stewart, Lance; Dong, Min; Huang, Xuhui; Jin, Rongsheng; Wilson, Ian A.; Fuller, Deborah H.; Baker, David

2018-01-01

De novo protein design holds promise for creating small stable proteins with shapes customized to bind therapeutic targets. We describe a massively parallel approach for designing, manufacturing and screening mini-protein binders, integrating large-scale computational design, oligonucleotide synthesis, yeast display screening and next-generation sequencing. We designed and tested 22,660 mini-proteins of 37–43 residues that target influenza haemagglutinin and botulinum neurotoxin B, along with 6,286 control sequences to probe contributions to folding and binding, and identified 2,618 high-affinity binders. Comparison of the binding and non-binding design sets, which are two orders of magnitude larger than any previously investigated, enabled the evaluation and improvement of the computational model. Biophysical characterization of a subset of the binder designs showed that they are extremely stable and, unlike antibodies, do not lose activity after exposure to high temperatures. The designs elicit little or no immune response and provide potent prophylactic and therapeutic protection against influenza, even after extensive repeated dosing. PMID:28953867
WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches

PubMed Central

Romer, Katherine A.; Kayombya, Guy-Richard; Fraenkel, Ernest

2007-01-01

WebMOTIFS provides a web interface that facilitates the discovery and analysis of DNA-sequence motifs. Several studies have shown that the accuracy of motif discovery can be significantly improved by using multiple de novo motif discovery programs and using randomized control calculations to identify the most significant motifs or by using Bayesian approaches. WebMOTIFS makes it easy to apply these strategies. Using a single submission form, users can run several motif discovery programs and score, cluster and visualize the results. In addition, the Bayesian motif discovery program THEME can be used to determine the class of transcription factors that is most likely to regulate a set of sequences. Input can be provided as a list of gene or probe identifiers. Used with the default settings, WebMOTIFS accurately identifies biologically relevant motifs from diverse data in several species. WebMOTIFS is freely available at http://fraenkel.mit.edu/webmotifs. PMID:17584794
Genovo: De Novo Assembly for Metagenomes

NASA Astrophysics Data System (ADS)

Laserson, Jonathan; Jojic, Vladimir; Koller, Daphne

Next-generation sequencing technologies produce a large number of noisy reads from the DNA in a sample. Metagenomics and population sequencing aim to recover the genomic sequences of the species in the sample, which could be of high diversity. Methods geared towards single sequence reconstruction are not sensitive enough when applied in this setting. We introduce a generative probabilistic model of read generation from environmental samples and present Genovo, a novel de novo sequence assembler that discovers likely sequence reconstructions under the model. A Chinese restaurant process prior accounts for the unknown number of genomes in the sample. Inference is made by applying a series of hill-climbing steps iteratively until convergence. We compare the performance of Genovo to three other short read assembly programs across one synthetic dataset and eight metagenomic datasets created using the 454 platform, the largest of which has 311k reads. Genovo's reconstructions cover more bases and recover more genes than the other methods, and yield a higher assembly score.
Comparative Transcriptomic Approaches Exploring Contamination Stress Tolerance in Salix sp. Reveal the Importance for a Metaorganismal de Novo Assembly Approach for Nonmodel Plants1[OPEN

PubMed Central

Brereton, Nicholas J. B.; Marleau, Julie; Nissim, Werther Guidi; Labrecque, Michel; Joly, Simon; Pitre, Frederic E.

2016-01-01

Metatranscriptomic study of nonmodel organisms requires strategies that retain the highly resolved genetic information generated from model organisms while allowing for identification of the unexpected. A real-world biological application of phytoremediation, the field growth of 10 Salix cultivars on polluted soils, was used as an exemplar nonmodel and multifaceted crop response well-disposed to the study of gene expression. Sequence reads were assembled de novo to create 10 independent transcriptomes, a global transcriptome, and were mapped against the Salix purpurea 94006 reference genome. Annotation of assembled contigs was performed without a priori assumption of the originating organism. Global transcriptome construction from 3.03 billion paired-end reads revealed 606,880 unique contigs annotated from 1588 species, often common in all 10 cultivars. Comparisons between transcriptomic and metatranscriptomic methodologies provide clear evidence that nonnative RNA can mistakenly map to reference genomes, especially to conserved regions of common housekeeping genes, such as actin, α/β-tubulin, and elongation factor 1-α. In Salix, Rubisco activase transcripts were down-regulated in contaminated trees across all 10 cultivars, whereas thiamine thizole synthase and CP12, a Calvin Cycle master regulator, were uniformly up-regulated. De novo assembly approaches, with unconstrained annotation, can improve data quality; care should be taken when exploring such plant genetics to reduce de facto data exclusion by mapping to a single reference genome alone. Salix gene expression patterns strongly suggest cultivar-wide alteration of specific photosynthetic apparatus and protection of the antenna complexes from oxidation damage in contaminated trees, providing an insight into common stress tolerance strategies in a real-world phytoremediation system. PMID:27002060
Sequencing proteins with transverse ionic transport in nanochannels.

PubMed

Boynton, Paul; Di Ventra, Massimiliano

2016-05-03

De novo protein sequencing is essential for understanding cellular processes that govern the function of living organisms and all sequence modifications that occur after a protein has been constructed from its corresponding DNA code. By obtaining the order of the amino acids that compose a given protein one can then determine both its secondary and tertiary structures through structure prediction, which is used to create models for protein aggregation diseases such as Alzheimer's Disease. Here, we propose a new technique for de novo protein sequencing that involves translocating a polypeptide through a synthetic nanochannel and measuring the ionic current of each amino acid through an intersecting perpendicular nanochannel. We find that the distribution of ionic currents for each of the 20 proteinogenic amino acids encoded by eukaryotic genes is statistically distinct, showing this technique's potential for de novo protein sequencing.
Draft Sequencing of the Heterozygous Diploid Genome of Satsuma (Citrus unshiu Marc.) Using a Hybrid Assembly Approach

PubMed Central

Shimizu, Tokurou; Tanizawa, Yasuhiro; Mochizuki, Takako; Nagasaki, Hideki; Yoshioka, Terutaka; Toyoda, Atsushi; Fujiyama, Asao; Kaminuma, Eli; Nakamura, Yasukazu

2017-01-01

Satsuma (Citrus unshiu Marc.) is one of the most abundantly produced mandarin varieties of citrus, known for its seedless fruit production and as a breeding parent of citrus. De novo assembly of the heterozygous diploid genome of Satsuma (“Miyagawa Wase”) was conducted by a hybrid assembly approach using short-read sequences, three mate-pair libraries, and a long-read sequence of PacBio by the PLATANUS assembler. The assembled sequence, with a total size of 359.7 Mb at the N50 length of 386,404 bp, consisted of 20,876 scaffolds. Pseudomolecules of Satsuma constructed by aligning the scaffolds to three genetic maps showed genome-wide synteny to the genomes of Clementine, pummelo, and sweet orange. Gene prediction by modeling with MAKER-P proposed 29,024 genes and 37,970 mRNA; additionally, gene prediction analysis found candidates for novel genes in several biosynthesis pathways for gibberellin and violaxanthin catabolism. BUSCO scores for the assembled scaffold and predicted transcripts, and another analysis by BAC end sequence mapping indicated the assembled genome consistency was close to those of the haploid Clementine, pummel, and sweet orange genomes. The number of repeat elements and long terminal repeat retrotransposon were comparable to those of the seven citrus genomes; this suggested no significant failure in the assembly at the repeat region. A resequencing application using the assembled sequence confirmed that both kunenbo-A and Satsuma are offsprings of Kishu, and Satsuma is a back-crossed offspring of Kishu. These results illustrated the performance of the hybrid assembly approach and its ability to construct an accurate heterozygous diploid genome. PMID:29259619
Draft Sequencing of the Heterozygous Diploid Genome of Satsuma (Citrus unshiu Marc.) Using a Hybrid Assembly Approach.

PubMed

Shimizu, Tokurou; Tanizawa, Yasuhiro; Mochizuki, Takako; Nagasaki, Hideki; Yoshioka, Terutaka; Toyoda, Atsushi; Fujiyama, Asao; Kaminuma, Eli; Nakamura, Yasukazu

2017-01-01

Satsuma ( Citrus unshiu Marc.) is one of the most abundantly produced mandarin varieties of citrus, known for its seedless fruit production and as a breeding parent of citrus. De novo assembly of the heterozygous diploid genome of Satsuma ("Miyagawa Wase") was conducted by a hybrid assembly approach using short-read sequences, three mate-pair libraries, and a long-read sequence of PacBio by the PLATANUS assembler. The assembled sequence, with a total size of 359.7 Mb at the N 50 length of 386,404 bp, consisted of 20,876 scaffolds. Pseudomolecules of Satsuma constructed by aligning the scaffolds to three genetic maps showed genome-wide synteny to the genomes of Clementine, pummelo, and sweet orange. Gene prediction by modeling with MAKER-P proposed 29,024 genes and 37,970 mRNA; additionally, gene prediction analysis found candidates for novel genes in several biosynthesis pathways for gibberellin and violaxanthin catabolism. BUSCO scores for the assembled scaffold and predicted transcripts, and another analysis by BAC end sequence mapping indicated the assembled genome consistency was close to those of the haploid Clementine, pummel, and sweet orange genomes. The number of repeat elements and long terminal repeat retrotransposon were comparable to those of the seven citrus genomes; this suggested no significant failure in the assembly at the repeat region. A resequencing application using the assembled sequence confirmed that both kunenbo-A and Satsuma are offsprings of Kishu, and Satsuma is a back-crossed offspring of Kishu. These results illustrated the performance of the hybrid assembly approach and its ability to construct an accurate heterozygous diploid genome.
A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly

PubMed Central

2013-01-01

Background The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes. Results We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. Conclusions These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies. PMID:23496952
A comparison across non-model animals suggests an optimal sequencing depth for de novo transcriptome assembly.

PubMed

Francis, Warren R; Christianson, Lynne M; Kiko, Rainer; Powers, Meghan L; Shaner, Nathan C; Haddock, Steven H D

2013-03-12

The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes. We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies.
Harnessing Whole Genome Sequencing in Medical Mycology.

PubMed

Cuomo, Christina A

2017-01-01

Comparative genome sequencing studies of human fungal pathogens enable identification of genes and variants associated with virulence and drug resistance. This review describes current approaches, resources, and advances in applying whole genome sequencing to study clinically important fungal pathogens. Genomes for some important fungal pathogens were only recently assembled, revealing gene family expansions in many species and extreme gene loss in one obligate species. The scale and scope of species sequenced is rapidly expanding, leveraging technological advances to assemble and annotate genomes with higher precision. By using iteratively improved reference assemblies or those generated de novo for new species, recent studies have compared the sequence of isolates representing populations or clinical cohorts. Whole genome approaches provide the resolution necessary for comparison of closely related isolates, for example, in the analysis of outbreaks or sampled across time within a single host. Genomic analysis of fungal pathogens has enabled both basic research and diagnostic studies. The increased scale of sequencing can be applied across populations, and new metagenomic methods allow direct analysis of complex samples.
Data-dependent bucketing improves reference-free compression of sequencing reads.

PubMed

Patro, Rob; Kingsford, Carl

2015-09-01

The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince. carlk@cs.cmu.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Discovery of Novel Antimicrobial Peptides from Varanus komodoensis (Komodo Dragon) by Large-Scale Analyses and De-Novo-Assisted Sequencing Using Electron-Transfer Dissociation Mass Spectrometry.

PubMed

Bishop, Barney M; Juba, Melanie L; Russo, Paul S; Devine, Megan; Barksdale, Stephanie M; Scott, Shaylyn; Settlage, Robert; Michalak, Pawel; Gupta, Kajal; Vliet, Kent; Schnur, Joel M; van Hoek, Monique L

2017-04-07

Komodo dragons are the largest living lizards and are the apex predators in their environs. They endure numerous strains of pathogenic bacteria in their saliva and recover from wounds inflicted by other dragons, reflecting the inherent robustness of their innate immune defense. We have employed a custom bioprospecting approach combining partial de novo peptide sequencing with transcriptome assembly to identify cationic antimicrobial peptides from Komodo dragon plasma. Through these analyses, we identified 48 novel potential cationic antimicrobial peptides. All but one of the identified peptides were derived from histone proteins. The antimicrobial effectiveness of eight of these peptides was evaluated against Pseudomonas aeruginosa (ATCC 9027) and Staphylococcus aureus (ATCC 25923), with seven peptides exhibiting antimicrobial activity against both microbes and one only showing significant potency against P. aeruginosa. This study demonstrates the power and promise of our bioprospecting approach to cationic antimicrobial peptide discovery, and it reveals the presence of a plethora of novel histone-derived antimicrobial peptides in the plasma of the Komodo dragon. These findings may have broader implications regarding the role that intact histones and histone-derived peptides play in defending the host from infection. Data are available via ProteomeXChange with identifier PXD005043.

The Goddard and Saturn Genes Are Essential for Drosophila Male Fertility and May Have Arisen De Novo

PubMed Central

Gubala, Anna M.; Schmitz, Jonathan F.; Kearns, Michael J.; Vinh, Tery T.; Bornberg-Bauer, Erich; Wolfner, Mariana F.

2017-01-01

New genes arise through a variety of mechanisms, including the duplication of existing genes and the de novo birth of genes from noncoding DNA sequences. While there are numerous examples of duplicated genes with important functional roles, the functions of de novo genes remain largely unexplored. Many newly evolved genes are expressed in the male reproductive tract, suggesting that these evolutionary innovations may provide advantages to males experiencing sexual selection. Using testis-specific RNA interference, we screened 11 putative de novo genes in Drosophila melanogaster for effects on male fertility and identified two, goddard and saturn, that are essential for spermatogenesis and sperm function. Goddard knockdown (KD) males fail to produce mature sperm, while saturn KD males produce few sperm, and these function inefficiently once transferred to females. Consistent with a de novo origin, both genes are identifiable only in Drosophila and are predicted to encode proteins with no sequence similarity to any annotated protein. However, since high levels of divergence prevented the unambiguous identification of the noncoding sequences from which each gene arose, we consider goddard and saturn to be putative de novo genes. Within Drosophila, both genes have been lost in certain lineages, but show conserved, male-specific patterns of expression in the species in which they are found. Goddard is consistently found in single-copy and evolves under purifying selection. In contrast, saturn has diversified through gene duplication and positive selection. These data suggest that de novo genes can acquire essential roles in male reproduction. PMID:28104747
MeCorS: Metagenome-enabled error correction of single cell sequencing reads

DOE PAGES

Bremges, Andreas; Singer, Esther; Woyke, Tanja; ...

2016-03-15

Here we present a new tool, MeCorS, to correct chimeric reads and sequencing errors in Illumina data generated from single amplified genomes (SAGs). It uses sequence information derived from accompanying metagenome sequencing to accurately correct errors in SAG reads, even from ultra-low coverage regions. In evaluations on real data, we show that MeCorS outperforms BayesHammer, the most widely used state-of-the-art approach. MeCorS performs particularly well in correcting chimeric reads, which greatly improves both accuracy and contiguity of de novo SAG assemblies.
De novo transcriptome sequencing reveals a considerable bias in the incidence of simple sequence repeats towards the downstream of 'Pre-miRNAs' of black pepper.

PubMed

Joy, Nisha; Asha, Srinivasan; Mallika, Vijayan; Soniya, Eppurathu Vasudevan

2013-01-01

Next generation sequencing has an advantageon transformational development of species with limited available sequence data as it helps to decode the genome and transcriptome. We carried out the de novo sequencing using illuminaHiSeq™ 2000 to generate the first leaf transcriptome of black pepper (Piper nigrum L.), an important spice variety native to South India and also grown in other tropical regions. Despite the economic and biochemical importance of pepper, a scientifically rigorous study at the molecular level is far from complete due to lack of sufficient sequence information and cytological complexity of its genome. The 55 million raw reads obtained, when assembled using Trinity program generated 2,23,386 contigs and 1,28,157 unigenes. Reports suggest that the repeat-rich genomic regions give rise to small non-coding functional RNAs. MicroRNAs (miRNAs) are the most abundant type of non-coding regulatory RNAs. In spite of the widespread research on miRNAs, little is known about the hair-pin precursors of miRNAs bearing Simple Sequence Repeats (SSRs). We used the array of transcripts generated, for the in silico prediction and detection of '43 pre-miRNA candidates bearing different types of SSR motifs'. The analysis identified 3913 different types of SSR motifs with an average of one SSR per 3.04 MB of thetranscriptome. About 0.033% of the transcriptome constituted 'pre-miRNA candidates bearing SSRs'. The abundance, type and distribution of SSR motifs studied across the hair-pin miRNA precursors, showed a significant bias in the position of SSRs towards the downstream of predicted 'pre-miRNA candidates'. The catalogue of transcripts identified, together with the demonstration of reliable existence of SSRs in the miRNA precursors, permits future opportunities for understanding the genetic mechanism of black pepper and likely functions of 'tandem repeats' in miRNAs.
De Novo Assembly and Characterization of Four Anthozoan (Phylum Cnidaria) Transcriptomes.

PubMed

Kitchen, Sheila A; Crowder, Camerron M; Poole, Angela Z; Weis, Virginia M; Meyer, Eli

2015-09-17

Many nonmodel species exemplify important biological questions but lack the sequence resources required to study the genes and genomic regions underlying traits of interest. Reef-building corals are famously sensitive to rising seawater temperatures, motivating ongoing research into their stress responses and long-term prospects in a changing climate. A comprehensive understanding of these processes will require extending beyond the sequenced coral genome (Acropora digitifera) to encompass diverse coral species and related anthozoans. Toward that end, we have assembled and annotated reference transcriptomes to develop catalogs of gene sequences for three scleractinian corals (Fungia scutaria, Montastraea cavernosa, Seriatopora hystrix) and a temperate anemone (Anthopleura elegantissima). High-throughput sequencing of cDNA libraries produced ~20-30 million reads per sample, and de novo assembly of these reads produced ~75,000-110,000 transcripts from each sample with size distributions (mean ~1.4 kb, N50 ~2 kb), comparable to the distribution of gene models from the coral genome (mean ~1.7 kb, N50 ~2.2 kb). Each assembly includes matches for more than half the gene models from A. digitifera (54-67%) and many reasonably complete transcripts (~5300-6700) spanning nearly the entire gene (ortholog hit ratios ≥0.75). The catalogs of gene sequences developed in this study made it possible to identify hundreds to thousands of orthologs across diverse scleractinian species and related taxa. We used these sequences for phylogenetic inference, recovering known relationships and demonstrating superior performance over phylogenetic trees constructed using single mitochondrial loci. The resources developed in this study provide gene sequences and genetic markers for several anthozoan species. To enhance the utility of these resources for the research community, we developed searchable databases enabling researchers to rapidly recover sequences for genes of interest. Our
De novo point mutations in patients diagnosed with ataxic cerebral palsy

PubMed Central

Parolin Schnekenberg, Ricardo; Perkins, Emma M.; Miller, Jack W.; Davies, Wayne I. L.; D’Adamo, Maria Cristina; Pessia, Mauro; Fawcett, Katherine A.; Sims, David; Gillard, Elodie; Hudspith, Karl; Skehel, Paul; Williams, Jonathan; O’Regan, Mary; Jayawant, Sandeep; Jefferson, Rosalind; Hughes, Sarah; Lustenberger, Andrea; Ragoussis, Jiannis

2015-01-01

Cerebral palsy is a sporadic disorder with multiple likely aetiologies, but frequently considered to be caused by birth asphyxia. Genetic investigations are rarely performed in patients with cerebral palsy and there is little proven evidence of genetic causes. As part of a large project investigating children with ataxia, we identified four patients in our cohort with a diagnosis of ataxic cerebral palsy. They were investigated using either targeted next generation sequencing or trio-based exome sequencing and were found to have mutations in three different genes, KCNC3, ITPR1 and SPTBN2. All the mutations were de novo and associated with increased paternal age. The mutations were shown to be pathogenic using a combination of bioinformatics analysis and in vitro model systems. This work is the first to report that the ataxic subtype of cerebral palsy can be caused by de novo dominant point mutations, which explains the sporadic nature of these cases. We conclude that at least some subtypes of cerebral palsy may be caused by de novo genetic mutations and patients with a clinical diagnosis of cerebral palsy should be genetically investigated before causation is ascribed to perinatal asphyxia or other aetiologies. PMID:25981959
Combining independent de novo assemblies optimizes the coding transcriptome for nonconventional model eukaryotic organisms.

PubMed

Cerveau, Nicolas; Jackson, Daniel J

2016-12-09

Next-generation sequencing (NGS) technologies are arguably the most revolutionary technical development to join the list of tools available to molecular biologists since PCR. For researchers working with nonconventional model organisms one major problem with the currently dominant NGS platform (Illumina) stems from the obligatory fragmentation of nucleic acid material that occurs prior to sequencing during library preparation. This step creates a significant bioinformatic challenge for accurate de novo assembly of novel transcriptome data. This challenge becomes apparent when a variety of modern assembly tools (of which there is no shortage) are applied to the same raw NGS dataset. With the same assembly parameters these tools can generate markedly different assembly outputs. In this study we present an approach that generates an optimized consensus de novo assembly of eukaryotic coding transcriptomes. This approach does not represent a new assembler, rather it combines the outputs of a variety of established assembly packages, and removes redundancy via a series of clustering steps. We test and validate our approach using Illumina datasets from six phylogenetically diverse eukaryotes (three metazoans, two plants and a yeast) and two simulated datasets derived from metazoan reference genome annotations. All of these datasets were assembled using three currently popular assembly packages (CLC, Trinity and IDBA-tran). In addition, we experimentally demonstrate that transcripts unique to one particular assembly package are likely to be bioinformatic artefacts. For all eight datasets our pipeline generates more concise transcriptomes that in fact possess more unique annotatable protein domains than any of the three individual assemblers we employed. Another measure of assembly completeness (using the purpose built BUSCO databases) also confirmed that our approach yields more information. Our approach yields coding transcriptome assemblies that are more likely to be
Identification of a Novel De Novo Heterozygous Deletion in the SOX10 Gene in Waardenburg Syndrome Type II Using Next-Generation Sequencing.

PubMed

Li, Haonan; Jin, Peng; Hao, Qian; Zhu, Wei; Chen, Xia; Wang, Ping

2017-11-01

Waardenburg syndrome (WS) is a rare autosomal dominant disorder associated with pigmentation abnormalities and sensorineural hearing loss. In this study, we investigated the genetic cause of WSII in a patient and evaluated the reliability of the targeted next-generation exome sequencing method for the genetic diagnosis of WS. Clinical evaluations were conducted on the patient and targeted next-generation sequencing (NGS) was used to identify the candidate genes responsible for WSII. Multiplex ligation-dependent probe amplification (MLPA) and real-time quantitative polymerase chain reaction (qPCR) were performed to confirm the targeted NGS results. Targeted NGS detected the entire deletion of the coding sequence (CDS) of the SOX10 gene in the WSII patient. MLPA results indicated that all exons of the SOX10 heterozygous deletion were detected; no aberrant copy number in the PAX3 and microphthalmia-associated transcription factor (MITF) genes was found. Real-time qPCR results identified the mutation as a de novo heterozygous deletion. This is the first report of using a targeted NGS method for WS candidate gene sequencing; its accuracy was verified by using the MLPA and qPCR methods. Our research provides a valuable method for the genetic diagnosis of WS.
BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data

PubMed Central

Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

2012-01-01

BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310
The Goddard and Saturn Genes Are Essential for Drosophila Male Fertility and May Have Arisen De Novo.

PubMed

Gubala, Anna M; Schmitz, Jonathan F; Kearns, Michael J; Vinh, Tery T; Bornberg-Bauer, Erich; Wolfner, Mariana F; Findlay, Geoffrey D

2017-05-01

New genes arise through a variety of mechanisms, including the duplication of existing genes and the de novo birth of genes from noncoding DNA sequences. While there are numerous examples of duplicated genes with important functional roles, the functions of de novo genes remain largely unexplored. Many newly evolved genes are expressed in the male reproductive tract, suggesting that these evolutionary innovations may provide advantages to males experiencing sexual selection. Using testis-specific RNA interference, we screened 11 putative de novo genes in Drosophila melanogaster for effects on male fertility and identified two, goddard and saturn, that are essential for spermatogenesis and sperm function. Goddard knockdown (KD) males fail to produce mature sperm, while saturn KD males produce few sperm, and these function inefficiently once transferred to females. Consistent with a de novo origin, both genes are identifiable only in Drosophila and are predicted to encode proteins with no sequence similarity to any annotated protein. However, since high levels of divergence prevented the unambiguous identification of the noncoding sequences from which each gene arose, we consider goddard and saturn to be putative de novo genes. Within Drosophila, both genes have been lost in certain lineages, but show conserved, male-specific patterns of expression in the species in which they are found. Goddard is consistently found in single-copy and evolves under purifying selection. In contrast, saturn has diversified through gene duplication and positive selection. These data suggest that de novo genes can acquire essential roles in male reproduction. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
A multi-method approach toward de novo glycan characterization: a Man-5 case study.

PubMed

Prien, Justin M; Prater, Bradley D; Cockrill, Steven L

2010-05-01

Regulatory agencies' expectations for biotherapeutic approval are becoming more stringent with regard to product characterization, where minor species as low as 0.1% of a given profile are typically identified. The mission of this manuscript is to demonstrate a multi-method approach toward de novo glycan characterization and quantitation, including minor species at or approaching the 0.1% benchmark. Recently, unexpected isomers of the Man(5)GlcNAc(2) (M(5)) were reported (Prien JM, Ashline DJ, Lapadula AJ, Zhang H, Reinhold VN. 2009. The high mannose glycans from bovine ribonuclease B isomer characterization by ion trap mass spectrometry (MS). J Am Soc Mass Spectrom. 20:539-556). In the current study, quantitative analysis of these isomers found in commercial M(5) standard demonstrated that they are in low abundance (<1% of the total) and therefore an exemplary "litmus test" for minor species characterization. A simple workflow devised around three core well-established analytical procedures: (1) fluorescence derivatization; (2) online rapid resolution reversed-phase separation coupled with negative-mode sequential mass spectrometry (RRRP-(-)-MS(n)); and (3) permethylation derivatization with nanospray sequential mass spectrometry (NSI-MS(n)) provides comprehensive glycan structural determination. All methods have limitations; however, a multi-method workflow is an at-line stopgap/solution which mitigates each method's individual shortcoming(s) providing greater opportunity for more comprehensive characterization. This manuscript is the first to demonstrate quantitative chromatographic separation of the M(5) isomers and the use of a commercially available stable isotope variant of 2-aminobenzoic acid to detect and chromatographically resolve multiple M(5) isomers in bovine ribonuclease B. With this multi-method approach, we have the capabilities to comprehensively characterize a biotherapeutic's glycan array in a de novo manner, including structural isomers at >/=0
Efficient generation of complete sequences of MDR-encoding plasmids by rapid assembly of MinION barcoding sequencing data.

PubMed

Li, Ruichao; Xie, Miaomiao; Dong, Ning; Lin, Dachuan; Yang, Xuemei; Wong, Marcus Ho Yin; Chan, Edward Wai-Chi; Chen, Sheng

2018-03-01

Multidrug resistance (MDR)-encoding plasmids are considered major molecular vehicles responsible for transmission of antibiotic resistance genes among bacteria of the same or different species. Delineating the complete sequences of such plasmids could provide valuable insight into the evolution and transmission mechanisms underlying bacterial antibiotic resistance development. However, due to the presence of multiple repeats of mobile elements, complete sequencing of MDR plasmids remains technically complicated, expensive, and time-consuming. Here, we demonstrate a rapid and efficient approach to obtaining multiple MDR plasmid sequences through the use of the MinION nanopore sequencing platform, which is incorporated in a portable device. By assembling the long sequencing reads generated by a single MinION run according to a rapid barcoding sequencing protocol, we obtained the complete sequences of 20 plasmids harbored by multiple bacterial strains. Importantly, single long reads covering a plasmid end-to-end were recorded, indicating that de novo assembly may be unnecessary if the single reads exhibit high accuracy. This workflow represents a convenient and cost-effective approach for systematic assessment of MDR plasmids responsible for treatment failure of bacterial infections, offering the opportunity to perform detailed molecular epidemiological studies to probe the evolutionary and transmission mechanisms of MDR-encoding elements.
De novo Transcriptome Sequencing Reveals a Considerable Bias in the Incidence of Simple Sequence Repeats towards the Downstream of ‘Pre-miRNAs’ of Black Pepper

PubMed Central

Joy, Nisha; Asha, Srinivasan; Mallika, Vijayan; Soniya, Eppurathu Vasudevan

2013-01-01

Next generation sequencing has an advantageon transformational development of species with limited available sequence data as it helps to decode the genome and transcriptome. We carried out the de novo sequencing using illuminaHiSeq™ 2000 to generate the first leaf transcriptome of black pepper (Piper nigrum L.), an important spice variety native to South India and also grown in other tropical regions. Despite the economic and biochemical importance of pepper, a scientifically rigorous study at the molecular level is far from complete due to lack of sufficient sequence information and cytological complexity of its genome. The 55 million raw reads obtained, when assembled using Trinity program generated 2,23,386 contigs and 1,28,157 unigenes. Reports suggest that the repeat-rich genomic regions give rise to small non-coding functional RNAs. MicroRNAs (miRNAs) are the most abundant type of non-coding regulatory RNAs. In spite of the widespread research on miRNAs, little is known about the hair-pin precursors of miRNAs bearing Simple Sequence Repeats (SSRs). We used the array of transcripts generated, for the in silico prediction and detection of ‘43 pre-miRNA candidates bearing different types of SSR motifs’. The analysis identified 3913 different types of SSR motifs with an average of one SSR per 3.04 MB of thetranscriptome. About 0.033% of the transcriptome constituted ‘pre-miRNA candidates bearing SSRs’. The abundance, type and distribution of SSR motifs studied across the hair-pin miRNA precursors, showed a significant bias in the position of SSRs towards the downstream of predicted ‘pre-miRNA candidates’. The catalogue of transcripts identified, together with the demonstration of reliable existence of SSRs in the miRNA precursors, permits future opportunities for understanding the genetic mechanism of black pepper and likely functions of ‘tandem repeats’ in miRNAs. PMID:23469176
Single molecule sequencing-guided scaffolding and correction of draft assemblies.

PubMed

Zhu, Shenglong; Chen, Danny Z; Emrich, Scott J

2017-12-06

Although single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies. We propose a disassembling-reassembling approach for both correcting structural errors in the draft assembly and scaffolding a target assembly based on error-corrected single molecule sequences. To achieve this goal, we formulate a maximum alternating path cover problem. We prove that this problem is NP-hard, and solve it by a 2-approximation algorithm. Our experimental results show that our approach can improve the structural correctness of target assemblies in the cost of some contiguity, even with smaller amounts of long reads. In addition, our reassembling process can also serve as a competitive scaffolder relative to well-established assembly benchmarks.
Next Generation Sequence Assembly with AMOS

PubMed Central

Treangen, Todd J; Sommer, Dan D; Angly, Florent E; Koren, Sergey; Pop, Mihai

2011-01-01

A Modular Open-Source Assembler (AMOS) was designed to offer a modular approach to genome assembly. AMOS includes a wide range of tools for assembly, including lightweight de novo assemblers Minimus and Minimo, and Bambus 2, a robust scaffolder able to handle metagenomic and polymorphic data. This protocol describes how to configure and use AMOS for the assembly of Next Generation sequence data. Additionally, we provide three tutorial examples that include bacterial, viral, and metagenomic datasets with specific tips for improving assembly quality. PMID:21400694
De novo mutations in histone modifying genes in congenital heart disease

PubMed Central

Zaidi, Samir; Choi, Murim; Wakimoto, Hiroko; Ma, Lijiang; Jiang, Jianming; Overton, John D.; Romano-Adesman, Angela; Bjornson, Robert D.; Breitbart, Roger E.; Brown, Kerry K.; Carriero, Nicholas J.; Cheung, Yee Him; Deanfield, John; DePalma, Steve; Fakhro, Khalid A.; Glessner, Joseph; Hakonarson, Hakon; Italia, Michael; Kaltman, Jonathan R.; Kaski, Juan; Kim, Richard; Kline, Jennie K.; Lee, Teresa; Leipzig, Jeremy; Lopez, Alexander; Mane, Shrikant M.; Mitchell, Laura E.; Newburger, Jane W.; Parfenov, Michael; Pe'er, Itsik; Porter, George; Roberts, Amy; Sachidanandam, Ravi; Sanders, Stephan J.; Seiden, Howard S.; State, Mathew W.; Subramanian, Sailakshmi; Tikhonova, Irina R.; Wang, Wei; Warburton, Dorothy; White, Peter S.; Williams, Ismee A.; Zhao, Hongyu; Seidman, Jonathan G.; Brueckner, Martina; Chung, Wendy K.; Gelb, Bruce D.; Goldmuntz, Elizabeth; Seidman, Christine E.; Lifton, Richard P.

2013-01-01

Congenital heart disease (CHD) is the most frequent birth defect, affecting 0.8% of live births1. Many cases occur sporadically and impair reproductive fitness, suggesting a role for de novo mutations. By analysis of exome sequencing of parent-offspring trios, we compared the incidence of de novo mutations in 362 severe CHD cases and 264 controls. CHD cases showed a significant excess of protein-altering de novo mutations in genes expressed in the developing heart, with an odds ratio of 7.5 for damaging mutations. Similar odds ratios were seen across major classes of severe CHD. We found a marked excess of de novo mutations in genes involved in production, removal or reading of H3K4 methylation (H3K4me), or ubiquitination of H2BK120, which is required for H3K4 methylation2–4. There were also two de novo mutations in SMAD2; SMAD2 signaling in the embryonic left-right organizer induces demethylation of H3K27me5. H3K4me and H3K27me mark `poised' promoters and enhancers that regulate expression of key developmental genes6. These findings implicate de novo point mutations in several hundred genes that collectively contribute to ~10% of severe CHD. PMID:23665959
De novo assembly and characterization of bark transcriptome using Illumina sequencing and development of EST-SSR markers in rubber tree (Hevea brasiliensis Muell. Arg.)

PubMed Central

2012-01-01

Background In rubber tree, bark is one of important agricultural and biological organs. However, the molecular mechanism involved in the bark formation and development in rubber tree remains largely unknown, which is at least partially due to lack of bark transcriptomic and genomic information. Therefore, it is necessary to carried out high-throughput transcriptome sequencing of rubber tree bark to generate enormous transcript sequences for the functional characterization and molecular marker development. Results In this study, more than 30 million sequencing reads were generated using Illumina paired-end sequencing technology. In total, 22,756 unigenes with an average length of 485 bp were obtained with de novo assembly. The similarity search indicated that 16,520 and 12,558 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Among these annotated unigenes, 6,867 and 5,559 unigenes were separately assigned to Gene Ontology (GO) and Clusters of Orthologous Group (COG). When 22,756 unigenes searched against the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database, 12,097 unigenes were assigned to 5 main categories including 123 KEGG pathways. Among the main KEGG categories, metabolism was the biggest category (9,043, 74.75%), suggesting the active metabolic processes in rubber tree bark. In addition, a total of 39,257 EST-SSRs were identified from 22,756 unigenes, and the characterizations of EST-SSRs were further analyzed in rubber tree. 110 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Among 13 Hevea germplasms, PCR success rate and polymorphism rate of 110 markers were separately 96.36% and 55.45% in this study. Conclusion By assembling and analyzing de novo transcriptome sequencing data, we reported the comprehensive functional characterization of rubber tree bark. This research generated a substantial fraction
De novo sequencing analysis of the Rosa roxburghii fruit transcriptome reveals putative ascorbate biosynthetic genes and EST-SSR markers.

PubMed

Yan, Xiuqin; Zhang, Xue; Lu, Min; He, Yong; An, Huaming

2015-04-25

Rosa roxburghii Tratt. is a well-known ornamental rose species native to China. In addition, the fruits of this species are valued for their nutritional and medicinal characteristics, especially their high ascorbic acid (AsA) levels. Nevertheless, AsA biosynthesis in R. roxburghii fruit has not been explored in detail because of a lack of genomic resources for this species. High-throughput transcriptomic sequencing generating large volumes of transcript sequence data can aid in gene discovery and molecular marker development. In this study, we generated more than 53 million clean reads using Illumina paired-end sequencing technology. De novo assembly yielded 106,590 unigenes, with an average length of 343 bp. On the basis of sequence similarity to known proteins, 9301 and 2393 unigenes were classified into Gene Ontology and Clusters of Orthologous Group categories, respectively. There were 7480 unigenes assigned to 124 pathways in the Kyoto Encyclopedia of Gene and Genome pathway database. BLASTx searches identified 498 unique putative transcripts encoding various transcription factors, some known to regulate fruit development. qRT-PCR validated the expressions of most of the genes encoding the main enzymes involved in ascorbate biosynthesis. In addition, 9131 potential simple sequence repeat (SSR) loci were identified among the unigenes. One hundred and two primer pairs were synthesized and 71 pairs produced an amplification product during initial screening. Among the amplified products, 30 were polymorphic in the 16 R. roxburghii germplasms tested. Our study was the first to produce a large volume of transcriptome data from R. roxburghii. The resulting sequence collection is a valuable resource for gene discovery and marker-assisted selective breeding in this rose species. Copyright © 2015 Elsevier B.V. All rights reserved.
Complete Taiwanese Macaque (Macaca cyclopis) Mitochondrial Genome: Reference-Assisted de novo Assembly with Multiple k-mer Strategy.

PubMed

Huang, Yu-Feng; Midha, Mohit; Chen, Tzu-Han; Wang, Yu-Tai; Smith, David Glenn; Pei, Kurtis Jai-Chyi; Chiu, Kuo Ping

2015-01-01

The Taiwanese (Formosan) macaque (Macaca cyclopis) is the only nonhuman primate endemic to Taiwan. This primate species is valuable for evolutionary studies and as subjects in medical research. However, only partial fragments of the mitochondrial genome (mitogenome) of this primate species have been sequenced, not mentioning its nuclear genome. We employed next-generation sequencing to generate 2 x 90 bp paired-end reads, followed by reference-assisted de novo assembly with multiple k-mer strategy to characterize the M. cyclopis mitogenome. We compared the assembled mitogenome with that of other macaque species for phylogenetic analysis. Our results show that, the M. cyclopis mitogenome consists of 16,563 nucleotides encoding for 13 protein-coding genes, 2 ribosomal RNAs and 22 transfer RNAs. Phylogenetic analysis indicates that M. cyclopis is most closely related to M. mulatta lasiota (Chinese rhesus macaque), supporting the notion of Asia-continental origin of M. cyclopis proposed in previous studies based on partial mitochondrial sequences. Our work presents a novel approach for assembling a mitogenome that utilizes the capabilities of de novo genome assembly with assistance of a reference genome. The availability of the complete Taiwanese macaque mitogenome will facilitate the study of primate evolution and the characterization of genetic variations for the potential usage of this species as a non-human primate model for medical research.
LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons.

PubMed

Steinbiss, Sascha; Kastens, Sascha; Kurtz, Stefan

2012-11-07

Long terminal repeat (LTR) retrotransposons are a class of eukaryotic mobile elements characterized by a distinctive sequence similarity-based structure. Hence they are well suited for computational identification. Current software allows for a comprehensive genome-wide de novo detection of such elements. The obvious next step is the classification of newly detected candidates resulting in (super-)families. Such a de novo classification approach based on sequence-based clustering of transposon features has been proposed before, resulting in a preliminary assignment of candidates to families as a basis for subsequent manual refinement. However, such a classification workflow is typically split across a heterogeneous set of glue scripts and generic software (for example, spreadsheets), making it tedious for a human expert to inspect, curate and export the putative families produced by the workflow. We have developed LTRsift, an interactive graphical software tool for semi-automatic postprocessing of de novo predicted LTR retrotransposon annotations. Its user-friendly interface offers customizable filtering and classification functionality, displaying the putative candidate groups, their members and their internal structure in a hierarchical fashion. To ease manual work, it also supports graphical user interface-driven reassignment, splitting and further annotation of candidates. Export of grouped candidate sets in standard formats is possible. In two case studies, we demonstrate how LTRsift can be employed in the context of a genome-wide LTR retrotransposon survey effort. LTRsift is a useful and convenient tool for semi-automated classification of newly detected LTR retrotransposons based on their internal features. Its efficient implementation allows for convenient and seamless filtering and classification in an integrated environment. Developed for life scientists, it is helpful in postprocessing and refining the output of software for predicting LTR
Exome sequencing reveals a de novo POLD1 mutation causing phenotypic variability in mandibular hypoplasia, deafness, progeroid features, and lipodystrophy syndrome (MDPL).

PubMed

Elouej, Sahar; Beleza-Meireles, Ana; Caswell, Richard; Colclough, Kevin; Ellard, Sian; Desvignes, Jean Pierre; Béroud, Christophe; Lévy, Nicolas; Mohammed, Shehla; De Sandre-Giovannoli, Annachiara

2017-06-01

Mandibular hypoplasia, deafness, progeroid features, and lipodystrophy syndrome (MDPL) is an autosomal dominant systemic disorder characterized by prominent loss of subcutaneous fat, a characteristic facial appearance and metabolic abnormalities. This syndrome is caused by heterozygous de novo mutations in the POLD1 gene. To date, 19 patients with MDPL have been reported in the literature and among them 14 patients have been characterized at the molecular level. Twelve unrelated patients carried a recurrent in-frame deletion of a single codon (p.Ser605del) and two other patients carried a novel heterozygous mutation in exon 13 (p.Arg507Cys). Additionally and interestingly, germline mutations of the same gene have been involved in familial polyposis and colorectal cancer (CRC) predisposition. We describe a male and a female patient with MDPL respectively affected with mild and severe phenotypes. Both of them showed mandibular hypoplasia, a beaked nose with bird-like facies, prominent eyes, a small mouth, growth retardation, muscle and skin atrophy, but the female patient showed such a severe and early phenotype that a first working diagnosis of Hutchinson-Gilford Progeria was made. The exploration was performed by direct sequencing of POLD1 gene exon 15 in the male patient with a classical MDPL phenotype and by whole exome sequencing in the female patient and her unaffected parents. Exome sequencing identified in the latter patient a de novo heterozygous undescribed mutation in the POLD1 gene (NM_002691.3: c.3209T>A), predicted to cause the missense change p.Ile1070Asn in the ZnF2 (Zinc Finger 2) domain of the protein. This mutation was not reported in the 1000 Genome Project, dbSNP and Exome sequencing databases. Furthermore, the Isoleucine1070 residue of POLD1 is highly conserved among various species, suggesting that this substitution may cause a major impairment of POLD1 activity. For the second patient, affected with a typical MDPL phenotype, direct sequencing

Installing hydrolytic activity into a completely de novo protein framework

NASA Astrophysics Data System (ADS)

Burton, Antony J.; Thomson, Andrew R.; Dawson, William M.; Brady, R. Leo; Woolfson, Derek N.

2016-09-01

The design of enzyme-like catalysts tests our understanding of sequence-to-structure/function relationships in proteins. Here we install hydrolytic activity predictably into a completely de novo and thermostable α-helical barrel, which comprises seven helices arranged around an accessible channel. We show that the lumen of the barrel accepts 21 mutations to functional polar residues. The resulting variant, which has cysteine-histidine-glutamic acid triads on each helix, hydrolyses p-nitrophenyl acetate with catalytic efficiencies that match the most-efficient redesigned hydrolases based on natural protein scaffolds. This is the first report of a functional catalytic triad engineered into a de novo protein framework. The flexibility of our system also allows the facile incorporation of unnatural side chains to improve activity and probe the catalytic mechanism. Such a predictable and robust construction of truly de novo biocatalysts holds promise for applications in chemical and biochemical synthesis.
De novo transcriptome sequencing of the Octopus vulgaris hemocytes using Illumina RNA-Seq technology: response to the infection by the gastrointestinal parasite Aggregata octopiana.

PubMed

Castellanos-Martínez, Sheila; Arteta, David; Catarino, Susana; Gestal, Camino

2014-01-01

Octopus vulgaris is a highly valuable species of great commercial interest and excellent candidate for aquaculture diversification; however, the octopus' well-being is impaired by pathogens, of which the gastrointestinal coccidian parasite Aggregata octopiana is one of the most important. The knowledge of the molecular mechanisms of the immune response in cephalopods, especially in octopus is scarce. The transcriptome of the hemocytes of O. vulgaris was de novo sequenced using the high-throughput paired-end Illumina technology to identify genes involved in immune defense and to understand the molecular basis of octopus tolerance/resistance to coccidiosis. A bi-directional mRNA library was constructed from hemocytes of two groups of octopus according to the infection by A. octopiana, sick octopus, suffering coccidiosis, and healthy octopus, and reads were de novo assembled together. The differential expression of transcripts was analysed using the general assembly as a reference for mapping the reads from each condition. After sequencing, a total of 75,571,280 high quality reads were obtained from the sick octopus group and 74,731,646 from the healthy group. The general transcriptome of the O. vulgaris hemocytes was assembled in 254,506 contigs. A total of 48,225 contigs were successfully identified, and 538 transcripts exhibited differential expression between groups of infection. The general transcriptome revealed genes involved in pathways like NF-kB, TLR and Complement. Differential expression of TLR-2, PGRP, C1q and PRDX genes due to infection was validated using RT-qPCR. In sick octopuses, only TLR-2 was up-regulated in hemocytes, but all of them were up-regulated in caecum and gills. The transcriptome reported here de novo establishes the first molecular clues to understand how the octopus immune system works and interacts with a highly pathogenic coccidian. The data provided here will contribute to identification of biomarkers for octopus resistance against
De Novo Transcriptome Sequencing of the Octopus vulgaris Hemocytes Using Illumina RNA-Seq Technology: Response to the Infection by the Gastrointestinal Parasite Aggregata octopiana

PubMed Central

Castellanos-Martínez, Sheila; Arteta, David; Catarino, Susana; Gestal, Camino

2014-01-01

Background Octopus vulgaris is a highly valuable species of great commercial interest and excellent candidate for aquaculture diversification; however, the octopus’ well-being is impaired by pathogens, of which the gastrointestinal coccidian parasite Aggregata octopiana is one of the most important. The knowledge of the molecular mechanisms of the immune response in cephalopods, especially in octopus is scarce. The transcriptome of the hemocytes of O. vulgaris was de novo sequenced using the high-throughput paired-end Illumina technology to identify genes involved in immune defense and to understand the molecular basis of octopus tolerance/resistance to coccidiosis. Results A bi-directional mRNA library was constructed from hemocytes of two groups of octopus according to the infection by A. octopiana, sick octopus, suffering coccidiosis, and healthy octopus, and reads were de novo assembled together. The differential expression of transcripts was analysed using the general assembly as a reference for mapping the reads from each condition. After sequencing, a total of 75,571,280 high quality reads were obtained from the sick octopus group and 74,731,646 from the healthy group. The general transcriptome of the O. vulgaris hemocytes was assembled in 254,506 contigs. A total of 48,225 contigs were successfully identified, and 538 transcripts exhibited differential expression between groups of infection. The general transcriptome revealed genes involved in pathways like NF-kB, TLR and Complement. Differential expression of TLR-2, PGRP, C1q and PRDX genes due to infection was validated using RT-qPCR. In sick octopuses, only TLR-2 was up-regulated in hemocytes, but all of them were up-regulated in caecum and gills. Conclusion The transcriptome reported here de novo establishes the first molecular clues to understand how the octopus immune system works and interacts with a highly pathogenic coccidian. The data provided here will contribute to identification of biomarkers
De Novo Design and Experimental Characterization of Ultrashort Self-Associating Peptides

PubMed Central

Xue, Bo; Robinson, Robert C.; Hauser, Charlotte A. E.; Floudas, Christodoulos A.

2014-01-01

Self-association is a common phenomenon in biology and one that can have positive and negative impacts, from the construction of the architectural cytoskeleton of cells to the formation of fibrils in amyloid diseases. Understanding the nature and mechanisms of self-association is important for modulating these systems and in creating biologically-inspired materials. Here, we present a two-stage de novo peptide design framework that can generate novel self-associating peptide systems. The first stage uses a simulated multimeric template structure as input into the optimization-based Sequence Selection to generate low potential energy sequences. The second stage is a computational validation procedure that calculates Fold Specificity and/or Approximate Association Affinity (K*association) based on metrics that we have devised for multimeric systems. This framework was applied to the design of self-associating tripeptides using the known self-associating tripeptide, Ac-IVD, as a structural template. Six computationally predicted tripeptides (Ac-LVE, Ac-YYD, Ac-LLE, Ac-YLD, Ac-MYD, Ac-VIE) were chosen for experimental validation in order to illustrate the self-association outcomes predicted by the three metrics. Self-association and electron microscopy studies revealed that Ac-LLE formed bead-like microstructures, Ac-LVE and Ac-YYD formed fibrillar aggregates, Ac-VIE and Ac-MYD formed hydrogels, and Ac-YLD crystallized under ambient conditions. An X-ray crystallographic study was carried out on a single crystal of Ac-YLD, which revealed that each molecule adopts a β-strand conformation that stack together to form parallel β-sheets. As an additional validation of the approach, the hydrogel-forming sequences of Ac-MYD and Ac-VIE were shuffled. The shuffled sequences were computationally predicted to have lower K*association values and were experimentally verified to not form hydrogels. This illustrates the robustness of the framework in predicting self
The Rickettsia Endosymbiont of Ixodes pacificus Contains All the Genes of De Novo Folate Biosynthesis

PubMed Central

Bodnar, James; Mortazavi, Bobak; Laurent, Timothy; Deason, Jeff; Thephavongsa, Khanhkeo; Zhong, Jianmin

2015-01-01

Ticks and other arthropods often are hosts to nutrient providing bacterial endosymbionts, which contribute to their host’s fitness by supplying nutrients such as vitamins and amino acids. It has been detected, in our lab, that Ixodes pacificus is host to Rickettsia species phylotype G021. This endosymbiont is predominantly present, and 100% maternally transmitted in I. pacificus. To study roles of phylotype G021 in I. pacificus, bioinformatic and molecular approaches were carried out. MUMmer genome alignments of whole genome sequence of I. scapularis, a close relative to I. pacificus, against completely sequenced genomes of R. bellii OSU85-389, R. conorii, and R. felis, identified 8,190 unique sequences that are homologous to Rickettsia sequences in the NCBI Trace Archive. MetaCyc metabolic reconstructions revealed that all folate gene orthologues (folA, folC, folE, folKP, ptpS) required for de novo folate biosynthesis are present in the genome of Rickettsia buchneri in I. scapularis. To examine the metabolic capability of phylotype G021 in I. pacificus, genes of the folate biosynthesis pathway of the bacterium were PCR amplified using degenerate primers. BLAST searches identified that nucleotide sequences of the folA, folC, folE, folKP, and ptpS genes possess 98.6%, 98.8%, 98.9%, 98.5% and 99.0% identity respectively to the corresponding genes of Rickettsia buchneri. Phylogenetic tree constructions show that the folate genes of phylotype G021 and homologous genes from various Rickettsia species are monophyletic. This study has shown that all folate genes exist in the genome of Rickettsia species phylotype G021 and that this bacterium has the genetic capability for de novo folate synthesis. PMID:26650541
De novo point mutations in patients diagnosed with ataxic cerebral palsy.

PubMed

Parolin Schnekenberg, Ricardo; Perkins, Emma M; Miller, Jack W; Davies, Wayne I L; D'Adamo, Maria Cristina; Pessia, Mauro; Fawcett, Katherine A; Sims, David; Gillard, Elodie; Hudspith, Karl; Skehel, Paul; Williams, Jonathan; O'Regan, Mary; Jayawant, Sandeep; Jefferson, Rosalind; Hughes, Sarah; Lustenberger, Andrea; Ragoussis, Jiannis; Jackson, Mandy; Tucker, Stephen J; Németh, Andrea H

2015-07-01

Cerebral palsy is a sporadic disorder with multiple likely aetiologies, but frequently considered to be caused by birth asphyxia. Genetic investigations are rarely performed in patients with cerebral palsy and there is little proven evidence of genetic causes. As part of a large project investigating children with ataxia, we identified four patients in our cohort with a diagnosis of ataxic cerebral palsy. They were investigated using either targeted next generation sequencing or trio-based exome sequencing and were found to have mutations in three different genes, KCNC3, ITPR1 and SPTBN2. All the mutations were de novo and associated with increased paternal age. The mutations were shown to be pathogenic using a combination of bioinformatics analysis and in vitro model systems. This work is the first to report that the ataxic subtype of cerebral palsy can be caused by de novo dominant point mutations, which explains the sporadic nature of these cases. We conclude that at least some subtypes of cerebral palsy may be caused by de novo genetic mutations and patients with a clinical diagnosis of cerebral palsy should be genetically investigated before causation is ascribed to perinatal asphyxia or other aetiologies. © The Author (2015). Published by Oxford University Press on behalf of the Guarantors of Brain.
RNAbrowse: RNA-Seq de novo assembly results browser.

PubMed

Mariette, Jérôme; Noirot, Céline; Nabihoudine, Ibounyamine; Bardou, Philippe; Hoede, Claire; Djari, Anis; Cabau, Cédric; Klopp, Christophe

2014-01-01

Transcriptome analysis based on a de novo assembly of next generation RNA sequences is now performed routinely in many laboratories. The generated results, including contig sequences, quantification figures, functional annotations and variation discovery outputs are usually bulky and quite diverse. This article presents a user oriented storage and visualisation environment permitting to explore the data in a top-down manner, going from general graphical views to all possible details. The software package is based on biomart, easy to install and populate with local data. The software package is available under the GNU General Public License (GPL) at http://bioinfo.genotoul.fr/RNAbrowse.
Rapid hybrid de novo assembly of a microbial genome using only short reads: Corynebacterium pseudotuberculosis I19 as a case study.

PubMed

Cerdeira, Louise Teixeira; Carneiro, Adriana Ribeiro; Ramos, Rommel Thiago Jucá; de Almeida, Sintia Silva; D'Afonseca, Vivian; Schneider, Maria Paula Cruz; Baumbach, Jan; Tauch, Andreas; McCulloch, John Anthony; Azevedo, Vasco Ariston Carvalho; Silva, Artur

2011-08-01

Due to the advent of the so-called Next-Generation Sequencing (NGS) technologies the amount of monetary and temporal resources for whole-genome sequencing has been reduced by several orders of magnitude. Sequence reads can be assembled either by anchoring them directly onto an available reference genome (classical reference assembly), or can be concatenated by overlap (de novo assembly). The latter strategy is preferable because it tends to maintain the architecture of the genome sequence the however, depending on the NGS platform used, the shortness of read lengths cause tremendous problems the in the subsequent genome assembly phase, impeding closing of the entire genome sequence. To address the problem, we developed a multi-pronged hybrid de novo strategy combining De Bruijn graph and Overlap-Layout-Consensus methods, which was used to assemble from short reads the entire genome of Corynebacterium pseudotuberculosis strain I19, a bacterium with immense importance in veterinary medicine that causes Caseous Lymphadenitis in ruminants, principally ovines and caprines. Briefly, contigs were assembled de novo from the short reads and were only oriented using a reference genome by anchoring. Remaining gaps were closed using iterative anchoring of short reads by craning to gap flanks. Finally, we compare the genome sequence assembled using our hybrid strategy to a classical reference assembly using the same data as input and show that with the availability of a reference genome, it pays off to use the hybrid de novo strategy, rather than a classical reference assembly, because more genome sequences are preserved using the former. Copyright © 2011 Elsevier B.V. All rights reserved.
De Novo Transcriptome Sequencing of Olea europaea L. to Identify Genes Involved in the Development of the Pollen Tube.

PubMed

Iaria, Domenico; Chiappetta, Adriana; Muzzalupo, Innocenzo

2016-01-01

In olive (Olea europaea L.), the processes controlling self-incompatibility are still unclear and the molecular basis underlying this process are still not fully characterized. In order to determine compatibility relationships, using next-generation sequencing techniques and a de novo transcriptome assembly strategy, we show that pollen tubes from different olive plants, grown in vitro in a medium containing its own pistil and in combination pollen/pistil from self-sterile and self-fertile cultivars, have a distinct gene expression profile and many of the differentially expressed sequences between the samples fall within gene families involved in the development of the pollen tube, such as lipase, carboxylesterase, pectinesterase, pectin methylesterase, and callose synthase. Moreover, different genes involved in signal transduction, transcription, and growth are overrepresented. The analysis also allowed us to identify members in actin and actin depolymerization factor and fibrin gene family and member of the Ca(2+) binding gene family related to the development and polarization of pollen apical tip. The whole transcriptomic analysis, through the identification of the differentially expressed transcripts set and an extended functional annotation analysis, will lead to a better understanding of the mechanisms of pollen germination and pollen tube growth in the olive.
Next generation sequence assembly with AMOS.

PubMed

Treangen, Todd J; Sommer, Dan D; Angly, Florent E; Koren, Sergey; Pop, Mihai

2011-03-01

A Modular Open-Source Assembler (AMOS) was designed to offer a modular approach to genome assembly. AMOS includes a wide range of tools for assembly, including the lightweight de novo assemblers Minimus and Minimo, and Bambus 2, a robust scaffolder able to handle metagenomic and polymorphic data. This protocol describes how to configure and use AMOS for the assembly of Next Generation sequence data. Additionally, we provide three tutorial examples that include bacterial, viral, and metagenomic datasets with specific tips for improving assembly quality. © 2011 by John Wiley & Sons, Inc.
Optimization of De Novo Short Read Assembly of Seabuckthorn (Hippophae rhamnoides L.) Transcriptome

PubMed Central

Ghangal, Rajesh; Chaudhary, Saurabh; Jain, Mukesh; Purty, Ram Singh; Chand Sharma, Prakash

2013-01-01

Seabuckthorn ( Hippophae rhamnoides L.) is known for its medicinal, nutritional and environmental importance since ancient times. However, very limited efforts have been made to characterize the genome and transcriptome of this wonder plant. Here, we report the use of next generation massive parallel sequencing technology (Illumina platform) and de novo assembly to gain a comprehensive view of the seabuckthorn transcriptome. We assembled 86,253,874 high quality short reads using six assembly tools. At our hand, assembly of non-redundant short reads following a two-step procedure was found to be the best considering various assembly quality parameters. Initially, ABySS tool was used following an additive k-mer approach. The assembled transcripts were subsequently subjected to TGICL suite. Finally, de novo short read assembly yielded 88,297 transcripts (> 100 bp), representing about 53 Mb of seabuckthorn transcriptome. The average length of transcripts was 610 bp, N50 length 1198 BP and 91% of the short reads uniquely mapped back to seabuckthorn transcriptome. A total of 41,340 (46.8%) transcripts showed significant similarity with sequences present in nr protein databases of NCBI (E-value < 1E-06). We also screened the assembled transcripts for the presence of transcription factors and simple sequence repeats. Our strategy involving the use of short read assembler (ABySS) followed by TGICL will be useful for the researchers working with a non-model organism’s transcriptome in terms of saving time and reducing complexity in data management. The seabuckthorn transcriptome data generated here provide a valuable resource for gene discovery and development of functional molecular markers. PMID:23991119
De novo transcriptome assembly databases for the butterfly orchid Phalaenopsis equestris

PubMed Central

Niu, Shan-Ce; Xu, Qing; Zhang, Guo-Qiang; Zhang, Yong-Qiang; Tsai, Wen-Chieh; Hsu, Jui-Ling; Liang, Chieh-Kai; Luo, Yi-Bo; Liu, Zhong-Jian

2016-01-01

Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third dataset consists of the annotation results of the aligned unigenes versus the Nonredundant (Nr) protein database, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) databases with low e-values, enabling a name-based search. PMID:27673730
De novo identification of highly diverged protein repeats by probabilistic consistency.

PubMed

Biegert, A; Söding, J

2008-03-15

An estimated 25% of all eukaryotic proteins contain repeats, which underlines the importance of duplication for evolving new protein functions. Internal repeats often correspond to structural or functional units in proteins. Methods capable of identifying diverged repeated segments or domains at the sequence level can therefore assist in predicting domain structures, inferring hypotheses about function and mechanism, and investigating the evolution of proteins from smaller fragments. We present HHrepID, a method for the de novo identification of repeats in protein sequences. It is able to detect the sequence signature of structural repeats in many proteins that have not yet been known to possess internal sequence symmetry, such as outer membrane beta-barrels. HHrepID uses HMM-HMM comparison to exploit evolutionary information in the form of multiple sequence alignments of homologs. In contrast to a previous method, the new method (1) generates a multiple alignment of repeats; (2) utilizes the transitive nature of homology through a novel merging procedure with fully probabilistic treatment of alignments; (3) improves alignment quality through an algorithm that maximizes the expected accuracy; (4) is able to identify different kinds of repeats within complex architectures by a probabilistic domain boundary detection method and (5) improves sensitivity through a new approach to assess statistical significance. Server: http://toolkit.tuebingen.mpg.de/hhrepid; Executables: ftp://ftp.tuebingen.mpg.de/pub/protevo/HHrepID
Rate of de novo mutations and the importance of father's age to disease risk.

PubMed

Kong, Augustine; Frigge, Michael L; Masson, Gisli; Besenbacher, Soren; Sulem, Patrick; Magnusson, Gisli; Gudjonsson, Sigurjon A; Sigurdsson, Asgeir; Jonasdottir, Aslaug; Jonasdottir, Adalbjorg; Wong, Wendy S W; Sigurdsson, Gunnar; Walters, G Bragi; Steinberg, Stacy; Helgason, Hannes; Thorleifsson, Gudmar; Gudbjartsson, Daniel F; Helgason, Agnar; Magnusson, Olafur Th; Thorsteinsdottir, Unnur; Stefansson, Kari

2012-08-23

Mutations generate sequence diversity and provide a substrate for selection. The rate of de novo mutations is therefore of major importance to evolution. Here we conduct a study of genome-wide mutation rates by sequencing the entire genomes of 78 Icelandic parent-offspring trios at high coverage. We show that in our samples, with an average father's age of 29.7, the average de novo mutation rate is 1.20 × 10(-8) per nucleotide per generation. Most notably, the diversity in mutation rate of single nucleotide polymorphisms is dominated by the age of the father at conception of the child. The effect is an increase of about two mutations per year. An exponential model estimates paternal mutations doubling every 16.5 years. After accounting for random Poisson variation, father's age is estimated to explain nearly all of the remaining variation in the de novo mutation counts. These observations shed light on the importance of the father's age on the risk of diseases such as schizophrenia and autism.
Simple Approach for De Novo Structural Identification of Mannose Trisaccharides

NASA Astrophysics Data System (ADS)

Hsu, Hsu Chen; Liew, Chia Yen; Huang, Shih-Pei; Tsai, Shang-Ting; Ni, Chi-Kung

2018-03-01

Oligosaccharides have diverse functions in biological systems. However, the structural determination of oligosaccharides remains difficult and has created a bottleneck in carbohydrate research. In this study, a new approach for the de novo structural determination of underivatized oligosaccharides is demonstrated. A low-energy collision-induced dissociation (CID) of sodium ion adducts was used to facilitate the cleavage of desired chemical bonds during the dissociation. The selection of fragments for the subsequent CID was guided using a procedure that we built from the understanding of the saccharide dissociation mechanism. The linkages, anomeric configurations, and branch locations of oligosaccharides were determined by comparing the CID spectra of oligosaccharide with the fragmentation patterns based on the dissociation mechanism and our specially prepared disaccharide CID spectrum database. The usefulness of this method was demonstrated to determine the structures of several mannose trisaccharides. This method can also be applied in the structural determination of oligosaccharides larger than trisaccharides and containing hexose other than mannose if authentic standards are available. [Figure not available: see fulltext.
Simple Approach for De Novo Structural Identification of Mannose Trisaccharides

NASA Astrophysics Data System (ADS)

Hsu, Hsu Chen; Liew, Chia Yen; Huang, Shih-Pei; Tsai, Shang-Ting; Ni, Chi-Kung

2017-12-01

Oligosaccharides have diverse functions in biological systems. However, the structural determination of oligosaccharides remains difficult and has created a bottleneck in carbohydrate research. In this study, a new approach for the de novo structural determination of underivatized oligosaccharides is demonstrated. A low-energy collision-induced dissociation (CID) of sodium ion adducts was used to facilitate the cleavage of desired chemical bonds during the dissociation. The selection of fragments for the subsequent CID was guided using a procedure that we built from the understanding of the saccharide dissociation mechanism. The linkages, anomeric configurations, and branch locations of oligosaccharides were determined by comparing the CID spectra of oligosaccharide with the fragmentation patterns based on the dissociation mechanism and our specially prepared disaccharide CID spectrum database. The usefulness of this method was demonstrated to determine the structures of several mannose trisaccharides. This method can also be applied in the structural determination of oligosaccharides larger than trisaccharides and containing hexose other than mannose if authentic standards are available. [Figure not available: see fulltext.
Improved annotation with de novo transcriptome assembly in four social amoeba species.

PubMed

Singh, Reema; Lawal, Hajara M; Schilde, Christina; Glöckner, Gernot; Barton, Geoffrey J; Schaap, Pauline; Cole, Christian

2017-01-31

Annotation of gene models and transcripts is a fundamental step in genome sequencing projects. Often this is performed with automated prediction pipelines, which can miss complex and atypical genes or transcripts. RNA sequencing (RNA-seq) data can aid the annotation with empirical data. Here we present de novo transcriptome assemblies generated from RNA-seq data in four Dictyostelid species: D. discoideum, P. pallidum, D. fasciculatum and D. lacteum. The assemblies were incorporated with existing gene models to determine corrections and improvement on a whole-genome scale. This is the first time this has been performed in these eukaryotic species. An initial de novo transcriptome assembly was generated by Trinity for each species and then refined with Program to Assemble Spliced Alignments (PASA). The completeness and quality were assessed with the Benchmarking Universal Single-Copy Orthologs (BUSCO) and Transrate tools at each stage of the assemblies. The final datasets of 11,315-12,849 transcripts contained 5,610-7,712 updates and corrections to >50% of existing gene models including changes to hundreds or thousands of protein products. Putative novel genes are also identified and alternative splice isoforms were observed for the first time in P. pallidum, D. lacteum and D. fasciculatum. In taking a whole transcriptome approach to genome annotation with empirical data we have been able to enrich the annotations of four existing genome sequencing projects. In doing so we have identified updates to the majority of the gene annotations across all four species under study and found putative novel genes and transcripts which could be worthy for follow-up. The new transcriptome data we present here will be a valuable resource for genome curators in the Dictyostelia and we propose this effective methodology for use in other genome annotation projects.
Substituted 1H-1,2,3-Triazol-4-yl-1H-pyrrolo[2,3-b]pyridines by De Novo One-Pot Ring Forming Coupling-Cyclization-Desilylation-CuAAC-Sequence.

PubMed

Müller, Thomas J J; Lessing, Timo; van Mark, Hauke

2018-05-04

Substituted 1H-1,2,3-triazol-4-yl-pyrrolo[2,3-b]pyridines are efficiently prepared by a one-pot coupling-cyclization-desilylation-CuAAC-sequence in the sense of a consecutive three-component fashion. The key feature of this novel de novo formation of azole and triazole anellation is the sequentially Pd/Cu-catalyzed process employing tri(iso-propyl)silylbutadiyne (TIPS-butadiyne) as a four-carbon building block. In addition, the sequence can be expanded in a four-component fashion also employing the in situ formation of the require azides. © 2018 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Dynamic chromatin changes associated with de novo centromere formation in maize euchromatin.

PubMed

Su, Handong; Liu, Yalin; Liu, Yong-Xin; Lv, Zhenling; Li, Hongyao; Xie, Shaojun; Gao, Zhi; Pang, Junling; Wang, Xiu-Jie; Lai, Jinsheng; Birchler, James A; Han, Fangpu

2016-12-01

The inheritance and function of centromeres are not strictly dependent on any specific DNA sequence, but involve an epigenetic component in most species. CENH3, a centromere histone H3 variant, is one of the best-described epigenetic factors in centromere identity, but the chromatin features required during centromere formation have not yet been revealed. We previously identified two de novo centromeres on Zea mays (maize) minichromosomes derived from euchromatic sites with high-density gene distributions but low-density transposon distributions. The distribution of gene location and gene expression in these sites indicates that transcriptionally active regions can initiate de novo centromere formation, and CENH3 seeding shows a preference for gene-free regions or regions with no gene expression. The locations of the expressed genes detected were at relatively hypomethylated loci, and the altered gene expression resulted from de novo centromere formation, but not from the additional copy of the minichromosome. The initial overall DNA methylation level of the two de novo regions was at a low level, but increased substantially to that of native centromeres after centromere formation. These results illustrate the dynamic chromatin changes during euchromatin-originated de novo centromere formation, which provides insight into the mechanism of de novo centromere formation and regulation of subsequent consequences. © 2016 The Authors The Plant Journal © 2016 John Wiley & Sons Ltd.
A new approach for annotation of transposable elements using small RNA mapping

PubMed Central

El Baidouri, Moaine; Kim, Kyung Do; Abernathy, Brian; Arikit, Siwaret; Maumus, Florian; Panaud, Olivier; Meyers, Blake C.; Jackson, Scott A.

2015-01-01

Transposable elements (TEs) are mobile genomic DNA sequences found in most organisms. They so densely populate the genomes of many eukaryotic species that they are often the major constituents. With the rapid generation of many plant genome sequencing projects over the past few decades, there is an urgent need for improved TE annotation as a prerequisite for genome-wide studies. Analogous to the use of RNA-seq for gene annotation, we propose a new method for de novo TE annotation that uses as a guide 24 nt-siRNAs that are a part of TE silencing pathways. We use this new approach, called TASR (for Transposon Annotation using Small RNAs), for de novo annotation of TEs in Arabidopsis, rice and soybean and demonstrate that this strategy can be successfully applied for de novo TE annotation in plants. Executable PERL is available for download from: http://tasr-pipeline.sourceforge.net/ PMID:25813049

A Restricted Repertoire of De Novo Mutations in ITPR1 Cause Gillespie Syndrome with Evidence for Dominant-Negative Effect

PubMed Central

McEntagart, Meriel; Williamson, Kathleen A.; Rainger, Jacqueline K.; Wheeler, Ann; Seawright, Anne; De Baere, Elfride; Verdin, Hannah; Bergendahl, L. Therese; Quigley, Alan; Rainger, Joe; Dixit, Abhijit; Sarkar, Ajoy; López Laso, Eduardo; Sanchez-Carpintero, Rocio; Barrio, Jesus; Bitoun, Pierre; Prescott, Trine; Riise, Ruth; McKee, Shane; Cook, Jackie; McKie, Lisa; Ceulemans, Berten; Meire, Françoise; Temple, I. Karen; Prieur, Fabienne; Williams, Jonathan; Clouston, Penny; Németh, Andrea H.; Banka, Siddharth; Bengani, Hemant; Handley, Mark; Freyer, Elisabeth; Ross, Allyson; van Heyningen, Veronica; Marsh, Joseph A.; Elmslie, Frances; FitzPatrick, David R.

2016-01-01

Gillespie syndrome (GS) is characterized by bilateral iris hypoplasia, congenital hypotonia, non-progressive ataxia, and progressive cerebellar atrophy. Trio-based exome sequencing identified de novo mutations in ITPR1 in three unrelated individuals with GS recruited to the Deciphering Developmental Disorders study. Whole-exome or targeted sequence analysis identified plausible disease-causing ITPR1 mutations in 10/10 additional GS-affected individuals. These ultra-rare protein-altering variants affected only three residues in ITPR1: Glu2094 missense (one de novo, one co-segregating), Gly2539 missense (five de novo, one inheritance uncertain), and Lys2596 in-frame deletion (four de novo). No clinical or radiological differences were evident between individuals with different mutations. ITPR1 encodes an inositol 1,4,5-triphosphate-responsive calcium channel. The homo-tetrameric structure has been solved by cryoelectron microscopy. Using estimations of the degree of structural change induced by known recessive- and dominant-negative mutations in other disease-associated multimeric channels, we developed a generalizable computational approach to indicate the likely mutational mechanism. This analysis supports a dominant-negative mechanism for GS variants in ITPR1. In GS-derived lymphoblastoid cell lines (LCLs), the proportion of ITPR1-positive cells using immunofluorescence was significantly higher in mutant than control LCLs, consistent with an abnormality of nuclear calcium signaling feedback control. Super-resolution imaging supports the existence of an ITPR1-lined nucleoplasmic reticulum. Mice with Itpr1 heterozygous null mutations showed no major iris defects. Purkinje cells of the cerebellum appear to be the most sensitive to impaired ITPR1 function in humans. Iris hypoplasia is likely to result from either complete loss of ITPR1 activity or structure-specific disruption of multimeric interactions. PMID:27108798
A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies.

PubMed

Thakur, Shalabh; Guttman, David S

2016-06-30

Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, most scale poorly when faced with hundreds of genome sequences, and many require extensive manual curation. We have developed a de-novo genome analysis pipeline (DeNoGAP) for the automated, iterative and high-throughput analysis of data from comparative genomics projects involving hundreds of whole genome sequences. The pipeline is designed to perform reference-assisted and de novo gene prediction, homolog protein family assignment, ortholog prediction, functional annotation, and pan-genome analysis using a range of proven tools and databases. While most existing methods scale quadratically with the number of genomes since they rely on pairwise comparisons among predicted protein sequences, DeNoGAP scales linearly since the homology assignment is based on iteratively refined hidden Markov models. This iterative clustering strategy enables DeNoGAP to handle a very large number of genomes using minimal computational resources. Moreover, the modular structure of the pipeline permits easy updates as new analysis programs become available. DeNoGAP integrates bioinformatics tools and databases for comparative analysis of a large number of genomes. The pipeline offers tools and algorithms for annotation and analysis of completed and draft genome sequences. The pipeline is developed using Perl, BioPerl and SQLite on Ubuntu Linux version 12.04 LTS. Currently, the software package accompanies script for automated installation of necessary external programs on Ubuntu Linux; however, the pipeline should be also compatible with other Linux and Unix systems after necessary external programs are installed. DeNoGAP is freely available at https://sourceforge.net/projects/denogap/ .
SCOPE: a web server for practical de novo motif discovery.

PubMed

Carlson, Jonathan M; Chakravarty, Arijit; DeZiel, Charles E; Gross, Robert H

2007-07-01

SCOPE is a novel parameter-free method for the de novo identification of potential regulatory motifs in sets of coordinately regulated genes. The SCOPE algorithm combines the output of three component algorithms, each designed to identify a particular class of motifs. Using an ensemble learning approach, SCOPE identifies the best candidate motifs from its component algorithms. In tests on experimentally determined datasets, SCOPE identified motifs with a significantly higher level of accuracy than a number of other web-based motif finders run with their default parameters. Because SCOPE has no adjustable parameters, the web server has an intuitive interface, requiring only a set of gene names or FASTA sequences and a choice of species. The most significant motifs found by SCOPE are displayed graphically on the main results page with a table containing summary statistics for each motif. Detailed motif information, including the sequence logo, PWM, consensus sequence and specific matching sites can be viewed through a single click on a motif. SCOPE's efficient, parameter-free search strategy has enabled the development of a web server that is readily accessible to the practising biologist while providing results that compare favorably with those of other motif finders. The SCOPE web server is at .
De novo transcriptome sequencing and discovery of genes related to copper tolerance in Paeonia ostii.

PubMed

Wang, Yanjie; Dong, Chunlan; Xue, Zeyun; Jin, Qijiang; Xu, Yingchun

2016-01-15

Paeonia ostii, an important ornamental and medicinal plant, grows normally on copper (Cu) mines with widespread Cu contamination of soils, and it has the ability to lower Cu contents in the Cu-contaminated soils. However, very little molecular information concerned with Cu resistance of P. ostii is available. In this study, high-throughput de novo transcriptome sequencing was carried out for P. ostii with and without Cu treatment using Illumina HiSeq 2000 platform. A total of 77,704 All-unigenes were obtained with a mean length of 710 bp. Of these unigenes, 47,461 were annotated with public databases based on sequence similarities. Comparative transcript profiling allowed the discovery of 4324 differentially expressed genes (DEGs), with 2207 up-regulated and 2117 down-regulated unigenes in Cu-treated library as compared to the control counterpart. Based on these DEGs, Gene Ontology (GO) enrichment analysis indicated Cu stress-relevant terms, such as 'membrane' and 'antioxidant activity'. Meanwhile, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis uncovered some important pathways, including 'biosynthesis of secondary metabolites' and 'metabolic pathways'. In addition, expression patterns of 12 selected DEGs derived from quantitative real-time polymerase chain reaction (qRT-PCR) were consistent with their transcript abundance changes obtained by transcriptomic analyses, suggesting that all the 12 genes were authentically involved in Cu tolerance in P. ostii. This is the first report to identify genes related to Cu stress responses in P. ostii, which could offer valuable information on the molecular mechanisms of Cu resistance, and provide a basis for further genomics research on this and related ornamental species for phytoremediation. Copyright © 2015 Elsevier B.V. All rights reserved.
Sequencing and de novo assembly of visceral mass transcriptome of the critically endangered land snail Satsuma myomphala: Annotation and SSR discovery.

PubMed

Kang, Se Won; Patnaik, Bharat Bhusan; Hwang, Hee-Ju; Park, So Young; Chung, Jong Min; Song, Dae Kwon; Patnaik, Hongray Howrelia; Lee, Jae Bong; Kim, Changmu; Kim, Soonok; Park, Hong Seog; Park, Seung-Hwan; Park, Young-Su; Han, Yeon Soo; Lee, Jun Sang; Lee, Yong Seok

2017-03-01

Satsuma myomphala is critically endangered through loss of natural habitats, predation by natural enemies, and indiscriminate collection. It is a protected species in Korea but lacks genomic resources for an understanding of varied functional processes attributable to evolutionary success under natural habitats. For assessing the genetic information of S. myomphala, we performed for the first time, de novo transcriptome sequencing and functional annotation of expressed sequences using Illumina Next-Generation Sequencing (NGS) platform and bioinformatics analysis. We identified 103,774 unigenes of which 37,959, 12,890, and 17,699 were annotated in the PANM (Protostome DB), Unigene, and COG (Clusters of Orthologous Groups) databases, respectively. In addition, 14,451 unigenes were predicted under Gene Ontology functional categories, with 4581 assigned to a single category. Furthermore, 3369 sequences with 646 having Enzyme Commission (EC) numbers were mapped to 122 pathways in the Kyoto Encyclopedia of Genes and Genomes Pathway database. The prominent protein domains included the Zinc finger (C2H2-like), Reverse Transcriptase, Thioredoxin-like fold, and RNA recognition motif domain. Many unigenes with homology to immunity, defense, and reproduction-related genes were screened in the transcriptome. We also detected 3120 putative simple sequence repeats (SSRs) encompassing dinucleotide to hexanucleotide repeat motifs from >1kb unigene sequences. A list of PCR primers of SSR loci have been identified to study the genetic polymorphisms. The transcriptome data represents a valuable resource for further investigations on the species genome structure and biology. The unigenes information and microsatellites would provide an indispensable tool for conservation of the species in natural and adaptive environments. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.
Resolving the Complexity of Human Skin Metagenomes Using Single-Molecule Sequencing

PubMed Central

Tsai, Yu-Chih; Deming, Clayton; Segre, Julia A.; Kong, Heidi H.; Korlach, Jonas

2016-01-01

ABSTRACT Deep metagenomic shotgun sequencing has emerged as a powerful tool to interrogate composition and function of complex microbial communities. Computational approaches to assemble genome fragments have been demonstrated to be an effective tool for de novo reconstruction of genomes from these communities. However, the resultant “genomes” are typically fragmented and incomplete due to the limited ability of short-read sequence data to assemble complex or low-coverage regions. Here, we use single-molecule, real-time (SMRT) sequencing to reconstruct a high-quality, closed genome of a previously uncharacterized Corynebacterium simulans and its companion bacteriophage from a skin metagenomic sample. Considerable improvement in assembly quality occurs in hybrid approaches incorporating short-read data, with even relatively small amounts of long-read data being sufficient to improve metagenome reconstruction. Using short-read data to evaluate strain variation of this C. simulans in its skin community at single-nucleotide resolution, we observed a dominant C. simulans strain with moderate allelic heterozygosity throughout the population. We demonstrate the utility of SMRT sequencing and hybrid approaches in metagenome quantitation, reconstruction, and annotation. PMID:26861018
RNAbrowse: RNA-Seq De Novo Assembly Results Browser

PubMed Central

Mariette, Jérôme; Noirot, Céline; Nabihoudine, Ibounyamine; Bardou, Philippe; Hoede, Claire; Djari, Anis; Cabau, Cédric; Klopp, Christophe

2014-01-01

Transcriptome analysis based on a de novo assembly of next generation RNA sequences is now performed routinely in many laboratories. The generated results, including contig sequences, quantification figures, functional annotations and variation discovery outputs are usually bulky and quite diverse. This article presents a user oriented storage and visualisation environment permitting to explore the data in a top-down manner, going from general graphical views to all possible details. The software package is based on biomart, easy to install and populate with local data. The software package is available under the GNU General Public License (GPL) at http://bioinfo.genotoul.fr/RNAbrowse. PMID:24823498
MANGO: a new approach to multiple sequence alignment.

PubMed

Zhang, Zefeng; Lin, Hao; Li, Ming

2007-01-01

Multiple sequence alignment is a classical and challenging task for biological sequence analysis. The problem is NP-hard. The full dynamic programming takes too much time. The progressive alignment heuristics adopted by most state of the art multiple sequence alignment programs suffer from the 'once a gap, always a gap' phenomenon. Is there a radically new way to do multiple sequence alignment? This paper introduces a novel and orthogonal multiple sequence alignment method, using multiple optimized spaced seeds and new algorithms to handle these seeds efficiently. Our new algorithm processes information of all sequences as a whole, avoiding problems caused by the popular progressive approaches. Because the optimized spaced seeds are provably significantly more sensitive than the consecutive k-mers, the new approach promises to be more accurate and reliable. To validate our new approach, we have implemented MANGO: Multiple Alignment with N Gapped Oligos. Experiments were carried out on large 16S RNA benchmarks showing that MANGO compares favorably, in both accuracy and speed, against state-of-art multiple sequence alignment methods, including ClustalW 1.83, MUSCLE 3.6, MAFFT 5.861, Prob-ConsRNA 1.11, Dialign 2.2.1, DIALIGN-T 0.2.1, T-Coffee 4.85, POA 2.0 and Kalign 2.0.
Neutrality and evolvability of designed protein sequences

NASA Astrophysics Data System (ADS)

Bhattacherjee, Arnab; Biswas, Parbati

2010-07-01

The effect of foldability on protein’s evolvability is analyzed by a two-prong approach consisting of a self-consistent mean-field theory and Monte Carlo simulations. Theory and simulation models representing protein sequences with binary patterning of amino acid residues compatible with a particular foldability criteria are used. This generalized foldability criterion is derived using the high temperature cumulant expansion approximating the free energy of folding. The effect of cumulative point mutations on these designed proteins is studied under neutral condition. The robustness, protein’s ability to tolerate random point mutations is determined with a selective pressure of stability (ΔΔG) for the theory designed sequences, which are found to be more robust than that of Monte Carlo and mean-field-biased Monte Carlo generated sequences. The results show that this foldability criterion selects viable protein sequences more effectively compared to the Monte Carlo method, which has a marked effect on how the selective pressure shapes the evolutionary sequence space. These observations may impact de novo sequence design and its applications in protein engineering.
Generative Recurrent Networks for De Novo Drug Design.

PubMed

Gupta, Anvita; Müller, Alex T; Huisman, Berend J H; Fuchs, Jens A; Schneider, Petra; Schneider, Gisbert

2018-01-01

Generative artificial intelligence models present a fresh approach to chemogenomics and de novo drug design, as they provide researchers with the ability to narrow down their search of the chemical space and focus on regions of interest. We present a method for molecular de novo design that utilizes generative recurrent neural networks (RNN) containing long short-term memory (LSTM) cells. This computational model captured the syntax of molecular representation in terms of SMILES strings with close to perfect accuracy. The learned pattern probabilities can be used for de novo SMILES generation. This molecular design concept eliminates the need for virtual compound library enumeration. By employing transfer learning, we fine-tuned the RNN's predictions for specific molecular targets. This approach enables virtual compound design without requiring secondary or external activity prediction, which could introduce error or unwanted bias. The results obtained advocate this generative RNN-LSTM system for high-impact use cases, such as low-data drug discovery, fragment based molecular design, and hit-to-lead optimization for diverse drug targets. © 2017 The Authors. Published by Wiley-VCH Verlag GmbH & Co. KGaA.
Second generation noninvasive fetal genome analysis reveals de novo mutations, single-base parental inheritance, and preferred DNA ends

PubMed Central

Chan, K. C. Allen; Jiang, Peiyong; Sun, Kun; Cheng, Yvonne K. Y.; Tong, Yu K.; Cheng, Suk Hang; Wong, Ada I. C.; Hudecova, Irena; Leung, Tak Y.; Chiu, Rossa W. K.; Lo, Yuk Ming Dennis

2016-01-01

Plasma DNA obtained from a pregnant woman was sequenced to a depth of 270× haploid genome coverage. Comparing the maternal plasma DNA sequencing data with the parental genomic DNA data and using a series of bioinformatics filters, fetal de novo mutations were detected at a sensitivity of 85% and a positive predictive value of 74%. These results represent a 169-fold improvement in the positive predictive value over previous attempts. Improvements in the interpretation of the sequence information of every base position in the genome allowed us to interrogate the maternal inheritance of the fetus for 618,271 of 656,676 (94.2%) heterozygous SNPs within the maternal genome. The fetal genotype at each of these sites was deduced individually, unlike previously, where the inheritance was determined for a collection of sites within a haplotype. These results represent a 90-fold enhancement in the resolution in determining the fetus’s maternal inheritance. Selected genomic locations were more likely to be found at the ends of plasma DNA molecules. We found that a subset of such preferred ends exhibited selectivity for fetal- or maternal-derived DNA in maternal plasma. The ratio of the number of maternal plasma DNA molecules with fetal preferred ends to those with maternal preferred ends showed a correlation with the fetal DNA fraction. Finally, this second generation approach for noninvasive fetal whole-genome analysis was validated in a pregnancy diagnosed with cardiofaciocutaneous syndrome with maternal plasma DNA sequenced to 195× coverage. The causative de novo BRAF mutation was successfully detected through the maternal plasma DNA analysis. PMID:27799561
De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy

PubMed Central

Parmeggiani, Fabio; Velasco, D. Alejandro Fernandez; Höcker, Birte; Baker, David

2015-01-01

Despite efforts for over 25 years, de novo protein design has not succeeded in achieving the TIM-barrel fold. Here we describe the computational design of 4-fold symmetrical (β/α)8-barrels guided by geometrical and chemical principles. Experimental characterization of 33 designs revealed the importance of sidechain-backbone hydrogen bonding for defining the strand register between repeat units. The X-ray crystal structure of a designed thermostable 184-residue protein is nearly identical with the designed TIM-barrel model. PSI-BLAST searches do not identify sequence similarities to known TIM-barrel proteins, and sensitive profile-profile searches indicate that the design sequence is distant from other naturally occurring TIM-barrel superfamilies, suggesting that Nature has only sampled a subset of the sequence space available to the TIM-barrel fold. The ability to de novo design TIM-barrels opens new possibilities for custom-made enzymes. PMID:26595462
De novo transcriptome analysis of an imminent biofuel crop, Camelina sativa L. using Illumina GAIIX sequencing platform and identification of SSR markers.

PubMed

Mudalkar, Shalini; Golla, Ramesh; Ghatty, Sreenivas; Reddy, Attipalli Ramachandra

2014-01-01

Camelina sativa L. is an emerging biofuel crop with potential applications in industry, medicine, cosmetics and human nutrition. The crop is unexploited owing to very limited availability of transcriptome and genomic data. In order to analyse the various metabolic pathways, we performed de novo assembly of the transcriptome on Illumina GAIIX platform with paired end sequencing for obtaining short reads. The sequencing output generated a FastQ file size of 2.97 GB with 10.83 million reads having a maximum read length of 101 nucleotides. The number of contigs generated was 53,854 with maximum and minimum lengths of 10,086 and 200 nucleotides respectively. These trancripts were annotated using BLAST search against the Aracyc, Swiss-Prot, TrEMBL, gene ontology and clusters of orthologous groups (KOG) databases. The genes involved in lipid metabolism were studied and the transcription factors were identified. Sequence similarity studies of Camelina with the other related organisms indicated the close relatedness of Camelina with Arabidopsis. In addition, bioinformatics analysis revealed the presence of a total of 19,379 simple sequence repeats. This is the first report on Camelina sativa L., where the transcriptome of the entire plant, including seedlings, seed, root, leaves and stem was done. Our data established an excellent resource for gene discovery and provide useful information for functional and comparative genomic studies in this promising biofuel crop.
Novel proline-hydroxyproline glycopeptides from the dandelion (Taraxacum officinale Wigg.) flowers: de novo sequencing and biological activity.

PubMed

Astafieva, Alexandra A; Enyenihi, Atim A; Rogozhin, Eugene A; Kozlov, Sergey A; Grishin, Eugene V; Odintsova, Tatyana I; Zubarev, Roman A; Egorov, Tsezi A

2015-09-01

Two novel homologous peptides named ToHyp1 and ToHyp2 that show no similarity to any known proteins were isolated from Taraxacum officinale Wigg. flowers by multidimensional liquid chromatography. Amino acid and mass spectrometry analyses demonstrated that the peptides have unusual structure: they are cysteine-free, proline-hydroxyproline-rich and post-translationally glycosylated by pentoses, with 5 carbohydrates in ToHyp2 and 10 in ToHyp1. The ToHyp2 peptide with a monoisotopic molecular mass of 4350.3Da was completely sequenced by a combination of Edman degradation and de novo sequencing via top down multistage collision induced dissociation (CID) and higher energy dissociation (HCD) tandem mass spectrometry (MS(n)). ToHyp2 consists of 35 amino acids, contains eighteen proline residues, of which 8 prolines are hydroxylated. The peptide displays antifungal activity and inhibits growth of Gram-positive and Gram-negative bacteria. We further showed that carbohydrate moieties have no significant impact on the peptide structure, but are important for antifungal activity although not absolutely necessary. The deglycosylated ToHyp2 peptide was less active against the susceptible fungus Bipolaris sorokiniana than the native peptide. Unique structural features of the ToHyp2 peptide place it into a new family of plant defense peptides. The discovery of ToHyp peptides in T. officinale flowers expands the repertoire of molecules of plant origin with practical applications. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
De Novo Assembly of Auricularia polytricha Transcriptome Using Illumina Sequencing for Gene Discovery and SSR Marker Identification

PubMed Central

Zhou, Yan; Chen, Lianfu; Fan, Xiuzhi; Bian, Yinbing

2014-01-01

Auricularia polytricha (Mont.) Sacc., a type of edible black-brown mushroom with a gelatinous and modality-specific fruiting body, is in high demand in Asia due to its nutritional and medicinal properties. Illumina Solexa sequenceing technology was used to generate very large transcript sequences from the mycelium and the mature fruiting body of A. polytricha for gene discovery and molecular marker development. De novo assembly generated 36,483 ESTs with an N50 length of 636 bp. A total of 28,108 ESTs demonstrated significant hits with known proteins in the nr database, and 94.03% of the annotated ESTs showed the greatest similarity to A. delicata, a related species of A. polytricha. Functional categorization of the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) metabolic pathways revealed the conservation of genes involved in various biological processes in A. polytricha. Gene expression profile analysis indicated that a total of 2,057 ESTs were differentially expressed, including 1,020 ESTs that were up-regulated in the mycelium and 1,037 up-regulated in the fruiting body. Functional enrichment showed that the ESTs associated with biosynthesis, metabolism and assembly of proteins were more active in fruiting body development. The expression patterns of homologous transcription factors indicated that the molecular mechanisms of fruiting body formation and development were not exactly the same as for other agarics. Interestingly, an EST encoding tyrosinase was significantly up-regulated in the fruiting body, indicating that melanins accumulated during the processes of the formation of the black-brown color of the fruiting body in A. polytricha development. In addition, a total of 1,715 potential SSRs were detected in this transcriptome. The transcriptome analysis of A. polytricha provides valuable sequence resources and numerous molecular markers to facilitate further functional genomics studies and
WebPrInSeS: automated full-length clone sequence identification and verification using high-throughput sequencing data.

PubMed

Massouras, Andreas; Decouttere, Frederik; Hens, Korneel; Deplancke, Bart

2010-07-01

High-throughput sequencing (HTS) is revolutionizing our ability to obtain cheap, fast and reliable sequence information. Many experimental approaches are expected to benefit from the incorporation of such sequencing features in their pipeline. Consequently, software tools that facilitate such an incorporation should be of great interest. In this context, we developed WebPrInSeS, a web server tool allowing automated full-length clone sequence identification and verification using HTS data. WebPrInSeS encompasses two separate software applications. The first is WebPrInSeS-C which performs automated sequence verification of user-defined open-reading frame (ORF) clone libraries. The second is WebPrInSeS-E, which identifies positive hits in cDNA or ORF-based library screening experiments such as yeast one- or two-hybrid assays. Both tools perform de novo assembly using HTS data from any of the three major sequencing platforms. Thus, WebPrInSeS provides a highly integrated, cost-effective and efficient way to sequence-verify or identify clones of interest. WebPrInSeS is available at http://webprinses.epfl.ch/ and is open to all users.
WebPrInSeS: automated full-length clone sequence identification and verification using high-throughput sequencing data

PubMed Central

Massouras, Andreas; Decouttere, Frederik; Hens, Korneel; Deplancke, Bart

2010-01-01

High-throughput sequencing (HTS) is revolutionizing our ability to obtain cheap, fast and reliable sequence information. Many experimental approaches are expected to benefit from the incorporation of such sequencing features in their pipeline. Consequently, software tools that facilitate such an incorporation should be of great interest. In this context, we developed WebPrInSeS, a web server tool allowing automated full-length clone sequence identification and verification using HTS data. WebPrInSeS encompasses two separate software applications. The first is WebPrInSeS-C which performs automated sequence verification of user-defined open-reading frame (ORF) clone libraries. The second is WebPrInSeS-E, which identifies positive hits in cDNA or ORF-based library screening experiments such as yeast one- or two-hybrid assays. Both tools perform de novo assembly using HTS data from any of the three major sequencing platforms. Thus, WebPrInSeS provides a highly integrated, cost-effective and efficient way to sequence-verify or identify clones of interest. WebPrInSeS is available at http://webprinses.epfl.ch/ and is open to all users. PMID:20501601
LTRsift: a graphical user interface for semi-automatic classification and postprocessing of de novo detected LTR retrotransposons

PubMed Central

2012-01-01

Background Long terminal repeat (LTR) retrotransposons are a class of eukaryotic mobile elements characterized by a distinctive sequence similarity-based structure. Hence they are well suited for computational identification. Current software allows for a comprehensive genome-wide de novo detection of such elements. The obvious next step is the classification of newly detected candidates resulting in (super-)families. Such a de novo classification approach based on sequence-based clustering of transposon features has been proposed before, resulting in a preliminary assignment of candidates to families as a basis for subsequent manual refinement. However, such a classification workflow is typically split across a heterogeneous set of glue scripts and generic software (for example, spreadsheets), making it tedious for a human expert to inspect, curate and export the putative families produced by the workflow. Results We have developed LTRsift, an interactive graphical software tool for semi-automatic postprocessing of de novo predicted LTR retrotransposon annotations. Its user-friendly interface offers customizable filtering and classification functionality, displaying the putative candidate groups, their members and their internal structure in a hierarchical fashion. To ease manual work, it also supports graphical user interface-driven reassignment, splitting and further annotation of candidates. Export of grouped candidate sets in standard formats is possible. In two case studies, we demonstrate how LTRsift can be employed in the context of a genome-wide LTR retrotransposon survey effort. Conclusions LTRsift is a useful and convenient tool for semi-automated classification of newly detected LTR retrotransposons based on their internal features. Its efficient implementation allows for convenient and seamless filtering and classification in an integrated environment. Developed for life scientists, it is helpful in postprocessing and refining the output of software
Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations.

PubMed

O'Roak, Brian J; Vives, Laura; Girirajan, Santhosh; Karakoc, Emre; Krumm, Niklas; Coe, Bradley P; Levy, Roie; Ko, Arthur; Lee, Choli; Smith, Joshua D; Turner, Emily H; Stanaway, Ian B; Vernot, Benjamin; Malig, Maika; Baker, Carl; Reilly, Beau; Akey, Joshua M; Borenstein, Elhanan; Rieder, Mark J; Nickerson, Deborah A; Bernier, Raphael; Shendure, Jay; Eichler, Evan E

2012-04-04

It is well established that autism spectrum disorders (ASD) have a strong genetic component; however, for at least 70% of cases, the underlying genetic cause is unknown. Under the hypothesis that de novo mutations underlie a substantial fraction of the risk for developing ASD in families with no previous history of ASD or related phenotypes--so-called sporadic or simplex families--we sequenced all coding regions of the genome (the exome) for parent-child trios exhibiting sporadic ASD, including 189 new trios and 20 that were previously reported. Additionally, we also sequenced the exomes of 50 unaffected siblings corresponding to these new (n = 31) and previously reported trios (n = 19), for a total of 677 individual exomes from 209 families. Here we show that de novo point mutations are overwhelmingly paternal in origin (4:1 bias) and positively correlated with paternal age, consistent with the modest increased risk for children of older fathers to develop ASD. Moreover, 39% (49 of 126) of the most severe or disruptive de novo mutations map to a highly interconnected β-catenin/chromatin remodelling protein network ranked significantly for autism candidate genes. In proband exomes, recurrent protein-altering mutations were observed in two genes: CHD8 and NTNG1. Mutation screening of six candidate genes in 1,703 ASD probands identified additional de novo, protein-altering mutations in GRIN2B, LAMC3 and SCN1A. Combined with copy number variant (CNV) data, these results indicate extreme locus heterogeneity but also provide a target for future discovery, diagnostics and therapeutics.
Genomic treasure troves: complete genome sequencing of herbarium and insect museum specimens.

PubMed

Staats, Martijn; Erkens, Roy H J; van de Vossenberg, Bart; Wieringa, Jan J; Kraaijeveld, Ken; Stielow, Benjamin; Geml, József; Richardson, James E; Bakker, Freek T

2013-01-01

Unlocking the vast genomic diversity stored in natural history collections would create unprecedented opportunities for genome-scale evolutionary, phylogenetic, domestication and population genomic studies. Many researchers have been discouraged from using historical specimens in molecular studies because of both generally limited success of DNA extraction and the challenges associated with PCR-amplifying highly degraded DNA. In today's next-generation sequencing (NGS) world, opportunities and prospects for historical DNA have changed dramatically, as most NGS methods are actually designed for taking short fragmented DNA molecules as templates. Here we show that using a standard multiplex and paired-end Illumina sequencing approach, genome-scale sequence data can be generated reliably from dry-preserved plant, fungal and insect specimens collected up to 115 years ago, and with minimal destructive sampling. Using a reference-based assembly approach, we were able to produce the entire nuclear genome of a 43-year-old Arabidopsis thaliana (Brassicaceae) herbarium specimen with high and uniform sequence coverage. Nuclear genome sequences of three fungal specimens of 22-82 years of age (Agaricus bisporus, Laccaria bicolor, Pleurotus ostreatus) were generated with 81.4-97.9% exome coverage. Complete organellar genome sequences were assembled for all specimens. Using de novo assembly we retrieved between 16.2-71.0% of coding sequence regions, and hence remain somewhat cautious about prospects for de novo genome assembly from historical specimens. Non-target sequence contaminations were observed in 2 of our insect museum specimens. We anticipate that future museum genomics projects will perhaps not generate entire genome sequences in all cases (our specimens contained relatively small and low-complexity genomes), but at least generating vital comparative genomic data for testing (phylo)genetic, demographic and genetic hypotheses, that become increasingly more horizontal

The chaperonin-60 universal target is a barcode for bacteria that enables de novo assembly of metagenomic sequence data.

PubMed

Links, Matthew G; Dumonceaux, Tim J; Hemmingsen, Sean M; Hill, Janet E

2012-01-01

Barcoding with molecular sequences is widely used to catalogue eukaryotic biodiversity. Studies investigating the community dynamics of microbes have relied heavily on gene-centric metagenomic profiling using two genes (16S rRNA and cpn60) to identify and track Bacteria. While there have been criteria formalized for barcoding of eukaryotes, these criteria have not been used to evaluate gene targets for other domains of life. Using the framework of the International Barcode of Life we evaluated DNA barcodes for Bacteria. Candidates from the 16S rRNA gene and the protein coding cpn60 gene were evaluated. Within complete bacterial genomes in the public domain representing 983 species from 21 phyla, the largest difference between median pairwise inter- and intra-specific distances ("barcode gap") was found from cpn60. Distribution of sequence diversity along the ∼555 bp cpn60 target region was remarkably uniform. The barcode gap of the cpn60 universal target facilitated the faithful de novo assembly of full-length operational taxonomic units from pyrosequencing data from a synthetic microbial community. Analysis supported the recognition of both 16S rRNA and cpn60 as DNA barcodes for Bacteria. The cpn60 universal target was found to have a much larger barcode gap than 16S rRNA suggesting cpn60 as a preferred barcode for Bacteria. A large barcode gap for cpn60 provided a robust target for species-level characterization of data. The assembly of consensus sequences for barcodes was shown to be a reliable method for the identification and tracking of novel microbes in metagenomic studies.
Exome Sequencing Fails to Identify the Genetic Cause of Aicardi Syndrome.

PubMed

Lund, Caroline; Striano, Pasquale; Sorte, Hanne Sørmo; Parisi, Pasquale; Iacomino, Michele; Sheng, Ying; Vigeland, Magnus D; Øye, Anne-Marte; Møller, Rikke Steensbjerre; Selmer, Kaja K; Zara, Federico

2016-09-01

Aicardi syndrome (AS) is a well-characterized neurodevelopmental disorder with an unknown etiology. In this study, we performed whole-exome sequencing in 11 female patients with the diagnosis of AS, in order to identify the disease-causing gene. In particular, we focused on detecting variants in the X chromosome, including the analysis of variants with a low number of sequencing reads, in case of somatic mosaicism. For 2 of the patients, we also sequenced the exome of the parents to search for de novo mutations. We did not identify any genetic variants likely to be damaging. Only one single missense variant was identified by the de novo analyses of the 2 trios, and this was considered benign. The failure to identify a disease gene in this study may be due to technical limitations of our study design, including the possibility that the genetic aberration leading to AS is situated in a non-exonic region or that the mutation is somatic and not detectable by our approach. Alternatively, it is possible that AS is genetically heterogeneous and that 11 patients are not sufficient to reveal the causative genes. Future studies of AS should consider designs where also non-exonic regions are explored and apply a sequencing depth so that also low-grade somatic mosaicism can be detected.
De novo centromere formation on a chromosome fragment in maize.

PubMed

Fu, Shulan; Lv, Zhenling; Gao, Zhi; Wu, Huajun; Pang, Junling; Zhang, Bing; Dong, Qianhua; Guo, Xiang; Wang, Xiu-Jie; Birchler, James A; Han, Fangpu

2013-04-09

The centromere is the part of the chromosome that organizes the kinetochore, which mediates chromosome movement during mitosis and meiosis. A small fragment from chromosome 3, named Duplication 3a (Dp3a), was described from UV-irradiated materials by Stadler and Roman in the 1940s [Stadler LJ, Roman H (1948) Genetics 33(3):273-303]. The genetic behavior of Dp3a is reminiscent of a ring chromosome, but fluoresecent in situ hybridization detected telomeres at both ends, suggesting a linear structure. This small chromosome has no detectable canonical centromeric sequences, but contains a site with protein features of functional centromeres such as CENH3, the centromere specific H3 histone variant, and CENP-C, a foundational kinetochore protein, suggesting the de novo formation of a centromere on the chromatin fragment. To examine the sequences associated with CENH3, chromatin immunoprecipitation was carried out with anti-CENH3 antibodies using material from young seedlings with and without the Dp3a chromosome. A novel peak was detected from the ChIP-Sequencing reads of the Dp3a sample. The peak spanned 350 kb within the long arm of chromosome 3 covering 22 genes. Collectively, these results define the behavior and molecular features of de novo centromere formation in the Dp3a chromosome, which may shed light on the initiation of new centromere sites during evolution.
De Novo RNA Sequencing and Transcriptome Analysis of Colletotrichum gloeosporioides ES026 Reveal Genes Related to Biosynthesis of Huperzine A

PubMed Central

Zhang, Xiangmei; Xia, Qianqian; Zhao, Xinmei; Ahn, Youngjoon; Ahmed, Nevin; Cosoveanu, Andreea; Wang, Mo; Wang, Jialu; Shu, Shaohua

2015-01-01

Huperzine A is important in the treatment of Alzheimer’s disease. There are major challenges for the mass production of huperzine A from plants due to the limited number of huperzine-A-producing plants, as well as the low content of huperzine A in these plants. Various endophytic fungi produce huperzine A. Colletotrichum gloeosporioides ES026 was previously isolated from a huperzine-A-producing plant Huperzia serrata, and this fungus also produces huperzine A. In this study, de novo RNA sequencing of C. gloeosporioides ES026 was carried out with an Illumina HiSeq2000. A total of 4,324,299,051 bp from 50,442,617 high-quality sequence reads of ES026 were obtained. These raw data were assembled into 24,998 unigenes, 40,536,684 residues and 19,790 genes. The majority of the unique sequences were assigned to corresponding putative functions based on BLAST searches of public databases. The molecular functions, biological processes and biochemical pathways of these unique sequences were determined using gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) assignments. A gene encoding copper amine oxidase (CAO) (unigene 9322) was annotated for the conversion of cadaverine to 5-aminopentanal in the biosynthesis of huperzine A. This gene was also detected in the root, stem and leaf of H. serrata. Furthermore, a close relationship was observed between expression of the CAO gene (unigene 9322) and quantity of crude huperzine A extracted from ES026. Therefore, CAO might be involved in the biosynthesis of huperzine A and it most likely plays a key role in regulating the content of huperzine A in ES026. PMID:25799531
A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer.

PubMed

Zhang, Feng; Liao, Xiangke; Peng, Shaoliang; Cui, Yingbo; Wang, Bingqiang; Zhu, Xiaoqian; Liu, Jie

2016-06-01

' The de novo assembly of DNA sequences is increasingly important for biological researches in the genomic era. After more than one decade since the Human Genome Project, some challenges still exist and new solutions are being explored to improve de novo assembly of genomes. String graph assembler (SGA), based on the string graph theory, is a new method/tool developed to address the challenges. In this paper, based on an in-depth analysis of SGA we prove that the SGA-based sequence de novo assembly is an NP-complete problem. According to our analysis, SGA outperforms other similar methods/tools in memory consumption, but costs much more time, of which 60-70 % is spent on the index construction. Upon this analysis, we introduce a hybrid parallel optimization algorithm and implement this algorithm in the TianHe-2's parallel framework. Simulations are performed with different datasets. For data of small size the optimized solution is 3.06 times faster than before, and for data of middle size it's 1.60 times. The results demonstrate an evident performance improvement, with the linear scalability for parallel FM-index construction. This results thus contribute significantly to improving the efficiency of de novo assembly of DNA sequences.
[The principle and application of the single-molecule real-time sequencing technology].

PubMed

Yanhu, Liu; Lu, Wang; Li, Yu

2015-03-01

Last decade witnessed the explosive development of the third-generation sequencing strategy, including single-molecule real-time sequencing (SMRT), true single-molecule sequencing (tSMSTM) and the single-molecule nanopore DNA sequencing. In this review, we summarize the principle, performance and application of the SMRT sequencing technology. Compared with the traditional Sanger method and the next-generation sequencing (NGS) technologies, the SMRT approach has several advantages, including long read length, high speed, PCR-free and the capability of direct detection of epigenetic modiﬁcations. However, the disadvantage of its low accuracy, most of which resulted from insertions and deletions, is also notable. So, the raw sequence data need to be corrected before assembly. Up to now, the SMRT is a good fit for applications in the de novo genomic sequencing and the high-quality assemblies of small genomes. In the future, it is expected to play an important role in epigenetics, transcriptomic sequencing, and assemblies of large genomes.
De novo FBXO11 mutations are associated with intellectual disability and behavioural anomalies.

PubMed

Fritzen, Daniel; Kuechler, Alma; Grimmel, Mona; Becker, Jessica; Peters, Sophia; Sturm, Marc; Hundertmark, Hela; Schmidt, Axel; Kreiß, Martina; Strom, Tim M; Wieczorek, Dagmar; Haack, Tobias B; Beck-Wödl, Stefanie; Cremer, Kirsten; Engels, Hartmut

2018-05-01

Intellectual disability (ID) has an estimated prevalence of 1.5-2%. In most affected individuals, its genetic basis remains unclear. Whole exome sequencing (WES) studies have identified a multitude of novel causative gene defects and have shown that a large proportion of sporadic ID cases results from de novo mutations. Here, we present two unrelated individuals with similar clinical features and deleterious de novo variants in FBXO11 detected by WES. Individual 1, a 14-year-old boy, has mild ID as well as mild microcephaly, corrected cleft lip and alveolus, hyperkinetic disorder, mild brain atrophy and minor facial dysmorphism. WES detected a heterozygous de novo 1 bp insertion in the splice donor site of exon 3. Individual 2, a 3-year-old boy, showed ID and pre- and postnatal growth retardation, postnatal mild microcephaly, hyperkinetic and restless behaviour, as well as mild dysmorphism. WES detected a heterozygous de novo frameshift mutation. While ten individuals with ID and de novo variants in FBXO11 have been reported as part of larger studies, only one of the reports has some additional clinical data. Interestingly, the latter individual carries the identical mutation as our individual 2 and also displays ID, intrauterine growth retardation, microcephaly, behavioural anomalies, and dysmorphisms. Thus, we confirm deleterious de novo mutations in FBXO11 as a cause of ID and start the delineation of the associated clinical picture which may also comprise postnatal microcephaly or borderline small head size and behavioural anomalies.
De Novo Assembly, Gene Annotation, and Marker Discovery in Stored-Product Pest Liposcelis entomophila (Enderlein) Using Transcriptome Sequences

PubMed Central

Wei, Dan-Dan; Chen, Er-Hu; Ding, Tian-Bo; Chen, Shi-Chun; Dou, Wei; Wang, Jin-Jun

2013-01-01

Background As a major stored-product pest insect, Liposcelis entomophila has developed high levels of resistance to various insecticides in grain storage systems. However, the molecular mechanisms underlying resistance and environmental stress have not been characterized. To date, there is a lack of genomic information for this species. Therefore, studies aimed at profiling the L. entomophila transcriptome would provide a better understanding of the biological functions at the molecular levels. Methodology/Principal Findings We applied Illumina sequencing technology to sequence the transcriptome of L. entomophila. A total of 54,406,328 clean reads were obtained and that de novo assembled into 54,220 unigenes, with an average length of 571 bp. Through a similarity search, 33,404 (61.61%) unigenes were matched to known proteins in the NCBI non-redundant (Nr) protein database. These unigenes were further functionally annotated with gene ontology (GO), cluster of orthologous groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. A large number of genes potentially involved in insecticide resistance were manually curated, including 68 putative cytochrome P450 genes, 37 putative glutathione S-transferase (GST) genes, 19 putative carboxyl/cholinesterase (CCE) genes, and other 126 transcripts to contain target site sequences or encoding detoxification genes representing eight types of resistance enzymes. Furthermore, to gain insight into the molecular basis of the L. entomophila toward thermal stresses, 25 heat shock protein (Hsp) genes were identified. In addition, 1,100 SSRs and 57,757 SNPs were detected and 231 pairs of SSR primes were designed for investigating the genetic diversity in future. Conclusions/Significance We developed a comprehensive transcriptomic database for L. entomophila. These sequences and putative molecular markers would further promote our understanding of the molecular mechanisms underlying insecticide resistance
De novo Assembly of the Indo-Pacific Humpback Dolphin Leucocyte Transcriptome to Identify Putative Genes Involved in the Aquatic Adaptation and Immune Response

PubMed Central

Xia, Jia; Yang, Lili; Chen, Jialin; Wu, Yuping; Yi, Meisheng

2013-01-01

Background The Indo-Pacific humpback dolphin (Sousa chinensis), a marine mammal species inhabited in the waters of Southeast Asia, South Africa and Australia, has attracted much attention because of the dramatic decline in population size in the past decades, which raises the concern of extinction. So far, this species is poorly characterized at molecular level due to little sequence information available in public databases. Recent advances in large-scale RNA sequencing provide an efficient approach to generate abundant sequences for functional genomic analyses in the species with un-sequenced genomes. Principal Findings We performed a de novo assembly of the Indo-Pacific humpback dolphin leucocyte transcriptome by Illumina sequencing. 108,751 high quality sequences from 47,840,388 paired-end reads were generated, and 48,868 and 46,587 unigenes were functionally annotated by BLAST search against the NCBI non-redundant and Swiss-Prot protein databases (E-value<10−5), respectively. In total, 16,467 unigenes were clustered into 25 functional categories by searching against the COG database, and BLAST2GO search assigned 37,976 unigenes to 61 GO terms. In addition, 36,345 unigenes were grouped into 258 KEGG pathways. We also identified 9,906 simple sequence repeats and 3,681 putative single nucleotide polymorphisms as potential molecular markers in our assembled sequences. A large number of unigenes were predicted to be involved in immune response, and many genes were predicted to be relevant to adaptive evolution and cetacean-specific traits. Conclusion This study represented the first transcriptome analysis of the Indo-Pacific humpback dolphin, an endangered species. The de novo transcriptome analysis of the unique transcripts will provide valuable sequence information for discovery of new genes, characterization of gene expression, investigation of various pathways and adaptive evolution, as well as identification of genetic markers. PMID:24015242
De novo assembly of the Indo-Pacific humpback dolphin leucocyte transcriptome to identify putative genes involved in the aquatic adaptation and immune response.

PubMed

Gui, Duan; Jia, Kuntong; Xia, Jia; Yang, Lili; Chen, Jialin; Wu, Yuping; Yi, Meisheng

2013-01-01

The Indo-Pacific humpback dolphin (Sousa chinensis), a marine mammal species inhabited in the waters of Southeast Asia, South Africa and Australia, has attracted much attention because of the dramatic decline in population size in the past decades, which raises the concern of extinction. So far, this species is poorly characterized at molecular level due to little sequence information available in public databases. Recent advances in large-scale RNA sequencing provide an efficient approach to generate abundant sequences for functional genomic analyses in the species with un-sequenced genomes. We performed a de novo assembly of the Indo-Pacific humpback dolphin leucocyte transcriptome by Illumina sequencing. 108,751 high quality sequences from 47,840,388 paired-end reads were generated, and 48,868 and 46,587 unigenes were functionally annotated by BLAST search against the NCBI non-redundant and Swiss-Prot protein databases (E-value<10(-5)), respectively. In total, 16,467 unigenes were clustered into 25 functional categories by searching against the COG database, and BLAST2GO search assigned 37,976 unigenes to 61 GO terms. In addition, 36,345 unigenes were grouped into 258 KEGG pathways. We also identified 9,906 simple sequence repeats and 3,681 putative single nucleotide polymorphisms as potential molecular markers in our assembled sequences. A large number of unigenes were predicted to be involved in immune response, and many genes were predicted to be relevant to adaptive evolution and cetacean-specific traits. This study represented the first transcriptome analysis of the Indo-Pacific humpback dolphin, an endangered species. The de novo transcriptome analysis of the unique transcripts will provide valuable sequence information for discovery of new genes, characterization of gene expression, investigation of various pathways and adaptive evolution, as well as identification of genetic markers.
Eculizumab for drug-induced de novo posttransplantation thrombotic microangiopathy: A case report.

PubMed

Safa, Kassem; Logan, Merranda S; Batal, Ibrahim; Gabardi, Steven; Rennke, Helmut G; Abdi, Reza

2015-02-01

De novo thrombotic microangiopathy (TMA) following renal transplantation is a severe complication associated with high rates of allograft failure. Several immunosuppressive agents are associated with TMA. Conventional approaches to managing this entity, such as withdrawal of the offending agent and/or plasmapheresis, often offer limited help, with high rates of treatment failure and graft loss. We herein report a case of drug induced de novo TMA successfully treated using the C5a inhibitor eculizumab in a renal transplant patient. This report highlights a potentially important role for eculizumab in settings where drug-induced de novo TMA is refractory to conventional therapies.
A new hybrid approach for MHC genotyping: high-throughput NGS and long read MinION nanopore sequencing, with application to the non-model vertebrate Alpine chamois (Rupicapra rupicapra).

PubMed

Fuselli, S; Baptista, R P; Panziera, A; Magi, A; Guglielmi, S; Tonin, R; Benazzo, A; Bauzer, L G; Mazzoni, C J; Bertorelle, G

2018-03-24

The major histocompatibility complex (MHC) acts as an interface between the immune system and infectious diseases. Accurate characterization and genotyping of the extremely variable MHC loci are challenging especially without a reference sequence. We designed a combination of long-range PCR, Illumina short-reads, and Oxford Nanopore MinION long-reads approaches to capture the genetic variation of the MHC II DRB locus in an Italian population of the Alpine chamois (Rupicapra rupicapra). We utilized long-range PCR to generate a 9 Kb fragment of the DRB locus. Amplicons from six different individuals were fragmented, tagged, and simultaneously sequenced with Illumina MiSeq. One of these amplicons was sequenced with the MinION device, which produced long reads covering the entire amplified fragment. A pipeline that combines short and long reads resolved several short tandem repeats and homopolymers and produced a de novo reference, which was then used to map and genotype the short reads from all individuals. The assembled DRB locus showed a high level of polymorphism and the presence of a recombination breakpoint. Our results suggest that an amplicon-based NGS approach coupled with single-molecule MinION nanopore sequencing can efficiently achieve both the assembly and the genotyping of complex genomic regions in multiple individuals in the absence of a reference sequence.
Assessing the utility of the Oxford Nanopore MinION for snake venom gland cDNA sequencing.

PubMed

Hargreaves, Adam D; Mulley, John F

2015-01-01

Portable DNA sequencers such as the Oxford Nanopore MinION device have the potential to be truly disruptive technologies, facilitating new approaches and analyses and, in some cases, taking sequencing out of the lab and into the field. However, the capabilities of these technologies are still being revealed. Here we show that single-molecule cDNA sequencing using the MinION accurately characterises venom toxin-encoding genes in the painted saw-scaled viper, Echis coloratus. We find the raw sequencing error rate to be around 12%, improved to 0-2% with hybrid error correction and 3% with de novo error correction. Our corrected data provides full coding sequences and 5' and 3' UTRs for 29 of 33 candidate venom toxins detected, far superior to Illumina data (13/40 complete) and Sanger-based ESTs (15/29). We suggest that, should the current pace of improvement continue, the MinION will become the default approach for cDNA sequencing in a variety of species.
Assessing the utility of the Oxford Nanopore MinION for snake venom gland cDNA sequencing

PubMed Central

Hargreaves, Adam D.

2015-01-01

Portable DNA sequencers such as the Oxford Nanopore MinION device have the potential to be truly disruptive technologies, facilitating new approaches and analyses and, in some cases, taking sequencing out of the lab and into the field. However, the capabilities of these technologies are still being revealed. Here we show that single-molecule cDNA sequencing using the MinION accurately characterises venom toxin-encoding genes in the painted saw-scaled viper, Echis coloratus. We find the raw sequencing error rate to be around 12%, improved to 0–2% with hybrid error correction and 3% with de novo error correction. Our corrected data provides full coding sequences and 5′ and 3′ UTRs for 29 of 33 candidate venom toxins detected, far superior to Illumina data (13/40 complete) and Sanger-based ESTs (15/29). We suggest that, should the current pace of improvement continue, the MinION will become the default approach for cDNA sequencing in a variety of species. PMID:26623194
Algorithm for selection of optimized EPR distance restraints for de novo protein structure determination

PubMed Central

Kazmier, Kelli; Alexander, Nathan S.; Meiler, Jens; Mchaourab, Hassane S.

2010-01-01

A hybrid protein structure determination approach combining sparse Electron Paramagnetic Resonance (EPR) distance restraints and Rosetta de novo protein folding has been previously demonstrated to yield high quality models (Alexander et al., 2008). However, widespread application of this methodology to proteins of unknown structures is hindered by the lack of a general strategy to place spin label pairs in the primary sequence. In this work, we report the development of an algorithm that optimally selects spin labeling positions for the purpose of distance measurements by EPR. For the α-helical subdomain of T4 lysozyme (T4L), simulated restraints that maximize sequence separation between the two spin labels while simultaneously ensuring pairwise connectivity of secondary structure elements yielded vastly improved models by Rosetta folding. 50% of all these models have the correct fold compared to only 21% and 8% correctly folded models when randomly placed restraints or no restraints are used, respectively. Moreover, the improvements in model quality require a limited number of optimized restraints, the number of which is determined by the pairwise connectivities of T4L α-helices. The predicted improvement in Rosetta model quality was verified by experimental determination of distances between spin labels pairs selected by the algorithm. Overall, our results reinforce the rationale for the combined use of sparse EPR distance restraints and de novo folding. By alleviating the experimental bottleneck associated with restraint selection, this algorithm sets the stage for extending computational structure determination to larger, traditionally elusive protein topologies of critical structural and biochemical importance. PMID:21074624
Katome: de novo DNA assembler implemented in rust

NASA Astrophysics Data System (ADS)

Neumann, Łukasz; Nowak, Robert M.; Kuśmirek, Wiktor

2017-08-01

Katome is a new de novo sequence assembler written in the Rust programming language, designed with respect to future parallelization of the algorithms, run time and memory usage optimization. The application uses new algorithms for the correct assembly of repetitive sequences. Performance and quality tests were performed on various data, comparing the new application to `dnaasm', `ABySS' and `Velvet' genome assemblers. Quality tests indicate that the new assembler creates more contigs than well-established solutions, but the contigs have better quality with regard to mismatches per 100kbp and indels per 100kbp. Additionally, benchmarks indicate that the Rust-based implementation outperforms `dnaasm', `ABySS' and `Velvet' assemblers, written in C++, in terms of assembly time. Lower memory usage in comparison to `dnaasm' is observed.
CATCh, an Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing Studies

PubMed Central

Mysara, Mohamed; Saeys, Yvan; Leys, Natalie; Raes, Jeroen

2014-01-01

In ecological studies, microbial diversity is nowadays mostly assessed via the detection of phylogenetic marker genes, such as 16S rRNA. However, PCR amplification of these marker genes produces a significant amount of artificial sequences, often referred to as chimeras. Different algorithms have been developed to remove these chimeras, but efforts to combine different methodologies are limited. Therefore, two machine learning classifiers (reference-based and de novo CATCh) were developed by integrating the output of existing chimera detection tools into a new, more powerful method. When comparing our classifiers with existing tools in either the reference-based or de novo mode, a higher performance of our ensemble method was observed on a wide range of sequencing data, including simulated, 454 pyrosequencing, and Illumina MiSeq data sets. Since our algorithm combines the advantages of different individual chimera detection tools, our approach produces more robust results when challenged with chimeric sequences having a low parent divergence, short length of the chimeric range, and various numbers of parents. Additionally, it could be shown that integrating CATCh in the preprocessing pipeline has a beneficial effect on the quality of the clustering in operational taxonomic units. PMID:25527546
Mining and Development of Novel SSR Markers Using Next Generation Sequencing (NGS) Data in Plants.

PubMed

Taheri, Sima; Lee Abdullah, Thohirah; Yusop, Mohd Rafii; Hanafi, Mohamed Musa; Sahebi, Mahbod; Azizi, Parisa; Shamshiri, Redmond Ramin

2018-02-13

Microsatellites, or simple sequence repeats (SSRs), are one of the most informative and multi-purpose genetic markers exploited in plant functional genomics. However, the discovery of SSRs and development using traditional methods are laborious, time-consuming, and costly. Recently, the availability of high-throughput sequencing technologies has enabled researchers to identify a substantial number of microsatellites at less cost and effort than traditional approaches. Illumina is a noteworthy transcriptome sequencing technology that is currently used in SSR marker development. Although 454 pyrosequencing datasets can be used for SSR development, this type of sequencing is no longer supported. This review aims to present an overview of the next generation sequencing, with a focus on the efficient use of de novo transcriptome sequencing (RNA-Seq) and related tools for mining and development of microsatellites in plants.
The De Novo Design of Protein-Protein Interfaces

DTIC Science & Technology

it was our intention to add to this body by engineering de novo (from scratch) protein/protein complexes. Using this inverse approach we have furthered...key physical features needed to drive specific protein/protein interactions. It is considered inverse because, instead of studying natural complexes
PRO_LIGAND: An approach to de novo molecular design. 4. Application to the design of peptides

NASA Astrophysics Data System (ADS)

Frenkel, David; Clark, David E.; Li, Jin; Murray, Christopher W.; Robson, Barry; Waszkowycz, Bohdan; Westhead, David R.

1995-06-01

In some instances, peptides can play an important role in the discovery of lead compounds. This paper describes the peptide design facility of the de novo drug design package, PRO_LIGAND. The package provides a unified framework for the design of peptides that are similar or complementary to a specified target. The approach uses single amino acid residues, selected from preconstructed libraries of different residues and conformations, and places them on top of predefined target interaction sites. This approach is a well-tested methodology for the design of organics but has not been used for peptides before. Peptides represent a difficulty because of their great conformational flexibility and a study of the advantages and disavantages of this simple approach is an important step in the development of design tools. After a description of our general approach, a more detailed discussion of its adaptation to peptides is given. The method is then applied to the design of peptide-based inhibitors to HIV-1 protease and the design of structural mimics of the surface region of lysozyme. The results are encouraging and point the way towards further development of interaction site-based approaches for peptide design.

De novo sequencing and analysis of the cranberry fruit transcriptome to identify putative genes involved in flavonoid biosynthesis, transport and regulation.

PubMed

Sun, Haiyue; Liu, Yushan; Gai, Yuzhuo; Geng, Jinman; Chen, Li; Liu, Hongdi; Kang, Limin; Tian, Youwen; Li, Yadong

2015-09-02

Cranberries (Vaccinium macrocarpon Ait.), renowned for their excellent health benefits, are an important berry crop. Here, we performed transcriptome sequencing of one cranberry cultivar, from fruits at two different developmental stages, on the Illumina HiSeq 2000 platform. Our main goals were to identify putative genes for major metabolic pathways of bioactive compounds and compare the expression patterns between white fruit (W) and red fruit (R) in cranberry. In this study, two cDNA libraries of W and R were constructed. Approximately 119 million raw sequencing reads were generated and assembled de novo, yielding 57,331 high quality unigenes with an average length of 739 bp. Using BLASTx, 38,460 unigenes were identified as putative homologs of annotated sequences in public protein databases, including NCBI NR, NT, Swiss-Prot, KEGG, COG and GO. Of these, 21,898 unigenes mapped to 128 KEGG pathways, with the metabolic pathways, secondary metabolites, glycerophospholipid metabolism, ether lipid metabolism, starch and sucrose metabolism, purine metabolism, and pyrimidine metabolism being well represented. Among them, many candidate genes were involved in flavonoid biosynthesis, transport and regulation. Furthermore, digital gene expression (DEG) analysis identified 3,257 unigenes that were differentially expressed between the two fruit developmental stages. In addition, 14,473 simple sequence repeats (SSRs) were detected. Our results present comprehensive gene expression information about the cranberry fruit transcriptome that could facilitate our understanding of the molecular mechanisms of fruit development in cranberries. Although it will be necessary to validate the functions carried out by these genes, these results could be used to improve the quality of breeding programs for the cranberry and related species.
De novo truncating mutations in ASXL3 are associated with a novel clinical phenotype with similarities to Bohring-Opitz syndrome

PubMed Central

2013-01-01

Background Molecular diagnostics can resolve locus heterogeneity underlying clinical phenotypes that may otherwise be co-assigned as a specific syndrome based on shared clinical features, and can associate phenotypically diverse diseases to a single locus through allelic affinity. Here we describe an apparently novel syndrome, likely caused by de novo truncating mutations in ASXL3, which shares characteristics with Bohring-Opitz syndrome, a disease associated with de novo truncating mutations in ASXL1. Methods We used whole-genome and whole-exome sequencing to interrogate the genomes of four subjects with an undiagnosed syndrome. Results Using genome-wide sequencing, we identified heterozygous, de novo truncating mutations in ASXL3, a transcriptional repressor related to ASXL1, in four unrelated probands. We found that these probands shared similar phenotypes, including severe feeding difficulties, failure to thrive, and neurologic abnormalities with significant developmental delay. Further, they showed less phenotypic overlap with patients who had de novo truncating mutations in ASXL1. Conclusion We have identified truncating mutations in ASXL3 as the likely cause of a novel syndrome with phenotypic overlap with Bohring-Opitz syndrome. PMID:23383720
A de novo redesign of the WW domain.

PubMed

Kraemer-Pecore, Christina M; Lecomte, Juliette T J; Desjarlais, John R

2003-10-01

We have used a sequence prediction algorithm and a novel sampling method to design protein sequences for the WW domain, a small beta-sheet motif. The procedure, referred to as SPANS, designs sequences to be compatible with an ensemble of closely related polypeptide backbones, mimicking the inherent flexibility of proteins. Two designed sequences (termed SPANS-WW1 and SPANS-WW2), using only naturally occurring L-amino acids, were selected for study and the corresponding polypeptides were prepared in Escherichia coli. Circular dichroism data suggested that both purified polypeptides adopted secondary structure features related to those of the target without the aid of disulfide bridges or bound cofactors. The structure exhibited by SPANS-WW2 melted cooperatively by raising the temperature of the solution. Further analysis of this polypeptide by proton nuclear magnetic resonance spectroscopy demonstrated that at 5 degrees C, it folds into a structure closely resembling a natural WW domain. This achievement constitutes one of a small number of successful de novo protein designs through fully automated computational methods and highlights the feasibility of including backbone flexibility in the design strategy.
De novo nonsense and frameshift variants of TCF20 in individuals with intellectual disability and postnatal overgrowth.

PubMed

Schäfgen, Johanna; Cremer, Kirsten; Becker, Jessica; Wieland, Thomas; Zink, Alexander M; Kim, Sarah; Windheuser, Isabelle C; Kreiß, Martina; Aretz, Stefan; Strom, Tim M; Wieczorek, Dagmar; Engels, Hartmut

2016-12-01

Recently, germline variants of the transcriptional co-regulator gene TCF20 have been implicated in the aetiology of autism spectrum disorders (ASD). However, the knowledge about the associated clinical picture remains fragmentary. In this study, two individuals with de novo TCF20 sequence variants were identified in a cohort of 313 individuals with intellectual disability of unknown aetiology, which was analysed by whole exome sequencing using a child-parent trio design. Both detected variants - one nonsense and one frameshift variant - were truncating. A comprehensive clinical characterisation of the patients yielded mild intellectual disability, postnatal tall stature and macrocephaly, obesity and muscular hypotonia as common clinical signs while ASD was only present in one proband. The present report begins to establish the clinical picture of individuals with de novo nonsense and frameshift variants of TCF20 which includes features such as proportionate overgrowth and muscular hypotonia. Furthermore, intellectual disability/developmental delay seems to be fully penetrant amongst known individuals with de novo nonsense and frameshift variants of TCF20, whereas ASD is shown to be incompletely penetrant. The transcriptional co-regulator gene TCF20 is hereby added to the growing number of genes implicated in the aetiology of both ASD and intellectual disability. Furthermore, such de novo variants of TCF20 may represent a novel differential diagnosis in the overgrowth syndrome spectrum.
De-novo discovery of differentially abundant transcription factor binding sites including their positional preference.

PubMed

Keilwagen, Jens; Grau, Jan; Paponov, Ivan A; Posch, Stefan; Strickert, Marc; Grosse, Ivo

2011-02-10

Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open
De Novo Paternal FBN1 Mutation Detected in Embryos Before Implantation.

PubMed

Wang, Shuling; Niu, Ziru; Wang, Hui; Ma, Minyue; Zhang, Wei; Fang Wang, Shu; Wang, Jun; Yan, Hong; Liu, Yifan; Duan, Na; Zhang, Xiandong; Yao, Yuanqing

2017-06-26

BACKGROUND Marfan syndrome (MFS) is an autosomal dominant disease caused by mutations in the Fibrillin (FBN)1 gene and characterized by disorders in the cardiovascular, skeletal, and visual systems. The diversity of mutations and phenotypic heterogeneity of MFS make prenatal molecular diagnoses difficult. In this study, we used pre-implantation genetic diagnosis (PGD) to identify the pathogenic mutation in a male patient with MFS and to determine whether his offspring would be free of the disease. MATERIAL AND METHODS The history and pedigree of the proband were analyzed. Mutation analysis was performed on the couple and immediate family members. The couple chose IVF treatment and 4 blastocysts were biopsied. PGD was carried out by targeted high-throughput sequencing of the FBN1 gene in the embryos, along with single-nucleotide polymorphism haplotyping. Sanger sequencing was used to confirm the causative mutation. RESULTS c.2647T>C (p.Trp883Arg) was identified as the de novo likely pathogenic mutation in the proband. Whole-genome amplification and sequencing of the 3 embryos revealed that they did not carry the mutation, and 1 blastocyst was transferred back to the uterus. The amniocentesis test result analyzed by Sanger sequencing confirmed the PGD. A premature but healthy infant free of heart malformations was born. CONCLUSIONS The de novo mutation c.2647T>C (p.Trp883Arg) in FBN1 was identified in a Chinese patient with MFS. Embryos without the mutation were identified by PGD and resulted in a successful pregnancy.
SNP discovery through de novo deep sequencing using the next generation of DNA sequencers

USDA-ARS?s Scientific Manuscript database

The production of high volumes of DNA sequence data using new technologies has permitted more efficient identification of single nucleotide polymorphisms in vertebrate genomes. This chapter presented practical methodology for production and analysis of DNA sequence data for SNP discovery....
Kinetics and risk of de novo hepatitis B infection in HBsAg-negative patients undergoing cytotoxic chemotherapy.

PubMed

Hui, Chee-Kin; Cheung, Winnie W W; Zhang, Hai-Ying; Au, Wing-Yan; Yueng, Yui-Hung; Leung, Anskar Y H; Leung, Nancy; Luk, John M; Lie, Albert K W; Kwong, Yok-Lam; Liang, Raymond; Lau, George K K

2006-07-01

De novo hepatitis B virus (HBV)-related hepatitis after chemotherapy results in high morbidity and mortality. We evaluate the clinical course of de novo HBV-related hepatitis after chemotherapy. Two hundred forty-four consecutive hepatitis B surface antigen (HBsAg)-negative lymphoma patients treated with chemotherapy were followed up for a median of 12.4 (range, 0.1-65.0) months. Serially collected serum samples were analyzed for hepatitis, serum HBV DNA, and HBsAg seroreversion. Eight of the 244 patients (3.3%) developed de novo HBV-related hepatitis. A 100-fold increase in serum HBV DNA preceded de novo HBV-related hepatitis by a median of 18.5 (range, 12-28) weeks. All 8 patients had normal serum alanine aminotransaminase level when the 100-fold increase in serum HBV DNA occurred. Patients with de novo HBV-related hepatitis were more likely to have occult HBV infection before chemotherapy. Direct sequencing results showed that these 8 patients had de novo HBV-related hepatitis from reactivation of occult HBV infection. Three of the 8 patients with de novo HBV-related hepatitis compared with 6 of the 236 patients without de novo HBV-related hepatitis developed fulminant hepatic failure (37.5% vs 2.5%, respectively, P < .001). On multivariate Cox analysis, de novo HBV-related hepatitis was independently associated with a higher risk of fulminant hepatic failure (relative risk, 29.854; 95% confidence interval: 4.844-183.980; P < .001). Close surveillance for a 100-fold increase in HBV DNA is recommended for HBsAg-negative patients treated with chemotherapy so that early commencement of antiviral therapy can be initiated before the occurrence of de novo HBV-related hepatitis.
Transposon Insertion Finder (TIF): a novel program for detection of de novo transpositions of transposable elements.

PubMed

Nakagome, Mariko; Solovieva, Elena; Takahashi, Akira; Yasue, Hiroshi; Hirochika, Hirohiko; Miyao, Akio

2014-03-14

Transposition event detection of transposable element (TE) in the genome using short reads from the next-generation sequence (NGS) was difficult, because the nucleotide sequence of TE itself is repetitive, making it difficult to identify locations of its insertions by alignment programs for NGS. We have developed a program with a new algorithm to detect the transpositions from NGS data. In the process of tool development, we used next-generation sequence (NGS) data of derivative lines (ttm2 and ttm5) of japonica rice cv. Nipponbare, regenerated through cell culture. The new program, called a transposon insertion finder (TIF), was applied to detect the de novo transpositions of Tos17 in the regenerated lines. TIF searched 300 million reads of a line within 20 min, identifying 4 and 12 de novo transposition in ttm2 and ttm5 lines, respectively. All of the transpositions were confirmed by PCR/electrophoresis and sequencing. Using the program, we also detected new transposon insertions of P-element from NGS data of Drosophila melanogaster. TIF operates to find the transposition of any elements provided that target site duplications (TSDs) are generated by their transpositions.
Comparative modeling without implicit sequence alignments.

PubMed

Kolinski, Andrzej; Gront, Dominik

2007-10-01

The number of known protein sequences is about thousand times larger than the number of experimentally solved 3D structures. For more than half of the protein sequences a close or distant structural analog could be identified. The key starting point in a classical comparative modeling is to generate the best possible sequence alignment with a template or templates. With decreasing sequence similarity, the number of errors in the alignments increases and these errors are the main causes of the decreasing accuracy of the molecular models generated. Here we propose a new approach to comparative modeling, which does not require the implicit alignment - the model building phase explores geometric, evolutionary and physical properties of a template (or templates). The proposed method requires prior identification of a template, although the initial sequence alignment is ignored. The model is built using a very efficient reduced representation search engine CABS to find the best possible superposition of the query protein onto the template represented as a 3D multi-featured scaffold. The criteria used include: sequence similarity, predicted secondary structure consistency, local geometric features and hydrophobicity profile. For more difficult cases, the new method qualitatively outperforms existing schemes of comparative modeling. The algorithm unifies de novo modeling, 3D threading and sequence-based methods. The main idea is general and could be easily combined with other efficient modeling tools as Rosetta, UNRES and others.
De novo sequencing, assembly and analysis of salivary gland transcriptome of Haemaphysalis flava and identification of sialoprotein genes.

PubMed

Xu, Xing-Li; Cheng, Tian-Yin; Yang, Hu; Yan, Fen; Yang, Ya

2015-06-01

Saliva plays an important role in feeding and pathogen transmission, identification and analysis of tick salivary gland (SG) proteins is considered as a hot spot in anti-tick researching area. Herein, we present the first description of SG transcriptome of Haemaphysalis flava using next-generation sequencing (NGS). A total of over 143 million high-quality reads were assembled into 54,357 unigenes, of which 20,145 (37.06%) had significant similarities to proteins in the Swiss-Prot database. 13,513 annotated sequences were associated with GO terms. Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis showed that 14,280 unigenes were assigned to 279 KEGG pathways in total. Reads per kb per million reads (RPKM) analysis showed that there were 3035 down-regulated unigenes and 2260 up-regulated unigenes in the engorged ticks (ET) compared with the semi-engorged one (SET). Several important genes are associated with blood feeding and ingestion as secreted salivary proteins, concluding cysteine, longipain, 4D8, calreticulin, metalloproteases, serine protease inhibitor, enolase, heat shock protein and AV422 in SG, were identified. The qRT-PCR results confirmed that patterns of these genes (except for the longipain gene) expression were consistent with RNA-seq results. This de novo assembly of SG transcriptome of H. flava not only provides more chance for screening and cloning functional genes, but also forms a solid basis for further insight into the changes of salivary proteins during blood-feeding. Copyright © 2015 Elsevier B.V. All rights reserved.
Sequencing of Oligourea Foldamers by Tandem Mass Spectrometry

NASA Astrophysics Data System (ADS)

Bathany, Katell; Owens, Neil W.; Guichard, Gilles; Schmitter, Jean-Marie

2013-03-01

This study is focused on sequence analysis of peptidomimetic helical oligoureas by means of tandem mass spectrometry, to build a basis for de novo sequencing for future high-throughput combinatorial library screening of oligourea foldamers. After the evaluation of MS/MS spectra obtained for model compounds with either MALDI or ESI sources, we found that the MALDI-TOF-TOF instrument gave more satisfactory results. MS/MS spectra of oligoureas generated by decay of singly charged precursor ions show major ion series corresponding to fragmentation across both CO-NH and N'H-CO urea bonds. Oligourea backbones fragment to produce a pattern of a, x, b, and y type fragment ions. De novo decoding of spectral information is facilitated by the occurrence of low mass reporter ions, representative of constitutive monomers, in an analogous manner to the use of immonium ions for peptide sequencing.
Evolution of the alternative AQP2 gene: Acquisition of a novel protein-coding sequence in dolphins.

PubMed

Kishida, Takushi; Suzuki, Miwa; Takayama, Asuka

2018-01-01

Taxon-specific de novo protein-coding sequences are thought to be important for taxon-specific environmental adaptation. A recent study revealed that bottlenose dolphins acquired a novel isoform of aquaporin 2 generated by alternative splicing (alternative AQP2), which helps dolphins to live in hyperosmotic seawater. The AQP2 gene consists of four exons, but the alternative AQP2 gene lacks the fourth exon and instead has a longer third exon that includes the original third exon and a part of the original third intron. Here, we show that the latter half of the third exon of the alternative AQP2 arose from a non-protein-coding sequence. Intact ORF of this de novo sequence is shared not by all cetaceans, but only by delphinoids. However, this sequence is conservative in all modern cetaceans, implying that this de novo sequence potentially plays important roles for marine adaptation in cetaceans. Copyright © 2017 Elsevier Inc. All rights reserved.
Shotgun protein sequencing: assembly of peptide tandem mass spectra from mixtures of modified proteins.

PubMed

Bandeira, Nuno; Clauser, Karl R; Pevzner, Pavel A

2007-07-01

Despite significant advances in the identification of known proteins, the analysis of unknown proteins by MS/MS still remains a challenging open problem. Although Klaus Biemann recognized the potential of MS/MS for sequencing of unknown proteins in the 1980s, low throughput Edman degradation followed by cloning still remains the main method to sequence unknown proteins. The automated interpretation of MS/MS spectra has been limited by a focus on individual spectra and has not capitalized on the information contained in spectra of overlapping peptides. Indeed the powerful shotgun DNA sequencing strategies have not been extended to automated protein sequencing. We demonstrate, for the first time, the feasibility of automated shotgun protein sequencing of protein mixtures by utilizing MS/MS spectra of overlapping and possibly modified peptides generated via multiple proteases of different specificities. We validate this approach by generating highly accurate de novo reconstructions of multiple regions of various proteins in western diamondback rattlesnake venom. We further argue that shotgun protein sequencing has the potential to overcome the limitations of current protein sequencing approaches and thus catalyze the otherwise impractical applications of proteomics methodologies in studies of unknown proteins.
MALDI Top-Down sequencing: calling N- and C-terminal protein sequences with high confidence and speed.

PubMed

Suckau, Detlev; Resemann, Anja

2009-12-01

The ability to match Top-Down protein sequencing (TDS) results by MALDI-TOF to protein sequences by classical protein database searching was evaluated in this work. Resulting from these analyses were the protein identity, the simultaneous assignment of the N- and C-termini and protein sequences of up to 70 residues from either terminus. In combination with de novo sequencing using the MALDI-TDS data, even fusion proteins were assigned and the detailed sequence around the fusion site was elucidated. MALDI-TDS allowed to efficiently match protein sequences quickly and to validate recombinant protein structures-in particular, protein termini-on the level of undigested proteins.
Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ)

PubMed Central

Mascher, Martin; Muehlbauer, Gary J; Rokhsar, Daniel S; Chapman, Jarrod; Schmutz, Jeremy; Barry, Kerrie; Muñoz-Amatriaín, María; Close, Timothy J; Wise, Roger P; Schulman, Alan H; Himmelbach, Axel; Mayer, Klaus FX; Scholz, Uwe; Poland, Jesse A; Stein, Nils; Waugh, Robbie

2013-01-01

Next-generation whole-genome shotgun assemblies of complex genomes are highly useful, but fail to link nearby sequence contigs with each other or provide a linear order of contigs along individual chromosomes. Here, we introduce a strategy based on sequencing progeny of a segregating population that allows de novo production of a genetically anchored linear assembly of the gene space of an organism. We demonstrate the power of the approach by reconstructing the chromosomal organization of the gene space of barley, a large, complex and highly repetitive 5.1 Gb genome. We evaluate the robustness of the new assembly by comparison to a recently released physical and genetic framework of the barley genome, and to various genetically ordered sequence-based genotypic datasets. The method is independent of the need for any prior sequence resources, and will enable rapid and cost-efficient establishment of powerful genomic information for many species. PMID:23998490
Evaluation of the impact of RNA preservation methods of spiders for de novo transcriptome assembly.

PubMed

Kono, Nobuaki; Nakamura, Hiroyuki; Ito, Yusuke; Tomita, Masaru; Arakawa, Kazuharu

2016-05-01

With advances in high-throughput sequencing technologies, de novo transcriptome sequencing and assembly has become a cost-effective method to obtain comprehensive genetic information of a species of interest, especially in nonmodel species with large genomes such as spiders. However, high-quality RNA is essential for successful sequencing, and sample preservation conditions require careful consideration for the effective storage of field-collected samples. To this end, we report a streamlined feasibility study of various storage conditions and their effects on de novo transcriptome assembly results. The storage parameters considered include temperatures ranging from room temperature to -80°C; preservatives, including ethanol, RNAlater, TRIzol and RNAlater-ICE; and sample submersion states. As a result, intact RNA was extracted and assembly was successful when samples were preserved at low temperatures regardless of the type of preservative used. The assemblies as well as the gene expression profiles were shown to be robust to RNA degradation, when 30 million 150-bp paired-end reads are obtained. The parameters for sample storage, RNA extraction, library preparation, sequencing and in silico assembly considered in this work provide a guideline for the study of field-collected samples of spiders. © 2015 John Wiley & Sons Ltd.
An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder.

PubMed

Werling, Donna M; Brand, Harrison; An, Joon-Yong; Stone, Matthew R; Zhu, Lingxue; Glessner, Joseph T; Collins, Ryan L; Dong, Shan; Layer, Ryan M; Markenscoff-Papadimitriou, Eirene; Farrell, Andrew; Schwartz, Grace B; Wang, Harold Z; Currall, Benjamin B; Zhao, Xuefang; Dea, Jeanselle; Duhn, Clif; Erdman, Carolyn A; Gilson, Michael C; Yadav, Rachita; Handsaker, Robert E; Kashin, Seva; Klei, Lambertus; Mandell, Jeffrey D; Nowakowski, Tomasz J; Liu, Yuwen; Pochareddy, Sirisha; Smith, Louw; Walker, Michael F; Waterman, Matthew J; He, Xin; Kriegstein, Arnold R; Rubenstein, John L; Sestan, Nenad; McCarroll, Steven A; Neale, Benjamin M; Coon, Hilary; Willsey, A Jeremy; Buxbaum, Joseph D; Daly, Mark J; State, Matthew W; Quinlan, Aaron R; Marth, Gabor T; Roeder, Kathryn; Devlin, Bernie; Talkowski, Michael E; Sanders, Stephan J

2018-05-01

Genomic association studies of common or rare protein-coding variation have established robust statistical approaches to account for multiple testing. Here we present a comparable framework to evaluate rare and de novo noncoding single-nucleotide variants, insertion/deletions, and all classes of structural variation from whole-genome sequencing (WGS). Integrating genomic annotations at the level of nucleotides, genes, and regulatory regions, we define 51,801 annotation categories. Analyses of 519 autism spectrum disorder families did not identify association with any categories after correction for 4,123 effective tests. Without appropriate correction, biologically plausible associations are observed in both cases and controls. Despite excluding previously identified gene-disrupting mutations, coding regions still exhibited the strongest associations. Thus, in autism, the contribution of de novo noncoding variation is probably modest in comparison to that of de novo coding variants. Robust results from future WGS studies will require large cohorts and comprehensive analytical strategies that consider the substantial multiple-testing burden.
De novo assembly of mitochondrial genomes provides insights into genetic diversity and molecular evolution in wild boars and domestic pigs.

PubMed

Ni, Pan; Bhuiyan, Ali Akbar; Chen, Jian-Hai; Li, Jingjin; Zhang, Cheng; Zhao, Shuhong; Du, Xiaoyong; Li, Hua; Yu, Hui; Liu, Xiangdong; Li, Kui

2018-06-01

Up to date, the scarcity of publicly available complete mitochondrial sequences for European wild pigs hampers deeper understanding about the genetic changes following domestication. Here, we have assembled 26 de novo mtDNA sequences of European wild boars from next generation sequencing (NGS) data and downloaded 174 complete mtDNA sequences to assess the genetic relationship, nucleotide diversity, and selection. The Bayesian consensus tree reveals the clear divergence between the European and Asian clade and a very small portion (10 out of 200 samples) of maternal introgression. The overall nucleotides diversities of the mtDNA sequences have been reduced following domestication. Interestingly, the selection efficiencies in both European and Asian domestic pigs are reduced, probably caused by changes in both selection constraints and maternal population size following domestication. This study suggests that de novo assembled mitogenomes can be a great boon to uncover the genetic turnover following domestication. Further investigation is warranted to include more samples from the ever-increasing amounts of NGS data to help us to better understand the process of domestication.
Inferring transposons activity chronology by TRANScendence - TEs database and de-novo mining tool.

PubMed

Startek, Michał Piotr; Nogły, Jakub; Gromadka, Agnieszka; Grzebelus, Dariusz; Gambin, Anna

2017-10-16

The constant progress in sequencing technology leads to ever increasing amounts of genomic data. In the light of current evidence transposable elements (TEs for short) are becoming useful tools for learning about the evolution of host genome. Therefore the software for genome-wide detection and analysis of TEs is of great interest. Here we describe the computational tool for mining, classifying and storing TEs from newly sequenced genomes. This is an online, web-based, user-friendly service, enabling users to upload their own genomic data, and perform de-novo searches for TEs. The detected TEs are automatically analyzed, compared to reference databases, annotated, clustered into families, and stored in TEs repository. Also, the genome-wide nesting structure of found elements are detected and analyzed by new method for inferring evolutionary history of TEs. We illustrate the functionality of our tool by performing a full-scale analyses of TE landscape in Medicago truncatula genome. TRANScendence is an effective tool for the de-novo annotation and classification of transposable elements in newly-acquired genomes. Its streamlined interface makes it well-suited for evolutionary studies.

De novo mutations in ATP1A3 cause alternating hemiplegia of childhood

PubMed Central

Heinzen, Erin L.; Swoboda, Kathryn J.; Hitomi, Yuki; Gurrieri, Fiorella; Nicole, Sophie; de Vries, Boukje; Tiziano, F. Danilo; Fontaine, Bertrand; Walley, Nicole M.; Heavin, Sinéad; Panagiotakaki, Eleni; Fiori, Stefania; Abiusi, Emanuela; Di Pietro, Lorena; Sweney, Matthew T.; Newcomb, Tara M.; Viollet, Louis; Huff, Chad; Jorde, Lynn B.; Reyna, Sandra P.; Murphy, Kelley J.; Shianna, Kevin V.; Gumbs, Curtis E.; Little, Latasha; Silver, Kenneth; Ptác̆ek, Louis J.; Haan, Joost; Ferrari, Michel D.; Bye, Ann M.; Herkes, Geoffrey K.; Whitelaw, Charlotte M.; Webb, David; Lynch, Bryan J.; Uldall, Peter; King, Mary D.; Scheffer, Ingrid E.; Neri, Giovanni; Arzimanoglou, Alexis; van den Maagdenberg, Arn M.J.M.; Sisodiya, Sanjay M.; Mikati, Mohamad A.; Goldstein, David B.; Nicole, Sophie; Gurrieri, Fiorella; Neri, Giovanni; de Vries, Boukje; Koelewijn, Stephany; Kamphorst, Jessica; Geilenkirchen, Marije; Pelzer, Nadine; Laan, Laura; Haan, Joost; Ferrari, Michel; van den Maagdenberg, Arn; Zucca, Claudio; Bassi, Maria Teresa; Franchini, Filippo; Vavassori, Rosaria; Giannotta, Melania; Gobbi, Giuseppe; Granata, Tiziana; Nardocci, Nardo; De Grandis, Elisa; Veneselli, Edvige; Stagnaro, Michela; Gurrieri, Fiorella; Neri, Giovanni; Vigevano, Federico; Panagiotakaki, Eleni; Oechsler, Claudia; Arzimanoglou, Alexis; Nicole, Sophie; Giannotta, Melania; Gobbi, Giuseppe; Ninan, Miriam; Neville, Brian; Ebinger, Friedrich; Fons, Carmen; Campistol, Jaume; Kemlink, David; Nevsimalova, Sona; Laan, Laura; Peeters-Scholte, Cacha; van den Maagdenberg, Arn; Casaer, Paul; Casari, Giorgio; Sange, Guenter; Spiel, Georg; Boneschi, Filippo Martinelli; Zucca, Claudio; Bassi, Maria Teresa; Schyns, Tsveta; Crawley, Francis; Poncelin, Dominique; Vavassori, Rosaria

2012-01-01

Alternating hemiplegia of childhood (AHC) is a rare, severe neurodevelopmental syndrome characterized by recurrent hemiplegic episodes and distinct neurologic manifestations. AHC is usually a sporadic disorder with unknown etiology. Using exome sequencing of seven patients with AHC, and their unaffected parents, we identified de novo nonsynonymous mutations in ATP1A3 in all seven AHC patients. Subsequent sequence analysis of ATP1A3 in 98 additional patients revealed that 78% of AHC cases have a likely causal ATP1A3 mutation, including one inherited mutation in a familial case of AHC. Remarkably, six ATP1A3 mutations explain the majority of patients, including one observed in 36 patients. Unlike ATP1A3 mutations that cause rapid-onset-dystonia-parkinsonism, AHC-causing mutations revealed consistent reductions in ATPase activity without effects on protein expression. This work identifies de novo ATP1A3 mutations as the primary cause of AHC, and offers insight into disease pathophysiology by expanding the spectrum of phenotypes associated with mutations in this gene. PMID:22842232
Transposon fingerprinting using low coverage whole genome shotgun sequencing in Cacao (Theobroma cacao L.) and related species

PubMed Central

2013-01-01

Background Transposable elements (TEs) and other repetitive elements are a large and dynamically evolving part of eukaryotic genomes, especially in plants where they can account for a significant proportion of genome size. Their dynamic nature gives them the potential for use in identifying and characterizing crop germplasm. However, their repetitive nature makes them challenging to study using conventional methods of molecular biology. Next generation sequencing and new computational tools have greatly facilitated the investigation of TE variation within species and among closely related species. Results (i) We generated low-coverage Illumina whole genome shotgun sequencing reads for multiple individuals of cacao (Theobroma cacao) and related species. These reads were analysed using both an alignment/mapping approach and a de novo (graph based clustering) approach. (ii) A standard set of ultra-conserved orthologous sequences (UCOS) standardized TE data between samples and provided phylogenetic information on the relatedness of samples. (iii) The mapping approach proved highly effective within the reference species but underestimated TE abundance in interspecific comparisons relative to the de novo methods. (iv) Individual T. cacao accessions have unique patterns of TE abundance indicating that the TE composition of the genome is evolving actively within this species. (v) LTR/Gypsy elements are the most abundant, comprising c.10% of the genome. (vi) Within T. cacao the retroelement families show an order of magnitude greater sequence variability than the DNA transposon families. (vii) Theobroma grandiflorum has a similar TE composition to T. cacao, but the related genus Herrania is rather different, with LTRs making up a lower proportion of the genome, perhaps because of a massive presence (c. 20%) of distinctive low complexity satellite-like repeats in this genome. Conclusions (i) Short read alignment/mapping to reference TE contigs provides a simple and effective
Transposon fingerprinting using low coverage whole genome shotgun sequencing in cacao (Theobroma cacao L.) and related species.

PubMed

Sveinsson, Saemundur; Gill, Navdeep; Kane, Nolan C; Cronk, Quentin

2013-07-24

Transposable elements (TEs) and other repetitive elements are a large and dynamically evolving part of eukaryotic genomes, especially in plants where they can account for a significant proportion of genome size. Their dynamic nature gives them the potential for use in identifying and characterizing crop germplasm. However, their repetitive nature makes them challenging to study using conventional methods of molecular biology. Next generation sequencing and new computational tools have greatly facilitated the investigation of TE variation within species and among closely related species. (i) We generated low-coverage Illumina whole genome shotgun sequencing reads for multiple individuals of cacao (Theobroma cacao) and related species. These reads were analysed using both an alignment/mapping approach and a de novo (graph based clustering) approach. (ii) A standard set of ultra-conserved orthologous sequences (UCOS) standardized TE data between samples and provided phylogenetic information on the relatedness of samples. (iii) The mapping approach proved highly effective within the reference species but underestimated TE abundance in interspecific comparisons relative to the de novo methods. (iv) Individual T. cacao accessions have unique patterns of TE abundance indicating that the TE composition of the genome is evolving actively within this species. (v) LTR/Gypsy elements are the most abundant, comprising c.10% of the genome. (vi) Within T. cacao the retroelement families show an order of magnitude greater sequence variability than the DNA transposon families. (vii) Theobroma grandiflorum has a similar TE composition to T. cacao, but the related genus Herrania is rather different, with LTRs making up a lower proportion of the genome, perhaps because of a massive presence (c. 20%) of distinctive low complexity satellite-like repeats in this genome. (i) Short read alignment/mapping to reference TE contigs provides a simple and effective method of investigating
New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation.

PubMed

McLysaght, Aoife; Guerzoni, Daniele

2015-09-26

The origin of novel protein-coding genes de novo was once considered so improbable as to be impossible. In less than a decade, and especially in the last five years, this view has been overturned by extensive evidence from diverse eukaryotic lineages. There is now evidence that this mechanism has contributed a significant number of genes to genomes of organisms as diverse as Saccharomyces, Drosophila, Plasmodium, Arabidopisis and human. From simple beginnings, these genes have in some instances acquired complex structure, regulated expression and important functional roles. New genes are often thought of as dispensable late additions; however, some recent de novo genes in human can play a role in disease. Rather than an extremely rare occurrence, it is now evident that there is a relatively constant trickle of proto-genes released into the testing ground of natural selection. It is currently unknown whether de novo genes arise primarily through an 'RNA-first' or 'ORF-first' pathway. Either way, evolutionary tinkering with this pool of genetic potential may have been a significant player in the origins of lineage-specific traits and adaptations. © 2015 The Authors.
A de novo FOXP1 variant in a patient with autism, intellectual disability and severe speech and language impairment.

PubMed

Lozano, Reymundo; Vino, Arianna; Lozano, Cristina; Fisher, Simon E; Deriziotis, Pelagia

2015-12-01

FOXP1 (forkhead box protein P1) is a transcription factor involved in the development of several tissues, including the brain. An emerging phenotype of patients with protein-disrupting FOXP1 variants includes global developmental delay, intellectual disability and mild to severe speech/language deficits. We report on a female child with a history of severe hypotonia, autism spectrum disorder and mild intellectual disability with severe speech/language impairment. Clinical exome sequencing identified a heterozygous de novo FOXP1 variant c.1267_1268delGT (p.V423Hfs*37). Functional analyses using cellular models show that the variant disrupts multiple aspects of FOXP1 activity, including subcellular localization and transcriptional repression properties. Our findings highlight the importance of performing functional characterization to help uncover the biological significance of variants identified by genomics approaches, thereby providing insight into pathways underlying complex neurodevelopmental disorders. Moreover, our data support the hypothesis that de novo variants represent significant causal factors in severe sporadic disorders and extend the phenotype seen in individuals with FOXP1 haploinsufficiency.
in silico Whole Genome Sequencer & Analyzer (iWGS): a computational pipeline to guide the design and analysis of de novo genome sequencing studies

USDA-ARS?s Scientific Manuscript database

The availability of genomes across the tree of life is highly biased toward vertebrates, pathogens, human disease models, and organisms with relatively small and simple genomes. Recent progress in genomics has enabled the de novo decoding of the genome of virtually any organism, greatly expanding it...
A de novo Mutation in KMT2A (MLL) in monozygotic twins with Wiedemann-Steiner syndrome.

PubMed

Dunkerton, Sophie; Field, Matthew; Cho, Vicki; Bertram, Edward; Whittle, Belinda; Groves, Alexandra; Goel, Himanshu

2015-09-01

Growth deficiency, psychomotor delay, and facial dysmorphism was originally described in a male patient in 1989 by Wiedemann et al. and later in 2000 by Steiner et al. Wiedemann-Steiner syndrome (WSS) has since been described only a few times in the literature, with the phenotypic spectrum both expanding and becoming more delineated with each patient reported. We report on the clinical and molecular features of monozygotic twins with a de novo mutation in KMT2A. Single nucleotide polymorphism (SNP) microarray was done on both twins and whole-exome sequencing was done using both parents and one of the affected twins. SNP microarray confirmed that they were monozygotic twins. A de novo heterozygous variant (p. Arg1083*) in the KMT2A gene was identified through whole-exome sequencing, confirming the diagnosis of WSS. In this study, we have identified a de novo mutation in KMT2A associated with psychomotor developmental delay, facial dysmorphism, short stature, hypertrichosis cubiti, and small kidneys. This finding in monozygotic twins gives specificity to the WSS. The description of more cases of WSS is needed for further delineation of this condition. Small kidneys with normal function have not been described in this condition in the medical literature before. © 2015 Wiley Periodicals, Inc.
A de novo redesign of the WW domain

PubMed Central

Kraemer-Pecore, Christina M.; Lecomte, Juliette T.J.; Desjarlais, John R.

2003-01-01

We have used a sequence prediction algorithm and a novel sampling method to design protein sequences for the WW domain, a small β-sheet motif. The procedure, referred to as SPANS, designs sequences to be compatible with an ensemble of closely related polypeptide backbones, mimicking the inherent flexibility of proteins. Two designed sequences (termed SPANS-WW1 and SPANS-WW2), using only naturally occurring l-amino acids, were selected for study and the corresponding polypeptides were prepared in Escherichia coli. Circular dichroism data suggested that both purified polypeptides adopted secondary structure features related to those of the target without the aid of disulfide bridges or bound cofactors. The structure exhibited by SPANS-WW2 melted cooperatively by raising the temperature of the solution. Further analysis of this polypeptide by proton nuclear magnetic resonance spectroscopy demonstrated that at 5°C, it folds into a structure closely resembling a natural WW domain. This achievement constitutes one of a small number of successful de novo protein designs through fully automated computational methods and highlights the feasibility of including backbone flexibility in the design strategy. PMID:14500877
Identification of a Novel De Novo Variant in the PAX3 Gene in Waardenburg Syndrome by Diagnostic Exome Sequencing: The First Molecular Diagnosis in Korea.

PubMed

Jang, Mi-Ae; Lee, Taeheon; Lee, Junnam; Cho, Eun-Hae; Ki, Chang-Seok

2015-05-01

Waardenburg syndrome (WS) is a clinically and genetically heterogeneous hereditary auditory pigmentary disorder characterized by congenital sensorineural hearing loss and iris discoloration. Many genes have been linked to WS, including PAX3, MITF, SNAI2, EDNRB, EDN3, and SOX10, and many additional genes have been associated with disorders with phenotypic overlap with WS. To screen all possible genes associated with WS and congenital deafness simultaneously, we performed diagnostic exome sequencing (DES) in a male patient with clinical features consistent with WS. Using DES, we identified a novel missense variant (c.220C>G; p.Arg74Gly) in exon 2 of the PAX3 gene in the patient. Further analysis by Sanger sequencing of the patient and his parents revealed a de novo occurrence of the variant. Our findings show that DES can be a useful tool for the identification of pathogenic gene variants in WS patients and for differentiation between WS and similar disorders. To the best of our knowledge, this is the first report of genetically confirmed WS in Korea.
Identification of a Novel De Novo Variant in the PAX3 Gene in Waardenburg Syndrome by Diagnostic Exome Sequencing: The First Molecular Diagnosis in Korea

PubMed Central

Jang, Mi-Ae; Lee, Taeheon; Lee, Junnam

2015-01-01

Waardenburg syndrome (WS) is a clinically and genetically heterogeneous hereditary auditory pigmentary disorder characterized by congenital sensorineural hearing loss and iris discoloration. Many genes have been linked to WS, including PAX3, MITF, SNAI2, EDNRB, EDN3, and SOX10, and many additional genes have been associated with disorders with phenotypic overlap with WS. To screen all possible genes associated with WS and congenital deafness simultaneously, we performed diagnostic exome sequencing (DES) in a male patient with clinical features consistent with WS. Using DES, we identified a novel missense variant (c.220C>G; p.Arg74Gly) in exon 2 of the PAX3 gene in the patient. Further analysis by Sanger sequencing of the patient and his parents revealed a de novo occurrence of the variant. Our findings show that DES can be a useful tool for the identification of pathogenic gene variants in WS patients and for differentiation between WS and similar disorders. To the best of our knowledge, this is the first report of genetically confirmed WS in Korea. PMID:25932447
A NGS approach to the encrusting Mediterranean sponge Crella elegans (Porifera, Demospongiae, Poecilosclerida): transcriptome sequencing, characterization and overview of the gene expression along three life cycle stages.

PubMed

Pérez-Porro, A R; Navarro-Gómez, D; Uriz, M J; Giribet, G

2013-05-01

Sponges can be dominant organisms in many marine and freshwater habitats where they play essential ecological roles. They also represent a key group to address important questions in early metazoan evolution. Recent approaches for improving knowledge on sponge biological and ecological functions as well as on animal evolution have focused on the genetic toolkits involved in ecological responses to environmental changes (biotic and abiotic), development and reproduction. These approaches are possible thanks to newly available, massive sequencing technologies-such as the Illumina platform, which facilitate genome and transcriptome sequencing in a cost-effective manner. Here we present the first NGS (next-generation sequencing) approach to understanding the life cycle of an encrusting marine sponge. For this we sequenced libraries of three different life cycle stages of the Mediterranean sponge Crella elegans and generated de novo transcriptome assemblies. Three assemblies were based on sponge tissue of a particular life cycle stage, including non-reproductive tissue, tissue with sperm cysts and tissue with larvae. The fourth assembly pooled the data from all three stages. By aggregating data from all the different life cycle stages we obtained a higher total number of contigs, contigs with blast hit and annotated contigs than from one stage-based assemblies. In that multi-stage assembly we obtained a larger number of the developmental regulatory genes known for metazoans than in any other assembly. We also advance the differential expression of selected genes in the three life cycle stages to explore the potential of RNA-seq for improving knowledge on functional processes along the sponge life cycle. © 2013 Blackwell Publishing Ltd.
Wegener's granulomatosis occurring de novo during pregnancy.

PubMed

Alfhaily, F; Watts, R; Leather, A

2009-01-01

Wegener's granulomatosis (WG) is rarely diagnosed during the reproductive years and uncommonly manifests for the first time during pregnancy. We report a case of de novo WG presenting at 30 weeks gestation with classical symptoms of WG (ENT, pulmonary). The diagnosis was confirmed by radiological, laboratory, and histological investigations. With a multidisciplinary approach, she had a successful vaginal delivery of a healthy baby. She was treated successfully by a combination of steroids, azathioprine and intravenous immunoglobulin in the active phase of disease for induction of remission and by azathioprine and steroids for maintenance of remission. The significant improvement in her symptoms allowed us to continue her pregnancy to 37 weeks when delivery was electively induced. Transplacental transmission of PR3-ANCA occurred but the neonate remained well. This case of de novo WG during pregnancy highlights the seriousness of this disease and the challenge in management of such patients.
Data compression of discrete sequence: A tree based approach using dynamic programming

NASA Technical Reports Server (NTRS)

Shivaram, Gurusrasad; Seetharaman, Guna; Rao, T. R. N.

1994-01-01

A dynamic programming based approach for data compression of a ID sequence is presented. The compression of an input sequence of size N to that of a smaller size k is achieved by dividing the input sequence into k subsequences and replacing the subsequences by their respective average values. The partitioning of the input sequence is carried with the intention of reducing the mean squared error in the reconstructed sequence. The complexity involved in finding the partitions which would result in such an optimal compressed sequence is reduced by using the dynamic programming approach, which is presented.
A recombinant isoform of the Ole e 7 olive pollen allergen assembled by de novo mass spectrometry retains the allergenic ability of the natural allergen.

PubMed

Oeo-Santos, Carmen; Mas, Salvador; Benedé, Sara; López-Lucendo, María; Quiralte, Joaquín; Blanca, Miguel; Mayorga, Cristobalina; Villalba, Mayte; Barderas, Rodrigo

2018-06-05

The allergenic non-specific lipid transfer protein Ole e 7 from olive pollen is a major allergen associated with severe symptoms in areas with high olive pollen levels. Despite its clinical importance, its cloning and recombinant production has been unable by classical approaches. This study aimed at determining by mass-spectrometry based proteomics its complete amino acid sequence for its subsequent expression and characterization. To this end, the natural protein was in-2D-gel tryptic digested, and CID and HCD fragmentation spectra obtained by nLC-MS/MS analyzed using PEAKS software. Thirteen out of the 457 de novo sequenced peptides obtained allowed assembling its full-length amino acid sequence. Then, Ole e 7-encoding cDNA was synthesized and cloned in pPICZαA vector for its expression in Pichia pastoris yeast. The analyses by Circular Dichroism, and WB, ELISA and cell-based tests using sera and blood from olive pollen-sensitized patients showed that rOle e 7 mostly retained the structural, allergenic and antigenic properties of the natural allergen. In summary, rOle e 7 allergen assembled by de novo peptide sequencing by MS behaved immunologically similar to the natural allergen scarcely isolated from pollen. Olive pollen is an important cause of allergy. The non-specific lipid binding protein Ole e 7 is a major allergen with a high incidence and a phenotype associated to severe clinical symptoms. Despite its relevance, its cloning and recombinant expression has been unable by classical techniques. Here, we have inferred the primary amino acid sequence of Ole e 7 by mass-spectrometry. We separated Ole e 7 isolated from pollen by 2DE. After in-gel digestion with trypsin and a direct analysis by nLC-MS/MS in an LTQ-Orbitrap Velos, we got the complete de novo sequenced peptides repertoire that allowed the assembling of the primary sequence of Ole e 7. After its protein expression, purification to homogeneity, and structural and immunological characterization
The contribution of de novo coding mutations to autism spectrum disorder.

PubMed

Iossifov, Ivan; O'Roak, Brian J; Sanders, Stephan J; Ronemus, Michael; Krumm, Niklas; Levy, Dan; Stessman, Holly A; Witherspoon, Kali T; Vives, Laura; Patterson, Karynne E; Smith, Joshua D; Paeper, Bryan; Nickerson, Deborah A; Dea, Jeanselle; Dong, Shan; Gonzalez, Luis E; Mandell, Jeffrey D; Mane, Shrikant M; Murtha, Michael T; Sullivan, Catherine A; Walker, Michael F; Waqar, Zainulabedin; Wei, Liping; Willsey, A Jeremy; Yamrom, Boris; Lee, Yoon-ha; Grabowska, Ewa; Dalkic, Ertugrul; Wang, Zihua; Marks, Steven; Andrews, Peter; Leotta, Anthony; Kendall, Jude; Hakker, Inessa; Rosenbaum, Julie; Ma, Beicong; Rodgers, Linda; Troge, Jennifer; Narzisi, Giuseppe; Yoon, Seungtai; Schatz, Michael C; Ye, Kenny; McCombie, W Richard; Shendure, Jay; Eichler, Evan E; State, Matthew W; Wigler, Michael

2014-11-13

Whole exome sequencing has proven to be a powerful tool for understanding the genetic architecture of human disease. Here we apply it to more than 2,500 simplex families, each having a child with an autistic spectrum disorder. By comparing affected to unaffected siblings, we show that 13% of de novo missense mutations and 43% of de novo likely gene-disrupting (LGD) mutations contribute to 12% and 9% of diagnoses, respectively. Including copy number variants, coding de novo mutations contribute to about 30% of all simplex and 45% of female diagnoses. Almost all LGD mutations occur opposite wild-type alleles. LGD targets in affected females significantly overlap the targets in males of lower intelligence quotient (IQ), but neither overlaps significantly with targets in males of higher IQ. We estimate that LGD mutation in about 400 genes can contribute to the joint class of affected females and males of lower IQ, with an overlapping and similar number of genes vulnerable to contributory missense mutation. LGD targets in the joint class overlap with published targets for intellectual disability and schizophrenia, and are enriched for chromatin modifiers, FMRP-associated genes and embryonically expressed genes. Most of the significance for the latter comes from affected females.
De novo transcriptome of the muga silkworm, Antheraea assamensis (Helfer).

PubMed

Chetia, Hasnahana; Kabiraj, Debajyoti; Singh, Deepika; Mosahari, Ponnala Vimal; Das, Suradip; Sharma, Pragya; Neog, Kartik; Sharma, Swagata; Jayaprakash, P; Bora, Utpal

2017-05-05

Antheraea assamensis (Lepidoptera: Saturniidae), is a semi-domesticated silkworm known to be endemic to Assam and the adjoining hilly areas of Northeast India. It is the only producer of a unique, commercially important variety of golden silk called "muga silk". Herein, we report the de novo transcriptome of A. assamensis reared on Machilus bombycina leaves for the first time. Short reads generated by high throughput sequencing of cDNA libraries from multiple tissues, viz. alimentary canal, silk gland and residual body of the 5 th instar of muga silkworm were assembled into transcripts via a de novo assembly pipeline followed by functional annotation and classification. A total of 1,21,433 transcripts were generated from ~231 million raw reads of which ~74% (89,583) were either allocated a functional annotation or categorized under Pfam/COG/KEGG categories. Identification of differentially expressed transcripts and their comparative sequence analysis revealed candidate genes related to silk synthesis, viz. silk gland factor-1 and 3, sericin-like transcript, etc. with conserved forkhead, homeo- and POU domains. Several candidate anti-microbial peptides which may have potential anti-bacterial, anti-fungal or anti-parasitic activity in A. assamensis were also identified. T/A and AT/TA were predicted to be the most abundant mono- and di-nucleotide simple sequence repeat markers in the transcriptome. Transcriptome validation was carried out by quantitative real-time PCR (qPCR) amplification of eight transcripts. The resources generated by this study will expand the periphery of existing genomic data on A. assamensis facilitating future in-depth studies on its unknown aspects. Copyright © 2017 Elsevier B.V. All rights reserved.
Transcriptome de novo assembly from next-generation sequencing and comparative analyses in the hexaploid salt marsh species Spartina maritima and Spartina alterniflora (Poaceae)

PubMed Central

Ferreira de Carvalho, J; Poulain, J; Da Silva, C; Wincker, P; Michon-Coudouel, S; Dheilly, A; Naquin, D; Boutte, J; Salmon, A; Ainouche, M

2013-01-01

Spartina species have a critical ecological role in salt marshes and represent an excellent system to investigate recurrent polyploid speciation. Using the 454 GS-FLX pyrosequencer, we assembled and annotated the first reference transcriptome (from roots and leaves) for two related hexaploid Spartina species that hybridize in Western Europe, the East American invasive Spartina alterniflora and the Euro-African S. maritima. The de novo read assembly generated 38 478 consensus sequences and 99% found an annotation using Poaceae databases, representing a total of 16 753 non-redundant genes. Spartina expressed sequence tags were mapped onto the Sorghum bicolor genome, where they were distributed among the subtelomeric arms of the 10 S. bicolor chromosomes, with high gene density correlation. Normalization of the complementary DNA library improved the number of annotated genes. Ecologically relevant genes were identified among GO biological function categories in salt and heavy metal stress response, C4 photosynthesis and in lignin and cellulose metabolism. Expression of some of these genes had been found to be altered by hybridization and genome duplication in a previous microarray-based study in Spartina. As these species are hexaploid, up to three duplicated homoeologs may be expected per locus. When analyzing sequence polymorphism at four different loci in S. maritima and S. alterniflora, we found up to four haplotypes per locus, suggesting the presence of two expressed homoeologous sequences with one or two allelic variants each. This reference transcriptome will allow analysis of specific Spartina genes of ecological or evolutionary interest, estimation of homoeologous gene expression variation using RNA-seq and further gene expression evolution analyses in natural populations. PMID:23149455
De novo activation of HBV with escape mutations from hepatitis B surface antibody after living donor liver transplantation.

PubMed

Ueda, Yoshihide; Marusawa, Hiroyuki; Egawa, Hiroto; Okamoto, Shinya; Ogura, Yasuhiro; Oike, Fumitaka; Nishijima, Norihiro; Takada, Yasutsugu; Uemoto, Shinji; Chiba, Tsutomu

2011-01-01

De novo activation of HBV occurs after liver transplantation from hepatitis B surface antigen (HBsAg)-negative and hepatitis B core antibody (anti-HBc)-positive donors, even under hepatitis B immunoglobulin (HBIG) prophylaxis. One reason for the activation of HBV is the emergence of HBV with escape mutations from hepatitis B surface antibody (anti-HBs). The aim of this study is to clarify the clinical features for de novo activation of HBV with anti-HBs escape mutations after liver transplantation. Clinical features of 75 patients who received HBIG prophylaxis >6 months after liver transplantation with liver grafts from anti-HBc-positive donors were retrospectively analysed. Among the 75 recipients, 19 (25%) developed de novo activation of HBV. Of the 19 recipients, the emergence of HBV with anti-HBs escape mutations was confirmed in 7 patients. The rate of de novo activation of HBV with anti-HBs escape mutations was 12% at 5 years. Sequence analysis revealed mutations in the common 'a' determinant region of the surface gene, including G145R, G145A and Q129P, in HBsAg. Administration of entecavir immediately after the occurrence of de novo HBV activation resolved hepatitis and induced clearance of serum HBsAg and HBV DNA in all four patients receiving entecavir. Escape mutations from anti-HBs caused de novo activation of HBV under HBIG prophylaxis after liver transplantation. Early administration of entecavir was effective on de novo activation of HBV with anti-HBs escape mutations.
Hybrid-denovo: a de novo OTU-picking pipeline integrating single-end and paired-end 16S sequence tags.

PubMed

Chen, Xianfeng; Johnson, Stephen; Jeraldo, Patricio; Wang, Junwen; Chia, Nicholas; Kocher, Jean-Pierre A; Chen, Jun

2018-03-01

Illumina paired-end sequencing has been increasingly popular for 16S rRNA gene-based microbiota profiling. It provides higher phylogenetic resolution than single-end reads due to a longer read length. However, the reverse read (R2) often has significant low base quality, and a large proportion of R2s will be discarded after quality control, resulting in a mixture of paired-end and single-end reads. A typical 16S analysis pipeline usually processes either paired-end or single-end reads but not a mixture. Thus, the quantification accuracy and statistical power will be reduced due to the loss of a large amount of reads. As a result, rare taxa may not be detectable with the paired-end approach, or low taxonomic resolution will result in a single-end approach. To have both the higher phylogenetic resolution provided by paired-end reads and the higher sequence coverage by single-end reads, we propose a novel OTU-picking pipeline, hybrid-denovo, that can process a hybrid of single-end and paired-end reads. Using high-quality paired-end reads as a gold standard, we show that hybrid-denovo achieved the highest correlation with the gold standard and performed better than the approaches based on paired-end or single-end reads in terms of quantifying the microbial diversity and taxonomic abundances. By applying our method to a rheumatoid arthritis (RA) data set, we demonstrated that hybrid-denovo captured more microbial diversity and identified more RA-associated taxa than a paired-end or single-end approach. Hybrid-denovo utilizes both paired-end and single-end 16S sequencing reads and is recommended for 16S rRNA gene targeted paired-end sequencing data.
PMS2 gene mutational analysis: direct cDNA sequencing to circumvent pseudogene interference.

PubMed

Wimmer, Katharina; Wernstedt, Annekatrin

2014-01-01

The presence of highly homologous pseudocopies can compromise the mutation analysis of a gene of interest. In particular, when using PCR-based strategies, pseudogene co-amplification has to be effectively prevented. This is often achieved by using primers designed to be parental gene specific according to the reference sequence and by applying stringent PCR conditions. However, there are cases in which this approach is of limited utility. For example, it has been shown that the PMS2 gene exchanges sequences with one of its pseudogenes, named PMS2CL. This results in functional PMS2 alleles containing pseudogene-derived sequences at their 3'-end and in nonfunctional PMS2CL pseudogene alleles that contain gene-derived sequences. Hence, the paralogues cannot be distinguished according to the reference sequence. This shortcoming can be effectively circumvented by using direct cDNA sequencing. This approach is based on the selective amplification of PMS2 transcripts in two overlapping 1.6-kb RT-PCR products. In addition to avoiding pseudogene co-amplification and allele dropout, this method has also the advantage that it allows to effectively identify deletions, splice mutations, and de novo retrotransposon insertions that escape the detection of most DNA-based mutation analysis protocols.

Approaches for in silico finishing of microbial genome sequences

PubMed Central

Kremer, Frederico Schmitt; McBride, Alan John Alexander; Pinto, Luciano da Silva

2017-01-01

Abstract The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as “drafts”, incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing. PMID:28898352
Approaches for in silico finishing of microbial genome sequences.

PubMed

Kremer, Frederico Schmitt; McBride, Alan John Alexander; Pinto, Luciano da Silva

The introduction of next-generation sequencing (NGS) had a significant effect on the availability of genomic information, leading to an increase in the number of sequenced genomes from a large spectrum of organisms. Unfortunately, due to the limitations implied by the short-read sequencing platforms, most of these newly sequenced genomes remained as "drafts", incomplete representations of the whole genetic content. The previous genome sequencing studies indicated that finishing a genome sequenced by NGS, even bacteria, may require additional sequencing to fill the gaps, making the entire process very expensive. As such, several in silico approaches have been developed to optimize the genome assemblies and facilitate the finishing process. The present review aims to explore some free (open source, in many cases) tools that are available to facilitate genome finishing.
Malan syndrome: Sotos-like overgrowth with de novo NFIX sequence variants and deletions in six new patients and a review of the literature.

PubMed

Klaassens, Merel; Morrogh, Deborah; Rosser, Elisabeth M; Jaffer, Fatima; Vreeburg, Maaike; Bok, Levinus A; Segboer, Tim; van Belzen, Martine; Quinlivan, Ros M; Kumar, Ajith; Hurst, Jane A; Scott, Richard H

2015-05-01

De novo monoallelic variants in NFIX cause two distinct syndromes. Whole gene deletions, nonsense variants and missense variants affecting the DNA-binding domain have been seen in association with a Sotos-like phenotype that we propose is referred to as Malan syndrome. Frameshift and splice-site variants thought to avoid nonsense-mediated RNA decay have been seen in Marshall-Smith syndrome. We report six additional patients with Malan syndrome and de novo NFIX deletions or sequence variants and review the 20 patients now reported. The phenotype is characterised by moderate postnatal overgrowth and macrocephaly. Median height and head circumference in childhood are 2.0 and 2.3 standard deviations (SD) above the mean, respectively. There is overlap of the facial phenotype with NSD1-positive Sotos syndrome in some cases including a prominent forehead, high anterior hairline, downslanting palpebral fissures and prominent chin. Neonatal feeding difficulties and/or hypotonia have been reported in 30% of patients. Developmental delay/learning disability have been reported in all cases and are typically moderate. Ocular phenotypes are common, including strabismus (65%), nystagmus (25% ) and optic disc pallor/hypoplasia (25%). Other recurrent features include pectus excavatum (40%) and scoliosis (25%). Eight reported patients have a deletion also encompassing CACNA1A, haploinsufficiency of which causes episodic ataxia type 2 or familial hemiplegic migraine. One previous case had episodic ataxia and one case we report has had cyclical vomiting responsive to pizotifen. In individuals with this contiguous gene deletion syndrome, awareness of possible later neurological manifestations is important, although their penetrance is not yet clear.
A de novo mutation in the AGXT gene causing primary hyperoxaluria type 1.

PubMed

Williams, Emma L; Kemper, Markus J; Rumsby, Gill

2006-09-01

Primary hyperoxaluria type 1 is caused by mutations in the alanine-glyoxylate aminotransferase (AGXT) gene. In cases in which no mutation was identified, linkage analysis can be used to confirm or exclude the diagnosis in other siblings. We present a family in which a sibling of the index case predicted to have primary hyperoxaluria type 1 by means of linkage analysis failed to show hyperoxaluria during the following 7 years, putting the diagnosis into question. Whole-gene sequence analysis identified 2 causative mutations in the index case, of which only 1, c.646A (Gly216Arg), was inherited. The other sequence change, c.33_34insC, was a de novo mutation occurring on the paternal allele. This particular mutation is a relatively common cause of primary hyperoxaluria type 1. It occurs in a run of 8 cytosines and therefore potentially is susceptible to polymerase slippage. This case illustrates 2 important points. First, biochemical confirmation of a genetic diagnosis should always be made in siblings diagnosed by using genetic tests. Second, de novo mutations should be considered as a potential, albeit rare, cause of primary hyperoxaluria type 1.
Targeted amplicon sequencing (TAS): a scalable next-gen approach to multilocus, multitaxa phylogenetics.

PubMed

Bybee, Seth M; Bracken-Grissom, Heather; Haynes, Benjamin D; Hermansen, Russell A; Byers, Robert L; Clement, Mark J; Udall, Joshua A; Wilcox, Edward R; Crandall, Keith A

2011-01-01

Next-gen sequencing technologies have revolutionized data collection in genetic studies and advanced genome biology to novel frontiers. However, to date, next-gen technologies have been used principally for whole genome sequencing and transcriptome sequencing. Yet many questions in population genetics and systematics rely on sequencing specific genes of known function or diversity levels. Here, we describe a targeted amplicon sequencing (TAS) approach capitalizing on next-gen capacity to sequence large numbers of targeted gene regions from a large number of samples. Our TAS approach is easily scalable, simple in execution, neither time-nor labor-intensive, relatively inexpensive, and can be applied to a broad diversity of organisms and/or genes. Our TAS approach includes a bioinformatic application, BarcodeCrucher, to take raw next-gen sequence reads and perform quality control checks and convert the data into FASTA format organized by gene and sample, ready for phylogenetic analyses. We demonstrate our approach by sequencing targeted genes of known phylogenetic utility to estimate a phylogeny for the Pancrustacea. We generated data from 44 taxa using 68 different 10-bp multiplexing identifiers. The overall quality of data produced was robust and was informative for phylogeny estimation. The potential for this method to produce copious amounts of data from a single 454 plate (e.g., 325 taxa for 24 loci) significantly reduces sequencing expenses incurred from traditional Sanger sequencing. We further discuss the advantages and disadvantages of this method, while offering suggestions to enhance the approach.
Targeted Amplicon Sequencing (TAS): A Scalable Next-Gen Approach to Multilocus, Multitaxa Phylogenetics

PubMed Central

Bybee, Seth M.; Bracken-Grissom, Heather; Haynes, Benjamin D.; Hermansen, Russell A.; Byers, Robert L.; Clement, Mark J.; Udall, Joshua A.; Wilcox, Edward R.; Crandall, Keith A.

2011-01-01

Next-gen sequencing technologies have revolutionized data collection in genetic studies and advanced genome biology to novel frontiers. However, to date, next-gen technologies have been used principally for whole genome sequencing and transcriptome sequencing. Yet many questions in population genetics and systematics rely on sequencing specific genes of known function or diversity levels. Here, we describe a targeted amplicon sequencing (TAS) approach capitalizing on next-gen capacity to sequence large numbers of targeted gene regions from a large number of samples. Our TAS approach is easily scalable, simple in execution, neither time-nor labor-intensive, relatively inexpensive, and can be applied to a broad diversity of organisms and/or genes. Our TAS approach includes a bioinformatic application, BarcodeCrucher, to take raw next-gen sequence reads and perform quality control checks and convert the data into FASTA format organized by gene and sample, ready for phylogenetic analyses. We demonstrate our approach by sequencing targeted genes of known phylogenetic utility to estimate a phylogeny for the Pancrustacea. We generated data from 44 taxa using 68 different 10-bp multiplexing identifiers. The overall quality of data produced was robust and was informative for phylogeny estimation. The potential for this method to produce copious amounts of data from a single 454 plate (e.g., 325 taxa for 24 loci) significantly reduces sequencing expenses incurred from traditional Sanger sequencing. We further discuss the advantages and disadvantages of this method, while offering suggestions to enhance the approach. PMID:22002916
A novel approach using long‐read sequencing and ddPCR to investigate gonadal mosaicism and estimate recurrence risk in two families with developmental disorders

PubMed Central

Gudmundsson, Sanna; Johansson, Josefin; Ameur, Adam; Stattin, Eva‐Lena; Annerén, Göran; Malmgren, Helena; Frykholm, Carina

2017-01-01

Abstract Objective De novo mutations contribute significantly to severe early‐onset genetic disorders. Even if the mutation is apparently de novo, there is a recurrence risk due to parental germ line mosaicism, depending on in which gonadal generation the mutation occurred. Methods We demonstrate the power of using SMRT sequencing and ddPCR to determine parental origin and allele frequencies of de novo mutations in germ cells in two families whom had undergone assisted reproduction. Results In the first family, a TCOF1 variant c.3156C>T was identified in the proband with Treacher Collins syndrome. The variant affects splicing and was determined to be of paternal origin. It was present in <1% of the paternal germ cells, suggesting a very low recurrence risk. In the second family, the couple had undergone several unsuccessful pregnancies where a de novo mutation PTPN11 c.923A>C causing Noonan syndrome was identified. The variant was present in 40% of the paternal germ cells suggesting a high recurrence risk. Conclusions Our findings highlight a successful strategy to identify the parental origin of mutations and to investigate the recurrence risk in couples that have undergone assisted reproduction with an unknown donor or in couples with gonadal mosaicism that will undergo preimplantation genetic diagnosis. PMID:28921562
Mapping and phasing of structural variation in patient genomes using nanopore sequencing.

PubMed

Cretu Stancu, Mircea; van Roosmalen, Markus J; Renkens, Ivo; Nieboer, Marleen M; Middelkamp, Sjors; de Ligt, Joep; Pregno, Giulia; Giachino, Daniela; Mandrile, Giorgia; Espejo Valle-Inclan, Jose; Korzelius, Jerome; de Bruijn, Ewart; Cuppen, Edwin; Talkowski, Michael E; Marschall, Tobias; de Ridder, Jeroen; Kloosterman, Wigard P

2017-11-06

Despite improvements in genomics technology, the detection of structural variants (SVs) from short-read sequencing still poses challenges, particularly for complex variation. Here we analyse the genomes of two patients with congenital abnormalities using the MinION nanopore sequencer and a novel computational pipeline-NanoSV. We demonstrate that nanopore long reads are superior to short reads with regard to detection of de novo chromothripsis rearrangements. The long reads also enable efficient phasing of genetic variations, which we leveraged to determine the parental origin of all de novo chromothripsis breakpoints and to resolve the structure of these complex rearrangements. Additionally, genome-wide surveillance of inherited SVs reveals novel variants, missed in short-read data sets, a large proportion of which are retrotransposon insertions. We provide a first exploration of patient genome sequencing with a nanopore sequencer and demonstrate the value of long-read sequencing in mapping and phasing of SVs for both clinical and research applications.
Evaluation of Methods for de novo Genome assembly from High-throughput Sequencing Reads Reveals Dependencies that Affect the Quality of the Results

USDA-ARS?s Scientific Manuscript database

Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole...
Neurofibromatosis-1 gene deletions and mutations in de novo adult acute myeloid leukemia.

PubMed

Boudry-Labis, Elise; Roche-Lestienne, Catherine; Nibourel, Olivier; Boissel, Nicolas; Terre, Christine; Perot, Christine; Eclache, Virginie; Gachard, Nathalie; Tigaud, Isabelle; Plessis, Ghislaine; Cuccuini, Wendy; Geffroy, Sandrine; Villenet, Céline; Figeac, Martin; Leprêtre, Frederic; Renneville, Aline; Cheok, Meyling; Soulier, Jean; Dombret, Hervé; Preudhomme, Claude

2013-04-01

Germline heterozygous alterations of the tumor-suppressor gene neurofibromatosis-1 (NF1) lead to neurofibromatosis type 1, a genetic disorder characterized by a higher risk to develop juvenile myelomonocytic leukemia and/or acute myeloid leukemia (AML). More recently, somatic 17q11 deletions encompassing NF1 have been described in many adult myeloid malignancies. In this context, we aimed to define NF1 involvement in AML. We screened a total of 488 previously untreated de novo AML patients for the NF1 deletion using either array comparative genomic hybridization (aCGH) or real-time quantitative PCR/fluorescence in situ hybridization approaches. We also applied massively parallel sequencing for in depth mutation analysis of NF1 in 20 patients including five NF1-deleted patients. We defined a small ∼0.3 Mb minimal deleted region involving NF1 by aCGH and an overall frequency of NF1 deletion of 3.5% (17/485). NF1 deletion is significantly associated with unfavorable cytogenetics and with monosomal karyotype notably. We discovered six NF1 variants of unknown significance in 7/20 patients of which only one out of four disappeared in corresponding complete remission sample. In addition, only one out of five NF1-deleted patients has an acquired coding mutation in the remaining allele. In conclusion, direct NF1 inactivation is infrequent in de novo AML and may be a secondary event probably involved in leukemic progression. Copyright © 2013 Wiley Periodicals, Inc.
Non-invasive prenatal diagnosis of achondroplasia and thanatophoric dysplasia: next-generation sequencing allows for a safer, more accurate, and comprehensive approach.

PubMed

Chitty, Lyn S; Mason, Sarah; Barrett, Angela N; McKay, Fiona; Lench, Nicholas; Daley, Rebecca; Jenkins, Lucy A

2015-07-01

Accurate prenatal diagnosis of genetic conditions can be challenging and usually requires invasive testing. Here, we demonstrate the potential of next-generation sequencing (NGS) for the analysis of cell-free DNA in maternal blood to transform prenatal diagnosis of monogenic disorders. Analysis of cell-free DNA using a PCR and restriction enzyme digest (PCR-RED) was compared with a novel NGS assay in pregnancies at risk of achondroplasia and thanatophoric dysplasia. PCR-RED was performed in 72 cases and was correct in 88.6%, inconclusive in 7% with one false negative. NGS was performed in 47 cases and was accurate in 96.2% with no inconclusives. Both approaches were used in 27 cases, with NGS giving the correct result in the two cases inconclusive with PCR-RED. NGS provides an accurate, flexible approach to non-invasive prenatal diagnosis of de novo and paternally inherited mutations. It is more sensitive than PCR-RED and is ideal when screening a gene with multiple potential pathogenic mutations. These findings highlight the value of NGS in the development of non-invasive prenatal diagnosis for other monogenic disorders. © 2015 John Wiley & Sons, Ltd.
BAC sequencing using pooled methods.

PubMed

Saski, Christopher A; Feltus, F Alex; Parida, Laxmi; Haiminen, Niina

2015-01-01

Shotgun sequencing and assembly of a large, complex genome can be both expensive and challenging to accurately reconstruct the true genome sequence. Repetitive DNA arrays, paralogous sequences, polyploidy, and heterozygosity are main factors that plague de novo genome sequencing projects that typically result in highly fragmented assemblies and are difficult to extract biological meaning. Targeted, sub-genomic sequencing offers complexity reduction by removing distal segments of the genome and a systematic mechanism for exploring prioritized genomic content through BAC sequencing. If one isolates and sequences the genome fraction that encodes the relevant biological information, then it is possible to reduce overall sequencing costs and efforts that target a genomic segment. This chapter describes the sub-genome assembly protocol for an organism based upon a BAC tiling path derived from a genome-scale physical map or from fine mapping using BACs to target sub-genomic regions. Methods that are described include BAC isolation and mapping, DNA sequencing, and sequence assembly.
New observations on maternal age effect on germline de novo mutations.

PubMed

Wong, Wendy S W; Solomon, Benjamin D; Bodian, Dale L; Kothiyal, Prachi; Eley, Greg; Huddleston, Kathi C; Baker, Robin; Thach, Dzung C; Iyer, Ramaswamy K; Vockley, Joseph G; Niederhuber, John E

2016-01-19

Germline mutations are the source of evolution and contribute substantially to many health-related processes. Here we use whole-genome deep sequencing data from 693 parents-offspring trios to examine the de novo point mutations (DNMs) in the offspring. Our estimate for the mutation rate per base pair per generation is 1.05 × 10(-8), well within the range of previous studies. We show that maternal age has a small but significant correlation with the total number of DNMs in the offspring after controlling for paternal age (0.51 additional mutations per year, 95% CI: 0.29, 0.73), which was not detectable in the smaller and younger parental cohorts of earlier studies. Furthermore, while the total number of DNMs increases at a constant rate for paternal age, the contribution from the mother increases at an accelerated rate with age.These observations have implications related to the incidence of de novo mutations relating to maternal age.
De novo mutations in HCN1 cause early infantile epileptic encephalopathy.

PubMed

Nava, Caroline; Dalle, Carine; Rastetter, Agnès; Striano, Pasquale; de Kovel, Carolien G F; Nabbout, Rima; Cancès, Claude; Ville, Dorothée; Brilstra, Eva H; Gobbi, Giuseppe; Raffo, Emmanuel; Bouteiller, Delphine; Marie, Yannick; Trouillard, Oriane; Robbiano, Angela; Keren, Boris; Agher, Dahbia; Roze, Emmanuel; Lesage, Suzanne; Nicolas, Aude; Brice, Alexis; Baulac, Michel; Vogt, Cornelia; El Hajj, Nady; Schneider, Eberhard; Suls, Arvid; Weckhuysen, Sarah; Gormley, Padhraig; Lehesjoki, Anna-Elina; De Jonghe, Peter; Helbig, Ingo; Baulac, Stéphanie; Zara, Federico; Koeleman, Bobby P C; Haaf, Thomas; LeGuern, Eric; Depienne, Christel

2014-06-01

Hyperpolarization-activated, cyclic nucleotide-gated (HCN) channels contribute to cationic Ih current in neurons and regulate the excitability of neuronal networks. Studies in rat models have shown that the Hcn1 gene has a key role in epilepsy, but clinical evidence implicating HCN1 mutations in human epilepsy is lacking. We carried out exome sequencing for parent-offspring trios with fever-sensitive, intractable epileptic encephalopathy, leading to the discovery of two de novo missense HCN1 mutations. Screening of follow-up cohorts comprising 157 cases in total identified 4 additional amino acid substitutions. Patch-clamp recordings of Ih currents in cells expressing wild-type or mutant human HCN1 channels showed that the mutations had striking but divergent effects on homomeric channels. Individuals with mutations had clinical features resembling those of Dravet syndrome with progression toward atypical absences, intellectual disability and autistic traits. These findings provide clear evidence that de novo HCN1 point mutations cause a recognizable early-onset epileptic encephalopathy in humans.
Using whole-exome sequencing to identify variants inherited from mosaic parents

PubMed Central

Rios, Jonathan J; Delgado, Mauricio R

2015-01-01

Whole-exome sequencing (WES) has allowed the discovery of genes and variants causing rare human disease. This is often achieved by comparing nonsynonymous variants between unrelated patients, and particularly for sporadic or recessive disease, often identifies a single or few candidate genes for further consideration. However, despite the potential for this approach to elucidate the genetic cause of rare human disease, a majority of patients fail to realize a genetic diagnosis using standard exome analysis methods. Although genetic heterogeneity contributes to the difficulty of exome sequence analysis between patients, it remains plausible that rare human disease is not caused by de novo or recessive variants. Multiple human disorders have been described for which the variant was inherited from a phenotypically normal mosaic parent. Here we highlight the potential for exome sequencing to identify a reasonable number of candidate genes when dominant disease variants are inherited from a mosaic parent. We show the power of WES to identify a limited number of candidate genes using this disease model and how sequence coverage affects identification of mosaic variants by WES. We propose this analysis as an alternative to discover genetic causes of rare human disorders for which typical WES approaches fail to identify likely pathogenic variants. PMID:24986828
De novo Assembly and Characterization of Cajanus scarabaeoides (L.) Thouars Transcriptome by Paired-End Sequencing

PubMed Central

Nigam, Deepti; Saxena, Swati; Ramakrishna, G.; Singh, Archana; Singh, N. K.; Gaikwad, Kishor

2017-01-01

Pigeonpea [Cajanus cajan (L.) Millsp.] is a heat and drought resilient legume crop grown mostly in Asia and Africa. Pigeonpea is affected by various biotic (diseases and insect pests) and abiotic stresses (salinity and water logging) which limit the yield potential of this crop. However, resistance to all these constraints is not readily available in the cultivated genotypes and some of the wild relatives have been found to withstand these resistances. Thus, the utilization of crop wild relatives (CWR) in pigeonpea breeding has been effective in conferring resistance, quality and breeding efficiency traits to this crop. Bud and leaf tissue of Cajanus scarabaeoides, a wild relative of pigeon pea were used for transcriptome profiling. Approximately 30 million clean reads filtered from raw reads by removal of adaptors, ambiguous reads and low-quality reads (3.02 gigabase pairs) were generated by Illumina paired-end RNA-seq technology. All of these clean reads were pooled and assembled de novo into 1,17,007 transcripts using the Trinity. Finally, a total of 98,664 unigenes were derived with mean length of 396 bp and N50 values of 1393. The assembly produced significant mapping results (73.68%) in BLASTN searches of the Glycine max CDS sequence database (Ensembl). Further, uniprot database of Viridiplantae was used for unigene annotation; 81,799 of 98,664 (82.90%) unigenes were finally annotated with gene descriptions or conserved protein domains. Further, a total of 23,475 SSRs were identified in 27,321 unigenes. This data will provide useful information for mining of functionally important genes and SSR markers for pigeonpea improvement. PMID:28748187
Whole genome sequencing and integrative genomic analysis approach on two 22q11.2 deletion syndrome family trios for genotype to phenotype correlations

PubMed Central

Chung, Jonathan H.; Cai, Jinlu; Suskin, Barrie G.; Zhang, Zhengdong; Coleman, Karlene

2015-01-01

The 22q11.2 deletion syndrome (22q11DS) affects 1:4000 live births and presents with highly variable phenotype expressivity. In this study, we developed an analytical approach utilizing whole genome sequencing and integrative analysis to discover genetic modifiers. Our pipeline combined available tools in order to prioritize rare, predicted deleterious, coding and non-coding single nucleotide variants (SNVs) and insertion/deletions (INDELs) from whole genome sequencing (WGS). We sequenced two unrelated probands with 22q11DS, with contrasting clinical findings, and their unaffected parents. Proband P1 had cognitive impairment, psychotic episodes, anxiety, and tetralogy of Fallot (TOF); while proband P2 had juvenile rheumatoid arthritis but no other major clinical findings. In P1, we identified common variants in COMT and PRODH on 22q11.2 as well as rare potentially deleterious DNA variants in other behavioral/neurocognitive genes. We also identified a de novo SNV in ADNP2 (NM_014913.3:c.2243G>C), encoding a neuroprotective protein that may be involved in behavioral disorders. In P2, we identified a novel non-synonymous SNV in ZFPM2 (NM_012082.3:c.1576C>T), a known causative gene for TOF, which may act as a protective variant downstream of TBX1, haploinsufficiency of which is responsible for congenital heart disease in individuals with 22q11DS. PMID:25981510
Integrating multiple genomic data to predict disease-causing nonsynonymous single nucleotide variants in exome sequencing studies.

PubMed

Wu, Jiaxin; Li, Yanda; Jiang, Rui

2014-03-01

Exome sequencing has been widely used in detecting pathogenic nonsynonymous single nucleotide variants (SNVs) for human inherited diseases. However, traditional statistical genetics methods are ineffective in analyzing exome sequencing data, due to such facts as the large number of sequenced variants, the presence of non-negligible fraction of pathogenic rare variants or de novo mutations, and the limited size of affected and normal populations. Indeed, prevalent applications of exome sequencing have been appealing for an effective computational method for identifying causative nonsynonymous SNVs from a large number of sequenced variants. Here, we propose a bioinformatics approach called SPRING (Snv PRioritization via the INtegration of Genomic data) for identifying pathogenic nonsynonymous SNVs for a given query disease. Based on six functional effect scores calculated by existing methods (SIFT, PolyPhen2, LRT, MutationTaster, GERP and PhyloP) and five association scores derived from a variety of genomic data sources (gene ontology, protein-protein interactions, protein sequences, protein domain annotations and gene pathway annotations), SPRING calculates the statistical significance that an SNV is causative for a query disease and hence provides a means of prioritizing candidate SNVs. With a series of comprehensive validation experiments, we demonstrate that SPRING is valid for diseases whose genetic bases are either partly known or completely unknown and effective for diseases with a variety of inheritance styles. In applications of our method to real exome sequencing data sets, we show the capability of SPRING in detecting causative de novo mutations for autism, epileptic encephalopathies and intellectual disability. We further provide an online service, the standalone software and genome-wide predictions of causative SNVs for 5,080 diseases at http://bioinfo.au.tsinghua.edu.cn/spring.
The genetic landscape of paediatric de novo acute myeloid leukaemia as defined by single nucleotide polymorphism array and exon sequencing of 100 candidate genes.

PubMed

Olsson, Linda; Zettermark, Sofia; Biloglav, Andrea; Castor, Anders; Behrendtz, Mikael; Forestier, Erik; Paulsson, Kajsa; Johansson, Bertil

2016-07-01

Cytogenetic analyses of a consecutive series of 67 paediatric (median age 8 years; range 0-17) de novo acute myeloid leukaemia (AML) patients revealed aberrations in 55 (82%) cases. The most common subgroups were KMT2A rearrangement (29%), normal karyotype (15%), RUNX1-RUNX1T1 (10%), deletions of 5q, 7q and/or 17p (9%), myeloid leukaemia associated with Down syndrome (7%), PML-RARA (7%) and CBFB-MYH11 (5%). Single nucleotide polymorphism array (SNP-A) analysis and exon sequencing of 100 genes, performed in 52 and 40 cases, respectively (39 overlapping), revealed ≥1 aberration in 89%; when adding cytogenetic data, this frequency increased to 98%. Uniparental isodisomies (UPIDs) were detected in 13% and copy number aberrations (CNAs) in 63% (median 2/case); three UPIDs and 22 CNAs were recurrent. Twenty-two genes were targeted by focal CNAs, including AEBP2 and PHF6 deletions and genes involved in AML-associated gene fusions. Deep sequencing identified mutations in 65% of cases (median 1/case). In total, 60 mutations were found in 30 genes, primarily those encoding signalling proteins (47%), transcription factors (25%), or epigenetic modifiers (13%). Twelve genes (BCOR, CEBPA, FLT3, GATA1, KIT, KRAS, NOTCH1, NPM1, NRAS, PTPN11, SMC3 and TP53) were recurrently mutated. We conclude that SNP-A and deep sequencing analyses complement the cytogenetic diagnosis of paediatric AML. © 2016 John Wiley & Sons Ltd.
T-IDBA: A de novo Iterative de Bruijn Graph Assembler for Transcriptome

NASA Astrophysics Data System (ADS)

Peng, Yu; Leung, Henry C. M.; Yiu, S. M.; Chin, Francis Y. L.

RNA-seq data produced by next-generation sequencing technology is a useful tool for analyzing transcriptomes. However, existing de novo transcriptome assemblers do not fully utilize the properties of transcriptomes and may result in short contigs because of the splicing nature (shared exons) of the genes. We propose the T-IDBA algorithm to reconstruct expressed isoforms without reference genome. By using pair-end information to solve the problem of long repeats in different genes and branching in the same gene due to alternative splicing, the graph can be decomposed into small components, each corresponds to a gene. The most possible isoforms with sufficient support from the pair-end reads will be found heuristically. In practice, our de novo transcriptome assembler, T-IDBA, outperforms Abyss substantially in terms of sensitivity and precision for both simulated and real data. T-IDBA is available at http://www.cs.hku.hk/~alse/tidba/

Enhancing genome assemblies by integrating non-sequence based data

PubMed Central

2011-01-01

Introduction Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial amount of data useful in de novo sequencing projects. Furthermore, the recent abundance of genome resources enables the use of conserved synteny maps identified in related species to further enhance genome assemblies. Methods The tammar wallaby (Macropus eugenii) is a model marsupial mammal with a low coverage genome. However, we have access to extensive comparative maps containing over 14,000 markers constructed through the physical mapping of conserved loci, chromosome painting and comprehensive linkage maps. Using a custom Bioperl pipeline, information from the maps was aligned to assembled tammar wallaby contigs using BLAT. This data was used to construct pseudo paired-end libraries with intervals ranging from 5-10 MB. We then used Bambus (a program designed to scaffold eukaryotic genomes by ordering and orienting contigs through the use of paired-end data) to scaffold our libraries. To determine how map data compares to sequence based approaches to enhance assemblies, we repeated the experiment using a 0.5× coverage of unique reads from 4 KB and 8 KB Illumina paired-end libraries. Finally, we combined both the sequence and non-sequence-based data to determine how a combined approach could further enhance the quality of the low coverage de novo reconstruction of the tammar wallaby genome. Results Using the map data alone, we were able order 2.2% of the initial contigs into scaffolds, and increase the N50 scaffold size to 39 KB (36 KB in the original assembly). Using only the 0.5× paired-end sequence based data, 53% of the initial contigs were assigned to scaffolds. Combining both data sets resulted in a further 2% increase in the number of initial contigs integrated
Enhancing genome assemblies by integrating non-sequence based data.

PubMed

Heider, Thomas N; Lindsay, James; Wang, Chenwei; O'Neill, Rachel J; Pask, Andrew J

2011-05-28

Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial amount of data useful in de novo sequencing projects. Furthermore, the recent abundance of genome resources enables the use of conserved synteny maps identified in related species to further enhance genome assemblies. The tammar wallaby (Macropus eugenii) is a model marsupial mammal with a low coverage genome. However, we have access to extensive comparative maps containing over 14,000 markers constructed through the physical mapping of conserved loci, chromosome painting and comprehensive linkage maps. Using a custom Bioperl pipeline, information from the maps was aligned to assembled tammar wallaby contigs using BLAT. This data was used to construct pseudo paired-end libraries with intervals ranging from 5-10 MB. We then used Bambus (a program designed to scaffold eukaryotic genomes by ordering and orienting contigs through the use of paired-end data) to scaffold our libraries. To determine how map data compares to sequence based approaches to enhance assemblies, we repeated the experiment using a 0.5× coverage of unique reads from 4 KB and 8 KB Illumina paired-end libraries. Finally, we combined both the sequence and non-sequence-based data to determine how a combined approach could further enhance the quality of the low coverage de novo reconstruction of the tammar wallaby genome. Using the map data alone, we were able order 2.2% of the initial contigs into scaffolds, and increase the N50 scaffold size to 39 KB (36 KB in the original assembly). Using only the 0.5× paired-end sequence based data, 53% of the initial contigs were assigned to scaffolds. Combining both data sets resulted in a further 2% increase in the number of initial contigs integrated into a scaffold (55% total
Characterization of GM events by insert knowledge adapted re-sequencing approaches

PubMed Central

Yang, Litao; Wang, Congmao; Holst-Jensen, Arne; Morisset, Dany; Lin, Yongjun; Zhang, Dabing

2013-01-01

Detection methods and data from molecular characterization of genetically modified (GM) events are needed by stakeholders of public risk assessors and regulators. Generally, the molecular characteristics of GM events are incomprehensively revealed by current approaches and biased towards detecting transformation vector derived sequences. GM events are classified based on available knowledge of the sequences of vectors and inserts (insert knowledge). Herein we present three insert knowledge-adapted approaches for characterization GM events (TT51-1 and T1c-19 rice as examples) based on paired-end re-sequencing with the advantages of comprehensiveness, accuracy, and automation. The comprehensive molecular characteristics of two rice events were revealed with additional unintended insertions comparing with the results from PCR and Southern blotting. Comprehensive transgene characterization of TT51-1 and T1c-19 is shown to be independent of a priori knowledge of the insert and vector sequences employing the developed approaches. This provides an opportunity to identify and characterize also unknown GM events. PMID:24088728
Characterization of GM events by insert knowledge adapted re-sequencing approaches.

PubMed

Yang, Litao; Wang, Congmao; Holst-Jensen, Arne; Morisset, Dany; Lin, Yongjun; Zhang, Dabing

2013-10-03

Detection methods and data from molecular characterization of genetically modified (GM) events are needed by stakeholders of public risk assessors and regulators. Generally, the molecular characteristics of GM events are incomprehensively revealed by current approaches and biased towards detecting transformation vector derived sequences. GM events are classified based on available knowledge of the sequences of vectors and inserts (insert knowledge). Herein we present three insert knowledge-adapted approaches for characterization GM events (TT51-1 and T1c-19 rice as examples) based on paired-end re-sequencing with the advantages of comprehensiveness, accuracy, and automation. The comprehensive molecular characteristics of two rice events were revealed with additional unintended insertions comparing with the results from PCR and Southern blotting. Comprehensive transgene characterization of TT51-1 and T1c-19 is shown to be independent of a priori knowledge of the insert and vector sequences employing the developed approaches. This provides an opportunity to identify and characterize also unknown GM events.
Prenatal diagnosis for a Chinese family with a de novo DMD gene mutation

PubMed Central

Li, Tao; Zhang, Zhao-jing; Ma, Xin; Lv, Xue; Xiao, Hai; Guo, Qian-nan; Liu, Hong-yan; Wang, Hong-dan; Wu, Dong; Lou, Gui-yu; Wang, Xin; Zhang, Chao-yang; Liao, Shi-xiu

2017-01-01

Abstract Background: Patients with Duchenne muscular dystrophy (DMD) usually have severe and fatal symptoms. At present, there is no effective treatment for DMD, thus it is very important to avoid the birth of children with DMD by effective prenatal diagnosis. We identified a de novo DMD gene mutation in a Chinese family, and make a prenatal diagnosis. Methods: First, multiplex ligation-dependent probe amplification (MLPA) was applied to analyze DMD gene exon deletion/duplication in all family members. The coding sequences of 79 exons in DMD gene were analyzed by Sanger sequencing in the patient; and then according to DMD gene exon mutation in the patient, DMD gene sequencing was performed in the family members. On the basis of results above, the pathogenic mutation in DMD gene was identified. Results: MLPA showed no DMD gene exon deletion/duplication in all family members. Sanger sequencing revealed c.2767_2767delT [p.Ser923LeufsX26] mutation in DMD gene of the patient. Heterozygous deletion mutation (T/-) at this locus was observed in the pregnant woman and her mother and younger sister. The analyses of amniotic fluid samples indicated negative Y chromosome sex-determining gene, no DMD gene exon deletion/duplication, no mutations at c.2767 locus, and the inherited maternal X chromosome different from that of the patient. Conclusion: The pathogenic mutation in DMD gene, c.2767_2767delT [p.Ser923LeufsX26], identified in this family is a de novo mutation. On the basis of specific conditions, it is necessary to select suitable methods to make prenatal diagnosis more effective, accurate, and economic. PMID:29390271
De novo transcriptome sequencing and digital gene expression analysis predict biosynthetic pathway of rhynchophylline and isorhynchophylline from Uncaria rhynchophylla, a non-model plant with potent anti-alzheimer's properties.

PubMed

Guo, Qianqian; Ma, Xiaojun; Wei, Shugen; Qiu, Deyou; Wilson, Iain W; Wu, Peng; Tang, Qi; Liu, Lijun; Dong, Shoukun; Zu, Wei

2014-08-12

The major medicinal alkaloids isolated from Uncaria rhynchophylla (gouteng in chinese) capsules are rhynchophylline (RIN) and isorhynchophylline (IRN). Extracts containing these terpene indole alkaloids (TIAs) can inhibit the formation and destabilize preformed fibrils of amyloid β protein (a pathological marker of Alzheimer's disease), and have been shown to improve the cognitive function of mice with Alzheimer-like symptoms. The biosynthetic pathways of RIN and IRN are largely unknown. In this study, RNA-sequencing of pooled Uncaria capsules RNA samples taken at three developmental stages that accumulate different amount of RIN and IRN was performed. More than 50 million high-quality reads from a cDNA library were generated and de novo assembled. Sequences for all of the known enzymes involved in TIAs synthesis were identified. Additionally, 193 cytochrome P450 (CYP450), 280 methyltransferase and 144 isomerase genes were identified, that are potential candidates for enzymes involved in RIN and IRN synthesis. Digital gene expression profile (DGE) analysis was performed on the three capsule developmental stages, and based on genes possessing expression profiles consistent with RIN and IRN levels; four CYP450s, three methyltransferases and three isomerases were identified as the candidates most likely to be involved in the later steps of RIN and IRN biosynthesis. A combination of de novo transcriptome assembly and DGE analysis was shown to be a powerful method for identifying genes encoding enzymes potentially involved in the biosynthesis of important secondary metabolites in a non-model plant. The transcriptome data from this study provides an important resource for understanding the formation of major bioactive constituents in the capsule extract from Uncaria, and provides information that may aid in metabolic engineering to increase yields of these important alkaloids.
A novel approach using long-read sequencing and ddPCR to investigate gonadal mosaicism and estimate recurrence risk in two families with developmental disorders.

PubMed

Wilbe, Maria; Gudmundsson, Sanna; Johansson, Josefin; Ameur, Adam; Stattin, Eva-Lena; Annerén, Göran; Malmgren, Helena; Frykholm, Carina; Bondeson, Marie-Louise

2017-11-01

De novo mutations contribute significantly to severe early-onset genetic disorders. Even if the mutation is apparently de novo, there is a recurrence risk due to parental germ line mosaicism, depending on in which gonadal generation the mutation occurred. We demonstrate the power of using SMRT sequencing and ddPCR to determine parental origin and allele frequencies of de novo mutations in germ cells in two families whom had undergone assisted reproduction. In the first family, a TCOF1 variant c.3156C>T was identified in the proband with Treacher Collins syndrome. The variant affects splicing and was determined to be of paternal origin. It was present in <1% of the paternal germ cells, suggesting a very low recurrence risk. In the second family, the couple had undergone several unsuccessful pregnancies where a de novo mutation PTPN11 c.923A>C causing Noonan syndrome was identified. The variant was present in 40% of the paternal germ cells suggesting a high recurrence risk. Our findings highlight a successful strategy to identify the parental origin of mutations and to investigate the recurrence risk in couples that have undergone assisted reproduction with an unknown donor or in couples with gonadal mosaicism that will undergo preimplantation genetic diagnosis. © 2017 The Authors Prenatal Diagnosis published by John Wiley & Sons Ltd.
Analysis of de novo sequencing and transcriptome assembly and lignocellulolytic enzymes gene expression of Coriolopsis gallica HTC.

PubMed

Chen, Yuehong; Cao, Qinghua; Tao, Xiang; Shao, Huanhuan; Zhang, Kun; Zhang, Yizheng; Tan, Xuemei

2017-03-01

White-rot basidiomycete Coriolopsis gallica HTC is one of the main biodegraders of poplar. In our previous study, we have shown the strong capacity of C. gallica HTC to degrade lignocellulose. In this study, equal amounts of total RNA fromC. Gallica HTC cultures grown in different conditions were pooled together. Illumina paired-end RNA sequencing was performed, and 13.2 million 90-bp paired-end reads were generated. We chose the Merged Assembly of Oases data-set for the following blast searches and gene ontology analyses. The reads were assembled de novo into 28,034 transcripts (≥ 100 bp) using combined assembly strategy MAO. The transcripts were annotated using Blast2GO. In all, 18,810 transcripts (≥100 bp) achieved BLASTX hits, of which, 7048 transcripts had GO term and 2074 had ECs. The expression level of 11 lignocellulolytic enzyme genes from the assembled C. gallica HTC transcriptome were detected by real-time quantitative polymerase chain reaction. The results showed that expression levels of these genes were affected by carbon source and nitrogen source at the level of transcription. The current abundant transcriptome data allowed the identification of many new transcripts in C. gallica HTC. Data provided here represent the most comprehensive and integrated genomic resources for cloning and identifying genes of interest from C. gallica HTC. Characterization of C. gallica HTC transcriptome provides an effective tool to understand mechanisms underlying cellular and molecular functions of C. gallica HTC.
Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.

PubMed

Zhang, ZhiZhuo; Chang, Cheng Wei; Hugo, Willy; Cheung, Edwin; Sung, Wing-Kin

2013-03-01

Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e., position preference and sequence rank preference). This information is usually required from the user. This article presents a de novo motif discovery algorithm called SEME (sampling with expectation maximization for motif elicitation), which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position, and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large-scale synthetic datasets, 32 metazoan compendium benchmark datasets, and 164 chromatin immunoprecipitation sequencing (ChIP-Seq) libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (coTF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct coTF motifs and, at the same time, predicted coTF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each coTF reveals potential interaction mechanisms between the primary TF and the coTF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the coTFs. The application is available online.
De novo malignancy after pancreas transplantation in Japan.

PubMed

Tomimaru, Y; Ito, T; Marubashi, S; Kawamoto, K; Tomokuni, A; Asaoka, T; Wada, H; Eguchi, H; Mori, M; Doki, Y; Nagano, H

2015-04-01

Long-term immunosuppression is associated with an increased risk of cancer. Especially, the immunosuppression in pancreas transplantation is more intensive than that in other organ transplantation because of its strong immunogenicity. Therefore, it suggests that the risk of post-transplant de novo malignancy might increase in pancreas transplantation. However, there have been few studies of de novo malignancy after pancreas transplantation. The aim of this study was to analyze the incidence of de novo malignancy after pancreas transplantation in Japan. Post-transplant patients with de novo malignancy were surveyed and characterized in Japan. Among 107 cases receiving pancreas transplantation in Japan between 2001 and 2010, de novo malignancy developed in 9 cases (8.4%): post-transplant lymphoproliferative disorders in 6 cases, colon cancer in 1 case, renal cancer in 1 case, and brain tumor in 1 case. We clarified the incidence of de novo malignancy after pancreas transplantation in Japan. Copyright © 2015 Elsevier Inc. All rights reserved.
Optimizing de novo transcriptome assembly and extending genomic resources for striped catfish (Pangasianodon hypophthalmus).

PubMed

Thanh, Nguyen Minh; Jung, Hyungtaek; Lyons, Russell E; Njaci, Isaac; Yoon, Byoung-Ha; Chand, Vincent; Tuan, Nguyen Viet; Thu, Vo Thi Minh; Mather, Peter

2015-10-01

Striped catfish (Pangasianodon hypophthalmus) is a commercially important freshwater fish used in inland aquaculture in the Mekong Delta, Vietnam. The culture industry is facing a significant challenge however from saltwater intrusion into many low topographical coastal provinces across the Mekong Delta as a result of predicted climate change impacts. Developing genomic resources for this species can facilitate the production of improved culture lines that can withstand raised salinity conditions, and so we have applied high-throughput Ion Torrent sequencing of transcriptome libraries from six target osmoregulatory organs from striped catfish as a genomic resource for use in future selection strategies. We obtained 12,177,770 reads after trimming and processing with an average length of 97bp. De novo assemblies were generated using CLC Genomic Workbench, Trinity and Velvet/Oases with the best overall contig performance resulting from the CLC assembly. De novo assembly using CLC yielded 66,451 contigs with an average length of 478bp and N50 length of 506bp. A total of 37,969 contigs (57%) possessed significant similarity with proteins in the non-redundant database. Comparative analyses revealed that a significant number of contigs matched sequences reported in other teleost fishes, ranging in similarity from 45.2% with Atlantic cod to 52% with zebrafish. In addition, 28,879 simple sequence repeats (SSRs) and 55,721 single nucleotide polymorphisms (SNPs) were detected in the striped catfish transcriptome. The sequence collection generated in the current study represents the most comprehensive genomic resource for P. hypophthalmus available to date. Our results illustrate the utility of next-generation sequencing as an efficient tool for constructing a large genomic database for marker development in non-model species. Copyright © 2015 Elsevier B.V. All rights reserved.
De novo and inherited private variants in MAP1B in periventricular nodular heterotopia.

PubMed

Heinzen, Erin L; O'Neill, Adam C; Zhu, Xiaolin; Allen, Andrew S; Bahlo, Melanie; Chelly, Jamel; Chen, Ming Hui; Dobyns, William B; Freytag, Saskia; Guerrini, Renzo; Leventer, Richard J; Poduri, Annapurna; Robertson, Stephen P; Walsh, Christopher A; Zhang, Mengqi

2018-05-01

Periventricular nodular heterotopia (PVNH) is a malformation of cortical development commonly associated with epilepsy. We exome sequenced 202 individuals with sporadic PVNH to identify novel genetic risk loci. We first performed a trio-based analysis and identified 219 de novo variants. Although no novel genes were implicated in this initial analysis, PVNH cases were found overall to have a significant excess of nonsynonymous de novo variants in intolerant genes (p = 3.27x10-7), suggesting a role for rare new alleles in genes yet to be associated with the condition. Using a gene-level collapsing analysis comparing cases and controls, we identified a genome-wide significant signal driven by four ultra-rare loss-of-function heterozygous variants in MAP1B, including one de novo variant. In at least one instance, the MAP1B variant was inherited from a parent with previously undiagnosed PVNH. The PVNH was frontally predominant and associated with perisylvian polymicrogyria. These results implicate MAP1B in PVNH. More broadly, our findings suggest that detrimental mutations likely arising in immediately preceding generations with incomplete penetrance may also be responsible for some apparently sporadic diseases.
A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions

PubMed Central

Abnousi, Armen; Broschat, Shira L.; Kalyanaraman, Ananth

2016-01-01

Background Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. Methods In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. Results We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s
Enzyme-like replication de novo in a microcontroller environment.

PubMed

Tangen, Uwe

2010-01-01

The desire to start evolution from scratch inside a computer memory is as old as computing. Here we demonstrate how viable computer programs can be established de novo in a Precambrian environment without supplying any specific instantiation, just starting with random bit sequences. These programs are not self-replicators, but act much more like catalysts. The microcontrollers used in the end are the result of a long series of simplifications. The objective of this simplification process was to produce universal machines with a human-readable interface, allowing software and/or hardware evolution to be studied. The power of the instruction set can be modified by introducing a secondary structure-folding mechanism, which is a state machine, allowing nontrivial replication to emerge with an instruction width of only a few bits. This state-machine approach not only attenuates the problems of brittleness and encoding functionality (too few bits available for coding, and too many instructions needed); it also enables the study of hardware evolution as such. Furthermore, the instruction set is sufficiently powerful to permit external signals to be processed. This information-theoretic approach forms one vertex of a triangle alongside artificial cell research and experimental research on the creation of life. Hopefully this work helps develop an understanding of how information—in a similar sense to the account of functional information described by Hazen et al.—is created by evolution and how this information interacts with or is embedded in its physico-chemical environment.
Analysis of SNP rs16754 of WT1 gene in a series of de novo acute myeloid leukemia patients.

PubMed

Luna, Irene; Such, Esperanza; Cervera, Jose; Barragán, Eva; Jiménez-Velasco, Antonio; Dolz, Sandra; Ibáñez, Mariam; Gómez-Seguí, Inés; López-Pavía, María; Llop, Marta; Fuster, Óscar; Oltra, Silvestre; Moscardó, Federico; Martínez-Cuadrón, David; Senent, M Leonor; Gascón, Adriana; Montesinos, Pau; Martín, Guillermo; Bolufer, Pascual; Sanz, Miguel A

2012-12-01

The single nucleotide polymorphism (SNP) rs16754 of the WT1 gene has been previously described as a possible prognostic marker in normal karyotype acute myeloid leukemia (AML) patients. Nevertheless, the findings in this field are not always reproducible in different series. One hundred and seventy-five adult de novo AML patients were screened with two different methods for the detection of SNP rs16754: high-resolution melting (HRM) and FRET hybridization probes. Direct sequencing was used to validate both techniques. The SNP was detected in 52 out of 175 patients (30 %), both by HRM and hybridization probes. Direct sequencing confirmed that every positive sample in the screening methods had a variation in the DNA sequence. Patients with the wild-type genotype (WT1(AA)) for the SNP rs16754 were significantly younger than those with the heterozygous WT1(AG) genotype. No other difference was observed for baseline characteristic or outcome between patients with or without the SNP. Both techniques are equally reliable and reproducible as screening methods for the detection of the SNP rs16754, allowing for the selection of those samples that will need to be sequenced. We were unable to confirm the suggested favorable outcome of SNP rs16754 in de novo AML.
Social and behavioral research in genomic sequencing: approaches from the Clinical Sequencing Exploratory Research Consortium Outcomes and Measures Working Group.

PubMed

Gray, Stacy W; Martins, Yolanda; Feuerman, Lindsay Z; Bernhardt, Barbara A; Biesecker, Barbara B; Christensen, Kurt D; Joffe, Steven; Rini, Christine; Veenstra, David; McGuire, Amy L

2014-10-01

The routine use of genomic sequencing in clinical medicine has the potential to dramatically alter patient care and medical outcomes. To fully understand the psychosocial and behavioral impact of sequencing integration into clinical practice, it is imperative that we identify the factors that influence sequencing-related decision making and patient outcomes. In an effort to develop a collaborative and conceptually grounded approach to studying sequencing adoption, members of the National Human Genome Research Institute's Clinical Sequencing Exploratory Research Consortium formed the Outcomes and Measures Working Group. Here we highlight the priority areas of investigation and psychosocial and behavioral outcomes identified by the Working Group. We also review some of the anticipated challenges to measurement in social and behavioral research related to genomic sequencing; opportunities for instrument development; and the importance of qualitative, quantitative, and mixed-method approaches. This work represents the early, shared efforts of multiple research teams as we strive to understand individuals' experiences with genomic sequencing. The resulting body of knowledge will guide recommendations for the optimal use of sequencing in clinical practice.
Whole genome sequencing reveals a de novo SHANK3 mutation in familial autism spectrum disorder.

PubMed

Nemirovsky, Sergio I; Córdoba, Marta; Zaiat, Jonathan J; Completa, Sabrina P; Vega, Patricia A; González-Morón, Dolores; Medina, Nancy M; Fabbro, Mónica; Romero, Soledad; Brun, Bianca; Revale, Santiago; Ogara, María Florencia; Pecci, Adali; Marti, Marcelo; Vazquez, Martin; Turjanski, Adrián; Kauffman, Marcelo A

2015-01-01

Clinical genomics promise to be especially suitable for the study of etiologically heterogeneous conditions such as Autism Spectrum Disorder (ASD). Here we present three siblings with ASD where we evaluated the usefulness of Whole Genome Sequencing (WGS) for the diagnostic approach to ASD. We identified a family segregating ASD in three siblings with an unidentified cause. We performed WGS in the three probands and used a state-of-the-art comprehensive bioinformatic analysis pipeline and prioritized the identified variants located in genes likely to be related to ASD. We validated the finding by Sanger sequencing in the probands and their parents. Three male siblings presented a syndrome characterized by severe intellectual disability, absence of language, autism spectrum symptoms and epilepsy with negative family history for mental retardation, language disorders, ASD or other psychiatric disorders. We found germline mosaicism for a heterozygous deletion of a cytosine in the exon 21 of the SHANK3 gene, resulting in a missense sequence of 5 codons followed by a premature stop codon (NM_033517:c.3259_3259delC, p.Ser1088Profs*6). We reported an infrequent form of familial ASD where WGS proved useful in the clinic. We identified a mutation in SHANK3 that underscores its relevance in Autism Spectrum Disorder.
Ray Meta: scalable de novo metagenome assembly and profiling

PubMed Central

2012-01-01

Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights for specific environments. Ray Meta is open source and available at http://denovoassembler.sf.net. PMID:23259615
Random Amplification and Pyrosequencing for Identification of Novel Viral Genome Sequences

PubMed Central

Hang, Jun; Forshey, Brett M.; Kochel, Tadeusz J.; Li, Tao; Solórzano, Víctor Fiestas; Halsey, Eric S.; Kuschner, Robert A.

2012-01-01

ssRNA viruses have high levels of genomic divergence, which can lead to difficulty in genomic characterization of new viruses using traditional PCR amplification and sequencing methods. In this study, random reverse transcription, anchored random PCR amplification, and high-throughput pyrosequencing were used to identify orthobunyavirus sequences from total RNA extracted from viral cultures of acute febrile illness specimens. Draft genome sequence for the orthobunyavirus L segment was assembled and sequentially extended using de novo assembly contigs from pyrosequencing reads and orthobunyavirus sequences in GenBank as guidance. Accuracy and continuous coverage were achieved by mapping all reads to the L segment draft sequence. Subsequently, RT-PCR and Sanger sequencing were used to complete the genome sequence. The complete L segment was found to be 6936 bases in length, encoding a 2248-aa putative RNA polymerase. The identified L segment was distinct from previously published South American orthobunyaviruses, sharing 63% and 54% identity at the nucleotide and amino acid level, respectively, with the complete Oropouche virus L segment and 73% and 81% identity at the nucleotide and amino acid level, respectively, with a partial Caraparu virus L segment. The result demonstrated the effectiveness of a sequence-independent amplification and next-generation sequencing approach for obtaining complete viral genomes from total nucleic acid extracts and its use in pathogen discovery. PMID:22468136
Establishing gene models from the Pinus pinaster genome using gene capture and BAC sequencing.

PubMed

Seoane-Zonjic, Pedro; Cañas, Rafael A; Bautista, Rocío; Gómez-Maldonado, Josefa; Arrillaga, Isabel; Fernández-Pozo, Noé; Claros, M Gonzalo; Cánovas, Francisco M; Ávila, Concepción

2016-02-27

In the era of DNA throughput sequencing, assembling and understanding gymnosperm mega-genomes remains a challenge. Although drafts of three conifer genomes have recently been published, this number is too low to understand the full complexity of conifer genomes. Using techniques focused on specific genes, gene models can be established that can aid in the assembly of gene-rich regions, and this information can be used to compare genomes and understand functional evolution. In this study, gene capture technology combined with BAC isolation and sequencing was used as an experimental approach to establish de novo gene structures without a reference genome. Probes were designed for 866 maritime pine transcripts to sequence genes captured from genomic DNA. The gene models were constructed using GeneAssembler, a new bioinformatic pipeline, which reconstructed over 82% of the gene structures, and a high proportion (85%) of the captured gene models contained sequences from the promoter regulatory region. In a parallel experiment, the P. pinaster BAC library was screened to isolate clones containing genes whose cDNA sequence were already available. BAC clones containing the asparagine synthetase, sucrose synthase and xyloglucan endotransglycosylase gene sequences were isolated and used in this study. The gene models derived from the gene capture approach were compared with the genomic sequences derived from the BAC clones. This combined approach is a particularly efficient way to capture the genomic structures of gene families with a small number of members. The experimental approach used in this study is a valuable combined technique to study genomic gene structures in species for which a reference genome is unavailable. It can be used to establish exon/intron boundaries in unknown gene structures, to reconstruct incomplete genes and to obtain promoter sequences that can be used for transcriptional studies. A bioinformatics algorithm (GeneAssembler) is also provided as a

Quantitative phenotyping via deep barcode sequencing.

PubMed

Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey

2009-10-01

Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.
Quantitative phenotyping via deep barcode sequencing

PubMed Central

Smith, Andrew M.; Heisler, Lawrence E.; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J.; Chee, Mark; Roth, Frederick P.; Giaever, Guri; Nislow, Corey

2009-01-01

Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or “Bar-seq,” outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that ∼20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene–environment interactions on a genome-wide scale. PMID:19622793
Whole exome sequencing is necessary to clarify ID/DD cases with de novo copy number variants of uncertain significance: Two proof-of-concept examples.

PubMed

Giorgio, Elisa; Ciolfi, Andrea; Biamino, Elisa; Caputo, Viviana; Di Gregorio, Eleonora; Belligni, Elga Fabia; Calcia, Alessandro; Gaidolfi, Elena; Bruselles, Alessandro; Mancini, Cecilia; Cavalieri, Simona; Molinatto, Cristina; Cirillo Silengo, Margherita; Ferrero, Giovanni Battista; Tartaglia, Marco; Brusco, Alfredo

2016-07-01

Whole exome sequencing (WES) is a powerful tool to identify clinically undefined forms of intellectual disability/developmental delay (ID/DD), especially in consanguineous families. Here we report the genetic definition of two sporadic cases, with syndromic ID/DD for whom array-Comparative Genomic Hybridization (aCGH) identified a de novo copy number variant (CNV) of uncertain significance. The phenotypes included microcephaly with brachycephaly and a distinctive facies in one proband, and hypotonia in the legs and mild ataxia in the other. WES allowed identification of a functionally relevant homozygous variant affecting a known disease gene for rare syndromic ID/DD in each proband, that is, c.1423C>T (p.Arg377*) in the Trafficking Protein Particle Complex 9 (TRAPPC9), and c.154T>C (p.Cys52Arg) in the Very Low Density Lipoprotein Receptor (VLDLR). Four mutations affecting TRAPPC9 have been previously reported, and the present finding further depicts this syndromic form of ID, which includes microcephaly with brachycephaly, corpus callosum hypoplasia, facial dysmorphism, and overweight. VLDLR-associated cerebellar hypoplasia (VLDLR-CH) is characterized by non-progressive congenital ataxia and moderate-to-profound intellectual disability. The c.154T>C (p.Cys52Arg) mutation was associated with a very mild form of ataxia, mild intellectual disability, and cerebellar hypoplasia without cortical gyri simplification. In conclusion, we report two novel cases with rare causes of autosomal recessive ID, which document how interpreting de novo array-CGH variants represents a challenge in consanguineous families; as such, clinical WES should be considered in diagnostic testing. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
A parallel approach of COFFEE objective function to multiple sequence alignment

NASA Astrophysics Data System (ADS)

Zafalon, G. F. D.; Visotaky, J. M. V.; Amorim, A. R.; Valêncio, C. R.; Neves, L. A.; de Souza, R. C. G.; Machado, J. M.

2015-09-01

The computational tools to assist genomic analyzes show even more necessary due to fast increasing of data amount available. With high computational costs of deterministic algorithms for sequence alignments, many works concentrate their efforts in the development of heuristic approaches to multiple sequence alignments. However, the selection of an approach, which offers solutions with good biological significance and feasible execution time, is a great challenge. Thus, this work aims to show the parallelization of the processing steps of MSA-GA tool using multithread paradigm in the execution of COFFEE objective function. The standard objective function implemented in the tool is the Weighted Sum of Pairs (WSP), which produces some distortions in the final alignments when sequences sets with low similarity are aligned. Then, in studies previously performed we implemented the COFFEE objective function in the tool to smooth these distortions. Although the nature of COFFEE objective function implies in the increasing of execution time, this approach presents points, which can be executed in parallel. With the improvements implemented in this work, we can verify the execution time of new approach is 24% faster than the sequential approach with COFFEE. Moreover, the COFFEE multithreaded approach is more efficient than WSP, because besides it is slightly fast, its biological results are better.
De novo mutations in the genome organizer CTCF cause intellectual disability.

PubMed

Gregor, Anne; Oti, Martin; Kouwenhoven, Evelyn N; Hoyer, Juliane; Sticht, Heinrich; Ekici, Arif B; Kjaergaard, Susanne; Rauch, Anita; Stunnenberg, Hendrik G; Uebe, Steffen; Vasileiou, Georgia; Reis, André; Zhou, Huiqing; Zweier, Christiane

2013-07-11

An increasing number of genes involved in chromatin structure and epigenetic regulation has been implicated in a variety of developmental disorders, often including intellectual disability. By trio exome sequencing and subsequent mutational screening we now identified two de novo frameshift mutations and one de novo missense mutation in CTCF in individuals with intellectual disability, microcephaly, and growth retardation. Furthermore, an individual with a larger deletion including CTCF was identified. CTCF (CCCTC-binding factor) is one of the most important chromatin organizers in vertebrates and is involved in various chromatin regulation processes such as higher order of chromatin organization, enhancer function, and maintenance of three-dimensional chromatin structure. Transcriptome analyses in all three individuals with point mutations revealed deregulation of genes involved in signal transduction and emphasized the role of CTCF in enhancer-driven expression of genes. Our findings indicate that haploinsufficiency of CTCF affects genomic interaction of enhancers and their regulated gene promoters that drive developmental processes and cognition. Copyright © 2013 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Rapid-Onset Obesity with Hypothalamic Dysfunction, Hypoventilation, and Autonomic Dysregulation (ROHHAD): exome sequencing of trios, monozygotic twins and tumours.

PubMed

Barclay, Sarah F; Rand, Casey M; Borch, Lauren A; Nguyen, Lisa; Gray, Paul A; Gibson, William T; Wilson, Richard J A; Gordon, Paul M K; Aung, Zaw; Berry-Kravis, Elizabeth M; Ize-Ludlow, Diego; Weese-Mayer, Debra E; Bech-Hansen, N Torben

2015-08-25

Rapid-onset Obesity with Hypothalamic Dysfunction, Hypoventilation, and Autonomic Dysregulation (ROHHAD) is thought to be a genetic disease caused by de novo mutations, though causative mutations have yet to be identified. We searched for de novo coding mutations among a carefully-diagnosed and clinically homogeneous cohort of 35 ROHHAD patients. We sequenced the exomes of seven ROHHAD trios, plus tumours from four of these patients and the unaffected monozygotic (MZ) twin of one (discovery cohort), to identify constitutional and somatic de novo sequence variants. We further analyzed this exome data to search for candidate genes under autosomal dominant and recessive models, and to identify structural variations. Candidate genes were tested by exome or Sanger sequencing in a replication cohort of 28 ROHHAD singletons. The analysis of the trio-based exomes found 13 de novo variants. However, no two patients had de novo variants in the same gene, and additional patient exomes and mutation analysis in the replication cohort did not provide strong genetic evidence to implicate any of these sequence variants in ROHHAD. Somatic comparisons revealed no coding differences between any blood and tumour samples, or between the two discordant MZ twins. Neither autosomal dominant nor recessive analysis yielded candidate genes for ROHHAD, and we did not identify any potentially causative structural variations. Clinical exome sequencing is highly unlikely to be a useful diagnostic test in patients with true ROHHAD. As ROHHAD has a high risk for fatality if not properly managed, it remains imperative to expand the search for non-exomic genetic risk factors, as well as to investigate other possible mechanisms of disease. In so doing, we will be able to confirm objectively the ROHHAD diagnosis and to contribute to our understanding of obesity, respiratory control, hypothalamic function, and autonomic regulation.
The first genome sequence of a metatherian herpesvirus: Macropodid herpesvirus 1.

PubMed

Vaz, Paola K; Mahony, Timothy J; Hartley, Carol A; Fowler, Elizabeth V; Ficorilli, Nino; Lee, Sang W; Gilkerson, James R; Browning, Glenn F; Devlin, Joanne M

2016-01-22

While many placental herpesvirus genomes have been fully sequenced, the complete genome of a marsupial herpesvirus has not been described. Here we present the first genome sequence of a metatherian herpesvirus, Macropodid herpesvirus 1 (MaHV-1). The MaHV-1 viral genome was sequenced using an Illumina MiSeq sequencer, de novo assembly was performed and the genome was annotated. The MaHV-1 genome was 140 kbp in length and clustered phylogenetically with the primate simplexviruses, sharing 67% nucleotide sequence identity with Human herpesviruses 1 and 2. The MaHV-1 genome contained 66 predicted open reading frames (ORFs) homologous to those in other herpesvirus genomes, but lacked homologues of UL3, UL4, UL56 and glycoprotein J. This is the first alphaherpesvirus genome that has been found to lack the UL3 and UL4 homologues. We identified six novel ORFs and confirmed their transcription by RT-PCR. This is the first genome sequence of a herpesvirus that infects metatherians, a taxonomically unique mammalian clade. Members of the Simplexvirus genus are remarkably conserved, so the absence of ORFs otherwise retained in eutherian and avian alphaherpesviruses contributes to our understanding of the Alphaherpesvirinae. Further study of metatherian herpesvirus genetics and pathogenesis provides a unique approach to understanding herpesvirus-mammalian interactions.
Phenotypic consequences of gene disruption by a balanced de novo translocation involving SLC6A1 and NAA15.

PubMed

Pesz, Karolina; Pienkowski, Victor Murcia; Pollak, Agnieszka; Gasperowicz, Piotr; Sykulski, Maciej; Kosińska, Joanna; Kiszko, Magdalena; Krzykwa, Bogusława; Bartnik-Głaska, Magdalena; Nowakowska, Beata; Rydzanicz, Małgorzata; Sasiadek, Maria Małgorzata; Płoski, Rafał

2018-04-03

Mapping of de novo balanced chromosomal translocations (BCTs) in patients with sporadic poorly characterized disease(s) is an unbiased method of finding candidate gene(s) responsible for the observed symptoms. We present a paediatric patient suffering from epilepsy, developmental delay (DD) and atrial septal defect IIº (ASD) requiring surgery. Karyotyping indicated an apparently balanced de novo reciprocal translocation 46,XX,t(3;4)(p25.3;q31.1), whereas aCGH did not reveal any copy number changes. Using shallow mate-pair whole genome sequencing and direct Sanger sequencing of breakpoint regions we found that translocation disrupted SLC6A1 and NAA15 genes. Our results confirm two previous reports indicating that loss of function of a single allele of SLC6A1 causes epilepsy. In addition, we extend existing evidence that disruption of NAA15 is associated with DD and with congenital heart defects. Copyright © 2018 Elsevier Masson SAS. All rights reserved.
Fathead minnow genome sequencing and assembly

EPA Pesticide Factsheets

The dataset provides the URLs for accessing the genome sequence data and two draft assemblies as well as fathead minnow genotyping data associated with estimating the heterozygosity of the in-bred line.This dataset is associated with the following publication:Burns, F., L. Cogburn, G. Ankley , D. Villeneuve , E. Waits , Y. Chang, V. Llaca, S. Deschamps, R. Jackson, and R. Hoke. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 35(1): 212-217, (2016).
G-Anchor: a novel approach for whole-genome comparative mapping utilizing evolutionary conserved DNA sequences.

PubMed

Lenis, Vasileios Panagiotis E; Swain, Martin; Larkin, Denis M

2018-05-01

Cross-species whole-genome sequence alignment is a critical first step for genome comparative analyses, ranging from the detection of sequence variants to studies of chromosome evolution. Animal genomes are large and complex, and whole-genome alignment is a computationally intense process, requiring expensive high-performance computing systems due to the need to explore extensive local alignments. With hundreds of sequenced animal genomes available from multiple projects, there is an increasing demand for genome comparative analyses. Here, we introduce G-Anchor, a new, fast, and efficient pipeline that uses a strictly limited but highly effective set of local sequence alignments to anchor (or map) an animal genome to another species' reference genome. G-Anchor makes novel use of a databank of highly conserved DNA sequence elements. We demonstrate how these elements may be aligned to a pair of genomes, creating anchors. These anchors enable the rapid mapping of scaffolds from a de novo assembled genome to chromosome assemblies of a reference species. Our results demonstrate that G-Anchor can successfully anchor a vertebrate genome onto a phylogenetically related reference species genome using a desktop or laptop computer within a few hours and with comparable accuracy to that achieved by a highly accurate whole-genome alignment tool such as LASTZ. G-Anchor thus makes whole-genome comparisons accessible to researchers with limited computational resources. G-Anchor is a ready-to-use tool for anchoring a pair of vertebrate genomes. It may be used with large genomes that contain a significant fraction of evolutionally conserved DNA sequences and that are not highly repetitive, polypoid, or excessively fragmented. G-Anchor is not a substitute for whole-genome aligning software but can be used for fast and accurate initial genome comparisons. G-Anchor is freely available and a ready-to-use tool for the pairwise comparison of two genomes.
Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.)

PubMed Central

Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

2015-01-01

The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073
De novo transcriptome assembly of drought tolerant CAM plants, Agave deserti and Agave tequilana.

PubMed

Gross, Stephen M; Martin, Jeffrey A; Simpson, June; Abraham-Juarez, María Jazmín; Wang, Zhong; Visel, Axel

2013-08-19

Agaves are succulent monocotyledonous plants native to xeric environments of North America. Because of their adaptations to their environment, including crassulacean acid metabolism (CAM, a water-efficient form of photosynthesis), and existing technologies for ethanol production, agaves have gained attention both as potential lignocellulosic bioenergy feedstocks and models for exploring plant responses to abiotic stress. However, the lack of comprehensive Agave sequence datasets limits the scope of investigations into the molecular-genetic basis of Agave traits. Here, we present comprehensive, high quality de novo transcriptome assemblies of two Agave species, A. tequilana and A. deserti, built from short-read RNA-seq data. Our analyses support completeness and accuracy of the de novo transcriptome assemblies, with each species having a minimum of approximately 35,000 protein-coding genes. Comparison of agave proteomes to those of additional plant species identifies biological functions of gene families displaying sequence divergence in agave species. Additionally, a focus on the transcriptomics of the A. deserti juvenile leaf confirms evolutionary conservation of monocotyledonous leaf physiology and development along the proximal-distal axis. Our work presents a comprehensive transcriptome resource for two Agave species and provides insight into their biology and physiology. These resources are a foundation for further investigation of agave biology and their improvement for bioenergy development.
Hotspot-Centric De Novo Design of Protein Binders

PubMed Central

Fleishman, Sarel J.; Corn, Jacob E.; Strauch, Eva-Maria; Whitehead, Timothy A.; Karanicolas, John; Baker, David

2014-01-01

Protein–protein interactions play critical roles in biology, and computational design of interactions could be useful in a range of applications. We describe in detail a general approach to de novo design of protein interactions based on computed, energetically optimized interaction hotspots, which was recently used to produce high-affinity binders of influenza hemagglutinin. We present several alternative approaches to identify and build the key hotspot interactions within both core secondary structural elements and variable loop regions and evaluate the method's performance in natural-interface recapitulation. We show that the method generates binding surfaces that are more conformationally restricted than previous design methods, reducing opportunities for off-target interactions. PMID:21945116
Global developmental delay and intellectual disability associated with a de novo TOP2B mutation.

PubMed

Lam, Ching-Wan; Yeung, Wai-Lan; Law, Chun-Yiu

2017-06-01

More than 100 genes had been identified for autism spectrum disorder (ASD). With the advancement of whole-exome/genome sequencing (WES/WGS), disease-causing gene in ASD can be identified in a holistic and unbiased approach. The identification of new ASD genes can further explore the molecular basis of ASD. We report a 15yo girl with developmental delay, intellectual disability, hypotonia, microcephaly and autistic feature. She first presented at 6months old with primitive response to noise. Physical examination showed the patient was hypotonic despite normal muscle power and reflexes. She also had progressive microcephaly. Developmental assessment at 6y showed the patient had a corresponding functional age of 1y. The patient also had autistic feature. The patient had no abnormal biochemical or radiological findings. To investigate the molecular basis of the clinical presentation, we applied clinical whole-exome sequencing (WES) for the proband and the family, and we identified a novel de novo heterozygous missense pathogenic variant, TOP2B: NM_001068.2:c.172C>T; NP_001059.2:p.His58Tyr. TOP2B encodes for the enzyme, topoisomerase II isoenzyme beta which is abundant in both developing and adult brain. Defect of topoisomerase is also known to cause ASD. Using clinical WES, we were able to identify the disease-causing gene for this patient in a holistic approach and end the diagnostic odyssey with a therapeutic impact. Copyright © 2017 Elsevier B.V. All rights reserved.
A novel approach to multiple sequence alignment using hadoop data grids.

PubMed

Sudha Sadasivam, G; Baktavatchalam, G

2010-01-01

Multiple alignment of protein sequences helps to determine evolutionary linkage and to predict molecular structures. The factors to be considered while aligning multiple sequences are speed and accuracy of alignment. Although dynamic programming algorithms produce accurate alignments, they are computation intensive. In this paper we propose a time efficient approach to sequence alignment that also produces quality alignment. The dynamic nature of the algorithm coupled with data and computational parallelism of hadoop data grids improves the accuracy and speed of sequence alignment. The principle of block splitting in hadoop coupled with its scalability facilitates alignment of very large sequences.
Implementing targeted region capture sequencing for the clinical detection of Alagille syndrome: An efficient and cost‑effective method.

PubMed

Huang, Tianhong; Yang, Guilin; Dang, Xiao; Ao, Feijian; Li, Jiankang; He, Yizhou; Tang, Qiyuan; He, Qing

2017-11-01

Alagille syndrome (AGS) is a highly variable, autosomal dominant disease that affects multiple structures including the liver, heart, eyes, bones and face. Targeted region capture sequencing focuses on a panel of known pathogenic genes and provides a rapid, cost‑effective and accurate method for molecular diagnosis. In a Chinese family, this method was used on the proband and Sanger sequencing was applied to validate the candidate mutation. A de novo heterozygous mutation (c.3254_3255insT p.Leu1085PhefsX24) of the jagged 1 gene was identified as the potential disease‑causing gene mutation. In conclusion, the present study suggested that target region capture sequencing is an efficient, reliable and accurate approach for the clinical diagnosis of AGS. Furthermore, these results expand on the understanding of the pathogenesis of AGS.
Whole Genome Sequencing Reveals a De Novo SHANK3 Mutation in Familial Autism Spectrum Disorder

PubMed Central

Nemirovsky, Sergio I.; Córdoba, Marta; Zaiat, Jonathan J.; Completa, Sabrina P.; Vega, Patricia A.; González-Morón, Dolores; Medina, Nancy M.; Fabbro, Mónica; Romero, Soledad; Brun, Bianca; Revale, Santiago; Ogara, María Florencia; Pecci, Adali; Marti, Marcelo; Vazquez, Martin; Turjanski, Adrián; Kauffman, Marcelo A.

2015-01-01

Introduction Clinical genomics promise to be especially suitable for the study of etiologically heterogeneous conditions such as Autism Spectrum Disorder (ASD). Here we present three siblings with ASD where we evaluated the usefulness of Whole Genome Sequencing (WGS) for the diagnostic approach to ASD. Methods We identified a family segregating ASD in three siblings with an unidentified cause. We performed WGS in the three probands and used a state-of-the-art comprehensive bioinformatic analysis pipeline and prioritized the identified variants located in genes likely to be related to ASD. We validated the finding by Sanger sequencing in the probands and their parents. Results Three male siblings presented a syndrome characterized by severe intellectual disability, absence of language, autism spectrum symptoms and epilepsy with negative family history for mental retardation, language disorders, ASD or other psychiatric disorders. We found germline mosaicism for a heterozygous deletion of a cytosine in the exon 21 of the SHANK3 gene, resulting in a missense sequence of 5 codons followed by a premature stop codon (NM_033517:c.3259_3259delC, p.Ser1088Profs*6). Conclusions We reported an infrequent form of familial ASD where WGS proved useful in the clinic. We identified a mutation in SHANK3 that underscores its relevance in Autism Spectrum Disorder. PMID:25646853
Incorporating metals into de novo proteins.

PubMed

Peacock, Anna F A

2013-12-01

The de novo design of artificial metalloproteins from first-principles is a powerful strategy with which to establish the minimum structure required for function, as well as to identify the important design features for tuning the chemistry of the coordinated metal ion. Herein we describe recent contributions to this field, covering metallo-porphyrin, mononuclear and multinuclear metal ion sites engineered into de novo proteins. Using miniature artificial scaffolds these examples demonstrate that complex natural protein folds are not required to mimic naturally occurring metal ion sites in proteins. More importantly progress is being made to engineer de novo metalloproteins capable of performing functions not in the repertoire of biology. Copyright © 2013 Elsevier Ltd. All rights reserved.
Metagenomic ventures into outer sequence space.

PubMed

Dutilh, Bas E

Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question.
Sequencing, de novo assembly and characterization of the spotted scat Scatophagus argus (Linnaeus 1766) transcriptome for discovery of reproduction related genes and SSRs

NASA Astrophysics Data System (ADS)

Yang, Wei; Chen, Huapu; Cui, Xuefan; Zhang, Kewei; Jiang, Dongneng; Deng, Siping; Zhu, Chunhua; Li, Guangli

2017-09-01

Spotted scat (Scatophagus argus) is an economically important farmed fish, particularly in East and Southeast Asia. Because there has been little research on reproductive development and regulation in this species, the lack of a mature artificial reproduction technology remains a barrier for the sustainable development of the aquaculture industry. More genetic and genomic background knowledge is urgently needed for an in-depth understanding of the molecular mechanism of reproductive process and identification of functional genes related to sexual differentiation, gonad maturation and gametogenesis. For these reasons, we performed transcriptomic analysis on spotted scat using a multiple tissue sample mixing strategy. The Illumina RNA sequencing generated 118 510 486 raw reads. After trimming, de novo assembly was performed and yielded 99 888 unigenes with an average length of 905.75 bp. A total of 45 015 unigenes were successfully annotated to the Nr, Swiss-Prot, KOG and KEGG databases. Additionally, 23 783 and 27 183 annotated unigenes were assigned to 56 Gene Ontology (GO) functional groups and 228 KEGG pathways, respectively. Subsequently, 2 474 transcripts associated with reproduction were selected using GO term and KEGG pathway assignments, and a number of reproduction-related genes involved in sex differentiation, gonad development and gametogenesis were identified. Furthermore, 22 279 simple sequence repeat (SSR) loci were discovered and characterized. The comprehensive transcript dataset described here greatly increases the genetic information available for spotted scat and contributes valuable sequence resources for functional gene mining and analysis. Candidate transcripts involved in reproduction would make good starting points for future studies on reproductive mechanisms, and the putative sex differentiation-related genes will be helpful for sex-determining gene identification and sex-specific marker isolation. Lastly, the SSRs can serve as marker

Molecular Characterization of Transgenic Events Using Next Generation Sequencing Approach.

PubMed

Guttikonda, Satish K; Marri, Pradeep; Mammadov, Jafar; Ye, Liang; Soe, Khaing; Richey, Kimberly; Cruse, James; Zhuang, Meibao; Gao, Zhifang; Evans, Clive; Rounsley, Steve; Kumpatla, Siva P

2016-01-01

Demand for the commercial use of genetically modified (GM) crops has been increasing in light of the projected growth of world population to nine billion by 2050. A prerequisite of paramount importance for regulatory submissions is the rigorous safety assessment of GM crops. One of the components of safety assessment is molecular characterization at DNA level which helps to determine the copy number, integrity and stability of a transgene; characterize the integration site within a host genome; and confirm the absence of vector DNA. Historically, molecular characterization has been carried out using Southern blot analysis coupled with Sanger sequencing. While this is a robust approach to characterize the transgenic crops, it is both time- and resource-consuming. The emergence of next-generation sequencing (NGS) technologies has provided highly sensitive and cost- and labor-effective alternative for molecular characterization compared to traditional Southern blot analysis. Herein, we have demonstrated the successful application of both whole genome sequencing and target capture sequencing approaches for the characterization of single and stacked transgenic events and compared the results and inferences with traditional method with respect to key criteria required for regulatory submissions.
Rare causes of early-onset dystonia-parkinsonism with cognitive impairment: a de novo PSEN-1 mutation.

PubMed

Carecchio, Miryam; Picillo, Marina; Valletta, Lorella; Elia, Antonio E; Haack, Tobias B; Cozzolino, Autilia; Vitale, Annalisa; Garavaglia, Barbara; Iuso, Arcangela; Bagella, Caterina F; Pappatà, Sabina; Barone, Paolo; Prokisch, Holger; Romito, Luigi; Tiranti, Valeria

2017-07-01

Mutations in PSEN1 are responsible for familial Alzheimer's disease (FAD) inherited as autosomal dominant trait, but also de novo mutations have been rarely reported in sporadic early-onset dementia cases. Parkinsonism in FAD has been mainly described in advanced disease stages. We characterized a patient presenting with early-onset dystonia-parkinsonism later complicated by dementia and myoclonus. Brain MRI showed signs of iron accumulation in the basal ganglia mimicking neurodegeneration with brain iron accumulation (NBIA) as well as fronto-temporal atrophy. Whole exome sequencing revealed a novel PSEN1 mutation and segregation within the family demonstrated the mutation arose de novo.We suggest considering PSEN1 mutations in cases of dystonia-parkinsonism with positive DAT-Scan, later complicated by progressive cognitive decline and cortical myoclonus even without a dominant family history.
De novo FGF12 mutation in 2 patients with neonatal-onset epilepsy

PubMed Central

Guella, Ilaria; Huh, Linda; McKenzie, Marna B.; Toyota, Eric B.; Bebin, E. Martina; Thompson, Michelle L.; Cooper, Gregory M.; Evans, Daniel M.; Buerki, Sarah E.; Adam, Shelin; Van Allen, Margot I.; Nelson, Tanya N.; Connolly, Mary B.; Farrer, Matthew J.

2016-01-01

Objective: We describe 2 additional patients with early-onset epilepsy with a de novo FGF12 mutation. Methods: Whole-exome sequencing was performed in 2 unrelated patients with early-onset epilepsy and their unaffected parents. Genetic variants were assessed by comparative trio analysis. Clinical evolution, EEG, and neuroimaging are described. The phenotype and response to treatment was reviewed and compared to affected siblings in the original report. Results: We identified the same FGF12 de novo mutation reported previously (c.G155A, p.R52H) in 2 additional patients with early-onset epilepsy. Similar to the original brothers described, both presented with tonic seizures in the first month of life. In the first patient, seizures responded to sodium channel blockers and her development was normal at 11 months. Patient 2 is a 15-year-old girl with treatment-resistant focal epilepsy, moderate intellectual disability, and autism. Carbamazepine (sodium channel blocker) was tried later in her course but not continued due to an allergic reaction. Conclusions: The identification of a recurrent de novo mutation in 2 additional unrelated probands with early-onset epilepsy supports the role of FGF12 p.R52H in disease pathogenesis. Affected carriers presented with similar early clinical phenotypes; however, this report expands the phenotype associated with this mutation which contrasts with the progressive course and early mortality of the siblings in the original report. PMID:27872899
Computational approaches for de novo design and redesign of metal-binding sites on proteins.

PubMed

Akcapinar, Gunseli Bayram; Sezerman, Osman Ugur

2017-04-28

Metal ions play pivotal roles in protein structure, function and stability. The functional and structural diversity of proteins in nature expanded with the incorporation of metal ions or clusters in proteins. Approximately one-third of these proteins in the databases contain metal ions. Many biological and chemical processes in nature involve metal ion-binding proteins, aka metalloproteins. Many cellular reactions that underpin life require metalloproteins. Most of the remarkable, complex chemical transformations are catalysed by metalloenzymes. Realization of the importance of metal-binding sites in a variety of cellular events led to the advancement of various computational methods for their prediction and characterization. Furthermore, as structural and functional knowledgebase about metalloproteins is expanding with advances in computational and experimental fields, the focus of the research is now shifting towards de novo design and redesign of metalloproteins to extend nature's own diversity beyond its limits. In this review, we will focus on the computational toolbox for prediction of metal ion-binding sites, de novo metalloprotein design and redesign. We will also give examples of tailor-made artificial metalloproteins designed with the computational toolbox. © 2017 The Author(s).
Complete genome sequence of Clavibacter michiganensis subsp. insidiosus R1-1 using PacBio single-molecule real-time technology

USDA-ARS?s Scientific Manuscript database

We report the complete genome sequence of Clavibacter michiganensis subsp. insidiosus R1-1 isolated in Minnesota, USA. The R1-1 genome, generated by de novo assembly of PacBio sequencing data, is the first complete genome sequence available for this subspecies....
Building a Better Fragment Library for De Novo Protein Structure Prediction

PubMed Central

de Oliveira, Saulo H. P.; Shi, Jiye; Deane, Charlotte M.

2015-01-01

Fragment-based approaches are the current standard for de novo protein structure prediction. These approaches rely on accurate and reliable fragment libraries to generate good structural models. In this work, we describe a novel method for structure fragment library generation and its application in fragment-based de novo protein structure prediction. The importance of correct testing procedures in assessing the quality of fragment libraries is demonstrated. In particular, the exclusion of homologs to the target from the libraries to correctly simulate a de novo protein structure prediction scenario, something which surprisingly is not always done. We demonstrate that fragments presenting different predominant predicted secondary structures should be treated differently during the fragment library generation step and that exhaustive and random search strategies should both be used. This information was used to develop a novel method, Flib. On a validation set of 41 structurally diverse proteins, Flib libraries presents both a higher precision and coverage than two of the state-of-the-art methods, NNMake and HHFrag. Flib also achieves better precision and coverage on the set of 275 protein domains used in the two previous experiments of the the Critical Assessment of Structure Prediction (CASP9 and CASP10). We compared Flib libraries against NNMake libraries in a structure prediction context. Of the 13 cases in which a correct answer was generated, Flib models were more accurate than NNMake models for 10. “Flib is available for download at: http://www.stats.ox.ac.uk/research/proteins/resources”. PMID:25901595
Comparative analyses of two Geraniaceae transcriptomes using next-generation sequencing.

PubMed

Zhang, Jin; Ruhlman, Tracey A; Mower, Jeffrey P; Jansen, Robert K

2013-12-29

Organelle genomes of Geraniaceae exhibit several unusual evolutionary phenomena compared to other angiosperm families including accelerated nucleotide substitution rates, widespread gene loss, reduced RNA editing, and extensive genomic rearrangements. Since most organelle-encoded proteins function in multi-subunit complexes that also contain nuclear-encoded proteins, it is likely that the atypical organellar phenomena affect the evolution of nuclear genes encoding organellar proteins. To begin to unravel the complex co-evolutionary interplay between organellar and nuclear genomes in this family, we sequenced nuclear transcriptomes of two species, Geranium maderense and Pelargonium x hortorum. Normalized cDNA libraries of G. maderense and P. x hortorum were used for transcriptome sequencing. Five assemblers (MIRA, Newbler, SOAPdenovo, SOAPdenovo-trans [SOAPtrans], Trinity) and two next-generation technologies (454 and Illumina) were compared to determine the optimal transcriptome sequencing approach. Trinity provided the highest quality assembly of Illumina data with the deepest transcriptome coverage. An analysis to determine the amount of sequencing needed for de novo assembly revealed diminishing returns of coverage and quality with data sets larger than sixty million Illumina paired end reads for both species. The G. maderense and P. x hortorum transcriptomes contained fewer transcripts encoding the PLS subclass of PPR proteins relative to other angiosperms, consistent with reduced mitochondrial RNA editing activity in Geraniaceae. In addition, transcripts for all six plastid targeted sigma factors were identified in both transcriptomes, suggesting that one of the highly divergent rpoA-like ORFs in the P. x hortorum plastid genome is functional. The findings support the use of the Illumina platform and assemblers optimized for transcriptome assembly, such as Trinity or SOAPtrans, to generate high-quality de novo transcriptomes with broad coverage. In addition
Multi-locus and long amplicon sequencing approach to study microbial diversity at species level using the MinION™ portable nanopore sequencer

PubMed Central

Sanz, Yolanda

2017-01-01

Abstract The miniaturized and portable DNA sequencer MinION™ has demonstrated great potential in different analyses such as genome-wide sequencing, pathogen outbreak detection and surveillance, human genome variability, and microbial diversity. In this study, we tested the ability of the MinION™ platform to perform long amplicon sequencing in order to design new approaches to study microbial diversity using a multi-locus approach. After compiling a robust database by parsing and extracting the rrn bacterial region from more than 67000 complete or draft bacterial genomes, we demonstrated that the data obtained during sequencing of the long amplicon in the MinION™ device using R9 and R9.4 chemistries were sufficient to study 2 mock microbial communities in a multiplex manner and to almost completely reconstruct the microbial diversity contained in the HM782D and D6305 mock communities. Although nanopore-based sequencing produces reads with lower per-base accuracy compared with other platforms, we presented a novel approach consisting of multi-locus and long amplicon sequencing using the MinION™ MkIb DNA sequencer and R9 and R9.4 chemistries that help to overcome the main disadvantage of this portable sequencing platform. Furthermore, the nanopore sequencing library, constructed with the last releases of pore chemistry (R9.4) and sequencing kit (SQK-LSK108), permitted the retrieval of the higher level of 1D read accuracy sufficient to characterize the microbial species present in each mock community analysed. Improvements in nanopore chemistry, such as minimizing base-calling errors and new library protocols able to produce rapid 1D libraries, will provide more reliable information in the near future. Such data will be useful for more comprehensive and faster specific detection of microbial species and strains in complex ecosystems. PMID:28605506
Solving the Water Jugs Problem by an Integer Sequence Approach

ERIC Educational Resources Information Center

Man, Yiu-Kwong

2012-01-01

In this article, we present an integer sequence approach to solve the classic water jugs problem. The solution steps can be obtained easily by additions and subtractions only, which is suitable for manual calculation or programming by computer. This approach can be introduced to secondary and undergraduate students, and also to teachers and…
High-throughput illumina strand-specific RNA sequencing library preparation

USDA-ARS?s Scientific Manuscript database

Conventional Illumina RNA-Seq does not have the resolution to decode the complex eukaryote transcriptome due to the lack of RNA polarity information. Strand-specific RNA sequencing (ssRNA-Seq) can overcome these limitations and as such is better suited for genome annotation, de novo transcriptome as...
De novo transcriptome sequencing and analysis of the juvenile and adult stages of Fasciola gigantica.

PubMed

Zhang, Xiao-Xuan; Cong, Wei; Elsheikha, Hany M; Liu, Guo-Hua; Ma, Jian-Gang; Huang, Wei-Yi; Zhao, Quan; Zhu, Xing-Quan

2017-07-01

Fasciola gigantica is regarded as the major liver fluke causing fasciolosis in livestock in tropical countries. Despite the significant economic and public health impacts of F. gigantica there are few studies on the pathogenesis of this parasite and our understanding is further limited by the lack of genome and transcriptome information. In this study, de novo Illumina RNA sequencing (RNA-seq) was performed to obtain a comprehensive transcriptome profile of the juvenile (42days post infection) and adult stages of F. gigantica. A total of 49,720 unigenes were produced from juvenile and adult stages of F. gigantica, with an average length of 1286 nucleotides (nt) and N50 of 2076nt. A total of 27,862 (56.03%) unigenes were annotated by BLAST similarity searches against the NCBI non-redundant protein database. Because F. gigantica needs to feed and/or digest host tissues, some proteases (including cysteine proteases and aspartic proteases), which play a role in the degradation of host tissues (protein), have been paid more attention in the present study. A total of 6511 distinct genes were found differentially expressed between juveniles and adults, of which 3993 genes were up-regulated and 2518 genes were down-regulated in adults versus juveniles, respectively. Moreover, stage-specific differentially expressed genes were identified in juvenile (17,009) and adult (6517) F. gigantica. The significantly divergent pathways of differentially expressed genes included cAMP signaling pathway (226; 4.12%), proteoglycans in cancer (256; 4.67%) and focal adhesion (199; 3.63%). The transcription pattern also revealed two egg-laying-associated pathways: cGMP-PKG signaling pathway and TGF-β signaling pathway. This study provides the first comparative transcriptomic data concerning juvenile and adult stages of F. gigantica that will be of great value for future research efforts into understanding parasite pathogenesis and developing vaccines against this important parasite
De Novo Transcriptome Assembly and Analyses of Gene Expression during Photomorphogenesis in Diploid Wheat Triticum monococcum

PubMed Central

Naithani, Sushma; Sullivan, Chris; Preece, Justin; Tiwari, Vijay K.; Elser, Justin; Leonard, Jeffrey M.; Sage, Abigail; Gresham, Cathy; Kerhornou, Arnaud; Bolser, Dan; McCarthy, Fiona; Kersey, Paul; Lazo, Gerard R.; Jaiswal, Pankaj

2014-01-01

Background Triticum monococcum (2n) is a close ancestor of T. urartu, the A-genome progenitor of cultivated hexaploid wheat, and is therefore a useful model for the study of components regulating photomorphogenesis in diploid wheat. In order to develop genetic and genomic resources for such a study, we constructed genome-wide transcriptomes of two Triticum monococcum subspecies, the wild winter wheat T. monococcum ssp. aegilopoides (accession G3116) and the domesticated spring wheat T. monococcum ssp. monococcum (accession DV92) by generating de novo assemblies of RNA-Seq data derived from both etiolated and green seedlings. Principal Findings The de novo transcriptome assemblies of DV92 and G3116 represent 120,911 and 117,969 transcripts, respectively. We successfully mapped ∼90% of these transcripts from each accession to barley and ∼95% of the transcripts to T. urartu genomes. However, only ∼77% transcripts mapped to the annotated barley genes and ∼85% transcripts mapped to the annotated T. urartu genes. Differential gene expression analyses revealed 22% more light up-regulated and 35% more light down-regulated transcripts in the G3116 transcriptome compared to DV92. The DV92 and G3116 mRNA sequence reads aligned against the reference barley genome led to the identification of ∼500,000 single nucleotide polymorphism (SNP) and ∼22,000 simple sequence repeat (SSR) sites. Conclusions De novo transcriptome assemblies of two accessions of the diploid wheat T. monococcum provide new empirical transcriptome references for improving Triticeae genome annotations, and insights into transcriptional programming during photomorphogenesis. The SNP and SSR sites identified in our analysis provide additional resources for the development of molecular markers. PMID:24821410
MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach

PubMed Central

Watson, Mick; Minot, Samuel S.; Rivera, Maria C.; Franklin, Rima B.

2017-01-01

Abstract Background: Environmental metagenomic analysis is typically accomplished by assigning taxonomy and/or function from whole genome sequencing or 16S amplicon sequences. Both of these approaches are limited, however, by read length, among other technical and biological factors. A nanopore-based sequencing platform, MinION™, produces reads that are ≥1 × 104 bp in length, potentially providing for more precise assignment, thereby alleviating some of the limitations inherent in determining metagenome composition from short reads. We tested the ability of sequence data produced by MinION (R7.3 flow cells) to correctly assign taxonomy in single bacterial species runs and in three types of low-complexity synthetic communities: a mixture of DNA using equal mass from four species, a community with one relatively rare (1%) and three abundant (33% each) components, and a mixture of genomic DNA from 20 bacterial strains of staggered representation. Taxonomic composition of the low-complexity communities was assessed by analyzing the MinION sequence data with three different bioinformatic approaches: Kraken, MG-RAST, and One Codex. Results: Long read sequences generated from libraries prepared from single strains using the version 5 kit and chemistry, run on the original MinION device, yielded as few as 224 to as many as 3497 bidirectional high-quality (2D) reads with an average overall study length of 6000 bp. For the single-strain analyses, assignment of reads to the correct genus by different methods ranged from 53.1% to 99.5%, assignment to the correct species ranged from 23.9% to 99.5%, and the majority of misassigned reads were to closely related organisms. A synthetic metagenome sequenced with the same setup yielded 714 high quality 2D reads of approximately 5500 bp that were up to 98% correctly assigned to the species level. Synthetic metagenome MinION libraries generated using version 6 kit and chemistry yielded from 899 to 3497 2D reads with lengths
Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

PubMed

Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

2015-02-01

The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Environmental risk factors for autism: do they help cause de novo genetic mutations that contribute to the disorder?

PubMed

Kinney, Dennis K; Barch, Daniel H; Chayka, Bogdan; Napoleon, Siena; Munir, Kerim M

2010-01-01

Recent research has discovered that a number of genetic risk factors for autism are de novo mutations. Advanced parental age at the time of conception is associated with increased risk for both autism and de novo mutations. We investigated the hypothesis that other environmental factors associated with increased risk for autism might also be mutagenic and contribute to autism by causing de novo mutations. A survey of the research literature identified 9 environmental factors for which increased pre-conceptual exposure appears to be associated with increased risk for autism. Five of these factors--mercury, cadmium, nickel, trichloroethylene, and vinyl chloride--are established mutagens. Another four--including residence in regions that are urbanized, located at higher latitudes, or experience high levels of precipitation--are associated with decreased sun exposure and increased risk for vitamin D deficiency. Vitamin D plays important roles in repairing DNA damage and protecting against oxidative stress--a key cause of DNA damage. Factors associated with vitamin D deficiency will thus contribute to higher mutation rates and impaired repair of DNA. We note how de novo mutations may also help explain why the concordance rate for autism is so markedly higher in monozygotic than dizygotic twins. De novo mutations may also explain in part why the prevalence of autism is so remarkably high, given the evidence for a strong role of genetic factors and the low fertility of individuals with autism--and resultant selection pressure against autism susceptibility genes. These several lines of evidence provide support for the hypothesis, and warrant new research approaches--which we suggest--to address limitations in existing studies. The hypothesis has implications for understanding possible etiologic roles of de novo mutations in autism, and it suggests possible approaches to primary prevention of the disorder, such as addressing widespread vitamin D deficiency and exposure to
Developing a de novo targeted knock-in method based on in utero electroporation into the mammalian brain.

PubMed

Tsunekawa, Yuji; Terhune, Raymond Kunikane; Fujita, Ikumi; Shitamukai, Atsunori; Suetsugu, Taeko; Matsuzaki, Fumio

2016-09-01

Genome-editing technology has revolutionized the field of biology. Here, we report a novel de novo gene-targeting method mediated by in utero electroporation into the developing mammalian brain. Electroporation of donor DNA with the CRISPR/Cas9 system vectors successfully leads to knock-in of the donor sequence, such as EGFP, to the target site via the homology-directed repair mechanism. We developed a targeting vector system optimized to prevent anomalous leaky expression of the donor gene from the plasmid, which otherwise often occurs depending on the donor sequence. The knock-in efficiency of the electroporated progenitors reached up to 40% in the early stage and 20% in the late stage of the developing mouse brain. Furthermore, we inserted different fluorescent markers into the target gene in each homologous chromosome, successfully distinguishing homozygous knock-in cells by color. We also applied this de novo gene targeting to the ferret model for the study of complex mammalian brains. Our results demonstrate that this technique is widely applicable for monitoring gene expression, visualizing protein localization, lineage analysis and gene knockout, all at the single-cell level, in developmental tissues. © 2016. Published by The Company of Biologists Ltd.
De novo transcriptome assembly of drought tolerant CAM plants, Agave deserti and Agave tequilana

PubMed Central

2013-01-01

Background Agaves are succulent monocotyledonous plants native to xeric environments of North America. Because of their adaptations to their environment, including crassulacean acid metabolism (CAM, a water-efficient form of photosynthesis), and existing technologies for ethanol production, agaves have gained attention both as potential lignocellulosic bioenergy feedstocks and models for exploring plant responses to abiotic stress. However, the lack of comprehensive Agave sequence datasets limits the scope of investigations into the molecular-genetic basis of Agave traits. Results Here, we present comprehensive, high quality de novo transcriptome assemblies of two Agave species, A. tequilana and A. deserti, built from short-read RNA-seq data. Our analyses support completeness and accuracy of the de novo transcriptome assemblies, with each species having a minimum of approximately 35,000 protein-coding genes. Comparison of agave proteomes to those of additional plant species identifies biological functions of gene families displaying sequence divergence in agave species. Additionally, a focus on the transcriptomics of the A. deserti juvenile leaf confirms evolutionary conservation of monocotyledonous leaf physiology and development along the proximal-distal axis. Conclusions Our work presents a comprehensive transcriptome resource for two Agave species and provides insight into their biology and physiology. These resources are a foundation for further investigation of agave biology and their improvement for bioenergy development. PMID:23957668
Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

PubMed Central

Schneider, Valerie A.; Graves-Lindsay, Tina; Howe, Kerstin; Bouk, Nathan; Chen, Hsiu-Chuan; Kitts, Paul A.; Murphy, Terence D.; Pruitt, Kim D.; Thibaud-Nissen, Françoise; Albracht, Derek; Fulton, Robert S.; Kremitzki, Milinn; Magrini, Vincent; Markovic, Chris; McGrath, Sean; Steinberg, Karyn Meltz; Auger, Kate; Chow, William; Collins, Joanna; Harden, Glenn; Hubbard, Timothy; Pelan, Sarah; Simpson, Jared T.; Threadgold, Glen; Torrance, James; Wood, Jonathan M.; Clarke, Laura; Koren, Sergey; Boitano, Matthew; Peluso, Paul; Li, Heng; Chin, Chen-Shan; Phillippy, Adam M.; Durbin, Richard; Wilson, Richard K.; Flicek, Paul; Eichler, Evan E.; Church, Deanna M.

2017-01-01

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health. PMID:28396521
De Novo Origin of Human Protein-Coding Genes

PubMed Central

Wu, Dong-Dong; Irwin, David M.; Zhang, Ya-Ping

2011-01-01

The de novo origin of a new protein-coding gene from non-coding DNA is considered to be a very rare occurrence in genomes. Here we identify 60 new protein-coding genes that originated de novo on the human lineage since divergence from the chimpanzee. The functionality of these genes is supported by both transcriptional and proteomic evidence. RNA–seq data indicate that these genes have their highest expression levels in the cerebral cortex and testes, which might suggest that these genes contribute to phenotypic traits that are unique to humans, such as improved cognitive ability. Our results are inconsistent with the traditional view that the de novo origin of new genes is very rare, thus there should be greater appreciation of the importance of the de novo origination of genes. PMID:22102831
Combined "de novo" and "ex novo" lipid fermentation in a mix-medium of corncob acid hydrolysate and soybean oil by Trichosporon dermatis.

PubMed

Huang, Chao; Luo, Mu-Tan; Chen, Xue-Fang; Qi, Gao-Xiang; Xiong, Lian; Lin, Xiao-Qing; Wang, Can; Li, Hai-Long; Chen, Xin-De

2017-01-01

Microbial oil is one important bio-product for its important function in energy, chemical, and food industry. Finding suitable substrates is one key issue for its industrial application. Both hydrophilic and hydrophobic substrates can be utilized by oleaginous microorganisms with two different bio-pathways (" de novo " lipid fermentation and " ex novo " lipid fermentation). To date, most of the research on lipid fermentation has focused mainly on only one fermentation pathway and little work was carried out on both " de novo " and " ex novo " lipid fermentation simultaneously; thus, the advantages of both lipid fermentation cannot be fulfilled comprehensively. In this study, corncob acid hydrolysate with soybean oil was used as a mix-medium for combined " de novo " and " ex novo " lipid fermentation by oleaginous yeast Trichosporon dermatis . Both hydrophilic and hydrophobic substrates (sugars and soybean oil) in the medium can be utilized simultaneously and efficiently by T. dermatis . Different fermentation modes were compared and the batch mode was the most suitable for the combined fermentation. The influence of soybean oil concentration, inoculum size, and initial pH on the lipid fermentation was evaluated and 20 g/L soybean oil, 5% inoculum size, and initial pH 6.0 were suitable for this bioprocess. By this technology, the lipid composition of extracellular hydrophobic substrate (soybean oil) can be modified. Although adding emulsifier showed little beneficial effect on lipid production, it can modify the intracellular lipid composition of T. dermatis . The present study proves the potential and possibility of combined " de novo " and " ex novo " lipid fermentation. This technology can use hydrophilic and hydrophobic sustainable bio-resources to generate lipid feedstock for the production of biodiesel or other lipid-based chemical compounds and to treat some special wastes such as oil-containing wastewater.

Complete Genome Sequence of Clavibacter michiganensis subsp. insidiosus R1-1 Using PacBio Single-Molecule Real-Time Technology

PubMed Central

Lu, You; Samac, Deborah A.; Glazebrook, Jane

2015-01-01

We report here the complete genome sequence of Clavibacter michiganensis subsp. insidiosus R1-1, isolated in Minnesota, USA. The R1-1 genome, generated by a de novo assembly of PacBio sequencing data, is the first complete genome sequence available for this subspecies. PMID:25953184
Genes from scratch--the evolutionary fate of de novo genes.

PubMed

Schlötterer, Christian

2015-04-01

Although considered an extremely unlikely event, many genes emerge from previously noncoding genomic regions. This review covers the entire life cycle of such de novo genes. Two competing hypotheses about the process of de novo gene birth are discussed as well as the high death rate of de novo genes. Despite the high death rate, some de novo genes are retained and remain functional, even in distantly related species, through their integration into gene networks. Further studies combining gene expression with ribosome profiling in multiple populations across different species will be instrumental for an improved understanding of the evolutionary processes operating on de novo genes. Copyright © 2015 The Author. Published by Elsevier Ltd.. All rights reserved.
Sequencing, annotation and comparative analysis of nine BACs of giant panda (Ailuropoda melanoleuca).

PubMed

Zheng, Yang; Cai, Jing; Li, JianWen; Li, Bo; Lin, Runmao; Tian, Feng; Wang, XiaoLing; Wang, Jun

2010-01-01

A 10-fold BAC library for giant panda was constructed and nine BACs were selected to generate finish sequences. These BACs could be used as a validation resource for the de novo assembly accuracy of the whole genome shotgun sequencing reads of giant panda newly generated by the Illumina GA sequencing technology. Complete sanger sequencing, assembly, annotation and comparative analysis were carried out on the selected BACs of a joint length 878 kb. Homologue search and de novo prediction methods were used to annotate genes and repeats. Twelve protein coding genes were predicted, seven of which could be functionally annotated. The seven genes have an average gene size of about 41 kb, an average coding size of about 1.2 kb and an average exon number of 6 per gene. Besides, seven tRNA genes were found. About 27 percent of the BAC sequence is composed of repeats. A phylogenetic tree was constructed using neighbor-join algorithm across five species, including giant panda, human, dog, cat and mouse, which reconfirms dog as the most related species to giant panda. Our results provide detailed sequence and structure information for new genes and repeats of giant panda, which will be helpful for further studies on the giant panda.
Cross-Curricular Sequence: An Approach for Teaching Business Communication.

ERIC Educational Resources Information Center

Clarke, Lillian W.; Franklin, Carl M.

1985-01-01

The Cross-Curricular Sequencing (CCS) approach to teaching business communications is explored. Its uses in word processing, principles of management, and business policy courses are discussed. Techniques for integrating materials from these courses into business communication classes are described. The implications of CCS for business…
De Novo Enzyme Design Using Rosetta3

PubMed Central

Richter, Florian; Leaver-Fay, Andrew; Khare, Sagar D.; Bjelic, Sinisa; Baker, David

2011-01-01

The Rosetta de novo enzyme design protocol has been used to design enzyme catalysts for a variety of chemical reactions, and in principle can be applied to any arbitrary chemical reaction of interest, The process has four stages: 1) choice of a catalytic mechanism and corresponding minimal model active site, 2) identification of sites in a set of scaffold proteins where this minimal active site can be realized, 3) optimization of the identities of the surrounding residues for stabilizing interactions with the transition state and primary catalytic residues, and 4) evaluation and ranking the resulting designed sequences. Stages two through four of this process can be carried out with the Rosetta package, while stage one needs to be done externally. Here, we demonstrate how to carry out the Rosetta enzyme design protocol from start to end in detail using for illustration the triosephosphate isomerase reaction. PMID:21603656
De Novo Mutations of the Gene Encoding the Histone Acetyltransferase KAT6B Cause Genitopatellar Syndrome

PubMed Central

Simpson, Michael A.; Deshpande, Charu; Dafou, Dimitra; Vissers, Lisenka E.L.M.; Woollard, Wesley J.; Holder, Susan E.; Gillessen-Kaesbach, Gabriele; Derks, Ronny; White, Susan M.; Cohen-Snuijf, Ruthy; Kant, Sarina G.; Hoefsloot, Lies H.; Reardon, Willie; Brunner, Han G.; Bongers, Ernie M.H.F.; Trembath, Richard C.

2012-01-01

Genitopatellar syndrome (GPS) is a rare disorder in which patellar aplasia or hypoplasia is associated with external genital anomalies and severe intellectual disability. Using an exome-sequencing approach, we identified de novo mutations of KAT6B in five individuals with GPS; a single nonsense variant and three frameshift indels, including a 4 bp deletion observed in two cases. All identified mutations are located within the terminal exon of the gene and are predicted to generate a truncated protein product lacking evolutionarily conserved domains. KAT6B encodes a member of the MYST family of histone acetyltranferases. We demonstrate a reduced level of both histone H3 and H4 acetylation in patient-derived cells suggesting that dysregulation of histone acetylation is a direct functional consequence of GPS alleles. These findings define the genetic basis of GPS and illustrate the complex role of the regulation of histone acetylation during development. PMID:22265017
De novo prediction of human chromosome structures: Epigenetic marking patterns encode genome architecture.

PubMed

Di Pierro, Michele; Cheng, Ryan R; Lieberman Aiden, Erez; Wolynes, Peter G; Onuchic, José N

2017-11-14

Inside the cell nucleus, genomes fold into organized structures that are characteristic of cell type. Here, we show that this chromatin architecture can be predicted de novo using epigenetic data derived from chromatin immunoprecipitation-sequencing (ChIP-Seq). We exploit the idea that chromosomes encode a 1D sequence of chromatin structural types. Interactions between these chromatin types determine the 3D structural ensemble of chromosomes through a process similar to phase separation. First, a neural network is used to infer the relation between the epigenetic marks present at a locus, as assayed by ChIP-Seq, and the genomic compartment in which those loci reside, as measured by DNA-DNA proximity ligation (Hi-C). Next, types inferred from this neural network are used as an input to an energy landscape model for chromatin organization [Minimal Chromatin Model (MiChroM)] to generate an ensemble of 3D chromosome conformations at a resolution of 50 kilobases (kb). After training the model, dubbed Maximum Entropy Genomic Annotation from Biomarkers Associated to Structural Ensembles (MEGABASE), on odd-numbered chromosomes, we predict the sequences of chromatin types and the subsequent 3D conformational ensembles for the even chromosomes. We validate these structural ensembles by using ChIP-Seq tracks alone to predict Hi-C maps, as well as distances measured using 3D fluorescence in situ hybridization (FISH) experiments. Both sets of experiments support the hypothesis of phase separation being the driving process behind compartmentalization. These findings strongly suggest that epigenetic marking patterns encode sufficient information to determine the global architecture of chromosomes and that de novo structure prediction for whole genomes may be increasingly possible. Copyright © 2017 the Author(s). Published by PNAS.
De novo assembly and characterization of the leaf, bud, and fruit transcriptome from the vulnerable tree Juglans mandshurica for the development of 20 new microsatellite markers using Illumina sequencing.

PubMed

Hu, Zhuang; Zhang, Tian; Gao, Xiao-Xiao; Wang, Yang; Zhang, Qiang; Zhou, Hui-Juan; Zhao, Gui-Fang; Wang, Ma-Li; Woeste, Keith E; Zhao, Peng

2016-04-01

Manchurian walnut (Juglans mandshurica Maxim.) is a vulnerable, temperate deciduous tree valued for its wood and nut, but transcriptomic and genomic data for the species are very limited. Next generation sequencing (NGS) has made it possible to develop molecular markers for this species rapidly and efficiently. Our goal is to use transcriptome information from RNA-Seq to understand development in J. mandshurica and develop polymorphic simple sequence repeats (SSRs, microsatellites) to understand the species' population genetics. In this study, more than 47.7 million clean reads were generated using Illumina sequencing technology. De novo assembly yielded 99,869 unigenes with an average length of 747 bp. Based on sequence similarity search with known proteins, a total of 39,708 (42.32 %) genes were identified. Searching against the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG) identified 15,903 (16.9 %) unigenes. Further, we identified and characterized 63 new transcriptome-derived microsatellite markers. By testing the markers on 4 to 14 individuals from four populations, we found that 20 were polymorphic and easily amplified. The number of alleles per locus ranged from 2 to 8. The observed and expected heterozygosity per locus ranged from 0.209 to 0.813 and 0.335 to 0.842, respectively. These twenty microsatellite markers will be useful for studies of population genetics, diversity, and genetic structure, and they will undoubtedly benefit future breeding studies of this walnut species. Moreover, the information uncovered in this research will also serve as a useful genetic resource for understanding the transcriptome and development of J. mandshurica and other Juglans species.
Sequential search leads to faster, more efficient fragment-based de novo protein structure prediction.

PubMed

de Oliveira, Saulo H P; Law, Eleanor C; Shi, Jiye; Deane, Charlotte M

2018-04-01

Most current de novo structure prediction methods randomly sample protein conformations and thus require large amounts of computational resource. Here, we consider a sequential sampling strategy, building on ideas from recent experimental work which shows that many proteins fold cotranslationally. We have investigated whether a pseudo-greedy search approach, which begins sequentially from one of the termini, can improve the performance and accuracy of de novo protein structure prediction. We observed that our sequential approach converges when fewer than 20 000 decoys have been produced, fewer than commonly expected. Using our software, SAINT2, we also compared the run time and quality of models produced in a sequential fashion against a standard, non-sequential approach. Sequential prediction produces an individual decoy 1.5-2.5 times faster than non-sequential prediction. When considering the quality of the best model, sequential prediction led to a better model being produced for 31 out of 41 soluble protein validation cases and for 18 out of 24 transmembrane protein cases. Correct models (TM-Score > 0.5) were produced for 29 of these cases by the sequential mode and for only 22 by the non-sequential mode. Our comparison reveals that a sequential search strategy can be used to drastically reduce computational time of de novo protein structure prediction and improve accuracy. Data are available for download from: http://opig.stats.ox.ac.uk/resources. SAINT2 is available for download from: https://github.com/sauloho/SAINT2. saulo.deoliveira@dtc.ox.ac.uk. Supplementary data are available at Bioinformatics online.
All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences.

PubMed

Hayat, Sikander; Sander, Chris; Marks, Debora S; Elofsson, Arne

2015-04-28

Transmembrane β-barrels (TMBs) carry out major functions in substrate transport and protein biogenesis but experimental determination of their 3D structure is challenging. Encouraged by successful de novo 3D structure prediction of globular and α-helical membrane proteins from sequence alignments alone, we developed an approach to predict the 3D structure of TMBs. The approach combines the maximum-entropy evolutionary coupling method for predicting residue contacts (EVfold) with a machine-learning approach (boctopus2) for predicting β-strands in the barrel. In a blinded test for 19 TMB proteins of known structure that have a sufficient number of diverse homologous sequences available, this combined method (EVfold_bb) predicts hydrogen-bonded residue pairs between adjacent β-strands at an accuracy of ∼70%. This accuracy is sufficient for the generation of all-atom 3D models. In the transmembrane barrel region, the average 3D structure accuracy [template-modeling (TM) score] of top-ranked models is 0.54 (ranging from 0.36 to 0.85), with a higher (44%) number of residue pairs in correct strand-strand registration than in earlier methods (18%). Although the nonbarrel regions are predicted less accurately overall, the evolutionary couplings identify some highly constrained loop residues and, for FecA protein, the barrel including the structure of a plug domain can be accurately modeled (TM score = 0.68). Lower prediction accuracy tends to be associated with insufficient sequence information and we therefore expect increasing numbers of β-barrel families to become accessible to accurate 3D structure prediction as the number of available sequences increases.
Rare De Novo Copy Number Variants in Patients with Congenital Pulmonary Atresia

PubMed Central

Xie, Li; Chen, Jin-Lan; Zhang, Wei-Zhi; Wang, Shou-Zheng; Zhao, Tian-Li; Huang, Can; Wang, Jian; Yang, Jin-Fu; Yang, Yi-Feng; Tan, Zhi-Ping

2014-01-01

Background Ongoing studies using genomic microarrays and next-generation sequencing have demonstrated that the genetic contributions to cardiovascular diseases have been significantly ignored in the past. The aim of this study was to identify rare copy number variants in individuals with congenital pulmonary atresia (PA). Methods and Results Based on the hypothesis that rare structural variants encompassing key genes play an important role in heart development in PA patients, we performed high-resolution genome-wide microarrays for copy number variations (CNVs) in 82 PA patient-parent trios and 189 controls with an Illumina SNP array platform. CNVs were identified in 17/82 patients (20.7%), and eight of these CNVs (9.8%) are considered potentially pathogenic. Five de novo CNVs occurred at two known congenital heart disease (CHD) loci (16p13.1 and 22q11.2). Two de novo CNVs that may affect folate and vitamin B12 metabolism were identified for the first time. A de novo 1-Mb deletion at 17p13.2 may represent a rare genomic disorder that involves mild intellectual disability and associated facial features. Conclusions Rare CNVs contribute to the pathogenesis of PA (9.8%), suggesting that the causes of PA are heterogeneous and pleiotropic. Together with previous data from animal models, our results might help identify a link between CHD and folate-mediated one-carbon metabolism (FOCM). With the accumulation of high-resolution SNP array data, these previously undescribed rare CNVs may help reveal critical gene(s) in CHD and may provide novel insights about CHD pathogenesis. PMID:24826987
Rare de novo copy number variants in patients with congenital pulmonary atresia.

PubMed

Xie, Li; Chen, Jin-Lan; Zhang, Wei-Zhi; Wang, Shou-Zheng; Zhao, Tian-Li; Huang, Can; Wang, Jian; Yang, Jin-Fu; Yang, Yi-Feng; Tan, Zhi-Ping

2014-01-01

Ongoing studies using genomic microarrays and next-generation sequencing have demonstrated that the genetic contributions to cardiovascular diseases have been significantly ignored in the past. The aim of this study was to identify rare copy number variants in individuals with congenital pulmonary atresia (PA). Based on the hypothesis that rare structural variants encompassing key genes play an important role in heart development in PA patients, we performed high-resolution genome-wide microarrays for copy number variations (CNVs) in 82 PA patient-parent trios and 189 controls with an Illumina SNP array platform. CNVs were identified in 17/82 patients (20.7%), and eight of these CNVs (9.8%) are considered potentially pathogenic. Five de novo CNVs occurred at two known congenital heart disease (CHD) loci (16p13.1 and 22q11.2). Two de novo CNVs that may affect folate and vitamin B12 metabolism were identified for the first time. A de novo 1-Mb deletion at 17p13.2 may represent a rare genomic disorder that involves mild intellectual disability and associated facial features. Rare CNVs contribute to the pathogenesis of PA (9.8%), suggesting that the causes of PA are heterogeneous and pleiotropic. Together with previous data from animal models, our results might help identify a link between CHD and folate-mediated one-carbon metabolism (FOCM). With the accumulation of high-resolution SNP array data, these previously undescribed rare CNVs may help reveal critical gene(s) in CHD and may provide novel insights about CHD pathogenesis.
Data of first de-novo transcriptome assembly of a non-model species, hawksbill sea turtle, Eretmochelys imbricate, nesting of the Colombian Caribean.

PubMed

Hernández-Fernández, Javier

2017-12-01

The hawksbill sea turtle, Eretmochelys imbricata, is an endangered species of the Caribbean Colombian coast due to anthropic and natural factors that have decreased their population levels. Little is known about the genes that are involved in their immune system, sex determination, aging and others important functions. The data generated represents RNA sequencing and the first de-novo assembly of transcripts expressed in the blood of the hawksbill sea turtle. The raw FASTQ files were deposited in the NCBI SRA database with accession number SRX2653641. A total of 5.7 Gb raw sequence data were obtained, corresponding to 47,555,108 raw reads. Trinity was used to perform a first de-novo assembly, and we were able to identify 47,586 transcripts of the female hawksbill turtle transcriptome with an N50 of 1100 bp. The obtained transcriptome data will be useful for further studies of the physiology, biochemistry and evolution in this species.
Sequence specificity of single-stranded DNA-binding proteins: a novel DNA microarray approach

PubMed Central

Morgan, Hugh P.; Estibeiro, Peter; Wear, Martin A.; Max, Klaas E.A.; Heinemann, Udo; Cubeddu, Liza; Gallagher, Maurice P.; Sadler, Peter J.; Walkinshaw, Malcolm D.

2007-01-01

We have developed a novel DNA microarray-based approach for identification of the sequence-specificity of single-stranded nucleic-acid-binding proteins (SNABPs). For verification, we have shown that the major cold shock protein (CspB) from Bacillus subtilis binds with high affinity to pyrimidine-rich sequences, with a binding preference for the consensus sequence, 5′-GTCTTTG/T-3′. The sequence was modelled onto the known structure of CspB and a cytosine-binding pocket was identified, which explains the strong preference for a cytosine base at position 3. This microarray method offers a rapid high-throughput approach for determining the specificity and strength of ss DNA–protein interactions. Further screening of this newly emerging family of transcription factors will help provide an insight into their cellular function. PMID:17488853
De Novo Kidney Regeneration with Stem Cells

PubMed Central

Yokote, Shinya; Yamanaka, Shuichiro; Yokoo, Takashi

2012-01-01

Recent studies have reported on techniques to mobilize and activate endogenous stem-cells in injured kidneys or to introduce exogenous stem cells for tissue repair. Despite many recent advantages in renal regenerative therapy, chronic kidney disease (CKD) remains a major cause of morbidity and mortality and the number of CKD patients has been increasing. When the sophisticated structure of the kidneys is totally disrupted by end stage renal disease (ESRD), traditional stem cell-based therapy is unable to completely regenerate the damaged tissue. This suggests that whole organ regeneration may be a promising therapeutic approach to alleviate patients with uncured CKD. We summarize here the potential of stem-cell-based therapy for injured tissue repair and de novo whole kidney regeneration. In addition, we describe the hurdles that must be overcome and possible applications of this approach in kidney regeneration. PMID:23251079
Next-generation sequencing strategies enable routine detection of balanced chromosome rearrangements for clinical diagnostics and genetic research.

PubMed

Talkowski, Michael E; Ernst, Carl; Heilbut, Adrian; Chiang, Colby; Hanscom, Carrie; Lindgren, Amelia; Kirby, Andrew; Liu, Shangtao; Muddukrishna, Bhavana; Ohsumi, Toshiro K; Shen, Yiping; Borowsky, Mark; Daly, Mark J; Morton, Cynthia C; Gusella, James F

2011-04-08

The contribution of balanced chromosomal rearrangements to complex disorders remains unclear because they are not detected routinely by genome-wide microarrays and clinical localization is imprecise. Failure to consider these events bypasses a potentially powerful complement to single nucleotide polymorphism and copy-number association approaches to complex disorders, where much of the heritability remains unexplained. To capitalize on this genetic resource, we have applied optimized sequencing and analysis strategies to test whether these potentially high-impact variants can be mapped at reasonable cost and throughput. By using a whole-genome multiplexing strategy, rearrangement breakpoints could be delineated at a fraction of the cost of standard sequencing. For rearrangements already mapped regionally by karyotyping and fluorescence in situ hybridization, a targeted approach enabled capture and sequencing of multiple breakpoints simultaneously. Importantly, this strategy permitted capture and unique alignment of up to 97% of repeat-masked sequences in the targeted regions. Genome-wide analyses estimate that only 3.7% of bases should be routinely omitted from genomic DNA capture experiments. Illustrating the power of these approaches, the rearrangement breakpoints were rapidly defined to base pair resolution and revealed unexpected sequence complexity, such as co-occurrence of inversion and translocation as an underlying feature of karyotypically balanced alterations. These findings have implications ranging from genome annotation to de novo assemblies and could enable sequencing screens for structural variations at a cost comparable to that of microarrays in standard clinical practice. Copyright © 2011 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Higher criticism approach to detect rare variants using whole genome sequencing data

PubMed Central

2014-01-01

Because of low statistical power of single-variant tests for whole genome sequencing (WGS) data, the association test for variant groups is a key approach for genetic mapping. To address the features of sparse and weak genetic effects to be detected, the higher criticism (HC) approach has been proposed and theoretically has proven optimal for detecting sparse and weak genetic effects. Here we develop a strategy to apply the HC approach to WGS data that contains rare variants as the majority. By using Genetic Analysis Workshop 18 "dose" genetic data with simulated phenotypes, we assess the performance of HC under a variety of strategies for grouping variants and collapsing rare variants. The HC approach is compared with the minimal p-value method and the sequence kernel association test. The results show that the HC approach is preferred for detecting weak genetic effects. PMID:25519367
Solving the water jugs problem by an integer sequence approach

NASA Astrophysics Data System (ADS)

Man, Yiu-Kwong

2012-01-01

In this article, we present an integer sequence approach to solve the classic water jugs problem. The solution steps can be obtained easily by additions and subtractions only, which is suitable for manual calculation or programming by computer. This approach can be introduced to secondary and undergraduate students, and also to teachers and lecturers involved in teaching mathematical problem solving, recreational mathematics, or elementary number theory.
Complete Genome Sequence of Clavibacter michiganensis subsp. insidiosus R1-1 Using PacBio Single-Molecule Real-Time Technology.

PubMed

Lu, You; Samac, Deborah A; Glazebrook, Jane; Ishimaru, Carol A

2015-05-07

We report here the complete genome sequence of Clavibacter michiganensis subsp. insidiosus R1-1, isolated in Minnesota, USA. The R1-1 genome, generated by a de novo assembly of PacBio sequencing data, is the first complete genome sequence available for this subspecies. Copyright © 2015 Lu et al.
Phylogenetic Placement of Exact Amplicon Sequences Improves Associations with Clinical Information

PubMed Central

McDonald, Daniel; Gonzalez, Antonio; Navas-Molina, Jose A.; Jiang, Lingjing; Xu, Zhenjiang Zech; Winker, Kevin; Kado, Deborah M.; Orwoll, Eric; Manary, Mark; Mirarab, Siavash

2018-01-01

ABSTRACT Recent algorithmic advances in amplicon-based microbiome studies enable the inference of exact amplicon sequence fragments. These new methods enable the investigation of sub-operational taxonomic units (sOTU) by removing erroneous sequences. However, short (e.g., 150-nucleotide [nt]) DNA sequence fragments do not contain sufficient phylogenetic signal to reproduce a reasonable tree, introducing a barrier in the utilization of critical phylogenetically aware metrics such as Faith’s PD or UniFrac. Although fragment insertion methods do exist, those methods have not been tested for sOTUs from high-throughput amplicon studies in insertions against a broad reference phylogeny. We benchmarked the SATé-enabled phylogenetic placement (SEPP) technique explicitly against 16S V4 sequence fragments and showed that it outperforms the conceptually problematic but often-used practice of reconstructing de novo phylogenies. In addition, we provide a BSD-licensed QIIME2 plugin (https://github.com/biocore/q2-fragment-insertion) for SEPP and integration into the microbial study management platform QIITA. IMPORTANCE The move from OTU-based to sOTU-based analysis, while providing additional resolution, also introduces computational challenges. We demonstrate that one popular method of dealing with sOTUs (building a de novo tree from the short sequences) can provide incorrect results in human gut metagenomic studies and show that phylogenetic placement of the new sequences with SEPP resolves this problem while also yielding other benefits over existing methods. PMID:29719869

Effects of short read quality and quantity on a de novo vertebrate transcriptome assembly.

PubMed

Garcia, T I; Shen, Y; Catchen, J; Amores, A; Schartl, M; Postlethwait, J; Walter, R B

2012-01-01

For many researchers, next generation sequencing data holds the key to answering a category of questions previously unassailable. One of the important and challenging steps in achieving these goals is accurately assembling the massive quantity of short sequencing reads into full nucleic acid sequences. For research groups working with non-model or wild systems, short read assembly can pose a significant challenge due to the lack of pre-existing EST or genome reference libraries. While many publications describe the overall process of sequencing and assembly, few address the topic of how many and what types of reads are best for assembly. The goal of this project was use real world data to explore the effects of read quantity and short read quality scores on the resulting de novo assemblies. Using several samples of short reads of various sizes and qualities we produced many assemblies in an automated manner. We observe how the properties of read length, read quality, and read quantity affect the resulting assemblies and provide some general recommendations based on our real-world data set. Published by Elsevier Inc.
MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach.

PubMed

Brown, Bonnie L; Watson, Mick; Minot, Samuel S; Rivera, Maria C; Franklin, Rima B

2017-03-01

Environmental metagenomic analysis is typically accomplished by assigning taxonomy and/or function from whole genome sequencing or 16S amplicon sequences. Both of these approaches are limited, however, by read length, among other technical and biological factors. A nanopore-based sequencing platform, MinION™, produces reads that are ≥1 × 104 bp in length, potentially providing for more precise assignment, thereby alleviating some of the limitations inherent in determining metagenome composition from short reads. We tested the ability of sequence data produced by MinION (R7.3 flow cells) to correctly assign taxonomy in single bacterial species runs and in three types of low-complexity synthetic communities: a mixture of DNA using equal mass from four species, a community with one relatively rare (1%) and three abundant (33% each) components, and a mixture of genomic DNA from 20 bacterial strains of staggered representation. Taxonomic composition of the low-complexity communities was assessed by analyzing the MinION sequence data with three different bioinformatic approaches: Kraken, MG-RAST, and One Codex. Results: Long read sequences generated from libraries prepared from single strains using the version 5 kit and chemistry, run on the original MinION device, yielded as few as 224 to as many as 3497 bidirectional high-quality (2D) reads with an average overall study length of 6000 bp. For the single-strain analyses, assignment of reads to the correct genus by different methods ranged from 53.1% to 99.5%, assignment to the correct species ranged from 23.9% to 99.5%, and the majority of misassigned reads were to closely related organisms. A synthetic metagenome sequenced with the same setup yielded 714 high quality 2D reads of approximately 5500 bp that were up to 98% correctly assigned to the species level. Synthetic metagenome MinION libraries generated using version 6 kit and chemistry yielded from 899 to 3497 2D reads with lengths averaging 5700 bp with up
Environmental Barcoding: A Next-Generation Sequencing Approach for Biomonitoring Applications Using River Benthos

PubMed Central

Hajibabaei, Mehrdad; Shokralla, Shadi; Zhou, Xin; Singer, Gregory A. C.; Baird, Donald J.

2011-01-01

Timely and accurate biodiversity analysis poses an ongoing challenge for the success of biomonitoring programs. Morphology-based identification of bioindicator taxa is time consuming, and rarely supports species-level resolution especially for immature life stages. Much work has been done in the past decade to develop alternative approaches for biodiversity analysis using DNA sequence-based approaches such as molecular phylogenetics and DNA barcoding. On-going assembly of DNA barcode reference libraries will provide the basis for a DNA-based identification system. The use of recently introduced next-generation sequencing (NGS) approaches in biodiversity science has the potential to further extend the application of DNA information for routine biomonitoring applications to an unprecedented scale. Here we demonstrate the feasibility of using 454 massively parallel pyrosequencing for species-level analysis of freshwater benthic macroinvertebrate taxa commonly used for biomonitoring. We designed our experiments in order to directly compare morphology-based, Sanger sequencing DNA barcoding, and next-generation environmental barcoding approaches. Our results show the ability of 454 pyrosequencing of mini-barcodes to accurately identify all species with more than 1% abundance in the pooled mixture. Although the approach failed to identify 6 rare species in the mixture, the presence of sequences from 9 species that were not represented by individuals in the mixture provides evidence that DNA based analysis may yet provide a valuable approach in finding rare species in bulk environmental samples. We further demonstrate the application of the environmental barcoding approach by comparing benthic macroinvertebrates from an urban region to those obtained from a conservation area. Although considerable effort will be required to robustly optimize NGS tools to identify species from bulk environmental samples, our results indicate the potential of an environmental barcoding
12 CFR 1263.14 - De novo insured depository institution applicants.

Code of Federal Regulations, 2011 CFR

2011-01-01

... 12 Banks and Banking 7 2011-01-01 2011-01-01 false De novo insured depository institution... MEMBERS OF THE BANKS Eligibility Requirements § 1263.14 De novo insured depository institution applicants... (de novo applicant) is deemed to meet the requirements of §§ 1263.7, 1263.8, 1263.11 and 1263.12. (b...
12 CFR 925.14 - De novo insured depository institution applicants.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 12 Banks and Banking 7 2010-01-01 2010-01-01 false De novo insured depository institution... MEMBERS AND HOUSING ASSOCIATES MEMBERS OF THE BANKS Eligibility Requirements § 925.14 De novo insured... application for membership in the Bank (de novo applicant) is deemed to meet the requirements of §§ 925.7, 925...
Antimicrobial peptide capsids of de novo design.

PubMed

De Santis, Emiliana; Alkassem, Hasan; Lamarre, Baptiste; Faruqui, Nilofar; Bella, Angelo; Noble, James E; Micale, Nicola; Ray, Santanu; Burns, Jonathan R; Yon, Alexander R; Hoogenboom, Bart W; Ryadnov, Maxim G

2017-12-22

The spread of bacterial resistance to antibiotics poses the need for antimicrobial discovery. With traditional search paradigms being exhausted, approaches that are altogether different from antibiotics may offer promising and creative solutions. Here, we introduce a de novo peptide topology that-by emulating the virus architecture-assembles into discrete antimicrobial capsids. Using the combination of high-resolution and real-time imaging, we demonstrate that these artificial capsids assemble as 20-nm hollow shells that attack bacterial membranes and upon landing on phospholipid bilayers instantaneously (seconds) convert into rapidly expanding pores causing membrane lysis (minutes). The designed capsids show broad antimicrobial activities, thus executing one primary function-they destroy bacteria on contact.
Characterization of X Chromosome Inactivation Using Integrated Analysis of Whole-Exome and mRNA Sequencing

PubMed Central

Szelinger, Szabolcs; Malenica, Ivana; Corneveaux, Jason J.; Siniard, Ashley L.; Kurdoglu, Ahmet A.; Ramsey, Keri M.; Schrauwen, Isabelle; Trent, Jeffrey M.; Narayanan, Vinodh; Huentelman, Matthew J.; Craig, David W.

2014-01-01

In females, X chromosome inactivation (XCI) is an epigenetic, gene dosage compensatory mechanism by inactivation of one copy of X in cells. Random XCI of one of the parental chromosomes results in an approximately equal proportion of cells expressing alleles from either the maternally or paternally inherited active X, and is defined by the XCI ratio. Skewed XCI ratio is suggestive of non-random inactivation, which can play an important role in X-linked genetic conditions. Current methods rely on indirect, semi-quantitative DNA methylation-based assay to estimate XCI ratio. Here we report a direct approach to estimate XCI ratio by integrated, family-trio based whole-exome and mRNA sequencing using phase-by-transmission of alleles coupled with allele-specific expression analysis. We applied this method to in silico data and to a clinical patient with mild cognitive impairment but no clear diagnosis or understanding molecular mechanism underlying the phenotype. Simulation showed that phased and unphased heterozygous allele expression can be used to estimate XCI ratio. Segregation analysis of the patient's exome uncovered a de novo, interstitial, 1.7 Mb deletion on Xp22.31 that originated on the paternally inherited X and previously been associated with heterogeneous, neurological phenotype. Phased, allelic expression data suggested an 83∶20 moderately skewed XCI that favored the expression of the maternally inherited, cytogenetically normal X and suggested that the deleterious affect of the de novo event on the paternal copy may be offset by skewed XCI that favors expression of the wild-type X. This study shows the utility of integrated sequencing approach in XCI ratio estimation. PMID:25503791
Integrating De Novo Transcriptome Assembly and Cloning to Obtain Chicken Ovocleidin-17 Full-Length cDNA

PubMed Central

Ning, ZhongHua; Hincke, Maxwell T.; Yang, Ning; Hou, ZhuoCheng

2014-01-01

Efficiently obtaining full-length cDNA for a target gene is the key step for functional studies and probing genetic variations. However, almost all sequenced domestic animal genomes are not ‘finished’. Many functionally important genes are located in these gapped regions. It can be difficult to obtain full-length cDNA for which only partial amino acid/EST sequences exist. In this study we report a general pipeline to obtain full-length cDNA, and illustrate this approach for one important gene (Ovocleidin-17, OC-17) that is associated with chicken eggshell biomineralization. Chicken OC-17 is one of the best candidates to control and regulate the deposition of calcium carbonate in the calcified eggshell layer. OC-17 protein has been purified, sequenced, and has had its three-dimensional structure solved. However, researchers still cannot conduct OC-17 mRNA related studies because the mRNA sequence is unknown and the gene is absent from the current chicken genome. We used RNA-Seq to obtain the entire transcriptome of the adult hen uterus, and then conducted de novo transcriptome assembling with bioinformatics analysis to obtain candidate OC-17 transcripts. Based on this sequence, we used RACE and PCR cloning methods to successfully obtain the full-length OC-17 cDNA. Temporal and spatial OC-17 mRNA expression analyses were also performed to demonstrate that OC-17 is predominantly expressed in the adult hen uterus during the laying cycle and barely at immature developmental stages. Differential uterine expression of OC-17 was observed in hens laying eggs with weak versus strong eggshell, confirming its important role in the regulation of eggshell mineralization and providing a new tool for genetic selection for eggshell quality parameters. This study is the first one to report the full-length OC-17 cDNA sequence, and builds a foundation for OC-17 mRNA related studies. We provide a general method for biologists experiencing difficulty in obtaining candidate gene full
Integrating de novo transcriptome assembly and cloning to obtain chicken Ovocleidin-17 full-length cDNA.

PubMed

Zhang, Quan; Liu, Long; Zhu, Feng; Ning, ZhongHua; Hincke, Maxwell T; Yang, Ning; Hou, ZhuoCheng

2014-01-01

Efficiently obtaining full-length cDNA for a target gene is the key step for functional studies and probing genetic variations. However, almost all sequenced domestic animal genomes are not 'finished'. Many functionally important genes are located in these gapped regions. It can be difficult to obtain full-length cDNA for which only partial amino acid/EST sequences exist. In this study we report a general pipeline to obtain full-length cDNA, and illustrate this approach for one important gene (Ovocleidin-17, OC-17) that is associated with chicken eggshell biomineralization. Chicken OC-17 is one of the best candidates to control and regulate the deposition of calcium carbonate in the calcified eggshell layer. OC-17 protein has been purified, sequenced, and has had its three-dimensional structure solved. However, researchers still cannot conduct OC-17 mRNA related studies because the mRNA sequence is unknown and the gene is absent from the current chicken genome. We used RNA-Seq to obtain the entire transcriptome of the adult hen uterus, and then conducted de novo transcriptome assembling with bioinformatics analysis to obtain candidate OC-17 transcripts. Based on this sequence, we used RACE and PCR cloning methods to successfully obtain the full-length OC-17 cDNA. Temporal and spatial OC-17 mRNA expression analyses were also performed to demonstrate that OC-17 is predominantly expressed in the adult hen uterus during the laying cycle and barely at immature developmental stages. Differential uterine expression of OC-17 was observed in hens laying eggs with weak versus strong eggshell, confirming its important role in the regulation of eggshell mineralization and providing a new tool for genetic selection for eggshell quality parameters. This study is the first one to report the full-length OC-17 cDNA sequence, and builds a foundation for OC-17 mRNA related studies. We provide a general method for biologists experiencing difficulty in obtaining candidate gene full
De novo transcriptome analysis of rose-scented geranium provides insights into the metabolic specificity of terpene and tartaric acid biosynthesis.

PubMed

Narnoliya, Lokesh K; Kaushal, Girija; Singh, Sudhir P; Sangwan, Rajender S

2017-01-13

Rose-scented geranium (Pelargonium sp.) is a perennial herb that produces a high value essential oil of fragrant significance due to the characteristic compositional blend of rose-oxide and acyclic monoterpenoids in foliage. Recently, the plant has also been shown to produce tartaric acid in leaf tissues. Rose-scented geranium represents top-tier cash crop in terms of economic returns and significance of the plant and plant products. However, there has hardly been any study on its metabolism and functional genomics, nor any genomic expression dataset resource is available in public domain. Therefore, to begin the gains in molecular understanding of specialized metabolic pathways of the plant, de novo sequencing of rose-scented geranium leaf transcriptome, transcript assembly, annotation, expression profiling as well as their validation were carried out. De novo transcriptome analysis resulted a total of 78,943 unique contigs (average length: 623 bp, and N50 length: 752 bp) from 15.44 million high quality raw reads. In silico functional annotation led to the identification of several putative genes representing terpene, ascorbic acid and tartaric acid biosynthetic pathways, hormone metabolism, and transcription factors. Additionally, a total of 6,040 simple sequence repeat (SSR) motifs were identified in 6.8% of the expressed transcripts. The highest frequency of SSR was of tri-nucleotides (50%). Further, transcriptome assembly was validated for randomly selected putative genes by standard PCR-based approach. In silico expression profile of assembled contigs were validated by real-time PCR analysis of selected transcripts. Being the first report on transcriptome analysis of rose-scented geranium the data sets and the leads and directions reflected in this investigation will serve as a foundation for pursuing and understanding molecular aspects of its biology, and specialized metabolic pathways, metabolic engineering, genetic diversity as well as molecular breeding.
Icarus: visualizer for de novo assembly evaluation.

PubMed

Mikheenko, Alla; Valin, Gleb; Prjibelski, Andrey; Saveliev, Vladislav; Gurevich, Alexey

2016-11-01

: Data visualization plays an increasingly important role in NGS data analysis. With advances in both sequencing and computational technologies, it has become a new bottleneck in genomics studies. Indeed, evaluation of de novo genome assemblies is one of the areas that can benefit from the visualization. However, even though multiple quality assessment methods are now available, existing visualization tools are hardly suitable for this purpose. Here, we present Icarus-a novel genome visualizer for accurate assessment and analysis of genomic draft assemblies, which is based on the tool QUAST. Icarus can be used in studies where a related reference genome is available, as well as for non-model organisms. The tool is available online and as a standalone application. http://cab.spbu.ru/software/icarus CONTACT: aleksey.gurevich@spbu.ruSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
De Novo Assembly of the Donkey White Blood Cell Transcriptome and a Comparative Analysis of Phenotype-Associated Genes between Donkeys and Horses.

PubMed

Xie, Feng-Yun; Feng, Yu-Long; Wang, Hong-Hui; Ma, Yun-Feng; Yang, Yang; Wang, Yin-Chao; Shen, Wei; Pan, Qing-Jie; Yin, Shen; Sun, Yu-Jiang; Ma, Jun-Yu

2015-01-01

Prior to the mechanization of agriculture and labor-intensive tasks, humans used donkeys (Equus africanus asinus) for farm work and packing. However, as mechanization increased, donkeys have been increasingly raised for meat, milk, and fur in China. To maintain the development of the donkey industry, breeding programs should focus on traits related to these new uses. Compared to conventional marker-assisted breeding plans, genome- and transcriptome-based selection methods are more efficient and effective. To analyze the coding genes of the donkey genome, we assembled the transcriptome of donkey white blood cells de novo. Using transcriptomic deep-sequencing data, we identified 264,714 distinct donkey unigenes and predicted 38,949 protein fragments. We annotated the donkey unigenes by BLAST searches against the non-redundant (NR) protein database. We also compared the donkey protein sequences with those of the horse (E. caballus) and wild horse (E. przewalskii), and linked the donkey protein fragments with mammalian phenotypes. As the outer ear size of donkeys and horses are obviously different, we compared the outer ear size-associated proteins in donkeys and horses. We identified three ear size-associated proteins, HIC1, PRKRA, and KMT2A, with sequence differences among the donkey, horse, and wild horse loci. Since the donkey genome sequence has not been released, the de novo assembled donkey transcriptome is helpful for preliminary investigations of donkey cultivars and for genetic improvement.
A long PCR–based approach for DNA enrichment prior to next-generation sequencing for systematic studies1

PubMed Central

Uribe-Convers, Simon; Duke, Justin R.; Moore, Michael J.; Tank, David C.

2014-01-01

• Premise of the study: We present an alternative approach for molecular systematic studies that combines long PCR and next-generation sequencing. Our approach can be used to generate templates from any DNA source for next-generation sequencing. Here we test our approach by amplifying complete chloroplast genomes, and we present a set of 58 potentially universal primers for angiosperms to do so. Additionally, this approach is likely to be particularly useful for nuclear and mitochondrial regions. • Methods and Results: Chloroplast genomes of 30 species across angiosperms were amplified to test our approach. Amplification success varied depending on whether PCR conditions were optimized for a given taxon. To further test our approach, some amplicons were sequenced on an Illumina HiSeq 2000. • Conclusions: Although here we tested this approach by sequencing plastomes, long PCR amplicons could be generated using DNA from any genome, expanding the possibilities of this approach for molecular systematic studies. PMID:25202592
Design of nucleic acid sequences for DNA computing based on a thermodynamic approach

PubMed Central

Tanaka, Fumiaki; Kameda, Atsushi; Yamamoto, Masahito; Ohuchi, Azuma

2005-01-01

We have developed an algorithm for designing multiple sequences of nucleic acids that have a uniform melting temperature between the sequence and its complement and that do not hybridize non-specifically with each other based on the minimum free energy (ΔGmin). Sequences that satisfy these constraints can be utilized in computations, various engineering applications such as microarrays, and nano-fabrications. Our algorithm is a random generate-and-test algorithm: it generates a candidate sequence randomly and tests whether the sequence satisfies the constraints. The novelty of our algorithm is that the filtering method uses a greedy search to calculate ΔGmin. This effectively excludes inappropriate sequences before ΔGmin is calculated, thereby reducing computation time drastically when compared with an algorithm without the filtering. Experimental results in silico showed the superiority of the greedy search over the traditional approach based on the hamming distance. In addition, experimental results in vitro demonstrated that the experimental free energy (ΔGexp) of 126 sequences correlated well with ΔGmin (|R| = 0.90) than with the hamming distance (|R| = 0.80). These results validate the rationality of a thermodynamic approach. We implemented our algorithm in a graphic user interface-based program written in Java. PMID:15701762
Origins and Structural Properties of Novel and De Novo Protein Domains During Insect Evolution.

PubMed

Klasberg, Steffen; Bitard-Feildel, Tristan; Callebaut, Isabelle; Bornberg-Bauer, Erich

2018-05-26

Over long time scales, protein evolution is characterised by modular rearrangements of protein domains. Such rearrangements are mainly caused by gene duplication, fusion and terminal losses. To better understand domain emergence mechanisms we investigated 32 insect genomes covering a speciation gradient ranging from ~ 2 to ~ 390 my. We use established domain models and foldable domains delineated by Hydrophobic-Cluster-Analysis (HCA), which does not require homologous sequences, to also identify domains which have likely arisen de novo, i.e. from previously non-coding DNA. Our results indicate that most novel domains emerge terminally as they originate from ORF extensions while fewer arise in middle arrangements, resulting from exonisation of intronic or intergenic regions. Many novel domains rapidly migrate between terminal or middle positions and single- and multi-domain arrangements. Young domains, such as most HCA defined domains, are under strong selection pressure as they show signals of purifying selection. De novo domains, linked to ancient domains or defined by HCA, have higher degrees of intrinsic disorder and disorder-to-order transition upon binding than ancient domains. However, the corresponding DNA sequences of the novel domains of denovo origins could only rarely be found in sister genomes. We conclude that novel domains are often recruited by other proteins and undergo important structural modifications shortly after their emergence, but evolve too fast to be characterised by cross-species comparisons alone. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Influence of active site location on catalytic activity in de novo-designed zinc metalloenzymes.

PubMed

Zastrow, Melissa L; Pecoraro, Vincent L

2013-04-17

While metalloprotein design has now yielded a number of successful metal-bound and even catalytically active constructs, the question of where to put a metal site along a linear, repetitive sequence has not been thoroughly addressed. Often several possibilities in a given sequence may exist that would appear equivalent but may in fact differ for metal affinity, substrate access, or protein dynamics. We present a systematic variation of active site location for a hydrolytically active ZnHis3O site contained within a de novo-designed three-stranded coiled coil. We find that the maximal rate, substrate access, and metal-binding affinity are dependent on the selected position, while catalytic efficiency for p-nitrophenyl acetate hydrolysis can be retained regardless of the location of the active site. This achievement demonstrates how efficient, tailor-made enzymes which control rate, pKa, substrate and solvent access (and selectivity), and metal-binding affinity may be realized. These findings may be applied to the more advanced de novo design of constructs containing secondary interactions, such as hydrogen-bonding channels. We are now confident that changes to location for accommodating such channels can be achieved without location-dependent loss of catalytic efficiency. These findings bring us closer to our ultimate goal of incorporating the secondary interactions we believe will be necessary in order to improve both active site properties and the catalytic efficiency to be competitive with the native enzyme, carbonic anhydrase.
The central nervous system transcriptome of the weakly electric brown ghost knifefish (Apteronotus leptorhynchus): de novo assembly, annotation, and proteomics validation.

PubMed

Salisbury, Joseph P; Sîrbulescu, Ruxandra F; Moran, Benjamin M; Auclair, Jared R; Zupanc, Günther K H; Agar, Jeffrey N

2015-03-11

The brown ghost knifefish (Apteronotus leptorhynchus) is a weakly electric teleost fish of particular interest as a versatile model system for a variety of research areas in neuroscience and biology. The comprehensive information available on the neurophysiology and neuroanatomy of this organism has enabled significant advances in such areas as the study of the neural basis of behavior, the development of adult-born neurons in the central nervous system and their involvement in the regeneration of nervous tissue, as well as brain aging and senescence. Despite substantial scientific interest in this species, no genomic resources are currently available. Here, we report the de novo assembly and annotation of the A. leptorhynchus transcriptome. After evaluating several trimming and transcript reconstruction strategies, de novo assembly using Trinity uncovered 42,459 unique contigs containing at least a partial protein-coding sequence based on alignment to a reference set of known Actinopterygii sequences. As many as 11,847 of these contigs contained full or near-full length protein sequences, providing broad coverage of the proteome. A variety of non-coding RNA sequences were also identified and annotated, including conserved long intergenic non-coding RNA and other long non-coding RNA observed previously to be expressed in adult zebrafish (Danio rerio) brain, as well as a variety of miRNA, snRNA, and snoRNA. Shotgun proteomics confirmed translation of open reading frames from over 2,000 transcripts, including alternative splice variants. Assignment of tandem mass spectra was greatly improved by use of the assembly compared to databases of sequences from closely related organisms. The assembly and raw reads have been deposited at DDBJ/EMBL/GenBank under the accession number GBKR00000000. Tandem mass spectrometry data is available via ProteomeXchange with identifier PXD001285. Presented here is the first release of an annotated de novo transcriptome assembly from
Prevalence and architecture of de novo mutations in developmental disorders.

PubMed

2017-02-23

The genomes of individuals with severe, undiagnosed developmental disorders are enriched in damaging de novo mutations (DNMs) in developmentally important genes. Here we have sequenced the exomes of 4,293 families containing individuals with developmental disorders, and meta-analysed these data with data from another 3,287 individuals with similar disorders. We show that the most important factors influencing the diagnostic yield of DNMs are the sex of the affected individual, the relatedness of their parents, whether close relatives are affected and the parental ages. We identified 94 genes enriched in damaging DNMs, including 14 that previously lacked compelling evidence of involvement in developmental disorders. We have also characterized the phenotypic diversity among these disorders. We estimate that 42% of our cohort carry pathogenic DNMs in coding sequences; approximately half of these DNMs disrupt gene function and the remainder result in altered protein function. We estimate that developmental disorders caused by DNMs have an average prevalence of 1 in 213 to 1 in 448 births, depending on parental age. Given current global demographics, this equates to almost 400,000 children born per year.
QSRA: a quality-value guided de novo short read assembler.

PubMed

Bryant, Douglas W; Wong, Weng-Keen; Mockler, Todd C

2009-02-24

New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data. We have designed and implemented an assembler, Quality-value guided Short Read Assembler, created to take advantage of quality-value scores as a further method of dealing with error. Compared to previous published algorithms, our assembler shows significant improvements not only in speed but also in output quality. QSRA generally produced the highest genomic coverage, while being faster than VCAKE. QSRA is extremely competitive in its longest contig and N50/N80 contig lengths, producing results of similar quality to those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo assembly of complex genomes, improving upon the original VCAKE algorithm by not only drastically reducing runtimes but also increasing the viability of the assembly algorithm through further error handling capabilities.
An approach for identification of unknown viruses using sequencing-by-hybridization.

PubMed

Katoski, Sarah E; Meyer, Hermann; Ibrahim, Sofi

2015-09-01

Accurate identification of biological threat agents, especially RNA viruses, in clinical or environmental samples can be challenging because the concentration of viral genomic material in a given sample is usually low, viral genomic RNA is liable to degradation, and RNA viruses are extremely diverse. A two-tiered approach was used for initial identification, then full genomic characterization of 199 RNA viruses belonging to virus families Arenaviridae, Bunyaviridae, Filoviridae, Flaviviridae, and Togaviridae. A Sequencing-by-hybridization (SBH) microarray was used to tentatively identify a viral pathogen then, the identity is confirmed by guided next-generation sequencing (NGS). After optimization and evaluation of the SBH and NGS methodologies with various virus species and strains, the approach was used to test the ability to identify viruses in blinded samples. The SBH correctly identified two Ebola viruses in the blinded samples within 24 hr, and by using guided amplicon sequencing with 454 GS FLX, the identities of the viruses in both samples were confirmed. SBH provides at relatively low-cost screening of biological samples against a panel of viral pathogens that can be custom-designed on a microarray. Once the identity of virus is deduced from the highest hybridization signal on the SBH microarray, guided (amplicon) NGS sequencing can be used not only to confirm the identity of the virus but also to provide further information about the strain or isolate, including a potential genetic manipulation. This approach can be useful in situations where natural or deliberate biological threat incidents might occur and a rapid response is required. © 2015 Wiley Periodicals, Inc.

MIG-seq: an effective PCR-based method for genome-wide single-nucleotide polymorphism genotyping using the next-generation sequencing platform

PubMed Central

Suyama, Yoshihisa; Matsuki, Yu

2015-01-01

Restriction-enzyme (RE)-based next-generation sequencing methods have revolutionized marker-assisted genetic studies; however, the use of REs has limited their widespread adoption, especially in field samples with low-quality DNA and/or small quantities of DNA. Here, we developed a PCR-based procedure to construct reduced representation libraries without RE digestion steps, representing de novo single-nucleotide polymorphism discovery, and its genotyping using next-generation sequencing. Using multiplexed inter-simple sequence repeat (ISSR) primers, thousands of genome-wide regions were amplified effectively from a wide variety of genomes, without prior genetic information. We demonstrated: 1) Mendelian gametic segregation of the discovered variants; 2) reproducibility of genotyping by checking its applicability for individual identification; and 3) applicability in a wide variety of species by checking standard population genetic analysis. This approach, called multiplexed ISSR genotyping by sequencing, should be applicable to many marker-assisted genetic studies with a wide range of DNA qualities and quantities. PMID:26593239
Sequence comparison alignment-free approach based on suffix tree and L-words frequency.

PubMed

Soares, Inês; Goios, Ana; Amorim, António

2012-01-01

The vast majority of methods available for sequence comparison rely on a first sequence alignment step, which requires a number of assumptions on evolutionary history and is sometimes very difficult or impossible to perform due to the abundance of gaps (insertions/deletions). In such cases, an alternative alignment-free method would prove valuable. Our method starts by a computation of a generalized suffix tree of all sequences, which is completed in linear time. Using this tree, the frequency of all possible words with a preset length L-L-words--in each sequence is rapidly calculated. Based on the L-words frequency profile of each sequence, a pairwise standard Euclidean distance is then computed producing a symmetric genetic distance matrix, which can be used to generate a neighbor joining dendrogram or a multidimensional scaling graph. We present an improvement to word counting alignment-free approaches for sequence comparison, by determining a single optimal word length and combining suffix tree structures to the word counting tasks. Our approach is, thus, a fast and simple application that proved to be efficient and powerful when applied to mitochondrial genomes. The algorithm was implemented in Python language and is freely available on the web.
De novo mutation in the dopamine transporter gene associates dopamine dysfunction with autism spectrum disorder.

PubMed

Hamilton, P J; Campbell, N G; Sharma, S; Erreger, K; Herborg Hansen, F; Saunders, C; Belovich, A N; Sahai, M A; Cook, E H; Gether, U; McHaourab, H S; Matthies, H J G; Sutcliffe, J S; Galli, A

2013-12-01

De novo genetic variation is an important class of risk factors for autism spectrum disorder (ASD). Recently, whole-exome sequencing of ASD families has identified a novel de novo missense mutation in the human dopamine (DA) transporter (hDAT) gene, which results in a Thr to Met substitution at site 356 (hDAT T356M). The dopamine transporter (DAT) is a presynaptic membrane protein that regulates dopaminergic tone in the central nervous system by mediating the high-affinity reuptake of synaptically released DA, making it a crucial regulator of DA homeostasis. Here, we report the first functional, structural and behavioral characterization of an ASD-associated de novo mutation in the hDAT. We demonstrate that the hDAT T356M displays anomalous function, characterized as a persistent reverse transport of DA (substrate efflux). Importantly, in the bacterial homolog leucine transporter, substitution of A289 (the homologous site to T356) with a Met promotes an outward-facing conformation upon substrate binding. In the substrate-bound state, an outward-facing transporter conformation is required for substrate efflux. In Drosophila melanogaster, the expression of hDAT T356M in DA neurons-lacking Drosophila DAT leads to hyperlocomotion, a trait associated with DA dysfunction and ASD. Taken together, our findings demonstrate that alterations in DA homeostasis, mediated by aberrant DAT function, may confer risk for ASD and related neuropsychiatric conditions.
High-throughput sequencing of the chloroplast and mitochondrion of Chlamydomonas reinhardtii to generate improved de novo assemblies, analyze expression patterns and transcript speciation, and evaluate diversity among laboratory strains and wild isolates

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gallaher, Sean D.; Fitz-Gibbon, Sorel T.; Strenkert, Daniela

Chlamydomonas reinhardtii is a unicellular chlorophyte alga that is widely studied as a reference organism for understanding photosynthesis, sensory and motile cilia, and for development of an algal-based platform for producing biofuels and bio-products. Its highly repetitive, ~205-kbp circular chloroplast genome and ~15.8-kbp linear mitochondrial genome were sequenced prior to the advent of high-throughput sequencing technologies. Here, high coverage shotgun sequencing was used to assemble both organellar genomes de novo. These new genomes correct dozens of errors in the prior genome sequences and annotations. Gen-ome sequencing coverage indicates that each cell contains on average 83 copies of the chloroplast genomemore » and 130 copies of the mitochondrial genome. Using protocols and analyses optimized for organellar tran-scripts, RNA-Seq was used to quantify their relative abundances across 12 different growth conditions. Forty-six percent of total cellular mRNA is attributable to high expression from a few dozen chloroplast genes. RNA-Seq data were used to guide gene annotation, to demonstrate polycistronic gene expression, and to quantify splicing of psaA and psbA introns. In contrast to a conclusion from a recent study, we found that chloroplast transcripts are not edited. Unexpectedly, cytosine-rich polynucleotide tails were observed at the 3’-end of all mitochondrial transcripts. A comparative genomics analysis of eight laboratory strains and 11 wild isolates of C. reinhardtii identified 2658 variants in the organellargenomes, which is 1/10th as much genetic diversity as is found in the nucleus.« less
Origins of De Novo Genes in Human and Chimpanzee.

PubMed

Ruiz-Orera, Jorge; Hernandez-Rodriguez, Jessica; Chiva, Cristina; Sabidó, Eduard; Kondova, Ivanela; Bontrop, Ronald; Marqués-Bonet, Tomàs; Albà, M Mar

2015-12-01

The birth of new genes is an important motor of evolutionary innovation. Whereas many new genes arise by gene duplication, others originate at genomic regions that did not contain any genes or gene copies. Some of these newly expressed genes may acquire coding or non-coding functions and be preserved by natural selection. However, it is yet unclear which is the prevalence and underlying mechanisms of de novo gene emergence. In order to obtain a comprehensive view of this process, we have performed in-depth sequencing of the transcriptomes of four mammalian species--human, chimpanzee, macaque, and mouse--and subsequently compared the assembled transcripts and the corresponding syntenic genomic regions. This has resulted in the identification of over five thousand new multiexonic transcriptional events in human and/or chimpanzee that are not observed in the rest of species. Using comparative genomics, we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS. In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not functional. However, we find signatures of selection in a subset of de novo genes which have evidence of protein translation. Taken together, the data support a model in which frequently-occurring new transcriptional events in the genome provide the raw material for the evolution of new proteins.
Origins of De Novo Genes in Human and Chimpanzee

PubMed Central

Ruiz-Orera, Jorge; Hernandez-Rodriguez, Jessica; Chiva, Cristina; Sabidó, Eduard; Kondova, Ivanela; Bontrop, Ronald; Marqués-Bonet, Tomàs; Albà, M.Mar

2015-01-01

The birth of new genes is an important motor of evolutionary innovation. Whereas many new genes arise by gene duplication, others originate at genomic regions that did not contain any genes or gene copies. Some of these newly expressed genes may acquire coding or non-coding functions and be preserved by natural selection. However, it is yet unclear which is the prevalence and underlying mechanisms of de novo gene emergence. In order to obtain a comprehensive view of this process, we have performed in-depth sequencing of the transcriptomes of four mammalian species—human, chimpanzee, macaque, and mouse—and subsequently compared the assembled transcripts and the corresponding syntenic genomic regions. This has resulted in the identification of over five thousand new multiexonic transcriptional events in human and/or chimpanzee that are not observed in the rest of species. Using comparative genomics, we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS. In general, these transcripts show little evidence of purifying selection, suggesting that many of them are not functional. However, we find signatures of selection in a subset of de novo genes which have evidence of protein translation. Taken together, the data support a model in which frequently-occurring new transcriptional events in the genome provide the raw material for the evolution of new proteins. PMID:26720152
De novo mutations of GCK, HNF1A and HNF4A may be more frequent in MODY than previously assumed.

PubMed

Stanik, Juraj; Dusatkova, Petra; Cinek, Ondrej; Valentinova, Lucia; Huckova, Miroslava; Skopkova, Martina; Dusatkova, Lenka; Stanikova, Daniela; Pura, Mikulas; Klimes, Iwar; Lebl, Jan; Gasperikova, Daniela; Pruhova, Stepanka

2014-03-01

MODY is mainly characterised by an early onset of diabetes and a positive family history of diabetes with an autosomal dominant mode of inheritance. However, de novo mutations have been reported anecdotally. The aim of this study was to systematically revisit a large collection of MODY patients to determine the minimum prevalence of de novo mutations in the most prevalent MODY genes (i.e. GCK, HNF1A, HNF4A). Analysis of 922 patients from two national MODY centres (Slovakia and the Czech Republic) identified 150 probands (16%) who came from pedigrees that did not fulfil the criterion of two generations with diabetes but did fulfil the remaining criteria. The GCK, HNF1A and HNF4A genes were analysed by direct sequencing. Mutations in GCK, HNF1A or HNF4A genes were detected in 58 of 150 individuals. Parents of 28 probands were unavailable for further analysis, and in 19 probands the mutation was inherited from an asymptomatic parent. In 11 probands the mutations arose de novo. In our cohort of MODY patients from two national centres the de novo mutations in GCK, HNF1A and HNF4A were present in 7.3% of the 150 families without a history of diabetes and 1.2% of all of the referrals for MODY testing. This is the largest collection of de novo MODY mutations to date, and our findings indicate a much higher frequency of de novo mutations than previously assumed. Therefore, genetic testing of MODY could be considered for carefully selected individuals without a family history of diabetes.
The draft genome sequence and annotation of the desert woodrat Neotoma lepida.

PubMed

Campbell, Michael; Oakeson, Kelly F; Yandell, Mark; Halpert, James R; Dearing, Denise

2016-09-01

We present the de novo draft genome sequence for a vertebrate mammalian herbivore, the desert woodrat (Neotoma lepida). This species is of ecological and evolutionary interest with respect to ingestion, microbial detoxification and hepatic metabolism of toxic plant secondary compounds from the highly toxic creosote bush (Larrea tridentata) and the juniper shrub (Juniperus monosperma). The draft genome sequence and annotation have been deposited at GenBank under the accession LZPO01000000.
De Novo Transcriptome Sequence Assembly from Coconut Leaves and Seeds with a Focus on Factors Involved in RNA-Directed DNA Methylation

PubMed Central

Huang, Ya-Yi; Lee, Chueh-Pai; Fu, Jason L.; Chang, Bill Chia-Han; Matzke, Antonius J. M.; Matzke, Marjori

2014-01-01

Coconut palm (Cocos nucifera) is a symbol of the tropics and a source of numerous edible and nonedible products of economic value. Despite its nutritional and industrial significance, coconut remains under-represented in public repositories for genomic and transcriptomic data. We report de novo transcript assembly from RNA-seq data and analysis of gene expression in seed tissues (embryo and endosperm) and leaves of a dwarf coconut variety. Assembly of 10 GB sequencing data for each tissue resulted in 58,211 total unigenes in embryo, 61,152 in endosperm, and 33,446 in leaf. Within each unigene pool, 24,857 could be annotated in embryo, 29,731 could be annotated in endosperm, and 26,064 could be annotated in leaf. A KEGG analysis identified 138, 138, and 139 pathways, respectively, in transcriptomes of embryo, endosperm, and leaf tissues. Given the extraordinarily large size of coconut seeds and the importance of small RNA-mediated epigenetic regulation during seed development in model plants, we used homology searches to identify putative homologs of factors required for RNA-directed DNA methylation in coconut. The findings suggest that RNA-directed DNA methylation is important during coconut seed development, particularly in maturing endosperm. This dataset will expand the genomics resources available for coconut and provide a foundation for more detailed analyses that may assist molecular breeding strategies aimed at improving this major tropical crop. PMID:25193496
De Novo Transcriptomic Analysis of an Oleaginous Microalga: Pathway Description and Gene Discovery for Production of Next-Generation Biofuels

PubMed Central

Wan, LingLin; Han, Juan; Sang, Min; Li, AiFen; Wu, Hong; Yin, ShunJi; Zhang, ChengWu

2012-01-01

Background Eustigmatos cf. polyphem is a yellow-green unicellular soil microalga belonging to the eustimatophyte with high biomass and considerable production of triacylglycerols (TAGs) for biofuels, which is thus referred to as an oleaginous microalga. The paucity of microalgae genome sequences, however, limits development of gene-based biofuel feedstock optimization studies. Here we describe the sequencing and de novo transcriptome assembly for a non-model microalgae species, E. cf. polyphem, and identify pathways and genes of importance related to biofuel production. Results We performed the de novo assembly of E. cf. polyphem transcriptome using Illumina paired-end sequencing technology. In a single run, we produced 29,199,432 sequencing reads corresponding to 2.33 Gb total nucleotides. These reads were assembled into 75,632 unigenes with a mean size of 503 bp and an N50 of 663 bp, ranging from 100 bp to >3,000 bp. Assembled unigenes were subjected to BLAST similarity searches and annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology identifiers. These analyses identified the majority of carbohydrate, fatty acids, TAG and carotenoids biosynthesis and catabolism pathways in E. cf. polyphem. Conclusions Our data provides the construction of metabolic pathways involved in the biosynthesis and catabolism of carbohydrate, fatty acids, TAG and carotenoids in E. cf. polyphem and provides a foundation for the molecular genetics and functional genomics required to direct metabolic engineering efforts that seek to enhance the quantity and character of microalgae-based biofuel feedstock. PMID:22536352
Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly.

PubMed

Schneider, Valerie A; Graves-Lindsay, Tina; Howe, Kerstin; Bouk, Nathan; Chen, Hsiu-Chuan; Kitts, Paul A; Murphy, Terence D; Pruitt, Kim D; Thibaud-Nissen, Françoise; Albracht, Derek; Fulton, Robert S; Kremitzki, Milinn; Magrini, Vincent; Markovic, Chris; McGrath, Sean; Steinberg, Karyn Meltz; Auger, Kate; Chow, William; Collins, Joanna; Harden, Glenn; Hubbard, Timothy; Pelan, Sarah; Simpson, Jared T; Threadgold, Glen; Torrance, James; Wood, Jonathan M; Clarke, Laura; Koren, Sergey; Boitano, Matthew; Peluso, Paul; Li, Heng; Chin, Chen-Shan; Phillippy, Adam M; Durbin, Richard; Wilson, Richard K; Flicek, Paul; Eichler, Evan E; Church, Deanna M

2017-05-01

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health. © 2017 Schneider et al.; Published by Cold Spring Harbor Laboratory Press.
Identification of QTLs for 14 Agronomically Important Traits in Setaria italica Based on SNPs Generated from High-Throughput Sequencing

PubMed Central

Zhang, Kai; Fan, Guangyu; Zhang, Xinxin; Zhao, Fang; Wei, Wei; Du, Guohua; Feng, Xiaolei; Wang, Xiaoming; Wang, Feng; Song, Guoliang; Zou, Hongfeng; Zhang, Xiaolei; Li, Shuangdong; Ni, Xuemei; Zhang, Gengyun; Zhao, Zhihai

2017-01-01

Foxtail millet (Setaria italica) is an important crop possessing C4 photosynthesis capability. The S. italica genome was de novo sequenced in 2012, but the sequence lacked high-density genetic maps with agronomic and yield trait linkages. In the present study, we resequenced a foxtail millet population of 439 recombinant inbred lines (RILs) and developed high-resolution bin map and high-density SNP markers, which could provide an effective approach for gene identification. A total of 59 QTL for 14 agronomic traits in plants grown under long- and short-day photoperiods were identified. The phenotypic variation explained ranged from 4.9 to 43.94%. In addition, we suggested that there may be segregation distortion on chromosome 6 that is significantly distorted toward Zhang gu. The newly identified QTL will provide a platform for sequence-based research on the S. italica genome, and for molecular marker-assisted breeding. PMID:28364039
Efficient high-throughput sequencing of a laser microdissected chromosome arm

PubMed Central

2013-01-01

Background Genomic sequence assemblies are key tools for a broad range of gene function and evolutionary studies. The diploid amphibian Xenopus tropicalis plays a pivotal role in these fields due to its combination of experimental flexibility, diploid genome, and early-branching tetrapod taxonomic position, having diverged from the amniote lineage ~360 million years ago. A genome assembly and a genetic linkage map have recently been made available. Unfortunately, large gaps in the linkage map attenuate long-range integrity of the genome assembly. Results We laser dissected the short arm of X. tropicalis chromosome 7 for next generation sequencing and computational mapping to the reference genome. This arm is of particular interest as it encodes the sex determination locus, but its genetic map contains large gaps which undermine available genome assemblies. Whole genome amplification of 15 laser-microdissected 7p arms followed by next generation sequencing yielded ~35 million reads, over four million of which uniquely mapped to the X. tropicalis genome. Our analysis placed more than 200 previously unmapped scaffolds on the analyzed chromosome arm, providing valuable low-resolution physical map information for de novo genome assembly. Conclusion We present a new approach for improving and validating genetic maps and sequence assemblies. Whole genome amplification of 15 microdissected chromosome arms provided sufficient high-quality material for localizing previously unmapped scaffolds and genes as well as recognizing mislocalized scaffolds. PMID:23714049
Modeling fructose-load-induced hepatic de-novo lipogenesis by model simplification.

PubMed

Allen, Richard J; Musante, Cynthia J

2017-01-01

Hepatic de-novo lipogenesis is a metabolic process implemented in the pathogenesis of type 2 diabetes. Clinically, the rate of this process can be ascertained by use of labeled acetate and stimulation by fructose administration. A systems pharmacology model of this process is desirable because it facilitates the description, analysis, and prediction of this experiment. Due to the multiple enzymes involved in de-novo lipogenesis, and the limited data, it is desirable to use single functional expressions to encapsulate the flux between multiple enzymes. To accomplish this we developed a novel simplification technique which uses the available information about the properties of the individual enzymes to bound the parameters of a single governing 'transfer function'. This method should be applicable to any model with linear chains of enzymes that are well stimulated. We validated this approach with computational simulations and analytical justification in a limiting case. Using this technique we generated a simple model of hepatic de-novo lipogenesis in these experimental conditions that matched prior data. This model can be used to assess pharmacological intervention at specific points on this pathway. We have demonstrated this with prospective simulation of acetyl-CoA carboxylase inhibition. This simplification technique suggests how the constituent properties of an enzymatic chain of reactions gives rise to the sensitivity (to substrate) of the pathway as a whole.
Modeling fructose-load-induced hepatic de-novo lipogenesis by model simplification

PubMed Central

Allen, Richard J; Musante, Cynthia J

2017-01-01

Hepatic de-novo lipogenesis is a metabolic process implemented in the pathogenesis of type 2 diabetes. Clinically, the rate of this process can be ascertained by use of labeled acetate and stimulation by fructose administration. A systems pharmacology model of this process is desirable because it facilitates the description, analysis, and prediction of this experiment. Due to the multiple enzymes involved in de-novo lipogenesis, and the limited data, it is desirable to use single functional expressions to encapsulate the flux between multiple enzymes. To accomplish this we developed a novel simplification technique which uses the available information about the properties of the individual enzymes to bound the parameters of a single governing ‘transfer function’. This method should be applicable to any model with linear chains of enzymes that are well stimulated. We validated this approach with computational simulations and analytical justification in a limiting case. Using this technique we generated a simple model of hepatic de-novo lipogenesis in these experimental conditions that matched prior data. This model can be used to assess pharmacological intervention at specific points on this pathway. We have demonstrated this with prospective simulation of acetyl-CoA carboxylase inhibition. This simplification technique suggests how the constituent properties of an enzymatic chain of reactions gives rise to the sensitivity (to substrate) of the pathway as a whole. PMID:28469410
De novo transcriptome assembly in chili pepper (Capsicum frutescens) to identify genes involved in the biosynthesis of capsaicinoids.

PubMed

Liu, Shaoqun; Li, Wanshun; Wu, Yimin; Chen, Changming; Lei, Jianjun

2013-01-01

The capsaicinoids are a group of compounds produced by chili pepper fruits and are used widely in many fields, especially in medical purposes. The capsaicinoid biosynthetic pathway has not yet been established clearly. To understand more knowledge in biosynthesis of capsaicinoids, we applied RNA-seq for the mixture of placenta and pericarp of pungent pepper (Capsicum frutescens L.). We have assessed the effect of various assembly parameters using different assembly software, and obtained one of the best strategies for de novo assembly of transcriptome data. We obtained a total 54,045 high-quality unigenes (transcripts) using Trinity software. About 92.65% of unigenes showed similarity to the public protein sequences, genome of potato and tomato and pepper (C. annuum) ESTs databases. Our results predicted 3 new structural genes (DHAD, TD, PAT), which filled gaps of the capsaicinoid biosynthetic pathway predicted by Mazourek, and revealed new candidate genes involved in capsaicinoid biosynthesis based on KEGG (Kyoto Encyclopedia of Genes and Genomes) analysis. A significant number of SSR (Simple Sequence Repeat) and SNP (Single Nucleotide Polymorphism) markers were predicted in C. frutescens and C. annuum sequences, which will be helpful in the identification of polymorphisms within chili pepper populations. These data will provide new insights to the pathway of capsaicinoid biosynthesis and subsequent research of chili peppers. In addition, our strategy of de novo transcriptome assembly is applicable to a wide range of similar studies.
De Novo Transcriptome Assembly in Chili Pepper (Capsicum frutescens) to Identify Genes Involved in the Biosynthesis of Capsaicinoids

PubMed Central

Liu, Shaoqun; Li, Wanshun; Wu, Yimin; Chen, Changming; Lei, Jianjun

2013-01-01

The capsaicinoids are a group of compounds produced by chili pepper fruits and are used widely in many fields, especially in medical purposes. The capsaicinoid biosynthetic pathway has not yet been established clearly. To understand more knowledge in biosynthesis of capsaicinoids, we applied RNA-seq for the mixture of placenta and pericarp of pungent pepper (Capsicum frutescens L.). We have assessed the effect of various assembly parameters using different assembly software, and obtained one of the best strategies for de novo assembly of transcriptome data. We obtained a total 54,045 high-quality unigenes (transcripts) using Trinity software. About 92.65% of unigenes showed similarity to the public protein sequences, genome of potato and tomato and pepper (C. annuum) ESTs databases. Our results predicted 3 new structural genes (DHAD, TD, PAT), which filled gaps of the capsaicinoid biosynthetic pathway predicted by Mazourek, and revealed new candidate genes involved in capsaicinoid biosynthesis based on KEGG (Kyoto Encyclopedia of Genes and Genomes) analysis. A significant number of SSR (Simple Sequence Repeat) and SNP (Single Nucleotide Polymorphism) markers were predicted in C. frutescens and C. annuum sequences, which will be helpful in the identification of polymorphisms within chili pepper populations. These data will provide new insights to the pathway of capsaicinoid biosynthesis and subsequent research of chili peppers. In addition, our strategy of de novo transcriptome assembly is applicable to a wide range of similar studies. PMID:23349661
De Novo Transcriptome of the Hemimetabolous German Cockroach (Blattella germanica)

PubMed Central

Zhou, Xiaojie; Qian, Kun; Tong, Ying; Zhu, Junwei Jerry; Qiu, Xinghui; Zeng, Xiaopeng

2014-01-01

Background The German cockroach, Blattella germanica, is an important insect pest that transmits various pathogens mechanically and causes severe allergic diseases. This insect has long served as a model system for studies of insect biology, physiology and ecology. However, the lack of genome or transcriptome information heavily hinder our further understanding about the German cockroach in every aspect at a molecular level and on a genome-wide scale. To explore the transcriptome and identify unique sequences of interest, we subjected the B. germanica transcriptome to massively parallel pyrosequencing and generated the first reference transcriptome for B. germanica. Methodology/Principal Findings A total of 1,365,609 raw reads with an average length of 529 bp were generated via pyrosequencing the mixed cDNA library from different life stages of German cockroach including maturing oothecae, nymphs, adult females and males. The raw reads were de novo assembled to 48,800 contigs and 3,961 singletons with high-quality unique sequences. These sequences were annotated and classified functionally in terms of BLAST, GO and KEGG, and the genes putatively coding detoxification enzyme systems, insecticide targets, key components in systematic RNA interference, immunity and chemoreception pathways were identified. A total of 3,601 SSRs (Simple Sequence Repeats) loci were also predicted. Conclusions/Significance The whole transcriptome pyrosequencing data from this study provides a usable genetic resource for future identification of potential functional genes involved in various biological processes. PMID:25265537
De novo sequencing and resurrection of a human astrovirus-neutralizing antibody

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bogdanoff, Walter A.; Morgenstern, David; Bern, Marshall

Monoclonal antibody (mAb) therapeutics targeting cancer, autoimmune diseases, inflammatory diseases, and infectious diseases are growing exponentially. Although numerous panels of mAbs targeting infectious disease agents have been developed, their progression into clinically useful mAbs is often hindered by the lack of sequence information and/or loss of hybridoma cells that produce them. Here we combine the power of crystallography and mass spectrometry to determine the amino acid sequence and glycosylation modification of the Fab fragment of a potent human astrovirus-neutralizing mAb. We used this information to engineer a recombinant antibody single-chain variable fragment that has the same specificity as the parentmore » monoclonal antibody to bind to the astrovirus capsid protein. Furthermore, this antibody can now potentially be developed as a therapeutic and diagnostic agent.« less
De novo sequencing and resurrection of a human astrovirus-neutralizing antibody

DOE PAGES

Bogdanoff, Walter A.; Morgenstern, David; Bern, Marshall; ...

2016-03-14

Monoclonal antibody (mAb) therapeutics targeting cancer, autoimmune diseases, inflammatory diseases, and infectious diseases are growing exponentially. Although numerous panels of mAbs targeting infectious disease agents have been developed, their progression into clinically useful mAbs is often hindered by the lack of sequence information and/or loss of hybridoma cells that produce them. Here we combine the power of crystallography and mass spectrometry to determine the amino acid sequence and glycosylation modification of the Fab fragment of a potent human astrovirus-neutralizing mAb. We used this information to engineer a recombinant antibody single-chain variable fragment that has the same specificity as the parentmore » monoclonal antibody to bind to the astrovirus capsid protein. Furthermore, this antibody can now potentially be developed as a therapeutic and diagnostic agent.« less

A knowledge engineering approach to recognizing and extracting sequences of nucleic acids from scientific literature.

PubMed

García-Remesal, Miguel; Maojo, Victor; Crespo, José

2010-01-01

In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.
PhyloTreePruner: A Phylogenetic Tree-Based Approach for Selection of Orthologous Sequences for Phylogenomics.

PubMed

Kocot, Kevin M; Citarella, Mathew R; Moroz, Leonid L; Halanych, Kenneth M

2013-01-01

Molecular phylogenetics relies on accurate identification of orthologous sequences among the taxa of interest. Most orthology inference programs available for use in phylogenomics rely on small sets of pre-defined orthologs from model organisms or phenetic approaches such as all-versus-all sequence comparisons followed by Markov graph-based clustering. Such approaches have high sensitivity but may erroneously include paralogous sequences. We developed PhyloTreePruner, a software utility that uses a phylogenetic approach to refine orthology inferences made using phenetic methods. PhyloTreePruner checks single-gene trees for evidence of paralogy and generates a new alignment for each group containing only sequences inferred to be orthologs. Importantly, PhyloTreePruner takes into account support values on the tree and avoids unnecessarily deleting sequences in cases where a weakly supported tree topology incorrectly indicates paralogy. A test of PhyloTreePruner on a dataset generated from 11 completely sequenced arthropod genomes identified 2,027 orthologous groups sampled for all taxa. Phylogenetic analysis of the concatenated supermatrix yielded a generally well-supported topology that was consistent with the current understanding of arthropod phylogeny. PhyloTreePruner is freely available from http://sourceforge.net/projects/phylotreepruner/.
A flexible and economical barcoding approach for highly multiplexed amplicon sequencing of diverse target genes

PubMed Central

Herbold, Craig W.; Pelikan, Claus; Kuzyk, Orest; Hausmann, Bela; Angel, Roey; Berry, David; Loy, Alexander

2015-01-01

High throughput sequencing of phylogenetic and functional gene amplicons provides tremendous insight into the structure and functional potential of complex microbial communities. Here, we introduce a highly adaptable and economical PCR approach to barcoding and pooling libraries of numerous target genes. In this approach, we replace gene- and sequencing platform-specific fusion primers with general, interchangeable barcoding primers, enabling nearly limitless customized barcode-primer combinations. Compared to barcoding with long fusion primers, our multiple-target gene approach is more economical because it overall requires lower number of primers and is based on short primers with generally lower synthesis and purification costs. To highlight our approach, we pooled over 900 different small-subunit rRNA and functional gene amplicon libraries obtained from various environmental or host-associated microbial community samples into a single, paired-end Illumina MiSeq run. Although the amplicon regions ranged in size from approximately 290 to 720 bp, we found no significant systematic sequencing bias related to amplicon length or gene target. Our results indicate that this flexible multiplexing approach produces large, diverse, and high quality sets of amplicon sequence data for modern studies in microbial ecology. PMID:26236305
Risks and Recommendations in Prenatally Detected De Novo Balanced Chromosomal Rearrangements from Assessment of Long-Term Outcomes.

PubMed

Halgren, Christina; Nielsen, Nete M; Nazaryan-Petersen, Lusine; Silahtaroglu, Asli; Collins, Ryan L; Lowther, Chelsea; Kjaergaard, Susanne; Frisch, Morten; Kirchhoff, Maria; Brøndum-Nielsen, Karen; Lind-Thomsen, Allan; Mang, Yuan; El-Schich, Zahra; Boring, Claire A; Mehrjouy, Mana M; Jensen, Peter K A; Fagerberg, Christina; Krogh, Lotte N; Hansen, Jan; Bryndorf, Thue; Hansen, Claus; Talkowski, Michael E; Bak, Mads; Tommerup, Niels; Bache, Iben

2018-06-07

The 6%-9% risk of an untoward outcome previously established by Warburton for prenatally detected de novo balanced chromosomal rearrangements (BCRs) does not account for long-term morbidity. We performed long-term follow-up (mean 17 years) of a registry-based nationwide cohort of 41 individuals carrying a prenatally detected de novo BCR with normal first trimester screening/ultrasound scan. We observed a significantly higher frequency of neurodevelopmental and/or neuropsychiatric disorders than in a matched control group (19.5% versus 8.3%, p = 0.04), which was increased to 26.8% upon clinical follow-up. Chromosomal microarray of 32 carriers revealed no pathogenic imbalances, illustrating a low prognostic value when fetal ultrasound scan is normal. In contrast, mate-pair sequencing revealed disrupted genes (ARID1B, NPAS3, CELF4), regulatory domains of known developmental genes (ZEB2, HOXC), and complex BCRs associated with adverse outcomes. Seven unmappable autosomal-autosomal BCRs with breakpoints involving pericentromeric/heterochromatic regions may represent a low-risk group. We performed independent phenotype-aware and blinded interpretation, which accurately predicted benign outcomes (specificity = 100%) but demonstrated relatively low sensitivity for prediction of the clinical outcome in affected carriers (sensitivity = 45%-55%). This sensitivity emphasizes the challenges associated with prenatal risk prediction for long-term morbidity in the absence of phenotypic data given the still immature annotation of the morbidity genome and poorly understood long-range regulatory mechanisms. In conclusion, we upwardly revise the previous estimates of Warburton to a morbidity risk of 27% and recommend sequencing of the chromosomal breakpoints as the first-tier diagnostic test in pregnancies with a de novo BCR. Copyright © 2018 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
High-Throughput Sequencing and De Novo Assembly of the Isatis indigotica Transcriptome

PubMed Central

Tang, Xiaoqing; Xiao, Yunhua; Lv, Tingting; Wang, Fangquan; Zhu, QianHao; Zheng, Tianqing; Yang, Jie

2014-01-01

Background Isatis indigotica, the source of the traditional Chinese medicine Radix isatidis (Ban-Lan-Gen), is an extremely important economical crop in China. To facilitate biological, biochemical and molecular research on the medicinal chemicals in I. indigotica, here we report the first I. indigotica transcriptome generated by RNA sequencing (RNA-seq). Results RNA-seq library was created using RNA extracted from a mixed sample including leaf and root. A total of 33,238 unigenes were assembled from more than 28 million of high quality short reads. The quality of the assembly was experimentally examined by cDNA sequencing of seven randomly selected unigenes. Based on blast search 28,184 unigenes had a hit in at least one of the protein and nucleotide databases used in this study, and 8 unigenes were found to be associated with biosynthesis of indole and its derivatives. According to Gene Ontology classification, 22,365 unigenes were categorized into 48 functional groups. Furthermore, Clusters of Orthologous Group and Swiss-Port annotation were assigned for 7,707 and 18,679 unigenes, respectively. Analysis of repeat motifs identified 6,400 simple sequence repeat markers in 4,509 unigenes. Conclusion Our data provide a comprehensive sequence resource for molecular study of I. indigotica. Our results will facilitate studies on the functions of genes involved in the indole alkaloid biosynthesis pathway and on metabolism of nitrogen and indole alkaloids in I. indigotica and its related species. PMID:25259890
Comparative analyses of two Geraniaceae transcriptomes using next-generation sequencing

PubMed Central

2013-01-01

Background Organelle genomes of Geraniaceae exhibit several unusual evolutionary phenomena compared to other angiosperm families including accelerated nucleotide substitution rates, widespread gene loss, reduced RNA editing, and extensive genomic rearrangements. Since most organelle-encoded proteins function in multi-subunit complexes that also contain nuclear-encoded proteins, it is likely that the atypical organellar phenomena affect the evolution of nuclear genes encoding organellar proteins. To begin to unravel the complex co-evolutionary interplay between organellar and nuclear genomes in this family, we sequenced nuclear transcriptomes of two species, Geranium maderense and Pelargonium x hortorum. Results Normalized cDNA libraries of G. maderense and P. x hortorum were used for transcriptome sequencing. Five assemblers (MIRA, Newbler, SOAPdenovo, SOAPdenovo-trans [SOAPtrans], Trinity) and two next-generation technologies (454 and Illumina) were compared to determine the optimal transcriptome sequencing approach. Trinity provided the highest quality assembly of Illumina data with the deepest transcriptome coverage. An analysis to determine the amount of sequencing needed for de novo assembly revealed diminishing returns of coverage and quality with data sets larger than sixty million Illumina paired end reads for both species. The G. maderense and P. x hortorum transcriptomes contained fewer transcripts encoding the PLS subclass of PPR proteins relative to other angiosperms, consistent with reduced mitochondrial RNA editing activity in Geraniaceae. In addition, transcripts for all six plastid targeted sigma factors were identified in both transcriptomes, suggesting that one of the highly divergent rpoA-like ORFs in the P. x hortorum plastid genome is functional. Conclusions The findings support the use of the Illumina platform and assemblers optimized for transcriptome assembly, such as Trinity or SOAPtrans, to generate high-quality de novo transcriptomes with
Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance

PubMed Central

2011-01-01

Background Until recently, read lengths on the Solexa/Illumina system were too short to reliably assemble transcriptomes without a reference sequence, especially for non-model organisms. However, with read lengths up to 100 nucleotides available in the current version, an assembly without reference genome should be possible. For this study we created an EST data set for the common pond snail Radix balthica by Illumina sequencing of a normalized transcriptome. Performance of three different short read assemblers was compared with respect to: the number of contigs, their length, depth of coverage, their quality in various BLAST searches and the alignment to mitochondrial genes. Results A single sequencing run of a normalized RNA pool resulted in 16,923,850 paired end reads with median read length of 61 bases. The assemblies generated by VELVET, OASES, and SeqMan NGEN differed in the total number of contigs, contig length, the number and quality of gene hits obtained by BLAST searches against various databases, and contig performance in the mt genome comparison. While VELVET produced the highest overall number of contigs, a large fraction of these were of small size (< 200bp), and gave redundant hits in BLAST searches and the mt genome alignment. The best overall contig performance resulted from the NGEN assembly. It produced the second largest number of contigs, which on average were comparable to the OASES contigs but gave the highest number of gene hits in two out of four BLAST searches against different reference databases. A subsequent meta-assembly of the four contig sets resulted in larger contigs, less redundancy and a higher number of BLAST hits. Conclusion Our results document the first de novo transcriptome assembly of a non-model species using Illumina sequencing data. We show that de novo transcriptome assembly using this approach yields results useful for downstream applications, in particular if a meta-assembly of contig sets is used to increase contig
Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies

PubMed Central

Sundquist, Andreas; Ronaghi, Mostafa; Tang, Haixu; Pevzner, Pavel; Batzoglou, Serafim

2007-01-01

While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology. PMID:17534434
De Novo duplication in Charcot-Marie-Tooth Type 1A

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mandich, P.; Bellone, E.; Ajmar, F.

1996-09-01

We read with interest the paper on {open_quotes}Prevalence and Origin of De Novo Duplications in Charcot-Marie-Tooth Disease Type 1A: First Report of a De Novo Duplication with a Maternal Origin,{close_quotes}. They reported their experience with 10 sporadic cases of Charcot-Marie-Tooth type 1A (CMT1A) in which it was demonstrated that the disease had arisen as the result of a de novo duplication. They analyzed the de novo-duplication families by using microsatellite markers and identified the parental origin of the duplication in eight cases. In one family the duplication was of maternal origin, whereas in the remaining seven cases it was ofmore » paternal origin. The authors concluded that their report was the first evidence of a de novo duplication of maternal origin, suggesting that this is not a phenomenon associated solely with male meiosis. 7 refs.« less
ABACAS: algorithm-based automatic contiguation of assembled sequences

PubMed Central

Assefa, Samuel; Keane, Thomas M.; Otto, Thomas D.; Newbold, Chris; Berriman, Matthew

2009-01-01

Summary: Due to the availability of new sequencing technologies, we are now increasingly interested in sequencing closely related strains of existing finished genomes. Recently a number of de novo and mapping-based assemblers have been developed to produce high quality draft genomes from new sequencing technology reads. New tools are necessary to take contigs from a draft assembly through to a fully contiguated genome sequence. ABACAS is intended as a tool to rapidly contiguate (align, order, orientate), visualize and design primers to close gaps on shotgun assembled contigs based on a reference sequence. The input to ABACAS is a set of contigs which will be aligned to the reference genome, ordered and orientated, visualized in the ACT comparative browser, and optimal primer sequences are automatically generated. Availability and Implementation: ABACAS is implemented in Perl and is freely available for download from http://abacas.sourceforge.net Contact: sa4@sanger.ac.uk PMID:19497936
Genome sequencing in microfabricated high-density picolitre reactors.

PubMed

Margulies, Marcel; Egholm, Michael; Altman, William E; Attiya, Said; Bader, Joel S; Bemben, Lisa A; Berka, Jan; Braverman, Michael S; Chen, Yi-Ju; Chen, Zhoutao; Dewell, Scott B; Du, Lei; Fierro, Joseph M; Gomes, Xavier V; Godwin, Brian C; He, Wen; Helgesen, Scott; Ho, Chun Heen; Ho, Chun He; Irzyk, Gerard P; Jando, Szilveszter C; Alenquer, Maria L I; Jarvie, Thomas P; Jirage, Kshama B; Kim, Jong-Bum; Knight, James R; Lanza, Janna R; Leamon, John H; Lefkowitz, Steven M; Lei, Ming; Li, Jing; Lohman, Kenton L; Lu, Hong; Makhijani, Vinod B; McDade, Keith E; McKenna, Michael P; Myers, Eugene W; Nickerson, Elizabeth; Nobile, John R; Plant, Ramona; Puc, Bernard P; Ronan, Michael T; Roth, George T; Sarkis, Gary J; Simons, Jan Fredrik; Simpson, John W; Srinivasan, Maithreyan; Tartaro, Karrie R; Tomasz, Alexander; Vogt, Kari A; Volkmer, Greg A; Wang, Shally H; Wang, Yong; Weiner, Michael P; Yu, Pengguang; Begley, Richard F; Rothberg, Jonathan M

2005-09-15

The proliferation of large-scale DNA-sequencing projects in recent years has driven a search for alternative methods to reduce time and cost. Here we describe a scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel fibre-optic slide of individual wells and is able to sequence 25 million bases, at 99% or better accuracy, in one four-hour run. To achieve an approximately 100-fold increase in throughput over current Sanger sequencing technology, we have developed an emulsion method for DNA amplification and an instrument for sequencing by synthesis using a pyrosequencing protocol optimized for solid support and picolitre-scale volumes. Here we show the utility, throughput, accuracy and robustness of this system by shotgun sequencing and de novo assembly of the Mycoplasma genitalium genome with 96% coverage at 99.96% accuracy in one run of the machine.
Particulated articular cartilage: CAIS and DeNovo NT.

PubMed

Farr, Jack; Cole, Brian J; Sherman, Seth; Karas, Vasili

2012-03-01

Cartilage Autograft Implantation System (CAIS; DePuy/Mitek, Raynham, MA) and DeNovo Natural Tissue (NT; ISTO, St. Louis, MO) are novel treatment options for focal articular cartilage defects in the knee. These methods involve the implantation of particulated articular cartilage from either autograft or juvenile allograft donor, respectively. In the laboratory and in animal models, both CAIS and DeNovo NT have demonstrated the ability of the transplanted cartilage cells to "escape" from the extracellular matrix, migrate, multiply, and form a new hyaline-like cartilage tissue matrix that integrates with the surrounding host tissue. In clinical practice, the technique for both CAIS and DeNovo NT is straightforward, requiring only a single surgery to affect cartilage repair. Clinical experience is limited, with short-term studies demonstrating both procedures to be safe, feasible, and effective, with improvements in subjective patient scores, and with magnetic resonance imaging evidence of good defect fill. While these treatment options appear promising, prospective randomized controlled studies are necessary to refine the indications and contraindications for both CAIS and DeNovo NT.
Next-Generation Sequencing of the Chrysanthemum nankingense (Asteraceae) Transcriptome Permits Large-Scale Unigene Assembly and SSR Marker Discovery

PubMed Central

Wang, Haibin; Jiang, Jiafu; Chen, Sumei; Qi, Xiangyu; Peng, Hui; Li, Pirui; Song, Aiping; Guan, Zhiyong; Fang, Weimin; Liao, Yuan; Chen, Fadi

2013-01-01

Background Simple sequence repeats (SSRs) are ubiquitous in eukaryotic genomes. Chrysanthemum is one of the largest genera in the Asteraceae family. Only few Chrysanthemum expressed sequence tag (EST) sequences have been acquired to date, so the number of available EST-SSR markers is very low. Methodology/Principal Findings Illumina paired-end sequencing technology produced over 53 million sequencing reads from C. nankingense mRNA. The subsequent de novo assembly yielded 70,895 unigenes, of which 45,789 (64.59%) unigenes showed similarity to the sequences in NCBI database. Out of 45,789 sequences, 107 have hits to the Chrysanthemum Nr protein database; 679 and 277 sequences have hits to the database of Helianthus and Lactuca species, respectively. MISA software identified a large number of putative EST-SSRs, allowing 1,788 primer pairs to be designed from the de novo transcriptome sequence and a further 363 from archival EST sequence. Among 100 primer pairs randomly chosen, 81 markers have amplicons and 20 are polymorphic for genotypes analysis in Chrysanthemum. The results showed that most (but not all) of the assays were transferable across species and that they exposed a significant amount of allelic diversity. Conclusions/Significance SSR markers acquired by transcriptome sequencing are potentially useful for marker-assisted breeding and genetic analysis in the genus Chrysanthemum and its related genera. PMID:23626799
De novo inference of protein function from coarse-grained dynamics.

PubMed

Bhadra, Pratiti; Pal, Debnath

2014-10-01

Inference of molecular function of proteins is the fundamental task in the quest for understanding cellular processes. The task is getting increasingly difficult with thousands of new proteins discovered each day. The difficulty arises primarily due to lack of high-throughput experimental technique for assessing protein molecular function, a lacunae that computational approaches are trying hard to fill. The latter too faces a major bottleneck in absence of clear evidence based on evolutionary information. Here we propose a de novo approach to annotate protein molecular function through structural dynamics match for a pair of segments from two dissimilar proteins, which may share even <10% sequence identity. To screen these matches, corresponding 1 µs coarse-grained (CG) molecular dynamics trajectories were used to compute normalized root-mean-square-fluctuation graphs and select mobile segments, which were, thereafter, matched for all pairs using unweighted three-dimensional autocorrelation vectors. Our in-house custom-built forcefield (FF), extensively validated against dynamics information obtained from experimental nuclear magnetic resonance data, was specifically used to generate the CG dynamics trajectories. The test for correspondence of dynamics-signature of protein segments and function revealed 87% true positive rate and 93.5% true negative rate, on a dataset of 60 experimentally validated proteins, including moonlighting proteins and those with novel functional motifs. A random test against 315 unique fold/function proteins for a negative test gave >99% true recall. A blind prediction on a novel protein appears consistent with additional evidences retrieved therein. This is the first proof-of-principle of generalized use of structural dynamics for inferring protein molecular function leveraging our custom-made CG FF, useful to all. © 2014 Wiley Periodicals, Inc.
ParDRe: faster parallel duplicated reads removal tool for sequencing studies.

PubMed

González-Domínguez, Jorge; Schmidt, Bertil

2016-05-15

Current next generation sequencing technologies often generate duplicated or near-duplicated reads that (depending on the application scenario) do not provide any interesting biological information but can increase memory requirements and computational time of downstream analysis. In this work we present ParDRe, a de novo parallel tool to remove duplicated and near-duplicated reads through the clustering of Single-End or Paired-End sequences from fasta or fastq files. It uses a novel bitwise approach to compare the suffixes of DNA strings and employs hybrid MPI/multithreading to reduce runtime on multicore systems. We show that ParDRe is up to 27.29 times faster than Fulcrum (a representative state-of-the-art tool) on a platform with two 8-core Sandy-Bridge processors. Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/pardre/ jgonzalezd@udc.es. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
SACCHARIS: an automated pipeline to streamline discovery of carbohydrate active enzyme activities within polyspecific families and de novo sequence datasets.

PubMed

Jones, Darryl R; Thomas, Dallas; Alger, Nicholas; Ghavidel, Ata; Inglis, G Douglas; Abbott, D Wade

2018-01-01

have performed a CAZome analysis of an in-house sequenced bacterial genome and a comparative analysis of B. thetaiotaomicron VPI-5482 and B. thetaiotaomicron 7330, to demonstrate that SACCHARIS can generate "CAZome fingerprints", which differentiate between the saccharolytic potential of two related strains in silico. Establishing sequence-function and sequence-structure relationships in polyspecific CAZyme families are promising approaches for streamlining enzyme discovery. SACCHARIS facilitates this process by embedding CAZyme and CBM family trees generated from biochemically to structurally characterized sequences, with protein sequences that have unknown functions. In addition, these trees can be integrated with user-defined datasets (e.g., genomics, metagenomics, and transcriptomics) to inform experimental characterization of new CAZymes or CBMs not currently curated, and for researchers to compare differential sequence patterns between entire CAZomes. In this light, SACCHARIS provides an in silico tool that can be tailored for enzyme bioprospecting in datasets of increasing complexity and for diverse applications in glycobiotechnology.
High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development.

PubMed

Daccord, Nicolas; Celton, Jean-Marc; Linsmith, Gareth; Becker, Claude; Choisne, Nathalie; Schijlen, Elio; van de Geest, Henri; Bianco, Luca; Micheletti, Diego; Velasco, Riccardo; Di Pierro, Erica Adele; Gouzy, Jérôme; Rees, D Jasper G; Guérif, Philippe; Muranty, Hélène; Durel, Charles-Eric; Laurens, François; Lespinasse, Yves; Gaillard, Sylvain; Aubourg, Sébastien; Quesneville, Hadi; Weigel, Detlef; van de Weg, Eric; Troggio, Michela; Bucher, Etienne

2017-07-01

Using the latest sequencing and optical mapping technologies, we have produced a high-quality de novo assembly of the apple (Malus domestica Borkh.) genome. Repeat sequences, which represented over half of the assembly, provided an unprecedented opportunity to investigate the uncharacterized regions of a tree genome; we identified a new hyper-repetitive retrotransposon sequence that was over-represented in heterochromatic regions and estimated that a major burst of different transposable elements (TEs) occurred 21 million years ago. Notably, the timing of this TE burst coincided with the uplift of the Tian Shan mountains, which is thought to be the center of the location where the apple originated, suggesting that TEs and associated processes may have contributed to the diversification of the apple ancestor and possibly to its divergence from pear. Finally, genome-wide DNA methylation data suggest that epigenetic marks may contribute to agronomically relevant aspects, such as apple fruit development.
De novo transcriptome sequencing in Bixa orellana to identify genes involved in methylerythritol phosphate, carotenoid and bixin biosynthesis

DOE PAGES

Cárdenas-Conejo, Yair; Carballo-Uicab, Víctor; Lieberman, Meric; ...

2015-10-28

Bixin or annatto is a commercially important natural orange-red pigment derived from lycopene that is produced and stored in seeds of Bixa orellana L. An enzymatic pathway for bixin biosynthesis was inferred from homology of putative proteins encoded by differentially expressed seed cDNAs. Some activities were later validated in a heterologous system. Nevertheless, much of the pathway remains to be clarified. For example, it is essential to identify the methylerythritol phosphate (MEP) and carotenoid pathways genes. In order to investigate the MEP, carotenoid, and bixin pathways genes, total RNA from young leaves and two different developmental stages of seeds frommore » B. orellana were used for the construction of indexed mRNA libraries, sequenced on the Illumina HiSeq 2500 platform and assembled de novo using Velvet, CLC Genomics Workbench and CAP3 software. A total of 52,549 contigs were obtained with average length of 1,924 bp. Two phylogenetic analyses of inferred proteins, in one case encoded by thirteen general, single-copy cDNAs, in the other from carotenoid and MEP cDNAs, indicated that B. orellana is closely related to sister Malvales species cacao and cotton. Using homology, we identified 7 and 14 core gene products from the MEP and carotenoid pathways, respectively. Surprisingly, previously defined bixin pathway cDNAs were not present in our transcriptome. Here we propose a new set of gene products involved in bixin pathway. In conclusion, the identification and qRT-PCR quantification of cDNAs involved in annatto production suggest a hypothetical model for bixin biosynthesis that involve coordinated activation of some MEP, carotenoid and bixin pathway genes. These findings provide a better understanding of the mechanisms regulating these pathways and will facilitate the genetic improvement of B. orellana.« less
De novo transcriptome sequencing in Bixa orellana to identify genes involved in methylerythritol phosphate, carotenoid and bixin biosynthesis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cárdenas-Conejo, Yair; Carballo-Uicab, Víctor; Lieberman, Meric

Bixin or annatto is a commercially important natural orange-red pigment derived from lycopene that is produced and stored in seeds of Bixa orellana L. An enzymatic pathway for bixin biosynthesis was inferred from homology of putative proteins encoded by differentially expressed seed cDNAs. Some activities were later validated in a heterologous system. Nevertheless, much of the pathway remains to be clarified. For example, it is essential to identify the methylerythritol phosphate (MEP) and carotenoid pathways genes. In order to investigate the MEP, carotenoid, and bixin pathways genes, total RNA from young leaves and two different developmental stages of seeds frommore » B. orellana were used for the construction of indexed mRNA libraries, sequenced on the Illumina HiSeq 2500 platform and assembled de novo using Velvet, CLC Genomics Workbench and CAP3 software. A total of 52,549 contigs were obtained with average length of 1,924 bp. Two phylogenetic analyses of inferred proteins, in one case encoded by thirteen general, single-copy cDNAs, in the other from carotenoid and MEP cDNAs, indicated that B. orellana is closely related to sister Malvales species cacao and cotton. Using homology, we identified 7 and 14 core gene products from the MEP and carotenoid pathways, respectively. Surprisingly, previously defined bixin pathway cDNAs were not present in our transcriptome. Here we propose a new set of gene products involved in bixin pathway. In conclusion, the identification and qRT-PCR quantification of cDNAs involved in annatto production suggest a hypothetical model for bixin biosynthesis that involve coordinated activation of some MEP, carotenoid and bixin pathway genes. These findings provide a better understanding of the mechanisms regulating these pathways and will facilitate the genetic improvement of B. orellana.« less
Detecting authorized and unauthorized genetically modified organisms containing vip3A by real-time PCR and next-generation sequencing.

PubMed

Liang, Chanjuan; van Dijk, Jeroen P; Scholtens, Ingrid M J; Staats, Martijn; Prins, Theo W; Voorhuijzen, Marleen M; da Silva, Andrea M; Arisi, Ana Carolina Maisonnave; den Dunnen, Johan T; Kok, Esther J

2014-04-01

The growing number of biotech crops with novel genetic elements increasingly complicates the detection of genetically modified organisms (GMOs) in food and feed samples using conventional screening methods. Unauthorized GMOs (UGMOs) in food and feed are currently identified through combining GMO element screening with sequencing the DNA flanking these elements. In this study, a specific and sensitive qPCR assay was developed for vip3A element detection based on the vip3Aa20 coding sequences of the recently marketed MIR162 maize and COT102 cotton. Furthermore, SiteFinding-PCR in combination with Sanger, Illumina or Pacific BioSciences (PacBio) sequencing was performed targeting the flanking DNA of the vip3Aa20 element in MIR162. De novo assembly and Basic Local Alignment Search Tool searches were used to mimic UGMO identification. PacBio data resulted in relatively long contigs in the upstream (1,326 nucleotides (nt); 95 % identity) and downstream (1,135 nt; 92 % identity) regions, whereas Illumina data resulted in two smaller contigs of 858 and 1,038 nt with higher sequence identity (>99 % identity). Both approaches outperformed Sanger sequencing, underlining the potential for next-generation sequencing in UGMO identification.

De Novo Assembly of the Donkey White Blood Cell Transcriptome and a Comparative Analysis of Phenotype-Associated Genes between Donkeys and Horses

PubMed Central

Xie, Feng-Yun; Feng, Yu-Long; Wang, Hong-Hui; Ma, Yun-Feng; Yang, Yang; Wang, Yin-Chao; Shen, Wei; Pan, Qing-Jie; Yin, Shen; Sun, Yu-Jiang; Ma, Jun-Yu

2015-01-01

Prior to the mechanization of agriculture and labor-intensive tasks, humans used donkeys (Equus africanus asinus) for farm work and packing. However, as mechanization increased, donkeys have been increasingly raised for meat, milk, and fur in China. To maintain the development of the donkey industry, breeding programs should focus on traits related to these new uses. Compared to conventional marker-assisted breeding plans, genome- and transcriptome-based selection methods are more efficient and effective. To analyze the coding genes of the donkey genome, we assembled the transcriptome of donkey white blood cells de novo. Using transcriptomic deep-sequencing data, we identified 264,714 distinct donkey unigenes and predicted 38,949 protein fragments. We annotated the donkey unigenes by BLAST searches against the non-redundant (NR) protein database. We also compared the donkey protein sequences with those of the horse (E. caballus) and wild horse (E. przewalskii), and linked the donkey protein fragments with mammalian phenotypes. As the outer ear size of donkeys and horses are obviously different, we compared the outer ear size-associated proteins in donkeys and horses. We identified three ear size-associated proteins, HIC1, PRKRA, and KMT2A, with sequence differences among the donkey, horse, and wild horse loci. Since the donkey genome sequence has not been released, the de novo assembled donkey transcriptome is helpful for preliminary investigations of donkey cultivars and for genetic improvement. PMID:26208029
SMRT sequencing of the Vitis vinifera cv. ‘Flame seedless’ genome using a SMRTbell-free library preparation from Swift Biosciences

USDA-ARS?s Scientific Manuscript database

Single Molecule Real-Time (SMRT) sequencing provides advantages to the sequencing of complex genomes. The long reads generated are superior for resolving complex genomic regions and provide highly contiguous de novo assemblies. Current SMRTbell libraries generate average read lengths of 10-15kb. How...
GRADE Evidence to Decision (EtD) frameworks for adoption, adaptation, and de novo development of trustworthy recommendations: GRADE-ADOLOPMENT.

PubMed

Schünemann, Holger J; Wiercioch, Wojtek; Brozek, Jan; Etxeandia-Ikobaltzeta, Itziar; Mustafa, Reem A; Manja, Veena; Brignardello-Petersen, Romina; Neumann, Ignacio; Falavigna, Maicon; Alhazzani, Waleed; Santesso, Nancy; Zhang, Yuan; Meerpohl, Jörg J; Morgan, Rebecca L; Rochwerg, Bram; Darzi, Andrea; Rojas, Maria Ximenas; Carrasco-Labra, Alonso; Adi, Yaser; AlRayees, Zulfa; Riva, John; Bollig, Claudia; Moore, Ainsley; Yepes-Nuñez, Juan José; Cuello, Carlos; Waziry, Reem; Akl, Elie A

2017-01-01

Guideline developers can: (1) adopt existing recommendations from others; (2) adapt existing recommendations to their own context; or (3) create recommendations de novo. Monetary and nonmonetary resources, credibility, maximization of uptake, as well as logical arguments should guide the choice of the approach and processes. To describe a potentially efficient model for guideline production based on adoption, adaptation, and/or de novo development of recommendations utilizing the Grading of Recommendations Assessment, Development and Evaluation (GRADE) Evidence to Decision (EtD) frameworks. We applied the model in a new national guideline program producing 22 practice guidelines. We searched for relevant evidence that informs the direction and strength of a recommendation. We then produced GRADE EtDs for guideline panels to develop recommendations. We produced a total of 80 EtD frameworks in approximately 4 months and 146 EtDs in approximately 6 months in two waves. Use of the EtD frameworks allowed panel members understand judgments of others about the criteria that bear on guideline recommendations and then make their own judgments about those criteria in a systematic approach. The "GRADE-ADOLOPMENT" approach to guideline production combines adoption, adaptation, and, as needed, de novo development of recommendations. If developers of guidelines follow EtD criteria more widely and make their work publically available, this approach should prove even more useful. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.
A de novo whole gene deletion of XIAP detected by exome sequencing analysis in very early onset inflammatory bowel disease: a case report.

PubMed

Kelsen, Judith R; Dawany, Noor; Martinez, Alejandro; Martinez, Alejuandro; Grochowski, Christopher M; Maurer, Kelly; Rappaport, Eric; Piccoli, David A; Baldassano, Robert N; Mamula, Petar; Sullivan, Kathleen E; Devoto, Marcella

2015-11-18

Children with very early-onset inflammatory bowel disease (VEO-IBD), those diagnosed at less than 5 years of age, are a unique population. A subset of these patients present with a distinct phenotype and more severe disease than older children and adults. Host genetics is thought to play a more prominent role in this young population, and monogenic defects in genes related to primary immunodeficiencies are responsible for the disease in a small subset of patients with VEO-IBD. We report a child who presented at 3 weeks of life with very early-onset inflammatory bowel disease (VEO-IBD). He had a complicated disease course and remained unresponsive to medical and surgical therapy. The refractory nature of his disease, together with his young age of presentation, prompted utilization of whole exome sequencing (WES) to detect an underlying monogenic primary immunodeficiency and potentially target therapy to the identified defect. Copy number variation analysis (CNV) was performed using the eXome-Hidden Markov Model. Whole exome sequencing revealed 1,380 nonsense and missense variants in the patient. Plausible candidate variants were not detected following analysis of filtered variants, therefore, we performed CNV analysis of the WES data, which led us to identify a de novo whole gene deletion in XIAP. This is the first reported whole gene deletion in XIAP, the causal gene responsible for XLP2 (X-linked lymphoproliferative Disease 2). XLP2 is a syndrome resulting in VEO-IBD and can increase susceptibility to hemophagocytic lymphohistocytosis (HLH). This identification allowed the patient to be referred for bone marrow transplantation, potentially curative for his disease and critical to prevent the catastrophic sequela of HLH. This illustrates the unique etiology of VEO-IBD, and the subsequent effects on therapeutic options. This cohort requires careful and thorough evaluation for monogenic defects and primary immunodeficiencies.
Visualizing the origins of selfish de novo mutations in individual seminiferous tubules of human testes

PubMed Central

Maher, Geoffrey J.; McGowan, Simon J.; Giannoulatou, Eleni; Verrill, Clare; Goriely, Anne; Wilkie, Andrew O. M.

2016-01-01

De novo point mutations arise predominantly in the male germline and increase in frequency with age, but it has not previously been possible to locate specific, identifiable mutations directly within the seminiferous tubules of human testes. Using microdissection of tubules exhibiting altered expression of the spermatogonial markers MAGEA4, FGFR3, and phospho-AKT, whole genome amplification, and DNA sequencing, we establish an in situ strategy for discovery and analysis of pathogenic de novo mutations. In 14 testes from men aged 39–90 y, we identified 11 distinct gain-of-function mutations in five genes (fibroblast growth factor receptors FGFR2 and FGFR3, tyrosine phosphatase PTPN11, and RAS oncogene homologs HRAS and KRAS) from 16 of 22 tubules analyzed; all mutations have known associations with severe diseases, ranging from congenital or perinatal lethal disorders to somatically acquired cancers. These results support proposed selfish selection of spermatogonial mutations affecting growth factor receptor-RAS signaling, highlight its prevalence in older men, and enable direct visualization of the microscopic anatomy of elongated mutant clones. PMID:26858415
Visualizing the origins of selfish de novo mutations in individual seminiferous tubules of human testes.

PubMed

Maher, Geoffrey J; McGowan, Simon J; Giannoulatou, Eleni; Verrill, Clare; Goriely, Anne; Wilkie, Andrew O M

2016-03-01

De novo point mutations arise predominantly in the male germline and increase in frequency with age, but it has not previously been possible to locate specific, identifiable mutations directly within the seminiferous tubules of human testes. Using microdissection of tubules exhibiting altered expression of the spermatogonial markers MAGEA4, FGFR3, and phospho-AKT, whole genome amplification, and DNA sequencing, we establish an in situ strategy for discovery and analysis of pathogenic de novo mutations. In 14 testes from men aged 39-90 y, we identified 11 distinct gain-of-function mutations in five genes (fibroblast growth factor receptors FGFR2 and FGFR3, tyrosine phosphatase PTPN11, and RAS oncogene homologs HRAS and KRAS) from 16 of 22 tubules analyzed; all mutations have known associations with severe diseases, ranging from congenital or perinatal lethal disorders to somatically acquired cancers. These results support proposed selfish selection of spermatogonial mutations affecting growth factor receptor-RAS signaling, highlight its prevalence in older men, and enable direct visualization of the microscopic anatomy of elongated mutant clones.
Protein Engineering Approaches in the Post-Genomic Era.

PubMed

Singh, Raushan K; Lee, Jung-Kul; Selvaraj, Chandrabose; Singh, Ranjitha; Li, Jinglin; Kim, Sang-Yong; Kalia, Vipin C

2018-01-01

Proteins are one of the most multifaceted macromolecules in living systems. Proteins have evolved to function under physiological conditions and, therefore, are not usually tolerant of harsh experimental and environmental conditions. The growing use of proteins in industrial processes as a greener alternative to chemical catalysts often demands constant innovation to improve their performance. Protein engineering aims to design new proteins or modify the sequence of a protein to create proteins with new or desirable functions. With the emergence of structural and functional genomics, protein engineering has been invigorated in the post-genomic era. The three-dimensional structures of proteins with known functions facilitate protein engineering approaches to design variants with desired properties. There are three major approaches of protein engineering research, namely, directed evolution, rational design, and de novo design. Rational design is an effective method of protein engineering when the threedimensional structure and mechanism of the protein is well known. In contrast, directed evolution does not require extensive information and a three-dimensional structure of the protein of interest. Instead, it involves random mutagenesis and selection to screen enzymes with desired properties. De novo design uses computational protein design algorithms to tailor synthetic proteins by using the three-dimensional structures of natural proteins and their folding rules. The present review highlights and summarizes recent protein engineering approaches, and their challenges and limitations in the post-genomic era. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Metavisitor, a Suite of Galaxy Tools for Simple and Rapid Detection and Discovery of Viruses in Deep Sequence Data

PubMed Central

Vernick, Kenneth D.

2017-01-01

Metavisitor is a software package that allows biologists and clinicians without specialized bioinformatics expertise to detect and assemble viral genomes from deep sequence datasets. The package is composed of a set of modular bioinformatic tools and workflows that are implemented in the Galaxy framework. Using the graphical Galaxy workflow editor, users with minimal computational skills can use existing Metavisitor workflows or adapt them to suit specific needs by adding or modifying analysis modules. Metavisitor works with DNA, RNA or small RNA sequencing data over a range of read lengths and can use a combination of de novo and guided approaches to assemble genomes from sequencing reads. We show that the software has the potential for quick diagnosis as well as discovery of viruses from a vast array of organisms. Importantly, we provide here executable Metavisitor use cases, which increase the accessibility and transparency of the software, ultimately enabling biologists or clinicians to focus on biological or medical questions. PMID:28045932
Identification of QTLs for 14 Agronomically Important Traits in Setaria italica Based on SNPs Generated from High-Throughput Sequencing.

PubMed

Zhang, Kai; Fan, Guangyu; Zhang, Xinxin; Zhao, Fang; Wei, Wei; Du, Guohua; Feng, Xiaolei; Wang, Xiaoming; Wang, Feng; Song, Guoliang; Zou, Hongfeng; Zhang, Xiaolei; Li, Shuangdong; Ni, Xuemei; Zhang, Gengyun; Zhao, Zhihai

2017-05-05

Foxtail millet ( Setaria italica ) is an important crop possessing C4 photosynthesis capability. The S. italica genome was de novo sequenced in 2012, but the sequence lacked high-density genetic maps with agronomic and yield trait linkages. In the present study, we resequenced a foxtail millet population of 439 recombinant inbred lines (RILs) and developed high-resolution bin map and high-density SNP markers, which could provide an effective approach for gene identification. A total of 59 QTL for 14 agronomic traits in plants grown under long- and short-day photoperiods were identified. The phenotypic variation explained ranged from 4.9 to 43.94%. In addition, we suggested that there may be segregation distortion on chromosome 6 that is significantly distorted toward Zhang gu. The newly identified QTL will provide a platform for sequence-based research on the S. italica genome, and for molecular marker-assisted breeding. Copyright © 2017 Zhang et al.
Fast de novo discovery of low-energy protein loop conformations.

PubMed

Wong, Samuel W K; Liu, Jun S; Kou, S C

2017-08-01

In the prediction of protein structure from amino acid sequence, loops are challenging regions for computational methods. Since loops are often located on the protein surface, they can have significant roles in determining protein functions and binding properties. Loop prediction without the aid of a structural template requires extensive conformational sampling and energy minimization, which are computationally difficult. In this article we present a new de novo loop sampling method, the Parallely filtered Energy Targeted All-atom Loop Sampler (PETALS) to rapidly locate low energy conformations. PETALS explores both backbone and side-chain positions of the loop region simultaneously according to the energy function selected by the user, and constructs a nonredundant ensemble of low energy loop conformations using filtering criteria. The method is illustrated with the DFIRE potential and DiSGro energy function for loops, and shown to be highly effective at discovering conformations with near-native (or better) energy. Using the same energy function as the DiSGro algorithm, PETALS samples conformations with both lower RMSDs and lower energies. PETALS is also useful for assessing the accuracy of different energy functions. PETALS runs rapidly, requiring an average time cost of 10 minutes for a length 12 loop on a single 3.2 GHz processor core, comparable to the fastest existing de novo methods for generating an ensemble of conformations. Proteins 2017; 85:1402-1412. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
De novo transcriptome sequence assembly from coconut leaves and seeds with a focus on factors involved in RNA-directed DNA methylation.

PubMed

Huang, Ya-Yi; Lee, Chueh-Pai; Fu, Jason L; Chang, Bill Chia-Han; Matzke, Antonius J M; Matzke, Marjori

2014-09-04

Coconut palm (Cocos nucifera) is a symbol of the tropics and a source of numerous edible and nonedible products of economic value. Despite its nutritional and industrial significance, coconut remains under-represented in public repositories for genomic and transcriptomic data. We report de novo transcript assembly from RNA-seq data and analysis of gene expression in seed tissues (embryo and endosperm) and leaves of a dwarf coconut variety. Assembly of 10 GB sequencing data for each tissue resulted in 58,211 total unigenes in embryo, 61,152 in endosperm, and 33,446 in leaf. Within each unigene pool, 24,857 could be annotated in embryo, 29,731 could be annotated in endosperm, and 26,064 could be annotated in leaf. A KEGG analysis identified 138, 138, and 139 pathways, respectively, in transcriptomes of embryo, endosperm, and leaf tissues. Given the extraordinarily large size of coconut seeds and the importance of small RNA-mediated epigenetic regulation during seed development in model plants, we used homology searches to identify putative homologs of factors required for RNA-directed DNA methylation in coconut. The findings suggest that RNA-directed DNA methylation is important during coconut seed development, particularly in maturing endosperm. This dataset will expand the genomics resources available for coconut and provide a foundation for more detailed analyses that may assist molecular breeding strategies aimed at improving this major tropical crop. Copyright © 2014 Huang et al.
A girl with West syndrome and autistic features harboring a de novo TBL1XR1 mutation.

PubMed

Saitsu, Hirotomo; Tohyama, Jun; Walsh, Tom; Kato, Mitsuhiro; Kobayashi, Yu; Lee, Ming; Tsurusaki, Yoshinori; Miyake, Noriko; Goto, Yu-Ichi; Nishino, Ichizo; Ohtake, Akira; King, Mary-Claire; Matsumoto, Naomichi

2014-10-01

Recently, de novo mutations in TBL1XR1 were found in two patients with autism spectrum disorders. Here, we report on a Japanese girl presenting with West syndrome, Rett syndrome-like and autistic features. Her initial development was normal until she developed a series of spasms at 5 months of age. Electroencephalogram at 7 months showed a pattern of hypsarrhythmia, which led to a diagnosis of West syndrome. Stereotypic hand movements appeared at 8 months of age, and autistic features such as deficits in communication, hyperactivity and excitability were observed later, at 4 years and 9 months. Whole exome sequencing of the patient and her parents revealed a de novo TBL1XR1 mutation [c.209 G>A (p.Gly70Asp)] occurring at an evolutionarily conserved amino acid in an F-box-like domain. Our report expands the clinical spectrum of TBL1XR1 mutations to West syndrome with Rett-like features, together with autistic features.
Draft De Novo Transcriptome of the Rat Kangaroo Potorous tridactylus as a Tool for Cell Biology

PubMed Central

Udy, Dylan B.; Voorhies, Mark; Chan, Patricia P.; Lowe, Todd M.; Dumont, Sophie

2015-01-01

The rat kangaroo (long-nosed potoroo, Potorous tridactylus) is a marsupial native to Australia. Cultured rat kangaroo kidney epithelial cells (PtK) are commonly used to study cell biological processes. These mammalian cells are large, adherent, and flat, and contain large and few chromosomes—and are thus ideal for imaging intra-cellular dynamics such as those of mitosis. Despite this, neither the rat kangaroo genome nor transcriptome have been sequenced, creating a challenge for probing the molecular basis of these cellular dynamics. Here, we present the sequencing, assembly and annotation of the draft rat kangaroo de novo transcriptome. We sequenced 679 million reads that mapped to 347,323 Trinity transcripts and 20,079 Unigenes. We present statistics emerging from transcriptome-wide analyses, and analyses suggesting that the transcriptome covers full-length sequences of most genes, many with multiple isoforms. We also validate our findings with a proof-of-concept gene knockdown experiment. We expect that this high quality transcriptome will make rat kangaroo cells a more tractable system for linking molecular-scale function and cellular-scale dynamics. PMID:26252667
Draft De Novo Transcriptome of the Rat Kangaroo Potorous tridactylus as a Tool for Cell Biology.

PubMed

Udy, Dylan B; Voorhies, Mark; Chan, Patricia P; Lowe, Todd M; Dumont, Sophie

2015-01-01

The rat kangaroo (long-nosed potoroo, Potorous tridactylus) is a marsupial native to Australia. Cultured rat kangaroo kidney epithelial cells (PtK) are commonly used to study cell biological processes. These mammalian cells are large, adherent, and flat, and contain large and few chromosomes-and are thus ideal for imaging intra-cellular dynamics such as those of mitosis. Despite this, neither the rat kangaroo genome nor transcriptome have been sequenced, creating a challenge for probing the molecular basis of these cellular dynamics. Here, we present the sequencing, assembly and annotation of the draft rat kangaroo de novo transcriptome. We sequenced 679 million reads that mapped to 347,323 Trinity transcripts and 20,079 Unigenes. We present statistics emerging from transcriptome-wide analyses, and analyses suggesting that the transcriptome covers full-length sequences of most genes, many with multiple isoforms. We also validate our findings with a proof-of-concept gene knockdown experiment. We expect that this high quality transcriptome will make rat kangaroo cells a more tractable system for linking molecular-scale function and cellular-scale dynamics.
Single-molecule, full-length transcript sequencing provides insight into the extreme metabolism of the ruby-throated hummingbird Archilochus colubris.

PubMed

Workman, Rachael E; Myrka, Alexander M; Wong, G William; Tseng, Elizabeth; Welch, Kenneth C; Timp, Winston

2018-03-01

Hummingbirds oxidize ingested nectar sugars directly to fuel foraging but cannot sustain this fuel use during fasting periods, such as during the night or during long-distance migratory flights. Instead, fasting hummingbirds switch to oxidizing stored lipids that are derived from ingested sugars. The hummingbird liver plays a key role in moderating energy homeostasis and this remarkable capacity for fuel switching. Additionally, liver is the principle location of de novo lipogenesis, which can occur at exceptionally high rates, such as during premigratory fattening. Yet understanding how this tissue and whole organism moderates energy turnover is hampered by a lack of information regarding how relevant enzymes differ in sequence, expression, and regulation. We generated a de novo transcriptome of the hummingbird liver using PacBio full-length cDNA sequencing (Iso-Seq), yielding 8.6Gb of sequencing data, or 2.6M reads from 4 different size fractions. We analyzed data using the SMRTAnalysis v3.1 Iso-Seq pipeline, then clustered isoforms into gene families to generate de novo gene contigs using Cogent. We performed orthology analysis to identify closely related sequences between our transcriptome and other avian and human gene sets. Finally, we closely examined homology of critical lipid metabolism genes between our transcriptome data and avian and human genomes. We confirmed high levels of sequence divergence within hummingbird lipogenic enzymes, suggesting a high probability of adaptive divergent function in the hepatic lipogenic pathways. Our results leverage cutting-edge technology and a novel bioinformatics pipeline to provide a first direct look at the transcriptome of this incredible organism.
Genetic study of congenital bile-duct dilatation identifies de novo and inherited variants in functionally related genes.

PubMed

Wong, John K L; Campbell, Desmond; Ngo, Ngoc Diem; Yeung, Fanny; Cheng, Guo; Tang, Clara S M; Chung, Patrick H Y; Tran, Ngoc Son; So, Man-Ting; Cherny, Stacey S; Sham, Pak C; Tam, Paul K; Garcia-Barcelo, Maria-Mercè

2016-12-12

Congenital dilatation of the bile-duct (CDD) is a rare, mostly sporadic, disorder that results in bile retention with severe associated complications. CDD affects mainly Asians. To our knowledge, no genetic study has ever been conducted. We aim to identify genetic risk factors by a "trio-based" exome-sequencing approach, whereby 31 CDD probands and their unaffected parents were exome-sequenced. Seven-hundred controls from the local population were used to detect gene-sets significantly enriched with rare variants in CDD patients. Twenty-one predicted damaging de novo variants (DNVs; 4 protein truncating and 17 missense) were identified in several evolutionarily constrained genes (p < 0.01). Six genes carrying DNVs were associated with human developmental disorders involving epithelial, connective or bone morphologies (PXDN, RTEL1, ANKRD11, MAP2K1, CYLD, ACAN) and four linked with cholangio- and hepatocellular carcinomas (PIK3CA, TLN1 CYLD, MAP2K1). Importantly, CDD patients have an excess of DNVs in cancer-related genes (p < 0.025). Thirteen genes were recurrently mutated at different sites, forming compound heterozygotes or functionally related complexes within patients. Our data supports a strong genetic basis for CDD and show that CDD is not only genetically heterogeneous but also non-monogenic, requiring mutations in more than one genes for the disease to develop. The data is consistent with the rarity and sporadic presentation of CDD.
Brute-Force Approach for Mass Spectrometry-Based Variant Peptide Identification in Proteogenomics without Personalized Genomic Data

NASA Astrophysics Data System (ADS)

Ivanov, Mark V.; Lobas, Anna A.; Levitsky, Lev I.; Moshkovskii, Sergei A.; Gorshkov, Mikhail V.

2018-02-01

In a proteogenomic approach based on tandem mass spectrometry analysis of proteolytic peptide mixtures, customized exome or RNA-seq databases are employed for identifying protein sequence variants. However, the problem of variant peptide identification without personalized genomic data is important for a variety of applications. Following the recent proposal by Chick et al. (Nat. Biotechnol. 33, 743-749, 2015) on the feasibility of such variant peptide search, we evaluated two available approaches based on the previously suggested "open" search and the "brute-force" strategy. To improve the efficiency of these approaches, we propose an algorithm for exclusion of false variant identifications from the search results involving analysis of modifications mimicking single amino acid substitutions. Also, we propose a de novo based scoring scheme for assessment of identified point mutations. In the scheme, the search engine analyzes y-type fragment ions in MS/MS spectra to confirm the location of the mutation in the variant peptide sequence.
Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data.

PubMed

Jayakumar, Vasanthan; Sakakibara, Yasubumi

2017-11-03

Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms. © The Author 2017. Published by Oxford University Press.
A rare case of de novo gigantic ovarian abscess within an endometrioma.

PubMed

Hameed, Aisha; Mehta, Vaishali; Sinha, Prabha

2010-06-01

We are reporting a rare case of de novo ovarian abscess in an endometrioma. Ovarian abscess within an endometrioma is a rare gynecological problem, but de novo abscess in the endometrioma is even rarer. Most of the ovarian abscesses develop in the endometriomas following interventions, e.g., aspiration, pelvic surgery, and oocyte retrieval. We are presenting a case of a spontaneous giant abscess in a large ovarian cyst in a nulliparous woman who presented with acute abdomen. Patient was treated in a district general hospital with multidisciplinary approach. Thirteen liters of the pus were drained. She has had a sub total (supra cervical) hysterectomy and bilateral salpingo-oophorectomy (BSO) performed. Histology of the abscess wall confirmed endometriotic nature of the cyst. Patient made an uneventful recovery and was discharged home on the 14th postoperative day. This case highlights that endometrioma and its complication can present as a surgical emergency and should be dealt as one.
A Rare Case of De Novo Gigantic Ovarian Abscess within an Endometrioma

PubMed Central

Hameed, Aisha; Mehta, Vaishali; Sinha, Prabha

2010-01-01

We are reporting a rare case of de novo ovarian abscess in an endometrioma. Ovarian abscess within an endometrioma is a rare gynecological problem, but de novo abscess in the endometrioma is even rarer. Most of the ovarian abscesses develop in the endometriomas following interventions, e.g., aspiration, pelvic surgery, and oocyte retrieval. We are presenting a case of a spontaneous giant abscess in a large ovarian cyst in a nulliparous woman who presented with acute abdomen. Patient was treated in a district general hospital with multidisciplinary approach. Thirteen liters of the pus were drained. She has had a sub total (supra cervical) hysterectomy and bilateral salpingo-oophorectomy (BSO) performed. Histology of the abscess wall confirmed endometriotic nature of the cyst. Patient made an uneventful recovery and was discharged home on the 14th postoperative day. This case highlights that endometrioma and its complication can present as a surgical emergency and should be dealt as one. PMID:20589187

CAMELOT: A machine learning approach for coarse-grained simulations of aggregation of block-copolymeric protein sequences

PubMed Central

Ruff, Kiersten M.; Harmon, Tyler S.; Pappu, Rohit V.

2015-01-01

We report the development and deployment of a coarse-graining method that is well suited for computer simulations of aggregation and phase separation of protein sequences with block-copolymeric architectures. Our algorithm, named CAMELOT for Coarse-grained simulations Aided by MachinE Learning Optimization and Training, leverages information from converged all atom simulations that is used to determine a suitable resolution and parameterize the coarse-grained model. To parameterize a system-specific coarse-grained model, we use a combination of Boltzmann inversion, non-linear regression, and a Gaussian process Bayesian optimization approach. The accuracy of the coarse-grained model is demonstrated through direct comparisons to results from all atom simulations. We demonstrate the utility of our coarse-graining approach using the block-copolymeric sequence from the exon 1 encoded sequence of the huntingtin protein. This sequence comprises of 17 residues from the N-terminal end of huntingtin (N17) followed by a polyglutamine (polyQ) tract. Simulations based on the CAMELOT approach are used to show that the adsorption and unfolding of the wild type N17 and its sequence variants on the surface of polyQ tracts engender a patchy colloid like architecture that promotes the formation of linear aggregates. These results provide a plausible explanation for experimental observations, which show that N17 accelerates the formation of linear aggregates in block-copolymeric N17-polyQ sequences. The CAMELOT approach is versatile and is generalizable for simulating the aggregation and phase behavior of a range of block-copolymeric protein sequences. PMID:26723608
Identification of Putative Nuclear Receptors and Steroidogenic Enzymes in Murray-Darling Rainbowfish (Melanotaenia fluviatilis) Using RNA-Seq and De Novo Transcriptome Assembly.

PubMed

Bain, Peter A; Papanicolaou, Alexie; Kumar, Anupama

2015-01-01

Murray-Darling rainbowfish (Melanotaenia fluviatilis [Castelnau, 1878]; Atheriniformes: Melanotaeniidae) is a small-bodied teleost currently under development in Australasia as a test species for aquatic toxicological studies. To date, efforts towards the development of molecular biomarkers of contaminant exposure have been hindered by the lack of available sequence data. To address this, we sequenced messenger RNA from brain, liver and gonads of mature male and female fish and generated a high-quality draft transcriptome using a de novo assembly approach. 149,742 clusters of putative transcripts were obtained, encompassing 43,841 non-redundant protein-coding regions. Deduced amino acid sequences were annotated by functional inference based on similarity with sequences from manually curated protein sequence databases. The draft assembly contained protein-coding regions homologous to 95.7% of the complete cohort of predicted proteins from the taxonomically related species, Oryzias latipes (Japanese medaka). The mean length of rainbowfish protein-coding sequences relative to their medaka homologues was 92.1%, indicating that despite the limited number of tissues sampled a large proportion of the total expected number of protein-coding genes was captured in the study. Because of our interest in the effects of environmental contaminants on endocrine pathways, we manually curated subsets of coding regions for putative nuclear receptors and steroidogenic enzymes in the rainbowfish transcriptome, revealing 61 candidate nuclear receptors encompassing all known subfamilies, and 41 putative steroidogenic enzymes representing all major steroidogenic enzymes occurring in teleosts. The transcriptome presented here will be a valuable resource for researchers interested in biomarker development, protein structure and function, and contaminant-response genomics in Murray-Darling rainbowfish.
Identification of interleukin genes in Pogona vitticeps using a de novo transcriptome assembly from RNA-seq data.

PubMed

Livernois, Alexandra; Hardy, Kristine; Domaschenz, Renae; Papanicolaou, Alexie; Georges, Arthur; Sarre, Stephen D; Rao, Sudha; Ezaz, Tariq; Deakin, Janine E

2016-10-01

Interleukins are a group of cytokines with complex immunomodulatory functions that are important for regulating immunity in vertebrate species. Reptiles and mammals last shared a common ancestor more than 350 million years ago, so it is not surprising that low sequence identity has prevented divergent interleukin genes from being identified in the central bearded dragon lizard, Pogona vitticeps, in its genome assembly. To determine the complete nucleotide sequences of key interleukin genes, we constructed full-length transcripts, using the Trinity platform, from short paired-end read RNA sequences from stimulated spleen cells. De novo transcript reconstruction and analysis allowed us to identify interleukin genes that are missing from the published P. vitticeps assembly. Identification of key cytokines in P. vitticeps will provide insight into the essential molecular mechanisms and evolution of interleukin gene families and allow for characterization of the immune response in a lizard for comparison with mammals.
An improved and validated RNA HLA class I SBT approach for obtaining full length coding sequences.

PubMed

Gerritsen, K E H; Olieslagers, T I; Groeneweg, M; Voorter, C E M; Tilanus, M G J

2014-11-01

The functional relevance of human leukocyte antigen (HLA) class I allele polymorphism beyond exons 2 and 3 is difficult to address because more than 70% of the HLA class I alleles are defined by exons 2 and 3 sequences only. For routine application on clinical samples we improved and validated the HLA sequence-based typing (SBT) approach based on RNA templates, using either a single locus-specific or two overlapping group-specific polymerase chain reaction (PCR) amplifications, with three forward and three reverse sequencing reactions for full length sequencing. Locus-specific HLA typing with RNA SBT of a reference panel, representing the major antigen groups, showed identical results compared to DNA SBT typing. Alleles encountered with unknown exons in the IMGT/HLA database and three samples, two with Null and one with a Low expressed allele, have been addressed by the group-specific RNA SBT approach to obtain full length coding sequences. This RNA SBT approach has proven its value in our routine full length definition of alleles. © 2014 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Extensive characterization of Tupaia belangeri neuropeptidome using an integrated mass spectrometric approach.

PubMed

Petruzziello, Filomena; Fouillen, Laetitia; Wadensten, Henrik; Kretz, Robert; Andren, Per E; Rainer, Gregor; Zhang, Xiaozhe

2012-02-03

Neuropeptidomics is used to characterize endogenous peptides in the brain of tree shrews (Tupaia belangeri). Tree shrews are small animals similar to rodents in size but close relatives of primates, and are excellent models for brain research. Currently, tree shrews have no complete proteome information available on which direct database search can be allowed for neuropeptide identification. To increase the capability in the identification of neuropeptides in tree shrews, we developed an integrated mass spectrometry (MS)-based approach that combines methods including data-dependent, directed, and targeted liquid chromatography (LC)-Fourier transform (FT)-tandem MS (MS/MS) analysis, database construction, de novo sequencing, precursor protein search, and homology analysis. Using this integrated approach, we identified 107 endogenous peptides that have sequences identical or similar to those from other mammalian species. High accuracy MS and tandem MS information, with BLAST analysis and chromatographic characteristics were used to confirm the sequences of all the identified peptides. Interestingly, further sequence homology analysis demonstrated that tree shrew peptides have a significantly higher degree of homology to equivalent sequences in humans than those in mice or rats, consistent with the close phylogenetic relationship between tree shrews and primates. Our results provide the first extensive characterization of the peptidome in tree shrews, which now permits characterization of their function in nervous and endocrine system. As the approach developed fully used the conservative properties of neuropeptides in evolution and the advantage of high accuracy MS, it can be portable for identification of neuropeptides in other species for which the fully sequenced genomes or proteomes are not available.
De novo assembly of maritime pine transcriptome: implications for forest breeding and biotechnology.

PubMed

Canales, Javier; Bautista, Rocio; Label, Philippe; Gómez-Maldonado, Josefa; Lesur, Isabelle; Fernández-Pozo, Noe; Rueda-López, Marina; Guerrero-Fernández, Dario; Castro-Rodríguez, Vanessa; Benzekri, Hicham; Cañas, Rafael A; Guevara, María-Angeles; Rodrigues, Andreia; Seoane, Pedro; Teyssier, Caroline; Morel, Alexandre; Ehrenmann, François; Le Provost, Grégoire; Lalanne, Céline; Noirot, Céline; Klopp, Christophe; Reymond, Isabelle; García-Gutiérrez, Angel; Trontin, Jean-François; Lelu-Walter, Marie-Anne; Miguel, Celia; Cervera, María Teresa; Cantón, Francisco R; Plomion, Christophe; Harvengt, Luc; Avila, Concepción; Gonzalo Claros, M; Cánovas, Francisco M

2014-04-01

Maritime pine (Pinus pinasterAit.) is a widely distributed conifer species in Southwestern Europe and one of the most advanced models for conifer research. In the current work, comprehensive characterization of the maritime pine transcriptome was performed using a combination of two different next-generation sequencing platforms, 454 and Illumina. De novo assembly of the transcriptome provided a catalogue of 26 020 unique transcripts in maritime pine trees and a collection of 9641 full-length cDNAs. Quality of the transcriptome assembly was validated by RT-PCR amplification of selected transcripts for structural and regulatory genes. Transcription factors and enzyme-encoding transcripts were annotated. Furthermore, the available sequencing data permitted the identification of polymorphisms and the establishment of robust single nucleotide polymorphism (SNP) and simple-sequence repeat (SSR) databases for genotyping applications and integration of translational genomics in maritime pine breeding programmes. All our data are freely available at SustainpineDB, the P. pinaster expressional database. Results reported here on the maritime pine transcriptome represent a valuable resource for future basic and applied studies on this ecological and economically important pine species. © 2013 Society for Experimental Biology, Association of Applied Biologists and John Wiley & Sons Ltd.
Facilitated sequence counting and assembly by template mutagenesis

PubMed Central

Levy, Dan; Wigler, Michael

2014-01-01

Presently, inferring the long-range structure of the DNA templates is limited by short read lengths. Accurate template counts suffer from distortions occurring during PCR amplification. We explore the utility of introducing random mutations in identical or nearly identical templates to create distinguishable patterns that are inherited during subsequent copying. We simulate the applications of this process under assumptions of error-free sequencing and perfect mapping, using cytosine deamination as a model for mutation. The simulations demonstrate that within readily achievable conditions of nucleotide conversion and sequence coverage, we can accurately count the number of otherwise identical molecules as well as connect variants separated by long spans of identical sequence. We discuss many potential applications, such as transcript profiling, isoform assembly, haplotype phasing, and de novo genome assembly. PMID:25313059
A Feature-Based Approach to Modeling Protein–DNA Interactions

PubMed Central

Segal, Eran

2008-01-01

Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/. PMID:18725950
In Silico Identification of Protein Disulfide Isomerase Gene Families in the De Novo Assembled Transcriptomes of Four Different Species of the Genus Conus.

PubMed

Figueroa-Montiel, Andrea; Ramos, Marco A; Mares, Rosa E; Dueñas, Salvador; Pimienta, Genaro; Ortiz, Ernesto; Possani, Lourival D; Licea-Navarro, Alexei F

2016-01-01

Small peptides isolated from the venom of the marine snails belonging to the genus Conus have been largely studied because of their therapeutic value. These peptides can be classified in two groups. The largest one is composed by peptides rich in disulfide bonds, and referred to as conotoxins. Despite the importance of conotoxins given their pharmacology value, little is known about the protein disulfide isomerase (PDI) enzymes that are required to catalyze their correct folding. To discover the PDIs that may participate in the folding and structural maturation of conotoxins, the transcriptomes of the venom duct of four different species of Conus from the peninsula of Baja California (Mexico) were assembled. Complementary DNA (cDNA) libraries were constructed for each species and sequenced using a Genome Analyzer Illumina platform. The raw RNA-seq data was converted into transcript sequences using Trinity, a de novo assembler that allows the grouping of reads into contigs without a reference genome. An N50 value of 605 was established as a reference for future assemblies of Conus transcriptomes using this software. Transdecoder was used to extract likely coding sequences from Trinity transcripts, and PDI-specific sequence motif "APWCGHCK" was used to capture potential PDIs. An in silico analysis was performed to characterize the group of PDI protein sequences encoded by the duct-transcriptome of each species. The computational approach entailed a structural homology characterization, based on the presence of functional Thioredoxin-like domains. Four different PDI families were characterized, which are constituted by a total of 41 different gene sequences. The sequences had an average of 65% identity with other PDIs. Using MODELLER 9.14, the homology-based three-dimensional structure prediction of a subset of the sequences reported, showed the expected thioredoxin fold which was confirmed by a "simulated annealing" method.
Evaluation and Validation of Assembling Corrected PacBio Long Reads for Microbial Genome Completion via Hybrid Approaches.

PubMed

Lin, Hsin-Hung; Liao, Yu-Chieh

2015-01-01

Despite the ever-increasing output of next-generation sequencing data along with developing assemblers, dozens to hundreds of gaps still exist in de novo microbial assemblies due to uneven coverage and large genomic repeats. Third-generation single-molecule, real-time (SMRT) sequencing technology avoids amplification artifacts and generates kilobase-long reads with the potential to complete microbial genome assembly. However, due to the low accuracy (~85%) of third-generation sequences, a considerable amount of long reads (>50X) are required for self-correction and for subsequent de novo assembly. Recently-developed hybrid approaches, using next-generation sequencing data and as few as 5X long reads, have been proposed to improve the completeness of microbial assembly. In this study we have evaluated the contemporary hybrid approaches and demonstrated that assembling corrected long reads (by runCA) produced the best assembly compared to long-read scaffolding (e.g., AHA, Cerulean and SSPACE-LongRead) and gap-filling (SPAdes). For generating corrected long reads, we further examined long-read correction tools, such as ECTools, LSC, LoRDEC, PBcR pipeline and proovread. We have demonstrated that three microbial genomes including Escherichia coli K12 MG1655, Meiothermus ruber DSM1279 and Pdeobacter heparinus DSM2366 were successfully hybrid assembled by runCA into near-perfect assemblies using ECTools-corrected long reads. In addition, we developed a tool, Patch, which implements corrected long reads and pre-assembled contigs as inputs, to enhance microbial genome assemblies. With the additional 20X long reads, short reads of S. cerevisiae W303 were hybrid assembled into 115 contigs using the verified strategy, ECTools + runCA. Patch was subsequently applied to upgrade the assembly to a 35-contig draft genome. Our evaluation of the hybrid approaches shows that assembling the ECTools-corrected long reads via runCA generates near complete microbial genomes, suggesting
Comparison of next generation sequencing technologies for transcriptome characterization

PubMed Central

2009-01-01

Background We have developed a simulation approach to help determine the optimal mixture of sequencing methods for most complete and cost effective transcriptome sequencing. We compared simulation results for traditional capillary sequencing with "Next Generation" (NG) ultra high-throughput technologies. The simulation model was parameterized using mappings of 130,000 cDNA sequence reads to the Arabidopsis genome (NCBI Accession SRA008180.19). We also generated 454-GS20 sequences and de novo assemblies for the basal eudicot California poppy (Eschscholzia californica) and the magnoliid avocado (Persea americana) using a variety of methods for cDNA synthesis. Results The Arabidopsis reads tagged more than 15,000 genes, including new splice variants and extended UTR regions. Of the total 134,791 reads (13.8 MB), 119,518 (88.7%) mapped exactly to known exons, while 1,117 (0.8%) mapped to introns, 11,524 (8.6%) spanned annotated intron/exon boundaries, and 3,066 (2.3%) extended beyond the end of annotated UTRs. Sequence-based inference of relative gene expression levels correlated significantly with microarray data. As expected, NG sequencing of normalized libraries tagged more genes than non-normalized libraries, although non-normalized libraries yielded more full-length cDNA sequences. The Arabidopsis data were used to simulate additional rounds of NG and traditional EST sequencing, and various combinations of each. Our simulations suggest a combination of FLX and Solexa sequencing for optimal transcriptome coverage at modest cost. We have also developed ESTcalc http://fgp.huck.psu.edu/NG_Sims/ngsim.pl, an online webtool, which allows users to explore the results of this study by specifying individualized costs and sequencing characteristics. Conclusion NG sequencing technologies are a highly flexible set of platforms that can be scaled to suit different project goals. In terms of sequence coverage alone, the NG sequencing is a dramatic advance over capillary
Increased co-expression of genes harboring the damaging de novo mutations in Chinese schizophrenic patients during prenatal development.

PubMed

Wang, Qiang; Li, Miaoxin; Yang, Zhenxing; Hu, Xun; Wu, Hei-Man; Ni, Peiyan; Ren, Hongyan; Deng, Wei; Li, Mingli; Ma, Xiaohong; Guo, Wanjun; Zhao, Liansheng; Wang, Yingcheng; Xiang, Bo; Lei, Wei; Sham, Pak C; Li, Tao

2015-12-15

Schizophrenia is a heritable, heterogeneous common psychiatric disorder. In this study, we evaluated the hypothesis that de novo variants (DNVs) contribute to the pathogenesis of schizophrenia. We performed exome sequencing in Chinese patients (N = 45) with schizophrenia and their unaffected parents (N = 90). Forty genes were found to contain DNVs. These genes had enriched transcriptional co-expression profile in prenatal frontal cortex (Bonferroni corrected p < 9.1 × 10(-3)), and in prenatal temporal and parietal regions (Bonferroni corrected p < 0.03). Also, four prenatal anatomical subregions (VCF, MFC, OFC and ITC) have shown significant enrichment of connectedness in co-expression networks. Moreover, four genes (LRP1, MACF1, DICER1 and ABCA2) harboring the damaging de novo mutations are strongly prioritized as susceptibility genes by multiple evidences. Our findings in Chinese schizophrenic patients indicate the pathogenic role of DNVs, supporting the hypothesis that schizophrenia is a neurodevelopmental disease.
Increased co-expression of genes harboring the damaging de novo mutations in Chinese schizophrenic patients during prenatal development

PubMed Central

Wang, Qiang; Li, Miaoxin; Yang, Zhenxing; Hu, Xun; Wu, Hei-Man; Ni, Peiyan; Ren, Hongyan; Deng, Wei; Li, Mingli; Ma, Xiaohong; Guo, Wanjun; Zhao, Liansheng; Wang, Yingcheng; Xiang, Bo; Lei, Wei; Sham, Pak C; Li, Tao

2015-01-01

Schizophrenia is a heritable, heterogeneous common psychiatric disorder. In this study, we evaluated the hypothesis that de novo variants (DNVs) contribute to the pathogenesis of schizophrenia. We performed exome sequencing in Chinese patients (N = 45) with schizophrenia and their unaffected parents (N = 90). Forty genes were found to contain DNVs. These genes had enriched transcriptional co-expression profile in prenatal frontal cortex (Bonferroni corrected p < 9.1 × 10−3), and in prenatal temporal and parietal regions (Bonferroni corrected p < 0.03). Also, four prenatal anatomical subregions (VCF, MFC, OFC and ITC) have shown significant enrichment of connectedness in co-expression networks. Moreover, four genes (LRP1, MACF1, DICER1 and ABCA2) harboring the damaging de novo mutations are strongly prioritized as susceptibility genes by multiple evidences. Our findings in Chinese schizophrenic patients indicate the pathogenic role of DNVs, supporting the hypothesis that schizophrenia is a neurodevelopmental disease. PMID:26666178
SMARTIV: combined sequence and structure de-novo motif discovery for in-vivo RNA binding data.

PubMed

Polishchuk, Maya; Paz, Inbal; Yakhini, Zohar; Mandel-Gutfreund, Yael

2018-05-25

Gene expression regulation is highly dependent on binding of RNA-binding proteins (RBPs) to their RNA targets. Growing evidence supports the notion that both RNA primary sequence and its local secondary structure play a role in specific Protein-RNA recognition and binding. Despite the great advance in high-throughput experimental methods for identifying sequence targets of RBPs, predicting the specific sequence and structure binding preferences of RBPs remains a major challenge. We present a novel webserver, SMARTIV, designed for discovering and visualizing combined RNA sequence and structure motifs from high-throughput RNA-binding data, generated from in-vivo experiments. The uniqueness of SMARTIV is that it predicts motifs from enriched k-mers that combine information from ranked RNA sequences and their predicted secondary structure, obtained using various folding methods. Consequently, SMARTIV generates Position Weight Matrices (PWMs) in a combined sequence and structure alphabet with assigned P-values. SMARTIV concisely represents the sequence and structure motif content as a single graphical logo, which is informative and easy for visual perception. SMARTIV was examined extensively on a variety of high-throughput binding experiments for RBPs from different families, generated from different technologies, showing consistent and accurate results. Finally, SMARTIV is a user-friendly webserver, highly efficient in run-time and freely accessible via http://smartiv.technion.ac.il/.
Whole transcriptome analysis using next-generation sequencing of model species Setaria viridis to support C4 photosynthesis research.

PubMed

Xu, Jiajia; Li, Yuanyuan; Ma, Xiuling; Ding, Jianfeng; Wang, Kai; Wang, Sisi; Tian, Ye; Zhang, Hui; Zhu, Xin-Guang

2013-09-01

Setaria viridis is an emerging model species for genetic studies of C4 photosynthesis. Many basic molecular resources need to be developed to support for this species. In this paper, we performed a comprehensive transcriptome analysis from multiple developmental stages and tissues of S. viridis using next-generation sequencing technologies. Sequencing of the transcriptome from multiple tissues across three developmental stages (seed germination, vegetative growth, and reproduction) yielded a total of 71 million single end 100 bp long reads. Reference-based assembly using Setaria italica genome as a reference generated 42,754 transcripts. De novo assembly generated 60,751 transcripts. In addition, 9,576 and 7,056 potential simple sequence repeats (SSRs) covering S. viridis genome were identified when using the reference based assembled transcripts and the de novo assembled transcripts, respectively. This identified transcripts and SSR provided by this study can be used for both reverse and forward genetic studies based on S. viridis.
Imparting functionality to biocatalysts via embedding enzymes into nanoporous materials by a de novo approach: size-selective sheltering of catalase in metal-organic framework microcrystals.

PubMed

Shieh, Fa-Kuen; Wang, Shao-Chun; Yen, Chia-I; Wu, Chang-Cheng; Dutta, Saikat; Chou, Lien-Yang; Morabito, Joseph V; Hu, Pan; Hsu, Ming-Hua; Wu, Kevin C-W; Tsung, Chia-Kuang

2015-04-08

We develop a new concept to impart new functions to biocatalysts by combining enzymes and metal-organic frameworks (MOFs). The proof-of-concept design is demonstrated by embedding catalase molecules into uniformly sized ZIF-90 crystals via a de novo approach. We have carried out electron microscopy, X-ray diffraction, nitrogen sorption, electrophoresis, thermogravimetric analysis, and confocal microscopy to confirm that the ~10 nm catalase molecules are embedded in 2 μm single-crystalline ZIF-90 crystals with ~5 wt % loading. Because catalase is immobilized and sheltered by the ZIF-90 crystals, the composites show activity in hydrogen peroxide degradation even in the presence of protease proteinase K.
De Novo Design of Skin-Penetrating Peptides for Enhanced Transdermal Delivery of Peptide Drugs.

PubMed

Menegatti, Stefano; Zakrewsky, Michael; Kumar, Sunny; De Oliveira, Joshua Sanchez; Muraski, John A; Mitragotri, Samir

2016-03-09

Skin-penetrating peptides (SPPs) are attracting increasing attention as a non-invasive strategy for transdermal delivery of therapeutics. The identification of SPP sequences, however, currently performed by experimental screening of peptide libraries, is very laborious. Recent studies have shown that, to be effective enhancers, SPPs must possess affinity for both skin keratin and the drug of interest. We therefore developed a computational process for generating and screening virtual libraries of disulfide-cyclic peptides against keratin and cyclosporine A (CsA) to identify SPPs capable of enhancing transdermal CsA delivery. The selected sequences were experimentally tested and found to bind both CsA and keratin, as determined by mass spectrometry and affinity chromatography, and enhance transdermal permeation of CsA. Four heptameric sequences that emerged as leading candidates (ACSATLQHSCG, ACSLTVNWNCG, ACTSTGRNACG, and ACSASTNHNCG) were tested and yielded CsA permeation on par with previously identified SPP SPACE (TM) . An octameric peptide (ACNAHQARSTCG) yielded significantly higher delivery of CsA compared to heptameric SPPs. The safety profile of the selected sequences was also validated by incubation with skin keratinocytes. This method thus represents an effective procedure for the de novo design of skin-penetrating peptides for the delivery of desired therapeutic or cosmetic agents. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
De novo mutations in inhibitors of Wnt, BMP, and Ras/ERK signaling pathways in non-syndromic midline craniosynostosis.

PubMed

Timberlake, Andrew T; Furey, Charuta G; Choi, Jungmin; Nelson-Williams, Carol; Loring, Erin; Galm, Amy; Kahle, Kristopher T; Steinbacher, Derek M; Larysz, Dawid; Persing, John A; Lifton, Richard P

2017-08-29

Non-syndromic craniosynostosis (NSC) is a frequent congenital malformation in which one or more cranial sutures fuse prematurely. Mutations causing rare syndromic craniosynostoses in humans and engineered mouse models commonly increase signaling of the Wnt, bone morphogenetic protein (BMP), or Ras/ERK pathways, converging on shared nuclear targets that promote bone formation. In contrast, the genetics of NSC is largely unexplored. More than 95% of NSC is sporadic, suggesting a role for de novo mutations. Exome sequencing of 291 parent-offspring trios with midline NSC revealed 15 probands with heterozygous damaging de novo mutations in 12 negative regulators of Wnt, BMP, and Ras/ERK signaling (10.9-fold enrichment, P = 2.4 × 10 -11 ). SMAD6 had 4 de novo and 14 transmitted mutations; no other gene had more than 1. Four familial NSC kindreds had mutations in genes previously implicated in syndromic disease. Collectively, these mutations contribute to 10% of probands. Mutations are predominantly loss-of-function, implicating haploinsufficiency as a frequent mechanism. A common risk variant near BMP2 increased the penetrance of SMAD6 mutations and was overtransmitted to patients with de novo mutations in other genes in these pathways, supporting a frequent two-locus pathogenesis. These findings implicate new genes in NSC and demonstrate related pathophysiology of common non-syndromic and rare syndromic craniosynostoses. These findings have implications for diagnosis, risk of recurrence, and risk of adverse neurodevelopmental outcomes. Finally, the use of pathways identified in rare syndromic disease to find genes accounting for non-syndromic cases may prove broadly relevant to understanding other congenital disorders featuring high locus heterogeneity.
De Novo Transcriptome Analysis of Medicinally Important Plantago ovata Using RNA-Seq

PubMed Central

Kotwal, Shivanjali; Kaul, Sanjana; Sharma, Pooja; Gupta, Mehak; Shankar, Rama; Jain, Mukesh; Dhar, Manoj K.

2016-01-01

Plantago ovata is an economically and medicinally important plant of the family Plantaginaceae. It is used extensively for the production of seed husk for its application in pharmaceutical, food and cosmetic industries. In the present study, the transcriptome of P. ovata ovary was sequenced using Illumina Genome Analyzer platform to characterize the mucilage biosynthesis pathway in the plant. De novo assembly was carried out using Oases followed by velvet. A total of 46,955 non-redundant transcripts (≥100 bp) using ~29 million high-quality paired end reads were generated. Functional categorization of these transcripts revealed the presence of several genes involved in various biological processes like metabolic pathways, mucilage biosynthesis, biosynthesis of secondary metabolites and antioxidants. In addition, simple sequence-repeat motifs, non-coding RNAs and transcription factors were also identified. Expression profiling of some genes involved in mucilage biosynthetic pathway was performed in different tissues of P. ovata using Real time PCR analysis. The study has resulted in a valuable resource for further studies on gene expression, genomics and functional genomics in P. ovata. PMID:26943165
Quantitative chemical proteomics profiling of de novo protein synthesis during starvation-mediated autophagy

PubMed Central

Wang, Jigang; Zhang, Jianbin; Lee, Yew-Mun; Koh, Pin-Lang; Ng, Shukie; Bao, Feichao; Lin, Qingsong; Shen, Han-Ming

2016-01-01

ABSTRACT Autophagy is an intracellular degradation mechanism in response to nutrient starvation. Via autophagy, some nonessential cellular constituents are degraded in a lysosome-dependent manner to generate biomolecules that can be utilized for maintaining the metabolic homeostasis. Although it is known that under starvation the global protein synthesis is significantly reduced mainly due to suppression of MTOR (mechanistic target of rapamycin serine/threonine kinase), emerging evidence demonstrates that de novo protein synthesis is involved in the autophagic process. However, characterizing these de novo proteins has been an issue with current techniques. Here, we developed a novel method to identify newly synthesized proteins during starvation-mediated autophagy by combining bio-orthogonal noncanonical amino acid tagging (BONCAT) and isobaric tags for relative and absolute quantitation (iTRAQTM). Using bio-orthogonal metabolic tagging, L-azidohomoalanine (AHA) was incorporated into newly synthesized proteins which were then enriched with avidin beads after a click reaction between alkyne-bearing biotin and AHA's bio-orthogonal azide moiety. The enriched proteins were subjected to iTRAQ labeling for protein identification and quantification using liquid chromatography-tandem mass spectrometry (LC-MS/MS). Via the above approach, we identified and quantified a total of 1176 proteins and among them 711 proteins were found to meet our defined criteria as de novo synthesized proteins during starvation-mediated autophagy. The characterized functional profiles of the 711 newly synthesized proteins by bioinformatics analysis suggest their roles in ensuring the prosurvival outcome of autophagy. Finally, we performed validation assays for some selected proteins and found that knockdown of some genes has a significant impact on starvation-induced autophagy. Thus, we think that the BONCAT-iTRAQ approach is effective in the identification of newly synthesized proteins and

Some links on this page may take you to non-federal websites. Their policies may differ from this site.