DeepBase: annotation and discovery of microRNAs and other noncoding RNAs from deep-sequencing data.
Yang, Jian-Hua; Qu, Liang-Hu
2012-01-01
Recent advances in high-throughput deep-sequencing technology have produced large numbers of short and long RNA sequences and enabled the detection and profiling of known and novel microRNAs (miRNAs) and other noncoding RNAs (ncRNAs) at unprecedented sensitivity and depth. In this chapter, we describe the use of deepBase, a database that we have developed to integrate all public deep-sequencing data and to facilitate the comprehensive annotation and discovery of miRNAs and other ncRNAs from these data. deepBase provides an integrative, interactive, and versatile web graphical interface to evaluate miRBase-annotated miRNA genes and other known ncRNAs, explores the expression patterns of miRNAs and other ncRNAs, and discovers novel miRNAs and other ncRNAs from deep-sequencing data. deepBase also provides a deepView genome browser to comparatively analyze these data at multiple levels. deepBase is available at http://deepbase.sysu.edu.cn/.
BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements.
De Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan
2015-12-01
The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be. Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
BLSSpeller: exhaustive comparative discovery of conserved cis-regulatory elements
De Witte, Dieter; Van de Velde, Jan; Decap, Dries; Van Bel, Michiel; Audenaert, Pieter; Demeester, Piet; Dhoedt, Bart; Vandepoele, Klaas; Fostier, Jan
2015-01-01
Motivation: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected. Results: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O.sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z.mays. Availability and implementation: BLSSpeller was written in Java. Source code and manual are available at http://bioinformatics.intec.ugent.be/blsspeller Contact: Klaas.Vandepoele@psb.vib-ugent.be or jan.fostier@intec.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26254488
Covington, Brett C; McLean, John A; Bachmann, Brian O
2017-01-04
Covering: 2000 to 2016The labor-intensive process of microbial natural product discovery is contingent upon identifying discrete secondary metabolites of interest within complex biological extracts, which contain inventories of all extractable small molecules produced by an organism or consortium. Historically, compound isolation prioritization has been driven by observed biological activity and/or relative metabolite abundance and followed by dereplication via accurate mass analysis. Decades of discovery using variants of these methods has generated the natural pharmacopeia but also contributes to recent high rediscovery rates. However, genomic sequencing reveals substantial untapped potential in previously mined organisms, and can provide useful prescience of potentially new secondary metabolites that ultimately enables isolation. Recently, advances in comparative metabolomics analyses have been coupled to secondary metabolic predictions to accelerate bioactivity and abundance-independent discovery work flows. In this review we will discuss the various analytical and computational techniques that enable MS-based metabolomic applications to natural product discovery and discuss the future prospects for comparative metabolomics in natural product discovery.
Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets.
Vishnevsky, Oleg V; Bocharnikov, Andrey V; Kolchanov, Nikolay A
2018-02-01
The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.
Guiding principles for peptide nanotechnology through directed discovery.
Lampel, A; Ulijn, R V; Tuttle, T
2018-05-21
Life's diverse molecular functions are largely based on only a small number of highly conserved building blocks - the twenty canonical amino acids. These building blocks are chemically simple, but when they are organized in three-dimensional structures of tremendous complexity, new properties emerge. This review explores recent efforts in the directed discovery of functional nanoscale systems and materials based on these same amino acids, but that are not guided by copying or editing biological systems. The review summarises insights obtained using three complementary approaches of searching the sequence space to explore sequence-structure relationships for assembly, reactivity and complexation, namely: (i) strategic editing of short peptide sequences; (ii) computational approaches to predicting and comparing assembly behaviours; (iii) dynamic peptide libraries that explore the free energy landscape. These approaches give rise to guiding principles on controlling order/disorder, complexation and reactivity by peptide sequence design.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Duncan, Katherine R.; Crüsemann, Max; Lechner, Anna
Genome sequencing has revealed that bacteria contain many more biosynthetic gene clusters than predicted based on the number of secondary metabolites discovered to date. While this biosynthetic reservoir has fostered interest in new tools for natural product discovery, there remains a gap between gene cluster detection and compound discovery. In this paper, we apply molecular networking and the new concept of pattern-based genome mining to 35 Salinispora strains, including 30 for which draft genome sequences were either available or obtained for this study. The results provide a method to simultaneously compare large numbers of complex microbial extracts, which facilitated themore » identification of media components, known compounds and their derivatives, and new compounds that could be prioritized for structure elucidation. Finally, these efforts revealed considerable metabolite diversity and led to several molecular family-gene cluster pairings, of which the quinomycin-type depsipeptide retimycin A was characterized and linked to gene cluster NRPS40 using pattern-based bioinformatic approaches.« less
Duncan, Katherine R.; Crüsemann, Max; Lechner, Anna; ...
2015-04-09
Genome sequencing has revealed that bacteria contain many more biosynthetic gene clusters than predicted based on the number of secondary metabolites discovered to date. While this biosynthetic reservoir has fostered interest in new tools for natural product discovery, there remains a gap between gene cluster detection and compound discovery. In this paper, we apply molecular networking and the new concept of pattern-based genome mining to 35 Salinispora strains, including 30 for which draft genome sequences were either available or obtained for this study. The results provide a method to simultaneously compare large numbers of complex microbial extracts, which facilitated themore » identification of media components, known compounds and their derivatives, and new compounds that could be prioritized for structure elucidation. Finally, these efforts revealed considerable metabolite diversity and led to several molecular family-gene cluster pairings, of which the quinomycin-type depsipeptide retimycin A was characterized and linked to gene cluster NRPS40 using pattern-based bioinformatic approaches.« less
Duncan, Katherine R.; Crüsemann, Max; Lechner, Anna; Sarkar, Anindita; Li, Jie; Ziemert, Nadine; Wang, Mingxun; Bandeira, Nuno; Moore, Bradley S.; Dorrestein, Pieter C.; Jensen, Paul R.
2015-01-01
Summary Genome sequencing has revealed that bacteria contain many more biosynthetic gene clusters than predicted based on the number of secondary metabolites discovered to date. While this biosynthetic reservoir has fostered interest in new tools for natural product discovery, there remains a gap between gene cluster detection and compound discovery. Here we apply molecular networking and the new concept of pattern-based genome mining to 35 Salinispora strains including 30 for which draft genome sequences were either available or obtained for this study. The results provide a method to simultaneously compare large numbers of complex microbial extracts, which facilitated the identification of media components, known compounds and their derivatives, and new compounds that could be prioritized for structure elucidation. These efforts revealed considerable metabolite diversity and led to several molecular family-gene cluster pairings, of which the quinomycin-type depsipeptide retimycin A was characterized and linked to gene cluster NRPS40 using pattern-based bioinformatic approaches. PMID:25865308
BayesMotif: de novo protein sorting motif discovery from impure datasets.
Hu, Jianjun; Zhang, Fan
2010-01-18
Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.
Shafiee, Mohammad Javad; Chung, Audrey G; Khalvati, Farzad; Haider, Masoom A; Wong, Alexander
2017-10-01
While lung cancer is the second most diagnosed form of cancer in men and women, a sufficiently early diagnosis can be pivotal in patient survival rates. Imaging-based, or radiomics-driven, detection methods have been developed to aid diagnosticians, but largely rely on hand-crafted features that may not fully encapsulate the differences between cancerous and healthy tissue. Recently, the concept of discovery radiomics was introduced, where custom abstract features are discovered from readily available imaging data. We propose an evolutionary deep radiomic sequencer discovery approach based on evolutionary deep intelligence. Motivated by patient privacy concerns and the idea of operational artificial intelligence, the evolutionary deep radiomic sequencer discovery approach organically evolves increasingly more efficient deep radiomic sequencers that produce significantly more compact yet similarly descriptive radiomic sequences over multiple generations. As a result, this framework improves operational efficiency and enables diagnosis to be run locally at the radiologist's computer while maintaining detection accuracy. We evaluated the evolved deep radiomic sequencer (EDRS) discovered via the proposed evolutionary deep radiomic sequencer discovery framework against state-of-the-art radiomics-driven and discovery radiomics methods using clinical lung CT data with pathologically proven diagnostic data from the LIDC-IDRI dataset. The EDRS shows improved sensitivity (93.42%), specificity (82.39%), and diagnostic accuracy (88.78%) relative to previous radiomics approaches.
ADEPT, a dynamic next generation sequencing data error-detection program with trimming
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feng, Shihai; Lo, Chien-Chi; Li, Po-E
Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the truemore » positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. We conclude that ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.« less
ADEPT, a dynamic next generation sequencing data error-detection program with trimming
Feng, Shihai; Lo, Chien-Chi; Li, Po-E; ...
2016-02-29
Illumina is the most widely used next generation sequencing technology and produces millions of short reads that contain errors. These sequencing errors constitute a major problem in applications such as de novo genome assembly, metagenomics analysis and single nucleotide polymorphism discovery. In this study, we present ADEPT, a dynamic error detection method, based on the quality scores of each nucleotide and its neighboring nucleotides, together with their positions within the read and compares this to the position-specific quality score distribution of all bases within the sequencing run. This method greatly improves upon other available methods in terms of the truemore » positive rate of error discovery without affecting the false positive rate, particularly within the middle of reads. We conclude that ADEPT is the only tool to date that dynamically assesses errors within reads by comparing position-specific and neighboring base quality scores with the distribution of quality scores for the dataset being analyzed. The result is a method that is less prone to position-dependent under-prediction, which is one of the most prominent issues in error prediction. The outcome is that ADEPT improves upon prior efforts in identifying true errors, primarily within the middle of reads, while reducing the false positive rate.« less
Emerging pathogens in the fish farming industry and sequencing-based pathogen discovery.
Tengs, Torstein; Rimstad, Espen
2017-10-01
The use of large scale DNA/RNA sequencing has become an integral part of biomedical research. Reduced sequencing costs and the availability of efficient computational resources has led to a revolution in how problems concerning genomics and transcriptomics are addressed. Sequencing-based pathogen discovery represents one example of how genetic data can now be used in ways that were previously considered infeasible. Emerging pathogens affect both human and animal health due to a multitude of factors, including globalization, a shifting environment and an increasing human population. Fish farming represents a relevant, interesting and challenging system to study emerging pathogens. This review summarizes recent progress in pathogen discovery using sequence data, with particular emphasis on viruses in Atlantic salmon (Salmo salar). Copyright © 2017 Elsevier Ltd. All rights reserved.
Discovery sequence and the nature of low permeability gas accumulations
Attanasi, E.D.
2005-01-01
There is an ongoing discussion regarding the geologic nature of accumulations that host gas in low-permeability sandstone environments. This note examines the discovery sequence of the accumulations in low permeability sandstone plays that were classified as continuous-type by the U.S. Geological Survey for the 1995 National Oil and Gas Assessment. It compares the statistical character of historical discovery sequences of accumulations associated with continuous-type sandstone gas plays to those of conventional plays. The seven sandstone plays with sufficient data exhibit declining size with sequence order, on average, and in three of the seven the trend is statistically significant. Simulation experiments show that both a skewed endowment size distribution and a discovery process that mimics sampling proportional to size are necessary to generate a discovery sequence that consistently produces a statistically significant negative size order relationship. The empirical findings suggest that discovery sequence could be used to constrain assessed gas in untested areas. The plays examined represent 134 of the 265 trillion cubic feet of recoverable gas assessed in undeveloped areas of continuous-type gas plays in low permeability sandstone environments reported in the 1995 National Assessment. ?? 2005 International Association for Mathematical Geology.
Liu, Jun-Jun; Xiang, Yu
2011-01-01
WRKY transcription factors are key regulators of numerous biological processes in plant growth and development, as well as plant responses to abiotic and biotic stresses. Research on biological functions of plant WRKY genes has focused in the past on model plant species or species with largely characterized transcriptomes. However, a variety of non-model plants, such as forest conifers, are essential as feed, biofuel, and wood or for sustainable ecosystems. Identification of WRKY genes in these non-model plants is equally important for understanding the evolutionary and function-adaptive processes of this transcription factor family. Because of limited genomic information, the rarity of regulatory gene mRNAs in transcriptomes, and the sequence divergence to model organism genes, identification of transcription factors in non-model plants using methods similar to those generally used for model plants is difficult. This chapter describes a gene family discovery strategy for identification of WRKY transcription factors in conifers by a combination of in silico-based prediction and PCR-based experimental approaches. Compared to traditional cDNA library screening or EST sequencing at transcriptome scales, this integrated gene discovery strategy provides fast, simple, reliable, and specific methods to unveil the WRKY gene family at both genome and transcriptome levels in non-model plants.
Next-generation sequencing in clinical virology: Discovery of new viruses.
Datta, Sibnarayan; Budhauliya, Raghvendra; Das, Bidisha; Chatterjee, Soumya; Vanlalhmuaka; Veer, Vijay
2015-08-12
Viruses are a cause of significant health problem worldwide, especially in the developing nations. Due to different anthropological activities, human populations are exposed to different viral pathogens, many of which emerge as outbreaks. In such situations, discovery of novel viruses is utmost important for deciding prevention and treatment strategies. Since last century, a number of different virus discovery methods, based on cell culture inoculation, sequence-independent PCR have been used for identification of a variety of viruses. However, the recent emergence and commercial availability of next-generation sequencers (NGS) has entirely changed the field of virus discovery. These massively parallel sequencing platforms can sequence a mixture of genetic materials from a very heterogeneous mix, with high sensitivity. Moreover, these platforms work in a sequence-independent manner, making them ideal tools for virus discovery. However, for their application in clinics, sample preparation or enrichment is necessary to detect low abundance virus populations. A number of techniques have also been developed for enrichment or viral nucleic acids. In this manuscript, we review the evolution of sequencing; NGS technologies available today as well as widely used virus enrichment technologies. We also discuss the challenges associated with their applications in the clinical virus discovery.
Neptune: a bioinformatics tool for rapid discovery of genomic variation in bacterial populations
Marinier, Eric; Zaheer, Rahat; Berry, Chrystal; Weedmark, Kelly A.; Domaratzki, Michael; Mabon, Philip; Knox, Natalie C.; Reimer, Aleisha R.; Graham, Morag R.; Chui, Linda; Patterson-Fortin, Laura; Zhang, Jian; Pagotto, Franco; Farber, Jeff; Mahony, Jim; Seyer, Karine; Bekal, Sadjia; Tremblay, Cécile; Isaac-Renton, Judy; Prystajecky, Natalie; Chen, Jessica; Slade, Peter
2017-01-01
Abstract The ready availability of vast amounts of genomic sequence data has created the need to rethink comparative genomics algorithms using ‘big data’ approaches. Neptune is an efficient system for rapidly locating differentially abundant genomic content in bacterial populations using an exact k-mer matching strategy, while accommodating k-mer mismatches. Neptune’s loci discovery process identifies sequences that are sufficiently common to a group of target sequences and sufficiently absent from non-targets using probabilistic models. Neptune uses parallel computing to efficiently identify and extract these loci from draft genome assemblies without requiring multiple sequence alignments or other computationally expensive comparative sequence analyses. Tests on simulated and real datasets showed that Neptune rapidly identifies regions that are both sensitive and specific. We demonstrate that this system can identify trait-specific loci from different bacterial lineages. Neptune is broadly applicable for comparative bacterial analyses, yet will particularly benefit pathogenomic applications, owing to efficient and sensitive discovery of differentially abundant genomic loci. The software is available for download at: http://github.com/phac-nml/neptune. PMID:29048594
Jones, Darryl R; Thomas, Dallas; Alger, Nicholas; Ghavidel, Ata; Inglis, G Douglas; Abbott, D Wade
2018-01-01
Deposition of new genetic sequences in online databases is expanding at an unprecedented rate. As a result, sequence identification continues to outpace functional characterization of carbohydrate active enzymes (CAZymes). In this paradigm, the discovery of enzymes with novel functions is often hindered by high volumes of uncharacterized sequences particularly when the enzyme sequence belongs to a family that exhibits diverse functional specificities (i.e., polyspecificity). Therefore, to direct sequence-based discovery and characterization of new enzyme activities we have developed an automated in silico pipeline entitled: Sequence Analysis and Clustering of CarboHydrate Active enzymes for Rapid Informed prediction of Specificity (SACCHARIS). This pipeline streamlines the selection of uncharacterized sequences for discovery of new CAZyme or CBM specificity from families currently maintained on the CAZy website or within user-defined datasets. SACCHARIS was used to generate a phylogenetic tree of a GH43, a CAZyme family with defined subfamily designations. This analysis confirmed that large datasets can be organized into sequence clusters of manageable sizes that possess related functions. Seeding this tree with a GH43 sequence from Bacteroides dorei DSM 17855 (BdGH43b, revealed it partitioned as a single sequence within the tree. This pattern was consistent with it possessing a unique enzyme activity for GH43 as BdGH43b is the first described α-glucanase described for this family. The capacity of SACCHARIS to extract and cluster characterized carbohydrate binding module sequences was demonstrated using family 6 CBMs (i.e., CBM6s). This CBM family displays a polyspecific ligand binding profile and contains many structurally determined members. Using SACCHARIS to identify a cluster of divergent sequences, a CBM6 sequence from a unique clade was demonstrated to bind yeast mannan, which represents the first description of an α-mannan binding CBM. Additionally, we have performed a CAZome analysis of an in-house sequenced bacterial genome and a comparative analysis of B. thetaiotaomicron VPI-5482 and B. thetaiotaomicron 7330, to demonstrate that SACCHARIS can generate "CAZome fingerprints", which differentiate between the saccharolytic potential of two related strains in silico. Establishing sequence-function and sequence-structure relationships in polyspecific CAZyme families are promising approaches for streamlining enzyme discovery. SACCHARIS facilitates this process by embedding CAZyme and CBM family trees generated from biochemically to structurally characterized sequences, with protein sequences that have unknown functions. In addition, these trees can be integrated with user-defined datasets (e.g., genomics, metagenomics, and transcriptomics) to inform experimental characterization of new CAZymes or CBMs not currently curated, and for researchers to compare differential sequence patterns between entire CAZomes. In this light, SACCHARIS provides an in silico tool that can be tailored for enzyme bioprospecting in datasets of increasing complexity and for diverse applications in glycobiotechnology.
TrawlerWeb: an online de novo motif discovery tool for next-generation sequencing datasets.
Dang, Louis T; Tondl, Markus; Chiu, Man Ho H; Revote, Jerico; Paten, Benedict; Tano, Vincent; Tokolyi, Alex; Besse, Florence; Quaife-Ryan, Greg; Cumming, Helen; Drvodelic, Mark J; Eichenlaub, Michael P; Hallab, Jeannette C; Stolper, Julian S; Rossello, Fernando J; Bogoyevitch, Marie A; Jans, David A; Nim, Hieu T; Porrello, Enzo R; Hudson, James E; Ramialison, Mirana
2018-04-05
A strong focus of the post-genomic era is mining of the non-coding regulatory genome in order to unravel the function of regulatory elements that coordinate gene expression (Nat 489:57-74, 2012; Nat 507:462-70, 2014; Nat 507:455-61, 2014; Nat 518:317-30, 2015). Whole-genome approaches based on next-generation sequencing (NGS) have provided insight into the genomic location of regulatory elements throughout different cell types, organs and organisms. These technologies are now widespread and commonly used in laboratories from various fields of research. This highlights the need for fast and user-friendly software tools dedicated to extracting cis-regulatory information contained in these regulatory regions; for instance transcription factor binding site (TFBS) composition. Ideally, such tools should not require prior programming knowledge to ensure they are accessible for all users. We present TrawlerWeb, a web-based version of the Trawler_standalone tool (Nat Methods 4:563-5, 2007; Nat Protoc 5:323-34, 2010), to allow for the identification of enriched motifs in DNA sequences obtained from next-generation sequencing experiments in order to predict their TFBS composition. TrawlerWeb is designed for online queries with standard options common to web-based motif discovery tools. In addition, TrawlerWeb provides three unique new features: 1) TrawlerWeb allows the input of BED files directly generated from NGS experiments, 2) it automatically generates an input-matched biologically relevant background, and 3) it displays resulting conservation scores for each instance of the motif found in the input sequences, which assists the researcher in prioritising the motifs to validate experimentally. Finally, to date, this web-based version of Trawler_standalone remains the fastest online de novo motif discovery tool compared to other popular web-based software, while generating predictions with high accuracy. TrawlerWeb provides users with a fast, simple and easy-to-use web interface for de novo motif discovery. This will assist in rapidly analysing NGS datasets that are now being routinely generated. TrawlerWeb is freely available and accessible at: http://trawler.erc.monash.edu.au .
Application of industrial scale genomics to discovery of therapeutic targets in heart failure.
Mehraban, F; Tomlinson, J E
2001-12-01
In recent years intense activity in both academic and industrial sectors has provided a wealth of information on the human genome with an associated impressive increase in the number of novel gene sequences deposited in sequence data repositories and patent applications. This genomic industrial revolution has transformed the way in which drug target discovery is now approached. In this article we discuss how various differential gene expression (DGE) technologies are being utilized for cardiovascular disease (CVD) drug target discovery. Other approaches such as sequencing cDNA from cardiovascular derived tissues and cells coupled with bioinformatic sequence analysis are used with the aim of identifying novel gene sequences that may be exploited towards target discovery. Additional leverage from gene sequence information is obtained through identification of polymorphisms that may confer disease susceptibility and/or affect drug responsiveness. Pharmacogenomic studies are described wherein gene expression-based techniques are used to evaluate drug response and/or efficacy. Industrial-scale genomics supports and addresses not only novel target gene discovery but also the burgeoning issues in pharmaceutical and clinical cardiovascular medicine relative to polymorphic gene responses.
Chen, Xin; Wu, Qiong; Sun, Ruimin; Zhang, Louxin
2012-01-01
The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of this SNP discovery approach. In this study, we formulate two new combinatorial optimization problems. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as SNP - MSP, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as SNP - MSQ, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. We present an exact dynamic programming algorithm for solving the SNP - MSP problem and also show that the SNP - MSQ problem is NP-hard by a reduction from a restricted variation of the 3-partition problem. We believe that an efficient solution to either problem above could offer a seamless integration of information in four complementary base-specific cleavage reactions, thereby improving the capability of the underlying biotechnology for sensitive and accurate SNP discovery.
2011-01-01
Background Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS) of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence. Results An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA containing putative SNPs was amplified by PCR from AL8/78 and AS75 and resequenced with the ABI 3730 xl. In a sample of 302 randomly selected putative SNPs, 84.0% in gene regions, 88.0% in repeat junctions, and 81.3% in uncharacterized regions were validated. Conclusion An annotation-based genome-wide SNP discovery pipeline for NGS platforms was developed. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one such platform generates relatively long reads. The pipeline package, AGSNP, and the discovered 497,118 Ae. tauschii SNPs can be accessed at (http://avena.pw.usda.gov/wheatD/agsnp.shtml). PMID:21266061
Overview Article: Identifying transcriptional cis-regulatory modules in animal genomes
Suryamohan, Kushal; Halfon, Marc S.
2014-01-01
Gene expression is regulated through the activity of transcription factors and chromatin modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily-identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods has led to an explosion of both computational and empirical methods for CRM discovery in model and non-model organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against transcription factors or histone post-translational modifications, identification of nucleosome-depleted “open” chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted transcription factor binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. PMID:25704908
Computational functional genomics-based approaches in analgesic drug discovery and repurposing.
Lippmann, Catharina; Kringel, Dario; Ultsch, Alfred; Lötsch, Jörn
2018-06-01
Persistent pain is a major healthcare problem affecting a fifth of adults worldwide with still limited treatment options. The search for new analgesics increasingly includes the novel research area of functional genomics, which combines data derived from various processes related to DNA sequence, gene expression or protein function and uses advanced methods of data mining and knowledge discovery with the goal of understanding the relationship between the genome and the phenotype. Its use in drug discovery and repurposing for analgesic indications has so far been performed using knowledge discovery in gene function and drug target-related databases; next-generation sequencing; and functional proteomics-based approaches. Here, we discuss recent efforts in functional genomics-based approaches to analgesic drug discovery and repurposing and highlight the potential of computational functional genomics in this field including a demonstration of the workflow using a novel R library 'dbtORA'.
Hwang, Kyu-Baek; Lee, In-Hee; Park, Jin-Ho; Hambuch, Tina; Choe, Yongjoon; Kim, MinHyeok; Lee, Kyungjoon; Song, Taemin; Neu, Matthew B; Gupta, Neha; Kohane, Isaac S; Green, Robert C; Kong, Sek Won
2014-08-01
As whole genome sequencing (WGS) uncovers variants associated with rare and common diseases, an immediate challenge is to minimize false-positive findings due to sequencing and variant calling errors. False positives can be reduced by combining results from orthogonal sequencing methods, but costly. Here, we present variant filtering approaches using logistic regression (LR) and ensemble genotyping to minimize false positives without sacrificing sensitivity. We evaluated the methods using paired WGS datasets of an extended family prepared using two sequencing platforms and a validated set of variants in NA12878. Using LR or ensemble genotyping based filtering, false-negative rates were significantly reduced by 1.1- to 17.8-fold at the same levels of false discovery rates (5.4% for heterozygous and 4.5% for homozygous single nucleotide variants (SNVs); 30.0% for heterozygous and 18.7% for homozygous insertions; 25.2% for heterozygous and 16.6% for homozygous deletions) compared to the filtering based on genotype quality scores. Moreover, ensemble genotyping excluded > 98% (105,080 of 107,167) of false positives while retaining > 95% (897 of 937) of true positives in de novo mutation (DNM) discovery in NA12878, and performed better than a consensus method using two sequencing platforms. Our proposed methods were effective in prioritizing phenotype-associated variants, and an ensemble genotyping would be essential to minimize false-positive DNM candidates. © 2014 WILEY PERIODICALS, INC.
Hwang, Kyu-Baek; Lee, In-Hee; Park, Jin-Ho; Hambuch, Tina; Choi, Yongjoon; Kim, MinHyeok; Lee, Kyungjoon; Song, Taemin; Neu, Matthew B.; Gupta, Neha; Kohane, Isaac S.; Green, Robert C.; Kong, Sek Won
2014-01-01
As whole genome sequencing (WGS) uncovers variants associated with rare and common diseases, an immediate challenge is to minimize false positive findings due to sequencing and variant calling errors. False positives can be reduced by combining results from orthogonal sequencing methods, but costly. Here we present variant filtering approaches using logistic regression (LR) and ensemble genotyping to minimize false positives without sacrificing sensitivity. We evaluated the methods using paired WGS datasets of an extended family prepared using two sequencing platforms and a validated set of variants in NA12878. Using LR or ensemble genotyping based filtering, false negative rates were significantly reduced by 1.1- to 17.8-fold at the same levels of false discovery rates (5.4% for heterozygous and 4.5% for homozygous SNVs; 30.0% for heterozygous and 18.7% for homozygous insertions; 25.2% for heterozygous and 16.6% for homozygous deletions) compared to the filtering based on genotype quality scores. Moreover, ensemble genotyping excluded > 98% (105,080 of 107,167) of false positives while retaining > 95% (897 of 937) of true positives in de novo mutation (DNM) discovery, and performed better than a consensus method using two sequencing platforms. Our proposed methods were effective in prioritizing phenotype-associated variants, and ensemble genotyping would be essential to minimize false positive DNM candidates. PMID:24829188
Classification and assessment tools for structural motif discovery algorithms.
Badr, Ghada; Al-Turaiki, Isra; Mathkour, Hassan
2013-01-01
Motif discovery is the problem of finding recurring patterns in biological data. Patterns can be sequential, mainly when discovered in DNA sequences. They can also be structural (e.g. when discovering RNA motifs). Finding common structural patterns helps to gain a better understanding of the mechanism of action (e.g. post-transcriptional regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit conservation in structure, which may be common even if the sequences are different. Over the past few years, hundreds of algorithms have been developed to solve the sequential motif discovery problem, while less work has been done for the structural case. In this paper, we survey, classify, and compare different algorithms that solve the structural motif discovery problem, where the underlying sequences may be different. We highlight their strengths and weaknesses. We start by proposing a benchmark dataset and a measurement tool that can be used to evaluate different motif discovery approaches. Then, we proceed by proposing our experimental setup. Finally, results are obtained using the proposed benchmark to compare available tools. To the best of our knowledge, this is the first attempt to compare tools solely designed for structural motif discovery. Results show that the accuracy of discovered motifs is relatively low. The results also suggest a complementary behavior among tools where some tools perform well on simple structures, while other tools are better for complex structures. We have classified and evaluated the performance of available structural motif discovery tools. In addition, we have proposed a benchmark dataset with tools that can be used to evaluate newly developed tools.
Limitations and potentials of current motif discovery algorithms
Hu, Jianjun; Li, Bin; Kihara, Daisuke
2005-01-01
Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6–45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them. PMID:16284194
Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana
2012-03-27
Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to themore » un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, and a transcriptional regulator, among other proteins, most of which are annotated as hypothetical, that were missed during annotation.« less
[Current applications of high-throughput DNA sequencing technology in antibody drug research].
Yu, Xin; Liu, Qi-Gang; Wang, Ming-Rong
2012-03-01
Since the publication of a high-throughput DNA sequencing technology based on PCR reaction was carried out in oil emulsions in 2005, high-throughput DNA sequencing platforms have been evolved to a robust technology in sequencing genomes and diverse DNA libraries. Antibody libraries with vast numbers of members currently serve as a foundation of discovering novel antibody drugs, and high-throughput DNA sequencing technology makes it possible to rapidly identify functional antibody variants with desired properties. Herein we present a review of current applications of high-throughput DNA sequencing technology in the analysis of antibody library diversity, sequencing of CDR3 regions, identification of potent antibodies based on sequence frequency, discovery of functional genes, and combination with various display technologies, so as to provide an alternative approach of discovery and development of antibody drugs.
Speth, Daan R; Lagkouvardos, Ilias; Wang, Yong; Qian, Pei-Yuan; Dutilh, Bas E; Jetten, Mike S M
2017-07-01
Several recent studies have indicated that members of the phylum Planctomycetes are abundantly present at the brine-seawater interface (BSI) above multiple brine pools in the Red Sea. Planctomycetes include bacteria capable of anaerobic ammonium oxidation (anammox). Here, we investigated the possibility of anammox at BSI sites using metagenomic shotgun sequencing of DNA obtained from the BSI above the Discovery Deep brine pool. Analysis of sequencing reads matching the 16S rRNA and hzsA genes confirmed presence of anammox bacteria of the genus Scalindua. Phylogenetic analysis of the 16S rRNA gene indicated that this Scalindua sp. belongs to a distinct group, separate from the anammox bacteria in the seawater column, that contains mostly sequences retrieved from high-salt environments. Using coverage- and composition-based binning, we extracted and assembled the draft genome of the dominant anammox bacterium. Comparative genomic analysis indicated that this Scalindua species uses compatible solutes for osmoadaptation, in contrast to other marine anammox bacteria that likely use a salt-in strategy. We propose the name Candidatus Scalindua rubra for this novel species, alluding to its discovery in the Red Sea.
Crane, Paul K; Foroud, Tatiana; Montine, Thomas J; Larson, Eric B
2017-12-01
The Alzheimer's Disease Sequencing Project (ADSP) used different criteria for assigning case and control status from the discovery and replication phases of the project. We considered data from a community-based prospective cohort study with autopsy follow-up where participants could be categorized as case, control, or neither by both definitions and compared the two sets of criteria. We used data from the Adult Changes in Thought (ACT) study including Diagnostic and Statistical Manual-IV criteria for dementia status, McKhann et al. criteria for clinical Alzheimer's disease, and Braak and Consortium to Establish a Registry for AD findings on neurofibrillary tangles and neuritic plaques to categorize the 621 ACT participants of European ancestry who died and came to autopsy. We applied ADSP discovery and replication definitions to identify controls, cases, and people who were neither controls nor cases. There was some agreement between the discovery and replication definitions. Major areas of discrepancy included the finding that only 40% of the discovery sample controls had sufficiently low levels of neurofibrillary tangles and neuritic plaques to be considered controls by the replication criteria and the finding that 16% of the replication phase cases were diagnosed with non-AD dementia during life and thus were excluded as cases for the discovery phase. These findings should inform interpretation of genetic association findings from the ADSP. Differences in genetic association findings between the two phases of the study may reflect these different phenotype definitions from the discovery and replication phase of the ADSP. Copyright © 2017 the Alzheimer's Association. Published by Elsevier Inc. All rights reserved.
Ye, Kai; Kosters, Walter A; Ijzerman, Adriaan P
2007-03-15
Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.
Ou, Hong-Yu; He, Xinyi; Harrison, Ewan M.; Kulasekara, Bridget R.; Thani, Ali Bin; Kadioglu, Aras; Lory, Stephen; Hinton, Jay C. D.; Barer, Michael R.; Rajakumar, Kumar
2007-01-01
MobilomeFINDER (http://mml.sjtu.edu.cn/MobilomeFINDER) is an interactive online tool that facilitates bacterial genomic island or ‘mobile genome’ (mobilome) discovery; it integrates the ArrayOme and tRNAcc software packages. ArrayOme utilizes a microarray-derived comparative genomic hybridization input data set to generate ‘inferred contigs’ produced by merging adjacent genes classified as ‘present’. Collectively these ‘fragments’ represent a hypothetical ‘microarray-visualized genome (MVG)’. ArrayOme permits recognition of discordances between physical genome and MVG sizes, thereby enabling identification of strains rich in microarray-elusive novel genes. Individual tRNAcc tools facilitate automated identification of genomic islands by comparative analysis of the contents and contexts of tRNA sites and other integration hotspots in closely related sequenced genomes. Accessory tools facilitate design of hotspot-flanking primers for in silico and/or wet-science-based interrogation of cognate loci in unsequenced strains and analysis of islands for features suggestive of foreign origins; island-specific and genome-contextual features are tabulated and represented in schematic and graphical forms. To date we have used MobilomeFINDER to analyse several Enterobacteriaceae, Pseudomonas aeruginosa and Streptococcus suis genomes. MobilomeFINDER enables high-throughput island identification and characterization through increased exploitation of emerging sequence data and PCR-based profiling of unsequenced test strains; subsequent targeted yeast recombination-based capture permits full-length sequencing and detailed functional studies of novel genomic islands. PMID:17537813
RAD tag sequencing as a source of SNP markers in Cynara cardunculus L
2012-01-01
Background The globe artichoke (Cynara cardunculus L. var. scolymus) genome is relatively poorly explored, especially compared to those of the other major Asteraceae crops sunflower and lettuce. No SNP markers are in the public domain. We have combined the recently developed restriction-site associated DNA (RAD) approach with the Illumina DNA sequencing platform to effect the rapid and mass discovery of SNP markers for C. cardunculus. Results RAD tags were sequenced from the genomic DNA of three C. cardunculus mapping population parents, generating 9.7 million reads, corresponding to ~1 Gbp of sequence. An assembly based on paired ends produced ~6.0 Mbp of genomic sequence, separated into ~19,000 contigs (mean length 312 bp), of which ~21% were fragments of putative coding sequence. The shared sequences allowed for the discovery of ~34,000 SNPs and nearly 800 indels, equivalent to a SNP frequency of 5.6 per 1,000 nt, and an indel frequency of 0.2 per 1,000 nt. A sample of heterozygous SNP loci was mapped by CAPS assays and this exercise provided validation of our mining criteria. The repetitive fraction of the genome had a high representation of retrotransposon sequence, followed by simple repeats, AT-low complexity regions and mobile DNA elements. The genomic k-mers distribution and CpG rate of C. cardunculus, compared with data derived from three whole genome-sequenced dicots species, provided a further evidence of the random representation of the C. cardunculus genome generated by RAD sampling. Conclusion The RAD tag sequencing approach is a cost-effective and rapid method to develop SNP markers in a highly heterozygous species. Our approach permitted to generate a large and robust SNP datasets by the adoption of optimized filtering criteria. PMID:22214349
Sequence-Based Genotyping for Marker Discovery and Co-Dominant Scoring in Germplasm and Populations
Truong, Hoa T.; Ramos, A. Marcos; Yalcin, Feyruz; de Ruiter, Marjo; van der Poel, Hein J. A.; Huvenaars, Koen H. J.; Hogers, René C. J.; van Enckevort, Leonora. J. G.; Janssen, Antoine; van Orsouw, Nathalie J.; van Eijk, Michiel J. T.
2012-01-01
Conventional marker-based genotyping platforms are widely available, but not without their limitations. In this context, we developed Sequence-Based Genotyping (SBG), a technology for simultaneous marker discovery and co-dominant scoring, using next-generation sequencing. SBG offers users several advantages including a generic sample preparation method, a highly robust genome complexity reduction strategy to facilitate de novo marker discovery across entire genomes, and a uniform bioinformatics workflow strategy to achieve genotyping goals tailored to individual species, regardless of the availability of a reference sequence. The most distinguishing features of this technology are the ability to genotype any population structure, regardless whether parental data is included, and the ability to co-dominantly score SNP markers segregating in populations. To demonstrate the capabilities of SBG, we performed marker discovery and genotyping in Arabidopsis thaliana and lettuce, two plant species of diverse genetic complexity and backgrounds. Initially we obtained 1,409 SNPs for arabidopsis, and 5,583 SNPs for lettuce. Further filtering of the SNP dataset produced over 1,000 high quality SNP markers for each species. We obtained a genotyping rate of 201.2 genotypes/SNP and 58.3 genotypes/SNP for arabidopsis (n = 222 samples) and lettuce (n = 87 samples), respectively. Linkage mapping using these SNPs resulted in stable map configurations. We have therefore shown that the SBG approach presented provides users with the utmost flexibility in garnering high quality markers that can be directly used for genotyping and downstream applications. Until advances and costs will allow for routine whole-genome sequencing of populations, we expect that sequence-based genotyping technologies such as SBG will be essential for genotyping of model and non-model genomes alike. PMID:22662172
Namouchi, Amine; Cimino, Mena; Favre-Rochex, Sandrine; Charles, Patricia; Gicquel, Brigitte
2017-07-13
Tuberculosis (TB) is caused by Mycobacterium tuberculosis and represents one of the major challenges facing drug discovery initiatives worldwide. The considerable rise in bacterial drug resistance in recent years has led to the need of new drugs and drug regimens. Model systems are regularly used to speed-up the drug discovery process and circumvent biosafety issues associated with manipulating M. tuberculosis. These include the use of strains such as Mycobacterium smegmatis and Mycobacterium marinum that can be handled in biosafety level 2 facilities, making high-throughput screening feasible. However, each of these model species have their own limitations. We report and describe the first complete genome sequence of Mycobacterium aurum ATCC23366, an environmental mycobacterium that can also grow in the gut of humans and animals as part of the microbiota. This species shows a comparable resistance profile to that of M. tuberculosis for several anti-TB drugs. The aims of this study were to (i) determine the drug resistance profile of a recently proposed model species, Mycobacterium aurum, strain ATCC23366, for anti-TB drug discovery as well as Mycobacterium smegmatis and Mycobacterium marinum (ii) sequence and annotate the complete genome sequence of this species obtained using Pacific Bioscience technology (iii) perform comparative genomics analyses of the various surrogate strains with M. tuberculosis (iv) discuss how the choice of the surrogate model used for drug screening can affect the drug discovery process. We describe the complete genome sequence of M. aurum, a surrogate model for anti-tuberculosis drug discovery. Most of the genes already reported to be associated with drug resistance are shared between all the surrogate strains and M. tuberculosis. We consider that M. aurum might be used in high-throughput screening for tuberculosis drug discovery. We also highly recommend the use of different model species during the drug discovery screening process.
Day-Williams, Aaron G.; McLay, Kirsten; Drury, Eleanor; Edkins, Sarah; Coffey, Alison J.; Palotie, Aarno; Zeggini, Eleftheria
2011-01-01
Pooled sequencing can be a cost-effective approach to disease variant discovery, but its applicability in association studies remains unclear. We compare sequence enrichment methods coupled to next-generation sequencing in non-indexed pools of 1, 2, 10, 20 and 50 individuals and assess their ability to discover variants and to estimate their allele frequencies. We find that pooled resequencing is most usefully applied as a variant discovery tool due to limitations in estimating allele frequency with high enough accuracy for association studies, and that in-solution hybrid-capture performs best among the enrichment methods examined regardless of pool size. PMID:22069447
Oduru, Sreedhar; Campbell, Janee L; Karri, SriTulasi; Hendry, William J; Khan, Shafiq A; Williams, Simon C
2003-01-01
Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish) genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells. PMID:12783626
Vatanparast, Mohammad; Powell, Adrian; Doyle, Jeff J; Egan, Ashley N
2018-03-01
The development of pipelines for locus discovery has spurred the use of target enrichment for plant phylogenomics. However, few studies have compared pipelines from locus discovery and bait design, through validation, to tree inference. We compared three methods within Leguminosae (Fabaceae) and present a workflow for future efforts. Using 30 transcriptomes, we compared Hyb-Seq, MarkerMiner, and the Yang and Smith (Y&S) pipelines for locus discovery, validated 7501 baits targeting 507 loci across 25 genera via Illumina sequencing, and inferred gene and species trees via concatenation- and coalescent-based methods. Hyb-Seq discovered loci with the longest mean length. MarkerMiner discovered the most conserved loci with the least flagged as paralogous. Y&S offered the most parsimony-informative sites and putative orthologs. Target recovery averaged 93% across taxa. We optimized our targeted locus set based on a workflow designed to minimize paralog/ortholog conflation and thus present 423 loci for legume phylogenomics. Methods differed across criteria important for phylogenetic marker development. We recommend Hyb-Seq as a method that may be useful for most phylogenomic projects. Our targeted locus set is a resource for future, community-driven efforts to reconstruct the legume tree of life.
Roden, Suzanne E; Dutton, Peter H; Morin, Phillip A
2009-01-01
The green sea turtle, Chelonia mydas, was used as a case study for single nucleotide polymorphism (SNP) discovery in a species that has little genetic sequence information available. As green turtles have a complex population structure, additional nuclear markers other than microsatellites could add to our understanding of their complex life history. Amplified fragment length polymorphism technique was used to generate sets of random fragments of genomic DNA, which were then electrophoretically separated with precast gels, stained with SYBR green, excised, and directly sequenced. It was possible to perform this method without the use of polyacrylamide gels, radioactive or fluorescent labeled primers, or hybridization methods, reducing the time, expense, and safety hazards of SNP discovery. Within 13 loci, 2547 base pairs were screened, resulting in the discovery of 35 SNPs. Using this method, it was possible to yield a sufficient number of loci to screen for SNP markers without the availability of prior sequence information.
PhAST: pharmacophore alignment search tool.
Hähnke, Volker; Hofmann, Bettina; Grgat, Tomislav; Proschak, Ewgenij; Steinhilber, Dieter; Schneider, Gisbert
2009-04-15
We present a ligand-based virtual screening technique (PhAST) for rapid hit and lead structure searching in large compound databases. Molecules are represented as strings encoding the distribution of pharmacophoric features on the molecular graph. In contrast to other text-based methods using SMILES strings, we introduce a new form of text representation that describes the pharmacophore of molecules. This string representation opens the opportunity for revealing functional similarity between molecules by sequence alignment techniques in analogy to homology searching in protein or nucleic acid sequence databases. We favorably compared PhAST with other current ligand-based virtual screening methods in a retrospective analysis using the BEDROC metric. In a prospective application, PhAST identified two novel inhibitors of 5-lipoxygenase product formation with minimal experimental effort. This outcome demonstrates the applicability of PhAST to drug discovery projects and provides an innovative concept of sequence-based compound screening with substantial scaffold hopping potential. 2008 Wiley Periodicals, Inc.
Karas, Vlad O; Sinnott-Armstrong, Nicholas A; Varghese, Vici; Shafer, Robert W; Greenleaf, William J; Sherlock, Gavin
2018-01-01
Abstract Much of the within species genetic variation is in the form of single nucleotide polymorphisms (SNPs), typically detected by whole genome sequencing (WGS) or microarray-based technologies. However, WGS produces mostly uninformative reads that perfectly match the reference, while microarrays require genome-specific reagents. We have developed Diff-seq, a sequencing-based mismatch detection assay for SNP discovery without the requirement for specialized nucleic-acid reagents. Diff-seq leverages the Surveyor endonuclease to cleave mismatched DNA molecules that are generated after cross-annealing of a complex pool of DNA fragments. Sequencing libraries enriched for Surveyor-cleaved molecules result in increased coverage at the variant sites. Diff-seq detected all mismatches present in an initial test substrate, with specific enrichment dependent on the identity and context of the variation. Application to viral sequences resulted in increased observation of variant alleles in a biologically relevant context. Diff-Seq has the potential to increase the sensitivity and efficiency of high-throughput sequencing in the detection of variation. PMID:29361139
Haplotag: Software for Haplotype-Based Genotyping-by-Sequencing Analysis
Tinker, Nicholas A.; Bekele, Wubishet A.; Hattori, Jiro
2016-01-01
Genotyping-by-sequencing (GBS), and related methods, are based on high-throughput short-read sequencing of genomic complexity reductions followed by discovery of single nucleotide polymorphisms (SNPs) within sequence tags. This provides a powerful and economical approach to whole-genome genotyping, facilitating applications in genomics, diversity analysis, and molecular breeding. However, due to the complexity of analyzing large data sets, applications of GBS may require substantial time, expertise, and computational resources. Haplotag, the novel GBS software described here, is freely available, and operates with minimal user-investment on widely available computer platforms. Haplotag is unique in fulfilling the following set of criteria: (1) operates without a reference genome; (2) can be used in a polyploid species; (3) provides a discovery mode, and a production mode; (4) discovers polymorphisms based on a model of tag-level haplotypes within sequenced tags; (5) reports SNPs as well as haplotype-based genotypes; and (6) provides an intuitive visual “passport” for each inferred locus. Haplotag is optimized for use in a self-pollinating plant species. PMID:26818073
An optimized protocol for generation and analysis of Ion Proton sequencing reads for RNA-Seq.
Yuan, Yongxian; Xu, Huaiqian; Leung, Ross Ka-Kit
2016-05-26
Previous studies compared running cost, time and other performance measures of popular sequencing platforms. However, comprehensive assessment of library construction and analysis protocols for Proton sequencing platform remains unexplored. Unlike Illumina sequencing platforms, Proton reads are heterogeneous in length and quality. When sequencing data from different platforms are combined, this can result in reads with various read length. Whether the performance of the commonly used software for handling such kind of data is satisfactory is unknown. By using universal human reference RNA as the initial material, RNaseIII and chemical fragmentation methods in library construction showed similar result in gene and junction discovery number and expression level estimated accuracy. In contrast, sequencing quality, read length and the choice of software affected mapping rate to a much larger extent. Unspliced aligner TMAP attained the highest mapping rate (97.27 % to genome, 86.46 % to transcriptome), though 47.83 % of mapped reads were clipped. Long reads could paradoxically reduce mapping in junctions. With reference annotation guide, the mapping rate of TopHat2 significantly increased from 75.79 to 92.09 %, especially for long (>150 bp) reads. Sailfish, a k-mer based gene expression quantifier attained highly consistent results with that of TaqMan array and highest sensitivity. We provided for the first time, the reference statistics of library preparation methods, gene detection and quantification and junction discovery for RNA-Seq by the Ion Proton platform. Chemical fragmentation performed equally well with the enzyme-based one. The optimal Ion Proton sequencing options and analysis software have been evaluated.
From genomics to functional markers in the era of next-generation sequencing.
Salgotra, R K; Gupta, B B; Stewart, C N
2014-03-01
The availability of complete genome sequences, along with other genomic resources for Arabidopsis, rice, pigeon pea, soybean and other crops, has revolutionized our understanding of the genetic make-up of plants. Next-generation DNA sequencing (NGS) has facilitated single nucleotide polymorphism discovery in plants. Functionally-characterized sequences can be identified and functional markers (FMs) for important traits can be developed at an ever-increasing ease. FMs are derived from sequence polymorphisms found in allelic variants of a functional gene. Linkage disequilibrium-based association mapping and homologous recombinants have been developed for identification of "perfect" markers for their use in crop improvement practices. Compared with many other molecular markers, FMs derived from the functionally characterized sequence genes using NGS techniques and their use provide opportunities to develop high-yielding plant genotypes resistant to various stresses at a fast pace.
GWATCH: a web platform for automated gene association discovery analysis.
Svitin, Anton; Malov, Sergey; Cherkasov, Nikolay; Geerts, Paul; Rotkevich, Mikhail; Dobrynin, Pavel; Shevchenko, Andrey; Guan, Li; Troyer, Jennifer; Hendrickson, Sher; Dilks, Holli Hutcheson; Oleksyk, Taras K; Donfield, Sharyne; Gomperts, Edward; Jabs, Douglas A; Sezgin, Efe; Van Natta, Mark; Harrigan, P Richard; Brumme, Zabrina L; O'Brien, Stephen J
2014-01-01
As genome-wide sequence analyses for complex human disease determinants are expanding, it is increasingly necessary to develop strategies to promote discovery and validation of potential disease-gene associations. Here we present a dynamic web-based platform - GWATCH - that automates and facilitates four steps in genetic epidemiological discovery: 1) Rapid gene association search and discovery analysis of large genome-wide datasets; 2) Expanded visual display of gene associations for genome-wide variants (SNPs, indels, CNVs), including Manhattan plots, 2D and 3D snapshots of any gene region, and a dynamic genome browser illustrating gene association chromosomal regions; 3) Real-time validation/replication of candidate or putative genes suggested from other sources, limiting Bonferroni genome-wide association study (GWAS) penalties; 4) Open data release and sharing by eliminating privacy constraints (The National Human Genome Research Institute (NHGRI) Institutional Review Board (IRB), informed consent, The Health Insurance Portability and Accountability Act (HIPAA) of 1996 etc.) on unabridged results, which allows for open access comparative and meta-analysis. GWATCH is suitable for both GWAS and whole genome sequence association datasets. We illustrate the utility of GWATCH with three large genome-wide association studies for HIV-AIDS resistance genes screened in large multicenter cohorts; however, association datasets from any study can be uploaded and analyzed by GWATCH.
Naghdi, Mohammad Reza; Smail, Katia; Wang, Joy X; Wade, Fallou; Breaker, Ronald R; Perreault, Jonathan
2017-03-15
The discovery of noncoding RNAs (ncRNAs) and their importance for gene regulation led us to develop bioinformatics tools to pursue the discovery of novel ncRNAs. Finding ncRNAs de novo is challenging, first due to the difficulty of retrieving large numbers of sequences for given gene activities, and second due to exponential demands on calculation needed for comparative genomics on a large scale. Recently, several tools for the prediction of conserved RNA secondary structure were developed, but many of them are not designed to uncover new ncRNAs, or are too slow for conducting analyses on a large scale. Here we present various approaches using the database RiboGap as a primary tool for finding known ncRNAs and for uncovering simple sequence motifs with regulatory roles. This database also can be used to easily extract intergenic sequences of eubacteria and archaea to find conserved RNA structures upstream of given genes. We also show how to extend analysis further to choose the best candidate ncRNAs for experimental validation. Copyright © 2017 Elsevier Inc. All rights reserved.
Buschmann, Dominik; Haberberger, Anna; Kirchner, Benedikt; Spornraft, Melanie; Riedmaier, Irmgard; Schelling, Gustav; Pfaffl, Michael W.
2016-01-01
Small RNA-Seq has emerged as a powerful tool in transcriptomics, gene expression profiling and biomarker discovery. Sequencing cell-free nucleic acids, particularly microRNA (miRNA), from liquid biopsies additionally provides exciting possibilities for molecular diagnostics, and might help establish disease-specific biomarker signatures. The complexity of the small RNA-Seq workflow, however, bears challenges and biases that researchers need to be aware of in order to generate high-quality data. Rigorous standardization and extensive validation are required to guarantee reliability, reproducibility and comparability of research findings. Hypotheses based on flawed experimental conditions can be inconsistent and even misleading. Comparable to the well-established MIQE guidelines for qPCR experiments, this work aims at establishing guidelines for experimental design and pre-analytical sample processing, standardization of library preparation and sequencing reactions, as well as facilitating data analysis. We highlight bottlenecks in small RNA-Seq experiments, point out the importance of stringent quality control and validation, and provide a primer for differential expression analysis and biomarker discovery. Following our recommendations will encourage better sequencing practice, increase experimental transparency and lead to more reproducible small RNA-Seq results. This will ultimately enhance the validity of biomarker signatures, and allow reliable and robust clinical predictions. PMID:27317696
Thakar, Sambhaji B; Ghorpade, Pradnya N; Kale, Manisha V; Sonawane, Kailas D
2015-01-01
Fern plants are known for their ethnomedicinal applications. Huge amount of fern medicinal plants information is scattered in the form of text. Hence, database development would be an appropriate endeavor to cope with the situation. So by looking at the importance of medicinally useful fern plants, we developed a web based database which contains information about several group of ferns, their medicinal uses, chemical constituents as well as protein/enzyme sequences isolated from different fern plants. Fern ethnomedicinal plant database is an all-embracing, content management web-based database system, used to retrieve collection of factual knowledge related to the ethnomedicinal fern species. Most of the protein/enzyme sequences have been extracted from NCBI Protein sequence database. The fern species, family name, identification, taxonomy ID from NCBI, geographical occurrence, trial for, plant parts used, ethnomedicinal importance, morphological characteristics, collected from various scientific literatures and journals available in the text form. NCBI's BLAST, InterPro, phylogeny, Clustal W web source has also been provided for the future comparative studies. So users can get information related to fern plants and their medicinal applications at one place. This Fern ethnomedicinal plant database includes information of 100 fern medicinal species. This web based database would be an advantageous to derive information specifically for computational drug discovery, botanists or botanical interested persons, pharmacologists, researchers, biochemists, plant biotechnologists, ayurvedic practitioners, doctors/pharmacists, traditional medicinal users, farmers, agricultural students and teachers from universities as well as colleges and finally fern plant lovers. This effort would be useful to provide essential knowledge for the users about the adventitious applications for drug discovery, applications, conservation of fern species around the world and finally to create social awareness.
USDA-ARS?s Scientific Manuscript database
Background: Due to a relatively high level of codominant inheritance and transferability within and among taxonomic groups, simple sequence repeat (SSR) markers are important elements in comparative mapping and delineation of genomic regions associated with traits of economic importance. Expressed S...
USDA-ARS?s Scientific Manuscript database
Single-nucleotide Polymorphism (SNP) markers are by far the most common form of DNA polymorphism in a genome. The objectives of this study were to discover SNPs in common bean comparing sequences from coding and non-coding regions obtained from Genbank and genomic DNA and to compare sequencing resu...
Accurate read-based metagenome characterization using a hierarchical suite of unique signatures
Freitas, Tracey Allen K.; Li, Po-E; Scholz, Matthew B.; Chain, Patrick S. G.
2015-01-01
A major challenge in the field of shotgun metagenomics is the accurate identification of organisms present within a microbial community, based on classification of short sequence reads. Though existing microbial community profiling methods have attempted to rapidly classify the millions of reads output from modern sequencers, the combination of incomplete databases, similarity among otherwise divergent genomes, errors and biases in sequencing technologies, and the large volumes of sequencing data required for metagenome sequencing has led to unacceptably high false discovery rates (FDR). Here, we present the application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly and consistently smaller FDR than any other available method. Our algorithm circumvents false positives using a series of non-redundant signature databases and examines Genomic Origins Through Taxonomic CHAllenge (GOTTCHA). GOTTCHA was tested and validated on 20 synthetic and mock datasets ranging in community composition and complexity, was applied successfully to data generated from spiked environmental and clinical samples, and robustly demonstrates superior performance compared with other available tools. PMID:25765641
The promises and pitfalls of RNA-interference-based therapeutics
Castanotto, Daniela; Rossi, John J.
2009-01-01
The discovery that gene expression can be controlled by the Watson–Crick base-pairing of small RNAs with messenger RNAs containing complementary sequence — a process known as RNA interference — has markedly advanced our understanding of eukaryotic gene regulation and function. The ability of short RNA sequences to modulate gene expression has provided a powerful tool with which to study gene function and is set to revolutionize the treatment of disease. Remarkably, despite being just one decade from its discovery, the phenomenon is already being used therapeutically in human clinical trials, and biotechnology companies that focus on RNA-interference-based therapeutics are already publicly traded. PMID:19158789
Loeffler 4.0: Diagnostic Metagenomics.
Höper, Dirk; Wylezich, Claudia; Beer, Martin
2017-01-01
A new world of possibilities for "virus discovery" was opened up with high-throughput sequencing becoming available in the last decade. While scientifically metagenomic analysis was established before the start of the era of high-throughput sequencing, the availability of the first second-generation sequencers was the kick-off for diagnosticians to use sequencing for the detection of novel pathogens. Today, diagnostic metagenomics is becoming the standard procedure for the detection and genetic characterization of new viruses or novel virus variants. Here, we provide an overview about technical considerations of high-throughput sequencing-based diagnostic metagenomics together with selected examples of "virus discovery" for animal diseases or zoonoses and metagenomics for food safety or basic veterinary research. © 2017 Elsevier Inc. All rights reserved.
VIP: an integrated pipeline for metagenomics of virus identification and discovery
Li, Yang; Wang, Hao; Nie, Kai; Zhang, Chen; Zhang, Yi; Wang, Ji; Niu, Peihua; Ma, Xuejun
2016-01-01
Identification and discovery of viruses using next-generation sequencing technology is a fast-developing area with potential wide application in clinical diagnostics, public health monitoring and novel virus discovery. However, tremendous sequence data from NGS study has posed great challenge both in accuracy and velocity for application of NGS study. Here we describe VIP (“Virus Identification Pipeline”), a one-touch computational pipeline for virus identification and discovery from metagenomic NGS data. VIP performs the following steps to achieve its goal: (i) map and filter out background-related reads, (ii) extensive classification of reads on the basis of nucleotide and remote amino acid homology, (iii) multiple k-mer based de novo assembly and phylogenetic analysis to provide evolutionary insight. We validated the feasibility and veracity of this pipeline with sequencing results of various types of clinical samples and public datasets. VIP has also contributed to timely virus diagnosis (~10 min) in acutely ill patients, demonstrating its potential in the performance of unbiased NGS-based clinical studies with demand of short turnaround time. VIP is released under GPLv3 and is available for free download at: https://github.com/keylabivdc/VIP. PMID:27026381
Advancements in Aptamer Discovery Technologies.
Gotrik, Michael R; Feagin, Trevor A; Csordas, Andrew T; Nakamoto, Margaret A; Soh, H Tom
2016-09-20
Affinity reagents that specifically bind to their target molecules are invaluable tools in nearly every field of modern biomedicine. Nucleic acid-based aptamers offer many advantages in this domain, because they are chemically synthesized, stable, and economical. Despite these compelling features, aptamers are currently not widely used in comparison to antibodies. This is primarily because conventional aptamer-discovery techniques such as SELEX are time-consuming and labor-intensive and often fail to produce aptamers with comparable binding performance to antibodies. This Account describes a body of work from our laboratory in developing advanced methods for consistently producing high-performance aptamers with higher efficiency, fewer resources, and, most importantly, a greater probability of success. We describe our efforts in systematically transforming each major step of the aptamer discovery process: selection, analysis, and characterization. To improve selection, we have developed microfluidic devices (M-SELEX) that enable discovery of high-affinity aptamers after a minimal number of selection rounds by precisely controlling the target concentration and washing stringency. In terms of improving aptamer pool analysis, our group was the first to use high-throughput sequencing (HTS) for the discovery of new aptamers. We showed that tracking the enrichment trajectory of individual aptamer sequences enables the identification of high-performing aptamers without requiring full convergence of the selected aptamer pool. HTS is now widely used for aptamer discovery, and open-source software has become available to facilitate analysis. To improve binding characterization, we used HTS data to design custom aptamer arrays to measure the affinity and specificity of up to ∼10(4) DNA aptamers in parallel as a means to rapidly discover high-quality aptamers. Most recently, our efforts have culminated in the invention of the "particle display" (PD) screening system, which transforms solution-phase aptamers into "aptamer particles" that can be individually screened at high-throughput via fluorescence-activated cell sorting. Using PD, we have shown the feasibility of rapidly generating aptamers with exceptional affinities, even for proteins that have previously proven intractable to aptamer discovery. We are confident that these advanced aptamer-discovery methods will accelerate the discovery of aptamer reagents with excellent affinities and specificities, perhaps even exceeding those of the best monoclonal antibodies. Since aptamers are reproducible, renewable, stable, and can be distributed as sequence information, we anticipate that these affinity reagents will become even more valuable tools for both research and clinical applications.
DOE Office of Scientific and Technical Information (OSTI.GOV)
With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs or raw, unassembled reads. The method is fast to compute, finding SNPs and building a SNP phylogeny in minutes to hours, depending on the size and diversity of the input sequences. The SNP-based trees that result are consistent with known taxonomy and treesmore » determined in other studies. The approach we describe can handle many gigabases of sequence in a single run. The algorithm is based on k-mer analysis.« less
Toma, Tudor; Bosman, Robert-Jan; Siebes, Arno; Peek, Niels; Abu-Hanna, Ameen
2010-08-01
An important problem in the Intensive Care is how to predict on a given day of stay the eventual hospital mortality for a specific patient. A recent approach to solve this problem suggested the use of frequent temporal sequences (FTSs) as predictors. Methods following this approach were evaluated in the past by inducing a model from a training set and validating the prognostic performance on an independent test set. Although this evaluative approach addresses the validity of the specific models induced in an experiment, it falls short of evaluating the inductive method itself. To achieve this, one must account for the inherent sources of variation in the experimental design. The main aim of this work is to demonstrate a procedure based on bootstrapping, specifically the .632 bootstrap procedure, for evaluating inductive methods that discover patterns, such as FTSs. A second aim is to apply this approach to find out whether a recently suggested inductive method that discovers FTSs of organ functioning status is superior over a traditional method that does not use temporal sequences when compared on each successive day of stay at the Intensive Care Unit. The use of bootstrapping with logistic regression using pre-specified covariates is known in the statistical literature. Using inductive methods of prognostic models based on temporal sequence discovery within the bootstrap procedure is however novel at least in predictive models in the Intensive Care. Our results of applying the bootstrap-based evaluative procedure demonstrate the superiority of the FTS-based inductive method over the traditional method in terms of discrimination as well as accuracy. In addition we illustrate the insights gained by the analyst into the discovered FTSs from the bootstrap samples. Copyright 2010 Elsevier Inc. All rights reserved.
The third annual BRDS on research and development of nucleic acid-based nanomedicines
Chaudhary, Amit Kumar
2017-01-01
The completion of human genome project, decrease in the sequencing cost, and correlation of genome sequencing data with specific diseases led to the exponential rise in the nucleic acid-based therapeutic approaches. In the third annual Biopharmaceutical Research and Development Symposium (BRDS) held at the Center for Drug Discovery and Lozier Center for Pharmacy Sciences and Education at the University of Nebraska Medical Center (UNMC), we highlighted the remarkable features of the nucleic acid-based nanomedicines, their significance, NIH funding opportunities on nanomedicines and gene therapy research, challenges and opportunities in the clinical translation of nucleic acids into therapeutics, and the role of intellectual property (IP) in drug discovery and development. PMID:27848223
Discovering discovery patterns with Predication-based Semantic Indexing.
Cohen, Trevor; Widdows, Dominic; Schvaneveldt, Roger W; Davies, Peter; Rindflesch, Thomas C
2012-12-01
In this paper we utilize methods of hyperdimensional computing to mediate the identification of therapeutically useful connections for the purpose of literature-based discovery. Our approach, named Predication-based Semantic Indexing, is utilized to identify empirically sequences of relationships known as "discovery patterns", such as "drug x INHIBITS substance y, substance y CAUSES disease z" that link pharmaceutical substances to diseases they are known to treat. These sequences are derived from semantic predications extracted from the biomedical literature by the SemRep system, and subsequently utilized to direct the search for known treatments for a held out set of diseases. Rapid and efficient inference is accomplished through the application of geometric operators in PSI space, allowing for both the derivation of discovery patterns from a large set of known TREATS relationships, and the application of these discovered patterns to constrain search for therapeutic relationships at scale. Our results include the rediscovery of discovery patterns that have been constructed manually by other authors in previous research, as well as the discovery of a set of previously unrecognized patterns. The application of these patterns to direct search through PSI space results in better recovery of therapeutic relationships than is accomplished with models based on distributional statistics alone. These results demonstrate the utility of efficient approximate inference in geometric space as a means to identify therapeutic relationships, suggesting a role of these methods in drug repurposing efforts. In addition, the results provide strong support for the utility of the discovery pattern approach pioneered by Hristovski and his colleagues. Copyright © 2012 Elsevier Inc. All rights reserved.
Discovering discovery patterns with predication-based Semantic Indexing
Cohen, Trevor; Widdows, Dominic; Schvaneveldt, Roger W.; Davies, Peter; Rindflesch, Thomas C.
2012-01-01
In this paper we utilize methods of hyperdimensional computing to mediate the identification of therapeutically useful connections for the purpose of literature-based discovery. Our approach, named Predication-based Semantic Indexing, is utilized to identify empirically sequences of relationships known as “discovery patterns”, such as “drug x INHIBITS substance y, substance y CAUSES disease z” that link pharmaceutical substances to diseases they are known to treat. These sequences are derived from semantic predications extracted from the biomedical literature by the SemRep system, and subsequently utilized to direct the search for known treatments for a held out set of diseases. Rapid and efficient inference is accomplished through the application of geometric operators in PSI space, allowing for both the derivation of discovery patterns from a large set of known TREATS relationships, and the application of these discovered patterns to constrain search for therapeutic relationships at scale. Our results include the rediscovery of discovery patterns that have been constructed manually by other authors in previous research, as well as the discovery of a set of previously unrecognized patterns. The application of these patterns to direct search through PSI space results in better recovery of therapeutic relationships than is accomplished with models based on distributional statistics alone. These results demonstrate the utility of efficient approximate inference in geometric space as a means to identify therapeutic relationships, suggesting a role of these methods in drug repurposing efforts. In addition, the results provide strong support for the utility of the discovery pattern approach pioneered by Hristovski and his colleagues. PMID:22841748
CoryneBase: Corynebacterium Genomic Resources and Analysis Tools at Your Fingertips
Tan, Mui Fern; Jakubovics, Nick S.; Wee, Wei Yee; Mutha, Naresh V. R.; Wong, Guat Jah; Ang, Mia Yang; Yazdi, Amir Hessam; Choo, Siew Woh
2014-01-01
Corynebacteria are used for a wide variety of industrial purposes but some species are associated with human diseases. With increasing number of corynebacterial genomes having been sequenced, comparative analysis of these strains may provide better understanding of their biology, phylogeny, virulence and taxonomy that may lead to the discoveries of beneficial industrial strains or contribute to better management of diseases. To facilitate the ongoing research of corynebacteria, a specialized central repository and analysis platform for the corynebacterial research community is needed to host the fast-growing amount of genomic data and facilitate the analysis of these data. Here we present CoryneBase, a genomic database for Corynebacterium with diverse functionality for the analysis of genomes aimed to provide: (1) annotated genome sequences of Corynebacterium where 165,918 coding sequences and 4,180 RNAs can be found in 27 species; (2) access to comprehensive Corynebacterium data through the use of advanced web technologies for interactive web interfaces; and (3) advanced bioinformatic analysis tools consisting of standard BLAST for homology search, VFDB BLAST for sequence homology search against the Virulence Factor Database (VFDB), Pairwise Genome Comparison (PGC) tool for comparative genomic analysis, and a newly designed Pathogenomics Profiling Tool (PathoProT) for comparative pathogenomic analysis. CoryneBase offers the access of a range of Corynebacterium genomic resources as well as analysis tools for comparative genomics and pathogenomics. It is publicly available at http://corynebacterium.um.edu.my/. PMID:24466021
Counting of oligomers in sequences generated by markov chains for DNA motif discovery.
Shan, Gao; Zheng, Wei-Mou
2009-02-01
By means of the technique of the imbedded Markov chain, an efficient algorithm is proposed to exactly calculate first, second moments of word counts and the probability for a word to occur at least once in random texts generated by a Markov chain. A generating function is introduced directly from the imbedded Markov chain to derive asymptotic approximations for the problem. Two Z-scores, one based on the number of sequences with hits and the other on the total number of word hits in a set of sequences, are examined for discovery of motifs on a set of promoter sequences extracted from A. thaliana genome. Source code is available at http://www.itp.ac.cn/zheng/oligo.c.
Ou-Yang, Si-sheng; Lu, Jun-yan; Kong, Xiang-qian; Liang, Zhong-jie; Luo, Cheng; Jiang, Hualiang
2012-01-01
Computational drug discovery is an effective strategy for accelerating and economizing drug discovery and development process. Because of the dramatic increase in the availability of biological macromolecule and small molecule information, the applicability of computational drug discovery has been extended and broadly applied to nearly every stage in the drug discovery and development workflow, including target identification and validation, lead discovery and optimization and preclinical tests. Over the past decades, computational drug discovery methods such as molecular docking, pharmacophore modeling and mapping, de novo design, molecular similarity calculation and sequence-based virtual screening have been greatly improved. In this review, we present an overview of these important computational methods, platforms and successful applications in this field. PMID:22922346
An Integrated Microfluidic Processor for DNA-Encoded Combinatorial Library Functional Screening
2017-01-01
DNA-encoded synthesis is rekindling interest in combinatorial compound libraries for drug discovery and in technology for automated and quantitative library screening. Here, we disclose a microfluidic circuit that enables functional screens of DNA-encoded compound beads. The device carries out library bead distribution into picoliter-scale assay reagent droplets, photochemical cleavage of compound from the bead, assay incubation, laser-induced fluorescence-based assay detection, and fluorescence-activated droplet sorting to isolate hits. DNA-encoded compound beads (10-μm diameter) displaying a photocleavable positive control inhibitor pepstatin A were mixed (1920 beads, 729 encoding sequences) with negative control beads (58 000 beads, 1728 encoding sequences) and screened for cathepsin D inhibition using a biochemical enzyme activity assay. The circuit sorted 1518 hit droplets for collection following 18 min incubation over a 240 min analysis. Visual inspection of a subset of droplets (1188 droplets) yielded a 24% false discovery rate (1166 pepstatin A beads; 366 negative control beads). Using template barcoding strategies, it was possible to count hit collection beads (1863) using next-generation sequencing data. Bead-specific barcodes enabled replicate counting, and the false discovery rate was reduced to 2.6% by only considering hit-encoding sequences that were observed on >2 beads. This work represents a complete distributable small molecule discovery platform, from microfluidic miniaturized automation to ultrahigh-throughput hit deconvolution by sequencing. PMID:28199790
An Integrated Microfluidic Processor for DNA-Encoded Combinatorial Library Functional Screening.
MacConnell, Andrew B; Price, Alexander K; Paegel, Brian M
2017-03-13
DNA-encoded synthesis is rekindling interest in combinatorial compound libraries for drug discovery and in technology for automated and quantitative library screening. Here, we disclose a microfluidic circuit that enables functional screens of DNA-encoded compound beads. The device carries out library bead distribution into picoliter-scale assay reagent droplets, photochemical cleavage of compound from the bead, assay incubation, laser-induced fluorescence-based assay detection, and fluorescence-activated droplet sorting to isolate hits. DNA-encoded compound beads (10-μm diameter) displaying a photocleavable positive control inhibitor pepstatin A were mixed (1920 beads, 729 encoding sequences) with negative control beads (58 000 beads, 1728 encoding sequences) and screened for cathepsin D inhibition using a biochemical enzyme activity assay. The circuit sorted 1518 hit droplets for collection following 18 min incubation over a 240 min analysis. Visual inspection of a subset of droplets (1188 droplets) yielded a 24% false discovery rate (1166 pepstatin A beads; 366 negative control beads). Using template barcoding strategies, it was possible to count hit collection beads (1863) using next-generation sequencing data. Bead-specific barcodes enabled replicate counting, and the false discovery rate was reduced to 2.6% by only considering hit-encoding sequences that were observed on >2 beads. This work represents a complete distributable small molecule discovery platform, from microfluidic miniaturized automation to ultrahigh-throughput hit deconvolution by sequencing.
Directional genomic hybridization for chromosomal inversion discovery and detection.
Ray, F Andrew; Zimmerman, Erin; Robinson, Bruce; Cornforth, Michael N; Bedford, Joel S; Goodwin, Edwin H; Bailey, Susan M
2013-04-01
Chromosomal rearrangements are a source of structural variation within the genome that figure prominently in human disease, where the importance of translocations and deletions is well recognized. In principle, inversions-reversals in the orientation of DNA sequences within a chromosome-should have similar detrimental potential. However, the study of inversions has been hampered by traditional approaches used for their detection, which are not particularly robust. Even with significant advances in whole genome approaches, changes in the absolute orientation of DNA remain difficult to detect routinely. Consequently, our understanding of inversions is still surprisingly limited, as is our appreciation for their frequency and involvement in human disease. Here, we introduce the directional genomic hybridization methodology of chromatid painting-a whole new way of looking at structural features of the genome-that can be employed with high resolution on a cell-by-cell basis, and demonstrate its basic capabilities for genome-wide discovery and targeted detection of inversions. Bioinformatics enabled development of sequence- and strand-specific directional probe sets, which when coupled with single-stranded hybridization, greatly improved the resolution and ease of inversion detection. We highlight examples of the far-ranging applicability of this cytogenomics-based approach, which include confirmation of the alignment of the human genome database and evidence that individuals themselves share similar sequence directionality, as well as use in comparative and evolutionary studies for any species whose genome has been sequenced. In addition to applications related to basic mechanistic studies, the information obtainable with strand-specific hybridization strategies may ultimately enable novel gene discovery, thereby benefitting the diagnosis and treatment of a variety of human disease states and disorders including cancer, autism, and idiopathic infertility.
Update on Genomic Databases and Resources at the National Center for Biotechnology Information.
Tatusova, Tatiana
2016-01-01
The National Center for Biotechnology Information (NCBI), as a primary public repository of genomic sequence data, collects and maintains enormous amounts of heterogeneous data. Data for genomes, genes, gene expressions, gene variation, gene families, proteins, and protein domains are integrated with the analytical, search, and retrieval resources through the NCBI website, text-based search and retrieval system, provides a fast and easy way to navigate across diverse biological databases.Comparative genome analysis tools lead to further understanding of evolution processes quickening the pace of discovery. Recent technological innovations have ignited an explosion in genome sequencing that has fundamentally changed our understanding of the biology of living organisms. This huge increase in DNA sequence data presents new challenges for the information management system and the visualization tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data.
Molecular biology and immunology of head and neck cancer.
Guo, Theresa; Califano, Joseph A
2015-07-01
In recent years, our knowledge and understanding of head and neck squamous cell carcinoma (HNSCC) has expanded dramatically. New high-throughput sequencing technologies have accelerated these discoveries since the first reports of whole-exome sequencing of HNSCC tumors in 2011. In addition, the discovery of human papillomavirus in relationship with oropharyngeal squamous cell carcinoma has shifted our molecular understanding of the disease. New investigation into the role of immune evasion in HNSCC has also led to potential novel therapies based on immune-specific systemic therapies. Copyright © 2015 Elsevier Inc. All rights reserved.
2010-01-01
Background Suppression subtractive hybridization is a popular technique for gene discovery from non-model organisms without an annotated genome sequence, such as cowpea (Vigna unguiculata (L.) Walp). We aimed to use this method to enrich for genes expressed during drought stress in a drought tolerant cowpea line. However, current methods were inefficient in screening libraries and management of the sequence data, and thus there was a need to develop software tools to facilitate the process. Results Forward and reverse cDNA libraries enriched for cowpea drought response genes were screened on microarrays, and the R software package SSHscreen 2.0.1 was developed (i) to normalize the data effectively using spike-in control spot normalization, and (ii) to select clones for sequencing based on the calculation of enrichment ratios with associated statistics. Enrichment ratio 3 values for each clone showed that 62% of the forward library and 34% of the reverse library clones were significantly differentially expressed by drought stress (adjusted p value < 0.05). Enrichment ratio 2 calculations showed that > 88% of the clones in both libraries were derived from rare transcripts in the original tester samples, thus supporting the notion that suppression subtractive hybridization enriches for rare transcripts. A set of 118 clones were chosen for sequencing, and drought-induced cowpea genes were identified, the most interesting encoding a late embryogenesis abundant Lea5 protein, a glutathione S-transferase, a thaumatin, a universal stress protein, and a wound induced protein. A lipid transfer protein and several components of photosynthesis were down-regulated by the drought stress. Reverse transcriptase quantitative PCR confirmed the enrichment ratio values for the selected cowpea genes. SSHdb, a web-accessible database, was developed to manage the clone sequences and combine the SSHscreen data with sequence annotations derived from BLAST and Blast2GO. The self-BLAST function within SSHdb grouped redundant clones together and illustrated that the SSHscreen plots are a useful tool for choosing anonymous clones for sequencing, since redundant clones cluster together on the enrichment ratio plots. Conclusions We developed the SSHscreen-SSHdb software pipeline, which greatly facilitates gene discovery using suppression subtractive hybridization by improving the selection of clones for sequencing after screening the library on a small number of microarrays. Annotation of the sequence information and collaboration was further enhanced through a web-based SSHdb database, and we illustrated this through identification of drought responsive genes from cowpea, which can now be investigated in gene function studies. SSH is a popular and powerful gene discovery tool, and therefore this pipeline will have application for gene discovery in any biological system, particularly non-model organisms. SSHscreen 2.0.1 and a link to SSHdb are available from http://microarray.up.ac.za/SSHscreen. PMID:20359330
Efficient exact motif discovery.
Marschall, Tobias; Rahmann, Sven
2009-06-15
The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif. We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis. The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/.
RSAT: regulatory sequence analysis tools.
Thomas-Chollier, Morgane; Sand, Olivier; Turatsinze, Jean-Valéry; Janky, Rekin's; Defrance, Matthieu; Vervisch, Eric; Brohée, Sylvain; van Helden, Jacques
2008-07-01
The regulatory sequence analysis tools (RSAT, http://rsat.ulb.ac.be/rsat/) is a software suite that integrates a wide collection of modular tools for the detection of cis-regulatory elements in genome sequences. The suite includes programs for sequence retrieval, pattern discovery, phylogenetic footprint detection, pattern matching, genome scanning and feature map drawing. Random controls can be performed with random gene selections or by generating random sequences according to a variety of background models (Bernoulli, Markov). Beyond the original word-based pattern-discovery tools (oligo-analysis and dyad-analysis), we recently added a battery of tools for matrix-based detection of cis-acting elements, with some original features (adaptive background models, Markov-chain estimation of P-values) that do not exist in other matrix-based scanning tools. The web server offers an intuitive interface, where each program can be accessed either separately or connected to the other tools. In addition, the tools are now available as web services, enabling their integration in programmatic workflows. Genomes are regularly updated from various genome repositories (NCBI and EnsEMBL) and 682 organisms are currently supported. Since 1998, the tools have been used by several hundreds of researchers from all over the world. Several predictions made with RSAT were validated experimentally and published.
Indel variant analysis of short-read sequencing data with Scalpel
Fang, Han; Bergmann, Ewa A; Arora, Kanika; Vacic, Vladimir; Zody, Michael C; Iossifov, Ivan; O’Rawe, Jason A; Wu, Yiyang; Barron, Laura T Jimenez; Rosenbaum, Julie; Ronemus, Michael; Lee, Yoon-ha; Wang, Zihua; Dikoglu, Esra; Jobanputra, Vaidehi; Lyon, Gholson J; Wigler, Michael; Schatz, Michael C; Narzisi, Giuseppe
2017-01-01
As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in ~5 h after read mapping. PMID:27854363
Jazaeri Farsani, Seyed Mohammad; Deijs, Martin; Dijkman, Ronald; Molenkamp, Richard; Jeeninga, Rienk E; Ieven, Margareta; Goossens, Herman; van der Hoek, Lia
2015-01-01
Background Currently, virus discovery is mainly based on molecular techniques. Here, we propose a method that relies on virus culturing combined with state-of-the-art sequencing techniques. The most natural ex vivo culture system was used to enable replication of respiratory viruses. Method Three respiratory clinical samples were tested on well-differentiated pseudostratified tracheobronchial human airway epithelial (HAE) cultures grown at an air–liquid interface, which resemble the airway epithelium. Cells were stained with convalescent serum of the patients to identify infected cells and apical washes were analyzed by VIDISCA-454, a next-generation sequencing virus discovery technique. Results Infected cells were observed for all three samples. Sequencing subsequently indicated that the cells were infected by either human coronavirus OC43, influenzavirus B, or influenzavirus A. The sequence reads covered a large part of the genome (52%, 82%, and 57%, respectively). Conclusion We present here a new method for virus discovery that requires a virus culture on primary cells and an antibody detection. The virus in the harvest can be used to characterize the viral genome sequence and cell tropism, but also provides progeny virus to initiate experiments to fulfill the Koch's postulates. PMID:25482367
SSRPrimer and SSR Taxonomy Tree: Biome SSR discovery
Jewell, Erica; Robinson, Andrew; Savage, David; Erwin, Tim; Love, Christopher G.; Lim, Geraldine A. C.; Li, Xi; Batley, Jacqueline; Spangenberg, German C.; Edwards, David
2006-01-01
Simple sequence repeat (SSR) molecular genetic markers have become important tools for a broad range of applications such as genome mapping and genetic diversity studies. SSRs are readily identified within DNA sequence data and PCR primers can be designed for their amplification. These PCR primers frequently cross amplify within related species. We report a web-based tool, SSR Primer, that integrates SPUTNIK, an SSR repeat finder, with Primer3, a primer design program, within one pipeline. On submission of multiple FASTA formatted sequences, the script screens each sequence for SSRs using SPUTNIK. Results are then parsed to Primer3 for locus specific primer design. We have applied this tool for the discovery of SSRs within the complete GenBank database, and have designed PCR amplification primers for over 13 million SSRs. The SSR Taxonomy Tree server provides web-based searching and browsing of species and taxa for the visualisation and download of these SSR amplification primers. These tools are available at . PMID:16845092
SSRPrimer and SSR Taxonomy Tree: Biome SSR discovery.
Jewell, Erica; Robinson, Andrew; Savage, David; Erwin, Tim; Love, Christopher G; Lim, Geraldine A C; Li, Xi; Batley, Jacqueline; Spangenberg, German C; Edwards, David
2006-07-01
Simple sequence repeat (SSR) molecular genetic markers have become important tools for a broad range of applications such as genome mapping and genetic diversity studies. SSRs are readily identified within DNA sequence data and PCR primers can be designed for their amplification. These PCR primers frequently cross amplify within related species. We report a web-based tool, SSR Primer, that integrates SPUTNIK, an SSR repeat finder, with Primer3, a primer design program, within one pipeline. On submission of multiple FASTA formatted sequences, the script screens each sequence for SSRs using SPUTNIK. Results are then parsed to Primer3 for locus specific primer design. We have applied this tool for the discovery of SSRs within the complete GenBank database, and have designed PCR amplification primers for over 13 million SSRs. The SSR Taxonomy Tree server provides web-based searching and browsing of species and taxa for the visualisation and download of these SSR amplification primers. These tools are available at http://bioinformatics.pbcbasc.latrobe.edu.au/ssrdiscovery.html.
Comprehensive discovery of noncoding RNAs in acute myeloid leukemia cell transcriptomes.
Zhang, Jin; Griffith, Malachi; Miller, Christopher A; Griffith, Obi L; Spencer, David H; Walker, Jason R; Magrini, Vincent; McGrath, Sean D; Ly, Amy; Helton, Nichole M; Trissal, Maria; Link, Daniel C; Dang, Ha X; Larson, David E; Kulkarni, Shashikant; Cordes, Matthew G; Fronick, Catrina C; Fulton, Robert S; Klco, Jeffery M; Mardis, Elaine R; Ley, Timothy J; Wilson, Richard K; Maher, Christopher A
2017-11-01
To detect diverse and novel RNA species comprehensively, we compared deep small RNA and RNA sequencing (RNA-seq) methods applied to a primary acute myeloid leukemia (AML) sample. We were able to discover previously unannotated small RNAs using deep sequencing of a library method using broader insert size selection. We analyzed the long noncoding RNA (lncRNA) landscape in AML by comparing deep sequencing from multiple RNA-seq library construction methods for the sample that we studied and then integrating RNA-seq data from 179 AML cases. This identified lncRNAs that are completely novel, differentially expressed, and associated with specific AML subtypes. Our study revealed the complexity of the noncoding RNA transcriptome through a combined strategy of strand-specific small RNA and total RNA-seq. This dataset will serve as an invaluable resource for future RNA-based analyses. Copyright © 2017 ISEH – Society for Hematology and Stem Cells. Published by Elsevier Inc. All rights reserved.
Modelling and enhanced molecular dynamics to steer structure-based drug discovery.
Kalyaanamoorthy, Subha; Chen, Yi-Ping Phoebe
2014-05-01
The ever-increasing gap between the availabilities of the genome sequences and the crystal structures of proteins remains one of the significant challenges to the modern drug discovery efforts. The knowledge of structure-dynamics-functionalities of proteins is important in order to understand several key aspects of structure-based drug discovery, such as drug-protein interactions, drug binding and unbinding mechanisms and protein-protein interactions. This review presents a brief overview on the different state of the art computational approaches that are applied for protein structure modelling and molecular dynamics simulations of biological systems. We give an essence of how different enhanced sampling molecular dynamics approaches, together with regular molecular dynamics methods, assist in steering the structure based drug discovery processes. Copyright © 2013 Elsevier Ltd. All rights reserved.
Wang, Yanjie; Dong, Chunlan; Xue, Zeyun; Jin, Qijiang; Xu, Yingchun
2016-01-15
Paeonia ostii, an important ornamental and medicinal plant, grows normally on copper (Cu) mines with widespread Cu contamination of soils, and it has the ability to lower Cu contents in the Cu-contaminated soils. However, very little molecular information concerned with Cu resistance of P. ostii is available. In this study, high-throughput de novo transcriptome sequencing was carried out for P. ostii with and without Cu treatment using Illumina HiSeq 2000 platform. A total of 77,704 All-unigenes were obtained with a mean length of 710 bp. Of these unigenes, 47,461 were annotated with public databases based on sequence similarities. Comparative transcript profiling allowed the discovery of 4324 differentially expressed genes (DEGs), with 2207 up-regulated and 2117 down-regulated unigenes in Cu-treated library as compared to the control counterpart. Based on these DEGs, Gene Ontology (GO) enrichment analysis indicated Cu stress-relevant terms, such as 'membrane' and 'antioxidant activity'. Meanwhile, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis uncovered some important pathways, including 'biosynthesis of secondary metabolites' and 'metabolic pathways'. In addition, expression patterns of 12 selected DEGs derived from quantitative real-time polymerase chain reaction (qRT-PCR) were consistent with their transcript abundance changes obtained by transcriptomic analyses, suggesting that all the 12 genes were authentically involved in Cu tolerance in P. ostii. This is the first report to identify genes related to Cu stress responses in P. ostii, which could offer valuable information on the molecular mechanisms of Cu resistance, and provide a basis for further genomics research on this and related ornamental species for phytoremediation. Copyright © 2015 Elsevier B.V. All rights reserved.
Li, Dongmei; Le Pape, Marc A; Parikh, Nisha I; Chen, Will X; Dye, Timothy D
2013-01-01
Microarrays are widely used for examining differential gene expression, identifying single nucleotide polymorphisms, and detecting methylation loci. Multiple testing methods in microarray data analysis aim at controlling both Type I and Type II error rates; however, real microarray data do not always fit their distribution assumptions. Smyth's ubiquitous parametric method, for example, inadequately accommodates violations of normality assumptions, resulting in inflated Type I error rates. The Significance Analysis of Microarrays, another widely used microarray data analysis method, is based on a permutation test and is robust to non-normally distributed data; however, the Significance Analysis of Microarrays method fold change criteria are problematic, and can critically alter the conclusion of a study, as a result of compositional changes of the control data set in the analysis. We propose a novel approach, combining resampling with empirical Bayes methods: the Resampling-based empirical Bayes Methods. This approach not only reduces false discovery rates for non-normally distributed microarray data, but it is also impervious to fold change threshold since no control data set selection is needed. Through simulation studies, sensitivities, specificities, total rejections, and false discovery rates are compared across the Smyth's parametric method, the Significance Analysis of Microarrays, and the Resampling-based empirical Bayes Methods. Differences in false discovery rates controls between each approach are illustrated through a preterm delivery methylation study. The results show that the Resampling-based empirical Bayes Methods offer significantly higher specificity and lower false discovery rates compared to Smyth's parametric method when data are not normally distributed. The Resampling-based empirical Bayes Methods also offers higher statistical power than the Significance Analysis of Microarrays method when the proportion of significantly differentially expressed genes is large for both normally and non-normally distributed data. Finally, the Resampling-based empirical Bayes Methods are generalizable to next generation sequencing RNA-seq data analysis.
DNA barcode goes two-dimensions: DNA QR code web server.
Liu, Chang; Shi, Linchun; Xu, Xiaolan; Li, Huan; Xing, Hang; Liang, Dong; Jiang, Kun; Pang, Xiaohui; Song, Jingyuan; Chen, Shilin
2012-01-01
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, "DNA barcode" actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications.
Transcription profile of boar spermatozoa as revealed by RNA-sequencing
USDA-ARS?s Scientific Manuscript database
High-throughput RNA sequencing (RNA-Seq) overcomes the limitations of the current hybridization-based techniques to detect the actual pool of RNA transcripts in spermatozoa. The application of this technology in livestock can speed the discovery of potential predictors of male fertility. As a first ...
USDA-ARS?s Scientific Manuscript database
Background: Vertebrate immune systems generate diverse repertoires of antibodies capable of mediating response to a variety of antigens. Next generation sequencing methods provide unique approaches to a number of immuno-based research areas including antibody discovery and engineering, disease surve...
Next generation sequencing applications for microRNA biomarker discovery in toxicological studies
Next Generation Sequencing (NGS) technology will be reviewed for its base pair resolution, wide dynamic range, and insights into the genome and transcriptome, with special focus upon the biomarker potential of microRNAs (miRNAs). The first part of this presentation reviews commo...
A DNA barcode for land plants.
2009-08-04
DNA barcoding involves sequencing a standard region of DNA as a tool for species identification. However, there has been no agreement on which region(s) should be used for barcoding land plants. To provide a community recommendation on a standard plant barcode, we have compared the performance of 7 leading candidate plastid DNA regions (atpF-atpH spacer, matK gene, rbcL gene, rpoB gene, rpoC1 gene, psbK-psbI spacer, and trnH-psbA spacer). Based on assessments of recoverability, sequence quality, and levels of species discrimination, we recommend the 2-locus combination of rbcL+matK as the plant barcode. This core 2-locus barcode will provide a universal framework for the routine use of DNA sequence data to identify specimens and contribute toward the discovery of overlooked species of land plants.
Hollingsworth, Peter M.; Forrest, Laura L.; Spouge, John L.; Hajibabaei, Mehrdad; Ratnasingham, Sujeevan; van der Bank, Michelle; Chase, Mark W.; Cowan, Robyn S.; Erickson, David L.; Fazekas, Aron J.; Graham, Sean W.; James, Karen E.; Kim, Ki-Joong; Kress, W. John; Schneider, Harald; van AlphenStahl, Jonathan; Barrett, Spencer C.H.; van den Berg, Cassio; Bogarin, Diego; Burgess, Kevin S.; Cameron, Kenneth M.; Carine, Mark; Chacón, Juliana; Clark, Alexandra; Clarkson, James J.; Conrad, Ferozah; Devey, Dion S.; Ford, Caroline S.; Hedderson, Terry A.J.; Hollingsworth, Michelle L.; Husband, Brian C.; Kelly, Laura J.; Kesanakurti, Prasad R.; Kim, Jung Sung; Kim, Young-Dong; Lahaye, Renaud; Lee, Hae-Lim; Long, David G.; Madriñán, Santiago; Maurin, Olivier; Meusnier, Isabelle; Newmaster, Steven G.; Park, Chong-Wook; Percy, Diana M.; Petersen, Gitte; Richardson, James E.; Salazar, Gerardo A.; Savolainen, Vincent; Seberg, Ole; Wilkinson, Michael J.; Yi, Dong-Keun; Little, Damon P.
2009-01-01
DNA barcoding involves sequencing a standard region of DNA as a tool for species identification. However, there has been no agreement on which region(s) should be used for barcoding land plants. To provide a community recommendation on a standard plant barcode, we have compared the performance of 7 leading candidate plastid DNA regions (atpF–atpH spacer, matK gene, rbcL gene, rpoB gene, rpoC1 gene, psbK–psbI spacer, and trnH–psbA spacer). Based on assessments of recoverability, sequence quality, and levels of species discrimination, we recommend the 2-locus combination of rbcL+matK as the plant barcode. This core 2-locus barcode will provide a universal framework for the routine use of DNA sequence data to identify specimens and contribute toward the discovery of overlooked species of land plants. PMID:19666622
SNPServer: a real-time SNP discovery tool.
Savage, David; Batley, Jacqueline; Erwin, Tim; Logan, Erica; Love, Christopher G; Lim, Geraldine A C; Mongin, Emmanuel; Barker, Gary; Spangenberg, German C; Edwards, David
2005-07-01
SNPServer is a real-time flexible tool for the discovery of SNPs (single nucleotide polymorphisms) within DNA sequence data. The program uses BLAST, to identify related sequences, and CAP3, to cluster and align these sequences. The alignments are parsed to the SNP discovery software autoSNP, a program that detects SNPs and insertion/deletion polymorphisms (indels). Alternatively, lists of related sequences or pre-assembled sequences may be entered for SNP discovery. SNPServer and autoSNP use redundancy to differentiate between candidate SNPs and sequence errors. For each candidate SNP, two measures of confidence are calculated, the redundancy of the polymorphism at a SNP locus and the co-segregation of the candidate SNP with other SNPs in the alignment. SNPServer is available at http://hornbill.cspp.latrobe.edu.au/snpdiscovery.html.
MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping.
Lee, Wan-Ping; Stromberg, Michael P; Ward, Alistair; Stewart, Chip; Garrison, Erik P; Marth, Gabor T
2014-01-01
MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me).
MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping
Lee, Wan-Ping; Stromberg, Michael P.; Ward, Alistair; Stewart, Chip; Garrison, Erik P.; Marth, Gabor T.
2014-01-01
MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me). PMID:24599324
Molecular Mapping of Restriction-Site Associated DNA Markers In Allotetraploid Upland Cotton.
Wang, Yangkun; Ning, Zhiyuan; Hu, Yan; Chen, Jiedan; Zhao, Rui; Chen, Hong; Ai, Nijiang; Guo, Wangzhen; Zhang, Tianzhen
2015-01-01
Upland cotton (Gossypium hirsutum L., 2n = 52, AADD) is an allotetraploid, therefore the discovery of single nucleotide polymorphism (SNP) markers is difficult. The recent emergence of genome complexity reduction technologies based on the next-generation sequencing (NGS) platform has greatly expedited SNP discovery in crops with highly repetitive and complex genomes. Here we applied restriction-site associated DNA (RAD) sequencing technology for de novo SNP discovery in allotetraploid cotton. We identified 21,109 SNPs between the two parents and used these for genotyping of 161 recombinant inbred lines (RILs). Finally, a high dense linkage map comprising 4,153 loci over 3500-cM was developed based on the previous result. Using this map quantitative trait locus (QTLs) conferring fiber strength and Verticillium Wilt (VW) resistance were mapped to a more accurate region in comparison to the 1576-cM interval determined using the simple sequence repeat (SSR) genetic map. This suggests that the newly constructed map has more power and resolution than the previous SSR map. It will pave the way for the rapid identification of the marker-assisted selection in cotton breeding and cloning of QTL of interest traits.
Buschmann, Tilo; Zhang, Rong; Brash, Douglas E; Bystrykh, Leonid V
2014-08-07
DNA barcodes are short unique sequences used to label DNA or RNA-derived samples in multiplexed deep sequencing experiments. During the demultiplexing step, barcodes must be detected and their position identified. In some cases (e.g., with PacBio SMRT), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements. In our analysis, barcode sequences showed high rates of coincidental similarities with the Mus musculus reference DNA. This problem became more acute when the length of the barcode sequence decreased and the number of barcodes in the set increased. The method presented in this paper controls the tail area-based false discovery rate to distinguish between barcoded and unbarcoded reads. This method helps to establish the highest acceptable minimal distance between reads and barcode sequences. In a proof of concept experiment we correctly detected barcodes in 83% of the reads with a precision of 89%. Sensitivity improved to 99% at 99% precision when the adjacent primer sequence was incorporated in the analysis. The analysis was further improved using a paired end strategy. Following an analysis of the data for sequence variants induced in the Atp1a1 gene of C57BL/6 murine melanocytes by ultraviolet light and conferring resistance to ouabain, we found no evidence of cross-contamination of DNA material between samples. Our method offers a proper quantitative treatment of the problem of detecting barcoded reads in a noisy sequencing environment. It is based on the false discovery rate statistics that allows a proper trade-off between sensitivity and precision to be chosen.
Yuan, Shuai; Johnston, H. Richard; Zhang, Guosheng; Li, Yun; Hu, Yi-Juan; Qin, Zhaohui S.
2015-01-01
With rapid decline of the sequencing cost, researchers today rush to embrace whole genome sequencing (WGS), or whole exome sequencing (WES) approach as the next powerful tool for relating genetic variants to human diseases and phenotypes. A fundamental step in analyzing WGS and WES data is mapping short sequencing reads back to the reference genome. This is an important issue because incorrectly mapped reads affect the downstream variant discovery, genotype calling and association analysis. Although many read mapping algorithms have been developed, the majority of them uses the universal reference genome and do not take sequence variants into consideration. Given that genetic variants are ubiquitous, it is highly desirable if they can be factored into the read mapping procedure. In this work, we developed a novel strategy that utilizes genotypes obtained a priori to customize the universal haploid reference genome into a personalized diploid reference genome. The new strategy is implemented in a program named RefEditor. When applying RefEditor to real data, we achieved encouraging improvements in read mapping, variant discovery and genotype calling. Compared to standard approaches, RefEditor can significantly increase genotype calling consistency (from 43% to 61% at 4X coverage; from 82% to 92% at 20X coverage) and reduce Mendelian inconsistency across various sequencing depths. Because many WGS and WES studies are conducted on cohorts that have been genotyped using array-based genotyping platforms previously or concurrently, we believe the proposed strategy will be of high value in practice, which can also be applied to the scenario where multiple NGS experiments are conducted on the same cohort. The RefEditor sources are available at https://github.com/superyuan/refeditor. PMID:26267278
Motif-based analysis of large nucleotide data sets using MEME-ChIP
Ma, Wenxiu; Noble, William S; Bailey, Timothy L
2014-01-01
MEME-ChIP is a web-based tool for analyzing motifs in large DNA or RNA data sets. It can analyze peak regions identified by ChIP-seq, cross-linking sites identified by cLIP-seq and related assays, as well as sets of genomic regions selected using other criteria. MEME-ChIP performs de novo motif discovery, motif enrichment analysis, motif location analysis and motif clustering, providing a comprehensive picture of the DNA or RNA motifs that are enriched in the input sequences. MEME-ChIP performs two complementary types of de novo motif discovery: weight matrix–based discovery for high accuracy; and word-based discovery for high sensitivity. Motif enrichment analysis using DNA or RNA motifs from human, mouse, worm, fly and other model organisms provides even greater sensitivity. MEME-ChIP’s interactive HTML output groups and aligns significant motifs to ease interpretation. this protocol takes less than 3 h, and it provides motif discovery approaches that are distinct and complementary to other online methods. PMID:24853928
McAlpine, James B
2009-03-27
Over the past decade major changes have occurred in the access to genome sequences that encode the enzymes responsible for the biosynthesis of secondary metabolites, knowledge of how those sequences translate into the final structure of the metabolite, and the ability to alter the sequence to obtain predicted products via both homologous and heterologous expression. Novel genera have been discovered leading to new chemotypes, but more surprisingly several instances have been uncovered where the apparently general rules of modular translation have not applied. Several new biosynthetic pathways have been unearthed, and our general knowledge grows rapidly. This review aims to highlight some of the more striking discoveries and advances of the decade.
A fortran program for Monte Carlo simulation of oil-field discovery sequences
Bohling, Geoffrey C.; Davis, J.C.
1993-01-01
We have developed a program for performing Monte Carlo simulation of oil-field discovery histories. A synthetic parent population of fields is generated as a finite sample from a distribution of specified form. The discovery sequence then is simulated by sampling without replacement from this parent population in accordance with a probabilistic discovery process model. The program computes a chi-squared deviation between synthetic and actual discovery sequences as a function of the parameters of the discovery process model, the number of fields in the parent population, and the distributional parameters of the parent population. The program employs the three-parameter log gamma model for the distribution of field sizes and employs a two-parameter discovery process model, allowing the simulation of a wide range of scenarios. ?? 1993.
Fu, Shuyue; Liu, Xiang; Luo, Maochao; Xie, Ke; Nice, Edouard C; Zhang, Haiyuan; Huang, Canhua
2017-04-01
Chemoresistance is a major obstacle for current cancer treatment. Proteogenomics is a powerful multi-omics research field that uses customized protein sequence databases generated by genomic and transcriptomic information to identify novel genes (e.g. noncoding, mutation and fusion genes) from mass spectrometry-based proteomic data. By identifying aberrations that are differentially expressed between tumor and normal pairs, this approach can also be applied to validate protein variants in cancer, which may reveal the response to drug treatment. Areas covered: In this review, we will present recent advances in proteogenomic investigations of cancer drug resistance with an emphasis on integrative proteogenomic pipelines and the biomarker discovery which contributes to achieving the goal of using precision/personalized medicine for cancer treatment. Expert commentary: The discovery and comprehensive understanding of potential biomarkers help identify the cohort of patients who may benefit from particular treatments, and will assist real-time clinical decision-making to maximize therapeutic efficacy and minimize adverse effects. With the development of MS-based proteomics and NGS-based sequencing, a growing number of proteogenomic tools are being developed specifically to investigate cancer drug resistance.
Martin, Guillaume E.; Rousseau-Gueutin, Mathieu; Cordonnier, Solenn; Lima, Oscar; Michon-Coudouel, Sophie; Naquin, Delphine; de Carvalho, Julie Ferreira; Aïnouche, Malika; Salmon, Armel; Aïnouche, Abdelkader
2014-01-01
Background and Aims To date chloroplast genomes are available only for members of the non-protein amino acid-accumulating clade (NPAAA) Papilionoid lineages in the legume family (i.e. Millettioids, Robinoids and the ‘inverted repeat-lacking clade’, IRLC). It is thus very important to sequence plastomes from other lineages in order to better understand the unusual evolution observed in this model flowering plant family. To this end, the plastome of a lupine species, Lupinus luteus, was sequenced to represent the Genistoid lineage, a noteworthy but poorly studied legume group. Methods The plastome of L. luteus was reconstructed using Roche-454 and Illumina next-generation sequencing. Its structure, repetitive sequences, gene content and sequence divergence were compared with those of other Fabaceae plastomes. PCR screening and sequencing were performed in other allied legumes in order to determine the origin of a large inversion identified in L. luteus. Key Results The first sequenced Genistoid plastome (L. luteus: 155 894 bp) resulted in the discovery of a 36-kb inversion, embedded within the already known 50-kb inversion in the large single-copy (LSC) region of the Papilionoideae. This inversion occurs at the base or soon after the Genistoid emergence, and most probably resulted from a flip–flop recombination between identical 29-bp inverted repeats within two trnS genes. Comparative analyses of the chloroplast gene content of L. luteus vs. Fabaceae and extra-Fabales plastomes revealed the loss of the plastid rpl22 gene, and its functional relocation to the nucleus was verified using lupine transcriptomic data. An investigation into the evolutionary rate of coding and non-coding sequences among legume plastomes resulted in the identification of remarkably variable regions. Conclusions This study resulted in the discovery of a novel, major 36-kb inversion, specific to the Genistoids. Chloroplast mutational hotspots were also identified, which contain novel and potentially informative regions for molecular evolutionary studies at various taxonomic levels in the legumes. Taken together, the results provide new insights into the evolutionary landscape of the legume plastome. PMID:24769537
Reliable Detection of Herpes Simplex Virus Sequence Variation by High-Throughput Resequencing.
Morse, Alison M; Calabro, Kaitlyn R; Fear, Justin M; Bloom, David C; McIntyre, Lauren M
2017-08-16
High-throughput sequencing (HTS) has resulted in data for a number of herpes simplex virus (HSV) laboratory strains and clinical isolates. The knowledge of these sequences has been critical for investigating viral pathogenicity. However, the assembly of complete herpesviral genomes, including HSV, is complicated due to the existence of large repeat regions and arrays of smaller reiterated sequences that are commonly found in these genomes. In addition, the inherent genetic variation in populations of isolates for viruses and other microorganisms presents an additional challenge to many existing HTS sequence assembly pipelines. Here, we evaluate two approaches for the identification of genetic variants in HSV1 strains using Illumina short read sequencing data. The first, a reference-based approach, identifies variants from reads aligned to a reference sequence and the second, a de novo assembly approach, identifies variants from reads aligned to de novo assembled consensus sequences. Of critical importance for both approaches is the reduction in the number of low complexity regions through the construction of a non-redundant reference genome. We compared variants identified in the two methods. Our results indicate that approximately 85% of variants are identified regardless of the approach. The reference-based approach to variant discovery captures an additional 15% representing variants divergent from the HSV1 reference possibly due to viral passage. Reference-based approaches are significantly less labor-intensive and identify variants across the genome where de novo assembly-based approaches are limited to regions where contigs have been successfully assembled. In addition, regions of poor quality assembly can lead to false variant identification in de novo consensus sequences. For viruses with a well-assembled reference genome, a reference-based approach is recommended.
SNP discovery through de novo deep sequencing using the next generation of DNA sequencers
USDA-ARS?s Scientific Manuscript database
The production of high volumes of DNA sequence data using new technologies has permitted more efficient identification of single nucleotide polymorphisms in vertebrate genomes. This chapter presented practical methodology for production and analysis of DNA sequence data for SNP discovery....
Iquebal, M A; Jaiswal, Sarika; Mahato, Ajay Kumar; Jayaswal, Pawan K; Angadi, U B; Kumar, Neeraj; Sharma, Nimisha; Singh, Anand K; Srivastav, Manish; Prakash, Jai; Singh, S K; Khan, Kasim; Mishra, Rupesh K; Rajan, Shailendra; Bajpai, Anju; Sandhya, B S; Nischita, Puttaraju; Ravishankar, K V; Dinesh, M R; Rai, Anil; Kumar, Dinesh; Sharma, Tilak R; Singh, Nagendra K
2017-11-02
Mango is one of the most important fruits of tropical ecological region of the world, well known for its nutritive value, aroma and taste. Its world production is >45MT worth >200 billion US dollars. Genomic resources are required for improvement in productivity and management of mango germplasm. There is no web-based genomic resources available for mango. Hence rapid and cost-effective high throughput putative marker discovery is required to develop such resources. RAD-based marker discovery can cater this urgent need till whole genome sequence of mango becomes available. Using a panel of 84 mango varieties, a total of 28.6 Gb data was generated by ddRAD-Seq approach on Illumina HiSeq 2000 platform. A total of 1.25 million SNPs were discovered. Phylogenetic tree using 749 common SNPs across these varieties revealed three major lineages which was compared with geographical locations. A web genomic resources MiSNPDb, available at http://webtom.cabgrid.res.in/mangosnps/ is based on 3-tier architecture, developed using PHP, MySQL and Javascript. This web genomic resources can be of immense use in the development of high density linkage map, QTL discovery, varietal differentiation, traceability, genome finishing and SNP chip development for future GWAS in genomic selection program. We report here world's first web-based genomic resources for genetic improvement and germplasm management of mango.
Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.
The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. Results: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a frameworkmore » based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. Availability: Phylo-VISTA is available at http://www-gsd.lbl. gov/phylovista. It requires an Internet browser with Java Plugin 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu« less
Anonymization of electronic medical records for validating genome-wide association studies
Loukides, Grigorios; Gkoulalas-Divanis, Aris; Malin, Bradley
2010-01-01
Genome-wide association studies (GWAS) facilitate the discovery of genotype–phenotype relations from population-based sequence databases, which is an integral facet of personalized medicine. The increasing adoption of electronic medical records allows large amounts of patients’ standardized clinical features to be combined with the genomic sequences of these patients and shared to support validation of GWAS findings and to enable novel discoveries. However, disseminating these data “as is” may lead to patient reidentification when genomic sequences are linked to resources that contain the corresponding patients’ identity information based on standardized clinical features. This work proposes an approach that provably prevents this type of data linkage and furnishes a result that helps support GWAS. Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases. Extensive experiments with real patient data derived from the Vanderbilt's University Medical Center verify that our approach generates data that eliminate the threat of individual reidentification, while supporting GWAS validation and clinical case analysis tasks. PMID:20385806
USDA-ARS?s Scientific Manuscript database
In recent years, next generation sequencing (NGS) based bulked segregant analysis (BSA) has become a powerful approach for allele discovery in non-model plant species. However, challenges remain, particular for out-crossing species with complex genomes. Here, the genetic control of a weeping bran...
Zhang, Hanyuan; Vieira Resende E Silva, Bruno; Cui, Juan
2018-05-01
Small RNA sequencing is the most widely used tool for microRNA (miRNA) discovery, and shows great potential for the efficient study of miRNA cross-species transport, i.e., by detecting the presence of exogenous miRNA sequences in the host species. Because of the increased appreciation of dietary miRNAs and their far-reaching implication in human health, research interests are currently growing with regard to exogenous miRNAs bioavailability, mechanisms of cross-species transport and miRNA function in cellular biological processes. In this article, we present microRNA Discovery (miRDis), a new small RNA sequencing data analysis pipeline for both endogenous and exogenous miRNA detection. Specifically, we developed and deployed a Web service that supports the annotation and expression profiling data of known host miRNAs and the detection of novel miRNAs, other noncoding RNAs, and the exogenous miRNAs from dietary species. As a proof-of-concept, we analyzed a set of human plasma sequencing data from a milk-feeding study where 225 human miRNAs were detected in the plasma samples and 44 show elevated expression after milk intake. By examining the bovine-specific sequences, data indicate that three bovine miRNAs (bta-miR-378, -181* and -150) are present in human plasma possibly because of the dietary uptake. Further evaluation based on different sets of public data demonstrates that miRDis outperforms other state-of-the-art tools in both detection and quantification of miRNA from either animal or plant sources. The miRDis Web server is available at: http://sbbi.unl.edu/miRDis/index.php.
Alu repeat discovery and characterization within human genomes
Hormozdiari, Fereydoun; Alkan, Can; Ventura, Mario; Hajirasouliha, Iman; Malig, Maika; Hach, Faraz; Yorukoglu, Deniz; Dao, Phuong; Bakhshi, Marzieh; Sahinalp, S. Cenk; Eichler, Evan E.
2011-01-01
Human genomes are now being rapidly sequenced, but not all forms of genetic variation are routinely characterized. In this study, we focus on Alu retrotransposition events and seek to characterize differences in the pattern of mobile insertion between individuals based on the analysis of eight human genomes sequenced using next-generation sequencing. Applying a rapid read-pair analysis algorithm, we discover 4342 Alu insertions not found in the human reference genome and show that 98% of a selected subset (63/64) experimentally validate. Of these new insertions, 89% correspond to AluY elements, suggesting that they arose by retrotransposition. Eighty percent of the Alu insertions have not been previously reported and more novel events were detected in Africans when compared with non-African samples (76% vs. 69%). Using these data, we develop an experimental and computational screen to identify ancestry informative Alu retrotransposition events among different human populations. PMID:21131385
Smith, J K; Parry, J D; Day, J G; Smith, R J
1998-10-01
The use of primers based on the Hip1 sequence as a typing technique for cyanobacteria has been investigated. The discovery of short repetitive sequence structures in bacterial DNA during the last decade has led to the development of PCR-based methods for typing, i.e., distinguishing and identifying, bacterial species and strains. An octameric palindromic sequence known as Hip1 has been shown to be present in the chromosomal DNA of many species of cyanobacteria as a highly repetitious interspersed sequence. PCR primers were constructed that extended the Hip1 sequence at the 3' end by two bases. Five of the 16 possible extended primers were tested. Each of the five primers produced a different set of products when used to prime PCR from cyanobacterial genomic DNA. Each primer produced a distinct set of products for each of the 15 cyanobacterial species tested. The ability of Hip1-based PCR to resolve taxonomic differences was assessed by analysis of independent isolates of Anabaena flos-aquae and Nostoc ellipsosporum obtained from the CCAP (Culture Collection of Algae and Protozoa, IFE, Cumbria, UK). A PCR-based RFLP analysis of products amplified from the 23S-16S rDNA intergenic region was used to characterize the isolates and to compare with the Hip1 typing data. The RFLP and Hip1 typing yielded similar results and both techniques were able to distinguish different strains. On the basis of these results it is suggested that the Hip1 PCR technique may assist in distinguishing cyanobacterial species and strains.
Zhou, Wen-Zhao; Zhang, Yan-Mei; Lu, Jun-Ying; Li, Jun-Feng
2012-01-01
To provide a resource of sisal-specific expressed sequence data and facilitate this powerful approach in new gene research, the preparation of normalized cDNA libraries enriched with full-length sequences is necessary. Four libraries were produced with RNA pooled from Agave sisalana multiple tissues to increase efficiency of normalization and maximize the number of independent genes by SMART™ method and the duplex-specific nuclease (DSN). This procedure kept the proportion of full-length cDNAs in the subtracted/normalized libraries and dramatically enhanced the discovery of new genes. Sequencing of 3875 cDNA clones of libraries revealed 3320 unigenes with an average insert length about 1.2 kb, indicating that the non-redundancy of libraries was about 85.7%. These unigene functions were predicted by comparing their sequences to functional domain databases and extensively annotated with Gene Ontology (GO) terms. Comparative analysis of sisal unigenes and other plant genomes revealed that four putative MADS-box genes and knotted-like homeobox (knox) gene were obtained from a total of 1162 full-length transcripts. Furthermore, real-time PCR showed that the characteristics of their transcripts mainly depended on the tight expression regulation of a number of genes during the leaf and flower development. Analysis of individual library sequence data indicated that the pooled-tissue approach was highly effective in discovering new genes and preparing libraries for efficient deep sequencing. PMID:23202944
Scaling up discovery of hidden diversity in fungi: impacts of barcoding approaches.
Yahr, Rebecca; Schoch, Conrad L; Dentinger, Bryn T M
2016-09-05
The fungal kingdom is a hyperdiverse group of multicellular eukaryotes with profound impacts on human society and ecosystem function. The challenge of documenting and describing fungal diversity is exacerbated by their typically cryptic nature, their ability to produce seemingly unrelated morphologies from a single individual and their similarity in appearance to distantly related taxa. This multiplicity of hurdles resulted in the early adoption of DNA-based comparisons to study fungal diversity, including linking curated DNA sequence data to expertly identified voucher specimens. DNA-barcoding approaches in fungi were first applied in specimen-based studies for identification and discovery of taxonomic diversity, but are now widely deployed for community characterization based on sequencing of environmental samples. Collectively, fungal barcoding approaches have yielded important advances across biological scales and research applications, from taxonomic, ecological, industrial and health perspectives. A major outstanding issue is the growing problem of 'sequences without names' that are somewhat uncoupled from the traditional framework of fungal classification based on morphology and preserved specimens. This review summarizes some of the most significant impacts of fungal barcoding, its limitations, and progress towards the challenge of effective utilization of the exponentially growing volume of data gathered from high-throughput sequencing technologies.This article is part of the themed issue 'From DNA barcodes to biomes'. © 2016 The Authors.
Harris, Katherine E; Aldred, Shelley Force; Davison, Laura M; Ogana, Heather Anne N; Boudreau, Andrew; Brüggemann, Marianne; Osborn, Michael; Ma, Biao; Buelow, Benjamin; Clarke, Starlynn C; Dang, Kevin H; Iyer, Suhasini; Jorgensen, Brett; Pham, Duy T; Pratap, Payal P; Rangaswamy, Udaya S; Schellenberger, Ute; van Schooten, Wim C; Ugamraj, Harshad S; Vafa, Omid; Buelow, Roland; Trinklein, Nathan D
2018-01-01
We created a novel transgenic rat that expresses human antibodies comprising a diverse repertoire of heavy chains with a single common rearranged kappa light chain (IgKV3-15-JK1). This fixed light chain animal, called OmniFlic, presents a unique system for human therapeutic antibody discovery and a model to study heavy chain repertoire diversity in the context of a constant light chain. The purpose of this study was to analyze heavy chain variable gene usage, clonotype diversity, and to describe the sequence characteristics of antigen-specific monoclonal antibodies (mAbs) isolated from immunized OmniFlic animals. Using next-generation sequencing antibody repertoire analysis, we measured heavy chain variable gene usage and the diversity of clonotypes present in the lymph node germinal centers of 75 OmniFlic rats immunized with 9 different protein antigens. Furthermore, we expressed 2,560 unique heavy chain sequences sampled from a diverse set of clonotypes as fixed light chain antibody proteins and measured their binding to antigen by ELISA. Finally, we measured patterns and overall levels of somatic hypermutation in the full B-cell repertoire and in the 2,560 mAbs tested for binding. The results demonstrate that OmniFlic animals produce an abundance of antigen-specific antibodies with heavy chain clonotype diversity that is similar to what has been described with unrestricted light chain use in mammals. In addition, we show that sequence-based discovery is a highly effective and efficient way to identify a large number of diverse monoclonal antibodies to a protein target of interest.
DNA Barcode Goes Two-Dimensions: DNA QR Code Web Server
Li, Huan; Xing, Hang; Liang, Dong; Jiang, Kun; Pang, Xiaohui; Song, Jingyuan; Chen, Shilin
2012-01-01
The DNA barcoding technology uses a standard region of DNA sequence for species identification and discovery. At present, “DNA barcode” actually refers to DNA sequences, which are not amenable to information storage, recognition, and retrieval. Our aim is to identify the best symbology that can represent DNA barcode sequences in practical applications. A comprehensive set of sequences for five DNA barcode markers ITS2, rbcL, matK, psbA-trnH, and CO1 was used as the test data. Fifty-three different types of one-dimensional and ten two-dimensional barcode symbologies were compared based on different criteria, such as coding capacity, compression efficiency, and error detection ability. The quick response (QR) code was found to have the largest coding capacity and relatively high compression ratio. To facilitate the further usage of QR code-based DNA barcodes, a web server was developed and is accessible at http://qrfordna.dnsalias.org. The web server allows users to retrieve the QR code for a species of interests, convert a DNA sequence to and from a QR code, and perform species identification based on local and global sequence similarities. In summary, the first comprehensive evaluation of various barcode symbologies has been carried out. The QR code has been found to be the most appropriate symbology for DNA barcode sequences. A web server has also been constructed to allow biologists to utilize QR codes in practical DNA barcoding applications. PMID:22574113
Swain, Martin T.; Larkin, Denis M.; Caffrey, Conor R.; Davies, Stephen J.; Loukas, Alex; Skelly, Patrick J.; Hoffmann, Karl F.
2011-01-01
Schistosoma genomes provide a comprehensive resource for identifying the molecular processes that shape parasite evolution and for discovering novel chemotherapeutic or immunoprophylactic targets. Here, we demonstrate how intra- and intergenus comparative genomics can be used to drive these investigations forward, illustrate the advantages and limitations of these approaches and review how post genomic technologies offer complementary strategies for genome characterisation. While sequencing and functional characterisation of other schistosome/platyhelminth genomes continues to expedite anthelmintic discovery, we contend that future priorities should equally focus on improving assembly quality, and chromosomal assignment, of existing schistosome/platyhelminth genomes. PMID:22024648
Exploring Dance Movement Data Using Sequence Alignment Methods
Chavoshi, Seyed Hossein; De Baets, Bernard; Neutens, Tijs; De Tré, Guy; Van de Weghe, Nico
2015-01-01
Despite the abundance of research on knowledge discovery from moving object databases, only a limited number of studies have examined the interaction between moving point objects in space over time. This paper describes a novel approach for measuring similarity in the interaction between moving objects. The proposed approach consists of three steps. First, we transform movement data into sequences of successive qualitative relations based on the Qualitative Trajectory Calculus (QTC). Second, sequence alignment methods are applied to measure the similarity between movement sequences. Finally, movement sequences are grouped based on similarity by means of an agglomerative hierarchical clustering method. The applicability of this approach is tested using movement data from samba and tango dancers. PMID:26181435
Computational methods in drug discovery
Leelananda, Sumudu P
2016-01-01
The process for drug discovery and development is challenging, time consuming and expensive. Computer-aided drug discovery (CADD) tools can act as a virtual shortcut, assisting in the expedition of this long process and potentially reducing the cost of research and development. Today CADD has become an effective and indispensable tool in therapeutic development. The human genome project has made available a substantial amount of sequence data that can be used in various drug discovery projects. Additionally, increasing knowledge of biological structures, as well as increasing computer power have made it possible to use computational methods effectively in various phases of the drug discovery and development pipeline. The importance of in silico tools is greater than ever before and has advanced pharmaceutical research. Here we present an overview of computational methods used in different facets of drug discovery and highlight some of the recent successes. In this review, both structure-based and ligand-based drug discovery methods are discussed. Advances in virtual high-throughput screening, protein structure prediction methods, protein–ligand docking, pharmacophore modeling and QSAR techniques are reviewed. PMID:28144341
Computational methods in drug discovery.
Leelananda, Sumudu P; Lindert, Steffen
2016-01-01
The process for drug discovery and development is challenging, time consuming and expensive. Computer-aided drug discovery (CADD) tools can act as a virtual shortcut, assisting in the expedition of this long process and potentially reducing the cost of research and development. Today CADD has become an effective and indispensable tool in therapeutic development. The human genome project has made available a substantial amount of sequence data that can be used in various drug discovery projects. Additionally, increasing knowledge of biological structures, as well as increasing computer power have made it possible to use computational methods effectively in various phases of the drug discovery and development pipeline. The importance of in silico tools is greater than ever before and has advanced pharmaceutical research. Here we present an overview of computational methods used in different facets of drug discovery and highlight some of the recent successes. In this review, both structure-based and ligand-based drug discovery methods are discussed. Advances in virtual high-throughput screening, protein structure prediction methods, protein-ligand docking, pharmacophore modeling and QSAR techniques are reviewed.
Biomarker Discovery and Mechanistic Studies of Prostate Cancer using Targeted Proteomic Approaches
2012-07-01
basigin in Drosophila ) tightly regulates cytoskeleton rearrangement in Drosophila melanogaster [23]. Based on the present results and the existing...from OligoEngine according to the manufac- turer’s instruction. Plasmids were amplified in DH5a cell and confirmed by sequencing . Subconfluent cell...electrophoresis and the results are shown in Figure 1 (Panel C). The RT-PCR products were cloned and subjected to DNA sequenc - ing. The sequencing
A biological compression model and its applications.
Cao, Minh Duc; Dix, Trevor I; Allison, Lloyd
2011-01-01
A biological compression model, expert model, is presented which is superior to existing compression algorithms in both compression performance and speed. The model is able to compress whole eukaryotic genomes. Most importantly, the model provides a framework for knowledge discovery from biological data. It can be used for repeat element discovery, sequence alignment and phylogenetic analysis. We demonstrate that the model can handle statistically biased sequences and distantly related sequences where conventional knowledge discovery tools often fail.
NASA Astrophysics Data System (ADS)
Yang, Rong-Sheng; Tang, Weijuan; Sheng, Huaming; Meng, Fanyu
2018-01-01
Discovery of novel insulin analogs as therapeutics has remained an active area of research. Compared with native human insulin, insulin analog molecules normally incorporate either covalent modifications or amino acid sequence variations. From the drug discovery and development perspective, methods for efficient and detailed characterization of these primary structural changes are very important. In this report, we demonstrate that proteinase K digestion coupled with UPLC-ESI-MS analysis provides a simple and rapid approach to characterize the modifications and sequence variations of insulin molecules. A commercially available proteinase K digestion kit was used to process recombinant human insulin (RHI), insulin glargine, and fluorescein isothiocynate-labeled recombinant human insulin (FITC-RHI) samples. The LC-MS data clearly showed that RHI and insulin glargine samples can be differentiated, and the FITC modifications in all three amine sites of the RHI molecule are well characterized. The end-to-end experiment and data interpretation was achieved within 60 min. This approach is fast and simple, and can be easily implemented in early drug discovery laboratories to facilitate research on more advanced insulin therapeutics. [Figure not available: see fulltext.
NASA Astrophysics Data System (ADS)
Yang, Rong-Sheng; Tang, Weijuan; Sheng, Huaming; Meng, Fanyu
2018-05-01
Discovery of novel insulin analogs as therapeutics has remained an active area of research. Compared with native human insulin, insulin analog molecules normally incorporate either covalent modifications or amino acid sequence variations. From the drug discovery and development perspective, methods for efficient and detailed characterization of these primary structural changes are very important. In this report, we demonstrate that proteinase K digestion coupled with UPLC-ESI-MS analysis provides a simple and rapid approach to characterize the modifications and sequence variations of insulin molecules. A commercially available proteinase K digestion kit was used to process recombinant human insulin (RHI), insulin glargine, and fluorescein isothiocynate-labeled recombinant human insulin (FITC-RHI) samples. The LC-MS data clearly showed that RHI and insulin glargine samples can be differentiated, and the FITC modifications in all three amine sites of the RHI molecule are well characterized. The end-to-end experiment and data interpretation was achieved within 60 min. This approach is fast and simple, and can be easily implemented in early drug discovery laboratories to facilitate research on more advanced insulin therapeutics. [Figure not available: see fulltext.
Agrawal, Neeraj J; Dykstra, Andrew; Yang, Jane; Yue, Hai; Nguyen, Xichdao; Kolvenbach, Carl; Angell, Nicolas
2018-05-01
Methionine oxidation in therapeutic antibodies can impact the product's stability, clinical efficacy, and safety and hence it is desirable to address the methionine oxidation liability during antibody discovery and development phase. Although the current experimental approaches can identify the oxidation-labile methionine residues, their application is limited mostly to the development phase. We demonstrate an in silico method that can be used to predict oxidation-labile residues based solely on the antibody sequence and structure information. Since antibody sequence information is available in the discovery phase, the in silico method can be applied very early on to identify the oxidation-labile methionine residues and subsequently address the oxidation liability. We believe that the in silico method for methionine oxidation liability assessment can aid in antibody discovery and development phase to address the liability in a more rational way. Copyright © 2018 American Pharmacists Association®. Published by Elsevier Inc. All rights reserved.
Maurer-Stroh, Sebastian; Gao, He; Han, Hao; Baeten, Lies; Schymkowitz, Joost; Rousseau, Frederic; Zhang, Louxin; Eisenhaber, Frank
2013-02-01
Data mining in protein databases, derivatives from more fundamental protein 3D structure and sequence databases, has considerable unearthed potential for the discovery of sequence motif--structural motif--function relationships as the finding of the U-shape (Huf-Zinc) motif, originally a small student's project, exemplifies. The metal ion zinc is critically involved in universal biological processes, ranging from protein-DNA complexes and transcription regulation to enzymatic catalysis and metabolic pathways. Proteins have evolved a series of motifs to specifically recognize and bind zinc ions. Many of these, so called zinc fingers, are structurally independent globular domains with discontinuous binding motifs made up of residues mostly far apart in sequence. Through a systematic approach starting from the BRIX structure fragment database, we discovered that there exists another predictable subset of zinc-binding motifs that not only have a conserved continuous sequence pattern but also share a characteristic local conformation, despite being included in totally different overall folds. While this does not allow general prediction of all Zn binding motifs, a HMM-based web server, Huf-Zinc, is available for prediction of these novel, as well as conventional, zinc finger motifs in protein sequences. The Huf-Zinc webserver can be freely accessed through this URL (http://mendel.bii.a-star.edu.sg/METHODS/hufzinc/).
NASA Astrophysics Data System (ADS)
Sarkes, Deborah A.; Hurley, Margaret M.; Coppock, Matthew B.; Farrell, Mikella E.; Pellegrino, Paul M.; Stratis-Cullum, Dimitra N.
2016-05-01
Peptides have emerged as viable alternatives to antibodies for molecular-based sensing due to their similarity in recognition ability despite their relative structural simplicity. Various methods for peptide capture reagent discovery exist, including phage display, yeast display, and bacterial display. One of the primary advantages of peptide discovery by bacterial display technology is the speed to candidate peptide capture agent, due to both rapid growth of bacteria and direct utilization of the sorted cells displaying each individual peptide for the subsequent round of biopanning. We have previously isolated peptide affinity reagents towards protective antigen of Bacillus anthracis using a commercially available automated magnetic sorting platform with improved enrichment as compared to manual magnetic sorting. In this work, we focus on adapting our automated biopanning method to a more challenging sort, to demonstrate the specificity possible with peptide capture agents. This was achieved using non-toxic, recombinant variants of ricin and abrin, RiVax and abrax, respectively, which are structurally similar Type II ribosomal inactivating proteins with significant sequence homology. After only two rounds of biopanning, enrichment of peptide capture candidates binding abrax but not RiVax was achieved as demonstrated by Fluorescence Activated Cell Sorting (FACS) studies. Further sorting optimization included negative sorting against RiVax, proper selection of autoMACS programs for specific sorting rounds, and using freshly made buffer and freshly thawed protein target for each round of biopanning for continued enrichment over all four rounds. Most of the resulting candidates from biopanning for abrax binding peptides were able to bind abrax but not RiVax, demonstrating that short peptide sequences can be highly specific even at this early discovery stage.
Couldrey, C; Keehan, M; Johnson, T; Tiplady, K; Winkelman, A; Littlejohn, M D; Scott, A; Kemper, K E; Hayes, B; Davis, S R; Spelman, R J
2017-07-01
Single nucleotide polymorphisms have been the DNA variant of choice for genomic prediction, largely because of the ease of single nucleotide polymorphism genotype collection. In contrast, structural variants (SV), which include copy number variants (CNV), translocations, insertions, and inversions, have eluded easy detection and characterization, particularly in nonhuman species. However, evidence increasingly shows that SV not only contribute a substantial proportion of genetic variation but also have significant influence on phenotypes. Here we present the discovery of CNV in a prominent New Zealand dairy bull using long-read PacBio (Pacific Biosciences, Menlo Park, CA) sequencing technology and the Sniffles SV discovery tool (version 0.0.1; https://github.com/fritzsedlazeck/Sniffles). The CNV identified from long reads were compared with CNV discovered in the same bull from Illumina sequencing using CNVnator (read depth-based tool; Illumina Inc., San Diego, CA) as a means of validation. Subsequently, further validation was undertaken using whole-genome Illumina sequencing of 556 cattle representing the wider New Zealand dairy cattle population. Very limited overlap was observed in CNV discovered from the 2 sequencing platforms, in part because of the differences in size of CNV detected. Only a few CNV were therefore able to be validated using this approach. However, the ability to use CNVnator to genotype the 557 cattle for copy number across all regions identified as putative CNV allowed a genome-wide assessment of transmission level of copy number based on pedigree. The more highly transmissible a putative CNV region was observed to be, the more likely the distribution of copy number was multimodal across the 557 sequenced animals. Furthermore, visual assessment of highly transmissible CNV regions provided evidence supporting the presence of CNV across the sequenced animals. This transmission-based approach was able to confirm a subset of CNV that segregates in the New Zealand dairy cattle population. Genome-wide identification and validation of CNV is an important step toward their inclusion in genomic selection strategies. The Authors. Published by the Federation of Animal Science Societies and Elsevier Inc. on behalf of the American Dairy Science Association®. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/).
A statistical method for the detection of variants from next-generation resequencing of DNA pools.
Bansal, Vikas
2010-06-15
Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80-85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3-5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Implementation of this method is available at http://polymorphism.scripps.edu/~vbansal/software/CRISP/.
2012-01-01
Background Bread wheat, one of the world’s staple food crops, has the largest, highly repetitive and polyploid genome among the cereal crops. The wheat genome holds the key to crop genetic improvement against challenges such as climate change, environmental degradation, and water scarcity. To unravel the complex wheat genome, the International Wheat Genome Sequencing Consortium (IWGSC) is pursuing a chromosome- and chromosome arm-based approach to physical mapping and sequencing. Here we report on the use of a BAC library made from flow-sorted telosomic chromosome 3A short arm (t3AS) for marker development and analysis of sequence composition and comparative evolution of homoeologous genomes of hexaploid wheat. Results The end-sequencing of 9,984 random BACs from a chromosome arm 3AS-specific library (TaaCsp3AShA) generated 11,014,359 bp of high quality sequence from 17,591 BAC-ends with an average length of 626 bp. The sequence represents 3.2% of t3AS with an average DNA sequence read every 19 kb. Overall, 79% of the sequence consisted of repetitive elements, 1.38% as coding regions (estimated 2,850 genes) and another 19% of unknown origin. Comparative sequence analysis suggested that 70-77% of the genes present in both 3A and 3B were syntenic with model species. Among the transposable elements, gypsy/sabrina (12.4%) was the most abundant repeat and was significantly more frequent in 3A compared to homoeologous chromosome 3B. Twenty novel repetitive sequences were also identified using de novo repeat identification. BESs were screened to identify simple sequence repeats (SSR) and transposable element junctions. A total of 1,057 SSRs were identified with a density of one per 10.4 kb, and 7,928 junctions between transposable elements (TE) and other sequences were identified with a density of one per 1.39 kb. With the objective of enhancing the marker density of chromosome 3AS, oligonucleotide primers were successfully designed from 758 SSRs and 695 Insertion Site Based Polymorphisms (ISBPs). Of the 96 ISBP primer pairs tested, 28 (29%) were 3A-specific and compared to 17 (18%) for 96 SSRs. Conclusion This work reports on the use of wheat chromosome arm 3AS-specific BAC library for the targeted generation of sequence data from a particular region of the huge genome of wheat. A large quantity of sequences were generated from the A genome of hexaploid wheat for comparative genome analysis with homoeologous B and D genomes and other model grass genomes. Hundreds of molecular markers were developed from the 3AS arm-specific sequences; these and other sequences will be useful in gene discovery and physical mapping. PMID:22559868
The sequence of sequencers: The history of sequencing DNA
Heather, James M.; Chain, Benjamin
2016-01-01
Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way. PMID:26554401
Schmidt, Olga; Hausmann, Axel; Cancian de Araujo, Bruno; Sutrisno, Hari; Peggie, Djunijanti; Schmidt, Stefan
2017-01-01
Here we present a general collecting and preparation protocol for DNA barcoding of Lepidoptera as part of large-scale rapid biodiversity assessment projects, and a comparison with alternative preserving and vouchering methods. About 98% of the sequenced specimens processed using the present collecting and preparation protocol yielded sequences with more than 500 base pairs. The study is based on the first outcomes of the Indonesian Biodiversity Discovery and Information System (IndoBioSys). IndoBioSys is a German-Indonesian research project that is conducted by the Museum für Naturkunde in Berlin and the Zoologische Staatssammlung München, in close cooperation with the Research Center for Biology - Indonesian Institute of Sciences (RCB-LIPI, Bogor).
Tian, Xin-Jie; Long, Yan; Wang, Jiao; Zhang, Jing-Wen; Wang, Yan-Yan; Li, Wei-Min; Peng, Yu-Fa; Yuan, Qian-Hua; Pei, Xin-Wu
2015-01-01
The perennial O. rufipogon (common wild rice), which is considered to be the ancestor of Asian cultivated rice species, contains many useful genetic resources, including drought resistance genes. However, few studies have identified the drought resistance and tissue-specific genes in common wild rice. In this study, transcriptome sequencing libraries were constructed, including drought-treated roots (DR) and control leaves (CL) and roots (CR). Using Illumina sequencing technology, we generated 16.75 million bases of high-quality sequence data for common wild rice and conducted de novo assembly and annotation of genes without prior genome information. These reads were assembled into 119,332 unigenes with an average length of 715 bp. A total of 88,813 distinct sequences (74.42% of unigenes) significantly matched known genes in the NCBI NT database. Differentially expressed gene (DEG) analysis showed that 3617 genes were up-regulated and 4171 genes were down-regulated in the CR library compared with the CL library. Among the DEGs, 535 genes were expressed in roots but not in shoots. A similar comparison between the DR and CR libraries showed that 1393 genes were up-regulated and 315 genes were down-regulated in the DR library compared with the CR library. Finally, 37 genes that were specifically expressed in roots were screened after comparing the DEGs identified in the above-described analyses. This study provides a transcriptome sequence resource for common wild rice plants and establishes a digital gene expression profile of wild rice plants under drought conditions using the assembled transcriptome data as a reference. Several tissue-specific and drought-stress-related candidate genes were identified, representing a fully characterized transcriptome and providing a valuable resource for genetic and genomic studies in plants.
A New Milky Way Satellite Discovered in the Subaru/Hyper Suprime-Cam Survey
NASA Astrophysics Data System (ADS)
Homma, Daisuke; Chiba, Masashi; Okamoto, Sakurako; Komiyama, Yutaka; Tanaka, Masayuki; Tanaka, Mikito; Ishigaki, Miho N.; Akiyama, Masayuki; Arimoto, Nobuo; Garmilla, José A.; Lupton, Robert H.; Strauss, Michael A.; Furusawa, Hisanori; Miyazaki, Satoshi; Murayama, Hitoshi; Nishizawa, Atsushi J.; Takada, Masahiro; Usuda, Tomonori; Wang, Shiang-Yu
2016-11-01
We report the discovery of a new ultra-faint dwarf satellite companion of the Milky Way (MW) based on the early survey data from the Hyper Suprime-Cam Subaru Strategic Program. This new satellite, Virgo I, which is located in the constellation of Virgo, has been identified as a statistically significant (5.5σ) spatial overdensity of star-like objects with a well-defined main sequence and red giant branch in the color-magnitude diagram. The significance of this overdensity increases to 10.8σ when the relevant isochrone filter is adopted for the search. Based on the distribution of the stars around the likely main-sequence turnoff at r ˜ 24 mag, the distance to Virgo I is estimated as 87 kpc, and its most likely absolute magnitude calculated from a Monte Carlo analysis is M V = -0.8 ± 0.9 mag. This stellar system has an extended spatial distribution with a half-light radius of {38}-11+12 pc, which clearly distinguishes it from a globular cluster with comparable luminosity. Thus, Virgo I is one of the faintest dwarf satellites known and is located beyond the reach of the Sloan Digital Sky Survey. This demonstrates the power of this survey program to identify very faint dwarf satellites. This discovery of Virgo I is based only on about 100 square degrees of data, thus a large number of faint dwarf satellites are likely to exist in the outer halo of the MW.
info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling.
Defrance, Matthieu; van Helden, Jacques
2009-10-15
Discovering cis-regulatory elements in genome sequence remains a challenging issue. Several methods rely on the optimization of some target scoring function. The information content (IC) or relative entropy of the motif has proven to be a good estimator of transcription factor DNA binding affinity. However, these information-based metrics are usually used as a posteriori statistics rather than during the motif search process itself. We introduce here info-gibbs, a Gibbs sampling algorithm that efficiently optimizes the IC or the log-likelihood ratio (LLR) of the motif while keeping computation time low. The method compares well with existing methods like MEME, BioProspector, Gibbs or GAME on both synthetic and biological datasets. Our study shows that motif discovery techniques can be enhanced by directly focusing the search on the motif IC or the motif LLR. http://rsat.ulb.ac.be/rsat/info-gibbs
RNA Interference for Functional Genomics and Improvement of Cotton (Gossypium sp.)
Abdurakhmonov, Ibrokhim Y.; Ayubov, Mirzakamol S.; Ubaydullaeva, Khurshida A.; Buriev, Zabardast T.; Shermatov, Shukhrat E.; Ruziboev, Haydarali S.; Shapulatov, Umid M.; Saha, Sukumar; Ulloa, Mauricio; Yu, John Z.; Percy, Richard G.; Devor, Eric J.; Sharma, Govind C.; Sripathi, Venkateswara R.; Kumpatla, Siva P.; van der Krol, Alexander; Kater, Hake D.; Khamidov, Khakimdjan; Salikhov, Shavkat I.; Jenkins, Johnie N.; Abdukarimov, Abdusattor; Pepper, Alan E.
2016-01-01
RNA interference (RNAi), is a powerful new technology in the discovery of genetic sequence functions, and has become a valuable tool for functional genomics of cotton (Gossypium sp.). The rapid adoption of RNAi has replaced previous antisense technology. RNAi has aided in the discovery of function and biological roles of many key cotton genes involved in fiber development, fertility and somatic embryogenesis, resistance to important biotic and abiotic stresses, and oil and seed quality improvements as well as the key agronomic traits including yield and maturity. Here, we have comparatively reviewed seminal research efforts in previously used antisense approaches and currently applied breakthrough RNAi studies in cotton, analyzing developed RNAi methodologies, achievements, limitations, and future needs in functional characterizations of cotton genes. We also highlighted needed efforts in the development of RNAi-based cotton cultivars, and their safety and risk assessment, small and large-scale field trials, and commercialization. PMID:26941765
RNA Interference for Functional Genomics and Improvement of Cotton (Gossypium sp.).
Abdurakhmonov, Ibrokhim Y; Ayubov, Mirzakamol S; Ubaydullaeva, Khurshida A; Buriev, Zabardast T; Shermatov, Shukhrat E; Ruziboev, Haydarali S; Shapulatov, Umid M; Saha, Sukumar; Ulloa, Mauricio; Yu, John Z; Percy, Richard G; Devor, Eric J; Sharma, Govind C; Sripathi, Venkateswara R; Kumpatla, Siva P; van der Krol, Alexander; Kater, Hake D; Khamidov, Khakimdjan; Salikhov, Shavkat I; Jenkins, Johnie N; Abdukarimov, Abdusattor; Pepper, Alan E
2016-01-01
RNA interference (RNAi), is a powerful new technology in the discovery of genetic sequence functions, and has become a valuable tool for functional genomics of cotton (Gossypium sp.). The rapid adoption of RNAi has replaced previous antisense technology. RNAi has aided in the discovery of function and biological roles of many key cotton genes involved in fiber development, fertility and somatic embryogenesis, resistance to important biotic and abiotic stresses, and oil and seed quality improvements as well as the key agronomic traits including yield and maturity. Here, we have comparatively reviewed seminal research efforts in previously used antisense approaches and currently applied breakthrough RNAi studies in cotton, analyzing developed RNAi methodologies, achievements, limitations, and future needs in functional characterizations of cotton genes. We also highlighted needed efforts in the development of RNAi-based cotton cultivars, and their safety and risk assessment, small and large-scale field trials, and commercialization.
Martin, Guillaume E; Rousseau-Gueutin, Mathieu; Cordonnier, Solenn; Lima, Oscar; Michon-Coudouel, Sophie; Naquin, Delphine; de Carvalho, Julie Ferreira; Aïnouche, Malika; Salmon, Armel; Aïnouche, Abdelkader
2014-06-01
To date chloroplast genomes are available only for members of the non-protein amino acid-accumulating clade (NPAAA) Papilionoid lineages in the legume family (i.e. Millettioids, Robinoids and the 'inverted repeat-lacking clade', IRLC). It is thus very important to sequence plastomes from other lineages in order to better understand the unusual evolution observed in this model flowering plant family. To this end, the plastome of a lupine species, Lupinus luteus, was sequenced to represent the Genistoid lineage, a noteworthy but poorly studied legume group. The plastome of L. luteus was reconstructed using Roche-454 and Illumina next-generation sequencing. Its structure, repetitive sequences, gene content and sequence divergence were compared with those of other Fabaceae plastomes. PCR screening and sequencing were performed in other allied legumes in order to determine the origin of a large inversion identified in L. luteus. The first sequenced Genistoid plastome (L. luteus: 155 894 bp) resulted in the discovery of a 36-kb inversion, embedded within the already known 50-kb inversion in the large single-copy (LSC) region of the Papilionoideae. This inversion occurs at the base or soon after the Genistoid emergence, and most probably resulted from a flip-flop recombination between identical 29-bp inverted repeats within two trnS genes. Comparative analyses of the chloroplast gene content of L. luteus vs. Fabaceae and extra-Fabales plastomes revealed the loss of the plastid rpl22 gene, and its functional relocation to the nucleus was verified using lupine transcriptomic data. An investigation into the evolutionary rate of coding and non-coding sequences among legume plastomes resulted in the identification of remarkably variable regions. This study resulted in the discovery of a novel, major 36-kb inversion, specific to the Genistoids. Chloroplast mutational hotspots were also identified, which contain novel and potentially informative regions for molecular evolutionary studies at various taxonomic levels in the legumes. Taken together, the results provide new insights into the evolutionary landscape of the legume plastome. © The Author 2014. Published by Oxford University Press on behalf of the Annals of Botany Company. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
2012-01-01
Background Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research. Results We describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature. Conclusions Our framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package at http://www.bme.ucdavis.edu/facciotti/resources_data/software/. PMID:23181585
SNP discovery in the bovine milk transcriptome using RNA-Seq technology.
Cánovas, Angela; Rincon, Gonzalo; Islas-Trejo, Alma; Wickramasinghe, Saumya; Medrano, Juan F
2010-12-01
High-throughput sequencing of RNA (RNA-Seq) was developed primarily to analyze global gene expression in different tissues. However, it also is an efficient way to discover coding SNPs. The objective of this study was to perform a SNP discovery analysis in the milk transcriptome using RNA-Seq. Seven milk samples from Holstein cows were analyzed by sequencing cDNAs using the Illumina Genome Analyzer system. We detected 19,175 genes expressed in milk samples corresponding to approximately 70% of the total number of genes analyzed. The SNP detection analysis revealed 100,734 SNPs in Holstein samples, and a large number of those corresponded to differences between the Holstein breed and the Hereford bovine genome assembly Btau4.0. The number of polymorphic SNPs within Holstein cows was 33,045. The accuracy of RNA-Seq SNP discovery was tested by comparing SNPs detected in a set of 42 candidate genes expressed in milk that had been resequenced earlier using Sanger sequencing technology. Seventy of 86 SNPs were detected using both RNA-Seq and Sanger sequencing technologies. The KASPar Genotyping System was used to validate unique SNPs found by RNA-Seq but not observed by Sanger technology. Our results confirm that analyzing the transcriptome using RNA-Seq technology is an efficient and cost-effective method to identify SNPs in transcribed regions. This study creates guidelines to maximize the accuracy of SNP discovery and prevention of false-positive SNP detection, and provides more than 33,000 SNPs located in coding regions of genes expressed during lactation that can be used to develop genotyping platforms to perform marker-trait association studies in Holstein cattle.
Lim, Chun Shen; Brown, Chris M
2017-01-01
Structured RNA elements may control virus replication, transcription and translation, and their distinct features are being exploited by novel antiviral strategies. Viral RNA elements continue to be discovered using combinations of experimental and computational analyses. However, the wealth of sequence data, notably from deep viral RNA sequencing, viromes, and metagenomes, necessitates computational approaches being used as an essential discovery tool. In this review, we describe practical approaches being used to discover functional RNA elements in viral genomes. In addition to success stories in new and emerging viruses, these approaches have revealed some surprising new features of well-studied viruses e.g., human immunodeficiency virus, hepatitis C virus, influenza, and dengue viruses. Some notable discoveries were facilitated by new comparative analyses of diverse viral genome alignments. Importantly, comparative approaches for finding RNA elements embedded in coding and non-coding regions differ. With the exponential growth of computer power we have progressed from stem-loop prediction on single sequences to cutting edge 3D prediction, and from command line to user friendly web interfaces. Despite these advances, many powerful, user friendly prediction tools and resources are underutilized by the virology community.
Lim, Chun Shen; Brown, Chris M.
2018-01-01
Structured RNA elements may control virus replication, transcription and translation, and their distinct features are being exploited by novel antiviral strategies. Viral RNA elements continue to be discovered using combinations of experimental and computational analyses. However, the wealth of sequence data, notably from deep viral RNA sequencing, viromes, and metagenomes, necessitates computational approaches being used as an essential discovery tool. In this review, we describe practical approaches being used to discover functional RNA elements in viral genomes. In addition to success stories in new and emerging viruses, these approaches have revealed some surprising new features of well-studied viruses e.g., human immunodeficiency virus, hepatitis C virus, influenza, and dengue viruses. Some notable discoveries were facilitated by new comparative analyses of diverse viral genome alignments. Importantly, comparative approaches for finding RNA elements embedded in coding and non-coding regions differ. With the exponential growth of computer power we have progressed from stem-loop prediction on single sequences to cutting edge 3D prediction, and from command line to user friendly web interfaces. Despite these advances, many powerful, user friendly prediction tools and resources are underutilized by the virology community. PMID:29354101
He, Ji; Dai, Xinbin; Zhao, Xuechun
2007-02-09
BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Personal BLAST Navigator (PLAN) is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1) query and target sequence database management, (2) automated high-throughput BLAST searching, (3) indexing and searching of results, (4) filtering results online, (5) managing results of personal interest in favorite categories, (6) automated sequence annotation (such as NCBI NR and ontology-based annotation). PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results. The PLAN web interface is platform-independent, easily configurable and capable of comprehensive expansion, and user-intuitive. PLAN is freely available to academic users at http://bioinfo.noble.org/plan/. The source code for local deployment is provided under free license. Full support on system utilization, installation, configuration and customization are provided to academic users.
He, Ji; Dai, Xinbin; Zhao, Xuechun
2007-01-01
Background BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Results Personal BLAST Navigator (PLAN) is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1) query and target sequence database management, (2) automated high-throughput BLAST searching, (3) indexing and searching of results, (4) filtering results online, (5) managing results of personal interest in favorite categories, (6) automated sequence annotation (such as NCBI NR and ontology-based annotation). PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. Conclusion PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results. The PLAN web interface is platform-independent, easily configurable and capable of comprehensive expansion, and user-intuitive. PLAN is freely available to academic users at . The source code for local deployment is provided under free license. Full support on system utilization, installation, configuration and customization are provided to academic users. PMID:17291345
Faggionato, Davide; Serb, Jeanne M
2017-08-01
The rise of high-throughput RNA sequencing (RNA-seq) and de novo transcriptome assembly has had a transformative impact on how we identify and study genes in the phototransduction cascade of non-model organisms. But the advantage provided by the nearly automated annotation of RNA-seq transcriptomes may at the same time hinder the possibility for gene discovery and the discovery of new gene functions. For example, standard functional annotation based on domain homology to known protein families can only confirm group membership, not identify the emergence of new biochemical function. In this study, we show the importance of developing a strategy that circumvents the limitations of semiautomated annotation and apply this workflow to photosensitivity as a means to discover non-opsin photoreceptors. We hypothesize that non-opsin G-protein-coupled receptor (GPCR) proteins may have chromophore-binding lysines in locations that differ from opsin. Here, we provide the first case study describing non-opsin light-sensitive GPCRs based on tissue-specific RNA-seq data of the common bay scallop Argopecten irradians (Lamarck, 1819). Using a combination of sequence analysis and three-dimensional protein modeling, we identified two candidate proteins. We tested their photochemical properties and provide evidence showing that these two proteins incorporate 11-cis and/or all-trans retinal and react to light photochemically. Based on this case study, we demonstrate that there is potential for the discovery of new light-sensitive GPCRs, and we have developed a workflow that starts from RNA-seq assemblies to the discovery of new non-opsin, GPCR-based photopigments.
Devlin, Joseph C; Battaglia, Thomas; Blaser, Martin J; Ruggles, Kelly V
2018-06-25
Exploration of large data sets, such as shotgun metagenomic sequence or expression data, by biomedical experts and medical professionals remains as a major bottleneck in the scientific discovery process. Although tools for this purpose exist for 16S ribosomal RNA sequencing analysis, there is a growing but still insufficient number of user-friendly interactive visualization workflows for easy data exploration and figure generation. The development of such platforms for this purpose is necessary to accelerate and streamline microbiome laboratory research. We developed the Workflow Hub for Automated Metagenomic Exploration (WHAM!) as a web-based interactive tool capable of user-directed data visualization and statistical analysis of annotated shotgun metagenomic and metatranscriptomic data sets. WHAM! includes exploratory and hypothesis-based gene and taxa search modules for visualizing differences in microbial taxa and gene family expression across experimental groups, and for creating publication quality figures without the need for command line interface or in-house bioinformatics. WHAM! is an interactive and customizable tool for downstream metagenomic and metatranscriptomic analysis providing a user-friendly interface allowing for easy data exploration by microbiome and ecological experts to facilitate discovery in multi-dimensional and large-scale data sets.
VarDetect: a nucleotide sequence variation exploratory tool
Ngamphiw, Chumpol; Kulawonganunchai, Supasak; Assawamakin, Anunchai; Jenwitheesuk, Ekachai; Tongsima, Sissades
2008-01-01
Background Single nucleotide polymorphisms (SNPs) are the most commonly studied units of genetic variation. The discovery of such variation may help to identify causative gene mutations in monogenic diseases and SNPs associated with predisposing genes in complex diseases. Accurate detection of SNPs requires software that can correctly interpret chromatogram signals to nucleotides. Results We present VarDetect, a stand-alone nucleotide variation exploratory tool that automatically detects nucleotide variation from fluorescence based chromatogram traces. Accurate SNP base-calling is achieved using pre-calculated peak content ratios, and is enhanced by rules which account for common sequence reading artifacts. The proposed software tool is benchmarked against four other well-known SNP discovery software tools (PolyPhred, novoSNP, Genalys and Mutation Surveyor) using fluorescence based chromatograms from 15 human genes. These chromatograms were obtained from sequencing 16 two-pooled DNA samples; a total of 32 individual DNA samples. In this comparison of automatic SNP detection tools, VarDetect achieved the highest detection efficiency. Availability VarDetect is compatible with most major operating systems such as Microsoft Windows, Linux, and Mac OSX. The current version of VarDetect is freely available at . PMID:19091032
Brody, Thomas; Yavatkar, Amarendra S; Kuzin, Alexander; Kundu, Mukta; Tyson, Leonard J; Ross, Jermaine; Lin, Tzu-Yang; Lee, Chi-Hon; Awasaki, Takeshi; Lee, Tzumin; Odenwald, Ward F
2012-01-01
Background: Phylogenetic footprinting has revealed that cis-regulatory enhancers consist of conserved DNA sequence clusters (CSCs). Currently, there is no systematic approach for enhancer discovery and analysis that takes full-advantage of the sequence information within enhancer CSCs. Results: We have generated a Drosophila genome-wide database of conserved DNA consisting of >100,000 CSCs derived from EvoPrints spanning over 90% of the genome. cis-Decoder database search and alignment algorithms enable the discovery of functionally related enhancers. The program first identifies conserved repeat elements within an input enhancer and then searches the database for CSCs that score highly against the input CSC. Scoring is based on shared repeats as well as uniquely shared matches, and includes measures of the balance of shared elements, a diagnostic that has proven to be useful in predicting cis-regulatory function. To demonstrate the utility of these tools, a temporally-restricted CNS neuroblast enhancer was used to identify other functionally related enhancers and analyze their structural organization. Conclusions: cis-Decoder reveals that co-regulating enhancers consist of combinations of overlapping shared sequence elements, providing insights into the mode of integration of multiple regulating transcription factors. The database and accompanying algorithms should prove useful in the discovery and analysis of enhancers involved in any developmental process. Developmental Dynamics 241:169–189, 2012. © 2011 Wiley Periodicals, Inc. Key findings A genome-wide catalog of Drosophila conserved DNA sequence clusters. cis-Decoder discovers functionally related enhancers. Functionally related enhancers share balanced sequence element copy numbers. Many enhancers function during multiple phases of development. PMID:22174086
Castro, Rosario; Navelsaker, Sofie; Krasnov, Aleksei; Du Pasquier, Louis; Boudinot, Pierre
2017-10-01
During the last decades, gene and cDNA cloning identified TCR and Ig genes across vertebrates; genome sequencing of TCR and Ig loci in many species revealed the different organizations selected during evolution under the pressure of generating diverse repertoires of Ag receptors. By detecting clonotypes over a wide range of frequency, deep sequencing of Ig and TCR transcripts provides a new way to compare the structure of expressed repertoires in species of various sizes, at different stages of development, with different physiologies, and displaying multiple adaptations to the environment. In this review, we provide a short overview of the technologies currently used to produce global description of immune repertoires, describe how they have already been used in comparative immunology, and we discuss the future potential of such approaches. The development of these methodologies in new species holds promise for new discoveries concerning particular adaptations. As an example, understanding the development of adaptive immunity across metamorphosis in frogs has been made possible by such approaches. Repertoire sequencing is now widely used, not only in basic research but also in the context of immunotherapy and vaccination. Analysis of fish responses to pathogens and vaccines has already benefited from these methods. Finally, we also discuss potential advances based on repertoire sequencing of multigene families of immune sensors and effectors in invertebrates. Copyright © 2017 Elsevier Ltd. All rights reserved.
The sequence of sequencers: The history of sequencing DNA.
Heather, James M; Chain, Benjamin
2016-01-01
Determining the order of nucleic acid residues in biological samples is an integral component of a wide variety of research applications. Over the last fifty years large numbers of researchers have applied themselves to the production of techniques and technologies to facilitate this feat, sequencing DNA and RNA molecules. This time-scale has witnessed tremendous changes, moving from sequencing short oligonucleotides to millions of bases, from struggling towards the deduction of the coding sequence of a single gene to rapid and widely available whole genome sequencing. This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries, researchers, and sequences along the way. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
2011-01-01
Background Rust fungi are biotrophic basidiomycete plant pathogens that cause major diseases on plants and trees world-wide, affecting agriculture and forestry. Their biotrophic nature precludes many established molecular genetic manipulations and lines of research. The generation of genomic resources for these microbes is leading to novel insights into biology such as interactions with the hosts and guiding directions for breakthrough research in plant pathology. Results To support gene discovery and gene model verification in the genome of the wheat leaf rust fungus, Puccinia triticina (Pt), we have generated Expressed Sequence Tags (ESTs) by sampling several life cycle stages. We focused on several spore stages and isolated haustorial structures from infected wheat, generating 17,684 ESTs. We produced sequences from both the sexual (pycniospores, aeciospores and teliospores) and asexual (germinated urediniospores) stages of the life cycle. From pycniospores and aeciospores, produced by infecting the alternate host, meadow rue (Thalictrum speciosissimum), 4,869 and 1,292 reads were generated, respectively. We generated 3,703 ESTs from teliospores produced on the senescent primary wheat host. Finally, we generated 6,817 reads from haustoria isolated from infected wheat as well as 1,003 sequences from germinated urediniospores. Along with 25,558 previously generated ESTs, we compiled a database of 13,328 non-redundant sequences (4,506 singlets and 8,822 contigs). Fungal genes were predicted using the EST version of the self-training GeneMarkS algorithm. To refine the EST database, we compared EST sequences by BLASTN to a set of 454 pyrosequencing-generated contigs and Sanger BAC-end sequences derived both from the Pt genome, and to ESTs and genome reads from wheat. A collection of 6,308 fungal genes was identified and compared to sequences of the cereal rusts, Puccinia graminis f. sp. tritici (Pgt) and stripe rust, P. striiformis f. sp. tritici (Pst), and poplar leaf rust Melampsora species, and the corn smut fungus, Ustilago maydis (Um). While extensive homologies were found, many genes appeared novel and species-specific; over 40% of genes did not match any known sequence in existing databases. Focusing on spore stages, direct comparison to Um identified potential functional homologs, possibly allowing heterologous functional analysis in that model fungus. Many potentially secreted protein genes were identified by similarity searches against genes and proteins of Pgt and Melampsora spp., revealing apparent orthologs. Conclusions The current set of Pt unigenes contributes to gene discovery in this major cereal pathogen and will be invaluable for gene model verification in the genome sequence. PMID:21435244
Romer, Katherine A.; Kayombya, Guy-Richard; Fraenkel, Ernest
2007-01-01
WebMOTIFS provides a web interface that facilitates the discovery and analysis of DNA-sequence motifs. Several studies have shown that the accuracy of motif discovery can be significantly improved by using multiple de novo motif discovery programs and using randomized control calculations to identify the most significant motifs or by using Bayesian approaches. WebMOTIFS makes it easy to apply these strategies. Using a single submission form, users can run several motif discovery programs and score, cluster and visualize the results. In addition, the Bayesian motif discovery program THEME can be used to determine the class of transcription factors that is most likely to regulate a set of sequences. Input can be provided as a list of gene or probe identifiers. Used with the default settings, WebMOTIFS accurately identifies biologically relevant motifs from diverse data in several species. WebMOTIFS is freely available at http://fraenkel.mit.edu/webmotifs. PMID:17584794
Hausmann, Axel; Cancian de Araujo, Bruno; Sutrisno, Hari; Peggie, Djunijanti; Schmidt, Stefan
2017-01-01
Abstract Here we present a general collecting and preparation protocol for DNA barcoding of Lepidoptera as part of large-scale rapid biodiversity assessment projects, and a comparison with alternative preserving and vouchering methods. About 98% of the sequenced specimens processed using the present collecting and preparation protocol yielded sequences with more than 500 base pairs. The study is based on the first outcomes of the Indonesian Biodiversity Discovery and Information System (IndoBioSys). IndoBioSys is a German-Indonesian research project that is conducted by the Museum für Naturkunde in Berlin and the Zoologische Staatssammlung München, in close cooperation with the Research Center for Biology – Indonesian Institute of Sciences (RCB-LIPI, Bogor). PMID:29134041
A multi-model approach to nucleic acid-based drug development.
Gautherot, Isabelle; Sodoyer, Regís
2004-01-01
With the advent of functional genomics and the shift of interest towards sequence-based therapeutics, the past decades have witnessed intense research efforts on nucleic acid-mediated gene regulation technologies. Today, RNA interference is emerging as a groundbreaking discovery, holding promise for development of genetic modulators of unprecedented potency. Twenty-five years after the discovery of antisense RNA and ribozymes, gene control therapeutics are still facing developmental difficulties, with only one US FDA-approved antisense drug currently available in the clinic. Limited predictability of target site selection models is recognized as one major stumbling block that is shared by all of the so-called complementary technologies, slowing the progress towards a commercial product. Currently employed in vitro systems for target site selection include RNAse H-based mapping, antisense oligonucleotide microarrays, and functional screening approaches using libraries of catalysts with randomized target-binding arms to identify optimal ribozyme/DNAzyme cleavage sites. Individually, each strategy has its drawbacks from a drug development perspective. Utilization of message-modulating sequences as therapeutic agents requires that their action on a given target transcript meets criteria of potency and selectivity in the natural physiological environment. In addition to sequence-dependent characteristics, other factors will influence annealing reactions and duplex stability, as well as nucleic acid-mediated catalysis. Parallel consideration of physiological selection systems thus appears essential for screening for nucleic acid compounds proposed for therapeutic applications. Cellular message-targeting studies face issues relating to efficient nucleic acid delivery and appropriate analysis of response. For reliability and simplicity, prokaryotic systems can provide a rapid and cost-effective means of studying message targeting under pseudo-cellular conditions, but such approaches also have limitations. To streamline nucleic acid drug discovery, we propose a multi-model strategy integrating high-throughput-adapted bacterial screening, followed by reporter-based and/or natural cellular models and potentially also in vitro assays for characterization of the most promising candidate sequences, before final in vivo testing.
Regulatory sequence analysis tools.
van Helden, Jacques
2003-07-01
The web resource Regulatory Sequence Analysis Tools (RSAT) (http://rsat.ulb.ac.be/rsat) offers a collection of software tools dedicated to the prediction of regulatory sites in non-coding DNA sequences. These tools include sequence retrieval, pattern discovery, pattern matching, genome-scale pattern matching, feature-map drawing, random sequence generation and other utilities. Alternative formats are supported for the representation of regulatory motifs (strings or position-specific scoring matrices) and several algorithms are proposed for pattern discovery. RSAT currently holds >100 fully sequenced genomes and these data are regularly updated from GenBank.
Gao, Guangtu; Nome, Torfinn; Pearse, Devon E; Moen, Thomas; Naish, Kerry A; Thorgaard, Gary H; Lien, Sigbjørn; Palti, Yniv
2018-01-01
Single-nucleotide polymorphisms (SNPs) are highly abundant markers, which are broadly distributed in animal genomes. For rainbow trout ( Oncorhynchus mykiss ), SNP discovery has been previously done through sequencing of restriction-site associated DNA (RAD) libraries, reduced representation libraries (RRL) and RNA sequencing. Recently we have performed high coverage whole genome resequencing with 61 unrelated samples, representing a wide range of rainbow trout and steelhead populations, with 49 new samples added to 12 aquaculture samples from AquaGen (Norway) that we previously used for SNP discovery. Of the 49 new samples, 11 were double-haploid lines from Washington State University (WSU) and 38 represented wild and hatchery populations from a wide range of geographic distribution and with divergent migratory phenotypes. We then mapped the sequences to the new rainbow trout reference genome assembly (GCA_002163495.1) which is based on the Swanson YY doubled haploid line. Variant calling was conducted with FreeBayes and SAMtools mpileup , followed by filtering of SNPs based on quality score, sequence complexity, read depth on the locus, and number of genotyped samples. Results from the two variant calling programs were compared and genotypes of the double haploid samples were used for detecting and filtering putative paralogous sequence variants (PSVs) and multi-sequence variants (MSVs). Overall, 30,302,087 SNPs were identified on the rainbow trout genome 29 chromosomes and 1,139,018 on unplaced scaffolds, with 4,042,723 SNPs having high minor allele frequency (MAF > 0.25). The average SNP density on the chromosomes was one SNP per 64 bp, or 15.6 SNPs per 1 kb. Results from the phylogenetic analysis that we conducted indicate that the SNP markers contain enough population-specific polymorphisms for recovering population relationships despite the small sample size used. Intra-Population polymorphism assessment revealed high level of polymorphism and heterozygosity within each population. We also provide functional annotation based on the genome position of each SNP and evaluate the use of clonal lines for filtering of PSVs and MSVs. These SNPs form a new database, which provides an important resource for a new high density SNP array design and for other SNP genotyping platforms used for genetic and genomics studies of this iconic salmonid fish species.
Zhu, Zhikai; Su, Xiaomeng; Go, Eden P; Desaire, Heather
2014-09-16
Glycoproteins are biologically significant large molecules that participate in numerous cellular activities. In order to obtain site-specific protein glycosylation information, intact glycopeptides, with the glycan attached to the peptide sequence, are characterized by tandem mass spectrometry (MS/MS) methods such as collision-induced dissociation (CID) and electron transfer dissociation (ETD). While several emerging automated tools are developed, no consensus is present in the field about the best way to determine the reliability of the tools and/or provide the false discovery rate (FDR). A common approach to calculate FDRs for glycopeptide analysis, adopted from the target-decoy strategy in proteomics, employs a decoy database that is created based on the target protein sequence database. Nonetheless, this approach is not optimal in measuring the confidence of N-linked glycopeptide matches, because the glycopeptide data set is considerably smaller compared to that of peptides, and the requirement of a consensus sequence for N-glycosylation further limits the number of possible decoy glycopeptides tested in a database search. To address the need to accurately determine FDRs for automated glycopeptide assignments, we developed GlycoPep Evaluator (GPE), a tool that helps to measure FDRs in identifying glycopeptides without using a decoy database. GPE generates decoy glycopeptides de novo for every target glycopeptide, in a 1:20 target-to-decoy ratio. The decoys, along with target glycopeptides, are scored against the ETD data, from which FDRs can be calculated accurately based on the number of decoy matches and the ratio of the number of targets to decoys, for small data sets. GPE is freely accessible for download and can work with any search engine that interprets ETD data of N-linked glycopeptides. The software is provided at https://desairegroup.ku.edu/research.
The impact of genetics on future drug discovery in schizophrenia.
Matsumoto, Mitsuyuki; Walton, Noah M; Yamada, Hiroshi; Kondo, Yuji; Marek, Gerard J; Tajinda, Katsunori
2017-07-01
Failures of investigational new drugs (INDs) for schizophrenia have left huge unmet medical needs for patients. Given the recent lackluster results, it is imperative that new drug discovery approaches (and resultant drug candidates) target pathophysiological alterations that are shared in specific, stratified patient populations that are selected based on pre-identified biological signatures. One path to implementing this paradigm is achievable by leveraging recent advances in genetic information and technologies. Genome-wide exome sequencing and meta-analysis of single nucleotide polymorphism (SNP)-based association studies have already revealed rare deleterious variants and SNPs in patient populations. Areas covered: Herein, the authors review the impact that genetics have on the future of schizophrenia drug discovery. The high polygenicity of schizophrenia strongly indicates that this disease is biologically heterogeneous so the identification of unique subgroups (by patient stratification) is becoming increasingly necessary for future investigational new drugs. Expert opinion: The authors propose a pathophysiology-based stratification of genetically-defined subgroups that share deficits in particular biological pathways. Existing tools, including lower-cost genomic sequencing and advanced gene-editing technology render this strategy ever more feasible. Genetically complex psychiatric disorders such as schizophrenia may also benefit from synergistic research with simpler monogenic disorders that share perturbations in similar biological pathways.
Zhao, Zhenqing; Gu, Honghui; Sheng, Xiaoguang; Yu, Huifang; Wang, Jiansheng; Huang, Long; Wang, Dan
2016-01-01
Molecular markers and genetic maps play an important role in plant genomics and breeding studies. Cauliflower is an important and distinctive vegetable; however, very few molecular resources have been reported for this species. In this study, a novel, specific-locus amplified fragment (SLAF) sequencing strategy was employed for large-scale single nucleotide polymorphism (SNP) discovery and high-density genetic map construction in a double-haploid, segregating population of cauliflower. A total of 12.47 Gb raw data containing 77.92 M pair-end reads were obtained after processing and 6815 polymorphic SLAFs between the two parents were detected. The average sequencing depths reached 52.66-fold for the female parent and 49.35-fold for the male parent. Subsequently, these polymorphic SLAFs were used to genotype the population and further filtered based on several criteria to construct a genetic linkage map of cauliflower. Finally, 1776 high-quality SLAF markers, including 2741 SNPs, constituted the linkage map with average data integrity of 95.68%. The final map spanned a total genetic length of 890.01 cM with an average marker interval of 0.50 cM, and covered 364.9 Mb of the reference genome. The markers and genetic map developed in this study could provide an important foundation not only for comparative genomics studies within Brassica oleracea species but also for quantitative trait loci identification and molecular breeding of cauliflower. PMID:27047515
Kwon, Andrew T.; Chou, Alice Yi; Arenillas, David J.; Wasserman, Wyeth W.
2011-01-01
We performed a genome-wide scan for muscle-specific cis-regulatory modules (CRMs) using three computational prediction programs. Based on the predictions, 339 candidate CRMs were tested in cell culture with NIH3T3 fibroblasts and C2C12 myoblasts for capacity to direct selective reporter gene expression to differentiated C2C12 myotubes. A subset of 19 CRMs validated as functional in the assay. The rate of predictive success reveals striking limitations of computational regulatory sequence analysis methods for CRM discovery. Motif-based methods performed no better than predictions based only on sequence conservation. Analysis of the properties of the functional sequences relative to inactive sequences identifies nucleotide sequence composition can be an important characteristic to incorporate in future methods for improved predictive specificity. Muscle-related TFBSs predicted within the functional sequences display greater sequence conservation than non-TFBS flanking regions. Comparison with recent MyoD and histone modification ChIP-Seq data supports the validity of the functional regions. PMID:22144875
Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures
Stark, Alexander; Lin, Michael F.; Kheradpour, Pouya; Pedersen, Jakob S.; Parts, Leopold; Carlson, Joseph W.; Crosby, Madeline A.; Rasmussen, Matthew D.; Roy, Sushmita; Deoras, Ameya N.; Ruby, J. Graham; Brennecke, Julius; Hodges, Emily; Hinrichs, Angie S.; Caspi, Anat; Paten, Benedict; Park, Seung-Won; Han, Mira V.; Maeder, Morgan L.; Polansky, Benjamin J.; Robson, Bryanne E.; Aerts, Stein; van Helden, Jacques; Hassan, Bassem; Gilbert, Donald G.; Eastman, Deborah A.; Rice, Michael; Weir, Michael; Hahn, Matthew W.; Park, Yongkyu; Dewey, Colin N.; Pachter, Lior; Kent, W. James; Haussler, David; Lai, Eric C.; Bartel, David P.; Hannon, Gregory J.; Kaufman, Thomas C.; Eisen, Michael B.; Clark, Andrew G.; Smith, Douglas; Celniker, Susan E.; Gelbart, William M.; Kellis, Manolis
2008-01-01
Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or ‘evolutionary signatures’, dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies. PMID:17994088
Kaplan, Oktay I; Berber, Burak; Hekim, Nezih; Doluca, Osman
2016-11-02
Many studies show that short non-coding sequences are widely conserved among regulatory elements. More and more conserved sequences are being discovered since the development of next generation sequencing technology. A common approach to identify conserved sequences with regulatory roles relies on topological changes such as hairpin formation at the DNA or RNA level. G-quadruplexes, non-canonical nucleic acid topologies with little established biological roles, are increasingly considered for conserved regulatory element discovery. Since the tertiary structure of G-quadruplexes is strongly dependent on the loop sequence which is disregarded by the generally accepted algorithm, we hypothesized that G-quadruplexes with similar topology and, indirectly, similar interaction patterns, can be determined using phylogenetic clustering based on differences in the loop sequences. Phylogenetic analysis of 52 G-quadruplex forming sequences in the Escherichia coli genome revealed two conserved G-quadruplex motifs with a potential regulatory role. Further analysis revealed that both motifs tend to form hairpins and G quadruplexes, as supported by circular dichroism studies. The phylogenetic analysis as described in this work can greatly improve the discovery of functional G-quadruplex structures and may explain unknown regulatory patterns. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Verhoeven, Joost Theo Petra; Canuti, Marta; Munro, Hannah J; Dufour, Suzanne C; Lang, Andrew S
2018-04-19
High-throughput sequencing (HTS) technologies are becoming increasingly important within microbiology research, but aspects of library preparation, such as high cost per sample or strict input requirements, make HTS difficult to implement in some niche applications and for research groups on a budget. To answer these necessities, we developed ViDiT, a customizable, PCR-based, extremely low-cost (<5 US dollars per sample) and versatile library preparation method, and CACTUS, an analysis pipeline designed to rely on cloud computing power to generate high-quality data from ViDiT-based experiments without the need of expensive servers. We demonstrate here the versatility and utility of these methods within three fields of microbiology: virus discovery, amplicon-based viral genome sequencing and microbiome profiling. ViDiT-CACTUS allowed the identification of viral fragments from 25 different viral families from 36 oropharyngeal-cloacal swabs collected from wild birds, the sequencing of three almost complete genomes of avian influenza A viruses (>90% coverage), and the characterization and functional profiling of the complete microbial diversity (bacteria, archaea, viruses) within a deep-sea carnivorous sponge. ViDiT-CACTUS demonstrated its validity in a wide range of microbiology applications and its simplicity and modularity make it easily implementable in any molecular biology laboratory, towards various research goals.
Discovery of a bovine enterovirus in alpaca.
McClenahan, Shasta D; Scherba, Gail; Borst, Luke; Fredrickson, Richard L; Krause, Philip R; Uhlenhaut, Christine
2013-01-01
A cytopathic virus was isolated using Madin-Darby bovine kidney (MDBK) cells from lung tissue of alpaca that died of a severe respiratory infection. To identify the virus, the infected cell culture supernatant was enriched for virus particles and a generic, PCR-based method was used to amplify potential viral sequences. Genomic sequence data of the alpaca isolate was obtained and compared with sequences of known viruses. The new alpaca virus sequence was most similar to recently designated Enterovirus species F, previously bovine enterovirus (BEVs), viruses that are globally prevalent in cattle, although they appear not to cause significant disease. Because bovine enteroviruses have not been previously reported in U.S. alpaca, we suspect that this type of infection is fairly rare, and in this case appeared not to spread beyond the original outbreak. The capsid sequence of the detected virus had greatest homology to Enterovirus F type 1 (indicating that the virus should be considered a member of serotype 1), but the virus had greater homology in 2A protease sequence to type 3, suggesting that it may have been a recombinant. Identifying pathogens that infect a new host species for the first time can be challenging. As the disease in a new host species may be quite different from that in the original or natural host, the pathogen may not be suspected based on the clinical presentation, delaying diagnosis. Although this virus replicated in MDBK cells, existing standard culture and molecular methods could not identify it. In this case, a highly sensitive generic PCR-based pathogen-detection method was used to identify this pathogen.
Discovery of a Bovine Enterovirus in Alpaca
McClenahan, Shasta D.; Scherba, Gail; Borst, Luke; Fredrickson, Richard L.; Krause, Philip R.; Uhlenhaut, Christine
2013-01-01
A cytopathic virus was isolated using Madin-Darby bovine kidney (MDBK) cells from lung tissue of alpaca that died of a severe respiratory infection. To identify the virus, the infected cell culture supernatant was enriched for virus particles and a generic, PCR-based method was used to amplify potential viral sequences. Genomic sequence data of the alpaca isolate was obtained and compared with sequences of known viruses. The new alpaca virus sequence was most similar to recently designated Enterovirus species F, previously bovine enterovirus (BEVs), viruses that are globally prevalent in cattle, although they appear not to cause significant disease. Because bovine enteroviruses have not been previously reported in U.S. alpaca, we suspect that this type of infection is fairly rare, and in this case appeared not to spread beyond the original outbreak. The capsid sequence of the detected virus had greatest homology to Enterovirus F type 1 (indicating that the virus should be considered a member of serotype 1), but the virus had greater homology in 2A protease sequence to type 3, suggesting that it may have been a recombinant. Identifying pathogens that infect a new host species for the first time can be challenging. As the disease in a new host species may be quite different from that in the original or natural host, the pathogen may not be suspected based on the clinical presentation, delaying diagnosis. Although this virus replicated in MDBK cells, existing standard culture and molecular methods could not identify it. In this case, a highly sensitive generic PCR-based pathogen-detection method was used to identify this pathogen. PMID:23950875
Discovery and mapping of single feature polymorphisms in wheat using Affymetrix arrays
Bernardo, Amy N; Bradbury, Peter J; Ma, Hongxiang; Hu, Shengwa; Bowden, Robert L; Buckler, Edward S; Bai, Guihua
2009-01-01
Background Wheat (Triticum aestivum L.) is a staple food crop worldwide. The wheat genome has not yet been sequenced due to its huge genome size (~17,000 Mb) and high levels of repetitive sequences; the whole genome sequence may not be expected in the near future. Available linkage maps have low marker density due to limitation in available markers; therefore new technologies that detect genome-wide polymorphisms are still needed to discover a large number of new markers for construction of high-resolution maps. A high-resolution map is a critical tool for gene isolation, molecular breeding and genomic research. Single feature polymorphism (SFP) is a new microarray-based type of marker that is detected by hybridization of DNA or cRNA to oligonucleotide probes. This study was conducted to explore the feasibility of using the Affymetrix GeneChip to discover and map SFPs in the large hexaploid wheat genome. Results Six wheat varieties of diverse origins (Ning 7840, Clark, Jagger, Encruzilhada, Chinese Spring, and Opata 85) were analyzed for significant probe by variety interactions and 396 probe sets with SFPs were identified. A subset of 164 unigenes was sequenced and 54% showed polymorphism within probes. Microarray analysis of 71 recombinant inbred lines from the cross Ning 7840/Clark identified 955 SFPs and 877 of them were mapped together with 269 simple sequence repeat markers. The SFPs were randomly distributed within a chromosome but were unevenly distributed among different genomes. The B genome had the most SFPs, and the D genome had the least. Map positions of a selected set of SFPs were validated by mapping single nucleotide polymorphism using SNaPshot and comparing with expressed sequence tags mapping data. Conclusion The Affymetrix array is a cost-effective platform for SFP discovery and SFP mapping in wheat. The new high-density map constructed in this study will be a useful tool for genetic and genomic research in wheat. PMID:19480702
Manivannan, Abinaya; Kim, Jin-Hee; Yang, Eun-Young; Ahn, Yul-Kyun; Lee, Eun-Su; Choi, Sena; Kim, Do-Sun
2018-01-01
Pepper is an economically important horticultural plant that has been widely used for its pungency and spicy taste in worldwide cuisines. Therefore, the domestication of pepper has been carried out since antiquity. Owing to meet the growing demand for pepper with high quality, organoleptic property, nutraceutical contents, and disease tolerance, genomics assisted breeding techniques can be incorporated to develop novel pepper varieties with desired traits. The application of next-generation sequencing (NGS) approaches has reformed the plant breeding technology especially in the area of molecular marker assisted breeding. The availability of genomic information aids in the deeper understanding of several molecular mechanisms behind the vital physiological processes. In addition, the NGS methods facilitate the genome-wide discovery of DNA based markers linked to key genes involved in important biological phenomenon. Among the molecular markers, single nucleotide polymorphism (SNP) indulges various benefits in comparison with other existing DNA based markers. The present review concentrates on the impact of NGS approaches in the discovery of useful SNP markers associated with pungency and disease resistance in pepper. The information provided in the current endeavor can be utilized for the betterment of pepper breeding in future.
Promoter Sequences Prediction Using Relational Association Rule Mining
Czibula, Gabriela; Bocicor, Maria-Iuliana; Czibula, Istvan Gergely
2012-01-01
In this paper we are approaching, from a computational perspective, the problem of promoter sequences prediction, an important problem within the field of bioinformatics. As the conditions for a DNA sequence to function as a promoter are not known, machine learning based classification models are still developed to approach the problem of promoter identification in the DNA. We are proposing a classification model based on relational association rules mining. Relational association rules are a particular type of association rules and describe numerical orderings between attributes that commonly occur over a data set. Our classifier is based on the discovery of relational association rules for predicting if a DNA sequence contains or not a promoter region. An experimental evaluation of the proposed model and comparison with similar existing approaches is provided. The obtained results show that our classifier overperforms the existing techniques for identifying promoter sequences, confirming the potential of our proposal. PMID:22563233
Systematic and fully automated identification of protein sequence patterns.
Hart, R K; Royyuru, A K; Stolovitzky, G; Califano, A
2000-01-01
We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.
Hall, Richard J; Wang, Jing; Todd, Angela K; Bissielo, Ange B; Yen, Seiha; Strydom, Hugo; Moore, Nicole E; Ren, Xiaoyun; Huang, Q Sue; Carter, Philip E; Peacey, Matthew
2014-01-01
The discovery of new or divergent viruses using metagenomics and high-throughput sequencing has become more commonplace. The preparation of a sample is known to have an effect on the representation of virus sequences within the metagenomic dataset yet comparatively little attention has been given to this. Physical enrichment techniques are often applied to samples to increase the number of viral sequences and therefore enhance the probability of detection. With the exception of virus ecology studies, there is a paucity of information available to researchers on the type of sample preparation required for a viral metagenomic study that seeks to identify an aetiological virus in an animal or human diagnostic sample. A review of published virus discovery studies revealed the most commonly used enrichment methods, that were usually quick and simple to implement, namely low-speed centrifugation, filtration, nuclease-treatment (or combinations of these) which have been routinely used but often without justification. These were applied to a simple and well-characterised artificial sample composed of bacterial and human cells, as well as DNA (adenovirus) and RNA viruses (influenza A and human enterovirus), being either non-enveloped capsid or enveloped viruses. The effect of the enrichment method was assessed by both quantitative real-time PCR and metagenomic analysis that incorporated an amplification step. Reductions in the absolute quantities of bacteria and human cells were observed for each method as determined by qPCR, but the relative abundance of viral sequences in the metagenomic dataset remained largely unchanged. A 3-step method of centrifugation, filtration and nuclease-treatment showed the greatest increase in the proportion of viral sequences. This study provides a starting point for the selection of a purification method in future virus discovery studies, and highlights the need for more data to validate the effect of enrichment methods on different sample types, amplification, bioinformatics approaches and sequencing platforms. This study also highlights the potential risks that may attend selection of a virus enrichment method without any consideration for the sample type being investigated. Copyright © 2013 The Authors. Published by Elsevier B.V. All rights reserved.
Introducing Aliphatic Substitution with a Discovery Experiment Using Competing Electrophiles
ERIC Educational Resources Information Center
Curran, Timothy P.; Mostovoy, Amelia J.; Curran, Margaret E.; Berger, Clara
2016-01-01
A facile, discovery-based experiment is described that introduces aliphatic substitution in an introductory undergraduate organic chemistry curriculum. Unlike other discovery-based experiments that examine substitution using two competing nucleophiles with a single electrophile, this experiment compares two isomeric, competing electrophiles…
2011-01-01
Background BAC-based physical maps provide for sequencing across an entire genome or a selected sub-genomic region of biological interest. Such a region can be approached with next-generation whole-genome sequencing and assembly as if it were an independent small genome. Using the minimum tiling path as a guide, specific BAC clones representing the prioritized genomic interval are selected, pooled, and used to prepare a sequencing library. Results This pooled BAC approach was taken to sequence and assemble a QTL-rich region, of ~3 Mbp and represented by twenty-seven BACs, on linkage group 5 of the Theobroma cacao cv. Matina 1-6 genome. Using various mixtures of read coverages from paired-end and linear 454 libraries, multiple assemblies of varied quality were generated. Quality was assessed by comparing the assembly of 454 reads with a subset of ten BACs individually sequenced and assembled using Sanger reads. A mixture of reads optimal for assembly was identified. We found, furthermore, that a quality assembly suitable for serving as a reference genome template could be obtained even with a reduced depth of sequencing coverage. Annotation of the resulting assembly revealed several genes potentially responsible for three T. cacao traits: black pod disease resistance, bean shape index, and pod weight. Conclusions Our results, as with other pooled BAC sequencing reports, suggest that pooling portions of a minimum tiling path derived from a BAC-based physical map is an effective method to target sub-genomic regions for sequencing. While we focused on a single QTL region, other QTL regions of importance could be similarly sequenced allowing for biological discovery to take place before a high quality whole-genome assembly is completed. PMID:21794110
Feltus, Frank A; Saski, Christopher A; Mockaitis, Keithanne; Haiminen, Niina; Parida, Laxmi; Smith, Zachary; Ford, James; Staton, Margaret E; Ficklin, Stephen P; Blackmon, Barbara P; Cheng, Chun-Huai; Schnell, Raymond J; Kuhn, David N; Motamayor, Juan-Carlos
2011-07-27
BAC-based physical maps provide for sequencing across an entire genome or a selected sub-genomic region of biological interest. Such a region can be approached with next-generation whole-genome sequencing and assembly as if it were an independent small genome. Using the minimum tiling path as a guide, specific BAC clones representing the prioritized genomic interval are selected, pooled, and used to prepare a sequencing library. This pooled BAC approach was taken to sequence and assemble a QTL-rich region, of ~3 Mbp and represented by twenty-seven BACs, on linkage group 5 of the Theobroma cacao cv. Matina 1-6 genome. Using various mixtures of read coverages from paired-end and linear 454 libraries, multiple assemblies of varied quality were generated. Quality was assessed by comparing the assembly of 454 reads with a subset of ten BACs individually sequenced and assembled using Sanger reads. A mixture of reads optimal for assembly was identified. We found, furthermore, that a quality assembly suitable for serving as a reference genome template could be obtained even with a reduced depth of sequencing coverage. Annotation of the resulting assembly revealed several genes potentially responsible for three T. cacao traits: black pod disease resistance, bean shape index, and pod weight. Our results, as with other pooled BAC sequencing reports, suggest that pooling portions of a minimum tiling path derived from a BAC-based physical map is an effective method to target sub-genomic regions for sequencing. While we focused on a single QTL region, other QTL regions of importance could be similarly sequenced allowing for biological discovery to take place before a high quality whole-genome assembly is completed.
2014-01-01
Background Recent advancements in next-generation sequencing technology have enabled cost-effective sequencing of whole or partial genomes, permitting the discovery and characterization of molecular polymorphisms. Double-digest restriction-site associated DNA sequencing (ddRAD-seq) is a powerful and inexpensive approach to developing numerous single nucleotide polymorphism (SNP) markers and constructing a high-density genetic map. To enrich genomic resources for Japanese eel (Anguilla japonica), we constructed a ddRAD-based genetic map using an Ion Torrent Personal Genome Machine and anchored scaffolds of the current genome assembly to 19 linkage groups of the Japanese eel. Furthermore, we compared the Japanese eel genome with genomes of model fishes to infer the history of genome evolution after the teleost-specific genome duplication. Results We generated the ddRAD-based linkage map of the Japanese eel, where the maps for female and male spanned 1748.8 cM and 1294.5 cM, respectively, and were arranged into 19 linkage groups. A total of 2,672 SNP markers and 115 Simple Sequence Repeat markers provide anchor points to 1,252 scaffolds covering 151 Mb (13%) of the current genome assembly of the Japanese eel. Comparisons among the Japanese eel, medaka, zebrafish and spotted gar genomes showed highly conserved synteny among teleosts and revealed part of the eight major chromosomal rearrangement events that occurred soon after the teleost-specific genome duplication. Conclusions The ddRAD-seq approach combined with the Ion Torrent Personal Genome Machine sequencing allowed us to conduct efficient and flexible SNP genotyping. The integration of the genetic map and the assembled sequence provides a valuable resource for fine mapping and positional cloning of quantitative trait loci associated with economically important traits and for investigating comparative genomics of the Japanese eel. PMID:24669946
Kai, Wataru; Nomura, Kazuharu; Fujiwara, Atushi; Nakamura, Yoji; Yasuike, Motoshige; Ojima, Nobuhiko; Masaoka, Tetsuji; Ozaki, Akiyuki; Kazeto, Yukinori; Gen, Koichiro; Nagao, Jiro; Tanaka, Hideki; Kobayashi, Takanori; Ototake, Mitsuru
2014-03-26
Recent advancements in next-generation sequencing technology have enabled cost-effective sequencing of whole or partial genomes, permitting the discovery and characterization of molecular polymorphisms. Double-digest restriction-site associated DNA sequencing (ddRAD-seq) is a powerful and inexpensive approach to developing numerous single nucleotide polymorphism (SNP) markers and constructing a high-density genetic map. To enrich genomic resources for Japanese eel (Anguilla japonica), we constructed a ddRAD-based genetic map using an Ion Torrent Personal Genome Machine and anchored scaffolds of the current genome assembly to 19 linkage groups of the Japanese eel. Furthermore, we compared the Japanese eel genome with genomes of model fishes to infer the history of genome evolution after the teleost-specific genome duplication. We generated the ddRAD-based linkage map of the Japanese eel, where the maps for female and male spanned 1748.8 cM and 1294.5 cM, respectively, and were arranged into 19 linkage groups. A total of 2,672 SNP markers and 115 Simple Sequence Repeat markers provide anchor points to 1,252 scaffolds covering 151 Mb (13%) of the current genome assembly of the Japanese eel. Comparisons among the Japanese eel, medaka, zebrafish and spotted gar genomes showed highly conserved synteny among teleosts and revealed part of the eight major chromosomal rearrangement events that occurred soon after the teleost-specific genome duplication. The ddRAD-seq approach combined with the Ion Torrent Personal Genome Machine sequencing allowed us to conduct efficient and flexible SNP genotyping. The integration of the genetic map and the assembled sequence provides a valuable resource for fine mapping and positional cloning of quantitative trait loci associated with economically important traits and for investigating comparative genomics of the Japanese eel.
novPTMenzy: a database for enzymes involved in novel post-translational modifications
Khater, Shradha; Mohanty, Debasisa
2015-01-01
With the recent discoveries of novel post-translational modifications (PTMs) which play important roles in signaling and biosynthetic pathways, identification of such PTM catalyzing enzymes by genome mining has been an area of major interest. Unlike well-known PTMs like phosphorylation, glycosylation, SUMOylation, no bioinformatics resources are available for enzymes associated with novel and unusual PTMs. Therefore, we have developed the novPTMenzy database which catalogs information on the sequence, structure, active site and genomic neighborhood of experimentally characterized enzymes involved in five novel PTMs, namely AMPylation, Eliminylation, Sulfation, Hydroxylation and Deamidation. Based on a comprehensive analysis of the sequence and structural features of these known PTM catalyzing enzymes, we have created Hidden Markov Model profiles for the identification of similar PTM catalyzing enzymatic domains in genomic sequences. We have also created predictive rules for grouping them into functional subfamilies and deciphering their mechanistic details by structure-based analysis of their active site pockets. These analytical modules have been made available as user friendly search interfaces of novPTMenzy database. It also has a specialized analysis interface for some PTMs like AMPylation and Eliminylation. The novPTMenzy database is a unique resource that can aid in discovery of unusual PTM catalyzing enzymes in newly sequenced genomes. Database URL: http://www.nii.ac.in/novptmenzy.html PMID:25931459
Moschetti, Tommaso; Sharpe, Timothy; Fischer, Gerhard; Marsh, May E; Ng, Hong Kin; Morgan, Matthew; Scott, Duncan E; Blundell, Tom L; R Venkitaraman, Ashok; Skidmore, John; Abell, Chris; Hyvönen, Marko
2016-11-20
Protein-protein interactions (PPIs) are increasingly important targets for drug discovery. Efficient fragment-based drug discovery approaches to tackle PPIs are often stymied by difficulties in the production of stable, unliganded target proteins. Here, we report an approach that exploits protein engineering to "humanise" thermophilic archeal surrogate proteins as targets for small-molecule inhibitor discovery and to exemplify this approach in the development of inhibitors against the PPI between the recombinase RAD51 and tumour suppressor BRCA2. As human RAD51 has proved impossible to produce in a form that is compatible with the requirements of fragment-based drug discovery, we have developed a surrogate protein system using RadA from Pyrococcus furiosus. Using a monomerised RadA as our starting point, we have adopted two parallel and mutually instructive approaches to mimic the human enzyme: firstly by mutating RadA to increase sequence identity with RAD51 in the BRC repeat binding sites, and secondly by generating a chimeric archaeal human protein. Both approaches generate proteins that interact with a fourth BRC repeat with affinity and stoichiometry comparable to human RAD51. Stepwise humanisation has also allowed us to elucidate the determinants of RAD51 binding to BRC repeats and the contributions of key interacting residues to this interaction. These surrogate proteins have enabled the development of biochemical and biophysical assays in our ongoing fragment-based small-molecule inhibitor programme and they have allowed us to determine hundreds of liganded structures in support of our structure-guided design process, demonstrating the feasibility and advantages of using archeal surrogates to overcome difficulties in handling human proteins. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.
A chemogenomic analysis of the human proteome: application to enzyme families.
Bernasconi, Paul; Chen, Min; Galasinski, Scott; Popa-Burke, Ioana; Bobasheva, Anna; Coudurier, Louis; Birkos, Steve; Hallam, Rhonda; Janzen, William P
2007-10-01
Sequence-based phylogenies (SBP) are well-established tools for describing relationships between proteins. They have been used extensively to predict the behavior and sensitivity toward inhibitors of enzymes within a family. The utility of this approach diminishes when comparing proteins with little sequence homology. Even within an enzyme family, SBPs must be complemented by an orthogonal method that is independent of sequence to better predict enzymatic behavior. A chemogenomic approach is demonstrated here that uses the inhibition profile of a 130,000 diverse molecule library to uncover relationships within a set of enzymes. The profile is used to construct a semimetric additive distance matrix. This matrix, in turn, defines a sequence-independent phylogeny (SIP). The method was applied to 97 enzymes (kinases, proteases, and phosphatases). SIP does not use structural information from the molecules used for establishing the profile, thus providing a more heuristic method than the current approaches, which require knowledge of the specific inhibitor's structure. Within enzyme families, SIP shows a good overall correlation with SBP. More interestingly, SIP uncovers distances within families that are not recognizable by sequence-based methods. In addition, SIP allows the determination of distance between enzymes with no sequence homology, thus uncovering novel relationships not predicted by SBP. This chemogenomic approach, used in conjunction with SBP, should prove to be a powerful tool for choosing target combinations for drug discovery programs as well as for guiding the selection of profiling and liability targets.
Koschmann, Jeannette; Machens, Fabian; Becker, Marlies; Niemeyer, Julia; Schulze, Jutta; Bülow, Lorenz; Stahl, Dietmar J.; Hehl, Reinhard
2012-01-01
A combination of bioinformatic tools, high-throughput gene expression profiles, and the use of synthetic promoters is a powerful approach to discover and evaluate novel cis-sequences in response to specific stimuli. With Arabidopsis (Arabidopsis thaliana) microarray data annotated to the PathoPlant database, 732 different queries with a focus on fungal and oomycete pathogens were performed, leading to 510 up-regulated gene groups. Using the binding site estimation suite of tools, BEST, 407 conserved sequence motifs were identified in promoter regions of these coregulated gene sets. Motif similarities were determined with STAMP, classifying the 407 sequence motifs into 37 families. A comparative analysis of these 37 families with the AthaMap, PLACE, and AGRIS databases revealed similarities to known cis-elements but also led to the discovery of cis-sequences not yet implicated in pathogen response. Using a parsley (Petroselinum crispum) protoplast system and a modified reporter gene vector with an internal transformation control, 25 elicitor-responsive cis-sequences from 10 different motif families were identified. Many of the elicitor-responsive cis-sequences also drive reporter gene expression in an Agrobacterium tumefaciens infection assay in Nicotiana benthamiana. This work significantly increases the number of known elicitor-responsive cis-sequences and demonstrates the successful integration of a diverse set of bioinformatic resources combined with synthetic promoter analysis for data mining and functional screening in plant-pathogen interaction. PMID:22744985
[GNU Pattern: open source pattern hunter for biological sequences based on SPLASH algorithm].
Xu, Ying; Li, Yi-xue; Kong, Xiang-yin
2005-06-01
To construct a high performance open source software engine based on IBM SPLASH algorithm for later research on pattern discovery. Gpat, which is based on SPLASH algorithm, was developed by using open source software. GNU Pattern (Gpat) software was developped, which efficiently implemented the core part of SPLASH algorithm. Full source code of Gpat was also available for other researchers to modify the program under the GNU license. Gpat is a successful implementation of SPLASH algorithm and can be used as a basic framework for later research on pattern recognition in biological sequences.
[The nineteenth century roots of the contemporary biological revolution].
Swynghedauw, Bernard
2006-01-01
The recent publication of the human genomic sequence is the most important progress in biology. It originates from four major watersheds between 1860-1865, namely the biological evolution by Darwin in 1858, the Mendel laws of heredity in 1865, the basis of physiology established by Claude Bernard also in 1865, and the discoveries of microbacteria by Louis Pasteur around 1857. Before 1860, biology did not exist as a science. After 1860, the Darwin's theory progressively became a law after the discovery of the DNA polymorphism and that of the mechanisms of genetic mixing. So far the Mendel's laws were confirmed in parallel with the development of molecular genetics after the discovery of DNA structure and genetic code. The discovery of hormones is one example, amongst several on how integrative physiology applies to Claude Bernard's basis. Finally, based on Pasteur's discovery and Pasteur Institutes, microbiology became a tool for molecular biologists.
The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update.
Huynh, Tien; Rigoutsos, Isidore
2004-07-01
In this report, we provide an update on the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server, which is operational around the clock, provides access to a large number of methods that have been developed and published by the group's members. There is an increasing number of problems that these tools can help tackle; these problems range from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences, the identification--directly from sequence--of structural deviations from alpha-helicity and the annotation of amino acid sequences for antimicrobial activity. Additionally, annotations for more than 130 archaeal, bacterial, eukaryotic and viral genomes are now available on-line and can be searched interactively. The tools and code bundles continue to be accessible from http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/.
Swaney, Danielle L; Wenger, Craig D; Thomson, James A; Coon, Joshua J
2009-01-27
Protein phosphorylation is central to the understanding of cellular signaling, and cellular signaling is suggested to play a major role in the regulation of human embryonic stem (ES) cell pluripotency. Here, we describe the use of conventional tandem mass spectrometry-based sequencing technology--collision-activated dissociation (CAD)--and the more recently developed method electron transfer dissociation (ETD) to characterize the human ES cell phosphoproteome. In total, these experiments resulted in the identification of 11,995 unique phosphopeptides, corresponding to 10,844 nonredundant phosphorylation sites, at a 1% false discovery rate (FDR). Among these phosphorylation sites are 5 localized to 2 pluripotency critical transcription factors--OCT4 and SOX2. From these experiments, we conclude that ETD identifies a larger number of unique phosphopeptides than CAD (8,087 to 3,868), more frequently localizes the phosphorylation site to a specific residue (49.8% compared with 29.6%), and sequences whole classes of phosphopeptides previously unobserved.
Optimization of the genotyping-by-sequencing strategy for population genomic analysis in conifers.
Pan, Jin; Wang, Baosheng; Pei, Zhi-Yong; Zhao, Wei; Gao, Jie; Mao, Jian-Feng; Wang, Xiao-Ru
2015-07-01
Flexibility and low cost make genotyping-by-sequencing (GBS) an ideal tool for population genomic studies of nonmodel species. However, to utilize the potential of the method fully, many parameters affecting library quality and single nucleotide polymorphism (SNP) discovery require optimization, especially for conifer genomes with a high repetitive DNA content. In this study, we explored strategies for effective GBS analysis in pine species. We constructed GBS libraries using HpaII, PstI and EcoRI-MseI digestions with different multiplexing levels and examined the effect of restriction enzymes on library complexity and the impact of sequencing depth and size selection of restriction fragments on sequence coverage bias. We tested and compared UNEAK, Stacks and GATK pipelines for the GBS data, and then developed a reference-free SNP calling strategy for haploid pine genomes. Our GBS procedure proved to be effective in SNP discovery, producing 7000-11 000 and 14 751 SNPs within and among three pine species, respectively, from a PstI library. This investigation provides guidance for the design and analysis of GBS experiments, particularly for organisms for which genomic information is lacking. © 2014 John Wiley & Sons Ltd.
Transcriptome assembly and digital gene expression atlas of the rainbow trout
USDA-ARS?s Scientific Manuscript database
Background: Transcriptome analysis is a preferred method for gene discovery, marker development and gene expression profiling in non-model organisms. Previously, we sequenced a transcriptome reference using Sanger-based and 454-pyrosequencing, however, a transcriptome assembly is still incomplete an...
Extracting DNA words based on the sequence features: non-uniform distribution and integrity.
Li, Zhi; Cao, Hongyan; Cui, Yuehua; Zhang, Yanbo
2016-01-25
DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the "words" based only on the DNA sequences. We considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract "DNA words" that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods. The results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary. Our method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome.
The web server of IBM's Bioinformatics and Pattern Discovery group.
Huynh, Tien; Rigoutsos, Isidore; Parida, Laxmi; Platt, Daniel; Shibuya, Tetsuo
2003-07-01
We herein present and discuss the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server is operational around the clock and provides access to a variety of methods that have been published by the group's members and collaborators. The available tools correspond to applications ranging from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences and the interactive annotation of amino acid sequences. Additionally, annotations for more than 70 archaeal, bacterial, eukaryotic and viral genomes are available on-line and can be searched interactively. The tools and code bundles can be accessed beginning at http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/.
The web server of IBM's Bioinformatics and Pattern Discovery group
Huynh, Tien; Rigoutsos, Isidore; Parida, Laxmi; Platt, Daniel; Shibuya, Tetsuo
2003-01-01
We herein present and discuss the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server is operational around the clock and provides access to a variety of methods that have been published by the group's members and collaborators. The available tools correspond to applications ranging from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences and the interactive annotation of amino acid sequences. Additionally, annotations for more than 70 archaeal, bacterial, eukaryotic and viral genomes are available on-line and can be searched interactively. The tools and code bundles can be accessed beginning at http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12824385
Strain Prioritization for Natural Product Discovery by a High-Throughput Real-Time PCR Method
2015-01-01
Natural products offer unmatched chemical and structural diversity compared to other small-molecule libraries, but traditional natural product discovery programs are not sustainable, demanding too much time, effort, and resources. Here we report a strain prioritization method for natural product discovery. Central to the method is the application of real-time PCR, targeting genes characteristic to the biosynthetic machinery of natural products with distinct scaffolds in a high-throughput format. The practicality and effectiveness of the method were showcased by prioritizing 1911 actinomycete strains for diterpenoid discovery. A total of 488 potential diterpenoid producers were identified, among which six were confirmed as platensimycin and platencin dual producers and one as a viguiepinol and oxaloterpin producer. While the method as described is most appropriate to prioritize strains for discovering specific natural products, variations of this method should be applicable to the discovery of other classes of natural products. Applications of genome sequencing and genome mining to the high-priority strains could essentially eliminate the chance elements from traditional discovery programs and fundamentally change how natural products are discovered. PMID:25238028
Brancaccio, Rosario N; Robitaille, Alexis; Dutta, Sankhadeep; Cuenin, Cyrille; Santare, Daiga; Skenders, Girts; Leja, Marcis; Fischer, Nicole; Giuliano, Anna R; Rollison, Dana E; Grundhoff, Adam; Tommasino, Massimo; Gheit, Tarik
2018-05-07
With the advent of new molecular tools, the discovery of new papillomaviruses (PVs) has accelerated during the past decade, enabling the expansion of knowledge about the viral populations that inhabit the human body. Human PVs (HPVs) are etiologically linked to benign or malignant lesions of the skin and mucosa. The detection of HPV types can vary widely, depending mainly on the methodology and the quality of the biological sample. Next-generation sequencing is one of the most powerful tools, enabling the discovery of novel viruses in a wide range of biological material. Here, we report a novel protocol for the detection of known and unknown HPV types in human skin and oral gargle samples using improved PCR protocols combined with next-generation sequencing. We identified 105 putative new PV types in addition to 296 known types, thus providing important information about the viral distribution in the oral cavity and skin. Copyright © 2018. Published by Elsevier Inc.
Ma, Chun-Lei; Jin, Ji-Qiang; Li, Chun-Fang; Wang, Rong-Kai; Zheng, Hong-Kun; Yao, Ming-Zhe; Chen, Liang
2015-01-01
Genetic maps are important tools in plant genomics and breeding. The present study reports the large-scale discovery of single nucleotide polymorphisms (SNPs) for genetic map construction in tea plant. We developed a total of 6,042 valid SNP markers using specific-locus amplified fragment sequencing (SLAF-seq), and subsequently mapped them into the previous framework map. The final map contained 6,448 molecular markers, distributing on fifteen linkage groups corresponding to the number of tea plant chromosomes. The total map length was 3,965 cM, with an average inter-locus distance of 1.0 cM. This map is the first SNP-based reference map of tea plant, as well as the most saturated one developed to date. The SNP markers and map resources generated in this study provide a wealth of genetic information that can serve as a foundation for downstream genetic analyses, such as the fine mapping of quantitative trait loci (QTL), map-based cloning, marker-assisted selection, and anchoring of scaffolds to facilitate the process of whole genome sequencing projects for tea plant. PMID:26035838
Metagenomics and novel gene discovery
Culligan, Eamonn P; Sleator, Roy D; Marchesi, Julian R; Hill, Colin
2014-01-01
Metagenomics provides a means of assessing the total genetic pool of all the microbes in a particular environment, in a culture-independent manner. It has revealed unprecedented diversity in microbial community composition, which is further reflected in the encoded functional diversity of the genomes, a large proportion of which consists of novel genes. Herein, we review both sequence-based and functional metagenomic methods to uncover novel genes and outline some of the associated problems of each type of approach, as well as potential solutions. Furthermore, we discuss the potential for metagenomic biotherapeutic discovery, with a particular focus on the human gut microbiome and finally, we outline how the discovery of novel genes may be used to create bioengineered probiotics. PMID:24317337
Analyzing Student Inquiry Data Using Process Discovery and Sequence Classification
ERIC Educational Resources Information Center
Emond, Bruno; Buffett, Scott
2015-01-01
This paper reports on results of applying process discovery mining and sequence classification mining techniques to a data set of semi-structured learning activities. The main research objective is to advance educational data mining to model and support self-regulated learning in heterogeneous environments of learning content, activities, and…
2013-01-01
Background The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes. Results We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. Conclusions These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies. PMID:23496952
Francis, Warren R; Christianson, Lynne M; Kiko, Rainer; Powers, Meghan L; Shaner, Nathan C; Haddock, Steven H D
2013-03-12
The lack of genomic resources can present challenges for studies of non-model organisms. Transcriptome sequencing offers an attractive method to gather information about genes and gene expression without the need for a reference genome. However, it is unclear what sequencing depth is adequate to assemble the transcriptome de novo for these purposes. We assembled transcriptomes of animals from six different phyla (Annelids, Arthropods, Chordates, Cnidarians, Ctenophores, and Molluscs) at regular increments of reads using Velvet/Oases and Trinity to determine how read count affects the assembly. This included an assembly of mouse heart reads because we could compare those against the reference genome that is available. We found qualitative differences in the assemblies of whole-animals versus tissues. With increasing reads, whole-animal assemblies show rapid increase of transcripts and discovery of conserved genes, while single-tissue assemblies show a slower discovery of conserved genes though the assembled transcripts were often longer. A deeper examination of the mouse assemblies shows that with more reads, assembly errors become more frequent but such errors can be mitigated with more stringent assembly parameters. These assembly trends suggest that representative assemblies are generated with as few as 20 million reads for tissue samples and 30 million reads for whole-animals for RNA-level coverage. These depths provide a good balance between coverage and noise. Beyond 60 million reads, the discovery of new genes is low and sequencing errors of highly-expressed genes are likely to accumulate. Finally, siphonophores (polymorphic Cnidarians) are an exception and possibly require alternate assembly strategies.
Sarmady, Mahdi; Dampier, William; Tozeren, Aydin
2011-01-01
Virus proteins alter protein pathways of the host toward the synthesis of viral particles by breaking and making edges via binding to host proteins. In this study, we developed a computational approach to predict viral sequence hotspots for binding to host proteins based on sequences of viral and host proteins and literature-curated virus-host protein interactome data. We use a motif discovery algorithm repeatedly on collections of sequences of viral proteins and immediate binding partners of their host targets and choose only those motifs that are conserved on viral sequences and highly statistically enriched among binding partners of virus protein targeted host proteins. Our results match experimental data on binding sites of Nef to host proteins such as MAPK1, VAV1, LCK, HCK, HLA-A, CD4, FYN, and GNB2L1 with high statistical significance but is a poor predictor of Nef binding sites on highly flexible, hoop-like regions. Predicted hotspots recapture CD8 cell epitopes of HIV Nef highlighting their importance in modulating virus-host interactions. Host proteins potentially targeted or outcompeted by Nef appear crowding the T cell receptor, natural killer cell mediated cytotoxicity, and neurotrophin signaling pathways. Scanning of HIV Nef motifs on multiple alignments of hepatitis C protein NS5A produces results consistent with literature, indicating the potential value of the hotspot discovery in advancing our understanding of virus-host crosstalk. PMID:21738584
Memetic algorithms for de novo motif-finding in biomedical sequences.
Bi, Chengpeng
2012-09-01
The objectives of this study are to design and implement a new memetic algorithm for de novo motif discovery, which is then applied to detect important signals hidden in various biomedical molecular sequences. In this paper, memetic algorithms are developed and tested in de novo motif-finding problems. Several strategies in the algorithm design are employed that are to not only efficiently explore the multiple sequence local alignment space, but also effectively uncover the molecular signals. As a result, there are a number of key features in the implementation of the memetic motif-finding algorithm (MaMotif), including a chromosome replacement operator, a chromosome alteration-aware local search operator, a truncated local search strategy, and a stochastic operation of local search imposed on individual learning. To test the new algorithm, we compare MaMotif with a few of other similar algorithms using simulated and experimental data including genomic DNA, primary microRNA sequences (let-7 family), and transmembrane protein sequences. The new memetic motif-finding algorithm is successfully implemented in C++, and exhaustively tested with various simulated and real biological sequences. In the simulation, it shows that MaMotif is the most time-efficient algorithm compared with others, that is, it runs 2 times faster than the expectation maximization (EM) method and 16 times faster than the genetic algorithm-based EM hybrid. In both simulated and experimental testing, results show that the new algorithm is compared favorably or superior to other algorithms. Notably, MaMotif is able to successfully discover the transcription factors' binding sites in the chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) data, correctly uncover the RNA splicing signals in gene expression, and precisely find the highly conserved helix motif in the transmembrane protein sequences, as well as rightly detect the palindromic segments in the primary microRNA sequences. The memetic motif-finding algorithm is effectively designed and implemented, and its applications demonstrate it is not only time-efficient, but also exhibits excellent performance while compared with other popular algorithms. Copyright © 2012 Elsevier B.V. All rights reserved.
RSAT 2018: regulatory sequence analysis tools 20th anniversary.
Nguyen, Nga Thi Thuy; Contreras-Moreira, Bruno; Castro-Mondragon, Jaime A; Santana-Garcia, Walter; Ossio, Raul; Robles-Espinoza, Carla Daniela; Bahin, Mathieu; Collombet, Samuel; Vincens, Pierre; Thieffry, Denis; van Helden, Jacques; Medina-Rivera, Alejandra; Thomas-Chollier, Morgane
2018-05-02
RSAT (Regulatory Sequence Analysis Tools) is a suite of modular tools for the detection and the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, including from genome-wide datasets like ChIP-seq/ATAC-seq, (ii) motif scanning, (iii) motif analysis (quality assessment, comparisons and clustering), (iv) analysis of regulatory variations, (v) comparative genomics. Six public servers jointly support 10 000 genomes from all kingdoms. Six novel or refactored programs have been added since the 2015 NAR Web Software Issue, including updated programs to analyse regulatory variants (retrieve-variation-seq, variation-scan, convert-variations), along with tools to extract sequences from a list of coordinates (retrieve-seq-bed), to select motifs from motif collections (retrieve-matrix), and to extract orthologs based on Ensembl Compara (get-orthologs-compara). Three use cases illustrate the integration of new and refactored tools to the suite. This Anniversary update gives a 20-year perspective on the software suite. RSAT is well-documented and available through Web sites, SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services, virtual machines and stand-alone programs at http://www.rsat.eu/.
Cancer genomics: technology, discovery, and translation.
Tran, Ben; Dancey, Janet E; Kamel-Reid, Suzanne; McPherson, John D; Bedard, Philippe L; Brown, Andrew M K; Zhang, Tong; Shaw, Patricia; Onetto, Nicole; Stein, Lincoln; Hudson, Thomas J; Neel, Benjamin G; Siu, Lillian L
2012-02-20
In recent years, the increasing awareness that somatic mutations and other genetic aberrations drive human malignancies has led us within reach of personalized cancer medicine (PCM). The implementation of PCM is based on the following premises: genetic aberrations exist in human malignancies; a subset of these aberrations drive oncogenesis and tumor biology; these aberrations are actionable (defined as having the potential to affect management recommendations based on diagnostic, prognostic, and/or predictive implications); and there are highly specific anticancer agents available that effectively modulate these targets. This article highlights the technology underlying cancer genomics and examines the early results of genome sequencing and the challenges met in the discovery of new genetic aberrations. Finally, drawing from experiences gained in a feasibility study of somatic mutation genotyping and targeted exome sequencing led by Princess Margaret Hospital-University Health Network and the Ontario Institute for Cancer Research, the processes, challenges, and issues involved in the translation of cancer genomics to the clinic are discussed.
Aslam, Luqman; Beal, Kathryn; Ann Blomberg, Le; Bouffard, Pascal; Burt, David W.; Crasta, Oswald; Crooijmans, Richard P. M. A.; Cooper, Kristal; Coulombe, Roger A.; De, Supriyo; Delany, Mary E.; Dodgson, Jerry B.; Dong, Jennifer J.; Evans, Clive; Frederickson, Karin M.; Flicek, Paul; Florea, Liliana; Folkerts, Otto; Groenen, Martien A. M.; Harkins, Tim T.; Herrero, Javier; Hoffmann, Steve; Megens, Hendrik-Jan; Jiang, Andrew; de Jong, Pieter; Kaiser, Pete; Kim, Heebal; Kim, Kyu-Won; Kim, Sungwon; Langenberger, David; Lee, Mi-Kyung; Lee, Taeheon; Mane, Shrinivasrao; Marcais, Guillaume; Marz, Manja; McElroy, Audrey P.; Modise, Thero; Nefedov, Mikhail; Notredame, Cédric; Paton, Ian R.; Payne, William S.; Pertea, Geo; Prickett, Dennis; Puiu, Daniela; Qioa, Dan; Raineri, Emanuele; Ruffier, Magali; Salzberg, Steven L.; Schatz, Michael C.; Scheuring, Chantel; Schmidt, Carl J.; Schroeder, Steven; Searle, Stephen M. J.; Smith, Edward J.; Smith, Jacqueline; Sonstegard, Tad S.; Stadler, Peter F.; Tafer, Hakim; Tu, Zhijian (Jake); Van Tassell, Curtis P.; Vilella, Albert J.; Williams, Kelly P.; Yorke, James A.; Zhang, Liqing; Zhang, Hong-Bin; Zhang, Xiaojun; Zhang, Yang; Reed, Kent M.
2010-01-01
A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (∼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest. PMID:20838655
Single-Cell Sequencing for Drug Discovery and Drug Development.
Wu, Hongjin; Wang, Charles; Wu, Shixiu
2017-01-01
Next-generation sequencing (NGS), particularly single-cell sequencing, has revolutionized the scale and scope of genomic and biomedical research. Recent technological advances in NGS and singlecell studies have made the deep whole-genome (DNA-seq), whole epigenome and whole-transcriptome sequencing (RNA-seq) at single-cell level feasible. NGS at the single-cell level expands our view of genome, epigenome and transcriptome and allows the genome, epigenome and transcriptome of any organism to be explored without a priori assumptions and with unprecedented throughput. And it does so with single-nucleotide resolution. NGS is also a very powerful tool for drug discovery and drug development. In this review, we describe the current state of single-cell sequencing techniques, which can provide a new, more powerful and precise approach for analyzing effects of drugs on treated cells and tissues. Our review discusses single-cell whole genome/exome sequencing (scWGS/scWES), single-cell transcriptome sequencing (scRNA-seq), single-cell bisulfite sequencing (scBS), and multiple omics of single-cell sequencing. We also highlight the advantages and challenges of each of these approaches. Finally, we describe, elaborate and speculate the potential applications of single-cell sequencing for drug discovery and drug development. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Dissecting enzyme function with microfluidic-based deep mutational scanning.
Romero, Philip A; Tran, Tuan M; Abate, Adam R
2015-06-09
Natural enzymes are incredibly proficient catalysts, but engineering them to have new or improved functions is challenging due to the complexity of how an enzyme's sequence relates to its biochemical properties. Here, we present an ultrahigh-throughput method for mapping enzyme sequence-function relationships that combines droplet microfluidic screening with next-generation DNA sequencing. We apply our method to map the activity of millions of glycosidase sequence variants. Microfluidic-based deep mutational scanning provides a comprehensive and unbiased view of the enzyme function landscape. The mapping displays expected patterns of mutational tolerance and a strong correspondence to sequence variation within the enzyme family, but also reveals previously unreported sites that are crucial for glycosidase function. We modified the screening protocol to include a high-temperature incubation step, and the resulting thermotolerance landscape allowed the discovery of mutations that enhance enzyme thermostability. Droplet microfluidics provides a general platform for enzyme screening that, when combined with DNA-sequencing technologies, enables high-throughput mapping of enzyme sequence space.
Pharmacological screening technologies for venom peptide discovery.
Prashanth, Jutty Rajan; Hasaballah, Nojod; Vetter, Irina
2017-12-01
Venomous animals occupy one of the most successful evolutionary niches and occur on nearly every continent. They deliver venoms via biting and stinging apparatuses with the aim to rapidly incapacitate prey and deter predators. This has led to the evolution of venom components that act at a number of biological targets - including ion channels, G-protein coupled receptors, transporters and enzymes - with exquisite selectivity and potency, making venom-derived components attractive pharmacological tool compounds and drug leads. In recent years, plate-based pharmacological screening approaches have been introduced to accelerate venom-derived drug discovery. A range of assays are amenable to this purpose, including high-throughput electrophysiology, fluorescence-based functional and binding assays. However, despite these technological advances, the traditional activity-guided fractionation approach is time-consuming and resource-intensive. The combination of screening techniques suitable for miniaturization with sequence-based discovery approaches - supported by advanced proteomics, mass spectrometry, chromatography as well as synthesis and expression techniques - promises to further improve venom peptide discovery. Here, we discuss practical aspects of establishing a pipeline for venom peptide drug discovery with a particular emphasis on pharmacology and pharmacological screening approaches. This article is part of the Special Issue entitled 'Venom-derived Peptides as Pharmacological Tools.' Copyright © 2017 Elsevier Ltd. All rights reserved.
Genome survey sequencing of red swamp crayfish Procambarus clarkii.
Shi, Linlin; Yi, Shaokui; Li, Yanhe
2018-06-21
Red swamp crayfish, Procambarus clarkii, presently is an important aquatic commercial species in China. The crayfish is a hot area of research focus, and its genetic improvement is quite urgent for the crayfish aquaculture in China. However, the knowledge of its genomic landscape is limited. In this study, a survey of P. clarkii genome was investigated based on Illumina's Solexa sequencing platform. Meanwhile, its genome size was estimated using flow cytometry. Interestingly, the genome size estimated is about 8.50 Gb by flow cytometry and 1.86 Gb with genome survey sequencing. Based on the assembled genome sequences, total of 136,962 genes and 152,268 exons were predicted, and the predicted genes ranged from 150 to 12,807 bp in length. The survey sequences could help accelerate the progress of gene discovery involved in genetic diversity and evolutionary analysis, even though it could not successfully applied for estimation of P. clarkii genome size.
Suyama, Yoshihisa; Matsuki, Yu
2015-01-01
Restriction-enzyme (RE)-based next-generation sequencing methods have revolutionized marker-assisted genetic studies; however, the use of REs has limited their widespread adoption, especially in field samples with low-quality DNA and/or small quantities of DNA. Here, we developed a PCR-based procedure to construct reduced representation libraries without RE digestion steps, representing de novo single-nucleotide polymorphism discovery, and its genotyping using next-generation sequencing. Using multiplexed inter-simple sequence repeat (ISSR) primers, thousands of genome-wide regions were amplified effectively from a wide variety of genomes, without prior genetic information. We demonstrated: 1) Mendelian gametic segregation of the discovered variants; 2) reproducibility of genotyping by checking its applicability for individual identification; and 3) applicability in a wide variety of species by checking standard population genetic analysis. This approach, called multiplexed ISSR genotyping by sequencing, should be applicable to many marker-assisted genetic studies with a wide range of DNA qualities and quantities. PMID:26593239
VaDiR: an integrated approach to Variant Detection in RNA.
Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy
2018-02-01
Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.
The Enigmatic Origin of Papillomavirus Protein Domains
Kirsip, Heleri; Gaston, Kevin
2017-01-01
Almost a century has passed since the discovery of papillomaviruses. A few decades of research have given a wealth of information on the molecular biology of papillomaviruses. Several excellent studies have been performed looking at the long- and short-term evolution of these viruses. However, when and how papillomaviruses originate is still a mystery. In this study, we systematically searched the (sequenced) biosphere to find distant homologs of papillomaviral protein domains. Our data show that, even including structural information, which allows us to find deeper evolutionary relationships compared to sequence-only based methods, only half of the protein domains in papillomaviruses have relatives in the rest of the biosphere. We show that the major capsid protein L1 and the replication protein E1 have relatives in several viral families, sharing three protein domains with Polyomaviridae and Parvoviridae. However, only the E1 replication protein has connections with cellular organisms. Most likely, the papillomavirus ancestor is of marine origin, a biotope that is not very well sequenced at the present time. Nevertheless, there is no evidence as to how papillomaviruses originated and how they became vertebrate and epithelium specific. PMID:28832519
The Enigmatic Origin of Papillomavirus Protein Domains.
Puustusmaa, Mikk; Kirsip, Heleri; Gaston, Kevin; Abroi, Aare
2017-08-23
Almost a century has passed since the discovery of papillomaviruses. A few decades of research have given a wealth of information on the molecular biology of papillomaviruses. Several excellent studies have been performed looking at the long- and short-term evolution of these viruses. However, when and how papillomaviruses originate is still a mystery. In this study, we systematically searched the (sequenced) biosphere to find distant homologs of papillomaviral protein domains. Our data show that, even including structural information, which allows us to find deeper evolutionary relationships compared to sequence-only based methods, only half of the protein domains in papillomaviruses have relatives in the rest of the biosphere. We show that the major capsid protein L1 and the replication protein E1 have relatives in several viral families, sharing three protein domains with Polyomaviridae and Parvoviridae . However, only the E1 replication protein has connections with cellular organisms. Most likely, the papillomavirus ancestor is of marine origin, a biotope that is not very well sequenced at the present time. Nevertheless, there is no evidence as to how papillomaviruses originated and how they became vertebrate and epithelium specific.
Termite hindguts and the ecology of microbial communities in the sequencing age.
Tai, Vera; Keeling, Patrick J
2013-01-01
Advances in high-throughput nucleic acid sequencing have improved our understanding of microbial communities in a number of ways. Deeper sequence coverage provides the means to assess diversity at the resolution necessary to recover ecological and biogeographic patterns, and at the same time single-cell genomics provides detailed information about the interactions between members of a microbial community. Given the vastness and complexity of microbial ecosystems, such analyses remain challenging for most environments, so greater insight can also be drawn from analysing less dynamic ecosystems. Here, we outline the advantages of one such environment, the wood-digesting hindgut communities of termites and cockroaches, and how it is a model to examine and compare both protist and bacterial communities. Beyond the analysis of diversity, our understanding of protist community ecology will depend on using statistically sound sampling regimes at biologically relevant scales, transitioning from discovery-based to experimental ecology, incorporating single-cell microbiology and other data sources, and continued development of analytical tools. © 2013 The Author(s) Journal of Eukaryotic Microbiology © 2013 International Society of Protistologists.
VirusDetect: An automated pipeline for efficient virus discovery using deep sequencing of small RNAs
USDA-ARS?s Scientific Manuscript database
Accurate detection of viruses in plants and animals is critical for agriculture production and human health. Deep sequencing and assembly of virus-derived siRNAs has proven to be a highly efficient approach for virus discovery. However, to date no computational tools specifically designed for both k...
Paisitkriangkrai, Sakrapee; Quek, Kelly; Nievergall, Eva; Jabbour, Anissa; Zannettino, Andrew; Kok, Chung Hoow
2018-06-07
Recurrent oncogenic fusion genes play a critical role in the development of various cancers and diseases and provide, in some cases, excellent therapeutic targets. To date, analysis tools that can identify and compare recurrent fusion genes across multiple samples have not been available to researchers. To address this deficiency, we developed Co-occurrence Fusion (Co-fuse), a new and easy to use software tool that enables biologists to merge RNA-seq information, allowing them to identify recurrent fusion genes, without the need for exhaustive data processing. Notably, Co-fuse is based on pattern mining and statistical analysis which enables the identification of hidden patterns of recurrent fusion genes. In this report, we show that Co-fuse can be used to identify 2 distinct groups within a set of 49 leukemic cell lines based on their recurrent fusion genes: a multiple myeloma (MM) samples-enriched cluster and an acute myeloid leukemia (AML) samples-enriched cluster. Our experimental results further demonstrate that Co-fuse can identify known driver fusion genes (e.g., IGH-MYC, IGH-WHSC1) in MM, when compared to AML samples, indicating the potential of Co-fuse to aid the discovery of yet unknown driver fusion genes through cohort comparisons. Additionally, using a 272 primary glioma sample RNA-seq dataset, Co-fuse was able to validate recurrent fusion genes, further demonstrating the power of this analysis tool to identify recurrent fusion genes. Taken together, Co-fuse is a powerful new analysis tool that can be readily applied to large RNA-seq datasets, and may lead to the discovery of new disease subgroups and potentially new driver genes, for which, targeted therapies could be developed. The Co-fuse R source code is publicly available at https://github.com/sakrapee/co-fuse .
Réfega, Susana; Girard-Misguich, Fabienne; Bourdieu, Christiane; Péry, Pierre; Labbé, Marie
2003-04-02
Specific antibodies were produced ex vivo from intestinal culture of Eimeria tenella infected chickens. The specificity of these intestinal antibodies was tested against different parasite stages. These antibodies were used to immunoscreen first generation schizont and sporozoite cDNA libraries permitting the identification of new E. tenella antigens. We obtained a total of 119 cDNA clones which were subjected to sequence analysis. The sequences coding for the proteins inducing local immune responses were compared with nucleotide or protein databases and with expressed sequence tags (ESTs) databases. We identified new Eimeria genes coding for heat shock proteins, a ribosomal protein, a pyruvate kinase and a pyridoxine kinase. Specific features of other sequences are discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
DeAngelis, K.M.; Gladden, J.G.; Allgaier, M.
2010-03-01
Producing cellulosic biofuels from plant material has recently emerged as a key U.S. Department of Energy goal. For this technology to be commercially viable on a large scale, it is critical to make production cost efficient by streamlining both the deconstruction of lignocellulosic biomass and fuel production. Many natural ecosystems efficiently degrade lignocellulosic biomass and harbor enzymes that, when identified, could be used to increase the efficiency of commercial biomass deconstruction. However, ecosystems most likely to yield relevant enzymes, such as tropical rain forest soil in Puerto Rico, are often too complex for enzyme discovery using current metagenomic sequencing technologies.more » One potential strategy to overcome this problem is to selectively cultivate the microbial communities from these complex ecosystems on biomass under defined conditions, generating less complex biomass-degrading microbial populations. To test this premise, we cultivated microbes from Puerto Rican soil or green waste compost under precisely defined conditions in the presence dried ground switchgrass (Panicum virgatum L.) or lignin, respectively, as the sole carbon source. Phylogenetic profiling of the two feedstock-adapted communities using SSU rRNA gene amplicon pyrosequencing or phylogenetic microarray analysis revealed that the adapted communities were significantly simplified compared to the natural communities from which they were derived. Several members of the lignin-adapted and switchgrass-adapted consortia are related to organisms previously characterized as biomass degraders, while others were from less well-characterized phyla. The decrease in complexity of these communities make them good candidates for metagenomic sequencing and will likely enable the reconstruction of a greater number of full length genes, leading to the discovery of novel lignocellulose-degrading enzymes adapted to feedstocks and conditions of interest.« less
Yam, Alice Wei Yee; Colmant, Agathe M. G.; McLean, Breeanna J.; Prow, Natalie A.; Watterson, Daniel; Hall-Mendelin, Sonja; Warrilow, David; Ng, Mah-Lee; Khromykh, Alexander A.; Hall, Roy A.
2015-01-01
Mosquito-borne viruses encompass a range of virus families, comprising a number of significant human pathogens (e.g., dengue viruses, West Nile virus, Chikungunya virus). Virulent strains of these viruses are continually evolving and expanding their geographic range, thus rapid and sensitive screening assays are required to detect emerging viruses and monitor their prevalence and spread in mosquito populations. Double-stranded RNA (dsRNA) is produced during the replication of many of these viruses as either an intermediate in RNA replication (e.g., flaviviruses, togaviruses) or the double-stranded RNA genome (e.g., reoviruses). Detection and discovery of novel viruses from field and clinical samples usually relies on recognition of antigens or nucleotide sequences conserved within a virus genus or family. However, due to the wide antigenic and genetic variation within and between viral families, many novel or divergent species can be overlooked by these approaches. We have developed two monoclonal antibodies (mAbs) which show co-localised staining with proteins involved in viral RNA replication in immunofluorescence assay (IFA), suggesting specific reactivity to viral dsRNA. By assessing binding against a panel of synthetic dsRNA molecules, we have shown that these mAbs recognise dsRNA greater than 30 base pairs in length in a sequence-independent manner. IFA and enzyme-linked immunosorbent assay (ELISA) were employed to demonstrate detection of a panel of RNA viruses from several families, in a range of cell types. These mAbs, termed monoclonal antibodies to viral RNA intermediates in cells (MAVRIC), have now been incorporated into a high-throughput, economical ELISA-based screening system for the detection and discovery of viruses from mosquito populations. Our results have demonstrated that this simple system enables the efficient detection and isolation of a range of known and novel viruses in cells inoculated with field-caught mosquito samples, and represents a rapid, sequence-independent, and cost-effective approach to virus discovery. PMID:25799391
kpLogo: positional k-mer analysis reveals hidden specificity in biological sequences
2017-01-01
Abstract Motifs of only 1–4 letters can play important roles when present at key locations within macromolecules. Because existing motif-discovery tools typically miss these position-specific short motifs, we developed kpLogo, a probability-based logo tool for integrated detection and visualization of position-specific ultra-short motifs from a set of aligned sequences. kpLogo also overcomes the limitations of conventional motif-visualization tools in handling positional interdependencies and utilizing ranked or weighted sequences increasingly available from high-throughput assays. kpLogo can be found at http://kplogo.wi.mit.edu/. PMID:28460012
RNAbrowse: RNA-Seq de novo assembly results browser.
Mariette, Jérôme; Noirot, Céline; Nabihoudine, Ibounyamine; Bardou, Philippe; Hoede, Claire; Djari, Anis; Cabau, Cédric; Klopp, Christophe
2014-01-01
Transcriptome analysis based on a de novo assembly of next generation RNA sequences is now performed routinely in many laboratories. The generated results, including contig sequences, quantification figures, functional annotations and variation discovery outputs are usually bulky and quite diverse. This article presents a user oriented storage and visualisation environment permitting to explore the data in a top-down manner, going from general graphical views to all possible details. The software package is based on biomart, easy to install and populate with local data. The software package is available under the GNU General Public License (GPL) at http://bioinfo.genotoul.fr/RNAbrowse.
Open discovery: An integrated live Linux platform of Bioinformatics tools.
Vetrivel, Umashankar; Pilla, Kalabharath
2008-01-01
Historically, live linux distributions for Bioinformatics have paved way for portability of Bioinformatics workbench in a platform independent manner. Moreover, most of the existing live Linux distributions limit their usage to sequence analysis and basic molecular visualization programs and are devoid of data persistence. Hence, open discovery - a live linux distribution has been developed with the capability to perform complex tasks like molecular modeling, docking and molecular dynamics in a swift manner. Furthermore, it is also equipped with complete sequence analysis environment and is capable of running windows executable programs in Linux environment. Open discovery portrays the advanced customizable configuration of fedora, with data persistency accessible via USB drive or DVD. The Open Discovery is distributed free under Academic Free License (AFL) and can be downloaded from http://www.OpenDiscovery.org.in.
Validating regulatory predictions from diverse bacteria with mutant fitness data
Sagawa, Shiori; Price, Morgan N.; Deutschbauer, Adam M.; ...
2017-05-24
Although transcriptional regulation is fundamental to understanding bacterial physiology, the targets of most bacterial transcription factors are not known. Comparative genomics has been used to identify likely targets of some of these transcription factors, but these predictions typically lack experimental support. Here, we used mutant fitness data, which measures the importance of each gene for a bacterium's growth across many conditions, to test regulatory predictions from RegPrecise, a curated collection of comparative genomics predictions. Because characterized transcription factors often have correlated fitness with one of their targets (either positively or negatively), correlated fitness patterns provide support for the comparative genomicsmore » predictions. At a false discovery rate of 3%, we identified significant cofitness for at least one target of 158 TFs in 107 ortholog groups and from 24 bacteria. Thus, high-throughput genetics can be used to identify a high-confidence subset of the sequence-based regulatory predictions.« less
Validating regulatory predictions from diverse bacteria with mutant fitness data
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sagawa, Shiori; Price, Morgan N.; Deutschbauer, Adam M.
Although transcriptional regulation is fundamental to understanding bacterial physiology, the targets of most bacterial transcription factors are not known. Comparative genomics has been used to identify likely targets of some of these transcription factors, but these predictions typically lack experimental support. Here, we used mutant fitness data, which measures the importance of each gene for a bacterium's growth across many conditions, to test regulatory predictions from RegPrecise, a curated collection of comparative genomics predictions. Because characterized transcription factors often have correlated fitness with one of their targets (either positively or negatively), correlated fitness patterns provide support for the comparative genomicsmore » predictions. At a false discovery rate of 3%, we identified significant cofitness for at least one target of 158 TFs in 107 ortholog groups and from 24 bacteria. Thus, high-throughput genetics can be used to identify a high-confidence subset of the sequence-based regulatory predictions.« less
Chaitankar, Vijender; Karakülah, Gökhan; Ratnapriya, Rinki; Giuste, Felipe O.; Brooks, Matthew J.; Swaroop, Anand
2016-01-01
The advent of high throughput next generation sequencing (NGS) has accelerated the pace of discovery of disease-associated genetic variants and genomewide profiling of expressed sequences and epigenetic marks, thereby permitting systems-based analyses of ocular development and disease. Rapid evolution of NGS and associated methodologies presents significant challenges in acquisition, management, and analysis of large data sets and for extracting biologically or clinically relevant information. Here we illustrate the basic design of commonly used NGS-based methods, specifically whole exome sequencing, transcriptome, and epigenome profiling, and provide recommendations for data analyses. We briefly discuss systems biology approaches for integrating multiple data sets to elucidate gene regulatory or disease networks. While we provide examples from the retina, the NGS guidelines reviewed here are applicable to other tissues/cell types as well. PMID:27297499
USDA-ARS?s Scientific Manuscript database
Background: Rust fungi are biotrophic basidiomycete plant pathogens that cause major diseases on plants and trees world-wide, affecting agriculture and forestry. Their biotrophic nature precludes many established molecular genetic manipulations and lines of research. The generation of genomic resour...
A Sensitive Assay for Virus Discovery in Respiratory Clinical Samples
de Vries, Michel; Deijs, Martin; Canuti, Marta; van Schaik, Barbera D. C.; Faria, Nuno R.; van de Garde, Martijn D. B.; Jachimowski, Loes C. M.; Jebbink, Maarten F.; Jakobs, Marja; Luyf, Angela C. M.; Coenjaerts, Frank E. J.; Claas, Eric C. J.; Molenkamp, Richard; Koekkoek, Sylvie M.; Lammens, Christine; Leus, Frank; Goossens, Herman; Ieven, Margareta; Baas, Frank; van der Hoek, Lia
2011-01-01
In 5–40% of respiratory infections in children, the diagnostics remain negative, suggesting that the patients might be infected with a yet unknown pathogen. Virus discovery cDNA-AFLP (VIDISCA) is a virus discovery method based on recognition of restriction enzyme cleavage sites, ligation of adaptors and subsequent amplification by PCR. However, direct discovery of unknown pathogens in nasopharyngeal swabs is difficult due to the high concentration of ribosomal RNA (rRNA) that acts as competitor. In the current study we optimized VIDISCA by adjusting the reverse transcription enzymes and decreasing rRNA amplification in the reverse transcription, using hexamer oligonucleotides that do not anneal to rRNA. Residual cDNA synthesis on rRNA templates was further reduced with oligonucleotides that anneal to rRNA but can not be extended due to 3′-dideoxy-C6-modification. With these modifications >90% reduction of rRNA amplification was established. Further improvement of the VIDISCA sensitivity was obtained by high throughput sequencing (VIDISCA-454). Eighteen nasopharyngeal swabs were analysed, all containing known respiratory viruses. We could identify the proper virus in the majority of samples tested (11/18). The median load in the VIDISCA-454 positive samples was 7.2 E5 viral genome copies/ml (ranging from 1.4 E3–7.7 E6). Our results show that optimization of VIDISCA and subsequent high-throughput-sequencing enhances sensitivity drastically and provides the opportunity to perform virus discovery directly in patient material. PMID:21283679
Li, Li; Brunk, Brian P.; Kissinger, Jessica C.; Pape, Deana; Tang, Keliang; Cole, Robert H.; Martin, John; Wylie, Todd; Dante, Mike; Fogarty, Steven J.; Howe, Daniel K.; Liberator, Paul; Diaz, Carmen; Anderson, Jennifer; White, Michael; Jerome, Maria E.; Johnson, Emily A.; Radke, Jay A.; Stoeckert, Christian J.; Waterston, Robert H.; Clifton, Sandra W.; Roos, David S.; Sibley, L. David
2003-01-01
Large-scale EST sequencing projects for several important parasites within the phylum Apicomplexa were undertaken for the purpose of gene discovery. Included were several parasites of medical importance (Plasmodium falciparum, Toxoplasma gondii) and others of veterinary importance (Eimeria tenella, Sarcocystis neurona, and Neospora caninum). A total of 55,192 ESTs, deposited into dbEST/GenBank, were included in the analyses. The resulting sequences have been clustered into nonredundant gene assemblies and deposited into a relational database that supports a variety of sequence and text searches. This database has been used to compare the gene assemblies using BLAST similarity comparisons to the public protein databases to identify putative genes. Of these new entries, ∼15%–20% represent putative homologs with a conservative cutoff of p < 10−9, thus identifying many conserved genes that are likely to share common functions with other well-studied organisms. Gene assemblies were also used to identify strain polymorphisms, examine stage-specific expression, and identify gene families. An interesting class of genes that are confined to members of this phylum and not shared by plants, animals, or fungi, was identified. These genes likely mediate the novel biological features of members of the Apicomplexa and hence offer great potential for biological investigation and as possible therapeutic targets. [The sequence data from this study have been submitted to dbEST division of GenBank under accession nos.: Toxoplasma gondii: –, –, –, –, – , –, –, –, –. Plasmodium falciparum: –, –, –, –. Sarcocystis neurona: , , , , , , , , , , , , , –, –, –, –, –. Eimeria tenella: –, –, –, –, –, –, –, –, – , –, –, –, –, –, –, –, –, –, –, –. Neospora caninum: –, –, , – , –, –.] PMID:12618375
Discovery and problem solving: Triangulation as a weak heuristic
NASA Technical Reports Server (NTRS)
Rochowiak, Daniel
1987-01-01
Recently the artificial intelligence community has turned its attention to the process of discovery and found that the history of science is a fertile source for what Darden has called compiled hindsight. Such hindsight generates weak heuristics for discovery that do not guarantee that discoveries will be made but do have proven worth in leading to discoveries. Triangulation is one such heuristic that is grounded in historical hindsight. This heuristic is explored within the general framework of the BACON, GLAUBER, STAHL, DALTON, and SUTTON programs. In triangulation different bases of information are compared in an effort to identify gaps between the bases. Thus, assuming that the bases of information are relevantly related, the gaps that are identified should be good locations for discovery and robust analysis.
Comparative Analysis and Distribution of Omega-3 lcPUFA Biosynthesis Genes in Marine Molluscs
Surm, Joachim M.; Prentis, Peter J.; Pavasovic, Ana
2015-01-01
Recent research has identified marine molluscs as an excellent source of omega-3 long-chain polyunsaturated fatty acids (lcPUFAs), based on their potential for endogenous synthesis of lcPUFAs. In this study we generated a representative list of fatty acyl desaturase (Fad) and elongation of very long-chain fatty acid (Elovl) genes from major orders of Phylum Mollusca, through the interrogation of transcriptome and genome sequences, and various publicly available databases. We have identified novel and uncharacterised Fad and Elovl sequences in the following species: Anadara trapezia, Nerita albicilla, Nerita melanotragus, Crassostrea gigas, Lottia gigantea, Aplysia californica, Loligo pealeii and Chlamys farreri. Based on alignments of translated protein sequences of Fad and Elovl genes, the haeme binding motif and histidine boxes of Fad proteins, and the histidine box and seventeen important amino acids in Elovl proteins, were highly conserved. Phylogenetic analysis of aligned reference sequences was used to reconstruct the evolutionary relationships for Fad and Elovl genes separately. Multiple, well resolved clades for both the Fad and Elovl sequences were observed, suggesting that repeated rounds of gene duplication best explain the distribution of Fad and Elovl proteins across the major orders of molluscs. For Elovl sequences, one clade contained the functionally characterised Elovl5 proteins, while another clade contained proteins hypothesised to have Elovl4 function. Additional well resolved clades consisted only of uncharacterised Elovl sequences. One clade from the Fad phylogeny contained only uncharacterised proteins, while the other clade contained functionally characterised delta-5 desaturase proteins. The discovery of an uncharacterised Fad clade is particularly interesting as these divergent proteins may have novel functions. Overall, this paper presents a number of novel Fad and Elovl genes suggesting that many mollusc groups possess most of the required enzymes for the synthesis of lcPUFAs. PMID:26308548
Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment
2013-01-01
Background Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. Results In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Conclusion Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA. PMID:24564200
Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment.
Nagar, Anurag; Hahsler, Michael
2013-01-01
Next Generation Sequencing techniques are producing enormous amounts of biological sequence data and analysis becomes a major computational problem. Currently, most analysis, especially the identification of conserved regions, relies heavily on Multiple Sequence Alignment and its various heuristics such as progressive alignment, whose run time grows with the square of the number and the length of the aligned sequences and requires significant computational resources. In this work, we present a method to efficiently discover regions of high similarity across multiple sequences without performing expensive sequence alignment. The method is based on approximating edit distance between segments of sequences using p-mer frequency counts. Then, efficient high-throughput data stream clustering is used to group highly similar segments into so called quasi-alignments. Quasi-alignments have numerous applications such as identifying species and their taxonomic class from sequences, comparing sequences for similarities, and, as in this paper, discovering conserved regions across related sequences. In this paper, we show that quasi-alignments can be used to discover highly similar segments across multiple sequences from related or different genomes efficiently and accurately. Experiments on a large number of unaligned 16S rRNA sequences obtained from the Greengenes database show that the method is able to identify conserved regions which agree with known hypervariable regions in 16S rRNA. Furthermore, the experiments show that the proposed method scales well for large data sets with a run time that grows only linearly with the number and length of sequences, whereas for existing multiple sequence alignment heuristics the run time grows super-linearly. Quasi-alignment-based algorithms can detect highly similar regions and conserved areas across multiple sequences. Since the run time is linear and the sequences are converted into a compact clustering model, we are able to identify conserved regions fast or even interactively using a standard PC. Our method has many potential applications such as finding characteristic signature sequences for families of organisms and studying conserved and variable regions in, for example, 16S rRNA.
Phylogenetic shadowing of primate sequences to find functional regions of the human genome.
Boffelli, Dario; McAuliffe, Jon; Ovcharenko, Dmitriy; Lewis, Keith D; Ovcharenko, Ivan; Pachter, Lior; Rubin, Edward M
2003-02-28
Nonhuman primates represent the most relevant model organisms to understand the biology of Homo sapiens. The recent divergence and associated overall sequence conservation between individual members of this taxon have nonetheless largely precluded the use of primates in comparative sequence studies. We used sequence comparisons of an extensive set of Old World and New World monkeys and hominoids to identify functional regions in the human genome. Analysis of these data enabled the discovery of primate-specific gene regulatory elements and the demarcation of the exons of multiple genes. Much of the information content of the comprehensive primate sequence comparisons could be captured with a small subset of phylogenetically close primates. These results demonstrate the utility of intraprimate sequence comparisons to discover common mammalian as well as primate-specific functional elements in the human genome, which are unattainable through the evaluation of more evolutionarily distant species.
Culture-independent discovery of natural products from soil metagenomes.
Katz, Micah; Hover, Bradley M; Brady, Sean F
2016-03-01
Bacterial natural products have proven to be invaluable starting points in the development of many currently used therapeutic agents. Unfortunately, traditional culture-based methods for natural product discovery have been deemphasized by pharmaceutical companies due in large part to high rediscovery rates. Culture-independent, or "metagenomic," methods, which rely on the heterologous expression of DNA extracted directly from environmental samples (eDNA), have the potential to provide access to metabolites encoded by a large fraction of the earth's microbial biosynthetic diversity. As soil is both ubiquitous and rich in bacterial diversity, it is an appealing starting point for culture-independent natural product discovery efforts. This review provides an overview of the history of soil metagenome-driven natural product discovery studies and elaborates on the recent development of new tools for sequence-based, high-throughput profiling of environmental samples used in discovering novel natural product biosynthetic gene clusters. We conclude with several examples of these new tools being employed to facilitate the recovery of novel secondary metabolite encoding gene clusters from soil metagenomes and the subsequent heterologous expression of these clusters to produce bioactive small molecules.
Song, Yuhyun; Leman, Scotland; Monteil, Caroline L.; Heath, Lenwood S.; Vinatzer, Boris A.
2014-01-01
A broadly accepted and stable biological classification system is a prerequisite for biological sciences. It provides the means to describe and communicate about life without ambiguity. Current biological classification and nomenclature use the species as the basic unit and require lengthy and laborious species descriptions before newly discovered organisms can be assigned to a species and be named. The current system is thus inadequate to classify and name the immense genetic diversity within species that is now being revealed by genome sequencing on a daily basis. To address this lack of a general intra-species classification and naming system adequate for today’s speed of discovery of new diversity, we propose a classification and naming system that is exclusively based on genome similarity and that is suitable for automatic assignment of codes to any genome-sequenced organism without requiring any phenotypic or phylogenetic analysis. We provide examples demonstrating that genome similarity-based codes largely align with current taxonomic groups at many different levels in bacteria, animals, humans, plants, and viruses. Importantly, the proposed approach is only slightly affected by the order of code assignment and can thus provide codes that reflect similarity between organisms and that do not need to be revised upon discovery of new diversity. We envision genome similarity-based codes to complement current biological nomenclature and to provide a universal means to communicate unambiguously about any genome-sequenced organism in fields as diverse as biodiversity research, infectious disease control, human and microbial forensics, animal breed and plant cultivar certification, and human ancestry research. PMID:24586551
SNP discovery by high-throughput sequencing in soybean
2010-01-01
Background With the advance of new massively parallel genotyping technologies, quantitative trait loci (QTL) fine mapping and map-based cloning become more achievable in identifying genes for important and complex traits. Development of high-density genetic markers in the QTL regions of specific mapping populations is essential for fine-mapping and map-based cloning of economically important genes. Single nucleotide polymorphisms (SNPs) are the most abundant form of genetic variation existing between any diverse genotypes that are usually used for QTL mapping studies. The massively parallel sequencing technologies (Roche GS/454, Illumina GA/Solexa, and ABI/SOLiD), have been widely applied to identify genome-wide sequence variations. However, it is still remains unclear whether sequence data at a low sequencing depth are enough to detect the variations existing in any QTL regions of interest in a crop genome, and how to prepare sequencing samples for a complex genome such as soybean. Therefore, with the aims of identifying SNP markers in a cost effective way for fine-mapping several QTL regions, and testing the validation rate of the putative SNPs predicted with Solexa short sequence reads at a low sequencing depth, we evaluated a pooled DNA fragment reduced representation library and SNP detection methods applied to short read sequences generated by Solexa high-throughput sequencing technology. Results A total of 39,022 putative SNPs were identified by the Illumina/Solexa sequencing system using a reduced representation DNA library of two parental lines of a mapping population. The validation rates of these putative SNPs predicted with low and high stringency were 72% and 85%, respectively. One hundred sixty four SNP markers resulted from the validation of putative SNPs and have been selectively chosen to target a known QTL, thereby increasing the marker density of the targeted region to one marker per 42 K bp. Conclusions We have demonstrated how to quickly identify large numbers of SNPs for fine mapping of QTL regions by applying massively parallel sequencing combined with genome complexity reduction techniques. This SNP discovery approach is more efficient for targeting multiple QTL regions in a same genetic population, which can be applied to other crops. PMID:20701770
Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization.
Bauer, Markus; Klau, Gunnar W; Reinert, Knut
2007-07-27
The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of input sequences. Our program LARA is freely available for academic purposes from http://www.planet-lisa.net.
Sasaki, Katsutomo; Mitsuda, Nobutaka; Nashima, Kenji; Kishimoto, Kyutaro; Katayose, Yuichi; Kanamori, Hiroyuki; Ohmiya, Akemi
2017-09-04
Chrysanthemum morifolium is one of the most economically valuable ornamental plants worldwide. Chrysanthemum is an allohexaploid plant with a large genome that is commercially propagated by vegetative reproduction. New cultivars with different floral traits, such as color, morphology, and scent, have been generated mainly by classical cross-breeding and mutation breeding. However, only limited genetic resources and their genome information are available for the generation of new floral traits. To obtain useful information about molecular bases for floral traits of chrysanthemums, we read expressed sequence tags (ESTs) of chrysanthemums by high-throughput sequencing using the 454 pyrosequencing technology. We constructed normalized cDNA libraries, consisting of full-length, 3'-UTR, and 5'-UTR cDNAs derived from various tissues of chrysanthemums. These libraries produced a total number of 3,772,677 high-quality reads, which were assembled into 213,204 contigs. By comparing the data obtained with those of full genome-sequenced species, we confirmed that our chrysanthemum contig set contained the majority of all expressed genes, which was sufficient for further molecular analysis in chrysanthemums. We confirmed that our chrysanthemum EST set (contigs) contained a number of contigs that encoded transcription factors and enzymes involved in pigment and aroma compound metabolism that was comparable to that of other species. This information can serve as an informative resource for identifying genes involved in various biological processes in chrysanthemums. Moreover, the findings of our study will contribute to a better understanding of the floral characteristics of chrysanthemums including the myriad cultivars at the molecular level.
Drewes, Stephan; Straková, Petra; Drexler, Jan F; Jacob, Jens; Ulrich, Rainer G
2017-01-01
Rodents are distributed throughout the world and interact with humans in many ways. They provide vital ecosystem services, some species are useful models in biomedical research and some are held as pet animals. However, many rodent species can have adverse effects such as damage to crops and stored produce, and they are of health concern because of the transmission of pathogens to humans and livestock. The first rodent viruses were discovered by isolation approaches and resulted in break-through knowledge in immunology, molecular and cell biology, and cancer research. In addition to rodent-specific viruses, rodent-borne viruses are causing a large number of zoonotic diseases. Most prominent examples are reemerging outbreaks of human hemorrhagic fever disease cases caused by arena- and hantaviruses. In addition, rodents are reservoirs for vector-borne pathogens, such as tick-borne encephalitis virus and Borrelia spp., and may carry human pathogenic agents, but likely are not involved in their transmission to human. In our days, next-generation sequencing or high-throughput sequencing (HTS) is revolutionizing the speed of the discovery of novel viruses, but other molecular approaches, such as generic RT-PCR/PCR and rolling circle amplification techniques, contribute significantly to the rapidly ongoing process. However, the current knowledge still represents only the tip of the iceberg, when comparing the known human viruses to those known for rodents, the mammalian taxon with the largest species number. The diagnostic potential of HTS-based metagenomic approaches is illustrated by their use in the discovery and complete genome determination of novel borna- and adenoviruses as causative disease agents in squirrels. In conclusion, HTS, in combination with conventional RT-PCR/PCR-based approaches, resulted in a drastically increased knowledge of the diversity of rodent viruses. Future improvements of the used workflows, including bioinformatics analysis, will further enhance our knowledge and preparedness in case of the emergence of novel viruses. Classical virological and additional molecular approaches are needed for genome annotation and functional characterization of novel viruses, discovered by these technologies, and evaluation of their zoonotic potential. © 2017 Elsevier Inc. All rights reserved.
Pérez, Germán M; Salomón, Luis A; Montero-Cabrera, Luis A; de la Vega, José M García; Mascini, Marcello
2016-05-01
A novel heuristic using an iterative select-and-purge strategy is proposed. It combines statistical techniques for sampling and classification by rigid molecular docking through an inverse virtual screening scheme. This approach aims to the de novo discovery of short peptides that may act as docking receptors for small target molecules when there are no data available about known association complexes between them. The algorithm performs an unbiased stochastic exploration of the sample space, acting as a binary classifier when analyzing the entire peptides population. It uses a novel and effective criterion for weighting the likelihood of a given peptide to form an association complex with a particular ligand molecule based on amino acid sequences. The exploratory analysis relies on chemical information of peptides composition, sequence patterns, and association free energies (docking scores) in order to converge to those peptides forming the association complexes with higher affinities. Statistical estimations support these results providing an association probability by improving predictions accuracy even in cases where only a fraction of all possible combinations are sampled. False positives/false negatives ratio was also improved with this method. A simple rigid-body docking approach together with the proper information about amino acid sequences was used. The methodology was applied in a retrospective docking study to all 8000 possible tripeptide combinations using the 20 natural amino acids, screened against a training set of 77 different ligands with diverse functional groups. Afterward, all tripeptides were screened against a test set of 82 ligands, also containing different functional groups. Results show that our integrated methodology is capable of finding a representative group of the top-scoring tripeptides. The associated probability of identifying the best receptor or a group of the top-ranked receptors is more than double and about 10 times higher, respectively, when compared to classical random sampling methods.
Open discovery: An integrated live Linux platform of Bioinformatics tools
Vetrivel, Umashankar; Pilla, Kalabharath
2008-01-01
Historically, live linux distributions for Bioinformatics have paved way for portability of Bioinformatics workbench in a platform independent manner. Moreover, most of the existing live Linux distributions limit their usage to sequence analysis and basic molecular visualization programs and are devoid of data persistence. Hence, open discovery ‐ a live linux distribution has been developed with the capability to perform complex tasks like molecular modeling, docking and molecular dynamics in a swift manner. Furthermore, it is also equipped with complete sequence analysis environment and is capable of running windows executable programs in Linux environment. Open discovery portrays the advanced customizable configuration of fedora, with data persistency accessible via USB drive or DVD. Availability The Open Discovery is distributed free under Academic Free License (AFL) and can be downloaded from http://www.OpenDiscovery.org.in PMID:19238235
2013-01-01
Background With high quantity and quality data production and low cost, next generation sequencing has the potential to provide new opportunities for plant phylogeographic studies on single and multiple species. Here we present an approach for in silicio chloroplast DNA assembly and single nucleotide polymorphism detection from short-read shotgun sequencing. The approach is simple and effective and can be implemented using standard bioinformatic tools. Results The chloroplast genome of Toona ciliata (Meliaceae), 159,514 base pairs long, was assembled from shotgun sequencing on the Illumina platform using de novo assembly of contigs. To evaluate its practicality, value and quality, we compared the short read assembly with an assembly completed using 454 data obtained after chloroplast DNA isolation. Sanger sequence verifications indicated that the Illumina dataset outperformed the longer read 454 data. Pooling of several individuals during preparation of the shotgun library enabled detection of informative chloroplast SNP markers. Following validation, we used the identified SNPs for a preliminary phylogeographic study of T. ciliata in Australia and to confirm low diversity across the distribution. Conclusions Our approach provides a simple method for construction of whole chloroplast genomes from shotgun sequencing of whole genomic DNA using short-read data and no available closely related reference genome (e.g. from the same species or genus). The high coverage of Illumina sequence data also renders this method appropriate for multiplexing and SNP discovery and therefore a useful approach for landscape level studies of evolutionary ecology. PMID:23497206
Kamoun, Choumouss; Payen, Thibaut; Hua-Van, Aurélie; Filée, Jonathan
2013-10-11
Insertion Sequences (ISs) and their non-autonomous derivatives (MITEs) are important components of prokaryotic genomes inducing duplication, deletion, rearrangement or lateral gene transfers. Although ISs and MITEs are relatively simple and basic genetic elements, their detection remains a difficult task due to their remarkable sequence diversity. With the advent of high-throughput genome and metagenome sequencing technologies, the development of fast, reliable and sensitive methods of ISs and MITEs detection become an important challenge. So far, almost all studies dealing with prokaryotic transposons have used classical BLAST-based detection methods against reference libraries. Here we introduce alternative methods of detection either taking advantages of the structural properties of the elements (de novo methods) or using an additional library-based method using profile HMM searches. In this study, we have developed three different work flows dedicated to ISs and MITEs detection: the first two use de novo methods detecting either repeated sequences or presence of Inverted Repeats; the third one use 28 in-house transposase alignment profiles with HMM search methods. We have compared the respective performances of each method using a reference dataset of 30 archaeal and 30 bacterial genomes in addition to simulated and real metagenomes. Compared to a BLAST-based method using ISFinder as library, de novo methods significantly improve ISs and MITEs detection. For example, in the 30 archaeal genomes, we discovered 30 new elements (+20%) in addition to the 141 multi-copies elements already detected by the BLAST approach. Many of the new elements correspond to ISs belonging to unknown or highly divergent families. The total number of MITEs has even doubled with the discovery of elements displaying very limited sequence similarities with their respective autonomous partners (mainly in the Inverted Repeats of the elements). Concerning metagenomes, with the exception of short reads data (<300 bp) for which both techniques seem equally limited, profile HMM searches considerably ameliorate the detection of transposase encoding genes (up to +50%) generating low level of false positives compare to BLAST-based methods. Compared to classical BLAST-based methods, the sensitivity of de novo and profile HMM methods developed in this study allow a better and more reliable detection of transposons in prokaryotic genomes and metagenomes. We believed that future studies implying ISs and MITEs identification in genomic data should combine at least one de novo and one library-based method, with optimal results obtained by running the two de novo methods in addition to a library-based search. For metagenomic data, profile HMM search should be favored, a BLAST-based step is only useful to the final annotation into groups and families.
Rosenow, Matthew; Xiao, Nick; Spetzler, David
2018-01-01
ABSTRACT Extracellular vesicle (EV)-based liquid biopsies have been proposed to be a readily obtainable biological substrate recently for both profiling and diagnostics purposes. Development of a fast and reliable preparation protocol to enrich such small particles could accelerate the discovery of informative, disease-related biomarkers. Though multiple EV enrichment protocols are available, in terms of efficiency, reproducibility and simplicity, precipitation-based methods are most amenable to studies with large numbers of subjects. However, the selectivity of the precipitation becomes critical. Here, we present a simple plasma EV enrichment protocol based on pluronic block copolymer. The enriched plasma EV was able to be verified by multiple platforms. Our results showed that the particles enriched from plasma by the copolymer were EV size vesicles with membrane structure; proteomic profiling showed that EV-related proteins were significantly enriched, while high-abundant plasma proteins were significantly reduced in comparison to other precipitation-based enrichment methods. Next-generation sequencing confirmed the existence of various RNA species that have been observed in EVs from previous studies. Small RNA sequencing showed enriched species compared to the corresponding plasma. Moreover, plasma EVs enriched from 20 advanced breast cancer patients and 20 age-matched non-cancer controls were profiled by semi-quantitative mass spectrometry. Protein features were further screened by EV proteomic profiles generated from four breast cancer cell lines, and then selected in cross-validation models. A total of 60 protein features that highly contributed in model prediction were identified. Interestingly, a large portion of these features were associated with breast cancer aggression, metastasis as well as invasion, consistent with the advanced clinical stage of the patients. In summary, we have developed a plasma EV enrichment method with improved precipitation selectivity and it might be suitable for larger-scale discovery studies. PMID:29696079
The web server of IBM's Bioinformatics and Pattern Discovery group: 2004 update
Huynh, Tien; Rigoutsos, Isidore
2004-01-01
In this report, we provide an update on the services and content which are available on the web server of IBM's Bioinformatics and Pattern Discovery group. The server, which is operational around the clock, provides access to a large number of methods that have been developed and published by the group's members. There is an increasing number of problems that these tools can help tackle; these problems range from the discovery of patterns in streams of events and the computation of multiple sequence alignments, to the discovery of genes in nucleic acid sequences, the identification—directly from sequence—of structural deviations from α-helicity and the annotation of amino acid sequences for antimicrobial activity. Additionally, annotations for more than 130 archaeal, bacterial, eukaryotic and viral genomes are now available on-line and can be searched interactively. The tools and code bundles continue to be accessible from http://cbcsrv.watson.ibm.com/Tspd.html whereas the genomics annotations are available at http://cbcsrv.watson.ibm.com/Annotations/. PMID:15215340
Sequencing Data Discovery and Integration for Earth System Science with MetaSeek
NASA Astrophysics Data System (ADS)
Hoarfrost, A.; Brown, N.; Arnosti, C.
2017-12-01
Microbial communities play a central role in biogeochemical cycles. Sequencing data resources from environmental sources have grown exponentially in recent years, and represent a singular opportunity to investigate microbial interactions with Earth system processes. Carrying out such meta-analyses depends on our ability to discover and curate sequencing data into large-scale integrated datasets. However, such integration efforts are currently challenging and time-consuming, with sequencing data scattered across multiple repositories and metadata that is not easily or comprehensively searchable. MetaSeek is a sequencing data discovery tool that integrates sequencing metadata from all the major data repositories, allowing the user to search and filter on datasets in a lightweight application with an intuitive, easy-to-use web-based interface. Users can save and share curated datasets, while other users can browse these data integrations or use them as a jumping off point for their own curation. Missing and/or erroneous metadata are inferred automatically where possible, and where not possible, users are prompted to contribute to the improvement of the sequencing metadata pool by correcting and amending metadata errors. Once an integrated dataset has been curated, users can follow simple instructions to download their raw data and quickly begin their investigations. In addition to the online interface, the MetaSeek database is easily queryable via an open API, further enabling users and facilitating integrations of MetaSeek with other data curation tools. This tool lowers the barriers to curation and integration of environmental sequencing data, clearing the path forward to illuminating the ecosystem-scale interactions between biological and abiotic processes.
Kotschote, Stefan; Bonin, Michael
2018-01-01
ABSTRACT Extracellular vesicles (EVs) are intercellular communicators with key functions in physiological and pathological processes and have recently garnered interest because of their diagnostic and therapeutic potential. The past decade has brought about the development and commercialization of a wide array of methods to isolate EVs from serum. Which subpopulations of EVs are captured strongly depends on the isolation method, which in turn determines how suitable resulting samples are for various downstream applications. To help clinicians and scientists choose the most appropriate approach for their experiments, isolation methods need to be comparatively characterized. Few attempts have been made to comprehensively analyse vesicular microRNAs (miRNAs) in patient biofluids for biomarker studies. To address this discrepancy, we set out to benchmark the performance of several isolation principles for serum EVs in healthy individuals and critically ill patients. Here, we compared five different methods of EV isolation in combination with two RNA extraction methods regarding their suitability for biomarker discovery-focused miRNA sequencing as well as biological characteristics of captured vesicles. Our findings reveal striking method-specific differences in both the properties of isolated vesicles and the ability of associated miRNAs to serve in biomarker research. While isolation by precipitation and membrane affinity was highly suitable for miRNA-based biomarker discovery, methods based on size-exclusion chromatography failed to separate patients from healthy volunteers. Isolated vesicles differed in size, quantity, purity and composition, indicating that each method captured distinctive populations of EVs as well as additional contaminants. Even though the focus of this work was on transcriptomic profiling of EV-miRNAs, our insights also apply to additional areas of research. We provide guidance for navigating the multitude of EV isolation methods available today and help researchers and clinicians make an informed choice about which strategy to use for experiments involving critically ill patients. PMID:29887978
Buschmann, Dominik; Kirchner, Benedikt; Hermann, Stefanie; Märte, Melanie; Wurmser, Christine; Brandes, Florian; Kotschote, Stefan; Bonin, Michael; Steinlein, Ortrud K; Pfaffl, Michael W; Schelling, Gustav; Reithmair, Marlene
2018-01-01
Extracellular vesicles (EVs) are intercellular communicators with key functions in physiological and pathological processes and have recently garnered interest because of their diagnostic and therapeutic potential. The past decade has brought about the development and commercialization of a wide array of methods to isolate EVs from serum. Which subpopulations of EVs are captured strongly depends on the isolation method, which in turn determines how suitable resulting samples are for various downstream applications. To help clinicians and scientists choose the most appropriate approach for their experiments, isolation methods need to be comparatively characterized. Few attempts have been made to comprehensively analyse vesicular microRNAs (miRNAs) in patient biofluids for biomarker studies. To address this discrepancy, we set out to benchmark the performance of several isolation principles for serum EVs in healthy individuals and critically ill patients. Here, we compared five different methods of EV isolation in combination with two RNA extraction methods regarding their suitability for biomarker discovery-focused miRNA sequencing as well as biological characteristics of captured vesicles. Our findings reveal striking method-specific differences in both the properties of isolated vesicles and the ability of associated miRNAs to serve in biomarker research. While isolation by precipitation and membrane affinity was highly suitable for miRNA-based biomarker discovery, methods based on size-exclusion chromatography failed to separate patients from healthy volunteers. Isolated vesicles differed in size, quantity, purity and composition, indicating that each method captured distinctive populations of EVs as well as additional contaminants. Even though the focus of this work was on transcriptomic profiling of EV-miRNAs, our insights also apply to additional areas of research. We provide guidance for navigating the multitude of EV isolation methods available today and help researchers and clinicians make an informed choice about which strategy to use for experiments involving critically ill patients.
Effects of Four Instructional Sequences on Application and Transfer. IDD&E Working Paper No. 12
ERIC Educational Resources Information Center
Chao, Chun-I; And Others
Using the Component Display Theory as an analyzing tool, this study compared the effects of expository and discovery methods of instruction on two learning outcomes, application and transfer. One hundred ninth grade students in each of four earth science classes were randomly assigned to five groups--four experimental groups designed to test four…
Next-generation libraries for robust RNA interference-based genome-wide screens
Kampmann, Martin; Horlbeck, Max A.; Chen, Yuwen; Tsai, Jordan C.; Bassik, Michael C.; Gilbert, Luke A.; Villalta, Jacqueline E.; Kwon, S. Chul; Chang, Hyeshik; Kim, V. Narry; Weissman, Jonathan S.
2015-01-01
Genetic screening based on loss-of-function phenotypes is a powerful discovery tool in biology. Although the recent development of clustered regularly interspaced short palindromic repeats (CRISPR)-based screening approaches in mammalian cell culture has enormous potential, RNA interference (RNAi)-based screening remains the method of choice in several biological contexts. We previously demonstrated that ultracomplex pooled short-hairpin RNA (shRNA) libraries can largely overcome the problem of RNAi off-target effects in genome-wide screens. Here, we systematically optimize several aspects of our shRNA library, including the promoter and microRNA context for shRNA expression, selection of guide strands, and features relevant for postscreen sample preparation for deep sequencing. We present next-generation high-complexity libraries targeting human and mouse protein-coding genes, which we grouped into 12 sublibraries based on biological function. A pilot screen suggests that our next-generation RNAi library performs comparably to current CRISPR interference (CRISPRi)-based approaches and can yield complementary results with high sensitivity and high specificity. PMID:26080438
Robustness of disaggregate oil and gas discovery forecasting models
Attanasi, E.D.; Schuenemeyer, J.H.
1989-01-01
The trend in forecasting oil and gas discoveries has been to develop and use models that allow forecasts of the size distribution of future discoveries. From such forecasts, exploration and development costs can more readily be computed. Two classes of these forecasting models are the Arps-Roberts type models and the 'creaming method' models. This paper examines the robustness of the forecasts made by these models when the historical data on which the models are based have been subject to economic upheavals or when historical discovery data are aggregated from areas having widely differing economic structures. Model performance is examined in the context of forecasting discoveries for offshore Texas State and Federal areas. The analysis shows how the model forecasts are limited by information contained in the historical discovery data. Because the Arps-Roberts type models require more regularity in discovery sequence than the creaming models, prior information had to be introduced into the Arps-Roberts models to accommodate the influence of economic changes. The creaming methods captured the overall decline in discovery size but did not easily allow introduction of exogenous information to compensate for incomplete historical data. Moreover, the predictive log normal distribution associated with the creaming model methods appears to understate the importance of the potential contribution of small fields. ?? 1989.
Predicting discovery rates of genomic features.
Gravel, Simon
2014-06-01
Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict "omics" variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require ∼15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and subsampled 1000 Genomes Project data. Extrapolating based on the National Heart, Lung, and Blood Institute Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African Americans and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types. Copyright © 2014 by the Genetics Society of America.
Using the TIGR gene index databases for biological discovery.
Lee, Yuandan; Quackenbush, John
2003-11-01
The TIGR Gene Index web pages provide access to analyses of ESTs and gene sequences for nearly 60 species, as well as a number of resources derived from these. Each species-specific database is presented using a common format with a homepage. A variety of methods exist that allow users to search each species-specific database. Methods implemented currently include nucleotide or protein sequence queries using WU-BLAST, text-based searches using various sequence identifiers, searches by gene, tissue and library name, and searches using functional classes through Gene Ontology assignments. This protocol provides guidance for using the Gene Index Databases to extract information.
Carnivore-specific SINEs (Can-SINEs): distribution, evolution, and genomic impact.
Walters-Conte, Kathryn B; Johnson, Diana L E; Allard, Marc W; Pecon-Slattery, Jill
2011-01-01
Short interspersed nuclear elements (SINEs) are a type of class 1 transposable element (retrotransposon) with features that allow investigators to resolve evolutionary relationships between populations and species while providing insight into genome composition and function. Characterization of a Carnivora-specific SINE family, Can-SINEs, has, has aided comparative genomic studies by providing rare genomic changes, and neutral sequence variants often needed to resolve difficult evolutionary questions. In addition, Can-SINEs constitute a significant source of functional diversity with Carnivora. Publication of the whole-genome sequence of domestic dog, domestic cat, and giant panda serves as a valuable resource in comparative genomic inferences gleaned from Can-SINEs. In anticipation of forthcoming studies bolstered by new genomic data, this review describes the discovery and characterization of Can-SINE motifs as well as describes composition, distribution, and effect on genome function. As the contribution of noncoding sequences to genomic diversity becomes more apparent, SINEs and other transposable elements will play an increasingly large role in mammalian comparative genomics.
Carnivore-Specific SINEs (Can-SINEs): Distribution, Evolution, and Genomic Impact
Johnson, Diana L.E.; Allard, Marc W.; Pecon-Slattery, Jill
2011-01-01
Short interspersed nuclear elements (SINEs) are a type of class 1 transposable element (retrotransposon) with features that allow investigators to resolve evolutionary relationships between populations and species while providing insight into genome composition and function. Characterization of a Carnivora-specific SINE family, Can-SINEs, has, has aided comparative genomic studies by providing rare genomic changes, and neutral sequence variants often needed to resolve difficult evolutionary questions. In addition, Can-SINEs constitute a significant source of functional diversity with Carnivora. Publication of the whole-genome sequence of domestic dog, domestic cat, and giant panda serves as a valuable resource in comparative genomic inferences gleaned from Can-SINEs. In anticipation of forthcoming studies bolstered by new genomic data, this review describes the discovery and characterization of Can-SINE motifs as well as describes composition, distribution, and effect on genome function. As the contribution of noncoding sequences to genomic diversity becomes more apparent, SINEs and other transposable elements will play an increasingly large role in mammalian comparative genomics. PMID:21846743
Neo-sex Chromosomes in the Monarch Butterfly, Danaus plexippus
Mongue, Andrew J.; Nguyen, Petr; Voleníková, Anna; Walters, James R.
2017-01-01
We report the discovery of a neo-sex chromosome in the monarch butterfly, Danaus plexippus, and several of its close relatives. Z-linked scaffolds in the D. plexippus genome assembly were identified via sex-specific differences in Illumina sequencing coverage. Additionally, a majority of the D. plexippus genome assembly was assigned to chromosomes based on counts of one-to-one orthologs relative to the butterfly Melitaea cinxia (with replication using two other lepidopteran species), in which genome scaffolds have been mapped to linkage groups. Sequencing coverage-based assessments of Z linkage combined with homology-based chromosomal assignments provided strong evidence for a Z-autosome fusion in the Danaus lineage, involving the autosome homologous to chromosome 21 in M. cinxia. Coverage analysis also identified three notable assembly errors resulting in chimeric Z-autosome scaffolds. Cytogenetic analysis further revealed a large W chromosome that is partially euchromatic, consistent with being a neo-W chromosome. The discovery of a neo-Z and the provisional assignment of chromosome linkage for >90% of D. plexippus genes lays the foundation for novel insights concerning sex chromosome evolution in this female-heterogametic model species for functional and evolutionary genomics. PMID:28839116
Do, Hongdo; Molania, Ramyar
2017-01-01
The identification of genomic rearrangements with high sensitivity and specificity using massively parallel sequencing remains a major challenge, particularly in precision medicine and cancer research. Here, we describe a new method for detecting rearrangements, GRIDSS (Genome Rearrangement IDentification Software Suite). GRIDSS is a multithreaded structural variant (SV) caller that performs efficient genome-wide break-end assembly prior to variant calling using a novel positional de Bruijn graph-based assembler. By combining assembly, split read, and read pair evidence using a probabilistic scoring, GRIDSS achieves high sensitivity and specificity on simulated, cell line, and patient tumor data, recently winning SV subchallenge #5 of the ICGC-TCGA DREAM8.5 Somatic Mutation Calling Challenge. On human cell line data, GRIDSS halves the false discovery rate compared to other recent methods while matching or exceeding their sensitivity. GRIDSS identifies nontemplate sequence insertions, microhomologies, and large imperfect homologies, estimates a quality score for each breakpoint, stratifies calls into high or low confidence, and supports multisample analysis. PMID:29097403
Swaney, Danielle L.; Wenger, Craig D.; Thomson, James A.; Coon, Joshua J.
2009-01-01
Protein phosphorylation is central to the understanding of cellular signaling, and cellular signaling is suggested to play a major role in the regulation of human embryonic stem (ES) cell pluripotency. Here, we describe the use of conventional tandem mass spectrometry-based sequencing technology—collision-activated dissociation (CAD)—and the more recently developed method electron transfer dissociation (ETD) to characterize the human ES cell phosphoproteome. In total, these experiments resulted in the identification of 11,995 unique phosphopeptides, corresponding to 10,844 nonredundant phosphorylation sites, at a 1% false discovery rate (FDR). Among these phosphorylation sites are 5 localized to 2 pluripotency critical transcription factors—OCT4 and SOX2. From these experiments, we conclude that ETD identifies a larger number of unique phosphopeptides than CAD (8,087 to 3,868), more frequently localizes the phosphorylation site to a specific residue (49.8% compared with 29.6%), and sequences whole classes of phosphopeptides previously unobserved. PMID:19144917
Smeele, Zoe E; Ainley, David G; Varsani, Arvind
2018-01-02
The Antarctic, sub-Antarctic islands and surrounding sea-ice provide a unique environment for the existence of organisms. Nonetheless, birds and seals of a variety of species inhabit them, particularly during their breeding seasons. Early research on Antarctic wildlife health, using serology-based assays, showed exposure to viruses in the families Birnaviridae, Flaviviridae, Herpesviridae, Orthomyxoviridae and Paramyxoviridae circulating in seals (Phocidae), penguins (Spheniscidae), petrels (Procellariidae) and skuas (Stercorariidae). It is only during the last decade or so that polymerase chain reaction-based assays have been used to characterize viruses associated with Antarctic animals. Furthermore, it is only during the last five years that full/whole genomes of viruses (adenoviruses, anelloviruses, orthomyxoviruses, a papillomavirus, paramyoviruses, polyomaviruses and a togavirus) have been sequenced using Sanger sequencing or high throughput sequencing (HTS) approaches. This review summaries the knowledge of animal Antarctic virology and discusses potential future directions with the advent of HTS in virus discovery and ecology. Copyright © 2017 Elsevier B.V. All rights reserved.
RNAbrowse: RNA-Seq De Novo Assembly Results Browser
Mariette, Jérôme; Noirot, Céline; Nabihoudine, Ibounyamine; Bardou, Philippe; Hoede, Claire; Djari, Anis; Cabau, Cédric; Klopp, Christophe
2014-01-01
Transcriptome analysis based on a de novo assembly of next generation RNA sequences is now performed routinely in many laboratories. The generated results, including contig sequences, quantification figures, functional annotations and variation discovery outputs are usually bulky and quite diverse. This article presents a user oriented storage and visualisation environment permitting to explore the data in a top-down manner, going from general graphical views to all possible details. The software package is based on biomart, easy to install and populate with local data. The software package is available under the GNU General Public License (GPL) at http://bioinfo.genotoul.fr/RNAbrowse. PMID:24823498
Pine Gene Discovery Project - Final Report - 08/31/1997 - 02/28/2001
DOE Office of Scientific and Technical Information (OSTI.GOV)
Whetten, R. W.; Sederoff, R. R.; Kinlaw, C.
2001-04-30
Integration of pines into the large scope of plant biology research depends on study of pines in parallel with study of annual plants, and on availability of research materials from pine to plant biologists interested in comparing pine with annual plant systems. The objectives of the Pine Gene Discovery Project were to obtain 10,000 partial DNA sequences of genes expressed in loblolly pine, to determine which of those pine genes were similar to known genes from other organisms, and to make the DNA sequences and isolated pine genes available to plant researchers to stimulate integration of pines into the widermore » scope of plant biology research. Those objectives have been completed, and the results are available to the public. Requests for pine genes have been received from a number of laboratories that would otherwise not have included pine in their research, indicating that progress is being made toward the goal of integrating pine research into the larger molecular biology research community.« less
Andreani, Julien; Verneau, Jonathan; Raoult, Didier; Levasseur, Anthony; La Scola, Bernard
2018-04-10
Nucleo-cytoplasmic large DNA viruses are doubled stranded DNA viruses capable of infecting eukaryotic cells. Since the discovery of Mimivirus and Pandoravirus, there has been no doubt about their extraordinary features compared to "classic" viruses. Recently, we reported the expansion of the proposed family Pithoviridae, with the description of Cedratvirus and Orpheovirus, two new viruses related to Pithoviruses. Studying the major capsid protein of Orpheovirus, we detected a homologous sequence in a mine drainage metagenome. The in-depth exploration of this metagenome, using the MG-Digger program, enabled us to retrieve up to 10 contigs with clear evidence of viral sequences. Moreover, phylogenetic analyses further extended our screening with the discovery in another marine metagenome of a second virus closely related to Orpheovirus IHUMI-LCC2. This virus is a misidentified virus confused with and annotated as a Rickettsiales bacterium. It presents a partial genome size of about 170 kbp.
Sequence evaluation of four specific cDNA libraries for developmental genomics of sunflower.
Tamborindeguy, C; Ben, C; Liboz, T; Gentzbittel, L
2004-04-01
Four different cDNA libraries were constructed from sunflower protoplasts growing under embryogenic and non-embryogenic conditions: one standard library from each condition and two subtractive libraries in opposite sense. A total of 22,876 cDNA clones were obtained and 4800 ESTs were sequenced, giving rise to 2479 high quality ESTs representing an unigene set of 1502 sequences. This set was compared with ESTs represented in public databases using the programs BLASTN and BLASTX, and its members were classified according to putative function using the catalog in the Kyoto Encyclopedia of Genes and Genomes (KEGG). Some 33% of sequences failed to align with existing plant ESTs and therefore represent putative novel genes. The libraries show a low level of redundancy and, on average, 50% of the present ESTs have not been previously reported for sunflower. Several potentially interesting genes were identified, based on their homology with genes involved in animal zygotic division or plant embryogenesis. We also identified two ESTs that show significantly different levels of expression under embryogenic and non-embryogenic conditions. The libraries described here represent an original and valuable resource for the discovery of yet unknown genes putatively involved in dicot embryogenesis and improving our knowledge of the mechanisms involved in polarity acquisition by plant embryos.
Villanova, Federica; Di Meglio, Paola; Inokuma, Margaret; Aghaeepour, Nima; Perucha, Esperanza; Mollon, Jennifer; Nomura, Laurel; Hernandez-Fuentes, Maria; Cope, Andrew; Prevost, A Toby; Heck, Susanne; Maino, Vernon; Lord, Graham; Brinkman, Ryan R; Nestle, Frank O
2013-01-01
Discovery of novel immune biomarkers for monitoring of disease prognosis and response to therapy in immune-mediated inflammatory diseases is an important unmet clinical need. Here, we establish a novel framework for immunological biomarker discovery, comparing a conventional (liquid) flow cytometry platform (CFP) and a unique lyoplate-based flow cytometry platform (LFP) in combination with advanced computational data analysis. We demonstrate that LFP had higher sensitivity compared to CFP, with increased detection of cytokines (IFN-γ and IL-10) and activation markers (Foxp3 and CD25). Fluorescent intensity of cells stained with lyophilized antibodies was increased compared to cells stained with liquid antibodies. LFP, using a plate loader, allowed medium-throughput processing of samples with comparable intra- and inter-assay variability between platforms. Automated computational analysis identified novel immunophenotypes that were not detected with manual analysis. Our results establish a new flow cytometry platform for standardized and rapid immunological biomarker discovery with wide application to immune-mediated diseases.
Villanova, Federica; Di Meglio, Paola; Inokuma, Margaret; Aghaeepour, Nima; Perucha, Esperanza; Mollon, Jennifer; Nomura, Laurel; Hernandez-Fuentes, Maria; Cope, Andrew; Prevost, A. Toby; Heck, Susanne; Maino, Vernon; Lord, Graham; Brinkman, Ryan R.; Nestle, Frank O.
2013-01-01
Discovery of novel immune biomarkers for monitoring of disease prognosis and response to therapy in immune-mediated inflammatory diseases is an important unmet clinical need. Here, we establish a novel framework for immunological biomarker discovery, comparing a conventional (liquid) flow cytometry platform (CFP) and a unique lyoplate-based flow cytometry platform (LFP) in combination with advanced computational data analysis. We demonstrate that LFP had higher sensitivity compared to CFP, with increased detection of cytokines (IFN-γ and IL-10) and activation markers (Foxp3 and CD25). Fluorescent intensity of cells stained with lyophilized antibodies was increased compared to cells stained with liquid antibodies. LFP, using a plate loader, allowed medium-throughput processing of samples with comparable intra- and inter-assay variability between platforms. Automated computational analysis identified novel immunophenotypes that were not detected with manual analysis. Our results establish a new flow cytometry platform for standardized and rapid immunological biomarker discovery with wide application to immune-mediated diseases. PMID:23843942
Yu, Dongliang; Meng, Yijun; Zuo, Ziwei; Xue, Jie; Wang, Huizhong
2016-01-01
Nat-siRNAs (small interfering RNAs originated from natural antisense transcripts) are a class of functional small RNA (sRNA) species discovered in both plants and animals. These siRNAs are highly enriched within the annealed regions of the NAT (natural antisense transcript) pairs. To date, great research efforts have been taken for systematical identification of the NATs in various organisms. However, developing a freely available and easy-to-use program for NAT prediction is strongly demanded by researchers. Here, we proposed an integrative pipeline named NATpipe for systematical discovery of NATs from de novo assembled transcriptomes. By utilizing sRNA sequencing data, the pipeline also allowed users to search for phase-distributed nat-siRNAs within the perfectly annealed regions of the NAT pairs. Additionally, more reliable nat-siRNA loci could be identified based on degradome sequencing data. A case study on the non-model plant Dendrobium officinale was performed to illustrate the utility of NATpipe. Finally, we hope that NATpipe would be a useful tool for NAT prediction, nat-siRNA discovery, and related functional studies. NATpipe is available at www.bioinfolab.cn/NATpipe/NATpipe.zip. PMID:26858106
Pappalardo, Matteo; Rayan, Mahmoud; Abu-Lafi, Saleh; Leonardi, Martha E; Milardi, Danilo; Guccione, Salvatore; Rayan, Anwar
2017-08-01
Modeling G-Protein Coupled Receptors (GPCRs) is an emergent field of research, since utility of high-quality models in receptor structure-based strategies might facilitate the discovery of interesting drug candidates. The findings from a quantitative analysis of eighteen resolved structures of rhodopsin family "A" receptors crystallized with antagonists and 153 pairs of structures are described. A strategy termed endeca-amino acids fragmentation was used to analyze the structures models aiming to detect the relationship between sequence identity and Root Mean Square Deviation (RMSD) at each trans-membrane-domain. Moreover, we have applied the leave-one-out strategy to study the shiftiness likelihood of the helices. The type of correlation between sequence identity and RMSD was studied using the aforementioned set receptors as representatives of membrane proteins and 98 serine proteases with 4753 pairs of structures as representatives of globular proteins. Data analysis using fragmentation strategy revealed that there is some extent of correlation between sequence identity and global RMSD of 11AA width windows. However, spatial conservation is not always close to the endoplasmic side as was reported before. A comparative study with globular proteins shows that GPCRs have higher standard deviation and higher slope in the graph with correlation between sequence identity and RMSD. The extracted information disclosed in this paper could be incorporated in the modeling protocols while using technique for model optimization and refinement. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.
Gene discovery using next-generation pyrosequencing to develop ESTs for Phalaenopsis orchids
2011-01-01
Background Orchids are one of the most diversified angiosperms, but few genomic resources are available for these non-model plants. In addition to the ecological significance, Phalaenopsis has been considered as an economically important floriculture industry worldwide. We aimed to use massively parallel 454 pyrosequencing for a global characterization of the Phalaenopsis transcriptome. Results To maximize sequence diversity, we pooled RNA from 10 samples of different tissues, various developmental stages, and biotic- or abiotic-stressed plants. We obtained 206,960 expressed sequence tags (ESTs) with an average read length of 228 bp. These reads were assembled into 8,233 contigs and 34,630 singletons. The unigenes were searched against the NCBI non-redundant (NR) protein database. Based on sequence similarity with known proteins, these analyses identified 22,234 different genes (E-value cutoff, e-7). Assembled sequences were annotated with Gene Ontology, Gene Family and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Among these annotations, over 780 unigenes encoding putative transcription factors were identified. Conclusion Pyrosequencing was effective in identifying a large set of unigenes from Phalaenopsis. The informative EST dataset we developed constitutes a much-needed resource for discovery of genes involved in various biological processes in Phalaenopsis and other orchid species. These transcribed sequences will narrow the gap between study of model organisms with many genomic resources and species that are important for ecological and evolutionary studies. PMID:21749684
Simple sequence repeat marker loci discovery using SSR primer.
Robinson, Andrew J; Love, Christopher G; Batley, Jacqueline; Barker, Gary; Edwards, David
2004-06-12
Simple sequence repeats (SSRs) have become important molecular markers for a broad range of applications, such as genome mapping and characterization, phenotype mapping, marker assisted selection of crop plants and a range of molecular ecology and diversity studies. With the increase in the availability of DNA sequence information, an automated process to identify and design PCR primers for amplification of SSR loci would be a useful tool in plant breeding programs. We report an application that integrates SPUTNIK, an SSR repeat finder, with Primer3, a PCR primer design program, into one pipeline tool, SSR Primer. On submission of multiple FASTA formatted sequences, the script screens each sequence for SSRs using SPUTNIK. The results are parsed to Primer3 for locus-specific primer design. The script makes use of a Web-based interface, enabling remote use. This program has been written in PERL and is freely available for non-commercial users by request from the authors. The Web-based version may be accessed at http://hornbill.cspp.latrobe.edu.au/
Effect of Next-Generation Exome Sequencing Depth for Discovery of Diagnostic Variants.
Kim, Kyung; Seong, Moon-Woo; Chung, Won-Hyong; Park, Sung Sup; Leem, Sangseob; Park, Won; Kim, Jihyun; Lee, KiYoung; Park, Rae Woong; Kim, Namshin
2015-06-01
Sequencing depth, which is directly related to the cost and time required for the generation, processing, and maintenance of next-generation sequencing data, is an important factor in the practical utilization of such data in clinical fields. Unfortunately, identifying an exome sequencing depth adequate for clinical use is a challenge that has not been addressed extensively. Here, we investigate the effect of exome sequencing depth on the discovery of sequence variants for clinical use. Toward this, we sequenced ten germ-line blood samples from breast cancer patients on the Illumina platform GAII(x) at a high depth of ~200×. We observed that most function-related diverse variants in the human exonic regions could be detected at a sequencing depth of 120×. Furthermore, investigation using a diagnostic gene set showed that the number of clinical variants identified using exome sequencing reached a plateau at an average sequencing depth of about 120×. Moreover, the phenomena were consistent across the breast cancer samples.
ERIC Educational Resources Information Center
Thumm, Walter
1975-01-01
Relates the story of Wilhelm Conrad Rontgen and presents one view of the extent to which the discovery of the x-ray was an accident. Reconstructs the sequence of events that led to the discovery and includes photographs of the lab where he worked and replicas of apparatus used. (GS)
flyDIVaS: A Comparative Genomics Resource for Drosophila Divergence and Selection
Stanley, Craig E.; Kulathinal, Rob J.
2016-01-01
With arguably the best finished and expertly annotated genome assembly, Drosophila melanogaster is a formidable genetics model to study all aspects of biology. Nearly a decade ago, the 12 Drosophila genomes project expanded D. melanogaster’s breadth as a comparative model through the community-development of an unprecedented genus- and genome-wide comparative resource. However, since its inception, these datasets for evolutionary inference and biological discovery have become increasingly outdated, outmoded, and inaccessible. Here, we provide an updated and upgradable comparative genomics resource of Drosophila divergence and selection, flyDIVaS, based on the latest genomic assemblies, curated FlyBase annotations, and recent OrthoDB orthology calls. flyDIVaS is an online database containing D. melanogaster-centric orthologous gene sets, CDS and protein alignments, divergence statistics (% gaps, dN, dS, dN/dS), and codon-based tests of positive Darwinian selection. Out of 13,920 protein-coding D. melanogaster genes, ∼80% have one aligned ortholog in the closely related species, D. simulans, and ∼50% have 1–1 12-way alignments in the original 12 sequenced species that span over 80 million yr of divergence. Genes and their orthologs can be chosen from four different taxonomic datasets differing in phylogenetic depth and coverage density, and visualized via interactive alignments and phylogenetic trees. Users can also batch download entire comparative datasets. A functional survey finds conserved mitotic and neural genes, highly diverged immune and reproduction-related genes, more conspicuous signals of divergence across tissue-specific genes, and an enrichment of positive selection among highly diverged genes. flyDIVaS will be regularly updated and can be freely accessed at www.flydivas.info. We encourage researchers to regularly use this resource as a tool for biological inference and discovery, and in their classrooms to help train the next generation of biologists to creatively use such genomic big data resources in an integrative manner. PMID:27226167
flyDIVaS: A Comparative Genomics Resource for Drosophila Divergence and Selection.
Stanley, Craig E; Kulathinal, Rob J
2016-08-09
With arguably the best finished and expertly annotated genome assembly, Drosophila melanogaster is a formidable genetics model to study all aspects of biology. Nearly a decade ago, the 12 Drosophila genomes project expanded D. melanogaster's breadth as a comparative model through the community-development of an unprecedented genus- and genome-wide comparative resource. However, since its inception, these datasets for evolutionary inference and biological discovery have become increasingly outdated, outmoded, and inaccessible. Here, we provide an updated and upgradable comparative genomics resource of Drosophila divergence and selection, flyDIVaS, based on the latest genomic assemblies, curated FlyBase annotations, and recent OrthoDB orthology calls. flyDIVaS is an online database containing D. melanogaster-centric orthologous gene sets, CDS and protein alignments, divergence statistics (% gaps, dN, dS, dN/dS), and codon-based tests of positive Darwinian selection. Out of 13,920 protein-coding D. melanogaster genes, ∼80% have one aligned ortholog in the closely related species, D. simulans, and ∼50% have 1-1 12-way alignments in the original 12 sequenced species that span over 80 million yr of divergence. Genes and their orthologs can be chosen from four different taxonomic datasets differing in phylogenetic depth and coverage density, and visualized via interactive alignments and phylogenetic trees. Users can also batch download entire comparative datasets. A functional survey finds conserved mitotic and neural genes, highly diverged immune and reproduction-related genes, more conspicuous signals of divergence across tissue-specific genes, and an enrichment of positive selection among highly diverged genes. flyDIVaS will be regularly updated and can be freely accessed at www.flydivas.info We encourage researchers to regularly use this resource as a tool for biological inference and discovery, and in their classrooms to help train the next generation of biologists to creatively use such genomic big data resources in an integrative manner. Copyright © 2016 Stanley and Kulathinal.
Dereeper, Alexis; Nicolas, Stéphane; Le Cunff, Loïc; Bacilieri, Roberto; Doligez, Agnès; Peros, Jean-Pierre; Ruiz, Manuel; This, Patrice
2011-05-05
High-throughput re-sequencing, new genotyping technologies and the availability of reference genomes allow the extensive characterization of Single Nucleotide Polymorphisms (SNPs) and insertion/deletion events (indels) in many plant species. The rapidly increasing amount of re-sequencing and genotyping data generated by large-scale genetic diversity projects requires the development of integrated bioinformatics tools able to efficiently manage, analyze, and combine these genetic data with genome structure and external data. In this context, we developed SNiPlay, a flexible, user-friendly and integrative web-based tool dedicated to polymorphism discovery and analysis. It integrates:1) a pipeline, freely accessible through the internet, combining existing softwares with new tools to detect SNPs and to compute different types of statistical indices and graphical layouts for SNP data. From standard sequence alignments, genotyping data or Sanger sequencing traces given as input, SNiPlay detects SNPs and indels events and outputs submission files for the design of Illumina's SNP chips. Subsequently, it sends sequences and genotyping data into a series of modules in charge of various processes: physical mapping to a reference genome, annotation (genomic position, intron/exon location, synonymous/non-synonymous substitutions), SNP frequency determination in user-defined groups, haplotype reconstruction and network, linkage disequilibrium evaluation, and diversity analysis (Pi, Watterson's Theta, Tajima's D).Furthermore, the pipeline allows the use of external data (such as phenotype, geographic origin, taxa, stratification) to define groups and compare statistical indices.2) a database storing polymorphisms, genotyping data and grapevine sequences released by public and private projects. It allows the user to retrieve SNPs using various filters (such as genomic position, missing data, polymorphism type, allele frequency), to compare SNP patterns between populations, and to export genotyping data or sequences in various formats. Our experiments on grapevine genetic projects showed that SNiPlay allows geneticists to rapidly obtain advanced results in several key research areas of plant genetic diversity. Both the management and treatment of large amounts of SNP data are rendered considerably easier for end-users through automation and integration. Current developments are taking into account new advances in high-throughput technologies.SNiPlay is available at: http://sniplay.cirad.fr/.
Zwaenepoel, Arthur; Diels, Tim; Amar, David; Van Parys, Thomas; Shamir, Ron; Van de Peer, Yves; Tzfadia, Oren
2018-01-01
Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.
Natural product discovery: past, present, and future.
Katz, Leonard; Baltz, Richard H
2016-03-01
Microorganisms have provided abundant sources of natural products which have been developed as commercial products for human medicine, animal health, and plant crop protection. In the early years of natural product discovery from microorganisms (The Golden Age), new antibiotics were found with relative ease from low-throughput fermentation and whole cell screening methods. Later, molecular genetic and medicinal chemistry approaches were applied to modify and improve the activities of important chemical scaffolds, and more sophisticated screening methods were directed at target disease states. In the 1990s, the pharmaceutical industry moved to high-throughput screening of synthetic chemical libraries against many potential therapeutic targets, including new targets identified from the human genome sequencing project, largely to the exclusion of natural products, and discovery rates dropped dramatically. Nonetheless, natural products continued to provide key scaffolds for drug development. In the current millennium, it was discovered from genome sequencing that microbes with large genomes have the capacity to produce about ten times as many secondary metabolites as was previously recognized. Indeed, the most gifted actinomycetes have the capacity to produce around 30-50 secondary metabolites. With the precipitous drop in cost for genome sequencing, it is now feasible to sequence thousands of actinomycete genomes to identify the "biosynthetic dark matter" as sources for the discovery of new and novel secondary metabolites. Advances in bioinformatics, mass spectrometry, proteomics, transcriptomics, metabolomics and gene expression are driving the new field of microbial genome mining for applications in natural product discovery and development.
Comparative genetics of longevity and cancer: insights from long-lived rodents
Gorbunova, Vera; Seluanov, Andrei; Zhang, Zhengdong; Gladyshev, Vadim N.; Vijg, Jan
2015-01-01
Mammals have evolved a dramatic diversity of aging rates. Within the single order of Rodentia maximum lifespans differ from four years in mice to 32 years in naked mole rats. Cancer rates also differ significantly, from cancer-prone mice to virtually cancer-proof naked and blind mole rats. Recent progress in rodent comparative biology, in combination with the emergence of whole genome sequence information, has opened opportunities for the discovery of genetic factors controlling longevity and cancer susceptibility. PMID:24981598
Chopra, Ratan; Burow, Gloria; Farmer, Andrew; Mudge, Joann; Simpson, Charles E; Wilkins, Thea A; Baring, Michael R; Puppala, Naveen; Chamberlin, Kelly D; Burow, Mark D
2015-06-01
Single-nucleotide polymorphisms, which can be identified in the thousands or millions from comparisons of transcriptome or genome sequences, are ideally suited for making high-resolution genetic maps, investigating population evolutionary history, and discovering marker-trait linkages. Despite significant results from their use in human genetics, progress in identification and use in plants, and particularly polyploid plants, has lagged. As part of a long-term project to identify and use SNPs suitable for these purposes in cultivated peanut, which is tetraploid, we generated transcriptome sequences of four peanut cultivars, namely OLin, New Mexico Valencia C, Tamrun OL07 and Jupiter, which represent the four major market classes of peanut grown in the world, and which are important economically to the US southwest peanut growing region. CopyDNA libraries of each genotype were used to generate 2 × 54 paired-end reads using an Illumina GAIIx sequencer. Raw reads were mapped to a custom reference consisting of Tifrunner 454 sequences plus peanut ESTs in GenBank, compromising 43,108 contigs; 263,840 SNP and indel variants were identified among four genotypes compared to the reference. A subset of 6 variants was assayed across 24 genotypes representing four market types using KASP chemistry to assess the criteria for SNP selection. Results demonstrated that transcriptome sequencing can identify SNPs usable as selectable DNA-based markers in complex polyploid species such as peanut. Criteria for effective use of SNPs as markers are discussed in this context.
Gerlt, John A
2017-08-22
The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of "genomic enzymology" web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence-function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems.
Transcriptome and gene expression analysis during flower blooming in Rosa chinensis 'Pallida'.
Yan, Huijun; Zhang, Hao; Chen, Min; Jian, Hongying; Baudino, Sylvie; Caissard, Jean-Claude; Bendahmane, Mohammed; Li, Shubin; Zhang, Ting; Zhou, Ningning; Qiu, Xianqin; Wang, Qigang; Tang, Kaixue
2014-04-25
Rosa chinensis 'Pallida' (Rosa L.) is one of the most important ancient rose cultivars originating from China. It contributed the 'tea scent' trait to modern roses. However, little information is available on the gene regulatory networks involved in scent biosynthesis and metabolism in Rosa. In this study, the transcriptome of R. chinensis 'Pallida' petals at different developmental stages, from flower buds to senescent flowers, was investigated using Illumina sequencing technology. De novo assembly generated 89,614 clusters with an average length of 428bp. Based on sequence similarity search with known proteins, 62.9% of total clusters were annotated. Out of these annotated transcripts, 25,705 and 37,159 sequences were assigned to gene ontology and clusters of orthologous groups, respectively. The dataset provides information on transcripts putatively associated with known scent metabolic pathways. Digital gene expression (DGE) was obtained using RNA samples from flower bud, open flower and senescent flower stages. Comparative DGE and quantitative real time PCR permitted the identification of five transcripts encoding proteins putatively associated with scent biosynthesis in roses. The study provides a foundation for scent-related gene discovery in roses. Copyright © 2014. Published by Elsevier B.V.
Farlora, Rodolfo; Araya-Garay, José; Gallardo-Escárate, Cristian
2014-06-01
Understanding the molecular underpinnings involved in the reproduction of the salmon louse is critical for designing novel strategies of pest management for this ectoparasite. However, genomic information on sex-related genes is still limited. In the present work, sex-specific gene transcription was revealed in the salmon louse Caligus rogercresseyi using high-throughput Illumina sequencing. A total of 30,191,914 and 32,292,250 high quality reads were generated for females and males, and these were de novo assembled into 32,173 and 38,177 contigs, respectively. Gene ontology analysis showed a pattern of higher expression in the female as compared to the male transcriptome. Based on our sequence analysis and known sex-related proteins, several genes putatively involved in sex differentiation, including Dmrt3, FOXL2, VASA, and FEM1, and other potentially significant candidate genes in C. rogercresseyi, were identified for the first time. In addition, the occurrence of SNPs in several differentially expressed contigs annotating for sex-related genes was found. This transcriptome dataset provides a useful resource for future functional analyses, opening new opportunities for sea lice pest control. Copyright © 2014 Elsevier B.V. All rights reserved.
Hart-Smith, Gene; Yagoub, Daniel; Tay, Aidan P.; Pickford, Russell; Wilkins, Marc R.
2016-01-01
All large scale LC-MS/MS post-translational methylation site discovery experiments require methylpeptide spectrum matches (methyl-PSMs) to be identified at acceptably low false discovery rates (FDRs). To meet estimated methyl-PSM FDRs, methyl-PSM filtering criteria are often determined using the target-decoy approach. The efficacy of this methyl-PSM filtering approach has, however, yet to be thoroughly evaluated. Here, we conduct a systematic analysis of methyl-PSM FDRs across a range of sample preparation workflows (each differing in their exposure to the alcohols methanol and isopropyl alcohol) and mass spectrometric instrument platforms (each employing a different mode of MS/MS dissociation). Through 13CD3-methionine labeling (heavy-methyl SILAC) of Saccharomyces cerevisiae cells and in-depth manual data inspection, accurate lists of true positive methyl-PSMs were determined, allowing methyl-PSM FDRs to be compared with target-decoy approach-derived methyl-PSM FDR estimates. These results show that global FDR estimates produce extremely unreliable methyl-PSM filtering criteria; we demonstrate that this is an unavoidable consequence of the high number of amino acid combinations capable of producing peptide sequences that are isobaric to methylated peptides of a different sequence. Separate methyl-PSM FDR estimates were also found to be unreliable due to prevalent sources of false positive methyl-PSMs that produce high peptide identity score distributions. Incorrect methylation site localizations, peptides containing cysteinyl-S-β-propionamide, and methylated glutamic or aspartic acid residues can partially, but not wholly, account for these false positive methyl-PSMs. Together, these results indicate that the target-decoy approach is an unreliable means of estimating methyl-PSM FDRs and methyl-PSM filtering criteria. We suggest that orthogonal methylpeptide validation (e.g. heavy-methyl SILAC or its offshoots) should be considered a prerequisite for obtaining high confidence methyl-PSMs in large scale LC-MS/MS methylation site discovery experiments and make recommendations on how to reduce methyl-PSM FDRs in samples not amenable to heavy isotope labeling. Data are available via ProteomeXchange with the data identifier PXD002857. PMID:26699799
Kaya, Hilal Betul; Cetin, Oznur; Kaya, Hulya; Sahin, Mustafa; Sefer, Filiz; Kahraman, Abdullah; Tanyolac, Bahattin
2013-01-01
Background The olive tree (Olea europaea L.) is a diploid (2n = 2x = 46) outcrossing species mainly grown in the Mediterranean area, where it is the most important oil-producing crop. Because of its economic, cultural and ecological importance, various DNA markers have been used in the olive to characterize and elucidate homonyms, synonyms and unknown accessions. However, a comprehensive characterization and a full sequence of its transcriptome are unavailable, leading to the importance of an efficient large-scale single nucleotide polymorphism (SNP) discovery in olive. The objectives of this study were (1) to discover olive SNPs using next-generation sequencing and to identify SNP primers for cultivar identification and (2) to characterize 96 olive genotypes originating from different regions of Turkey. Methodology/Principal Findings Next-generation sequencing technology was used with five distinct olive genotypes and generated cDNA, producing 126,542,413 reads using an Illumina Genome Analyzer IIx. Following quality and size trimming, the high-quality reads were assembled into 22,052 contigs with an average length of 1,321 bases and 45 singletons. The SNPs were filtered and 2,987 high-quality putative SNP primers were identified. The assembled sequences and singletons were subjected to BLAST similarity searches and annotated with a Gene Ontology identifier. To identify the 96 olive genotypes, these SNP primers were applied to the genotypes in combination with amplified fragment length polymorphism (AFLP) and simple sequence repeats (SSR) markers. Conclusions/Significance This study marks the highest number of SNP markers discovered to date from olive genotypes using transcriptome sequencing. The developed SNP markers will provide a useful source for molecular genetic studies, such as genetic diversity and characterization, high density quantitative trait locus (QTL) analysis, association mapping and map-based gene cloning in the olive. High levels of genetic variation among Turkish olive genotypes revealed by SNPs, AFLPs and SSRs allowed us to characterize the Turkish olive genotype. PMID:24058483
USDA-ARS?s Scientific Manuscript database
Next-generation sequencing technologies were used to rapidly and efficiently sequence the genome of the domestic turkey (Meleagris gallopavo). The current genome assembly (~1.1 Gb) includes 917 Mb of sequence assigned to chromosomes. Innate heterozygosity of the sequenced bird allowed discovery of...
Rodrigues, Jorge L. M.; Serres, Margrethe H.; Tiedje, James M.
2011-01-01
The use of comparative genomics for the study of different microbiological species has increased substantially as sequence technologies become more affordable. However, efforts to fully link a genotype to its phenotype remain limited to the development of one mutant at a time. In this study, we provided a high-throughput alternative to this limiting step by coupling comparative genomics to the use of phenotype arrays for five sequenced Shewanella strains. Positive phenotypes were obtained for 441 nutrients (C, N, P, and S sources), with N-based compounds being the most utilized for all strains. Many genes and pathways predicted by genome analyses were confirmed with the comparative phenotype assay, and three degradation pathways believed to be missing in Shewanella were confirmed as missing. A number of previously unknown gene products were predicted to be parts of pathways or to have a function, expanding the number of gene targets for future genetic analyses. Ecologically, the comparative high-throughput phenotype analysis provided insights into niche specialization among the five different strains. For example, Shewanella amazonensis strain SB2B, isolated from the Amazon River delta, was capable of utilizing 60 C compounds, whereas Shewanella sp. strain W3-18-1, isolated from deep marine sediment, utilized only 25 of them. In spite of the large number of nutrient sources yielding positive results, our study indicated that except for the N sources, they were not sufficiently informative to predict growth phenotypes from increasing evolutionary distances. Our results indicate the importance of phenotypic evaluation for confirming genome predictions. This strategy will accelerate the functional discovery of genes and provide an ecological framework for microbial genome sequencing projects. PMID:21642407
Huang, Chen; Leung, Ross Ka-Kit; Guo, Min; Tuo, Li; Guo, Lin; Yew, Wing Wai; Lou, Inchio; Lee, Simon Ming Yuen; Sun, Chenghang
2016-01-01
Microbial secondary metabolites are valuable resources for novel drug discovery. In particular, actinomycetes expressed a range of antibiotics against a spectrum of bacteria. In genus level, strain Allosalinactinospora lopnorensis CA15-2T is the first new actinomycete isolated from the Lop Nor region, China. Antimicrobial assays revealed that the strain could inhibit the growth of certain types of bacteria, including Acinetobacter baumannii and Staphylococcus aureus, highlighting its clinical significance. Here we report the 5,894,259 base pairs genome of the strain, containing 5,662 predicted genes, and 832 of them cannot be detected by sequence similarity-based methods, suggesting the new species may carry a novel gene pool. Furthermore, our genome-mining investigation reveals that A. lopnorensis CA15-2T contains 17 gene clusters coding for known or novel secondary metabolites. Meanwhile, at least six secondary metabolites were disclosed from ethyl acetate (EA) extract of the fermentation broth of the strain by high-resolution UPLC-MS. Compared with reported clusters of other species, many new genes were found in clusters, and the physical chromosomal location and order of genes in the clusters are distinct. This study presents evidence in support of A. lopnorensis CA15-2T as a potent natural products source for drug discovery. PMID:26864220
Huszar, Tunde I; Jobling, Mark A; Wetton, Jon H
2018-04-12
Short tandem repeats on the male-specific region of the Y chromosome (Y-STRs) are permanently linked as haplotypes, and therefore Y-STR sequence diversity can be considered within the robust framework of a phylogeny of haplogroups defined by single nucleotide polymorphisms (SNPs). Here we use massively parallel sequencing (MPS) to analyse the 23 Y-STRs in Promega's prototype PowerSeq™ Auto/Mito/Y System kit (containing the markers of the PowerPlex® Y23 [PPY23] System) in a set of 100 diverse Y chromosomes whose phylogenetic relationships are known from previous megabase-scale resequencing. Including allele duplications and alleles resulting from likely somatic mutation, we characterised 2311 alleles, demonstrating 99.83% concordance with capillary electrophoresis (CE) data on the same sample set. The set contains 267 distinct sequence-based alleles (an increase of 58% compared to the 169 detectable by CE), including 60 novel Y-STR variants phased with their flanking sequences which have not been reported previously to our knowledge. Variation includes 46 distinct alleles containing non-reference variants of SNPs/indels in both repeat and flanking regions, and 145 distinct alleles containing repeat pattern variants (RPV). For DYS385a,b, DYS481 and DYS390 we observed repeat count variation in short flanking segments previously considered invariable, and suggest new MPS-based structural designations based on these. We considered the observed variation in the context of the Y phylogeny: several specific haplogroup associations were observed for SNPs and indels, reflecting the low mutation rates of such variant types; however, RPVs showed less phylogenetic coherence and more recurrence, reflecting their relatively high mutation rates. In conclusion, our study reveals considerable additional diversity at the Y-STRs of the PPY23 set via MPS analysis, demonstrates high concordance with CE data, facilitates nomenclature standardisation, and places Y-STR sequence variants in their phylogenetic context. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.
Quantifying the Level of Inquiry in a Reformed Introductory Geology Lab Course
ERIC Educational Resources Information Center
Moss, Elizabeth; Cervato, Cinzia
2016-01-01
As part of a campus-wide effort to transform introductory science courses to be more engaging and more accurately convey the excitement of discovery in science, the curriculum of an introductory physical geology lab course was redesigned. What had been a series of ''cookbook'' lab activities was transformed into a sequence of activities based on…
Clowne Science Scheme--A Method Based Course for the Early Years in Secondary Schools
ERIC Educational Resources Information Center
Burden, I. J.; And Others
1975-01-01
Describes a two-year course sequence that is team taught and theme centered. Themes include the earth, the senses, time, and rate of change. The teaching method is the discovery approach and the role of the teacher is outlined. Explains student assessment and outlines problems and observations related to the program. (GS)
Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard
2009-05-01
The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.
NASA Astrophysics Data System (ADS)
Sabale, Pramod M.; George, Jerrin Thomas; Srivatsan, Seergazhi G.
2014-08-01
Given the biological and therapeutic significance of telomeres and other G-quadruplex forming sequences in human genome, it is highly desirable to develop simple methods to study these structures, which can also be implemented in screening formats for the discovery of G-quadruplex binders. The majority of telomere detection methods developed so far are laborious and use elaborate assay and instrumental setups, and hence, are not amenable to discovery platforms. Here, we describe the development of a simple homogeneous fluorescence turn-on method, which uses a unique combination of an environment-sensitive fluorescent nucleobase analogue, the superior base pairing property of PNA, and DNA-binding and fluorescence quenching properties of graphene oxide, to detect human telomeric DNA repeats of varying lengths. Our results demonstrate that this method, which does not involve a rigorous assay setup, would provide new opportunities to study G-quadruplex structures.Given the biological and therapeutic significance of telomeres and other G-quadruplex forming sequences in human genome, it is highly desirable to develop simple methods to study these structures, which can also be implemented in screening formats for the discovery of G-quadruplex binders. The majority of telomere detection methods developed so far are laborious and use elaborate assay and instrumental setups, and hence, are not amenable to discovery platforms. Here, we describe the development of a simple homogeneous fluorescence turn-on method, which uses a unique combination of an environment-sensitive fluorescent nucleobase analogue, the superior base pairing property of PNA, and DNA-binding and fluorescence quenching properties of graphene oxide, to detect human telomeric DNA repeats of varying lengths. Our results demonstrate that this method, which does not involve a rigorous assay setup, would provide new opportunities to study G-quadruplex structures. Electronic supplementary information (ESI) available. Figures, tables, experimental procedures and NMR spectra. See DOI: 10.1039/c4nr00878b
RNA 3D Modules in Genome-Wide Predictions of RNA 2D Structure
Theis, Corinna; Zirbel, Craig L.; zu Siederdissen, Christian Höner; Anthon, Christian; Hofacker, Ivo L.; Nielsen, Henrik; Gorodkin, Jan
2015-01-01
Recent experimental and computational progress has revealed a large potential for RNA structure in the genome. This has been driven by computational strategies that exploit multiple genomes of related organisms to identify common sequences and secondary structures. However, these computational approaches have two main challenges: they are computationally expensive and they have a relatively high false discovery rate (FDR). Simultaneously, RNA 3D structure analysis has revealed modules composed of non-canonical base pairs which occur in non-homologous positions, apparently by independent evolution. These modules can, for example, occur inside structural elements which in RNA 2D predictions appear as internal loops. Hence one question is if the use of such RNA 3D information can improve the prediction accuracy of RNA secondary structure at a genome-wide level. Here, we use RNAz in combination with 3D module prediction tools and apply them on a 13-way vertebrate sequence-based alignment. We find that RNA 3D modules predicted by metaRNAmodules and JAR3D are significantly enriched in the screened windows compared to their shuffled counterparts. The initially estimated FDR of 47.0% is lowered to below 25% when certain 3D module predictions are present in the window of the 2D prediction. We discuss the implications and prospects for further development of computational strategies for detection of RNA 2D structure in genomic sequence. PMID:26509713
MAGIC database and interfaces: an integrated package for gene discovery and expression.
Cordonnier-Pratt, Marie-Michèle; Liang, Chun; Wang, Haiming; Kolychev, Dmitri S; Sun, Feng; Freeman, Robert; Sullivan, Robert; Pratt, Lee H
2004-01-01
The rapidly increasing rate at which biological data is being produced requires a corresponding growth in relational databases and associated tools that can help laboratories contend with that data. With this need in mind, we describe here a Modular Approach to a Genomic, Integrated and Comprehensive (MAGIC) Database. This Oracle 9i database derives from an initial focus in our laboratory on gene discovery via production and analysis of expressed sequence tags (ESTs), and subsequently on gene expression as assessed by both EST clustering and microarrays. The MAGIC Gene Discovery portion of the database focuses on information derived from DNA sequences and on its biological relevance. In addition to MAGIC SEQ-LIMS, which is designed to support activities in the laboratory, it contains several additional subschemas. The latter include MAGIC Admin for database administration, MAGIC Sequence for sequence processing as well as sequence and clone attributes, MAGIC Cluster for the results of EST clustering, MAGIC Polymorphism in support of microsatellite and single-nucleotide-polymorphism discovery, and MAGIC Annotation for electronic annotation by BLAST and BLAT. The MAGIC Microarray portion is a MIAME-compliant database with two components at present. These are MAGIC Array-LIMS, which makes possible remote entry of all information into the database, and MAGIC Array Analysis, which provides data mining and visualization. Because all aspects of interaction with the MAGIC Database are via a web browser, it is ideally suited not only for individual research laboratories but also for core facilities that serve clients at any distance.
RSAT 2015: Regulatory Sequence Analysis Tools
Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A.; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M.; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques
2015-01-01
RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. PMID:25904632
Buxbaum, Joseph D; Daly, Mark J; Devlin, Bernie; Lehner, Thomas; Roeder, Kathryn; State, Matthew W
2012-12-20
Research during the past decade has seen significant progress in the understanding of the genetic architecture of autism spectrum disorders (ASDs), with gene discovery accelerating as the characterization of genomic variation has become increasingly comprehensive. At the same time, this research has highlighted ongoing challenges. Here we address the enormous impact of high-throughput sequencing (HTS) on ASD gene discovery, outline a consensus view for leveraging this technology, and describe a large multisite collaboration developed to accomplish these goals. Similar approaches could prove effective for severe neurodevelopmental disorders more broadly. Copyright © 2012 Elsevier Inc. All rights reserved.
Vicini, P; Fields, O; Lai, E; Litwack, E D; Martin, A-M; Morgan, T M; Pacanowski, M A; Papaluca, M; Perez, O D; Ringel, M S; Robson, M; Sakul, H; Vockley, J; Zaks, T; Dolsten, M; Søgaard, M
2016-02-01
High throughput molecular and functional profiling of patients is a key driver of precision medicine. DNA and RNA characterization has been enabled at unprecedented cost and scale through rapid, disruptive progress in sequencing technology, but challenges persist in data management and interpretation. We analyze the state-of-the-art of large-scale unbiased sequencing in drug discovery and development, including technology, application, ethical, regulatory, policy and commercial considerations, and discuss issues of LUS implementation in clinical and regulatory practice. © 2015 American Society for Clinical Pharmacology and Therapeutics.
DLocalMotif: a discriminative approach for discovering local motifs in protein sequences.
Mehdi, Ahmed M; Sehgal, Muhammad Shoaib B; Kobe, Bostjan; Bailey, Timothy L; Bodén, Mikael
2013-01-01
Local motifs are patterns of DNA or protein sequences that occur within a sequence interval relative to a biologically defined anchor or landmark. Current protein motif discovery methods do not adequately consider such constraints to identify biologically significant motifs that are only weakly over-represented but spatially confined. Using negatives, i.e. sequences known to not contain a local motif, can further increase the specificity of their discovery. This article introduces the method DLocalMotif that makes use of positional information and negative data for local motif discovery in protein sequences. DLocalMotif combines three scoring functions, measuring degrees of motif over-representation, entropy and spatial confinement, specifically designed to discriminatively exploit the availability of negative data. The method is shown to outperform current methods that use only a subset of these motif characteristics. We apply the method to several biological datasets. The analysis of peroxisomal targeting signals uncovers several novel motifs that occur immediately upstream of the dominant peroxisomal targeting signal-1 signal. The analysis of proline-tyrosine nuclear localization signals uncovers multiple novel motifs that overlap with C2H2 zinc finger domains. We also evaluate the method on classical nuclear localization signals and endoplasmic reticulum retention signals and find that DLocalMotif successfully recovers biologically relevant sequence properties. http://bioinf.scmb.uq.edu.au/dlocalmotif/
Microbial Metagenomics: Beyond the Genome
NASA Astrophysics Data System (ADS)
Gilbert, Jack A.; Dupont, Christopher L.
2011-01-01
Metagenomics literally means “beyond the genome.” Marine microbial metagenomic databases presently comprise ˜400 billion base pairs of DNA, only ˜3% of that found in 1 ml of seawater. Very soon a trillion-base-pair sequence run will be feasible, so it is time to reflect on what we have learned from metagenomics. We review the impact of metagenomics on our understanding of marine microbial communities. We consider the studies facilitated by data generated through the Global Ocean Sampling expedition, as well as the revolution wrought at the individual laboratory level through next generation sequencing technologies. We review recent studies and discoveries since 2008, provide a discussion of bioinformatic analyses, including conceptual pipelines and sequence annotation and predict the future of metagenomics, with suggestions of collaborative community studies tailored toward answering some of the fundamental questions in marine microbial ecology.
Screening for SNPs with Allele-Specific Methylation based on Next-Generation Sequencing Data.
Hu, Bo; Ji, Yuan; Xu, Yaomin; Ting, Angela H
2013-05-01
Allele-specific methylation (ASM) has long been studied but mainly documented in the context of genomic imprinting and X chromosome inactivation. Taking advantage of the next-generation sequencing technology, we conduct a high-throughput sequencing experiment with four prostate cell lines to survey the whole genome and identify single nucleotide polymorphisms (SNPs) with ASM. A Bayesian approach is proposed to model the counts of short reads for each SNP conditional on its genotypes of multiple subjects, leading to a posterior probability of ASM. We flag SNPs with high posterior probabilities of ASM by accounting for multiple comparisons based on posterior false discovery rates. Applying the Bayesian approach to the in-house prostate cell line data, we identify 269 SNPs as candidates of ASM. A simulation study is carried out to demonstrate the quantitative performance of the proposed approach.
Cheng, Chia-Ying; Tsai, Chia-Feng; Chen, Yu-Ju; Sung, Ting-Yi; Hsu, Wen-Lian
2013-05-03
As spectral library searching has received increasing attention for peptide identification, constructing good decoy spectra from the target spectra is the key to correctly estimating the false discovery rate in searching against the concatenated target-decoy spectral library. Several methods have been proposed to construct decoy spectral libraries. Most of them construct decoy peptide sequences and then generate theoretical spectra accordingly. In this paper, we propose a method, called precursor-swap, which directly constructs decoy spectral libraries directly at the "spectrum level" without generating decoy peptide sequences by swapping the precursors of two spectra selected according to a very simple rule. Our spectrum-based method does not require additional efforts to deal with ion types (e.g., a, b or c ions), fragment mechanism (e.g., CID, or ETD), or unannotated peaks, but preserves many spectral properties. The precursor-swap method is evaluated on different spectral libraries and the results of obtained decoy ratios show that it is comparable to other methods. Notably, it is efficient in time and memory usage for constructing decoy libraries. A software tool called Precursor-Swap-Decoy-Generation (PSDG) is publicly available for download at http://ms.iis.sinica.edu.tw/PSDG/.
Brown, Margaret E; Walker, Mark C; Nakashige, Toshiki G; Iavarone, Anthony T; Chang, Michelle C Y
2011-11-16
Bacteria and other living organisms offer a potentially unlimited resource for the discovery of new chemical catalysts, but many interesting reaction phenotypes observed at the whole organism level remain difficult to elucidate down to the molecular level. A key challenge in the discovery process is the identification of discrete molecular players involved in complex biological transformations because multiple cryptic genetic components often work in concert to elicit an overall chemical phenotype. We now report a rapid pipeline for the discovery of new enzymes of interest from unsequenced bacterial hosts based on laboratory-scale methods for the de novo assembly of bacterial genome sequences using short reads. We have applied this approach to the biomass-degrading soil bacterium Amycolatopsis sp. 75iv2 ATCC 39116 (formerly Streptomyces setonii and S. griseus 75vi2) to discover and biochemically characterize two new heme proteins comprising the most abundant members of the extracellular oxidative system under lignin-reactive growth conditions.
The Importance of Biological Databases in Biological Discovery.
Baxevanis, Andreas D; Bateman, Alex
2015-06-19
Biological databases play a central role in bioinformatics. They offer scientists the opportunity to access a wide variety of biologically relevant data, including the genomic sequences of an increasingly broad range of organisms. This unit provides a brief overview of major sequence databases and portals, such as GenBank, the UCSC Genome Browser, and Ensembl. Model organism databases, including WormBase, The Arabidopsis Information Resource (TAIR), and those made available through the Mouse Genome Informatics (MGI) resource, are also covered. Non-sequence-centric databases, such as Online Mendelian Inheritance in Man (OMIM), the Protein Data Bank (PDB), MetaCyc, and the Kyoto Encyclopedia of Genes and Genomes (KEGG), are also discussed. Copyright © 2015 John Wiley & Sons, Inc.
Structure and evolution of cereal genomes.
Paterson, Andrew H; Bowers, John E; Peterson, Daniel G; Estill, James C; Chapman, Brad A
2003-12-01
The cereal species, of central importance to our diet, began to diverge 50-70 million years ago. For the past few thousand years, these species have undergone largely parallel selection regimes associated with domestication and improvement. The rice genome sequence provides a platform for organizing information about diverse cereals, and together with genetic maps and sequence samples from other cereals is yielding new insights into both the shared and the independent dimensions of cereal evolution. New data and population-based approaches are identifying genes that have been involved in cereal improvement. Reduced-representation sequencing promises to accelerate gene discovery in many large-genome cereals, and to better link the under-explored genomes of 'orphan' cereals with state-of-the-art knowledge.
Efficient Mining of Interesting Patterns in Large Biological Sequences
Rashid, Md. Mamunur; Karim, Md. Rezaul; Jeong, Byeong-Soo
2012-01-01
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time. PMID:23105928
Efficient mining of interesting patterns in large biological sequences.
Rashid, Md Mamunur; Karim, Md Rezaul; Jeong, Byeong-Soo; Choi, Ho-Jin
2012-03-01
Pattern discovery in biological sequences (e.g., DNA sequences) is one of the most challenging tasks in computational biology and bioinformatics. So far, in most approaches, the number of occurrences is a major measure of determining whether a pattern is interesting or not. In computational biology, however, a pattern that is not frequent may still be considered very informative if its actual support frequency exceeds the prior expectation by a large margin. In this paper, we propose a new interesting measure that can provide meaningful biological information. We also propose an efficient index-based method for mining such interesting patterns. Experimental results show that our approach can find interesting patterns within an acceptable computation time.
Translational bioinformatics: linking the molecular world to the clinical world.
Altman, R B
2012-06-01
Translational bioinformatics represents the union of translational medicine and bioinformatics. Translational medicine moves basic biological discoveries from the research bench into the patient-care setting and uses clinical observations to inform basic biology. It focuses on patient care, including the creation of new diagnostics, prognostics, prevention strategies, and therapies based on biological discoveries. Bioinformatics involves algorithms to represent, store, and analyze basic biological data, including DNA sequence, RNA expression, and protein and small-molecule abundance within cells. Translational bioinformatics spans these two fields; it involves the development of algorithms to analyze basic molecular and cellular data with an explicit goal of affecting clinical care.
10 CFR 2.705 - Discovery-additional methods.
Code of Federal Regulations, 2013 CFR
2013-01-01
... 10 CFR 73.22(b)(3) or 73.23(b)(3), as applicable, by submitting fingerprints to the NRC Office of... fingerprints. However, before a final adverse determination by the NRC Office of Administration on an... discovery may be used in any sequence and the fact that a party is conducting discovery, whether by...
10 CFR 2.705 - Discovery-additional methods.
Code of Federal Regulations, 2011 CFR
2011-01-01
... 10 CFR 73.22(b)(3) or 73.23(b)(3), as applicable, by submitting fingerprints to the NRC Office of... fingerprints. However, before a final adverse determination by the NRC Office of Administration on an... discovery may be used in any sequence and the fact that a party is conducting discovery, whether by...
10 CFR 2.705 - Discovery-additional methods.
Code of Federal Regulations, 2014 CFR
2014-01-01
... 10 CFR 73.22(b)(3) or 73.23(b)(3), as applicable, by submitting fingerprints to the NRC Office of... fingerprints. However, before a final adverse determination by the NRC Office of Administration on an... discovery may be used in any sequence and the fact that a party is conducting discovery, whether by...
10 CFR 2.705 - Discovery-additional methods.
Code of Federal Regulations, 2012 CFR
2012-01-01
... 10 CFR 73.22(b)(3) or 73.23(b)(3), as applicable, by submitting fingerprints to the NRC Office of... fingerprints. However, before a final adverse determination by the NRC Office of Administration on an... discovery may be used in any sequence and the fact that a party is conducting discovery, whether by...
Shim, Hongseok; Kim, Ji Hyun; Kim, Chan Yeong; Hwang, Sohyun; Kim, Hyojin; Yang, Sunmo; Lee, Ji Eun; Lee, Insuk
2016-11-16
Whole exome sequencing (WES) accelerates disease gene discovery using rare genetic variants, but further statistical and functional evidence is required to avoid false-discovery. To complement variant-driven disease gene discovery, here we present function-driven disease gene discovery in zebrafish (Danio rerio), a promising human disease model owing to its high anatomical and genomic similarity to humans. To facilitate zebrafish-based function-driven disease gene discovery, we developed a genome-scale co-functional network of zebrafish genes, DanioNet (www.inetbio.org/danionet), which was constructed by Bayesian integration of genomics big data. Rigorous statistical assessment confirmed the high prediction capacity of DanioNet for a wide variety of human diseases. To demonstrate the feasibility of the function-driven disease gene discovery using DanioNet, we predicted genes for ciliopathies and performed experimental validation for eight candidate genes. We also validated the existence of heterozygous rare variants in the candidate genes of individuals with ciliopathies yet not in controls derived from the UK10K consortium, suggesting that these variants are potentially involved in enhancing the risk of ciliopathies. These results showed that an integrated genomics big data for a model animal of diseases can expand our opportunity for harnessing WES data in disease gene discovery. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
A Safer, Discovery-Based Nucleophilic Substitution Experiment
ERIC Educational Resources Information Center
Horowitz, Gail
2009-01-01
A discovery-based nucleophilic substitution experiment is described in which students compare the reactivity of chloride and iodide ions in an S[subscript N]2 reaction. This experiment improves upon the well-known "Competing Nucleophiles" experiment in that it does not involve the generation of hydrogen halide gas. The experiment also introduces…
Floral gene resources from basal angiosperms for comparative genomics research
Albert, Victor A; Soltis, Douglas E; Carlson, John E; Farmerie, William G; Wall, P Kerr; Ilut, Daniel C; Solow, Teri M; Mueller, Lukas A; Landherr, Lena L; Hu, Yi; Buzgo, Matyas; Kim, Sangtae; Yoo, Mi-Jeong; Frohlich, Michael W; Perl-Treves, Rafael; Schlarbaum, Scott E; Bliss, Barbara J; Zhang, Xiaohong; Tanksley, Steven D; Oppenheimer, David G; Soltis, Pamela S; Ma, Hong; dePamphilis, Claude W; Leebens-Mack, James H
2005-01-01
Background The Floral Genome Project was initiated to bridge the genomic gap between the most broadly studied plant model systems. Arabidopsis and rice, although now completely sequenced and under intensive comparative genomic investigation, are separated by at least 125 million years of evolutionary time, and cannot in isolation provide a comprehensive perspective on structural and functional aspects of flowering plant genome dynamics. Here we discuss new genomic resources available to the scientific community, comprising cDNA libraries and Expressed Sequence Tag (EST) sequences for a suite of phylogenetically basal angiosperms specifically selected to bridge the evolutionary gaps between model plants and provide insights into gene content and genome structure in the earliest flowering plants. Results Random sequencing of cDNAs from representatives of phylogenetically important eudicot, non-grass monocot, and gymnosperm lineages has so far (as of 12/1/04) generated 70,514 ESTs and 48,170 assembled unigenes. Efficient sorting of EST sequences into putative gene families based on whole Arabidopsis/rice proteome comparison has permitted ready identification of cDNA clones for finished sequencing. Preliminarily, (i) proportions of functional categories among sequenced floral genes seem representative of the entire Arabidopsis transcriptome, (ii) many known floral gene homologues have been captured, and (iii) phylogenetic analyses of ESTs are providing new insights into the process of gene family evolution in relation to the origin and diversification of the angiosperms. Conclusion Initial comparisons illustrate the utility of the EST data sets toward discovery of the basic floral transcriptome. These first findings also afford the opportunity to address a number of conspicuous evolutionary genomic questions, including reproductive organ transcriptome overlap between angiosperms and gymnosperms, genome-wide duplication history, lineage-specific gene duplication and functional divergence, and analyses of adaptive molecular evolution. Since not all genes in the floral transcriptome will be associated with flowering, these EST resources will also be of interest to plant scientists working on other functions, such as photosynthesis, signal transduction, and metabolic pathways. PMID:15799777
Arthropods as a source of new RNA viruses.
Bichaud, L; de Lamballerie, X; Alkan, C; Izri, A; Gould, E A; Charrel, R N
2014-12-01
The discovery and development of methods for isolation, characterisation and taxonomy of viruses represents an important milestone in the study, treatment and control of virus diseases during the 20th century. Indeed, by the late-1950s, it was becoming common belief that most human and veterinary pathogenic viruses had been discovered. However, at that time, knowledge of the impact of improved commercial transportation, urbanisation and deforestation, on disease emergence, was in its infancy. From the late 1960s onwards viruses, such as hepatitis virus (A, B and C) hantavirus, HIV, Marburg virus, Ebola virus and many others began to emerge and it became apparent that the world was changing, at least in terms of virus epidemiology, largely due to the influence of anthropological activities. Subsequently, with the improvement of molecular biotechnologies, for amplification of viral RNA, genome sequencing and proteomic analysis the arsenal of available tools for virus discovery and genetic characterization opened up new and exciting possibilities for virological discovery. Many recently identified but "unclassified" viruses are now being allocated to existing genera or families based on whole genome sequencing, bioinformatic and phylogenetic analysis. New species, genera and families are also being created following the guidelines of the International Committee for the Taxonomy of Viruses. Many of these newly discovered viruses are vectored by arthropods (arboviruses) and possess an RNA genome. This brief review will focus largely on the discovery of new arthropod-borne viruses. Copyright © 2014 Elsevier Ltd. All rights reserved.
Brusniak, Mi-Youn; Bodenmiller, Bernd; Campbell, David; Cooke, Kelly; Eddes, James; Garbutt, Andrew; Lau, Hollis; Letarte, Simon; Mueller, Lukas N; Sharma, Vagisha; Vitek, Olga; Zhang, Ning; Aebersold, Ruedi; Watts, Julian D
2008-01-01
Background Quantitative proteomics holds great promise for identifying proteins that are differentially abundant between populations representing different physiological or disease states. A range of computational tools is now available for both isotopically labeled and label-free liquid chromatography mass spectrometry (LC-MS) based quantitative proteomics. However, they are generally not comparable to each other in terms of functionality, user interfaces, information input/output, and do not readily facilitate appropriate statistical data analysis. These limitations, along with the array of choices, present a daunting prospect for biologists, and other researchers not trained in bioinformatics, who wish to use LC-MS-based quantitative proteomics. Results We have developed Corra, a computational framework and tools for discovery-based LC-MS proteomics. Corra extends and adapts existing algorithms used for LC-MS-based proteomics, and statistical algorithms, originally developed for microarray data analyses, appropriate for LC-MS data analysis. Corra also adapts software engineering technologies (e.g. Google Web Toolkit, distributed processing) so that computationally intense data processing and statistical analyses can run on a remote server, while the user controls and manages the process from their own computer via a simple web interface. Corra also allows the user to output significantly differentially abundant LC-MS-detected peptide features in a form compatible with subsequent sequence identification via tandem mass spectrometry (MS/MS). We present two case studies to illustrate the application of Corra to commonly performed LC-MS-based biological workflows: a pilot biomarker discovery study of glycoproteins isolated from human plasma samples relevant to type 2 diabetes, and a study in yeast to identify in vivo targets of the protein kinase Ark1 via phosphopeptide profiling. Conclusion The Corra computational framework leverages computational innovation to enable biologists or other researchers to process, analyze and visualize LC-MS data with what would otherwise be a complex and not user-friendly suite of tools. Corra enables appropriate statistical analyses, with controlled false-discovery rates, ultimately to inform subsequent targeted identification of differentially abundant peptides by MS/MS. For the user not trained in bioinformatics, Corra represents a complete, customizable, free and open source computational platform enabling LC-MS-based proteomic workflows, and as such, addresses an unmet need in the LC-MS proteomics field. PMID:19087345
Computational approaches to predict bacteriophage–host relationships
Edwards, Robert A.; McNair, Katelyn; Faust, Karoline; Raes, Jeroen; Dutilh, Bas E.
2015-01-01
Metagenomics has changed the face of virus discovery by enabling the accurate identification of viral genome sequences without requiring isolation of the viruses. As a result, metagenomic virus discovery leaves the first and most fundamental question about any novel virus unanswered: What host does the virus infect? The diversity of the global virosphere and the volumes of data obtained in metagenomic sequencing projects demand computational tools for virus–host prediction. We focus on bacteriophages (phages, viruses that infect bacteria), the most abundant and diverse group of viruses found in environmental metagenomes. By analyzing 820 phages with annotated hosts, we review and assess the predictive power of in silico phage–host signals. Sequence homology approaches are the most effective at identifying known phage–host pairs. Compositional and abundance-based methods contain significant signal for phage–host classification, providing opportunities for analyzing the unknowns in viral metagenomes. Together, these computational approaches further our knowledge of the interactions between phages and their hosts. Importantly, we find that all reviewed signals significantly link phages to their hosts, illustrating how current knowledge and insights about the interaction mechanisms and ecology of coevolving phages and bacteria can be exploited to predict phage–host relationships, with potential relevance for medical and industrial applications. PMID:26657537
Ramond, J-B; Makhalanyane, T P; Tuffin, M I; Cowan, D A
2015-04-01
Normalization is a procedure classically employed to detect rare sequences in cellular expression profiles (i.e. cDNA libraries). Here, we present a normalization protocol involving the direct treatment of extracted environmental metagenomic DNA with S1 nuclease, referred to as normalization of metagenomic DNA: NmDNA. We demonstrate that NmDNA, prior to post hoc PCR-based experiments (16S rRNA gene T-RFLP fingerprinting and clone library), increased the diversity of sequences retrieved from environmental microbial communities by detection of rarer sequences. This approach could be used to enhance the resolution of detection of ecologically relevant rare members in environmental microbial assemblages and therefore is promising in enabling a better understanding of ecosystem functioning. This study is the first testing 'normalization' on environmental metagenomic DNA (mDNA). The aim of this procedure was to improve the identification of rare phylotypes in environmental communities. Using hypoliths as model systems, we present evidence that this post-mDNA extraction molecular procedure substantially enhances the detection of less common phylotypes and could even lead to the discovery of novel microbial genotypes within a given environment. © 2014 The Society for Applied Microbiology.
A generic motif discovery algorithm for sequential data.
Jensen, Kyle L; Styczynski, Mark P; Rigoutsos, Isidore; Stephanopoulos, Gregory N
2006-01-01
Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems. Here we present a generic motif discovery algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real-valued data. As we show, Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation-agnostic: they can be represented using regular expressions, position weight matrices or any number of other models for any type of sequential data. We demonstrate a number of applications of the algorithm, including the discovery of motifs in amino acids sequences, a new solution to the (l,d)-motif problem in DNA sequences and the discovery of conserved protein substructures. Gemoda is freely available at http://web.mit.edu/bamel/gemoda
NASA Astrophysics Data System (ADS)
Lestari, D.; Bustamam, A.; Novianti, T.; Ardaneswari, G.
2017-07-01
DNA sequence can be defined as a succession of letters, representing the order of nucleotides within DNA, using a permutation of four DNA base codes including adenine (A), guanine (G), cytosine (C), and thymine (T). The precise code of the sequences is determined using DNA sequencing methods and technologies, which have been developed since the 1970s and currently become highly developed, advanced and highly throughput sequencing technologies. So far, DNA sequencing has greatly accelerated biological and medical research and discovery. However, in some cases DNA sequencing could produce any ambiguous and not clear enough sequencing results that make them quite difficult to be determined whether these codes are A, T, G, or C. To solve these problems, in this study we can introduce other representation of DNA codes namely Quaternion Q = (PA, PT, PG, PC), where PA, PT, PG, PC are the probability of A, T, G, C bases that could appear in Q and PA + PT + PG + PC = 1. Furthermore, using Quaternion representations we are able to construct the improved scoring matrix for global sequence alignment processes, by applying a dot product method. Moreover, this scoring matrix produces better and higher quality of the match and mismatch score between two DNA base codes. In implementation, we applied the Needleman-Wunsch global sequence alignment algorithm using Octave, to analyze our target sequence which contains some ambiguous sequence data. The subject sequences are the DNA sequences of Streptococcus pneumoniae families obtained from the Genebank, meanwhile the target DNA sequence are received from our collaborator database. As the results we found the Quaternion representations improve the quality of the sequence alignment score and we can conclude that DNA sequence target has maximum similarity with Streptococcus pneumoniae.
Cho, Jin-Young; Lee, Hyoung-Joo; Jeong, Seul-Ki; Paik, Young-Ki
2017-12-01
Mass spectrometry (MS) is a widely used proteome analysis tool for biomedical science. In an MS-based bottom-up proteomic approach to protein identification, sequence database (DB) searching has been routinely used because of its simplicity and convenience. However, searching a sequence DB with multiple variable modification options can increase processing time, false-positive errors in large and complicated MS data sets. Spectral library searching is an alternative solution, avoiding the limitations of sequence DB searching and allowing the detection of more peptides with high sensitivity. Unfortunately, this technique has less proteome coverage, resulting in limitations in the detection of novel and whole peptide sequences in biological samples. To solve these problems, we previously developed the "Combo-Spec Search" method, which uses manually multiple references and simulated spectral library searching to analyze whole proteomes in a biological sample. In this study, we have developed a new analytical interface tool called "Epsilon-Q" to enhance the functions of both the Combo-Spec Search method and label-free protein quantification. Epsilon-Q performs automatically multiple spectral library searching, class-specific false-discovery rate control, and result integration. It has a user-friendly graphical interface and demonstrates good performance in identifying and quantifying proteins by supporting standard MS data formats and spectrum-to-spectrum matching powered by SpectraST. Furthermore, when the Epsilon-Q interface is combined with the Combo-Spec search method, called the Epsilon-Q system, it shows a synergistic function by outperforming other sequence DB search engines for identifying and quantifying low-abundance proteins in biological samples. The Epsilon-Q system can be a versatile tool for comparative proteome analysis based on multiple spectral libraries and label-free quantification.
A statistical view of FMRFamide neuropeptide diversity.
Espinoza, E; Carrigan, M; Thomas, S G; Shaw, G; Edison, A S
2000-01-01
FMRFamide-like peptide (FLP) amino acid sequences have been collected and statistically analyzed. FLP amino acid composition as a function of position in the peptide is graphically presented for several major phyla. Results of total amino acid composition and frequencies of pairs of FLP amino acids have been computed and compared with corresponding values from the entire GenBank protein sequence database. The data for pairwise distributions of amino acids should help in future structure-function studies of FLPs. To aid in future peptide discovery, a computer program and search protocol was developed to identify FLPs from the GenBank protein database without the use of keywords.
Yu, Yang; Wei, Jiankai; Zhang, Xiaojun; Liu, Jingwen; Liu, Chengzhang; Li, Fuhua; Xiang, Jianhai
2014-01-01
The application of next generation sequencing technology has greatly facilitated high throughput single nucleotide polymorphism (SNP) discovery and genotyping in genetic research. In the present study, SNPs were discovered based on two transcriptomes of Litopenaeus vannamei (L. vannamei) generated from Illumina sequencing platform HiSeq 2000. One transcriptome of L. vannamei was obtained through sequencing on the RNA from larvae at mysis stage and its reference sequence was de novo assembled. The data from another transcriptome were downloaded from NCBI and the reads of the two transcriptomes were mapped separately to the assembled reference by BWA. SNP calling was performed using SAMtools. A total of 58,717 and 36,277 SNPs with high quality were predicted from the two transcriptomes, respectively. SNP calling was also performed using the reads of two transcriptomes together, and a total of 96,040 SNPs with high quality were predicted. Among these 96,040 SNPs, 5,242 and 29,129 were predicted as non-synonymous and synonymous SNPs respectively. Characterization analysis of the predicted SNPs in L. vannamei showed that the estimated SNP frequency was 0.21% (one SNP per 476 bp) and the estimated ratio for transition to transversion was 2.0. Fifty SNPs were randomly selected for validation by Sanger sequencing after PCR amplification and 76% of SNPs were confirmed, which indicated that the SNPs predicted in this study were reliable. These SNPs will be very useful for genetic study in L. vannamei, especially for the high density linkage map construction and genome-wide association studies. PMID:24498047
Rasmussen, L. D.; Zawadsky, C.; Binnerup, S. J.; Øregaard, G.; Sørensen, S. J.; Kroer, N.
2008-01-01
Mercury-resistant bacteria may be important players in mercury biogeochemistry. To assess the potential for mercury reduction by two subsurface microbial communities, resistant subpopulations and their merA genes were characterized by a combined molecular and cultivation-dependent approach. The cultivation method simulated natural conditions by using polycarbonate membranes as a growth support and a nonsterile soil slurry as a culture medium. Resistant bacteria were pregrown to microcolony-forming units (mCFU) before being plated on standard medium. Compared to direct plating, culturability was increased up to 2,800 times and numbers of mCFU were similar to the total number of mercury-resistant bacteria in the soils. Denaturing gradient gel electrophoresis analysis of DNA extracted from membranes suggested stimulation of growth of hard-to-culture bacteria during the preincubation. A total of 25 different 16S rRNA gene sequences were observed, including Alpha-, Beta-, and Gammaproteobacteria; Actinobacteria; Firmicutes; and Bacteroidetes. The diversity of isolates obtained by direct plating included eight different 16S rRNA gene sequences (Alpha- and Betaproteobacteria and Actinobacteria). Partial sequencing of merA of selected isolates led to the discovery of new merA sequences. With phylum-specific merA primers, PCR products were obtained for Alpha- and Betaproteobacteria and Actinobacteria but not for Bacteroidetes and Firmicutes. The similarity to known sequences ranged between 89 and 95%. One of the sequences did not result in a match in the BLAST search. The results illustrate the power of integrating advanced cultivation methodology with molecular techniques for the characterization of the diversity of mercury-resistant populations and assessing the potential for mercury reduction in contaminated environments. PMID:18441111
Synthetic biology to access and expand nature’s chemical diversity
Smanski, Michael J.; Zhou, Hui; Claesen, Jan; Shen, Ben; Fischbach, Michael; Voigt, Christopher A.
2016-01-01
Bacterial genomes encode the biosynthetic potential to produce hundreds of thousands of complex molecules with diverse applications, from medicine to agriculture and materials. Economically accessing the potential encoded within sequenced genomes promises to reinvigorate waning drug discovery pipelines and provide novel routes to intricate chemicals. This is a tremendous undertaking, as the pathways often comprise dozens of genes spanning as much as 100+ kiliobases of DNA, are controlled by complex regulatory networks, and the most interesting molecules are made by non-model organisms. Advances in synthetic biology address these issues, including DNA construction technologies, genetic parts for precision expression control, synthetic regulatory circuits, computer aided design, and multiplexed genome engineering. Collectively, these technologies are moving towards an era when chemicals can be accessed en mass based on sequence information alone. This will enable the harnessing of metagenomic data and massive strain banks for high-throughput molecular discovery and, ultimately, the ability to forward design pathways to complex chemicals not found in nature. PMID:26876034
Analysis Commons, A Team Approach to Discovery in a Big-Data Environment for Genetic Epidemiology
Brody, Jennifer A.; Morrison, Alanna C.; Bis, Joshua C.; O'Connell, Jeffrey R.; Brown, Michael R.; Huffman, Jennifer E.; Ames, Darren C.; Carroll, Andrew; Conomos, Matthew P.; Gabriel, Stacey; Gibbs, Richard A.; Gogarten, Stephanie M.; Gupta, Namrata; Jaquish, Cashell E.; Johnson, Andrew D.; Lewis, Joshua P.; Liu, Xiaoming; Manning, Alisa K.; Papanicolaou, George J.; Pitsillides, Achilleas N.; Rice, Kenneth M.; Salerno, William; Sitlani, Colleen M.; Smith, Nicholas L.; Heckbert, Susan R.; Laurie, Cathy C.; Mitchell, Braxton D.; Vasan, Ramachandran S.; Rich, Stephen S.; Rotter, Jerome I.; Wilson, James G.; Boerwinkle, Eric; Psaty, Bruce M.; Cupples, L. Adrienne
2017-01-01
Summary paragraph The exploding volume of whole-genome sequence (WGS) and multi-omics data requires new approaches for analysis. As one solution, we have created a cloud-based Analysis Commons, which brings together genotype and phenotype data from multiple studies in a setting that is accessible by multiple investigators. This framework addresses many of the challenges of multi-center WGS analyses, including data sharing mechanisms, phenotype harmonization, integrated multi-omics analyses, annotation, and computational flexibility. In this setting, the computational pipeline facilitates a sequence-to-discovery analysis workflow illustrated here by an analysis of plasma fibrinogen levels in 3996 individuals from the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) WGS program. The Analysis Commons represents a novel model for transforming WGS resources from a massive quantity of phenotypic and genomic data into knowledge of the determinants of health and disease risk in diverse human populations. PMID:29074945
Landscape of genomic diversity and trait discovery in soybean.
Valliyodan, Babu; Dan Qiu; Patil, Gunvant; Zeng, Peng; Huang, Jiaying; Dai, Lu; Chen, Chengxuan; Li, Yanjun; Joshi, Trupti; Song, Li; Vuong, Tri D; Musket, Theresa A; Xu, Dong; Shannon, J Grover; Shifeng, Cheng; Liu, Xin; Nguyen, Henry T
2016-03-31
Cultivated soybean [Glycine max (L.) Merr.] is a primary source of vegetable oil and protein. We report a landscape analysis of genome-wide genetic variation and an association study of major domestication and agronomic traits in soybean. A total of 106 soybean genomes representing wild, landraces, and elite lines were re-sequenced at an average of 17x depth with a 97.5% coverage. Over 10 million high-quality SNPs were discovered, and 35.34% of these have not been previously reported. Additionally, 159 putative domestication sweeps were identified, which includes 54.34 Mbp (4.9%) and 4,414 genes; 146 regions were involved in artificial selection during domestication. A genome-wide association study of major traits including oil and protein content, salinity, and domestication traits resulted in the discovery of novel alleles. Genomic information from this study provides a valuable resource for understanding soybean genome structure and evolution, and can also facilitate trait dissection leading to sequencing-based molecular breeding.
Landscape of genomic diversity and trait discovery in soybean
Valliyodan, Babu; Dan Qiu; Patil, Gunvant; Zeng, Peng; Huang, Jiaying; Dai, Lu; Chen, Chengxuan; Li, Yanjun; Joshi, Trupti; Song, Li; Vuong, Tri D.; Musket, Theresa A.; Xu, Dong; Shannon, J. Grover; Shifeng, Cheng; Liu, Xin; Nguyen, Henry T.
2016-01-01
Cultivated soybean [Glycine max (L.) Merr.] is a primary source of vegetable oil and protein. We report a landscape analysis of genome-wide genetic variation and an association study of major domestication and agronomic traits in soybean. A total of 106 soybean genomes representing wild, landraces, and elite lines were re-sequenced at an average of 17x depth with a 97.5% coverage. Over 10 million high-quality SNPs were discovered, and 35.34% of these have not been previously reported. Additionally, 159 putative domestication sweeps were identified, which includes 54.34 Mbp (4.9%) and 4,414 genes; 146 regions were involved in artificial selection during domestication. A genome-wide association study of major traits including oil and protein content, salinity, and domestication traits resulted in the discovery of novel alleles. Genomic information from this study provides a valuable resource for understanding soybean genome structure and evolution, and can also facilitate trait dissection leading to sequencing-based molecular breeding. PMID:27029319
Assessing the Impact of Assemblers on Virus Detection in a De Novo Metagenomic Analysis Pipeline.
White, Daniel J; Wang, Jing; Hall, Richard J
2017-09-01
Applying high-throughput sequencing to pathogen discovery is a relatively new field, the objective of which is to find disease-causing agents when little or no background information on disease is available. Key steps in the process are the generation of millions of sequence reads from an infected tissue sample, followed by assembly of these reads into longer, contiguous stretches of nucleotide sequences, and then identification of the contigs by matching them to known databases, such as those stored at GenBank or Ensembl. This technique, that is, de novo metagenomics, is particularly useful when the pathogen is viral and strong discriminatory power can be achieved. However, recently, we found that striking differences in results can be achieved when different assemblers were used. In this study, we test formally the impact of five popular assemblers (MIRA, VELVET, METAVELVET, SPADES, and OMEGA) on the detection of a novel virus and assembly of its whole genome in a data set for which we have confirmed the presence of the virus by empirical laboratory techniques, and compare the overall performance between assemblers. Our results show that if results from only one assembler are considered, biologically important reads can easily be overlooked. The impacts of these results on the field of pathogen discovery are considered.
Giese, Sven H; Zickmann, Franziska; Renard, Bernhard Y
2014-01-01
Accurate estimation, comparison and evaluation of read mapping error rates is a crucial step in the processing of next-generation sequencing data, as further analysis steps and interpretation assume the correctness of the mapping results. Current approaches are either focused on sensitivity estimation and thereby disregard specificity or are based on read simulations. Although continuously improving, read simulations are still prone to introduce a bias into the mapping error quantitation and cannot capture all characteristics of an individual dataset. We introduce ARDEN (artificial reference driven estimation of false positives in next-generation sequencing data), a novel benchmark method that estimates error rates of read mappers based on real experimental reads, using an additionally generated artificial reference genome. It allows a dataset-specific computation of error rates and the construction of a receiver operating characteristic curve. Thereby, it can be used for optimization of parameters for read mappers, selection of read mappers for a specific problem or for filtering alignments based on quality estimation. The use of ARDEN is demonstrated in a general read mapper comparison, a parameter optimization for one read mapper and an application example in single-nucleotide polymorphism discovery with a significant reduction in the number of false positive identifications. The ARDEN source code is freely available at http://sourceforge.net/projects/arden/.
Exarchos, Konstantinos P; Exarchos, Themis P; Rigas, Georgios; Papaloukas, Costas; Fotiadis, Dimitrios I
2011-05-10
In peptides and proteins, only a small percentile of peptide bonds adopts the cis configuration. Especially in the case of amide peptide bonds, the amount of cis conformations is quite limited thus hampering systematic studies, until recently. However, lately the emerging population of databases with more 3D structures of proteins has produced a considerable number of sequences containing non-proline cis formations (cis-nonPro). In our work, we extract regular expression-type patterns that are descriptive of regions surrounding the cis-nonPro formations. For this purpose, three types of pattern discovery are performed: i) exact pattern discovery, ii) pattern discovery using a chemical equivalency set, and iii) pattern discovery using a structural equivalency set. Afterwards, using each pattern as predicate, we search the Eukaryotic Linear Motif (ELM) resource to identify potential functional implications of regions with cis-nonPro peptide bonds. The patterns extracted from each type of pattern discovery are further employed, in order to formulate a pattern-based classifier, which is used to discriminate between cis-nonPro and trans-nonPro formations. In terms of functional implications, we observe a significant association of cis-nonPro peptide bonds towards ligand/binding functionalities. As for the pattern-based classification scheme, the highest results were obtained using the structural equivalency set, which yielded 70% accuracy, 77% sensitivity and 63% specificity.
Automated Antibody De Novo Sequencing and Its Utility in Biopharmaceutical Discovery
NASA Astrophysics Data System (ADS)
Sen, K. Ilker; Tang, Wilfred H.; Nayak, Shruti; Kil, Yong J.; Bern, Marshall; Ozoglu, Berk; Ueberheide, Beatrix; Davis, Darryl; Becker, Christopher
2017-05-01
Applications of antibody de novo sequencing in the biopharmaceutical industry range from the discovery of new antibody drug candidates to identifying reagents for research and determining the primary structure of innovator products for biosimilar development. When murine, phage display, or patient-derived monoclonal antibodies against a target of interest are available, but the cDNA or the original cell line is not, de novo protein sequencing is required to humanize and recombinantly express these antibodies, followed by in vitro and in vivo testing for functional validation. Availability of fully automated software tools for monoclonal antibody de novo sequencing enables efficient and routine analysis. Here, we present a novel method to automatically de novo sequence antibodies using mass spectrometry and the Supernovo software. The robustness of the algorithm is demonstrated through a series of stress tests.
Keilwagen, Jens; Grau, Jan; Paponov, Ivan A; Posch, Stefan; Strickert, Marc; Grosse, Ivo
2011-02-10
Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom.
Ethnobotany and Medicinal Plant Biotechnology: From Tradition to Modern Aspects of Drug Development.
Kayser, Oliver
2018-05-24
Secondary natural products from plants are important drug leads for the development of new drug candidates for rational clinical therapy and exhibit a variety of biological activities in experimental pharmacology and serve as structural template in medicinal chemistry. The exploration of plants and discovery of natural compounds based on ethnopharmacology in combination with high sophisticated analytics is still today an important drug discovery to characterize and validate potential leads. Due to structural complexity, low abundance in biological material, and high costs in chemical synthesis, alternative ways in production like plant cell cultures, heterologous biosynthesis, and synthetic biotechnology are applied. The basis for any biotechnological process is deep knowledge in genetic regulation of pathways and protein expression with regard to todays "omics" technologies. The high number genetic techniques allowed the implementation of combinatorial biosynthesis and wide genome sequencing. Consequently, genetics allowed functional expression of biosynthetic cascades from plants and to reconstitute low-performing pathways in more productive heterologous microorganisms. Thus, de novo biosynthesis in heterologous hosts requires fundamental understanding of pathway reconstruction and multitude of genes in a foreign organism. Here, actual concepts and strategies are discussed for pathway reconstruction and genome sequencing techniques cloning tools to bridge the gap between ethnopharmaceutical drug discovery to industrial biotechnology. Georg Thieme Verlag KG Stuttgart · New York.
Iyer, Pravin S.; Fodor, Matthew D.; Coleman, Claire M.; Twining, Leslie A.; Mitasev, Branko
2012-01-01
The solution phase synthesis of a discovery library of 178 tricyclic pyrrole-2-carboxamides was accomplished in nine steps and seven purifications starting with three benzoyl protected amino acid methyl esters. Further diversity was introduced by two glyoxaldehydes and forty-one primary amines. The combination of Pauson-Khand, Stetter and microwave assisted Paal Knorr reactions was applied as a key sequence. The discovery library was designed with the help of QikProp 2.1 and physicochemical data are presented for all pyrroles. Library members were synthesized and purified in parallel and analyzed by LC-MS. Selected compounds were fully characterized. PMID:16677007
Using History and Philosophy of Science to Promote Students' Argumentation
ERIC Educational Resources Information Center
Archila, Pablo Antonio
2015-01-01
This article describes the effect of a teaching-learning sequence (TLS) based on the discovery of oxygen in promoting students' argumentation. It examines the written and oral arguments produced by 63 high school students (24 females and 39 males, 16-17 years old) in France during a complete TLS supervised by the same teacher. The data used in…
You, Yanqin; Sun, Yan; Li, Xuchao; Li, Yali; Wei, Xiaoming; Chen, Fang; Ge, Huijuan; Lan, Zhangzhang; Zhu, Qian; Tang, Ying; Wang, Shujuan; Gao, Ya; Jiang, Fuman; Song, Jiaping; Shi, Quan; Zhu, Xuan; Mu, Feng; Dong, Wei; Gao, Vince; Jiang, Hui; Yi, Xin; Wang, Wei; Gao, Zhiying
2014-08-01
This article demonstrates a prominent noninvasive prenatal approach to assist the clinical diagnosis of a single-gene disorder disease, maple syrup urine disease, using targeted sequencing knowledge from the affected family. The method reported here combines novel mutant discovery in known genes by targeted massively parallel sequencing with noninvasive prenatal testing. By applying this new strategy, we successfully revealed novel mutations in the gene BCKDHA (Ex2_4dup and c.392A>G) in this Chinese family and developed a prenatal haplotype-assisted approach to noninvasively detect the genotype of the fetus (transmitted from both parents). This is the first report of integration of targeted sequencing and noninvasive prenatal testing into clinical practice. Our study has demonstrated that this massively parallel sequencing-based strategy can potentially be used for single-gene disorder diagnosis in the future.
Eikrem, Oystein S; Strauss, Philipp; Beisland, Christian; Scherer, Andreas; Landolt, Lea; Flatberg, Arnar; Leh, Sabine; Beisvag, Vidar; Skogstrand, Trude; Hjelle, Karin; Shresta, Anjana; Marti, Hans-Peter
2016-12-01
A previous study by this group demonstrated the feasibility of RNA sequencing (RNAseq) technology for capturing disease biology of clear cell renal cell carcinoma (ccRCC), and presented initial results for carbonic anhydrase-9 (CA9) and tumor necrosis factor-α-induced protein-6 (TNFAIP6) as possible biomarkers of ccRCC (discovery set) [Eikrem et al. PLoS One 2016;11:e0149743]. To confirm these results, the previous study is expanded, and RNAseq data from additional matched ccRCC and normal renal biopsies are analyzed (confirmation set). Two core biopsies from patients (n = 12) undergoing partial or full nephrectomy were obtained with a 16 g needle. RNA sequencing libraries were generated with the Illumina TruSeq ® Access library preparation protocol. Comparative analysis was done using linear modeling (voom/Limma; R Bioconductor). The formalin-fixed and paraffin-embedded discovery and confirmation data yielded 8957 and 11,047 detected transcripts, respectively. The two data sets shared 1193 of differentially expressed genes with each other. The average expression and the log 2 -fold changes of differentially expressed transcripts in both data sets correlated, with R² = .95 and R² = .94, respectively. Among transcripts with the highest fold changes were CA9, neuronal pentraxin-2 and uromodulin. Epithelial-mesenchymal transition was highlighted by differential expression of, for example, transforming growth factor-β 1 and delta-like ligand-4. The diagnostic accuracy of CA9 was 100% and 93.9% when using the discovery set as the training set and the confirmation data as the test set, and vice versa, respectively. These data further support TNFAIP6 as a novel biomarker of ccRCC. TNFAIP6 had combined accuracy of 98.5% in the two data sets. This study provides confirmatory data on the potential use of CA9 and TNFAIP6 as biomarkers of ccRCC. Thus, next-generation sequencing expands the clinical application of tissue analyses.
Ishino, Yoshizumi; Krupovic, Mart; Forterre, Patrick
2018-04-01
Clustered regularly interspaced short palindromic repeat (CRISPR)-Cas systems are well-known acquired immunity systems that are widespread in archaea and bacteria. The RNA-guided nucleases from CRISPR-Cas systems are currently regarded as the most reliable tools for genome editing and engineering. The first hint of their existence came in 1987, when an unusual repetitive DNA sequence, which subsequently was defined as a CRISPR, was discovered in the Escherichia coli genome during an analysis of genes involved in phosphate metabolism. Similar sequence patterns were then reported in a range of other bacteria as well as in halophilic archaea, suggesting an important role for such evolutionarily conserved clusters of repeated sequences. A critical step toward functional characterization of the CRISPR-Cas systems was the recognition of a link between CRISPRs and the associated Cas proteins, which were initially hypothesized to be involved in DNA repair in hyperthermophilic archaea. Comparative genomics, structural biology, and advanced biochemistry could then work hand in hand, not only culminating in the explosion of genome editing tools based on CRISPR-Cas9 and other class II CRISPR-Cas systems but also providing insights into the origin and evolution of this system from mobile genetic elements denoted casposons. To celebrate the 30th anniversary of the discovery of CRISPR, this minireview briefly discusses the fascinating history of CRISPR-Cas systems, from the original observation of an enigmatic sequence in E. coli to genome editing in humans. Copyright © 2018 American Society for Microbiology.
Accurate phylogenetic classification of DNA fragments based onsequence composition
DOE Office of Scientific and Technical Information (OSTI.GOV)
McHardy, Alice C.; Garcia Martin, Hector; Tsirigos, Aristotelis
2006-05-01
Metagenome studies have retrieved vast amounts of sequenceout of a variety of environments, leading to novel discoveries and greatinsights into the uncultured microbial world. Except for very simplecommunities, diversity makes sequence assembly and analysis a verychallenging problem. To understand the structure a 5 nd function ofmicrobial communities, a taxonomic characterization of the obtainedsequence fragments is highly desirable, yet currently limited mostly tothose sequences that contain phylogenetic marker genes. We show that forclades at the rank of domain down to genus, sequence composition allowsthe very accurate phylogenetic 10 characterization of genomic sequence.We developed a composition-based classifier, PhyloPythia, for de novophylogenetic sequencemore » characterization and have trained it on adata setof 340 genomes. By extensive evaluation experiments we show that themethodis accurate across all taxonomic ranks considered, even forsequences that originate fromnovel organisms and are as short as 1kb.Application to two metagenome datasets 15 obtained from samples ofphosphorus-removing sludge showed that the method allows the accurateclassification at genus level of most sequence fragments from thedominant populations, while at the same time correctly characterizingeven larger parts of the samples at higher taxonomic levels.« less
From Fibonacci Numbers To Geometric Sequences and the Binet Formula by Way of the Golden Ratio!
ERIC Educational Resources Information Center
DiDomenico, Angelo S.
1997-01-01
Provides activities that deal with Fibonacci-like sequences and guide students' thinking as they explore mathematical induction. Investigation leads to a discovery of an interesting relation that involves all Fibonacci-like sequences. (DDR)
NASA Astrophysics Data System (ADS)
Sherwood, R.; Mutz, D.; Estlin, T.; Chien, S.; Backes, P.; Norris, J.; Tran, D.; Cooper, B.; Rabideau, G.; Mishkin, A.; Maxwell, S.
2001-07-01
This article discusses a proof-of-concept prototype for ground-based automatic generation of validated rover command sequences from high-level science and engineering activities. This prototype is based on ASPEN, the Automated Scheduling and Planning Environment. This artificial intelligence (AI)-based planning and scheduling system will automatically generate a command sequence that will execute within resource constraints and satisfy flight rules. An automated planning and scheduling system encodes rover design knowledge and uses search and reasoning techniques to automatically generate low-level command sequences while respecting rover operability constraints, science and engineering preferences, environmental predictions, and also adhering to hard temporal constraints. This prototype planning system has been field-tested using the Rocky 7 rover at JPL and will be field-tested on more complex rovers to prove its effectiveness before transferring the technology to flight operations for an upcoming NASA mission. Enabling goal-driven commanding of planetary rovers greatly reduces the requirements for highly skilled rover engineering personnel. This in turn greatly reduces mission operations costs. In addition, goal-driven commanding permits a faster response to changes in rover state (e.g., faults) or science discoveries by removing the time-consuming manual sequence validation process, allowing rapid "what-if" analyses, and thus reducing overall cycle times.
Lu, Ruipeng; Mucaki, Eliseos J; Rogan, Peter K
2017-03-17
Data from ChIP-seq experiments can derive the genome-wide binding specificities of transcription factors (TFs) and other regulatory proteins. We analyzed 765 ENCODE ChIP-seq peak datasets of 207 human TFs with a novel motif discovery pipeline based on recursive, thresholded entropy minimization. This approach, while obviating the need to compensate for skewed nucleotide composition, distinguishes true binding motifs from noise, quantifies the strengths of individual binding sites based on computed affinity and detects adjacent cofactor binding sites that coordinate with the targets of primary, immunoprecipitated TFs. We obtained contiguous and bipartite information theory-based position weight matrices (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The reliability and accuracy of these iPWMs were determined via four independent validation methods, including the detection of experimentally proven binding sites, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. We also predict previously unreported TF coregulatory interactions (e.g. TF complexes). These iPWMs constitute a powerful tool for predicting the effects of sequence variants in known binding sites, performing mutation analysis on regulatory SNPs and predicting previously unrecognized binding sites and target genes. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
The Discovery of a Low-Mass Binary Companion to HD130948
NASA Astrophysics Data System (ADS)
Potter, D. E.; Cushing, M. C.; Neuhauser, R.
2003-10-01
We report the discovery of a low mass binary companion to the nearby (17.9 pc) main sequence star HD130948 (HR5534, HIP 72567) using the Hokupa'a adaptive optics instrument mounted on the Gemini North 8 meter telescope. Both companions have the same common proper motion as the primary star as seen over a 4 month baseline. The JHK' photometry of the companions, when placed on a near-IR color-magnitude diagram and compared with theoretical models places them at the bottom of the M-dwarf sequence. Preliminary near IR spectra have been obtained with SpeX mounted on the NASA IRTF 3 meter telescope are consistent with the photometric results and show carbon monoxide bandheads and water absorption features indicative of an early L-late M spectral type. The X-ray activity and Lithium abundance of the primary star indicate that the system is probably less than 1 Gyr old. Assuming a young age, these objects are less than 80 Mjupiter. With further astrometric observations carried out over an estimated orbital period of 10-20 years, a dynamical mass will be obtained.
Advances in QTL Mapping in Pigs
Rothschild, Max F.; Hu, Zhi-liang; Jiang, Zhihua
2007-01-01
Over the past 15 years advances in the porcine genetic linkage map and discovery of useful candidate genes have led to valuable gene and trait information being discovered. Early use of exotic breed crosses and now commercial breed crosses for quantitative trait loci (QTL) scans and candidate gene analyses have led to 110 publications which have identified 1,675 QTL. Additionally, these studies continue to identify genes associated with economically important traits such as growth rate, leanness, feed intake, meat quality, litter size, and disease resistance. A well developed QTL database called PigQTLdb is now as a valuable tool for summarizing and pinpointing in silico regions of interest to researchers. The commercial pig industry is actively incorporating these markers in marker-assisted selection along with traditional performance information to improve traits of economic performance. The long awaited sequencing efforts are also now beginning to provide sequence available for both comparative genomics and large scale single nucleotide polymorphism (SNP) association studies. While these advances are all positive, development of useful new trait families and measurement of new or underlying traits still limits future discoveries. A review of these developments is presented. PMID:17384738
Screening for SNPs with Allele-Specific Methylation based on Next-Generation Sequencing Data
Hu, Bo; Xu, Yaomin
2013-01-01
Allele-specific methylation (ASM) has long been studied but mainly documented in the context of genomic imprinting and X chromosome inactivation. Taking advantage of the next-generation sequencing technology, we conduct a high-throughput sequencing experiment with four prostate cell lines to survey the whole genome and identify single nucleotide polymorphisms (SNPs) with ASM. A Bayesian approach is proposed to model the counts of short reads for each SNP conditional on its genotypes of multiple subjects, leading to a posterior probability of ASM. We flag SNPs with high posterior probabilities of ASM by accounting for multiple comparisons based on posterior false discovery rates. Applying the Bayesian approach to the in-house prostate cell line data, we identify 269 SNPs as candidates of ASM. A simulation study is carried out to demonstrate the quantitative performance of the proposed approach. PMID:23710259
High-field MRS in clinical drug development.
Ross, Brian D
2013-07-01
Magnetic resonance spectroscopy (MRS) will continue to play an ever increasing role in drug discovery because MRS does readily define biomarkers for several hundreds of clinically distinct diseases. Published evidence based medicine (EBM) surveys, which generally conclude the opposite, are seriously flawed and do a disservice to the field of drug discovery. This article presents MRS and how it has guided several hundreds of practical human 'drug discovery' endeavors since its development. Specifically, the author looks at the process of 'reverse-translation' and its influence in the expansion of the number of preclinical drug discoveries from in vivo MRS. The author also provides a structured approach of eight criteria, including EBM acceptance, which could potentially re-open the field of MRS for productive exploration of existing and repurposed drugs and cost-effective drug-discovery. MRS-guided drug discovery is poised for future expansion. The cost of clinical trials has escalated and the use of biomarkers has become increasingly useful in improving patient selection for drug trials. Clinical MRS has uncovered a treasure-trove of novel biomarkers and clinical MRS itself has become better standardized and more widely available on 'routine' clinical MRI scanners. When combined with available new MRI sequences, MRS can provide a 'one stop shop' with multiple potential outcome measures for the disease and the drug in question.
A non-parametric peak calling algorithm for DamID-Seq.
Li, Renhua; Hempel, Leonie U; Jiang, Tingbo
2015-01-01
Protein-DNA interactions play a significant role in gene regulation and expression. In order to identify transcription factor binding sites (TFBS) of double sex (DSX)-an important transcription factor in sex determination, we applied the DNA adenine methylation identification (DamID) technology to the fat body tissue of Drosophila, followed by deep sequencing (DamID-Seq). One feature of DamID-Seq data is that induced adenine methylation signals are not assured to be symmetrically distributed at TFBS, which renders the existing peak calling algorithms for ChIP-Seq, including SPP and MACS, inappropriate for DamID-Seq data. This challenged us to develop a new algorithm for peak calling. A challenge in peaking calling based on sequence data is estimating the averaged behavior of background signals. We applied a bootstrap resampling method to short sequence reads in the control (Dam only). After data quality check and mapping reads to a reference genome, the peaking calling procedure compromises the following steps: 1) reads resampling; 2) reads scaling (normalization) and computing signal-to-noise fold changes; 3) filtering; 4) Calling peaks based on a statistically significant threshold. This is a non-parametric method for peak calling (NPPC). We also used irreproducible discovery rate (IDR) analysis, as well as ChIP-Seq data to compare the peaks called by the NPPC. We identified approximately 6,000 peaks for DSX, which point to 1,225 genes related to the fat body tissue difference between female and male Drosophila. Statistical evidence from IDR analysis indicated that these peaks are reproducible across biological replicates. In addition, these peaks are comparable to those identified by use of ChIP-Seq on S2 cells, in terms of peak number, location, and peaks width.
Anton, Brian P; Mongodin, Emmanuel F; Agrawal, Sonia; Fomenkov, Alexey; Byrd, Devon R; Roberts, Richard J; Raleigh, Elisabeth A
2015-01-01
We report the complete sequence of ER2796, a laboratory strain of Escherichia coli K-12 that is completely defective in DNA methylation. Because of its lack of any native methylation, it is extremely useful as a host into which heterologous DNA methyltransferase genes can be cloned and the recognition sequences of their products deduced by Pacific Biosciences Single-Molecule Real Time (SMRT) sequencing. The genome was itself sequenced from a long-insert library using the SMRT platform, resulting in a single closed contig devoid of methylated bases. Comparison with K-12 MG1655, the first E. coli K-12 strain to be sequenced, shows an essentially co-linear relationship with no major rearrangements despite many generations of laboratory manipulation. The comparison revealed a total of 41 insertions and deletions, and 228 single base pair substitutions. In addition, the long-read approach facilitated the surprising discovery of four gene conversion events, three involving rRNA operons and one between two cryptic prophages. Such events thus contribute both to genomic homogenization and to bacteriophage diversification. As one of relatively few laboratory strains of E. coli to be sequenced, the genome also reveals the sequence changes underlying a number of classical mutant alleles including those affecting the various native DNA methylation systems.
Anton, Brian P.; Mongodin, Emmanuel F.; Agrawal, Sonia; Fomenkov, Alexey; Byrd, Devon R.; Roberts, Richard J.; Raleigh, Elisabeth A.
2015-01-01
We report the complete sequence of ER2796, a laboratory strain of Escherichia coli K-12 that is completely defective in DNA methylation. Because of its lack of any native methylation, it is extremely useful as a host into which heterologous DNA methyltransferase genes can be cloned and the recognition sequences of their products deduced by Pacific Biosciences Single-Molecule Real Time (SMRT) sequencing. The genome was itself sequenced from a long-insert library using the SMRT platform, resulting in a single closed contig devoid of methylated bases. Comparison with K-12 MG1655, the first E. coli K-12 strain to be sequenced, shows an essentially co-linear relationship with no major rearrangements despite many generations of laboratory manipulation. The comparison revealed a total of 41 insertions and deletions, and 228 single base pair substitutions. In addition, the long-read approach facilitated the surprising discovery of four gene conversion events, three involving rRNA operons and one between two cryptic prophages. Such events thus contribute both to genomic homogenization and to bacteriophage diversification. As one of relatively few laboratory strains of E. coli to be sequenced, the genome also reveals the sequence changes underlying a number of classical mutant alleles including those affecting the various native DNA methylation systems. PMID:26010885
Towards Discovery and Targeted Peptide Biomarker Detection Using nanoESI-TIMS-TOF MS
DOE Office of Scientific and Technical Information (OSTI.GOV)
Garabedian, Alyssa; Benigni, Paolo; Ramirez, Cesar E.
Abstract. In the present work, the potential of trapped ion mobility spectrometry coupled to TOF mass spectrometry (TIMS-TOF MS) for discovery and targeted monitoring of peptide biomarkers from human-in-mouse xenograft tumor tissue was evaluated. In particular, a TIMS-MS workflow was developed for the detection and quantification of peptide biomarkers using internal heavy analogs, taking advantage of the high mobility resolution (R = 150–250) prior to mass analysis. Five peptide biomarkers were separated, identified, and quantified using offline nanoESI-TIMSCID- TOF MS; the results were in good agreement with measurements using a traditional LC-ESI-MS/MS proteomics workflow. The TIMS-TOF MS analysis permitted peptidemore » biomarker detection based on accurate mobility, mass measurements, and high sequence coverage for concentrations in the 10–200 nM range, while simultaneously achieving discovery measurements« less
Agyei, Dominic; Tsopmo, Apollinaire; Udenigwe, Chibuike C
2018-06-01
There are emerging advancements in the strategies used for the discovery and development of food-derived bioactive peptides because of their multiple food and health applications. Bioinformatics and peptidomics are two computational and analytical techniques that have the potential to speed up the development of bioactive peptides from bench to market. Structure-activity relationships observed in peptides form the basis for bioinformatics and in silico prediction of bioactive sequences encrypted in food proteins. Peptidomics, on the other hand, relies on "hyphenated" (liquid chromatography-mass spectrometry-based) techniques for the detection, profiling, and quantitation of peptides. Together, bioinformatics and peptidomics approaches provide a low-cost and effective means of predicting, profiling, and screening bioactive protein hydrolysates and peptides from food. This article discuses the basis, strengths, and limitations of bioinformatics and peptidomics approaches currently used for the discovery and analysis of food-derived bioactive peptides.
Novel Insights into Tree Biology and Genome Evolution as Revealed Through Genomics.
Neale, David B; Martínez-García, Pedro J; De La Torre, Amanda R; Montanari, Sara; Wei, Xiao-Xin
2017-04-28
Reference genome sequences are the key to the discovery of genes and gene families that determine traits of interest. Recent progress in sequencing technologies has enabled a rapid increase in genome sequencing of tree species, allowing the dissection of complex characters of economic importance, such as fruit and wood quality and resistance to biotic and abiotic stresses. Although the number of reference genome sequences for trees lags behind those for other plant species, it is not too early to gain insight into the unique features that distinguish trees from nontree plants. Our review of the published data suggests that, although many gene families are conserved among herbaceous and tree species, some gene families, such as those involved in resistance to biotic and abiotic stresses and in the synthesis and transport of sugars, are often expanded in tree genomes. As the genomes of more tree species are sequenced, comparative genomics will further elucidate the complexity of tree genomes and how this relates to traits unique to trees.
Computational Analysis of Mouse piRNA Sequence and Biogenesis
Betel, Doron; Sheridan, Robert; Marks, Debora S; Sander, Chris
2007-01-01
The recent discovery of a new class of 30-nucleotide long RNAs in mammalian testes, called PIWI-interacting RNA (piRNA), with similarities to microRNAs and repeat-associated small interfering RNAs (rasiRNAs), has raised puzzling questions regarding their biogenesis and function. We report a comparative analysis of currently available piRNA sequence data from the pachytene stage of mouse spermatogenesis that sheds light on their sequence diversity and mechanism of biogenesis. We conclude that (i) there are at least four times as many piRNAs in mouse testes than currently known; (ii) piRNAs, which originate from long precursor transcripts, are generated by quasi-random enzymatic processing that is guided by a weak sequence signature at the piRNA 5′ends resulting in a large number of distinct sequences; and (iii) many of the piRNA clusters contain inverted repeats segments capable of forming double-strand RNA fold-back segments that may initiate piRNA processing analogous to transposon silencing. PMID:17997596
Sánchez, Cecilia Castaño; Smith, Timothy P L; Wiedmann, Ralph T; Vallejo, Roger L; Salem, Mohamed; Yao, Jianbo; Rexroad, Caird E
2009-11-25
To enhance capabilities for genomic analyses in rainbow trout, such as genomic selection, a large suite of polymorphic markers that are amenable to high-throughput genotyping protocols must be identified. Expressed Sequence Tags (ESTs) have been used for single nucleotide polymorphism (SNP) discovery in salmonids. In those strategies, the salmonid semi-tetraploid genomes often led to assemblies of paralogous sequences and therefore resulted in a high rate of false positive SNP identification. Sequencing genomic DNA using primers identified from ESTs proved to be an effective but time consuming methodology of SNP identification in rainbow trout, therefore not suitable for high throughput SNP discovery. In this study, we employed a high-throughput strategy that used pyrosequencing technology to generate data from a reduced representation library constructed with genomic DNA pooled from 96 unrelated rainbow trout that represent the National Center for Cool and Cold Water Aquaculture (NCCCWA) broodstock population. The reduced representation library consisted of 440 bp fragments resulting from complete digestion with the restriction enzyme HaeIII; sequencing produced 2,000,000 reads providing an average 6 fold coverage of the estimated 150,000 unique genomic restriction fragments (300,000 fragment ends). Three independent data analyses identified 22,022 to 47,128 putative SNPs on 13,140 to 24,627 independent contigs. A set of 384 putative SNPs, randomly selected from the sets produced by the three analyses were genotyped on individual fish to determine the validation rate of putative SNPs among analyses, distinguish apparent SNPs that actually represent paralogous loci in the tetraploid genome, examine Mendelian segregation, and place the validated SNPs on the rainbow trout linkage map. Approximately 48% (183) of the putative SNPs were validated; 167 markers were successfully incorporated into the rainbow trout linkage map. In addition, 2% of the sequences from the validated markers were associated with rainbow trout transcripts. The use of reduced representation libraries and pyrosequencing technology proved to be an effective strategy for the discovery of a high number of putative SNPs in rainbow trout; however, modifications to the technique to decrease the false discovery rate resulting from the evolutionary recent genome duplication would be desirable.
Metagenomics: Probing pollutant fate in natural and engineered ecosystems.
Bouhajja, Emna; Agathos, Spiros N; George, Isabelle F
2016-12-01
Polluted environments are a reservoir of microbial species able to degrade or to convert pollutants to harmless compounds. The proper management of microbial resources requires a comprehensive characterization of their genetic pool to assess the fate of contaminants and increase the efficiency of bioremediation processes. Metagenomics offers appropriate tools to describe microbial communities in their whole complexity without lab-based cultivation of individual strains. After a decade of use of metagenomics to study microbiomes, the scientific community has made significant progress in this field. In this review, we survey the main steps of metagenomics applied to environments contaminated with organic compounds or heavy metals. We emphasize technical solutions proposed to overcome encountered obstacles. We then compare two metagenomic approaches, i.e. library-based targeted metagenomics and direct sequencing of metagenomes. In the former, environmental DNA is cloned inside a host, and then clones of interest are selected based on (i) their expression of biodegradative functions or (ii) sequence homology with probes and primers designed from relevant, already known sequences. The highest score for the discovery of novel genes and degradation pathways has been achieved so far by functional screening of large clone libraries. On the other hand, direct sequencing of metagenomes without a cloning step has been more often applied to polluted environments for characterization of the taxonomic and functional composition of microbial communities and their dynamics. In this case, the analysis has focused on 16S rRNA genes and marker genes of biodegradation. Advances in next generation sequencing and in bioinformatic analysis of sequencing data have opened up new opportunities for assessing the potential of biodegradation by microbes, but annotation of collected genes is still hampered by a limited number of available reference sequences in databases. Although metagenomics is still facing technical and computational challenges, our review of the recent literature highlights its value as an aid to efficiently monitor the clean-up of contaminated environments and develop successful strategies to mitigate the impact of pollutants on ecosystems. Copyright © 2016 Elsevier Inc. All rights reserved.
Chao, Tianle; Wang, Guizhi; Wang, Jianmin; Liu, Zhaohua; Ji, Zhibin; Hou, Lei; Zhang, Chunlan
2016-01-01
High-throughput mRNA sequencing enables the discovery of new transcripts and additional parts of incompletely annotated transcripts. Compared with the human and cow genomes, the reference annotation level of the sheep genome is still low. An investigation of new transcripts in sheep skeletal muscle will improve our understanding of muscle development. Therefore, applying high-throughput sequencing, two cDNA libraries from the biceps brachii of small-tailed Han sheep and Dorper sheep were constructed, and whole-transcriptome analysis was performed to determine the unknown transcript catalogue of this tissue. In this study, 40,129 transcripts were finally mapped to the sheep genome. Among them, 3,467 transcripts were determined to be unannotated in the current reference sheep genome and were defined as new transcripts. Based on protein-coding capacity prediction and comparative analysis of sequence similarity, 246 transcripts were classified as portions of unannotated genes or incompletely annotated genes. Another 1,520 transcripts were predicted with high confidence to be long non-coding RNAs. Our analysis also revealed 334 new transcripts that displayed specific expression in ruminants and uncovered a number of new transcripts without intergenus homology but with specific expression in sheep skeletal muscle. The results confirmed a complex transcript pattern of coding and non-coding RNA in sheep skeletal muscle. This study provided important information concerning the sheep genome and transcriptome annotation, which could provide a basis for further study.
Pattern Discovery and Change Detection of Online Music Query Streams
NASA Astrophysics Data System (ADS)
Li, Hua-Fu
In this paper, an efficient stream mining algorithm, called FTP-stream (Frequent Temporal Pattern mining of streams), is proposed to find the frequent temporal patterns over melody sequence streams. In the framework of our proposed algorithm, an effective bit-sequence representation is used to reduce the time and memory needed to slide the windows. The FTP-stream algorithm can calculate the support threshold in only a single pass based on the concept of bit-sequence representation. It takes the advantage of "left" and "and" operations of the representation. Experiments show that the proposed algorithm only scans the music query stream once, and runs significant faster and consumes less memory than existing algorithms, such as SWFI-stream and Moment.
MendeLIMS: a web-based laboratory information management system for clinical genome sequencing.
Grimes, Susan M; Ji, Hanlee P
2014-08-27
Large clinical genomics studies using next generation DNA sequencing require the ability to select and track samples from a large population of patients through many experimental steps. With the number of clinical genome sequencing studies increasing, it is critical to maintain adequate laboratory information management systems to manage the thousands of patient samples that are subject to this type of genetic analysis. To meet the needs of clinical population studies using genome sequencing, we developed a web-based laboratory information management system (LIMS) with a flexible configuration that is adaptable to continuously evolving experimental protocols of next generation DNA sequencing technologies. Our system is referred to as MendeLIMS, is easily implemented with open source tools and is also highly configurable and extensible. MendeLIMS has been invaluable in the management of our clinical genome sequencing studies. We maintain a publicly available demonstration version of the application for evaluation purposes at http://mendelims.stanford.edu. MendeLIMS is programmed in Ruby on Rails (RoR) and accesses data stored in SQL-compliant relational databases. Software is freely available for non-commercial use at http://dna-discovery.stanford.edu/software/mendelims/.
Assessment of composite motif discovery methods.
Klepper, Kjetil; Sandve, Geir K; Abul, Osman; Johansen, Jostein; Drablos, Finn
2008-02-26
Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery - discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis-regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery. We have developed a benchmarking framework for composite motif discovery and used it to evaluate the performance of eight published module discovery tools. Benchmark datasets were constructed based on real genomic sequences containing experimentally verified regulatory modules, and the module discovery programs were asked to predict both the locations of these modules and to specify the single motifs involved. To aid the programs in their search, we provided position weight matrices corresponding to the binding motifs of the transcription factors involved. In addition, selections of decoy matrices were mixed with the genuine matrices on one dataset to test the response of programs to varying levels of noise. Although some of the methods tested tended to score somewhat better than others overall, there were still large variations between individual datasets and no single method performed consistently better than the rest in all situations. The variation in performance on individual datasets also shows that the new benchmark datasets represents a suitable variety of challenges to most methods for module discovery.
Discovery of novel plant interaction determinants from the genomes of 163 root nodule bacteria
Seshadri, Rekha; Reeve, Wayne G.; Ardley, Julie K.; ...
2015-11-20
Root nodule bacteria (RNB) or “rhizobia” are a type of plant growth promoting bacteria, typified by their ability to fix nitrogen for their plant host, fixing nearly 65% of the nitrogen currently utilized in sustainable agricultural production of legume crops and pastures. In this study, we sequenced the genomes of 110 RNB from diverse hosts and biogeographical regions, and undertook a global exploration of all available RNB genera with the aim of identifying novel genetic determinants of symbiotic association and plant growth promotion. Specifically, we performed a subtractive comparative analysis with non-RNB genomes, employed relevant transcriptomic data, and leveraged phylogeneticmore » distribution patterns and sequence signatures based on known precepts of symbioticand host-microbe interactions. A total of 184 protein families were delineated, including known factors for nodulation and nitrogen fixation, and candidates with previously unexplored functions, for which a role in host-interaction, -regulation, biocontrol, and more, could be posited. Lastly, these analyses expand our knowledge of the RNB purview and provide novel targets for strain improvement in the ultimate quest to enhance plant productivity and agricultural sustainability.« less
Tool-assisted rhythmic drumming in palm cockatoos shares key elements of human instrumental music
Heinsohn, Robert; Zdenek, Christina N.; Cunningham, Ross B.; Endler, John A.; Langmore, Naomi E.
2017-01-01
All human societies have music with a rhythmic “beat,” typically produced with percussive instruments such as drums. The set of capacities that allows humans to produce and perceive music appears to be deeply rooted in human biology, but an understanding of its evolutionary origins requires cross-taxa comparisons. We show that drumming by palm cockatoos (Probosciger aterrimus) shares the key rudiments of human instrumental music, including manufacture of a sound tool, performance in a consistent context, regular beat production, repeated components, and individual styles. Over 131 drumming sequences produced by 18 males, the beats occurred at nonrandom, regular intervals, yet individual males differed significantly in the shape parameters describing the distribution of their beat patterns, indicating individual drumming styles. Autocorrelation analyses of the longest drumming sequences further showed that they were highly regular and predictable like human music. These discoveries provide a rare comparative perspective on the evolution of rhythmicity and instrumental music in our own species, and show that a preference for a regular beat can have other origins before being co-opted into group-based music and dance. PMID:28782005
Nearing saturation of cancer driver gene discovery.
Hsiehchen, David; Hsieh, Antony
2018-06-15
Extensive sequencing efforts of cancer genomes such as The Cancer Genome Atlas (TCGA) have been undertaken to uncover bona fide cancer driver genes which has enhanced our understanding of cancer and revealed therapeutic targets. However, the number of driver gene mutations is bounded, indicating that there must be a point when further sequencing efforts will be excessive. We found that there was a significant positive correlation between sample size and identified driver gene mutations across 33 cancers sequenced by the TCGA, which is expected if additional sequencing is still leading to the identification of more driver genes. However, the rate of new cancer driver genes being discovered with larger samples is declining rapidly. Our analysis provides a general guide for determining which cancer types would likely benefit from additional sequencing efforts, particularly those with relatively high rates of cancer driver gene discovery. Our results argue that past strategies of indiscriminately sequencing as many specimens as possible for all cancer types is becoming inefficient. In addition, without significant investments into applying our knowledge of cancer genomes, we risk sequencing more cancer genomes for the sake of sequencing rather than meaningful patient benefit.
The discovery of low-mass pre-main-sequence stars in Cepheus OB3b
NASA Astrophysics Data System (ADS)
Pozzo, M.; Naylor, T.; Jeffries, R. D.; Drew, J. E.
2003-05-01
We report the discovery of a low-mass pre-main-sequence (PMS) stellar population in the younger subgroup of the Cepheus OB3 association, Cep OB3b, using UBVI CCD photometry and follow-up spectroscopy. The optical survey covers approximately 1300 arcmin2 on the sky and gives a global photometric and astrometric catalogue for more than 7000 objects. The location of a PMS population is well defined in a V versus (V-I) colour-magnitude diagram. Multifibre spectroscopic results for optically selected PMS candidates confirm the T Tauri nature for 10 objects, with equal numbers of classical TTS (CTTS) and weak-line TTS (WTTS). There are six other objects that we classify as possible PMS stars. The newly discovered TTS stars have masses in the range ~0.9-3.0 Msolar and ages from <1 to nearly 10 Myr, based on the Siess, Dufour & Forestini isochrones. Their location close to the O and B stars of the association (especially the O7n star) demonstrates that low-mass star formation is indeed possible in such an apparently hostile environment dominated by early-type stars and that the latter must have been less effective in eroding the circumstellar discs of their lower-mass siblings compared with other OB associations (e.g. λ-Ori). We attribute this to the nature of the local environment, speculating that the bulk of molecular material, which shielded low-mass stars from the ionizing radiation of their early-type siblings, has only recently been removed.
Comparative analysis of the feline immunoglobulin repertoire.
Steiniger, Sebastian C J; Glanville, Jacob; Harris, Douglas W; Wilson, Thomas L; Ippolito, Gregory C; Dunham, Steven A
2017-03-01
Next-Generation Sequencing combined with bioinformatics is a powerful tool for analyzing the large number of DNA sequences present in the expressed antibody repertoire and these data sets can be used to advance a number of research areas including antibody discovery and engineering. The accurate measurement of the immune repertoire sequence composition, diversity and abundance is important for understanding the repertoire response in infections, vaccinations and cancer immunology and could also be useful for elucidating novel molecular targets. In this study 4 individual domestic cats (Felis catus) were subjected to antibody repertoire sequencing with total number of sequences generated 1079863 for VH for IgG, 1050824 VH for IgM, 569518 for VK and 450195 for VL. Our analysis suggests that a similar VDJ expression patterns exists across all cats. Similar to the canine repertoire, the feline repertoire is dominated by a single subgroup, namely VH3. The antibody paratope of felines showed similar amino acid variation when compared to human, mouse and canine counterparts. All animals show a similarly skewed VH CDR-H3 profile and, when compared to canine, human and mouse, distinct differences are observed. Our study represents the first attempt to characterize sequence diversity in the expressed feline antibody repertoire and this demonstrates the utility of using NGS to elucidate entire antibody repertoires from individual animals. These data provide significant insight into understanding the feline immune system function. Copyright © 2017 International Alliance for Biological Standardization. Published by Elsevier Ltd. All rights reserved.
The Papillomavirus Episteme: a major update to the papillomavirus sequence database.
Van Doorslaer, Koenraad; Li, Zhiwen; Xirasagar, Sandhya; Maes, Piet; Kaminsky, David; Liou, David; Sun, Qiang; Kaur, Ramandeep; Huyen, Yentram; McBride, Alison A
2017-01-04
The Papillomavirus Episteme (PaVE) is a database of curated papillomavirus genomic sequences, accompanied by web-based sequence analysis tools. This update describes the addition of major new features. The papillomavirus genomes within PaVE have been further annotated, and now includes the major spliced mRNA transcripts. Viral genes and transcripts can be visualized on both linear and circular genome browsers. Evolutionary relationships among PaVE reference protein sequences can be analysed using multiple sequence alignments and phylogenetic trees. To assist in viral discovery, PaVE offers a typing tool; a simplified algorithm to determine whether a newly sequenced virus is novel. PaVE also now contains an image library containing gross clinical and histopathological images of papillomavirus infected lesions. Database URL: https://pave.niaid.nih.gov/. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
ProDeGe: A computational protocol for fully automated decontamination of genomes
Tennessen, Kristin; Andersen, Evan; Clingenpeel, Scott; ...
2015-06-09
Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies.more » On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). Lastly, the procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.« less
ProDeGe: A computational protocol for fully automated decontamination of genomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tennessen, Kristin; Andersen, Evan; Clingenpeel, Scott
Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies.more » On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). Lastly, the procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.« less
Shchetynsky, Klementy; Diaz-Gallo, Lina-Marcella; Folkersen, Lasse; Hensvold, Aase Haj; Catrina, Anca Irinel; Berg, Louise; Klareskog, Lars; Padyukov, Leonid
2017-02-02
Here we integrate verified signals from previous genetic association studies with gene expression and pathway analysis for discovery of new candidate genes and signaling networks, relevant for rheumatoid arthritis (RA). RNA-sequencing-(RNA-seq)-based expression analysis of 377 genes from previously verified RA-associated loci was performed in blood cells from 5 newly diagnosed, non-treated patients with RA, 7 patients with treated RA and 12 healthy controls. Differentially expressed genes sharing a similar expression pattern in treated and untreated RA sub-groups were selected for pathway analysis. A set of "connector" genes derived from pathway analysis was tested for differential expression in the initial discovery cohort and validated in blood cells from 73 patients with RA and in 35 healthy controls. There were 11 qualifying genes selected for pathway analysis and these were grouped into two evidence-based functional networks, containing 29 and 27 additional connector molecules. The expression of genes, corresponding to connector molecules was then tested in the initial RNA-seq data. Differences in the expression of ERBB2, TP53 and THOP1 were similar in both treated and non-treated patients with RA and an additional nine genes were differentially expressed in at least one group of patients compared to healthy controls. The ERBB2, TP53. THOP1 expression profile was successfully replicated in RNA-seq data from peripheral blood mononuclear cells from healthy controls and non-treated patients with RA, in an independent collection of samples. Integration of RNA-seq data with findings from association studies, and consequent pathway analysis implicate new candidate genes, ERBB2, TP53 and THOP1 in the pathogenesis of RA.
Discovery of a new family of amphibians from northeast India with ancient links to Africa
Kamei, Rachunliu G.; Mauro, Diego San; Gower, David J.; Van Bocxlaer, Ines; Sherratt, Emma; Thomas, Ashish; Babu, Suresh; Bossuyt, Franky; Wilkinson, Mark; Biju, S. D.
2012-01-01
The limbless, primarily soil-dwelling and tropical caecilian amphibians (Gymnophiona) comprise the least known order of tetrapods. On the basis of unprecedented extensive fieldwork, we report the discovery of a previously overlooked, ancient lineage and radiation of caecilians from threatened habitats in the underexplored states of northeast India. Molecular phylogenetic analyses of mitogenomic and nuclear DNA sequences, and comparative cranial anatomy indicate an unexpected sister-group relationship with the exclusively African family Herpelidae. Relaxed molecular clock analyses indicate that these lineages diverged in the Early Cretaceous, about 140 Ma. The discovery adds a major branch to the amphibian tree of life and sheds light on both the evolution and biogeography of caecilians and the biotic history of northeast India—an area generally interpreted as a gateway between biodiversity hotspots rather than a distinct biogeographic unit with its own ancient endemics. Because of its distinctive morphology, inferred age and phylogenetic relationships, we recognize the newly discovered caecilian radiation as a new family of modern amphibians. PMID:22357266
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics.
Deutsch, Eric W; Sun, Zhi; Campbell, David S; Binz, Pierre-Alain; Farrah, Terry; Shteynberg, David; Mendoza, Luis; Omenn, Gilbert S; Moritz, Robert L
2016-11-04
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances-a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/ .
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics
Deutsch, Eric W.; Sun, Zhi; Campbell, David S.; Binz, Pierre-Alain; Farrah, Terry; Shteynberg, David; Mendoza, Luis; Omenn, Gilbert S.; Moritz, Robert L.
2016-01-01
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances – a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ~20,000 primary isoforms plus contaminants to a very large database that includes almost all non-redundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/. PMID:27577934
SCOPE: a web server for practical de novo motif discovery.
Carlson, Jonathan M; Chakravarty, Arijit; DeZiel, Charles E; Gross, Robert H
2007-07-01
SCOPE is a novel parameter-free method for the de novo identification of potential regulatory motifs in sets of coordinately regulated genes. The SCOPE algorithm combines the output of three component algorithms, each designed to identify a particular class of motifs. Using an ensemble learning approach, SCOPE identifies the best candidate motifs from its component algorithms. In tests on experimentally determined datasets, SCOPE identified motifs with a significantly higher level of accuracy than a number of other web-based motif finders run with their default parameters. Because SCOPE has no adjustable parameters, the web server has an intuitive interface, requiring only a set of gene names or FASTA sequences and a choice of species. The most significant motifs found by SCOPE are displayed graphically on the main results page with a table containing summary statistics for each motif. Detailed motif information, including the sequence logo, PWM, consensus sequence and specific matching sites can be viewed through a single click on a motif. SCOPE's efficient, parameter-free search strategy has enabled the development of a web server that is readily accessible to the practising biologist while providing results that compare favorably with those of other motif finders. The SCOPE web server is at
Reanalysis of RNA-Sequencing Data Reveals Several Additional Fusion Genes with Multiple Isoforms
Kangaspeska, Sara; Hultsch, Susanne; Edgren, Henrik; Nicorici, Daniel; Murumägi, Astrid; Kallioniemi, Olli
2012-01-01
RNA-sequencing and tailored bioinformatic methodologies have paved the way for identification of expressed fusion genes from the chaotic genomes of solid tumors. We have recently successfully exploited RNA-sequencing for the discovery of 24 novel fusion genes in breast cancer. Here, we demonstrate the importance of continuous optimization of the bioinformatic methodology for this purpose, and report the discovery and experimental validation of 13 additional fusion genes from the same samples. Integration of copy number profiling with the RNA-sequencing results revealed that the majority of the gene fusions were promoter-donating events that occurred at copy number transition points or involved high-level DNA-amplifications. Sequencing of genomic fusion break points confirmed that DNA-level rearrangements underlie selected fusion transcripts. Furthermore, a significant portion (>60%) of the fusion genes were alternatively spliced. This illustrates the importance of reanalyzing sequencing data as gene definitions change and bioinformatic methods improve, and highlights the previously unforeseen isoform diversity among fusion transcripts. PMID:23119097
Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms.
Kangaspeska, Sara; Hultsch, Susanne; Edgren, Henrik; Nicorici, Daniel; Murumägi, Astrid; Kallioniemi, Olli
2012-01-01
RNA-sequencing and tailored bioinformatic methodologies have paved the way for identification of expressed fusion genes from the chaotic genomes of solid tumors. We have recently successfully exploited RNA-sequencing for the discovery of 24 novel fusion genes in breast cancer. Here, we demonstrate the importance of continuous optimization of the bioinformatic methodology for this purpose, and report the discovery and experimental validation of 13 additional fusion genes from the same samples. Integration of copy number profiling with the RNA-sequencing results revealed that the majority of the gene fusions were promoter-donating events that occurred at copy number transition points or involved high-level DNA-amplifications. Sequencing of genomic fusion break points confirmed that DNA-level rearrangements underlie selected fusion transcripts. Furthermore, a significant portion (>60%) of the fusion genes were alternatively spliced. This illustrates the importance of reanalyzing sequencing data as gene definitions change and bioinformatic methods improve, and highlights the previously unforeseen isoform diversity among fusion transcripts.
Morph-X-Select: Morphology-based tissue aptamer selection for ovarian cancer biomarker discovery
Wang, Hongyu; Li, Xin; Volk, David E.; Lokesh, Ganesh L.-R.; Elizondo-Riojas, Miguel-Angel; Li, Li; Nick, Alpa M.; Sood, Anil K.; Rosenblatt, Kevin P.; Gorenstein, David G.
2016-01-01
High affinity aptamer-based biomarker discovery has the advantage of simultaneously discovering an aptamer affinity reagent and its target biomarker protein. Here, we demonstrate a morphology-based tissue aptamer selection method that enables us to use tissue sections from individual patients and identify high-affinity aptamers and their associated target proteins in a systematic and accurate way. We created a combinatorial DNA aptamer library that has been modified with thiophosphate substitutions of the phosphate ester backbone at selected 5′dA positions for enhanced nuclease resistance and targeting. Based on morphological assessment, we used image-directed laser microdissection (LMD) to dissect regions of interest bound with the thioaptamer (TA) library and further identified target proteins for the selected TAs. We have successfully identified and characterized the lead candidate TA, V5, as a vimentin-specific sequence that has shown specific binding to tumor vasculature of human ovarian tissue and human microvascular endothelial cells. This new Morph-X-Select method allows us to select high-affinity aptamers and their associated target proteins in a specific and accurate way, and could be used for personalized biomarker discovery to improve medical decision-making and to facilitate the development of targeted therapies to achieve more favorable outcomes. PMID:27839510
A framework genetic map for Miscanthus sinensis from RNAseq-based markers shows recent tetraploidy
2012-01-01
Background Miscanthus (subtribe Saccharinae, tribe Andropogoneae, family Poaceae) is a genus of temperate perennial C4 grasses whose high biomass production makes it, along with its close relatives sugarcane and sorghum, attractive as a biofuel feedstock. The base chromosome number of Miscanthus (x = 19) is different from that of other Saccharinae and approximately twice that of the related Sorghum bicolor (x = 10), suggesting large-scale duplications may have occurred in recent ancestors of Miscanthus. Owing to the complexity of the Miscanthus genome and the complications of self-incompatibility, a complete genetic map with a high density of markers has not yet been developed. Results We used deep transcriptome sequencing (RNAseq) from two M. sinensis accessions to define 1536 single nucleotide variants (SNVs) for a GoldenGate™ genotyping array, and found that simple sequence repeat (SSR) markers defined in sugarcane are often informative in M. sinensis. A total of 658 SNP and 210 SSR markers were validated via segregation in a full sibling F1 mapping population. Using 221 progeny from this mapping population, we constructed a genetic map for M. sinensis that resolves into 19 linkage groups, the haploid chromosome number expected from cytological evidence. Comparative genomic analysis documents a genome-wide duplication in Miscanthus relative to Sorghum bicolor, with subsequent insertional fusion of a pair of chromosomes. The utility of the map is confirmed by the identification of two paralogous C4-pyruvate, phosphate dikinase (C4-PPDK) loci in Miscanthus, at positions syntenic to the single orthologous gene in Sorghum. Conclusions The genus Miscanthus experienced an ancestral tetraploidy and chromosome fusion prior to its diversification, but after its divergence from the closely related sugarcane clade. The recent timing of this tetraploidy complicates discovery and mapping of genetic markers for Miscanthus species, since alleles and fixed differences between paralogs are comparable. These difficulties can be overcome by careful analysis of segregation patterns in a mapping population and genotyping of doubled haploids. The genetic map for Miscanthus will be useful in biological discovery and breeding efforts to improve this emerging biofuel crop, and also provide a valuable resource for understanding genomic responses to tetraploidy and chromosome fusion. PMID:22524439
Links, Matthew G; Chaban, Bonnie; Hemmingsen, Sean M; Muirhead, Kevin; Hill, Janet E
2013-08-15
Formation of operational taxonomic units (OTU) is a common approach to data aggregation in microbial ecology studies based on amplification and sequencing of individual gene targets. The de novo assembly of OTU sequences has been recently demonstrated as an alternative to widely used clustering methods, providing robust information from experimental data alone, without any reliance on an external reference database. Here we introduce mPUMA (microbial Profiling Using Metagenomic Assembly, http://mpuma.sourceforge.net), a software package for identification and analysis of protein-coding barcode sequence data. It was developed originally for Cpn60 universal target sequences (also known as GroEL or Hsp60). Using an unattended process that is independent of external reference sequences, mPUMA forms OTUs by DNA sequence assembly and is capable of tracking OTU abundance. mPUMA processes microbial profiles both in terms of the direct DNA sequence as well as in the translated amino acid sequence for protein coding barcodes. By forming OTUs and calculating abundance through an assembly approach, mPUMA is capable of generating inputs for several popular microbiota analysis tools. Using SFF data from sequencing of a synthetic community of Cpn60 sequences derived from the human vaginal microbiome, we demonstrate that mPUMA can faithfully reconstruct all expected OTU sequences and produce compositional profiles consistent with actual community structure. mPUMA enables analysis of microbial communities while empowering the discovery of novel organisms through OTU assembly.
Engel, Stacia R.; Cherry, J. Michael
2013-01-01
The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery. Database URL: http://www.yeastgenome.org/ PMID:23487186
RSAT 2015: Regulatory Sequence Analysis Tools.
Medina-Rivera, Alejandra; Defrance, Matthieu; Sand, Olivier; Herrmann, Carl; Castro-Mondragon, Jaime A; Delerce, Jeremy; Jaeger, Sébastien; Blanchet, Christophe; Vincens, Pierre; Caron, Christophe; Staines, Daniel M; Contreras-Moreira, Bruno; Artufel, Marie; Charbonnier-Khamvongsa, Lucie; Hernandez, Céline; Thieffry, Denis; Thomas-Chollier, Morgane; van Helden, Jacques
2015-07-01
RSAT (Regulatory Sequence Analysis Tools) is a modular software suite for the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, appropriate to genome-wide data sets like ChIP-seq, (ii) transcription factor binding motif analysis (quality assessment, comparisons and clustering), (iii) comparative genomics and (iv) analysis of regulatory variations. Nine new programs have been added to the 43 described in the 2011 NAR Web Software Issue, including a tool to extract sequences from a list of coordinates (fetch-sequences from UCSC), novel programs dedicated to the analysis of regulatory variants from GWAS or population genomics (retrieve-variation-seq and variation-scan), a program to cluster motifs and visualize the similarities as trees (matrix-clustering). To deal with the drastic increase of sequenced genomes, RSAT public sites have been reorganized into taxon-specific servers. The suite is well-documented with tutorials and published protocols. The software suite is available through Web sites, SOAP/WSDL Web services, virtual machines and stand-alone programs at http://www.rsat.eu/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
MicroRNA Biomarkers of Toxicity in Biological Matrices ...
Biomarker measurements that reliably correlate with tissue injury and can be measured from sampling accessible biofluids offer enormous benefits in terms of cost, time, and convenience when assessing environmental and drug-induced toxicity in model systems or human cohorts. MicroRNAs (miRNAs) have emerged in recent years as a promising new type of biomarker for monitoring toxicity. Recent enthusiasm for miRNA biomarker research has been fueled by discoveries that certain miRNA species are cell-type specific and released during injury, thus raising the possibility of using biofluid-based miRNAs as a “liquid biopsy” that may be obtained by sampling extracellular fluids. As biomarkers, miRNAs demonstrate improved stability as compared to many protein markers and sequences are largely conserved across species, simplifying analytical techniques. Recent efforts have sought to identify miRNAs that are released into accessible biofluids following xenobiotic exposure, using compounds that target specific organs. While still early in the discovery phase, miRNA biomarkers will have an increasingly important role in the assessment of adverse effects of both environmental chemicals and pharmaceutical drugs. Here, we review the current findings of biofluid-based miRNAs, as well as highlight technical challenges in assessing toxicologic pathology using these biomarkers. MicroRNAs (miRNAs) are small, non-coding RNA species that selectively bind mRNA molecules and alter thei
Defining the wheat gluten peptide fingerprint via a discovery and targeted proteomics approach.
Martínez-Esteso, María José; Nørgaard, Jørgen; Brohée, Marcel; Haraszi, Reka; Maquet, Alain; O'Connor, Gavin
2016-09-16
Accurate, reliable and sensitive detection methods for gluten are required to support current EU regulations. The enforcement of legislative levels requires that measurement results are comparable over time and between methods. This is not a trivial task for gluten which comprises a large number of protein targets. This paper describes a strategy for defining a set of specific analytical targets for wheat gluten. A comprehensive proteomic approach was applied by fractionating wheat gluten using RP-HPLC (reversed phase high performance liquid chromatography) followed by a multi-enzymatic digestion (LysC, trypsin and chymotrypsin) with subsequent mass spectrometric analysis. This approach identified 434 peptide sequences from gluten. Peptides were grouped based on two criteria: unique to a single gluten protein sequence; contained known immunogenic and toxic sequences in the context of coeliac disease. An LC-MS/MS method based on selected reaction monitoring (SRM) was developed on a triple quadrupole mass spectrometer for the specific detection of the target peptides. The SRM based screening approach was applied to gluten containing cereals (wheat, rye, barley and oats) and non-gluten containing flours (corn, soy and rice). A unique set of wheat gluten marker peptides were identified and are proposed as wheat specific markers. The measurement of gluten in processed food products in support of regulatory limits is performed routinely. Mass spectrometry is emerging as a viable alternative to ELISA based methods. Here we outline a set of peptide markers that are representative of gluten and consider the end user's needs in protecting those with coeliac disease. The approach taken has been applied to wheat but can be easily extended to include other species potentially enabling the MS quantification of different gluten containing species from the identified markers. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Haplotype-Based Genotyping in Polyploids.
Clevenger, Josh P; Korani, Walid; Ozias-Akins, Peggy; Jackson, Scott
2018-01-01
Accurate identification of polymorphisms from sequence data is crucial to unlocking the potential of high throughput sequencing for genomics. Single nucleotide polymorphisms (SNPs) are difficult to accurately identify in polyploid crops due to the duplicative nature of polyploid genomes leading to low confidence in the true alignment of short reads. Implementing a haplotype-based method in contrasting subgenome-specific sequences leads to higher accuracy of SNP identification in polyploids. To test this method, a large-scale 48K SNP array (Axiom Arachis2) was developed for Arachis hypogaea (peanut), an allotetraploid, in which 1,674 haplotype-based SNPs were included. Results of the array show that 74% of the haplotype-based SNP markers could be validated, which is considerably higher than previous methods used for peanut. The haplotype method has been implemented in a standalone program, HAPLOSWEEP, which takes as input bam files and a vcf file and identifies haplotype-based markers. Haplotype discovery can be made within single reads or span paired reads, and can leverage long read technology by targeting any length of haplotype. Haplotype-based genotyping is applicable in all allopolyploid genomes and provides confidence in marker identification and in silico-based genotyping for polyploid genomics.
Comparative genome analysis of Basidiomycete fungi
DOE Office of Scientific and Technical Information (OSTI.GOV)
Riley, Robert; Salamov, Asaf; Henrissat, Bernard
Fungi of the phylum Basidiomycota (basidiomycetes), make up some 37percent of the described fungi, and are important in forestry, agriculture, medicine, and bioenergy. This diverse phylum includes symbionts, pathogens, and saprotrophs including the majority of wood decaying and ectomycorrhizal species. To better understand the genetic diversity of this phylum we compared the genomes of 35 basidiomycetes including 6 newly sequenced genomes. These genomes span extremes of genome size, gene number, and repeat content. Analysis of core genes reveals that some 48percent of basidiomycete proteins are unique to the phylum with nearly half of those (22percent) found in only one organism.more » Correlations between lifestyle and certain gene families are evident. Phylogenetic patterns of plant biomass-degrading genes in Agaricomycotina suggest a continuum rather than a dichotomy between the white rot and brown rot modes of wood decay. Based on phylogenetically-informed PCA analysis of wood decay genes, we predict that that Botryobasidium botryosum and Jaapia argillacea have properties similar to white rot species, although neither has typical ligninolytic class II fungal peroxidases (PODs). This prediction is supported by growth assays in which both fungi exhibit wood decay with white rot-like characteristics. Based on this, we suggest that the white/brown rot dichotomy may be inadequate to describe the full range of wood decaying fungi. Analysis of the rate of discovery of proteins with no or few homologs suggests the value of continued sequencing of basidiomycete fungi.« less
A novel algorithm for validating peptide identification from a shotgun proteomics search engine.
Jian, Ling; Niu, Xinnan; Xia, Zhonghang; Samir, Parimal; Sumanasekera, Chiranthani; Mu, Zheng; Jennings, Jennifer L; Hoek, Kristen L; Allos, Tara; Howard, Leigh M; Edwards, Kathryn M; Weil, P Anthony; Link, Andrew J
2013-03-01
Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has revolutionized the proteomics analysis of complexes, cells, and tissues. In a typical proteomic analysis, the tandem mass spectra from a LC-MS/MS experiment are assigned to a peptide by a search engine that compares the experimental MS/MS peptide data to theoretical peptide sequences in a protein database. The peptide spectra matches are then used to infer a list of identified proteins in the original sample. However, the search engines often fail to distinguish between correct and incorrect peptides assignments. In this study, we designed and implemented a novel algorithm called De-Noise to reduce the number of incorrect peptide matches and maximize the number of correct peptides at a fixed false discovery rate using a minimal number of scoring outputs from the SEQUEST search engine. The novel algorithm uses a three-step process: data cleaning, data refining through a SVM-based decision function, and a final data refining step based on proteolytic peptide patterns. Using proteomics data generated on different types of mass spectrometers, we optimized the De-Noise algorithm on the basis of the resolution and mass accuracy of the mass spectrometer employed in the LC-MS/MS experiment. Our results demonstrate De-Noise improves peptide identification compared to other methods used to process the peptide sequence matches assigned by SEQUEST. Because De-Noise uses a limited number of scoring attributes, it can be easily implemented with other search engines.
Rector, Annabel; Tachezy, Ruth; Van Ranst, Marc
2004-01-01
The discovery of novel viruses has often been accomplished by using hybridization-based methods that necessitate the availability of a previously characterized virus genome probe or knowledge of the viral nucleotide sequence to construct consensus or degenerate PCR primers. In their natural replication cycle, certain viruses employ a rolling-circle mechanism to propagate their circular genomes, and multiply primed rolling-circle amplification (RCA) with φ29 DNA polymerase has recently been applied in the amplification of circular plasmid vectors used in cloning. We employed an isothermal RCA protocol that uses random hexamer primers to amplify the complete genomes of papillomaviruses without the need for prior knowledge of their DNA sequences. We optimized this RCA technique with extracted human papillomavirus type 16 (HPV-16) DNA from W12 cells, using a real-time quantitative PCR assay to determine amplification efficiency, and obtained a 2.4 × 104-fold increase in HPV-16 DNA concentration. We were able to clone the complete HPV-16 genome from this multiply primed RCA product. The optimized protocol was subsequently applied to a bovine fibropapillomatous wart tissue sample. Whereas no papillomavirus DNA could be detected by restriction enzyme digestion of the original sample, multiply primed RCA enabled us to obtain a sufficient amount of papillomavirus DNA for restriction enzyme analysis, cloning, and subsequent sequencing of a novel variant of bovine papillomavirus type 1. The multiply primed RCA method allows the discovery of previously unknown papillomaviruses, and possibly also other circular DNA viruses, without a priori sequence information. PMID:15113879
NASA Astrophysics Data System (ADS)
Zhang, Xiao-Yong; Wang, Guang-Hua; Xu, Xin-Ya; Nong, Xu-Hua; Wang, Jie; Amin, Muhammad; Qi, Shu-Hua
2016-10-01
The present study investigated the fungal diversity in four different deep-sea sediments from Okinawa Trough using high-throughput Illumina sequencing of the nuclear ribosomal internal transcribed spacer-1 (ITS1). A total of 40,297 fungal ITS1 sequences clustered into 420 operational taxonomic units (OTUs) with 97% sequence similarity and 170 taxa were recovered from these sediments. Most ITS1 sequences (78%) belonged to the phylum Ascomycota, followed by Basidiomycota (17.3%), Zygomycota (1.5%) and Chytridiomycota (0.8%), and a small proportion (2.4%) belonged to unassigned fungal phyla. Compared with previous studies on fungal diversity of sediments from deep-sea environments by culture-dependent approach and clone library analysis, the present result suggested that Illumina sequencing had been dramatically accelerating the discovery of fungal community of deep-sea sediments. Furthermore, our results revealed that Sordariomycetes was the most diverse and abundant fungal class in this study, challenging the traditional view that the diversity of Sordariomycetes phylotypes was low in the deep-sea environments. In addition, more than 12 taxa accounted for 21.5% sequences were found to be rarely reported as deep-sea fungi, suggesting the deep-sea sediments from Okinawa Trough harbored a plethora of different fungal communities compared with other deep-sea environments. To our knowledge, this study is the first exploration of the fungal diversity in deep-sea sediments from Okinawa Trough using high-throughput Illumina sequencing.
USDA-ARS?s Scientific Manuscript database
Single-nucleotide polymorphisms (SNPs) are highly abundant markers, which are broadly distributed in animal genomes. For rainbow trout, SNP discovery has been done through sequencing of restriction-site associated DNA (RAD) libraries, reduced representation libraries (RRL), RNA sequencing, and whole...
Next-generation sequencing for targeted discovery of rare mutations in rice
USDA-ARS?s Scientific Manuscript database
Advances in DNA sequencing (i.e., next-generation sequencing, NGS) have greatly increased the power and efficiency of detecting rare mutations in large mutant populations. Targeting Induced Local Lesions in Genomes (TILLING) is a reverse genetics approach for identifying gene mutations resulting fro...
Dalpé, Gratien; Joly, Yann
2014-09-01
Healthcare-related bioinformatics databases are increasingly offering the possibility to maintain, organize, and distribute DNA sequencing data. Different national and international institutions are currently hosting such databases that offer researchers website platforms where they can obtain sequencing data on which they can perform different types of analysis. Until recently, this process remained mostly one-dimensional, with most analysis concentrated on a limited amount of data. However, newer genome sequencing technology is producing a huge amount of data that current computer facilities are unable to handle. An alternative approach has been to start adopting cloud computing services for combining the information embedded in genomic and model system biology data, patient healthcare records, and clinical trials' data. In this new technological paradigm, researchers use virtual space and computing power from existing commercial or not-for-profit cloud service providers to access, store, and analyze data via different application programming interfaces. Cloud services are an alternative to the need of larger data storage; however, they raise different ethical, legal, and social issues. The purpose of this Commentary is to summarize how cloud computing can contribute to bioinformatics-based drug discovery and to highlight some of the outstanding legal, ethical, and social issues that are inherent in the use of cloud services. © 2014 Wiley Periodicals, Inc.
Investigating ChaMPlane X-Ray Sources in the Galactic Bulge with Magellan LDSS2 Spectra
NASA Astrophysics Data System (ADS)
Koenig, Xavier; Grindlay, Jonathan E.; van den Berg, Maureen; Laycock, Silas; Zhao, Ping; Hong, JaeSub; Schlegel, Eric M.
2008-09-01
We have carried out optical and X-ray spectral analyses on a sample of 136 candidate optical counterparts of X-ray sources found in five Galactic bulge fields included in our Chandra Multiwavelength Plane Survey. We use a combination of optical spectral fitting and quantile X-ray analysis to obtain the hydrogen column density toward each object, and a three-dimensional dust model of the Galaxy to estimate the most probable distance in each case. We present the discovery of a population of stellar coronal emission sources, likely consisting of pre-main-sequence, young main-sequence, and main-sequence stars, as well as a component of active binaries of RS CVn or BY Dra type. We identify one candidate quiescent low-mass X-ray binary with a subgiant companion; we note that this object may also be an RS CVn system. We report the discovery of three new X-ray-detected cataclysmic variables (CVs) in the direction of the Galactic center (at distances lesssim2 kpc). This number is in excess of predictions made with a simple CV model based on a local CV space density of lesssim10-5 pc-3, and a scale height ~200 pc. We discuss several possible reasons for this observed excess.
Exome Sequencing in Suspected Monogenic Dyslipidemias
Stitziel, Nathan O.; Peloso, Gina M.; Abifadel, Marianne; Cefalu, Angelo B.; Fouchier, Sigrid; Motazacker, M. Mahdi; Tada, Hayato; Larach, Daniel B.; Awan, Zuhier; Haller, Jorge F.; Pullinger, Clive R.; Varret, Mathilde; Rabès, Jean-Pierre; Noto, Davide; Tarugi, Patrizia; Kawashiri, Masa-aki; Nohara, Atsushi; Yamagishi, Masakazu; Risman, Marjorie; Deo, Rahul; Ruel, Isabelle; Shendure, Jay; Nickerson, Deborah A.; Wilson, James G.; Rich, Stephen S.; Gupta, Namrata; Farlow, Deborah N.; Neale, Benjamin M.; Daly, Mark J.; Kane, John P.; Freeman, Mason W.; Genest, Jacques; Rader, Daniel J.; Mabuchi, Hiroshi; Kastelein, John J.P.; Hovingh, G. Kees; Averna, Maurizio R.; Gabriel, Stacey; Boileau, Catherine; Kathiresan, Sekar
2015-01-01
Background Exome sequencing is a promising tool for gene mapping in Mendelian disorders. We utilized this technique in an attempt to identify novel genes underlying monogenic dyslipidemias. Methods and Results We performed exome sequencing on 213 selected family members from 41 kindreds with suspected Mendelian inheritance of extreme levels of low-density lipoprotein (LDL) cholesterol (after candidate gene sequencing excluded known genetic causes for high LDL cholesterol families) or high-density lipoprotein (HDL) cholesterol. We used standard analytic approaches to identify candidate variants and also assigned a polygenic score to each individual in order to account for their burden of common genetic variants known to influence lipid levels. In nine families, we identified likely pathogenic variants in known lipid genes (ABCA1, APOB, APOE, LDLR, LIPA, and PCSK9); however, we were unable to identify obvious genetic etiologies in the remaining 32 families despite follow-up analyses. We identified three factors that limited novel gene discovery: (1) imperfect sequencing coverage across the exome hid potentially causal variants; (2) large numbers of shared rare alleles within families obfuscated causal variant identification; and (3) individuals from 15% of families carried a significant burden of common lipid-related alleles, suggesting complex inheritance can masquerade as monogenic disease. Conclusions We identified the genetic basis of disease in nine of 41 families; however, none of these represented novel gene discoveries. Our results highlight the promise and limitations of exome sequencing as a discovery technique in suspected monogenic dyslipidemias. Considering the confounders identified may inform the design of future exome sequencing studies. PMID:25632026
Phenotypic mutant library: potential for gene discovery
USDA-ARS?s Scientific Manuscript database
The rapid development of high throughput and affordable Next- Generation Sequencing (NGS) techniques has renewed interest in gene discovery using forward genetics. The conventional forward genetic approach starts with isolation of mutants with a phenotype of interest, mapping the mutation within a s...
Discovery of rare mutations in populations: TILLING by sequencing
USDA-ARS?s Scientific Manuscript database
Discovery of rare mutations in populations requires methods for processing and analyzing in parallel many individuals. Previous TILLING methods employed enzymatic or physical discrimination of heteroduplexed from homoduplexed target DNA. We used mutant populations of rice and wheat to develop a meth...
Improving mapping and SNP-calling performance in multiplexed targeted next-generation sequencing
2012-01-01
Background Compared to classical genotyping, targeted next-generation sequencing (tNGS) can be custom-designed to interrogate entire genomic regions of interest, in order to detect novel as well as known variants. To bring down the per-sample cost, one approach is to pool barcoded NGS libraries before sample enrichment. Still, we lack a complete understanding of how this multiplexed tNGS approach and the varying performance of the ever-evolving analytical tools can affect the quality of variant discovery. Therefore, we evaluated the impact of different software tools and analytical approaches on the discovery of single nucleotide polymorphisms (SNPs) in multiplexed tNGS data. To generate our own test model, we combined a sequence capture method with NGS in three experimental stages of increasing complexity (E. coli genes, multiplexed E. coli, and multiplexed HapMap BRCA1/2 regions). Results We successfully enriched barcoded NGS libraries instead of genomic DNA, achieving reproducible coverage profiles (Pearson correlation coefficients of up to 0.99) across multiplexed samples, with <10% strand bias. However, the SNP calling quality was substantially affected by the choice of tools and mapping strategy. With the aim of reducing computational requirements, we compared conventional whole-genome mapping and SNP-calling with a new faster approach: target-region mapping with subsequent ‘read-backmapping’ to the whole genome to reduce the false detection rate. Consequently, we developed a combined mapping pipeline, which includes standard tools (BWA, SAMtools, etc.), and tested it on public HiSeq2000 exome data from the 1000 Genomes Project. Our pipeline saved 12 hours of run time per Hiseq2000 exome sample and detected ~5% more SNPs than the conventional whole genome approach. This suggests that more potential novel SNPs may be discovered using both approaches than with just the conventional approach. Conclusions We recommend applying our general ‘two-step’ mapping approach for more efficient SNP discovery in tNGS. Our study has also shown the benefit of computing inter-sample SNP-concordances and inspecting read alignments in order to attain more confident results. PMID:22913592
BONSAI Garden: Parallel knowledge discovery system for amino acid sequences
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shoudai, T.; Miyano, S.; Shinohara, A.
1995-12-31
We have developed a machine discovery system BON-SAI which receives positive and negative examples as inputs and produces as a hypothesis a pair of a decision tree over regular patterns and an alphabet indexing. This system has succeeded in discovering reasonable knowledge on transmembrane domain sequences and signal peptide sequences by computer experiments. However, when several kinds of sequences axe mixed in the data, it does not seem reasonable for a single BONSAI system to find a hypothesis of a reasonably small size with high accuracy. For this purpose, we have designed a system BONSAI Garden, in which several BONSAI`smore » and a program called Gardener run over a network in parallel, to partition the data into some number of classes together with hypotheses explaining these classes accurately.« less
Goettel, Wolfgang; Xia, Eric; Upchurch, Robert; Wang, Ming-Li; Chen, Pengyin; An, Yong-Qiang Charles
2014-04-23
Variation in seed oil composition and content among soybean varieties is largely attributed to differences in transcript sequences and/or transcript accumulation of oil production related genes in seeds. Discovery and analysis of sequence and expression variations in these genes will accelerate soybean oil quality improvement. In an effort to identify these variations, we sequenced the transcriptomes of soybean seeds from nine lines varying in oil composition and/or total oil content. Our results showed that 69,338 distinct transcripts from 32,885 annotated genes were expressed in seeds. A total of 8,037 transcript expression polymorphisms and 50,485 transcript sequence polymorphisms (48,792 SNPs and 1,693 small Indels) were identified among the lines. Effects of the transcript polymorphisms on their encoded protein sequences and functions were predicted. The studies also provided independent evidence that the lack of FAD2-1A gene activity and a non-synonymous SNP in the coding sequence of FAB2C caused elevated oleic acid and stearic acid levels in soybean lines M23 and FAM94-41, respectively. As a proof-of-concept, we developed an integrated RNA-seq and bioinformatics approach to identify and functionally annotate transcript polymorphisms, and demonstrated its high effectiveness for discovery of genetic and transcript variations that result in altered oil quality traits. The collection of transcript polymorphisms coupled with their predicted functional effects will be a valuable asset for further discovery of genes, gene variants, and functional markers to improve soybean oil quality.
Peroxidase gene discovery from the horseradish transcriptome.
Näätsaari, Laura; Krainer, Florian W; Schubert, Michael; Glieder, Anton; Thallinger, Gerhard G
2014-03-24
Horseradish peroxidases (HRPs) from Armoracia rusticana have long been utilized as reporters in various diagnostic assays and histochemical stainings. Regardless of their increasing importance in the field of life sciences and suggested uses in medical applications, chemical synthesis and other industrial applications, the HRP isoenzymes, their substrate specificities and enzymatic properties are poorly characterized. Due to lacking sequence information of natural isoenzymes and the low levels of HRP expression in heterologous hosts, commercially available HRP is still extracted as a mixture of isoenzymes from the roots of A. rusticana. In this study, a normalized, size-selected A. rusticana transcriptome library was sequenced using 454 Titanium technology. The resulting reads were assembled into 14871 isotigs with an average length of 1133 bp. Sequence databases, ORF finding and ORF characterization were utilized to identify peroxidase genes from the 14871 isotigs generated by de novo assembly. The sequences were manually reviewed and verified with Sanger sequencing of PCR amplified genomic fragments, resulting in the discovery of 28 secretory peroxidases, 23 of them previously unknown. A total of 22 isoenzymes including allelic variants were successfully expressed in Pichia pastoris and showed peroxidase activity with at least one of the substrates tested, thus enabling their development into commercial pure isoenzymes. This study demonstrates that transcriptome sequencing combined with sequence motif search is a powerful concept for the discovery and quick supply of new enzymes and isoenzymes from any plant or other eukaryotic organisms. Identification and manual verification of the sequences of 28 HRP isoenzymes do not only contribute a set of peroxidases for industrial, biological and biomedical applications, but also provide valuable information on the reliability of the approach in identifying and characterizing a large group of isoenzymes.
Peroxidase gene discovery from the horseradish transcriptome
2014-01-01
Background Horseradish peroxidases (HRPs) from Armoracia rusticana have long been utilized as reporters in various diagnostic assays and histochemical stainings. Regardless of their increasing importance in the field of life sciences and suggested uses in medical applications, chemical synthesis and other industrial applications, the HRP isoenzymes, their substrate specificities and enzymatic properties are poorly characterized. Due to lacking sequence information of natural isoenzymes and the low levels of HRP expression in heterologous hosts, commercially available HRP is still extracted as a mixture of isoenzymes from the roots of A. rusticana. Results In this study, a normalized, size-selected A. rusticana transcriptome library was sequenced using 454 Titanium technology. The resulting reads were assembled into 14871 isotigs with an average length of 1133 bp. Sequence databases, ORF finding and ORF characterization were utilized to identify peroxidase genes from the 14871 isotigs generated by de novo assembly. The sequences were manually reviewed and verified with Sanger sequencing of PCR amplified genomic fragments, resulting in the discovery of 28 secretory peroxidases, 23 of them previously unknown. A total of 22 isoenzymes including allelic variants were successfully expressed in Pichia pastoris and showed peroxidase activity with at least one of the substrates tested, thus enabling their development into commercial pure isoenzymes. Conclusions This study demonstrates that transcriptome sequencing combined with sequence motif search is a powerful concept for the discovery and quick supply of new enzymes and isoenzymes from any plant or other eukaryotic organisms. Identification and manual verification of the sequences of 28 HRP isoenzymes do not only contribute a set of peroxidases for industrial, biological and biomedical applications, but also provide valuable information on the reliability of the approach in identifying and characterizing a large group of isoenzymes. PMID:24666710
Discovery of Escherichia coli CRISPR sequences in an undergraduate laboratory.
Militello, Kevin T; Lazatin, Justine C
2017-05-01
Clustered regularly interspaced short palindromic repeats (CRISPRs) represent a novel type of adaptive immune system found in eubacteria and archaebacteria. CRISPRs have recently generated a lot of attention due to their unique ability to catalog foreign nucleic acids, their ability to destroy foreign nucleic acids in a mechanism that shares some similarity to RNA interference, and the ability to utilize reconstituted CRISPR systems for genome editing in numerous organisms. In order to introduce CRISPR biology into an undergraduate upper-level laboratory, a five-week set of exercises was designed to allow students to examine the CRISPR status of uncharacterized Escherichia coli strains and to allow the discovery of new repeats and spacers. Students started the project by isolating genomic DNA from E. coli and amplifying the iap CRISPR locus using the polymerase chain reaction (PCR). The PCR products were analyzed by Sanger DNA sequencing, and the sequences were examined for the presence of CRISPR repeat sequences. The regions between the repeats, the spacers, were extracted and analyzed with BLASTN searches. Overall, CRISPR loci were sequenced from several previously uncharacterized E. coli strains and one E. coli K-12 strain. Sanger DNA sequencing resulted in the discovery of 36 spacer sequences and their corresponding surrounding repeat sequences. Five of the spacers were homologous to foreign (non-E. coli) DNA. Assessment of the laboratory indicates that improvements were made in the ability of students to answer questions relating to the structure and function of CRISPRs. Future directions of the laboratory are presented and discussed. © 2016 by The International Union of Biochemistry and Molecular Biology, 45(3):262-269, 2017. © 2016 The International Union of Biochemistry and Molecular Biology.
Practice-Based Knowledge Discovery for Comparative Effectiveness Research: An Organizing Framework
Lucero, Robert J.; Bakken, Suzanne
2014-01-01
Electronic health information systems can increase the ability of health-care organizations to investigate the effects of clinical interventions. The authors present an organizing framework that integrates outcomes and informatics research paradigms to guide knowledge discovery in electronic clinical databases. They illustrate its application using the example of hospital acquired pressure ulcers (HAPU). The Knowledge Discovery through Informatics for Comparative Effectiveness Research (KDI-CER) framework was conceived as a heuristic to conceptualize study designs and address potential methodological limitations imposed by using a single research perspective. Advances in informatics research can play a complementary role in advancing the field of outcomes research including CER. The KDI-CER framework can be used to facilitate knowledge discovery from routinely collected electronic clinical data. PMID:25278645
Gardner, Shea N.; Hall, Barry G.
2013-01-01
Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished E. coli and Shigella genomes and a set of the same genomes to which have been added 47 assemblies and four “raw read” genomes of H104:H4 strains from the recent European E. coli outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths. PMID:24349125
Gardner, Shea N; Hall, Barry G
2013-01-01
Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished E. coli and Shigella genomes and a set of the same genomes to which have been added 47 assemblies and four "raw read" genomes of H104:H4 strains from the recent European E. coli outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths.
NASA Astrophysics Data System (ADS)
Tumewu, Widya Anjelia; Wulan, Ana Ratna; Sanjaya, Yayan
2017-05-01
The purpose of this study was to know comparing the effectiveness of learning using Project-based learning (PjBL) and Discovery Learning (DL) toward students metacognitive strategies on global warming concept. A quasi-experimental research design with a The Matching-Only Pretest-Posttest Control Group Design was used in this study. The subjects were students of two classes 7th grade of one of junior high school in Bandung City, West Java of 2015/2016 academic year. The study was conducted on two experimental class, that were project-based learning treatment on the experimental class I and discovery learning treatment was done on the experimental class II. The data was collected through questionnaire to know students metacognitive strategies. The statistical analysis showed that there were statistically significant differences in students metacognitive strategies between project-based learning and discovery learning.
Gupta, Radhey S
2016-07-01
Analyses of genome sequences, by some approaches, suggest that the widespread occurrence of horizontal gene transfers (HGTs) in prokaryotes disguises their evolutionary relationships and have led to questioning of the Darwinian model of evolution for prokaryotes. These inferences are critically examined in the light of comparative genome analysis, characteristic synapomorphies, phylogenetic trees and Darwin's views on examining evolutionary relationships. Genome sequences are enabling discovery of numerous molecular markers (synapomorphies) such as conserved signature indels (CSIs) and conserved signature proteins (CSPs), which are distinctive characteristics of different prokaryotic taxa. Based on these molecular markers, exhibiting high degree of specificity and predictive ability, numerous prokaryotic taxa of different ranks, currently identified based on the 16S rRNA gene trees, can now be reliably demarcated in molecular terms. Within all studied groups, multiple CSIs and CSPs have been identified for successive nested clades providing reliable information regarding their hierarchical relationships and these inferences are not affected by HGTs. These results strongly support Darwin's views on evolution and classification and supplement the current phylogenetic framework based on 16S rRNA in important respects. The identified molecular markers provide important means for developing novel diagnostics, therapeutics and for functional studies providing important insights regarding prokaryotic taxa. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Inferring Higher Functional Information for RIKEN Mouse Full-Length cDNA Clones With FACTS
Nagashima, Takeshi; Silva, Diego G.; Petrovsky, Nikolai; Socha, Luis A.; Suzuki, Harukazu; Saito, Rintaro; Kasukawa, Takeya; Kurochkin, Igor V.; Konagaya, Akihiko; Schönbach, Christian
2003-01-01
FACTS (Functional Association/Annotation of cDNA Clones from Text/Sequence Sources) is a semiautomated knowledge discovery and annotation system that integrates molecular function information derived from sequence analysis results (sequence inferred) with functional information extracted from text. Text-inferred information was extracted from keyword-based retrievals of MEDLINE abstracts and by matching of gene or protein names to OMIM, BIND, and DIP database entries. Using FACTS, we found that 47.5% of the 60,770 RIKEN mouse cDNA FANTOM2 clone annotations were informative for text searches. MEDLINE queries yielded molecular interaction-containing sentences for 23.1% of the clones. When disease MeSH and GO terms were matched with retrieved abstracts, 22.7% of clones were associated with potential diseases, and 32.5% with GO identifiers. A significant number (23.5%) of disease MeSH-associated clones were also found to have a hereditary disease association (OMIM Morbidmap). Inferred neoplastic and nervous system disease represented 49.6% and 36.0% of disease MeSH-associated clones, respectively. A comparison of sequence-based GO assignments with informative text-based GO assignments revealed that for 78.2% of clones, identical GO assignments were provided for that clone by either method, whereas for 21.8% of clones, the assignments differed. In contrast, for OMIM assignments, only 28.5% of clones had identical sequence-based and text-based OMIM assignments. Sequence, sentence, and term-based functional associations are included in the FACTS database (http://facts.gsc.riken.go.jp/), which permits results to be annotated and explored through web-accessible keyword and sequence search interfaces. The FACTS database will be a critical tool for investigating the functional complexity of the mouse transcriptome, cDNA-inferred interactome (molecular interactions), and pathome (pathologies). PMID:12819151
Parallel human genome analysis: microarray-based expression monitoring of 1000 genes.
Schena, M; Shalon, D; Heller, R; Chai, A; Brown, P O; Davis, R W
1996-01-01
Microarrays containing 1046 human cDNAs of unknown sequence were printed on glass with high-speed robotics. These 1.0-cm2 DNA "chips" were used to quantitatively monitor differential expression of the cognate human genes using a highly sensitive two-color hybridization assay. Array elements that displayed differential expression patterns under given experimental conditions were characterized by sequencing. The identification of known and novel heat shock and phorbol ester-regulated genes in human T cells demonstrates the sensitivity of the assay. Parallel gene analysis with microarrays provides a rapid and efficient method for large-scale human gene discovery. Images Fig. 1 Fig. 2 Fig. 3 PMID:8855227
Sequenced sorghum mutant library- an efficient platform for discovery of causal gene mutations
USDA-ARS?s Scientific Manuscript database
Ethyl methanesulfonate (EMS) efficiently generates high-density mutations in genomes. We applied whole-genome sequencing to 256 phenotyped mutant lines of sorghum (Sorghum bicolor L. Moench) to 16x coverage. Comparisons with the reference sequence revealed >1.8 million canonical EMS-induced G/C to A...
USDA-ARS?s Scientific Manuscript database
The Indianmeal moth, Plodia interpunctella (Lepidoptera: Pyralidae), is a common pest of stored goods with a worldwide distribution. The complete genome sequence for a larval pathogen of this moth, the baculovirus Plodia interpunctella granulovirus (PiGV), was determined by next-generation sequenci...
454-pyrosequencing: A tool for discovery and biomarker development
USDA-ARS?s Scientific Manuscript database
The Roche GS-FLX (454) sequencer has made possible what was thought impossible just a few years ago: sequence >1 million high-quality nucleotide reads (mean 400 bp) in less than 12 h. This technology provides valuable species-specific sequence information, and is a valuable tool to discover and und...
Insights into transcriptomes of Big and Low sagebrush
Mark D. Huynh; Justin T. Page; Bryce A. Richardson; Joshua A. Udall
2015-01-01
We report the sequencing and assembly of three transcriptomes from Big (Artemisia tridentatassp. wyomingensis and A. tridentatassp. tridentata) and Low (A. arbuscula ssp. arbuscula) sagebrush. The sequence reads are available in the Sequence Read Archive of NCBI. We demonstrate the utilities of these transcriptomes for gene discovery and phylogenomic analysis. An...
Chen, Jianchi; Civerolo, Edwin L; Jarret, Robert L; Van Sluys, Marie-Anne; de Oliveira, Mariana C
2005-02-01
Xylella fastidiosa causes many important plant diseases including Pierce's disease (PD) in grape and almond leaf scorch disease (ALSD). DNA-based methodologies, such as randomly amplified polymorphic DNA (RAPD) analysis, have been playing key roles in genetic information collection of the bacterium. This study further analyzed the nucleotide sequences of selected RAPDs from X. fastidiosa strains in conjunction with the available genome sequence databases and unveiled several previously unknown novel genetic traits. These include a sequence highly similar to those in the phage family of Podoviridae. Genome comparisons among X. fastidiosa strains suggested that the "phage" is currently active. Two other RAPDs were also related to horizontal gene transfer: one was part of a broadly distributed cryptic plasmid and the other was associated with conjugal transfer. One RAPD inferred a genomic rearrangement event among X. fastidiosa PD strains and another identified a single nucleotide polymorphism of evolutionary value.
RNA catalysis and the origins of life
NASA Technical Reports Server (NTRS)
Orgel, Leslie E.
1986-01-01
The role of RNA catalysis in the origins of life is considered in connection with the discovery of riboszymes, which are RNA molecules that catalyze sequence-specific hydrolysis and transesterification reactions of RNA substrates. Due to this discovery, theories positing protein-free replication as preceding the appearance of the genetic code are more plausible. The scope of RNA catalysis in biology and chemistry is discussed, and it is noted that the development of methods to select (or predict) RNA sequences with preassigned catalytic functions would be a major contribution to the study of life's origins.
Meng, Yijun; Yu, Dongliang; Xue, Jie; Lu, Jiangjie; Feng, Shangguo; Shen, Chenjia; Wang, Huizhong
2016-01-01
Dendrobium officinale is an important traditional Chinese herb. Here, we did a transcriptome-wide, organ-specific study on this valuable plant by combining RNA, small RNA (sRNA) and degradome sequencing. RNA sequencing of four organs (flower, root, leaf and stem) of Dendrobium officinale enabled us to obtain 536,558 assembled transcripts, from which 2,645, 256, 42 and 54 were identified to be highly expressed in the four organs respectively. Based on sRNA sequencing, 2,038, 2, 21 and 24 sRNAs were identified to be specifically accumulated in the four organs respectively. A total of 1,047 mature microRNA (miRNA) candidates were detected. Based on secondary structure predictions and sequencing, tens of potential miRNA precursors were identified from the assembled transcripts. Interestingly, phase-distributed sRNAs with degradome-based processing evidences were discovered on the long-stem structures of two precursors. Target identification was performed for the 1,047 miRNA candidates, resulting in the discovery of 1,257 miRNA--target pairs. Finally, some biological meaningful subnetworks involving hormone signaling, development, secondary metabolism and Argonaute 1-related regulation were established. All of the sequencing data sets are available at NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra/). Summarily, our study provides a valuable resource for the in-depth molecular and functional studies on this important Chinese orchid herb. PMID:26732614
Gini, Beatrice; Mischel, Paul S
2014-08-01
Single-cell sequencing approaches are needed to characterize the genomic diversity of complex tumors, shedding light on their evolutionary paths and potentially suggesting more effective therapies. In this issue of Cancer Discovery, Francis and colleagues develop a novel integrative approach to identify distinct tumor subpopulations based on joint detection of clonal and subclonal events from bulk tumor and single-nucleus whole-genome sequencing, allowing them to infer a subclonal architecture. Surprisingly, the authors identify convergent evolution of multiple, mutually exclusive, independent EGFR gain-of-function variants in a single tumor. This study demonstrates the value of integrative single-cell genomics and highlights the biologic primacy of EGFR as an actionable target in glioblastoma. ©2014 American Association for Cancer Research.
Information resources at the National Center for Biotechnology Information.
Woodsmall, R M; Benson, D A
1993-01-01
The National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, was established in 1988 to perform basic research in the field of computational molecular biology as well as build and distribute molecular biology databases. The basic research has led to new algorithms and analysis tools for interpreting genomic data and has been instrumental in the discovery of human disease genes for neurofibromatosis and Kallmann syndrome. The principal database responsibility is the National Institutes of Health (NIH) genetic sequence database, GenBank. NCBI, in collaboration with international partners, builds, distributes, and provides online and CD-ROM access to over 112,000 DNA sequences. Another major program is the integration of multiple sequences databases and related bibliographic information and the development of network-based retrieval systems for Internet access. PMID:8374583
Karakülah, Gökhan
2017-06-28
Novel transcript discovery through RNA sequencing has substantially improved our understanding of the transcriptome dynamics of biological systems. Endogenous target mimicry (eTM) transcripts, a novel class of regulatory molecules, bind to their target microRNAs (miRNAs) by base pairing and block their biological activity. The objective of this study was to provide a computational analysis framework for the prediction of putative eTM sequences in plants, and as an example, to discover previously un-annotated eTMs in Prunus persica (peach) transcriptome. Therefore, two public peach transcriptome libraries downloaded from Sequence Read Archive (SRA) and a previously published set of long non-coding RNAs (lncRNAs) were investigated with multi-step analysis pipeline, and 44 putative eTMs were found. Additionally, an eTM-miRNA-mRNA regulatory network module associated with peach fruit organ development was built via integration of the miRNA target information and predicted eTM-miRNA interactions. My findings suggest that one of the most widely expressed miRNA families among diverse plant species, miR156, might be potentially sponged by seven putative eTMs. Besides, the study indicates eTMs potentially play roles in the regulation of development processes in peach fruit via targeting specific miRNAs. In conclusion, by following the step-by step instructions provided in this study, novel eTMs can be identified and annotated effectively in public plant transcriptome libraries.
Methods and Applications of CRISPR-Mediated Base Editing in Eukaryotic Genomes.
Hess, Gaelen T; Tycko, Josh; Yao, David; Bassik, Michael C
2017-10-05
The past several years have seen an explosion in development of applications for the CRISPR-Cas9 system, from efficient genome editing, to high-throughput screening, to recruitment of a range of DNA and chromatin-modifying enzymes. While homology-directed repair (HDR) coupled with Cas9 nuclease cleavage has been used with great success to repair and re-write genomes, recently developed base-editing systems present a useful orthogonal strategy to engineer nucleotide substitutions. Base editing relies on recruitment of cytidine deaminases to introduce changes (rather than double-stranded breaks and donor templates) and offers potential improvements in efficiency while limiting damage and simplifying the delivery of editing machinery. At the same time, these systems enable novel mutagenesis strategies to introduce sequence diversity for engineering and discovery. Here, we review the different base-editing platforms, including their deaminase recruitment strategies and editing outcomes, and compare them to other CRISPR genome-editing technologies. Additionally, we discuss how these systems have been applied in therapeutic, engineering, and research settings. Lastly, we explore future directions of this emerging technology. Copyright © 2017 Elsevier Inc. All rights reserved.
Zhang, Chuanliang; Qu, Yanyan; Wu, Xiaoqing; Song, Dunlun; Ling, Yun; Yang, Xinling
2015-05-13
Insect kinin neuropeptides are pleiotropic peptides that are involved in the regulation of hindgut contraction, diuresis, and digestive enzyme release. They share a common C-terminal pentapeptide sequence of Phe(1)-Xaa(2)-Yaa(3)-Trp(4)-Gly(5)-NH2 (where Xaa(2) = His, Asn, Phe, Ser, or Tyr; Yaa(3) = Pro, Ser, or Ala). Recently, the aphicidal activity of insect kinin analogues has attracted the attention of researchers. Our previous work demonstrated that the sequence-simplified insect kinin pentapeptide analogue Phe-Phe-[Aib]-Trp-Gly-NH2 could retain good aphicidal activity and be the lead compound for the further discovery of eco-friendly insecticides which encompassed a broad array of biochemicals derived from micro-organisms and other natural sources. Using the peptidomimetics strategy, we chose Phe-Phe-[Aib]-Trp-Gly-NH2 as the lead compound, and we designed and synthesized three series, including 31 novel insect kinin analogues. The aphicidal activity of the new analogues against soybean aphid was determined. The results showed that all of the analogues exhibited aphicidal activity. Of particular interest was the analogue II-1, which exhibited improved aphicidal activity with an LC50 of 0.019 mmol/L compared with the lead compound (LC50 = 0.045 mmol/L) or the commercial insecticide pymetrozine (LC50 = 0.034 mmol/L). This suggests that the analogue II-1 could be used as a new lead for the discovery of potential eco-friendly insecticides.
Synthetic biology approaches: Towards sustainable exploitation of marine bioactive molecules.
Seghal Kiran, G; Ramasamy, Pasiyappazham; Sekar, Sivasankari; Ramu, Meenatchi; Hassan, Saqib; Ninawe, A S; Selvin, Joseph
2018-06-01
The discovery of genes responsible for the production of bioactive metabolites via metabolic pathways combined with the advances in synthetic biology tools, has allowed the establishment of numerous microbial cell factories, for instance the yeast cell factories, for the manufacture of highly useful metabolites from renewable biomass. Genome mining and metagenomics are two platforms provide base-line data for reconstruction of genomes and metabolomes which is based in the development of synthetic/semi-synthetic genomes for marine natural products discovery. Engineered biofilms are being innovated on synthetic biology platform using genetic circuits and cell signalling systems as represillators controlling biofilm formation. Recombineering is a process of homologous recombination mediated genetic engineering, includes insertion, deletion or modification of any sequence specifically. Although this discipline considered new to the scientific domain, this field has now developed as promising endeavor on the accomplishment of sustainable exploitation of marine natural products. Copyright © 2018 Elsevier B.V. All rights reserved.
Assessing pooled BAC and whole genome shotgun strategies for assembly of complex genomes.
Haiminen, Niina; Feltus, F Alex; Parida, Laxmi
2011-04-15
We investigate if pooling BAC clones and sequencing the pools can provide for more accurate assembly of genome sequences than the "whole genome shotgun" (WGS) approach. Furthermore, we quantify this accuracy increase. We compare the pooled BAC and WGS approaches using in silico simulations. Standard measures of assembly quality focus on assembly size and fragmentation, which are desirable for large whole genome assemblies. We propose additional measures enabling easy and visual comparison of assembly quality, such as rearrangements and redundant sequence content, relative to the known target sequence. The best assembly quality scores were obtained using 454 coverage of 15× linear and 5× paired (3kb insert size) reads (15L-5P) on Arabidopsis. This regime gave similarly good results on four additional plant genomes of very different GC and repeat contents. BAC pooling improved assembly scores over WGS assembly, coverage and redundancy scores improving the most. BAC pooling works better than WGS, however, both require a physical map to order the scaffolds. Pool sizes up to 12Mbp work well, suggesting this pooling density to be effective in medium-scale re-sequencing applications such as targeted sequencing of QTL intervals for candidate gene discovery. Assuming the current Roche/454 Titanium sequencing limitations, a 12 Mbp region could be re-sequenced with a full plate of linear reads and a half plate of paired-end reads, yielding 15L-5P coverage after read pre-processing. Our simulation suggests that massively over-sequencing may not improve accuracy. Our scoring measures can be used generally to evaluate and compare results of simulated genome assemblies.
Diller, Kyle I; Bayden, Alexander S; Audie, Joseph; Diller, David J
2018-01-01
There is growing interest in peptide-based drug design and discovery. Due to their relatively large size, polymeric nature, and chemical complexity, the design of peptide-based drugs presents an interesting "big data" challenge. Here, we describe an interactive computational environment, PeptideNavigator, for naturally exploring the tremendous amount of information generated during a peptide drug design project. The purpose of PeptideNavigator is the presentation of large and complex experimental and computational data sets, particularly 3D data, so as to enable multidisciplinary scientists to make optimal decisions during a peptide drug discovery project. PeptideNavigator provides users with numerous viewing options, such as scatter plots, sequence views, and sequence frequency diagrams. These views allow for the collective visualization and exploration of many peptides and their properties, ultimately enabling the user to focus on a small number of peptides of interest. To drill down into the details of individual peptides, PeptideNavigator provides users with a Ramachandran plot viewer and a fully featured 3D visualization tool. Each view is linked, allowing the user to seamlessly navigate from collective views of large peptide data sets to the details of individual peptides with promising property profiles. Two case studies, based on MHC-1A activating peptides and MDM2 scaffold design, are presented to demonstrate the utility of PeptideNavigator in the context of disparate peptide-design projects. Copyright © 2017 Elsevier Ltd. All rights reserved.
Ramharack, Pritika; Soliman, Mahmoud E S
2018-06-01
Originally developed for the analysis of biological sequences, bioinformatics has advanced into one of the most widely recognized domains in the scientific community. Despite this technological evolution, there is still an urgent need for nontoxic and efficient drugs. The onus now falls on the 'omics domain to meet this need by implementing bioinformatics techniques that will allow for the introduction of pioneering approaches in the rational drug design process. Here, we categorize an updated list of informatics tools and explore the capabilities of integrative bioinformatics in disease control. We believe that our review will serve as a comprehensive guide toward bioinformatics-oriented disease and drug discovery research. Copyright © 2018 Elsevier Ltd. All rights reserved.
Genetic and epigenetic control of gene expression by CRISPR–Cas systems
Lo, Albert; Qi, Lei
2017-01-01
The discovery and adaption of bacterial clustered regularly interspaced short palindromic repeats (CRISPR)–CRISPR-associated (Cas) systems has revolutionized the way researchers edit genomes. Engineering of catalytically inactivated Cas variants (nuclease-deficient or nuclease-deactivated [dCas]) combined with transcriptional repressors, activators, or epigenetic modifiers enable sequence-specific regulation of gene expression and chromatin state. These CRISPR–Cas-based technologies have contributed to the rapid development of disease models and functional genomics screening approaches, which can facilitate genetic target identification and drug discovery. In this short review, we will cover recent advances of CRISPR–dCas9 systems and their use for transcriptional repression and activation, epigenome editing, and engineered synthetic circuits for complex control of the mammalian genome. PMID:28649363
Gomulski, Ludvik M; Dimopoulos, George; Xi, Zhiyong; Soares, Marcelo B; Bonaldo, Maria F; Malacrida, Anna R; Gasperi, Giuliano
2008-01-01
Background The medfly, Ceratitis capitata, is a highly invasive agricultural pest that has become a model insect for the development of biological control programs. Despite research into the behavior and classical and population genetics of this organism, the quantity of sequence data available is limited. We have utilized an expressed sequence tag (EST) approach to obtain detailed information on transcriptome signatures that relate to a variety of physiological systems in the medfly; this information emphasizes on reproduction, sex determination, and chemosensory perception, since the study was based on normalized cDNA libraries from embryos and adult heads. Results A total of 21,253 high-quality ESTs were obtained from the embryo and head libraries. Clustering analyses performed separately for each library resulted in 5201 embryo and 6684 head transcripts. Considering an estimated 19% overlap in the transcriptomes of the two libraries, they represent about 9614 unique transcripts involved in a wide range of biological processes and molecular functions. Of particular interest are the sequences that share homology with Drosophila genes involved in sex determination, olfaction, and reproductive behavior. The medfly transformer2 (tra2) homolog was identified among the embryonic sequences, and its genomic organization and expression were characterized. Conclusion The sequences obtained in this study represent the first major dataset of expressed genes in a tephritid species of agricultural importance. This resource provides essential information to support the investigation of numerous questions regarding the biology of the medfly and other related species and also constitutes an invaluable tool for the annotation of complete genome sequences. Our study has revealed intriguing findings regarding the transcript regulation of tra2 and other sex determination genes, as well as insights into the comparative genomics of genes implicated in chemosensory reception and reproduction. PMID:18500975
CRISPR/Cas9: From Genome Engineering to Cancer Drug Discovery
Luo, Ji
2016-01-01
Advances in translational research are often driven by new technologies. The advent of microarrays, next-generation sequencing, proteomics and RNA interference (RNAi) have led to breakthroughs in our understanding of the mechanisms of cancer and the discovery of new cancer drug targets. The discovery of the bacterial clustered regularly interspaced palindromic repeat (CRISPR) system and its subsequent adaptation as a tool for mammalian genome engineering has opened up new avenues for functional genomics studies. This review will focus on the utility of CRISPR in the context of cancer drug target discovery. PMID:28603775
De novo characterization of Lentinula edodes C(91-3) transcriptome by deep Solexa sequencing.
Zhong, Mintao; Liu, Ben; Wang, Xiaoli; Liu, Lei; Lun, Yongzhi; Li, Xingyun; Ning, Anhong; Cao, Jing; Huang, Min
2013-02-01
Lentinula edodes, has been utilized as food, as well as, in popular medicine, moreover, its extract isolated from its mycelium and fruiting body have shown several therapeutic properties. Yet little is understood about its genes involved in these properties, and the absence of L.edodes genomes has been a barrier to the development of functional genomics research. However, high throughput sequencing technologies are now being widely applied to non-model species. To facilitate research on L.edodes, we leveraged Solexa sequencing technology in de novo assembly of L.edodes C(91-3) transcriptome. In a single run, we produced more than 57 million sequencing reads. These reads were assembled into 28,923 unigene sequences (mean size=689bp) including 18,120 unigenes with coding sequence (CDS). Based on similarity search with known proteins, assembled unigene sequences were annotated with gene descriptions, gene ontology (GO) and clusters of orthologous group (COG) terms. Our data provides the first comprehensive sequence resource available for functional genomics studies in L.edodes, and demonstrates the utility of Illumina/Solexa sequencing for de novo transcriptome characterization and gene discovery in a non-model mushroom. Copyright © 2012 Elsevier Inc. All rights reserved.
Angeloni, Debora; ter Elst, Arja; Wei, Ming Hui; van der Veen, Anneke Y; Braga, Eleonora A; Klimov, Eugene A; Timmer, Tineke; Korobeinikova, Luba; Lerman, Michael I; Buys, Charles H C M
2006-07-01
Homozygous deletions or loss of heterozygosity (LOH) at human chromosome band 3p12 are consistent features of lung and other malignancies, suggesting the presence of a tumor suppressor gene(s) (TSG) at this location. Only one gene has been cloned thus far from the overlapping region deleted in lung and breast cancer cell lines U2020, NCI H2198, and HCC38. It is DUTT1 (Deleted in U Twenty Twenty), also known as ROBO1, FLJ21882, and SAX3, according to HUGO. DUTT1, the human ortholog of the fly gene ROBO, has homology with NCAM proteins. Extensive analyses of DUTT1 in lung cancer have not revealed any mutations, suggesting that another gene(s) at this location could be of importance in lung cancer initiation and progression. Here, we report the discovery of a new, small, homozygous deletion in the small cell lung cancer (SCLC) cell line GLC20, nested in the overlapping, critical region. The deletion was delineated using several polymorphic markers and three overlapping P1 phage clones. Fiber-FISH experiments revealed the deletion was approximately 130 kb. Comparative genomic sequence analysis uncovered short sequence elements highly conserved among mammalian genomes and the chicken genome. The discovery of two EST clusters within the deleted region led to the isolation of two noncoding RNA (ncRNA) genes. These were subsequently found differentially expressed in various tumors when compared to their normal tissues. The ncRNA and other highly conserved sequence elements in the deleted region may represent miRNA targets of importance in cancer initiation or progression. Published 2006 Wiley-Liss, Inc.
Kelly, Benjamin J; Fitch, James R; Hu, Yangqiu; Corsmeier, Donald J; Zhong, Huachun; Wetzel, Amy N; Nordquist, Russell D; Newsom, David L; White, Peter
2015-01-20
While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.
Defrance, Matthieu; Janky, Rekin's; Sand, Olivier; van Helden, Jacques
2008-01-01
This protocol explains how to discover functional signals in genomic sequences by detecting over- or under-represented oligonucleotides (words) or spaced pairs thereof (dyads) with the Regulatory Sequence Analysis Tools (http://rsat.ulb.ac.be/rsat/). Two typical applications are presented: (i) predicting transcription factor-binding motifs in promoters of coregulated genes and (ii) discovering phylogenetic footprints in promoters of orthologous genes. The steps of this protocol include purging genomic sequences to discard redundant fragments, discovering over-represented patterns and assembling them to obtain degenerate motifs, scanning sequences and drawing feature maps. The main strength of the method is its statistical ground: the binomial significance provides an efficient control on the rate of false positives. In contrast with optimization-based pattern discovery algorithms, the method supports the detection of under- as well as over-represented motifs. Computation times vary from seconds (gene clusters) to minutes (whole genomes). The execution of the whole protocol should take approximately 1 h.
Characterizing ncRNAs in Human Pathogenic Protists Using High-Throughput Sequencing Technology
Collins, Lesley Joan
2011-01-01
ncRNAs are key genes in many human diseases including cancer and viral infection, as well as providing critical functions in pathogenic organisms such as fungi, bacteria, viruses, and protists. Until now the identification and characterization of ncRNAs associated with disease has been slow or inaccurate requiring many years of testing to understand complicated RNA and protein gene relationships. High-throughput sequencing now offers the opportunity to characterize miRNAs, siRNAs, small nucleolar RNAs (snoRNAs), and long ncRNAs on a genomic scale, making it faster and easier to clarify how these ncRNAs contribute to the disease state. However, this technology is still relatively new, and ncRNA discovery is not an application of high priority for streamlined bioinformatics. Here we summarize background concepts and practical approaches for ncRNA analysis using high-throughput sequencing, and how it relates to understanding human disease. As a case study, we focus on the parasitic protists Giardia lamblia and Trichomonas vaginalis, where large evolutionary distance has meant difficulties in comparing ncRNAs with those from model eukaryotes. A combination of biological, computational, and sequencing approaches has enabled easier classification of ncRNA classes such as snoRNAs, but has also aided the identification of novel classes. It is hoped that a higher level of understanding of ncRNA expression and interaction may aid in the development of less harsh treatment for protist-based diseases. PMID:22303390
Trellis Tone Modulation Multiple-Access for Peer Discovery in D2D Networks
Lim, Chiwoo; Kim, Sang-Hyo
2018-01-01
In this paper, a new non-orthogonal multiple-access scheme, trellis tone modulation multiple-access (TTMMA), is proposed for peer discovery of distributed device-to-device (D2D) communication. The range and capacity of discovery are important performance metrics in peer discovery. The proposed trellis tone modulation uses single-tone transmission and achieves a long discovery range due to its low Peak-to-Average Power Ratio (PAPR). The TTMMA also exploits non-orthogonal resource assignment to increase the discovery capacity. For the multi-user detection of superposed multiple-access signals, a message-passing algorithm with supplementary schemes are proposed. With TTMMA and its message-passing demodulation, approximately 1.5 times the number of devices are discovered compared to the conventional frequency division multiple-access (FDMA)-based discovery. PMID:29673167
Trellis Tone Modulation Multiple-Access for Peer Discovery in D2D Networks.
Lim, Chiwoo; Jang, Min; Kim, Sang-Hyo
2018-04-17
In this paper, a new non-orthogonal multiple-access scheme, trellis tone modulation multiple-access (TTMMA), is proposed for peer discovery of distributed device-to-device (D2D) communication. The range and capacity of discovery are important performance metrics in peer discovery. The proposed trellis tone modulation uses single-tone transmission and achieves a long discovery range due to its low Peak-to-Average Power Ratio (PAPR). The TTMMA also exploits non-orthogonal resource assignment to increase the discovery capacity. For the multi-user detection of superposed multiple-access signals, a message-passing algorithm with supplementary schemes are proposed. With TTMMA and its message-passing demodulation, approximately 1.5 times the number of devices are discovered compared to the conventional frequency division multiple-access (FDMA)-based discovery.
Jarecki, Jessica L.; Frey, Brian L.; Smith, Lloyd M.; Stretton, Antony O.
2011-01-01
Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) was used to discover peptides in extracts of the large parasitic nematode Ascaris suum. This required the assembly of a new database of known and predicted peptides. In addition to those already sequenced, peptides were either previously predicted to be processed from precursor proteins identified in an A. suum library of expressed sequence tags (ESTs), or newly predicted from a library of A. suum genome survey sequences (GSSs). The predicted MS/MS fragmentation patterns of this collection of real and putative peptides were compared with the actual fragmentation patterns found in the MS/MS spectra of peptides fractionated by MS; this enabled individual peptides to be sequenced. Many previously identified peptides were found, and 21 novel peptides were discovered. Thus, this approach is very useful, despite the fact that the available GSS database is still preliminary, having only 1X coverage. PMID:21524146
Comparative immunogenomics of molluscs.
Schultz, Jonathan H; Adema, Coen M
2017-10-01
Comparative immunology, studying both vertebrates and invertebrates, provided the earliest descriptions of phagocytosis as a general immune mechanism. However, the large scale of animal diversity challenges all-inclusive investigations and the field of immunology has developed by mostly emphasizing study of a few vertebrate species. In addressing the lack of comprehensive understanding of animal immunity, especially that of invertebrates, comparative immunology helps toward management of invertebrates that are food sources, agricultural pests, pathogens, or transmit diseases, and helps interpret the evolution of animal immunity. Initial studies showed that the Mollusca (second largest animal phylum), and invertebrates in general, possess innate defenses but lack the lymphocytic immune system that characterizes vertebrate immunology. Recognizing the reality of both common and taxon-specific immune features, and applying up-to-date cell and molecular research capabilities, in-depth studies of a select number of bivalve and gastropod species continue to reveal novel aspects of molluscan immunity. The genomics era heralded a new stage of comparative immunology; large-scale efforts yielded an initial set of full molluscan genome sequences that is available for analyses of full complements of immune genes and regulatory sequences. Next-generation sequencing (NGS), due to lower cost and effort required, allows individual researchers to generate large sequence datasets for growing numbers of molluscs. RNAseq provides expression profiles that enable discovery of immune genes and genome sequences reveal distribution and diversity of immune factors across molluscan phylogeny. Although computational de novo sequence assembly will benefit from continued development and automated annotation may require some experimental validation, NGS is a powerful tool for comparative immunology, especially increasing coverage of the extensive molluscan diversity. To date, immunogenomics revealed new levels of complexity of molluscan defense by indicating sequence heterogeneity in individual snails and bivalves, and members of expanded immune gene families are expressed differentially to generate pathogen-specific defense responses. Copyright © 2017 Elsevier Ltd. All rights reserved.
New approaches to antimicrobial discovery.
Lewis, Kim
2017-06-15
The spread of resistant organisms is producing a human health crisis, as we are witnessing the emergence of pathogens resistant to all available antibiotics. An increase in chronic infections presents an additional challenge - these diseases are difficult to treat due to antibiotic-tolerant persister cells. Overmining of soil Actinomycetes ended the golden era of antibiotic discovery in the 60s, and efforts to replace this source by screening synthetic compound libraries was not successful. Bacteria have an efficient permeability barrier, preventing penetration of most synthetic compounds. Empirically establishing rules of penetration for antimicrobials will form the knowledge base to produce libraries tailored to antibiotic discovery, and will revive rational drug design. Two untapped sources of natural products hold the promise of reviving natural product discovery. Most bacterial species, over 99%, are uncultured, and methods to grow these organisms have been developed, and the first promising compounds are in development. Genome sequencing shows that known producers harbor many more operons coding for secondary metabolites than we can account for, providing an additional rich source of antibiotics. Revival of natural product discovery will require high-throughput identification of novel compounds within a large background of known substances. This could be achieved by rapid acquisition of transcription profiles from active extracts that will point to potentially novel compounds. Copyright © 2016 Elsevier Inc. All rights reserved.
Development of Genetic Markers in Eucalyptus Species by Target Enrichment and Exome Sequencing
Dasgupta, Modhumita Ghosh; Dharanishanthi, Veeramuthu; Agarwal, Ishangi; Krutovsky, Konstantin V.
2015-01-01
The advent of next-generation sequencing has facilitated large-scale discovery, validation and assessment of genetic markers for high density genotyping. The present study was undertaken to identify markers in genes supposedly related to wood property traits in three Eucalyptus species. Ninety four genes involved in xylogenesis were selected for hybridization probe based nuclear genomic DNA target enrichment and exome sequencing. Genomic DNA was isolated from the leaf tissues and used for on-array probe hybridization followed by Illumina sequencing. The raw sequence reads were trimmed and high-quality reads were mapped to the E. grandis reference sequence and the presence of single nucleotide variants (SNVs) and insertions/ deletions (InDels) were identified across the three species. The average read coverage was 216X and a total of 2294 SNVs and 479 InDels were discovered in E. camaldulensis, 2383 SNVs and 518 InDels in E. tereticornis, and 1228 SNVs and 409 InDels in E. grandis. Additionally, SNV calling and InDel detection were conducted in pair-wise comparisons of E. tereticornis vs. E. grandis, E. camaldulensis vs. E. tereticornis and E. camaldulensis vs. E. grandis. This study presents an efficient and high throughput method on development of genetic markers for family– based QTL and association analysis in Eucalyptus. PMID:25602379
2017-01-01
The exponentially increasing number of protein and nucleic acid sequences provides opportunities to discover novel enzymes, metabolic pathways, and metabolites/natural products, thereby adding to our knowledge of biochemistry and biology. The challenge has evolved from generating sequence information to mining the databases to integrating and leveraging the available information, i.e., the availability of “genomic enzymology” web tools. Web tools that allow identification of biosynthetic gene clusters are widely used by the natural products/synthetic biology community, thereby facilitating the discovery of novel natural products and the enzymes responsible for their biosynthesis. However, many novel enzymes with interesting mechanisms participate in uncharacterized small-molecule metabolic pathways; their discovery and functional characterization also can be accomplished by leveraging information in protein and nucleic acid databases. This Perspective focuses on two genomic enzymology web tools that assist the discovery novel metabolic pathways: (1) Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) for generating sequence similarity networks to visualize and analyze sequence–function space in protein families and (2) Enzyme Function Initiative-Genome Neighborhood Tool (EFI-GNT) for generating genome neighborhood networks to visualize and analyze the genome context in microbial and fungal genomes. Both tools have been adapted to other applications to facilitate target selection for enzyme discovery and functional characterization. As the natural products community has demonstrated, the enzymology community needs to embrace the essential role of web tools that allow the protein and genome sequence databases to be leveraged for novel insights into enzymological problems. PMID:28826221
Liseron-Monfils, Christophe; Lewis, Tim; Ashlock, Daniel; McNicholas, Paul D; Fauteux, François; Strömvik, Martina; Raizada, Manish N
2013-03-15
The discovery of genetic networks and cis-acting DNA motifs underlying their regulation is a major objective of transcriptome studies. The recent release of the maize genome (Zea mays L.) has facilitated in silico searches for regulatory motifs. Several algorithms exist to predict cis-acting elements, but none have been adapted for maize. A benchmark data set was used to evaluate the accuracy of three motif discovery programs: BioProspector, Weeder and MEME. Analysis showed that each motif discovery tool had limited accuracy and appeared to retrieve a distinct set of motifs. Therefore, using the benchmark, statistical filters were optimized to reduce the false discovery ratio, and then remaining motifs from all programs were combined to improve motif prediction. These principles were integrated into a user-friendly pipeline for motif discovery in maize called Promzea, available at http://www.promzea.org and on the Discovery Environment of the iPlant Collaborative website. Promzea was subsequently expanded to include rice and Arabidopsis. Within Promzea, a user enters cDNA sequences or gene IDs; corresponding upstream sequences are retrieved from the maize genome. Predicted motifs are filtered, combined and ranked. Promzea searches the chosen plant genome for genes containing each candidate motif, providing the user with the gene list and corresponding gene annotations. Promzea was validated in silico using a benchmark data set: the Promzea pipeline showed a 22% increase in nucleotide sensitivity compared to the best standalone program tool, Weeder, with equivalent nucleotide specificity. Promzea was also validated by its ability to retrieve the experimentally defined binding sites of transcription factors that regulate the maize anthocyanin and phlobaphene biosynthetic pathways. Promzea predicted additional promoter motifs, and genome-wide motif searches by Promzea identified 127 non-anthocyanin/phlobaphene genes that each contained all five predicted promoter motifs in their promoters, perhaps uncovering a broader co-regulated gene network. Promzea was also tested against tissue-specific microarray data from maize. An online tool customized for promoter motif discovery in plants has been generated called Promzea. Promzea was validated in silico by its ability to retrieve benchmark motifs and experimentally defined motifs and was tested using tissue-specific microarray data. Promzea predicted broader networks of gene regulation associated with the historic anthocyanin and phlobaphene biosynthetic pathways. Promzea is a new bioinformatics tool for understanding transcriptional gene regulation in maize and has been expanded to include rice and Arabidopsis.
Constructing a Graph Database for Semantic Literature-Based Discovery.
Hristovski, Dimitar; Kastrin, Andrej; Dinevski, Dejan; Rindflesch, Thomas C
2015-01-01
Literature-based discovery (LBD) generates discoveries, or hypotheses, by combining what is already known in the literature. Potential discoveries have the form of relations between biomedical concepts; for example, a drug may be determined to treat a disease other than the one for which it was intended. LBD views the knowledge in a domain as a network; a set of concepts along with the relations between them. As a starting point, we used SemMedDB, a database of semantic relations between biomedical concepts extracted with SemRep from Medline. SemMedDB is distributed as a MySQL relational database, which has some problems when dealing with network data. We transformed and uploaded SemMedDB into the Neo4j graph database, and implemented the basic LBD discovery algorithms with the Cypher query language. We conclude that storing the data needed for semantic LBD is more natural in a graph database. Also, implementing LBD discovery algorithms is conceptually simpler with a graph query language when compared with standard SQL.
Whole-genome sequencing of a pandoravirus isolated from keratitis-inducing acanthamoeba.
Antwerpen, M H; Georgi, E; Zoeller, L; Woelfel, R; Stoecker, K; Scheid, P
2015-03-26
Following the recent discovery of two Pandoravirus species in 2013, a previously described endocytobiont isolated from the inflamed eye of a patient with keratitis was subjected to whole-genome sequencing (WGS). Here, we present the complete genome sequence of a new Pandoravirus isolate. Copyright © 2015 Antwerpen et al.
New in-depth rainbow trout transcriptome reference and digital atlas of gene expression
USDA-ARS?s Scientific Manuscript database
Sequencing the rainbow trout genome is underway and a transcriptome reference sequence is required to help in genome assembly and gene discovery. Previously, we reported a transcriptome reference sequence using a 19X coverage of 454-pyrosequencing data. Although this work added a great wealth of ann...
Transcriptome sequencing and annotation of the halophytic microalga Dunaliella salina * #
Hong, Ling; Liu, Jun-li; Midoun, Samira Z.; Miller, Philip C.
2017-01-01
The unicellular green alga Dunaliella salina is well adapted to salt stress and contains compounds (including β-carotene and vitamins) with potential commercial value. A large transcriptome database of D. salina during the adjustment, exponential and stationary growth phases was generated using a high throughput sequencing platform. We characterized the metabolic processes in D. salina with a focus on valuable metabolites, with the aim of manipulating D. salina to achieve greater economic value in large-scale production through a bioengineering strategy. Gene expression profiles under salt stress verified using quantitative polymerase chain reaction (qPCR) implied that salt can regulate the expression of key genes. This study generated a substantial fraction of D. salina transcriptional sequences for the entire growth cycle, providing a basis for the discovery of novel genes. This first full-scale transcriptome study of D. salina establishes a foundation for further comparative genomic studies. PMID:28990374
Krämer, Andreas; Shah, Sohela; Rebres, Robert Anthony; Tang, Susan; Richards, Daniel Rene
2017-08-11
Next-generation sequencing is widely used to identify disease-causing variants in patients with rare genetic disorders. Identifying those variants from whole-genome or exome data can be both scientifically challenging and time consuming. A significant amount of time is spent on variant annotation, and interpretation. Fully or partly automated solutions are therefore needed to streamline and scale this process. We describe Phenotype Driven Ranking (PDR), an algorithm integrated into Ingenuity Variant Analysis, that uses observed patient phenotypes to prioritize diseases and genes in order to expedite causal-variant discovery. Our method is based on a network of phenotype-disease-gene relationships derived from the QIAGEN Knowledge Base, which allows for efficient computational association of phenotypes to implicated diseases, and also enables scoring and ranking. We have demonstrated the utility and performance of PDR by applying it to a number of clinical rare-disease cases, where the true causal gene was known beforehand. It is also shown that PDR compares favorably to a representative alternative tool.
Yamagata, Koichi; Yamanishi, Ayako; Kokubu, Chikara; Takeda, Junji; Sese, Jun
2016-01-01
An important challenge in cancer genomics is precise detection of structural variations (SVs) by high-throughput short-read sequencing, which is hampered by the high false discovery rates of existing analysis tools. Here, we propose an accurate SV detection method named COSMOS, which compares the statistics of the mapped read pairs in tumor samples with isogenic normal control samples in a distinct asymmetric manner. COSMOS also prioritizes the candidate SVs using strand-specific read-depth information. Performance tests on modeled tumor genomes revealed that COSMOS outperformed existing methods in terms of F-measure. We also applied COSMOS to an experimental mouse cell-based model, in which SVs were induced by genome engineering and gamma-ray irradiation, followed by polymerase chain reaction-based confirmation. The precision of COSMOS was 84.5%, while the next best existing method was 70.4%. Moreover, the sensitivity of COSMOS was the highest, indicating that COSMOS has great potential for cancer genome analysis. PMID:26833260
Naz, Sadia; Ngo, Tony; Farooq, Umar
2017-01-01
Background The rapid increase in antibiotic resistance by various bacterial pathogens underlies the significance of developing new therapies and exploring different drug targets. A fraction of bacterial pathogens abbreviated as ESKAPE by the European Center for Disease Prevention and Control have been considered a major threat due to the rise in nosocomial infections. Here, we compared putative drug binding pockets of twelve essential and mostly conserved metabolic enzymes in numerous bacterial pathogens including those of the ESKAPE group and Mycobacterium tuberculosis. The comparative analysis will provide guidelines for the likelihood of transferability of the inhibitors from one species to another. Methods Nine bacterial species including six ESKAPE pathogens, Mycobacterium tuberculosis along with Mycobacterium smegmatis and Eschershia coli, two non-pathogenic bacteria, have been selected for drug binding pocket analysis of twelve essential enzymes. The amino acid sequences were obtained from Uniprot, aligned using ICM v3.8-4a and matched against the Pocketome encyclopedia. We used known co-crystal structures of selected target enzyme orthologs to evaluate the location of their active sites and binding pockets and to calculate a matrix of pairwise sequence identities across each target enzyme across the different species. This was used to generate sequence maps. Results High sequence identity of enzyme binding pockets, derived from experimentally determined co-crystallized structures, was observed among various species. Comparison at both full sequence level and for drug binding pockets of key metabolic enzymes showed that binding pockets are highly conserved (sequence similarity up to 100%) among various ESKAPE pathogens as well as Mycobacterium tuberculosis. Enzymes orthologs having conserved binding sites may have potential to interact with inhibitors in similar way and might be helpful for design of similar class of inhibitors for a particular species. The derived pocket alignments and distance-based maps provide guidelines for drug discovery and repurposing. In addition they also provide recommendations for the relevant model bacteria that may be used for initial drug testing. Discussion Comparing ligand binding sites through sequence identity calculation could be an effective approach to identify conserved orthologs as drug binding pockets have shown higher level of conservation among various species. By using this approach we could avoid the problems associated with full sequence comparison. We identified essential metabolic enzymes among ESKAPE pathogens that share high sequence identity in their putative drug binding pockets (up to 100%), of which known inhibitors can potentially antagonize these identical pockets in the various species in a similar manner. PMID:28948099
Naz, Sadia; Ngo, Tony; Farooq, Umar; Abagyan, Ruben
2017-01-01
The rapid increase in antibiotic resistance by various bacterial pathogens underlies the significance of developing new therapies and exploring different drug targets. A fraction of bacterial pathogens abbreviated as ESKAPE by the European Center for Disease Prevention and Control have been considered a major threat due to the rise in nosocomial infections. Here, we compared putative drug binding pockets of twelve essential and mostly conserved metabolic enzymes in numerous bacterial pathogens including those of the ESKAPE group and Mycobacterium tuberculosis . The comparative analysis will provide guidelines for the likelihood of transferability of the inhibitors from one species to another. Nine bacterial species including six ESKAPE pathogens, Mycobacterium tuberculosis along with Mycobacterium smegmatis and Eschershia coli , two non-pathogenic bacteria, have been selected for drug binding pocket analysis of twelve essential enzymes. The amino acid sequences were obtained from Uniprot, aligned using ICM v3.8-4a and matched against the Pocketome encyclopedia. We used known co-crystal structures of selected target enzyme orthologs to evaluate the location of their active sites and binding pockets and to calculate a matrix of pairwise sequence identities across each target enzyme across the different species. This was used to generate sequence maps. High sequence identity of enzyme binding pockets, derived from experimentally determined co-crystallized structures, was observed among various species. Comparison at both full sequence level and for drug binding pockets of key metabolic enzymes showed that binding pockets are highly conserved (sequence similarity up to 100%) among various ESKAPE pathogens as well as Mycobacterium tuberculosis . Enzymes orthologs having conserved binding sites may have potential to interact with inhibitors in similar way and might be helpful for design of similar class of inhibitors for a particular species. The derived pocket alignments and distance-based maps provide guidelines for drug discovery and repurposing. In addition they also provide recommendations for the relevant model bacteria that may be used for initial drug testing. Comparing ligand binding sites through sequence identity calculation could be an effective approach to identify conserved orthologs as drug binding pockets have shown higher level of conservation among various species. By using this approach we could avoid the problems associated with full sequence comparison. We identified essential metabolic enzymes among ESKAPE pathogens that share high sequence identity in their putative drug binding pockets (up to 100%), of which known inhibitors can potentially antagonize these identical pockets in the various species in a similar manner.
2009-01-01
Background Chickpea (Cicer arietinum L.), an important grain legume crop of the world is seriously challenged by terminal drought and salinity stresses. However, very limited number of molecular markers and candidate genes are available for undertaking molecular breeding in chickpea to tackle these stresses. This study reports generation and analysis of comprehensive resource of drought- and salinity-responsive expressed sequence tags (ESTs) and gene-based markers. Results A total of 20,162 (18,435 high quality) drought- and salinity- responsive ESTs were generated from ten different root tissue cDNA libraries of chickpea. Sequence editing, clustering and assembly analysis resulted in 6,404 unigenes (1,590 contigs and 4,814 singletons). Functional annotation of unigenes based on BLASTX analysis showed that 46.3% (2,965) had significant similarity (≤1E-05) to sequences in the non-redundant UniProt database. BLASTN analysis of unique sequences with ESTs of four legume species (Medicago, Lotus, soybean and groundnut) and three model plant species (rice, Arabidopsis and poplar) provided insights on conserved genes across legumes as well as novel transcripts for chickpea. Of 2,965 (46.3%) significant unigenes, only 2,071 (32.3%) unigenes could be functionally categorised according to Gene Ontology (GO) descriptions. A total of 2,029 sequences containing 3,728 simple sequence repeats (SSRs) were identified and 177 new EST-SSR markers were developed. Experimental validation of a set of 77 SSR markers on 24 genotypes revealed 230 alleles with an average of 4.6 alleles per marker and average polymorphism information content (PIC) value of 0.43. Besides SSR markers, 21,405 high confidence single nucleotide polymorphisms (SNPs) in 742 contigs (with ≥ 5 ESTs) were also identified. Recognition sites for restriction enzymes were identified for 7,884 SNPs in 240 contigs. Hierarchical clustering of 105 selected contigs provided clues about stress- responsive candidate genes and their expression profile showed predominance in specific stress-challenged libraries. Conclusion Generated set of chickpea ESTs serves as a resource of high quality transcripts for gene discovery and development of functional markers associated with abiotic stress tolerance that will be helpful to facilitate chickpea breeding. Mapping of gene-based markers in chickpea will also add more anchoring points to align genomes of chickpea and other legume species. PMID:19912666
Comparative Analysis of 35 Basidiomycete Genomes Reveals Diversity and Uniqueness of the Phylum
DOE Office of Scientific and Technical Information (OSTI.GOV)
Riley, Robert; Salamov, Asaf; Otillar, Robert
Fungi of the phylum Basidiomycota (basidiomycetes), make up some 37percent of the described fungi, and are important in forestry, agriculture, medicine, and bioenergy. This diverse phylum includes symbionts, pathogens, and saprobes including wood decaying fungi. To better understand the diversity of this phylum we compared the genomes of 35 basidiomycete fungi including 6 newly sequenced genomes. The genomes of basidiomycetes span extremes of genome size, gene number, and repeat content. A phylogenetic tree of Basidiomycota was generated using the Phyldog software, which uses all available protein sequence data to simultaneously infer gene and species trees. Analysis of core genes revealsmore » that some 48percent of basidiomycete proteins are unique to the phylum with nearly half of those (22percent) comprising proteins found in only one organism. Phylogenetic patterns of plant biomass-degrading genes suggest a continuum rather than a sharp dichotomy between the white rot and brown rot modes of wood decay among the members of Agaricomycotina subphylum. There is a correlation of the profile of certain gene families to nutritional mode in Agaricomycotina. Based on phylogenetically-informed PCA analysis of such profiles, we predict that that Botryobasidium botryosum and Jaapia argillacea have properties similar to white rot species, although neither has liginolytic class II fungal peroxidases. Furthermore, we find that both fungi exhibit wood decay with white rot-like characteristics in growth assays. Analysis of the rate of discovery of proteins with no or few homologs suggests the high value of continued sequencing of basidiomycete fungi.« less
Louis, Ed
2011-01-01
In the early days of the yeast genome sequencing project, gene annotation was in its infancy and suffered the problem of many false positive annotations as well as missed genes. The lack of other sequences for comparison also prevented the annotation of conserved, functional sequences that were not coding. We are now in an era of comparative genomics where many closely related as well as more distantly related genomes are available for direct sequence and synteny comparisons allowing for more probable predictions of genes and other functional sequences due to conservation. We also have a plethora of functional genomics data which helps inform gene annotation for previously uncharacterised open reading frames (ORFs)/genes. For Saccharomyces cerevisiae this has resulted in a continuous updating of the gene and functional sequence annotations in the reference genome helping it retain its position as the best characterized eukaryotic organism's genome. A single reference genome for a species does not accurately describe the species and this is quite clear in the case of S. cerevisiae where the reference strain is not ideal for brewing or baking due to missing genes. Recent surveys of numerous isolates, from a variety of sources, using a variety of technologies have revealed a great deal of variation amongst isolates with genome sequence surveys providing information on novel genes, undetectable by other means. We now have a better understanding of the extant variation in S. cerevisiae as a species as well as some idea of how much we are missing from this understanding. As with gene annotation, comparative genomics enhances the discovery and description of genome variation and is providing us with the tools for understanding genome evolution, adaptation and selection, and underlying genetics of complex traits.
Mapping copy number variation by population-scale genome sequencing.
Mills, Ryan E; Walter, Klaudia; Stewart, Chip; Handsaker, Robert E; Chen, Ken; Alkan, Can; Abyzov, Alexej; Yoon, Seungtai Chris; Ye, Kai; Cheetham, R Keira; Chinwalla, Asif; Conrad, Donald F; Fu, Yutao; Grubert, Fabian; Hajirasouliha, Iman; Hormozdiari, Fereydoun; Iakoucheva, Lilia M; Iqbal, Zamin; Kang, Shuli; Kidd, Jeffrey M; Konkel, Miriam K; Korn, Joshua; Khurana, Ekta; Kural, Deniz; Lam, Hugo Y K; Leng, Jing; Li, Ruiqiang; Li, Yingrui; Lin, Chang-Yun; Luo, Ruibang; Mu, Xinmeng Jasmine; Nemesh, James; Peckham, Heather E; Rausch, Tobias; Scally, Aylwyn; Shi, Xinghua; Stromberg, Michael P; Stütz, Adrian M; Urban, Alexander Eckehart; Walker, Jerilyn A; Wu, Jiantao; Zhang, Yujun; Zhang, Zhengdong D; Batzer, Mark A; Ding, Li; Marth, Gabor T; McVean, Gil; Sebat, Jonathan; Snyder, Michael; Wang, Jun; Ye, Kenny; Eichler, Evan E; Gerstein, Mark B; Hurles, Matthew E; Lee, Charles; McCarroll, Steven A; Korbel, Jan O
2011-02-03
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
A taxonomy update for the family Polyomaviridae.
Calvignac-Spencer, Sébastien; Feltkamp, Mariet C W; Daugherty, Matthew D; Moens, Ugo; Ramqvist, Torbjörn; Johne, Reimar; Ehlers, Bernhard
2016-06-01
Many distinct polyomaviruses infecting a variety of vertebrate hosts have recently been discovered, and their complete genome sequence could often be determined. To accommodate this fast-growing diversity, the International Committee on Taxonomy of Viruses (ICTV) Polyomaviridae Study Group designed a host- and sequence-based rationale for an updated taxonomy of the family Polyomaviridae. Applying this resulted in numerous recommendations of taxonomical revisions, which were accepted by the Executive Committee of the ICTV in December 2015. New criteria for definition and creation of polyomavirus species were established that were based on the observed distance between large T antigen coding sequences. Four genera (Alpha-, Beta, Gamma- and Deltapolyomavirus) were delineated that together include 73 species. Species naming was made as systematic as possible - most species names now consist of the binomial name of the host species followed by polyomavirus and a number reflecting the order of discovery. It is hoped that this important update of the family taxonomy will serve as a stable basis for future taxonomical developments.
2012-01-01
Background To discover a compound inhibiting multiple proteins (i.e. polypharmacological targets) is a new paradigm for the complex diseases (e.g. cancers and diabetes). In general, the polypharmacological proteins often share similar local binding environments and motifs. As the exponential growth of the number of protein structures, to find the similar structural binding motifs (pharma-motifs) is an emergency task for drug discovery (e.g. side effects and new uses for old drugs) and protein functions. Results We have developed a Space-Related Pharmamotifs (called SRPmotif) method to recognize the binding motifs by searching against protein structure database. SRPmotif is able to recognize conserved binding environments containing spatially discontinuous pharma-motifs which are often short conserved peptides with specific physico-chemical properties for protein functions. Among 356 pharma-motifs, 56.5% interacting residues are highly conserved. Experimental results indicate that 81.1% and 92.7% polypharmacological targets of each protein-ligand complex are annotated with same biological process (BP) and molecular function (MF) terms, respectively, based on Gene Ontology (GO). Our experimental results show that the identified pharma-motifs often consist of key residues in functional (active) sites and play the key roles for protein functions. The SRPmotif is available at http://gemdock.life.nctu.edu.tw/SRP/. Conclusions SRPmotif is able to identify similar pharma-interfaces and pharma-motifs sharing similar binding environments for polypharmacological targets by rapidly searching against the protein structure database. Pharma-motifs describe the conservations of binding environments for drug discovery and protein functions. Additionally, these pharma-motifs provide the clues for discovering new sequence-based motifs to predict protein functions from protein sequence databases. We believe that SRPmotif is useful for elucidating protein functions and drug discovery. PMID:23281852
Chiu, Yi-Yuan; Lin, Chun-Yu; Lin, Chih-Ta; Hsu, Kai-Cheng; Chang, Li-Zen; Yang, Jinn-Moon
2012-01-01
To discover a compound inhibiting multiple proteins (i.e. polypharmacological targets) is a new paradigm for the complex diseases (e.g. cancers and diabetes). In general, the polypharmacological proteins often share similar local binding environments and motifs. As the exponential growth of the number of protein structures, to find the similar structural binding motifs (pharma-motifs) is an emergency task for drug discovery (e.g. side effects and new uses for old drugs) and protein functions. We have developed a Space-Related Pharmamotifs (called SRPmotif) method to recognize the binding motifs by searching against protein structure database. SRPmotif is able to recognize conserved binding environments containing spatially discontinuous pharma-motifs which are often short conserved peptides with specific physico-chemical properties for protein functions. Among 356 pharma-motifs, 56.5% interacting residues are highly conserved. Experimental results indicate that 81.1% and 92.7% polypharmacological targets of each protein-ligand complex are annotated with same biological process (BP) and molecular function (MF) terms, respectively, based on Gene Ontology (GO). Our experimental results show that the identified pharma-motifs often consist of key residues in functional (active) sites and play the key roles for protein functions. The SRPmotif is available at http://gemdock.life.nctu.edu.tw/SRP/. SRPmotif is able to identify similar pharma-interfaces and pharma-motifs sharing similar binding environments for polypharmacological targets by rapidly searching against the protein structure database. Pharma-motifs describe the conservations of binding environments for drug discovery and protein functions. Additionally, these pharma-motifs provide the clues for discovering new sequence-based motifs to predict protein functions from protein sequence databases. We believe that SRPmotif is useful for elucidating protein functions and drug discovery.
Discovery of phosphonic acid natural products by mining the genomes of 10,000 actinomycetes.
Ju, Kou-San; Gao, Jiangtao; Doroghazi, James R; Wang, Kwo-Kwang A; Thibodeaux, Christopher J; Li, Steven; Metzger, Emily; Fudala, John; Su, Joleen; Zhang, Jun Kai; Lee, Jaeheon; Cioni, Joel P; Evans, Bradley S; Hirota, Ryuichi; Labeda, David P; van der Donk, Wilfred A; Metcalf, William W
2015-09-29
Although natural products have been a particularly rich source of human medicines, activity-based screening results in a very high rate of rediscovery of known molecules. Based on the large number of natural product biosynthetic genes in microbial genomes, many have proposed "genome mining" as an alternative approach for discovery efforts; however, this idea has yet to be performed experimentally on a large scale. Here, we demonstrate the feasibility of large-scale, high-throughput genome mining by screening a collection of over 10,000 actinomycetes for the genetic potential to make phosphonic acids, a class of natural products with diverse and useful bioactivities. Genome sequencing identified a diverse collection of phosphonate biosynthetic gene clusters within 278 strains. These clusters were classified into 64 distinct groups, of which 55 are likely to direct the synthesis of unknown compounds. Characterization of strains within five of these groups resulted in the discovery of a new archetypical pathway for phosphonate biosynthesis, the first (to our knowledge) dedicated pathway for H-phosphinates, and 11 previously undescribed phosphonic acid natural products. Among these compounds are argolaphos, a broad-spectrum antibacterial phosphonopeptide composed of aminomethylphosphonate in peptide linkage to a rare amino acid N(5)-hydroxyarginine; valinophos, an N-acetyl l-Val ester of 2,3-dihydroxypropylphosphonate; and phosphonocystoximate, an unusual thiohydroximate-containing molecule representing a new chemotype of sulfur-containing phosphonate natural products. Analysis of the genome sequences from the remaining strains suggests that the majority of the phosphonate biosynthetic repertoire of Actinobacteria has been captured at the gene level. This dereplicated strain collection now provides a reservoir of numerous, as yet undiscovered, phosphonate natural products.
Discovery of phosphonic acid natural products by mining the genomes of 10,000 actinomycetes
Ju, Kou-San; Gao, Jiangtao; Doroghazi, James R.; Wang, Kwo-Kwang A.; Thibodeaux, Christopher J.; Li, Steven; Metzger, Emily; Fudala, John; Su, Joleen; Zhang, Jun Kai; Lee, Jaeheon; Cioni, Joel P.; Evans, Bradley S.; Hirota, Ryuichi; Labeda, David P.; van der Donk, Wilfred A.; Metcalf, William W.
2015-01-01
Although natural products have been a particularly rich source of human medicines, activity-based screening results in a very high rate of rediscovery of known molecules. Based on the large number of natural product biosynthetic genes in microbial genomes, many have proposed “genome mining” as an alternative approach for discovery efforts; however, this idea has yet to be performed experimentally on a large scale. Here, we demonstrate the feasibility of large-scale, high-throughput genome mining by screening a collection of over 10,000 actinomycetes for the genetic potential to make phosphonic acids, a class of natural products with diverse and useful bioactivities. Genome sequencing identified a diverse collection of phosphonate biosynthetic gene clusters within 278 strains. These clusters were classified into 64 distinct groups, of which 55 are likely to direct the synthesis of unknown compounds. Characterization of strains within five of these groups resulted in the discovery of a new archetypical pathway for phosphonate biosynthesis, the first (to our knowledge) dedicated pathway for H-phosphinates, and 11 previously undescribed phosphonic acid natural products. Among these compounds are argolaphos, a broad-spectrum antibacterial phosphonopeptide composed of aminomethylphosphonate in peptide linkage to a rare amino acid N5-hydroxyarginine; valinophos, an N-acetyl l-Val ester of 2,3-dihydroxypropylphosphonate; and phosphonocystoximate, an unusual thiohydroximate-containing molecule representing a new chemotype of sulfur-containing phosphonate natural products. Analysis of the genome sequences from the remaining strains suggests that the majority of the phosphonate biosynthetic repertoire of Actinobacteria has been captured at the gene level. This dereplicated strain collection now provides a reservoir of numerous, as yet undiscovered, phosphonate natural products. PMID:26324907
Towards Discovery and Targeted Peptide Biomarker Detection Using nanoESI-TIMS-TOF MS
NASA Astrophysics Data System (ADS)
Garabedian, Alyssa; Benigni, Paolo; Ramirez, Cesar E.; Baker, Erin S.; Liu, Tao; Smith, Richard D.; Fernandez-Lima, Francisco
2018-05-01
In the present work, the potential of trapped ion mobility spectrometry coupled to TOF mass spectrometry (TIMS-TOF MS) for discovery and targeted monitoring of peptide biomarkers from human-in-mouse xenograft tumor tissue was evaluated. In particular, a TIMS-MS workflow was developed for the detection and quantification of peptide biomarkers using internal heavy analogs, taking advantage of the high mobility resolution (R = 150-250) prior to mass analysis. Five peptide biomarkers were separated, identified, and quantified using offline nanoESI-TIMS-CID-TOF MS; the results were in good agreement with measurements using a traditional LC-ESI-MS/MS proteomics workflow. The TIMS-TOF MS analysis permitted peptide biomarker detection based on accurate mobility, mass measurements, and high sequence coverage for concentrations in the 10-200 nM range, while simultaneously achieving discovery measurements of not initially targeted peptides as markers from the same proteins and, eventually, other proteins. [Figure not available: see fulltext.
An algorithm of discovering signatures from DNA databases on a computer cluster.
Lee, Hsiao Ping; Sheu, Tzu-Fang
2014-10-05
Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved. In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms. The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.
NASA Astrophysics Data System (ADS)
Miatun, A.; Muntazhimah
2018-01-01
The aim of this research was to determine the effect of learning models on mathematics achievement viewed from student’s self-regulated learning. The learning model compared were discovery learning and problem-based learning. The population was all students at the grade VIII of Junior High School in Boyolali regency. The samples were students of SMPN 4 Boyolali, SMPN 6 Boyolali, and SMPN 4 Mojosongo. The instruments used were mathematics achievement tests and self-regulated learning questionnaire. The data were analyzed using unbalanced two-ways Anova. The conclusion was as follows: (1) discovery learning gives better achievement than problem-based learning. (2) Achievement of students who have high self-regulated learning was better than students who have medium and low self-regulated learning. (3) For discovery learning, achievement of students who have high self-regulated learning was better than students who have medium and low self-regulated learning. For problem-based learning, students who have high and medium self-regulated learning have the same achievement. (4) For students who have high self-regulated learning, discovery learning gives better achievement than problem-based learning. Students who have medium and low self-regulated learning, both learning models give the same achievement.
Zhao, Chao; Chu, Yanan; Li, Yanhong; Yang, Chengfeng; Chen, Yuqing; Wang, Xumin; Liu, Bin
2017-01-01
To analyze the microbial diversity and gene content of a thermophilic cellulose-degrading consortium from hot springs in Xiamen, China using 454 pyrosequencing for discovering cellulolytic enzyme resources. A thermophilic cellulose-degrading consortium, XM70 that was isolated from a hot spring, used sugarcane bagasse as sole carbon and energy source. DNA sequencing of the XM70 sample resulted in 349,978 reads with an average read length of 380 bases, accounting for 133,896,867 bases of sequence information. The characterization of sequencing reads and assembled contigs revealed that most microbes were derived from four phyla: Geobacillus (Firmicutes), Thermus, Bacillus, and Anoxybacillus. Twenty-eight homologous genes belonging to 15 glycoside hydrolase families were detected, including several cellulase genes. A novel hot spring metagenome-derived thermophilic cellulase was expressed and characterized. The application value of thermostable sugarcane bagasse-degrading enzymes is shown for production of cellulosic biofuel. The practical power of using a short-read-based metagenomic approach for harvesting novel microbial genes is also demonstrated.
RNA interference for functional genomics and improvement of cotton (Gossypium species)
USDA-ARS?s Scientific Manuscript database
RNA interference (RNAi), is a powerful new technology in the discovery of genetic sequence functions, and has become a valuable tool for functional genomics of cotton (Gossypium ssp.). The rapid adoption of RNAi has replaced previous antisense technology. RNAi has aided in the discovery of function ...
USDA-ARS?s Scientific Manuscript database
The next generation sequencing (NGS) technologies have opened a wealth of opportunities for plant breeding and genomics research, and changed the paradigms of marker detection, genotyping, and gene discovery. Abundant genomic resources have been generated using a whole genome resequencing (WGR) str...
2014-01-01
Background Variation in seed oil composition and content among soybean varieties is largely attributed to differences in transcript sequences and/or transcript accumulation of oil production related genes in seeds. Discovery and analysis of sequence and expression variations in these genes will accelerate soybean oil quality improvement. Results In an effort to identify these variations, we sequenced the transcriptomes of soybean seeds from nine lines varying in oil composition and/or total oil content. Our results showed that 69,338 distinct transcripts from 32,885 annotated genes were expressed in seeds. A total of 8,037 transcript expression polymorphisms and 50,485 transcript sequence polymorphisms (48,792 SNPs and 1,693 small Indels) were identified among the lines. Effects of the transcript polymorphisms on their encoded protein sequences and functions were predicted. The studies also provided independent evidence that the lack of FAD2-1A gene activity and a non-synonymous SNP in the coding sequence of FAB2C caused elevated oleic acid and stearic acid levels in soybean lines M23 and FAM94-41, respectively. Conclusions As a proof-of-concept, we developed an integrated RNA-seq and bioinformatics approach to identify and functionally annotate transcript polymorphisms, and demonstrated its high effectiveness for discovery of genetic and transcript variations that result in altered oil quality traits. The collection of transcript polymorphisms coupled with their predicted functional effects will be a valuable asset for further discovery of genes, gene variants, and functional markers to improve soybean oil quality. PMID:24755115
Westbrook, Jared W.; Chhatre, Vikram E.; Wu, Le-Shin; Chamala, Srikar; Neves, Leandro Gomide; Muñoz, Patricio; Martínez-García, Pedro J.; Neale, David B.; Kirst, Matias; Mockaitis, Keithanne; Nelson, C. Dana; Peter, Gary F.; Echt, Craig S.
2015-01-01
A consensus genetic map for Pinus taeda (loblolly pine) and Pinus elliottii (slash pine) was constructed by merging three previously published P. taeda maps with a map from a pseudo-backcross between P. elliottii and P. taeda. The consensus map positioned 3856 markers via genotyping of 1251 individuals from four pedigrees. It is the densest linkage map for a conifer to date. Average marker spacing was 0.6 cM and total map length was 2305 cM. Functional predictions of mapped genes were improved by aligning expressed sequence tags used for marker discovery to full-length P. taeda transcripts. Alignments to the P. taeda genome mapped 3305 scaffold sequences onto 12 linkage groups. The consensus genetic map was used to compare the genome-wide linkage disequilibrium in a population of distantly related P. taeda individuals (ADEPT2) used for association genetic studies and a multiple-family pedigree used for genomic selection (CCLONES). The prevalence and extent of LD was greater in CCLONES as compared to ADEPT2; however, extended LD with LGs or between LGs was rare in both populations. The average squared correlations, r2, between SNP alleles less than 1 cM apart were less than 0.05 in both populations and r2 did not decay substantially with genetic distance. The consensus map and analysis of linkage disequilibrium establish a foundation for comparative association mapping and genomic selection in P. taeda and P. elliottii. PMID:26068575
Lijavetzky, Diego; Cabezas, José Antonio; Ibáñez, Ana; Rodríguez, Virginia; Martínez-Zapater, José M
2007-01-01
Background Single-nucleotide polymorphisms (SNPs) are the most abundant type of DNA sequence polymorphisms. Their higher availability and stability when compared to simple sequence repeats (SSRs) provide enhanced possibilities for genetic and breeding applications such as cultivar identification, construction of genetic maps, the assessment of genetic diversity, the detection of genotype/phenotype associations, or marker-assisted breeding. In addition, the efficiency of these activities can be improved thanks to the ease with which SNP genotyping can be automated. Expressed sequence tags (EST) sequencing projects in grapevine are allowing for the in silico detection of multiple putative sequence polymorphisms within and among a reduced number of cultivars. In parallel, the sequence of the grapevine cultivar Pinot Noir is also providing thousands of polymorphisms present in this highly heterozygous genome. Still the general application of those SNPs requires further validation since their use could be restricted to those specific genotypes. Results In order to develop a large SNP set of wide application in grapevine we followed a systematic re-sequencing approach in a group of 11 grape genotypes corresponding to ancient unrelated cultivars as well as wild plants. Using this approach, we have sequenced 230 gene fragments, what represents the analysis of over 1 Mb of grape DNA sequence. This analysis has allowed the discovery of 1573 SNPs with an average of one SNP every 64 bp (one SNP every 47 bp in non-coding regions and every 69 bp in coding regions). Nucleotide diversity in grape (π = 0.0051) was found to be similar to values observed in highly polymorphic plant species such as maize. The average number of haplotypes per gene sequence was estimated as six, with three haplotypes representing over 83% of the analyzed sequences. Short-range linkage disequilibrium (LD) studies within the analyzed sequences indicate the existence of a rapid decay of LD within the selected grapevine genotypes. To validate the use of the detected polymorphisms in genetic mapping, cultivar identification and genetic diversity studies we have used the SNPlex™ genotyping technology in a sample of grapevine genotypes and segregating progenies. Conclusion These results provide accurate values for nucleotide diversity in coding sequences and a first estimate of short-range LD in grapevine. Using SNPlex™ genotyping we have shown the application of a set of discovered SNPs as molecular markers for cultivar identification, linkage mapping and genetic diversity studies. Thus, the combination a highly efficient re-sequencing approach and the SNPlex™ high throughput genotyping technology provide a powerful tool for grapevine genetic analysis. PMID:18021442
Proposal and Evaluation of BLE Discovery Process Based on New Features of Bluetooth 5.0.
Hernández-Solana, Ángela; Perez-Diaz-de-Cerio, David; Valdovinos, Antonio; Valenzuela, Jose Luis
2017-08-30
The device discovery process is one of the most crucial aspects in real deployments of sensor networks. Recently, several works have analyzed the topic of Bluetooth Low Energy (BLE) device discovery through analytical or simulation models limited to version 4.x. Non-connectable and non-scannable undirected advertising has been shown to be a reliable alternative for discovering a high number of devices in a relatively short time period. However, new features of Bluetooth 5.0 allow us to define a variant on the device discovery process, based on BLE scannable undirected advertising events, which results in higher discovering capacities and also lower power consumption. In order to characterize this new device discovery process, we experimentally model the real device behavior of BLE scannable undirected advertising events. Non-detection packet probability, discovery probability, and discovery latency for a varying number of devices and parameters are compared by simulations and experimental measurements. We demonstrate that our proposal outperforms previous works, diminishing the discovery time and increasing the potential user device density. A mathematical model is also developed in order to easily obtain a measure of the potential capacity in high density scenarios.
Proposal and Evaluation of BLE Discovery Process Based on New Features of Bluetooth 5.0
2017-01-01
The device discovery process is one of the most crucial aspects in real deployments of sensor networks. Recently, several works have analyzed the topic of Bluetooth Low Energy (BLE) device discovery through analytical or simulation models limited to version 4.x. Non-connectable and non-scannable undirected advertising has been shown to be a reliable alternative for discovering a high number of devices in a relatively short time period. However, new features of Bluetooth 5.0 allow us to define a variant on the device discovery process, based on BLE scannable undirected advertising events, which results in higher discovering capacities and also lower power consumption. In order to characterize this new device discovery process, we experimentally model the real device behavior of BLE scannable undirected advertising events. Non-detection packet probability, discovery probability, and discovery latency for a varying number of devices and parameters are compared by simulations and experimental measurements. We demonstrate that our proposal outperforms previous works, diminishing the discovery time and increasing the potential user device density. A mathematical model is also developed in order to easily obtain a measure of the potential capacity in high density scenarios. PMID:28867786
Johnston, Chad W; Skinnider, Michael A; Wyatt, Morgan A; Li, Xiang; Ranieri, Michael R M; Yang, Lian; Zechel, David L; Ma, Bin; Magarvey, Nathan A
2015-09-28
Bacterial natural products are a diverse and valuable group of small molecules, and genome sequencing indicates that the vast majority remain undiscovered. The prediction of natural product structures from biosynthetic assembly lines can facilitate their discovery, but highly automated, accurate, and integrated systems are required to mine the broad spectrum of sequenced bacterial genomes. Here we present a genome-guided natural products discovery tool to automatically predict, combinatorialize and identify polyketides and nonribosomal peptides from biosynthetic assembly lines using LC-MS/MS data of crude extracts in a high-throughput manner. We detail the directed identification and isolation of six genetically predicted polyketides and nonribosomal peptides using our Genome-to-Natural Products platform. This highly automated, user-friendly programme provides a means of realizing the potential of genetically encoded natural products.
Di Maro, Antimo; Chambery, Angela; Carafa, Vincenzo; Costantini, Susan; Colonna, Giovanni; Parente, Augusto
2009-03-01
The amino acid sequence and glycan structure of PD-L1, PD-L2 and PD-L3, type 1 ribosome-inactivating proteins isolated from Phytolacca dioica L. leaves, were determined using a combined approach based on peptide mapping, Edman degradation and ESI-Q-TOF MS in precursor ion discovery mode. The comparative analysis of the 261 amino acid residue sequences showed that PD-L1 and PD-L2 have identical primary structure, as it is the case of PD-L3 and PD-L4. Furthermore, the primary structure of PD-Ls 1-2 and PD-Ls 3-4 have 81.6% identity (85.1% similarity). The ESI-Q-TOF MS analysis confirmed that PD-Ls 1-3 were glycosylated at different sites. In particular, PD-L1 contained three glycidic chains with the well known paucidomannosidic structure (Man)(3) (GlcNAc)(2) (Fuc)(1) (Xyl)(1) linked to Asn10, Asn43 and Asn255. PD-L2 was glycosylated at Asn10 and Asn43, and PD-L3 was glycosylated only at Asn10. PD-L4 was confirmed to be not glycosylated. Despite an overall high structural similarity, the comparative modeling of PD-L1, PD-L2, PD-L3 and PD-L4 has shown potential influences of the glycidic chains on their adenine polynucleotide glycosylase activity on different substrates.
Melendrez, Melanie C.; Lange, Rachel K.; Cohan, Frederick M.; Ward, David M.
2011-01-01
Previous research has shown that sequences of 16S rRNA genes and 16S-23S rRNA internal transcribed spacer regions may not have enough genetic resolution to define all ecologically distinct Synechococcus populations (ecotypes) inhabiting alkaline, siliceous hot spring microbial mats. To achieve higher molecular resolution, we studied sequence variation in three protein-encoding loci sampled by PCR from 60°C and 65°C sites in the Mushroom Spring mat (Yellowstone National Park, WY). Sequences were analyzed using the ecotype simulation (ES) and AdaptML algorithms to identify putative ecotypes. Between 4 and 14 times more putative ecotypes were predicted from variation in protein-encoding locus sequences than from variation in 16S rRNA and 16S-23S rRNA internal transcribed spacer sequences. The number of putative ecotypes predicted depended on the number of sequences sampled and the molecular resolution of the locus. Chao estimates of diversity indicated that few rare ecotypes were missed. Many ecotypes hypothesized by sequence analyses were different in their habitat specificities, suggesting different adaptations to temperature or other parameters that vary along the flow channel. PMID:21169433
Jain, Mukesh; Chevala, V V S Narayana; Garg, Rohini
2014-11-01
MicroRNAs (miRNAs) are essential components of complex gene regulatory networks that orchestrate plant development. Although several genomic resources have been developed for the legume crop chickpea, miRNAs have not been discovered until now. For genome-wide discovery of miRNAs in chickpea (Cicer arietinum), we sequenced the small RNA content from seven major tissues/organs employing Illumina technology. About 154 million reads were generated, which represented more than 20 million distinct small RNA sequences. We identified a total of 440 conserved miRNAs in chickpea based on sequence similarity with known miRNAs in other plants. In addition, 178 novel miRNAs were identified using a miRDeep pipeline with plant-specific scoring. Some of the conserved and novel miRNAs with significant sequence similarity were grouped into families. The chickpea miRNAs targeted a wide range of mRNAs involved in diverse cellular processes, including transcriptional regulation (transcription factors), protein modification and turnover, signal transduction, and metabolism. Our analysis revealed several miRNAs with differential spatial expression. Many of the chickpea miRNAs were expressed in a tissue-specific manner. The conserved and differential expression of members of the same miRNA family in different tissues was also observed. Some of the same family members were predicted to target different chickpea mRNAs, which suggested the specificity and complexity of miRNA-mediated developmental regulation. This study, for the first time, reveals a comprehensive set of conserved and novel miRNAs along with their expression patterns and putative targets in chickpea, and provides a framework for understanding regulation of developmental processes in legumes. © The Author 2014. Published by Oxford University Press on behalf of the Society for Experimental Biology.
Scaglione, Davide; Lanteri, Sergio; Acquadro, Alberto; Lai, Zhao; Knapp, Steven J; Rieseberg, Loren; Portis, Ezio
2012-10-01
Cynara cardunculus (2n = 2× = 34) is a member of the Asteraceae family that contributes significantly to the agricultural economy of the Mediterranean basin. The species includes two cultivated varieties, globe artichoke and cardoon, which are grown mainly for food. Cynara cardunculus is an orphan crop species whose genome/transcriptome has been relatively unexplored, especially in comparison to other Asteraceae crops. Hence, there is a significant need to improve its genomic resources through the identification of novel genes and sequence-based markers, to design new breeding schemes aimed at increasing quality and crop productivity. We report the outcome of cDNA sequencing and assembly for eleven accessions of C. cardunculus. Sequencing of three mapping parental genotypes using Roche 454-Titanium technology generated 1.7 × 10⁶ reads, which were assembled into 38,726 reference transcripts covering 32 Mbp. Putative enzyme-encoding genes were annotated using the KEGG-database. Transcription factors and candidate resistance genes were surveyed as well. Paired-end sequencing was done for cDNA libraries of eight other representative C. cardunculus accessions on an Illumina Genome Analyzer IIx, generating 46 × 10⁶ reads. Alignment of the IGA and 454 reads to reference transcripts led to the identification of 195,400 SNPs with a Bayesian probability exceeding 95%; a validation rate of 90% was obtained by Sanger-sequencing of a subset of contigs. These results demonstrate that the integration of data from different NGS platforms enables large-scale transcriptome characterization, along with massive SNP discovery. This information will contribute to the dissection of key agricultural traits in C. cardunculus and facilitate the implementation of marker-assisted selection programs. © 2012 The Authors. Plant Biotechnology Journal © 2012 Society for Experimental Biology, Association of Applied Biologists and Blackwell Publishing Ltd.
Poisson Statistics of Combinatorial Library Sampling Predict False Discovery Rates of Screening
2017-01-01
Microfluidic droplet-based screening of DNA-encoded one-bead-one-compound combinatorial libraries is a miniaturized, potentially widely distributable approach to small molecule discovery. In these screens, a microfluidic circuit distributes library beads into droplets of activity assay reagent, photochemically cleaves the compound from the bead, then incubates and sorts the droplets based on assay result for subsequent DNA sequencing-based hit compound structure elucidation. Pilot experimental studies revealed that Poisson statistics describe nearly all aspects of such screens, prompting the development of simulations to understand system behavior. Monte Carlo screening simulation data showed that increasing mean library sampling (ε), mean droplet occupancy, or library hit rate all increase the false discovery rate (FDR). Compounds identified as hits on k > 1 beads (the replicate k class) were much more likely to be authentic hits than singletons (k = 1), in agreement with previous findings. Here, we explain this observation by deriving an equation for authenticity, which reduces to the product of a library sampling bias term (exponential in k) and a sampling saturation term (exponential in ε) setting a threshold that the k-dependent bias must overcome. The equation thus quantitatively describes why each hit structure’s FDR is based on its k class, and further predicts the feasibility of intentionally populating droplets with multiple library beads, assaying the micromixtures for function, and identifying the active members by statistical deconvolution. PMID:28682059
Gulati, Ashima; Somlo, Stefan
2018-05-01
The genesis of whole exome sequencing as a powerful tool for detailing the protein coding sequence of the human genome was conceptualized based on the availability of next-generation sequencing technology and knowledge of the human reference genome. The field of pediatric nephrology enriched with molecularly unsolved phenotypes is allowing the clinical and research application of whole exome sequencing to enable novel gene discovery and provide amendment of phenotypic misclassification. Recent studies in the field have informed us that newer high-throughput sequencing techniques are likely to be of high yield when applied in conjunction with conventional genomic approaches such as linkage analysis and other strategies used to focus subsequent analysis. They have also emphasized the need for the validation of novel genetic findings in large collaborative cohorts and the production of robust corroborative biological data. The well-structured application of comprehensive genomic testing in clinical and research arenas will hopefully continue to advance patient care and precision medicine, but does call for attention to be paid to its integrated challenges.
Nagaki, Kiyotaka; Shibata, Fukashi; Kanatani, Asaka; Kashihara, Kazunari; Murata, Minoru
2012-04-01
The centromere is a multi-functional complex comprising centromeric DNA and a number of proteins. To isolate unidentified centromeric DNA sequences, centromere-specific histone H3 variants (CENH3) and chromatin immunoprecipitation (ChIP) have been utilized in some plant species. However, anti-CENH3 antibody for ChIP must be raised in each species because of its species specificity. Production of the antibodies is time-consuming and costly, and it is not easy to produce ChIP-grade antibodies. In this study, we applied a HaloTag7-based chromatin affinity purification system to isolate centromeric DNA sequences in tobacco. This system required no specific antibody, and made it possible to apply a highly stringent wash to remove contaminated DNA. As a result, we succeeded in isolating five tandem repetitive DNA sequences in addition to the centromeric retrotransposons that were previously identified by ChIP. Three of the tandem repeats were centromere-specific sequences located on different chromosomes. These results confirm the validity of the HaloTag7-based chromatin affinity purification system as an alternative method to ChIP for isolating unknown centromeric DNA sequences. The discovery of more than two chromosome-specific centromeric DNA sequences indicates the mosaic structure of tobacco centromeres. © Springer-Verlag 2011
Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
Wang, Ying; Fu, Lei; Ren, Jie; Yu, Zhaoxia; Chen, Ting; Sun, Fengzhu
2018-01-01
Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered “group-specific” in our study. Our main purpose is to discover group-specific sequence regions between control and case groups as disease-associated markers. We developed a long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly. We called our method MetaGO: Group-specific oligonucleotide analysis for metagenomic samples. An open-source pipeline on Apache Spark was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified group-specific markers. In the simulated dataset, 99.11% of group-specific logical 40-mers covered 98.89% disease-specific regions from the disease-associated strain. In addition, 97.90% of group-specific numerical 40-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 group-specific 40-mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 group-specific features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All group-specific 40-mers were present in LC patients, but not healthy controls. All the assembled 11 LC-specific sequences can be mapped to two strains of Veillonella parvula: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying group-specific k-mers, which would be clinically applicable for disease prediction. MetaGO is available at https://github.com/VVsmileyx/MetaGO. PMID:29774017
Development of self-compressing BLSOM for comprehensive analysis of big sequence data.
Kikuchi, Akihito; Ikemura, Toshimichi; Abe, Takashi
2015-01-01
With the remarkable increase in genomic sequence data from various organisms, novel tools are needed for comprehensive analyses of available big sequence data. We previously developed a Batch-Learning Self-Organizing Map (BLSOM), which can cluster genomic fragment sequences according to phylotype solely dependent on oligonucleotide composition and applied to genome and metagenomic studies. BLSOM is suitable for high-performance parallel-computing and can analyze big data simultaneously, but a large-scale BLSOM needs a large computational resource. We have developed Self-Compressing BLSOM (SC-BLSOM) for reduction of computation time, which allows us to carry out comprehensive analysis of big sequence data without the use of high-performance supercomputers. The strategy of SC-BLSOM is to hierarchically construct BLSOMs according to data class, such as phylotype. The first-layer BLSOM was constructed with each of the divided input data pieces that represents the data subclass, such as phylotype division, resulting in compression of the number of data pieces. The second BLSOM was constructed with a total of weight vectors obtained in the first-layer BLSOMs. We compared SC-BLSOM with the conventional BLSOM by analyzing bacterial genome sequences. SC-BLSOM could be constructed faster than BLSOM and cluster the sequences according to phylotype with high accuracy, showing the method's suitability for efficient knowledge discovery from big sequence data.
Activity-based protein profiling: from enzyme chemistry to proteomic chemistry.
Cravatt, Benjamin F; Wright, Aaron T; Kozarich, John W
2008-01-01
Genome sequencing projects have provided researchers with a complete inventory of the predicted proteins produced by eukaryotic and prokaryotic organisms. Assignment of functions to these proteins represents one of the principal challenges for the field of proteomics. Activity-based protein profiling (ABPP) has emerged as a powerful chemical proteomic strategy to characterize enzyme function directly in native biological systems on a global scale. Here, we review the basic technology of ABPP, the enzyme classes addressable by this method, and the biological discoveries attributable to its application.
Sequence Bundles: a novel method for visualising, discovering and exploring sequence motifs
2014-01-01
Background We introduce Sequence Bundles--a novel data visualisation method for representing multiple sequence alignments (MSAs). We identify and address key limitations of the existing bioinformatics data visualisation methods (i.e. the Sequence Logo) by enabling Sequence Bundles to give salient visual expression to sequence motifs and other data features, which would otherwise remain hidden. Methods For the development of Sequence Bundles we employed research-led information design methodologies. Sequences are encoded as uninterrupted, semi-opaque lines plotted on a 2-dimensional reconfigurable grid. Each line represents a single sequence. The thickness and opacity of the stack at each residue in each position indicates the level of conservation and the lines' curved paths expose patterns in correlation and functionality. Several MSAs can be visualised in a composite image. The Sequence Bundles method is designed to favour a tangible, continuous and intuitive display of information. Results We have developed a software demonstration application for generating a Sequence Bundles visualisation of MSAs provided for the BioVis 2013 redesign contest. A subsequent exploration of the visualised line patterns allowed for the discovery of a number of interesting features in the dataset. Reported features include the extreme conservation of sequences displaying a specific residue and bifurcations of the consensus sequence. Conclusions Sequence Bundles is a novel method for visualisation of MSAs and the discovery of sequence motifs. It can aid in generating new insight and hypothesis making. Sequence Bundles is well disposed for future implementation as an interactive visual analytics software, which can complement existing visualisation tools. PMID:25237395
2012-01-01
Introduction Traditionally, genomic or transcriptomic data have been restricted to a few model or emerging model organisms, and to a handful of species of medical and/or environmental importance. Next-generation sequencing techniques have the capability of yielding massive amounts of gene sequence data for virtually any species at a modest cost. Here we provide a comparative analysis of de novo assembled transcriptomic data for ten non-model species of previously understudied animal taxa. Results cDNA libraries of ten species belonging to five animal phyla (2 Annelida [including Sipuncula], 2 Arthropoda, 2 Mollusca, 2 Nemertea, and 2 Porifera) were sequenced in different batches with an Illumina Genome Analyzer II (read length 100 or 150 bp), rendering between ca. 25 and 52 million reads per species. Read thinning, trimming, and de novo assembly were performed under different parameters to optimize output. Between 67,423 and 207,559 contigs were obtained across the ten species, post-optimization. Of those, 9,069 to 25,681 contigs retrieved blast hits against the NCBI non-redundant database, and approximately 50% of these were assigned with Gene Ontology terms, covering all major categories, and with similar percentages in all species. Local blasts against our datasets, using selected genes from major signaling pathways and housekeeping genes, revealed high efficiency in gene recovery compared to available genomes of closely related species. Intriguingly, our transcriptomic datasets detected multiple paralogues in all phyla and in nearly all gene pathways, including housekeeping genes that are traditionally used in phylogenetic applications for their purported single-copy nature. Conclusions We generated the first study of comparative transcriptomics across multiple animal phyla (comparing two species per phylum in most cases), established the first Illumina-based transcriptomic datasets for sponge, nemertean, and sipunculan species, and generated a tractable catalogue of annotated genes (or gene fragments) and protein families for ten newly sequenced non-model organisms, some of commercial importance (i.e., Octopus vulgaris). These comprehensive sets of genes can be readily used for phylogenetic analysis, gene expression profiling, developmental analysis, and can also be a powerful resource for gene discovery. The characterization of the transcriptomes of such a diverse array of animal species permitted the comparison of sequencing depth, functional annotation, and efficiency of genomic sampling using the same pipelines, which proved to be similar for all considered species. In addition, the datasets revealed their potential as a resource for paralogue detection, a recurrent concern in various aspects of biological inquiry, including phylogenetics, molecular evolution, development, and cellular biochemistry. PMID:23190771
Pavkov-Keller, Tea; Strohmeier, Gernot A.; Diepold, Matthias; Peeters, Wilco; Smeets, Natascha; Schürmann, Martin; Gruber, Karl; Schwab, Helmut; Steiner, Kerstin
2016-01-01
Transaminases are useful biocatalysts for the production of amino acids and chiral amines as intermediates for a broad range of drugs and fine chemicals. Here, we describe the discovery and characterisation of new transaminases from microorganisms which were enriched in selective media containing (R)-amines as sole nitrogen source. While most of the candidate proteins were clearly assigned to known subgroups of the fold IV family of PLP-dependent enzymes by sequence analysis and characterisation of their substrate specificity, some of them did not fit to any of these groups. The structure of one of these enzymes from Curtobacterium pusillum, which can convert d-amino acids and various (R)-amines with high enantioselectivity, was solved at a resolution of 2.4 Å. It shows significant differences especially in the active site compared to other transaminases of the fold IV family and thus indicates the existence of a new subgroup within this family. Although the discovered transaminases were not able to convert ketones in a reasonable time frame, overall, the enrichment-based approach was successful, as we identified two amine transaminases, which convert (R)-amines with high enantioselectivity, and can be used for a kinetic resolution of 1-phenylethylamine and analogues to obtain the (S)-amines with e.e.s >99%. PMID:27905516
2011-01-01
Background Biodiesel or ethanol derived from lipids or starch produced by microalgae may overcome many of the sustainability challenges previously ascribed to petroleum-based fuels and first generation plant-based biofuels. The paucity of microalgae genome sequences, however, limits gene-based biofuel feedstock optimization studies. Here we describe the sequencing and de novo transcriptome assembly for the non-model microalgae species, Dunaliella tertiolecta, and identify pathways and genes of importance related to biofuel production. Results Next generation DNA pyrosequencing technology applied to D. tertiolecta transcripts produced 1,363,336 high quality reads with an average length of 400 bases. Following quality and size trimming, ~ 45% of the high quality reads were assembled into 33,307 isotigs with a 31-fold coverage and 376,482 singletons. Assembled sequences and singletons were subjected to BLAST similarity searches and annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology (KO) identifiers. These analyses identified the majority of lipid and starch biosynthesis and catabolism pathways in D. tertiolecta. Conclusions The construction of metabolic pathways involved in the biosynthesis and catabolism of fatty acids, triacylglycrols, and starch in D. tertiolecta as well as the assembled transcriptome provide a foundation for the molecular genetics and functional genomics required to direct metabolic engineering efforts that seek to enhance the quantity and character of microalgae-based biofuel feedstock. PMID:21401935
Chaves, Francisco A.; Lee, Alvin H.; Nayak, Jennifer; Richards, Katherine A.; Sant, Andrea J.
2012-01-01
The ability to track CD4 T cells elicited in response to pathogen infection or vaccination is critical because of the role these cells play in protective immunity. Coupled with advances in genome sequencing of pathogenic organisms, there is considerable appeal for implementation of computer-based algorithms to predict peptides that bind to the class II molecules, forming the complex recognized by CD4 T cells. Despite recent progress in this area, there is a paucity of data regarding their success in identifying actual pathogen-derived epitopes. In this study, we sought to rigorously evaluate the performance of multiple web-available algorithms by comparing their predictions and our results using purely empirical methods for epitope discovery in influenza that utilized overlapping peptides and cytokine Elispots, for three independent class II molecules. We analyzed the data in different ways, trying to anticipate how an investigator might use these computational tools for epitope discovery. We come to the conclusion that currently available algorithms can indeed facilitate epitope discovery, but all shared a high degree of false positive and false negative predictions. Therefore, efficiencies were low. We also found dramatic disparities among algorithms and between predicted IC50 values and true dissociation rates of peptide:MHC class II complexes. We suggest that improved success of predictive algorithms will depend less on changes in computational methods or increased data sets and more on changes in parameters used to “train” the algorithms that factor in elements of T cell repertoire and peptide acquisition by class II molecules. PMID:22467652
Fragment-based drug discovery and molecular docking in drug design.
Wang, Tao; Wu, Mian-Bin; Chen, Zheng-Jie; Chen, Hua; Lin, Jian-Ping; Yang, Li-Rong
2015-01-01
Fragment-based drug discovery (FBDD) has caused a revolution in the process of drug discovery and design, with many FBDD leads being developed into clinical trials or approved in the past few years. Compared with traditional high-throughput screening, it displays obvious advantages such as efficiently covering chemical space, achieving higher hit rates, and so forth. In this review, we focus on the most recent developments of FBDD for improving drug discovery, illustrating the process and the importance of FBDD. In particular, the computational strategies applied in the process of FBDD and molecular-docking programs are highlighted elaborately. In most cases, docking is used for predicting the ligand-receptor interaction modes and hit identification by structurebased virtual screening. The successful cases of typical significance and the hits identified most recently are discussed.
The discovery of Halictivirus resolves the Sinaivirus phylogeny.
Bigot, Diane; Dalmon, Anne; Roy, Bronwen; Hou, Chunsheng; Germain, Michèle; Romary, Manon; Deng, Shuai; Diao, Qingyun; Weinert, Lucy A; Cook, James M; Herniou, Elisabeth A; Gayral, Philippe
2017-11-01
By providing pollination services, bees are among the most important insects, both in ecological and economical terms. Combined next-generation and classical sequencing approaches were applied to discover and study new insect viruses potentially harmful to bees. A bioinformatics virus discovery pipeline was used on individual Illumina transcriptomes of 13 wild bees from three species from the genus Halictus and 30 ants from six species of the genera Messor and Aphaenogaster. This allowed the discovery and description of three sequences of a new virus termed Halictus scabiosae Adlikon virus (HsAV). Phylogenetic analyses of ORF1, RNA-dependent RNA-polymerase (RdRp) and capsid genes showed that HsAV is closely related to (+)ssRNA viruses of the unassigned Sinaivirus genus but distant enough to belong to a different new genus we called Halictivirus. In addition, our study of ant transcriptomes revealed the first four sinaivirus sequences from ants (Messor barbarus, M. capitatus and M. concolor). Maximum likelihood phylogenetic analyses were performed on a 594 nt fragment of the ORF1/RdRp region from 84 sinaivirus sequences, including 31 new Lake Sinai viruses (LSVs) from honey bees collected in five countries across the globe and the four ant viral sequences. The phylogeny revealed four main clades potentially representing different viral species infecting honey bees. Moreover, the ant viruses belonged to the LSV4 clade, suggesting a possible cross-species transmission between bees and ants. Lastly, wide honey bee screening showed that all four LSV clades have worldwide distributions with no obvious geographical segregation.
Daily life activity routine discovery in hemiparetic rehabilitation patients using topic models.
Seiter, J; Derungs, A; Schuster-Amft, C; Amft, O; Tröster, G
2015-01-01
Monitoring natural behavior and activity routines of hemiparetic rehabilitation patients across the day can provide valuable progress information for therapists and patients and contribute to an optimized rehabilitation process. In particular, continuous patient monitoring could add type, frequency and duration of daily life activity routines and hence complement standard clinical scores that are assessed for particular tasks only. Machine learning methods have been applied to infer activity routines from sensor data. However, supervised methods require activity annotations to build recognition models and thus require extensive patient supervision. Discovery methods, including topic models could provide patient routine information and deal with variability in activity and movement performance across patients. Topic models have been used to discover characteristic activity routine patterns of healthy individuals using activity primitives recognized from supervised sensor data. Yet, the applicability of topic models for hemiparetic rehabilitation patients and techniques to derive activity primitives without supervision needs to be addressed. We investigate, 1) whether a topic model-based activity routine discovery framework can infer activity routines of rehabilitation patients from wearable motion sensor data. 2) We compare the performance of our topic model-based activity routine discovery using rule-based and clustering-based activity vocabulary. We analyze the activity routine discovery in a dataset recorded with 11 hemiparetic rehabilitation patients during up to ten full recording days per individual in an ambulatory daycare rehabilitation center using wearable motion sensors attached to both wrists and the non-affected thigh. We introduce and compare rule-based and clustering-based activity vocabulary to process statistical and frequency acceleration features to activity words. Activity words were used for activity routine pattern discovery using topic models based on Latent Dirichlet Allocation. Discovered activity routine patterns were then mapped to six categorized activity routines. Using the rule-based approach, activity routines could be discovered with an average accuracy of 76% across all patients. The rule-based approach outperformed clustering by 10% and showed less confusions for predicted activity routines. Topic models are suitable to discover daily life activity routines in hemiparetic rehabilitation patients without trained classifiers and activity annotations. Activity routines show characteristic patterns regarding activity primitives including body and extremity postures and movement. A patient-independent rule set can be derived. Including expert knowledge supports successful activity routine discovery over completely data-driven clustering.
Next-generation sequencing of cancer genomes: back to the future
Walter, Matthew J; Graubert, Timothy A; DiPersio, John F; Mardis, Elaine R; Wilson, Richard K; Ley, Timothy J
2010-01-01
The systematic karyotyping of bone marrow cells was the first genomic approach used to personalize therapy for patients with leukemia. The paradigm established by cytogenetic studies in leukemia (from gene discovery to therapeutic intervention) now has the potential to be rapidly extended with the use of whole-genome sequencing approaches for cancer, which are now possible. We are now entering a period of exponential growth in cancer gene discovery that will provide many novel therapeutic targets for a large number of cancer types. Establishing the pathogenetic relevance of individual mutations is a major challenge that must be solved. However, after thousands of cancer genomes have been sequenced, the genetic rules of cancer will become known and new approaches for diagnosis, risk stratification and individualized treatment of cancer patients will surely follow. PMID:20161678
Bowerman, Bruce
2011-10-01
Molecular genetic investigation of the early Caenorhabditis elegans embryo has contributed substantially to the discovery and general understanding of the genes, pathways, and mechanisms that regulate and execute developmental and cell biological processes. Initially, worm geneticists relied exclusively on a classical genetics approach, isolating mutants with interesting phenotypes after mutagenesis and then determining the identity of the affected genes. Subsequently, the discovery of RNA interference (RNAi) led to a much greater reliance on a reverse genetics approach: reducing the function of known genes with RNAi and then observing the phenotypic consequences. Now the advent of next-generation DNA sequencing technologies and the ensuing ease and affordability of whole-genome sequencing are reviving the use of classical genetics to investigate early C. elegans embryogenesis.
NASA Astrophysics Data System (ADS)
Zhang, Hui; Zhai, Yuxiu; Yao, Lin; Jiang, Yanhua; Li, Fengling
2017-05-01
Chlamys farreri is an economically important mollusk that can accumulate excessive amounts of cadmium (Cd). Studying the molecular mechanism of Cd accumulation in bivalves is difficult because of the lack of genome background. Transcriptomic analysis based on high-throughput RNA sequencing has been shown to be an efficient and powerful method for the discovery of relevant genes in non-model and genome reference-free organisms. Here, we constructed two cDNA libraries (control and Cd exposure groups) from the digestive gland of C. farreri and compared the transcriptomic data between them. A total of 227 673 transcripts were assembled into 105 071 unigenes, most of which shared high similarity with sequences in the NCBI non-redundant protein database. For functional classification, 24 493 unigenes were assigned to Gene Ontology terms. Additionally, EuKaryotic Ortholog Groups and Kyoto Encyclopedia of Genes and Genomes analyses assigned 12 028 unigenes to 26 categories and 7 849 unigenes to five pathways, respectively. Comparative transcriptomics analysis identified 3 800 unigenes that were differentially expressed in the Cd-treated group compared with the control group. Among them, genes associated with heavy metal accumulation were screened, including metallothionein, divalent metal transporter, and metal tolerance protein. The functional genes and predicted pathways identified in our study will contribute to a better understanding of the metabolic and immune system in the digestive gland of C. farreri. In addition, the transcriptomic data will provide a comprehensive resource that may contribute to the understanding of molecular mechanisms that respond to marine pollutants in bivalves.
Iwasaki, Yuki; Abe, Takashi; Wada, Kennosuke; Wada, Yoshiko; Ikemura, Toshimichi
2013-11-20
With the remarkable increase of genomic sequence data of microorganisms, novel tools are needed for comprehensive analyses of the big sequence data available. The self-organizing map (SOM) is an effective tool for clustering and visualizing high-dimensional data, such as oligonucleotide composition on one map. By modifying the conventional SOM, we developed batch-learning SOM (BLSOM), which allowed classification of sequence fragments (e.g., 1 kb) according to phylotypes, solely depending on oligonucleotide composition. Metagenomics studies of uncultivable microorganisms in clinical and environmental samples should allow extensive surveys of genes important in life sciences. BLSOM is most suitable for phylogenetic assignment of metagenomic sequences, because fragmental sequences can be clustered according to phylotypes, solely depending on oligonucleotide composition. We first constructed oligonucleotide BLSOMs for all available sequences from genomes of known species, and by mapping metagenomic sequences on these large-scale BLSOMs, we can predict phylotypes of individual metagenomic sequences, revealing a microbial community structure of uncultured microorganisms, including viruses. BLSOM has shown that influenza viruses isolated from humans and birds clearly differ in oligonucleotide composition. Based on this host-dependent oligonucleotide composition, we have proposed strategies for predicting directional changes of virus sequences and for surveilling potentially hazardous strains when introduced into humans from non-human sources.
tRNADB-CE: tRNA gene database well-timed in the era of big sequence data.
Abe, Takashi; Inokuchi, Hachiro; Yamada, Yuko; Muto, Akira; Iwasaki, Yuki; Ikemura, Toshimichi
2014-01-01
The tRNA gene data base curated by experts "tRNADB-CE" (http://trna.ie.niigata-u.ac.jp) was constructed by analyzing 1,966 complete and 5,272 draft genomes of prokaryotes, 171 viruses', 121 chloroplasts', and 12 eukaryotes' genomes plus fragment sequences obtained by metagenome studies of environmental samples. 595,115 tRNA genes in total, and thus two times of genes compiled previously, have been registered, for which sequence, clover-leaf structure, and results of sequence-similarity and oligonucleotide-pattern searches can be browsed. To provide collective knowledge with help from experts in tRNA researches, we added a column for enregistering comments to each tRNA. By grouping bacterial tRNAs with an identical sequence, we have found high phylogenetic preservation of tRNA sequences, especially at the phylum level. Since many species-unknown tRNAs from metagenomic sequences have sequences identical to those found in species-known prokaryotes, the identical sequence group (ISG) can provide phylogenetic markers to investigate the microbial community in an environmental ecosystem. This strategy can be applied to a huge amount of short sequences obtained from next-generation sequencers, as showing that tRNADB-CE is a well-timed database in the era of big sequence data. It is also discussed that batch-learning self-organizing-map with oligonucleotide composition is useful for efficient knowledge discovery from big sequence data.
Assessing pooled BAC and whole genome shotgun strategies for assembly of complex genomes
2011-01-01
Background We investigate if pooling BAC clones and sequencing the pools can provide for more accurate assembly of genome sequences than the "whole genome shotgun" (WGS) approach. Furthermore, we quantify this accuracy increase. We compare the pooled BAC and WGS approaches using in silico simulations. Standard measures of assembly quality focus on assembly size and fragmentation, which are desirable for large whole genome assemblies. We propose additional measures enabling easy and visual comparison of assembly quality, such as rearrangements and redundant sequence content, relative to the known target sequence. Results The best assembly quality scores were obtained using 454 coverage of 15× linear and 5× paired (3kb insert size) reads (15L-5P) on Arabidopsis. This regime gave similarly good results on four additional plant genomes of very different GC and repeat contents. BAC pooling improved assembly scores over WGS assembly, coverage and redundancy scores improving the most. Conclusions BAC pooling works better than WGS, however, both require a physical map to order the scaffolds. Pool sizes up to 12Mbp work well, suggesting this pooling density to be effective in medium-scale re-sequencing applications such as targeted sequencing of QTL intervals for candidate gene discovery. Assuming the current Roche/454 Titanium sequencing limitations, a 12 Mbp region could be re-sequenced with a full plate of linear reads and a half plate of paired-end reads, yielding 15L-5P coverage after read pre-processing. Our simulation suggests that massively over-sequencing may not improve accuracy. Our scoring measures can be used generally to evaluate and compare results of simulated genome assemblies. PMID:21496274
Bernard, Guillaume; Pathmanathan, Jananan S; Lannes, Romain; Lopez, Philippe; Bapteste, Eric
2018-01-01
Abstract Microbes are the oldest and most widespread, phylogenetically and metabolically diverse life forms on Earth. However, they have been discovered only 334 years ago, and their diversity started to become seriously investigated even later. For these reasons, microbial studies that unveil novel microbial lineages and processes affecting or involving microbes deeply (and repeatedly) transform knowledge in biology. Considering the quantitative prevalence of taxonomically and functionally unassigned sequences in environmental genomics data sets, and that of uncultured microbes on the planet, we propose that unraveling the microbial dark matter should be identified as a central priority for biologists. Based on former empirical findings of microbial studies, we sketch a logic of discovery with the potential to further highlight the microbial unknowns. PMID:29420719
New strategies in drug discovery.
Ohlstein, Eliot H; Johnson, Anthony G; Elliott, John D; Romanic, Anne M
2006-01-01
Gene identification followed by determination of the expression of genes in a given disease and understanding of the function of the gene products is central to the drug discovery process. The ability to associate a specific gene with a disease can be attributed primarily to the extraordinary progress that has been made in the areas of gene sequencing and information technologies. Selection and validation of novel molecular targets have become of great importance in light of the abundance of new potential therapeutic drug targets that have emerged from human gene sequencing. In response to this revolution within the pharmaceutical industry, the development of high-throughput methods in both biology and chemistry has been necessitated. Further, the successful translation of basic scientific discoveries into clinical experimental medicine and novel therapeutics is an increasing challenge. As such, a new paradigm for drug discovery has emerged. This process involves the integration of clinical, genetic, genomic, and molecular phenotype data partnered with cheminformatics. Central to this process, the data generated are managed, collated, and interpreted with the use of informatics. This review addresses the use of new technologies that have arisen to deal with this new paradigm.
Gene Patents and Personalized Cancer Care: Impact of the Myriad Case on Clinical Oncology
Offit, Kenneth; Bradbury, Angela; Storm, Courtney; Merz, Jon F.; Noonan, Kevin E.; Spence, Rebecca
2013-01-01
Genomic discoveries have transformed the practice of oncology and cancer prevention. Diagnostic and therapeutic advances based on cancer genomics developed during a time when it was possible to patent genes. A case before the Supreme Court, Association for Molecular Pathology v Myriad Genetics, Inc seeks to overturn patents on isolated genes. Although the outcomes are uncertain, it is suggested here that the Supreme Court decision will have few immediate effects on oncology practice or research but may have more significant long-term impact. The Federal Circuit court has already rejected Myriad's broad diagnostic methods claims, and this is not affected by the Supreme Court decision. Isolated DNA patents were already becoming obsolete on scientific grounds, in an era when human DNA sequence is public knowledge and because modern methods of next-generation sequencing need not involve isolated DNA. The Association for Molecular Pathology v Myriad Supreme Court decision will have limited impact on new drug development, as new drug patents usually involve cellular methods. A nuanced Supreme Court decision acknowledging the scientific distinction between synthetic cDNA and genomic DNA will further mitigate any adverse impact. A Supreme Court decision to include or exclude all types of DNA from patent eligibility could impact future incentives for genomic discovery as well as the future delivery of medical care. Whatever the outcome of this important case, it is important that judicial and legislative actions in this area maximize genomic discovery while also ensuring patients' access to personalized cancer care. PMID:23766521
Huang, Chao-Wei; Lin, Yu-Tsung; Ding, Shih-Torng; Lo, Ling-Ling; Wang, Pei-Hwa; Lin, En-Chung; Liu, Fang-Wei; Lu, Yen-Wen
2015-01-01
The genetic markers associated with economic traits have been widely explored for animal breeding. Among these markers, single-nucleotide polymorphism (SNPs) are gradually becoming a prevalent and effective evaluation tool. Since SNPs only focus on the genetic sequences of interest, it thereby reduces the evaluation time and cost. Compared to traditional approaches, SNP genotyping techniques incorporate informative genetic background, improve the breeding prediction accuracy and acquiesce breeding quality on the farm. This article therefore reviews the typical procedures of animal breeding using SNPs and the current status of related techniques. The associated SNP information and genotyping techniques, including microarray and Lab-on-a-Chip based platforms, along with their potential are highlighted. Examples in pig and poultry with different SNP loci linked to high economic trait values are given. The recommendations for utilizing SNP genotyping in nimal breeding are summarized. PMID:27600241
Changes in Abundance of Oral Microbiota Associated with Oral Cancer
Schmidt, Brian L.; Kuczynski, Justin; Bhattacharya, Aditi; Huey, Bing; Corby, Patricia M.; Queiroz, Erica L. S.; Nightingale, Kira; Kerr, A. Ross; DeLacure, Mark D.; Veeramachaneni, Ratna; Olshen, Adam B.; Albertson, Donna G.
2014-01-01
Individual bacteria and shifts in the composition of the microbiome have been associated with human diseases including cancer. To investigate changes in the microbiome associated with oral cancers, we profiled cancers and anatomically matched contralateral normal tissue from the same patient by sequencing 16S rDNA hypervariable region amplicons. In cancer samples from both a discovery and a subsequent confirmation cohort, abundance of Firmicutes (especially Streptococcus) and Actinobacteria (especially Rothia) was significantly decreased relative to contralateral normal samples from the same patient. Significant decreases in abundance of these phyla were observed for pre-cancers, but not when comparing samples from contralateral sites (tongue and floor of mouth) from healthy individuals. Weighted UniFrac principal coordinates analysis based on 12 taxa separated most cancers from other samples with greatest separation of node positive cases. These studies begin to develop a framework for exploiting the oral microbiome for monitoring oral cancer development, progression and recurrence. PMID:24887397
Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.
Bonham-Carter, Oliver; Steele, Joe; Bastola, Dhundy
2014-11-01
Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base-base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.
Li, Xihong; Cui, Zhaoxia; Liu, Yuan; Song, Chengwen; Shi, Guohui
2013-01-01
Background The Chinese mitten crab Eriocheir sinensis is an important economic crustacean and has been seriously attacked by various diseases, which requires more and more information for immune relevant genes on genome background. Recently, high-throughput RNA sequencing (RNA-seq) technology provides a powerful and efficient method for transcript analysis and immune gene discovery. Methods/Principal Findings A cDNA library from hepatopancreas of E. sinensis challenged by a mixture of three pathogen strains (Gram-positive bacteria Micrococcus luteus, Gram-negative bacteria Vibrio alginolyticus and fungi Pichia pastoris; 108 cfu·mL−1) was constructed and randomly sequenced using Illumina technique. Totally 39.76 million clean reads were assembled to 70,300 unigenes. After ruling out short-length and low-quality sequences, 52,074 non-redundant unigenes were compared to public databases for homology searching and 17,617 of them showed high similarity to sequences in NCBI non-redundant protein (Nr) database. For function classification and pathway assignment, 18,734 (36.00%) unigenes were categorized to three Gene Ontology (GO) categories, 12,243 (23.51%) were classified to 25 Clusters of Orthologous Groups (COG), and 8,983 (17.25%) were assigned to six Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Potentially, 24, 14, 47 and 132 unigenes were characterized to be involved in Toll, IMD, JAK-STAT and MAPK pathways, respectively. Conclusions/Significance This is the first systematical transcriptome analysis of components relating to innate immune pathways in E. sinensis. Functional genes and putative pathways identified here will contribute to better understand immune system and prevent various diseases in crab. PMID:23874555
Clinical analysis of genome next-generation sequencing data using the Omicia platform
Coonrod, Emily M; Margraf, Rebecca L; Russell, Archie; Voelkerding, Karl V; Reese, Martin G
2013-01-01
Aims Next-generation sequencing is being implemented in the clinical laboratory environment for the purposes of candidate causal variant discovery in patients affected with a variety of genetic disorders. The successful implementation of this technology for diagnosing genetic disorders requires a rapid, user-friendly method to annotate variants and generate short lists of clinically relevant variants of interest. This report describes Omicia’s Opal platform, a new software tool designed for variant discovery and interpretation in a clinical laboratory environment. The software allows clinical scientists to process, analyze, interpret and report on personal genome files. Materials & Methods To demonstrate the software, the authors describe the interactive use of the system for the rapid discovery of disease-causing variants using three cases. Results & Conclusion Here, the authors show the features of the Opal system and their use in uncovering variants of clinical significance. PMID:23895124
Genome network medicine: innovation to overcome huge challenges in cancer therapy.
Roukos, Dimitrios H
2014-01-01
The post-ENCODE era shapes now a new biomedical research direction for understanding transcriptional and signaling networks driving gene expression and core cellular processes such as cell fate, survival, and apoptosis. Over the past half century, the Francis Crick 'central dogma' of single n gene/protein-phenotype (trait/disease) has defined biology, human physiology, disease, diagnostics, and drugs discovery. However, the ENCODE project and several other genomic studies using high-throughput sequencing technologies, computational strategies, and imaging techniques to visualize regulatory networks, provide evidence that transcriptional process and gene expression are regulated by highly complex dynamic molecular and signaling networks. This Focus article describes the linear experimentation-based limitations of diagnostics and therapeutics to cure advanced cancer and the need to move on from reductionist to network-based approaches. With evident a wide genomic heterogeneity, the power and challenges of next-generation sequencing (NGS) technologies to identify a patient's personal mutational landscape for tailoring the best target drugs in the individual patient are discussed. However, the available drugs are not capable of targeting aberrant signaling networks and research on functional transcriptional heterogeneity and functional genome organization is poorly understood. Therefore, the future clinical genome network medicine aiming at overcoming multiple problems in the new fields of regulatory DNA mapping, noncoding RNA, enhancer RNAs, and dynamic complexity of transcriptional circuitry are also discussed expecting in new innovation technology and strong appreciation of clinical data and evidence-based medicine. The problematic and potential solutions in the discovery of next-generation, molecular, and signaling circuitry-based biomarkers and drugs are explored. © 2013 Wiley Periodicals, Inc.
ITS1: a DNA barcode better than ITS2 in eukaryotes?
Wang, Xin-Cun; Liu, Chang; Huang, Liang; Bengtsson-Palme, Johan; Chen, Haimei; Zhang, Jian-Hui; Cai, Dayong; Li, Jian-Qin
2015-05-01
A DNA barcode is a short piece of DNA sequence used for species determination and discovery. The internal transcribed spacer (ITS/ITS2) region has been proposed as the standard DNA barcode for fungi and seed plants and has been widely used in DNA barcoding analyses for other biological groups, for example algae, protists and animals. The ITS region consists of both ITS1 and ITS2 regions. Here, a large-scale meta-analysis was carried out to compare ITS1 and ITS2 from three aspects: PCR amplification, DNA sequencing and species discrimination, in terms of the presence of DNA barcoding gaps, species discrimination efficiency, sequence length distribution, GC content distribution and primer universality. In total, 85 345 sequence pairs in 10 major groups of eukaryotes, including ascomycetes, basidiomycetes, liverworts, mosses, ferns, gymnosperms, monocotyledons, eudicotyledons, insects and fishes, covering 611 families, 3694 genera, and 19 060 species, were analysed. Using similarity-based methods, we calculated species discrimination efficiencies for ITS1 and ITS2 in all major groups, families and genera. Using Fisher's exact test, we found that ITS1 has significantly higher efficiencies than ITS2 in 17 of the 47 families and 20 of the 49 genera, which are sample-rich. By in silico PCR amplification evaluation, primer universality of the extensively applied ITS1 primers was found superior to that of ITS2 primers. Additionally, shorter length of amplification product and lower GC content was discovered to be two other advantages of ITS1 for sequencing. In summary, ITS1 represents a better DNA barcode than ITS2 for eukaryotic species. © 2014 John Wiley & Sons Ltd.
False positives complicate ancient pathogen identifications using high-throughput shotgun sequencing
2014-01-01
Background Identification of historic pathogens is challenging since false positives and negatives are a serious risk. Environmental non-pathogenic contaminants are ubiquitous. Furthermore, public genetic databases contain limited information regarding these species. High-throughput sequencing may help reliably detect and identify historic pathogens. Results We shotgun-sequenced 8 16th-century Mixtec individuals from the site of Teposcolula Yucundaa (Oaxaca, Mexico) who are reported to have died from the huey cocoliztli (‘Great Pestilence’ in Nahautl), an unknown disease that decimated native Mexican populations during the Spanish colonial period, in order to identify the pathogen. Comparison of these sequences with those deriving from the surrounding soil and from 4 precontact individuals from the site found a wide variety of contaminant organisms that confounded analyses. Without the comparative sequence data from the precontact individuals and soil, false positives for Yersinia pestis and rickettsiosis could have been reported. Conclusions False positives and negatives remain problematic in ancient DNA analyses despite the application of high-throughput sequencing. Our results suggest that several studies claiming the discovery of ancient pathogens may need further verification. Additionally, true single molecule sequencing’s short read lengths, inability to sequence through DNA lesions, and limited ancient-DNA-specific technical development hinder its application to palaeopathology. PMID:24568097
Next-generation sequencing: the future of molecular genetics in poultry production and food safety.
Diaz-Sanchez, S; Hanning, I; Pendleton, Sean; D'Souza, Doris
2013-02-01
The era of molecular biology and automation of the Sanger chain-terminator sequencing method has led to discovery and advances in diagnostics and biotechnology. The Sanger methodology dominated research for over 2 decades, leading to significant accomplishments and technological improvements in DNA sequencing. Next-generation high-throughput sequencing (HT-NGS) technologies were developed subsequently to overcome the limitations of this first generation technology that include higher speed, less labor, and lowered cost. Various platforms developed include sequencing-by-synthesis 454 Life Sciences, Illumina (Solexa) sequencing, SOLiD sequencing (among others), and the Ion Torrent semiconductor sequencing technologies that use different detection principles. As technology advances, progress made toward third generation sequencing technologies are being reported, which include Nanopore Sequencing and real-time monitoring of PCR activity through fluorescent resonant energy transfer. The advantages of these technologies include scalability, simplicity, with increasing DNA polymerase performance and yields, being less error prone, and even more economically feasible with the eventual goal of obtaining real-time results. These technologies can be directly applied to improve poultry production and enhance food safety. For example, sequence-based (determination of the gut microbial community, genes for metabolic pathways, or presence of plasmids) and function-based (screening for function such as antibiotic resistance, or vitamin production) metagenomic analysis can be carried out. Gut microbialflora/communities of poultry can be sequenced to determine the changes that affect health and disease along with efficacy of methods to control pathogenic growth. Thus, the purpose of this review is to provide an overview of the principles of these current technologies and their potential application to improve poultry production and food safety as well as public health.
Feltus, F A; Singh, H P; Lohithaswa, H C; Schulze, S R; Silva, T D; Paterson, A H
2006-04-01
Completed genome sequences provide templates for the design of genome analysis tools in orphan species lacking sequence information. To demonstrate this principle, we designed 384 PCR primer pairs to conserved exonic regions flanking introns, using Sorghum/Pennisetum expressed sequence tag alignments to the Oryza genome. Conserved-intron scanning primers (CISPs) amplified single-copy loci at 37% to 80% success rates in taxa that sample much of the approximately 50-million years of Poaceae divergence. While the conserved nature of exons fostered cross-taxon amplification, the lesser evolutionary constraints on introns enhanced single-nucleotide polymorphism detection. For example, in eight rice (Oryza sativa) genotypes, polymorphism averaged 12.1 per kb in introns but only 3.6 per kb in exons. Curiously, among 124 CISPs evaluated across Oryza, Sorghum, Pennisetum, Cynodon, Eragrostis, Zea, Triticum, and Hordeum, 23 (18.5%) seemed to be subject to rigid intron size constraints that were independent of per-nucleotide DNA sequence variation. Furthermore, we identified 487 conserved-noncoding sequence motifs in 129 CISP loci. A large CISP set (6,062 primer pairs, amplifying introns from 1,676 genes) designed using an automated pipeline showed generally higher abundance in recombinogenic than in nonrecombinogenic regions of the rice genome, thus providing relatively even distribution along genetic maps. CISPs are an effective means to explore poorly characterized genomes for both DNA polymorphism and noncoding sequence conservation on a genome-wide or candidate gene basis, and also provide anchor points for comparative genomics across a diverse range of species.
Feltus, F.A.; Singh, H.P.; Lohithaswa, H.C.; Schulze, S.R.; Silva, T.D.; Paterson, A.H.
2006-01-01
Completed genome sequences provide templates for the design of genome analysis tools in orphan species lacking sequence information. To demonstrate this principle, we designed 384 PCR primer pairs to conserved exonic regions flanking introns, using Sorghum/Pennisetum expressed sequence tag alignments to the Oryza genome. Conserved-intron scanning primers (CISPs) amplified single-copy loci at 37% to 80% success rates in taxa that sample much of the approximately 50-million years of Poaceae divergence. While the conserved nature of exons fostered cross-taxon amplification, the lesser evolutionary constraints on introns enhanced single-nucleotide polymorphism detection. For example, in eight rice (Oryza sativa) genotypes, polymorphism averaged 12.1 per kb in introns but only 3.6 per kb in exons. Curiously, among 124 CISPs evaluated across Oryza, Sorghum, Pennisetum, Cynodon, Eragrostis, Zea, Triticum, and Hordeum, 23 (18.5%) seemed to be subject to rigid intron size constraints that were independent of per-nucleotide DNA sequence variation. Furthermore, we identified 487 conserved-noncoding sequence motifs in 129 CISP loci. A large CISP set (6,062 primer pairs, amplifying introns from 1,676 genes) designed using an automated pipeline showed generally higher abundance in recombinogenic than in nonrecombinogenic regions of the rice genome, thus providing relatively even distribution along genetic maps. CISPs are an effective means to explore poorly characterized genomes for both DNA polymorphism and noncoding sequence conservation on a genome-wide or candidate gene basis, and also provide anchor points for comparative genomics across a diverse range of species. PMID:16607031
Peterson, K A; Yoshigi, M; Hazel, M W; Delker, D A; Lin, E; Krishnamurthy, C; Consiglio, N; Robson, J; Yandell, M; Clayton, F
2018-06-04
Although current American guidelines distinguish proton pump inhibitor-responsive oesophageal eosinophilia (PPI-REE) from eosinophilic oesophagitis (EoE), these entities are broadly similar. While two microarray studies showed that they have similar transcriptomes, more extensive RNA sequencing studies have not been done previously. To determine whether RNA sequencing identifies genetic markers distinguishing PPI-REE from EoE. We retrospectively examined 13 PPI-REE and 14 EoE biopsies, matched for tissue eosinophil content, and 14 normal controls. Patients and controls were not PPI-treated at the time of biopsy. We did RNA sequencing on formalin-fixed, paraffin-embedded tissue, with differential expression confirmation by quantitative polymerase chain reaction (PCR). We validated the use of formalin-fixed, paraffin-embedded vs RNAlater-preserved tissue, and compared our formalin-fixed, paraffin-embedded EoE results to a prior EoE study. By RNA sequencing, no genes were differentially expressed between the EoE and PPI-REE groups at the false discovery rate (FDR) ≤0.01 level. Compared to normal controls, 1996 genes were differentially expressed in the PPI-REE group and 1306 genes in the EoE group. By less stringent criteria, only MAPK8IP2 was differentially expressed between PPI-REE and EoE (FDR = 0.029, 2.2-fold less in EoE than in PPI-REE), with similar results by PCR. KCNJ2, which was differentially expressed in a prior study, was similar in the EoE and PPI-REE groups by both RNA sequencing and real-time PCR. Eosinophilic oesophagitis and PPI-REE have comparable transcriptomes, confirming that they are part of the same disease continuum. © 2018 John Wiley & Sons Ltd.
Identification of missing variants by combining multiple analytic pipelines.
Ren, Yingxue; Reddy, Joseph S; Pottier, Cyril; Sarangi, Vivekananda; Tian, Shulan; Sinnwell, Jason P; McDonnell, Shannon K; Biernacka, Joanna M; Carrasquillo, Minerva M; Ross, Owen A; Ertekin-Taner, Nilüfer; Rademakers, Rosa; Hudson, Matthew; Mainzer, Liudmila Sergeevna; Asmann, Yan W
2018-04-16
After decades of identifying risk factors using array-based genome-wide association studies (GWAS), genetic research of complex diseases has shifted to sequencing-based rare variants discovery. This requires large sample sizes for statistical power and has brought up questions about whether the current variant calling practices are adequate for large cohorts. It is well-known that there are discrepancies between variants called by different pipelines, and that using a single pipeline always misses true variants exclusively identifiable by other pipelines. Nonetheless, it is common practice today to call variants by one pipeline due to computational cost and assume that false negative calls are a small percent of total. We analyzed 10,000 exomes from the Alzheimer's Disease Sequencing Project (ADSP) using multiple analytic pipelines consisting of different read aligners and variant calling strategies. We compared variants identified by using two aligners in 50,100, 200, 500, 1000, and 1952 samples; and compared variants identified by adding single-sample genotyping to the default multi-sample joint genotyping in 50,100, 500, 2000, 5000 and 10,000 samples. We found that using a single pipeline missed increasing numbers of high-quality variants correlated with sample sizes. By combining two read aligners and two variant calling strategies, we rescued 30% of pass-QC variants at sample size of 2000, and 56% at 10,000 samples. The rescued variants had higher proportions of low frequency (minor allele frequency [MAF] 1-5%) and rare (MAF < 1%) variants, which are the very type of variants of interest. In 660 Alzheimer's disease cases with earlier onset ages of ≤65, 4 out of 13 (31%) previously-published rare pathogenic and protective mutations in APP, PSEN1, and PSEN2 genes were undetected by the default one-pipeline approach but recovered by the multi-pipeline approach. Identification of the complete variant set from sequencing data is the prerequisite of genetic association analyses. The current analytic practice of calling genetic variants from sequencing data using a single bioinformatics pipeline is no longer adequate with the increasingly large projects. The number and percentage of quality variants that passed quality filters but are missed by the one-pipeline approach rapidly increased with sample size.
Periasamy, Vengadesh; Rizan, Nastaran; Al-Ta’ii, Hassan Maktuff Jaber; Tan, Yee Shin; Tajuddin, Hairul Annuar; Iwamoto, Mitsumasa
2016-01-01
The discovery of semiconducting behavior of deoxyribonucleic acid (DNA) has resulted in a large number of literatures in the study of DNA electronics. Sequence-specific electronic response provides a platform towards understanding charge transfer mechanism and therefore the electronic properties of DNA. It is possible to utilize these characteristic properties to identify/detect DNA. In this current work, we demonstrate a novel method of DNA-based identification of basidiomycetes using current-voltage (I-V) profiles obtained from DNA-specific Schottky barrier diodes. Electronic properties such as ideality factor, barrier height, shunt resistance, series resistance, turn-on voltage, knee-voltage, breakdown voltage and breakdown current were calculated and used to quantify the identification process as compared to morphological and molecular characterization techniques. The use of these techniques is necessary in order to study biodiversity, but sometimes it can be misleading and unreliable and is not sufficiently useful for the identification of fungi genera. Many of these methods have failed when it comes to identification of closely related species of certain genus like Pleurotus. Our electronics profiles, both in the negative and positive bias regions were however found to be highly characteristic according to the base-pair sequences. We believe that this simple, low-cost and practical method could be useful towards identifying and detecting DNA in biotechnology and pathology. PMID:27435636
NASA Astrophysics Data System (ADS)
Periasamy, Vengadesh; Rizan, Nastaran; Al-Ta'Ii, Hassan Maktuff Jaber; Tan, Yee Shin; Tajuddin, Hairul Annuar; Iwamoto, Mitsumasa
2016-07-01
The discovery of semiconducting behavior of deoxyribonucleic acid (DNA) has resulted in a large number of literatures in the study of DNA electronics. Sequence-specific electronic response provides a platform towards understanding charge transfer mechanism and therefore the electronic properties of DNA. It is possible to utilize these characteristic properties to identify/detect DNA. In this current work, we demonstrate a novel method of DNA-based identification of basidiomycetes using current-voltage (I-V) profiles obtained from DNA-specific Schottky barrier diodes. Electronic properties such as ideality factor, barrier height, shunt resistance, series resistance, turn-on voltage, knee-voltage, breakdown voltage and breakdown current were calculated and used to quantify the identification process as compared to morphological and molecular characterization techniques. The use of these techniques is necessary in order to study biodiversity, but sometimes it can be misleading and unreliable and is not sufficiently useful for the identification of fungi genera. Many of these methods have failed when it comes to identification of closely related species of certain genus like Pleurotus. Our electronics profiles, both in the negative and positive bias regions were however found to be highly characteristic according to the base-pair sequences. We believe that this simple, low-cost and practical method could be useful towards identifying and detecting DNA in biotechnology and pathology.