Sample records for sequence datasets reveals

  1. Major soybean maturity gene haplotypes revealed by SNPViz analysis of 72 sequenced soybean genomes

    USDA-ARS?s Scientific Manuscript database

    In this Genomics Era, vast amounts of next generation sequencing data have become publicly-available for multiple genomes across hundreds of species. Analysis of these large-scale datasets can become cumbersome, especially when comparing nucleotide polymorphisms across many samples within a dataset...

  2. fCCAC: functional canonical correlation analysis to evaluate covariance between nucleic acid sequencing datasets.

    PubMed

    Madrigal, Pedro

    2017-03-01

    Computational evaluation of variability across DNA or RNA sequencing datasets is a crucial step in genomic science, as it allows both to evaluate reproducibility of biological or technical replicates, and to compare different datasets to identify their potential correlations. Here we present fCCAC, an application of functional canonical correlation analysis to assess covariance of nucleic acid sequencing datasets such as chromatin immunoprecipitation followed by deep sequencing (ChIP-seq). We show how this method differs from other measures of correlation, and exemplify how it can reveal shared covariance between histone modifications and DNA binding proteins, such as the relationship between the H3K4me3 chromatin mark and its epigenetic writers and readers. An R/Bioconductor package is available at http://bioconductor.org/packages/fCCAC/ . pmb59@cam.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  3. Analysis of plant-derived miRNAs in animal small RNA datasets

    PubMed Central

    2012-01-01

    Background Plants contain significant quantities of small RNAs (sRNAs) derived from various sRNA biogenesis pathways. Many of these sRNAs play regulatory roles in plants. Previous analysis revealed that numerous sRNAs in corn, rice and soybean seeds have high sequence similarity to animal genes. However, exogenous RNA is considered to be unstable within the gastrointestinal tract of many animals, thus limiting potential for any adverse effects from consumption of dietary RNA. A recent paper reported that putative plant miRNAs were detected in animal plasma and serum, presumably acquired through ingestion, and may have a functional impact in the consuming organisms. Results To address the question of how common this phenomenon could be, we searched for plant miRNAs sequences in public sRNA datasets from various tissues of mammals, chicken and insects. Our analyses revealed that plant miRNAs were present in the animal sRNA datasets, and significantly miR168 was extremely over-represented. Furthermore, all or nearly all (>96%) miR168 sequences were monocot derived for most datasets, including datasets for two insects reared on dicot plants in their respective experiments. To investigate if plant-derived miRNAs, including miR168, could accumulate and move systemically in insects, we conducted insect feeding studies for three insects including corn rootworm, which has been shown to be responsive to plant-produced long double-stranded RNAs. Conclusions Our analyses suggest that the observed plant miRNAs in animal sRNA datasets can originate in the process of sequencing, and that accumulation of plant miRNAs via dietary exposure is not universal in animals. PMID:22873950

  4. MIPE: A metagenome-based community structure explorer and SSU primer evaluation tool

    PubMed Central

    Zhou, Quan

    2017-01-01

    An understanding of microbial community structure is an important issue in the field of molecular ecology. The traditional molecular method involves amplification of small subunit ribosomal RNA (SSU rRNA) genes by polymerase chain reaction (PCR). However, PCR-based amplicon approaches are affected by primer bias and chimeras. With the development of high-throughput sequencing technology, unbiased SSU rRNA gene sequences can be mined from shotgun sequencing-based metagenomic or metatranscriptomic datasets to obtain a reflection of the microbial community structure in specific types of environment and to evaluate SSU primers. However, the use of short reads obtained through next-generation sequencing for primer evaluation has not been well resolved. The software MIPE (MIcrobiota metagenome Primer Explorer) was developed to adapt numerous short reads from metagenomes and metatranscriptomes. Using metagenomic or metatranscriptomic datasets as input, MIPE extracts and aligns rRNA to reveal detailed information on microbial composition and evaluate SSU rRNA primers. A mock dataset, a real Metagenomics Rapid Annotation using Subsystem Technology (MG-RAST) test dataset, two PrimerProspector test datasets and a real metatranscriptomic dataset were used to validate MIPE. The software calls Mothur (v1.33.3) and the SILVA database (v119) for the alignment and classification of rRNA genes from a metagenome or metatranscriptome. MIPE can effectively extract shotgun rRNA reads from a metagenome or metatranscriptome and is capable of classifying these sequences and exhibiting sensitivity to different SSU rRNA PCR primers. Therefore, MIPE can be used to guide primer design for specific environmental samples. PMID:28350876

  5. MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data.

    PubMed

    Ozaki, Haruka; Iwasaki, Wataru

    2016-08-01

    As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif. Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs. By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions. Copyright © 2016 Elsevier Ltd. All rights reserved.

  6. Human action classification using procrustes shape theory

    NASA Astrophysics Data System (ADS)

    Cho, Wanhyun; Kim, Sangkyoon; Park, Soonyoung; Lee, Myungeun

    2015-02-01

    In this paper, we propose new method that can classify a human action using Procrustes shape theory. First, we extract a pre-shape configuration vector of landmarks from each frame of an image sequence representing an arbitrary human action, and then we have derived the Procrustes fit vector for pre-shape configuration vector. Second, we extract a set of pre-shape vectors from tanning sample stored at database, and we compute a Procrustes mean shape vector for these preshape vectors. Third, we extract a sequence of the pre-shape vectors from input video, and we project this sequence of pre-shape vectors on the tangent space with respect to the pole taking as a sequence of mean shape vectors corresponding with a target video. And we calculate the Procrustes distance between two sequences of the projection pre-shape vectors on the tangent space and the mean shape vectors. Finally, we classify the input video into the human action class with minimum Procrustes distance. We assess a performance of the proposed method using one public dataset, namely Weizmann human action dataset. Experimental results reveal that the proposed method performs very good on this dataset.

  7. Construction of a large collection of small genome variations in French dairy and beef breeds using whole-genome sequences.

    PubMed

    Boussaha, Mekki; Michot, Pauline; Letaief, Rabia; Hozé, Chris; Fritz, Sébastien; Grohs, Cécile; Esquerré, Diane; Duchesne, Amandine; Philippe, Romain; Blanquet, Véronique; Phocas, Florence; Floriot, Sandrine; Rocha, Dominique; Klopp, Christophe; Capitan, Aurélien; Boichard, Didier

    2016-11-15

    In recent years, several bovine genome sequencing projects were carried out with the aim of developing genomic tools to improve dairy and beef production efficiency and sustainability. In this study, we describe the first French cattle genome variation dataset obtained by sequencing 274 whole genomes representing several major dairy and beef breeds. This dataset contains over 28 million single nucleotide polymorphisms (SNPs) and small insertions and deletions. Comparisons between sequencing results and SNP array genotypes revealed a very high genotype concordance rate, which indicates the good quality of our data. To our knowledge, this is the first large-scale catalog of small genomic variations in French dairy and beef cattle. This resource will contribute to the study of gene functions and population structure and also help to improve traits through genotype-guided selection.

  8. Recovering complete and draft population genomes from metagenome datasets

    DOE PAGES

    Sangwan, Naseer; Xia, Fangfang; Gilbert, Jack A.

    2016-03-08

    Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem ofmore » chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.« less

  9. Recovering complete and draft population genomes from metagenome datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sangwan, Naseer; Xia, Fangfang; Gilbert, Jack A.

    Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem ofmore » chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.« less

  10. A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts.

    PubMed

    Nilsson, R Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M; Bengtsson-Palme, Johan; Walker, Donald M; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C; Abarenkov, Kessy

    2015-01-01

    The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric-artificially joined-DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation.

  11. A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts

    PubMed Central

    Nilsson, R. Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M.; Bengtsson-Palme, Johan; Walker, Donald M.; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C.; Abarenkov, Kessy

    2015-01-01

    The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric—artificially joined—DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation. PMID:25786896

  12. Analyses of mitochondrial amino acid sequence datasets support the proposal that specimens of Hypodontus macropi from three species of macropodid hosts represent distinct species

    PubMed Central

    2013-01-01

    Background Hypodontus macropi is a common intestinal nematode of a range of kangaroos and wallabies (macropodid marsupials). Based on previous multilocus enzyme electrophoresis (MEE) and nuclear ribosomal DNA sequence data sets, H. macropi has been proposed to be complex of species. To test this proposal using independent molecular data, we sequenced the whole mitochondrial (mt) genomes of individuals of H. macropi from three different species of hosts (Macropus robustus robustus, Thylogale billardierii and Macropus [Wallabia] bicolor) as well as that of Macropicola ocydromi (a related nematode), and undertook a comparative analysis of the amino acid sequence datasets derived from these genomes. Results The mt genomes sequenced by next-generation (454) technology from H. macropi from the three host species varied from 13,634 bp to 13,699 bp in size. Pairwise comparisons of the amino acid sequences predicted from these three mt genomes revealed differences of 5.8% to 18%. Phylogenetic analysis of the amino acid sequence data sets using Bayesian Inference (BI) showed that H. macropi from the three different host species formed distinct, well-supported clades. In addition, sliding window analysis of the mt genomes defined variable regions for future population genetic studies of H. macropi in different macropodid hosts and geographical regions around Australia. Conclusions The present analyses of inferred mt protein sequence datasets clearly supported the hypothesis that H. macropi from M. robustus robustus, M. bicolor and T. billardierii represent distinct species. PMID:24261823

  13. Breaking barriers and halting rupture: the 2016 Amatrice-Visso-Castelluccio earthquake sequence, central Italy

    NASA Astrophysics Data System (ADS)

    Gregory, L. C.; Walters, R. J.; Wedmore, L. N. J.; Craig, T. J.; McCaffrey, K. J. W.; Wilkinson, M. W.; Livio, F.; Michetti, A.; Goodall, H.; Li, Z.; Chen, J.; De Martini, P. M.

    2017-12-01

    In 2016 the Central Italian Apennines was struck by a sequence of normal faulting earthquakes that ruptured in three separate events on the 24th August (Mw 6.2), the 26th Oct (Mw 6.1), and the 30th Oct (Mw 6.6). We reveal the complex nature of the individual events and the time-evolution of the sequence using multiple datasets. We will present an overview of the results from field geology, satellite geodesy, GNSS (including low-cost short baseline installations), and terrestrial laser scanning (TLS). Sequences of earthquakes of mid to high magnitude 6 are common in historical and seismological records in Italy and other similar tectonic settings globally. Multi-fault rupture during these sequences can occur in seconds, as in the M 6.9 1980 Irpinia earthquake, or can span days, months, or years (e.g. the 1703 Norcia-L'Aquila sequence). It is critical to determine why the causative faults in the 2016 sequence did not rupture simultaneously, and how this relates to fault segmentation and structural barriers. This is the first sequence of this kind to be observed using modern geodetic techniques, and only with all of the datasets combined can we begin to understand how and why the sequence evolved in time and space. We show that earthquake rupture both broke through structural barriers that were thought to exist, but was also inhibited by a previously unknown structure. We will also discuss the logistical challenges in generating datasets on the time-evolving sequence, and show how rapid response and international collaboration within the Open EMERGEO Working Group was critical for gaining a complete picture of the ongoing activity.

  14. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shi, CY; Yang, H; Wei, CL

    Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from poly (A){sup +} RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled intomore » 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis.« less

  15. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    PubMed Central

    2011-01-01

    Background Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Results Using high-throughput Illumina RNA-seq, the transcriptome from poly (A)+ RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled into 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). Conclusions An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis. PMID:21356090

  16. The microbiome of Brazilian mangrove sediments as revealed by metagenomics.

    PubMed

    Andreote, Fernando Dini; Jiménez, Diego Javier; Chaves, Diego; Dias, Armando Cavalcante Franco; Luvizotto, Danice Mazzer; Dini-Andreote, Francisco; Fasanella, Cristiane Cipola; Lopez, Maryeimy Varon; Baena, Sandra; Taketani, Rodrigo Gouvêa; de Melo, Itamar Soares

    2012-01-01

    Here we embark in a deep metagenomic survey that revealed the taxonomic and potential metabolic pathways aspects of mangrove sediment microbiology. The extraction of DNA from sediment samples and the direct application of pyrosequencing resulted in approximately 215 Mb of data from four distinct mangrove areas (BrMgv01 to 04) in Brazil. The taxonomic approaches applied revealed the dominance of Deltaproteobacteria and Gammaproteobacteria in the samples. Paired statistical analysis showed higher proportions of specific taxonomic groups in each dataset. The metabolic reconstruction indicated the possible occurrence of processes modulated by the prevailing conditions found in mangrove sediments. In terms of carbon cycling, the sequences indicated the prevalence of genes involved in the metabolism of methane, formaldehyde, and carbon dioxide. With respect to the nitrogen cycle, evidence for sequences associated with dissimilatory reduction of nitrate, nitrogen immobilization, and denitrification was detected. Sequences related to the production of adenylsulfate, sulfite, and H(2)S were relevant to the sulphur cycle. These data indicate that the microbial core involved in methane, nitrogen, and sulphur metabolism consists mainly of Burkholderiaceae, Planctomycetaceae, Rhodobacteraceae, and Desulfobacteraceae. Comparison of our data to datasets from soil and sea samples resulted in the allotment of the mangrove sediments between those samples. The results of this study add valuable data about the composition of microbial communities in mangroves and also shed light on possible transformations promoted by microbial organisms in mangrove sediments.

  17. The Microbiome of Brazilian Mangrove Sediments as Revealed by Metagenomics

    PubMed Central

    Andreote, Fernando Dini; Jiménez, Diego Javier; Chaves, Diego; Dias, Armando Cavalcante Franco; Luvizotto, Danice Mazzer; Dini-Andreote, Francisco; Fasanella, Cristiane Cipola; Lopez, Maryeimy Varon; Baena, Sandra; Taketani, Rodrigo Gouvêa; de Melo, Itamar Soares

    2012-01-01

    Here we embark in a deep metagenomic survey that revealed the taxonomic and potential metabolic pathways aspects of mangrove sediment microbiology. The extraction of DNA from sediment samples and the direct application of pyrosequencing resulted in approximately 215 Mb of data from four distinct mangrove areas (BrMgv01 to 04) in Brazil. The taxonomic approaches applied revealed the dominance of Deltaproteobacteria and Gammaproteobacteria in the samples. Paired statistical analysis showed higher proportions of specific taxonomic groups in each dataset. The metabolic reconstruction indicated the possible occurrence of processes modulated by the prevailing conditions found in mangrove sediments. In terms of carbon cycling, the sequences indicated the prevalence of genes involved in the metabolism of methane, formaldehyde, and carbon dioxide. With respect to the nitrogen cycle, evidence for sequences associated with dissimilatory reduction of nitrate, nitrogen immobilization, and denitrification was detected. Sequences related to the production of adenylsulfate, sulfite, and H2S were relevant to the sulphur cycle. These data indicate that the microbial core involved in methane, nitrogen, and sulphur metabolism consists mainly of Burkholderiaceae, Planctomycetaceae, Rhodobacteraceae, and Desulfobacteraceae. Comparison of our data to datasets from soil and sea samples resulted in the allotment of the mangrove sediments between those samples. The results of this study add valuable data about the composition of microbial communities in mangroves and also shed light on possible transformations promoted by microbial organisms in mangrove sediments. PMID:22737213

  18. SACCHARIS: an automated pipeline to streamline discovery of carbohydrate active enzyme activities within polyspecific families and de novo sequence datasets.

    PubMed

    Jones, Darryl R; Thomas, Dallas; Alger, Nicholas; Ghavidel, Ata; Inglis, G Douglas; Abbott, D Wade

    2018-01-01

    Deposition of new genetic sequences in online databases is expanding at an unprecedented rate. As a result, sequence identification continues to outpace functional characterization of carbohydrate active enzymes (CAZymes). In this paradigm, the discovery of enzymes with novel functions is often hindered by high volumes of uncharacterized sequences particularly when the enzyme sequence belongs to a family that exhibits diverse functional specificities (i.e., polyspecificity). Therefore, to direct sequence-based discovery and characterization of new enzyme activities we have developed an automated in silico pipeline entitled: Sequence Analysis and Clustering of CarboHydrate Active enzymes for Rapid Informed prediction of Specificity (SACCHARIS). This pipeline streamlines the selection of uncharacterized sequences for discovery of new CAZyme or CBM specificity from families currently maintained on the CAZy website or within user-defined datasets. SACCHARIS was used to generate a phylogenetic tree of a GH43, a CAZyme family with defined subfamily designations. This analysis confirmed that large datasets can be organized into sequence clusters of manageable sizes that possess related functions. Seeding this tree with a GH43 sequence from Bacteroides dorei DSM 17855 (BdGH43b, revealed it partitioned as a single sequence within the tree. This pattern was consistent with it possessing a unique enzyme activity for GH43 as BdGH43b is the first described α-glucanase described for this family. The capacity of SACCHARIS to extract and cluster characterized carbohydrate binding module sequences was demonstrated using family 6 CBMs (i.e., CBM6s). This CBM family displays a polyspecific ligand binding profile and contains many structurally determined members. Using SACCHARIS to identify a cluster of divergent sequences, a CBM6 sequence from a unique clade was demonstrated to bind yeast mannan, which represents the first description of an α-mannan binding CBM. Additionally, we have performed a CAZome analysis of an in-house sequenced bacterial genome and a comparative analysis of B. thetaiotaomicron VPI-5482 and B. thetaiotaomicron 7330, to demonstrate that SACCHARIS can generate "CAZome fingerprints", which differentiate between the saccharolytic potential of two related strains in silico. Establishing sequence-function and sequence-structure relationships in polyspecific CAZyme families are promising approaches for streamlining enzyme discovery. SACCHARIS facilitates this process by embedding CAZyme and CBM family trees generated from biochemically to structurally characterized sequences, with protein sequences that have unknown functions. In addition, these trees can be integrated with user-defined datasets (e.g., genomics, metagenomics, and transcriptomics) to inform experimental characterization of new CAZymes or CBMs not currently curated, and for researchers to compare differential sequence patterns between entire CAZomes. In this light, SACCHARIS provides an in silico tool that can be tailored for enzyme bioprospecting in datasets of increasing complexity and for diverse applications in glycobiotechnology.

  19. Ultra Deep Sequencing of Listeria monocytogenes sRNA Transcriptome Revealed New Antisense RNAs

    PubMed Central

    Behrens, Sebastian; Widder, Stefanie; Mannala, Gopala Krishna; Qing, Xiaoxing; Madhugiri, Ramakanth; Kefer, Nathalie; Mraheil, Mobarak Abu; Rattei, Thomas; Hain, Torsten

    2014-01-01

    Listeria monocytogenes, a gram-positive pathogen, and causative agent of listeriosis, has become a widely used model organism for intracellular infections. Recent studies have identified small non-coding RNAs (sRNAs) as important factors for regulating gene expression and pathogenicity of L. monocytogenes. Increased speed and reduced costs of high throughput sequencing (HTS) techniques have made RNA sequencing (RNA-Seq) the state-of-the-art method to study bacterial transcriptomes. We created a large transcriptome dataset of L. monocytogenes containing a total of 21 million reads, using the SOLiD sequencing technology. The dataset contained cDNA sequences generated from L. monocytogenes RNA collected under intracellular and extracellular condition and additionally was size fractioned into three different size ranges from <40 nt, 40–150 nt and >150 nt. We report here, the identification of nine new sRNAs candidates of L. monocytogenes and a reevaluation of known sRNAs of L. monocytogenes EGD-e. Automatic comparison to known sRNAs revealed a high recovery rate of 55%, which was increased to 90% by manual revision of the data. Moreover, thorough classification of known sRNAs shed further light on their possible biological functions. Interestingly among the newly identified sRNA candidates are antisense RNAs (asRNAs) associated to the housekeeping genes purA, fumC and pgi and potentially their regulation, emphasizing the significance of sRNAs for metabolic adaptation in L. monocytogenes. PMID:24498259

  20. Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples

    PubMed Central

    White, James Robert; Nagarajan, Niranjan; Pop, Mihai

    2009-01-01

    Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them. We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/. PMID:19360128

  1. iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets

    PubMed Central

    2012-01-01

    Background ChIP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, detecting allelic imbalance from a single ChIP-seq dataset often has low statistical power since only sequence reads mapped to heterozygote SNPs are informative for discriminating two alleles. Results We develop a new method iASeq to address this issue by jointly analyzing multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve detection of allelic imbalance. Application of iASeq to 77 ChIP-seq samples from 40 ENCODE datasets and 1 genomic DNA sample in GM12878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately. Conclusions iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a new tool to better analyze ASB. PMID:23194258

  2. Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets.

    PubMed

    Vishnevsky, Oleg V; Bocharnikov, Andrey V; Kolchanov, Nikolay A

    2018-02-01

    The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.

  3. Geoseq: a tool for dissecting deep-sequencing datasets.

    PubMed

    Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi

    2010-10-12

    Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

  4. Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiency estimation of genomic sampling in non-model taxa

    PubMed Central

    2012-01-01

    Introduction Traditionally, genomic or transcriptomic data have been restricted to a few model or emerging model organisms, and to a handful of species of medical and/or environmental importance. Next-generation sequencing techniques have the capability of yielding massive amounts of gene sequence data for virtually any species at a modest cost. Here we provide a comparative analysis of de novo assembled transcriptomic data for ten non-model species of previously understudied animal taxa. Results cDNA libraries of ten species belonging to five animal phyla (2 Annelida [including Sipuncula], 2 Arthropoda, 2 Mollusca, 2 Nemertea, and 2 Porifera) were sequenced in different batches with an Illumina Genome Analyzer II (read length 100 or 150 bp), rendering between ca. 25 and 52 million reads per species. Read thinning, trimming, and de novo assembly were performed under different parameters to optimize output. Between 67,423 and 207,559 contigs were obtained across the ten species, post-optimization. Of those, 9,069 to 25,681 contigs retrieved blast hits against the NCBI non-redundant database, and approximately 50% of these were assigned with Gene Ontology terms, covering all major categories, and with similar percentages in all species. Local blasts against our datasets, using selected genes from major signaling pathways and housekeeping genes, revealed high efficiency in gene recovery compared to available genomes of closely related species. Intriguingly, our transcriptomic datasets detected multiple paralogues in all phyla and in nearly all gene pathways, including housekeeping genes that are traditionally used in phylogenetic applications for their purported single-copy nature. Conclusions We generated the first study of comparative transcriptomics across multiple animal phyla (comparing two species per phylum in most cases), established the first Illumina-based transcriptomic datasets for sponge, nemertean, and sipunculan species, and generated a tractable catalogue of annotated genes (or gene fragments) and protein families for ten newly sequenced non-model organisms, some of commercial importance (i.e., Octopus vulgaris). These comprehensive sets of genes can be readily used for phylogenetic analysis, gene expression profiling, developmental analysis, and can also be a powerful resource for gene discovery. The characterization of the transcriptomes of such a diverse array of animal species permitted the comparison of sequencing depth, functional annotation, and efficiency of genomic sampling using the same pipelines, which proved to be similar for all considered species. In addition, the datasets revealed their potential as a resource for paralogue detection, a recurrent concern in various aspects of biological inquiry, including phylogenetics, molecular evolution, development, and cellular biochemistry. PMID:23190771

  5. Multi-Donor Longitudinal Antibody Repertoire Sequencing Reveals the Existence of Public Antibody Clonotypes in HIV-1 Infection.

    PubMed

    Setliff, Ian; McDonnell, Wyatt J; Raju, Nagarajan; Bombardi, Robin G; Murji, Amyn A; Scheepers, Cathrine; Ziki, Rutendo; Mynhardt, Charissa; Shepherd, Bryan E; Mamchak, Alusha A; Garrett, Nigel; Karim, Salim Abdool; Mallal, Simon A; Crowe, James E; Morris, Lynn; Georgiev, Ivelin S

    2018-06-13

    Characterization of single antibody lineages within infected individuals has provided insights into the development of Env-specific antibodies. However, a systems-level understanding of the humoral response against HIV-1 is limited. Here, we interrogated the antibody repertoires of multiple HIV-infected donors from an infection-naive state through acute and chronic infection using next-generation sequencing. This analysis revealed the existence of "public" antibody clonotypes that were shared among multiple HIV-infected individuals. The HIV-1 reactivity for representative antibodies from an identified public clonotype shared by three donors was confirmed. Furthermore, a meta-analysis of publicly available antibody repertoire sequencing datasets revealed antibodies with high sequence identity to known HIV-reactive antibodies, even in repertoires that were reported to be HIV naive. The discovery of public antibody clonotypes in HIV-infected individuals represents an avenue of significant potential for better understanding antibody responses to HIV-1 infection, as well as for clonotype-specific vaccine development. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.

  6. RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets.

    PubMed

    Scheuch, Matthias; Höper, Dirk; Beer, Martin

    2015-03-03

    Fuelled by the advent and subsequent development of next generation sequencing technologies, metagenomics became a powerful tool for the analysis of microbial communities both scientifically and diagnostically. The biggest challenge is the extraction of relevant information from the huge sequence datasets generated for metagenomics studies. Although a plethora of tools are available, data analysis is still a bottleneck. To overcome the bottleneck of data analysis, we developed an automated computational workflow called RIEMS - Reliable Information Extraction from Metagenomic Sequence datasets. RIEMS assigns every individual read sequence within a dataset taxonomically by cascading different sequence analyses with decreasing stringency of the assignments using various software applications. After completion of the analyses, the results are summarised in a clearly structured result protocol organised taxonomically. The high accuracy and performance of RIEMS analyses were proven in comparison with other tools for metagenomics data analysis using simulated sequencing read datasets. RIEMS has the potential to fill the gap that still exists with regard to data analysis for metagenomics studies. The usefulness and power of RIEMS for the analysis of genuine sequencing datasets was demonstrated with an early version of RIEMS in 2011 when it was used to detect the orthobunyavirus sequences leading to the discovery of Schmallenberg virus.

  7. Independent studies using deep sequencing resolve the same set of core bacterial species dominating gut communities of honey bees.

    PubMed

    Sabree, Zakee L; Hansen, Allison K; Moran, Nancy A

    2012-01-01

    Starting in 2003, numerous studies using culture-independent methodologies to characterize the gut microbiota of honey bees have retrieved a consistent and distinctive set of eight bacterial species, based on near identity of the 16S rRNA gene sequences. A recent study [Mattila HR, Rios D, Walker-Sperling VE, Roeselers G, Newton ILG (2012) Characterization of the active microbiotas associated with honey bees reveals healthier and broader communities when colonies are genetically diverse. PLoS ONE 7(3): e32962], using pyrosequencing of the V1-V2 hypervariable region of the 16S rRNA gene, reported finding entirely novel bacterial species in honey bee guts, and used taxonomic assignments from these reads to predict metabolic activities based on known metabolisms of cultivable species. To better understand this discrepancy, we analyzed the Mattila et al. pyrotag dataset. In contrast to the conclusions of Mattila et al., we found that the large majority of pyrotag sequences belonged to clusters for which representative sequences were identical to sequences from previously identified core species of the bee microbiota. On average, they represent 95% of the bacteria in each worker bee in the Mattila et al. dataset, a slightly lower value than that found in other studies. Some colonies contain small proportions of other bacteria, mostly species of Enterobacteriaceae. Reanalysis of the Mattila et al. dataset also did not support a relationship between abundances of Bifidobacterium and of putative pathogens or a significant difference in gut communities between colonies from queens that were singly or multiply mated. Additionally, consistent with previous studies, the dataset supports the occurrence of considerable strain variation within core species, even within single colonies. The roles of these bacteria within bees, or the implications of the strain variation, are not yet clear.

  8. Benchmark Dataset for Whole Genome Sequence Compression.

    PubMed

    C L, Biji; S Nair, Achuthsankar

    2017-01-01

    The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.

  9. Harnessing NGS and Big Data Optimally: Comparison of miRNA Prediction from Assembled versus Non-assembled Sequencing Data--The Case of the Grass Aegilops tauschii Complex Genome.

    PubMed

    Budak, Hikmet; Kantar, Melda

    2015-07-01

    MicroRNAs (miRNAs) are small, endogenous, non-coding RNA molecules that regulate gene expression at the post-transcriptional level. As high-throughput next generation sequencing (NGS) and Big Data rapidly accumulate for various species, efforts for in silico identification of miRNAs intensify. Surprisingly, the effect of the input genomics sequence on the robustness of miRNA prediction was not evaluated in detail to date. In the present study, we performed a homology-based miRNA and isomiRNA prediction of the 5D chromosome of bread wheat progenitor, Aegilops tauschii, using two distinct sequence data sets as input: (1) raw sequence reads obtained from 454-GS FLX Titanium sequencing platform and (2) an assembly constructed from these reads. We also compared this method with a number of available plant sequence datasets. We report here the identification of 62 and 22 miRNAs from raw reads and the assembly, respectively, of which 16 were predicted with high confidence from both datasets. While raw reads promoted sensitivity with the high number of miRNAs predicted, 55% (12 out of 22) of the assembly-based predictions were supported by previous observations, bringing specificity forward compared to the read-based predictions, of which only 37% were supported. Importantly, raw reads could identify several repeat-related miRNAs that could not be detected with the assembly. However, raw reads could not capture 6 miRNAs, for which the stem-loops could only be covered by the relatively longer sequences from the assembly. In summary, the comparison of miRNA datasets obtained by these two strategies revealed that utilization of raw reads, as well as assemblies for in silico prediction, have distinct advantages and disadvantages. Consideration of these important nuances can benefit future miRNA identification efforts in the current age of NGS and Big Data driven life sciences innovation.

  10. Prospecting Biotechnologically-Relevant Monooxygenases from Cold Sediment Metagenomes: An In Silico Approach

    DOE PAGES

    Musumeci, Matias A.; Lozada, Mariana; Rial, Daniela V.; ...

    2017-04-09

    The goal of this work was to identify sequences encoding monooxygenase biocatalysts with novel features by in silico mining an assembled metagenomic dataset of polar and subpolar marine sediments. The targeted enzyme sequences were Baeyer-Villiger and bacterial cytochrome P450 monooxygenases (CYP153). These enzymes have wide-ranging applications, from the synthesis of steroids, antibiotics, mycotoxins and pheromones to the synthesis of monomers for polymerization and anticancer precursors, due to their extraordinary enantio-, regio-, and chemo- selectivity that are valuable features for organic synthesis. Phylogenetic analyses were used to select the most divergent sequences affiliated to these enzyme families among the 264 putativemore » monooxygenases recovered from the ~14 million protein-coding sequences in the assembled metagenome dataset. Three-dimensional structure modeling and docking analysis suggested features useful in biotechnological applications in five metagenomic sequences, such as wide substrate range, novel substrate specificity or regioselectivity. Further analysis revealed structural features associated with psychrophilic enzymes, such as broader substrate accessibility, larger catalytic pockets or low domain interactions, suggesting that they could be applied in biooxidations at room or low temperatures, saving costs inherent to energy consumption. As a result, this work allowed the identification of putative enzyme candidates with promising features from metagenomes, providing a suitable starting point for further developments.« less

  11. Prospecting Biotechnologically-Relevant Monooxygenases from Cold Sediment Metagenomes: An In Silico Approach

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Musumeci, Matias A.; Lozada, Mariana; Rial, Daniela V.

    The goal of this work was to identify sequences encoding monooxygenase biocatalysts with novel features by in silico mining an assembled metagenomic dataset of polar and subpolar marine sediments. The targeted enzyme sequences were Baeyer-Villiger and bacterial cytochrome P450 monooxygenases (CYP153). These enzymes have wide-ranging applications, from the synthesis of steroids, antibiotics, mycotoxins and pheromones to the synthesis of monomers for polymerization and anticancer precursors, due to their extraordinary enantio-, regio-, and chemo- selectivity that are valuable features for organic synthesis. Phylogenetic analyses were used to select the most divergent sequences affiliated to these enzyme families among the 264 putativemore » monooxygenases recovered from the ~14 million protein-coding sequences in the assembled metagenome dataset. Three-dimensional structure modeling and docking analysis suggested features useful in biotechnological applications in five metagenomic sequences, such as wide substrate range, novel substrate specificity or regioselectivity. Further analysis revealed structural features associated with psychrophilic enzymes, such as broader substrate accessibility, larger catalytic pockets or low domain interactions, suggesting that they could be applied in biooxidations at room or low temperatures, saving costs inherent to energy consumption. As a result, this work allowed the identification of putative enzyme candidates with promising features from metagenomes, providing a suitable starting point for further developments.« less

  12. Prospecting Biotechnologically-Relevant Monooxygenases from Cold Sediment Metagenomes: An In Silico Approach.

    PubMed

    Musumeci, Matías A; Lozada, Mariana; Rial, Daniela V; Mac Cormack, Walter P; Jansson, Janet K; Sjöling, Sara; Carroll, JoLynn; Dionisi, Hebe M

    2017-04-09

    The goal of this work was to identify sequences encoding monooxygenase biocatalysts with novel features by in silico mining an assembled metagenomic dataset of polar and subpolar marine sediments. The targeted enzyme sequences were Baeyer-Villiger and bacterial cytochrome P450 monooxygenases (CYP153). These enzymes have wide-ranging applications, from the synthesis of steroids, antibiotics, mycotoxins and pheromones to the synthesis of monomers for polymerization and anticancer precursors, due to their extraordinary enantio-, regio-, and chemo- selectivity that are valuable features for organic synthesis. Phylogenetic analyses were used to select the most divergent sequences affiliated to these enzyme families among the 264 putative monooxygenases recovered from the ~14 million protein-coding sequences in the assembled metagenome dataset. Three-dimensional structure modeling and docking analysis suggested features useful in biotechnological applications in five metagenomic sequences, such as wide substrate range, novel substrate specificity or regioselectivity. Further analysis revealed structural features associated with psychrophilic enzymes, such as broader substrate accessibility, larger catalytic pockets or low domain interactions, suggesting that they could be applied in biooxidations at room or low temperatures, saving costs inherent to energy consumption. This work allowed the identification of putative enzyme candidates with promising features from metagenomes, providing a suitable starting point for further developments.

  13. Prospecting Biotechnologically-Relevant Monooxygenases from Cold Sediment Metagenomes: An In Silico Approach

    PubMed Central

    Musumeci, Matías A.; Lozada, Mariana; Rial, Daniela V.; Mac Cormack, Walter P.; Jansson, Janet K.; Sjöling, Sara; Carroll, JoLynn; Dionisi, Hebe M.

    2017-01-01

    The goal of this work was to identify sequences encoding monooxygenase biocatalysts with novel features by in silico mining an assembled metagenomic dataset of polar and subpolar marine sediments. The targeted enzyme sequences were Baeyer–Villiger and bacterial cytochrome P450 monooxygenases (CYP153). These enzymes have wide-ranging applications, from the synthesis of steroids, antibiotics, mycotoxins and pheromones to the synthesis of monomers for polymerization and anticancer precursors, due to their extraordinary enantio-, regio-, and chemo- selectivity that are valuable features for organic synthesis. Phylogenetic analyses were used to select the most divergent sequences affiliated to these enzyme families among the 264 putative monooxygenases recovered from the ~14 million protein-coding sequences in the assembled metagenome dataset. Three-dimensional structure modeling and docking analysis suggested features useful in biotechnological applications in five metagenomic sequences, such as wide substrate range, novel substrate specificity or regioselectivity. Further analysis revealed structural features associated with psychrophilic enzymes, such as broader substrate accessibility, larger catalytic pockets or low domain interactions, suggesting that they could be applied in biooxidations at room or low temperatures, saving costs inherent to energy consumption. This work allowed the identification of putative enzyme candidates with promising features from metagenomes, providing a suitable starting point for further developments. PMID:28397770

  14. New FeFe-hydrogenase genes identified in a metagenomic fosmid library from a municipal wastewater treatment plant as revealed by high-throughput sequencing.

    PubMed

    Tomazetto, Geizecler; Wibberg, Daniel; Schlüter, Andreas; Oliveira, Valéria M

    2015-01-01

    A fosmid metagenomic library was constructed with total community DNA obtained from a municipal wastewater treatment plant (MWWTP), with the aim of identifying new FeFe-hydrogenase genes encoding the enzymes most important for hydrogen metabolism. The dataset generated by pyrosequencing of a fosmid library was mined to identify environmental gene tags (EGTs) assigned to FeFe-hydrogenase. The majority of EGTs representing FeFe-hydrogenase genes were affiliated with the class Clostridia, suggesting that this group is the main hydrogen producer in the MWWTP analyzed. Based on assembled sequences, three FeFe-hydrogenase genes were predicted based on detection of the L2 motif (MPCxxKxxE) in the encoded gene product, confirming true FeFe-hydrogenase sequences. These sequences were used to design specific primers to detect fosmids encoding FeFe-hydrogenase genes predicted from the dataset. Three identified fosmids were completely sequenced. The cloned genomic fragments within these fosmids are closely related to members of the Spirochaetaceae, Bacteroidales and Firmicutes, and their FeFe-hydrogenase sequences are characterized by the structure type M3, which is common to clostridial enzymes. FeFe-hydrogenase sequences found in this study represent hitherto undetected sequences, indicating the high genetic diversity regarding these enzymes in MWWTP. Results suggest that MWWTP have to be considered as reservoirs for new FeFe-hydrogenase genes. Copyright © 2014 Institut Pasteur. Published by Elsevier Masson SAS. All rights reserved.

  15. Phylogenetic reassessment of tribe Anemoneae (Ranunculaceae): Non-monophyly of Anemone s.l. revealed by plastid datasets

    PubMed Central

    Yang, Jun-Bo; Zhang, Shu-Dong; Guan, Kai-Yun; Tan, Yun-Hong

    2017-01-01

    Morphological and molecular evidence strongly supported the monophyly of tribe Anemoneae DC.; however, phylogenetic relationships among genera of this tribe have still not been fully resolved. In this study, we sampled 120 specimens representing 82 taxa of tribe Anemoneae. One nuclear ribosomal internal transcribed spacer (nrITS) and six plastid markers (atpB-rbcL, matK, psbA-trnQ, rpoB-trnC, rbcL and rps16) were amplified and sequenced. Both Maximum likelihood and Bayesian inference methods were used to reconstruct phylogenies for this tribe. Individual datasets supported all traditional genera as monophyletic, except Anemone and Clematis that were polyphyletic and paraphyletic, respectively, and revealed that the seven single-gene datasets can be split into two groups, i.e. nrITS + atpB-rbcL and the remaining five plastid markers. The combined nrITS + atpB-rbcL dataset recovered monophyly of subtribes Anemoninae (i.e. Anemone s.l.) and Clematidinae (including Anemoclema), respectively. However, the concatenated plastid dataset showed that one group of subtribes Anemoninae (Hepatica and Anemone spp. from subgenus Anemonidium) close to the clade Clematis s.l. + Anemoclema. Our results strongly supported a close relationship between Anemoclema and Clematis s.l., which included Archiclematis and Naravelia. Non-monophyly of Anemone s.l. using the plastid dataset indicates to revise as two genera, new Anemone s.l. (including Pulsatilla, Barneoudia, Oreithales and Knowltonia), Hepatica (corresponding to Anemone subgenus Anemonidium). PMID:28362811

  16. Phylogenetic reassessment of tribe Anemoneae (Ranunculaceae): Non-monophyly of Anemone s.l. revealed by plastid datasets.

    PubMed

    Jiang, Nan; Zhou, Zhuang; Yang, Jun-Bo; Zhang, Shu-Dong; Guan, Kai-Yun; Tan, Yun-Hong; Yu, Wen-Bin

    2017-01-01

    Morphological and molecular evidence strongly supported the monophyly of tribe Anemoneae DC.; however, phylogenetic relationships among genera of this tribe have still not been fully resolved. In this study, we sampled 120 specimens representing 82 taxa of tribe Anemoneae. One nuclear ribosomal internal transcribed spacer (nrITS) and six plastid markers (atpB-rbcL, matK, psbA-trnQ, rpoB-trnC, rbcL and rps16) were amplified and sequenced. Both Maximum likelihood and Bayesian inference methods were used to reconstruct phylogenies for this tribe. Individual datasets supported all traditional genera as monophyletic, except Anemone and Clematis that were polyphyletic and paraphyletic, respectively, and revealed that the seven single-gene datasets can be split into two groups, i.e. nrITS + atpB-rbcL and the remaining five plastid markers. The combined nrITS + atpB-rbcL dataset recovered monophyly of subtribes Anemoninae (i.e. Anemone s.l.) and Clematidinae (including Anemoclema), respectively. However, the concatenated plastid dataset showed that one group of subtribes Anemoninae (Hepatica and Anemone spp. from subgenus Anemonidium) close to the clade Clematis s.l. + Anemoclema. Our results strongly supported a close relationship between Anemoclema and Clematis s.l., which included Archiclematis and Naravelia. Non-monophyly of Anemone s.l. using the plastid dataset indicates to revise as two genera, new Anemone s.l. (including Pulsatilla, Barneoudia, Oreithales and Knowltonia), Hepatica (corresponding to Anemone subgenus Anemonidium).

  17. A clone-free, single molecule map of the domestic cow (Bos taurus) genome.

    PubMed

    Zhou, Shiguo; Goldstein, Steve; Place, Michael; Bechner, Michael; Patino, Diego; Potamousis, Konstantinos; Ravindran, Prabu; Pape, Louise; Rincon, Gonzalo; Hernandez-Ortiz, Juan; Medrano, Juan F; Schwartz, David C

    2015-08-28

    The cattle (Bos taurus) genome was originally selected for sequencing due to its economic importance and unique biology as a model organism for understanding other ruminants, or mammals. Currently, there are two cattle genome sequence assemblies (UMD3.1 and Btau4.6) from groups using dissimilar assembly algorithms, which were complemented by genetic and physical map resources. However, past comparisons between these assemblies revealed substantial differences. Consequently, such discordances have engendered ambiguities when using reference sequence data, impacting genomic studies in cattle and motivating construction of a new optical map resource--BtOM1.0--to guide comparisons and improvements to the current sequence builds. Accordingly, our comprehensive comparisons of BtOM1.0 against the UMD3.1 and Btau4.6 sequence builds tabulate large-to-immediate scale discordances requiring mediation. The optical map, BtOM1.0, spanning the B. taurus genome (Hereford breed, L1 Dominette 01449) was assembled from an optical map dataset consisting of 2,973,315 (439 X; raw dataset size before assembly) single molecule optical maps (Rmaps; 1 Rmap = 1 restriction mapped DNA molecule) generated by the Optical Mapping System. The BamHI map spans 2,575.30 Mb and comprises 78 optical contigs assembled by a combination of iterative (using the reference sequence: UMD3.1) and de novo assembly techniques. BtOM1.0 is a high-resolution physical map featuring an average restriction fragment size of 8.91 Kb. Comparisons of BtOM1.0 vs. UMD3.1, or Btau4.6, revealed that Btau4.6 presented far more discordances (7,463) vs. UMD3.1 (4,754). Overall, we found that Btau4.6 presented almost double the number of discordances than UMD3.1 across most of the 6 categories of sequence vs. map discrepancies, which are: COMPLEX (misassembly), DELs (extraneous sequences), INSs (missing sequences), ITs (Inverted/Translocated sequences), ECs (extra restriction cuts) and MCs (missing restriction cuts). Alignments of UMD3.1 and Btau4.6 to BtOM1.0 reveal discordances commensurate with previous reports, and affirm the NCBI's current designation of UMD3.1 sequence assembly as the "reference assembly" and the Btau4.6 as the "alternate assembly." The cattle genome optical map, BtOM1.0, when used as a comprehensive and largely independent guide, will greatly assist improvements to existing sequence builds, and later serve as an accurate physical scaffold for studies concerning the comparative genomics of cattle breeds.

  18. Understanding the complex evolution of rapidly mutating viruses with deep sequencing: Beyond the analysis of viral diversity.

    PubMed

    Leung, Preston; Eltahla, Auda A; Lloyd, Andrew R; Bull, Rowena A; Luciani, Fabio

    2017-07-15

    With the advent of affordable deep sequencing technologies, detection of low frequency variants within genetically diverse viral populations can now be achieved with unprecedented depth and efficiency. The high-resolution data provided by next generation sequencing technologies is currently recognised as the gold standard in estimation of viral diversity. In the analysis of rapidly mutating viruses, longitudinal deep sequencing datasets from viral genomes during individual infection episodes, as well as at the epidemiological level during outbreaks, now allow for more sophisticated analyses such as statistical estimates of the impact of complex mutation patterns on the evolution of the viral populations both within and between hosts. These analyses are revealing more accurate descriptions of the evolutionary dynamics that underpin the rapid adaptation of these viruses to the host response, and to drug therapies. This review assesses recent developments in methods and provide informative research examples using deep sequencing data generated from rapidly mutating viruses infecting humans, particularly hepatitis C virus (HCV), human immunodeficiency virus (HIV), Ebola virus and influenza virus, to understand the evolution of viral genomes and to explore the relationship between viral mutations and the host adaptive immune response. Finally, we discuss limitations in current technologies, and future directions that take advantage of publically available large deep sequencing datasets. Copyright © 2016 Elsevier B.V. All rights reserved.

  19. A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks.

    PubMed

    Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong

    2017-01-01

    Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.

  20. A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks

    PubMed Central

    Wu, Chenxue; Liu, Zhao; Zhu, Yunhong

    2017-01-01

    Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687

  1. SCMPSP: Prediction and characterization of photosynthetic proteins based on a scoring card method.

    PubMed

    Vasylenko, Tamara; Liou, Yi-Fan; Chen, Hong-An; Charoenkwan, Phasit; Huang, Hui-Ling; Ho, Shinn-Ying

    2015-01-01

    Photosynthetic proteins (PSPs) greatly differ in their structure and function as they are involved in numerous subprocesses that take place inside an organelle called a chloroplast. Few studies predict PSPs from sequences due to their high variety of sequences and structues. This work aims to predict and characterize PSPs by establishing the datasets of PSP and non-PSP sequences and developing prediction methods. A novel bioinformatics method of predicting and characterizing PSPs based on scoring card method (SCMPSP) was used. First, a dataset consisting of 649 PSPs was established by using a Gene Ontology term GO:0015979 and 649 non-PSPs from the SwissProt database with sequence identity <= 25%.- Several prediction methods are presented based on support vector machine (SVM), decision tree J48, Bayes, BLAST, and SCM. The SVM method using dipeptide features-performed well and yielded - a test accuracy of 72.31%. The SCMPSP method uses the estimated propensity scores of 400 dipeptides - as PSPs and has a test accuracy of 71.54%, which is comparable to that of the SVM method. The derived propensity scores of 20 amino acids were further used to identify informative physicochemical properties for characterizing PSPs. The analytical results reveal the following four characteristics of PSPs: 1) PSPs favour hydrophobic side chain amino acids; 2) PSPs are composed of the amino acids prone to form helices in membrane environments; 3) PSPs have low interaction with water; and 4) PSPs prefer to be composed of the amino acids of electron-reactive side chains. The SCMPSP method not only estimates the propensity of a sequence to be PSPs, it also discovers characteristics that further improve understanding of PSPs. The SCMPSP source code and the datasets used in this study are available at http://iclab.life.nctu.edu.tw/SCMPSP/.

  2. Genomics dataset of unidentified disclosed isolates.

    PubMed

    Rekadwad, Bhagwan N

    2016-09-01

    Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis.

  3. A reference human genome dataset of the BGISEQ-500 sequencer.

    PubMed

    Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian

    2017-05-01

    BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.

  4. Large-scale benchmarking reveals false discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome studies.

    PubMed

    Thorsen, Jonathan; Brejnrod, Asker; Mortensen, Martin; Rasmussen, Morten A; Stokholm, Jakob; Al-Soud, Waleed Abu; Sørensen, Søren; Bisgaard, Hans; Waage, Johannes

    2016-11-25

    There is an immense scientific interest in the human microbiome and its effects on human physiology, health, and disease. A common approach for examining bacterial communities is high-throughput sequencing of 16S rRNA gene hypervariable regions, aggregating sequence-similar amplicons into operational taxonomic units (OTUs). Strategies for detecting differential relative abundance of OTUs between sample conditions include classical statistical approaches as well as a plethora of newer methods, many borrowing from the related field of RNA-seq analysis. This effort is complicated by unique data characteristics, including sparsity, sequencing depth variation, and nonconformity of read counts to theoretical distributions, which is often exacerbated by exploratory and/or unbalanced study designs. Here, we assess the robustness of available methods for (1) inference in differential relative abundance analysis and (2) beta-diversity-based sample separation, using a rigorous benchmarking framework based on large clinical 16S microbiome datasets from different sources. Running more than 380,000 full differential relative abundance tests on real datasets with permuted case/control assignments and in silico-spiked OTUs, we identify large differences in method performance on a range of parameters, including false positive rates, sensitivity to sparsity and case/control balances, and spike-in retrieval rate. In large datasets, methods with the highest false positive rates also tend to have the best detection power. For beta-diversity-based sample separation, we show that library size normalization has very little effect and that the distance metric is the most important factor in terms of separation power. Our results, generalizable to datasets from different sequencing platforms, demonstrate how the choice of method considerably affects analysis outcome. Here, we give recommendations for tools that exhibit low false positive rates, have good retrieval power across effect sizes and case/control proportions, and have low sparsity bias. Result output from some commonly used methods should be interpreted with caution. We provide an easily extensible framework for benchmarking of new methods and future microbiome datasets.

  5. DAMe: a toolkit for the initial processing of datasets with PCR replicates of double-tagged amplicons for DNA metabarcoding analyses.

    PubMed

    Zepeda-Mendoza, Marie Lisandra; Bohmann, Kristine; Carmona Baez, Aldo; Gilbert, M Thomas P

    2016-05-03

    DNA metabarcoding is an approach for identifying multiple taxa in an environmental sample using specific genetic loci and taxa-specific primers. When combined with high-throughput sequencing it enables the taxonomic characterization of large numbers of samples in a relatively time- and cost-efficient manner. One recent laboratory development is the addition of 5'-nucleotide tags to both primers producing double-tagged amplicons and the use of multiple PCR replicates to filter erroneous sequences. However, there is currently no available toolkit for the straightforward analysis of datasets produced in this way. We present DAMe, a toolkit for the processing of datasets generated by double-tagged amplicons from multiple PCR replicates derived from an unlimited number of samples. Specifically, DAMe can be used to (i) sort amplicons by tag combination, (ii) evaluate PCR replicates dissimilarity, and (iii) filter sequences derived from sequencing/PCR errors, chimeras, and contamination. This is attained by calculating the following parameters: (i) sequence content similarity between the PCR replicates from each sample, (ii) reproducibility of each unique sequence across the PCR replicates, and (iii) copy number of the unique sequences in each PCR replicate. We showcase the insights that can be obtained using DAMe prior to taxonomic assignment, by applying it to two real datasets that vary in their complexity regarding number of samples, sequencing libraries, PCR replicates, and used tag combinations. Finally, we use a third mock dataset to demonstrate the impact and importance of filtering the sequences with DAMe. DAMe allows the user-friendly manipulation of amplicons derived from multiple samples with PCR replicates built in a single or multiple sequencing libraries. It allows the user to: (i) collapse amplicons into unique sequences and sort them by tag combination while retaining the sample identifier and copy number information, (ii) identify sequences carrying unused tag combinations, (iii) evaluate the comparability of PCR replicates of the same sample, and (iv) filter tagged amplicons from a number of PCR replicates using parameters of minimum length, copy number, and reproducibility across the PCR replicates. This enables an efficient analysis of complex datasets, and ultimately increases the ease of handling datasets from large-scale studies.

  6. Visual exploration of parameter influence on phylogenetic trees.

    PubMed

    Hess, Martin; Bremm, Sebastian; Weissgraeber, Stephanie; Hamacher, Kay; Goesele, Michael; Wiemeyer, Josef; von Landesberger, Tatiana

    2014-01-01

    Evolutionary relationships between organisms are frequently derived as phylogenetic trees inferred from multiple sequence alignments (MSAs). The MSA parameter space is exponentially large, so tens of thousands of potential trees can emerge for each dataset. A proposed visual-analytics approach can reveal the parameters' impact on the trees. Given input trees created with different parameter settings, it hierarchically clusters the trees according to their structural similarity. The most important clusters of similar trees are shown together with their parameters. This view offers interactive parameter exploration and automatic identification of relevant parameters. Biologists applied this approach to real data of 16S ribosomal RNA and protein sequences of ion channels. It revealed which parameters affected the tree structures. This led to a more reliable selection of the best trees.

  7. Can natural proteins designed with 'inverted' peptide sequences adopt native-like protein folds?

    PubMed

    Sridhar, Settu; Guruprasad, Kunchur

    2014-01-01

    We have carried out a systematic computational analysis on a representative dataset of proteins of known three-dimensional structure, in order to evaluate whether it would possible to 'swap' certain short peptide sequences in naturally occurring proteins with their corresponding 'inverted' peptides and generate 'artificial' proteins that are predicted to retain native-like protein fold. The analysis of 3,967 representative proteins from the Protein Data Bank revealed 102,677 unique identical inverted peptide sequence pairs that vary in sequence length between 5-12 and 18 amino acid residues. Our analysis illustrates with examples that such 'artificial' proteins may be generated by identifying peptides with 'similar structural environment' and by using comparative protein modeling and validation studies. Our analysis suggests that natural proteins may be tolerant to accommodating such peptides.

  8. Phylodynamic Analysis Revealed That Epidemic of CRF07_BC Strain in Men Who Have Sex with Men Drove Its Second Spreading Wave in China.

    PubMed

    Zhang, Min; Jia, Dijing; Li, Hanping; Gui, Tao; Jia, Lei; Wang, Xiaolin; Li, Tianyi; Liu, Yongjian; Bao, Zuoyi; Liu, Siyang; Zhuang, Daomin; Li, Jingyun; Li, Lin

    2017-10-01

    CRF07_BC was originally formed in Yunnan province of China in 1980s and spread quickly in injecting drug users (IDUs). In recent years, it has been introduced into men who have sex with men (MSM) and become the most dominant strain in China. In this study, we performed a comprehensively phylodynamic analysis of CRF07_BC sequences from China. All CRF07_BC sequences identified in China were retrieved from database. More sequences obtained in our laboratory were added to make the dataset more representative. A maximum-likelihood (ML) tree was constructed with PhyML3.0. Maximum clade credibility (MCC) tree and effective population size were predicted by using Markov Chains Monte Carlo sampling method with Beast software. A total of 610 CRF07_BC sequences coving 1,473 bp of the gag gene (from 817 to 2,289 according to HXB2 calculator) were included into the dataset. Three epidemic clusters were identified; two clusters comprised sequences from IDUs, while one cluster mainly contained sequences from MSMs. The time of the most recent common ancestor of clusters that composed of sequences from MSMs was estimated to be in 2000. Two rapid spreading waves of effective population size of CRF07_BC infections were identified in the skyline plot. The second wave coincided with the expanding of MSM cluster. The results indicated that the control of CRF07_BC infections in MSMs would help to decrease its epidemic in China.

  9. Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

    PubMed Central

    Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.

    2016-01-01

    An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777

  10. High-throughput engineering of a mammalian genome reveals building principles of methylation states at CG rich regions.

    PubMed

    Krebs, Arnaud R; Dessus-Babus, Sophie; Burger, Lukas; Schübeler, Dirk

    2014-09-26

    The majority of mammalian promoters are CpG islands; regions of high CG density that require protection from DNA methylation to be functional. Importantly, how sequence architecture mediates this unmethylated state remains unclear. To address this question in a comprehensive manner, we developed a method to interrogate methylation states of hundreds of sequence variants inserted at the same genomic site in mouse embryonic stem cells. Using this assay, we were able to quantify the contribution of various sequence motifs towards the resulting DNA methylation state. Modeling of this comprehensive dataset revealed that CG density alone is a minor determinant of their unmethylated state. Instead, these data argue for a principal role for transcription factor binding sites, a prediction confirmed by testing synthetic mutant libraries. Taken together, these findings establish the hierarchy between the two cis-encoded mechanisms that define the DNA methylation state and thus the transcriptional competence of CpG islands.

  11. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

    PubMed

    Yu, Qiang; Wei, Dingbang; Huo, Hongwei

    2018-06-18

    Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

  12. Detailed analysis of metagenome datasets obtained from biogas-producing microbial communities residing in biogas reactors does not indicate the presence of putative pathogenic microorganisms

    PubMed Central

    2013-01-01

    Background In recent years biogas plants in Germany have been supposed to be involved in amplification and dissemination of pathogenic bacteria causing severe infections in humans and animals. In particular, biogas plants are discussed to contribute to the spreading of Escherichia coli infections in humans or chronic botulism in cattle caused by Clostridium botulinum. Metagenome datasets of microbial communities from an agricultural biogas plant as well as from anaerobic lab-scale digesters operating at different temperatures and conditions were analyzed for the presence of putative pathogenic bacteria and virulence determinants by various bioinformatic approaches. Results All datasets featured a low abundance of reads that were taxonomically assigned to the genus Escherichia or further selected genera comprising pathogenic species. Higher numbers of reads were taxonomically assigned to the genus Clostridium. However, only very few sequences were predicted to originate from pathogenic clostridial species. Moreover, mapping of metagenome reads to complete genome sequences of selected pathogenic bacteria revealed that not the pathogenic species itself, but only species that are more or less related to pathogenic ones are present in the fermentation samples analyzed. Likewise, known virulence determinants could hardly be detected. Only a marginal number of reads showed similarity to sequences described in the Microbial Virulence Database MvirDB such as those encoding protein toxins, virulence proteins or antibiotic resistance determinants. Conclusions Findings of this first study of metagenomic sequence reads of biogas producing microbial communities suggest that the risk of dissemination of pathogenic bacteria by application of digestates from biogas fermentations as fertilizers is low, because obtained results do not indicate the presence of putative pathogenic microorganisms in the samples analyzed. PMID:23557021

  13. High-resolution phylogeography of zoonotic tapeworm Echinococcus granulosus sensu stricto genotype G1 with an emphasis on its distribution in Turkey, Italy and Spain.

    PubMed

    Kinkar, Liina; Laurimäe, Teivi; Simsek, Sami; Balkaya, Ibrahim; Casulli, Adriano; Manfredi, Maria Teresa; Ponce-Gordo, Francisco; Varcasia, Antonio; Lavikainen, Antti; González, Luis Miguel; Rehbein, Steffen; VAN DER Giessen, Joke; Sprong, Hein; Saarma, Urmas

    2016-11-01

    Echinococcus granulosus is the causative agent of cystic echinococcosis. The disease is a significant global public health concern and human infections are most commonly associated with E. granulosus sensu stricto (s. s.) genotype G1. The objectives of this study were to: (i) analyse the genetic variation and phylogeography of E. granulosus s. s. G1 in part of its main distribution range in Europe using 8274 bp of mtDNA; (ii) compare the results with those derived from previously used shorter mtDNA sequences and highlight the major differences. We sequenced a total of 91 E. granulosus s. s. G1 isolates from six different intermediate host species, including humans. The isolates originated from seven countries representing primarily Turkey, Italy and Spain. Few samples were also from Albania, Greece, Romania and from a patient originating from Algeria, but diagnosed in Finland. The analysed 91 sequences were divided into 83 haplotypes, revealing complex phylogeography and high genetic variation of E. granulosus s. s. G1 in Europe, particularly in the high-diversity domestication centre of western Asia. Comparisons with shorter mtDNA datasets revealed that 8274 bp sequences provided significantly higher phylogenetic resolution and thus more power to reveal the genetic relations between different haplotypes.

  14. Transcriptome- Assisted Label-Free Quantitative Proteomics Analysis Reveals Novel Insights into Piper nigrum—Phytophthora capsici Phytopathosystem

    PubMed Central

    Mahadevan, Chidambareswaren; Krishnan, Anu; Saraswathy, Gayathri G.; Surendran, Arun; Jaleel, Abdul; Sakuntala, Manjula

    2016-01-01

    Black pepper (Piper nigrum L.), a tropical spice crop of global acclaim, is susceptible to Phytophthora capsici, an oomycete pathogen which causes the highly destructive foot rot disease. A systematic understanding of this phytopathosystem has not been possible owing to lack of genome or proteome information. In this study, we explain an integrated transcriptome-assisted label-free quantitative proteomics pipeline to study the basal immune components of black pepper when challenged with P. capsici. We report a global identification of 532 novel leaf proteins from black pepper, of which 518 proteins were functionally annotated using BLAST2GO tool. A label-free quantitation of the protein datasets revealed 194 proteins common to diseased and control protein datasets of which 22 proteins showed significant up-regulation and 134 showed significant down-regulation. Ninety-three proteins were identified exclusively on P. capsici infected leaf tissues and 245 were expressed only in mock (control) infected samples. In-depth analysis of our data gives novel insights into the regulatory pathways of black pepper which are compromised during the infection. Differential down-regulation was observed in a number of critical pathways like carbon fixation in photosynthetic organism, cyano-amino acid metabolism, fructose, and mannose metabolism, glutathione metabolism, and phenylpropanoid biosynthesis. The proteomics results were validated with real-time qRT-PCR analysis. We were also able to identify the complete coding sequences for all the proteins of which few selected genes were cloned and sequence characterized for further confirmation. Our study is the first report of a quantitative proteomics dataset in black pepper which provides convincing evidence on the effectiveness of a transcriptome-based label-free proteomics approach for elucidating the host response to biotic stress in a non-model spice crop like P. nigrum, for which genome information is unavailable. Our dataset will serve as a useful resource for future studies in this plant. Data are available via ProteomeXchange with identifier PXD003887. PMID:27379110

  15. Transcriptome- Assisted Label-Free Quantitative Proteomics Analysis Reveals Novel Insights into Piper nigrum-Phytophthora capsici Phytopathosystem.

    PubMed

    Mahadevan, Chidambareswaren; Krishnan, Anu; Saraswathy, Gayathri G; Surendran, Arun; Jaleel, Abdul; Sakuntala, Manjula

    2016-01-01

    Black pepper (Piper nigrum L.), a tropical spice crop of global acclaim, is susceptible to Phytophthora capsici, an oomycete pathogen which causes the highly destructive foot rot disease. A systematic understanding of this phytopathosystem has not been possible owing to lack of genome or proteome information. In this study, we explain an integrated transcriptome-assisted label-free quantitative proteomics pipeline to study the basal immune components of black pepper when challenged with P. capsici. We report a global identification of 532 novel leaf proteins from black pepper, of which 518 proteins were functionally annotated using BLAST2GO tool. A label-free quantitation of the protein datasets revealed 194 proteins common to diseased and control protein datasets of which 22 proteins showed significant up-regulation and 134 showed significant down-regulation. Ninety-three proteins were identified exclusively on P. capsici infected leaf tissues and 245 were expressed only in mock (control) infected samples. In-depth analysis of our data gives novel insights into the regulatory pathways of black pepper which are compromised during the infection. Differential down-regulation was observed in a number of critical pathways like carbon fixation in photosynthetic organism, cyano-amino acid metabolism, fructose, and mannose metabolism, glutathione metabolism, and phenylpropanoid biosynthesis. The proteomics results were validated with real-time qRT-PCR analysis. We were also able to identify the complete coding sequences for all the proteins of which few selected genes were cloned and sequence characterized for further confirmation. Our study is the first report of a quantitative proteomics dataset in black pepper which provides convincing evidence on the effectiveness of a transcriptome-based label-free proteomics approach for elucidating the host response to biotic stress in a non-model spice crop like P. nigrum, for which genome information is unavailable. Our dataset will serve as a useful resource for future studies in this plant. Data are available via ProteomeXchange with identifier PXD003887.

  16. Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison

    PubMed Central

    2013-01-01

    Background Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons. Results We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention. Conclusion CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets. PMID:23617892

  17. Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison.

    PubMed

    Yang, Fang; Chia, Nicholas; White, Bryan A; Schook, Lawrence B

    2013-04-23

    Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons. We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention. CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets.

  18. Metatranscriptomic analysis of diverse microbial communities reveals core metabolic pathways and microbiome-specific functionality.

    PubMed

    Jiang, Yue; Xiong, Xuejian; Danska, Jayne; Parkinson, John

    2016-01-12

    Metatranscriptomics is emerging as a powerful technology for the functional characterization of complex microbial communities (microbiomes). Use of unbiased RNA-sequencing can reveal both the taxonomic composition and active biochemical functions of a complex microbial community. However, the lack of established reference genomes, computational tools and pipelines make analysis and interpretation of these datasets challenging. Systematic studies that compare data across microbiomes are needed to demonstrate the ability of such pipelines to deliver biologically meaningful insights on microbiome function. Here, we apply a standardized analytical pipeline to perform a comparative analysis of metatranscriptomic data from diverse microbial communities derived from mouse large intestine, cow rumen, kimchi culture, deep-sea thermal vent and permafrost. Sequence similarity searches allowed annotation of 19 to 76% of putative messenger RNA (mRNA) reads, with the highest frequency in the kimchi dataset due to its relatively low complexity and availability of closely related reference genomes. Metatranscriptomic datasets exhibited distinct taxonomic and functional signatures. From a metabolic perspective, we identified a common core of enzymes involved in amino acid, energy and nucleotide metabolism and also identified microbiome-specific pathways such as phosphonate metabolism (deep sea) and glycan degradation pathways (cow rumen). Integrating taxonomic and functional annotations within a novel visualization framework revealed the contribution of different taxa to metabolic pathways, allowing the identification of taxa that contribute unique functions. The application of a single, standard pipeline confirms that the rich taxonomic and functional diversity observed across microbiomes is not simply an artefact of different analysis pipelines but instead reflects distinct environmental influences. At the same time, our findings show how microbiome complexity and availability of reference genomes can impact comprehensive annotation of metatranscriptomes. Consequently, beyond the application of standardized pipelines, additional caution must be taken when interpreting their output and performing downstream, microbiome-specific, analyses. The pipeline used in these analyses along with a tutorial has been made freely available for download from our project website: http://www.compsysbio.org/microbiome .

  19. Biclustering as a method for RNA local multiple sequence alignment.

    PubMed

    Wang, Shu; Gutell, Robin R; Miranker, Daniel P

    2007-12-15

    Biclustering is a clustering method that simultaneously clusters both the domain and range of a relation. A challenge in multiple sequence alignment (MSA) is that the alignment of sequences is often intended to reveal groups of conserved functional subsequences. Simultaneously, the grouping of the sequences can impact the alignment; precisely the kind of dual situation biclustering is intended to address. We define a representation of the MSA problem enabling the application of biclustering algorithms. We develop a computer program for local MSA, BlockMSA, that combines biclustering with divide-and-conquer. BlockMSA simultaneously finds groups of similar sequences and locally aligns subsequences within them. Further alignment is accomplished by dividing both the set of sequences and their contents. The net result is both a multiple sequence alignment and a hierarchical clustering of the sequences. BlockMSA was tested on the subsets of the BRAliBase 2.1 benchmark suite that display high variability and on an extension to that suite to larger problem sizes. Also, alignments were evaluated of two large datasets of current biological interest, T box sequences and Group IC1 Introns. The results were compared with alignments computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment programs using Sum of Pairs (SPS) and Consensus Count. Results for the benchmark suite are sensitive to problem size. On problems of 15 or greater sequences, BlockMSA is consistently the best. On none of the problems in the test suite are there appreciable differences in scores among BlockMSA, MAFFT and PROBCONS. On the T box sequences, BlockMSA does the most faithful job of reproducing known annotations. MAFFT and PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT and MUSCLE are comparable at identifying conserved regions. BlockMSA is implemented in Java. Source code and supplementary datasets are available at http://aug.csres.utexas.edu/msa/

  20. Metaxa: a software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequencing datasets.

    PubMed

    Bengtsson, Johan; Eriksson, K Martin; Hartmann, Martin; Wang, Zheng; Shenoy, Belle Damodara; Grelet, Gwen-Aëlle; Abarenkov, Kessy; Petri, Anna; Rosenblad, Magnus Alm; Nilsson, R Henrik

    2011-10-01

    The ribosomal small subunit (SSU) rRNA gene has emerged as an important genetic marker for taxonomic identification in environmental sequencing datasets. In addition to being present in the nucleus of eukaryotes and the core genome of prokaryotes, the gene is also found in the mitochondria of eukaryotes and in the chloroplasts of photosynthetic eukaryotes. These three sets of genes are conceptually paralogous and should in most situations not be aligned and analyzed jointly. To identify the origin of SSU sequences in complex sequence datasets has hitherto been a time-consuming and largely manual undertaking. However, the present study introduces Metaxa ( http://microbiology.se/software/metaxa/ ), an automated software tool to extract full-length and partial SSU sequences from larger sequence datasets and assign them to an archaeal, bacterial, nuclear eukaryote, mitochondrial, or chloroplast origin. Using data from reference databases and from full-length organelle and organism genomes, we show that Metaxa detects and scores SSU sequences for origin with very low proportions of false positives and negatives. We believe that this tool will be useful in microbial and evolutionary ecology as well as in metagenomics.

  1. Novel antigenic shift in HA sequences of H1N1 viruses detected by big data analysis.

    PubMed

    Zhang, Ruiying; Xu, Chongfeng; Duan, Ziyuan

    2017-07-01

    The influenza virus H1N1 has been prevalent all over the world for nearly a century. Many studies on its evolutionary history, substitution rate and antigenicity-associated sites have been done with small datasets. To have a complete view, we analysed 3171 full-length HA sequences from human H1N1 viruses sampled from 1918 to 2016, and discovered a new clade has formed with sequences isolated in Iran. Based on genetic distance calculations, we revealed an uneven evolutionary rate among sequences isolated in different years. We also found that the HA1 fragment of the new clade is like that of viruses that existed in the 1930s, while the HA2 fragment is closely associated with strains isolated after the 2009 pandemic. This new, "mixed" HA sequence indicates a cryptic antigenic shift event occurred, and it should draw more attention to the new clade identified from sequences from Iran. Copyright © 2017. Published by Elsevier B.V.

  2. AMP: Assembly Matching Pursuit.

    PubMed

    Biswas, S; Jojic, V

    2013-01-01

    Metagenomics, the study of the total genetic material isolated from a biological host, promises to reveal host-microbe or microbe-microbe interactions that may help to personalize medicine or improve agronomic practice. We introduce a method that discovers metagenomic units (MGUs) relevant for phenotype prediction through sequence-based dictionary learning. The method aggregates patient-specific dictionaries and estimates MGU abundances in order to summarize a whole population and yield universally predictive biomarkers. We analyze the impact of Gaussian, Poisson, and Negative Binomial read count models in guiding dictionary construction by examining classification efficiency on a number of synthetic datasets and a real dataset from Ref. 1. Each outperforms standard methods of dictionary composition, such as random projection and orthogonal matching pursuit. Additionally, the predictive MGUs they recover are biologically relevant.

  3. Reference datasets for 2-treatment, 2-sequence, 2-period bioequivalence studies.

    PubMed

    Schütz, Helmut; Labes, Detlew; Fuglsang, Anders

    2014-11-01

    It is difficult to validate statistical software used to assess bioequivalence since very few datasets with known results are in the public domain, and the few that are published are of moderate size and balanced. The purpose of this paper is therefore to introduce reference datasets of varying complexity in terms of dataset size and characteristics (balance, range, outlier presence, residual error distribution) for 2-treatment, 2-period, 2-sequence bioequivalence studies and to report their point estimates and 90% confidence intervals which companies can use to validate their installations. The results for these datasets were calculated using the commercial packages EquivTest, Kinetica, SAS and WinNonlin, and the non-commercial package R. The results of three of these packages mostly agree, but imbalance between sequences seems to provoke questionable results with one package, which illustrates well the need for proper software validation.

  4. Genometa--a fast and accurate classifier for short metagenomic shotgun reads.

    PubMed

    Davenport, Colin F; Neugebauer, Jens; Beckmann, Nils; Friedrich, Benedikt; Kameri, Burim; Kokott, Svea; Paetow, Malte; Siekmann, Björn; Wieding-Drewes, Matthias; Wienhöfer, Markus; Wolf, Stefan; Tümmler, Burkhard; Ahlers, Volker; Sprengel, Frauke

    2012-01-01

    Metagenomic studies use high-throughput sequence data to investigate microbial communities in situ. However, considerable challenges remain in the analysis of these data, particularly with regard to speed and reliable analysis of microbial species as opposed to higher level taxa such as phyla. We here present Genometa, a computationally undemanding graphical user interface program that enables identification of bacterial species and gene content from datasets generated by inexpensive high-throughput short read sequencing technologies. Our approach was first verified on two simulated metagenomic short read datasets, detecting 100% and 94% of the bacterial species included with few false positives or false negatives. Subsequent comparative benchmarking analysis against three popular metagenomic algorithms on an Illumina human gut dataset revealed Genometa to attribute the most reads to bacteria at species level (i.e. including all strains of that species) and demonstrate similar or better accuracy than the other programs. Lastly, speed was demonstrated to be many times that of BLAST due to the use of modern short read aligners. Our method is highly accurate if bacteria in the sample are represented by genomes in the reference sequence but cannot find species absent from the reference. This method is one of the most user-friendly and resource efficient approaches and is thus feasible for rapidly analysing millions of short reads on a personal computer. The Genometa program, a step by step tutorial and Java source code are freely available from http://genomics1.mh-hannover.de/genometa/ and on http://code.google.com/p/genometa/. This program has been tested on Ubuntu Linux and Windows XP/7.

  5. mirVAFC: A Web Server for Prioritizations of Pathogenic Sequence Variants from Exome Sequencing Data via Classifications.

    PubMed

    Li, Zhongshan; Liu, Zhenwei; Jiang, Yi; Chen, Denghui; Ran, Xia; Sun, Zhong Sheng; Wu, Jinyu

    2017-01-01

    Exome sequencing has been widely used to identify the genetic variants underlying human genetic disorders for clinical diagnoses, but the identification of pathogenic sequence variants among the huge amounts of benign ones is complicated and challenging. Here, we describe a new Web server named mirVAFC for pathogenic sequence variants prioritizations from clinical exome sequencing (CES) variant data of single individual or family. The mirVAFC is able to comprehensively annotate sequence variants, filter out most irrelevant variants using custom criteria, classify variants into different categories as for estimated pathogenicity, and lastly provide pathogenic variants prioritizations based on classifications and mutation effects. Case studies using different types of datasets for different diseases from publication and our in-house data have revealed that mirVAFC can efficiently identify the right pathogenic candidates as in original work in each case. Overall, the Web server mirVAFC is specifically developed for pathogenic sequence variant identifications from family-based CES variants using classification-based prioritizations. The mirVAFC Web server is freely accessible at https://www.wzgenomics.cn/mirVAFC/. © 2016 WILEY PERIODICALS, INC.

  6. Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.

    PubMed

    Zhang, ZhiZhuo; Chang, Cheng Wei; Hugo, Willy; Cheung, Edwin; Sung, Wing-Kin

    2013-03-01

    Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e., position preference and sequence rank preference). This information is usually required from the user. This article presents a de novo motif discovery algorithm called SEME (sampling with expectation maximization for motif elicitation), which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position, and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large-scale synthetic datasets, 32 metazoan compendium benchmark datasets, and 164 chromatin immunoprecipitation sequencing (ChIP-Seq) libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (coTF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct coTF motifs and, at the same time, predicted coTF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each coTF reveals potential interaction mechanisms between the primary TF and the coTF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the coTFs. The application is available online.

  7. Whole Genome Sequencing of Danish Staphylococcus argenteus Reveals a Genetically Diverse Collection with Clear Separation from Staphylococcus aureus.

    PubMed

    Hansen, Thomas A; Bartels, Mette D; Høgh, Silje V; Dons, Lone E; Pedersen, Michael; Jensen, Thøger G; Kemp, Michael; Skov, Marianne N; Gumpert, Heidi; Worning, Peder; Westh, Henrik

    2017-01-01

    Staphylococcus argenteus ( S. argenteus ) is a newly identified Staphylococcus species that has been misidentified as Staphylococcus aureus ( S. aureus ) and is clinically relevant. We identified 25 S. argenteus genomes in our collection of whole genome sequenced S. aureus . These genomes were compared to publicly available genomes and a phylogeny revealed seven clusters corresponding to seven clonal complexes. The genome of S. argenteus was found to be different from the genome of S. aureus and a core genome analysis showed that ~33% of the total gene pool was shared between the two species, at 90% homology level. An assessment of mobile elements shows flow of SCC mec cassettes, plasmids, phages, and pathogenicity islands, between S. argenteus and S. aureus . This dataset emphasizes that S. argenteus and S. aureus are two separate species that share genetic material.

  8. Data Portal | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The CPTAC Data Portal is a centralized repository for the public dissemination of proteomic sequence datasets collected by CPTAC, along with corresponding genomic sequence datasets.  In addition, available are analyses of CPTAC's raw mass spectrometry-based data files (mapping of spectra to peptide sequences and protein identification) by individual investigators from CPTAC and by a Common Data Analysis Pipeline.

  9. Degenerate Pax2 and Senseless binding motifs improve detection of low-affinity sites required for enhancer specificity

    PubMed Central

    Zandvakili, Arya; Campbell, Ian; Weirauch, Matthew T.

    2018-01-01

    Cells use thousands of regulatory sequences to recruit transcription factors (TFs) and produce specific transcriptional outcomes. Since TFs bind degenerate DNA sequences, discriminating functional TF binding sites (TFBSs) from background sequences represents a significant challenge. Here, we show that a Drosophila regulatory element that activates Epidermal Growth Factor signaling requires overlapping, low-affinity TFBSs for competing TFs (Pax2 and Senseless) to ensure cell- and segment-specific activity. Testing available TF binding models for Pax2 and Senseless, however, revealed variable accuracy in predicting such low-affinity TFBSs. To better define parameters that increase accuracy, we developed a method that systematically selects subsets of TFBSs based on predicted affinity to generate hundreds of position-weight matrices (PWMs). Counterintuitively, we found that degenerate PWMs produced from datasets depleted of high-affinity sequences were more accurate in identifying both low- and high-affinity TFBSs for the Pax2 and Senseless TFs. Taken together, these findings reveal how TFBS arrangement can be constrained by competition rather than cooperativity and that degenerate models of TF binding preferences can improve identification of biologically relevant low affinity TFBSs. PMID:29617378

  10. Identification and validation of differentially expressed transcripts by RNA-sequencing of formalin-fixed, paraffin-embedded (FFPE) lung tissue from patients with Idiopathic Pulmonary Fibrosis.

    PubMed

    Vukmirovic, Milica; Herazo-Maya, Jose D; Blackmon, John; Skodric-Trifunovic, Vesna; Jovanovic, Dragana; Pavlovic, Sonja; Stojsic, Jelena; Zeljkovic, Vesna; Yan, Xiting; Homer, Robert; Stefanovic, Branko; Kaminski, Naftali

    2017-01-12

    Idiopathic Pulmonary Fibrosis (IPF) is a lethal lung disease of unknown etiology. A major limitation in transcriptomic profiling of lung tissue in IPF has been a dependence on snap-frozen fresh tissues (FF). In this project we sought to determine whether genome scale transcript profiling using RNA Sequencing (RNA-Seq) could be applied to archived Formalin-Fixed Paraffin-Embedded (FFPE) IPF tissues. We isolated total RNA from 7 IPF and 5 control FFPE lung tissues and performed 50 base pair paired-end sequencing on Illumina 2000 HiSeq. TopHat2 was used to map sequencing reads to the human genome. On average ~62 million reads (53.4% of ~116 million reads) were mapped per sample. 4,131 genes were differentially expressed between IPF and controls (1,920 increased and 2,211 decreased (FDR < 0.05). We compared our results to differentially expressed genes calculated from a previously published dataset generated from FF tissues analyzed on Agilent microarrays (GSE47460). The overlap of differentially expressed genes was very high (760 increased and 1,413 decreased, FDR < 0.05). Only 92 differentially expressed genes changed in opposite directions. Pathway enrichment analysis performed using MetaCore confirmed numerous IPF relevant genes and pathways including extracellular remodeling, TGF-beta, and WNT. Gene network analysis of MMP7, a highly differentially expressed gene in both datasets, revealed the same canonical pathways and gene network candidates in RNA-Seq and microarray data. For validation by NanoString nCounter® we selected 35 genes that had a fold change of 2 in at least one dataset (10 discordant, 10 significantly differentially expressed in one dataset only and 15 concordant genes). High concordance of fold change and FDR was observed for each type of the samples (FF vs FFPE) with both microarrays (r = 0.92) and RNA-Seq (r = 0.90) and the number of discordant genes was reduced to four. Our results demonstrate that RNA sequencing of RNA obtained from archived FFPE lung tissues is feasible. The results obtained from FFPE tissue are highly comparable to FF tissues. The ability to perform RNA-Seq on archived FFPE IPF tissues should greatly enhance the availability of tissue biopsies for research in IPF.

  11. SHARAKU: an algorithm for aligning and clustering read mapping profiles of deep sequencing in non-coding RNA processing.

    PubMed

    Tsuchiya, Mariko; Amano, Kojiro; Abe, Masaya; Seki, Misato; Hase, Sumitaka; Sato, Kengo; Sakakibara, Yasubumi

    2016-06-15

    Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs. We developed an algorithm termed SHARAKU to align two read mapping profiles of next-generation sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5'-end processing and 3'-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain. The source code of our program SHARAKU is available at http://www.dna.bio.keio.ac.jp/sharaku/, and the simulated dataset used in this work is available at the same link. Accession code: The sequence data from the whole RNA transcripts in the hippocampus of the left brain used in this work is available from the DNA DataBank of Japan (DDBJ) Sequence Read Archive (DRA) under the accession number DRA004502. yasu@bio.keio.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  12. SPAR: small RNA-seq portal for analysis of sequencing experiments.

    PubMed

    Kuksa, Pavel P; Amlie-Wolf, Alexandre; Katanic, Živadin; Valladares, Otto; Wang, Li-San; Leung, Yuk Yee

    2018-05-04

    The introduction of new high-throughput small RNA sequencing protocols that generate large-scale genomics datasets along with increasing evidence of the significant regulatory roles of small non-coding RNAs (sncRNAs) have highlighted the urgent need for tools to analyze and interpret large amounts of small RNA sequencing data. However, it remains challenging to systematically and comprehensively discover and characterize sncRNA genes and specifically-processed sncRNA products from these datasets. To fill this gap, we present Small RNA-seq Portal for Analysis of sequencing expeRiments (SPAR), a user-friendly web server for interactive processing, analysis, annotation and visualization of small RNA sequencing data. SPAR supports sequencing data generated from various experimental protocols, including smRNA-seq, short total RNA sequencing, microRNA-seq, and single-cell small RNA-seq. Additionally, SPAR includes publicly available reference sncRNA datasets from our DASHR database and from ENCODE across 185 human tissues and cell types to produce highly informative small RNA annotations across all major small RNA types and other features such as co-localization with various genomic features, precursor transcript cleavage patterns, and conservation. SPAR allows the user to compare the input experiment against reference ENCODE/DASHR datasets. SPAR currently supports analyses of human (hg19, hg38) and mouse (mm10) sequencing data. SPAR is freely available at https://www.lisanwanglab.org/SPAR.

  13. QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles.

    PubMed

    Van der Borght, Koen; Thys, Kim; Wetzels, Yves; Clement, Lieven; Verbist, Bie; Reumers, Joke; van Vlijmen, Herman; Aerssens, Jeroen

    2015-11-10

    Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNV(D)). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNV(HS)). To also increase specificity, SNVs called were overruled when their frequency was below the 80(th) percentile calculated on the distribution of error frequencies (QQ-SNV(HS-P80)). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNV(D) performed similarly to the existing approaches. QQ-SNV(HS) was more sensitive on all test sets but with more false positives. QQ-SNV(HS-P80) was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5%, QQ-SNV(HS-P80) revealed a sensitivity of 100% (vs. 40-60% for the existing methods) and a specificity of 100% (vs. 98.0-99.7% for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5% were consistently detected by QQ-SNV(HS-P80) from different generations of Illumina sequencers. We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data.

  14. An efficient and scalable graph modeling approach for capturing information at different levels in next generation sequencing reads

    PubMed Central

    2013-01-01

    Background Next generation sequencing technologies have greatly advanced many research areas of the biomedical sciences through their capability to generate massive amounts of genetic information at unprecedented rates. The advent of next generation sequencing has led to the development of numerous computational tools to analyze and assemble the millions to billions of short sequencing reads produced by these technologies. While these tools filled an important gap, current approaches for storing, processing, and analyzing short read datasets generally have remained simple and lack the complexity needed to efficiently model the produced reads and assemble them correctly. Results Previously, we presented an overlap graph coarsening scheme for modeling read overlap relationships on multiple levels. Most current read assembly and analysis approaches use a single graph or set of clusters to represent the relationships among a read dataset. Instead, we use a series of graphs to represent the reads and their overlap relationships across a spectrum of information granularity. At each information level our algorithm is capable of generating clusters of reads from the reduced graph, forming an integrated graph modeling and clustering approach for read analysis and assembly. Previously we applied our algorithm to simulated and real 454 datasets to assess its ability to efficiently model and cluster next generation sequencing data. In this paper we extend our algorithm to large simulated and real Illumina datasets to demonstrate that our algorithm is practical for both sequencing technologies. Conclusions Our overlap graph theoretic algorithm is able to model next generation sequencing reads at various levels of granularity through the process of graph coarsening. Additionally, our model allows for efficient representation of the read overlap relationships, is scalable for large datasets, and is practical for both Illumina and 454 sequencing technologies. PMID:24564333

  15. Spatially resolved RNA-sequencing of the embryonic heart identifies a role for Wnt/β-catenin signaling in autonomic control of heart rate

    PubMed Central

    Burkhard, Silja Barbara

    2018-01-01

    Development of specialized cells and structures in the heart is regulated by spatially -restricted molecular pathways. Disruptions in these pathways can cause severe congenital cardiac malformations or functional defects. To better understand these pathways and how they regulate cardiac development we used tomo-seq, combining high-throughput RNA-sequencing with tissue-sectioning, to establish a genome-wide expression dataset with high spatial resolution for the developing zebrafish heart. Analysis of the dataset revealed over 1100 genes differentially expressed in sub-compartments. Pacemaker cells in the sinoatrial region induce heart contractions, but little is known about the mechanisms underlying their development. Using our transcriptome map, we identified spatially restricted Wnt/β-catenin signaling activity in pacemaker cells, which was controlled by Islet-1 activity. Moreover, Wnt/β-catenin signaling controls heart rate by regulating pacemaker cellular response to parasympathetic stimuli. Thus, this high-resolution transcriptome map incorporating all cell types in the embryonic heart can expose spatially restricted molecular pathways critical for specific cardiac functions. PMID:29400650

  16. A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

    PubMed

    Bansal, Vikas

    2017-03-14

    PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates .

  17. Unique attributes of cyanobacterial metabolism revealed by improved genome-scale metabolic modeling and essential gene analysis

    DOE PAGES

    Broddrick, Jared T.; Rubin, Benjamin E.; Welkie, David G.; ...

    2016-12-20

    The model cyanobacterium, Synechococcus elongatus PCC 7942, is a genetically tractable obligate phototroph that is being developed for the bioproduction of high-value chemicals. Genome-scale models (GEMs) have been successfully used to assess and engineer cellular metabolism; however, GEMs of phototrophic metabolism have been limited by the lack of experimental datasets for model validation and the challenges of incorporating photon uptake. In this paper, we develop a GEM of metabolism in S. elongatus using random barcode transposon site sequencing (RB-TnSeq) essential gene and physiological data specific to photoautotrophic metabolism. The model explicitly describes photon absorption and accounts for shading, resulting inmore » the characteristic linear growth curve of photoautotrophs. GEM predictions of gene essentiality were compared with data obtained from recent dense-transposon mutagenesis experiments. This dataset allowed major improvements to the accuracy of the model. Furthermore, discrepancies between GEM predictions and the in vivo dataset revealed biological characteristics, such as the importance of a truncated, linear TCA pathway, low flux toward amino acid synthesis from photorespiration, and knowledge gaps within nucleotide metabolism. Finally, coupling of strong experimental support and photoautotrophic modeling methods thus resulted in a highly accurate model of S. elongatus metabolism that highlights previously unknown areas of S. elongatus biology.« less

  18. Unique attributes of cyanobacterial metabolism revealed by improved genome-scale metabolic modeling and essential gene analysis

    PubMed Central

    Broddrick, Jared T.; Rubin, Benjamin E.; Welkie, David G.; Du, Niu; Mih, Nathan; Diamond, Spencer; Lee, Jenny J.; Golden, Susan S.; Palsson, Bernhard O.

    2016-01-01

    The model cyanobacterium, Synechococcus elongatus PCC 7942, is a genetically tractable obligate phototroph that is being developed for the bioproduction of high-value chemicals. Genome-scale models (GEMs) have been successfully used to assess and engineer cellular metabolism; however, GEMs of phototrophic metabolism have been limited by the lack of experimental datasets for model validation and the challenges of incorporating photon uptake. Here, we develop a GEM of metabolism in S. elongatus using random barcode transposon site sequencing (RB-TnSeq) essential gene and physiological data specific to photoautotrophic metabolism. The model explicitly describes photon absorption and accounts for shading, resulting in the characteristic linear growth curve of photoautotrophs. GEM predictions of gene essentiality were compared with data obtained from recent dense-transposon mutagenesis experiments. This dataset allowed major improvements to the accuracy of the model. Furthermore, discrepancies between GEM predictions and the in vivo dataset revealed biological characteristics, such as the importance of a truncated, linear TCA pathway, low flux toward amino acid synthesis from photorespiration, and knowledge gaps within nucleotide metabolism. Coupling of strong experimental support and photoautotrophic modeling methods thus resulted in a highly accurate model of S. elongatus metabolism that highlights previously unknown areas of S. elongatus biology. PMID:27911809

  19. Unique attributes of cyanobacterial metabolism revealed by improved genome-scale metabolic modeling and essential gene analysis

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Broddrick, Jared T.; Rubin, Benjamin E.; Welkie, David G.

    The model cyanobacterium, Synechococcus elongatus PCC 7942, is a genetically tractable obligate phototroph that is being developed for the bioproduction of high-value chemicals. Genome-scale models (GEMs) have been successfully used to assess and engineer cellular metabolism; however, GEMs of phototrophic metabolism have been limited by the lack of experimental datasets for model validation and the challenges of incorporating photon uptake. In this paper, we develop a GEM of metabolism in S. elongatus using random barcode transposon site sequencing (RB-TnSeq) essential gene and physiological data specific to photoautotrophic metabolism. The model explicitly describes photon absorption and accounts for shading, resulting inmore » the characteristic linear growth curve of photoautotrophs. GEM predictions of gene essentiality were compared with data obtained from recent dense-transposon mutagenesis experiments. This dataset allowed major improvements to the accuracy of the model. Furthermore, discrepancies between GEM predictions and the in vivo dataset revealed biological characteristics, such as the importance of a truncated, linear TCA pathway, low flux toward amino acid synthesis from photorespiration, and knowledge gaps within nucleotide metabolism. Finally, coupling of strong experimental support and photoautotrophic modeling methods thus resulted in a highly accurate model of S. elongatus metabolism that highlights previously unknown areas of S. elongatus biology.« less

  20. A communal catalogue reveals Earth’s multiscale microbial diversity

    DOE PAGES

    Thompson, Luke R.; Sanders, Jon G.; McDonald, Daniel; ...

    2017-11-01

    Our growing awareness of the importance and diversity of the microbial world contrasts starkly with our limited understanding of its fundamental structure. Despite remarkable advances in DNA sequence generation, a lack of standardized protocols and common analytical framework impede useful comparison between studies, hindering development of global inferences about microbial life on Earth. Here, we show that with coordinated protocols, exact microbial 16S rRNA gene sequences can be followed across scores of individual studies, revealing patterns of diversity, community structure, and life history strategy at a planetary scale. Using 27,751 crowdsourced environmental samples comprising more than 2.2 billion reads, wemore » find sharp divides between host-associated and free-living communities. We show that the distribution of taxonomic and sequence diversity follows consistent trends across samples types and along gradients of environmental parameters, highlighting some of the global evolutionary patterns and ecological principles that underpin Earth’s microbiome. Here, this dataset provides the most complete environmental survey of our microbial world to date, and serves as a growing reference to provide immediate global context to future microbial surveys.« less

  1. A communal catalogue reveals Earth’s multiscale microbial diversity

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thompson, Luke R.; Sanders, Jon G.; McDonald, Daniel

    Our growing awareness of the importance and diversity of the microbial world contrasts starkly with our limited understanding of its fundamental structure. Despite remarkable advances in DNA sequence generation, a lack of standardized protocols and common analytical framework impede useful comparison between studies, hindering development of global inferences about microbial life on Earth. Here, we show that with coordinated protocols, exact microbial 16S rRNA gene sequences can be followed across scores of individual studies, revealing patterns of diversity, community structure, and life history strategy at a planetary scale. Using 27,751 crowdsourced environmental samples comprising more than 2.2 billion reads, wemore » find sharp divides between host-associated and free-living communities. We show that the distribution of taxonomic and sequence diversity follows consistent trends across samples types and along gradients of environmental parameters, highlighting some of the global evolutionary patterns and ecological principles that underpin Earth’s microbiome. Here, this dataset provides the most complete environmental survey of our microbial world to date, and serves as a growing reference to provide immediate global context to future microbial surveys.« less

  2. ESTuber db: an online database for Tuber borchii EST sequences.

    PubMed

    Lazzari, Barbara; Caprera, Andrea; Cosentino, Cristian; Stella, Alessandra; Milanesi, Luciano; Viotti, Angelo

    2007-03-08

    The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.

  3. TriageTools: tools for partitioning and prioritizing analysis of high-throughput sequencing data.

    PubMed

    Fimereli, Danai; Detours, Vincent; Konopka, Tomasz

    2013-04-01

    High-throughput sequencing is becoming a popular research tool but carries with it considerable costs in terms of computation time, data storage and bandwidth. Meanwhile, some research applications focusing on individual genes or pathways do not necessitate processing of a full sequencing dataset. Thus, it is desirable to partition a large dataset into smaller, manageable, but relevant pieces. We present a toolkit for partitioning raw sequencing data that includes a method for extracting reads that are likely to map onto pre-defined regions of interest. We show the method can be used to extract information about genes of interest from DNA or RNA sequencing samples in a fraction of the time and disk space required to process and store a full dataset. We report speedup factors between 2.6 and 96, depending on settings and samples used. The software is available at http://www.sourceforge.net/projects/triagetools/.

  4. Virome Assembly and Annotation: A Surprise in the Namib Desert

    PubMed Central

    Hesse, Uljana; van Heusden, Peter; Kirby, Bronwyn M.; Olonade, Israel; van Zyl, Leonardo J.; Trindade, Marla

    2017-01-01

    Sequencing, assembly, and annotation of environmental virome samples is challenging. Methodological biases and differences in species abundance result in fragmentary read coverage; sequence reconstruction is further complicated by the mosaic nature of viral genomes. In this paper, we focus on biocomputational aspects of virome analysis, emphasizing latent pitfalls in sequence annotation. Using simulated viromes that mimic environmental data challenges we assessed the performance of five assemblers (CLC-Workbench, IDBA-UD, SPAdes, RayMeta, ABySS). Individual analyses of relevant scaffold length fractions revealed shortcomings of some programs in reconstruction of viral genomes with excessive read coverage (IDBA-UD, RayMeta), and in accurate assembly of scaffolds ≥50 kb (SPAdes, RayMeta, ABySS). The CLC-Workbench assembler performed best in terms of genome recovery (including highly covered genomes) and correct reconstruction of large scaffolds; and was used to assemble a virome from a copper rich site in the Namib Desert. We found that scaffold network analysis and cluster-specific read reassembly improved reconstruction of sequences with excessive read coverage, and that strict data filtering for non-viral sequences prior to downstream analyses was essential. In this study we describe novel viral genomes identified in the Namib Desert copper site virome. Taxonomic affiliations of diverse proteins in the dataset and phylogenetic analyses of circovirus-like proteins indicated links to the marine habitat. Considering additional evidence from this dataset we hypothesize that viruses may have been carried from the Atlantic Ocean into the Namib Desert by fog and wind, highlighting the impact of the extended environment on an investigated niche in metagenome studies. PMID:28167933

  5. Seqenv: linking sequences to environments through text mining.

    PubMed

    Sinclair, Lucas; Ijaz, Umer Z; Jensen, Lars Juhl; Coolen, Marco J L; Gubry-Rangin, Cecile; Chroňáková, Alica; Oulas, Anastasis; Pavloudi, Christina; Schnetzer, Julia; Weimann, Aaron; Ijaz, Ali; Eiler, Alexander; Quince, Christopher; Pafilis, Evangelos

    2016-01-01

    Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the "nt" nucleotide database provided by NCBI and, out of every hit, extracts-if it is available-the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv.

  6. PASTA for Proteins.

    PubMed

    Collins, Kodi; Warnow, Tandy

    2018-06-19

    PASTA is a multiple sequence method that uses divide-and-conquer plus iteration to enable base alignment methods to scale with high accuracy to large sequence datasets. By default, PASTA included MAFFT L-INS-i; our new extension of PASTA enables the use of MAFFT G-INS-i, MAFFT Homologs, CONTRAlign, and ProbCons. We analyzed the performance of each base method and PASTA using these base methods on 224 datasets from BAliBASE 4 with at least 50 sequences. We show that PASTA enables the most accurate base methods to scale to larger datasets at reduced computational effort, and generally improves alignment and tree accuracy on the largest BAliBASE datasets. PASTA is available at https://github.com/kodicollins/pasta and has also been integrated into the original PASTA repository at https://github.com/smirarab/pasta. Supplementary data are available at Bioinformatics online.

  7. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

    PubMed

    Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W

    2010-07-02

    The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.

  8. Next-generation sequencing of translocation renal cell carcinoma reveals novel RNA splicing partners and frequent mutations of chromatin-remodeling genes.

    PubMed

    Malouf, Gabriel G; Su, Xiaoping; Yao, Hui; Gao, Jianjun; Xiong, Liangwen; He, Qiuming; Compérat, Eva; Couturier, Jérôme; Molinié, Vincent; Escudier, Bernard; Camparo, Philippe; Doss, Denaha J; Thompson, Erika J; Khayat, David; Wood, Christopher G; Yu, Willie; Teh, Bin T; Weinstein, John; Tannir, Nizar M

    2014-08-01

    MITF/TFE translocation renal cell carcinoma (TRCC) is a rare subtype of kidney cancer. Its incidence and the genome-wide characterization of its genetic origin have not been fully elucidated. We performed RNA and exome sequencing on an exploratory set of TRCC (n = 7), and validated our findings using The Cancer Genome Atlas (TCGA) clear-cell RCC (ccRCC) dataset (n = 460). Using the TCGA dataset, we identified seven TRCC (1.5%) cases and determined their genomic profile. We discovered three novel partners of MITF/TFE (LUC7L3, KHSRP, and KHDRBS2) that are involved in RNA splicing. TRCC displayed a unique gene expression signature as compared with other RCC types, and showed activation of MITF, the transforming growth factor β1 and the PI3K complex targets. Genes differentially spliced between TRCC and other RCC types were enriched for MITF and ID2 targets. Exome sequencing of TRCC revealed a distinct mutational spectrum as compared with ccRCC, with frequent mutations in chromatin-remodeling genes (six of eight cases, three of which were from the TCGA). In two cases, we identified mutations in INO80D, an ATP-dependent chromatin-remodeling gene, previously shown to control the amplitude of the S phase. Knockdown of INO80D decreased cell proliferation in a novel cell line bearing LUC7L3-TFE3 translocation. This genome-wide study defines the incidence of TRCC within a ccRCC-directed project and expands the genomic spectrum of TRCC by identifying novel MITF/TFE partners involved in RNA splicing and frequent mutations in chromatin-remodeling genes. ©2014 American Association for Cancer Research.

  9. Leveraging genome-wide datasets to quantify the functional role of the anti-Shine-Dalgarno sequence in regulating translation efficiency.

    PubMed

    Hockenberry, Adam J; Pah, Adam R; Jewett, Michael C; Amaral, Luís A N

    2017-01-01

    Studies dating back to the 1970s established that sequence complementarity between the anti-Shine-Dalgarno (aSD) sequence on prokaryotic ribosomes and the 5' untranslated region of mRNAs helps to facilitate translation initiation. The optimal location of aSD sequence binding relative to the start codon, the full extents of the aSD sequence and the functional form of the relationship between aSD sequence complementarity and translation efficiency have not been fully resolved. Here, we investigate these relationships by leveraging the sequence diversity of endogenous genes and recently available genome-wide estimates of translation efficiency. We show that-after accounting for predicted mRNA structure-aSD sequence complementarity increases the translation of endogenous mRNAs by roughly 50%. Further, we observe that this relationship is nonlinear, with translation efficiency maximized for mRNAs with intermediate levels of aSD sequence complementarity. The mechanistic insights that we observe are highly robust: we find nearly identical results in multiple datasets spanning three distantly related bacteria. Further, we verify our main conclusions by re-analysing a controlled experimental dataset. © 2017 The Authors.

  10. Computational Approaches for Decoding Select Odorant-Olfactory Receptor Interactions Using Mini-Virtual Screening

    PubMed Central

    Harini, K.; Sowdhamini, Ramanathan

    2015-01-01

    Olfactory receptors (ORs) belong to the class A G-Protein Coupled Receptor superfamily of proteins. Unlike G-Protein Coupled Receptors, ORs exhibit a combinatorial response to odors/ligands. ORs display an affinity towards a range of odor molecules rather than binding to a specific set of ligands and conversely a single odorant molecule may bind to a number of olfactory receptors with varying affinities. The diversity in odor recognition is linked to the highly variable transmembrane domains of these receptors. The purpose of this study is to decode the odor-olfactory receptor interactions using in silico docking studies. In this study, a ligand (odor molecules) dataset of 125 molecules was used to carry out in silico docking using the GLIDE docking tool (SCHRODINGER Inc Pvt LTD). Previous studies, with smaller datasets of ligands, have shown that orthologous olfactory receptors respond to similarly-tuned ligands, but are dramatically different in their efficacy and potency. Ligand docking results were applied on homologous pairs (with varying sequence identity) of ORs from human and mouse genomes and ligand binding residues and the ligand profile differed among such related olfactory receptor sequences. This study revealed that homologous sequences with high sequence identity need not bind to the same/ similar ligand with a given affinity. A ligand profile has been obtained for each of the 20 receptors in this analysis which will be useful for expression and mutation studies on these receptors. PMID:26221959

  11. Isolation and genetic characterization of Aurantimonas and Methylobacterium strains from stems of hypernodulated soybeans.

    PubMed

    Anda, Mizue; Ikeda, Seishi; Eda, Shima; Okubo, Takashi; Sato, Shusei; Tabata, Satoshi; Mitsui, Hisayuki; Minamisawa, Kiwamu

    2011-01-01

    The aims of this study were to isolate Aurantimonas and Methylobacterium strains that responded to soybean nodulation phenotypes and nitrogen fertilization rates in a previous culture-independent analysis (Ikeda et al. ISME J. 4:315-326, 2010). Two strategies were adopted for isolation from enriched bacterial cells prepared from stems of field-grown, hypernodulated soybeans: PCR-assisted isolation for Aurantimonas and selective cultivation for Methylobacterium. Thirteen of 768 isolates cultivated on Nutrient Agar medium were identified as Aurantimonas by colony PCR specific for Aurantimonas and 16S rRNA gene sequencing. Meanwhile, among 187 isolates on methanol-containing agar media, 126 were identified by 16S rRNA gene sequences as Methylobacterium. A clustering analysis (>99% identity) of the 16S rRNA gene sequences for the combined datasets of the present and previous studies revealed 4 and 8 operational taxonomic units (OTUs) for Aurantimonas and Methylobacterium, respectively, and showed the successful isolation of target bacteria for these two groups. ERIC- and BOX-PCR showed the genomic uniformity of the target isolates. In addition, phylogenetic analyses of Aurantimonas revealed a phyllosphere-specific cluster in the genus. The isolates obtained in the present study will be useful for revealing unknown legume-microbe interactions in relation to the autoregulation of nodulation.

  12. Transcriptogenomics identification and characterization of RNA editing sites in human primary monocytes using high-depth next generation sequencing data.

    PubMed

    Leong, Wai-Mun; Ripen, Adiratna Mat; Mirsafian, Hoda; Mohamad, Saharuddin Bin; Merican, Amir Feisal

    2018-06-07

    High-depth next generation sequencing data provide valuable insights into the number and distribution of RNA editing events. Here, we report the RNA editing events at cellular level of human primary monocyte using high-depth whole genomic and transcriptomic sequencing data. We identified over a ten thousand putative RNA editing sites and 69% of the sites were A-to-I editing sites. The sites enriched in repetitive sequences and intronic regions. High-depth sequencing datasets revealed that 90% of the canonical sites were edited at lower frequencies (<0.7). Single and multiple human monocytes and brain tissues samples were analyzed through genome sequence independent approach. The later approach was observed to identify more editing sites. Monocytes was observed to contain more C-to-U editing sites compared to brain tissues. Our results establish comparable pipeline that can address current limitations as well as demonstrate the potential for highly sensitive detection of RNA editing events in single cell type. Copyright © 2018 Elsevier Inc. All rights reserved.

  13. DNApod: DNA polymorphism annotation database from next-generation sequence read archives.

    PubMed

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.

  14. DNApod: DNA polymorphism annotation database from next-generation sequence read archives

    PubMed Central

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924

  15. TaxI: a software tool for DNA barcoding using distance methods

    PubMed Central

    Steinke, Dirk; Vences, Miguel; Salzburger, Walter; Meyer, Axel

    2005-01-01

    DNA barcoding is a promising approach to the diagnosis of biological diversity in which DNA sequences serve as the primary key for information retrieval. Most existing software for evolutionary analysis of DNA sequences was designed for phylogenetic analyses and, hence, those algorithms do not offer appropriate solutions for the rapid, but precise analyses needed for DNA barcoding, and are also unable to process the often large comparative datasets. We developed a flexible software tool for DNA taxonomy, named TaxI. This program calculates sequence divergences between a query sequence (taxon to be barcoded) and each sequence of a dataset of reference sequences defined by the user. Because the analysis is based on separate pairwise alignments this software is also able to work with sequences characterized by multiple insertions and deletions that are difficult to align in large sequence sets (i.e. thousands of sequences) by multiple alignment algorithms because of computational restrictions. Here, we demonstrate the utility of this approach with two datasets of fish larvae and juveniles from Lake Constance and juvenile land snails under different models of sequence evolution. Sets of ribosomal 16S rRNA sequences, characterized by multiple indels, performed as good as or better than cox1 sequence sets in assigning sequences to species, demonstrating the suitability of rRNA genes for DNA barcoding. PMID:16214755

  16. Processing and population genetic analysis of multigenic datasets with ProSeq3 software.

    PubMed

    Filatov, Dmitry A

    2009-12-01

    The current tendency in molecular population genetics is to use increasing numbers of genes in the analysis. Here I describe a program for handling and population genetic analysis of DNA polymorphism data collected from multiple genes. The program includes a sequence/alignment editor and an internal relational database that simplify the preparation and manipulation of multigenic DNA polymorphism datasets. The most commonly used DNA polymorphism analyses are implemented in ProSeq3, facilitating population genetic analysis of large multigenic datasets. Extensive input/output options make ProSeq3 a convenient hub for sequence data processing and analysis. The program is available free of charge from http://dps.plants.ox.ac.uk/sequencing/proseq.htm.

  17. Comparative Evaluation of Background Subtraction Algorithms in Remote Scene Videos Captured by MWIR Sensors

    PubMed Central

    Yao, Guangle; Lei, Tao; Zhong, Jiandan; Jiang, Ping; Jia, Wenwu

    2017-01-01

    Background subtraction (BS) is one of the most commonly encountered tasks in video analysis and tracking systems. It distinguishes the foreground (moving objects) from the video sequences captured by static imaging sensors. Background subtraction in remote scene infrared (IR) video is important and common to lots of fields. This paper provides a Remote Scene IR Dataset captured by our designed medium-wave infrared (MWIR) sensor. Each video sequence in this dataset is identified with specific BS challenges and the pixel-wise ground truth of foreground (FG) for each frame is also provided. A series of experiments were conducted to evaluate BS algorithms on this proposed dataset. The overall performance of BS algorithms and the processor/memory requirements were compared. Proper evaluation metrics or criteria were employed to evaluate the capability of each BS algorithm to handle different kinds of BS challenges represented in this dataset. The results and conclusions in this paper provide valid references to develop new BS algorithm for remote scene IR video sequence, and some of them are not only limited to remote scene or IR video sequence but also generic for background subtraction. The Remote Scene IR dataset and the foreground masks detected by each evaluated BS algorithm are available online: https://github.com/JerryYaoGl/BSEvaluationRemoteSceneIR. PMID:28837112

  18. CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment.

    PubMed

    Oh, Jeongsu; Choi, Chi-Hwan; Park, Min-Kyu; Kim, Byung Kwon; Hwang, Kyuin; Lee, Sang-Heon; Hong, Soon Gyu; Nasir, Arshan; Cho, Wan-Sup; Kim, Kyung Mo

    2016-01-01

    High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology-a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr.

  19. CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

    PubMed Central

    Park, Min-Kyu; Kim, Byung Kwon; Hwang, Kyuin; Lee, Sang-Heon; Hong, Soon Gyu; Nasir, Arshan; Cho, Wan-Sup; Kim, Kyung Mo

    2016-01-01

    High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology–a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOM-CLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr. PMID:26954507

  20. Limitations and potentials of current motif discovery algorithms

    PubMed Central

    Hu, Jianjun; Li, Bin; Kihara, Daisuke

    2005-01-01

    Computational methods for de novo identification of gene regulation elements, such as transcription factor binding sites, have proved to be useful for deciphering genetic regulatory networks. However, despite the availability of a large number of algorithms, their strengths and weaknesses are not sufficiently understood. Here, we designed a comprehensive set of performance measures and benchmarked five modern sequence-based motif discovery algorithms using large datasets generated from Escherichia coli RegulonDB. Factors that affect the prediction accuracy, scalability and reliability are characterized. It is revealed that the nucleotide and the binding site level accuracy are very low, while the motif level accuracy is relatively high, which indicates that the algorithms can usually capture at least one correct motif in an input sequence. To exploit diverse predictions from multiple runs of one or more algorithms, a consensus ensemble algorithm has been developed, which achieved 6–45% improvement over the base algorithms by increasing both the sensitivity and specificity. Our study illustrates limitations and potentials of existing sequence-based motif discovery algorithms. Taking advantage of the revealed potentials, several promising directions for further improvements are discussed. Since the sequence-based algorithms are the baseline of most of the modern motif discovery algorithms, this paper suggests substantial improvements would be possible for them. PMID:16284194

  1. Analyses of Evolutionary Characteristics of the Hemagglutinin-Esterase Gene of Influenza C Virus during a Period of 68 Years Reveals Evolutionary Patterns Different from Influenza A and B Viruses.

    PubMed

    Furuse, Yuki; Matsuzaki, Yoko; Nishimura, Hidekazu; Oshitani, Hitoshi

    2016-11-26

    Infections with the influenza C virus causing respiratory symptoms are common, particularly among children. Since isolation and detection of the virus are rarely performed, compared with influenza A and B viruses, the small number of available sequences of the virus makes it difficult to analyze its evolutionary dynamics. Recently, we reported the full genome sequence of 102 strains of the virus. Here, we exploited the data to elucidate the evolutionary characteristics and phylodynamics of the virus compared with influenza A and B viruses. Along with our data, we obtained public sequence data of the hemagglutinin-esterase gene of the virus; the dataset consists of 218 unique sequences of the virus collected from 14 countries between 1947 and 2014. Informatics analyses revealed that (1) multiple lineages have been circulating globally; (2) there have been weak and infrequent selective bottlenecks; (3) the evolutionary rate is low because of weak positive selection and a low capability to induce mutations; and (4) there is no significant positive selection although a few mutations affecting its antigenicity have been induced. The unique evolutionary dynamics of the influenza C virus must be shaped by multiple factors, including virological, immunological, and epidemiological characteristics.

  2. Analyses of Evolutionary Characteristics of the Hemagglutinin-Esterase Gene of Influenza C Virus during a Period of 68 Years Reveals Evolutionary Patterns Different from Influenza A and B Viruses

    PubMed Central

    Furuse, Yuki; Matsuzaki, Yoko; Nishimura, Hidekazu; Oshitani, Hitoshi

    2016-01-01

    Infections with the influenza C virus causing respiratory symptoms are common, particularly among children. Since isolation and detection of the virus are rarely performed, compared with influenza A and B viruses, the small number of available sequences of the virus makes it difficult to analyze its evolutionary dynamics. Recently, we reported the full genome sequence of 102 strains of the virus. Here, we exploited the data to elucidate the evolutionary characteristics and phylodynamics of the virus compared with influenza A and B viruses. Along with our data, we obtained public sequence data of the hemagglutinin-esterase gene of the virus; the dataset consists of 218 unique sequences of the virus collected from 14 countries between 1947 and 2014. Informatics analyses revealed that (1) multiple lineages have been circulating globally; (2) there have been weak and infrequent selective bottlenecks; (3) the evolutionary rate is low because of weak positive selection and a low capability to induce mutations; and (4) there is no significant positive selection although a few mutations affecting its antigenicity have been induced. The unique evolutionary dynamics of the influenza C virus must be shaped by multiple factors, including virological, immunological, and epidemiological characteristics. PMID:27898037

  3. Data-Driven Sequence of Changes to Anatomical Brain Connectivity in Sporadic Alzheimer's Disease.

    PubMed

    Oxtoby, Neil P; Garbarino, Sara; Firth, Nicholas C; Warren, Jason D; Schott, Jonathan M; Alexander, Daniel C

    2017-01-01

    Model-based investigations of transneuronal spreading mechanisms in neurodegenerative diseases relate the pattern of pathology severity to the brain's connectivity matrix, which reveals information about how pathology propagates through the connectivity network. Such network models typically use networks based on functional or structural connectivity in young and healthy individuals, and only end-stage patterns of pathology, thereby ignoring/excluding the effects of normal aging and disease progression. Here, we examine the sequence of changes in the elderly brain's anatomical connectivity over the course of a neurodegenerative disease. We do this in a data-driven manner that is not dependent upon clinical disease stage, by using event-based disease progression modeling. Using data from the Alzheimer's Disease Neuroimaging Initiative dataset, we sequence the progressive decline of anatomical connectivity, as quantified by graph-theory metrics, in the Alzheimer's disease brain. Ours is the first single model to contribute to understanding all three of the nature, the location, and the sequence of changes to anatomical connectivity in the human brain due to Alzheimer's disease. Our experimental results reveal new insights into Alzheimer's disease: that degeneration of anatomical connectivity in the brain may be a viable, even early, biomarker and should be considered when studying such neurodegenerative diseases.

  4. Discovery of parvovirus-related sequences in an unexpected broad range of animals.

    PubMed

    François, S; Filloux, D; Roumagnac, P; Bigot, D; Gayral, P; Martin, D P; Froissart, R; Ogliastro, M

    2016-09-07

    Our knowledge of the genetic diversity and host ranges of viruses is fragmentary. This is particularly true for the Parvoviridae family. Genetic diversity studies of single stranded DNA viruses within this family have been largely focused on arthropod- and vertebrate-infecting species that cause diseases of humans and our domesticated animals: a focus that has biased our perception of parvovirus diversity. While metagenomics approaches could help rectify this bias, so too could transcriptomics studies. Large amounts of transcriptomic data are available for a diverse array of animal species and whenever this data has inadvertently been gathered from virus-infected individuals, it could contain detectable viral transcripts. We therefore performed a systematic search for parvovirus-related sequences (PRSs) within publicly available transcript, genome and protein databases and eleven new transcriptome datasets. This revealed 463 PRSs in the transcript databases of 118 animals. At least 41 of these PRSs are likely integrated within animal genomes in that they were also found within genomic sequence databases. Besides illuminating the ubiquity of parvoviruses, the number of parvoviral sequences discovered within public databases revealed numerous previously unknown parvovirus-host combinations; particularly in invertebrates. Our findings suggest that the host-ranges of extant parvoviruses might span the entire animal kingdom.

  5. Sequencing Data Discovery and Integration for Earth System Science with MetaSeek

    NASA Astrophysics Data System (ADS)

    Hoarfrost, A.; Brown, N.; Arnosti, C.

    2017-12-01

    Microbial communities play a central role in biogeochemical cycles. Sequencing data resources from environmental sources have grown exponentially in recent years, and represent a singular opportunity to investigate microbial interactions with Earth system processes. Carrying out such meta-analyses depends on our ability to discover and curate sequencing data into large-scale integrated datasets. However, such integration efforts are currently challenging and time-consuming, with sequencing data scattered across multiple repositories and metadata that is not easily or comprehensively searchable. MetaSeek is a sequencing data discovery tool that integrates sequencing metadata from all the major data repositories, allowing the user to search and filter on datasets in a lightweight application with an intuitive, easy-to-use web-based interface. Users can save and share curated datasets, while other users can browse these data integrations or use them as a jumping off point for their own curation. Missing and/or erroneous metadata are inferred automatically where possible, and where not possible, users are prompted to contribute to the improvement of the sequencing metadata pool by correcting and amending metadata errors. Once an integrated dataset has been curated, users can follow simple instructions to download their raw data and quickly begin their investigations. In addition to the online interface, the MetaSeek database is easily queryable via an open API, further enabling users and facilitating integrations of MetaSeek with other data curation tools. This tool lowers the barriers to curation and integration of environmental sequencing data, clearing the path forward to illuminating the ecosystem-scale interactions between biological and abiotic processes.

  6. High-Throughput Single-Cell RNA Sequencing and Data Analysis.

    PubMed

    Sagar; Herman, Josip Stefan; Pospisilik, John Andrew; Grün, Dominic

    2018-01-01

    Understanding biological systems at a single cell resolution may reveal several novel insights which remain masked by the conventional population-based techniques providing an average readout of the behavior of cells. Single-cell transcriptome sequencing holds the potential to identify novel cell types and characterize the cellular composition of any organ or tissue in health and disease. Here, we describe a customized high-throughput protocol for single-cell RNA-sequencing (scRNA-seq) combining flow cytometry and a nanoliter-scale robotic system. Since scRNA-seq requires amplification of a low amount of endogenous cellular RNA, leading to substantial technical noise in the dataset, downstream data filtering and analysis require special care. Therefore, we also briefly describe in-house state-of-the-art data analysis algorithms developed to identify cellular subpopulations including rare cell types as well as to derive lineage trees by ordering the identified subpopulations of cells along the inferred differentiation trajectories.

  7. MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets.

    PubMed

    Reddy, Rachamalla Maheedhar; Mohammed, Monzoorul Haque; Mande, Sharmila S

    2014-01-01

    A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA - a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicates that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at https://metagenomics.atc.tcs.com/MetaCAA. Copyright © 2014 Elsevier Inc. All rights reserved.

  8. Identification of fungi in shotgun metagenomics datasets

    PubMed Central

    Donovan, Paul D.; Gonzalez, Gabriel; Higgins, Desmond G.

    2018-01-01

    Metagenomics uses nucleic acid sequencing to characterize species diversity in different niches such as environmental biomes or the human microbiome. Most studies have used 16S rRNA amplicon sequencing to identify bacteria. However, the decreasing cost of sequencing has resulted in a gradual shift away from amplicon analyses and towards shotgun metagenomic sequencing. Shotgun metagenomic data can be used to identify a wide range of species, but have rarely been applied to fungal identification. Here, we develop a sequence classification pipeline, FindFungi, and use it to identify fungal sequences in public metagenome datasets. We focus primarily on animal metagenomes, especially those from pig and mouse microbiomes. We identified fungi in 39 of 70 datasets comprising 71 fungal species. At least 11 pathogenic species with zoonotic potential were identified, including Candida tropicalis. We identified Pseudogymnoascus species from 13 Antarctic soil samples initially analyzed for the presence of bacteria capable of degrading diesel oil. We also show that Candida tropicalis and Candida loboi are likely the same species. In addition, we identify several examples where contaminating DNA was erroneously included in fungal genome assemblies. PMID:29444186

  9. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets

    PubMed Central

    2010-01-01

    Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141

  10. What is the phylogenetic signal limit from mitogenomes? The reconciliation between mitochondrial and nuclear data in the Insecta class phylogeny

    PubMed Central

    2011-01-01

    Background Efforts to solve higher-level evolutionary relationships within the class Insecta by using mitochondrial genomic data are hindered due to fast sequence evolution of several groups, most notably Hymenoptera, Strepsiptera, Phthiraptera, Hemiptera and Thysanoptera. Accelerated rates of substitution on their sequences have been shown to have negative consequences in phylogenetic inference. In this study, we tested several methodological approaches to recover phylogenetic signal from whole mitochondrial genomes. As a model, we used two classical problems in insect phylogenetics: The relationships within Paraneoptera and within Holometabola. Moreover, we assessed the mitochondrial phylogenetic signal limits in the deeper Eumetabola dataset, and we studied the contribution of individual genes. Results Long-branch attraction (LBA) artefacts were detected in all the datasets. Methods using Bayesian inference outperformed maximum likelihood approaches, and LBA was avoided in Paraneoptera and Holometabola when using protein sequences and the site-heterogeneous mixture model CAT. The better performance of this method was evidenced by resulting topologies matching generally accepted hypotheses based on nuclear and/or morphological data, and was confirmed by cross-validation and simulation analyses. Using the CAT model, the order Strepsiptera was recovered as sister to Coleoptera for the first time using mitochondrial sequences, in agreement with recent results based on large nuclear and morphological datasets. Also the Hymenoptera-Mecopterida association was obtained, leaving Coleoptera and Strepsiptera as the basal groups of the holometabolan insects, which coincides with one of the two main competing hypotheses. For the Paraneroptera, the currently accepted non-monophyly of Homoptera was documented as a phylogenetic novelty for mitochondrial data. However, results were not satisfactory when exploring the entire Eumetabola, revealing the limits of the phylogenetic signal that can be extracted from Insecta mitogenomes. Based on the combined use of the five best topology-performing genes we obtained comparable results to whole mitogenomes, highlighting the important role of data quality. Conclusion We show for the first time that mitogenomic data agrees with nuclear and morphological data for several of the most controversial insect evolutionary relationships, adding a new independent source of evidence to study relationships among insect orders. We propose that deeper divergences cannot be inferred with the current available methods due to sequence saturation and compositional bias inconsistencies. Our exploratory analysis indicates that the CAT model is the best dealing with LBA and it could be useful for other groups and datasets with similar phylogenetic difficulties. PMID:22032248

  11. Multi-Data Approach for remote sensing-based regional crop rotation mapping: A case study for the Rur catchment, Germany

    NASA Astrophysics Data System (ADS)

    Waldhoff, Guido; Lussem, Ulrike; Bareth, Georg

    2017-09-01

    Spatial land use information is one of the key input parameters for regional agro-ecosystem modeling. Furthermore, to assess the crop-specific management in a spatio-temporal context accurately, parcel-related crop rotation information is additionally needed. Such data is scarcely available for a regional scale, so that only modeled crop rotations can be incorporated instead. However, the spectrum of the occurring multiannual land use patterns on arable land remains unknown. Thus, this contribution focuses on the mapping of the actually practiced crop rotations in the Rur catchment, located in the western part of Germany. We addressed this by combining multitemporal multispectral remote sensing data, ancillary information and expert-knowledge on crop phenology in a GIS-based Multi-Data Approach (MDA). At first, a methodology for the enhanced differentiation of the major crop types on an annual basis was developed. Key aspects are (i) the usage of physical block data to separate arable land from other land use types, (ii) the classification of remote sensing scenes of specific time periods, which are most favorable for the differentiation of certain crop types, and (iii) the combination of the multitemporal classification results in a sequential analysis strategy. Annual crop maps of eight consecutive years (2008-2015) were combined to a crop sequence dataset to have a profound data basis for the mapping of crop rotations. In most years, the remote sensing data basis was highly fragmented. Nevertheless, our method enabled satisfying crop mapping results. As an example for the annual crop mapping workflow, the procedure and the result of 2015 are illustrated. For the generation of the crop sequence dataset, the eight annual crop maps were geometrically smoothened and integrated into a single vector data layer. The resulting dataset informs about the occurring crop sequence for individual areas on arable land, so that crop rotation schemes can be derived. The resulting dataset reveals that the spectrum of the practiced crop rotations is extremely heterogeneous and contains a large amount of crop sequences, which strongly diverge from model crop rotations. Consequently, the integration of remote sensing-based crop rotation data can considerably reduce uncertainties regarding the management in regional agro-ecosystem modeling. Finally, the developed methods and the results are discussed in detail.

  12. Lokiarchaea are close relatives of Euryarchaeota, not bridging the gap between prokaryotes and eukaryotes

    PubMed Central

    Forterre, Patrick

    2017-01-01

    The eocyte hypothesis, in which Eukarya emerged from within Archaea, has been boosted by the description of a new candidate archaeal phylum, “Lokiarchaeota”, from metagenomic data. Eukarya branch within Lokiarchaeota in a tree reconstructed from the concatenation of 36 universal proteins. However, individual phylogenies revealed that lokiarchaeal proteins sequences have different evolutionary histories. The individual markers phylogenies revealed at least two subsets of proteins, either supporting the Woese or the Eocyte tree of life. Strikingly, removal of a single protein, the elongation factor EF2, is sufficient to break the Eukaryotes-Lokiarchaea affiliation. Our analysis suggests that the three lokiarchaeal EF2 proteins have a chimeric organization that could be due to contamination and/or homologous recombination with patches of eukaryotic sequences. A robust phylogenetic analysis of RNA polymerases with a new dataset indicates that Lokiarchaeota and related phyla of the Asgard superphylum are sister group to Euryarchaeota, not to Eukarya, and supports the monophyly of Archaea with their rooting in the branch leading to Thaumarchaeota. PMID:28604769

  13. Defining the Estimated Core Genome of Bacterial Populations Using a Bayesian Decision Model

    PubMed Central

    van Tonder, Andries J.; Mistry, Shilan; Bray, James E.; Hill, Dorothea M. C.; Cody, Alison J.; Farmer, Chris L.; Klugman, Keith P.; von Gottberg, Anne; Bentley, Stephen D.; Parkhill, Julian; Jolley, Keith A.; Maiden, Martin C. J.; Brueggemann, Angela B.

    2014-01-01

    The bacterial core genome is of intense interest and the volume of whole genome sequence data in the public domain available to investigate it has increased dramatically. The aim of our study was to develop a model to estimate the bacterial core genome from next-generation whole genome sequencing data and use this model to identify novel genes associated with important biological functions. Five bacterial datasets were analysed, comprising 2096 genomes in total. We developed a Bayesian decision model to estimate the number of core genes, calculated pairwise evolutionary distances (p-distances) based on nucleotide sequence diversity, and plotted the median p-distance for each core gene relative to its genome location. We designed visually-informative genome diagrams to depict areas of interest in genomes. Case studies demonstrated how the model could identify areas for further study, e.g. 25% of the core genes with higher sequence diversity in the Campylobacter jejuni and Neisseria meningitidis genomes encoded hypothetical proteins. The core gene with the highest p-distance value in C. jejuni was annotated in the reference genome as a putative hydrolase, but further work revealed that it shared sequence homology with beta-lactamase/metallo-beta-lactamases (enzymes that provide resistance to a range of broad-spectrum antibiotics) and thioredoxin reductase genes (which reduce oxidative stress and are essential for DNA replication) in other C. jejuni genomes. Our Bayesian model of estimating the core genome is principled, easy to use and can be applied to large genome datasets. This study also highlighted the lack of knowledge currently available for many core genes in bacterial genomes of significant global public health importance. PMID:25144616

  14. Biogeography of the ecosystems of the healthy human body.

    PubMed

    Zhou, Yanjiao; Gao, Hongyu; Mihindukulasuriya, Kathie A; La Rosa, Patricio S; Wylie, Kristine M; Vishnivetskaya, Tatiana; Podar, Mircea; Warner, Barb; Tarr, Phillip I; Nelson, David E; Fortenberry, J Dennis; Holland, Martin J; Burr, Sarah E; Shannon, William D; Sodergren, Erica; Weinstock, George M

    2013-01-14

    Characterizing the biogeography of the microbiome of healthy humans is essential for understanding microbial associated diseases. Previous studies mainly focused on a single body habitat from a limited set of subjects. Here, we analyzed one of the largest microbiome datasets to date and generated a biogeographical map that annotates the biodiversity, spatial relationships, and temporal stability of 22 habitats from 279 healthy humans. We identified 929 genera from more than 24 million 16S rRNA gene sequences of 22 habitats, and we provide a baseline of inter-subject variation for healthy adults. The oral habitat has the most stable microbiota with the highest alpha diversity, while the skin and vaginal microbiota are less stable and show lower alpha diversity. The level of biodiversity in one habitat is independent of the biodiversity of other habitats in the same individual. The abundances of a given genus at a body site in which it dominates do not correlate with the abundances at body sites where it is not dominant. Additionally, we observed the human microbiota exhibit both cosmopolitan and endemic features. Finally, comparing datasets of different projects revealed a project-based clustering pattern, emphasizing the significance of standardization of metagenomic studies. The data presented here extend the definition of the human microbiome by providing a more complete and accurate picture of human microbiome biogeography, addressing questions best answered by a large dataset of subjects and body sites that are deeply sampled by sequencing.

  15. PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine.

    PubMed

    Hayat, Maqsood; Tahir, Muhammad

    2015-08-01

    Membrane protein is a central component of the cell that manages intra and extracellular processes. Membrane proteins execute a diversity of functions that are vital for the survival of organisms. The topology of transmembrane proteins describes the number of transmembrane (TM) helix segments and its orientation. However, owing to the lack of its recognized structures, the identification of TM helix and its topology through experimental methods is laborious with low throughput. In order to identify TM helix segments reliably, accurately, and effectively from topogenic sequences, we propose the PSOFuzzySVM-TMH model. In this model, evolutionary based information position specific scoring matrix and discrete based information 6-letter exchange group are used to formulate transmembrane protein sequences. The noisy and extraneous attributes are eradicated using an optimization selection technique, particle swarm optimization, from both feature spaces. Finally, the selected feature spaces are combined in order to form ensemble feature space. Fuzzy-support vector Machine is utilized as a classification algorithm. Two benchmark datasets, including low and high resolution datasets, are used. At various levels, the performance of the PSOFuzzySVM-TMH model is assessed through 10-fold cross validation test. The empirical results reveal that the proposed framework PSOFuzzySVM-TMH outperforms in terms of classification performance in the examined datasets. It is ascertained that the proposed model might be a useful and high throughput tool for academia and research community for further structure and functional studies on transmembrane proteins.

  16. Biogeography of the ecosystems of the healthy human body

    PubMed Central

    2013-01-01

    Background Characterizing the biogeography of the microbiome of healthy humans is essential for understanding microbial associated diseases. Previous studies mainly focused on a single body habitat from a limited set of subjects. Here, we analyzed one of the largest microbiome datasets to date and generated a biogeographical map that annotates the biodiversity, spatial relationships, and temporal stability of 22 habitats from 279 healthy humans. Results We identified 929 genera from more than 24 million 16S rRNA gene sequences of 22 habitats, and we provide a baseline of inter-subject variation for healthy adults. The oral habitat has the most stable microbiota with the highest alpha diversity, while the skin and vaginal microbiota are less stable and show lower alpha diversity. The level of biodiversity in one habitat is independent of the biodiversity of other habitats in the same individual. The abundances of a given genus at a body site in which it dominates do not correlate with the abundances at body sites where it is not dominant. Additionally, we observed the human microbiota exhibit both cosmopolitan and endemic features. Finally, comparing datasets of different projects revealed a project-based clustering pattern, emphasizing the significance of standardization of metagenomic studies. Conclusions The data presented here extend the definition of the human microbiome by providing a more complete and accurate picture of human microbiome biogeography, addressing questions best answered by a large dataset of subjects and body sites that are deeply sampled by sequencing. PMID:23316946

  17. Privacy-preserving microbiome analysis using secure computation.

    PubMed

    Wagner, Justin; Paulson, Joseph N; Wang, Xiao; Bhattacharjee, Bobby; Corrada Bravo, Héctor

    2016-06-15

    Developing targeted therapeutics and identifying biomarkers relies on large amounts of research participant data. Beyond human DNA, scientists now investigate the DNA of micro-organisms inhabiting the human body. Recent work shows that an individual's collection of microbial DNA consistently identifies that person and could be used to link a real-world identity to a sensitive attribute in a research dataset. Unfortunately, the current suite of DNA-specific privacy-preserving analysis tools does not meet the requirements for microbiome sequencing studies. To address privacy concerns around microbiome sequencing, we implement metagenomic analyses using secure computation. Our implementation allows comparative analysis over combined data without revealing the feature counts for any individual sample. We focus on three analyses and perform an evaluation on datasets currently used by the microbiome research community. We use our implementation to simulate sharing data between four policy-domains. Additionally, we describe an application of our implementation for patients to combine data that allows drug developers to query against and compensate patients for the analysis. The software is freely available for download at: http://cbcb.umd.edu/∼hcorrada/projects/secureseq.html Supplementary data are available at Bioinformatics online. hcorrada@umiacs.umd.edu. © The Author 2016. Published by Oxford University Press.

  18. Animal Viruses Probe dataset (AVPDS) for microarray-based diagnosis and identification of viruses.

    PubMed

    Yadav, Brijesh S; Pokhriyal, Mayank; Vasishtha, Dinesh P; Sharma, Bhaskar

    2014-03-01

    AVPDS (Animal Viruses Probe dataset) is a dataset of virus-specific and conserve oligonucleotides for identification and diagnosis of viruses infecting animals. The current dataset contain 20,619 virus specific probes for 833 viruses and their subtypes and 3,988 conserved probes for 146 viral genera. Dataset of virus specific probe has been divided into two fields namely virus name and probe sequence. Similarly conserved probes for virus genera table have genus, and subgroup within genus name and probe sequence. The subgroup within genus is artificially divided subgroups with no taxonomic significance and contains probes which identifies viruses in that specific subgroup of the genus. Using this dataset we have successfully diagnosed the first case of Newcastle disease virus in sheep and reported a mixed infection of Bovine viral diarrhea and Bovine herpesvirus in cattle. These dataset also contains probes which cross reacts across species experimentally though computationally they meet specifications. These probes have been marked. We hope that this dataset will be useful in microarray-based detection of viruses. The dataset can be accessed through the link https://dl.dropboxusercontent.com/u/94060831/avpds/HOME.html.

  19. IMG/M: integrated genome and metagenome comparative data analysis system

    DOE PAGES

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; ...

    2016-10-13

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less

  20. IMG/M: integrated genome and metagenome comparative data analysis system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less

  1. IMG/M: integrated genome and metagenome comparative data analysis system

    PubMed Central

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; Palaniappan, Krishna; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Andersen, Evan; Huntemann, Marcel; Varghese, Neha; Hadjithomas, Michalis; Tennessen, Kristin; Nielsen, Torben; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2017-01-01

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system. PMID:27738135

  2. Ecology of Alpine Macrofungi - Combining Historical with Recent Data

    PubMed Central

    Brunner, Ivano; Frey, Beat; Hartmann, Martin; Zimmermann, Stephan; Graf, Frank; Suz, Laura M.; Niskanen, Tuula; Bidartondo, Martin I.; Senn-Irlet, Beatrice

    2017-01-01

    Historical datasets of living communities are important because they can be used to document creeping shifts in species compositions. Such a historical data set exists for alpine fungi. From 1941 to 1953, the Swiss geologist Jules Favre visited yearly the region of the Swiss National Park and recorded the occurring fruiting bodies of fungi >1 mm (so-called “macrofungi”) in the alpine zone. Favre can be regarded as one of the pioneers of alpine fungal ecology not least because he noted location, elevation, geology, and associated plants during his numerous excursions. However, some relevant information is only available in his unpublished field-book. Overall, Favre listed 204 fungal species in 26 sampling sites, with 46 species being previously unknown. The analysis of his data revealed that the macrofungi recorded belong to two major ecological groups, either they are symbiotrophs and live in ectomycorrhizal associations with alpine plant hosts, or they are saprotrophs and decompose plant litter and soil organic matter. The most frequent fungi were members of Inocybe and Cortinarius, which form ectomycorrhizas with Dryas octopetala or the dwarf alpine Salix species. The scope of the present study was to combine Favre's historical dataset with more recent data, either with the “SwissFungi” database or with data from major studies of the French and German Alps, and with the data from novel high-throughput DNA sequencing techniques of soils from the Swiss Alps. Results of the latter application revealed, that problems associated with these new techniques are manifold and species determination remains often unclear. At this point, the fungal taxa collected by Favre and deposited as exsiccata at the “Conservatoire et Jardin Botaniques de la Ville de Genève” could be used as a reference sequence dataset for alpine fungal studies. In conclusion, it can be postulated that new improved databases are urgently necessary for the near future, particularly, with regard to investigating fungal communities from alpine regions using new techniques. PMID:29123508

  3. Ecology of Alpine Macrofungi - Combining Historical with Recent Data.

    PubMed

    Brunner, Ivano; Frey, Beat; Hartmann, Martin; Zimmermann, Stephan; Graf, Frank; Suz, Laura M; Niskanen, Tuula; Bidartondo, Martin I; Senn-Irlet, Beatrice

    2017-01-01

    Historical datasets of living communities are important because they can be used to document creeping shifts in species compositions. Such a historical data set exists for alpine fungi. From 1941 to 1953, the Swiss geologist Jules Favre visited yearly the region of the Swiss National Park and recorded the occurring fruiting bodies of fungi >1 mm (so-called "macrofungi") in the alpine zone. Favre can be regarded as one of the pioneers of alpine fungal ecology not least because he noted location, elevation, geology, and associated plants during his numerous excursions. However, some relevant information is only available in his unpublished field-book. Overall, Favre listed 204 fungal species in 26 sampling sites, with 46 species being previously unknown. The analysis of his data revealed that the macrofungi recorded belong to two major ecological groups, either they are symbiotrophs and live in ectomycorrhizal associations with alpine plant hosts, or they are saprotrophs and decompose plant litter and soil organic matter. The most frequent fungi were members of Inocybe and Cortinarius , which form ectomycorrhizas with Dryas octopetala or the dwarf alpine Salix species. The scope of the present study was to combine Favre's historical dataset with more recent data, either with the "SwissFungi" database or with data from major studies of the French and German Alps, and with the data from novel high-throughput DNA sequencing techniques of soils from the Swiss Alps. Results of the latter application revealed, that problems associated with these new techniques are manifold and species determination remains often unclear. At this point, the fungal taxa collected by Favre and deposited as exsiccata at the "Conservatoire et Jardin Botaniques de la Ville de Genève" could be used as a reference sequence dataset for alpine fungal studies. In conclusion, it can be postulated that new improved databases are urgently necessary for the near future, particularly, with regard to investigating fungal communities from alpine regions using new techniques.

  4. Anomalous Diffusion Measured by a Twice-Refocused Spin Echo Pulse Sequence: Analysis Using Fractional Order Calculus

    PubMed Central

    2011-01-01

    Purpose To theoretically develop and experimentally validate a formulism based on a fractional order calculus (FC) diffusion model to characterize anomalous diffusion in brain tissues measured with a twice-refocused spin-echo (TRSE) pulse sequence. Materials and Methods The FC diffusion model is the fractional order generalization of the Bloch-Torrey equation. Using this model, an analytical expression was derived to describe the diffusion-induced signal attenuation in a TRSE pulse sequence. To experimentally validate this expression, a set of diffusion-weighted (DW) images was acquired at 3 Tesla from healthy human brains using a TRSE sequence with twelve b-values ranging from 0 to 2,600 s/mm2. For comparison, DW images were also acquired using a Stejskal-Tanner diffusion gradient in a single-shot spin-echo echo planar sequence. For both datasets, a Levenberg-Marquardt fitting algorithm was used to extract three parameters: diffusion coefficient D, fractional order derivative in space β, and a spatial parameter μ (in units of μm). Using adjusted R-squared values and standard deviations, D, β and μ values and the goodness-of-fit in three specific regions of interest (ROI) in white matter, gray matter, and cerebrospinal fluid were evaluated for each of the two datasets. In addition, spatially resolved parametric maps were assessed qualitatively. Results The analytical expression for the TRSE sequence, derived from the FC diffusion model, accurately characterized the diffusion-induced signal loss in brain tissues at high b-values. In the selected ROIs, the goodness-of-fit and standard deviations for the TRSE dataset were comparable with the results obtained from the Stejskal-Tanner dataset, demonstrating the robustness of the FC model across multiple data acquisition strategies. Qualitatively, the D, β, and μ maps from the TRSE dataset exhibited fewer artifacts, reflecting the improved immunity to eddy currents. Conclusion The diffusion-induced signal attenuation in a TRSE pulse sequence can be described by an FC diffusion model at high b-values. This model performs equally well for data acquired from the human brain tissues with a TRSE pulse sequence or a conventional Stejskal-Tanner sequence. PMID:21509877

  5. Anomalous diffusion measured by a twice-refocused spin echo pulse sequence: analysis using fractional order calculus.

    PubMed

    Gao, Qing; Srinivasan, Girish; Magin, Richard L; Zhou, Xiaohong Joe

    2011-05-01

    To theoretically develop and experimentally validate a formulism based on a fractional order calculus (FC) diffusion model to characterize anomalous diffusion in brain tissues measured with a twice-refocused spin-echo (TRSE) pulse sequence. The FC diffusion model is the fractional order generalization of the Bloch-Torrey equation. Using this model, an analytical expression was derived to describe the diffusion-induced signal attenuation in a TRSE pulse sequence. To experimentally validate this expression, a set of diffusion-weighted (DW) images was acquired at 3 Tesla from healthy human brains using a TRSE sequence with twelve b-values ranging from 0 to 2600 s/mm(2). For comparison, DW images were also acquired using a Stejskal-Tanner diffusion gradient in a single-shot spin-echo echo planar sequence. For both datasets, a Levenberg-Marquardt fitting algorithm was used to extract three parameters: diffusion coefficient D, fractional order derivative in space β, and a spatial parameter μ (in units of μm). Using adjusted R-squared values and standard deviations, D, β, and μ values and the goodness-of-fit in three specific regions of interest (ROIs) in white matter, gray matter, and cerebrospinal fluid, respectively, were evaluated for each of the two datasets. In addition, spatially resolved parametric maps were assessed qualitatively. The analytical expression for the TRSE sequence, derived from the FC diffusion model, accurately characterized the diffusion-induced signal loss in brain tissues at high b-values. In the selected ROIs, the goodness-of-fit and standard deviations for the TRSE dataset were comparable with the results obtained from the Stejskal-Tanner dataset, demonstrating the robustness of the FC model across multiple data acquisition strategies. Qualitatively, the D, β, and μ maps from the TRSE dataset exhibited fewer artifacts, reflecting the improved immunity to eddy currents. The diffusion-induced signal attenuation in a TRSE pulse sequence can be described by an FC diffusion model at high b-values. This model performs equally well for data acquired from the human brain tissues with a TRSE pulse sequence or a conventional Stejskal-Tanner sequence. Copyright © 2011 Wiley-Liss, Inc.

  6. Mitochondrial genomes reveal the extinct Hippidion as an outgroup to all living equids.

    PubMed

    Der Sarkissian, Clio; Vilstrup, Julia T; Schubert, Mikkel; Seguin-Orlando, Andaine; Eme, David; Weinstock, Jacobo; Alberdi, Maria Teresa; Martin, Fabiana; Lopez, Patricio M; Prado, Jose L; Prieto, Alfredo; Douady, Christophe J; Stafford, Tom W; Willerslev, Eske; Orlando, Ludovic

    2015-03-01

    Hippidions were equids with very distinctive anatomical features. They lived in South America 2.5 million years ago (Ma) until their extinction approximately 10 000 years ago. The evolutionary origin of the three known Hippidion morphospecies is still disputed. Based on palaeontological data, Hippidion could have diverged from the lineage leading to modern equids before 10 Ma. In contrast, a much later divergence date, with Hippidion nesting within modern equids, was indicated by partial ancient mitochondrial DNA sequences. Here, we characterized eight Hippidion complete mitochondrial genomes at 3.4-386.3-fold coverage using target-enrichment capture and next-generation sequencing. Our dataset reveals that the two morphospecies sequenced (H. saldiasi and H. principale) formed a monophyletic clade, basal to extant and extinct Equus lineages. This contrasts with previous genetic analyses and supports Hippidion as a distinct genus, in agreement with palaeontological models. We date the Hippidion split from Equus at 5.6-6.5 Ma, suggesting an early divergence in North America prior to the colonization of South America, after the formation of the Panamanian Isthmus 3.5 Ma and the Great American Biotic Interchange. © 2015 The Author(s) Published by the Royal Society. All rights reserved.

  7. Mitochondrial genomes reveal the extinct Hippidion as an outgroup to all living equids

    PubMed Central

    Der Sarkissian, Clio; Vilstrup, Julia T.; Schubert, Mikkel; Seguin-Orlando, Andaine; Eme, David; Weinstock, Jacobo; Alberdi, Maria Teresa; Martin, Fabiana; Lopez, Patricio M.; Prado, Jose L.; Prieto, Alfredo; Douady, Christophe J.; Stafford, Tom W.; Willerslev, Eske; Orlando, Ludovic

    2015-01-01

    Hippidions were equids with very distinctive anatomical features. They lived in South America 2.5 million years ago (Ma) until their extinction approximately 10 000 years ago. The evolutionary origin of the three known Hippidion morphospecies is still disputed. Based on palaeontological data, Hippidion could have diverged from the lineage leading to modern equids before 10 Ma. In contrast, a much later divergence date, with Hippidion nesting within modern equids, was indicated by partial ancient mitochondrial DNA sequences. Here, we characterized eight Hippidion complete mitochondrial genomes at 3.4–386.3-fold coverage using target-enrichment capture and next-generation sequencing. Our dataset reveals that the two morphospecies sequenced (H. saldiasi and H. principale) formed a monophyletic clade, basal to extant and extinct Equus lineages. This contrasts with previous genetic analyses and supports Hippidion as a distinct genus, in agreement with palaeontological models. We date the Hippidion split from Equus at 5.6–6.5 Ma, suggesting an early divergence in North America prior to the colonization of South America, after the formation of the Panamanian Isthmus 3.5 Ma and the Great American Biotic Interchange. PMID:25762573

  8. Genome-wide characterization of centromeric satellites from multiple mammalian genomes.

    PubMed

    Alkan, Can; Cardone, Maria Francesca; Catacchio, Claudia Rita; Antonacci, Francesca; O'Brien, Stephen J; Ryder, Oliver A; Purgato, Stefania; Zoli, Monica; Della Valle, Giuliano; Eichler, Evan E; Ventura, Mario

    2011-01-01

    Despite its importance in cell biology and evolution, the centromere has remained the final frontier in genome assembly and annotation due to its complex repeat structure. However, isolation and characterization of the centromeric repeats from newly sequenced species are necessary for a complete understanding of genome evolution and function. In recent years, various genomes have been sequenced, but the characterization of the corresponding centromeric DNA has lagged behind. Here, we present a computational method (RepeatNet) to systematically identify higher-order repeat structures from unassembled whole-genome shotgun sequence and test whether these sequence elements correspond to functional centromeric sequences. We analyzed genome datasets from six species of mammals representing the diversity of the mammalian lineage, namely, horse, dog, elephant, armadillo, opossum, and platypus. We define candidate monomer satellite repeats and demonstrate centromeric localization for five of the six genomes. Our analysis revealed the greatest diversity of centromeric sequences in horse and dog in contrast to elephant and armadillo, which showed high-centromeric sequence homogeneity. We could not isolate centromeric sequences within the platypus genome, suggesting that centromeres in platypus are not enriched in satellite DNA. Our method can be applied to the characterization of thousands of other vertebrate genomes anticipated for sequencing in the near future, providing an important tool for annotation of centromeres.

  9. TranslatomeDB: a comprehensive database and cloud-based analysis platform for translatome sequencing data

    PubMed Central

    Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie

    2018-01-01

    Abstract Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. PMID:29106630

  10. Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic

    PubMed Central

    Yebra, Gonzalo; Hodcroft, Emma B.; Ragonnet-Cronin, Manon L.; Pillay, Deenan; Brown, Andrew J. Leigh; Fraser, Christophe; Kellam, Paul; de Oliveira, Tulio; Dennis, Ann; Hoppe, Anne; Kityo, Cissy; Frampton, Dan; Ssemwanga, Deogratius; Tanser, Frank; Keshani, Jagoda; Lingappa, Jairam; Herbeck, Joshua; Wawer, Maria; Essex, Max; Cohen, Myron S.; Paton, Nicholas; Ratmann, Oliver; Kaleebu, Pontiano; Hayes, Richard; Fidler, Sarah; Quinn, Thomas; Novitsky, Vladimir; Haywards, Andrew; Nastouli, Eleni; Morris, Steven; Clark, Duncan; Kozlakidis, Zisis

    2016-01-01

    HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA_HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree’s using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences. PMID:28008945

  11. Using nearly full-genome HIV sequence data improves phylogeny reconstruction in a simulated epidemic.

    PubMed

    Yebra, Gonzalo; Hodcroft, Emma B; Ragonnet-Cronin, Manon L; Pillay, Deenan; Brown, Andrew J Leigh

    2016-12-23

    HIV molecular epidemiology studies analyse viral pol gene sequences due to their availability, but whole genome sequencing allows to use other genes. We aimed to determine what gene(s) provide(s) the best approximation to the real phylogeny by analysing a simulated epidemic (created as part of the PANGEA_HIV project) with a known transmission tree. We sub-sampled a simulated dataset of 4662 sequences into different combinations of genes (gag-pol-env, gag-pol, gag, pol, env and partial pol) and sampling depths (100%, 60%, 20% and 5%), generating 100 replicates for each case. We built maximum-likelihood trees for each combination using RAxML (GTR + Γ), and compared their topologies to the corresponding true tree's using CompareTree. The accuracy of the trees was significantly proportional to the length of the sequences used, with the gag-pol-env datasets showing the best performance and gag and partial pol sequences showing the worst. The lowest sampling depths (20% and 5%) greatly reduced the accuracy of tree reconstruction and showed high variability among replicates, especially when using the shortest gene datasets. In conclusion, using longer sequences derived from nearly whole genomes will improve the reliability of phylogenetic reconstruction. With low sample coverage, results can be highly variable, particularly when based on short sequences.

  12. A phylogenetic overview of the antrodia clade (Basidiomycota, Polyporales)

    Treesearch

    Beatriz Ortiz-Santana; Daniel L. Lindner; Otto Miettinen; Alfredo Justo; David S. Hibbett

    2013-01-01

    Phylogenetic relationships among members of the antrodia clade were investigated with molecular data from two nuclear ribosomal DNA regions, LSU and ITS. A total of 123 species representing 26 genera producing a brown rot were included in the present study. Three DNA datasets (combined LSU-ITS dataset, LSU dataset, ITS dataset) comprising sequences of 449 isolates were...

  13. Assessment of phylogenetic sensitivity for reconstructing HIV-1 epidemiological relationships.

    PubMed

    Beloukas, Apostolos; Magiorkinis, Emmanouil; Magiorkinis, Gkikas; Zavitsanou, Asimina; Karamitros, Timokratis; Hatzakis, Angelos; Paraskevis, Dimitrios

    2012-06-01

    Phylogenetic analysis has been extensively used as a tool for the reconstruction of epidemiological relations for research or for forensic purposes. It was our objective to assess the sensitivity of different phylogenetic methods and various phylogenetic programs to reconstruct epidemiological links among HIV-1 infected patients that is the probability to reveal a true transmission relationship. Multiple datasets (90) were prepared consisting of HIV-1 sequences in protease (PR) and partial reverse transcriptase (RT) sampled from patients with documented epidemiological relationship (target population), and from unrelated individuals (control population) belonging to the same HIV-1 subtype as the target population. Each dataset varied regarding the number, the geographic origin and the transmission risk groups of the sequences among the control population. Phylogenetic trees were inferred by neighbor-joining (NJ), maximum likelihood heuristics (hML) and Bayesian methods. All clusters of sequences belonging to the target population were correctly reconstructed by NJ and Bayesian methods receiving high bootstrap and posterior probability (PP) support, respectively. On the other hand, TreePuzzle failed to reconstruct or provide significant support for several clusters; high puzzling step support was associated with the inclusion of control sequences from the same geographic area as the target population. In contrary, all clusters were correctly reconstructed by hML as implemented in PhyML 3.0 receiving high bootstrap support. We report that under the conditions of our study, hML using PhyML, NJ and Bayesian methods were the most sensitive for the reconstruction of epidemiological links mostly from sexually infected individuals. Copyright © 2012 Elsevier B.V. All rights reserved.

  14. BayesMotif: de novo protein sorting motif discovery from impure datasets.

    PubMed

    Hu, Jianjun; Zhang, Fan

    2010-01-18

    Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.

  15. Pooled assembly of marine metagenomic datasets: enriching annotation through chimerism.

    PubMed

    Magasin, Jonathan D; Gerloff, Dietlind L

    2015-02-01

    Despite advances in high-throughput sequencing, marine metagenomic samples remain largely opaque. A typical sample contains billions of microbial organisms from thousands of genomes and quadrillions of DNA base pairs. Its derived metagenomic dataset underrepresents this complexity by orders of magnitude because of the sparseness and shortness of sequencing reads. Read shortness and sequencing errors pose a major challenge to accurate species and functional annotation. This includes distinguishing known from novel species. Often the majority of reads cannot be annotated and thus cannot help our interpretation of the sample. Here, we demonstrate quantitatively how careful assembly of marine metagenomic reads within, but also across, datasets can alleviate this problem. For 10 simulated datasets, each with species complexity modeled on a real counterpart, chimerism remained within the same species for most contigs (97%). For 42 real pyrosequencing ('454') datasets, assembly increased the proportion of annotated reads, and even more so when datasets were pooled, by on average 1.6% (max 6.6%) for species, 9.0% (max 28.7%) for Pfam protein domains and 9.4% (max 22.9%) for PANTHER gene families. Our results outline exciting prospects for data sharing in the metagenomics community. While chimeric sequences should be avoided in other areas of metagenomics (e.g. biodiversity analyses), conservative pooled assembly is advantageous for annotation specificity and sensitivity. Intriguingly, our experiment also found potential prospects for (low-cost) discovery of new species in 'old' data. dgerloff@ffame.org Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  16. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

    PubMed

    Yooseph, Shibu; Sutton, Granger; Rusch, Douglas B; Halpern, Aaron L; Williamson, Shannon J; Remington, Karin; Eisen, Jonathan A; Heidelberg, Karla B; Manning, Gerard; Li, Weizhong; Jaroszewski, Lukasz; Cieplak, Piotr; Miller, Christopher S; Li, Huiying; Mashiyama, Susan T; Joachimiak, Marcin P; van Belle, Christopher; Chandonia, John-Marc; Soergel, David A; Zhai, Yufeng; Natarajan, Kannan; Lee, Shaun; Raphael, Benjamin J; Bafna, Vineet; Friedman, Robert; Brenner, Steven E; Godzik, Adam; Eisenberg, David; Dixon, Jack E; Taylor, Susan S; Strausberg, Robert L; Frazier, Marvin; Venter, J Craig

    2007-03-01

    Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

  17. Multiview human activity recognition system based on spatiotemporal template for video surveillance system

    NASA Astrophysics Data System (ADS)

    Kushwaha, Alok Kumar Singh; Srivastava, Rajeev

    2015-09-01

    An efficient view invariant framework for the recognition of human activities from an input video sequence is presented. The proposed framework is composed of three consecutive modules: (i) detect and locate people by background subtraction, (ii) view invariant spatiotemporal template creation for different activities, (iii) and finally, template matching is performed for view invariant activity recognition. The foreground objects present in a scene are extracted using change detection and background modeling. The view invariant templates are constructed using the motion history images and object shape information for different human activities in a video sequence. For matching the spatiotemporal templates for various activities, the moment invariants and Mahalanobis distance are used. The proposed approach is tested successfully on our own viewpoint dataset, KTH action recognition dataset, i3DPost multiview dataset, MSR viewpoint action dataset, VideoWeb multiview dataset, and WVU multiview human action recognition dataset. From the experimental results and analysis over the chosen datasets, it is observed that the proposed framework is robust, flexible, and efficient with respect to multiple views activity recognition, scale, and phase variations.

  18. Comparative immunogenomics of molluscs.

    PubMed

    Schultz, Jonathan H; Adema, Coen M

    2017-10-01

    Comparative immunology, studying both vertebrates and invertebrates, provided the earliest descriptions of phagocytosis as a general immune mechanism. However, the large scale of animal diversity challenges all-inclusive investigations and the field of immunology has developed by mostly emphasizing study of a few vertebrate species. In addressing the lack of comprehensive understanding of animal immunity, especially that of invertebrates, comparative immunology helps toward management of invertebrates that are food sources, agricultural pests, pathogens, or transmit diseases, and helps interpret the evolution of animal immunity. Initial studies showed that the Mollusca (second largest animal phylum), and invertebrates in general, possess innate defenses but lack the lymphocytic immune system that characterizes vertebrate immunology. Recognizing the reality of both common and taxon-specific immune features, and applying up-to-date cell and molecular research capabilities, in-depth studies of a select number of bivalve and gastropod species continue to reveal novel aspects of molluscan immunity. The genomics era heralded a new stage of comparative immunology; large-scale efforts yielded an initial set of full molluscan genome sequences that is available for analyses of full complements of immune genes and regulatory sequences. Next-generation sequencing (NGS), due to lower cost and effort required, allows individual researchers to generate large sequence datasets for growing numbers of molluscs. RNAseq provides expression profiles that enable discovery of immune genes and genome sequences reveal distribution and diversity of immune factors across molluscan phylogeny. Although computational de novo sequence assembly will benefit from continued development and automated annotation may require some experimental validation, NGS is a powerful tool for comparative immunology, especially increasing coverage of the extensive molluscan diversity. To date, immunogenomics revealed new levels of complexity of molluscan defense by indicating sequence heterogeneity in individual snails and bivalves, and members of expanded immune gene families are expressed differentially to generate pathogen-specific defense responses. Copyright © 2017 Elsevier Ltd. All rights reserved.

  19. SEXCMD: Development and validation of sex marker sequences for whole-exome/genome and RNA sequencing.

    PubMed

    Jeong, Seongmun; Kim, Jiwoong; Park, Won; Jeon, Hongmin; Kim, Namshin

    2017-01-01

    Over the last decade, a large number of nucleotide sequences have been generated by next-generation sequencing technologies and deposited to public databases. However, most of these datasets do not specify the sex of individuals sampled because researchers typically ignore or hide this information. Male and female genomes in many species have distinctive sex chromosomes, XX/XY and ZW/ZZ, and expression levels of many sex-related genes differ between the sexes. Herein, we describe how to develop sex marker sequences from syntenic regions of sex chromosomes and use them to quickly identify the sex of individuals being analyzed. Array-based technologies routinely use either known sex markers or the B-allele frequency of X or Z chromosomes to deduce the sex of an individual. The same strategy has been used with whole-exome/genome sequence data; however, all reads must be aligned onto a reference genome to determine the B-allele frequency of the X or Z chromosomes. SEXCMD is a pipeline that can extract sex marker sequences from reference sex chromosomes and rapidly identify the sex of individuals from whole-exome/genome and RNA sequencing after training with a known dataset through a simple machine learning approach. The pipeline counts total numbers of hits from sex-specific marker sequences and identifies the sex of the individuals sampled based on the fact that XX/ZZ samples do not have Y or W chromosome hits. We have successfully validated our pipeline with mammalian (Homo sapiens; XY) and avian (Gallus gallus; ZW) genomes. Typical calculation time when applying SEXCMD to human whole-exome or RNA sequencing datasets is a few minutes, and analyzing human whole-genome datasets takes about 10 minutes. Another important application of SEXCMD is as a quality control measure to avoid mixing samples before bioinformatics analysis. SEXCMD comprises simple Python and R scripts and is freely available at https://github.com/lovemun/SEXCMD.

  20. Quantum annealing versus classical machine learning applied to a simplified computational biology problem

    PubMed Central

    Li, Richard Y.; Di Felice, Rosa; Rohs, Remo; Lidar, Daniel A.

    2018-01-01

    Transcription factors regulate gene expression, but how these proteins recognize and specifically bind to their DNA targets is still debated. Machine learning models are effective means to reveal interaction mechanisms. Here we studied the ability of a quantum machine learning approach to predict binding specificity. Using simplified datasets of a small number of DNA sequences derived from actual binding affinity experiments, we trained a commercially available quantum annealer to classify and rank transcription factor binding. The results were compared to state-of-the-art classical approaches for the same simplified datasets, including simulated annealing, simulated quantum annealing, multiple linear regression, LASSO, and extreme gradient boosting. Despite technological limitations, we find a slight advantage in classification performance and nearly equal ranking performance using the quantum annealer for these fairly small training data sets. Thus, we propose that quantum annealing might be an effective method to implement machine learning for certain computational biology problems. PMID:29652405

  1. Genome-wide assessment of differential translations with ribosome profiling data.

    PubMed

    Xiao, Zhengtao; Zou, Qin; Liu, Yu; Yang, Xuerui

    2016-04-04

    The closely regulated process of mRNA translation is crucial for precise control of protein abundance and quality. Ribosome profiling, a combination of ribosome foot-printing and RNA deep sequencing, has been used in a large variety of studies to quantify genome-wide mRNA translation. Here, we developed Xtail, an analysis pipeline tailored for ribosome profiling data that comprehensively and accurately identifies differentially translated genes in pairwise comparisons. Applied on simulated and real datasets, Xtail exhibits high sensitivity with minimal false-positive rates, outperforming existing methods in the accuracy of quantifying differential translations. With published ribosome profiling datasets, Xtail does not only reveal differentially translated genes that make biological sense, but also uncovers new events of differential translation in human cancer cells on mTOR signalling perturbation and in human primary macrophages on interferon gamma (IFN-γ) treatment. This demonstrates the value of Xtail in providing novel insights into the molecular mechanisms that involve translational dysregulations.

  2. The Landscape of long non-coding RNA classification

    PubMed Central

    St Laurent, Georges; Wahlestedt, Claes; Kapranov, Philipp

    2015-01-01

    Advances in the depth and quality of transcriptome sequencing have revealed many new classes of long non-coding RNAs (lncRNAs). lncRNA classification has mushroomed to accommodate these new findings, even though the real dimensions and complexity of the non-coding transcriptome remain unknown. Although evidence of functionality of specific lncRNAs continues to accumulate, conflicting, confusing, and overlapping terminology has fostered ambiguity and lack of clarity in the field in general. The lack of fundamental conceptual un-ambiguous classification framework results in a number of challenges in the annotation and interpretation of non-coding transcriptome data. It also might undermine integration of the new genomic methods and datasets in an effort to unravel function of lncRNA. Here, we review existing lncRNA classifications, nomenclature, and terminology. Then we describe the conceptual guidelines that have emerged for their classification and functional annotation based on expanding and more comprehensive use of large systems biology-based datasets. PMID:25869999

  3. Fast and Sensitive Alignment of Microbial Whole Genome Sequencing Reads to Large Sequence Datasets on a Desktop PC: Application to Metagenomic Datasets and Pathogen Identification

    PubMed Central

    2014-01-01

    Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner. PMID:25077800

  4. Characterization and prediction of residues determining protein functional specificity.

    PubMed

    Capra, John A; Singh, Mona

    2008-07-01

    Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular functional specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally determined SDPs. We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a dataset of SDPs. The resulting large dataset, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolutionary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large dataset of enzyme SDPs. Datasets and GroupSim code are available online at http://compbio.cs.princeton.edu/specificity/. Supplementary data are available at Bioinformatics online.

  5. Fast and sensitive alignment of microbial whole genome sequencing reads to large sequence datasets on a desktop PC: application to metagenomic datasets and pathogen identification.

    PubMed

    Pongor, Lőrinc S; Vera, Roberto; Ligeti, Balázs

    2014-01-01

    Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner.

  6. Fathead minnow genome sequencing and assembly

    EPA Pesticide Factsheets

    The dataset provides the URLs for accessing the genome sequence data and two draft assemblies as well as fathead minnow genotyping data associated with estimating the heterozygosity of the in-bred line.This dataset is associated with the following publication:Burns, F., L. Cogburn, G. Ankley , D. Villeneuve , E. Waits , Y. Chang, V. Llaca, S. Deschamps, R. Jackson, and R. Hoke. Sequencing and De novo Draft Assemblies of the Fathead Minnow (Pimphales promelas)Reference Genome. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 35(1): 212-217, (2016).

  7. Impact of sequencing depth and read length on single cell RNA sequencing data of T cells.

    PubMed

    Rizzetto, Simone; Eltahla, Auda A; Lin, Peijie; Bull, Rowena; Lloyd, Andrew R; Ho, Joshua W K; Venturi, Vanessa; Luciani, Fabio

    2017-10-06

    Single cell RNA sequencing (scRNA-seq) provides great potential in measuring the gene expression profiles of heterogeneous cell populations. In immunology, scRNA-seq allowed the characterisation of transcript sequence diversity of functionally relevant T cell subsets, and the identification of the full length T cell receptor (TCRαβ), which defines the specificity against cognate antigens. Several factors, e.g. RNA library capture, cell quality, and sequencing output affect the quality of scRNA-seq data. We studied the effects of read length and sequencing depth on the quality of gene expression profiles, cell type identification, and TCRαβ reconstruction, utilising 1,305 single cells from 8 publically available scRNA-seq datasets, and simulation-based analyses. Gene expression was characterised by an increased number of unique genes identified with short read lengths (<50 bp), but these featured higher technical variability compared to profiles from longer reads. Successful TCRαβ reconstruction was achieved for 6 datasets (81% - 100%) with at least 0.25 millions (PE) reads of length >50 bp, while it failed for datasets with <30 bp reads. Sufficient read length and sequencing depth can control technical noise to enable accurate identification of TCRαβ and gene expression profiles from scRNA-seq data of T cells.

  8. Identification of Two Novel Amalgaviruses in the Common Eelgrass (Zostera marina) and in Silico Analysis of the Amalgavirus +1 Programmed Ribosomal Frameshifting Sites.

    PubMed

    Park, Dongbin; Goh, Chul Jun; Kim, Hyein; Hahn, Yoonsoo

    2018-04-01

    The genome sequences of two novel monopartite RNA viruses were identified in a common eelgrass ( Zostera marina ) transcriptome dataset. Sequence comparison and phylogenetic analyses revealed that these two novel viruses belong to the genus Amalgavirus in the family Amalgaviridae . They were named Zostera marina amalgavirus 1 (ZmAV1) and Zostera marina amalgavirus 2 (ZmAV2). Genomes of both ZmAV1 and ZmAV2 contain two overlapping open reading frames (ORFs). ORF1 encodes a putative replication factory matrix-like protein, while ORF2 encodes a RNA-dependent RNA polymerase (RdRp) domain. The fusion protein (ORF1+2) of ORF1 and ORF2, which mediates RNA replication, was produced using the +1 programmed ribosomal frameshifting (PRF) mechanism. The +1 PRF motif sequence, UUU_CGN, which is highly conserved among known amalgaviruses, was also found in ZmAV1 and ZmAV2. Multiple sequence alignment of the ORF1+2 fusion proteins from 24 amalgaviruses revealed that +1 PRF occurred only at three different positions within the 13-amino acid-long segment, which was surrounded by highly conserved regions on both sides. This suggested that the +1 PRF may be constrained by the structure of fusion proteins. Genome sequences of ZmAV1 and ZmAV2, which are the first viruses to be identified in common eelgrass, will serve as useful resources for studying evolution and diversity of amalgaviruses.

  9. Identification of Two Novel Amalgaviruses in the Common Eelgrass (Zostera marina) and in Silico Analysis of the Amalgavirus +1 Programmed Ribosomal Frameshifting Sites

    PubMed Central

    Park, Dongbin; Goh, Chul Jun; Kim, Hyein; Hahn, Yoonsoo

    2018-01-01

    The genome sequences of two novel monopartite RNA viruses were identified in a common eelgrass (Zostera marina) transcriptome dataset. Sequence comparison and phylogenetic analyses revealed that these two novel viruses belong to the genus Amalgavirus in the family Amalgaviridae. They were named Zostera marina amalgavirus 1 (ZmAV1) and Zostera marina amalgavirus 2 (ZmAV2). Genomes of both ZmAV1 and ZmAV2 contain two overlapping open reading frames (ORFs). ORF1 encodes a putative replication factory matrix-like protein, while ORF2 encodes a RNA-dependent RNA polymerase (RdRp) domain. The fusion protein (ORF1+2) of ORF1 and ORF2, which mediates RNA replication, was produced using the +1 programmed ribosomal frameshifting (PRF) mechanism. The +1 PRF motif sequence, UUU_CGN, which is highly conserved among known amalgaviruses, was also found in ZmAV1 and ZmAV2. Multiple sequence alignment of the ORF1+2 fusion proteins from 24 amalgaviruses revealed that +1 PRF occurred only at three different positions within the 13-amino acid-long segment, which was surrounded by highly conserved regions on both sides. This suggested that the +1 PRF may be constrained by the structure of fusion proteins. Genome sequences of ZmAV1 and ZmAV2, which are the first viruses to be identified in common eelgrass, will serve as useful resources for studying evolution and diversity of amalgaviruses. PMID:29628822

  10. How B-Cell Receptor Repertoire Sequencing Can Be Enriched with Structural Antibody Data

    PubMed Central

    Kovaltsuk, Aleksandr; Krawczyk, Konrad; Galson, Jacob D.; Kelly, Dominic F.; Deane, Charlotte M.; Trück, Johannes

    2017-01-01

    Next-generation sequencing of immunoglobulin gene repertoires (Ig-seq) allows the investigation of large-scale antibody dynamics at a sequence level. However, structural information, a crucial descriptor of antibody binding capability, is not collected in Ig-seq protocols. Developing systematic relationships between the antibody sequence information gathered from Ig-seq and low-throughput techniques such as X-ray crystallography could radically improve our understanding of antibodies. The mapping of Ig-seq datasets to known antibody structures can indicate structurally, and perhaps functionally, uncharted areas. Furthermore, contrasting naïve and antigenically challenged datasets using structural antibody descriptors should provide insights into antibody maturation. As the number of antibody structures steadily increases and more and more Ig-seq datasets become available, the opportunities that arise from combining the two types of information increase as well. Here, we review how these data types enrich one another and show potential for advancing our knowledge of the immune system and improving antibody engineering. PMID:29276518

  11. Treetrimmer: a method for phylogenetic dataset size reduction.

    PubMed

    Maruyama, Shinichiro; Eveleigh, Robert J M; Archibald, John M

    2013-04-12

    With rapid advances in genome sequencing and bioinformatics, it is now possible to generate phylogenetic trees containing thousands of operational taxonomic units (OTUs) from a wide range of organisms. However, use of rigorous tree-building methods on such large datasets is prohibitive and manual 'pruning' of sequence alignments is time consuming and raises concerns over reproducibility. There is a need for bioinformatic tools with which to objectively carry out such pruning procedures. Here we present 'TreeTrimmer', a bioinformatics procedure that removes unnecessary redundancy in large phylogenetic datasets, alleviating the size effect on more rigorous downstream analyses. The method identifies and removes user-defined 'redundant' sequences, e.g., orthologous sequences from closely related organisms and 'recently' evolved lineage-specific paralogs. Representative OTUs are retained for more rigorous re-analysis. TreeTrimmer reduces the OTU density of phylogenetic trees without sacrificing taxonomic diversity while retaining the original tree topology, thereby speeding up downstream computer-intensive analyses, e.g., Bayesian and maximum likelihood tree reconstructions, in a reproducible fashion.

  12. Relationships between palaeogeography and opal occurrence in Australia: A data-mining approach

    NASA Astrophysics Data System (ADS)

    Landgrebe, T. C. W.; Merdith, A.; Dutkiewicz, A.; Müller, R. D.

    2013-07-01

    Age-coded multi-layered geological datasets are becoming increasingly prevalent with the surge in open-access geodata, yet there are few methodologies for extracting geological information and knowledge from these data. We present a novel methodology, based on the open-source GPlates software in which age-coded digital palaeogeographic maps are used to “data-mine” spatio-temporal patterns related to the occurrence of Australian opal. Our aim is to test the concept that only a particular sequence of depositional/erosional environments may lead to conditions suitable for the formation of gem quality sedimentary opal. Time-varying geographic environment properties are extracted from a digital palaeogeographic dataset of the eastern Australian Great Artesian Basin (GAB) at 1036 opal localities. We obtain a total of 52 independent ordinal sequences sampling 19 time slices from the Early Cretaceous to the present-day. We find that 95% of the known opal deposits are tied to only 27 sequences all comprising fluvial and shallow marine depositional sequences followed by a prolonged phase of erosion. We then map the total area of the GAB that matches these 27 opal-specific sequences, resulting in an opal-prospective region of only about 10% of the total area of the basin. The key patterns underlying this association involve only a small number of key environmental transitions. We demonstrate that these key associations are generally absent at arbitrary locations in the basin. This new methodology allows for the simplification of a complex time-varying geological dataset into a single map view, enabling straightforward application for opal exploration and for future co-assessment with other datasets/geological criteria. This approach may help unravel the poorly understood opal formation process using an empirical spatio-temporal data-mining methodology and readily available datasets to aid hypothesis testing.

  13. Metabarcoding of marine nematodes – evaluation of reference datasets used in tree-based taxonomy assignment approach

    PubMed Central

    2016-01-01

    Abstract Background Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. New information In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand. Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset. Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach. PMID:27932919

  14. Metabarcoding of marine nematodes - evaluation of reference datasets used in tree-based taxonomy assignment approach.

    PubMed

    Holovachov, Oleksandr

    2016-01-01

    Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand.Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset.Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach.

  15. Microbial Communities on Seafloor Basalts at Dorado Outcrop Reflect Level of Alteration and Highlight Global Lithic Clades

    PubMed Central

    Lee, Michael D.; Walworth, Nathan G.; Sylvan, Jason B.; Edwards, Katrina J.; Orcutt, Beth N.

    2015-01-01

    Areas of exposed basalt along mid-ocean ridges and at seafloor outcrops serve as conduits of fluid flux into and out of a subsurface ocean, and microbe–mineral interactions can influence alteration reactions at the rock–water interface. Located on the eastern flank of the East Pacific Rise, Dorado Outcrop is a site of low-temperature (<20°C) hydrothermal venting and represents a new end-member in the current survey of seafloor basalt biomes. Consistent with prior studies, a survey of 16S rRNA gene sequence diversity using universal primers targeting the V4 hypervariable region revealed much greater richness and diversity on the seafloor rocks than in surrounding seawater. Overall, Gamma-, Alpha-, and Deltaproteobacteria, and Thaumarchaeota dominated the sequenced communities, together making up over half of the observed diversity, though bacterial sequences were more abundant than archaeal in all samples. The most abundant bacterial reads were closely related to the obligate chemolithoautotrophic, sulfur-oxidizing Thioprofundum lithotrophicum, suggesting carbon and sulfur cycling as dominant metabolic pathways in this system. Representatives of Thaumarchaeota were detected in relatively high abundance on the basalts in comparison to bottom water, possibly indicating ammonia oxidation. In comparison to other sequence datasets from globally distributed seafloor basalts, this study reveals many overlapping and cosmopolitan phylogenetic groups and also suggests that substrate age correlates with community structure. PMID:26779122

  16. The Transcriptome Analysis of Strongyloides stercoralis L3i Larvae Reveals Targets for Intervention in a Neglected Disease

    PubMed Central

    Marcilla, Antonio; Garg, Gagan; Bernal, Dolores; Ranganathan, Shoba; Forment, Javier; Ortiz, Javier; Muñoz-Antolí, Carla; Dominguez, M. Victoria; Pedrola, Laia; Martinez-Blanch, Juan; Sotillo, Javier; Trelis, Maria; Toledo, Rafael; Esteban, J. Guillermo

    2012-01-01

    Background Strongyloidiasis is one of the most neglected diseases distributed worldwide with endemic areas in developed countries, where chronic infections are life threatening. Despite its impact, very little is known about the molecular biology of the parasite involved and its interplay with its hosts. Next generation sequencing technologies now provide unique opportunities to rapidly address these questions. Principal Findings Here we present the first transcriptome of the third larval stage of S. stercoralis using 454 sequencing coupled with semi-automated bioinformatic analyses. 253,266 raw sequence reads were assembled into 11,250 contiguous sequences, most of which were novel. 8037 putative proteins were characterized based on homology, gene ontology and/or biochemical pathways. Comparison of the transcriptome of S. strongyloides with those of other nematodes, including S. ratti, revealed similarities in transcription of molecules inferred to have key roles in parasite-host interactions. Enzymatic proteins, like kinases and proteases, were abundant. 1213 putative excretory/secretory proteins were compiled using a new pipeline which included non-classical secretory proteins. Potential drug targets were also identified. Conclusions Overall, the present dataset should provide a solid foundation for future fundamental genomic, proteomic and metabolomic explorations of S. stercoralis, as well as a basis for applied outcomes, such as the development of novel methods of intervention against this neglected parasite. PMID:22389732

  17. TranslatomeDB: a comprehensive database and cloud-based analysis platform for translatome sequencing data.

    PubMed

    Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong

    2018-01-04

    Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  18. Functional metagenomics of oil-impacted mangrove sediments reveals high abundance of hydrolases of biotechnological interest.

    PubMed

    Ottoni, Júlia Ronzella; Cabral, Lucélia; de Sousa, Sanderson Tarciso Pereira; Júnior, Gileno Vieira Lacerda; Domingos, Daniela Ferreira; Soares Junior, Fábio Lino; da Silva, Mylenne Calciolari Pinheiro; Marcon, Joelma; Dias, Armando Cavalcante Franco; de Melo, Itamar Soares; de Souza, Anete Pereira; Andreote, Fernando Dini; de Oliveira, Valéria Maia

    2017-07-01

    Mangroves are located in coastal wetlands and are susceptible to the consequences of oil spills, what may threaten the diversity of microorganisms responsible for the nutrient cycling and the consequent ecosystem functioning. Previous reports show that high concentration of oil favors the incidence of epoxide hydrolases and haloalkane dehalogenases in mangroves. This finding has guided the goals of this study in an attempt to broaden the analysis to other hydrolases and thereby verify whether oil contamination interferes with the prevalence of particular hydrolases and their assigned microorganisms. For this, an in-depth survey of the taxonomic and functional microbial diversity recovered in a fosmid library (Library_Oil Mgv) constructed from oil-impacted Brazilian mangrove sediment was carried out. Fosmid DNA of the whole library was extracted and submitted to Illumina HiSeq sequencing. The resulting Library Oil_Mgv dataset was further compared with those obtained by direct sequencing of environmental DNA from Brazilian mangroves (from distinct regions and affected by distinct sources of contamination), focusing on hydrolases with potential use in biotechnological processes. The most abundant hydrolases found were proteases, esterases and amylases, with similar occurrence profile in all datasets. The main microbial groups harboring such hydrolase-encoding genes were distinct in each mangrove, and in the fosmid library these enzymes were mainly assigned to Chloroflexaceae (for amylases), Planctomycetaceae (for esterases) and Bradyrhizobiaceae (for proteases). Assembly and analysis of Library_Oil Mgv reads revealed three potentially novel enzymes, one epoxide hydrolase, one xylanase and one amylase, to be further investigated via heterologous expression assays.

  19. An Alignment-Free Algorithm in Comparing the Similarity of Protein Sequences Based on Pseudo-Markov Transition Probabilities among Amino Acids

    PubMed Central

    Li, Yushuang; Yang, Jiasheng; Zhang, Yi

    2016-01-01

    In this paper, we have proposed a novel alignment-free method for comparing the similarity of protein sequences. We first encode a protein sequence into a 440 dimensional feature vector consisting of a 400 dimensional Pseudo-Markov transition probability vector among the 20 amino acids, a 20 dimensional content ratio vector, and a 20 dimensional position ratio vector of the amino acids in the sequence. By evaluating the Euclidean distances among the representing vectors, we compare the similarity of protein sequences. We then apply this method into the ND5 dataset consisting of the ND5 protein sequences of 9 species, and the F10 and G11 datasets representing two of the xylanases containing glycoside hydrolase families, i.e., families 10 and 11. As a result, our method achieves a correlation coefficient of 0.962 with the canonical protein sequence aligner ClustalW in the ND5 dataset, much higher than those of other 5 popular alignment-free methods. In addition, we successfully separate the xylanases sequences in the F10 family and the G11 family and illustrate that the F10 family is more heat stable than the G11 family, consistent with a few previous studies. Moreover, we prove mathematically an identity equation involving the Pseudo-Markov transition probability vector and the amino acids content ratio vector. PMID:27918587

  20. The Transcriptome Analysis and Comparison Explorer--T-ACE: a platform-independent, graphical tool to process large RNAseq datasets of non-model organisms.

    PubMed

    Philipp, E E R; Kraemer, L; Mountfort, D; Schilhabel, M; Schreiber, S; Rosenstiel, P

    2012-03-15

    Next generation sequencing (NGS) technologies allow a rapid and cost-effective compilation of large RNA sequence datasets in model and non-model organisms. However, the storage and analysis of transcriptome information from different NGS platforms is still a significant bottleneck, leading to a delay in data dissemination and subsequent biological understanding. Especially database interfaces with transcriptome analysis modules going beyond mere read counts are missing. Here, we present the Transcriptome Analysis and Comparison Explorer (T-ACE), a tool designed for the organization and analysis of large sequence datasets, and especially suited for transcriptome projects of non-model organisms with little or no a priori sequence information. T-ACE offers a TCL-based interface, which accesses a PostgreSQL database via a php-script. Within T-ACE, information belonging to single sequences or contigs, such as annotation or read coverage, is linked to the respective sequence and immediately accessible. Sequences and assigned information can be searched via keyword- or BLAST-search. Additionally, T-ACE provides within and between transcriptome analysis modules on the level of expression, GO terms, KEGG pathways and protein domains. Results are visualized and can be easily exported for external analysis. We developed T-ACE for laboratory environments, which have only a limited amount of bioinformatics support, and for collaborative projects in which different partners work on the same dataset from different locations or platforms (Windows/Linux/MacOS). For laboratories with some experience in bioinformatics and programming, the low complexity of the database structure and open-source code provides a framework that can be customized according to the different needs of the user and transcriptome project.

  1. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance

    PubMed Central

    Rand, Hugh; Shumway, Martin; Trees, Eija K.; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E.; Defibaugh-Chavez, Stephanie; Carleton, Heather A.; Klimke, William A.; Katz, Lee S.

    2017-01-01

    Background As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. Methods We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and “known” phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Results Our “outbreak” benchmark datasets represent the four major foodborne bacterial pathogens (Listeria monocytogenes, Salmonella enterica, Escherichia coli, and Campylobacter jejuni) and one simulated dataset where the “known tree” can be accurately called the “true tree”. The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. Discussion These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools—we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines. PMID:29372115

  2. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance.

    PubMed

    Timme, Ruth E; Rand, Hugh; Shumway, Martin; Trees, Eija K; Simmons, Mustafa; Agarwala, Richa; Davis, Steven; Tillman, Glenn E; Defibaugh-Chavez, Stephanie; Carleton, Heather A; Klimke, William A; Katz, Lee S

    2017-01-01

    As next generation sequence technology has advanced, there have been parallel advances in genome-scale analysis programs for determining evolutionary relationships as proxies for epidemiological relationship in public health. Most new programs skip traditional steps of ortholog determination and multi-gene alignment, instead identifying variants across a set of genomes, then summarizing results in a matrix of single-nucleotide polymorphisms or alleles for standard phylogenetic analysis. However, public health authorities need to document the performance of these methods with appropriate and comprehensive datasets so they can be validated for specific purposes, e.g., outbreak surveillance. Here we propose a set of benchmark datasets to be used for comparison and validation of phylogenomic pipelines. We identified four well-documented foodborne pathogen events in which the epidemiology was concordant with routine phylogenomic analyses (reference-based SNP and wgMLST approaches). These are ideal benchmark datasets, as the trees, WGS data, and epidemiological data for each are all in agreement. We have placed these sequence data, sample metadata, and "known" phylogenetic trees in publicly-accessible databases and developed a standard descriptive spreadsheet format describing each dataset. To facilitate easy downloading of these benchmarks, we developed an automated script that uses the standard descriptive spreadsheet format. Our "outbreak" benchmark datasets represent the four major foodborne bacterial pathogens ( Listeria monocytogenes , Salmonella enterica , Escherichia coli , and Campylobacter jejuni ) and one simulated dataset where the "known tree" can be accurately called the "true tree". The downloading script and associated table files are available on GitHub: https://github.com/WGS-standards-and-analysis/datasets. These five benchmark datasets will help standardize comparison of current and future phylogenomic pipelines, and facilitate important cross-institutional collaborations. Our work is part of a global effort to provide collaborative infrastructure for sequence data and analytic tools-we welcome additional benchmark datasets in our recommended format, and, if relevant, we will add these on our GitHub site. Together, these datasets, dataset format, and the underlying GitHub infrastructure present a recommended path for worldwide standardization of phylogenomic pipelines.

  3. Transcriptome survey of the anhydrobiotic tardigrade Milnesium tardigradum in comparison with Hypsibius dujardini and Richtersius coronifer

    PubMed Central

    2010-01-01

    Background The phenomenon of desiccation tolerance, also called anhydrobiosis, involves the ability of an organism to survive the loss of almost all cellular water without sustaining irreversible damage. Although there are several physiological, morphological and ecological studies on tardigrades, only limited DNA sequence information is available. Therefore, we explored the transcriptome in the active and anhydrobiotic state of the tardigrade Milnesium tardigradum which has extraordinary tolerance to desiccation and freezing. In this study, we present the first overview of the transcriptome of M. tardigradum and its response to desiccation and discuss potential parallels to stress responses in other organisms. Results We sequenced a total of 9984 expressed sequence tags (ESTs) from two cDNA libraries from the eutardigrade M. tardigradum in its active and inactive, anhydrobiotic (tun) stage. Assembly of these ESTs resulted in 3283 putative unique transcripts, whereof ~50% showed significant sequence similarity to known genes. The resulting unigenes were functionally annotated using the Gene Ontology (GO) vocabulary. A GO term enrichment analysis revealed several GOs that were significantly underrepresented in the inactive stage. Furthermore we compared the putative unigenes of M. tardigradum with ESTs from two other eutardigrade species that are available from public sequence databases, namely Richtersius coronifer and Hypsibius dujardini. The processed sequences of the three tardigrade species revealed similar functional content and the M. tardigradum dataset contained additional sequences from tardigrades not present in the other two. Conclusions This study describes novel sequence data from the tardigrade M. tardigradum, which significantly contributes to the available tardigrade sequence data and will help to establish this extraordinary tardigrade as a model for studying anhydrobiosis. Functional comparison of active and anhydrobiotic tardigrades revealed a differential distribution of Gene Ontology terms associated with chromatin structure and the translation machinery, which are underrepresented in the inactive animals. These findings imply a widespread metabolic response of the animals on dehydration. The collective tardigrade transcriptome data will serve as a reference for further studies and support the identification and characterization of genes involved in the anhydrobiotic response. PMID:20226016

  4. Transcriptome survey of the anhydrobiotic tardigrade Milnesium tardigradum in comparison with Hypsibius dujardini and Richtersius coronifer.

    PubMed

    Mali, Brahim; Grohme, Markus A; Förster, Frank; Dandekar, Thomas; Schnölzer, Martina; Reuter, Dirk; Wełnicz, Weronika; Schill, Ralph O; Frohme, Marcus

    2010-03-12

    The phenomenon of desiccation tolerance, also called anhydrobiosis, involves the ability of an organism to survive the loss of almost all cellular water without sustaining irreversible damage. Although there are several physiological, morphological and ecological studies on tardigrades, only limited DNA sequence information is available. Therefore, we explored the transcriptome in the active and anhydrobiotic state of the tardigrade Milnesium tardigradum which has extraordinary tolerance to desiccation and freezing. In this study, we present the first overview of the transcriptome of M. tardigradum and its response to desiccation and discuss potential parallels to stress responses in other organisms. We sequenced a total of 9984 expressed sequence tags (ESTs) from two cDNA libraries from the eutardigrade M. tardigradum in its active and inactive, anhydrobiotic (tun) stage. Assembly of these ESTs resulted in 3283 putative unique transcripts, whereof approximately 50% showed significant sequence similarity to known genes. The resulting unigenes were functionally annotated using the Gene Ontology (GO) vocabulary. A GO term enrichment analysis revealed several GOs that were significantly underrepresented in the inactive stage. Furthermore we compared the putative unigenes of M. tardigradum with ESTs from two other eutardigrade species that are available from public sequence databases, namely Richtersius coronifer and Hypsibius dujardini. The processed sequences of the three tardigrade species revealed similar functional content and the M. tardigradum dataset contained additional sequences from tardigrades not present in the other two. This study describes novel sequence data from the tardigrade M. tardigradum, which significantly contributes to the available tardigrade sequence data and will help to establish this extraordinary tardigrade as a model for studying anhydrobiosis. Functional comparison of active and anhydrobiotic tardigrades revealed a differential distribution of Gene Ontology terms associated with chromatin structure and the translation machinery, which are underrepresented in the inactive animals. These findings imply a widespread metabolic response of the animals on dehydration. The collective tardigrade transcriptome data will serve as a reference for further studies and support the identification and characterization of genes involved in the anhydrobiotic response.

  5. Modified RNA-seq method for microbial community and diversity analysis using rRNA in different types of environmental samples

    PubMed Central

    Yan, Yong-Wei; Zou, Bin; Zhu, Ting; Hozzein, Wael N.

    2017-01-01

    RNA-seq-based SSU (small subunit) rRNA (ribosomal RNA) analysis has provided a better understanding of potentially active microbial community within environments. However, for RNA-seq library construction, high quantities of purified RNA are typically required. We propose a modified RNA-seq method for SSU rRNA-based microbial community analysis that depends on the direct ligation of a 5’ adaptor to RNA before reverse-transcription. The method requires only a low-input quantity of RNA (10–100 ng) and does not require a DNA removal step. The method was initially tested on three mock communities synthesized with enriched SSU rRNA of archaeal, bacterial and fungal isolates at different ratios, and was subsequently used for environmental samples of high or low biomass. For high-biomass salt-marsh sediments, enriched SSU rRNA and total nucleic acid-derived RNA-seq datasets revealed highly consistent community compositions for all of the SSU rRNA sequences, and as much as 46.4%-59.5% of 16S rRNA sequences were suitable for OTU (operational taxonomic unit)-based community and diversity analyses with complete coverage of V1-V2 regions. OTU-based community structures for the two datasets were also highly consistent with those determined by all of the 16S rRNA reads. For low-biomass samples, total nucleic acid-derived RNA-seq datasets were analyzed, and highly active bacterial taxa were also identified by the OTU-based method, notably including members of the previously underestimated genus Nitrospira and phylum Acidobacteria in tap water, members of the phylum Actinobacteria on a shower curtain, and members of the phylum Cyanobacteria on leaf surfaces. More than half of the bacterial 16S rRNA sequences covered the complete region of primer 8F, and non-coverage rates as high as 38.7% were obtained for phylum-unclassified sequences, providing many opportunities to identify novel bacterial taxa. This modified RNA-seq method will provide a better snapshot of diverse microbial communities, most notably by OTU-based analysis, even communities with low-biomass samples. PMID:29016661

  6. Profiling mRNAs of Two Cuscuta Species Reveals Possible Candidate Transcripts Shared by Parasitic Plants

    PubMed Central

    Wijeratne, Saranga; Fraga, Martina; Meulia, Tea; Doohan, Doug; Li, Zhaohu; Qu, Feng

    2013-01-01

    Dodders are among the most important parasitic plants that cause serious yield losses in crop plants. In this report, we sought to unveil the genetic basis of dodder parasitism by profiling the trancriptomes of Cuscuta pentagona and C. suaveolens, two of the most common dodder species using a next-generation RNA sequencing platform. De novo assembly of the sequence reads resulted in more than 46,000 isotigs and contigs (collectively referred to as expressed sequence tags or ESTs) for each species, with more than half of them predicted to encode proteins that share significant sequence similarities with known proteins of non-parasitic plants. Comparing our datasets with transcriptomes of 12 other fully sequenced plant species confirmed a close evolutionary relationship between dodder and tomato. Using a rigorous set of filtering parameters, we were able to identify seven pairs of ESTs that appear to be shared exclusively by parasitic plants, thus providing targets for tailored management approaches. In addition, we also discovered ESTs with sequences similarities to known plant viruses, including cryptic viruses, in the dodder sequence assemblies. Together this study represents the first comprehensive transcriptome profiling of parasitic plants in the Cuscuta genus, and is expected to contribute to our understanding of the molecular mechanisms of parasitic plant-host plant interactions. PMID:24312295

  7. Profiling mRNAs of two Cuscuta species reveals possible candidate transcripts shared by parasitic plants.

    PubMed

    Jiang, Linjian; Wijeratne, Asela J; Wijeratne, Saranga; Fraga, Martina; Meulia, Tea; Doohan, Doug; Li, Zhaohu; Qu, Feng

    2013-01-01

    Dodders are among the most important parasitic plants that cause serious yield losses in crop plants. In this report, we sought to unveil the genetic basis of dodder parasitism by profiling the trancriptomes of Cuscuta pentagona and C. suaveolens, two of the most common dodder species using a next-generation RNA sequencing platform. De novo assembly of the sequence reads resulted in more than 46,000 isotigs and contigs (collectively referred to as expressed sequence tags or ESTs) for each species, with more than half of them predicted to encode proteins that share significant sequence similarities with known proteins of non-parasitic plants. Comparing our datasets with transcriptomes of 12 other fully sequenced plant species confirmed a close evolutionary relationship between dodder and tomato. Using a rigorous set of filtering parameters, we were able to identify seven pairs of ESTs that appear to be shared exclusively by parasitic plants, thus providing targets for tailored management approaches. In addition, we also discovered ESTs with sequences similarities to known plant viruses, including cryptic viruses, in the dodder sequence assemblies. Together this study represents the first comprehensive transcriptome profiling of parasitic plants in the Cuscuta genus, and is expected to contribute to our understanding of the molecular mechanisms of parasitic plant-host plant interactions.

  8. Characterization of mango (Mangifera indica L.) transcriptome and chloroplast genome.

    PubMed

    Azim, M Kamran; Khan, Ishtaiq A; Zhang, Yong

    2014-05-01

    We characterized mango leaf transcriptome and chloroplast genome using next generation DNA sequencing. The RNA-seq output of mango transcriptome generated >12 million reads (total nucleotides sequenced >1 Gb). De novo transcriptome assembly generated 30,509 unigenes with lengths in the range of 300 to ≥3,000 nt and 67× depth of coverage. Blast searching against nonredundant nucleotide databases and several Viridiplantae genomic datasets annotated 24,593 mango unigenes (80% of total) and identified Citrus sinensis as closest neighbor of mango with 9,141 (37%) matched sequences. The annotation with gene ontology and Clusters of Orthologous Group terms categorized unigene sequences into 57 and 25 classes, respectively. More than 13,500 unigenes were assigned to 293 KEGG pathways. Besides major plant biology related pathways, KEGG based gene annotation pointed out active presence of an array of biochemical pathways involved in (a) biosynthesis of bioactive flavonoids, flavones and flavonols, (b) biosynthesis of terpenoids and lignins and (c) plant hormone signal transduction. The mango transcriptome sequences revealed 235 proteases belonging to five catalytic classes of proteolytic enzymes. The draft genome of mango chloroplast (cp) was obtained by a combination of Sanger and next generation sequencing. The draft mango cp genome size is 151,173 bp with a pair of inverted repeats of 27,093 bp separated by small and large single copy regions, respectively. Out of 139 genes in mango cp genome, 91 found to be protein coding. Sequence analysis revealed cp genome of C. sinensis as closest neighbor of mango. We found 51 short repeats in mango cp genome supposed to be associated with extensive rearrangements. This is the first report of transcriptome and chloroplast genome analysis of any Anacardiaceae family member.

  9. Distinct profiles of expressed sequence tags during intestinal regeneration in the sea cucumber Holothuria glaberrima

    PubMed Central

    Rojas-Cartagena, Carmencita; Ortíz-Pineda, Pablo; Ramírez-Gómez, Francisco; Suárez-Castillo, Edna C.; Matos-Cruz, Vanessa; Rodríguez, Carlos; Ortíz-Zuazaga, Humberto; García-Arrarás, José E.

    2010-01-01

    Repair and regeneration are key processes for tissue maintenance, and their disruption may lead to disease states. Little is known about the molecular mechanisms that underline the repair and regeneration of the digestive tract. The sea cucumber Holothuria glaberrima represents an excellent model to dissect and characterize the molecular events during intestinal regeneration. To study the gene expression profile, cDNA libraries were constructed from normal, 3-day, and 7-day regenerating intestines of H. glaberrima. Clones were randomly sequenced and queried against the nonredundant protein database at the National Center for Biotechnology Information. RT-PCR analyses were made of several genes to determine their expression profile during intestinal regeneration. A total of 5,173 sequences from three cDNA libraries were obtained. About 46.2, 35.6, and 26.2% of the sequences for the normal, 3-days, and 7-days cDNA libraries, respectively, shared significant similarity with known sequences in the protein database of GenBank but only present 10% of similarity among them. Analysis of the libraries in terms of functional processes, protein domains, and most common sequences suggests that a differential expression profile is taking place during the regeneration process. Further examination of the expressed sequence tag dataset revealed that 12 putative genes are differentially expressed at significant level (R > 6). Experimental validation by RT-PCR analysis reveals that at least three genes (unknown C-4677-1, melanotransferrin, and centaurin) present a differential expression during regeneration. These findings strongly suggest that the gene expression profile varies among regeneration stages and provide evidence for the existence of differential gene expression. PMID:17579180

  10. A phylogenetic framework for root lesion nematodes of the genus Pratylenchus (Nematoda): Evidence from 18S and D2-D3 expansion segments of 28S ribosomal RNA genes and morphological characters.

    PubMed

    Subbotin, Sergei A; Ragsdale, Erik J; Mullens, Teresa; Roberts, Philip A; Mundo-Ocampo, Manuel; Baldwin, James G

    2008-08-01

    The root lesion nematodes of the genus Pratylenchus Filipjev, 1936 are migratory endoparasites of plant roots, considered among the most widespread and important nematode parasites in a variety of crops. We obtained gene sequences from the D2 and D3 expansion segments of 28S rRNA partial and 18S rRNA from 31 populations belonging to 11 valid and two unidentified species of root lesion nematodes and five outgroup taxa. These datasets were analyzed using maximum parsimony and Bayesian inference. The alignments were generated using the secondary structure models for these molecules and analyzed with Bayesian inference under the standard models and the complex model, considering helices under the doublet model and loops and bulges under the general time reversible model. The phylogenetic informativeness of morphological characters is tested by reconstruction of their histories on rRNA based trees using parallel parsimony and Bayesian approaches. Phylogenetic and sequence analyses of the 28S D2-D3 dataset with 145 accessions for 28 species and 18S dataset with 68 accessions for 15 species confirmed among large numbers of geographical diverse isolates that most classical morphospecies are monophyletic. Phylogenetic analyses revealed at least six distinct major clades of examined Pratylenchus species and these clades are generally congruent with those defined by characters derived from lip patterns, numbers of lip annules, and spermatheca shape. Morphological results suggest the need for sophisticated character discovery and analysis for morphology based phylogenetics in nematodes.

  11. Metagenomics Reveals Pervasive Bacterial Populations and Reduced Community Diversity across the Alaska Tundra Ecosystem

    DOE PAGES

    Johnston, Eric R.; Rodriguez-R, Luis M.; Luo, Chengwei; ...

    2016-04-25

    How soil microbial communities contrast with respect to taxonomic and functional composition within and between ecosystems remains an unresolved question that is central to predicting how global anthropogenic change will affect soil functioning and services. In particular, it remains unclear how small-scale observations of soil communities based on the typical volume sampled (1-2 g) are generalizable to ecosystem-scale responses and processes. This is especially relevant for remote, northern latitude soils, which are challenging to sample and are also thought to be more vulnerable to climate change compared to temperate soils. Here, we employed well-replicated shotgun metagenome and 16S rRNA genemore » amplicon sequencing to characterize community composition and metabolic potential in Alaskan tundra soils, combining our own datasets with those publically available from distant tundra and temperate grassland and agriculture habitats. We found that the abundance of many taxa and metabolic functions differed substantially between tundra soil metagenomes relative to those from temperate soils, and that a high degree of OTU-sharing exists between tundra locations. Tundra soils were an order of magnitude less complex than their temperate counterparts, allowing for near-complete coverage of microbial community richness (~92% breadth) by sequencing, and the recovery of 27 high-quality, almost complete ( > 80% completeness) population bins. These population bins, collectively, made up to ~10% of the metagenomic datasets, and represented diverse taxonomic groups and metabolic lifestyles tuned toward sulfur cycling, hydrogen metabolism, methanotrophy, and organic matter oxidation. Several population bins, including members of Acidobacteria, Actinobacteria, and Proteobacteria, were also present in geographically distant (~100-530 km apart) tundra habitats (full genome representation and up to 99.6% genome-derived average nucleotide identity). Collectively, our results revealed that Alaska tundra microbial communities are less diverse and more homogenous across spatial scales than previously anticipated, and provided DNA sequences of abundant populations and genes that would be relevant for future studies of the effects of environmental change on tundra ecosystems.« less

  12. Metagenomics Reveals Pervasive Bacterial Populations and Reduced Community Diversity across the Alaska Tundra Ecosystem.

    PubMed

    Johnston, Eric R; Rodriguez-R, Luis M; Luo, Chengwei; Yuan, Mengting M; Wu, Liyou; He, Zhili; Schuur, Edward A G; Luo, Yiqi; Tiedje, James M; Zhou, Jizhong; Konstantinidis, Konstantinos T

    2016-01-01

    How soil microbial communities contrast with respect to taxonomic and functional composition within and between ecosystems remains an unresolved question that is central to predicting how global anthropogenic change will affect soil functioning and services. In particular, it remains unclear how small-scale observations of soil communities based on the typical volume sampled (1-2 g) are generalizable to ecosystem-scale responses and processes. This is especially relevant for remote, northern latitude soils, which are challenging to sample and are also thought to be more vulnerable to climate change compared to temperate soils. Here, we employed well-replicated shotgun metagenome and 16S rRNA gene amplicon sequencing to characterize community composition and metabolic potential in Alaskan tundra soils, combining our own datasets with those publically available from distant tundra and temperate grassland and agriculture habitats. We found that the abundance of many taxa and metabolic functions differed substantially between tundra soil metagenomes relative to those from temperate soils, and that a high degree of OTU-sharing exists between tundra locations. Tundra soils were an order of magnitude less complex than their temperate counterparts, allowing for near-complete coverage of microbial community richness (~92% breadth) by sequencing, and the recovery of 27 high-quality, almost complete (>80% completeness) population bins. These population bins, collectively, made up to ~10% of the metagenomic datasets, and represented diverse taxonomic groups and metabolic lifestyles tuned toward sulfur cycling, hydrogen metabolism, methanotrophy, and organic matter oxidation. Several population bins, including members of Acidobacteria, Actinobacteria, and Proteobacteria, were also present in geographically distant (~100-530 km apart) tundra habitats (full genome representation and up to 99.6% genome-derived average nucleotide identity). Collectively, our results revealed that Alaska tundra microbial communities are less diverse and more homogenous across spatial scales than previously anticipated, and provided DNA sequences of abundant populations and genes that would be relevant for future studies of the effects of environmental change on tundra ecosystems.

  13. Metagenomics Reveals Pervasive Bacterial Populations and Reduced Community Diversity across the Alaska Tundra Ecosystem

    PubMed Central

    Johnston, Eric R.; Rodriguez-R, Luis M.; Luo, Chengwei; Yuan, Mengting M.; Wu, Liyou; He, Zhili; Schuur, Edward A. G.; Luo, Yiqi; Tiedje, James M.; Zhou, Jizhong; Konstantinidis, Konstantinos T.

    2016-01-01

    How soil microbial communities contrast with respect to taxonomic and functional composition within and between ecosystems remains an unresolved question that is central to predicting how global anthropogenic change will affect soil functioning and services. In particular, it remains unclear how small-scale observations of soil communities based on the typical volume sampled (1–2 g) are generalizable to ecosystem-scale responses and processes. This is especially relevant for remote, northern latitude soils, which are challenging to sample and are also thought to be more vulnerable to climate change compared to temperate soils. Here, we employed well-replicated shotgun metagenome and 16S rRNA gene amplicon sequencing to characterize community composition and metabolic potential in Alaskan tundra soils, combining our own datasets with those publically available from distant tundra and temperate grassland and agriculture habitats. We found that the abundance of many taxa and metabolic functions differed substantially between tundra soil metagenomes relative to those from temperate soils, and that a high degree of OTU-sharing exists between tundra locations. Tundra soils were an order of magnitude less complex than their temperate counterparts, allowing for near-complete coverage of microbial community richness (~92% breadth) by sequencing, and the recovery of 27 high-quality, almost complete (>80% completeness) population bins. These population bins, collectively, made up to ~10% of the metagenomic datasets, and represented diverse taxonomic groups and metabolic lifestyles tuned toward sulfur cycling, hydrogen metabolism, methanotrophy, and organic matter oxidation. Several population bins, including members of Acidobacteria, Actinobacteria, and Proteobacteria, were also present in geographically distant (~100–530 km apart) tundra habitats (full genome representation and up to 99.6% genome-derived average nucleotide identity). Collectively, our results revealed that Alaska tundra microbial communities are less diverse and more homogenous across spatial scales than previously anticipated, and provided DNA sequences of abundant populations and genes that would be relevant for future studies of the effects of environmental change on tundra ecosystems. PMID:27199914

  14. ChIP-seq Accurately Predicts Tissue-Specific Activity of Enhancers

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Visel, Axel; Blow, Matthew J.; Li, Zirong

    2009-02-01

    A major yet unresolved quest in decoding the human genome is the identification of the regulatory sequences that control the spatial and temporal expression of genes. Distant-acting transcriptional enhancers are particularly challenging to uncover since they are scattered amongst the vast non-coding portion of the genome. Evolutionary sequence constraint can facilitate the discovery of enhancers, but fails to predict when and where they are active in vivo. Here, we performed chromatin immunoprecipitation with the enhancer-associated protein p300, followed by massively-parallel sequencing, to map several thousand in vivo binding sites of p300 in mouse embryonic forebrain, midbrain, and limb tissue. Wemore » tested 86 of these sequences in a transgenic mouse assay, which in nearly all cases revealed reproducible enhancer activity in those tissues predicted by p300 binding. Our results indicate that in vivo mapping of p300 binding is a highly accurate means for identifying enhancers and their associated activities and suggest that such datasets will be useful to study the role of tissue-specific enhancers in human biology and disease on a genome-wide scale.« less

  15. Use of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells.

    PubMed

    Xin, Yurong; Kim, Jinrang; Ni, Min; Wei, Yi; Okamoto, Haruka; Lee, Joseph; Adler, Christina; Cavino, Katie; Murphy, Andrew J; Yancopoulos, George D; Lin, Hsin Chieh; Gromada, Jesper

    2016-03-22

    This study provides an assessment of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells. The system combines microfluidic technology and nanoliter-scale reactions. We sequenced 622 cells, allowing identification of 341 islet cells with high-quality gene expression profiles. The cells clustered into populations of α-cells (5%), β-cells (92%), δ-cells (1%), and pancreatic polypeptide cells (2%). We identified cell-type-specific transcription factors and pathways primarily involved in nutrient sensing and oxidation and cell signaling. Unexpectedly, 281 cells had to be removed from the analysis due to low viability, low sequencing quality, or contamination resulting in the detection of more than one islet hormone. Collectively, we provide a resource for identification of high-quality gene expression datasets to help expand insights into genes and pathways characterizing islet cell types. We reveal limitations in the C1 Fluidigm cell capture process resulting in contaminated cells with altered gene expression patterns. This calls for caution when interpreting single-cell transcriptomics data using the C1 Fluidigm system.

  16. cGRNB: a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets.

    PubMed

    Xu, Huayong; Yu, Hui; Tu, Kang; Shi, Qianqian; Wei, Chaochun; Li, Yuan-Yuan; Li, Yi-Xue

    2013-01-01

    We are witnessing rapid progress in the development of methodologies for building the combinatorial gene regulatory networks involving both TFs (Transcription Factors) and miRNAs (microRNAs). There are a few tools available to do these jobs but most of them are not easy to use and not accessible online. A web server is especially needed in order to allow users to upload experimental expression datasets and build combinatorial regulatory networks corresponding to their particular contexts. In this work, we compiled putative TF-gene, miRNA-gene and TF-miRNA regulatory relationships from forward-engineering pipelines and curated them as built-in data libraries. We streamlined the R codes of our two separate forward-and-reverse engineering algorithms for combinatorial gene regulatory network construction and formalized them as two major functional modules. As a result, we released the cGRNB (combinatorial Gene Regulatory Networks Builder): a web server for constructing combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. The cGRNB enables two major network-building modules, one for MPGE (miRNA-perturbed gene expression) datasets and the other for parallel miRNA/mRNA expression datasets. A miRNA-centered two-layer combinatorial regulatory cascade is the output of the first module and a comprehensive genome-wide network involving all three types of combinatorial regulations (TF-gene, TF-miRNA, and miRNA-gene) are the output of the second module. In this article we propose cGRNB, a web server for building combinatorial gene regulatory networks through integrated engineering of seed-matching sequence information and gene expression datasets. Since parallel miRNA/mRNA expression datasets are rapidly accumulated by the advance of next-generation sequencing techniques, cGRNB will be very useful tool for researchers to build combinatorial gene regulatory networks based on expression datasets. The cGRNB web-server is free and available online at http://www.scbit.org/cgrnb.

  17. Bellerophon: A program to detect chimeric sequences in multiple sequence alignments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huber, Thomas; Faulkner, Geoffrey; Hugenholtz, Philip

    2003-12-23

    Bellerophon is a program for detecting chimeric sequences in multiple sequence datasets by an adaption of partial treeing analysis. Bellerophon was specifically developed to detect 16S rRNA gene chimeras in PCR-clone libraries of environmental samples but can be applied to other nucleotide sequence alignments.

  18. Comprehensive discovery of noncoding RNAs in acute myeloid leukemia cell transcriptomes.

    PubMed

    Zhang, Jin; Griffith, Malachi; Miller, Christopher A; Griffith, Obi L; Spencer, David H; Walker, Jason R; Magrini, Vincent; McGrath, Sean D; Ly, Amy; Helton, Nichole M; Trissal, Maria; Link, Daniel C; Dang, Ha X; Larson, David E; Kulkarni, Shashikant; Cordes, Matthew G; Fronick, Catrina C; Fulton, Robert S; Klco, Jeffery M; Mardis, Elaine R; Ley, Timothy J; Wilson, Richard K; Maher, Christopher A

    2017-11-01

    To detect diverse and novel RNA species comprehensively, we compared deep small RNA and RNA sequencing (RNA-seq) methods applied to a primary acute myeloid leukemia (AML) sample. We were able to discover previously unannotated small RNAs using deep sequencing of a library method using broader insert size selection. We analyzed the long noncoding RNA (lncRNA) landscape in AML by comparing deep sequencing from multiple RNA-seq library construction methods for the sample that we studied and then integrating RNA-seq data from 179 AML cases. This identified lncRNAs that are completely novel, differentially expressed, and associated with specific AML subtypes. Our study revealed the complexity of the noncoding RNA transcriptome through a combined strategy of strand-specific small RNA and total RNA-seq. This dataset will serve as an invaluable resource for future RNA-based analyses. Copyright © 2017 ISEH – Society for Hematology and Stem Cells. Published by Elsevier Inc. All rights reserved.

  19. Molecular diversification of Trichuris spp. from Sigmodontinae (Cricetidae) rodents from Argentina based on mitochondrial DNA sequences.

    PubMed

    Callejón, Rocío; Robles, María Del Rosario; Panei, Carlos Javier; Cutillas, Cristina

    2016-08-01

    A molecular phylogenetic hypothesis is presented for the genus Trichuris based on sequence data from mitochondrial cytochrome c oxidase 1 (cox1) and cytochrome b (cob). The taxa consisted of nine populations of whipworm from five species of Sigmodontinae rodents from Argentina. Bayesian Inference, Maximum Parsimony, and Maximum Likelihood methods were used to infer phylogenies for each gene separately but also for the combined mitochondrial data and the combined mitochondrial and nuclear dataset. Phylogenetic results based on cox1 and cob mitochondrial DNA (mtDNA) revealed three clades strongly resolved corresponding to three different species (Trichuris navonae, Trichuris bainae, and Trichuris pardinasi) showing phylogeographic variation, but relationships among Trichuris species were poorly resolved. Phylogenetic reconstruction based on concatenated sequences had greater phylogenetic resolution for delimiting species and populations intra-specific of Trichuris than those based on partitioned genes. Thus, populations of T. bainae and T. pardinasi could be affected by geographical factors and co-divergence parasite-host.

  20. LFQC: a lossless compression algorithm for FASTQ files

    PubMed Central

    Nicolae, Marius; Pathak, Sudipta; Rajasekaran, Sanguthevar

    2015-01-01

    Motivation: Next Generation Sequencing (NGS) technologies have revolutionized genomic research by reducing the cost of whole genome sequencing. One of the biggest challenges posed by modern sequencing technology is economic storage of NGS data. Storing raw data is infeasible because of its enormous size and high redundancy. In this article, we address the problem of storage and transmission of large FASTQ files using innovative compression techniques. Results: We introduce a new lossless non-reference based FASTQ compression algorithm named Lossless FASTQ Compressor. We have compared our algorithm with other state of the art big data compression algorithms namely gzip, bzip2, fastqz (Bonfield and Mahoney, 2013), fqzcomp (Bonfield and Mahoney, 2013), Quip (Jones et al., 2012), DSRC2 (Roguski and Deorowicz, 2014). This comparison reveals that our algorithm achieves better compression ratios on LS454 and SOLiD datasets. Availability and implementation: The implementations are freely available for non-commercial purposes. They can be downloaded from http://engr.uconn.edu/rajasek/lfqc-v1.1.zip. Contact: rajasek@engr.uconn.edu PMID:26093148

  1. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures

    PubMed Central

    Lipinski, Leszek; Dziembowski, Andrzej

    2018-01-01

    Abstract Plasmids are mobile genetics elements that play an important role in the environmental adaptation of microorganisms. Although plasmids are usually analyzed in cultured microorganisms, there is a need for methods that allow for the analysis of pools of plasmids (plasmidomes) in environmental samples. To that end, several molecular biology and bioinformatics methods have been developed; however, they are limited to environments with low diversity and cannot recover large plasmids. Here, we present PlasFlow, a novel tool based on genomic signatures that employs a neural network approach for identification of bacterial plasmid sequences in environmental samples. PlasFlow can recover plasmid sequences from assembled metagenomes without any prior knowledge of the taxonomical or functional composition of samples with an accuracy up to 96%. It can also recover sequences of both circular and linear plasmids and can perform initial taxonomical classification of sequences. Compared to other currently available tools, PlasFlow demonstrated significantly better performance on test datasets. Analysis of two samples from heavy metal-contaminated microbial mats revealed that plasmids may constitute an important fraction of their metagenomes and carry genes involved in heavy-metal homeostasis, proving the pivotal role of plasmids in microorganism adaptation to environmental conditions. PMID:29346586

  2. A statistical approach to detection of copy number variations in PCR-enriched targeted sequencing data.

    PubMed

    Demidov, German; Simakova, Tamara; Vnuchkova, Julia; Bragin, Anton

    2016-10-22

    Multiplex polymerase chain reaction (PCR) is a common enrichment technique for targeted massive parallel sequencing (MPS) protocols. MPS is widely used in biomedical research and clinical diagnostics as the fast and accurate tool for the detection of short genetic variations. However, identification of larger variations such as structure variants and copy number variations (CNV) is still being a challenge for targeted MPS. Some approaches and tools for structural variants detection were proposed, but they have limitations and often require datasets of certain type, size and expected number of amplicons affected by CNVs. In the paper, we describe novel algorithm for high-resolution germinal CNV detection in the PCR-enriched targeted sequencing data and present accompanying tool. We have developed a machine learning algorithm for the detection of large duplications and deletions in the targeted sequencing data generated with PCR-based enrichment step. We have performed verification studies and established the algorithm's sensitivity and specificity. We have compared developed tool with other available methods applicable for the described data and revealed its higher performance. We showed that our method has high specificity and sensitivity for high-resolution copy number detection in targeted sequencing data using large cohort of samples.

  3. XPAT: a toolkit to conduct cross-platform association studies with heterogeneous sequencing datasets.

    PubMed

    Yu, Yao; Hu, Hao; Bohlender, Ryan J; Hu, Fulan; Chen, Jiun-Sheng; Holt, Carson; Fowler, Jerry; Guthery, Stephen L; Scheet, Paul; Hildebrandt, Michelle A T; Yandell, Mark; Huff, Chad D

    2018-04-06

    High-throughput sequencing data are increasingly being made available to the research community for secondary analyses, providing new opportunities for large-scale association studies. However, heterogeneity in target capture and sequencing technologies often introduce strong technological stratification biases that overwhelm subtle signals of association in studies of complex traits. Here, we introduce the Cross-Platform Association Toolkit, XPAT, which provides a suite of tools designed to support and conduct large-scale association studies with heterogeneous sequencing datasets. XPAT includes tools to support cross-platform aware variant calling, quality control filtering, gene-based association testing and rare variant effect size estimation. To evaluate the performance of XPAT, we conducted case-control association studies for three diseases, including 783 breast cancer cases, 272 ovarian cancer cases, 205 Crohn disease cases and 3507 shared controls (including 1722 females) using sequencing data from multiple sources. XPAT greatly reduced Type I error inflation in the case-control analyses, while replicating many previously identified disease-gene associations. We also show that association tests conducted with XPAT using cross-platform data have comparable performance to tests using matched platform data. XPAT enables new association studies that combine existing sequencing datasets to identify genetic loci associated with common diseases and other complex traits.

  4. VaDiR: an integrated approach to Variant Detection in RNA.

    PubMed

    Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy

    2018-02-01

    Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.

  5. Characterization and Pathogenicity of Alternaria burnsii from Seeds of Cucurbita maxima (Cucurbitaceae) in Bangladesh.

    PubMed

    Paul, Narayan Chandra; Deng, Jian Xin; Lee, Hyang Burm; Yu, Seung-Hun

    2015-12-01

    In the course of survey of endophytic fungi from Bangladesh pumpkin seeds in 2011~2012, two strains (CNU111042 and CNU111043) with similar colony characteristics were isolated and characterized by their morphology and by molecular phylogenetic analysis of the internal transcribed spacer, glyceraldehydes-3-phosphate dehydrogenase (gpd), and Alternaria allergen a1 (Alt a1) sequences. Phylogenetic analysis of all three sequences and their combined dataset revealed that the fungus formed a subclade within the A. alternata clade, matching A. burnsi and showing differences with its other closely related Alternaria species, such as A. longipes, A. tomato, and A. tomaticola. Long ellipsoid, obclavate or ovoid beakless conidia, shorter and thinner conidial size (16~60 [90] × 6.5~14 [~16] µm) distinguish this fungus from other related species. These isolates showed more transverse septation (2~11) and less longitudinal septation (0~3) than did other related species. Moreover, the isolate did not produce any diffusible pigment on media. Therefore, our results reveal that the newly recorded fungus from a new host, Cucurbita maxima, is Alternaria burnsii Uppal, Patel & Kamat.

  6. CscoreTool: fast Hi-C compartment analysis at high resolution.

    PubMed

    Zheng, Xiaobin; Zheng, Yixian

    2018-05-01

    The genome-wide chromosome conformation capture (Hi-C) has revealed that the eukaryotic genome can be partitioned into A and B compartments that have distinctive chromatin and transcription features. Current Principle Component Analyses (PCA)-based method for the A/B compartment prediction based on Hi-C data requires substantial CPU time and memory. We report the development of a method, CscoreTool, which enables fast and memory-efficient determination of A/B compartments at high resolution even in datasets with low sequencing depth. https://github.com/scoutzxb/CscoreTool. xzheng@carnegiescience.edu. Supplementary data are available at Bioinformatics online.

  7. Genomic Datasets for Cancer Research

    Cancer.gov

    A variety of datasets from genome-wide association studies of cancer and other genotype-phenotype studies, including sequencing and molecular diagnostic assays, are available to approved investigators through the Extramural National Cancer Institute Data Access Committee.

  8. Effectiveness of CID, HCD, and ETD with FT MS/MS for degradomic-peptidomic analysis: comparison of peptide identification methods

    PubMed Central

    Shen, Yufeng; Tolić, Nikola; Xie, Fang; Zhao, Rui; Purvine, Samuel O.; Schepmoes, Athena A.; Ronald, J. Moore; Anderson, Gordon A.; Smith, Richard D.

    2011-01-01

    We report on the effectiveness of CID, HCD, and ETD for LC-FT MS/MS analysis of peptides using a tandem linear ion trap-Orbitrap mass spectrometer. A range of software tools and analysis parameters were employed to explore the use of CID, HCD, and ETD to identify peptides isolated from human blood plasma without the use of specific “enzyme rules”. In the evaluation of an FDR-controlled SEQUEST scoring method, the use of accurate masses for fragments increased the numbers of identified peptides (by ~50%) compared to the use of conventional low accuracy fragment mass information, and CID provided the largest contribution to the identified peptide datasets compared to HCD and ETD. The FDR-controlled Mascot scoring method provided significantly fewer peptide identifications than with SEQUEST (by 1.3–2.3 fold) at the same confidence levels, and CID, HCD, and ETD provided similar contributions to identified peptides. Evaluation of de novo sequencing and the UStags method for more intense fragment ions revealed that HCD afforded more sequence consecutive residues (e.g., ≥7 amino acids) than either CID or ETD. Both the FDR-controlled SEQUEST and Mascot scoring methods provided peptide datasets that were affected by the decoy database and mass tolerances applied (e.g., the identical peptides between the datasets could be limited to ~70%), while the UStags method provided the most consistent peptide datasets (>90% overlap) with extremely low (near zero) numbers of false positive identifications. The m/z ranges in which CID, HCD, and ETD contributed the largest number of peptide identifications were substantially overlapping. This work suggests that the three peptide ion fragmentation methods are complementary, and that maximizing the number of peptide identifications benefits significantly from a careful match with the informatics tools and methods applied. These results also suggest that the decoy strategy may inaccurately estimate identification FDRs. PMID:21678914

  9. Molecular exploration of hidden diversity in the Indo-West Pacific sciaenid clade

    PubMed Central

    Lo, Pei-Chun; Liu, Shu-Hui; Nor, Siti Azizah Mohd

    2017-01-01

    The family Sciaenidae, known as croakers or drums, is one of the largest perciform fish families. A recent multi-gene based study investigating the phylogeny and biogeography of global sciaenids revealed that the origin and early diversification of this family occurred in tropical America during the Late Oligocene—Early Miocene before undergoing range expansions to other seas including the Indo-West Pacific, where high species richness is observed. Despite this clarification of the overall evolutionary history of the family, knowledge of the taxonomy and phylogeny of sciaenid genera endemic to the Indo-West Pacific is still limited due to lack of a thorough survey of all taxa. In this study, we used DNA-based approaches to investigate the evolutionary relationships, to explore the species diversity, and to elucidate the taxonomic status of sciaenid species/genera within the Indo-West Pacific clade. Three datasets were herein built for the above objectives: the combined dataset (248 samples from 45 currently recognized species) from one nuclear gene (RAG1) and one mitochondrial gene (COI); the dataset with only RAG1 gene sequences (245 samples from 44 currently recognized species); and the dataset with only COI gene sequences (308 samples from 51 currently recognized species). The latter was primarily used for our biodiversity exploration with two different species delimitation methods (Automatic Barcode Gap Discovery, ABGD and Generalized Mixed Yule Coalescent, GMYC). The results were further evaluated with help of four supplementary criteria for species delimitation (genetic similarity, monophyly inferred from individual gene and combined data trees, geographic distribution, and morphology). Our final results confirmed the validity of 32 currently recognized species and identified several potential new species waiting for formal descriptions. We also reexamined the taxonomic status of the genera, Larimichthys, Nibea, Protonibea and Megalonibea, and suggested a revision of Nibea and proposed a new genus Pseudolarimichthys. PMID:28453569

  10. Molecular exploration of hidden diversity in the Indo-West Pacific sciaenid clade.

    PubMed

    Lo, Pei-Chun; Liu, Shu-Hui; Nor, Siti Azizah Mohd; Chen, Wei-Jen

    2017-01-01

    The family Sciaenidae, known as croakers or drums, is one of the largest perciform fish families. A recent multi-gene based study investigating the phylogeny and biogeography of global sciaenids revealed that the origin and early diversification of this family occurred in tropical America during the Late Oligocene-Early Miocene before undergoing range expansions to other seas including the Indo-West Pacific, where high species richness is observed. Despite this clarification of the overall evolutionary history of the family, knowledge of the taxonomy and phylogeny of sciaenid genera endemic to the Indo-West Pacific is still limited due to lack of a thorough survey of all taxa. In this study, we used DNA-based approaches to investigate the evolutionary relationships, to explore the species diversity, and to elucidate the taxonomic status of sciaenid species/genera within the Indo-West Pacific clade. Three datasets were herein built for the above objectives: the combined dataset (248 samples from 45 currently recognized species) from one nuclear gene (RAG1) and one mitochondrial gene (COI); the dataset with only RAG1 gene sequences (245 samples from 44 currently recognized species); and the dataset with only COI gene sequences (308 samples from 51 currently recognized species). The latter was primarily used for our biodiversity exploration with two different species delimitation methods (Automatic Barcode Gap Discovery, ABGD and Generalized Mixed Yule Coalescent, GMYC). The results were further evaluated with help of four supplementary criteria for species delimitation (genetic similarity, monophyly inferred from individual gene and combined data trees, geographic distribution, and morphology). Our final results confirmed the validity of 32 currently recognized species and identified several potential new species waiting for formal descriptions. We also reexamined the taxonomic status of the genera, Larimichthys, Nibea, Protonibea and Megalonibea, and suggested a revision of Nibea and proposed a new genus Pseudolarimichthys.

  11. Prospecting for Natural Gas Gydrate in the Orca & Choctaw Basins in the Northern Gulf of Mexico

    NASA Astrophysics Data System (ADS)

    Cook, A.; Hillman, J. I. T.; Sawyer, D.; Frye, M.; Palmes, S.; Shedd, W. W.

    2016-12-01

    The Orca and Choctaw salt bounded mini-basins, which occur in 1.5 to 2.5 km water depth on the northern Gulf of Mexico slope, are currently under consideration as an IODP scientific drilling location for coarse-grained natural gas hydrate systems. We use a 3D seismic dataset for gas hydrate prospecting that covers parts of eleven lease blocks ( 200 km2) in the Walker Ridge protraction area. The study area includes the southern section of the Orca Basin and a smaller section of the northern Choctaw Basin. We have mapped a discontinuous bottom-simulating reflection (BSR) over nearly 30% of our seismic dataset, which varies significantly in both amplitude and depth throughout the area. The southeastern section of our dataset contains three positive impedance amplitude horizons with possible phase reversals at the BSR. Detailed mapping in the area also reveals at the base of gas hydrate stability, a complicated intercalation of an east-west trending fault system and an amalgamated deepwater depositional system comprising channel levee deposits and turbidite sheet sands. Three industry wells drilled in the southwestern section of our study area indicate that the sedimentary sequence infilling the basins consists of predominantly mud rich units with interbedded turbidite sands, forming a 2 km thick supra-salt sequence of late Miocene to Pleistocene sediments. Two of the industry wells have strong evidence for natural gas hydrate in clay-rich sediment, with moderate resistivity (between 2-10 Ωm) increases above background resistivity in zones that exceed 60 m thick. Additionally, the electromagnetic resistivity curves in these wells separate suggesting that the gas hydrate occurs in high-angle fractures. We will present our seismic dataset, our continuing analysis and selected drill sites in the Orca and Choctaw basins. Furthermore, our analysis in the southeastern section of the study area underscores the importance of interpreting faults when considering phase reversals in hydrate systems.

  12. Dimensions of biodiversity in the Earth mycobiome.

    PubMed

    Peay, Kabir G; Kennedy, Peter G; Talbot, Jennifer M

    2016-07-01

    Fungi represent a large proportion of the genetic diversity on Earth and fungal activity influences the structure of plant and animal communities, as well as rates of ecosystem processes. Large-scale DNA-sequencing datasets are beginning to reveal the dimensions of fungal biodiversity, which seem to be fundamentally different to bacteria, plants and animals. In this Review, we describe the patterns of fungal biodiversity that have been revealed by molecular-based studies. Furthermore, we consider the evidence that supports the roles of different candidate drivers of fungal diversity at a range of spatial scales, as well as the role of dispersal limitation in maintaining regional endemism and influencing local community assembly. Finally, we discuss the ecological mechanisms that are likely to be responsible for the high heterogeneity that is observed in fungal communities at local scales.

  13. Identification of characteristic oligonucleotides in the bacterial 16S ribosomal RNA sequence dataset

    NASA Technical Reports Server (NTRS)

    Zhang, Zhengdong; Willson, Richard C.; Fox, George E.

    2002-01-01

    MOTIVATION: The phylogenetic structure of the bacterial world has been intensively studied by comparing sequences of 16S ribosomal RNA (16S rRNA). This database of sequences is now widely used to design probes for the detection of specific bacteria or groups of bacteria one at a time. The success of such methods reflects the fact that there are local sequence segments that are highly characteristic of particular organisms or groups of organisms. It is not clear, however, the extent to which such signature sequences exist in the 16S rRNA dataset. A better understanding of the numbers and distribution of highly informative oligonucleotide sequences may facilitate the design of hybridization arrays that can characterize the phylogenetic position of an unknown organism or serve as the basis for the development of novel approaches for use in bacterial identification. RESULTS: A computer-based algorithm that characterizes the extent to which any individual oligonucleotide sequence in 16S rRNA is characteristic of any particular bacterial grouping was developed. A measure of signature quality, Q(s), was formulated and subsequently calculated for every individual oligonucleotide sequence in the size range of 5-11 nucleotides and for 15mers with reference to each cluster and subcluster in a 929 organism representative phylogenetic tree. Subsequently, the perfect signature sequences were compared to the full set of 7322 sequences to see how common false positives were. The work completed here establishes beyond any doubt that highly characteristic oligonucleotides exist in the bacterial 16S rRNA sequence dataset in large numbers. Over 16,000 15mers were identified that might be useful as signatures. Signature oligonucleotides are available for over 80% of the nodes in the representative tree.

  14. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets

    PubMed Central

    Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S.; Beer, Michael A.

    2013-01-01

    Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167–80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org. PMID:23771147

  15. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets.

    PubMed

    Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S; Beer, Michael A

    2013-07-01

    Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167-80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org.

  16. Simultaneous bright‐ and black‐blood whole‐heart MRI for noncontrast enhanced coronary lumen and thrombus visualization

    PubMed Central

    Neji, Radhouene; Phinikaridou, Alkystis; Whitaker, John; Botnar, René M.; Prieto, Claudia

    2017-01-01

    Purpose To develop a 3D whole‐heart Bright‐blood and black‐blOOd phase SensiTive (BOOST) inversion recovery sequence for simultaneous noncontrast enhanced coronary lumen and thrombus/hemorrhage visualization. Methods The proposed sequence alternates the acquisition of two bright‐blood datasets preceded by different preparatory pulses to obtain variations in blood/myocardium contrast, which then are combined in a phase‐sensitive inversion recovery (PSIR)‐like reconstruction to obtain a third, coregistered, black‐blood dataset. The bright‐blood datasets are used for both visualization of the coronary lumen and motion estimation, whereas the complementary black‐blood dataset potentially allows for thrombus/hemorrhage visualization. Furthermore, integration with 2D image‐based navigation enables 100% scan efficiency and predictable scan times. The proposed sequence was compared to conventional coronary MR angiography (CMRA) and PSIR sequences in a standardized phantom and in healthy subjects. Feasibility for thrombus depiction was tested ex vivo. Results With BOOST, the coronary lumen is visualized with significantly higher (P < 0.05) contrast‐to‐noise ratio and vessel sharpness when compared to conventional CMRA. Furthermore, BOOST showed effective blood signal suppression as well as feasibility for thrombus visualization ex vivo. Conclusion A new PSIR sequence for noncontrast enhanced simultaneous coronary lumen and thrombus/hemorrhage detection was developed. The sequence provided improved coronary lumen depiction and showed potential for thrombus visualization. Magn Reson Med 79:1460–1472, 2018. © 2017 International Society for Magnetic Resonance in Medicine. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. PMID:28722267

  17. Bicycle: a bioinformatics pipeline to analyze bisulfite sequencing data.

    PubMed

    Graña, Osvaldo; López-Fernández, Hugo; Fdez-Riverola, Florentino; González Pisano, David; Glez-Peña, Daniel

    2018-04-15

    High-throughput sequencing of bisulfite-converted DNA is a technique used to measure DNA methylation levels. Although a considerable number of computational pipelines have been developed to analyze such data, none of them tackles all the peculiarities of the analysis together, revealing limitations that can force the user to manually perform additional steps needed for a complete processing of the data. This article presents bicycle, an integrated, flexible analysis pipeline for bisulfite sequencing data. Bicycle analyzes whole genome bisulfite sequencing data, targeted bisulfite sequencing data and hydroxymethylation data. To show how bicycle overtakes other available pipelines, we compared them on a defined number of features that are summarized in a table. We also tested bicycle with both simulated and real datasets, to show its level of performance, and compared it to different state-of-the-art methylation analysis pipelines. Bicycle is publicly available under GNU LGPL v3.0 license at http://www.sing-group.org/bicycle. Users can also download a customized Ubuntu LiveCD including bicycle and other bisulfite sequencing data pipelines compared here. In addition, a docker image with bicycle and its dependencies, which allows a straightforward use of bicycle in any platform (e.g. Linux, OS X or Windows), is also available. ograna@cnio.es or dgpena@uvigo.es. Supplementary data are available at Bioinformatics online.

  18. A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa

    PubMed Central

    Petegrosso, Raphael; Tolar, Jakub

    2018-01-01

    Single-cell RNA sequencing (scRNA-seq) has been widely applied to discover new cell types by detecting sub-populations in a heterogeneous group of cells. Since scRNA-seq experiments have lower read coverage/tag counts and introduce more technical biases compared to bulk RNA-seq experiments, the limited number of sampled cells combined with the experimental biases and other dataset specific variations presents a challenge to cross-dataset analysis and discovery of relevant biological variations across multiple cell populations. In this paper, we introduce a method of variance-driven multitask clustering of single-cell RNA-seq data (scVDMC) that utilizes multiple single-cell populations from biological replicates or different samples. scVDMC clusters single cells in multiple scRNA-seq experiments of similar cell types and markers but varying expression patterns such that the scRNA-seq data are better integrated than typical pooled analyses which only increase the sample size. By controlling the variance among the cell clusters within each dataset and across all the datasets, scVDMC detects cell sub-populations in each individual experiment with shared cell-type markers but varying cluster centers among all the experiments. Applied to two real scRNA-seq datasets with several replicates and one large-scale droplet-based dataset on three patient samples, scVDMC more accurately detected cell populations and known cell markers than pooled clustering and other recently proposed scRNA-seq clustering methods. In the case study applied to in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) scRNA-seq data, scVDMC revealed several new cell types and unknown markers validated by flow cytometry. MATLAB/Octave code available at https://github.com/kuanglab/scVDMC. PMID:29630593

  19. A high level interface to SCOP and ASTRAL implemented in python.

    PubMed

    Casbon, James A; Crooks, Gavin E; Saqi, Mansoor A S

    2006-01-10

    Benchmarking algorithms in structural bioinformatics often involves the construction of datasets of proteins with given sequence and structural properties. The SCOP database is a manually curated structural classification which groups together proteins on the basis of structural similarity. The ASTRAL compendium provides non redundant subsets of SCOP domains on the basis of sequence similarity such that no two domains in a given subset share more than a defined degree of sequence similarity. Taken together these two resources provide a 'ground truth' for assessing structural bioinformatics algorithms. We present a small and easy to use API written in python to enable construction of datasets from these resources. We have designed a set of python modules to provide an abstraction of the SCOP and ASTRAL databases. The modules are designed to work as part of the Biopython distribution. Python users can now manipulate and use the SCOP hierarchy from within python programs, and use ASTRAL to return sequences of domains in SCOP, as well as clustered representations of SCOP from ASTRAL. The modules make the analysis and generation of datasets for use in structural genomics easier and more principled.

  20. Study of infectious diseases in archaeological bone material - A dataset.

    PubMed

    Pucu, Elisa; Cascardo, Paula; Chame, Marcia; Felice, Gisele; Guidon, Niéde; Cleonice Vergne, Maria; Campos, Guadalupe; Roberto Machado-Silva, José; Leles, Daniela

    2017-08-01

    Bones of human and ground sloth remains were analyzed for presence of Trypanosoma cruzi by conventional PCR using primers TC, TC1 and TC2. Sequence results amplified a fragment with the same product size as the primers (300 and 350pb). Amplified PCR product was sequenced and analyzed on GenBank, using Blast. Although these sequences did not match with these parasites they showed high amplification with species of bacteria. This article presents the methodology used and the alignment of the sequences. The display of this dataset will allow further analysis of our results and discussion presented in the manuscript "Finding the unexpected: a critical view on molecular diagnosis of infectious diseases in archaeological samples" (Pucu et al. 2017) [1].

  1. Conserved Features in the Structure, Mechanism, and Biogenesis of the Inverse Autotransporter Protein Family

    PubMed Central

    Heinz, Eva; Stubenrauch, Christopher J.; Grinter, Rhys; Croft, Nathan P.; Purcell, Anthony W.; Strugnell, Richard A.; Dougan, Gordon; Lithgow, Trevor

    2016-01-01

    The bacterial cell surface proteins intimin and invasin are virulence factors that share a common domain structure and bind selectively to host cell receptors in the course of bacterial pathogenesis. The β-barrel domains of intimin and invasin show significant sequence and structural similarities. Conversely, a variety of proteins with sometimes limited sequence similarity have also been annotated as “intimin-like” and “invasin” in genome datasets, while other recent work on apparently unrelated virulence-associated proteins ultimately revealed similarities to intimin and invasin. Here we characterize the sequence and structural relationships across this complex protein family. Surprisingly, intimins and invasins represent a very small minority of the sequence diversity in what has been previously the “intimin/invasin protein family”. Analysis of the assembly pathway for expression of the classic intimin, EaeA, and a characteristic example of the most prevalent members of the group, FdeC, revealed a dependence on the translocation and assembly module as a common feature for both these proteins. While the majority of the sequences in the grouping are most similar to FdeC, a further and widespread group is two-partner secretion systems that use the β-barrel domain as the delivery device for secretion of a variety of virulence factors. This comprehensive analysis supports the adoption of the “inverse autotransporter protein family” as the most accurate nomenclature for the family and, in turn, has important consequences for our overall understanding of the Type V secretion systems of bacterial pathogens. PMID:27190006

  2. LS³: A Method for Improving Phylogenomic Inferences When Evolutionary Rates Are Heterogeneous among Taxa

    PubMed Central

    Rivera-Rivera, Carlos J.; Montoya-Burgos, Juan I.

    2016-01-01

    Phylogenetic inference artifacts can occur when sequence evolution deviates from assumptions made by the models used to analyze them. The combination of strong model assumption violations and highly heterogeneous lineage evolutionary rates can become problematic in phylogenetic inference, and lead to the well-described long-branch attraction (LBA) artifact. Here, we define an objective criterion for assessing lineage evolutionary rate heterogeneity among predefined lineages: the result of a likelihood ratio test between a model in which the lineages evolve at the same rate (homogeneous model) and a model in which different lineage rates are allowed (heterogeneous model). We implement this criterion in the algorithm Locus Specific Sequence Subsampling (LS³), aimed at reducing the effects of LBA in multi-gene datasets. For each gene, LS³ sequentially removes the fastest-evolving taxon of the ingroup and tests for lineage rate homogeneity until all lineages have uniform evolutionary rates. The sequences excluded from the homogeneously evolving taxon subset are flagged as potentially problematic. The software implementation provides the user with the possibility to remove the flagged sequences for generating a new concatenated alignment. We tested LS³ with simulations and two real datasets containing LBA artifacts: a nucleotide dataset regarding the position of Glires within mammals and an amino-acid dataset concerning the position of nematodes within bilaterians. The initially incorrect phylogenies were corrected in all cases upon removing data flagged by LS³. PMID:26912812

  3. A Snapshot of a Coral “Holobiont”: A Transcriptome Assembly of the Scleractinian Coral, Porites, Captures a Wide Variety of Genes from Both the Host and Symbiotic Zooxanthellae

    PubMed Central

    Shinzato, Chuya; Inoue, Mayuri; Kusakabe, Makoto

    2014-01-01

    Massive scleractinian corals of the genus Porites are important reef builders in the Indo-Pacific, and they are more resistant to thermal stress than other stony corals, such as the genus Acropora. Because coral health and survival largely depend on the interaction between a coral host and its symbionts, it is important to understand the molecular interactions of an entire “coral holobiont”. We simultaneously sequenced transcriptomes of Porites australiensis and its symbionts using the Illumina Hiseq2000 platform. We obtained 14.3 Gbp of sequencing data and assembled it into 74,997 contigs (average: 1,263 bp, N50 size: 2,037 bp). We successfully distinguished contigs originating from the host (Porites) and the symbiont (Symbiodinium) by aligning nucleotide sequences with the decoded Acropora digitifera and Symbiodinium minutum genomes. In contrast to previous coral transcriptome studies, at least 35% of the sequences were found to have originated from the symbionts, indicating that it is possible to analyze both host and symbiont transcriptomes simultaneously. Conserved protein domain and KEGG analyses showed that the dataset contains broad gene repertoires of both Porites and Symbiodinium. Effective utilization of sequence reads revealed that the polymorphism rate in P. australiensis is 1.0% and identified the major symbiotic Symbiodinium as Type C15. Analyses of amino acid biosynthetic pathways suggested that this Porites holobiont is probably able to synthesize most of the common amino acids and that Symbiodinium is potentially able to provide essential amino acids to its host. We believe this to be the first molecular evidence of complementarity in amino acid metabolism between coral hosts and their symbionts. We successfully assembled genes originating from both the host coral and the symbiotic Symbiodinium to create a snapshot of the coral holobiont transcriptome. This dataset will facilitate a deeper understanding of molecular mechanisms of coral symbioses and stress responses. PMID:24454815

  4. A snapshot of a coral "holobiont": a transcriptome assembly of the scleractinian coral, porites, captures a wide variety of genes from both the host and symbiotic zooxanthellae.

    PubMed

    Shinzato, Chuya; Inoue, Mayuri; Kusakabe, Makoto

    2014-01-01

    Massive scleractinian corals of the genus Porites are important reef builders in the Indo-Pacific, and they are more resistant to thermal stress than other stony corals, such as the genus Acropora. Because coral health and survival largely depend on the interaction between a coral host and its symbionts, it is important to understand the molecular interactions of an entire "coral holobiont". We simultaneously sequenced transcriptomes of Porites australiensis and its symbionts using the Illumina Hiseq2000 platform. We obtained 14.3 Gbp of sequencing data and assembled it into 74,997 contigs (average: 1,263 bp, N50 size: 2,037 bp). We successfully distinguished contigs originating from the host (Porites) and the symbiont (Symbiodinium) by aligning nucleotide sequences with the decoded Acropora digitifera and Symbiodinium minutum genomes. In contrast to previous coral transcriptome studies, at least 35% of the sequences were found to have originated from the symbionts, indicating that it is possible to analyze both host and symbiont transcriptomes simultaneously. Conserved protein domain and KEGG analyses showed that the dataset contains broad gene repertoires of both Porites and Symbiodinium. Effective utilization of sequence reads revealed that the polymorphism rate in P. australiensis is 1.0% and identified the major symbiotic Symbiodinium as Type C15. Analyses of amino acid biosynthetic pathways suggested that this Porites holobiont is probably able to synthesize most of the common amino acids and that Symbiodinium is potentially able to provide essential amino acids to its host. We believe this to be the first molecular evidence of complementarity in amino acid metabolism between coral hosts and their symbionts. We successfully assembled genes originating from both the host coral and the symbiotic Symbiodinium to create a snapshot of the coral holobiont transcriptome. This dataset will facilitate a deeper understanding of molecular mechanisms of coral symbioses and stress responses.

  5. Identifying Differentially Abundant Metabolic Pathways in Metagenomic Datasets

    NASA Astrophysics Data System (ADS)

    Liu, Bo; Pop, Mihai

    Enabled by rapid advances in sequencing technology, metagenomic studies aim to characterize entire communities of microbes bypassing the need for culturing individual bacterial members. One major goal of such studies is to identify specific functional adaptations of microbial communities to their habitats. Here we describe a powerful analytical method (MetaPath) that can identify differentially abundant pathways in metagenomic data-sets, relying on a combination of metagenomic sequence data and prior metabolic pathway knowledge. We show that MetaPath outperforms other common approaches when evaluated on simulated datasets. We also demonstrate the power of our methods in analyzing two, publicly available, metagenomic datasets: a comparison of the gut microbiome of obese and lean twins; and a comparison of the gut microbiome of infant and adult subjects. We demonstrate that the subpathways identified by our method provide valuable insights into the biological activities of the microbiome.

  6. Simultaneous mutation and copy number variation (CNV) detection by multiplex PCR-based GS-FLX sequencing.

    PubMed

    Goossens, Dirk; Moens, Lotte N; Nelis, Eva; Lenaerts, An-Sofie; Glassee, Wim; Kalbe, Andreas; Frey, Bruno; Kopal, Guido; De Jonghe, Peter; De Rijk, Peter; Del-Favero, Jurgen

    2009-03-01

    We evaluated multiplex PCR amplification as a front-end for high-throughput sequencing, to widen the applicability of massive parallel sequencers for the detailed analysis of complex genomes. Using multiplex PCR reactions, we sequenced the complete coding regions of seven genes implicated in peripheral neuropathies in 40 individuals on a GS-FLX genome sequencer (Roche). The resulting dataset showed highly specific and uniform amplification. Comparison of the GS-FLX sequencing data with the dataset generated by Sanger sequencing confirmed the detection of all variants present and proved the sensitivity of the method for mutation detection. In addition, we showed that we could exploit the multiplexed PCR amplicons to determine individual copy number variation (CNV), increasing the spectrum of detected variations to both genetic and genomic variants. We conclude that our straightforward procedure substantially expands the applicability of the massive parallel sequencers for sequencing projects of a moderate number of amplicons (50-500) with typical applications in resequencing exons in positional or functional candidate regions and molecular genetic diagnostics. 2008 Wiley-Liss, Inc.

  7. Morphological Identification and Single-Cell Genomics of Marine Diplonemids.

    PubMed

    Gawryluk, Ryan M R; Del Campo, Javier; Okamoto, Noriko; Strassert, Jürgen F H; Lukeš, Julius; Richards, Thomas A; Worden, Alexandra Z; Santoro, Alyson E; Keeling, Patrick J

    2016-11-21

    Recent global surveys of marine biodiversity have revealed that a group of organisms known as "marine diplonemids" constitutes one of the most abundant and diverse planktonic lineages [1]. Though discovered over a decade ago [2, 3], their potential importance was unrecognized, and our knowledge remains restricted to a single gene amplified from environmental DNA, the 18S rRNA gene (small subunit [SSU]). Here, we use single-cell genomics (SCG) and microscopy to characterize ten marine diplonemids, isolated from a range of depths in the eastern North Pacific Ocean. Phylogenetic analysis confirms that the isolates reflect the entire range of marine diplonemid diversity, and comparisons to environmental SSU surveys show that sequences from the isolates range from rare to superabundant, including the single most common marine diplonemid known. SCG generated a total of ∼915 Mbp of assembled sequence across all ten cells and ∼4,000 protein-coding genes with homologs in the Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology database, distributed across categories expected for heterotrophic protists. Models of highly conserved genes indicate a high density of non-canonical introns, lacking conventional GT-AG splice sites. Mapping metagenomic datasets [4] to SCG assemblies reveals virtually no overlap, suggesting that nuclear genomic diversity is too great for representative SCG data to provide meaningful phylogenetic context to metagenomic datasets. This work provides an entry point to the future identification, isolation, and cultivation of these elusive yet ecologically important cells. The high density of nonconventional introns, however, also portends difficulty in generating accurate gene models and highlights the need for the establishment of stable cultures and transcriptomic analyses. Copyright © 2016 Elsevier Ltd. All rights reserved.

  8. De novo transcriptome assembly databases for the butterfly orchid Phalaenopsis equestris

    PubMed Central

    Niu, Shan-Ce; Xu, Qing; Zhang, Guo-Qiang; Zhang, Yong-Qiang; Tsai, Wen-Chieh; Hsu, Jui-Ling; Liang, Chieh-Kai; Luo, Yi-Bo; Liu, Zhong-Jian

    2016-01-01

    Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third dataset consists of the annotation results of the aligned unigenes versus the Nonredundant (Nr) protein database, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) databases with low e-values, enabling a name-based search. PMID:27673730

  9. A high HIV-1 strain variability in London, UK, revealed by full-genome analysis: Results from the ICONIC project.

    PubMed

    Yebra, Gonzalo; Frampton, Dan; Gallo Cassarino, Tiziano; Raffle, Jade; Hubb, Jonathan; Ferns, R Bridget; Waters, Laura; Tong, C Y William; Kozlakidis, Zisis; Hayward, Andrew; Kellam, Paul; Pillay, Deenan; Clark, Duncan; Nastouli, Eleni; Leigh Brown, Andrew J

    2018-01-01

    The ICONIC project has developed an automated high-throughput pipeline to generate HIV nearly full-length genomes (NFLG, i.e. from gag to nef) from next-generation sequencing (NGS) data. The pipeline was applied to 420 HIV samples collected at University College London Hospitals NHS Trust and Barts Health NHS Trust (London) and sequenced using an Illumina MiSeq at the Wellcome Trust Sanger Institute (Cambridge). Consensus genomes were generated and subtyped using COMET, and unique recombinants were studied with jpHMM and SimPlot. Maximum-likelihood phylogenetic trees were constructed using RAxML to identify transmission networks using the Cluster Picker. The pipeline generated sequences of at least 1Kb of length (median = 7.46Kb, IQR = 4.01Kb) for 375 out of the 420 samples (89%), with 174 (46.4%) being NFLG. A total of 365 sequences (169 of them NFLG) corresponded to unique subjects and were included in the down-stream analyses. The most frequent HIV subtypes were B (n = 149, 40.8%) and C (n = 77, 21.1%) and the circulating recombinant form CRF02_AG (n = 32, 8.8%). We found 14 different CRFs (n = 66, 18.1%) and multiple URFs (n = 32, 8.8%) that involved recombination between 12 different subtypes/CRFs. The most frequent URFs were B/CRF01_AE (4 cases) and A1/D, B/C, and B/CRF02_AG (3 cases each). Most URFs (19/26, 73%) lacked breakpoints in the PR+RT pol region, rendering them undetectable if only that was sequenced. Twelve (37.5%) of the URFs could have emerged within the UK, whereas the rest were probably imported from sub-Saharan Africa, South East Asia and South America. For 2 URFs we found highly similar pol sequences circulating in the UK. We detected 31 phylogenetic clusters using the full dataset: 25 pairs (mostly subtypes B and C), 4 triplets and 2 quadruplets. Some of these were not consistent across different genes due to inter- and intra-subtype recombination. Clusters involved 70 sequences, 19.2% of the dataset. The initial analysis of genome sequences detected substantial hidden variability in the London HIV epidemic. Analysing full genome sequences, as opposed to only PR+RT, identified previously undetected recombinants. It provided a more reliable description of CRFs (that would be otherwise misclassified) and transmission clusters.

  10. Fast tandem mass spectra-based protein identification regardless of the number of spectra or potential modifications examined.

    PubMed

    Falkner, Jayson; Andrews, Philip

    2005-05-15

    Comparing tandem mass spectra (MSMS) against a known dataset of protein sequences is a common method for identifying unknown proteins; however, the processing of MSMS by current software often limits certain applications, including comprehensive coverage of post-translational modifications, non-specific searches and real-time searches to allow result-dependent instrument control. This problem deserves attention as new mass spectrometers provide the ability for higher throughput and as known protein datasets rapidly grow in size. New software algorithms need to be devised in order to address the performance issues of conventional MSMS protein dataset-based protein identification. This paper describes a novel algorithm based on converting a collection of monoisotopic, centroided spectra to a new data structure, named 'peptide finite state machine' (PFSM), which may be used to rapidly search a known dataset of protein sequences, regardless of the number of spectra searched or the number of potential modifications examined. The algorithm is verified using a set of commercially available tryptic digest protein standards analyzed using an ABI 4700 MALDI TOFTOF mass spectrometer, and a free, open source PFSM implementation. It is illustrated that a PFSM can accurately search large collections of spectra against large datasets of protein sequences (e.g. NCBI nr) using a regular desktop PC; however, this paper only details the method for identifying peptide and subsequently protein candidates from a dataset of known protein sequences. The concept of using a PFSM as a peptide pre-screening technique for MSMS-based search engines is validated by using PFSM with Mascot and XTandem. Complete source code, documentation and examples for the reference PFSM implementation are freely available at the Proteome Commons, http://www.proteomecommons.org and source code may be used both commercially and non-commercially as long as the original authors are credited for their work.

  11. Genome-wide characterization of differential transcript usage in Arabidopsis thaliana.

    PubMed

    Vaneechoutte, Dries; Estrada, April R; Lin, Ying-Chen; Loraine, Ann E; Vandepoele, Klaas

    2017-12-01

    Alternative splicing and the usage of alternate transcription start- or stop sites allows a single gene to produce multiple transcript isoforms. Most plant genes express certain isoforms at a significantly higher level than others, but under specific conditions this expression dominance can change, resulting in a different set of dominant isoforms. These events of differential transcript usage (DTU) have been observed for thousands of Arabidopsis thaliana, Zea mays and Vitis vinifera genes, and have been linked to development and stress response. However, neither the characteristics of these genes, nor the implications of DTU on their protein coding sequences or functions, are currently well understood. Here we present a dataset of isoform dominance and DTU for all genes in the AtRTD2 reference transcriptome based on a protocol that was benchmarked on simulated data and validated through comparison with a published reverse transciptase-polymerase chain reaction panel. We report DTU events for 8148 genes across 206 public RNA-Seq samples, and find that protein sequences are affected in 22% of the cases. The observed DTU events show high consistency across replicates, and reveal reproducible patterns in response to treatment and development. We also demonstrate that genes with different evolutionary ages, expression breadths and functions show large differences in the frequency at which they undergo DTU, and in the effect that these events have on their protein sequences. Finally, we showcase how the generated dataset can be used to explore DTU events for genes of interest or to find genes with specific DTU in samples of interest. © 2017 The Authors The Plant Journal © 2017 John Wiley & Sons Ltd.

  12. Achieving high confidence protein annotations in a sea of unknowns

    NASA Astrophysics Data System (ADS)

    Timmins-Schiffman, E.; May, D. H.; Noble, W. S.; Nunn, B. L.; Mikan, M.; Harvey, H. R.

    2016-02-01

    Increased sensitivity of mass spectrometry (MS) technology allows deep and broad insight into community functional analyses. Metaproteomics holds the promise to reveal functional responses of natural microbial communities, whereas metagenomics alone can only hint at potential functions. The complex datasets resulting from ocean MS have the potential to inform diverse realms of the biological, chemical, and physical ocean sciences, yet the extent of bacterial functional diversity and redundancy has not been fully explored. To take advantage of these impressive datasets, we need a clear bioinformatics pipeline for metaproteomics peptide identification and annotation with a database that can provide confident identifications. Researchers must consider whether it is sufficient to leverage the vast quantities of available ocean sequence data or if they must invest in site-specific metagenomic sequencing. We have sequenced, to our knowledge, the first western arctic metagenomes from the Bering Strait and the Chukchi Sea. We have addressed the long standing question: Is a metagenome required to accurately complete metaproteomics and assess the biological distribution of metabolic functions controlling nutrient acquisition in the ocean? Two different protein databases were constructed from 1) a site-specific metagenome and 2) subarctic/arctic groups available in NCBI's non-redundant database. Multiple proteomic search strategies were employed, against each individual database and against both databases combined, to determine the algorithm and approach that yielded the balance of high sensitivity and confident identification. Results yielded over 8200 confidently identified proteins. Our comparison of these results allows us to quantify the utility of investing resources in a metagenome versus using the constantly expanding and immediately available public databases for metaproteomic studies.

  13. Analyzing the relationship between sequence divergence and nodal support using Bayesian phylogenetic analyses.

    PubMed

    Makowsky, Robert; Cox, Christian L; Roelke, Corey; Chippindale, Paul T

    2010-11-01

    Determining the appropriate gene for phylogeny reconstruction can be a difficult process. Rapidly evolving genes tend to resolve recent relationships, but suffer from alignment issues and increased homoplasy among distantly related species. Conversely, slowly evolving genes generally perform best for deeper relationships, but lack sufficient variation to resolve recent relationships. We determine the relationship between sequence divergence and Bayesian phylogenetic reconstruction ability using both natural and simulated datasets. The natural data are based on 28 well-supported relationships within the subphylum Vertebrata. Sequences of 12 genes were acquired and Bayesian analyses were used to determine phylogenetic support for correct relationships. Simulated datasets were designed to determine whether an optimal range of sequence divergence exists across extreme phylogenetic conditions. Across all genes we found that an optimal range of divergence for resolving the correct relationships does exist, although this level of divergence expectedly depends on the distance metric. Simulated datasets show that an optimal range of sequence divergence exists across diverse topologies and models of evolution. We determine that a simple to measure property of genetic sequences (genetic distance) is related to phylogenic reconstruction ability in Bayesian analyses. This information should be useful for selecting the most informative gene to resolve any relationships, especially those that are difficult to resolve, as well as minimizing both cost and confounding information during project design. Copyright © 2010. Published by Elsevier Inc.

  14. A user's guide to quantitative and comparative analysis of metagenomic datasets.

    PubMed

    Luo, Chengwei; Rodriguez-R, Luis M; Konstantinidis, Konstantinos T

    2013-01-01

    Metagenomics has revolutionized microbiological studies during the past decade and provided new insights into the diversity, dynamics, and metabolic potential of natural microbial communities. However, metagenomics still represents a field in development, and standardized tools and approaches to handle and compare metagenomes have not been established yet. An important reason accounting for the latter is the continuous changes in the type of sequencing data available, for example, long versus short sequencing reads. Here, we provide a guide to bioinformatic pipelines developed to accomplish the following tasks, focusing primarily on those developed by our team: (i) assemble a metagenomic dataset; (ii) determine the level of sequence coverage obtained and the amount of sequencing required to obtain complete coverage; (iii) identify the taxonomic affiliation of a metagenomic read or assembled contig; and (iv) determine differentially abundant genes, pathways, and species between different datasets. Most of these pipelines do not depend on the type of sequences available or can be easily adjusted to fit different types of sequences, and are freely available (for instance, through our lab Web site: http://www.enve-omics.gatech.edu/). The limitations of current approaches, as well as the computational aspects that can be further improved, will also be briefly discussed. The work presented here provides practical guidelines on how to perform metagenomic analysis of microbial communities characterized by varied levels of diversity and establishes approaches to handle the resulting data, independent of the sequencing platform employed. © 2013 Elsevier Inc. All rights reserved.

  15. An improved filtering algorithm for big read datasets and its application to single-cell assembly.

    PubMed

    Wedemeyer, Axel; Kliemann, Lasse; Srivastav, Anand; Schielke, Christian; Reusch, Thorsten B; Rosenstiel, Philip

    2017-07-03

    For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their k-mers. We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new algorithmic feature is the use of phred quality scores together with a detailed analysis of the k-mer counts to decide which reads to keep. We qualify and recommend parameters for our new read filtering algorithm. Guided by these parameters, we remove in terms of median 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these filtered datasets in a fraction of the time needed for an assembly from the datasets filtered with Diginorm. We conclude that read filtering is a practical and efficient method for reducing read data and for speeding up the assembly process. This applies not only for single cell assembly, as shown in this paper, but also to other projects with high mean coverage datasets like metagenomic sequencing projects. Our Bignorm algorithm allows assemblies of competitive quality in comparison to Diginorm, while being much faster. Bignorm is available for download at https://git.informatik.uni-kiel.de/axw/Bignorm .

  16. PARRoT- a homology-based strategy to quantify and compare RNA-sequencing from non-model organisms.

    PubMed

    Gan, Ruei-Chi; Chen, Ting-Wen; Wu, Timothy H; Huang, Po-Jung; Lee, Chi-Ching; Yeh, Yuan-Ming; Chiu, Cheng-Hsun; Huang, Hsien-Da; Tang, Petrus

    2016-12-22

    Next-generation sequencing promises the de novo genomic and transcriptomic analysis of samples of interests. However, there are only a few organisms having reference genomic sequences and even fewer having well-defined or curated annotations. For transcriptome studies focusing on organisms lacking proper reference genomes, the common strategy is de novo assembly followed by functional annotation. However, things become even more complicated when multiple transcriptomes are compared. Here, we propose a new analysis strategy and quantification methods for quantifying expression level which not only generate a virtual reference from sequencing data, but also provide comparisons between transcriptomes. First, all reads from the transcriptome datasets are pooled together for de novo assembly. The assembled contigs are searched against NCBI NR databases to find potential homolog sequences. Based on the searched result, a set of virtual transcripts are generated and served as a reference transcriptome. By using the same reference, normalized quantification values including RC (read counts), eRPKM (estimated RPKM) and eTPM (estimated TPM) can be obtained that are comparable across transcriptome datasets. In order to demonstrate the feasibility of our strategy, we implement it in the web service PARRoT. PARRoT stands for Pipeline for Analyzing RNA Reads of Transcriptomes. It analyzes gene expression profiles for two transcriptome sequencing datasets. For better understanding of the biological meaning from the comparison among transcriptomes, PARRoT further provides linkage between these virtual transcripts and their potential function through showing best hits in SwissProt, NR database, assigning GO terms. Our demo datasets showed that PARRoT can analyze two paired-end transcriptomic datasets of approximately 100 million reads within just three hours. In this study, we proposed and implemented a strategy to analyze transcriptomes from non-reference organisms which offers the opportunity to quantify and compare transcriptome profiles through a homolog based virtual transcriptome reference. By using the homolog based reference, our strategy effectively avoids the problems that may cause from inconsistencies among transcriptomes. This strategy will shed lights on the field of comparative genomics for non-model organism. We have implemented PARRoT as a web service which is freely available at http://parrot.cgu.edu.tw .

  17. Inadequate Reference Datasets Biased toward Short Non-epitopes Confound B-cell Epitope Prediction*

    PubMed Central

    Rahman, Kh. Shamsur; Chowdhury, Erfan Ullah; Sachse, Konrad; Kaltenboeck, Bernhard

    2016-01-01

    X-ray crystallography has shown that an antibody paratope typically binds 15–22 amino acids (aa) of an epitope, of which 2–5 randomly distributed amino acids contribute most of the binding energy. In contrast, researchers typically choose for B-cell epitope mapping short peptide antigens in antibody binding assays. Furthermore, short 6–11-aa epitopes, and in particular non-epitopes, are over-represented in published B-cell epitope datasets that are commonly used for development of B-cell epitope prediction approaches from protein antigen sequences. We hypothesized that such suboptimal length peptides result in weak antibody binding and cause false-negative results. We tested the influence of peptide antigen length on antibody binding by analyzing data on more than 900 peptides used for B-cell epitope mapping of immunodominant proteins of Chlamydia spp. We demonstrate that short 7–12-aa peptides of B-cell epitopes bind antibodies poorly; thus, epitope mapping with short peptide antigens falsely classifies many B-cell epitopes as non-epitopes. We also show in published datasets of confirmed epitopes and non-epitopes a direct correlation between length of peptide antigens and antibody binding. Elimination of short, ≤11-aa epitope/non-epitope sequences improved datasets for evaluation of in silico B-cell epitope prediction. Achieving up to 86% accuracy, protein disorder tendency is the best indicator of B-cell epitope regions for chlamydial and published datasets. For B-cell epitope prediction, the most effective approach is plotting disorder of protein sequences with the IUPred-L scale, followed by antibody reactivity testing of 16–30-aa peptides from peak regions. This strategy overcomes the well known inaccuracy of in silico B-cell epitope prediction from primary protein sequences. PMID:27189949

  18. Putative archaeal viruses from the mesopelagic ocean.

    PubMed

    Vik, Dean R; Roux, Simon; Brum, Jennifer R; Bolduc, Ben; Emerson, Joanne B; Padilla, Cory C; Stewart, Frank J; Sullivan, Matthew B

    2017-01-01

    Oceanic viruses that infect bacteria, or phages, are known to modulate host diversity, metabolisms, and biogeochemical cycling, while the viruses that infect marine Archaea remain understudied despite the critical ecosystem roles played by their hosts. Here we introduce "MArVD", for Metagenomic Archaeal Virus Detector, an annotation tool designed to identify putative archaeal virus contigs in metagenomic datasets. MArVD is made publicly available through the online iVirus analytical platform. Benchmarking analysis of MArVD showed it to be >99% accurate and 100% sensitive in identifying the 127 known archaeal viruses among the 12,499 viruses in the VirSorter curated dataset. Application of MArVD to 10 viral metagenomes from two depth profiles in the Eastern Tropical North Pacific (ETNP) oxygen minimum zone revealed 43 new putative archaeal virus genomes and large genome fragments ranging in size from 10 to 31 kb. Network-based classifications, which were consistent with marker gene phylogenies where available, suggested that these putative archaeal virus contigs represented six novel candidate genera. Ecological analyses, via fragment recruitment and ordination, revealed that the diversity and relative abundances of these putative archaeal viruses were correlated with oxygen concentration and temperature along two OMZ-spanning depth profiles, presumably due to structuring of the host Archaea community. Peak viral diversity and abundances were found in surface waters, where Thermoplasmata 16S rRNA genes are prevalent, suggesting these archaea as hosts in the surface habitats. Together these findings provide a baseline for identifying archaeal viruses in sequence datasets, and an initial picture of the ecology of such viruses in non-extreme environments.

  19. Genome-wide assessment of differential translations with ribosome profiling data

    PubMed Central

    Xiao, Zhengtao; Zou, Qin; Liu, Yu; Yang, Xuerui

    2016-01-01

    The closely regulated process of mRNA translation is crucial for precise control of protein abundance and quality. Ribosome profiling, a combination of ribosome foot-printing and RNA deep sequencing, has been used in a large variety of studies to quantify genome-wide mRNA translation. Here, we developed Xtail, an analysis pipeline tailored for ribosome profiling data that comprehensively and accurately identifies differentially translated genes in pairwise comparisons. Applied on simulated and real datasets, Xtail exhibits high sensitivity with minimal false-positive rates, outperforming existing methods in the accuracy of quantifying differential translations. With published ribosome profiling datasets, Xtail does not only reveal differentially translated genes that make biological sense, but also uncovers new events of differential translation in human cancer cells on mTOR signalling perturbation and in human primary macrophages on interferon gamma (IFN-γ) treatment. This demonstrates the value of Xtail in providing novel insights into the molecular mechanisms that involve translational dysregulations. PMID:27041671

  20. HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold.

    PubMed

    Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel; Ten Have, Arjen

    2018-01-01

    Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER.

  1. HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold

    PubMed Central

    Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel

    2018-01-01

    Background Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. Results HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. Conclusions HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER. PMID:29579071

  2. The sponge microbiome project.

    PubMed

    Moitinho-Silva, Lucas; Nielsen, Shaun; Amir, Amnon; Gonzalez, Antonio; Ackermann, Gail L; Cerrano, Carlo; Astudillo-Garcia, Carmen; Easson, Cole; Sipkema, Detmer; Liu, Fang; Steinert, Georg; Kotoulas, Giorgos; McCormack, Grace P; Feng, Guofang; Bell, James J; Vicente, Jan; Björk, Johannes R; Montoya, Jose M; Olson, Julie B; Reveillaud, Julie; Steindler, Laura; Pineda, Mari-Carmen; Marra, Maria V; Ilan, Micha; Taylor, Michael W; Polymenakou, Paraskevi; Erwin, Patrick M; Schupp, Peter J; Simister, Rachel L; Knight, Rob; Thacker, Robert W; Costa, Rodrigo; Hill, Russell T; Lopez-Legentil, Susanna; Dailianis, Thanos; Ravasi, Timothy; Hentschel, Ute; Li, Zhiyong; Webster, Nicole S; Thomas, Torsten

    2017-10-01

    Marine sponges (phylum Porifera) are a diverse, phylogenetically deep-branching clade known for forming intimate partnerships with complex communities of microorganisms. To date, 16S rRNA gene sequencing studies have largely utilised different extraction and amplification methodologies to target the microbial communities of a limited number of sponge species, severely limiting comparative analyses of sponge microbial diversity and structure. Here, we provide an extensive and standardised dataset that will facilitate sponge microbiome comparisons across large spatial, temporal, and environmental scales. Samples from marine sponges (n = 3569 specimens), seawater (n = 370), marine sediments (n = 65) and other environments (n = 29) were collected from different locations across the globe. This dataset incorporates at least 268 different sponge species, including several yet unidentified taxa. The V4 region of the 16S rRNA gene was amplified and sequenced from extracted DNA using standardised procedures. Raw sequences (total of 1.1 billion sequences) were processed and clustered with (i) a standard protocol using QIIME closed-reference picking resulting in 39 543 operational taxonomic units (OTU) at 97% sequence identity, (ii) a de novo clustering using Mothur resulting in 518 246 OTUs, and (iii) a new high-resolution Deblur protocol resulting in 83 908 unique bacterial sequences. Abundance tables, representative sequences, taxonomic classifications, and metadata are provided. This dataset represents a comprehensive resource of sponge-associated microbial communities based on 16S rRNA gene sequences that can be used to address overarching hypotheses regarding host-associated prokaryotes, including host specificity, convergent evolution, environmental drivers of microbiome structure, and the sponge-associated rare biosphere. © The Authors 2017. Published by Oxford University Press.

  3. Chameleon sequences in neurodegenerative diseases.

    PubMed

    Bahramali, Golnaz; Goliaei, Bahram; Minuchehr, Zarrin; Salari, Ali

    2016-03-25

    Chameleon sequences can adopt either alpha helix sheet or a coil conformation. Defining chameleon sequences in PDB (Protein Data Bank) may yield to an insight on defining peptides and proteins responsible in neurodegeneration. In this research, we benefitted from the large PDB and performed a sequence analysis on Chameleons, where we developed an algorithm to extract peptide segments with identical sequences, but different structures. In order to find new chameleon sequences, we extracted a set of 8315 non-redundant protein sequences from the PDB with an identity less than 25%. Our data was classified to "helix to strand (HE)", "helix to coil (HC)" and "strand to coil (CE)" alterations. We also analyzed the occurrence of singlet and doublet amino acids and the solvent accessibility in the chameleon sequences; we then sorted out the proteins with the most number of chameleon sequences and named them Chameleon Flexible Proteins (CFPs) in our dataset. Our data revealed that Gly, Val, Ile, Tyr and Phe, are the major amino acids in Chameleons. We also found that there are proteins such as Insulin Degrading Enzyme IDE and GTP-binding nuclear protein Ran (RAN) with the most number of chameleons (640 and 405 respectively). These proteins have known roles in neurodegenerative diseases. Therefore it can be inferred that other CFP's can serve as key proteins in neurodegeneration, and a study on them can shed light on curing and preventing neurodegenerative diseases. Copyright © 2016 Elsevier Inc. All rights reserved.

  4. Chameleon sequences in neurodegenerative diseases

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bahramali, Golnaz; Goliaei, Bahram, E-mail: goliaei@ut.ac.ir; Minuchehr, Zarrin, E-mail: minuchehr@nigeb.ac.ir

    2016-03-25

    Chameleon sequences can adopt either alpha helix sheet or a coil conformation. Defining chameleon sequences in PDB (Protein Data Bank) may yield to an insight on defining peptides and proteins responsible in neurodegeneration. In this research, we benefitted from the large PDB and performed a sequence analysis on Chameleons, where we developed an algorithm to extract peptide segments with identical sequences, but different structures. In order to find new chameleon sequences, we extracted a set of 8315 non-redundant protein sequences from the PDB with an identity less than 25%. Our data was classified to “helix to strand (HE)”, “helix tomore » coil (HC)” and “strand to coil (CE)” alterations. We also analyzed the occurrence of singlet and doublet amino acids and the solvent accessibility in the chameleon sequences; we then sorted out the proteins with the most number of chameleon sequences and named them Chameleon Flexible Proteins (CFPs) in our dataset. Our data revealed that Gly, Val, Ile, Tyr and Phe, are the major amino acids in Chameleons. We also found that there are proteins such as Insulin Degrading Enzyme IDE and GTP-binding nuclear protein Ran (RAN) with the most number of chameleons (640 and 405 respectively). These proteins have known roles in neurodegenerative diseases. Therefore it can be inferred that other CFP's can serve as key proteins in neurodegeneration, and a study on them can shed light on curing and preventing neurodegenerative diseases.« less

  5. A novel method based on new adaptive LVQ neural network for predicting protein-protein interactions from protein sequences.

    PubMed

    Yousef, Abdulaziz; Moghadam Charkari, Nasrollah

    2013-11-07

    Protein-Protein interaction (PPI) is one of the most important data in understanding the cellular processes. Many interesting methods have been proposed in order to predict PPIs. However, the methods which are based on the sequence of proteins as a prior knowledge are more universal. In this paper, a sequence-based, fast, and adaptive PPI prediction method is introduced to assign two proteins to an interaction class (yes, no). First, in order to improve the presentation of the sequences, twelve physicochemical properties of amino acid have been used by different representation methods to transform the sequence of protein pairs into different feature vectors. Then, for speeding up the learning process and reducing the effect of noise PPI data, principal component analysis (PCA) is carried out as a proper feature extraction algorithm. Finally, a new and adaptive Learning Vector Quantization (LVQ) predictor is designed to deal with different models of datasets that are classified into balanced and imbalanced datasets. The accuracy of 93.88%, 90.03%, and 89.72% has been found on S. cerevisiae, H. pylori, and independent datasets, respectively. The results of various experiments indicate the efficiency and validity of the method. © 2013 Published by Elsevier Ltd.

  6. Skeleton-based human action recognition using multiple sequence alignment

    NASA Astrophysics Data System (ADS)

    Ding, Wenwen; Liu, Kai; Cheng, Fei; Zhang, Jin; Li, YunSong

    2015-05-01

    Human action recognition and analysis is an active research topic in computer vision for many years. This paper presents a method to represent human actions based on trajectories consisting of 3D joint positions. This method first decompose action into a sequence of meaningful atomic actions (actionlets), and then label actionlets with English alphabets according to the Davies-Bouldin index value. Therefore, an action can be represented using a sequence of actionlet symbols, which will preserve the temporal order of occurrence of each of the actionlets. Finally, we employ sequence comparison to classify multiple actions through using string matching algorithms (Needleman-Wunsch). The effectiveness of the proposed method is evaluated on datasets captured by commodity depth cameras. Experiments of the proposed method on three challenging 3D action datasets show promising results.

  7. ESTminer: a Web interface for mining EST contig and cluster databases.

    PubMed

    Huang, Yecheng; Pumphrey, Janie; Gingle, Alan R

    2005-03-01

    ESTminer is a Web application and database schema for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The Web interface contains a query frame that allows the selection of contigs/clusters with specific cDNA library makeup or a threshold number of members. The results are displayed as color-coded tree nodes, where the color indicates the fractional size of each cDNA library component. The nodes are expandable, revealing library statistics as well as EST or contig members, with links to sequence data, GenBank records or user configurable links. Also, the interface allows 'queries within queries' where the result set of a query is further filtered by the subsequent query. ESTminer is implemented in Java/JSP and the package, including MySQL and Oracle schema creation scripts, is available from http://cggc.agtec.uga.edu/Data/download.asp agingle@uga.edu.

  8. Unassigned MS/MS Spectra: Who Am I?

    PubMed

    Pathan, Mohashin; Samuel, Monisha; Keerthikumar, Shivakumar; Mathivanan, Suresh

    2017-01-01

    Recent advances in high resolution tandem mass spectrometry (MS) has resulted in the accumulation of high quality data. Paralleled with these advances in instrumentation, bioinformatics software have been developed to analyze such quality datasets. In spite of these advances, data analysis in mass spectrometry still remains critical for protein identification. In addition, the complexity of the generated MS/MS spectra, unpredictable nature of peptide fragmentation, sequence annotation errors, and posttranslational modifications has impeded the protein identification process. In a typical MS data analysis, about 60 % of the MS/MS spectra remains unassigned. While some of these could attribute to the low quality of the MS/MS spectra, a proportion can be classified as high quality. Further analysis may reveal how much of the unassigned MS spectra attribute to search space, sequence annotation errors, mutations, and/or posttranslational modifications. In this chapter, the tools used to identify proteins and ways to assign unassigned tandem MS spectra are discussed.

  9. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus.

    PubMed

    Zhang, Yan; An, Lin; Xu, Jie; Zhang, Bo; Zheng, W Jim; Hu, Ming; Tang, Jijun; Yue, Feng

    2018-02-21

    Although Hi-C technology is one of the most popular tools for studying 3D genome organization, due to sequencing cost, the resolution of most Hi-C datasets are coarse and cannot be used to link distal regulatory elements to their target genes. Here we develop HiCPlus, a computational approach based on deep convolutional neural network, to infer high-resolution Hi-C interaction matrices from low-resolution Hi-C data. We demonstrate that HiCPlus can impute interaction matrices highly similar to the original ones, while only using 1/16 of the original sequencing reads. We show that the models learned from one cell type can be applied to make predictions in other cell or tissue types. Our work not only provides a computational framework to enhance Hi-C data resolution but also reveals features underlying the formation of 3D chromatin interactions.

  10. MIPS bacterial genomes functional annotation benchmark dataset.

    PubMed

    Tetko, Igor V; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Fobo, Gisela; Ruepp, Andreas; Antonov, Alexey V; Surmeli, Dimitrij; Mewes, Hans-Wernen

    2005-05-15

    Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. BFAB is available at http://mips.gsf.de/proj/bfab

  11. Predicting protein-binding regions in RNA using nucleotide profiles and compositions.

    PubMed

    Choi, Daesik; Park, Byungkyu; Chae, Hanju; Lee, Wook; Han, Kyungsook

    2017-03-14

    Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use. We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others. Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding .

  12. Development of a EST dataset and characterization of EST-SSRs in a traditional Chinese medicinal plant, Epimedium sagittatum (Sieb. Et Zucc.) Maxim

    PubMed Central

    2010-01-01

    Background Epimedium sagittatum (Sieb. Et Zucc.) Maxim, a traditional Chinese medicinal plant species, has been used extensively as genuine medicinal materials. Certain Epimedium species are endangered due to commercial overexploition, while sustainable application studies, conservation genetics, systematics, and marker-assisted selection (MAS) of Epimedium is less-studied due to the lack of molecular markers. Here, we report a set of expressed sequence tags (ESTs) and simple sequence repeats (SSRs) identified in these ESTs for E. sagittatum. Results cDNAs of E. sagittatum are sequenced using 454 GS-FLX pyrosequencing technology. The raw reads are cleaned and assembled into a total of 76,459 consensus sequences comprising of 17,231 contigs and 59,228 singlets. About 38.5% (29,466) of the consensus sequences significantly match to the non-redundant protein database (E-value < 1e-10), 22,295 of which are further annotated using Gene Ontology (GO) terms. A total of 2,810 EST-SSRs is identified from the Epimedium EST dataset. Trinucleotide SSR is the dominant repeat type (55.2%) followed by dinucleotide (30.4%), tetranuleotide (7.3%), hexanucleotide (4.9%), and pentanucleotide (2.2%) SSR. The dominant repeat motif is AAG/CTT (23.6%) followed by AG/CT (19.3%), ACC/GGT (11.1%), AT/AT (7.5%), and AAC/GTT (5.9%). Thirty-two SSR-ESTs are randomly selected and primer pairs are synthesized for testing the transferability across 52 Epimedium species. Eighteen primer pairs (85.7%) could be successfully transferred to Epimedium species and sixteen of those show high genetic diversity with 0.35 of observed heterozygosity (Ho) and 0.65 of expected heterozygosity (He) and high number of alleles per locus (11.9). Conclusion A large EST dataset with a total of 76,459 consensus sequences is generated, aiming to provide sequence information for deciphering secondary metabolism, especially for flavonoid pathway in Epimedium. A total of 2,810 EST-SSRs is identified from EST dataset and ~1580 EST-SSR markers are transferable. E. sagittatum EST-SSR transferability to the major Epimedium germplasm is up to 85.7%. Therefore, this EST dataset and EST-SSRs will be a powerful resource for further studies such as taxonomy, molecular breeding, genetics, genomics, and secondary metabolism in Epimedium species. PMID:20141623

  13. Evaluation of privacy in high dynamic range video sequences

    NASA Astrophysics Data System (ADS)

    Řeřábek, Martin; Yuan, Lin; Krasula, Lukáš; Korshunov, Pavel; Fliegel, Karel; Ebrahimi, Touradj

    2014-09-01

    The ability of high dynamic range (HDR) to capture details in environments with high contrast has a significant impact on privacy in video surveillance. However, the extent to which HDR imaging affects privacy, when compared to a typical low dynamic range (LDR) imaging, is neither well studied nor well understood. To achieve such an objective, a suitable dataset of images and video sequences is needed. Therefore, we have created a publicly available dataset of HDR video for privacy evaluation PEViD-HDR, which is an HDR extension of an existing Privacy Evaluation Video Dataset (PEViD). PEViD-HDR video dataset can help in the evaluations of privacy protection tools, as well as for showing the importance of HDR imaging in video surveillance applications and its influence on the privacy-intelligibility trade-off. We conducted a preliminary subjective experiment demonstrating the usability of the created dataset for evaluation of privacy issues in video. The results confirm that a tone-mapped HDR video contains more privacy sensitive information and details compared to a typical LDR video.

  14. Spectra library assisted de novo peptide sequencing for HCD and ETD spectra pairs.

    PubMed

    Yan, Yan; Zhang, Kaizhong

    2016-12-23

    De novo peptide sequencing via tandem mass spectrometry (MS/MS) has been developed rapidly in recent years. With the use of spectra pairs from the same peptide under different fragmentation modes, performance of de novo sequencing is greatly improved. Currently, with large amount of spectra sequenced everyday, spectra libraries containing tens of thousands of annotated experimental MS/MS spectra become available. These libraries provide information of the spectra properties, thus have the potential to be used with de novo sequencing to improve its performance. In this study, an improved de novo sequencing method assisted with spectra library is proposed. It uses spectra libraries as training datasets and introduces significant scores of the features used in our previous de novo sequencing method for HCD and ETD spectra pairs. Two pairs of HCD and ETD spectral datasets were used to test the performance of the proposed method and our previous method. The results show that this proposed method achieves better sequencing accuracy with higher ranked correct sequences and less computational time. This paper proposed an advanced de novo sequencing method for HCD and ETD spectra pair and used information from spectra libraries and significant improved previous similar methods.

  15. SPHINX--an algorithm for taxonomic binning of metagenomic sequences.

    PubMed

    Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Singh, Nitin Kumar; Mande, Sharmila S

    2011-01-01

    Compared with composition-based binning algorithms, the binning accuracy and specificity of alignment-based binning algorithms is significantly higher. However, being alignment-based, the latter class of algorithms require enormous amount of time and computing resources for binning huge metagenomic datasets. The motivation was to develop a binning approach that can analyze metagenomic datasets as rapidly as composition-based approaches, but nevertheless has the accuracy and specificity of alignment-based algorithms. This article describes a hybrid binning approach (SPHINX) that achieves high binning efficiency by utilizing the principles of both 'composition'- and 'alignment'-based binning algorithms. Validation results with simulated sequence datasets indicate that SPHINX is able to analyze metagenomic sequences as rapidly as composition-based algorithms. Furthermore, the binning efficiency (in terms of accuracy and specificity of assignments) of SPHINX is observed to be comparable with results obtained using alignment-based algorithms. A web server for the SPHINX algorithm is available at http://metagenomics.atc.tcs.com/SPHINX/.

  16. Promoter classifier: software package for promoter database analysis.

    PubMed

    Gershenzon, Naum I; Ioshikhes, Ilya P

    2005-01-01

    Promoter Classifier is a package of seven stand-alone Windows-based C++ programs allowing the following basic manipulations with a set of promoter sequences: (i) calculation of positional distributions of nucleotides averaged over all promoters of the dataset; (ii) calculation of the averaged occurrence frequencies of the transcription factor binding sites and their combinations; (iii) division of the dataset into subsets of sequences containing or lacking certain promoter elements or combinations; (iv) extraction of the promoter subsets containing or lacking CpG islands around the transcription start site; and (v) calculation of spatial distributions of the promoter DNA stacking energy and bending stiffness. All programs have a user-friendly interface and provide the results in a convenient graphical form. The Promoter Classifier package is an effective tool for various basic manipulations with eukaryotic promoter sequences that usually are necessary for analysis of large promoter datasets. The program Promoter Divider is described in more detail as a representative component of the package.

  17. Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts.

    PubMed

    Göke, Jonathan; Schulz, Marcel H; Lasserre, Julia; Vingron, Martin

    2012-03-01

    The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets. We present the standardized alignment-free sequence similarity measure N2, a flexible framework that is defined for word neighbourhoods. We explore the usefulness of adding reverse complement words as well as words including mismatches into the neighbourhood. On simulated enhancer sequences as well as functional enhancers in mouse development, N2 is shown to outperform previous alignment-free measures. N2 is flexible, faster than competing methods and less susceptible to single sequence noise and the occurrence of repetitive sequences. Experiments on the mouse enhancers reveal that enhancers active in different tissues can be separated by pairwise comparison using N2. N2 represents an improvement over previous alignment-free similarity measures without compromising speed, which makes it a good candidate for large-scale sequence comparison of regulatory sequences. The software is part of the open-source C++ library SeqAn (www.seqan.de) and a compiled version can be downloaded at http://www.seqan.de/projects/alf.html. Supplementary data are available at Bioinformatics online.

  18. Identification of optimum sequencing depth especially for de novo genome assembly of small genomes using next generation sequencing data.

    PubMed

    Desai, Aarti; Marwah, Veer Singh; Yadav, Akshay; Jha, Vineet; Dhaygude, Kishor; Bangar, Ujwala; Kulkarni, Vivek; Jere, Abhay

    2013-01-01

    Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6-40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.

  19. Challenges and opportunities in understanding microbial communities with metagenome assembly (accompanied by IPython Notebook tutorial)

    DOE PAGES

    Howe, Adina; Chain, Patrick S. G.

    2015-07-09

    Metagenomic investigations hold great promise for informing the genetics, physiology, and ecology of environmental microorganisms. Current challenges for metagenomic analysis are related to our ability to connect the dots between sequencing reads, their population of origin, and their encoding functions. Assembly-based methods reduce dataset size by extending overlapping reads into larger contiguous sequences (contigs), providing contextual information for genetic sequences that does not rely on existing references. These methods, however, tend to be computationally intensive and are again challenged by sequencing errors as well as by genomic repeats. While numerous tools have been developed based on these methodological concepts, theymore » present confounding choices and training requirements to metagenomic investigators. To help with accessibility to assembly tools, this review also includes an IPython Notebook metagenomic assembly tutorial. This tutorial has instructions for execution any operating system using Amazon Elastic Cloud Compute and guides users through downloading, assembly, and mapping reads to contigs of a mock microbiome metagenome. Despite its challenges, metagenomic analysis has already revealed novel insights into many environments on Earth. As software, training, and data continue to emerge, metagenomic data access and its discoveries will to grow.« less

  20. Structural classification of proteins using texture descriptors extracted from the cellular automata image.

    PubMed

    Kavianpour, Hamidreza; Vasighi, Mahdi

    2017-02-01

    Nowadays, having knowledge about cellular attributes of proteins has an important role in pharmacy, medical science and molecular biology. These attributes are closely correlated with the function and three-dimensional structure of proteins. Knowledge of protein structural class is used by various methods for better understanding the protein functionality and folding patterns. Computational methods and intelligence systems can have an important role in performing structural classification of proteins. Most of protein sequences are saved in databanks as characters and strings and a numerical representation is essential for applying machine learning methods. In this work, a binary representation of protein sequences is introduced based on reduced amino acids alphabets according to surrounding hydrophobicity index. Many important features which are hidden in these long binary sequences can be clearly displayed through their cellular automata images. The extracted features from these images are used to build a classification model by support vector machine. Comparing to previous studies on the several benchmark datasets, the promising classification rates obtained by tenfold cross-validation imply that the current approach can help in revealing some inherent features deeply hidden in protein sequences and improve the quality of predicting protein structural class.

  1. Species Diversity of Puerto Rican Heterotermes (Dictyoptera: Rhinotermitidae) Revealed by Phylogenetic Analyses of Two Mitochondrial Genes

    PubMed Central

    Jones, Susan C.; Jenkins, Tracie M.

    2016-01-01

    The goal of this study was to infer Heterotermes (Froggatt) (Dictyoptera: Rhinotermitidae) species diversity on the island of Puerto Rico from phylogenetic analyses of DNA sequence data from two mitochondrial genes, 16S rRNA and cytochrome oxidase II (COII). This termite genus is a structural pest known to be well adapted to arid environments in subtropical and tropical regions worldwide including Puerto Rico and many other Caribbean islands. Extensive sampling was accomplished across Puerto Rico, and phylogenetic analyses of individual gene sequences from these samples indicated robust datasets of congruent gene tree topologies showing three monophyletic groups: H. cardini (Snyder), H. convexinotatus (Snyder), and H. tenuis (Hagen). We found that H. cardini and H. convexinotatus were widespread in the arid coastal regions of Puerto Rico, whereas H. tenuis was uncommon and may represent a relatively new introduction. We found only H. convexinotatus on Culebra Island. We provide strong evidence that Puerto Rico may be linked to the Heterotermes in southern Florida, USA, since its GenBank 16S sequence was identical to that of seven Puerto Rican H. cardini sequences. Our study represents the first records of H. cardini from Puerto Rico and Grand Bahama.

  2. Challenges and opportunities in understanding microbial communities with metagenome assembly (accompanied by IPython Notebook tutorial)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Howe, Adina; Chain, Patrick S. G.

    Metagenomic investigations hold great promise for informing the genetics, physiology, and ecology of environmental microorganisms. Current challenges for metagenomic analysis are related to our ability to connect the dots between sequencing reads, their population of origin, and their encoding functions. Assembly-based methods reduce dataset size by extending overlapping reads into larger contiguous sequences (contigs), providing contextual information for genetic sequences that does not rely on existing references. These methods, however, tend to be computationally intensive and are again challenged by sequencing errors as well as by genomic repeats. While numerous tools have been developed based on these methodological concepts, theymore » present confounding choices and training requirements to metagenomic investigators. To help with accessibility to assembly tools, this review also includes an IPython Notebook metagenomic assembly tutorial. This tutorial has instructions for execution any operating system using Amazon Elastic Cloud Compute and guides users through downloading, assembly, and mapping reads to contigs of a mock microbiome metagenome. Despite its challenges, metagenomic analysis has already revealed novel insights into many environments on Earth. As software, training, and data continue to emerge, metagenomic data access and its discoveries will to grow.« less

  3. Genovo: De Novo Assembly for Metagenomes

    NASA Astrophysics Data System (ADS)

    Laserson, Jonathan; Jojic, Vladimir; Koller, Daphne

    Next-generation sequencing technologies produce a large number of noisy reads from the DNA in a sample. Metagenomics and population sequencing aim to recover the genomic sequences of the species in the sample, which could be of high diversity. Methods geared towards single sequence reconstruction are not sensitive enough when applied in this setting. We introduce a generative probabilistic model of read generation from environmental samples and present Genovo, a novel de novo sequence assembler that discovers likely sequence reconstructions under the model. A Chinese restaurant process prior accounts for the unknown number of genomes in the sample. Inference is made by applying a series of hill-climbing steps iteratively until convergence. We compare the performance of Genovo to three other short read assembly programs across one synthetic dataset and eight metagenomic datasets created using the 454 platform, the largest of which has 311k reads. Genovo's reconstructions cover more bases and recover more genes than the other methods, and yield a higher assembly score.

  4. Sequence Data for Clostridium autoethanogenum using Three Generations of Sequencing Technologies

    DOE PAGES

    Utturkar, Sagar M.; Klingeman, Dawn Marie; Bruno-Barcena, José M.; ...

    2015-04-14

    During the past decade, DNA sequencing output has been mostly dominated by the second generation sequencing platforms which are characterized by low cost, high throughput and shorter read lengths for example, Illumina. The emergence and development of so called third generation sequencing platforms such as PacBio has permitted exceptionally long reads (over 20 kb) to be generated. Due to read length increases, algorithm improvements and hybrid assembly approaches, the concept of one chromosome, one contig and automated finishing of microbial genomes is now a realistic and achievable task for many microbial laboratories. In this paper, we describe high quality sequencemore » datasets which span three generations of sequencing technologies, containing six types of data from four NGS platforms and originating from a single microorganism, Clostridium autoethanogenum. The dataset reported here will be useful for the scientific community to evaluate upcoming NGS platforms, enabling comparison of existing and novel bioinformatics approaches and will encourage interest in the development of innovative experimental and computational methods for NGS data.« less

  5. Integrated Analyses of Cuticular Hydrocarbons, Chromosome and mtDNA in the Neotropical Social Wasp Mischocyttarus consimilis Zikán (Hymenoptera, Vespidae).

    PubMed

    Cunha, D A S; Menezes, R S T; Costa, M A; Lima, S M; Andrade, L H C; Antonialli, W F

    2017-12-01

    In the present work, we explored multiple data from different biological levels such as cuticular hydrocarbons, chromosomal features, and mtDNA sequences in the Neotropical social wasp Mischocyttarus consimilis (J.F. Zikán). Particularly, we explored the genetic and chemical differentiation level within and between populations of this insect. Our dataset revealed shallow intraspecific differentiation in M. consimilis. The similarity among the analyzed samples can probably be due to the geographical proximity where the colonies were sampled, and we argue that Paraná River did not contribute effectively as a historical barrier to this wasp.

  6. Evaluation of rapid and simple techniques for the enrichment of viruses prior to metagenomic virus discovery.

    PubMed

    Hall, Richard J; Wang, Jing; Todd, Angela K; Bissielo, Ange B; Yen, Seiha; Strydom, Hugo; Moore, Nicole E; Ren, Xiaoyun; Huang, Q Sue; Carter, Philip E; Peacey, Matthew

    2014-01-01

    The discovery of new or divergent viruses using metagenomics and high-throughput sequencing has become more commonplace. The preparation of a sample is known to have an effect on the representation of virus sequences within the metagenomic dataset yet comparatively little attention has been given to this. Physical enrichment techniques are often applied to samples to increase the number of viral sequences and therefore enhance the probability of detection. With the exception of virus ecology studies, there is a paucity of information available to researchers on the type of sample preparation required for a viral metagenomic study that seeks to identify an aetiological virus in an animal or human diagnostic sample. A review of published virus discovery studies revealed the most commonly used enrichment methods, that were usually quick and simple to implement, namely low-speed centrifugation, filtration, nuclease-treatment (or combinations of these) which have been routinely used but often without justification. These were applied to a simple and well-characterised artificial sample composed of bacterial and human cells, as well as DNA (adenovirus) and RNA viruses (influenza A and human enterovirus), being either non-enveloped capsid or enveloped viruses. The effect of the enrichment method was assessed by both quantitative real-time PCR and metagenomic analysis that incorporated an amplification step. Reductions in the absolute quantities of bacteria and human cells were observed for each method as determined by qPCR, but the relative abundance of viral sequences in the metagenomic dataset remained largely unchanged. A 3-step method of centrifugation, filtration and nuclease-treatment showed the greatest increase in the proportion of viral sequences. This study provides a starting point for the selection of a purification method in future virus discovery studies, and highlights the need for more data to validate the effect of enrichment methods on different sample types, amplification, bioinformatics approaches and sequencing platforms. This study also highlights the potential risks that may attend selection of a virus enrichment method without any consideration for the sample type being investigated. Copyright © 2013 The Authors. Published by Elsevier B.V. All rights reserved.

  7. Prediction of Drug-Target Interaction Networks from the Integration of Protein Sequences and Drug Chemical Structures.

    PubMed

    Meng, Fan-Rong; You, Zhu-Hong; Chen, Xing; Zhou, Yong; An, Ji-Yong

    2017-07-05

    Knowledge of drug-target interaction (DTI) plays an important role in discovering new drug candidates. Unfortunately, there are unavoidable shortcomings; including the time-consuming and expensive nature of the experimental method to predict DTI. Therefore, it motivates us to develop an effective computational method to predict DTI based on protein sequence. In the paper, we proposed a novel computational approach based on protein sequence, namely PDTPS (Predicting Drug Targets with Protein Sequence) to predict DTI. The PDTPS method combines Bi-gram probabilities (BIGP), Position Specific Scoring Matrix (PSSM), and Principal Component Analysis (PCA) with Relevance Vector Machine (RVM). In order to evaluate the prediction capacity of the PDTPS, the experiment was carried out on enzyme, ion channel, GPCR, and nuclear receptor datasets by using five-fold cross-validation tests. The proposed PDTPS method achieved average accuracy of 97.73%, 93.12%, 86.78%, and 87.78% on enzyme, ion channel, GPCR and nuclear receptor datasets, respectively. The experimental results showed that our method has good prediction performance. Furthermore, in order to further evaluate the prediction performance of the proposed PDTPS method, we compared it with the state-of-the-art support vector machine (SVM) classifier on enzyme and ion channel datasets, and other exiting methods on four datasets. The promising comparison results further demonstrate that the efficiency and robust of the proposed PDTPS method. This makes it a useful tool and suitable for predicting DTI, as well as other bioinformatics tasks.

  8. SVM-PB-Pred: SVM based protein block prediction method using sequence profiles and secondary structures.

    PubMed

    Suresh, V; Parthasarathy, S

    2014-01-01

    We developed a support vector machine based web server called SVM-PB-Pred, to predict the Protein Block for any given amino acid sequence. The input features of SVM-PB-Pred include i) sequence profiles (PSSM) and ii) actual secondary structures (SS) from DSSP method or predicted secondary structures from NPS@ and GOR4 methods. There were three combined input features PSSM+SS(DSSP), PSSM+SS(NPS@) and PSSM+SS(GOR4) used to test and train the SVM models. Similarly, four datasets RS90, DB433, LI1264 and SP1577 were used to develop the SVM models. These four SVM models developed were tested using three different benchmarking tests namely; (i) self consistency, (ii) seven fold cross validation test and (iii) independent case test. The maximum possible prediction accuracy of ~70% was observed in self consistency test for the SVM models of both LI1264 and SP1577 datasets, where PSSM+SS(DSSP) input features was used to test. The prediction accuracies were reduced to ~53% for PSSM+SS(NPS@) and ~43% for PSSM+SS(GOR4) in independent case test, for the SVM models of above two same datasets. Using our method, it is possible to predict the protein block letters for any query protein sequence with ~53% accuracy, when the SP1577 dataset and predicted secondary structure from NPS@ server were used. The SVM-PB-Pred server can be freely accessed through http://bioinfo.bdu.ac.in/~svmpbpred.

  9. Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains.

    PubMed

    Bulashevska, Alla; Eils, Roland

    2006-06-14

    The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. A novel method called HensBC is introduced to predict protein subcellular location. HensBC is a recursive algorithm which constructs a hierarchical ensemble of classifiers. The classifiers used are Bayesian classifiers based on Markov chain models. We tested our method on six various datasets; among them are Gram-negative bacteria dataset, data for discriminating outer membrane proteins and apoptosis proteins dataset. We observed that our method can predict the subcellular location with high accuracy. Another advantage of the proposed method is that it can improve the accuracy of the prediction of some classes with few sequences in training and is therefore useful for datasets with imbalanced distribution of classes. This study introduces an algorithm which uses only the primary sequence of a protein to predict its subcellular location. The proposed recursive scheme represents an interesting methodology for learning and combining classifiers. The method is computationally efficient and competitive with the previously reported approaches in terms of prediction accuracies as empirical results indicate. The code for the software is available upon request.

  10. LS³: A Method for Improving Phylogenomic Inferences When Evolutionary Rates Are Heterogeneous among Taxa.

    PubMed

    Rivera-Rivera, Carlos J; Montoya-Burgos, Juan I

    2016-06-01

    Phylogenetic inference artifacts can occur when sequence evolution deviates from assumptions made by the models used to analyze them. The combination of strong model assumption violations and highly heterogeneous lineage evolutionary rates can become problematic in phylogenetic inference, and lead to the well-described long-branch attraction (LBA) artifact. Here, we define an objective criterion for assessing lineage evolutionary rate heterogeneity among predefined lineages: the result of a likelihood ratio test between a model in which the lineages evolve at the same rate (homogeneous model) and a model in which different lineage rates are allowed (heterogeneous model). We implement this criterion in the algorithm Locus Specific Sequence Subsampling (LS³), aimed at reducing the effects of LBA in multi-gene datasets. For each gene, LS³ sequentially removes the fastest-evolving taxon of the ingroup and tests for lineage rate homogeneity until all lineages have uniform evolutionary rates. The sequences excluded from the homogeneously evolving taxon subset are flagged as potentially problematic. The software implementation provides the user with the possibility to remove the flagged sequences for generating a new concatenated alignment. We tested LS³ with simulations and two real datasets containing LBA artifacts: a nucleotide dataset regarding the position of Glires within mammals and an amino-acid dataset concerning the position of nematodes within bilaterians. The initially incorrect phylogenies were corrected in all cases upon removing data flagged by LS³. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  11. In silico analysis of expressed sequence tags from Trichostrongylus vitrinus (Nematoda): comparison of the automated ESTExplorer workflow platform with conventional database searches.

    PubMed

    Nagaraj, Shivashankar H; Gasser, Robin B; Nisbet, Alasdair J; Ranganathan, Shoba

    2008-01-01

    The analysis of expressed sequence tags (EST) offers a rapid and cost effective approach to elucidate the transcriptome of an organism, but requires several computational methods for assembly and annotation. Researchers frequently analyse each step manually, which is laborious and time consuming. We have recently developed ESTExplorer, a semi-automated computational workflow system, in order to achieve the rapid analysis of EST datasets. In this study, we evaluated EST data analysis for the parasitic nematode Trichostrongylus vitrinus (order Strongylida) using ESTExplorer, compared with database matching alone. We functionally annotated 1776 ESTs obtained via suppressive-subtractive hybridisation from T. vitrinus, an important parasitic trichostrongylid of small ruminants. Cluster and comparative genomic analyses of the transcripts using ESTExplorer indicated that 290 (41%) sequences had homologues in Caenorhabditis elegans, 329 (42%) in parasitic nematodes, 202 (28%) in organisms other than nematodes, and 218 (31%) had no significant match to any sequence in the current databases. Of the C. elegans homologues, 90 were associated with 'non-wildtype' double-stranded RNA interference (RNAi) phenotypes, including embryonic lethality, maternal sterility, sterile progeny, larval arrest and slow growth. We could functionally classify 267 (38%) sequences using the Gene Ontologies (GO) and establish pathway associations for 230 (33%) sequences using the Kyoto Encyclopedia of Genes and Genomes (KEGG). Further examination of this EST dataset revealed a number of signalling molecules, proteases, protease inhibitors, enzymes, ion channels and immune-related genes. In addition, we identified 40 putative secreted proteins that could represent potential candidates for developing novel anthelmintics or vaccines. We further compared the automated EST sequence annotations, using ESTExplorer, with database search results for individual T. vitrinus ESTs. ESTExplorer reliably and rapidly annotated 301 ESTs, with pathway and GO information, eliminating 60 low quality hits from database searches. We evaluated the efficacy of ESTExplorer in analysing EST data, and demonstrate that computational tools can be used to accelerate the process of gene discovery in EST sequencing projects. The present study has elucidated sets of relatively conserved and potentially novel genes for biological investigation, and the annotated EST set provides further insight into the molecular biology of T. vitrinus, towards the identification of novel drug targets.

  12. Deep sequencing of the Camellia chekiangoleosa transcriptome revealed candidate genes for anthocyanin biosynthesis.

    PubMed

    Wang, Zhong-Wei; Jiang, Cong; Wen, Qiang; Wang, Na; Tao, Yuan-Yuan; Xu, Li-An

    2014-03-15

    Camellia chekiangoleosa is an important species of genus Camellia. It provides high-quality edible oil and has great ornamental value. The flowers are big and red which bloom between February and March. Flower pigmentation is closely related to the accumulation of anthocyanin. Although anthocyanin biosynthesis has been studied extensively in herbaceous plants, little molecular information on the anthocyanin biosynthesis pathway of C. chekiangoleosa is yet known. In the present study, a cDNA library was constructed to obtain detailed and general data from the flowers of C. chekiangoleosa. To explore the transcriptome of C. chekiangoleosa and investigate genes involved in anthocyanin biosynthesis, a 454 GS FLX Titanium platform was used to generate an EST dataset. About 46,279 sequences were obtained, and 24,593 (53.1%) were annotated. Using Blast search against the AGRIS, 1740 unigenes were found homologous to 599 Arabidopsis transcription factor genes. Based on the transcriptome dataset, nine anthocyanin biosynthesis pathway genes (PAL, CHS1, CHS2, CHS3, CHI, F3H, DFR, ANS, and UFGT) were identified and cloned. The spatio-temporal expression patterns of these genes were also analyzed using quantitative real-time polymerase chain reaction. The study results not only enrich the gene resource but also provide valuable information for further studies concerning anthocyanin biosynthesis. Copyright © 2014 Elsevier B.V. All rights reserved.

  13. Detecting exact breakpoints of deletions with diversity in hepatitis B viral genomic DNA from next-generation sequencing data.

    PubMed

    Cheng, Ji-Hong; Liu, Wen-Chun; Chang, Ting-Tsung; Hsieh, Sun-Yuan; Tseng, Vincent S

    2017-10-01

    Many studies have suggested that deletions of Hepatitis B Viral (HBV) are associated with the development of progressive liver diseases, even ultimately resulting in hepatocellular carcinoma (HCC). Among the methods for detecting deletions from next-generation sequencing (NGS) data, few methods considered the characteristics of virus, such as high evolution rates and high divergence among the different HBV genomes. Sequencing high divergence HBV genome sequences using the NGS technology outputs millions of reads. Thus, detecting exact breakpoints of deletions from these big and complex data incurs very high computational cost. We proposed a novel analytical method named VirDelect (Virus Deletion Detect), which uses split read alignment base to detect exact breakpoint and diversity variable to consider high divergence in single-end reads data, such that the computational cost can be reduced without losing accuracy. We use four simulated reads datasets and two real pair-end reads datasets of HBV genome sequence to verify VirDelect accuracy by score functions. The experimental results show that VirDelect outperforms the state-of-the-art method Pindel in terms of accuracy score for all simulated datasets and VirDelect had only two base errors even in real datasets. VirDelect is also shown to deliver high accuracy in analyzing the single-end read data as well as pair-end data. VirDelect can serve as an effective and efficient bioinformatics tool for physiologists with high accuracy and efficient performance and applicable to further analysis with characteristics similar to HBV on genome length and high divergence. The software program of VirDelect can be downloaded at https://sourceforge.net/projects/virdelect/. Copyright © 2017. Published by Elsevier Inc.

  14. miRCat2: accurate prediction of plant and animal microRNAs from next-generation sequencing datasets

    PubMed Central

    Paicu, Claudia; Mohorianu, Irina; Stocks, Matthew; Xu, Ping; Coince, Aurore; Billmeier, Martina; Dalmay, Tamas; Moulton, Vincent; Moxon, Simon

    2017-01-01

    Abstract Motivation MicroRNAs are a class of ∼21–22 nt small RNAs which are excised from a stable hairpin-like secondary structure. They have important gene regulatory functions and are involved in many pathways including developmental timing, organogenesis and development in eukaryotes. There are several computational tools for miRNA detection from next-generation sequencing datasets. However, many of these tools suffer from high false positive and false negative rates. Here we present a novel miRNA prediction algorithm, miRCat2. miRCat2 incorporates a new entropy-based approach to detect miRNA loci, which is designed to cope with the high sequencing depth of current next-generation sequencing datasets. It has a user-friendly interface and produces graphical representations of the hairpin structure and plots depicting the alignment of sequences on the secondary structure. Results We test miRCat2 on a number of animal and plant datasets and present a comparative analysis with miRCat, miRDeep2, miRPlant and miReap. We also use mutants in the miRNA biogenesis pathway to evaluate the predictions of these tools. Results indicate that miRCat2 has an improved accuracy compared with other methods tested. Moreover, miRCat2 predicts several new miRNAs that are differentially expressed in wild-type versus mutants in the miRNA biogenesis pathway. Availability and Implementation miRCat2 is part of the UEA small RNA Workbench and is freely available from http://srna-workbench.cmp.uea.ac.uk/. Contact v.moulton@uea.ac.uk or s.moxon@uea.ac.uk Supplementary information Supplementary data are available at Bioinformatics online. PMID:28407097

  15. Genetic diversity and recombination analysis of sweepoviruses from Brazil

    PubMed Central

    2012-01-01

    Background Monopartite begomoviruses (genus Begomovirus, family Geminiviridae) that infect sweet potato (Ipomoea batatas) around the world are known as sweepoviruses. Because sweet potato plants are vegetatively propagated, the accumulation of viruses can become a major constraint for root production. Mixed infections of sweepovirus species and strains can lead to recombination, which may contribute to the generation of new recombinant sweepoviruses. Results This study reports the full genome sequence of 34 sweepoviruses sampled from a sweet potato germplasm bank and commercial fields in Brazil. These sequences were compared with others from public nucleotide sequence databases to provide a comprehensive overview of the genetic diversity and patterns of genetic exchange in sweepoviruses isolated from Brazil, as well as to review the classification and nomenclature of sweepoviruses in accordance with the current guidelines proposed by the Geminiviridae Study Group of the International Committee on Taxonomy of Viruses (ICTV). Co-infections and extensive recombination events were identified in Brazilian sweepoviruses. Analysis of the recombination breakpoints detected within the sweepovirus dataset revealed that most recombination events occurred in the intergenic region (IR) and in the middle of the C1 open reading frame (ORF). Conclusions The genetic diversity of sweepoviruses was considerably greater than previously described in Brazil. Moreover, recombination analysis revealed that a genomic exchange is responsible for the emergence of sweepovirus species and strains and provided valuable new information for understanding the diversity and evolution of sweepoviruses. PMID:23082767

  16. Core microbial functional activities in ocean environments revealed by global metagenomic profiling analyses.

    PubMed

    Ferreira, Ari J S; Siam, Rania; Setubal, João C; Moustafa, Ahmed; Sayed, Ahmed; Chambergo, Felipe S; Dawe, Adam S; Ghazy, Mohamed A; Sharaf, Hazem; Ouf, Amged; Alam, Intikhab; Abdel-Haleem, Alyaa M; Lehvaslaiho, Heikki; Ramadan, Eman; Antunes, André; Stingl, Ulrich; Archer, John A C; Jankovic, Boris R; Sogin, Mitchell; Bajic, Vladimir B; El-Dorry, Hamza

    2014-01-01

    Metagenomics-based functional profiling analysis is an effective means of gaining deeper insight into the composition of marine microbial populations and developing a better understanding of the interplay between the functional genome content of microbial communities and abiotic factors. Here we present a comprehensive analysis of 24 datasets covering surface and depth-related environments at 11 sites around the world's oceans. The complete datasets comprises approximately 12 million sequences, totaling 5,358 Mb. Based on profiling patterns of Clusters of Orthologous Groups (COGs) of proteins, a core set of reference photic and aphotic depth-related COGs, and a collection of COGs that are associated with extreme oxygen limitation were defined. Their inferred functions were utilized as indicators to characterize the distribution of light- and oxygen-related biological activities in marine environments. The results reveal that, while light level in the water column is a major determinant of phenotypic adaptation in marine microorganisms, oxygen concentration in the aphotic zone has a significant impact only in extremely hypoxic waters. Phylogenetic profiling of the reference photic/aphotic gene sets revealed a greater variety of source organisms in the aphotic zone, although the majority of individual photic and aphotic depth-related COGs are assigned to the same taxa across the different sites. This increase in phylogenetic and functional diversity of the core aphotic related COGs most probably reflects selection for the utilization of a broad range of alternate energy sources in the absence of light.

  17. Core Microbial Functional Activities in Ocean Environments Revealed by Global Metagenomic Profiling Analyses

    PubMed Central

    Ferreira, Ari J. S.; Siam, Rania; Setubal, João C.; Moustafa, Ahmed; Sayed, Ahmed; Chambergo, Felipe S.; Dawe, Adam S.; Ghazy, Mohamed A.; Sharaf, Hazem; Ouf, Amged; Alam, Intikhab; Abdel-Haleem, Alyaa M.; Lehvaslaiho, Heikki; Ramadan, Eman; Antunes, André; Stingl, Ulrich; Archer, John A. C.; Jankovic, Boris R.; Sogin, Mitchell; Bajic, Vladimir B.; El-Dorry, Hamza

    2014-01-01

    Metagenomics-based functional profiling analysis is an effective means of gaining deeper insight into the composition of marine microbial populations and developing a better understanding of the interplay between the functional genome content of microbial communities and abiotic factors. Here we present a comprehensive analysis of 24 datasets covering surface and depth-related environments at 11 sites around the world's oceans. The complete datasets comprises approximately 12 million sequences, totaling 5,358 Mb. Based on profiling patterns of Clusters of Orthologous Groups (COGs) of proteins, a core set of reference photic and aphotic depth-related COGs, and a collection of COGs that are associated with extreme oxygen limitation were defined. Their inferred functions were utilized as indicators to characterize the distribution of light- and oxygen-related biological activities in marine environments. The results reveal that, while light level in the water column is a major determinant of phenotypic adaptation in marine microorganisms, oxygen concentration in the aphotic zone has a significant impact only in extremely hypoxic waters. Phylogenetic profiling of the reference photic/aphotic gene sets revealed a greater variety of source organisms in the aphotic zone, although the majority of individual photic and aphotic depth-related COGs are assigned to the same taxa across the different sites. This increase in phylogenetic and functional diversity of the core aphotic related COGs most probably reflects selection for the utilization of a broad range of alternate energy sources in the absence of light. PMID:24921648

  18. Rhipicephalus microplus dataset of nonredundant raw sequence reads from 454 GS FLX sequencing of Cot-selected (Cot = 660) genomic DNA

    USDA-ARS?s Scientific Manuscript database

    A reassociation kinetics-based approach was used to reduce the complexity of genomic DNA from the Deutsch laboratory strain of the cattle tick, Rhipicephalus microplus, to facilitate genome sequencing. Selected genomic DNA (Cot value = 660) was sequenced using 454 GS FLX technology, resulting in 356...

  19. Screening Currency Notes for Microbial Pathogens and Antibiotic Resistance Genes Using a Shotgun Metagenomic Approach

    PubMed Central

    Jalali, Saakshi; Kohli, Samantha; Latka, Chitra; Bhatia, Sugandha; Vellarikal, Shamsudheen Karuthedath; Sivasubbu, Sridhar; Scaria, Vinod; Ramachandran, Srinivasan

    2015-01-01

    Fomites are a well-known source of microbial infections and previous studies have provided insights into the sojourning microbiome of fomites from various sources. Paper currency notes are one of the most commonly exchanged objects and its potential to transmit pathogenic organisms has been well recognized. Approaches to identify the microbiome associated with paper currency notes have been largely limited to culture dependent approaches. Subsequent studies portrayed the use of 16S ribosomal RNA based approaches which provided insights into the taxonomical distribution of the microbiome. However, recent techniques including shotgun sequencing provides resolution at gene level and enable estimation of their copy numbers in the metagenome. We investigated the microbiome of Indian paper currency notes using a shotgun metagenome sequencing approach. Metagenomic DNA isolated from samples of frequently circulated denominations of Indian currency notes were sequenced using Illumina Hiseq sequencer. Analysis of the data revealed presence of species belonging to both eukaryotic and prokaryotic genera. The taxonomic distribution at kingdom level revealed contigs mapping to eukaryota (70%), bacteria (9%), viruses and archae (~1%). We identified 78 pathogens including Staphylococcus aureus, Corynebacterium glutamicum, Enterococcus faecalis, and 75 cellulose degrading organisms including Acidothermus cellulolyticus, Cellulomonas flavigena and Ruminococcus albus. Additionally, 78 antibiotic resistance genes were identified and 18 of these were found in all the samples. Furthermore, six out of 78 pathogens harbored at least one of the 18 common antibiotic resistance genes. To the best of our knowledge, this is the first report of shotgun metagenome sequence dataset of paper currency notes, which can be useful for future applications including as bio-surveillance of exchangeable fomites for infectious agents. PMID:26035208

  20. Identification of DNA-binding proteins using multi-features fusion and binary firefly optimization algorithm.

    PubMed

    Zhang, Jian; Gao, Bo; Chai, Haiting; Ma, Zhiqiang; Yang, Guifu

    2016-08-26

    DNA-binding proteins (DBPs) play fundamental roles in many biological processes. Therefore, the developing of effective computational tools for identifying DBPs is becoming highly desirable. In this study, we proposed an accurate method for the prediction of DBPs. Firstly, we focused on the challenge of improving DBP prediction accuracy with information solely from the sequence. Secondly, we used multiple informative features to encode the protein. These features included evolutionary conservation profile, secondary structure motifs, and physicochemical properties. Thirdly, we introduced a novel improved Binary Firefly Algorithm (BFA) to remove redundant or noisy features as well as select optimal parameters for the classifier. The experimental results of our predictor on two benchmark datasets outperformed many state-of-the-art predictors, which revealed the effectiveness of our method. The promising prediction performance on a new-compiled independent testing dataset from PDB and a large-scale dataset from UniProt proved the good generalization ability of our method. In addition, the BFA forged in this research would be of great potential in practical applications in optimization fields, especially in feature selection problems. A highly accurate method was proposed for the identification of DBPs. A user-friendly web-server named iDbP (identification of DNA-binding Proteins) was constructed and provided for academic use.

  1. De novo transcriptome sequencing in Frankliniella occidentalis to identify genes involved in plant virus transmission and insecticide resistance.

    PubMed

    Zhang, Zhijun; Zhang, Pengjun; Li, Weidi; Zhang, Jinming; Huang, Fang; Yang, Jian; Bei, Yawei; Lu, Yaobin

    2013-05-01

    The western flower thrips (WFT), Frankliniella occidentalis, a world-wide invasive insect, causes agricultural damage by directly feeding and by indirectly vectoring Tospoviruses, such as Tomato spotted wilt virus (TSWV). We characterized the transcriptome of WFT and analyzed global gene expression of WFT response to TSWV infection using Illumina sequencing platform. We compiled 59,932 unigenes, and identified 36,339 unigenes by similarity analysis against public databases, most of which were annotated using gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. Within these annotated transcripts, we collected 278 sequences related to insecticide resistance. GO and KEGG analysis of different expression genes between TSWV-infected and non-infected WFT population revealed that TSWV can regulate cellular process and immune response, which might lead to low virus titers in thrips cells and no detrimental effects on F. occidentalis. This data-set not only enriches genomic resource for WFT, but also benefits research into its molecular genetics and functional genomics. Copyright © 2013 Elsevier Inc. All rights reserved.

  2. Structural Analysis of Biodiversity

    PubMed Central

    Sirovich, Lawrence; Stoeckle, Mark Y.; Zhang, Yu

    2010-01-01

    Large, recently-available genomic databases cover a wide range of life forms, suggesting opportunity for insights into genetic structure of biodiversity. In this study we refine our recently-described technique using indicator vectors to analyze and visualize nucleotide sequences. The indicator vector approach generates correlation matrices, dubbed Klee diagrams, which represent a novel way of assembling and viewing large genomic datasets. To explore its potential utility, here we apply the improved algorithm to a collection of almost 17000 DNA barcode sequences covering 12 widely-separated animal taxa, demonstrating that indicator vectors for classification gave correct assignment in all 11000 test cases. Indicator vector analysis revealed discontinuities corresponding to species- and higher-level taxonomic divisions, suggesting an efficient approach to classification of organisms from poorly-studied groups. As compared to standard distance metrics, indicator vectors preserve diagnostic character probabilities, enable automated classification of test sequences, and generate high-information density single-page displays. These results support application of indicator vectors for comparative analysis of large nucleotide data sets and raise prospect of gaining insight into broad-scale patterns in the genetic structure of biodiversity. PMID:20195371

  3. Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.

    PubMed

    Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias

    2011-01-01

    The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.

  4. FRAGS: estimation of coding sequence substitution rates from fragmentary data

    PubMed Central

    Swart, Estienne C; Hide, Winston A; Seoighe, Cathal

    2004-01-01

    Background Rates of substitution in protein-coding sequences can provide important insights into evolutionary processes that are of biomedical and theoretical interest. Increased availability of coding sequence data has enabled researchers to estimate more accurately the coding sequence divergence of pairs of organisms. However the use of different data sources, alignment protocols and methods to estimate substitution rates leads to widely varying estimates of key parameters that define the coding sequence divergence of orthologous genes. Although complete genome sequence data are not available for all organisms, fragmentary sequence data can provide accurate estimates of substitution rates provided that an appropriate and consistent methodology is used and that differences in the estimates obtainable from different data sources are taken into account. Results We have developed FRAGS, an application framework that uses existing, freely available software components to construct in-frame alignments and estimate coding substitution rates from fragmentary sequence data. Coding sequence substitution estimates for human and chimpanzee sequences, generated by FRAGS, reveal that methodological differences can give rise to significantly different estimates of important substitution parameters. The estimated substitution rates were also used to infer upper-bounds on the amount of sequencing error in the datasets that we have analysed. Conclusion We have developed a system that performs robust estimation of substitution rates for orthologous sequences from a pair of organisms. Our system can be used when fragmentary genomic or transcript data is available from one of the organisms and the other is a completely sequenced genome within the Ensembl database. As well as estimating substitution statistics our system enables the user to manage and query alignment and substitution data. PMID:15005802

  5. Sequence-based analysis of the Vitis vinifera L. cv Cabernet Sauvignon grape must mycobiome in three South African vineyards employing distinct agronomic systems

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Setati, Mathabatha E.; Jacobson, Daniel; Bauer, Florian F.

    Recent microbiomic research of agricultural habitats has highlighted tremendous microbial biodiversity associated with such ecosystems. In addition, data generated in vineyards have furthermore highlighted significant regional differences in vineyard biodiversity, hinting at the possibility that such differences might be responsible for regional differences in wine style and character, a hypothesis referred to as "microbial terroir." The current study further contributes to this body of work by comparing the mycobiome associated with South African (SA) Cabernet Sauvignon grapes in three neighboring vineyards that employ different agronomic approaches, and comparing the outcome with similar data sets from Californian vineyards. The aim ofmore » this study was to fully characterize the mycobiomes associated with the grapes from these vineyards. The data revealed approximately 10 times more fungal diversity than what is typically retrieved from culture-based studies. The Biodynamic vineyard was found to harbor a more diverse fungal community (H = 2.6) than the conventional (H = 2.1) and integrated (H = 1.8) vineyards. The data show that ascomycota are the most abundant phylum in the three vineyards, with Aureobasidium pullulans and its close relative Kabatiella microsticta being the most dominant fungi. This is the first report to reveal a high incidence of K. microsticta in the grape/wine ecosystem. Different common wine yeast species, such as Metschnikowia pulcherrima and Starmerella bacillaris dominated the mycobiome in the three vineyards. The data show that the filamentous fungi are the most abundant community in grape must although they are not regarded as relevant during wine fermentation. Comparison of metagenomic datasets from the three SA vineyards and previously published data from Californian vineyards revealed only 25% of the fungi in the SA dataset was also present in the Californian dataset, with greater variation evident amongst ubiquitous epiphytic fungi.« less

  6. Sequence-based Analysis of the Vitis vinifera L. cv Cabernet Sauvignon Grape Must Mycobiome in Three South African Vineyards Employing Distinct Agronomic Systems

    PubMed Central

    Setati, Mathabatha E.; Jacobson, Daniel; Bauer, Florian F.

    2015-01-01

    Recent microbiomic research of agricultural habitats has highlighted tremendous microbial biodiversity associated with such ecosystems. Data generated in vineyards have furthermore highlighted significant regional differences in vineyard biodiversity, hinting at the possibility that such differences might be responsible for regional differences in wine style and character, a hypothesis referred to as “microbial terroir.” The current study further contributes to this body of work by comparing the mycobiome associated with South African (SA) Cabernet Sauvignon grapes in three neighboring vineyards that employ different agronomic approaches, and comparing the outcome with similar data sets from Californian vineyards. The aim of this study was to fully characterize the mycobiomes associated with the grapes from these vineyards. The data revealed approximately 10 times more fungal diversity than what is typically retrieved from culture-based studies. The Biodynamic vineyard was found to harbor a more diverse fungal community (H = 2.6) than the conventional (H = 2.1) and integrated (H = 1.8) vineyards. The data show that ascomycota are the most abundant phylum in the three vineyards, with Aureobasidium pullulans and its close relative Kabatiella microsticta being the most dominant fungi. This is the first report to reveal a high incidence of K. microsticta in the grape/wine ecosystem. Different common wine yeast species, such as Metschnikowia pulcherrima and Starmerella bacillaris dominated the mycobiome in the three vineyards. The data show that the filamentous fungi are the most abundant community in grape must although they are not regarded as relevant during wine fermentation. Comparison of metagenomic datasets from the three SA vineyards and previously published data from Californian vineyards revealed only 25% of the fungi in the SA dataset was also present in the Californian dataset, with greater variation evident amongst ubiquitous epiphytic fungi. PMID:26648930

  7. Sequence-based analysis of the Vitis vinifera L. cv Cabernet Sauvignon grape must mycobiome in three South African vineyards employing distinct agronomic systems

    DOE PAGES

    Setati, Mathabatha E.; Jacobson, Daniel; Bauer, Florian F.

    2015-11-30

    Recent microbiomic research of agricultural habitats has highlighted tremendous microbial biodiversity associated with such ecosystems. In addition, data generated in vineyards have furthermore highlighted significant regional differences in vineyard biodiversity, hinting at the possibility that such differences might be responsible for regional differences in wine style and character, a hypothesis referred to as "microbial terroir." The current study further contributes to this body of work by comparing the mycobiome associated with South African (SA) Cabernet Sauvignon grapes in three neighboring vineyards that employ different agronomic approaches, and comparing the outcome with similar data sets from Californian vineyards. The aim ofmore » this study was to fully characterize the mycobiomes associated with the grapes from these vineyards. The data revealed approximately 10 times more fungal diversity than what is typically retrieved from culture-based studies. The Biodynamic vineyard was found to harbor a more diverse fungal community (H = 2.6) than the conventional (H = 2.1) and integrated (H = 1.8) vineyards. The data show that ascomycota are the most abundant phylum in the three vineyards, with Aureobasidium pullulans and its close relative Kabatiella microsticta being the most dominant fungi. This is the first report to reveal a high incidence of K. microsticta in the grape/wine ecosystem. Different common wine yeast species, such as Metschnikowia pulcherrima and Starmerella bacillaris dominated the mycobiome in the three vineyards. The data show that the filamentous fungi are the most abundant community in grape must although they are not regarded as relevant during wine fermentation. Comparison of metagenomic datasets from the three SA vineyards and previously published data from Californian vineyards revealed only 25% of the fungi in the SA dataset was also present in the Californian dataset, with greater variation evident amongst ubiquitous epiphytic fungi.« less

  8. Transcriptome deep-sequencing and clustering of expressed isoforms from Favia corals

    PubMed Central

    2013-01-01

    Background Genomic and transcriptomic sequence data are essential tools for tackling ecological problems. Using an approach that combines next-generation sequencing, de novo transcriptome assembly, gene annotation and synthetic gene construction, we identify and cluster the protein families from Favia corals from the northern Red Sea. Results We obtained 80 million 75 bp paired-end cDNA reads from two Favia adult samples collected at 65 m (Fav1, Fav2) on the Illumina GA platform, and generated two de novo assemblies using ABySS and CAP3. After removing redundancy and filtering out low quality reads, our transcriptome datasets contained 58,268 (Fav1) and 62,469 (Fav2) contigs longer than 100 bp, with N50 values of 1,665 bp and 1,439 bp, respectively. Using the proteome of the sea anemone Nematostella vectensis as a reference, we were able to annotate almost 20% of each dataset using reciprocal homology searches. Homologous clustering of these annotated transcripts allowed us to divide them into 7,186 (Fav1) and 6,862 (Fav2) homologous transcript clusters (E-value ≤ 2e-30). Functional annotation categories were assigned to homologous clusters using the functional annotation of Nematostella vectensis. General annotation of the assembled transcripts was improved 1-3% using the Acropora digitifera proteome. In addition, we screened these transcript isoform clusters for fluorescent proteins (FPs) homologs and identified seven potential FP homologs in Fav1, and four in Fav2. These transcripts were validated as bona fide FP transcripts via robust fluorescence heterologous expression. Annotation of the assembled contigs revealed that 1.34% and 1.61% (in Fav1 and Fav2, respectively) of the total assembled contigs likely originated from the corals’ algal symbiont, Symbiodinium spp. Conclusions Here we present a study to identify the homologous transcript isoform clusters from the transcriptome of Favia corals using a far-related reference proteome. Furthermore, the symbiont-derived transcripts were isolated from the datasets and their contribution quantified. This is the first annotated transcriptome of the genus Favia, a major increase in genomics resources available in this important family of corals. PMID:23937070

  9. A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast

    PubMed Central

    Kundaje, Anshul; Xin, Xiantong; Lan, Changgui; Lianoglou, Steve; Zhou, Mei; Zhang, Li; Leslie, Christina

    2008-01-01

    Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included. PMID:19008939

  10. Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures

    PubMed Central

    Wang, Ying; Fu, Lei; Ren, Jie; Yu, Zhaoxia; Chen, Ting; Sun, Fengzhu

    2018-01-01

    Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered “group-specific” in our study. Our main purpose is to discover group-specific sequence regions between control and case groups as disease-associated markers. We developed a long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly. We called our method MetaGO: Group-specific oligonucleotide analysis for metagenomic samples. An open-source pipeline on Apache Spark was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified group-specific markers. In the simulated dataset, 99.11% of group-specific logical 40-mers covered 98.89% disease-specific regions from the disease-associated strain. In addition, 97.90% of group-specific numerical 40-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 group-specific 40-mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 group-specific features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All group-specific 40-mers were present in LC patients, but not healthy controls. All the assembled 11 LC-specific sequences can be mapped to two strains of Veillonella parvula: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying group-specific k-mers, which would be clinically applicable for disease prediction. MetaGO is available at https://github.com/VVsmileyx/MetaGO. PMID:29774017

  11. Analysis of phylogenomic datasets reveals conflict, concordance, and gene duplications with examples from animals and plants.

    PubMed

    Smith, Stephen A; Moore, Michael J; Brown, Joseph W; Yang, Ya

    2015-08-05

    The use of transcriptomic and genomic datasets for phylogenetic reconstruction has become increasingly common as researchers attempt to resolve recalcitrant nodes with increasing amounts of data. The large size and complexity of these datasets introduce significant phylogenetic noise and conflict into subsequent analyses. The sources of conflict may include hybridization, incomplete lineage sorting, or horizontal gene transfer, and may vary across the phylogeny. For phylogenetic analysis, this noise and conflict has been accommodated in one of several ways: by binning gene regions into subsets to isolate consistent phylogenetic signal; by using gene-tree methods for reconstruction, where conflict is presumed to be explained by incomplete lineage sorting (ILS); or through concatenation, where noise is presumed to be the dominant source of conflict. The results provided herein emphasize that analysis of individual homologous gene regions can greatly improve our understanding of the underlying conflict within these datasets. Here we examined two published transcriptomic datasets, the angiosperm group Caryophyllales and the aculeate Hymenoptera, for the presence of conflict, concordance, and gene duplications in individual homologs across the phylogeny. We found significant conflict throughout the phylogeny in both datasets and in particular along the backbone. While some nodes in each phylogeny showed patterns of conflict similar to what might be expected with ILS alone, the backbone nodes also exhibited low levels of phylogenetic signal. In addition, certain nodes, especially in the Caryophyllales, had highly elevated levels of strongly supported conflict that cannot be explained by ILS alone. This study demonstrates that phylogenetic signal is highly variable in phylogenomic data sampled across related species and poses challenges when conducting species tree analyses on large genomic and transcriptomic datasets. Further insight into the conflict and processes underlying these complex datasets is necessary to improve and develop adequate models for sequence analysis and downstream applications. To aid this effort, we developed the open source software phyparts ( https://bitbucket.org/blackrim/phyparts ), which calculates unique, conflicting, and concordant bipartitions, maps gene duplications, and outputs summary statistics such as internode certainy (ICA) scores and node-specific counts of gene duplications.

  12. Comparing species tree estimation with large anchored phylogenomic and small Sanger-sequenced molecular datasets: an empirical study on Malagasy pseudoxyrhophiine snakes.

    PubMed

    Ruane, Sara; Raxworthy, Christopher J; Lemmon, Alan R; Lemmon, Emily Moriarty; Burbrink, Frank T

    2015-10-12

    Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar's serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species tree approaches. We also examined the individual gene trees in comparison to the 377-locus species tree using the program MetaTree. Using the full anchored dataset under a variety of methods gave us the same, well-supported phylogeny for pseudoxyrhophiines. The African pseudoxyrhophiine Duberria is the sister taxon to the Malagasy pseudoxyrhophiines genera, providing evidence for a monophyletic radiation in Madagascar. In addition, within Madagascar, the two major clades inferred correspond largely to the aglyphous and opisthoglyphous genera, suggesting that feeding specializations associated with tooth venom delivery may have played a major role in the early diversification of this radiation. The comparison of tree topologies from the concatenated and species-tree methods using different datasets indicated the 5-locus dataset cannot beused to infer a correct phylogeny for the pseudoxyrhophiines under any method tested here and that summary statistics methods require 50 or more loci to consistently recover the species-tree inferred using the complete anchored dataset. However, as few as 15 loci may infer the correct topology when using the full coalescent species tree method *BEAST. MetaTree analyses of each gene tree from the Sanger and anchored datasets found that none of the individual gene trees matched the 377-locus species tree, and that no gene trees were identical with respect to topology. Our results suggest that ≥50 loci may be necessary to confidently infer phylogenies when using summaryspecies-tree methods, but that the coalescent-based method *BEAST consistently recovers the same topology using only 15 loci. These results reinforce that datasets with small numbers of markers may result in misleading topologies, and further, that the method of inference used to generate a phylogeny also has a major influence on the number of loci necessary to infer robust species trees.

  13. SSU rDNA divergence in planktonic foraminifera: molecular taxonomy and biogeographic implications.

    PubMed

    André, Aurore; Quillévéré, Frédéric; Morard, Raphaël; Ujiié, Yurika; Escarguel, Gilles; de Vargas, Colomban; de Garidel-Thoron, Thibault; Douady, Christophe J

    2014-01-01

    The use of planktonic foraminifera in paleoceanography requires taxonomic consistency and precise assessment of the species biogeography. Yet, ribosomal small subunit (SSUr) DNA analyses have revealed that most of the modern morpho-species of planktonic foraminifera are composed of a complex of several distinct genetic types that may correspond to cryptic or pseudo-cryptic species. These genetic types are usually delimitated using partial sequences located at the 3'end of the SSUrDNA, but typically based on empirical delimitation. Here, we first use patristic genetic distances calculated within and among genetic types of the most common morpho-species to show that intra-type and inter-type genetic distances within morpho-species may significantly overlap, suggesting that genetic types have been sometimes inconsistently defined. We further apply two quantitative and independent methods, ABGD (Automatic Barcode Gap Detection) and GMYC (General Mixed Yule Coalescent) to a dataset of published and newly obtained partial SSU rDNA for a more objective assessment of the species status of these genetic types. Results of these complementary approaches are highly congruent and lead to a molecular taxonomy that ranks 49 genetic types of planktonic foraminifera as genuine (pseudo)cryptic species. Our results advocate for a standardized sequencing procedure allowing homogenous delimitations of (pseudo)cryptic species. On the ground of this revised taxonomic framework, we finally provide an integrative taxonomy synthesizing geographic, ecological and morphological differentiations that can occur among the genuine (pseudo)cryptic species. Due to molecular, environmental or morphological data scarcities, many aspects of our proposed integrative taxonomy are not yet fully resolved. On the other hand, our study opens up the potential for a correct interpretation of environmental sequence datasets.

  14. SSU rDNA Divergence in Planktonic Foraminifera: Molecular Taxonomy and Biogeographic Implications

    PubMed Central

    André, Aurore; Quillévéré, Frédéric; Morard, Raphaël; Ujiié, Yurika; Escarguel, Gilles; de Vargas, Colomban; de Garidel-Thoron, Thibault; Douady, Christophe J.

    2014-01-01

    The use of planktonic foraminifera in paleoceanography requires taxonomic consistency and precise assessment of the species biogeography. Yet, ribosomal small subunit (SSUr) DNA analyses have revealed that most of the modern morpho-species of planktonic foraminifera are composed of a complex of several distinct genetic types that may correspond to cryptic or pseudo-cryptic species. These genetic types are usually delimitated using partial sequences located at the 3′end of the SSUrDNA, but typically based on empirical delimitation. Here, we first use patristic genetic distances calculated within and among genetic types of the most common morpho-species to show that intra-type and inter-type genetic distances within morpho-species may significantly overlap, suggesting that genetic types have been sometimes inconsistently defined. We further apply two quantitative and independent methods, ABGD (Automatic Barcode Gap Detection) and GMYC (General Mixed Yule Coalescent) to a dataset of published and newly obtained partial SSU rDNA for a more objective assessment of the species status of these genetic types. Results of these complementary approaches are highly congruent and lead to a molecular taxonomy that ranks 49 genetic types of planktonic foraminifera as genuine (pseudo)cryptic species. Our results advocate for a standardized sequencing procedure allowing homogenous delimitations of (pseudo)cryptic species. On the ground of this revised taxonomic framework, we finally provide an integrative taxonomy synthesizing geographic, ecological and morphological differentiations that can occur among the genuine (pseudo)cryptic species. Due to molecular, environmental or morphological data scarcities, many aspects of our proposed integrative taxonomy are not yet fully resolved. On the other hand, our study opens up the potential for a correct interpretation of environmental sequence datasets. PMID:25119900

  15. Transcription factor profiling reveals molecular choreography and key regulators of human retrotransposon expression

    PubMed Central

    Sun, Xiaoji; Wang, Xuya; Tang, Zuojian; Grivainis, Mark; Kahler, David; Yun, Chi; Mita, Paolo; Fenyö, David

    2018-01-01

    Transposable elements (TEs) represent a substantial fraction of many eukaryotic genomes, and transcriptional regulation of these factors is important to determine TE activities in human cells. However, due to the repetitive nature of TEs, identifying transcription factor (TF)-binding sites from ChIP-sequencing (ChIP-seq) datasets is challenging. Current algorithms are focused on subtle differences between TE copies and thus bias the analysis to relatively old and inactive TEs. Here we describe an approach termed “MapRRCon” (mapping repeat reads to a consensus) which allows us to identify proteins binding to TE DNA sequences by mapping ChIP-seq reads to the TE consensus sequence after whole-genome alignment. Although this method does not assign binding sites to individual insertions in the genome, it provides a landscape of interacting TFs by capturing factors that bind to TEs under various conditions. We applied this method to screen TFs’ interaction with L1 in human cells/tissues using ENCODE ChIP-seq datasets and identified 178 of the 512 TFs tested as bound to L1 in at least one biological condition with most of them (138) localized to the promoter. Among these L1-binding factors, we focused on Myc and CTCF, as they play important roles in cancer progression and 3D chromatin structure formation. Furthermore, we explored the transcriptomes of The Cancer Genome Atlas breast and ovarian tumor samples in which a consistent anti-/correlation between L1 and Myc/CTCF expression was observed, suggesting that these two factors may play roles in regulating L1 transcription during the development of such tumors. PMID:29802231

  16. Early-branching euteleost relationships: areas of congruence between concatenation and coalescent model inferences.

    PubMed

    Campbell, Matthew A; Alfaro, Michael E; Belasco, Max; López, J Andrés

    2017-01-01

    Phylogenetic inference based on evidence from DNA sequences has led to significant strides in the development of a stable and robustly supported framework for the vertebrate tree of life. To date, the bulk of those advances have relied on sequence data from a small number of genome regions that have proven unable to produce satisfactory answers to consistently recalcitrant phylogenetic questions. Here, we re-examine phylogenetic relationships among early-branching euteleostean fish lineages classically grouped in the Protacanthopterygii using DNA sequence data surrounding ultraconserved elements. We report and examine a dataset of thirty-four OTUs with 17,957 aligned characters from fifty-three nuclear loci. Phylogenetic analysis is conducted in concatenated, joint gene trees and species tree estimation and summary coalescent frameworks. All analytical frameworks yield supporting evidence for existing hypotheses of relationship for the placement of Lepidogalaxias salamandroides , monophyly of the Stomiatii and the presence of an esociform + salmonid clade. Lepidogalaxias salamandroides and the Esociformes + Salmoniformes are successive sister lineages to all other euteleosts in the majority of analyses. The concatenated and joint gene trees and species tree analysis types produce high support values for this arrangement. However, inter-relationships of Argentiniformes, Stomiatii and Neoteleostei remain uncertain as they varied by analysis type while receiving strong and contradictory indices of support. Topological differences between analysis types are also apparent within the otomorph and the percomorph taxa in the data set. Our results identify concordant areas with strong support for relationships within and between early-branching euteleost lineages but they also reveal limitations in the ability of larger datasets to conclusively resolve other aspects of that phylogeny.

  17. Early-branching euteleost relationships: areas of congruence between concatenation and coalescent model inferences

    PubMed Central

    Alfaro, Michael E.; Belasco, Max; López, J. Andrés

    2017-01-01

    Phylogenetic inference based on evidence from DNA sequences has led to significant strides in the development of a stable and robustly supported framework for the vertebrate tree of life. To date, the bulk of those advances have relied on sequence data from a small number of genome regions that have proven unable to produce satisfactory answers to consistently recalcitrant phylogenetic questions. Here, we re-examine phylogenetic relationships among early-branching euteleostean fish lineages classically grouped in the Protacanthopterygii using DNA sequence data surrounding ultraconserved elements. We report and examine a dataset of thirty-four OTUs with 17,957 aligned characters from fifty-three nuclear loci. Phylogenetic analysis is conducted in concatenated, joint gene trees and species tree estimation and summary coalescent frameworks. All analytical frameworks yield supporting evidence for existing hypotheses of relationship for the placement of Lepidogalaxias salamandroides, monophyly of the Stomiatii and the presence of an esociform + salmonid clade. Lepidogalaxias salamandroides and the Esociformes + Salmoniformes are successive sister lineages to all other euteleosts in the majority of analyses. The concatenated and joint gene trees and species tree analysis types produce high support values for this arrangement. However, inter-relationships of Argentiniformes, Stomiatii and Neoteleostei remain uncertain as they varied by analysis type while receiving strong and contradictory indices of support. Topological differences between analysis types are also apparent within the otomorph and the percomorph taxa in the data set. Our results identify concordant areas with strong support for relationships within and between early-branching euteleost lineages but they also reveal limitations in the ability of larger datasets to conclusively resolve other aspects of that phylogeny. PMID:28929008

  18. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

    PubMed

    Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo

    2014-01-01

    Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts. Available under the open source MIT license at http://sourceforge.net/projects/seqpig/

  19. Bellerophon: a program to detect chimeric sequences in multiple sequence alignments.

    PubMed

    Huber, Thomas; Faulkner, Geoffrey; Hugenholtz, Philip

    2004-09-22

    Bellerophon is a program for detecting chimeric sequences in multiple sequence datasets by an adaption of partial treeing analysis. Bellerophon was specifically developed to detect 16S rRNA gene chimeras in PCR-clone libraries of environmental samples but can be applied to other nucleotide sequence alignments. Bellerophon is available as an interactive web server at http://foo.maths.uq.edu.au/~huber/bellerophon.pl

  20. CPTAC Releases Largest-Ever Ovarian Cancer Proteome Dataset from Previously Genome Characterized Tumors | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium (CPTAC) scientists have just released a comprehensive dataset of the proteomic analysis of high grade serous ovarian tumor samples, previously genomically analyzed by The Cancer Genome Atlas (TCGA).  This is one of the largest public datasets covering the proteome, phosphoproteome and glycoproteome with complementary deep genomic sequencing data on the same tumor.

  1. MRUniNovo: an efficient tool for de novo peptide sequencing utilizing the hadoop distributed computing framework.

    PubMed

    Li, Chuang; Chen, Tao; He, Qiang; Zhu, Yunping; Li, Kenli

    2017-03-15

    Tandem mass spectrometry-based de novo peptide sequencing is a complex and time-consuming process. The current algorithms for de novo peptide sequencing cannot rapidly and thoroughly process large mass spectrometry datasets. In this paper, we propose MRUniNovo, a novel tool for parallel de novo peptide sequencing. MRUniNovo parallelizes UniNovo based on the Hadoop compute platform. Our experimental results demonstrate that MRUniNovo significantly reduces the computation time of de novo peptide sequencing without sacrificing the correctness and accuracy of the results, and thus can process very large datasets that UniNovo cannot. MRUniNovo is an open source software tool implemented in java. The source code and the parameter settings are available at http://bioinfo.hupo.org.cn/MRUniNovo/index.php. s131020002@hnu.edu.cn ; taochen1019@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  2. Transcriptome analysis of Bupleurum chinense focusing on genes involved in the biosynthesis of saikosaponins

    PubMed Central

    2011-01-01

    Abstract Background Bupleurum chinense DC. is a widely used traditional Chinese medicinal plant. Saikosaponins are the major bioactive constituents of B. chinense, but relatively little is known about saikosaponin biosynthesis. The 454 pyrosequencing technology provides a promising opportunity for finding novel genes that participate in plant metabolism. Consequently, this technology may help to identify the candidate genes involved in the saikosaponin biosynthetic pathway. Results One-quarter of the 454 pyrosequencing runs produced a total of 195, 088 high-quality reads, with an average read length of 356 bases (NCBI SRA accession SRA039388). A de novo assembly generated 24, 037 unique sequences (22, 748 contigs and 1, 289 singletons), 12, 649 (52.6%) of which were annotated against three public protein databases using a basic local alignment search tool (E-value ≤1e-10). All unique sequences were compared with NCBI expressed sequence tags (ESTs) (237) and encoding sequences (44) from the Bupleurum genus, and with a Sanger-sequenced EST dataset (3, 111). The 23, 173 (96.4%) unique sequences obtained in the present study represent novel Bupleurum genes. The ESTs of genes related to saikosaponin biosynthesis were found to encode known enzymes that catalyze the formation of the saikosaponin backbone; 246 cytochrome P450 (P450s) and 102 glycosyltransferases (GTs) unique sequences were also found in the 454 dataset. Full length cDNAs of 7 P450s and 7 uridine diphosphate GTs (UGTs) were verified by reverse transcriptase polymerase chain reaction or by cloning using 5' and/or 3' rapid amplification of cDNA ends. Two P450s and three UGTs were identified as the most likely candidates involved in saikosaponin biosynthesis. This finding was based on the coordinate up-regulation of their expression with β-AS in methyl jasmonate-treated adventitious roots and on their similar expression patterns with β-AS in various B. chinense tissues. Conclusions A collection of high-quality ESTs for B. chinense obtained by 454 pyrosequencing is provided here for the first time. These data should aid further research on the functional genomics of B. chinense and other Bupleurum species. The candidate genes for enzymes involved in saikosaponin biosynthesis, especially the P450s and UGTs, that were revealed provide a substantial foundation for follow-up research on the metabolism and regulation of the saikosaponins. PMID:22047182

  3. Computational Tools for Parsimony Phylogenetic Analysis of Omics Data

    PubMed Central

    Salazar, Jose; Amri, Hakima; Noursi, David

    2015-01-01

    Abstract High-throughput assays from genomics, proteomics, metabolomics, and next generation sequencing produce massive omics datasets that are challenging to analyze in biological or clinical contexts. Thus far, there is no publicly available program for converting quantitative omics data into input formats to be used in off-the-shelf robust phylogenetic programs. To the best of our knowledge, this is the first report on creation of two Windows-based programs, OmicsTract and SynpExtractor, to address this gap. We note, as a way of introduction and development of these programs, that one particularly useful bioinformatics inferential modeling is the phylogenetic cladogram. Cladograms are multidimensional tools that show the relatedness between subgroups of healthy and diseased individuals and the latter's shared aberrations; they also reveal some characteristics of a disease that would not otherwise be apparent by other analytical methods. The OmicsTract and SynpExtractor were written for the respective tasks of (1) accommodating advanced phylogenetic parsimony analysis (through standard programs of MIX [from PHYLIP] and TNT), and (2) extracting shared aberrations at the cladogram nodes. OmicsTract converts comma-delimited data tables through assigning each data point into a binary value (“0” for normal states and “1” for abnormal states) then outputs the converted data tables into the proper input file formats for MIX or with embedded commands for TNT. SynapExtractor uses outfiles from MIX and TNT to extract the shared aberrations of each node of the cladogram, matching them with identifying labels from the dataset and exporting them into a comma-delimited file. Labels may be gene identifiers in gene-expression datasets or m/z values in mass spectrometry datasets. By automating these steps, OmicsTract and SynpExtractor offer a veritable opportunity for rapid and standardized phylogenetic analyses of omics data; their model can also be extended to next generation sequencing (NGS) data. We make OmicsTract and SynpExtractor publicly and freely available for non-commercial use in order to strengthen and build capacity for the phylogenetic paradigm of omics analysis. PMID:26230532

  4. User Guidelines for the Brassica Database: BRAD.

    PubMed

    Wang, Xiaobo; Cheng, Feng; Wang, Xiaowu

    2016-01-01

    The genome sequence of Brassica rapa was first released in 2011. Since then, further Brassica genomes have been sequenced or are undergoing sequencing. It is therefore necessary to develop tools that help users to mine information from genomic data efficiently. This will greatly aid scientific exploration and breeding application, especially for those with low levels of bioinformatic training. Therefore, the Brassica database (BRAD) was built to collect, integrate, illustrate, and visualize Brassica genomic datasets. BRAD provides useful searching and data mining tools, and facilitates the search of gene annotation datasets, syntenic or non-syntenic orthologs, and flanking regions of functional genomic elements. It also includes genome-analysis tools such as BLAST and GBrowse. One of the important aims of BRAD is to build a bridge between Brassica crop genomes with the genome of the model species Arabidopsis thaliana, thus transferring the bulk of A. thaliana gene study information for use with newly sequenced Brassica crops.

  5. Partial Shotgun Sequencing of the Boechera stricta Genome Reveals Extensive Microsynteny and Promoter Conservation with Arabidopsis1[W

    PubMed Central

    Windsor, Aaron J.; Schranz, M. Eric; Formanová, Nataša; Gebauer-Jung, Steffi; Bishop, John G.; Schnabelrauch, Domenica; Kroymann, Juergen; Mitchell-Olds, Thomas

    2006-01-01

    Comparative genomics provides insight into the evolutionary dynamics that shape discrete sequences as well as whole genomes. To advance comparative genomics within the Brassicaceae, we have end sequenced 23,136 medium-sized insert clones from Boechera stricta, a wild relative of Arabidopsis (Arabidopsis thaliana). A significant proportion of these sequences, 18,797, are nonredundant and display highly significant similarity (BLASTn e-value ≤ 10−30) to low copy number Arabidopsis genomic regions, including more than 9,000 annotated coding sequences. We have used this dataset to identify orthologous gene pairs in the two species and to perform a global comparison of DNA regions 5′ to annotated coding regions. On average, the 500 nucleotides upstream to coding sequences display 71.4% identity between the two species. In a similar analysis, 61.4% identity was observed between 5′ noncoding sequences of Brassica oleracea and Arabidopsis, indicating that regulatory regions are not as diverged among these lineages as previously anticipated. By mapping the B. stricta end sequences onto the Arabidopsis genome, we have identified nearly 2,000 conserved blocks of microsynteny (bracketing 26% of the Arabidopsis genome). A comparison of fully sequenced B. stricta inserts to their homologous Arabidopsis genomic regions indicates that indel polymorphisms >5 kb contribute substantially to the genome size difference observed between the two species. Further, we demonstrate that microsynteny inferred from end-sequence data can be applied to the rapid identification and cloning of genomic regions of interest from nonmodel species. These results suggest that among diploid relatives of Arabidopsis, small- to medium-scale shotgun sequencing approaches can provide rapid and cost-effective benefits to evolutionary and/or functional comparative genomic frameworks. PMID:16607030

  6. The diversity and structure of marine protists in the coastal waters of China revealed by morphological observation and 454 pyrosequencing

    NASA Astrophysics Data System (ADS)

    Liu, Yun; Song, Shuqun; Chen, Tiantian; Li, Caiwen

    2017-04-01

    Pyrosequencing of the 18S rRNA gene has been widely adopted to study the eukaryotic diversity in various types of environments, and has an advantage over traditional morphology methods in exploring unknown microbial communities. To comprehensively assess the diversity and community composition of marine protists in the coastal waters of China, we applied both morphological observations and high-throughput sequencing of the V2 and V3 regions of 18S rDNA simultaneously to analyze samples collected from the surface layer of the Yellow and East China Seas. Dinoflagellates, diatoms and ciliates were the three dominant protistan groups as revealed by the two methods. Diatoms were the first dominant protistan group in the microscopic observations, with Skeletonema mainly distributed in the nearshore eutrophic waters and Chaetoceros in higher temperature and higher pH waters. The mixotrophic dinoflagellates, Gymnodinium and Gyrodinium, were more competitive in the oligotrophic waters. The pyrosequencing method revealed an extensive diversity of dinoflagellates. Chaetoceros was the only dominant diatom group in the pyrosequencing dataset. Gyrodinium represented the most abundant reads and dominated the offshore oligotrophic protistan community as they were in the microscopic observations. The dominance of parasitic dinoflagellates in the pyrosequencing dataset, which were overlooked in the morphological observations, indicates more attention should be paid to explore the potential role of this group. Both methods provide coherent clustering of samples. Nutrient levels, salinity and pH were the main factors influencing the distribution of protists. This study demonstrates that different primer pairs used in the pyrosequencing will indicate different protistan community structures. A suitable marker may reveal more comprehensive composition of protists and provide valuable information on environmental drivers.

  7. Ten years of maintaining and expanding a microbial genome and metagenome analysis system.

    PubMed

    Markowitz, Victor M; Chen, I-Min A; Chu, Ken; Pati, Amrita; Ivanova, Natalia N; Kyrpides, Nikos C

    2015-11-01

    Launched in March 2005, the Integrated Microbial Genomes (IMG) system is a comprehensive data management system that supports multidimensional comparative analysis of genomic data. At the core of the IMG system is a data warehouse that contains genome and metagenome datasets sequenced at the Joint Genome Institute or provided by scientific users, as well as public genome datasets available at the National Center for Biotechnology Information Genbank sequence data archive. Genomes and metagenome datasets are processed using IMG's microbial genome and metagenome sequence data processing pipelines and are integrated into the data warehouse using IMG's data integration toolkits. Microbial genome and metagenome application specific data marts and user interfaces provide access to different subsets of IMG's data and analysis toolkits. This review article revisits IMG's original aims, highlights key milestones reached by the system during the past 10 years, and discusses the main challenges faced by a rapidly expanding system, in particular the complexity of maintaining such a system in an academic setting with limited budgets and computing and data management infrastructure. Copyright © 2015 Elsevier Ltd. All rights reserved.

  8. SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction.

    PubMed

    Hagopian, Raffi; Davidson, John R; Datta, Ruchira S; Samad, Bushra; Jarvis, Glen R; Sjölander, Kimmen

    2010-07-01

    We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/.

  9. Beyond Reasonable Doubt: Evolution from DNA Sequences

    PubMed Central

    Penny, David

    2013-01-01

    We demonstrate quantitatively that, as predicted by evolutionary theory, sequences of homologous proteins from different species converge as we go further and further back in time. The converse, a non-evolutionary model can be expressed as probabilities, and the test works for chloroplast, nuclear and mitochondrial sequences, as well as for sequences that diverged at different time depths. Even on our conservative test, the probability that chance could produce the observed levels of ancestral convergence for just one of the eight datasets of 51 proteins is ≈1×10−19 and combined over 8 datasets is ≈1×10−132. By comparison, there are about 1080 protons in the universe, hence the probability that the sequences could have been produced by a process involving unrelated ancestral sequences is about 1050 lower than picking, among all protons, the same proton at random twice in a row. A non-evolutionary control model shows no convergence, and only a small number of parameters are required to account for the observations. It is time that that researchers insisted that doubters put up testable alternatives to evolution. PMID:23950906

  10. MODBASE, a database of annotated comparative protein structure models

    PubMed Central

    Pieper, Ursula; Eswar, Narayanan; Stuart, Ashley C.; Ilyin, Valentin A.; Sali, Andrej

    2002-01-01

    MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on PSI-BLAST, IMPALA and MODELLER. MODBASE uses the MySQL relational database management system for flexible and efficient querying, and the MODVIEW Netscape plugin for viewing and manipulating multiple sequences and structures. It is updated regularly to reflect the growth of the protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different datasets. The largest dataset contains models for domains in 304 517 out of 539 171 unique protein sequences in the complete TrEMBL database (23 March 2001); only models based on significant alignments (PSI-BLAST E-value < 10–4) and models assessed to have the correct fold are included. Other datasets include models for target selection and structure-based annotation by the New York Structural Genomics Research Consortium, models for prediction of genes in the Drosophila melanogaster genome, models for structure determination of several ribosomal particles and models calculated by the MODWEB comparative modeling web server. PMID:11752309

  11. Identification of Habitat-Specific Biomes of Aquatic Fungal Communities Using a Comprehensive Nearly Full-Length 18S rRNA Dataset Enriched with Contextual Data

    PubMed Central

    Panzer, Katrin; Yilmaz, Pelin; Weiß, Michael; Reich, Lothar; Richter, Michael; Wiese, Jutta; Schmaljohann, Rolf; Labes, Antje; Imhoff, Johannes F.; Glöckner, Frank Oliver; Reich, Marlis

    2015-01-01

    Molecular diversity surveys have demonstrated that aquatic fungi are highly diverse, and that they play fundamental ecological roles in aquatic systems. Unfortunately, comparative studies of aquatic fungal communities are few and far between, due to the scarcity of adequate datasets. We combined all publicly available fungal 18S ribosomal RNA (rRNA) gene sequences with new sequence data from a marine fungi culture collection. We further enriched this dataset by adding validated contextual data. Specifically, we included data on the habitat type of the samples assigning fungal taxa to ten different habitat categories. This dataset has been created with the intention to serve as a valuable reference dataset for aquatic fungi including a phylogenetic reference tree. The combined data enabled us to infer fungal community patterns in aquatic systems. Pairwise habitat comparisons showed significant phylogenetic differences, indicating that habitat strongly affects fungal community structure. Fungal taxonomic composition differed considerably even on phylum and class level. Freshwater fungal assemblage was most different from all other habitat types and was dominated by basal fungal lineages. For most communities, phylogenetic signals indicated clustering of sequences suggesting that environmental factors were the main drivers of fungal community structure, rather than species competition. Thus, the diversification process of aquatic fungi must be highly clade specific in some cases.The combined data enabled us to infer fungal community patterns in aquatic systems. Pairwise habitat comparisons showed significant phylogenetic differences, indicating that habitat strongly affects fungal community structure. Fungal taxonomic composition differed considerably even on phylum and class level. Freshwater fungal assemblage was most different from all other habitat types and was dominated by basal fungal lineages. For most communities, phylogenetic signals indicated clustering of sequences suggesting that environmental factors were the main drivers of fungal community structure, rather than species competition. Thus, the diversification process of aquatic fungi must be highly clade specific in some cases. PMID:26226014

  12. A high HIV-1 strain variability in London, UK, revealed by full-genome analysis: Results from the ICONIC project

    PubMed Central

    Frampton, Dan; Gallo Cassarino, Tiziano; Raffle, Jade; Hubb, Jonathan; Ferns, R. Bridget; Waters, Laura; Tong, C. Y. William; Kozlakidis, Zisis; Hayward, Andrew; Kellam, Paul; Pillay, Deenan; Clark, Duncan; Nastouli, Eleni; Leigh Brown, Andrew J.

    2018-01-01

    Background & methods The ICONIC project has developed an automated high-throughput pipeline to generate HIV nearly full-length genomes (NFLG, i.e. from gag to nef) from next-generation sequencing (NGS) data. The pipeline was applied to 420 HIV samples collected at University College London Hospitals NHS Trust and Barts Health NHS Trust (London) and sequenced using an Illumina MiSeq at the Wellcome Trust Sanger Institute (Cambridge). Consensus genomes were generated and subtyped using COMET, and unique recombinants were studied with jpHMM and SimPlot. Maximum-likelihood phylogenetic trees were constructed using RAxML to identify transmission networks using the Cluster Picker. Results The pipeline generated sequences of at least 1Kb of length (median = 7.46Kb, IQR = 4.01Kb) for 375 out of the 420 samples (89%), with 174 (46.4%) being NFLG. A total of 365 sequences (169 of them NFLG) corresponded to unique subjects and were included in the down-stream analyses. The most frequent HIV subtypes were B (n = 149, 40.8%) and C (n = 77, 21.1%) and the circulating recombinant form CRF02_AG (n = 32, 8.8%). We found 14 different CRFs (n = 66, 18.1%) and multiple URFs (n = 32, 8.8%) that involved recombination between 12 different subtypes/CRFs. The most frequent URFs were B/CRF01_AE (4 cases) and A1/D, B/C, and B/CRF02_AG (3 cases each). Most URFs (19/26, 73%) lacked breakpoints in the PR+RT pol region, rendering them undetectable if only that was sequenced. Twelve (37.5%) of the URFs could have emerged within the UK, whereas the rest were probably imported from sub-Saharan Africa, South East Asia and South America. For 2 URFs we found highly similar pol sequences circulating in the UK. We detected 31 phylogenetic clusters using the full dataset: 25 pairs (mostly subtypes B and C), 4 triplets and 2 quadruplets. Some of these were not consistent across different genes due to inter- and intra-subtype recombination. Clusters involved 70 sequences, 19.2% of the dataset. Conclusions The initial analysis of genome sequences detected substantial hidden variability in the London HIV epidemic. Analysing full genome sequences, as opposed to only PR+RT, identified previously undetected recombinants. It provided a more reliable description of CRFs (that would be otherwise misclassified) and transmission clusters. PMID:29389981

  13. CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.

    PubMed

    Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan

    2017-06-24

    The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2 ) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .

  14. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences.

    PubMed

    Mizianty, Marcin J; Kurgan, Lukasz

    2009-12-13

    Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.

  15. Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences

    PubMed Central

    2009-01-01

    Background Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. Results The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. Conclusions The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/. PMID:20003388

  16. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).

    PubMed

    Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C

    2015-01-01

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.

  17. GeNets: a unified web platform for network-based genomic analyses.

    PubMed

    Li, Taibo; Kim, April; Rosenbluh, Joseph; Horn, Heiko; Greenfeld, Liraz; An, David; Zimmer, Andrew; Liberzon, Arthur; Bistline, Jon; Natoli, Ted; Li, Yang; Tsherniak, Aviad; Narayan, Rajiv; Subramanian, Aravind; Liefeld, Ted; Wong, Bang; Thompson, Dawn; Calvo, Sarah; Carr, Steve; Boehm, Jesse; Jaffe, Jake; Mesirov, Jill; Hacohen, Nir; Regev, Aviv; Lage, Kasper

    2018-06-18

    Functional genomics networks are widely used to identify unexpected pathway relationships in large genomic datasets. However, it is challenging to compare the signal-to-noise ratios of different networks and to identify the optimal network with which to interpret a particular genetic dataset. We present GeNets, a platform in which users can train a machine-learning model (Quack) to carry out these comparisons and execute, store, and share analyses of genetic and RNA-sequencing datasets.

  18. Prediction of glutathionylation sites in proteins using minimal sequence information and their experimental validation.

    PubMed

    Pal, Debojyoti; Sharma, Deepak; Kumar, Mukesh; Sandur, Santosh K

    2016-09-01

    S-glutathionylation of proteins plays an important role in various biological processes and is known to be protective modification during oxidative stress. Since, experimental detection of S-glutathionylation is labor intensive and time consuming, bioinformatics based approach is a viable alternative. Available methods require relatively longer sequence information, which may prevent prediction if sequence information is incomplete. Here, we present a model to predict glutathionylation sites from pentapeptide sequences. It is based upon differential association of amino acids with glutathionylated and non-glutathionylated cysteines from a database of experimentally verified sequences. This data was used to calculate position dependent F-scores, which measure how a particular amino acid at a particular position may affect the likelihood of glutathionylation event. Glutathionylation-score (G-score), indicating propensity of a sequence to undergo glutathionylation, was calculated using position-dependent F-scores for each amino-acid. Cut-off values were used for prediction. Our model returned an accuracy of 58% with Matthew's correlation-coefficient (MCC) value of 0.165. On an independent dataset, our model outperformed the currently available model, in spite of needing much less sequence information. Pentapeptide motifs having high abundance among glutathionylated proteins were identified. A list of potential glutathionylation hotspot sequences were obtained by assigning G-scores and subsequent Protein-BLAST analysis revealed a total of 254 putative glutathionable proteins, a number of which were already known to be glutathionylated. Our model predicted glutathionylation sites in 93.93% of experimentally verified glutathionylated proteins. Outcome of this study may assist in discovering novel glutathionylation sites and finding candidate proteins for glutathionylation.

  19. Identification of 15 candidate structured noncoding RNA motifs in fungi by comparative genomics.

    PubMed

    Li, Sanshu; Breaker, Ronald R

    2017-10-13

    With the development of rapid and inexpensive DNA sequencing, the genome sequences of more than 100 fungal species have been made available. This dataset provides an excellent resource for comparative genomics analyses, which can be used to discover genetic elements, including noncoding RNAs (ncRNAs). Bioinformatics tools similar to those used to uncover novel ncRNAs in bacteria, likewise, should be useful for searching fungal genomic sequences, and the relative ease of genetic experiments with some model fungal species could facilitate experimental validation studies. We have adapted a bioinformatics pipeline for discovering bacterial ncRNAs to systematically analyze many fungal genomes. This comparative genomics pipeline integrates information on conserved RNA sequence and structural features with alternative splicing information to reveal fungal RNA motifs that are candidate regulatory domains, or that might have other possible functions. A total of 15 prominent classes of structured ncRNA candidates were identified, including variant HDV self-cleaving ribozyme representatives, atypical snoRNA candidates, and possible structured antisense RNA motifs. Candidate regulatory motifs were also found associated with genes for ribosomal proteins, S-adenosylmethionine decarboxylase (SDC), amidase, and HexA protein involved in Woronin body formation. We experimentally confirm that the variant HDV ribozymes undergo rapid self-cleavage, and we demonstrate that the SDC RNA motif reduces the expression of SAM decarboxylase by translational repression. Furthermore, we provide evidence that several other motifs discovered in this study are likely to be functional ncRNA elements. Systematic screening of fungal genomes using a computational discovery pipeline has revealed the existence of a variety of novel structured ncRNAs. Genome contexts and similarities to known ncRNA motifs provide strong evidence for the biological and biochemical functions of some newly found ncRNA motifs. Although initial examinations of several motifs provide evidence for their likely functions, other motifs will require more in-depth analysis to reveal their functions.

  20. Phylogenetic analysis of canine distemper virus in South America clade 1 reveals unique molecular signatures of the local epidemic.

    PubMed

    Fischer, Cristine D B; Gräf, Tiago; Ikuta, Nilo; Lehmann, Fernanda K M; Passos, Daniel T; Makiejczuk, Aline; Silveira, Marcos A T; Fonseca, André S K; Canal, Cláudio W; Lunge, Vagner R

    2016-07-01

    Canine distemper virus (CDV) is a highly contagious pathogen for domestic dogs and several wild carnivore species. In Brazil, natural infection of CDV in dogs is very high due to the large non-vaccinated dog population, a scenario that calls for new studies on the molecular epidemiology. This study investigates the phylodynamics and amino-acid signatures of CDV epidemic in South America by analyzing a large dataset compiled from publicly available sequences and also by collecting new samples from Brazil. A population of 175 dogs with canine distemper (CD) signs was sampled, from which 89 were positive for CDV, generating 42 new CDV sequences. Phylogenetic analysis of the new and publicly available sequences revealed that Brazilian sequences mainly clustered in South America 1 (SA1) clade, which has its origin estimated to the late 1980's. The reconstruction of the demographic history in SA1 clade showed an epidemic expanding until the recent years, doubling in size every nine years. SA1 clade epidemic distinguished from the world CDV epidemic by the emergence of the R580Q strain, a very rare and potentially detrimental substitution in the viral genome. The R580Q substitution was estimated to have happened in one single evolutionary step in the epidemic history in SA1 clade, emerging shortly after introduction to the continent. Moreover, a high prevalence (11.9%) of the Y549H mutation was observed among the domestic dogs sampled here. This finding was associated (p<0.05) with outcome-death and higher frequency in mixed-breed dogs, the later being an indicator of a continuous exchange of CDV strains circulating among wild carnivores and domestic dogs. The results reported here highlight the diversity of the worldwide CDV epidemic and reveal local features that can be valuable for combating the disease. Copyright © 2016 Elsevier B.V. All rights reserved.

  1. DNA entropy reveals a significant difference in complexity between housekeeping and tissue specific gene promoters.

    PubMed

    Thomas, David; Finan, Chris; Newport, Melanie J; Jones, Susan

    2015-10-01

    The complexity of DNA can be quantified using estimates of entropy. Variation in DNA complexity is expected between the promoters of genes with different transcriptional mechanisms; namely housekeeping (HK) and tissue specific (TS). The former are transcribed constitutively to maintain general cellular functions, and the latter are transcribed in restricted tissue and cells types for specific molecular events. It is known that promoter features in the human genome are related to tissue specificity, but this has been difficult to quantify on a genomic scale. If entropy effectively quantifies DNA complexity, calculating the entropies of HK and TS gene promoters as profiles may reveal significant differences. Entropy profiles were calculated for a total dataset of 12,003 human gene promoters and for 501 housekeeping (HK) and 587 tissue specific (TS) human gene promoters. The mean profiles show the TS promoters have a significantly lower entropy (p<2.2e-16) than HK gene promoters. The entropy distributions for the 3 datasets show that promoter entropies could be used to identify novel HK genes. Functional features comprise DNA sequence patterns that are non-random and hence they have lower entropies. The lower entropy of TS gene promoters can be explained by a higher density of positive and negative regulatory elements, required for genes with complex spatial and temporary expression. Copyright © 2015 Elsevier Ltd. All rights reserved.

  2. An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

    PubMed

    Ye, Kai; Kosters, Walter A; Ijzerman, Adriaan P

    2007-03-15

    Pattern discovery in protein sequences is often based on multiple sequence alignments (MSA). The procedure can be computationally intensive and often requires manual adjustment, which may be particularly difficult for a set of deviating sequences. In contrast, two algorithms, PRATT2 (http//www.ebi.ac.uk/pratt/) and TEIRESIAS (http://cbcsrv.watson.ibm.com/) are used to directly identify frequent patterns from unaligned biological sequences without an attempt to align them. Here we propose a new algorithm with more efficiency and more functionality than both PRATT2 and TEIRESIAS, and discuss some of its applications to G protein-coupled receptors, a protein family of important drug targets. In this study, we designed and implemented six algorithms to mine three different pattern types from either one or two datasets using a pattern growth approach. We compared our approach to PRATT2 and TEIRESIAS in efficiency, completeness and the diversity of pattern types. Compared to PRATT2, our approach is faster, capable of processing large datasets and able to identify the so-called type III patterns. Our approach is comparable to TEIRESIAS in the discovery of the so-called type I patterns but has additional functionality such as mining the so-called type II and type III patterns and finding discriminating patterns between two datasets. The source code for pattern growth algorithms and their pseudo-code are available at http://www.liacs.nl/home/kosters/pg/.

  3. A Web Server and Mobile App for Computing Hemolytic Potency of Peptides.

    PubMed

    Chaudhary, Kumardeep; Kumar, Ritesh; Singh, Sandeep; Tuknait, Abhishek; Gautam, Ankur; Mathur, Deepika; Anand, Priya; Varshney, Grish C; Raghava, Gajendra P S

    2016-03-08

    Numerous therapeutic peptides do not enter the clinical trials just because of their high hemolytic activity. Recently, we developed a database, Hemolytik, for maintaining experimentally validated hemolytic and non-hemolytic peptides. The present study describes a web server and mobile app developed for predicting, and screening of peptides having hemolytic potency. Firstly, we generated a dataset HemoPI-1 that contains 552 hemolytic peptides extracted from Hemolytik database and 552 random non-hemolytic peptides (from Swiss-Prot). The sequence analysis of these peptides revealed that certain residues (e.g., L, K, F, W) and motifs (e.g., "FKK", "LKL", "KKLL", "KWK", "VLK", "CYCR", "CRR", "RFC", "RRR", "LKKL") are more abundant in hemolytic peptides. Therefore, we developed models for discriminating hemolytic and non-hemolytic peptides using various machine learning techniques and achieved more than 95% accuracy. We also developed models for discriminating peptides having high and low hemolytic potential on different datasets called HemoPI-2 and HemoPI-3. In order to serve the scientific community, we developed a web server, mobile app and JAVA-based standalone software (http://crdd.osdd.net/raghava/hemopi/).

  4. Detection of timescales in evolving complex systems

    PubMed Central

    Darst, Richard K.; Granell, Clara; Arenas, Alex; Gómez, Sergio; Saramäki, Jari; Fortunato, Santo

    2016-01-01

    Most complex systems are intrinsically dynamic in nature. The evolution of a dynamic complex system is typically represented as a sequence of snapshots, where each snapshot describes the configuration of the system at a particular instant of time. This is often done by using constant intervals but a better approach would be to define dynamic intervals that match the evolution of the system’s configuration. To this end, we propose a method that aims at detecting evolutionary changes in the configuration of a complex system, and generates intervals accordingly. We show that evolutionary timescales can be identified by looking for peaks in the similarity between the sets of events on consecutive time intervals of data. Tests on simple toy models reveal that the technique is able to detect evolutionary timescales of time-varying data both when the evolution is smooth as well as when it changes sharply. This is further corroborated by analyses of several real datasets. Our method is scalable to extremely large datasets and is computationally efficient. This allows a quick, parameter-free detection of multiple timescales in the evolution of a complex system. PMID:28004820

  5. Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows.

    PubMed

    Verheggen, Kenneth; Raeder, Helge; Berven, Frode S; Martens, Lennart; Barsnes, Harald; Vaudel, Marc

    2017-09-13

    Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines. © 2017 Wiley Periodicals, Inc.

  6. Sensitivity and specificity considerations for fMRI encoding, decoding, and mapping of auditory cortex at ultra-high field.

    PubMed

    Moerel, Michelle; De Martino, Federico; Kemper, Valentin G; Schmitter, Sebastian; Vu, An T; Uğurbil, Kâmil; Formisano, Elia; Yacoub, Essa

    2018-01-01

    Following rapid technological advances, ultra-high field functional MRI (fMRI) enables exploring correlates of neuronal population activity at an increasing spatial resolution. However, as the fMRI blood-oxygenation-level-dependent (BOLD) contrast is a vascular signal, the spatial specificity of fMRI data is ultimately determined by the characteristics of the underlying vasculature. At 7T, fMRI measurement parameters determine the relative contribution of the macro- and microvasculature to the acquired signal. Here we investigate how these parameters affect relevant high-end fMRI analyses such as encoding, decoding, and submillimeter mapping of voxel preferences in the human auditory cortex. Specifically, we compare a T 2 * weighted fMRI dataset, obtained with 2D gradient echo (GE) EPI, to a predominantly T 2 weighted dataset obtained with 3D GRASE. We first investigated the decoding accuracy based on two encoding models that represented different hypotheses about auditory cortical processing. This encoding/decoding analysis profited from the large spatial coverage and sensitivity of the T 2 * weighted acquisitions, as evidenced by a significantly higher prediction accuracy in the GE-EPI dataset compared to the 3D GRASE dataset for both encoding models. The main disadvantage of the T 2 * weighted GE-EPI dataset for encoding/decoding analyses was that the prediction accuracy exhibited cortical depth dependent vascular biases. However, we propose that the comparison of prediction accuracy across the different encoding models may be used as a post processing technique to salvage the spatial interpretability of the GE-EPI cortical depth-dependent prediction accuracy. Second, we explored the mapping of voxel preferences. Large-scale maps of frequency preference (i.e., tonotopy) were similar across datasets, yet the GE-EPI dataset was preferable due to its larger spatial coverage and sensitivity. However, submillimeter tonotopy maps revealed biases in assigned frequency preference and selectivity for the GE-EPI dataset, but not for the 3D GRASE dataset. Thus, a T 2 weighted acquisition is recommended if high specificity in tonotopic maps is required. In conclusion, different fMRI acquisitions were better suited for different analyses. It is therefore critical that any sequence parameter optimization considers the eventual intended fMRI analyses and the nature of the neuroscience questions being asked. Copyright © 2017 Elsevier Inc. All rights reserved.

  7. Dense infraspecific sampling reveals rapid and independent trajectories of plastome degradation in a heterotrophic orchid complex.

    PubMed

    Barrett, Craig F; Wicke, Susann; Sass, Chodon

    2018-05-01

    Heterotrophic plants provide excellent opportunities to study the effects of altered selective regimes on genome evolution. Plastid genome (plastome) studies in heterotrophic plants are often based on one or a few highly divergent species or sequences as representatives of an entire lineage, thus missing important evolutionary-transitory events. Here, we present the first infraspecific analysis of plastome evolution in any heterotrophic plant. By combining genome skimming and targeted sequence capture, we address hypotheses on the degree and rate of plastome degradation in a complex of leafless orchids (Corallorhiza striata) across its geographic range. Plastomes provide strong support for relationships and evidence of reciprocal monophyly between C. involuta and the endangered C. bentleyi. Plastome degradation is extensive, occurring rapidly over a few million years, with evidence of differing rates of genomic change among the two principal clades of the complex. Genome skimming and targeted sequence capture differ widely in coverage depth overall, with depth in targeted sequence capture datasets varying immensely across the plastome as a function of GC content. These findings will help to fill a knowledge gap in models of heterotrophic plastid genome evolution, and have implications for future studies in heterotrophs. © 2018 The Authors. New Phytologist © 2018 New Phytologist Trust.

  8. Phylogenetic study of Oryzoideae species and related taxa of the Poaceae based on atpB-rbcL and ndhF DNA sequences.

    PubMed

    Zeng, Xu; Yuan, Zhengrong; Tong, Xin; Li, Qiushi; Gao, Weiwei; Qin, Minjian; Liu, Zhihua

    2012-05-01

    Oryzoideae (Poaceae) plants have economic and ecological value. However, the phylogenetic position of some plants is not clear, such as Hygroryza aristata (Retz.) Nees. and Porteresia coarctata (Roxb.) Tateoka (syn. Oryza coarctata). Comprehensive molecular phylogenetic studies have been carried out on many genera in the Poaceae. The different DNA sequences, including nuclear and chloroplast sequences, had been extensively employed to determine relationships at both higher and lower taxonomic levels in the Poaceae. Chloroplast DNA ndhF gene and atpB-rbcL spacer were used to construct phylogenetic trees and estimate the divergence time of Oryzoideae, Bambusoideae, Panicoideae, Pooideae and so on. Complete sequences of atpB-rbcL and ndhF were generated for 17 species representing six species of the Oryzoideae and related subfamilies. Nicotiana tabacum L. was the outgroup species. The two DNA datasets were analyzed, using Maximum Parsimony and Bayesian analysis methods. The molecular phylogeny revealed that H. aristata (Retz.) Nees was the sister to Chikusichloa aquatica Koidz. Moreover, P. coarctata (Roxb.) Tateoka was in the genus Oryza. Furthermore, the result of evolution analysis, which based on the ndhF marker, indicated that the time of origin of Oryzoideae might be 31 million years ago.

  9. Ontology-based meta-analysis of global collections of high-throughput public data.

    PubMed

    Kupershmidt, Ilya; Su, Qiaojuan Jane; Grewal, Anoop; Sundaresh, Suman; Halperin, Inbal; Flynn, James; Shekar, Mamatha; Wang, Helen; Park, Jenny; Cui, Wenwu; Wall, Gregory D; Wisotzkey, Robert; Alag, Satnam; Akhtari, Saeid; Ronaghi, Mostafa

    2010-09-29

    The investigation of the interconnections between the molecular and genetic events that govern biological systems is essential if we are to understand the development of disease and design effective novel treatments. Microarray and next-generation sequencing technologies have the potential to provide this information. However, taking full advantage of these approaches requires that biological connections be made across large quantities of highly heterogeneous genomic datasets. Leveraging the increasingly huge quantities of genomic data in the public domain is fast becoming one of the key challenges in the research community today. We have developed a novel data mining framework that enables researchers to use this growing collection of public high-throughput data to investigate any set of genes or proteins. The connectivity between molecular states across thousands of heterogeneous datasets from microarrays and other genomic platforms is determined through a combination of rank-based enrichment statistics, meta-analyses, and biomedical ontologies. We address data quality concerns through dataset replication and meta-analysis and ensure that the majority of the findings are derived using multiple lines of evidence. As an example of our strategy and the utility of this framework, we apply our data mining approach to explore the biology of brown fat within the context of the thousands of publicly available gene expression datasets. Our work presents a practical strategy for organizing, mining, and correlating global collections of large-scale genomic data to explore normal and disease biology. Using a hypothesis-free approach, we demonstrate how a data-driven analysis across very large collections of genomic data can reveal novel discoveries and evidence to support existing hypothesis.

  10. The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system.

    PubMed

    Vonk, Freek J; Casewell, Nicholas R; Henkel, Christiaan V; Heimberg, Alysha M; Jansen, Hans J; McCleary, Ryan J R; Kerkkamp, Harald M E; Vos, Rutger A; Guerreiro, Isabel; Calvete, Juan J; Wüster, Wolfgang; Woods, Anthony E; Logan, Jessica M; Harrison, Robert A; Castoe, Todd A; de Koning, A P Jason; Pollock, David D; Yandell, Mark; Calderon, Diego; Renjifo, Camila; Currier, Rachel B; Salgado, David; Pla, Davinia; Sanz, Libia; Hyder, Asad S; Ribeiro, José M C; Arntzen, Jan W; van den Thillart, Guido E E J M; Boetzer, Marten; Pirovano, Walter; Dirks, Ron P; Spaink, Herman P; Duboule, Denis; McGlinn, Edwina; Kini, R Manjunatha; Richardson, Michael K

    2013-12-17

    Snakes are limbless predators, and many species use venom to help overpower relatively large, agile prey. Snake venoms are complex protein mixtures encoded by several multilocus gene families that function synergistically to cause incapacitation. To examine venom evolution, we sequenced and interrogated the genome of a venomous snake, the king cobra (Ophiophagus hannah), and compared it, together with our unique transcriptome, microRNA, and proteome datasets from this species, with data from other vertebrates. In contrast to the platypus, the only other venomous vertebrate with a sequenced genome, we find that snake toxin genes evolve through several distinct co-option mechanisms and exhibit surprisingly variable levels of gene duplication and directional selection that correlate with their functional importance in prey capture. The enigmatic accessory venom gland shows a very different pattern of toxin gene expression from the main venom gland and seems to have recruited toxin-like lectin genes repeatedly for new nontoxic functions. In addition, tissue-specific microRNA analyses suggested the co-option of core genetic regulatory components of the venom secretory system from a pancreatic origin. Although the king cobra is limbless, we recovered coding sequences for all Hox genes involved in amniote limb development, with the exception of Hoxd12. Our results provide a unique view of the origin and evolution of snake venom and reveal multiple genome-level adaptive responses to natural selection in this complex biological weapon system. More generally, they provide insight into mechanisms of protein evolution under strong selection.

  11. Marine turtle mitogenome phylogenetics and evolution.

    PubMed

    Duchene, Sebastián; Frey, Amy; Alfaro-Núñez, Alonzo; Dutton, Peter H; Thomas P Gilbert, M; Morin, Phillip A

    2012-10-01

    The sea turtles are a group of cretaceous origin containing seven recognized living species: leatherback, hawksbill, Kemp's ridley, olive ridley, loggerhead, green, and flatback. The leatherback is the single member of the Dermochelidae family, whereas all other sea turtles belong in Cheloniidae. Analyses of partial mitochondrial sequences and some nuclear markers have revealed phylogenetic inconsistencies within Cheloniidae, especially regarding the placement of the flatback. Population genetic studies based on D-Loop sequences have shown considerable structuring in species with broad geographic distributions, shedding light on complex migration patterns and possible geographic or climatic events as driving forces of sea-turtle distribution. We have sequenced complete mitogenomes for all sea-turtle species, including samples from their geographic range extremes, and performed phylogenetic analyses to assess sea-turtle evolution with a large molecular dataset. We found variation in the length of the ATP8 gene and a highly variable site in ND4 near a proton translocation channel in the resulting protein. Complete mitogenomes show strong support and resolution for phylogenetic relationships among all sea turtles, and reveal phylogeographic patterns within globally-distributed species. Although there was clear concordance between phylogenies and geographic origin of samples in most taxa, we found evidence of more recent dispersal events in the loggerhead and olive ridley turtles, suggesting more recent migrations (<1 Myr) in these species. Overall, our results demonstrate the complexity of sea-turtle diversity, and indicate the need for further research in phylogeography and molecular evolution. Published by Elsevier Inc.

  12. Fusimonas intestini gen. nov., sp. nov., a novel intestinal bacterium of the family Lachnospiraceae associated with diabetes in mice.

    PubMed

    Kusada, Hiroyuki; Kameyama, Keishi; Meng, Xian-Ying; Kamagata, Yoichi; Tamaki, Hideyuki

    2017-12-22

    Our previous study shows that an anaerobic intestinal bacterium strain AJ110941 P contributes to type 2 diabetes development in mice. Here we phylogenetically and physiologically characterized this unique mouse gut bacterium. The 16S rRNA gene analysis revealed that the strain belongs to the family Lachnospiraceae but shows low sequence similarities ( < 92.5%) to valid species, and rather formed a distinct cluster with uncultured mouse gut bacteria clones. In metagenomic database survey, the 16S sequence of AJ110941 P also matched with mouse gut-derived datasets (56% of total datasets) with > 99% similarity, suggesting that AJ110941 P -related bacteria mainly reside in mouse digestive tracts. Strain AJ110941 P shared common physiological traits (e.g., Gram-positive, anaerobic, mesophilic, and fermentative growth with carbohydrates) with relative species of the Lachnospiraceae. Notably, the biofilm-forming capacity was found in both AJ110941 P and relative species. However, AJ110941 P possessed far more strong ability to produce biofilm than relative species and formed unique structure of extracellular polymeric substances. Furthermore, AJ110941 P cells are markedly long fusiform-shaped rods (9.0-62.5 µm) with multiple flagella that have never been observed in any other Lachnospiraceae members. Based on the phenotypic and phylogenetic features, we propose a new genus and species, Fusimonas intestini gen. nov., sp. nov. for strain AJ110941 P (FERM BP-11443).

  13. Haematobia irritans dataset of raw sequence reads from Illumina and Pac Bio sequencing of genomic DNA

    USDA-ARS?s Scientific Manuscript database

    The genome of the horn fly, Haematobia irritans, was sequenced using Illumina- and Pac Bio-based protocols. Following quality filtering, the raw reads have been deposited at NCBI under the BioProject and BioSample accession numbers PRJNA30967 and SAMN07830356, respectively. The Illumina reads are un...

  14. Haematobia irritans dataset of raw sequence reads from Illumina-based transcriptome sequencing of specific tissues and life stages

    USDA-ARS?s Scientific Manuscript database

    Illumina HiSeq technology was used to sequence the transcriptome from various dissected tissues and life stages from the horn fly, Haematobia irritans. These samples include eggs (0, 2, 4, and 9 hours post-oviposition), adult fly gut, adult fly legs, adult fly malpighian tubule, adult fly ovary, adu...

  15. A detailed gene expression study of the Miscanthus genus reveals changes in the transcriptome associated with the rejuvenation of spring rhizomes.

    PubMed

    Barling, Adam; Swaminathan, Kankshita; Mitros, Therese; James, Brandon T; Morris, Juliette; Ngamboma, Ornella; Hall, Megan C; Kirkpatrick, Jessica; Alabady, Magdy; Spence, Ashley K; Hudson, Matthew E; Rokhsar, Daniel S; Moose, Stephen P

    2013-12-09

    The Miscanthus genus of perennial C4 grasses contains promising biofuel crops for temperate climates. However, few genomic resources exist for Miscanthus, which limits understanding of its interesting biology and future genetic improvement. A comprehensive catalog of expressed sequences were generated from a variety of Miscanthus species and tissue types, with an emphasis on characterizing gene expression changes in spring compared to fall rhizomes. Illumina short read sequencing technology was used to produce transcriptome sequences from different tissues and organs during distinct developmental stages for multiple Miscanthus species, including Miscanthus sinensis, Miscanthus sacchariflorus, and their interspecific hybrid Miscanthus × giganteus. More than fifty billion base-pairs of Miscanthus transcript sequence were produced. Overall, 26,230 Sorghum gene models (i.e., ~ 96% of predicted Sorghum genes) had at least five Miscanthus reads mapped to them, suggesting that a large portion of the Miscanthus transcriptome is represented in this dataset. The Miscanthus × giganteus data was used to identify genes preferentially expressed in a single tissue, such as the spring rhizome, using Sorghum bicolor as a reference. Quantitative real-time PCR was used to verify examples of preferential expression predicted via RNA-Seq. Contiguous consensus transcript sequences were assembled for each species and annotated using InterProScan. Sequences from the assembled transcriptome were used to amplify genomic segments from a doubled haploid Miscanthus sinensis and from Miscanthus × giganteus to further disentangle the allelic and paralogous variations in genes. This large expressed sequence tag collection creates a valuable resource for the study of Miscanthus biology by providing detailed gene sequence information and tissue preferred expression patterns. We have successfully generated a database of transcriptome assemblies and demonstrated its use in the study of genes of interest. Analysis of gene expression profiles revealed biological pathways that exhibit altered regulation in spring compared to fall rhizomes, which are consistent with their different physiological functions. The expression profiles of the subterranean rhizome provides a better understanding of the biological activities of the underground stem structures that are essentials for perenniality and the storage or remobilization of carbon and nutrient resources.

  16. A phylogenetic analysis using full-length viral genomes of South American dengue serotype 3 in consecutive Venezuelan outbreaks reveals novel NS5 mutation

    PubMed Central

    Schmidt, DJ; Pickett, BE; Camacho, D; Comach, G; Xhaja, K; Lennon, NJ; Rizzolo, K; de Bosch, N; Becerra, A; Nogueira, ML; Mondini, A; da Silva, EV; Vasconcelos, PF; Muñoz-Jordán, JL; Santiago, GA; Ocazionez, R; Gehrke, L; Lefkowitz, EJ; Birren, BW; Henn, MR; Bosch, I

    2013-01-01

    Dengue virus currently causes 50-100 million infections annually. Comprehensive knowledge about the evolution of Dengue in response to selection pressure is currently unavailable, but would greatly enhance vaccine design efforts. In the current study, we sequenced 187 new dengue virus serotype 3(DENV-3) genotype III whole genomes isolated from Asia and the Americas. We analyzed them together with previously-sequenced isolates to gain a more detailed understanding of the evolutionary adaptations existing in this prevalent American serotype. In order to analyze the phylogenetic dynamics of DENV-3 during outbreak periods; we incorporated datasets of 48 and 11 sequences spanning two major outbreaks in Venezuela during 2001 and 2007-2008 respectively. Our phylogenetic analysis of newly sequenced viruses shows that subsets of genomes cluster primarily by geographic location, and secondarily by time of virus isolation. DENV-3 genotype III sequences from Asia are significantly divergent from those from the Americas due to their geographical separation and subsequent speciation. We measured amino acid variation for the E protein by calculating the Shannon entropy at each position between Asian and American genomes. We found a cluster of 7 amino acid substitutions having high variability within E protein domain III, which has previously been implicated in serotype-specific neutralization escape mutants. No novel mutations were found in the E protein of sequences isolated during either Venezuelan outbreak. Shannon entropy analysis of the NS5 polymerase mature protein revealed that a G374E mutation, in a region that contributes to interferon resistance in other flaviviruses by interfering with JAK-STAT signaling was present in both the Asian and American sequences from the 2007-2008 Venezuelan outbreak, but was absent in the sequences from the 2001 Venezuelan outbreak. In addition to E, several NS5 amino acid changes were unique to the 2007-2008 epidemic in Venezuela and may give additional insight into the adaptive response of DENV-3 at the population level. PMID:21964598

  17. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

    PubMed

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-05-01

    Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. ivan.borozan@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  18. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification

    PubMed Central

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-01-01

    Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. Contact: ivan.borozan@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25573913

  19. Classifying next-generation sequencing data using a zero-inflated Poisson model.

    PubMed

    Zhou, Yan; Wan, Xiang; Zhang, Baoxue; Tong, Tiejun

    2018-04-15

    With the development of high-throughput techniques, RNA-sequencing (RNA-seq) is becoming increasingly popular as an alternative for gene expression analysis, such as RNAs profiling and classification. Identifying which type of diseases a new patient belongs to with RNA-seq data has been recognized as a vital problem in medical research. As RNA-seq data are discrete, statistical methods developed for classifying microarray data cannot be readily applied for RNA-seq data classification. Witten proposed a Poisson linear discriminant analysis (PLDA) to classify the RNA-seq data in 2011. Note, however, that the count datasets are frequently characterized by excess zeros in real RNA-seq or microRNA sequence data (i.e. when the sequence depth is not enough or small RNAs with the length of 18-30 nucleotides). Therefore, it is desired to develop a new model to analyze RNA-seq data with an excess of zeros. In this paper, we propose a Zero-Inflated Poisson Logistic Discriminant Analysis (ZIPLDA) for RNA-seq data with an excess of zeros. The new method assumes that the data are from a mixture of two distributions: one is a point mass at zero, and the other follows a Poisson distribution. We then consider a logistic relation between the probability of observing zeros and the mean of the genes and the sequencing depth in the model. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. Two real datasets including a breast cancer RNA-seq dataset and a microRNA-seq dataset are also analyzed, and they coincide with the simulation results that our proposed method outperforms the existing competitors. The software is available at http://www.math.hkbu.edu.hk/∼tongt. xwan@comp.hkbu.edu.hk or tongt@hkbu.edu.hk. Supplementary data are available at Bioinformatics online.

  20. Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains.

    PubMed

    Liao, Weinan; Ren, Jie; Wang, Kun; Wang, Shun; Zeng, Feng; Wang, Ying; Sun, Fengzhu

    2016-11-23

    The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com.

  1. Characterization and Pathogenicity of New Record of Anthracnose on Various Chili Varieties Caused by Colletotrichum scovillei in Korea.

    PubMed

    Oo, May Moe; Lim, GiTaek; Jang, Hyun A; Oh, Sang-Keun

    2017-09-01

    The anthracnose disease caused by Colletotrichum species is well-known as a major plant pathogen that primarily causes fruit rot in pepper and reduces its marketability. Thirty-five isolates representing species of Colletotrichum were obtained from chili fruits showing anthracnose disease symptoms in Chungcheongnam-do and Chungcheongbuk-do, South Korea. These 35 isolates were characterized according to morphological characteristics and nucleotide sequence data of internal transcribed spacer, glyceraldehyde-3-phosphate-dehydrogenase, and β-tubulin. The combined dataset shows that all of these 35 isolates were identified as C. scovillei and morphological characteristics were directly correlated with the nucleotide sequence data. Notably, these isolates were recorded for the first time as the causes of anthracnose caused by C. scovillei on pepper in Korea. Forty cultivars were used to investigate the pathogenicity and to identify the possible source of resistance. The result reveals that all of chili cultivars used in this study are susceptible to C. scovillei .

  2. Characterization and Pathogenicity of New Record of Anthracnose on Various Chili Varieties Caused by Colletotrichum scovillei in Korea

    PubMed Central

    Oo, May Moe; Lim, GiTaek; Jang, Hyun A

    2017-01-01

    The anthracnose disease caused by Colletotrichum species is well-known as a major plant pathogen that primarily causes fruit rot in pepper and reduces its marketability. Thirty-five isolates representing species of Colletotrichum were obtained from chili fruits showing anthracnose disease symptoms in Chungcheongnam-do and Chungcheongbuk-do, South Korea. These 35 isolates were characterized according to morphological characteristics and nucleotide sequence data of internal transcribed spacer, glyceraldehyde-3-phosphate-dehydrogenase, and β-tubulin. The combined dataset shows that all of these 35 isolates were identified as C. scovillei and morphological characteristics were directly correlated with the nucleotide sequence data. Notably, these isolates were recorded for the first time as the causes of anthracnose caused by C. scovillei on pepper in Korea. Forty cultivars were used to investigate the pathogenicity and to identify the possible source of resistance. The result reveals that all of chili cultivars used in this study are susceptible to C. scovillei. PMID:29138623

  3. A New Approach for Mining Order-Preserving Submatrices Based on All Common Subsequences.

    PubMed

    Xue, Yun; Liao, Zhengling; Li, Meihang; Luo, Jie; Kuang, Qiuhua; Hu, Xiaohui; Li, Tiechen

    2015-01-01

    Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.

  4. Evolutionary dynamics of Newcastle disease virus

    USGS Publications Warehouse

    Miller, P.J.; Kim, L.M.; Ip, Hon S.; Afonso, C.L.

    2009-01-01

    A comprehensive dataset of NDV genome sequences was evaluated using bioinformatics to characterize the evolutionary forces affecting NDV genomes. Despite evidence of recombination in most genes, only one event in the fusion gene of genotype V viruses produced evolutionarily viable progenies. The codon-associated rate of change for the six NDV proteins revealed that the highest rate of change occurred at the fusion protein. All proteins were under strong purifying (negative) selection; the fusion protein displayed the highest number of amino acids under positive selection. Regardless of the phylogenetic grouping or the level of virulence, the cleavage site motif was highly conserved implying that mutations at this site that result in changes of virulence may not be favored. The coding sequence of the fusion gene and the genomes of viruses from wild birds displayed higher yearly rates of change in virulent viruses than in viruses of low virulence, suggesting that an increase in virulence may accelerate the rate of NDV evolution. ?? 2009 Elsevier Inc.

  5. Genome-wide methylation analysis identified sexually dimorphic methylated regions in hybrid tilapia

    PubMed Central

    Wan, Zi Yi; Xia, Jun Hong; Lin, Grace; Wang, Le; Lin, Valerie C. L.; Yue, Gen Hua

    2016-01-01

    Sexual dimorphism is an interesting biological phenomenon. Previous studies showed that DNA methylation might play a role in sexual dimorphism. However, the overall picture of the genome-wide methylation landscape in sexually dimorphic species remains unclear. We analyzed the DNA methylation landscape and transcriptome in hybrid tilapia (Oreochromis spp.) using whole genome bisulfite sequencing (WGBS) and RNA-sequencing (RNA-seq). We found 4,757 sexually dimorphic differentially methylated regions (DMRs), with significant clusters of DMRs located on chromosomal regions associated with sex determination. CpG methylation in promoter regions was negatively correlated with the gene expression level. MAPK/ERK pathway was upregulated in male tilapia. We also inferred active cis-regulatory regions (ACRs) in skeletal muscle tissues from WGBS datasets, revealing sexually dimorphic cis-regulatory regions. These results suggest that DNA methylation contribute to sex-specific phenotypes and serve as resources for further investigation to analyze the functions of these regions and their contributions towards sexual dimorphisms. PMID:27782217

  6. Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models.

    PubMed

    Park, Byungkyu; Im, Jinyong; Tuvshinjargal, Narankhuu; Lee, Wook; Han, Kyungsook

    2014-11-01

    As many structures of protein-DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein-DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  7. Assessment of species diversity and distribution of an ancient diatom lineage using a DNA metabarcoding approach.

    PubMed

    Nanjappa, Deepak; Audic, Stephane; Romac, Sarah; Kooistra, Wiebe H C F; Zingone, Adriana

    2014-01-01

    Continuous efforts to estimate actual diversity and to trace the species distribution and ranges in the natural environments have gone in equal pace with advancements of the technologies in the study of microbial species diversity from microscopic observations to DNA-based barcoding. DNA metabarcoding based on Next Generation Sequencing (NGS) constitutes the latest advancement in these efforts. Here we use NGS data from different sites to investigate the geographic range of six species of the diatom family Leptocylindraceae and to identify possible new taxa within the family. We analysed the V4 and V9 regions of the nuclear-encoded SSU rDNA gene region in the NGS database of the European ERA-Biodiversa project BioMarKs, collected in plankton and sediments at six coastal sites in European coastal waters, as well as environmental sequences from the NCBI database. All species known in the family Leptocylindraceae were detected in both datasets, but the much larger Illumina V9 dataset showed a higher species coverage at the various sites than the 454 V4 dataset. Sequences identical or similar to the references of Leptocylindrus aporus, L. convexus, L. danicus/hargravesii and Tenuicylindrus belgicus were found in the Mediterranean Sea, North Atlantic Ocean and Black Sea as well as at locations outside Europe. Instead, sequences identical or close to that of L. minimus were found in the North Atlantic Ocean and the Black Sea but not in the Mediterranean Sea, while sequences belonging to a yet undescribed taxon were encountered only in Oslo Fjord and Baffin Bay. Identification of Leptocylindraceae species in NGS datasets has expanded our knowledge of the species biogeographic distribution and of the overall diversity of this diatom family. Individual species appear to be widespread, but not all of them are found everywhere. Despite the sequencing depth allowed by NGS and the wide geographic area covered by this study, the diversity of this ancient diatom family appears to be low, at least at the level of the marker used in this study.

  8. Assessment of Species Diversity and Distribution of an Ancient Diatom Lineage Using a DNA Metabarcoding Approach

    PubMed Central

    Nanjappa, Deepak; Audic, Stephane; Romac, Sarah; Kooistra, Wiebe H. C. F.; Zingone, Adriana

    2014-01-01

    Background Continuous efforts to estimate actual diversity and to trace the species distribution and ranges in the natural environments have gone in equal pace with advancements of the technologies in the study of microbial species diversity from microscopic observations to DNA-based barcoding. DNA metabarcoding based on Next Generation Sequencing (NGS) constitutes the latest advancement in these efforts. Here we use NGS data from different sites to investigate the geographic range of six species of the diatom family Leptocylindraceae and to identify possible new taxa within the family. Methodology/Principal Findings We analysed the V4 and V9 regions of the nuclear-encoded SSU rDNA gene region in the NGS database of the European ERA-Biodiversa project BioMarKs, collected in plankton and sediments at six coastal sites in European coastal waters, as well as environmental sequences from the NCBI database. All species known in the family Leptocylindraceae were detected in both datasets, but the much larger Illumina V9 dataset showed a higher species coverage at the various sites than the 454 V4 dataset. Sequences identical or similar to the references of Leptocylindrus aporus, L. convexus, L. danicus/hargravesii and Tenuicylindrus belgicus were found in the Mediterranean Sea, North Atlantic Ocean and Black Sea as well as at locations outside Europe. Instead, sequences identical or close to that of L. minimus were found in the North Atlantic Ocean and the Black Sea but not in the Mediterranean Sea, while sequences belonging to a yet undescribed taxon were encountered only in Oslo Fjord and Baffin Bay. Conclusions/Significance Identification of Leptocylindraceae species in NGS datasets has expanded our knowledge of the species biogeographic distribution and of the overall diversity of this diatom family. Individual species appear to be widespread, but not all of them are found everywhere. Despite the sequencing depth allowed by NGS and the wide geographic area covered by this study, the diversity of this ancient diatom family appears to be low, at least at the level of the marker used in this study. PMID:25133638

  9. Identification and Removal of Contaminant Sequences From Ribosomal Gene Databases: Lessons From the Census of Deep Life

    PubMed Central

    Sheik, Cody S.; Reese, Brandi Kiel; Twing, Katrina I.; Sylvan, Jason B.; Grim, Sharon L.; Schrenk, Matthew O.; Sogin, Mitchell L.; Colwell, Frederick S.

    2018-01-01

    Earth’s subsurface environment is one of the largest, yet least studied, biomes on Earth, and many questions remain regarding what microorganisms are indigenous to the subsurface. Through the activity of the Census of Deep Life (CoDL) and the Deep Carbon Observatory, an open access 16S ribosomal RNA gene sequence database from diverse subsurface environments has been compiled. However, due to low quantities of biomass in the deep subsurface, the potential for incorporation of contaminants from reagents used during sample collection, processing, and/or sequencing is high. Thus, to understand the ecology of subsurface microorganisms (i.e., the distribution, richness, or survival), it is necessary to minimize, identify, and remove contaminant sequences that will skew the relative abundances of all taxa in the sample. In this meta-analysis, we identify putative contaminants associated with the CoDL dataset, recommend best practices for removing contaminants from samples, and propose a series of best practices for subsurface microbiology sampling. The most abundant putative contaminant genera observed, independent of evenness across samples, were Propionibacterium, Aquabacterium, Ralstonia, and Acinetobacter. While the top five most frequently observed genera were Pseudomonas, Propionibacterium, Acinetobacter, Ralstonia, and Sphingomonas. The majority of the most frequently observed genera (high evenness) were associated with reagent or potential human contamination. Additionally, in DNA extraction blanks, we observed potential archaeal contaminants, including methanogens, which have not been discussed in previous contamination studies. Such contaminants would directly affect the interpretation of subsurface molecular studies, as methanogenesis is an important subsurface biogeochemical process. Utilizing previously identified contaminant genera, we found that ∼27% of the total dataset were identified as contaminant sequences that likely originate from DNA extraction and DNA cleanup methods. Thus, controls must be taken at every step of the collection and processing procedure when working with low biomass environments such as, but not limited to, portions of Earth’s deep subsurface. Taken together, we stress that the CoDL dataset is an incredible resource for the broader research community interested in subsurface life, and steps to remove contamination derived sequences must be taken prior to using this dataset. PMID:29780369

  10. VirSorter: mining viral signal from microbial genomic data.

    PubMed

    Roux, Simon; Enault, Francois; Hurwitz, Bonnie L; Sullivan, Matthew B

    2015-01-01

    Viruses of microbes impact all ecosystems where microbes drive key energy and substrate transformations including the oceans, humans and industrial fermenters. However, despite this recognized importance, our understanding of viral diversity and impacts remains limited by too few model systems and reference genomes. One way to fill these gaps in our knowledge of viral diversity is through the detection of viral signal in microbial genomic data. While multiple approaches have been developed and applied for the detection of prophages (viral genomes integrated in a microbial genome), new types of microbial genomic data are emerging that are more fragmented and larger scale, such as Single-cell Amplified Genomes (SAGs) of uncultivated organisms or genomic fragments assembled from metagenomic sequencing. Here, we present VirSorter, a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses. Performance testing shows that VirSorter's prophage prediction capability compares to that of available prophage predictors for complete genomes, but is superior in predicting viral sequences outside of a host genome (i.e., from extrachromosomal prophages, lytic infections, or partially assembled prophages). Furthermore, VirSorter outperforms existing tools for fragmented genomic and metagenomic datasets, and can identify viral signal in assembled sequence (contigs) as short as 3kb, while providing near-perfect identification (>95% Recall and 100% Precision) on contigs of at least 10kb. Because VirSorter scales to large datasets, it can also be used in "reverse" to more confidently identify viral sequence in viral metagenomes by sorting away cellular DNA whether derived from gene transfer agents, generalized transduction or contamination. Finally, VirSorter is made available through the iPlant Cyberinfrastructure that provides a web-based user interface interconnected with the required computing resources. VirSorter thus complements existing prophage prediction softwares to better leverage fragmented, SAG and metagenomic datasets in a way that will scale to modern sequencing. Given these features, VirSorter should enable the discovery of new viruses in microbial datasets, and further our understanding of uncultivated viral communities across diverse ecosystems.

  11. VirSorter: mining viral signal from microbial genomic data

    PubMed Central

    Roux, Simon; Enault, Francois; Hurwitz, Bonnie L.

    2015-01-01

    Viruses of microbes impact all ecosystems where microbes drive key energy and substrate transformations including the oceans, humans and industrial fermenters. However, despite this recognized importance, our understanding of viral diversity and impacts remains limited by too few model systems and reference genomes. One way to fill these gaps in our knowledge of viral diversity is through the detection of viral signal in microbial genomic data. While multiple approaches have been developed and applied for the detection of prophages (viral genomes integrated in a microbial genome), new types of microbial genomic data are emerging that are more fragmented and larger scale, such as Single-cell Amplified Genomes (SAGs) of uncultivated organisms or genomic fragments assembled from metagenomic sequencing. Here, we present VirSorter, a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses. Performance testing shows that VirSorter’s prophage prediction capability compares to that of available prophage predictors for complete genomes, but is superior in predicting viral sequences outside of a host genome (i.e., from extrachromosomal prophages, lytic infections, or partially assembled prophages). Furthermore, VirSorter outperforms existing tools for fragmented genomic and metagenomic datasets, and can identify viral signal in assembled sequence (contigs) as short as 3kb, while providing near-perfect identification (>95% Recall and 100% Precision) on contigs of at least 10kb. Because VirSorter scales to large datasets, it can also be used in “reverse” to more confidently identify viral sequence in viral metagenomes by sorting away cellular DNA whether derived from gene transfer agents, generalized transduction or contamination. Finally, VirSorter is made available through the iPlant Cyberinfrastructure that provides a web-based user interface interconnected with the required computing resources. VirSorter thus complements existing prophage prediction softwares to better leverage fragmented, SAG and metagenomic datasets in a way that will scale to modern sequencing. Given these features, VirSorter should enable the discovery of new viruses in microbial datasets, and further our understanding of uncultivated viral communities across diverse ecosystems. PMID:26038737

  12. Identification and Removal of Contaminant Sequences From Ribosomal Gene Databases: Lessons From the Census of Deep Life.

    PubMed

    Sheik, Cody S; Reese, Brandi Kiel; Twing, Katrina I; Sylvan, Jason B; Grim, Sharon L; Schrenk, Matthew O; Sogin, Mitchell L; Colwell, Frederick S

    2018-01-01

    Earth's subsurface environment is one of the largest, yet least studied, biomes on Earth, and many questions remain regarding what microorganisms are indigenous to the subsurface. Through the activity of the Census of Deep Life (CoDL) and the Deep Carbon Observatory, an open access 16S ribosomal RNA gene sequence database from diverse subsurface environments has been compiled. However, due to low quantities of biomass in the deep subsurface, the potential for incorporation of contaminants from reagents used during sample collection, processing, and/or sequencing is high. Thus, to understand the ecology of subsurface microorganisms (i.e., the distribution, richness, or survival), it is necessary to minimize, identify, and remove contaminant sequences that will skew the relative abundances of all taxa in the sample. In this meta-analysis, we identify putative contaminants associated with the CoDL dataset, recommend best practices for removing contaminants from samples, and propose a series of best practices for subsurface microbiology sampling. The most abundant putative contaminant genera observed, independent of evenness across samples, were Propionibacterium , Aquabacterium , Ralstonia , and Acinetobacter . While the top five most frequently observed genera were Pseudomonas , Propionibacterium , Acinetobacter , Ralstonia , and Sphingomonas . The majority of the most frequently observed genera (high evenness) were associated with reagent or potential human contamination. Additionally, in DNA extraction blanks, we observed potential archaeal contaminants, including methanogens, which have not been discussed in previous contamination studies. Such contaminants would directly affect the interpretation of subsurface molecular studies, as methanogenesis is an important subsurface biogeochemical process. Utilizing previously identified contaminant genera, we found that ∼27% of the total dataset were identified as contaminant sequences that likely originate from DNA extraction and DNA cleanup methods. Thus, controls must be taken at every step of the collection and processing procedure when working with low biomass environments such as, but not limited to, portions of Earth's deep subsurface. Taken together, we stress that the CoDL dataset is an incredible resource for the broader research community interested in subsurface life, and steps to remove contamination derived sequences must be taken prior to using this dataset.

  13. JingleBells: A Repository of Immune-Related Single-Cell RNA-Sequencing Datasets.

    PubMed

    Ner-Gaon, Hadas; Melchior, Ariel; Golan, Nili; Ben-Haim, Yael; Shay, Tal

    2017-05-01

    Recent advances in single-cell RNA-sequencing (scRNA-seq) technology increase the understanding of immune differentiation and activation processes, as well as the heterogeneity of immune cell types. Although the number of available immune-related scRNA-seq datasets increases rapidly, their large size and various formats render them hard for the wider immunology community to use, and read-level data are practically inaccessible to the non-computational immunologist. To facilitate datasets reuse, we created the JingleBells repository for immune-related scRNA-seq datasets ready for analysis and visualization of reads at the single-cell level (http://jinglebells.bgu.ac.il/). To this end, we collected the raw data of publicly available immune-related scRNA-seq datasets, aligned the reads to the relevant genome, and saved aligned reads in a uniform format, annotated for cell of origin. We also added scripts and a step-by-step tutorial for visualizing each dataset at the single-cell level, through the commonly used Integrated Genome Viewer (www.broadinstitute.org/igv/). The uniform scRNA-seq format used in JingleBells can facilitate reuse of scRNA-seq data by computational biologists. It also enables immunologists who are interested in a specific gene to visualize the reads aligned to this gene to estimate cell-specific preferences for splicing, mutation load, or alleles. Thus JingleBells is a resource that will extend the usefulness of scRNA-seq datasets outside the programming aficionado realm. Copyright © 2017 by The American Association of Immunologists, Inc.

  14. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions

    PubMed Central

    Brezovský, Jan

    2016-01-01

    An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools’ predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2. PMID:27224906

  15. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions.

    PubMed

    Bendl, Jaroslav; Musil, Miloš; Štourač, Jan; Zendulka, Jaroslav; Damborský, Jiří; Brezovský, Jan

    2016-05-01

    An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools' predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2.

  16. De novo sequencing and characterization of floral transcriptome in two species of buckwheat (Fagopyrum)

    PubMed Central

    2011-01-01

    Background Transcriptome sequencing data has become an integral component of modern genetics, genomics and evolutionary biology. However, despite advances in the technologies of DNA sequencing, such data are lacking for many groups of living organisms, in particular, many plant taxa. We present here the results of transcriptome sequencing for two closely related plant species. These species, Fagopyrum esculentum and F. tataricum, belong to the order Caryophyllales - a large group of flowering plants with uncertain evolutionary relationships. F. esculentum (common buckwheat) is also an important food crop. Despite these practical and evolutionary considerations Fagopyrum species have not been the subject of large-scale sequencing projects. Results Normalized cDNA corresponding to genes expressed in flowers and inflorescences of F. esculentum and F. tataricum was sequenced using the 454 pyrosequencing technology. This resulted in 267 (for F. esculentum) and 229 (F. tataricum) thousands of reads with average length of 341-349 nucleotides. De novo assembly of the reads produced about 25 thousands of contigs for each species, with 7.5-8.2× coverage. Comparative analysis of two transcriptomes demonstrated their overall similarity but also revealed genes that are presumably differentially expressed. Among them are retrotransposon genes and genes involved in sugar biosynthesis and metabolism. Thirteen single-copy genes were used for phylogenetic analysis; the resulting trees are largely consistent with those inferred from multigenic plastid datasets. The sister relationships of the Caryophyllales and asterids now gained high support from nuclear gene sequences. Conclusions 454 transcriptome sequencing and de novo assembly was performed for two congeneric flowering plant species, F. esculentum and F. tataricum. As a result, a large set of cDNA sequences that represent orthologs of known plant genes as well as potential new genes was generated. PMID:21232141

  17. Transcriptome sequence analysis of an ornamental plant, Ananas comosus var. bracteatus, revealed the potential unigenes involved in terpenoid and phenylpropanoid biosynthesis.

    PubMed

    Ma, Jun; Kanakala, S; He, Yehua; Zhang, Junli; Zhong, Xiaolan

    2015-01-01

    Ananas comosus var. bracteatus (Red Pineapple) is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies. The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis. The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus.

  18. Transcriptome Sequence Analysis of an Ornamental Plant, Ananas comosus var. bracteatus, Revealed the Potential Unigenes Involved in Terpenoid and Phenylpropanoid Biosynthesis

    PubMed Central

    Ma, Jun; Kanakala, S.; He, Yehua; Zhang, Junli; Zhong, Xiaolan

    2015-01-01

    Background Ananas comosus var. bracteatus (Red Pineapple) is an important ornamental plant for its colorful leaves and decorative red fruits. Because of its complex genome, it is difficult to understand the molecular mechanisms involved in the growth and development. Thus high-throughput transcriptome sequencing of Ananas comosus var. bracteatus is necessary to generate large quantities of transcript sequences for the purpose of gene discovery and functional genomic studies. Results The Ananas comosus var. bracteatus transcriptome was sequenced by the Illumina paired-end sequencing technology. We obtained a total of 23.5 million high quality sequencing reads, 1,555,808 contigs and 41,052 unigenes. In total 41,052 unigenes of Ananas comosus var. bracteatus, 23,275 unigenes were annotated in the NCBI non-redundant protein database and 23,134 unigenes were annotated in the Swiss-Port database. Out of these, 17,748 and 8,505 unigenes were assigned to gene ontology categories and clusters of orthologous groups, respectively. Functional annotation against Kyoto Encyclopedia of Genes and Genomes Pathway database identified 5,825 unigenes which were mapped to 117 pathways. The assembly predicted many unigenes that were previously unknown. The annotated unigenes were compared against pineapple, rice, maize, Arabidopsis, and sorghum. Unigenes that did not match any of those five sequence datasets are considered to be Ananas comosus var. bracteatus unique. We predicted unigenes encoding enzymes involved in terpenoid and phenylpropanoid biosynthesis. Conclusion The sequence data provide the most comprehensive transcriptomic resource currently available for Ananas comosus var. bracteatus. To our knowledge; this is the first report on the de novo transcriptome sequencing of the Ananas comosus var. bracteatus. Unigenes obtained in this study, may help improve future gene expression, genetic and genomics studies in Ananas comosus var. bracteatus. PMID:25769053

  19. Structural diversity of domain superfamilies in the CATH database.

    PubMed

    Reeves, Gabrielle A; Dallman, Timothy J; Redfern, Oliver C; Akpor, Adrian; Orengo, Christine A

    2006-07-14

    The CATH database of domain structures has been used to explore the structural variation of homologous domains in 294 well populated domain structure superfamilies, each containing at least three sequence diverse relatives. Our analyses confirm some previously detected trends relating sequence divergence to structural variation but for a much larger dataset and in some superfamilies the new data reveal exceptional structural variation. Use of a new algorithm (2DSEC) to analyse variability in secondary structure compositions across a superfamily sheds new light on how structures evolve. 2DSEC detects inserted secondary structures that embellish the core of conserved secondary structures found throughout the superfamily. Analysis showed that for 56% of highly populated superfamilies (>9 sequence diverse relatives), there are twofold or more increases in the numbers of secondary structures in some relatives. In some families fivefold increases occur, sometimes modifying the fold of the domain. Manual inspection of secondary structure insertions or embellishments in 48 particularly variable superfamilies revealed that although these insertions were usually discontiguous in the sequence they were often co-located in 3D resulting in a larger structural motif that often modified the geometry of the active site or the surface conformation promoting diverse domain partnerships and protein interactions. These observations, supported by automatic analysis of all well populated CATH families, suggest that accretion of small secondary structure insertions may provide a simple mechanism for evolving new functions in diverse relatives. Some layered domain architectures (e.g. mainly-beta and alpha-beta sandwiches) that recur highly in the genomes more frequently exploit these types of embellishments to modify function. In these architectures, aggregation occurs most often at the edges, top or bottom of the beta-sheets. Information on structural variability across domain superfamilies has been made available through the CATH Dictionary of Homologous Structures (DHS).

  20. Identifying the pattern of molecular evolution for Zaire ebolavirus in the 2014 outbreak in West Africa.

    PubMed

    Liu, Si-Qing; Deng, Cheng-Lin; Yuan, Zhi-Ming; Rayner, Simon; Zhang, Bo

    2015-06-01

    The current Ebola virus disease (EVD) epidemic has killed more than all previous Ebola outbreaks combined and, even as efforts appear to be bringing the outbreak under control, the threat of reemergence remains. The availability of new whole-genome sequences from West Africa in 2014 outbreak, together with those from the earlier outbreaks, provide an opportunity to investigate the genetic characteristics, the epidemiological dynamics and the evolutionary history for Zaire ebolavirus (ZEBOV). To investigate the evolutionary properties of ZEBOV in this outbreak, we examined amino acid mutations, positive selection, and evolutionary rates on the basis of 123 ZEBOV genome sequences. The estimated phylogenetic relationships within ZEBOV revealed that viral sequences from the same period or location formed a distinct cluster. The West Africa viruses probably derived from Middle Africa, consistent with results from previous studies. Analysis of the seven protein regions of ZEBOV revealed evidence of positive selection acting on the GP and L genes. Interestingly, all putatively positive-selected sites identified in the GP are located within the mucin-like domain of the solved structure of the protein, suggesting a possible role in the immune evasion properties of ZEBOV. Compared with earlier outbreaks, the evolutionary rate of GP gene was estimated to significantly accelerate in the 2014 outbreak, suggesting that more ZEBOV variants are generated for human to human transmission during this sweeping epidemic. However, a more balanced sample set and next generation sequencing datasets would help achieve a clearer understanding at the genetic level of how the virus is evolving and adapting to new conditions. Copyright © 2015 Elsevier B.V. All rights reserved.

  1. Breaking the computational barriers of pairwise genome comparison.

    PubMed

    Torreno, Oscar; Trelles, Oswaldo

    2015-08-11

    Conventional pairwise sequence comparison software algorithms are being used to process much larger datasets than they were originally designed for. This can result in processing bottlenecks that limit software capabilities or prevent full use of the available hardware resources. Overcoming the barriers that limit the efficient computational analysis of large biological sequence datasets by retrofitting existing algorithms or by creating new applications represents a major challenge for the bioinformatics community. We have developed C libraries for pairwise sequence comparison within diverse architectures, ranging from commodity systems to high performance and cloud computing environments. Exhaustive tests were performed using different datasets of closely- and distantly-related sequences that span from small viral genomes to large mammalian chromosomes. The tests demonstrated that our solution is capable of generating high quality results with a linear-time response and controlled memory consumption, being comparable or faster than the current state-of-the-art methods. We have addressed the problem of pairwise and all-versus-all comparison of large sequences in general, greatly increasing the limits on input data size. The approach described here is based on a modular out-of-core strategy that uses secondary storage to avoid reaching memory limits during the identification of High-scoring Segment Pairs (HSPs) between the sequences under comparison. Software engineering concepts were applied to avoid intermediate result re-calculation, to minimise the performance impact of input/output (I/O) operations and to modularise the process, thus enhancing application flexibility and extendibility. Our computationally-efficient approach allows tasks such as the massive comparison of complete genomes, evolutionary event detection, the identification of conserved synteny blocks and inter-genome distance calculations to be performed more effectively.

  2. Combining high-throughput sequencing and targeted loci data to infer the phylogeny of the "Adenocalymma-Neojobertia" clade (Bignonieae, Bignoniaceae).

    PubMed

    Fonseca, Luiz Henrique M; Lohmann, Lúcia G

    2018-06-01

    Combining high-throughput sequencing data with amplicon sequences allows the reconstruction of robust phylogenies based on comprehensive sampling of characters and taxa. Here, we combine Next Generation Sequencing (NGS) and Sanger sequencing data to infer the phylogeny of the "Adenocalymma-Neojobertia" clade (Bignonieae, Bignoniaceae), a diverse lineage of Neotropical plants, using Maximum Likelihood and Bayesian approaches. We used NGS to obtain complete or nearly-complete plastomes of members of this clade, leading to a final dataset with 54 individuals, representing 44 members of ingroup and 10 outgroups. In addition, we obtained Sanger sequences of two plastid markers (ndhF and rpl32-trnL) for 44 individuals (43 ingroup and 1 outgroup) and the nuclear PepC for 64 individuals (63 ingroup and 1 outgroup). Our final dataset includes 87 individuals of members of the "Adenocalymma-Neojobertia" clade, representing 66 species (ca. 90% of the diversity), plus 11 outgroups. Plastid and nuclear datasets recovered congruent topologies and were combined. The combined analysis recovered a monophyletic "Adenocalymma-Neojobertia" clade and a paraphyletic Adenocalymma that also contained a monophyletic Neojobertia plus Pleonotoma albiflora. Relationships are strongly supported in all analyses, with most lineages within the "Adenocalymma-Neojobertia" clade receiving maximum posterior probabilities. Ancestral character state reconstructions using Bayesian approaches identified six morphological synapomorphies of clades namely, prophyll type, petiole and petiolule articulation, tendril ramification, inflorescence ramification, calyx shape, and fruit wings. Other characters such as habit, calyx cupular trichomes, corolla color, and corolla shape evolved multiple times. These characters are putatively related with the clade diversification and can be further explored in diversification studies. Copyright © 2018 Elsevier Inc. All rights reserved.

  3. The Tara Oceans voyage reveals global diversity and distribution patterns of marine planktonic ciliates

    PubMed Central

    Gimmler, Anna; Korn, Ralf; de Vargas, Colomban; Audic, Stéphane; Stoeck, Thorsten

    2016-01-01

    Illumina reads of the SSU-rDNA-V9 region obtained from the circumglobal Tara Oceans expedition allow the investigation of protistan plankton diversity patterns on a global scale. We analyzed 6,137,350 V9-amplicons from ocean surface waters and the deep chlorophyll maximum, which were taxonomically assigned to the phylum Ciliophora. For open ocean samples global planktonic ciliate diversity is relatively low (ca. 1,300 observed and predicted ciliate OTUs). We found that 17% of all detected ciliate OTUs occurred in all oceanic regions under study. On average, local ciliate OTU richness represented 27% of the global ciliate OTU richness, indicating that a large proportion of ciliates is widely distributed. Yet, more than half of these OTUs shared <90% sequence similarity with reference sequences of described ciliates. While alpha-diversity measures (richness and exp(Shannon H)) are hardly affected by contemporary environmental conditions, species (OTU) turnover and community similarity (β-diversity) across taxonomic groups showed strong correlation to environmental parameters. Logistic regression models predicted significant correlations between the occurrence of specific ciliate genera and individual nutrients, the oceanic carbonate system and temperature. Planktonic ciliates displayed distinct vertical distributions relative to chlorophyll a. In contrast, the Tara Oceans dataset did not reveal any evidence that latitude is structuring ciliate communities. PMID:27633177

  4. The molecular epidemiology of HIV-1 in the Comunidad Valenciana (Spain): analysis of transmission clusters.

    PubMed

    Patiño-Galindo, Juan Ángel; Torres-Puente, Manoli; Bracho, María Alma; Alastrué, Ignacio; Juan, Amparo; Navarro, David; Galindo, María José; Ocete, Dolores; Ortega, Enrique; Gimeno, Concepción; Belda, Josefina; Domínguez, Victoria; Moreno, Rosario; González-Candelas, Fernando

    2017-09-14

    HIV infections are still a very serious concern for public heath worldwide. We have applied molecular evolution methods to study the HIV-1 epidemics in the Comunidad Valenciana (CV, Spain) from a public health surveillance perspective. For this, we analysed 1804 HIV-1 sequences comprising protease and reverse transcriptase (PR/RT) coding regions, sampled between 2004 and 2014. These sequences were subtyped and subjected to phylogenetic analyses in order to detect transmission clusters. In addition, univariate and multinomial comparisons were performed to detect epidemiological differences between HIV-1 subtypes, and risk groups. The HIV epidemic in the CV is dominated by subtype B infections among local men who have sex with men (MSM). 270 transmission clusters were identified (>57% of the dataset), 12 of which included ≥10 patients; 11 of subtype B (9 affecting MSMs) and one (n = 21) of CRF14, affecting predominately intravenous drug users (IDUs). Dated phylogenies revealed these large clusters to have originated from the mid-80s to the early 00 s. Subtype B is more likely to form transmission clusters than non-B variants and MSMs to cluster than other risk groups. Multinomial analyses revealed an association between non-B variants, which are not established in the local population yet, and different foreign groups.

  5. Wide-Open: Accelerating public data release by automating detection of overdue datasets

    PubMed Central

    Poon, Hoifung; Howe, Bill

    2017-01-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. PMID:28594819

  6. Wide-Open: Accelerating public data release by automating detection of overdue datasets.

    PubMed

    Grechkin, Maxim; Poon, Hoifung; Howe, Bill

    2017-06-01

    Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

  7. BeerDeCoded: the open beer metagenome project.

    PubMed

    Sobel, Jonathan; Henry, Luc; Rotman, Nicolas; Rando, Gianpaolo

    2017-01-01

    Next generation sequencing has radically changed research in the life sciences, in both academic and corporate laboratories. The potential impact is tremendous, yet a majority of citizens have little or no understanding of the technological and ethical aspects of this widespread adoption. We designed BeerDeCoded as a pretext to discuss the societal issues related to genomic and metagenomic data with fellow citizens, while advancing scientific knowledge of the most popular beverage of all. In the spirit of citizen science, sample collection and DNA extraction were carried out with the participation of non-scientists in the community laboratory of Hackuarium, a not-for-profit organisation that supports unconventional research and promotes the public understanding of science. The dataset presented herein contains the targeted metagenomic profile of 39 bottled beers from 5 countries, based on internal transcribed spacer (ITS) sequencing of fungal species. A preliminary analysis reveals the presence of a large diversity of wild yeast species in commercial brews. With this project, we demonstrate that coupling simple laboratory procedures that can be carried out in a non-professional environment with state-of-the-art sequencing technologies and targeted metagenomic analyses, can lead to the detection and identification of the microbial content in bottled beer.

  8. Intrinsic flexibility of B-DNA: the experimental TRX scale.

    PubMed

    Heddi, Brahim; Oguey, Christophe; Lavelle, Christophe; Foloppe, Nicolas; Hartmann, Brigitte

    2010-01-01

    B-DNA flexibility, crucial for DNA-protein recognition, is sequence dependent. Free DNA in solution would in principle be the best reference state to uncover the relation between base sequences and their intrinsic flexibility; however, this has long been hampered by a lack of suitable experimental data. We investigated this relationship by compiling and analyzing a large dataset of NMR (31)P chemical shifts in solution. These measurements reflect the BI <--> BII equilibrium in DNA, intimately correlated to helicoidal descriptors of the curvature, winding and groove dimensions. Comparing the ten complementary DNA dinucleotide steps indicates that some steps are much more flexible than others. This malleability is primarily controlled at the dinucleotide level, modulated by the tetranucleotide environment. Our analyses provide an experimental scale called TRX that quantifies the intrinsic flexibility of the ten dinucleotide steps in terms of Twist, Roll, and X-disp (base pair displacement). Applying the TRX scale to DNA sequences optimized for nucleosome formation reveals a 10 base-pair periodic alternation of stiff and flexible regions. Thus, DNA flexibility captured by the TRX scale is relevant to nucleosome formation, suggesting that this scale may be of general interest to better understand protein-DNA recognition.

  9. Discovery of sex-related genes through high-throughput transcriptome sequencing from the salmon louse Caligus rogercresseyi.

    PubMed

    Farlora, Rodolfo; Araya-Garay, José; Gallardo-Escárate, Cristian

    2014-06-01

    Understanding the molecular underpinnings involved in the reproduction of the salmon louse is critical for designing novel strategies of pest management for this ectoparasite. However, genomic information on sex-related genes is still limited. In the present work, sex-specific gene transcription was revealed in the salmon louse Caligus rogercresseyi using high-throughput Illumina sequencing. A total of 30,191,914 and 32,292,250 high quality reads were generated for females and males, and these were de novo assembled into 32,173 and 38,177 contigs, respectively. Gene ontology analysis showed a pattern of higher expression in the female as compared to the male transcriptome. Based on our sequence analysis and known sex-related proteins, several genes putatively involved in sex differentiation, including Dmrt3, FOXL2, VASA, and FEM1, and other potentially significant candidate genes in C. rogercresseyi, were identified for the first time. In addition, the occurrence of SNPs in several differentially expressed contigs annotating for sex-related genes was found. This transcriptome dataset provides a useful resource for future functional analyses, opening new opportunities for sea lice pest control. Copyright © 2014 Elsevier B.V. All rights reserved.

  10. BeerDeCoded: the open beer metagenome project

    PubMed Central

    Sobel, Jonathan; Henry, Luc; Rotman, Nicolas; Rando, Gianpaolo

    2017-01-01

    Next generation sequencing has radically changed research in the life sciences, in both academic and corporate laboratories. The potential impact is tremendous, yet a majority of citizens have little or no understanding of the technological and ethical aspects of this widespread adoption. We designed BeerDeCoded as a pretext to discuss the societal issues related to genomic and metagenomic data with fellow citizens, while advancing scientific knowledge of the most popular beverage of all. In the spirit of citizen science, sample collection and DNA extraction were carried out with the participation of non-scientists in the community laboratory of Hackuarium, a not-for-profit organisation that supports unconventional research and promotes the public understanding of science. The dataset presented herein contains the targeted metagenomic profile of 39 bottled beers from 5 countries, based on internal transcribed spacer (ITS) sequencing of fungal species. A preliminary analysis reveals the presence of a large diversity of wild yeast species in commercial brews. With this project, we demonstrate that coupling simple laboratory procedures that can be carried out in a non-professional environment with state-of-the-art sequencing technologies and targeted metagenomic analyses, can lead to the detection and identification of the microbial content in bottled beer. PMID:29123645

  11. Transcription Factor Map Alignment of Promoter Regions

    PubMed Central

    Blanco, Enrique; Messeguer, Xavier; Smith, Temple F; Guigó, Roderic

    2006-01-01

    We address the problem of comparing and characterizing the promoter regions of genes with similar expression patterns. This remains a challenging problem in sequence analysis, because often the promoter regions of co-expressed genes do not show discernible sequence conservation. In our approach, thus, we have not directly compared the nucleotide sequence of promoters. Instead, we have obtained predictions of transcription factor binding sites, annotated the predicted sites with the labels of the corresponding binding factors, and aligned the resulting sequences of labels—to which we refer here as transcription factor maps (TF-maps). To obtain the global pairwise alignment of two TF-maps, we have adapted an algorithm initially developed to align restriction enzyme maps. We have optimized the parameters of the algorithm in a small, but well-curated, collection of human–mouse orthologous gene pairs. Results in this dataset, as well as in an independent much larger dataset from the CISRED database, indicate that TF-map alignments are able to uncover conserved regulatory elements, which cannot be detected by the typical sequence alignments. PMID:16733547

  12. Sequence-similar, structure-dissimilar protein pairs in the PDB.

    PubMed

    Kosloff, Mickey; Kolodny, Rachel

    2008-05-01

    It is often assumed that in the Protein Data Bank (PDB), two proteins with similar sequences will also have similar structures. Accordingly, it has proved useful to develop subsets of the PDB from which "redundant" structures have been removed, based on a sequence-based criterion for similarity. Similarly, when predicting protein structure using homology modeling, if a template structure for modeling a target sequence is selected by sequence alone, this implicitly assumes that all sequence-similar templates are equivalent. Here, we show that this assumption is often not correct and that standard approaches to create subsets of the PDB can lead to the loss of structurally and functionally important information. We have carried out sequence-based structural superpositions and geometry-based structural alignments of a large number of protein pairs to determine the extent to which sequence similarity ensures structural similarity. We find many examples where two proteins that are similar in sequence have structures that differ significantly from one another. The source of the structural differences usually has a functional basis. The number of such proteins pairs that are identified and the magnitude of the dissimilarity depend on the approach that is used to calculate the differences; in particular sequence-based structure superpositioning will identify a larger number of structurally dissimilar pairs than geometry-based structural alignments. When two sequences can be aligned in a statistically meaningful way, sequence-based structural superpositioning provides a meaningful measure of structural differences. This approach and geometry-based structure alignments reveal somewhat different information and one or the other might be preferable in a given application. Our results suggest that in some cases, notably homology modeling, the common use of nonredundant datasets, culled from the PDB based on sequence, may mask important structural and functional information. We have established a data base of sequence-similar, structurally dissimilar protein pairs that will help address this problem (http://luna.bioc.columbia.edu/rachel/seqsimstrdiff.htm).

  13. Counting Patterns in Degenerated Sequences

    NASA Astrophysics Data System (ADS)

    Nuel, Grégory

    Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach.

  14. A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer.

    PubMed

    Quick, Joshua; Quinlan, Aaron R; Loman, Nicholas J

    2014-01-01

    The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. The MinION™ measures the change in current resulting from DNA strands interacting with a charged protein nanopore. These measurements can then be used to deduce the underlying nucleotide sequence. We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION™ Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods.

  15. SeqFIRE: a web application for automated extraction of indel regions and conserved blocks from protein multiple sequence alignments.

    PubMed

    Ajawatanawong, Pravech; Atkinson, Gemma C; Watson-Haigh, Nathan S; Mackenzie, Bryony; Baldauf, Sandra L

    2012-07-01

    Analyses of multiple sequence alignments generally focus on well-defined conserved sequence blocks, while the rest of the alignment is largely ignored or discarded. This is especially true in phylogenomics, where large multigene datasets are produced through automated pipelines. However, some of the most powerful phylogenetic markers have been found in the variable length regions of multiple alignments, particularly insertions/deletions (indels) in protein sequences. We have developed Sequence Feature and Indel Region Extractor (SeqFIRE) to enable the automated identification and extraction of indels from protein sequence alignments. The program can also extract conserved blocks and identify fast evolving sites using a combination of conservation and entropy. All major variables can be adjusted by the user, allowing them to identify the sets of variables most suited to a particular analysis or dataset. Thus, all major tasks in preparing an alignment for further analysis are combined in a single flexible and user-friendly program. The output includes a numbered list of indels, alignments in NEXUS format with indels annotated or removed and indel-only matrices. SeqFIRE is a user-friendly web application, freely available online at www.seqfire.org/.

  16. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE PAGES

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...

    2015-10-26

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  17. The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos

    The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.

  18. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  19. SPRINT: ultrafast protein-protein interaction prediction of the entire human interactome.

    PubMed

    Li, Yiwei; Ilie, Lucian

    2017-11-15

    Proteins perform their functions usually by interacting with other proteins. Predicting which proteins interact is a fundamental problem. Experimental methods are slow, expensive, and have a high rate of error. Many computational methods have been proposed among which sequence-based ones are very promising. However, so far no such method is able to predict effectively the entire human interactome: they require too much time or memory. We present SPRINT (Scoring PRotein INTeractions), a new sequence-based algorithm and tool for predicting protein-protein interactions. We comprehensively compare SPRINT with state-of-the-art programs on seven most reliable human PPI datasets and show that it is more accurate while running orders of magnitude faster and using very little memory. SPRINT is the only sequence-based program that can effectively predict the entire human interactome: it requires between 15 and 100 min, depending on the dataset. Our goal is to transform the very challenging problem of predicting the entire human interactome into a routine task. The source code of SPRINT is freely available from https://github.com/lucian-ilie/SPRINT/ and the datasets and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/ .

  20. De novo characterization of the Chinese fir (Cunninghamia lanceolata) transcriptome and analysis of candidate genes involved in cellulose and lignin biosynthesis

    PubMed Central

    2012-01-01

    Background Chinese fir (Cunninghamia lanceolata) is an important timber species that accounts for 20–30% of the total commercial timber production in China. However, the available genomic information of Chinese fir is limited, and this severely encumbers functional genomic analysis and molecular breeding in Chinese fir. Recently, major advances in transcriptome sequencing have provided fast and cost-effective approaches to generate large expression datasets that have proven to be powerful tools to profile the transcriptomes of non-model organisms with undetermined genomes. Results In this study, the transcriptomes of nine tissues from Chinese fir were analyzed using the Illumina HiSeq™ 2000 sequencing platform. Approximately 40 million paired-end reads were obtained, generating 3.62 gigabase pairs of sequencing data. These reads were assembled into 83,248 unique sequences (i.e. Unigenes) with an average length of 449 bp, amounting to 37.40 Mb. A total of 73,779 Unigenes were supported by more than 5 reads, 42,663 (57.83%) had homologs in the NCBI non-redundant and Swiss-Prot protein databases, corresponding to 27,224 unique protein entries. Of these Unigenes, 16,750 were assigned to Gene Ontology classes, and 14,877 were clustered into orthologous groups. A total of 21,689 (29.40%) were mapped to 119 pathways by BLAST comparison against the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The majority of the genes encoding the enzymes in the biosynthetic pathways of cellulose and lignin were identified in the Unigene dataset by targeted searches of their annotations. And a number of candidate Chinese fir genes in the two metabolic pathways were discovered firstly. Eighteen genes related to cellulose and lignin biosynthesis were cloned for experimental validating of transcriptome data. Overall 49 Unigenes, covering different regions of these selected genes, were found by alignment. Their expression patterns in different tissues were analyzed by qRT-PCR to explore their putative functions. Conclusions A substantial fraction of transcript sequences was obtained from the deep sequencing of Chinese fir. The assembled Unigene dataset was used to discover candidate genes of cellulose and lignin biosynthesis. This transcriptome dataset will provide a comprehensive sequence resource for molecular genetics research of C. lanceolata. PMID:23171398

  1. Action recognition using multi-scale histograms of oriented gradients based depth motion trail Images

    NASA Astrophysics Data System (ADS)

    Wang, Guanxi; Tie, Yun; Qi, Lin

    2017-07-01

    In this paper, we propose a novel approach based on Depth Maps and compute Multi-Scale Histograms of Oriented Gradient (MSHOG) from sequences of depth maps to recognize actions. Each depth frame in a depth video sequence is projected onto three orthogonal Cartesian planes. Under each projection view, the absolute difference between two consecutive projected maps is accumulated through a depth video sequence to form a Depth Map, which is called Depth Motion Trail Images (DMTI). The MSHOG is then computed from the Depth Maps for the representation of an action. In addition, we apply L2-Regularized Collaborative Representation (L2-CRC) to classify actions. We evaluate the proposed approach on MSR Action3D dataset and MSRGesture3D dataset. Promising experimental result demonstrates the effectiveness of our proposed method.

  2. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

    PubMed Central

    Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo

    2014-01-01

    Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts. Availability and Implementation: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/ Contact: andre.schumacher@yahoo.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24149054

  3. mtDNA sequence diversity of Hazara ethnic group from Pakistan.

    PubMed

    Rakha, Allah; Fatima; Peng, Min-Sheng; Adan, Atif; Bi, Rui; Yasmin, Memona; Yao, Yong-Gang

    2017-09-01

    The present study was undertaken to investigate mitochondrial DNA (mtDNA) control region sequences of Hazaras from Pakistan, so as to generate mtDNA reference database for forensic casework in Pakistan and to analyze phylogenetic relationship of this particular ethnic group with geographically proximal populations. Complete mtDNA control region (nt 16024-576) sequences were generated through Sanger Sequencing for 319 Hazara individuals from Quetta, Baluchistan. The population sample set showed a total of 189 distinct haplotypes, belonging mainly to West Eurasian (51.72%), East & Southeast Asian (29.78%) and South Asian (18.50%) haplogroups. Compared with other populations from Pakistan, the Hazara population had a relatively high haplotype diversity (0.9945) and a lower random match probability (0.0085). The dataset has been incorporated into EMPOP database under accession number EMP00680. The data herein comprises the largest, and likely most thoroughly examined, control region mtDNA dataset from Hazaras of Pakistan. Copyright © 2017 Elsevier B.V. All rights reserved.

  4. Next-generation sequencing coupled with a cell-free display technology for high-throughput production of reliable interactome data

    PubMed Central

    Fujimori, Shigeo; Hirai, Naoya; Ohashi, Hiroyuki; Masuoka, Kazuyo; Nishikimi, Akihiko; Fukui, Yoshinori; Washio, Takanori; Oshikubo, Tomohiro; Yamashita, Tatsuhiro; Miyamoto-Sato, Etsuko

    2012-01-01

    Next-generation sequencing (NGS) has been applied to various kinds of omics studies, resulting in many biological and medical discoveries. However, high-throughput protein-protein interactome datasets derived from detection by sequencing are scarce, because protein-protein interaction analysis requires many cell manipulations to examine the interactions. The low reliability of the high-throughput data is also a problem. Here, we describe a cell-free display technology combined with NGS that can improve both the coverage and reliability of interactome datasets. The completely cell-free method gives a high-throughput and a large detection space, testing the interactions without using clones. The quantitative information provided by NGS reduces the number of false positives. The method is suitable for the in vitro detection of proteins that interact not only with the bait protein, but also with DNA, RNA and chemical compounds. Thus, it could become a universal approach for exploring the large space of protein sequences and interactome networks. PMID:23056904

  5. Complete genome sequencing of the luminescent bacterium, Vibrio qinghaiensis sp. Q67 using PacBio technology

    NASA Astrophysics Data System (ADS)

    Gong, Liang; Wu, Yu; Jian, Qijie; Yin, Chunxiao; Li, Taotao; Gupta, Vijai Kumar; Duan, Xuewu; Jiang, Yueming

    2018-01-01

    Vibrio qinghaiensis sp.-Q67 (Vqin-Q67) is a freshwater luminescent bacterium that continuously emits blue-green light (485 nm). The bacterium has been widely used for detecting toxic contaminants. Here, we report the complete genome sequence of Vqin-Q67, obtained using third-generation PacBio sequencing technology. Continuous long reads were attained from three PacBio sequencing runs and reads >500 bp with a quality value of >0.75 were merged together into a single dataset. This resultant highly-contiguous de novo assembly has no genome gaps, and comprises two chromosomes with substantial genetic information, including protein-coding genes, non-coding RNA, transposon and gene islands. Our dataset can be useful as a comparative genome for evolution and speciation studies, as well as for the analysis of protein-coding gene families, the pathogenicity of different Vibrio species in fish, the evolution of non-coding RNA and transposon, and the regulation of gene expression in relation to the bioluminescence of Vqin-Q67.

  6. Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences.

    PubMed

    Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou

    2006-06-15

    The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.

  7. Comparison of recent SnIa datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sanchez, J.C. Bueno; Perivolaropoulos, L.; Nesseris, S., E-mail: jbueno@cc.uoi.gr, E-mail: nesseris@nbi.ku.dk, E-mail: leandros@uoi.gr

    2009-11-01

    We rank the six latest Type Ia supernova (SnIa) datasets (Constitution (C), Union (U), ESSENCE (Davis) (E), Gold06 (G), SNLS 1yr (S) and SDSS-II (D)) in the context of the Chevalier-Polarski-Linder (CPL) parametrization w(a) = w{sub 0}+w{sub 1}(1−a), according to their Figure of Merit (FoM), their consistency with the cosmological constant (ΛCDM), their consistency with standard rulers (Cosmic Microwave Background (CMB) and Baryon Acoustic Oscillations (BAO)) and their mutual consistency. We find a significant improvement of the FoM (defined as the inverse area of the 95.4% parameter contour) with the number of SnIa of these datasets ((C) highest FoM, (U),more » (G), (D), (E), (S) lowest FoM). Standard rulers (CMB+BAO) have a better FoM by about a factor of 3, compared to the highest FoM SnIa dataset (C). We also find that the ranking sequence based on consistency with ΛCDM is identical with the corresponding ranking based on consistency with standard rulers ((S) most consistent, (D), (C), (E), (U), (G) least consistent). The ranking sequence of the datasets however changes when we consider the consistency with an expansion history corresponding to evolving dark energy (w{sub 0},w{sub 1}) = (−1.4,2) crossing the phantom divide line w = −1 (it is practically reversed to (G), (U), (E), (S), (D), (C)). The SALT2 and MLCS2k2 fitters are also compared and some peculiar features of the SDSS-II dataset when standardized with the MLCS2k2 fitter are pointed out. Finally, we construct a statistic to estimate the internal consistency of a collection of SnIa datasets. We find that even though there is good consistency among most samples taken from the above datasets, this consistency decreases significantly when the Gold06 (G) dataset is included in the sample.« less

  8. Prediction of Protein Structural Classes for Low-Similarity Sequences Based on Consensus Sequence and Segmented PSSM.

    PubMed

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2015-01-01

    Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.

  9. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space.

    PubMed

    Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal

    2008-07-01

    UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.

  10. The complete chloroplast genome sequence of Mahonia bealei (Berberidaceae) reveals a significant expansion of the inverted repeat and phylogenetic relationship with other angiosperms.

    PubMed

    Ma, Ji; Yang, Bingxian; Zhu, Wei; Sun, Lianli; Tian, Jingkui; Wang, Xumin

    2013-10-10

    Mahonia bealei (Berberidaceae) is a frequently-used traditional Chinese medicinal plant with efficient anti-inflammatory ability. This plant is one of the sources of berberine, a new cholesterol-lowering drug with anti-diabetic activity. We have sequenced the complete nucleotide sequence of the chloroplast (cp) genome of M. bealei. The complete cp genome of M. bealei is 164,792 bp in length, and has a typical structure with large (LSC 73,052 bp) and small (SSC 18,591 bp) single-copy regions separated by a pair of inverted repeats (IRs 36,501 bp) of large size. The Mahonia cp genome contains 111 unique genes and 39 genes are duplicated in the IR regions. The gene order and content of M. bealei are almost unarranged which is consistent with the hypothesis that large IRs stabilize cp genome and reduce gene loss-and-gain probabilities during evolutionary process. A large IR expansion of over 12 kb has occurred in M. bealei, 15 genes (rps19, rpl22, rps3, rpl16, rpl14, rps8, infA, rpl36, rps11, petD, petB, psbH, psbN, psbT and psbB) have expanded to have an additional copy in the IRs. The IR expansion rearrangement occurred via a double-strand DNA break and subsequence repair, which is different from the ordinary gene conversion mechanism. Repeat analysis identified 39 direct/inverted repeats 30 bp or longer with a sequence identity ≥ 90%. Analysis also revealed 75 simple sequence repeat (SSR) loci and almost all are composed of A or T, contributing to a distinct bias in base composition. Comparison of protein-coding sequences with ESTs reveals 9 putative RNA edits and 5 of them resulted in non-synonymous modifications in rpoC1, rps2, rps19 and ycf1. Phylogenetic analysis using maximum parsimony (MP) and maximum likelihood (ML) was performed on a dataset composed of 65 protein-coding genes from 25 taxa, which yields an identical tree topology as previous plastid-based trees, and provides strong support for the sister relationship between Ranunculaceae and Berberidaceae. Molecular dating analyses suggest that Ranunculaceae and Berberidaceae diverged between 90 and 84 mya, which is congruent with the fossil records and with recent estimates of the divergence time of these two taxa. © 2013.

  11. RNA-seq reveals more consistent reference genes for gene expression studies in human non-melanoma skin cancers

    PubMed Central

    Tan, Jean-Marie; Payne, Elizabeth J.; Lin, Lynlee L.; Sinnya, Sudipta; Raphael, Anthony P.; Lambie, Duncan; Frazer, Ian H.; Dinger, Marcel E.; Soyer, H. Peter

    2017-01-01

    Identification of appropriate reference genes (RGs) is critical to accurate data interpretation in quantitative real-time PCR (qPCR) experiments. In this study, we have utilised next generation RNA sequencing (RNA-seq) to analyse the transcriptome of a panel of non-melanoma skin cancer lesions, identifying genes that are consistently expressed across all samples. Genes encoding ribosomal proteins were amongst the most stable in this dataset. Validation of this RNA-seq data was examined using qPCR to confirm the suitability of a set of highly stable genes for use as qPCR RGs. These genes will provide a valuable resource for the normalisation of qPCR data for the analysis of non-melanoma skin cancer. PMID:28852586

  12. Improved detection of CXCR4-using HIV by V3 genotyping: application of population-based and "deep" sequencing to plasma RNA and proviral DNA.

    PubMed

    Swenson, Luke C; Moores, Andrew; Low, Andrew J; Thielen, Alexander; Dong, Winnie; Woods, Conan; Jensen, Mark A; Wynhoven, Brian; Chan, Dennison; Glascock, Christopher; Harrigan, P Richard

    2010-08-01

    Tropism testing should rule out CXCR4-using HIV before treatment with CCR5 antagonists. Currently, the recombinant phenotypic Trofile assay (Monogram) is most widely utilized; however, genotypic tests may represent alternative methods. Independent triplicate amplifications of the HIV gp120 V3 region were made from either plasma HIV RNA or proviral DNA. These underwent standard, population-based sequencing with an ABI3730 (RNA n = 63; DNA n = 40), or "deep" sequencing with a Roche/454 Genome Sequencer-FLX (RNA n = 12; DNA n = 12). Position-specific scoring matrices (PSSMX4/R5) (-6.96 cutoff) and geno2pheno[coreceptor] (5% false-positive rate) inferred tropism from V3 sequence. These methods were then independently validated with a separate, blinded dataset (n = 278) of screening samples from the maraviroc MOTIVATE trials. Standard sequencing of HIV RNA with PSSM yielded 69% sensitivity and 91% specificity, relative to Trofile. The validation dataset gave 75% sensitivity and 83% specificity. Proviral DNA plus PSSM gave 77% sensitivity and 71% specificity. "Deep" sequencing of HIV RNA detected >2% inferred-CXCR4-using virus in 8/8 samples called non-R5 by Trofile, and <2% in 4/4 samples called R5. Triplicate analyses of V3 standard sequence data detect greater proportions of CXCR4-using samples than previously achieved. Sequencing proviral DNA and "deep" V3 sequencing may also be useful tools for assessing tropism.

  13. Genome-Wide Association Study of a Validated Case Definition of Gulf War Illness in a Population-Representative Sample

    DTIC Science & Technology

    2013-09-01

    sequence dataset. All procedures were performed by personnel in the IIMT UT Southwestern Genomics and Microarray Core using standard protocols. More... sequencing run, samples were demultiplexed using standard algorithms in the Genomics and Microarray Core and processed into individual sample Illumina single... Sequencing (RNA-Seq), using Illumina’s multiplexing mRNA-Seq to generate full sequence libraries from the poly-A tailed RNA to a read depth of 30

  14. Digital Gene Expression Analysis Based on De Novo Transcriptome Assembly Reveals New Genes Associated with Floral Organ Differentiation of the Orchid Plant Cymbidium ensifolium

    PubMed Central

    Yang, Fengxi; Zhu, Genfa

    2015-01-01

    Cymbidium ensifolium belongs to the genus Cymbidium of the orchid family. Owing to its spectacular flower morphology, C. ensifolium has considerable ecological and cultural value. However, limited genetic data is available for this non-model plant, and the molecular mechanism underlying floral organ identity is still poorly understood. In this study, we characterize the floral transcriptome of C. ensifolium and present, for the first time, extensive sequence and transcript abundance data of individual floral organs. After sequencing, over 10 Gb clean sequence data were generated and assembled into 111,892 unigenes with an average length of 932.03 base pairs, including 1,227 clusters and 110,665 singletons. Assembled sequences were annotated with gene descriptions, gene ontology, clusters of orthologous group terms, the Kyoto Encyclopedia of Genes and Genomes, and the plant transcription factor database. From these annotations, 131 flowering-associated unigenes, 61 CONSTANS-LIKE (COL) unigenes and 90 floral homeotic genes were identified. In addition, four digital gene expression libraries were constructed for the sepal, petal, labellum and gynostemium, and 1,058 genes corresponding to individual floral organ development were identified. Among them, eight MADS-box genes were further investigated by full-length cDNA sequence analysis and expression validation, which revealed two APETALA1/AGL9-like MADS-box genes preferentially expressed in the sepal and petal, two AGAMOUS-like genes particularly restricted to the gynostemium, and four DEF-like genes distinctively expressed in different floral organs. The spatial expression of these genes varied distinctly in different floral mutant corresponding to different floral morphogenesis, which validated the specialized roles of them in floral patterning and further supported the effectiveness of our in silico analysis. This dataset generated in our study provides new insights into the molecular mechanisms underlying floral patterning of Cymbidium and supports a valuable resource for molecular breeding of the orchid plant. PMID:26580566

  15. Genomics dataset on unclassified published organism (patent US 7547531).

    PubMed

    Khan Shawan, Mohammad Mahfuz Ali; Hasan, Md Ashraful; Hossain, Md Mozammel; Hasan, Md Mahmudul; Parvin, Afroza; Akter, Salina; Uddin, Kazi Rasel; Banik, Subrata; Morshed, Mahbubul; Rahman, Md Nazibur; Rahman, S M Badier

    2016-12-01

    Nucleotide (DNA) sequence analysis provides important clues regarding the characteristics and taxonomic position of an organism. With the intention that, DNA sequence analysis is very crucial to learn about hierarchical classification of that particular organism. This dataset (patent US 7547531) is chosen to simplify all the complex raw data buried in undisclosed DNA sequences which help to open doors for new collaborations. In this data, a total of 48 unidentified DNA sequences from patent US 7547531 were selected and their complete sequences were retrieved from NCBI BioSample database. Quick response (QR) code of those DNA sequences was constructed by DNA BarID tool. QR code is useful for the identification and comparison of isolates with other organisms. AT/GC content of the DNA sequences was determined using ENDMEMO GC Content Calculator, which indicates their stability at different temperature. The highest GC content was observed in GP445188 (62.5%) which was followed by GP445198 (61.8%) and GP445189 (59.44%), while lowest was in GP445178 (24.39%). In addition, New England BioLabs (NEB) database was used to identify cleavage code indicating the 5, 3 and blunt end and enzyme code indicating the methylation site of the DNA sequences was also shown. These data will be helpful for the construction of the organisms' hierarchical classification, determination of their phylogenetic and taxonomic position and revelation of their molecular characteristics.

  16. Datasets for evolutionary comparative genomics

    PubMed Central

    Liberles, David A

    2005-01-01

    Many decisions about genome sequencing projects are directed by perceived gaps in the tree of life, or towards model organisms. With the goal of a better understanding of biology through the lens of evolution, however, there are additional genomes that are worth sequencing. One such rationale for whole-genome sequencing is discussed here, along with other important strategies for understanding the phenotypic divergence of species. PMID:16086856

  17. RNA-seq based transcriptomic map reveals new insights into mouse salivary gland development and maturation.

    PubMed

    Gluck, Christian; Min, Sangwon; Oyelakin, Akinsola; Smalley, Kirsten; Sinha, Satrajit; Romano, Rose-Anne

    2016-11-16

    Mouse models have served a valuable role in deciphering various facets of Salivary Gland (SG) biology, from normal developmental programs to diseased states. To facilitate such studies, gene expression profiling maps have been generated for various stages of SG organogenesis. However these prior studies fall short of capturing the transcriptional complexity due to the limited scope of gene-centric microarray-based technology. Compared to microarray, RNA-sequencing (RNA-seq) offers unbiased detection of novel transcripts, broader dynamic range and high specificity and sensitivity for detection of genes, transcripts, and differential gene expression. Although RNA-seq data, particularly under the auspices of the ENCODE project, have covered a large number of biological specimens, studies on the SG have been lacking. To better appreciate the wide spectrum of gene expression profiles, we isolated RNA from mouse submandibular salivary glands at different embryonic and adult stages. In parallel, we processed RNA-seq data for 24 organs and tissues obtained from the mouse ENCODE consortium and calculated the average gene expression values. To identify molecular players and pathways likely to be relevant for SG biology, we performed functional gene enrichment analysis, network construction and hierarchal clustering of the RNA-seq datasets obtained from different stages of SG development and maturation, and other mouse organs and tissues. Our bioinformatics-based data analysis not only reaffirmed known modulators of SG morphogenesis but revealed novel transcription factors and signaling pathways unique to mouse SG biology and function. Finally we demonstrated that the unique SG gene signature obtained from our mouse studies is also well conserved and can demarcate features of the human SG transcriptome that is different from other tissues. Our RNA-seq based Atlas has revealed a high-resolution cartographic view of the dynamic transcriptomic landscape of the mouse SG at various stages. These RNA-seq datasets will complement pre-existing microarray based datasets, including the Salivary Gland Molecular Anatomy Project by offering a broader systems-biology based perspective rather than the classical gene-centric view. Ultimately such resources will be valuable in providing a useful toolkit to better understand how the diverse cell population of the SG are organized and controlled during development and differentiation.

  18. MELOGEN: an EST database for melon functional genomics

    PubMed Central

    Gonzalez-Ibeas, Daniel; Blanca, José; Roig, Cristina; González-To, Mireia; Picó, Belén; Truniger, Verónica; Gómez, Pedro; Deleu, Wim; Caño-Delgado, Ana; Arús, Pere; Nuez, Fernando; Garcia-Mas, Jordi; Puigdomènech, Pere; Aranda, Miguel A

    2007-01-01

    Background Melon (Cucumis melo L.) is one of the most important fleshy fruits for fresh consumption. Despite this, few genomic resources exist for this species. To facilitate the discovery of genes involved in essential traits, such as fruit development, fruit maturation and disease resistance, and to speed up the process of breeding new and better adapted melon varieties, we have produced a large collection of expressed sequence tags (ESTs) from eight normalized cDNA libraries from different tissues in different physiological conditions. Results We determined over 30,000 ESTs that were clustered into 16,637 non-redundant sequences or unigenes, comprising 6,023 tentative consensus sequences (contigs) and 10,614 unclustered sequences (singletons). Many potential molecular markers were identified in the melon dataset: 1,052 potential simple sequence repeats (SSRs) and 356 single nucleotide polymorphisms (SNPs) were found. Sixty-nine percent of the melon unigenes showed a significant similarity with proteins in databases. Functional classification of the unigenes was carried out following the Gene Ontology scheme. In total, 9,402 unigenes were mapped to one or more ontology. Remarkably, the distributions of melon and Arabidopsis unigenes followed similar tendencies, suggesting that the melon dataset is representative of the whole melon transcriptome. Bioinformatic analyses primarily focused on potential precursors of melon micro RNAs (miRNAs) in the melon dataset, but many other genes potentially controlling disease resistance and fruit quality traits were also identified. Patterns of transcript accumulation were characterised by Real-Time-qPCR for 20 of these genes. Conclusion The collection of ESTs characterised here represents a substantial increase on the genetic information available for melon. A database (MELOGEN) which contains all EST sequences, contig images and several tools for analysis and data mining has been created. This set of sequences constitutes also the basis for an oligo-based microarray for melon that is being used in experiments to further analyse the melon transcriptome. PMID:17767721

  19. PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection.

    PubMed

    Wang, Huilin; Wang, Mingjun; Tan, Hao; Li, Yuan; Zhang, Ziding; Song, Jiangning

    2014-01-01

    X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed 'PredPPCrys' using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.

  20. Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing.

    PubMed

    Angiuoli, Samuel V; White, James R; Matalka, Malcolm; White, Owen; Fricke, W Florian

    2011-01-01

    The widespread popularity of genomic applications is threatened by the "bioinformatics bottleneck" resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.

  1. Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing

    PubMed Central

    Angiuoli, Samuel V.; White, James R.; Matalka, Malcolm; White, Owen; Fricke, W. Florian

    2011-01-01

    Background The widespread popularity of genomic applications is threatened by the “bioinformatics bottleneck” resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. Results We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Conclusions Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers. PMID:22028928

  2. RRE: a tool for the extraction of non-coding regions surrounding annotated genes from genomic datasets.

    PubMed

    Lazzarato, F; Franceschinis, G; Botta, M; Cordero, F; Calogero, R A

    2004-11-01

    RRE allows the extraction of non-coding regions surrounding a coding sequence [i.e. gene upstream region, 5'-untranslated region (5'-UTR), introns, 3'-UTR, downstream region] from annotated genomic datasets available at NCBI. RRE parser and web-based interface are accessible at http://www.bioinformatica.unito.it/bioinformatics/rre/rre.html

  3. Improving phylogenetic analyses by incorporating additional information from genetic sequence databases.

    PubMed

    Liang, Li-Jung; Weiss, Robert E; Redelings, Benjamin; Suchard, Marc A

    2009-10-01

    Statistical analyses of phylogenetic data culminate in uncertain estimates of underlying model parameters. Lack of additional data hinders the ability to reduce this uncertainty, as the original phylogenetic dataset is often complete, containing the entire gene or genome information available for the given set of taxa. Informative priors in a Bayesian analysis can reduce posterior uncertainty; however, publicly available phylogenetic software specifies vague priors for model parameters by default. We build objective and informative priors using hierarchical random effect models that combine additional datasets whose parameters are not of direct interest but are similar to the analysis of interest. We propose principled statistical methods that permit more precise parameter estimates in phylogenetic analyses by creating informative priors for parameters of interest. Using additional sequence datasets from our lab or public databases, we construct a fully Bayesian semiparametric hierarchical model to combine datasets. A dynamic iteratively reweighted Markov chain Monte Carlo algorithm conveniently recycles posterior samples from the individual analyses. We demonstrate the value of our approach by examining the insertion-deletion (indel) process in the enolase gene across the Tree of Life using the phylogenetic software BALI-PHY; we incorporate prior information about indels from 82 curated alignments downloaded from the BAliBASE database.

  4. A multigene phylogenetic synthesis for the class Lecanoromycetes (Ascomycota): 1307 fungi representing 1139 infrageneric taxa, 317 genera and 66 families

    PubMed Central

    Miadlikowska, Jolanta; Kauff, Frank; Högnabba, Filip; Oliver, Jeffrey C.; Molnár, Katalin; Fraker, Emily; Gaya, Ester; Hafellner, Josef; Hofstetter, Valérie; Gueidan, Cécile; Otálora, Mónica A.G.; Hodkinson, Brendan; Kukwa, Martin; Lücking, Robert; Björk, Curtis; Sipman, Harrie J.M.; Burgaz, Ana Rosa; Thell, Arne; Passo, Alfredo; Myllys, Leena; Goward, Trevor; Fernández-Brime, Samantha; Hestmark, Geir; Lendemer, James; Lumbsch, H. Thorsten; Schmull, Michaela; Schoch, Conrad; Sérusiaux, Emmanuël; Maddison, David R.; Arnold, A. Elizabeth; Lutzoni, François; Stenroos, Soili

    2014-01-01

    The Lecanoromycetes is the largest class of lichenized Fungi, and one of the most species-rich classes in the kingdom. Here we provide a multigene phylogenetic synthesis (using three ribosomal RNA-coding and two protein-coding genes) of the Lecanoromycetes based on 642 newly generated and 3329 publicly available sequences representing 1139 taxa, 317 genera, 66 families, 17 orders and five subclasses (four currently recognized: Acarosporomycetidae, Lecanoromycetidae, Ostropomycetidae, Umbilicariomycetidae; and one provisionarily recognized, ‘Candelariomycetidae’). Maximum likelihood phylogenetic analyses on four multigene datasets assembled using a cumulative supermatrix approach with a progressively higher number of species and missing data (5-gene, 5+4-gene, 5+4+3-gene and 5+4+3+2-gene datasets) show that the current classification includes non-monophyletic taxa at various ranks, which need to be recircumscribed and require revisionary treatments based on denser taxon sampling and more loci. Two newly circumscribed orders (Arctomiales and Hymeneliales in the Ostropomycetidae) and three families (Ramboldiaceae and Psilolechiaceae in the Lecanorales, and Strangosporaceae in the Lecanoromycetes inc. sed.) are introduced. The potential resurrection of the families Eigleraceae and Lopadiaceae is considered here to alleviate phylogenetic and classification disparities. An overview of the photobionts associated with the main fungal lineages in the Lecanoromycetes based on available published records is provided. A revised schematic classification at the family level in the phylogenetic context of widely accepted and newly revealed relationships across Lecanoromycetes is included. The cumulative addition of taxa with an increasing amount of missing data (i.e., a cumulative supermatrix approach, starting with taxa for which sequences were available for all five targeted genes and ending with the addition of taxa for which only two genes have been sequenced) revealed relatively stable relationships for many families and orders. However, the increasing number of taxa without the addition of more loci also resulted in an expected substantial loss of phylogenetic resolving power and support (especially for deep phylogenetic relationships), potentially including the misplacements of several taxa. Future phylogenetic analyses should include additional single copy protein-coding markers in order to improve the tree of the Lecanoromycetes. As part of this study, a new module (“Hypha”) of the freely available Mesquite software was developed to compare and display the internodal support values derived from this cumulative supermatrix approach. PMID:24747130

  5. Automatic detection of lift-off and touch-down of a pick-up walker using 3D kinematics.

    PubMed

    Grootveld, L; Thies, S B; Ogden, D; Howard, D; Kenney, L P J

    2014-02-01

    Walking aids have been associated with falls and it is believed that incorrect use limits their usefulness. Measures are therefore needed that characterize their stable use and the classification of key events in walking aid movement is the first step in their development. This study presents an automated algorithm for detection of lift-off (LO) and touch-down (TD) events of a pick-up walker. For algorithm design and initial testing, a single user performed trials for which the four individual walker feet lifted off the ground and touched down again in various sequences, and for different amounts of frame loading (Dataset_1). For further validation, ten healthy young subjects walked with the pick-up walker on flat ground (Dataset_2a) and on a narrow beam (Dataset_2b), to challenge balance. One 88-year-old walking frame user was also assessed. Kinematic data were collected with a 3D optoelectronic camera system. The algorithm detected over 93% of events (Dataset_1), and 95% and 92% in Dataset_2a and b, respectively. Of the various LO/TD sequences, those associated with natural progression resulted in up to 100% correctly identified events. For the 88-year-old walking frame user, 96% of LO events and 93% of TD events were detected, demonstrating the potential of the approach. Copyright © 2013 IPEM. Published by Elsevier Ltd. All rights reserved.

  6. How Often Do They Have Sex? A Comparative Analysis of the Population Structure of Seven Eukaryotic Microbial Pathogens

    PubMed Central

    Tomasini, Nicolás; Lauthier, Juan José; Ayala, Francisco José; Tibayrenc, Michel; Diosque, Patricio

    2014-01-01

    The model of predominant clonal evolution (PCE) proposed for micropathogens does not state that genetic exchange is totally absent, but rather, that it is too rare to break the prevalent PCE pattern. However, the actual impact of this “residual” genetic exchange should be evaluated. Multilocus Sequence Typing (MLST) is an excellent tool to explore the problem. Here, we compared online available MLST datasets for seven eukaryotic microbial pathogens: Trypanosoma cruzi, the Fusarium solani complex, Aspergillus fumigatus, Blastocystis subtype 3, the Leishmania donovani complex, Candida albicans and Candida glabrata. We first analyzed phylogenetic relationships among genotypes within each dataset. Then, we examined different measures of branch support and incongruence among loci as signs of genetic structure and levels of past recombination. The analyses allow us to identify three types of genetic structure. The first was characterized by trees with well-supported branches and low levels of incongruence suggesting well-structured populations and PCE. This was the case for the T. cruzi and F. solani datasets. The second genetic structure, represented by Blastocystis spp., A. fumigatus and the L. donovani complex datasets, showed trees with weakly-supported branches but low levels of incongruence among loci, whereby genetic structuration was not clearly defined by MLST. Finally, trees showing weakly-supported branches and high levels of incongruence among loci were observed for Candida species, suggesting that genetic exchange has a higher evolutionary impact in these mainly clonal yeast species. Furthermore, simulations showed that MLST may fail to show right clustering in population datasets even in the absence of genetic exchange. In conclusion, these results make it possible to infer variable impacts of genetic exchange in populations of predominantly clonal micro-pathogens. Moreover, our results reveal different problems of MLST to determine the genetic structure in these organisms that should be considered. PMID:25054834

  7. Thysanophora penicillioides includes multiple genetically diverged groups that coexist respectively in Abies mariesii forests in Japan.

    PubMed

    Iwamoto, Susumu; Tokumasu, Seiji; Suyama, Yoshihisa; Kakishima, Makoto

    2005-01-01

    We investigated intraspecific diversity and genetic structures of a saprotrophic fungus--Thysanophora penicillioides--based on sequences of nuclear ribosomal internal transcribed spacer (ITS) in 15 discontinuous Abies mariesii forests of Japan. In such a well-defined morphological species, numerous unexpected ITS variations were revealed: 12 ITS sequence types detected in 254 isolates collected from 15 local populations were classified into five ITS sequence groups. Maximally, four ITS groups consisted of seven ITS types coexisting in one population. However, group 1 was dominant with approximately 65%; in particular, one haplotype, 1a, was most dominant with approximately 60% in respective populations. Therefore, few differences were recognized in genetic structure among local populations, implying that the gene flow of each lineage of the fungus occurs among local populations without geographic limitations. However, minor haplotypes in some ITS groups were found only in restricted areas, suggesting that they might expand steadily from their places of origin to neighboring A. mariesii forests. Aggregating sequence data of seven European strains and four North American strains from various substrates to those of Japanese strains, 18 ITS sequence types and 28 variable sites were recognized. They were clustered into nine lineages by phylogenetic analyses of the beta-tubulin and combined ITS and beta-tubulin datasets. According to phylogenetic species recognition by the concordance of genealogies, respective lineages correspond to phylogenetic species. Plural phylogenetic species coexist in a local population in an A. mariesii forest in Japan.

  8. Orthology detection combining clustering and synteny for very large datasets.

    PubMed

    Lechner, Marcus; Hernandez-Rosales, Maribel; Doerr, Daniel; Wieseke, Nicolas; Thévenin, Annelyse; Stoye, Jens; Hartmann, Roland K; Prohaska, Sonja J; Stadler, Peter F

    2014-01-01

    The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.

  9. Orthology Detection Combining Clustering and Synteny for Very Large Datasets

    PubMed Central

    Lechner, Marcus; Hernandez-Rosales, Maribel; Doerr, Daniel; Wieseke, Nicolas; Thévenin, Annelyse; Stoye, Jens; Hartmann, Roland K.; Prohaska, Sonja J.; Stadler, Peter F.

    2014-01-01

    The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets. PMID:25137074

  10. Genomic Data Quality Impacts Automated Detection of Lateral Gene Transfer in Fungi

    PubMed Central

    Dupont, Pierre-Yves; Cox, Murray P.

    2017-01-01

    Lateral gene transfer (LGT, also known as horizontal gene transfer), an atypical mechanism of transferring genes between species, has almost become the default explanation for genes that display an unexpected composition or phylogeny. Numerous methods of detecting LGT events all rely on two fundamental strategies: primary structure composition or gene tree/species tree comparisons. Discouragingly, the results of these different approaches rarely coincide. With the wealth of genome data now available, detection of laterally transferred genes is increasingly being attempted in large uncurated eukaryotic datasets. However, detection methods depend greatly on the quality of the underlying genomic data, which are typically complex for eukaryotes. Furthermore, given the automated nature of genomic data collection, it is typically impractical to manually verify all protein or gene models, orthology predictions, and multiple sequence alignments, requiring researchers to accept a substantial margin of error in their datasets. Using a test case comprising plant-associated genomes across the fungal kingdom, this study reveals that composition- and phylogeny-based methods have little statistical power to detect laterally transferred genes. In particular, phylogenetic methods reveal extreme levels of topological variation in fungal gene trees, the vast majority of which show departures from the canonical species tree. Therefore, it is inherently challenging to detect LGT events in typical eukaryotic genomes. This finding is in striking contrast to the large number of claims for laterally transferred genes in eukaryotic species that routinely appear in the literature, and questions how many of these proposed examples are statistically well supported. PMID:28235827

  11. The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system

    PubMed Central

    Vonk, Freek J.; Casewell, Nicholas R.; Henkel, Christiaan V.; Heimberg, Alysha M.; Jansen, Hans J.; McCleary, Ryan J. R.; Kerkkamp, Harald M. E.; Vos, Rutger A.; Guerreiro, Isabel; Calvete, Juan J.; Wüster, Wolfgang; Woods, Anthony E.; Logan, Jessica M.; Harrison, Robert A.; Castoe, Todd A.; de Koning, A. P. Jason; Pollock, David D.; Yandell, Mark; Calderon, Diego; Renjifo, Camila; Currier, Rachel B.; Salgado, David; Pla, Davinia; Sanz, Libia; Hyder, Asad S.; Ribeiro, José M. C.; Arntzen, Jan W.; van den Thillart, Guido E. E. J. M.; Boetzer, Marten; Pirovano, Walter; Dirks, Ron P.; Spaink, Herman P.; Duboule, Denis; McGlinn, Edwina; Kini, R. Manjunatha; Richardson, Michael K.

    2013-01-01

    Snakes are limbless predators, and many species use venom to help overpower relatively large, agile prey. Snake venoms are complex protein mixtures encoded by several multilocus gene families that function synergistically to cause incapacitation. To examine venom evolution, we sequenced and interrogated the genome of a venomous snake, the king cobra (Ophiophagus hannah), and compared it, together with our unique transcriptome, microRNA, and proteome datasets from this species, with data from other vertebrates. In contrast to the platypus, the only other venomous vertebrate with a sequenced genome, we find that snake toxin genes evolve through several distinct co-option mechanisms and exhibit surprisingly variable levels of gene duplication and directional selection that correlate with their functional importance in prey capture. The enigmatic accessory venom gland shows a very different pattern of toxin gene expression from the main venom gland and seems to have recruited toxin-like lectin genes repeatedly for new nontoxic functions. In addition, tissue-specific microRNA analyses suggested the co-option of core genetic regulatory components of the venom secretory system from a pancreatic origin. Although the king cobra is limbless, we recovered coding sequences for all Hox genes involved in amniote limb development, with the exception of Hoxd12. Our results provide a unique view of the origin and evolution of snake venom and reveal multiple genome-level adaptive responses to natural selection in this complex biological weapon system. More generally, they provide insight into mechanisms of protein evolution under strong selection. PMID:24297900

  12. Deep sequencing of small RNA repertoires in mice reveals metabolic disorders-associated hepatic miRNAs.

    PubMed

    Liang, Tingming; Liu, Chang; Ye, Zhenchao

    2013-01-01

    Obesity and associated metabolic disorders contribute importantly to the metabolic syndrome. On the other hand, microRNAs (miRNAs) are a class of small non-coding RNAs that repress target gene expression by inducing mRNA degradation and/or translation repression. Dysregulation of specific miRNAs in obesity may influence energy metabolism and cause insulin resistance, which leads to dyslipidemia, steatosis hepatis and type 2 diabetes. In the present study, we comprehensively analyzed and validated dysregulated miRNAs in ob/ob mouse liver, as well as miRNA groups based on miRNA gene cluster and gene family by using deep sequencing miRNA datasets. We found that over 13.8% of the total analyzed miRNAs were dysregulated, of which 37 miRNA species showed significantly differential expression. Further RT-qPCR analysis in some selected miRNAs validated the similar expression patterns observed in deep sequencing. Interestingly, we found that miRNA gene cluster and family always showed consistent dysregulation patterns in ob/ob mouse liver, although they had various enrichment levels. Functional enrichment analysis revealed the versatile physiological roles (over six signal pathways and five human diseases) of these miRNAs. Biological studies indicated that overexpression of miR-126 or inhibition of miR-24 in AML-12 cells attenuated free fatty acids-induced fat accumulation. Taken together, our data strongly suggest that obesity and metabolic disturbance are tightly associated with functional miRNAs. We also identified hepatic miRNA candidates serving as potential biomarkers for the diagnose of the metabolic syndrome.

  13. Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error

    PubMed Central

    Porter, Teresita M.; Golding, G. Brian

    2012-01-01

    Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys. PMID:22558215

  14. Low Maternal Microbiota Sharing across Gut, Breast Milk and Vagina, as Revealed by 16S rRNA Gene and Reduced Metagenomic Sequencing.

    PubMed

    Avershina, Ekaterina; Angell, Inga Leena; Simpson, Melanie; Storrø, Ola; Øien, Torbjørn; Johnsen, Roar; Rudi, Knut

    2018-05-01

    The maternal microbiota plays an important role in infant gut colonization. In this work we have investigated which bacterial species are shared across the breast milk, vaginal and stool microbiotas of 109 women shortly before and after giving birth using 16S rRNA gene sequencing and a novel reduced metagenomic sequencing (RMS) approach in a subgroup of 16 women. All the species predicted by the 16S rRNA gene sequencing were also detected by RMS analysis and there was good correspondence between their relative abundances estimated by both approaches. Both approaches also demonstrate a low level of maternal microbiota sharing across the population and RMS analysis identified only two species common to most women and in all sample types ( Bifidobacterium longum and Enterococcus faecalis ). Breast milk was the only sample type that had significantly higher intra- than inter- individual similarity towards both vaginal and stool samples. We also searched our RMS dataset against an in silico generated reference database derived from bacterial isolates in the Human Microbiome Project. The use of this reference-based search enabled further separation of Bifidobacterium longum into Bifidobacterium longum ssp. longum and Bifidobacterium longum ssp. infantis . We also detected the Lactobacillus rhamnosus GG strain, which was used as a probiotic supplement by some women, demonstrating the potential of RMS approach for deeper taxonomic delineation and estimation.

  15. Low Maternal Microbiota Sharing across Gut, Breast Milk and Vagina, as Revealed by 16S rRNA Gene and Reduced Metagenomic Sequencing

    PubMed Central

    Angell, Inga Leena; Storrø, Ola; Øien, Torbjørn; Johnsen, Roar; Rudi, Knut

    2018-01-01

    The maternal microbiota plays an important role in infant gut colonization. In this work we have investigated which bacterial species are shared across the breast milk, vaginal and stool microbiotas of 109 women shortly before and after giving birth using 16S rRNA gene sequencing and a novel reduced metagenomic sequencing (RMS) approach in a subgroup of 16 women. All the species predicted by the 16S rRNA gene sequencing were also detected by RMS analysis and there was good correspondence between their relative abundances estimated by both approaches. Both approaches also demonstrate a low level of maternal microbiota sharing across the population and RMS analysis identified only two species common to most women and in all sample types (Bifidobacterium longum and Enterococcus faecalis). Breast milk was the only sample type that had significantly higher intra- than inter- individual similarity towards both vaginal and stool samples. We also searched our RMS dataset against an in silico generated reference database derived from bacterial isolates in the Human Microbiome Project. The use of this reference-based search enabled further separation of Bifidobacterium longum into Bifidobacterium longum ssp. longum and Bifidobacterium longum ssp. infantis. We also detected the Lactobacillus rhamnosus GG strain, which was used as a probiotic supplement by some women, demonstrating the potential of RMS approach for deeper taxonomic delineation and estimation. PMID:29724017

  16. Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine

    NASA Astrophysics Data System (ADS)

    Xing, Pengwei; Su, Ran; Guo, Fei; Wei, Leyi

    2017-04-01

    N6-methyladenosine (m6A) refers to methylation of the adenosine nucleotide acid at the nitrogen-6 position. It plays an important role in a series of biological processes, such as splicing events, mRNA exporting, nascent mRNA synthesis, nuclear translocation and translation process. Numerous experiments have been done to successfully characterize m6A sites within sequences since high-resolution mapping of m6A sites was established. However, as the explosive growth of genomic sequences, using experimental methods to identify m6A sites are time-consuming and expensive. Thus, it is highly desirable to develop fast and accurate computational identification methods. In this study, we propose a sequence-based predictor called RAM-NPPS for identifying m6A sites within RNA sequences, in which we present a novel feature representation algorithm based on multi-interval nucleotide pair position specificity, and use support vector machine classifier to construct the prediction model. Comparison results show that our proposed method outperforms the state-of-the-art predictors on three benchmark datasets across the three species, indicating the effectiveness and robustness of our method. Moreover, an online webserver implementing the proposed predictor has been established at http://server.malab.cn/RAM-NPPS/. It is anticipated to be a useful prediction tool to assist biologists to reveal the mechanisms of m6A site functions.

  17. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses.

    PubMed

    Hurwitz, Bonnie L; Westveld, Anton H; Brum, Jennifer R; Sullivan, Matthew B

    2014-07-22

    Long-standing questions in marine viral ecology are centered on understanding how viral assemblages change along gradients in space and time. However, investigating these fundamental ecological questions has been challenging due to incomplete representation of naturally occurring viral diversity in single gene- or morphology-based studies and an inability to identify up to 90% of reads in viral metagenomes (viromes). Although protein clustering techniques provide a significant advance by helping organize this unknown metagenomic sequence space, they typically use only ∼75% of the data and rely on assembly methods not yet tuned for naturally occurring sequence variation. Here, we introduce an annotation- and assembly-free strategy for comparative metagenomics that combines shared k-mer and social network analyses (regression modeling). This robust statistical framework enables visualization of complex sample networks and determination of ecological factors driving community structure. Application to 32 viromes from the Pacific Ocean Virome dataset identified clusters of samples broadly delineated by photic zone and revealed that geographic region, depth, and proximity to shore were significant predictors of community structure. Within subsets of this dataset, depth, season, and oxygen concentration were significant drivers of viral community structure at a single open ocean station, whereas variability along onshore-offshore transects was driven by oxygen concentration in an area with an oxygen minimum zone and not depth or proximity to shore, as might be expected. Together these results demonstrate that this highly scalable approach using complete metagenomic network-based comparisons can both test and generate hypotheses for ecological investigation of viral and microbial communities in nature.

  18. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses

    PubMed Central

    Hurwitz, Bonnie L.; Westveld, Anton H.; Brum, Jennifer R.; Sullivan, Matthew B.

    2014-01-01

    Long-standing questions in marine viral ecology are centered on understanding how viral assemblages change along gradients in space and time. However, investigating these fundamental ecological questions has been challenging due to incomplete representation of naturally occurring viral diversity in single gene- or morphology-based studies and an inability to identify up to 90% of reads in viral metagenomes (viromes). Although protein clustering techniques provide a significant advance by helping organize this unknown metagenomic sequence space, they typically use only ∼75% of the data and rely on assembly methods not yet tuned for naturally occurring sequence variation. Here, we introduce an annotation- and assembly-free strategy for comparative metagenomics that combines shared k-mer and social network analyses (regression modeling). This robust statistical framework enables visualization of complex sample networks and determination of ecological factors driving community structure. Application to 32 viromes from the Pacific Ocean Virome dataset identified clusters of samples broadly delineated by photic zone and revealed that geographic region, depth, and proximity to shore were significant predictors of community structure. Within subsets of this dataset, depth, season, and oxygen concentration were significant drivers of viral community structure at a single open ocean station, whereas variability along onshore–offshore transects was driven by oxygen concentration in an area with an oxygen minimum zone and not depth or proximity to shore, as might be expected. Together these results demonstrate that this highly scalable approach using complete metagenomic network-based comparisons can both test and generate hypotheses for ecological investigation of viral and microbial communities in nature. PMID:25002514

  19. Encasement and subsidence of salt minibasins: observations from the SE Precaspian Basin and numerical modeling.

    NASA Astrophysics Data System (ADS)

    Fernandez, Naiara; Duffy, Oliver B.; Hudec, Michael R.; Jackson, Christopher A.-L.; Dooley, Tim P.; Jackson, Martin P. A.; Burg, George

    2017-04-01

    The SE Precaspian Basin is characterized by an assemblage of Upper Permian to Triassic minibasins. A recently acquired borehole-constrained 3D reflection dataset reveals the existence of abundant intrasalt reflection packages lying in between the Permo-Triassic minibasins. We propose that most of the mapped intrasalt reflection packages in the study area are minibasins originally deposited on top of salt that were later incorporated into salt by encasement processes. This makes the SE Precaspian Basin a new example of a salt province populated by encased minibasins, which until now had been mainly described from the Gulf of Mexico. Identifying salt-encased sediment packages in the study area has been crucial, not only because they provide a new exploration target, but also because they can play a key role on improving seismic imaging of adjacent or deeper stratigraphic sections. Another remarkable feature observed in the seismic dataset is the widespread occurrence of distinct seismic sequences in the Permo-Triassic minibasins. Bowl- and wedge-shaped seismic sequences define discrete periods of vertical and asymmetric minibasin subsidence. In the absence of shortening, the bowl-to-wedge transition is typically associated with the timing of basal welding and subsequent rotation of the minibasins. Timing of minibasin welding has important implications when addressing the likelihood of suprasalt reservoir charging. We performed a set of 2D numerical simulations aimed at investigating what drives the tilting of minibasins and how it relates to welding. A key observation from the numerical models is that the bowl-to-wedge transition can predate the time of basal welding.

  20. Spherical: an iterative workflow for assembling metagenomic datasets.

    PubMed

    Hitch, Thomas C A; Creevey, Christopher J

    2018-01-24

    The consensus emerging from the study of microbiomes is that they are far more complex than previously thought, requiring better assemblies and increasingly deeper sequencing. However, current metagenomic assembly techniques regularly fail to incorporate all, or even the majority in some cases, of the sequence information generated for many microbiomes, negating this effort. This can especially bias the information gathered and the perceived importance of the minor taxa in a microbiome. We propose a simple but effective approach, implemented in Python, to address this problem. Based on an iterative methodology, our workflow (called Spherical) carries out successive rounds of assemblies with the sequencing reads not yet utilised. This approach also allows the user to reduce the resources required for very large datasets, by assembling random subsets of the whole in a "divide and conquer" manner. We demonstrate the accuracy of Spherical using simulated data based on completely sequenced genomes and the effectiveness of the workflow at retrieving lost information for taxa in three published metagenomics studies of varying sizes. Our results show that Spherical increased the amount of reads utilized in the assembly by up to 109% compared to the base assembly. The additional contigs assembled by the Spherical workflow resulted in a significant (P < 0.05) changes in the predicted taxonomic profile of all datasets analysed. Spherical is implemented in Python 2.7 and freely available for use under the MIT license. Source code and documentation is hosted publically at: https://github.com/thh32/Spherical .

  1. Genome-wide gene–gene interaction analysis for next-generation sequencing

    PubMed Central

    Zhao, Jinying; Zhu, Yun; Xiong, Momiao

    2016-01-01

    The critical barrier in interaction analysis for next-generation sequencing (NGS) data is that the traditional pairwise interaction analysis that is suitable for common variants is difficult to apply to rare variants because of their prohibitive computational time, large number of tests and low power. The great challenges for successful detection of interactions with NGS data are (1) the demands in the paradigm of changes in interaction analysis; (2) severe multiple testing; and (3) heavy computations. To meet these challenges, we shift the paradigm of interaction analysis between two SNPs to interaction analysis between two genomic regions. In other words, we take a gene as a unit of analysis and use functional data analysis techniques as dimensional reduction tools to develop a novel statistic to collectively test interaction between all possible pairs of SNPs within two genome regions. By intensive simulations, we demonstrate that the functional logistic regression for interaction analysis has the correct type 1 error rates and higher power to detect interaction than the currently used methods. The proposed method was applied to a coronary artery disease dataset from the Wellcome Trust Case Control Consortium (WTCCC) study and the Framingham Heart Study (FHS) dataset, and the early-onset myocardial infarction (EOMI) exome sequence datasets with European origin from the NHLBI's Exome Sequencing Project. We discovered that 6 of 27 pairs of significantly interacted genes in the FHS were replicated in the independent WTCCC study and 24 pairs of significantly interacted genes after applying Bonferroni correction in the EOMI study. PMID:26173972

  2. Joint estimation of motion and illumination change in a sequence of images

    NASA Astrophysics Data System (ADS)

    Koo, Ja-Keoung; Kim, Hyo-Hun; Hong, Byung-Woo

    2015-09-01

    We present an algorithm that simultaneously computes optical flow and estimates illumination change from an image sequence in a unified framework. We propose an energy functional consisting of conventional optical flow energy based on Horn-Schunck method and an additional constraint that is designed to compensate for illumination changes. Any undesirable illumination change that occurs in the imaging procedure in a sequence while the optical flow is being computed is considered a nuisance factor. In contrast to the conventional optical flow algorithm based on Horn-Schunck functional, which assumes the brightness constancy constraint, our algorithm is shown to be robust with respect to temporal illumination changes in the computation of optical flows. An efficient conjugate gradient descent technique is used in the optimization procedure as a numerical scheme. The experimental results obtained from the Middlebury benchmark dataset demonstrate the robustness and the effectiveness of our algorithm. In addition, comparative analysis of our algorithm and Horn-Schunck algorithm is performed on the additional test dataset that is constructed by applying a variety of synthetic bias fields to the original image sequences in the Middlebury benchmark dataset in order to demonstrate that our algorithm outperforms the Horn-Schunck algorithm. The superior performance of the proposed method is observed in terms of both qualitative visualizations and quantitative accuracy errors when compared to Horn-Schunck optical flow algorithm that easily yields poor results in the presence of small illumination changes leading to violation of the brightness constancy constraint.

  3. Layer-specific chromatin accessibility landscapes reveal regulatory networks in adult mouse visual cortex

    PubMed Central

    Gray, Lucas T; Yao, Zizhen; Nguyen, Thuc Nghi; Kim, Tae Kyung; Zeng, Hongkui; Tasic, Bosiljka

    2017-01-01

    Mammalian cortex is a laminar structure, with each layer composed of a characteristic set of cell types with different morphological, electrophysiological, and connectional properties. Here, we define chromatin accessibility landscapes of major, layer-specific excitatory classes of neurons, and compare them to each other and to inhibitory cortical neurons using the Assay for Transposase-Accessible Chromatin with high-throughput sequencing (ATAC-seq). We identify a large number of layer-specific accessible sites, and significant association with genes that are expressed in specific cortical layers. Integration of these data with layer-specific transcriptomic profiles and transcription factor binding motifs enabled us to construct a regulatory network revealing potential key layer-specific regulators, including Cux1/2, Foxp2, Nfia, Pou3f2, and Rorb. This dataset is a valuable resource for identifying candidate layer-specific cis-regulatory elements in adult mouse cortex. DOI: http://dx.doi.org/10.7554/eLife.21883.001 PMID:28112643

  4. Genotypic and Phylogenetic Insights on Prevention of the Spread of HIV-1 and Drug Resistance in “Real-World” Settings

    PubMed Central

    Brenner, Bluma G.; Ibanescu, Ruxandra-Ilinca; Hardy, Isabelle; Roger, Michel

    2017-01-01

    HIV continues to spread among vulnerable heterosexual (HET), Men-having-Sex with Men (MSM) and intravenous drug user (IDU) populations, influenced by a complex array of biological, behavioral and societal factors. Phylogenetics analyses of large sequence datasets from national drug resistance testing programs reveal the evolutionary interrelationships of viral strains implicated in the dynamic spread of HIV in different regional settings. Viral phylogenetics can be combined with demographic and behavioral information to gain insights on epidemiological processes shaping transmission networks at the population-level. Drug resistance testing programs also reveal emergent mutational pathways leading to resistance to the 23 antiretroviral drugs used in HIV-1 management in low-, middle- and high-income settings. This article describes how genotypic and phylogenetic information from Quebec and elsewhere provide critical information on HIV transmission and resistance, Cumulative findings can be used to optimize public health strategies to tackle the challenges of HIV in “real-world” settings. PMID:29283390

  5. Deciphering microbial interactions and detecting keystone species with co-occurrence networks.

    PubMed

    Berry, David; Widder, Stefanie

    2014-01-01

    Co-occurrence networks produced from microbial survey sequencing data are frequently used to identify interactions between community members. While this approach has potential to reveal ecological processes, it has been insufficiently validated due to the technical limitations inherent in studying complex microbial ecosystems. Here, we simulate multi-species microbial communities with known interaction patterns using generalized Lotka-Volterra dynamics. We then construct co-occurrence networks and evaluate how well networks reveal the underlying interactions and how experimental and ecological parameters can affect network inference and interpretation. We find that co-occurrence networks can recapitulate interaction networks under certain conditions, but that they lose interpretability when the effects of habitat filtering become significant. We demonstrate that networks suffer from local hot spots of spurious correlation in the neighborhood of hub species that engage in many interactions. We also identify topological features associated with keystone species in co-occurrence networks. This study provides a substantiated framework to guide environmental microbiologists in the construction and interpretation of co-occurrence networks from microbial survey datasets.

  6. Binary Interval Search: a scalable algorithm for counting interval intersections.

    PubMed

    Layer, Ryan M; Skadron, Kevin; Robins, Gabriel; Hall, Ira M; Quinlan, Aaron R

    2013-01-01

    The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. https://github.com/arq5x/bits.

  7. The metagenomic data life-cycle: standards and best practices

    PubMed Central

    ten Hoopen, Petra; Finn, Robert D.; Bongo, Lars Ailo; Corre, Erwan; Meyer, Folker; Mitchell, Alex; Pelletier, Eric; Pesole, Graziano; Santamaria, Monica; Willassen, Nils Peder

    2017-01-01

    Abstract Metagenomics data analyses from independent studies can only be compared if the analysis workflows are described in a harmonized way. In this overview, we have mapped the landscape of data standards available for the description of essential steps in metagenomics: (i) material sampling, (ii) material sequencing, (iii) data analysis, and (iv) data archiving and publishing. Taking examples from marine research, we summarize essential variables used to describe material sampling processes and sequencing procedures in a metagenomics experiment. These aspects of metagenomics dataset generation have been to some extent addressed by the scientific community, but greater awareness and adoption is still needed. We emphasize the lack of standards relating to reporting how metagenomics datasets are analysed and how the metagenomics data analysis outputs should be archived and published. We propose best practice as a foundation for a community standard to enable reproducibility and better sharing of metagenomics datasets, leading ultimately to greater metagenomics data reuse and repurposing. PMID:28637310

  8. The metagenomic data life-cycle: standards and best practices

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    ten Hoopen, Petra; Finn, Robert D.; Bongo, Lars Ailo

    Metagenomics data analyses from independent studies can only be compared if the analysis workflows are described in a harmonised way. In this overview, we have mapped the landscape of data standards available for the description of essential steps in metagenomics: (1) material sampling, (2) material sequencing (3) data analysis and (4) data archiving & publishing. Taking examples from marine research, we summarise essential variables used to describe material sampling processes and sequencing procedures in a metagenomics experiment. These aspects of metagenomics dataset generation have been to some extent addressed by the scientific community but greater awareness and adoption is stillmore » needed. We emphasise the lack of standards relating to reporting how metagenomics datasets are analysed and how the metagenomics data analysis outputs should be archived and published. We propose best practice as a foundation for a community standard to enable reproducibility and better sharing of metagenomics datasets, leading ultimately to greater metagenomics data reuse and repurposing.« less

  9. GUIDEseq: a bioconductor package to analyze GUIDE-Seq datasets for CRISPR-Cas nucleases.

    PubMed

    Zhu, Lihua Julie; Lawrence, Michael; Gupta, Ankit; Pagès, Hervé; Kucukural, Alper; Garber, Manuel; Wolfe, Scot A

    2017-05-15

    Genome editing technologies developed around the CRISPR-Cas9 nuclease system have facilitated the investigation of a broad range of biological questions. These nucleases also hold tremendous promise for treating a variety of genetic disorders. In the context of their therapeutic application, it is important to identify the spectrum of genomic sequences that are cleaved by a candidate nuclease when programmed with a particular guide RNA, as well as the cleavage efficiency of these sites. Powerful new experimental approaches, such as GUIDE-seq, facilitate the sensitive, unbiased genome-wide detection of nuclease cleavage sites within the genome. Flexible bioinformatics analysis tools for processing GUIDE-seq data are needed. Here, we describe an open source, open development software suite, GUIDEseq, for GUIDE-seq data analysis and annotation as a Bioconductor package in R. The GUIDEseq package provides a flexible platform with more than 60 adjustable parameters for the analysis of datasets associated with custom nuclease applications. These parameters allow data analysis to be tailored to different nuclease platforms with different length and complexity in their guide and PAM recognition sequences or their DNA cleavage position. They also enable users to customize sequence aggregation criteria, and vary peak calling thresholds that can influence the number of potential off-target sites recovered. GUIDEseq also annotates potential off-target sites that overlap with genes based on genome annotation information, as these may be the most important off-target sites for further characterization. In addition, GUIDEseq enables the comparison and visualization of off-target site overlap between different datasets for a rapid comparison of different nuclease configurations or experimental conditions. For each identified off-target, the GUIDEseq package outputs mapped GUIDE-Seq read count as well as cleavage score from a user specified off-target cleavage score prediction algorithm permitting the identification of genomic sequences with unexpected cleavage activity. The GUIDEseq package enables analysis of GUIDE-data from various nuclease platforms for any species with a defined genomic sequence. This software package has been used successfully to analyze several GUIDE-seq datasets. The software, source code and documentation are freely available at http://www.bioconductor.org/packages/release/bioc/html/GUIDEseq.html .

  10. Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.

    The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. Results: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a frameworkmore » based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. Availability: Phylo-VISTA is available at http://www-gsd.lbl. gov/phylovista. It requires an Internet browser with Java Plugin 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu« less

  11. IMNGS: A comprehensive open resource of processed 16S rRNA microbial profiles for ecology and diversity studies.

    PubMed

    Lagkouvardos, Ilias; Joseph, Divya; Kapfhammer, Martin; Giritli, Sabahattin; Horn, Matthias; Haller, Dirk; Clavel, Thomas

    2016-09-23

    The SRA (Sequence Read Archive) serves as primary depository for massive amounts of Next Generation Sequencing data, and currently host over 100,000 16S rRNA gene amplicon-based microbial profiles from various host habitats and environments. This number is increasing rapidly and there is a dire need for approaches to utilize this pool of knowledge. Here we created IMNGS (Integrated Microbial Next Generation Sequencing), an innovative platform that uniformly and systematically screens for and processes all prokaryotic 16S rRNA gene amplicon datasets available in SRA and uses them to build sample-specific sequence databases and OTU-based profiles. Via a web interface, this integrative sequence resource can easily be queried by users. We show examples of how the approach allows testing the ecological importance of specific microorganisms in different hosts or ecosystems, and performing targeted diversity studies for selected taxonomic groups. The platform also offers a complete workflow for de novo analysis of users' own raw 16S rRNA gene amplicon datasets for the sake of comparison with existing data. IMNGS can be accessed at www.imngs.org.

  12. Prediction and Identification of Krüppel-Like Transcription Factors by Machine Learning Method.

    PubMed

    Liao, Zhijun; Wang, Xinrui; Chen, Xingyong; Zou, Quan

    2017-01-01

    The Krüppel-like factors (KLFs) are a family of containing Zn finger(ZF) motif transcription factors with 18 members in human genome, among them, KLF18 is predicted by bioinformatics. KLFs possess various physiological function involving in a number of cancers and other diseases. Here we perform a binary-class classification of KLFs and non-KLFs by machine learning methods. The protein sequences of KLFs and non-KLFs were searched from UniProt and randomly separate them into training dataset(containing positive and negative sequences) and test dataset(containing only negative sequences), after extracting the 188-dimensional(188D) feature vectors we carry out category with four classifiers(GBDT, libSVM, RF, and k-NN). On the human KLFs, we further dig into the evolutionary relationship and motif distribution, and finally we analyze the conserved amino acid residue of three zinc fingers. The classifier model from training dataset were well constructed, and the highest specificity(Sp) was 99.83% from a library for support vector machine(libSVM) and all the correctly classified rates were over 70% for 10-fold cross-validation on test dataset. The 18 human KLFs can be further divided into 7 groups and the zinc finger domains were located at the carboxyl terminus, and many conserved amino acid residues including Cysteine and Histidine, and the span and interval between them were consistent in the three ZF domains. Two classification models for KLFs prediction have been built by novel machine learning methods. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  13. Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences

    PubMed Central

    Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou

    2006-01-01

    Background The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. Results There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. Conclusion The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart. PMID:16776838

  14. Reticulamoeba Is a Long-Branched Granofilosean (Cercozoa) That Is Missing from Sequence Databases

    PubMed Central

    Bass, David; Yabuki, Akinori; Santini, Sébastien; Romac, Sarah; Berney, Cédric

    2012-01-01

    We sequenced the 18S ribosomal RNA gene of seven isolates of the enigmatic marine amoeboflagellate Reticulamoeba Grell, which resolved into four genetically distinct Reticulamoeba lineages, two of which correspond to R. gemmipara Grell and R. minor Grell, another with a relatively large cell body forming lacunae, and another that has similarities to both R. minor and R. gemmipara but with a greater propensity to form cell clusters. These lineages together form a long-branched clade that branches within the cercozoan class Granofilosea (phylum Cercozoa), showing phylogenetic affinities with the genus Mesofila. The basic morphology of Reticulamoeba is a roundish or ovoid cell with a more or less irregular outline. Long and branched reticulopodia radiate from the cell. The reticulopodia bear granules that are bidirectionally motile. There is also a biflagellate dispersal stage. Reticulamoeba is frequently observed in coastal marine environmental samples. PCR primers specific to the Reticulamoeba clade confirm that it is a frequent member of benthic marine microbial communities, and is also found in brackish water sediments and freshwater biofilm. However, so far it has not been found in large molecular datasets such as the nucleotide database in NCBI GenBank, metagenomic datasets in Camera, and the marine microbial eukaryote sampling and sequencing consortium BioMarKs, although closely related lineages can be found in some of these datasets using a highly targeted approach. Therefore, although such datasets are very powerful tools in microbial ecology, they may, for several methodological reasons, fail to detect ecologically and evolutionary key lineages. PMID:23226495

  15. Inaugural Genomics Automation Congress and the coming deluge of sequencing data.

    PubMed

    Creighton, Chad J

    2010-10-01

    Presentations at Select Biosciences's first 'Genomics Automation Congress' (Boston, MA, USA) in 2010 focused on next-generation sequencing and the platforms and methodology around them. The meeting provided an overview of sequencing technologies, both new and emerging. Speakers shared their recent work on applying sequencing to profile cells for various levels of biomolecular complexity, including DNA sequences, DNA copy, DNA methylation, mRNA and microRNA. With sequencing time and costs continuing to drop dramatically, a virtual explosion of very large sequencing datasets is at hand, which will probably present challenges and opportunities for high-level data analysis and interpretation, as well as for information technology infrastructure.

  16. Prediction of constitutive A-to-I editing sites from human transcriptomes in the absence of genomic sequences

    PubMed Central

    2013-01-01

    Background Adenosine-to-inosine (A-to-I) RNA editing is recognized as a cellular mechanism for generating both RNA and protein diversity. Inosine base pairs with cytidine during reverse transcription and therefore appears as guanosine during sequencing of cDNA. Current approaches of RNA editing identification largely depend on the comparison between transcriptomes and genomic DNA (gDNA) sequencing datasets from the same individuals, and it has been challenging to identify editing candidates from transcriptomes in the absence of gDNA information. Results We have developed a new strategy to accurately predict constitutive RNA editing sites from publicly available human RNA-seq datasets in the absence of relevant genomic sequences. Our approach establishes new parameters to increase the ability to map mismatches and to minimize sequencing/mapping errors and unreported genome variations. We identified 695 novel constitutive A-to-I editing sites that appear in clusters (named “editing boxes”) in multiple samples and which exhibit spatial and dynamic regulation across human tissues. Some of these editing boxes are enriched in non-repetitive regions lacking inverted repeat structures and contain an extremely high conversion frequency of As to Is. We validated a number of editing boxes in multiple human cell lines and confirmed that ADAR1 is responsible for the observed promiscuous editing events in non-repetitive regions, further expanding our knowledge of the catalytic substrate of A-to-I RNA editing by ADAR enzymes. Conclusions The approach we present here provides a novel way of identifying A-to-I RNA editing events by analyzing only RNA-seq datasets. This method has allowed us to gain new insights into RNA editing and should also aid in the identification of more constitutive A-to-I editing sites from additional transcriptomes. PMID:23537002

  17. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

    PubMed Central

    Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal

    2008-01-01

    Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il PMID:18586742

  18. SearchSmallRNA: a graphical interface tool for the assemblage of viral genomes using small RNA libraries data.

    PubMed

    de Andrade, Roberto R S; Vaslin, Maite F S

    2014-03-07

    Next-generation parallel sequencing (NGS) allows the identification of viral pathogens by sequencing the small RNAs of infected hosts. Thus, viral genomes may be assembled from host immune response products without prior virus enrichment, amplification or purification. However, mapping of the vast information obtained presents a bioinformatics challenge. In order to by pass the need of line command and basic bioinformatics knowledge, we develop a mapping software with a graphical interface to the assemblage of viral genomes from small RNA dataset obtained by NGS. SearchSmallRNA was developed in JAVA language version 7 using NetBeans IDE 7.1 software. The program also allows the analysis of the viral small interfering RNAs (vsRNAs) profile; providing an overview of the size distribution and other features of the vsRNAs produced in infected cells. The program performs comparisons between each read sequenced present in a library and a chosen reference genome. Reads showing Hamming distances smaller or equal to an allowed mismatched will be selected as positives and used to the assemblage of a long nucleotide genome sequence. In order to validate the software, distinct analysis using NGS dataset obtained from HIV and two plant viruses were used to reconstruct viral whole genomes. SearchSmallRNA program was able to reconstructed viral genomes using NGS of small RNA dataset with high degree of reliability so it will be a valuable tool for viruses sequencing and discovery. It is accessible and free to all research communities and has the advantage to have an easy-to-use graphical interface. SearchSmallRNA was written in Java and is freely available at http://www.microbiologia.ufrj.br/ssrna/.

  19. SearchSmallRNA: a graphical interface tool for the assemblage of viral genomes using small RNA libraries data

    PubMed Central

    2014-01-01

    Background Next-generation parallel sequencing (NGS) allows the identification of viral pathogens by sequencing the small RNAs of infected hosts. Thus, viral genomes may be assembled from host immune response products without prior virus enrichment, amplification or purification. However, mapping of the vast information obtained presents a bioinformatics challenge. Methods In order to by pass the need of line command and basic bioinformatics knowledge, we develop a mapping software with a graphical interface to the assemblage of viral genomes from small RNA dataset obtained by NGS. SearchSmallRNA was developed in JAVA language version 7 using NetBeans IDE 7.1 software. The program also allows the analysis of the viral small interfering RNAs (vsRNAs) profile; providing an overview of the size distribution and other features of the vsRNAs produced in infected cells. Results The program performs comparisons between each read sequenced present in a library and a chosen reference genome. Reads showing Hamming distances smaller or equal to an allowed mismatched will be selected as positives and used to the assemblage of a long nucleotide genome sequence. In order to validate the software, distinct analysis using NGS dataset obtained from HIV and two plant viruses were used to reconstruct viral whole genomes. Conclusions SearchSmallRNA program was able to reconstructed viral genomes using NGS of small RNA dataset with high degree of reliability so it will be a valuable tool for viruses sequencing and discovery. It is accessible and free to all research communities and has the advantage to have an easy-to-use graphical interface. Availability and implementation SearchSmallRNA was written in Java and is freely available at http://www.microbiologia.ufrj.br/ssrna/. PMID:24607237

  20. Validation of Genotyping-By-Sequencing Analysis in Populations of Tetraploid Alfalfa by 454 Sequencing

    PubMed Central

    Rocher, Solen; Jean, Martine; Castonguay, Yves; Belzile, François

    2015-01-01

    Genotyping-by-sequencing (GBS) is a relatively low-cost high throughput genotyping technology based on next generation sequencing and is applicable to orphan species with no reference genome. A combination of genome complexity reduction and multiplexing with DNA barcoding provides a simple and affordable way to resolve allelic variation between plant samples or populations. GBS was performed on ApeKI libraries using DNA from 48 genotypes each of two heterogeneous populations of tetraploid alfalfa (Medicago sativa spp. sativa): the synthetic cultivar Apica (ATF0) and a derived population (ATF5) obtained after five cycles of recurrent selection for superior tolerance to freezing (TF). Nearly 400 million reads were obtained from two lanes of an Illumina HiSeq 2000 sequencer and analyzed with the Universal Network-Enabled Analysis Kit (UNEAK) pipeline designed for species with no reference genome. Following the application of whole dataset-level filters, 11,694 single nucleotide polymorphism (SNP) loci were obtained. About 60% had a significant match on the Medicago truncatula syntenic genome. The accuracy of allelic ratios and genotype calls based on GBS data was directly assessed using 454 sequencing on a subset of SNP loci scored in eight plant samples. Sequencing depth in this study was not sufficient for accurate tetraploid allelic dosage, but reliable genotype calls based on diploid allelic dosage were obtained when using additional quality filtering. Principal Component Analysis of SNP loci in plant samples revealed that a small proportion (<5%) of the genetic variability assessed by GBS is able to differentiate ATF0 and ATF5. Our results confirm that analysis of GBS data using UNEAK is a reliable approach for genome-wide discovery of SNP loci in outcrossed polyploids. PMID:26115486

  1. GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank.

    PubMed

    You, Ronghui; Zhang, Zihan; Xiong, Yi; Sun, Fengzhu; Mamitsuka, Hiroshi; Zhu, Shanfeng

    2018-03-07

    Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% of more than 70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multilabel classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have <60% sequence identity to proteins with annotations already. Thus the vital and challenging problem now is how to develop a method for SAFP, particularly for difficult proteins. The key of this method is to extract not only homology information but also diverse, deep- rooted information/evidence from sequence inputs and integrate them into a predictor in a both effective and efficient manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a paradigm of machine learning, especially powerful for multilabel classification. The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods. http://datamining-iip.fudan.edu.cn/golabeler. zhusf@fudan.edu.cn. Supplementary data are available at Bioinformatics online.

  2. Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features.

    PubMed

    Zhou, Hang; Yang, Yang; Shen, Hong-Bin

    2017-03-15

    Protein subcellular localization prediction has been an important research topic in computational biology over the last decade. Various automatic methods have been proposed to predict locations for large scale protein datasets, where statistical machine learning algorithms are widely used for model construction. A key step in these predictors is encoding the amino acid sequences into feature vectors. Many studies have shown that features extracted from biological domains, such as gene ontology and functional domains, can be very useful for improving the prediction accuracy. However, domain knowledge usually results in redundant features and high-dimensional feature spaces, which may degenerate the performance of machine learning models. In this paper, we propose a new amino acid sequence-based human protein subcellular location prediction approach Hum-mPLoc 3.0, which covers 12 human subcellular localizations. The sequences are represented by multi-view complementary features, i.e. context vocabulary annotation-based gene ontology (GO) terms, peptide-based functional domains, and residue-based statistical features. To systematically reflect the structural hierarchy of the domain knowledge bases, we propose a novel feature representation protocol denoted as HCM (Hidden Correlation Modeling), which will create more compact and discriminative feature vectors by modeling the hidden correlations between annotation terms. Experimental results on four benchmark datasets show that HCM improves prediction accuracy by 5-11% and F 1 by 8-19% compared with conventional GO-based methods. A large-scale application of Hum-mPLoc 3.0 on the whole human proteome reveals proteins co-localization preferences in the cell. www.csbio.sjtu.edu.cn/bioinf/Hum-mPLoc3/. hbshen@sjtu.edu.cn. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  3. DMRfinder: efficiently identifying differentially methylated regions from MethylC-seq data.

    PubMed

    Gaspar, John M; Hart, Ronald P

    2017-11-29

    DNA methylation is an epigenetic modification that is studied at a single-base resolution with bisulfite treatment followed by high-throughput sequencing. After alignment of the sequence reads to a reference genome, methylation counts are analyzed to determine genomic regions that are differentially methylated between two or more biological conditions. Even though a variety of software packages is available for different aspects of the bioinformatics analysis, they often produce results that are biased or require excessive computational requirements. DMRfinder is a novel computational pipeline that identifies differentially methylated regions efficiently. Following alignment, DMRfinder extracts methylation counts and performs a modified single-linkage clustering of methylation sites into genomic regions. It then compares methylation levels using beta-binomial hierarchical modeling and Wald tests. Among its innovative attributes are the analyses of novel methylation sites and methylation linkage, as well as the simultaneous statistical analysis of multiple sample groups. To demonstrate its efficiency, DMRfinder is benchmarked against other computational approaches using a large published dataset. Contrasting two replicates of the same sample yielded minimal genomic regions with DMRfinder, whereas two alternative software packages reported a substantial number of false positives. Further analyses of biological samples revealed fundamental differences between DMRfinder and another software package, despite the fact that they utilize the same underlying statistical basis. For each step, DMRfinder completed the analysis in a fraction of the time required by other software. Among the computational approaches for identifying differentially methylated regions from high-throughput bisulfite sequencing datasets, DMRfinder is the first that integrates all the post-alignment steps in a single package. Compared to other software, DMRfinder is extremely efficient and unbiased in this process. DMRfinder is free and open-source software, available on GitHub ( github.com/jsh58/DMRfinder ); it is written in Python and R, and is supported on Linux.

  4. Mining spatiotemporal patterns of urban dwellers from taxi trajectory data

    NASA Astrophysics Data System (ADS)

    Mao, Feng; Ji, Minhe; Liu, Ting

    2016-06-01

    With the widespread adoption of locationaware technology, obtaining long-sequence, massive and high-accuracy spatiotemporal trajectory data of individuals has become increasingly popular in various geographic studies. Trajectory data of taxis, one of the most widely used inner-city travel modes, contain rich information about both road network traffic and travel behavior of passengers. Such data can be used to study the microscopic activity patterns of individuals as well as the macro system of urban spatial structures. This paper focuses on trajectories obtained from GPS-enabled taxis and their applications for mining urban commuting patterns. A novel approach is proposed to discover spatiotemporal patterns of household travel from the taxi trajectory dataset with a large number of point locations. The approach involves three critical steps: spatial clustering of taxi origin-destination (OD) based on urban traffic grids to discover potentially meaningful places, identifying threshold values from statistics of the OD clusters to extract urban jobs-housing structures, and visualization of analytic results to understand the spatial distribution and temporal trends of the revealed urban structures and implied household commuting behavior. A case study with a taxi trajectory dataset in Shanghai, China is presented to demonstrate and evaluate the proposed method.

  5. A phylogenomic analysis of the role and timing of molecular adaptation in the aquatic transition of cetartiodactyl mammals.

    PubMed

    Tsagkogeorga, Georgia; McGowen, Michael R; Davies, Kalina T J; Jarman, Simon; Polanowski, Andrea; Bertelsen, Mads F; Rossiter, Stephen J

    2015-09-01

    Recent studies have reported multiple cases of molecular adaptation in cetaceans related to their aquatic abilities. However, none of these has included the hippopotamus, precluding an understanding of whether molecular adaptations in cetaceans occurred before or after they split from their semi-aquatic sister taxa. Here, we obtained new transcriptomes from the hippopotamus and humpback whale, and analysed these together with available data from eight other cetaceans. We identified more than 11 000 orthologous genes and compiled a genome-wide dataset of 6845 coding DNA sequences among 23 mammals, to our knowledge the largest phylogenomic dataset to date for cetaceans. We found positive selection in nine genes on the branch leading to the common ancestor of hippopotamus and whales, and 461 genes in cetaceans compared to 64 in hippopotamus. Functional annotation revealed adaptations in diverse processes, including lipid metabolism, hypoxia, muscle and brain function. By combining these findings with data on protein-protein interactions, we found evidence suggesting clustering among gene products relating to nervous and muscular systems in cetaceans. We found little support for shared ancestral adaptations in the two taxa; most molecular adaptations in extant cetaceans occurred after their split with hippopotamids.

  6. Extraction of Molecular Features through Exome to Transcriptome Alignment

    PubMed Central

    Mudvari, Prakriti; Kowsari, Kamran; Cole, Charles; Mazumder, Raja; Horvath, Anelia

    2014-01-01

    Integrative Next Generation Sequencing (NGS) DNA and RNA analyses have very recently become feasible, and the published to date studies have discovered critical disease implicated pathways, and diagnostic and therapeutic targets. A growing number of exomes, genomes and transcriptomes from the same individual are quickly accumulating, providing unique venues for mechanistic and regulatory features analysis, and, at the same time, requiring new exploration strategies. In this study, we have integrated variation and expression information of four NGS datasets from the same individual: normal and tumor breast exomes and transcriptomes. Focusing on SNPcentered variant allelic prevalence, we illustrate analytical algorithms that can be applied to extract or validate potential regulatory elements, such as expression or growth advantage, imprinting, loss of heterozygosity (LOH), somatic changes, and RNA editing. In addition, we point to some critical elements that might bias the output and recommend alternative measures to maximize the confidence of findings. The need for such strategies is especially recognized within the growing appreciation of the concept of systems biology: integrative exploration of genome and transcriptome features reveal mechanistic and regulatory insights that reach far beyond linear addition of the individual datasets. PMID:24791251

  7. A Web Server and Mobile App for Computing Hemolytic Potency of Peptides

    NASA Astrophysics Data System (ADS)

    Chaudhary, Kumardeep; Kumar, Ritesh; Singh, Sandeep; Tuknait, Abhishek; Gautam, Ankur; Mathur, Deepika; Anand, Priya; Varshney, Grish C.; Raghava, Gajendra P. S.

    2016-03-01

    Numerous therapeutic peptides do not enter the clinical trials just because of their high hemolytic activity. Recently, we developed a database, Hemolytik, for maintaining experimentally validated hemolytic and non-hemolytic peptides. The present study describes a web server and mobile app developed for predicting, and screening of peptides having hemolytic potency. Firstly, we generated a dataset HemoPI-1 that contains 552 hemolytic peptides extracted from Hemolytik database and 552 random non-hemolytic peptides (from Swiss-Prot). The sequence analysis of these peptides revealed that certain residues (e.g., L, K, F, W) and motifs (e.g., “FKK”, “LKL”, “KKLL”, “KWK”, “VLK”, “CYCR”, “CRR”, “RFC”, “RRR”, “LKKL”) are more abundant in hemolytic peptides. Therefore, we developed models for discriminating hemolytic and non-hemolytic peptides using various machine learning techniques and achieved more than 95% accuracy. We also developed models for discriminating peptides having high and low hemolytic potential on different datasets called HemoPI-2 and HemoPI-3. In order to serve the scientific community, we developed a web server, mobile app and JAVA-based standalone software (http://crdd.osdd.net/raghava/hemopi/).

  8. A phylogenomic analysis of the role and timing of molecular adaptation in the aquatic transition of cetartiodactyl mammals

    PubMed Central

    Tsagkogeorga, Georgia; McGowen, Michael R.; Davies, Kalina T. J.; Jarman, Simon; Polanowski, Andrea; Bertelsen, Mads F.; Rossiter, Stephen J.

    2015-01-01

    Recent studies have reported multiple cases of molecular adaptation in cetaceans related to their aquatic abilities. However, none of these has included the hippopotamus, precluding an understanding of whether molecular adaptations in cetaceans occurred before or after they split from their semi-aquatic sister taxa. Here, we obtained new transcriptomes from the hippopotamus and humpback whale, and analysed these together with available data from eight other cetaceans. We identified more than 11 000 orthologous genes and compiled a genome-wide dataset of 6845 coding DNA sequences among 23 mammals, to our knowledge the largest phylogenomic dataset to date for cetaceans. We found positive selection in nine genes on the branch leading to the common ancestor of hippopotamus and whales, and 461 genes in cetaceans compared to 64 in hippopotamus. Functional annotation revealed adaptations in diverse processes, including lipid metabolism, hypoxia, muscle and brain function. By combining these findings with data on protein–protein interactions, we found evidence suggesting clustering among gene products relating to nervous and muscular systems in cetaceans. We found little support for shared ancestral adaptations in the two taxa; most molecular adaptations in extant cetaceans occurred after their split with hippopotamids. PMID:26473040

  9. Metasecretome-selective phage display approach for mining the functional potential of a rumen microbial community.

    PubMed

    Ciric, Milica; Moon, Christina D; Leahy, Sinead C; Creevey, Christopher J; Altermann, Eric; Attwood, Graeme T; Rakonjac, Jasna; Gagic, Dragana

    2014-05-12

    In silico, secretome proteins can be predicted from completely sequenced genomes using various available algorithms that identify membrane-targeting sequences. For metasecretome (collection of surface, secreted and transmembrane proteins from environmental microbial communities) this approach is impractical, considering that the metasecretome open reading frames (ORFs) comprise only 10% to 30% of total metagenome, and are poorly represented in the dataset due to overall low coverage of metagenomic gene pool, even in large-scale projects. By combining secretome-selective phage display and next-generation sequencing, we focused the sequence analysis of complex rumen microbial community on the metasecretome component of the metagenome. This approach achieved high enrichment (29 fold) of secreted fibrolytic enzymes from the plant-adherent microbial community of the bovine rumen. In particular, we identified hundreds of heretofore rare modules belonging to cellulosomes, cell-surface complexes specialised for recognition and degradation of the plant fibre. As a method, metasecretome phage display combined with next-generation sequencing has a power to sample the diversity of low-abundance surface and secreted proteins that would otherwise require exceptionally large metagenomic sequencing projects. As a resource, metasecretome display library backed by the dataset obtained by next-generation sequencing is ready for i) affinity selection by standard phage display methodology and ii) easy purification of displayed proteins as part of the virion for individual functional analysis.

  10. Sensitivity to sequencing depth in single-cell cancer genomics.

    PubMed

    Alves, João M; Posada, David

    2018-04-16

    Querying cancer genomes at single-cell resolution is expected to provide a powerful framework to understand in detail the dynamics of cancer evolution. However, given the high costs currently associated with single-cell sequencing, together with the inevitable technical noise arising from single-cell genome amplification, cost-effective strategies that maximize the quality of single-cell data are critically needed. Taking advantage of previously published single-cell whole-genome and whole-exome cancer datasets, we studied the impact of sequencing depth and sampling effort towards single-cell variant detection. Five single-cell whole-genome and whole-exome cancer datasets were independently downscaled to 25, 10, 5, and 1× sequencing depth. For each depth level, ten technical replicates were generated, resulting in a total of 6280 single-cell BAM files. The sensitivity of variant detection, including structural and driver mutations, genotyping, clonal inference, and phylogenetic reconstruction to sequencing depth was evaluated using recent tools specifically designed for single-cell data. Altogether, our results suggest that for relatively large sample sizes (25 or more cells) sequencing single tumor cells at depths > 5× does not drastically improve somatic variant discovery, characterization of clonal genotypes, or estimation of single-cell phylogenies. We suggest that sequencing multiple individual tumor cells at a modest depth represents an effective alternative to explore the mutational landscape and clonal evolutionary patterns of cancer genomes.

  11. Biodiversity assessment among two Nebraska prairies: a comparison between traditional and phylogenetic diversity indices.

    PubMed

    Aust, Shelly K; Ahrendsen, Dakota L; Kellar, P Roxanne

    2015-01-01

    Conservation of the evolutionary diversity among organisms should be included in the selection of priority regions for preservation of Earth's biodiversity. Traditionally, biodiversity has been determined from an assessment of species richness (S), abundance, evenness, rarity, etc. of organisms but not from variation in species' evolutionary histories. Phylogenetic diversity (PD) measures evolutionary differences between taxa in a community and is gaining acceptance as a biodiversity assessment tool. However, with the increase in the number of ways to calculate PD, end-users and decision-makers are left wondering how metrics compare and what data are needed to calculate various metrics. In this study, we used massively parallel sequencing to generate over 65,000 DNA characters from three cellular compartments for over 60 species in the asterid clade of flowering plants. We estimated asterid phylogenies from character datasets of varying nucleotide quantities, and then assessed the effect of varying character datasets on resulting PD metric values. We also compared multiple PD metrics with traditional diversity indices (including S) among two endangered grassland prairies in Nebraska (U.S.A.). Our results revealed that PD metrics varied based on the quantity of genes used to infer the phylogenies; therefore, when comparing PD metrics between sites, it is vital to use comparable datasets. Additionally, various PD metrics and traditional diversity indices characterize biodiversity differently and should be chosen depending on the research question. Our study provides empirical results that reveal the value of measuring PD when considering sites for conservation, and it highlights the usefulness of using PD metrics in combination with other diversity indices when studying community assembly and ecosystem functioning. Ours is just one example of the types of investigations that need to be conducted across the tree of life and across varying ecosystems in order to build a database of phylogenetic diversity assessments that lead to a pool of results upon which a guide through the plethora of PD metrics may be prepared for use by ecologists and conservation planners.

  12. Haplowebs as a graphical tool for delimiting species: a revival of Doyle's "field for recombination" approach and its application to the coral genus Pocillopora in Clipperton

    PubMed Central

    2010-01-01

    Background Usual methods for inferring species boundaries from molecular sequence data rely either on gene trees or on population genetic analyses. Another way of delimiting species, based on a view of species as "fields for recombination" (FFRs) characterized by mutual allelic exclusivity, was suggested in 1995 by Doyle. Here we propose to use haplowebs (haplotype networks with additional connections between haplotypes found co-occurring in heterozygous individuals) to visualize and delineate single-locus FFRs (sl-FFRs). Furthermore, we introduce a method to quantify the reliability of putative species boundaries according to the number of independent markers that support them, and illustrate this approach with a case study of taxonomically difficult corals of the genus Pocillopora collected around Clipperton Island (far eastern Pacific). Results One haploweb built from intron sequences of the ATP synthase β subunit gene revealed the presence of two sl-FFRs among our 74 coral samples, whereas a second one built from ITS sequences turned out to be composed of four sl-FFRs. As a third independent marker, we performed a combined analysis of two regions of the mitochondrial genome: since haplowebs are not suited to analyze non-recombining markers, individuals were sorted into four haplogroups according to their mitochondrial sequences. Among all possible bipartitions of our set of samples, thirteen were supported by at least one molecular dataset, none by two and only one by all three datasets: this congruent pattern obtained from independent nuclear and mitochondrial markers indicates that two species of Pocillopora are present in Clipperton. Conclusions Our approach builds on Doyle's method and extends it by introducing an intuitive, user-friendly graphical representation and by proposing a conceptual framework to analyze and quantify the congruence between sl-FFRs obtained from several independent markers. Like delineation methods based on population-level statistical approaches, our method can distinguish closely-related species that have not yet reached reciprocal monophyly at most or all of their loci; like tree-based approaches, it can yield meaningful conclusions using a number of independent markers as low as three. Future efforts will aim to develop programs that speed up the construction of haplowebs from FASTA sequence alignments and help perform the congruence analysis outlined in this article. PMID:21118572

  13. Haplowebs as a graphical tool for delimiting species: a revival of Doyle's "field for recombination" approach and its application to the coral genus Pocillopora in Clipperton.

    PubMed

    Flot, Jean-François; Couloux, Arnaud; Tillier, Simon

    2010-11-30

    Usual methods for inferring species boundaries from molecular sequence data rely either on gene trees or on population genetic analyses. Another way of delimiting species, based on a view of species as "fields for recombination" (FFRs) characterized by mutual allelic exclusivity, was suggested in 1995 by Doyle. Here we propose to use haplowebs (haplotype networks with additional connections between haplotypes found co-occurring in heterozygous individuals) to visualize and delineate single-locus FFRs (sl-FFRs). Furthermore, we introduce a method to quantify the reliability of putative species boundaries according to the number of independent markers that support them, and illustrate this approach with a case study of taxonomically difficult corals of the genus Pocillopora collected around Clipperton Island (far eastern Pacific). One haploweb built from intron sequences of the ATP synthase β subunit gene revealed the presence of two sl-FFRs among our 74 coral samples, whereas a second one built from ITS sequences turned out to be composed of four sl-FFRs. As a third independent marker, we performed a combined analysis of two regions of the mitochondrial genome: since haplowebs are not suited to analyze non-recombining markers, individuals were sorted into four haplogroups according to their mitochondrial sequences. Among all possible bipartitions of our set of samples, thirteen were supported by at least one molecular dataset, none by two and only one by all three datasets: this congruent pattern obtained from independent nuclear and mitochondrial markers indicates that two species of Pocillopora are present in Clipperton. Our approach builds on Doyle's method and extends it by introducing an intuitive, user-friendly graphical representation and by proposing a conceptual framework to analyze and quantify the congruence between sl-FFRs obtained from several independent markers. Like delineation methods based on population-level statistical approaches, our method can distinguish closely-related species that have not yet reached reciprocal monophyly at most or all of their loci; like tree-based approaches, it can yield meaningful conclusions using a number of independent markers as low as three. Future efforts will aim to develop programs that speed up the construction of haplowebs from FASTA sequence alignments and help perform the congruence analysis outlined in this article.

  14. CisSERS: Customizable in silico sequence evaluation for restriction sites

    DOE PAGES

    Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus; ...

    2016-04-12

    High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less

  15. CisSERS: Customizable in silico sequence evaluation for restriction sites

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus

    High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less

  16. Exploring 3D Human Action Recognition: from Offline to Online.

    PubMed

    Liu, Zhenyu; Li, Rui; Tan, Jianrong

    2018-02-20

    With the introduction of cost-effective depth sensors, a tremendous amount of research has been devoted to studying human action recognition using 3D motion data. However, most existing methods work in an offline fashion, i.e., they operate on a segmented sequence. There are a few methods specifically designed for online action recognition, which continually predicts action labels as a stream sequence proceeds. In view of this fact, we propose a question: can we draw inspirations and borrow techniques or descriptors from existing offline methods, and then apply these to online action recognition? Note that extending offline techniques or descriptors to online applications is not straightforward, since at least two problems-including real-time performance and sequence segmentation-are usually not considered in offline action recognition. In this paper, we give a positive answer to the question. To develop applicable online action recognition methods, we carefully explore feature extraction, sequence segmentation, computational costs, and classifier selection. The effectiveness of the developed methods is validated on the MSR 3D Online Action dataset and the MSR Daily Activity 3D dataset.

  17. Exploring 3D Human Action Recognition: from Offline to Online

    PubMed Central

    Li, Rui; Liu, Zhenyu; Tan, Jianrong

    2018-01-01

    With the introduction of cost-effective depth sensors, a tremendous amount of research has been devoted to studying human action recognition using 3D motion data. However, most existing methods work in an offline fashion, i.e., they operate on a segmented sequence. There are a few methods specifically designed for online action recognition, which continually predicts action labels as a stream sequence proceeds. In view of this fact, we propose a question: can we draw inspirations and borrow techniques or descriptors from existing offline methods, and then apply these to online action recognition? Note that extending offline techniques or descriptors to online applications is not straightforward, since at least two problems—including real-time performance and sequence segmentation—are usually not considered in offline action recognition. In this paper, we give a positive answer to the question. To develop applicable online action recognition methods, we carefully explore feature extraction, sequence segmentation, computational costs, and classifier selection. The effectiveness of the developed methods is validated on the MSR 3D Online Action dataset and the MSR Daily Activity 3D dataset. PMID:29461502

  18. Detection of Splice Sites Using Support Vector Machine

    NASA Astrophysics Data System (ADS)

    Varadwaj, Pritish; Purohit, Neetesh; Arora, Bhumika

    Automatic identification and annotation of exon and intron region of gene, from DNA sequences has been an important research area in field of computational biology. Several approaches viz. Hidden Markov Model (HMM), Artificial Intelligence (AI) based machine learning and Digital Signal Processing (DSP) techniques have extensively and independently been used by various researchers to cater this challenging task. In this work, we propose a Support Vector Machine based kernel learning approach for detection of splice sites (the exon-intron boundary) in a gene. Electron-Ion Interaction Potential (EIIP) values of nucleotides have been used for mapping character sequences to corresponding numeric sequences. Radial Basis Function (RBF) SVM kernel is trained using EIIP numeric sequences. Furthermore this was tested on test gene dataset for detection of splice site by window (of 12 residues) shifting. Optimum values of window size, various important parameters of SVM kernel have been optimized for a better accuracy. Receiver Operating Characteristic (ROC) curves have been utilized for displaying the sensitivity rate of the classifier and results showed 94.82% accuracy for splice site detection on test dataset.

  19. Combined molecular and morphological phylogenetic analyses of the New Zealand wolf spider genus Anoteropsis (Araneae: Lycosidae).

    PubMed

    Vink, Cor J; Paterson, Adrian M

    2003-09-01

    Datasets from the mitochondrial gene regions NADH dehydrogenase subunit I (ND1) and cytochrome c oxidase subunit I (COI) of the 20 species in the New Zealand wolf spider (Lycosidae) genus Anoteropsis were generated. Sequence data were phylogenetically analysed using parsimony and maximum likelihood analyses. The phylogenies generated from the ND1 and COI sequence data and a previously generated morphological dataset were significantly congruent (p<0.001). Sequence data were combined with morphological data and phylogenetically analysed using parsimony. The ND1 region sequenced included part of tRNA(Leu(CUN)), which appears to have an unstable amino-acyl arm and no TpsiC arm in lycosids. Analyses supported the existence of five species groups within Anoteropsis and the monophyly of species represented by multiple samples. A radiation of Anoteropsis species within the last five million years is inferred from the ND1 and COI likelihood phylograms, habitat and geological data, which also indicates that Anoteropsis arrived in New Zealand some time after it separated from Gondwana.

  20. Disk-based compression of data from genome sequencing.

    PubMed

    Grabowski, Szymon; Deorowicz, Sebastian; Roguski, Łukasz

    2015-05-01

    High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage. We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space. http://sun.aei.polsl.pl/orcom under a free license. sebastian.deorowicz@polsl.pl Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  1. The next generation of melanocyte data: Genetic, epigenetic, and transcriptional resource datasets and analysis tools.

    PubMed

    Loftus, Stacie K

    2018-05-01

    The number of melanocyte- and melanoma-derived next generation sequence genome-scale datasets have rapidly expanded over the past several years. This resource guide provides a summary of publicly available sources of melanocyte cell derived whole genome, exome, mRNA and miRNA transcriptome, chromatin accessibility and epigenetic datasets. Also highlighted are bioinformatic resources and tools for visualization and data queries which allow researchers a genome-scale view of the melanocyte. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.

  2. Phylogenetic diversity of Pasteurellaceae and horizontal gene transfer of leukotoxin in wild and domestic sheep.

    PubMed

    Kelley, Scott T; Cassirer, E Frances; Weiser, Glen C; Safaee, Shirin

    2007-01-01

    Wild and domestic animal populations are known to be sources and reservoirs of emerging diseases. There is also a growing recognition that horizontal genetic transfer (HGT) plays an important role in bacterial pathogenesis. We used molecular phylogenetic methods to assess diversity and cross-transmission rates of Pasteurellaceae bacteria in populations of bighorn sheep, Dall's sheep, domestic sheep and domestic goats. Members of the Pasteurellaceae cause an array of deadly illnesses including bacterial pneumonia known as "pasteurellosis", a particularly devastating disease for bighorn sheep. A phylogenetic analysis of a combined dataset of two RNA genes (16S ribosomal RNA and RNAse P RNA) revealed remarkable evolutionary diversity among Pasteurella trehalosi and Mannheimia (Pasteurella) haemolytica bacteria isolated from sheep and goats. Several phylotypes appeared to associate with particular host species, though we found numerous instances of apparent cross-transmission among species and populations. Statistical analyses revealed that host species, geographic locale and biovariant classification, but not virulence, correlated strongly with Pasteurellaceae phylogeny. Sheep host species correlated with P. trehalosi isolates phylogeny (PTP test; P=0.002), but not with the phylogeny of M. haemolytica isolates, suggesting that P. trehalosi bacteria may be more host specific. With regards to populations within species, we also discovered a strong correlation between geographic locale and isolate phylogeny in the Rocky Mountain bighorn sheep (PTP test; P=0.001). We also investigated the potential for HGT of the leukotoxin A (lktA) gene, which produces a toxin that plays an integral role in causing disease. Comparative analysis of the combined RNA gene phylogeny and the lktA phylogenies revealed considerable incongruence between the phylogenies, suggestive of HGT. Furthermore, we found identical lktA alleles in unrelated bacterial species, some of which had been isolated from sheep in distantly removed populations. For example, lktA sequences from P. trehalosi isolated from remote Alaskan Dall's sheep were 100% identical over a 900-nucleotide stretch to sequences determined from M. haemolytica isolated from domestic sheep in the UK. This extremely high degree of sequence similarity of lktA sequences among distinct bacterial species suggests that HGT has played a role in the evolution of lktA in wild hosts.

  3. Development of Genic and Genomic SSR Markers of Robusta Coffee (Coffea canephora Pierre Ex A. Froehner)

    PubMed Central

    Hendre, Prasad S.; Aggarwal, Ramesh K.

    2014-01-01

    Coffee breeding and improvement efforts can be greatly facilitated by availability of a large repository of simple sequence repeats (SSRs) based microsatellite markers, which provides efficiency and high-resolution in genetic analyses. This study was aimed to improve SSR availability in coffee by developing new genic−/genomic-SSR markers using in-silico bioinformatics and streptavidin-biotin based enrichment approach, respectively. The expressed sequence tag (EST) based genic microsatellite markers (EST-SSRs) were developed using the publicly available dataset of 13,175 unigene ESTs, which showed a distribution of 1 SSR/3.4 kb of coffee transcriptome. Genomic SSRs, on the other hand, were developed from an SSR-enriched small-insert partial genomic library of robusta coffee. In total, 69 new SSRs (44 EST-SSRs and 25 genomic SSRs) were developed and validated as suitable genetic markers. Diversity analysis of selected coffee genotypes revealed these to be highly informative in terms of allelic diversity and PIC values, and eighteen of these markers (∼27%) could be mapped on a robusta linkage map. Notably, the markers described here also revealed a very high cross-species transferability. In addition to the validated markers, we have also designed primer pairs for 270 putative EST-SSRs, which are expected to provide another ca. 200 useful genetic markers considering the high success rate (88%) of marker conversion of similar pairs tested/validated in this study. PMID:25461752

  4. Revising the recent evolutionary history of equids using ancient DNA.

    PubMed

    Orlando, Ludovic; Metcalf, Jessica L; Alberdi, Maria T; Telles-Antunes, Miguel; Bonjean, Dominique; Otte, Marcel; Martin, Fabiana; Eisenmann, Véra; Mashkour, Marjan; Morello, Flavia; Prado, Jose L; Salas-Gismondi, Rodolfo; Shockey, Bruce J; Wrinn, Patrick J; Vasil'ev, Sergei K; Ovodov, Nikolai D; Cherry, Michael I; Hopwood, Blair; Male, Dean; Austin, Jeremy J; Hänni, Catherine; Cooper, Alan

    2009-12-22

    The rich fossil record of the family Equidae (Mammalia: Perissodactyla) over the past 55 MY has made it an icon for the patterns and processes of macroevolution. Despite this, many aspects of equid phylogenetic relationships and taxonomy remain unresolved. Recent genetic analyses of extinct equids have revealed unexpected evolutionary patterns and a need for major revisions at the generic, subgeneric, and species levels. To investigate this issue we examine 35 ancient equid specimens from four geographic regions (South America, Europe, Southwest Asia, and South Africa), of which 22 delivered 87-688 bp of reproducible aDNA mitochondrial sequence. Phylogenetic analyses support a major revision of the recent evolutionary history of equids and reveal two new species, a South American hippidion and a descendant of a basal lineage potentially related to Middle Pleistocene equids. Sequences from specimens assigned to the giant extinct Cape zebra, Equus capensis, formed a separate clade within the modern plain zebra species, a phenotypicically plastic group that also included the extinct quagga. In addition, we revise the currently recognized extinction times for two hemione-related equid groups. However, it is apparent that the current dataset cannot solve all of the taxonomic and phylogenetic questions relevant to the evolution of Equus. In light of these findings, we propose a rapid DNA barcoding approach to evaluate the taxonomic status of the many Late Pleistocene fossil Equidae species that have been described from purely morphological analyses.

  5. DNA demethylation activates genes in seed maternal integument development in rice (Oryza sativa L.).

    PubMed

    Wang, Yifeng; Lin, Haiyan; Tong, Xiaohong; Hou, Yuxuan; Chang, Yuxiao; Zhang, Jian

    2017-11-01

    DNA methylation is an important epigenetic modification that regulates various plant developmental processes. Rice seed integument determines the seed size. However, the role of DNA methylation in its development remains largely unknown. Here, we report the first dynamic DNA methylomic profiling of rice maternal integument before and after pollination by using a whole-genome bisulfite deep sequencing approach. Analysis of DNA methylation patterns identified 4238 differentially methylated regions underpin 4112 differentially methylated genes, including GW2, DEP1, RGB1 and numerous other regulators participated in maternal integument development. Bisulfite sanger sequencing and qRT-PCR of six differentially methylated genes revealed extensive occurrence of DNA hypomethylation triggered by double fertilization at IAP compared with IBP, suggesting that DNA demethylation might be a key mechanism to activate numerous maternal controlling genes. These results presented here not only greatly expanded the rice methylome dataset, but also shed novel insight into the regulatory roles of DNA methylation in rice seed maternal integument development. Copyright © 2017 Elsevier Masson SAS. All rights reserved.

  6. Scalable and cost-effective NGS genotyping in the cloud.

    PubMed

    Souilmi, Yassine; Lancaster, Alex K; Jung, Jae-Yoon; Rizzo, Ettore; Hawkins, Jared B; Powles, Ryan; Amzazi, Saaïd; Ghazal, Hassan; Tonellato, Peter J; Wall, Dennis P

    2015-10-15

    While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars. We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.

  7. Metagenome sequencing and 98 microbial genomes from Juan de Fuca Ridge flank subsurface fluids

    NASA Astrophysics Data System (ADS)

    Jungbluth, Sean P.; Amend, Jan P.; Rappé, Michael S.

    2017-03-01

    The global deep subsurface biosphere is one of the largest reservoirs for microbial life on our planet. This study takes advantage of new sampling technologies and couples them with improvements to DNA sequencing and associated informatics tools to reconstruct the genomes of uncultivated Bacteria and Archaea from fluids collected deep within the Juan de Fuca Ridge subseafloor. Here, we generated two metagenomes from borehole observatories located 311 meters apart and, using binning tools, retrieved 98 genomes from metagenomes (GFMs). Of the GFMs, 31 were estimated to be >90% complete, while an additional 17 were >70% complete. Phylogenomic analysis revealed 53 bacterial and 45 archaeal GFMs, of which nearly all were distantly related to known cultivated isolates. In the GFMs, abundant Bacteria included Chloroflexi, Nitrospirae, Acetothermia (OP1), EM3, Aminicenantes (OP8), Gammaproteobacteria, and Deltaproteobacteria, while abundant Archaea included Archaeoglobi, Bathyarchaeota (MCG), and Marine Benthic Group E (MBG-E). These data are the first GFMs reconstructed from the deep basaltic subseafloor biosphere, and provide a dataset available for further interrogation.

  8. Metagenome sequencing and 98 microbial genomes from Juan de Fuca Ridge flank subsurface fluids.

    PubMed

    Jungbluth, Sean P; Amend, Jan P; Rappé, Michael S

    2017-03-28

    The global deep subsurface biosphere is one of the largest reservoirs for microbial life on our planet. This study takes advantage of new sampling technologies and couples them with improvements to DNA sequencing and associated informatics tools to reconstruct the genomes of uncultivated Bacteria and Archaea from fluids collected deep within the Juan de Fuca Ridge subseafloor. Here, we generated two metagenomes from borehole observatories located 311 meters apart and, using binning tools, retrieved 98 genomes from metagenomes (GFMs). Of the GFMs, 31 were estimated to be >90% complete, while an additional 17 were >70% complete. Phylogenomic analysis revealed 53 bacterial and 45 archaeal GFMs, of which nearly all were distantly related to known cultivated isolates. In the GFMs, abundant Bacteria included Chloroflexi, Nitrospirae, Acetothermia (OP1), EM3, Aminicenantes (OP8), Gammaproteobacteria, and Deltaproteobacteria, while abundant Archaea included Archaeoglobi, Bathyarchaeota (MCG), and Marine Benthic Group E (MBG-E). These data are the first GFMs reconstructed from the deep basaltic subseafloor biosphere, and provide a dataset available for further interrogation.

  9. Invasion of a Holarctic planktonic cladoceran Daphnia galeata Sars (Crustacea: Cladocera) in the Lower Lakes of South Australia.

    PubMed

    Karabanov, Dmitry P; Bekker, Eugeniya I; Shiel, Russell J; Kotov, Alexey A

    2018-03-27

    We found a Holarctic microcrustacean Daphnia galeata Sars, 1863 (Cladocera: Daphniidae) in the Lower Lakes of South Australia. This taxon was never detected in continental Australia before. Its identity was confirmed by the sequences of mitochondrial COI, 12S and 16S and nuclear 18S and 28S genes. A maximum likelihood tree from a dataset from combining 12S + 16S mitochondrial sequence and a split network of the COI haplotypes are provided, but resolution of both genes is not sufficient to reveal the exact region of the Holarctic from where D. galeata was introduced to Australia; the vector of its invasion also is unknown. We hypothesize that appearance of D. galeata in the Lower Lakes of the Murray River is related to a recent anthropogenic eutrophication of water bodies in this region, keeping in mind that examples of successful invasion of some European lakes by D. galeata after their eutrophication are well-known. We also hypothesize that establishment of this non-indigenous taxon populations in Australia might have a strong negative impact on native lake biota.

  10. The opportunities and challenges of large-scale molecular approaches to songbird neurobiology

    PubMed Central

    Mello, C.V.; Clayton, D.F.

    2014-01-01

    High-through put methods for analyzing genome structure and function are having a large impact in song-bird neurobiology. Methods include genome sequencing and annotation, comparative genomics, DNA microarrays and transcriptomics, and the development of a brain atlas of gene expression. Key emerging findings include the identification of complex transcriptional programs active during singing, the robust brain expression of non-coding RNAs, evidence of profound variations in gene expression across brain regions, and the identification of molecular specializations within song production and learning circuits. Current challenges include the statistical analysis of large datasets, effective genome curations, the efficient localization of gene expression changes to specific neuronal circuits and cells, and the dissection of behavioral and environmental factors that influence brain gene expression. The field requires efficient methods for comparisons with organisms like chicken, which offer important anatomical, functional and behavioral contrasts. As sequencing costs plummet, opportunities emerge for comparative approaches that may help reveal evolutionary transitions contributing to vocal learning, social behavior and other properties that make songbirds such compelling research subjects. PMID:25280907

  11. Metagenome sequencing and 98 microbial genomes from Juan de Fuca Ridge flank subsurface fluids

    PubMed Central

    Jungbluth, Sean P.; Amend, Jan P.; Rappé, Michael S.

    2017-01-01

    The global deep subsurface biosphere is one of the largest reservoirs for microbial life on our planet. This study takes advantage of new sampling technologies and couples them with improvements to DNA sequencing and associated informatics tools to reconstruct the genomes of uncultivated Bacteria and Archaea from fluids collected deep within the Juan de Fuca Ridge subseafloor. Here, we generated two metagenomes from borehole observatories located 311 meters apart and, using binning tools, retrieved 98 genomes from metagenomes (GFMs). Of the GFMs, 31 were estimated to be >90% complete, while an additional 17 were >70% complete. Phylogenomic analysis revealed 53 bacterial and 45 archaeal GFMs, of which nearly all were distantly related to known cultivated isolates. In the GFMs, abundant Bacteria included Chloroflexi, Nitrospirae, Acetothermia (OP1), EM3, Aminicenantes (OP8), Gammaproteobacteria, and Deltaproteobacteria, while abundant Archaea included Archaeoglobi, Bathyarchaeota (MCG), and Marine Benthic Group E (MBG-E). These data are the first GFMs reconstructed from the deep basaltic subseafloor biosphere, and provide a dataset available for further interrogation. PMID:28350381

  12. Assessing the prevalence of mycoplasma contamination in cell culture via a survey of NCBI's RNA-seq archive

    PubMed Central

    Olarerin-George, Anthony O.; Hogenesch, John B.

    2015-01-01

    Mycoplasmas are notorious contaminants of cell culture and can have profound effects on host cell biology by depriving cells of nutrients and inducing global changes in gene expression. Over the last two decades, sentinel testing has revealed wide-ranging contamination rates in mammalian culture. To obtain an unbiased assessment from hundreds of labs, we analyzed sequence data from 9395 rodent and primate samples from 884 series in the NCBI Sequence Read Archive. We found 11% of these series were contaminated (defined as ≥100 reads/million mapping to mycoplasma in one or more samples). Ninety percent of mycoplasma-mapped reads aligned to ribosomal RNA. This was unexpected given 37% of contaminated series used poly(A)-selection for mRNA enrichment. Lastly, we examined the relationship between mycoplasma contamination and host gene expression in a single cell RNA-seq dataset and found 61 host genes (P < 0.001) were significantly associated with mycoplasma-mapped read counts. In all, this study suggests mycoplasma contamination is still prevalent today and poses substantial risk to research quality. PMID:25712092

  13. Sequence-based predictive modeling to identify cancerlectins

    PubMed Central

    Lai, Hong-Yan; Chen, Xin-Xin; Chen, Wei; Tang, Hua; Lin, Hao

    2017-01-01

    Lectins are a diverse type of glycoproteins or carbohydrate-binding proteins that have a wide distribution to various species. They can specially identify and exclusively bind to a certain kind of saccharide groups. Cancerlectins are a group of lectins that are closely related to cancer and play a major role in the initiation, survival, growth, metastasis and spread of tumor. Several computational methods have emerged to discriminate cancerlectins from non-cancerlectins, which promote the study on pathogenic mechanisms and clinical treatment of cancer. However, the predictive accuracies of most of these techniques are very limited. In this work, by constructing a benchmark dataset based on the CancerLectinDB database, a new amino acid sequence-based strategy for feature description was developed, and then the binomial distribution was applied to screen the optimal feature set. Ultimately, an SVM-based predictor was performed to distinguish cancerlectins from non-cancerlectins, and achieved an accuracy of 77.48% with AUC of 85.52% in jackknife cross-validation. The results revealed that our prediction model could perform better comparing with published predictive tools. PMID:28423655

  14. A Feature-Based Approach to Modeling Protein–DNA Interactions

    PubMed Central

    Segal, Eran

    2008-01-01

    Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/. PMID:18725950

  15. Quantification of DNA cleavage specificity in Hi-C experiments.

    PubMed

    Meluzzi, Dario; Arya, Gaurav

    2016-01-08

    Hi-C experiments produce large numbers of DNA sequence read pairs that are typically analyzed to deduce genomewide interactions between arbitrary loci. A key step in these experiments is the cleavage of cross-linked chromatin with a restriction endonuclease. Although this cleavage should happen specifically at the enzyme's recognition sequence, an unknown proportion of cleavage events may involve other sequences, owing to the enzyme's star activity or to random DNA breakage. A quantitative estimation of these non-specific cleavages may enable simulating realistic Hi-C read pairs for validation of downstream analyses, monitoring the reproducibility of experimental conditions and investigating biophysical properties that correlate with DNA cleavage patterns. Here we describe a computational method for analyzing Hi-C read pairs to estimate the fractions of cleavages at different possible targets. The method relies on expressing an observed local target distribution downstream of aligned reads as a linear combination of known conditional local target distributions. We validated this method using Hi-C read pairs obtained by computer simulation. Application of the method to experimental Hi-C datasets from murine cells revealed interesting similarities and differences in patterns of cleavage across the various experiments considered. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  16. Comparison of intraspecific, interspecific and intergeneric chloroplast diversity in Cycads

    PubMed Central

    Jiang, Guo-Feng; Hinsinger, Damien Daniel; Strijk, Joeri Sergej

    2016-01-01

    Cycads are among the most threatened plant species. Increasing the availability of genomic information by adding whole chloroplast data is a fundamental step in supporting phylogenetic studies and conservation efforts. Here, we assemble a dataset encompassing three taxonomic levels in cycads, including ten genera, three species in the genus Cycas and two individuals of C. debaoensis. Repeated sequences, SSRs and variations of the chloroplast were analyzed at the intraspecific, interspecific and intergeneric scale, and using our sequence data, we reconstruct a phylogenomic tree for cycads. The chloroplast was 162,094 bp in length, with 133 genes annotated, including 87 protein-coding, 37 tRNA and 8 rRNA genes. We found 7 repeated sequences and 39 SSRs. Seven loci showed promising levels of variations for application in DNA-barcoding. The chloroplast phylogeny confirmed the division of Cycadales in two suborders, each of them being monophyletic, revealing a contradiction with the current family circumscription and its evolution. Finally, 10 intraspecific SNPs were found. Our results showed that despite the extremely restricted distribution range of C. debaoensis, using complete chloroplast data is useful not only in intraspecific studies, but also to improve our understanding of cycad evolution and in defining conservation strategies for this emblematic group. PMID:27558458

  17. A Portrait of the Transcriptome of the Neglected Trematode, Fasciola gigantica—Biological and Biotechnological Implications

    PubMed Central

    Young, Neil D.; Jex, Aaron R.; Cantacessi, Cinzia; Hall, Ross S.; Campbell, Bronwyn E.; Spithill, Terence W.; Tangkawattana, Sirikachorn; Tangkawattana, Prasarn; Laha, Thewarach; Gasser, Robin B.

    2011-01-01

    Fasciola gigantica (Digenea) is an important foodborne trematode that causes liver fluke disease (fascioliasis) in mammals, including ungulates and humans, mainly in tropical climatic zones of the world. Despite its socioeconomic impact, almost nothing is known about the molecular biology of this parasite, its interplay with its hosts, and the pathogenesis of fascioliasis. Modern genomic technologies now provide unique opportunities to rapidly tackle these exciting areas. The present study reports the first transcriptome representing the adult stage of F. gigantica (of bovid origin), defined using a massively parallel sequencing-coupled bioinformatic approach. From >20 million raw sequence reads, >30,000 contiguous sequences were assembled, of which most were novel. Relative levels of transcription were determined for individual molecules, which were also characterized (at the inferred amino acid level) based on homology, gene ontology, and/or pathway mapping. Comparisons of the transcriptome of F. gigantica with those of other trematodes, including F. hepatica, revealed similarities in transcription for molecules inferred to have key roles in parasite-host interactions. Overall, the present dataset should provide a solid foundation for future fundamental genomic, proteomic, and metabolomic explorations of F. gigantica, as well as a basis for applied outcomes such as the development of novel methods of intervention against this neglected parasite. PMID:21408104

  18. Panax ginseng genome examination for ginsenoside biosynthesis.

    PubMed

    Xu, Jiang; Chu, Yang; Liao, Baosheng; Xiao, Shuiming; Yin, Qinggang; Bai, Rui; Su, He; Dong, Linlin; Li, Xiwen; Qian, Jun; Zhang, Jingjing; Zhang, Yujun; Zhang, Xiaoyan; Wu, Mingli; Zhang, Jie; Li, Guozheng; Zhang, Lei; Chang, Zhenzhan; Zhang, Yuebin; Jia, Zhengwei; Liu, Zhixiang; Afreh, Daniel; Nahurira, Ruth; Zhang, Lianjuan; Cheng, Ruiyang; Zhu, Yingjie; Zhu, Guangwei; Rao, Wei; Zhou, Chao; Qiao, Lirui; Huang, Zhihai; Cheng, Yung-Chi; Chen, Shilin

    2017-11-01

    Ginseng, which contains ginsenosides as bioactive compounds, has been regarded as an important traditional medicine for several millennia. However, the genetic background of ginseng remains poorly understood, partly because of the plant's large and complex genome composition. We report the entire genome sequence of Panax ginseng using next-generation sequencing. The 3.5-Gb nucleotide sequence contains more than 60% repeats and encodes 42 006 predicted genes. Twenty-two transcriptome datasets and mass spectrometry images of ginseng roots were adopted to precisely quantify the functional genes. Thirty-one genes were identified to be involved in the mevalonic acid pathway. Eight of these genes were annotated as 3-hydroxy-3-methylglutaryl-CoA reductases, which displayed diverse structures and expression characteristics. A total of 225 UDP-glycosyltransferases (UGTs) were identified, and these UGTs accounted for one of the largest gene families of ginseng. Tandem repeats contributed to the duplication and divergence of UGTs. Molecular modeling of UGTs in the 71st, 74th, and 94th families revealed a regiospecific conserved motif located at the N-terminus. Molecular docking predicted that this motif captures ginsenoside precursors. The ginseng genome represents a valuable resource for understanding and improving the breeding, cultivation, and synthesis biology of this key herb. © The Author 2017. Published by Oxford University Press.

  19. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

    PubMed Central

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H

    2015-01-01

    Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. PMID:25336595

  20. The FUN of identifying gene function in bacterial pathogens; insights from Salmonella functional genomics.

    PubMed

    Hammarlöf, Disa L; Canals, Rocío; Hinton, Jay C D

    2013-10-01

    The availability of thousands of genome sequences of bacterial pathogens poses a particular challenge because each genome contains hundreds of genes of unknown function (FUN). How can we easily discover which FUN genes encode important virulence factors? One solution is to combine two different functional genomic approaches. First, transcriptomics identifies bacterial FUN genes that show differential expression during the process of mammalian infection. Second, global mutagenesis identifies individual FUN genes that the pathogen requires to cause disease. The intersection of these datasets can reveal a small set of candidate genes most likely to encode novel virulence attributes. We demonstrate this approach with the Salmonella infection model, and propose that a similar strategy could be used for other bacterial pathogens. Copyright © 2013 Elsevier Ltd. All rights reserved.

  1. Pathway perturbations in signaling networks: Linking genotype to phenotype.

    PubMed

    Li, Yongsheng; McGrail, Daniel J; Latysheva, Natasha; Yi, Song; Babu, M Madan; Sahni, Nidhi

    2018-05-10

    Genes and gene products interact with each other to form signal transduction networks in the cell. The interactome networks are under intricate regulation in physiological conditions, but could go awry upon genome instability caused by genetic mutations. In the past decade with next-generation sequencing technologies, an increasing number of genomic mutations have been identified in a variety of disease patients and healthy individuals. As functional and systematic studies on these mutations leap forward, they begin to reveal insights into cellular homeostasis and disease mechanisms. In this review, we discuss recent advances in the field of network biology and signaling pathway perturbations upon genomic changes, and highlight the success of various omics datasets in unraveling genotype-to-phenotype relationships. Copyright © 2018 Elsevier Ltd. All rights reserved.

  2. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations.

    PubMed

    Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

    2014-01-01

    Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.

  3. Artificial seismic acceleration

    USGS Publications Warehouse

    Felzer, Karen R.; Page, Morgan T.; Michael, Andrew J.

    2015-01-01

    In their 2013 paper, Bouchon, Durand, Marsan, Karabulut, 3 and Schmittbuhl (BDMKS) claim to see significant accelerating seismicity before M 6.5 interplate mainshocks, but not before intraplate mainshocks, reflecting a preparatory process before large events. We concur with the finding of BDMKS that their interplate dataset has significantly more fore- shocks than their intraplate dataset; however, we disagree that the foreshocks are predictive of large events in particular. Acceleration in stacked foreshock sequences has been seen before and has been explained by the cascade model, in which earthquakes occasionally trigger aftershocks larger than themselves4. In this model, the time lags between the smaller mainshocks and larger aftershocks follow the inverse power law common to all aftershock sequences, creating an apparent acceleration when stacked (see Supplementary Information).

  4. Screening and expression of selected taxonomically conserved and unique hypothetical proteins in Burkholderia pseudomallei K96243

    NASA Astrophysics Data System (ADS)

    Akhir, Nor Azurah Mat; Nadzirin, Nurul; Mohamed, Rahmah; Firdaus-Raih, Mohd

    2015-09-01

    Hypothetical proteins of bacterial pathogens represent a large numbers of novel biological mechanisms which could belong to essential pathways in the bacteria. They lack functional characterizations mainly due to the inability of sequence homology based methods to detect functional relationships in the absence of detectable sequence similarity. The dataset derived from this study showed 550 candidates conserved in genomes that has pathogenicity information and only present in the Burkholderiales order. The dataset has been narrowed down to taxonomic clusters. Ten proteins were selected for ORF amplification, seven of them were successfully amplified, and only four proteins were successfully expressed. These proteins will be great candidates in determining the true function via structural biology.

  5. Evaluation of experimental design and computational parameter choices affecting analyses of ChIP-seq and RNA-seq data in undomesticated poplar trees.

    Treesearch

    Lijun Liu; V. Missirian; Matthew S. Zinkgraf; Andrew Groover; V. Filkov

    2014-01-01

    Background: One of the great advantages of next generation sequencing is the ability to generate large genomic datasets for virtually all species, including non-model organisms. It should be possible, in turn, to apply advanced computational approaches to these datasets to develop models of biological processes. In a practical sense, working with non-model organisms...

  6. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation

    PubMed Central

    Pujar, Shashikant; O’Leary, Nuala A; Farrell, Catherine M; Mudge, Jonathan M; Wallin, Craig; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bult, Carol J; Frankish, Adam; Pruitt, Kim D

    2018-01-01

    Abstract The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community. PMID:29126148

  7. Rooting phylogenies using gene duplications: an empirical example from the bees (Apoidea).

    PubMed

    Brady, Seán G; Litman, Jessica R; Danforth, Bryan N

    2011-09-01

    The placement of the root node in a phylogeny is fundamental to characterizing evolutionary relationships. The root node of bee phylogeny remains unclear despite considerable previous attention. In order to test alternative hypotheses for the location of the root node in bees, we used the F1 and F2 paralogs of elongation factor 1-alpha (EF-1α) to compare the tree topologies that result when using outgroup versus paralogous rooting. Fifty-two taxa representing each of the seven bee families were sequenced for both copies of EF-1α. Two datasets were analyzed. In the first (the "concatenated" dataset), the F1 and F2 copies for each species were concatenated and the tree was rooted using appropriate outgroups (sphecid and crabronid wasps). In the second dataset (the "duplicated" dataset), the F1 and F2 copies were aligned to each another and each copy for all taxa were treated as separate terminals. In this dataset, the root was placed between the F1 and F2 copies (e.g., paralog rooting). Bayesian analyses demonstrate that the outgroup rooting approach outperforms paralog rooting, recovering deeper clades and showing stronger support for groups well established by both morphological and other molecular data. Sequence characteristics of the two copies were compared at the amino acid level, but little evidence was found to suggest that one copy is more functionally conserved. Although neither approach yields an unambiguous root to the tree, both approaches strongly indicate that the root of bee phylogeny does not fall near Colletidae, as has been previously proposed. We discuss paralog rooting as a general strategy and why this approach performs relatively poorly with our particular dataset. Copyright © 2011 Elsevier Inc. All rights reserved.

  8. Benchmarking protein classification algorithms via supervised cross-validation.

    PubMed

    Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor

    2008-04-24

    Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

  9. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.

  10. Ohmic resistance in a multi-anode MxCs

    EPA Pesticide Factsheets

    A-3txf_sequence summary.xksx: Abundance of contigs or unique sequences for each biofilm samples from anodes in the MEC reactorHodon Waterloo final_fasta_working.docx: Raw sequences with their identification numbersRNA S1_MEC.docx: Representative sequences with their ID number and taxonomyThis dataset is associated with the following publication:Santodomingo, J., H. Ryu, B. Dhar, and H. Lee. Ohmic resistance affects microbial community and electrochemical kinetics in a multi-anode microbial electrochemical cell. JOURNAL OF POWER SOURCES. Elsevier Science Ltd, New York, NY, USA, 331: 315-321, (2016).

  11. A photogrammetric technique for generation of an accurate multispectral optical flow dataset

    NASA Astrophysics Data System (ADS)

    Kniaz, V. V.

    2017-06-01

    A presence of an accurate dataset is the key requirement for a successful development of an optical flow estimation algorithm. A large number of freely available optical flow datasets were developed in recent years and gave rise for many powerful algorithms. However most of the datasets include only images captured in the visible spectrum. This paper is focused on the creation of a multispectral optical flow dataset with an accurate ground truth. The generation of an accurate ground truth optical flow is a rather complex problem, as no device for error-free optical flow measurement was developed to date. Existing methods for ground truth optical flow estimation are based on hidden textures, 3D modelling or laser scanning. Such techniques are either work only with a synthetic optical flow or provide a sparse ground truth optical flow. In this paper a new photogrammetric method for generation of an accurate ground truth optical flow is proposed. The method combines the benefits of the accuracy and density of a synthetic optical flow datasets with the flexibility of laser scanning based techniques. A multispectral dataset including various image sequences was generated using the developed method. The dataset is freely available on the accompanying web site.

  12. I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chard, Kyle; D'Arcy, Mike; Heavner, Benjamin D.

    Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and toolsmore » for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.« less

  13. Cross-species transferability and mapping of genomic and cDNA SSRs in pines

    Treesearch

    D. Chagne; P. Chaumeil; A. Ramboer; C. Collada; A. Guevara; M. T. Cervera; G. G. Vendramin; V. Garcia; J-M. Frigerio; Craig Echt; T. Richardson; Christophe Plomion

    2004-01-01

    Two unigene datasets of Pinus taeda and Pinus pinaster were screened to detect di-, tri and tetranucleotide repeated motifs using the SSRIT script. A total of 419 simple sequence repeats (SSRs) were identified, from which only 12.8% overlapped between the two sets. The position of the SSRs within the coding sequence were predicted...

  14. Kidney segmentation in CT sequences using graph cuts based active contours model and contextual continuity.

    PubMed

    Zhang, Pin; Liang, Yanmei; Chang, Shengjiang; Fan, Hailun

    2013-08-01

    Accurate segmentation of renal tissues in abdominal computed tomography (CT) image sequences is an indispensable step for computer-aided diagnosis and pathology detection in clinical applications. In this study, the goal is to develop a radiology tool to extract renal tissues in CT sequences for the management of renal diagnosis and treatments. In this paper, the authors propose a new graph-cuts-based active contours model with an adaptive width of narrow band for kidney extraction in CT image sequences. Based on graph cuts and contextual continuity, the segmentation is carried out slice-by-slice. In the first stage, the middle two adjacent slices in a CT sequence are segmented interactively based on the graph cuts approach. Subsequently, the deformable contour evolves toward the renal boundaries by the proposed model for the kidney extraction of the remaining slices. In this model, the energy function combining boundary with regional information is optimized in the constructed graph and the adaptive search range is determined by contextual continuity and the object size. In addition, in order to reduce the complexity of the min-cut computation, the nodes in the graph only have n-links for fewer edges. The total 30 CT images sequences with normal and pathological renal tissues are used to evaluate the accuracy and effectiveness of our method. The experimental results reveal that the average dice similarity coefficient of these image sequences is from 92.37% to 95.71% and the corresponding standard deviation for each dataset is from 2.18% to 3.87%. In addition, the average automatic segmentation time for one kidney in each slice is about 0.36 s. Integrating the graph-cuts-based active contours model with contextual continuity, the algorithm takes advantages of energy minimization and the characteristics of image sequences. The proposed method achieves effective results for kidney segmentation in CT sequences.

  15. Strategies to improve reference databases for soil microbiomes

    DOE PAGES

    Choi, Jinlyung; Yang, Fan; Stepanauskas, Ramunas; ...

    2016-12-09

    A database of curated genomes is needed to better assess soil microbial communities and their processes associated with differing land management and environmental impacts. Interpreting soil metagenomic datasets with existing sequence databases is challenging because these datasets are biased towards medical and biotechnology research and can result in misleading annotations. We have curated a database of 928 genomes of soil-associated organisms (888 bacteria, 34 archaea, and 6 fungi). Using this database as a representation of the current state of knowledge of soil microbes that are well-characterized, we evaluated its composition and compared it to broader microbial databases, specifically NCBI’s RefSeq,more » as well as 3,035 publicly available soil amplicon datasets. These comparisons identified phyla and functions that are enriched in soils as well as those that may be underrepresented in RefSoil. For example, RefSoil was observed to have increased representation of Firmicutes despite its low abundance in soil environments and also lacked representation of Acidobacteria and Verrucomicrobia, which are abundant in soils. Our comparison of RefSoil to soil amplicon datasets allowed us to identify targets that if cultured or sequenced would significantly increase the biodiversity represented within RefSoil. To demonstrate the opportunities to access these underrepresented targets, we employed single cell genomics in a pilot experiment to recover 14 genomes from the "most wanted" list, which improved RefSoil's representation of EMP sequences by 7% by abundance. This effort demonstrates the value of RefSoil in the guidance of future research efforts and the capability of single cell genomics as a practical means to fill the existing genomic data gaps.« less

  16. Strategies to improve reference databases for soil microbiomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Choi, Jinlyung; Yang, Fan; Stepanauskas, Ramunas

    A database of curated genomes is needed to better assess soil microbial communities and their processes associated with differing land management and environmental impacts. Interpreting soil metagenomic datasets with existing sequence databases is challenging because these datasets are biased towards medical and biotechnology research and can result in misleading annotations. We have curated a database of 928 genomes of soil-associated organisms (888 bacteria, 34 archaea, and 6 fungi). Using this database as a representation of the current state of knowledge of soil microbes that are well-characterized, we evaluated its composition and compared it to broader microbial databases, specifically NCBI’s RefSeq,more » as well as 3,035 publicly available soil amplicon datasets. These comparisons identified phyla and functions that are enriched in soils as well as those that may be underrepresented in RefSoil. For example, RefSoil was observed to have increased representation of Firmicutes despite its low abundance in soil environments and also lacked representation of Acidobacteria and Verrucomicrobia, which are abundant in soils. Our comparison of RefSoil to soil amplicon datasets allowed us to identify targets that if cultured or sequenced would significantly increase the biodiversity represented within RefSoil. To demonstrate the opportunities to access these underrepresented targets, we employed single cell genomics in a pilot experiment to recover 14 genomes from the "most wanted" list, which improved RefSoil's representation of EMP sequences by 7% by abundance. This effort demonstrates the value of RefSoil in the guidance of future research efforts and the capability of single cell genomics as a practical means to fill the existing genomic data gaps.« less

  17. PredPPCrys: Accurate Prediction of Sequence Cloning, Protein Production, Purification and Crystallization Propensity from Protein Sequences Using Multi-Step Heterogeneous Feature Fusion and Selection

    PubMed Central

    Wang, Huilin; Wang, Mingjun; Tan, Hao; Li, Yuan; Zhang, Ziding; Song, Jiangning

    2014-01-01

    X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed ‘PredPPCrys’ using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys. PMID:25148528

  18. Sequence Polymorphisms and Structural Variations among Four Grapevine (Vitis vinifera L.) Cultivars Representing Sardinian Agriculture

    PubMed Central

    Mercenaro, Luca; Nieddu, Giovanni; Porceddu, Andrea; Pezzotti, Mario; Camiolo, Salvatore

    2017-01-01

    The genetic diversity among grapevine (Vitis vinifera L.) cultivars that underlies differences in agronomic performance and wine quality reflects the accumulation of single nucleotide polymorphisms (SNPs) and small indels as well as larger genomic variations. A combination of high throughput sequencing and mapping against the grapevine reference genome allows the creation of comprehensive sequence variation maps. We used next generation sequencing and bioinformatics to generate an inventory of SNPs and small indels in four widely cultivated Sardinian grape cultivars (Bovale sardo, Cannonau, Carignano and Vermentino). More than 3,200,000 SNPs were identified with high statistical confidence. Some of the SNPs caused the appearance of premature stop codons and thus identified putative pseudogenes. The analysis of SNP distribution along chromosomes led to the identification of large genomic regions with uninterrupted series of homozygous SNPs. We used a digital comparative genomic hybridization approach to identify 6526 genomic regions with significant differences in copy number among the four cultivars compared to the reference sequence, including 81 regions shared between all four cultivars and 4953 specific to single cultivars (representing 1.2 and 75.9% of total copy number variation, respectively). Reads mapping at a distance that was not compatible with the insert size were used to identify a dataset of putative large deletions with cultivar Cannonau revealing the highest number. The analysis of genes mapping to these regions provided a list of candidates that may explain some of the phenotypic differences among the Bovale sardo, Cannonau, Carignano and Vermentino cultivars. PMID:28775732

  19. Single cell sequencing reveals heterogeneity within ovarian cancer epithelium and cancer associated stromal cells.

    PubMed

    Winterhoff, Boris J; Maile, Makayla; Mitra, Amit Kumar; Sebe, Attila; Bazzaro, Martina; Geller, Melissa A; Abrahante, Juan E; Klein, Molly; Hellweg, Raffaele; Mullany, Sally A; Beckman, Kenneth; Daniel, Jerry; Starr, Timothy K

    2017-03-01

    The purpose of this study was to determine the level of heterogeneity in high grade serous ovarian cancer (HGSOC) by analyzing RNA expression in single epithelial and cancer associated stromal cells. In addition, we explored the possibility of identifying subgroups based on pathway activation and pre-defined signatures from cancer stem cells and chemo-resistant cells. A fresh, HGSOC tumor specimen derived from ovary was enzymatically digested and depleted of immune infiltrating cells. RNA sequencing was performed on 92 single cells and 66 of these single cell datasets passed quality control checks. Sequences were analyzed using multiple bioinformatics tools, including clustering, principle components analysis, and geneset enrichment analysis to identify subgroups and activated pathways. Immunohistochemistry for ovarian cancer, stem cell and stromal markers was performed on adjacent tumor sections. Analysis of the gene expression patterns identified two major subsets of cells characterized by epithelial and stromal gene expression patterns. The epithelial group was characterized by proliferative genes including genes associated with oxidative phosphorylation and MYC activity, while the stromal group was characterized by increased expression of extracellular matrix (ECM) genes and genes associated with epithelial-to-mesenchymal transition (EMT). Neither group expressed a signature correlating with published chemo-resistant gene signatures, but many cells, predominantly in the stromal subgroup, expressed markers associated with cancer stem cells. Single cell sequencing provides a means of identifying subpopulations of cancer cells within a single patient. Single cell sequence analysis may prove to be critical for understanding the etiology, progression and drug resistance in ovarian cancer. Copyright © 2017 Elsevier Inc. All rights reserved.

  20. The Targeted Sequencing of Alpha Satellite DNA in Cercopithecus pogonias Provides New Insight into the Diversity and Dynamics of Centromeric Repeats in Old World monkeys.

    PubMed

    Cacheux, Lauriane; Ponger, Loïc; Gerbault-Seureau, Michèle; Loll, François; Gey, Delphine; Richard, Florence Anne; Escudé, Christophe

    2018-06-01

    Alpha satellite is the major repeated DNA element of primate centromeres. Specific evolutionary mechanisms have led to a great diversity of sequence families with peculiar genomic organization and distribution, which have till now been studied mostly in great apes. Using high throughput sequencing of alpha satellite monomers obtained by enzymatic digestion followed by computational and cytogenetic analysis, we compare here the diversity and genomic distribution of alpha satellite DNA in two related Old World monkey species, Cercopithecus pogonias and Cercopithecus solatus, which are known to have diverged about seven million years ago. Two main families of monomers, called C1 and C2, are found in both species. A detailed analysis of our datasets revealed the existence of numerous subfamilies within the centromeric C1 family. Although the most abundant subfamily is conserved between both species, our FISH experiments clearly show that some subfamilies are specific for each species and that their distribution is restricted to a subset of chromosomes, thereby pointing to the existence of recurrent amplification/homogenization events. The pericentromeric C2 family is very abundant on the short arm of all acrocentric chromosomes in both species, pointing to specific mechanisms that lead to this distribution. Results obtained using two different restriction enzymes are fully consistent with a predominant monomeric organization of alpha satellite DNA which coexists with higher order organization patterns in the Cercopithecus pogonias genome. Our study suggests a high dynamics of alpha satellite DNA in Cercopithecini, with recurrent apparition of new sequence variants and interchromosomal sequence transfer.

  1. RNA design using simulated SHAPE data.

    PubMed

    Lotfi, Mohadeseh; Zare-Mirakabad, Fatemeh; Montaseri, Soheila

    2018-05-03

    It has long been established that in addition to being involved in protein translation, RNA plays essential roles in numerous other cellular processes, including gene regulation and DNA replication. Such roles are known to be dictated by higher-order structures of RNA molecules. It is therefore of prime importance to find an RNA sequence that can fold to acquire a particular function that is desirable for use in pharmaceuticals and basic research. The challenge of finding an RNA sequence for a given structure is known as the RNA design problem. Although there are several algorithms to solve this problem, they mainly consider hard constraints, such as minimum free energy, to evaluate the predicted sequences. Recently, SHAPE data has emerged as a new soft constraint for RNA secondary structure prediction. To take advantage of this new experimental constraint, we report here a new method for accurate design of RNA sequences based on their secondary structures using SHAPE data as pseudo-free energy. We then compare our algorithm with four others: INFO-RNA, ERD, MODENA and RNAifold 2.0. Our algorithm precisely predicts 26 out of 29 new sequences for the structures extracted from the Rfam dataset, while the other four algorithms predict no more than 22 out of 29. The proposed algorithm is comparable to the above algorithms on RNA-SSD datasets, where they can predict up to 33 appropriate sequences for RNA secondary structures out of 34.

  2. Linking GPS and travel diary data using sequence alignment in a study of children's independent mobility

    PubMed Central

    2011-01-01

    Background Global positioning systems (GPS) are increasingly being used in health research to determine the location of study participants. Combining GPS data with data collected via travel/activity diaries allows researchers to assess where people travel in conjunction with data about trip purpose and accompaniment. However, linking GPS and diary data is problematic and to date the only method has been to match the two datasets manually, which is time consuming and unlikely to be practical for larger data sets. This paper assesses the feasibility of a new sequence alignment method of linking GPS and travel diary data in comparison with the manual matching method. Methods GPS and travel diary data obtained from a study of children's independent mobility were linked using sequence alignment algorithms to test the proof of concept. Travel diaries were assessed for quality by counting the number of errors and inconsistencies in each participant's set of diaries. The success of the sequence alignment method was compared for higher versus lower quality travel diaries, and for accompanied versus unaccompanied trips. Time taken and percentage of trips matched were compared for the sequence alignment method and the manual method. Results The sequence alignment method matched 61.9% of all trips. Higher quality travel diaries were associated with higher match rates in both the sequence alignment and manual matching methods. The sequence alignment method performed almost as well as the manual method and was an order of magnitude faster. However, the sequence alignment method was less successful at fully matching trips and at matching unaccompanied trips. Conclusions Sequence alignment is a promising method of linking GPS and travel diary data in large population datasets, especially if limitations in the trip detection algorithm are addressed. PMID:22142322

  3. Enrichment allows identification of diverse, rare elements in metagenomic resistome-virulome sequencing.

    PubMed

    Noyes, Noelle R; Weinroth, Maggie E; Parker, Jennifer K; Dean, Chris J; Lakin, Steven M; Raymond, Robert A; Rovira, Pablo; Doster, Enrique; Abdo, Zaid; Martin, Jennifer N; Jones, Kenneth L; Ruiz, Jaime; Boucher, Christina A; Belk, Keith E; Morley, Paul S

    2017-10-17

    Shotgun metagenomic sequencing is increasingly utilized as a tool to evaluate ecological-level dynamics of antimicrobial resistance and virulence, in conjunction with microbiome analysis. Interest in use of this method for environmental surveillance of antimicrobial resistance and pathogenic microorganisms is also increasing. In published metagenomic datasets, the total of all resistance- and virulence-related sequences accounts for < 1% of all sequenced DNA, leading to limitations in detection of low-abundance resistome-virulome elements. This study describes the extent and composition of the low-abundance portion of the resistome-virulome, using a bait-capture and enrichment system that incorporates unique molecular indices to count DNA molecules and correct for enrichment bias. The use of the bait-capture and enrichment system significantly increased on-target sequencing of the resistome-virulome, enabling detection of an additional 1441 gene accessions and revealing a low-abundance portion of the resistome-virulome that was more diverse and compositionally different than that detected by more traditional metagenomic assays. The low-abundance portion of the resistome-virulome also contained resistance genes with public health importance, such as extended-spectrum betalactamases, that were not detected using traditional shotgun metagenomic sequencing. In addition, the use of the bait-capture and enrichment system enabled identification of rare resistance gene haplotypes that were used to discriminate between sample origins. These results demonstrate that the rare resistome-virulome contains valuable and unique information that can be utilized for both surveillance and population genetic investigations of resistance. Access to the rare resistome-virulome using the bait-capture and enrichment system validated in this study can greatly advance our understanding of microbiome-resistome dynamics.

  4. Gene discovery in an invasive tephritid model pest species, the Mediterranean fruit fly, Ceratitis capitata

    PubMed Central

    Gomulski, Ludvik M; Dimopoulos, George; Xi, Zhiyong; Soares, Marcelo B; Bonaldo, Maria F; Malacrida, Anna R; Gasperi, Giuliano

    2008-01-01

    Background The medfly, Ceratitis capitata, is a highly invasive agricultural pest that has become a model insect for the development of biological control programs. Despite research into the behavior and classical and population genetics of this organism, the quantity of sequence data available is limited. We have utilized an expressed sequence tag (EST) approach to obtain detailed information on transcriptome signatures that relate to a variety of physiological systems in the medfly; this information emphasizes on reproduction, sex determination, and chemosensory perception, since the study was based on normalized cDNA libraries from embryos and adult heads. Results A total of 21,253 high-quality ESTs were obtained from the embryo and head libraries. Clustering analyses performed separately for each library resulted in 5201 embryo and 6684 head transcripts. Considering an estimated 19% overlap in the transcriptomes of the two libraries, they represent about 9614 unique transcripts involved in a wide range of biological processes and molecular functions. Of particular interest are the sequences that share homology with Drosophila genes involved in sex determination, olfaction, and reproductive behavior. The medfly transformer2 (tra2) homolog was identified among the embryonic sequences, and its genomic organization and expression were characterized. Conclusion The sequences obtained in this study represent the first major dataset of expressed genes in a tephritid species of agricultural importance. This resource provides essential information to support the investigation of numerous questions regarding the biology of the medfly and other related species and also constitutes an invaluable tool for the annotation of complete genome sequences. Our study has revealed intriguing findings regarding the transcript regulation of tra2 and other sex determination genes, as well as insights into the comparative genomics of genes implicated in chemosensory reception and reproduction. PMID:18500975

  5. Genome-Wide Characterization of RNA Editing in Chicken Embryos Reveals Common Features among Vertebrates.

    PubMed

    Frésard, Laure; Leroux, Sophie; Roux, Pierre-François; Klopp, Christophe; Fabre, Stéphane; Esquerré, Diane; Dehais, Patrice; Djari, Anis; Gourichon, David; Lagarrigue, Sandrine; Pitel, Frédérique

    2015-01-01

    RNA editing results in a post-transcriptional nucleotide change in the RNA sequence that creates an alternative nucleotide not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is described in more details in mammals, notably since the availability of next generation sequencing technologies allowing whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken is still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, and to extend the knowledge of its conservation among vertebrates. We performed sequencing of RNA and DNA from 8 embryos. Being aware of common pitfalls inherent to sequence analyses that lead to false positive discovery, we stringently filtered our datasets and found fewer than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual increase of editing level with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, attest to its tissue and stage specificity and provide support of the absence of non A-to-I events from the chicken transcriptome.

  6. Genome-Wide Characterization of RNA Editing in Chicken Embryos Reveals Common Features among Vertebrates

    PubMed Central

    Frésard, Laure; Leroux, Sophie; Roux, Pierre-François; Klopp, Christophe; Fabre, Stéphane; Esquerré, Diane; Dehais, Patrice; Djari, Anis; Gourichon, David

    2015-01-01

    RNA editing results in a post-transcriptional nucleotide change in the RNA sequence that creates an alternative nucleotide not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is described in more details in mammals, notably since the availability of next generation sequencing technologies allowing whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken is still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, and to extend the knowledge of its conservation among vertebrates. We performed sequencing of RNA and DNA from 8 embryos. Being aware of common pitfalls inherent to sequence analyses that lead to false positive discovery, we stringently filtered our datasets and found fewer than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual increase of editing level with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, attest to its tissue and stage specificity and provide support of the absence of non A-to-I events from the chicken transcriptome. PMID:26024316

  7. Prediction of Disease Causing Non-Synonymous SNPs by the Artificial Neural Network Predictor NetDiseaseSNP

    PubMed Central

    Johansen, Morten Bo; Izarzugaza, Jose M. G.; Brunak, Søren; Petersen, Thomas Nordahl; Gupta, Ramneek

    2013-01-01

    We have developed a sequence conservation-based artificial neural network predictor called NetDiseaseSNP which classifies nsSNPs as disease-causing or neutral. Our method uses the excellent alignment generation algorithm of SIFT to identify related sequences and a combination of 31 features assessing sequence conservation and the predicted surface accessibility to produce a single score which can be used to rank nsSNPs based on their potential to cause disease. NetDiseaseSNP classifies successfully disease-causing and neutral mutations. In addition, we show that NetDiseaseSNP discriminates cancer driver and passenger mutations satisfactorily. Our method outperforms other state-of-the-art methods on several disease/neutral datasets as well as on cancer driver/passenger mutation datasets and can thus be used to pinpoint and prioritize plausible disease candidates among nsSNPs for further investigation. NetDiseaseSNP is publicly available as an online tool as well as a web service: http://www.cbs.dtu.dk/services/NetDiseaseSNP PMID:23935863

  8. Characterization of unknown genetic modifications using high throughput sequencing and computational subtraction.

    PubMed

    Tengs, Torstein; Zhang, Haibo; Holst-Jensen, Arne; Bohlin, Jon; Butenko, Melinka A; Kristoffersen, Anja Bråthen; Sorteberg, Hilde-Gunn Opsahl; Berdal, Knut G

    2009-10-08

    When generating a genetically modified organism (GMO), the primary goal is to give a target organism one or several novel traits by using biotechnology techniques. A GMO will differ from its parental strain in that its pool of transcripts will be altered. Currently, there are no methods that are reliably able to determine if an organism has been genetically altered if the nature of the modification is unknown. We show that the concept of computational subtraction can be used to identify transgenic cDNA sequences from genetically modified plants. Our datasets include 454-type sequences from a transgenic line of Arabidopsis thaliana and published EST datasets from commercially relevant species (rice and papaya). We believe that computational subtraction represents a powerful new strategy for determining if an organism has been genetically modified as well as to define the nature of the modification. Fewer assumptions have to be made compared to methods currently in use and this is an advantage particularly when working with unknown GMOs.

  9. Characterization of unknown genetic modifications using high throughput sequencing and computational subtraction

    PubMed Central

    Tengs, Torstein; Zhang, Haibo; Holst-Jensen, Arne; Bohlin, Jon; Butenko, Melinka A; Kristoffersen, Anja Bråthen; Sorteberg, Hilde-Gunn Opsahl; Berdal, Knut G

    2009-01-01

    Background When generating a genetically modified organism (GMO), the primary goal is to give a target organism one or several novel traits by using biotechnology techniques. A GMO will differ from its parental strain in that its pool of transcripts will be altered. Currently, there are no methods that are reliably able to determine if an organism has been genetically altered if the nature of the modification is unknown. Results We show that the concept of computational subtraction can be used to identify transgenic cDNA sequences from genetically modified plants. Our datasets include 454-type sequences from a transgenic line of Arabidopsis thaliana and published EST datasets from commercially relevant species (rice and papaya). Conclusion We believe that computational subtraction represents a powerful new strategy for determining if an organism has been genetically modified as well as to define the nature of the modification. Fewer assumptions have to be made compared to methods currently in use and this is an advantage particularly when working with unknown GMOs. PMID:19814792

  10. A Low-Dimensional Radial Silhouette-Based Feature for Fast Human Action Recognition Fusing Multiple Views.

    PubMed

    Chaaraoui, Alexandros Andre; Flórez-Revuelta, Francisco

    2014-01-01

    This paper presents a novel silhouette-based feature for vision-based human action recognition, which relies on the contour of the silhouette and a radial scheme. Its low-dimensionality and ease of extraction result in an outstanding proficiency for real-time scenarios. This feature is used in a learning algorithm that by means of model fusion of multiple camera streams builds a bag of key poses, which serves as a dictionary of known poses and allows converting the training sequences into sequences of key poses. These are used in order to perform action recognition by means of a sequence matching algorithm. Experimentation on three different datasets returns high and stable recognition rates. To the best of our knowledge, this paper presents the highest results so far on the MuHAVi-MAS dataset. Real-time suitability is given, since the method easily performs above video frequency. Therefore, the related requirements that applications as ambient-assisted living services impose are successfully fulfilled.

  11. ANCAC: amino acid, nucleotide, and codon analysis of COGs--a tool for sequence bias analysis in microbial orthologs.

    PubMed

    Meiler, Arno; Klinger, Claudia; Kaufmann, Michael

    2012-09-08

    The COG database is the most popular collection of orthologous proteins from many different completely sequenced microbial genomes. Per definition, a cluster of orthologous groups (COG) within this database exclusively contains proteins that most likely achieve the same cellular function. Recently, the COG database was extended by assigning to every protein both the corresponding amino acid and its encoding nucleotide sequence resulting in the NUCOCOG database. This extended version of the COG database is a valuable resource connecting sequence features with the functionality of the respective proteins. Here we present ANCAC, a web tool and MySQL database for the analysis of amino acid, nucleotide, and codon frequencies in COGs on the basis of freely definable phylogenetic patterns. We demonstrate the usefulness of ANCAC by analyzing amino acid frequencies, codon usage, and GC-content in a species- or function-specific context. With respect to amino acids we, at least in part, confirm the cognate bias hypothesis by using ANCAC's NUCOCOG dataset as the largest one available for that purpose thus far. Using the NUCOCOG datasets, ANCAC connects taxonomic, amino acid, and nucleotide sequence information with the functional classification via COGs and provides a GUI for flexible mining for sequence-bias. Thereby, to our knowledge, it is the only tool for the analysis of sequence composition in the light of physiological roles and phylogenetic context without requirement of substantial programming-skills.

  12. ANCAC: amino acid, nucleotide, and codon analysis of COGs – a tool for sequence bias analysis in microbial orthologs

    PubMed Central

    2012-01-01

    Background The COG database is the most popular collection of orthologous proteins from many different completely sequenced microbial genomes. Per definition, a cluster of orthologous groups (COG) within this database exclusively contains proteins that most likely achieve the same cellular function. Recently, the COG database was extended by assigning to every protein both the corresponding amino acid and its encoding nucleotide sequence resulting in the NUCOCOG database. This extended version of the COG database is a valuable resource connecting sequence features with the functionality of the respective proteins. Results Here we present ANCAC, a web tool and MySQL database for the analysis of amino acid, nucleotide, and codon frequencies in COGs on the basis of freely definable phylogenetic patterns. We demonstrate the usefulness of ANCAC by analyzing amino acid frequencies, codon usage, and GC-content in a species- or function-specific context. With respect to amino acids we, at least in part, confirm the cognate bias hypothesis by using ANCAC’s NUCOCOG dataset as the largest one available for that purpose thus far. Conclusions Using the NUCOCOG datasets, ANCAC connects taxonomic, amino acid, and nucleotide sequence information with the functional classification via COGs and provides a GUI for flexible mining for sequence-bias. Thereby, to our knowledge, it is the only tool for the analysis of sequence composition in the light of physiological roles and phylogenetic context without requirement of substantial programming-skills. PMID:22958836

  13. Allele Identification for Transcriptome-Based Population Genomics in the Invasive Plant Centaurea solstitialis

    PubMed Central

    Dlugosch, Katrina M.; Lai, Zhao; Bonin, Aurélie; Hierro, José; Rieseberg, Loren H.

    2013-01-01

    Transcriptome sequences are becoming more broadly available for multiple individuals of the same species, providing opportunities to derive population genomic information from these datasets. Using the 454 Life Science Genome Sequencer FLX and FLX-Titanium next-generation platforms, we generated 11−430 Mbp of sequence for normalized cDNA for 40 wild genotypes of the invasive plant Centaurea solstitialis, yellow starthistle, from across its worldwide distribution. We examined the impact of sequencing effort on transcriptome recovery and overlap among individuals. To do this, we developed two novel publicly available software pipelines: SnoWhite for read cleaning before assembly, and AllelePipe for clustering of loci and allele identification in assembled datasets with or without a reference genome. AllelePipe is designed specifically for cases in which read depth information is not appropriate or available to assist with disentangling closely related paralogs from allelic variation, as in transcriptome or previously assembled libraries. We find that modest applications of sequencing effort recover most of the novel sequences present in the transcriptome of this species, including single-copy loci and a representative distribution of functional groups. In contrast, the coverage of variable sites, observation of heterozygosity, and overlap among different libraries are all highly dependent on sequencing effort. Nevertheless, the information gained from overlapping regions was informative regarding coarse population structure and variation across our small number of population samples, providing the first genetic evidence in support of hypothesized invasion scenarios. PMID:23390612

  14. Impact of sequencing depth in ChIP-seq experiments

    PubMed Central

    Jung, Youngsook L.; Luquette, Lovelace J.; Ho, Joshua W.K.; Ferrari, Francesco; Tolstorukov, Michael; Minoda, Aki; Issner, Robbyn; Epstein, Charles B.; Karpen, Gary H.; Kuroda, Mitzi I.; Park, Peter J.

    2014-01-01

    In a chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) experiment, an important consideration in experimental design is the minimum number of sequenced reads required to obtain statistically significant results. We present an extensive evaluation of the impact of sequencing depth on identification of enriched regions for key histone modifications (H3K4me3, H3K36me3, H3K27me3 and H3K9me2/me3) using deep-sequenced datasets in human and fly. We propose to define sufficient sequencing depth as the number of reads at which detected enrichment regions increase <1% for an additional million reads. Although the required depth depends on the nature of the mark and the state of the cell in each experiment, we observe that sufficient depth is often reached at <20 million reads for fly. For human, there are no clear saturation points for the examined datasets, but our analysis suggests 40–50 million reads as a practical minimum for most marks. We also devise a mathematical model to estimate the sufficient depth and total genomic coverage of a mark. Lastly, we find that the five algorithms tested do not agree well for broad enrichment profiles, especially at lower depths. Our findings suggest that sufficient sequencing depth and an appropriate peak-calling algorithm are essential for ensuring robustness of conclusions derived from ChIP-seq data. PMID:24598259

  15. Comparison of whole-genome bisulfite sequencing library preparation strategies identifies sources of biases affecting DNA methylation data.

    PubMed

    Olova, Nelly; Krueger, Felix; Andrews, Simon; Oxley, David; Berrens, Rebecca V; Branco, Miguel R; Reik, Wolf

    2018-03-15

    Whole-genome bisulfite sequencing (WGBS) is becoming an increasingly accessible technique, used widely for both fundamental and disease-oriented research. Library preparation methods benefit from a variety of available kits, polymerases and bisulfite conversion protocols. Although some steps in the procedure, such as PCR amplification, are known to introduce biases, a systematic evaluation of biases in WGBS strategies is missing. We perform a comparative analysis of several commonly used pre- and post-bisulfite WGBS library preparation protocols for their performance and quality of sequencing outputs. Our results show that bisulfite conversion per se is the main trigger of pronounced sequencing biases, and PCR amplification builds on these underlying artefacts. The majority of standard library preparation methods yield a significantly biased sequence output and overestimate global methylation. Importantly, both absolute and relative methylation levels at specific genomic regions vary substantially between methods, with clear implications for DNA methylation studies. We show that amplification-free library preparation is the least biased approach for WGBS. In protocols with amplification, the choice of bisulfite conversion protocol or polymerase can significantly minimize artefacts. To aid with the quality assessment of existing WGBS datasets, we have integrated a bias diagnostic tool in the Bismark package and offer several approaches for consideration during the preparation and analysis of WGBS datasets.

  16. De-MetaST-BLAST: A Tool for the Validation of Degenerate Primer Sets and Data Mining of Publicly Available Metagenomes

    PubMed Central

    Gulvik, Christopher A.; Effler, T. Chad; Wilhelm, Steven W.; Buchan, Alison

    2012-01-01

    Development and use of primer sets to amplify nucleic acid sequences of interest is fundamental to studies spanning many life science disciplines. As such, the validation of primer sets is essential. Several computer programs have been created to aid in the initial selection of primer sequences that may or may not require multiple nucleotide combinations (i.e., degeneracies). Conversely, validation of primer specificity has remained largely unchanged for several decades, and there are currently few available programs that allows for an evaluation of primers containing degenerate nucleotide bases. To alleviate this gap, we developed the program De-MetaST that performs an in silico amplification using user defined nucleotide sequence dataset(s) and primer sequences that may contain degenerate bases. The program returns an output file that contains the in silico amplicons. When De-MetaST is paired with NCBI’s BLAST (De-MetaST-BLAST), the program also returns the top 10 nr NCBI database hits for each recovered in silico amplicon. While the original motivation for development of this search tool was degenerate primer validation using the wealth of nucleotide sequences available in environmental metagenome and metatranscriptome databases, this search tool has potential utility in many data mining applications. PMID:23189198

  17. Anchored enrichment dataset for true flies (order Diptera) reveals insights into the phylogeny of flower flies (family Syrphidae).

    PubMed

    Young, Andrew Donovan; Lemmon, Alan R; Skevington, Jeffrey H; Mengual, Ximo; Ståhls, Gunilla; Reemer, Menno; Jordaens, Kurt; Kelso, Scott; Lemmon, Emily Moriarty; Hauser, Martin; De Meyer, Marc; Misof, Bernhard; Wiegmann, Brian M

    2016-06-29

    Anchored hybrid enrichment is a form of next-generation sequencing that uses oligonucleotide probes to target conserved regions of the genome flanked by less conserved regions in order to acquire data useful for phylogenetic inference from a broad range of taxa. Once a probe kit is developed, anchored hybrid enrichment is superior to traditional PCR-based Sanger sequencing in terms of both the amount of genomic data that can be recovered and effective cost. Due to their incredibly diverse nature, importance as pollinators, and historical instability with regard to subfamilial and tribal classification, Syrphidae (flower flies or hoverflies) are an ideal candidate for anchored hybrid enrichment-based phylogenetics, especially since recent molecular phylogenies of the syrphids using only a few markers have resulted in highly unresolved topologies. Over 6200 syrphids are currently known and uncovering their phylogeny will help us to understand how these species have diversified, providing insight into an array of ecological processes, from the development of adult mimicry, the origin of adult migration, to pollination patterns and the evolution of larval resource utilization. We present the first use of anchored hybrid enrichment in insect phylogenetics on a dataset containing 30 flower fly species from across all four subfamilies and 11 tribes out of 15. To produce a phylogenetic hypothesis, 559 loci were sampled to produce a final dataset containing 217,702 sites. We recovered a well resolved topology with bootstrap support values that were almost universally >95 %. The subfamily Eristalinae is recovered as paraphyletic, with the strongest support for this hypothesis to date. The ant predators in the Microdontinae are sister to all other syrphids. Syrphinae and Pipizinae are monophyletic and sister to each other. Larval predation on soft-bodied hemipterans evolved only once in this family. Anchored hybrid enrichment was successful in producing a robustly supported phylogenetic hypothesis for the syrphids. Subfamilial reconstruction is concordant with recent phylogenetic hypotheses, but with much higher support values. With the newly designed probe kit this analysis could be rapidly expanded with further sampling, opening the door to more comprehensive analyses targeting problem areas in syrphid phylogenetics and ecology.

  18. ChimeRScope: a novel alignment-free algorithm for fusion transcript prediction using paired-end RNA-Seq data

    PubMed Central

    Li, You; Heavican, Tayla B.; Vellichirammal, Neetha N.; Iqbal, Javeed

    2017-01-01

    Abstract The RNA-Seq technology has revolutionized transcriptome characterization not only by accurately quantifying gene expression, but also by the identification of novel transcripts like chimeric fusion transcripts. The ‘fusion’ or ‘chimeric’ transcripts have improved the diagnosis and prognosis of several tumors, and have led to the development of novel therapeutic regimen. The fusion transcript detection is currently accomplished by several software packages, primarily relying on sequence alignment algorithms. The alignment of sequencing reads from fusion transcript loci in cancer genomes can be highly challenging due to the incorrect mapping induced by genomic alterations, thereby limiting the performance of alignment-based fusion transcript detection methods. Here, we developed a novel alignment-free method, ChimeRScope that accurately predicts fusion transcripts based on the gene fingerprint (as k-mers) profiles of the RNA-Seq paired-end reads. Results on published datasets and in-house cancer cell line datasets followed by experimental validations demonstrate that ChimeRScope consistently outperforms other popular methods irrespective of the read lengths and sequencing depth. More importantly, results on our in-house datasets show that ChimeRScope is a better tool that is capable of identifying novel fusion transcripts with potential oncogenic functions. ChimeRScope is accessible as a standalone software at (https://github.com/ChimeRScope/ChimeRScope/wiki) or via the Galaxy web-interface at (https://galaxy.unmc.edu/). PMID:28472320

  19. Concordance and discordance of sequence survey methods for molecular epidemiology

    PubMed Central

    Hasan, Nur A.; Cebula, Thomas A.; Colwell, Rita R.; Robison, Richard A.; Johnson, W. Evan; Crandall, Keith A.

    2015-01-01

    The post-genomic era is characterized by the direct acquisition and analysis of genomic data with many applications, including the enhancement of the understanding of microbial epidemiology and pathology. However, there are a number of molecular approaches to survey pathogen diversity, and the impact of these different approaches on parameter estimation and inference are not entirely clear. We sequenced whole genomes of bacterial pathogens, Burkholderia pseudomallei, Yersinia pestis, and Brucella spp. (60 new genomes), and combined them with 55 genomes from GenBank to address how different molecular survey approaches (whole genomes, SNPs, and MLST) impact downstream inferences on molecular evolutionary parameters, evolutionary relationships, and trait character associations. We selected isolates for sequencing to represent temporal, geographic origin, and host range variability. We found that substitution rate estimates vary widely among approaches, and that SNP and genomic datasets yielded different but strongly supported phylogenies. MLST yielded poorly supported phylogenies, especially in our low diversity dataset, i.e., Y. pestis. Trait associations showed that B. pseudomallei and Y. pestis phylogenies are significantly associated with geography, irrespective of the molecular survey approach used, while Brucella spp. phylogeny appears to be strongly associated with geography and host origin. We contrast inferences made among monomorphic (clonal) and non-monomorphic bacteria, and between intra- and inter-specific datasets. We also discuss our results in light of underlying assumptions of different approaches. PMID:25737810

  20. A Multilocus Species Delimitation Reveals a Striking Number of Species of Coralline Algae Forming Maerl in the OSPAR Maritime Area

    PubMed Central

    Pardo, Cristina; Lopez, Lua; Peña, Viviana; Hernández-Kantún, Jazmin; Le Gall, Line; Bárbara, Ignacio; Barreiro, Rodolfo

    2014-01-01

    Maerl beds are sensitive biogenic habitats built by an accumulation of loose-lying, non-geniculate coralline algae. While these habitats are considered hot-spots of marine biodiversity, the number and distribution of maerl-forming species is uncertain because homoplasy and plasticity of morphological characters are common. As a result, species discrimination based on morphological features is notoriously challenging, making these coralline algae the ideal candidates for a DNA barcoding study. Here, mitochondrial (COI-5P DNA barcode fragment) and plastidial (psbA gene) sequence data were used in a two-step approach to delimit species in 224 collections of maerl sampled from Svalbard (78°96’N) to the Canary Islands (28°64’N) that represented 10 morphospecies from four genera and two families. First, the COI-5P dataset was analyzed with two methods based on distinct criteria (ABGD and GMYC) to delineate 16 primary species hypotheses (PSHs) arranged into four major lineages. Second, chloroplast (psbA) sequence data served to consolidate these PSHs into 13 secondary species hypotheses (SSHs) that showed biologically plausible ranges. Using several lines of evidence (e.g. morphological characters, known species distributions, sequences from type and topotype material), six SSHs were assigned to available species names that included the geographically widespread Phymatolithon calcareum, Lithothamnion corallioides, and L. glaciale; possible identities of other SSHs are discussed. Concordance between SSHs and morphospecies was minimal, highlighting the convenience of DNA barcoding for an accurate identification of maerl specimens. Our survey indicated that a majority of maerl forming species have small distribution ranges and revealed a gradual replacement of species with latitude. PMID:25111057

Top