Sample records for dna sequence datasets

  1. Genomics dataset of unidentified disclosed isolates.

    PubMed

    Rekadwad, Bhagwan N

    2016-09-01

    Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content of the DNA sequences analysis was carried out. The QR is helpful for quick identification of isolates. AT/GC content is helpful for studying their stability at different temperatures. Additionally, a dataset on cleavage code and enzyme code studied under the restriction digestion study, which helpful for performing studies using short DNA sequences was reported. The dataset disclosed here is the new revelatory data for exploration of unique DNA sequences for evaluation, identification, comparison and analysis.

  2. Benchmark Dataset for Whole Genome Sequence Compression.

    PubMed

    C L, Biji; S Nair, Achuthsankar

    2017-01-01

    The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset. The sample dataset and the respective links are available @ https://sourceforge.net/projects/benchmarkdnacompressiondataset/.

  3. DNApod: DNA polymorphism annotation database from next-generation sequence read archives.

    PubMed

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information.

  4. DNApod: DNA polymorphism annotation database from next-generation sequence read archives

    PubMed Central

    Mochizuki, Takako; Tanizawa, Yasuhiro; Fujisawa, Takatomo; Ohta, Tazro; Nikoh, Naruo; Shimizu, Tokurou; Toyoda, Atsushi; Fujiyama, Asao; Kurata, Nori; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2017-01-01

    With the rapid advances in next-generation sequencing (NGS), datasets for DNA polymorphisms among various species and strains have been produced, stored, and distributed. However, reliability varies among these datasets because the experimental and analytical conditions used differ among assays. Furthermore, such datasets have been frequently distributed from the websites of individual sequencing projects. It is desirable to integrate DNA polymorphism data into one database featuring uniform quality control that is distributed from a single platform at a single place. DNA polymorphism annotation database (DNApod; http://tga.nig.ac.jp/dnapod/) is an integrated database that stores genome-wide DNA polymorphism datasets acquired under uniform analytical conditions, and this includes uniformity in the quality of the raw data, the reference genome version, and evaluation algorithms. DNApod genotypic data are re-analyzed whole-genome shotgun datasets extracted from sequence read archives, and DNApod distributes genome-wide DNA polymorphism datasets and known-gene annotations for each DNA polymorphism. This new database was developed for storing genome-wide DNA polymorphism datasets of plants, with crops being the first priority. Here, we describe our analyzed data for 679, 404, and 66 strains of rice, maize, and sorghum, respectively. The analytical methods are available as a DNApod workflow in an NGS annotation system of the DNA Data Bank of Japan and a virtual machine image. Furthermore, DNApod provides tables of links of identifiers between DNApod genotypic data and public phenotypic data. To advance the sharing of organism knowledge, DNApod offers basic and ubiquitous functions for multiple alignment and phylogenetic tree construction by using orthologous gene information. PMID:28234924

  5. TaxI: a software tool for DNA barcoding using distance methods

    PubMed Central

    Steinke, Dirk; Vences, Miguel; Salzburger, Walter; Meyer, Axel

    2005-01-01

    DNA barcoding is a promising approach to the diagnosis of biological diversity in which DNA sequences serve as the primary key for information retrieval. Most existing software for evolutionary analysis of DNA sequences was designed for phylogenetic analyses and, hence, those algorithms do not offer appropriate solutions for the rapid, but precise analyses needed for DNA barcoding, and are also unable to process the often large comparative datasets. We developed a flexible software tool for DNA taxonomy, named TaxI. This program calculates sequence divergences between a query sequence (taxon to be barcoded) and each sequence of a dataset of reference sequences defined by the user. Because the analysis is based on separate pairwise alignments this software is also able to work with sequences characterized by multiple insertions and deletions that are difficult to align in large sequence sets (i.e. thousands of sequences) by multiple alignment algorithms because of computational restrictions. Here, we demonstrate the utility of this approach with two datasets of fish larvae and juveniles from Lake Constance and juvenile land snails under different models of sequence evolution. Sets of ribosomal 16S rRNA sequences, characterized by multiple indels, performed as good as or better than cox1 sequence sets in assigning sequences to species, demonstrating the suitability of rRNA genes for DNA barcoding. PMID:16214755

  6. SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

    PubMed

    Yu, Qiang; Wei, Dingbang; Huo, Hongwei

    2018-06-18

    Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

  7. Processing and population genetic analysis of multigenic datasets with ProSeq3 software.

    PubMed

    Filatov, Dmitry A

    2009-12-01

    The current tendency in molecular population genetics is to use increasing numbers of genes in the analysis. Here I describe a program for handling and population genetic analysis of DNA polymorphism data collected from multiple genes. The program includes a sequence/alignment editor and an internal relational database that simplify the preparation and manipulation of multigenic DNA polymorphism datasets. The most commonly used DNA polymorphism analyses are implemented in ProSeq3, facilitating population genetic analysis of large multigenic datasets. Extensive input/output options make ProSeq3 a convenient hub for sequence data processing and analysis. The program is available free of charge from http://dps.plants.ox.ac.uk/sequencing/proseq.htm.

  8. Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models.

    PubMed

    Park, Byungkyu; Im, Jinyong; Tuvshinjargal, Narankhuu; Lee, Wook; Han, Kyungsook

    2014-11-01

    As many structures of protein-DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein-DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  9. A phylogenetic overview of the antrodia clade (Basidiomycota, Polyporales)

    Treesearch

    Beatriz Ortiz-Santana; Daniel L. Lindner; Otto Miettinen; Alfredo Justo; David S. Hibbett

    2013-01-01

    Phylogenetic relationships among members of the antrodia clade were investigated with molecular data from two nuclear ribosomal DNA regions, LSU and ITS. A total of 123 species representing 26 genera producing a brown rot were included in the present study. Three DNA datasets (combined LSU-ITS dataset, LSU dataset, ITS dataset) comprising sequences of 449 isolates were...

  10. A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

    PubMed

    Bansal, Vikas

    2017-03-14

    PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments. In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples. The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates .

  11. Rhipicephalus microplus dataset of nonredundant raw sequence reads from 454 GS FLX sequencing of Cot-selected (Cot = 660) genomic DNA

    USDA-ARS?s Scientific Manuscript database

    A reassociation kinetics-based approach was used to reduce the complexity of genomic DNA from the Deutsch laboratory strain of the cattle tick, Rhipicephalus microplus, to facilitate genome sequencing. Selected genomic DNA (Cot value = 660) was sequenced using 454 GS FLX technology, resulting in 356...

  12. MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data.

    PubMed

    Ozaki, Haruka; Iwasaki, Wataru

    2016-08-01

    As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif. Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs. By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions. Copyright © 2016 Elsevier Ltd. All rights reserved.

  13. fCCAC: functional canonical correlation analysis to evaluate covariance between nucleic acid sequencing datasets.

    PubMed

    Madrigal, Pedro

    2017-03-01

    Computational evaluation of variability across DNA or RNA sequencing datasets is a crucial step in genomic science, as it allows both to evaluate reproducibility of biological or technical replicates, and to compare different datasets to identify their potential correlations. Here we present fCCAC, an application of functional canonical correlation analysis to assess covariance of nucleic acid sequencing datasets such as chromatin immunoprecipitation followed by deep sequencing (ChIP-seq). We show how this method differs from other measures of correlation, and exemplify how it can reveal shared covariance between histone modifications and DNA binding proteins, such as the relationship between the H3K4me3 chromatin mark and its epigenetic writers and readers. An R/Bioconductor package is available at http://bioconductor.org/packages/fCCAC/ . pmb59@cam.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  14. mtDNA sequence diversity of Hazara ethnic group from Pakistan.

    PubMed

    Rakha, Allah; Fatima; Peng, Min-Sheng; Adan, Atif; Bi, Rui; Yasmin, Memona; Yao, Yong-Gang

    2017-09-01

    The present study was undertaken to investigate mitochondrial DNA (mtDNA) control region sequences of Hazaras from Pakistan, so as to generate mtDNA reference database for forensic casework in Pakistan and to analyze phylogenetic relationship of this particular ethnic group with geographically proximal populations. Complete mtDNA control region (nt 16024-576) sequences were generated through Sanger Sequencing for 319 Hazara individuals from Quetta, Baluchistan. The population sample set showed a total of 189 distinct haplotypes, belonging mainly to West Eurasian (51.72%), East & Southeast Asian (29.78%) and South Asian (18.50%) haplogroups. Compared with other populations from Pakistan, the Hazara population had a relatively high haplotype diversity (0.9945) and a lower random match probability (0.0085). The dataset has been incorporated into EMPOP database under accession number EMP00680. The data herein comprises the largest, and likely most thoroughly examined, control region mtDNA dataset from Hazaras of Pakistan. Copyright © 2017 Elsevier B.V. All rights reserved.

  15. Genomics dataset on unclassified published organism (patent US 7547531).

    PubMed

    Khan Shawan, Mohammad Mahfuz Ali; Hasan, Md Ashraful; Hossain, Md Mozammel; Hasan, Md Mahmudul; Parvin, Afroza; Akter, Salina; Uddin, Kazi Rasel; Banik, Subrata; Morshed, Mahbubul; Rahman, Md Nazibur; Rahman, S M Badier

    2016-12-01

    Nucleotide (DNA) sequence analysis provides important clues regarding the characteristics and taxonomic position of an organism. With the intention that, DNA sequence analysis is very crucial to learn about hierarchical classification of that particular organism. This dataset (patent US 7547531) is chosen to simplify all the complex raw data buried in undisclosed DNA sequences which help to open doors for new collaborations. In this data, a total of 48 unidentified DNA sequences from patent US 7547531 were selected and their complete sequences were retrieved from NCBI BioSample database. Quick response (QR) code of those DNA sequences was constructed by DNA BarID tool. QR code is useful for the identification and comparison of isolates with other organisms. AT/GC content of the DNA sequences was determined using ENDMEMO GC Content Calculator, which indicates their stability at different temperature. The highest GC content was observed in GP445188 (62.5%) which was followed by GP445198 (61.8%) and GP445189 (59.44%), while lowest was in GP445178 (24.39%). In addition, New England BioLabs (NEB) database was used to identify cleavage code indicating the 5, 3 and blunt end and enzyme code indicating the methylation site of the DNA sequences was also shown. These data will be helpful for the construction of the organisms' hierarchical classification, determination of their phylogenetic and taxonomic position and revelation of their molecular characteristics.

  16. VaDiR: an integrated approach to Variant Detection in RNA.

    PubMed

    Neums, Lisa; Suenaga, Seiji; Beyerlein, Peter; Anders, Sara; Koestler, Devin; Mariani, Andrea; Chien, Jeremy

    2018-02-01

    Advances in next-generation DNA sequencing technologies are now enabling detailed characterization of sequence variations in cancer genomes. With whole-genome sequencing, variations in coding and non-coding sequences can be discovered. But the cost associated with it is currently limiting its general use in research. Whole-exome sequencing is used to characterize sequence variations in coding regions, but the cost associated with capture reagents and biases in capture rate limit its full use in research. Additional limitations include uncertainty in assigning the functional significance of the mutations when these mutations are observed in the non-coding region or in genes that are not expressed in cancer tissue. We investigated the feasibility of uncovering mutations from expressed genes using RNA sequencing datasets with a method called Variant Detection in RNA(VaDiR) that integrates 3 variant callers, namely: SNPiR, RVBoost, and MuTect2. The combination of all 3 methods, which we called Tier 1 variants, produced the highest precision with true positive mutations from RNA-seq that could be validated at the DNA level. We also found that the integration of Tier 1 variants with those called by MuTect2 and SNPiR produced the highest recall with acceptable precision. Finally, we observed a higher rate of mutation discovery in genes that are expressed at higher levels. Our method, VaDiR, provides a possibility of uncovering mutations from RNA sequencing datasets that could be useful in further functional analysis. In addition, our approach allows orthogonal validation of DNA-based mutation discovery by providing complementary sequence variation analysis from paired RNA/DNA sequencing datasets.

  17. Improved detection of CXCR4-using HIV by V3 genotyping: application of population-based and "deep" sequencing to plasma RNA and proviral DNA.

    PubMed

    Swenson, Luke C; Moores, Andrew; Low, Andrew J; Thielen, Alexander; Dong, Winnie; Woods, Conan; Jensen, Mark A; Wynhoven, Brian; Chan, Dennison; Glascock, Christopher; Harrigan, P Richard

    2010-08-01

    Tropism testing should rule out CXCR4-using HIV before treatment with CCR5 antagonists. Currently, the recombinant phenotypic Trofile assay (Monogram) is most widely utilized; however, genotypic tests may represent alternative methods. Independent triplicate amplifications of the HIV gp120 V3 region were made from either plasma HIV RNA or proviral DNA. These underwent standard, population-based sequencing with an ABI3730 (RNA n = 63; DNA n = 40), or "deep" sequencing with a Roche/454 Genome Sequencer-FLX (RNA n = 12; DNA n = 12). Position-specific scoring matrices (PSSMX4/R5) (-6.96 cutoff) and geno2pheno[coreceptor] (5% false-positive rate) inferred tropism from V3 sequence. These methods were then independently validated with a separate, blinded dataset (n = 278) of screening samples from the maraviroc MOTIVATE trials. Standard sequencing of HIV RNA with PSSM yielded 69% sensitivity and 91% specificity, relative to Trofile. The validation dataset gave 75% sensitivity and 83% specificity. Proviral DNA plus PSSM gave 77% sensitivity and 71% specificity. "Deep" sequencing of HIV RNA detected >2% inferred-CXCR4-using virus in 8/8 samples called non-R5 by Trofile, and <2% in 4/4 samples called R5. Triplicate analyses of V3 standard sequence data detect greater proportions of CXCR4-using samples than previously achieved. Sequencing proviral DNA and "deep" V3 sequencing may also be useful tools for assessing tropism.

  18. A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts.

    PubMed

    Nilsson, R Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M; Bengtsson-Palme, Johan; Walker, Donald M; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C; Abarenkov, Kessy

    2015-01-01

    The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric-artificially joined-DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation.

  19. A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts

    PubMed Central

    Nilsson, R. Henrik; Tedersoo, Leho; Ryberg, Martin; Kristiansson, Erik; Hartmann, Martin; Unterseher, Martin; Porter, Teresita M.; Bengtsson-Palme, Johan; Walker, Donald M.; de Sousa, Filipe; Gamper, Hannes Andres; Larsson, Ellen; Larsson, Karl-Henrik; Kõljalg, Urmas; Edgar, Robert C.; Abarenkov, Kessy

    2015-01-01

    The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric—artificially joined—DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation. PMID:25786896

  20. Mapping the Space of Genomic Signatures

    PubMed Central

    Kari, Lila; Hill, Kathleen A.; Sayem, Abu S.; Karamichalis, Rallis; Bryans, Nathaniel; Davis, Katelyn; Dattani, Nikesh S.

    2015-01-01

    We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to k (herein k = 9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence alignment and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information. This method also correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and that the sequence most different from it in this dataset belongs to a cucumber. PMID:26000734

  1. Inaugural Genomics Automation Congress and the coming deluge of sequencing data.

    PubMed

    Creighton, Chad J

    2010-10-01

    Presentations at Select Biosciences's first 'Genomics Automation Congress' (Boston, MA, USA) in 2010 focused on next-generation sequencing and the platforms and methodology around them. The meeting provided an overview of sequencing technologies, both new and emerging. Speakers shared their recent work on applying sequencing to profile cells for various levels of biomolecular complexity, including DNA sequences, DNA copy, DNA methylation, mRNA and microRNA. With sequencing time and costs continuing to drop dramatically, a virtual explosion of very large sequencing datasets is at hand, which will probably present challenges and opportunities for high-level data analysis and interpretation, as well as for information technology infrastructure.

  2. DAMe: a toolkit for the initial processing of datasets with PCR replicates of double-tagged amplicons for DNA metabarcoding analyses.

    PubMed

    Zepeda-Mendoza, Marie Lisandra; Bohmann, Kristine; Carmona Baez, Aldo; Gilbert, M Thomas P

    2016-05-03

    DNA metabarcoding is an approach for identifying multiple taxa in an environmental sample using specific genetic loci and taxa-specific primers. When combined with high-throughput sequencing it enables the taxonomic characterization of large numbers of samples in a relatively time- and cost-efficient manner. One recent laboratory development is the addition of 5'-nucleotide tags to both primers producing double-tagged amplicons and the use of multiple PCR replicates to filter erroneous sequences. However, there is currently no available toolkit for the straightforward analysis of datasets produced in this way. We present DAMe, a toolkit for the processing of datasets generated by double-tagged amplicons from multiple PCR replicates derived from an unlimited number of samples. Specifically, DAMe can be used to (i) sort amplicons by tag combination, (ii) evaluate PCR replicates dissimilarity, and (iii) filter sequences derived from sequencing/PCR errors, chimeras, and contamination. This is attained by calculating the following parameters: (i) sequence content similarity between the PCR replicates from each sample, (ii) reproducibility of each unique sequence across the PCR replicates, and (iii) copy number of the unique sequences in each PCR replicate. We showcase the insights that can be obtained using DAMe prior to taxonomic assignment, by applying it to two real datasets that vary in their complexity regarding number of samples, sequencing libraries, PCR replicates, and used tag combinations. Finally, we use a third mock dataset to demonstrate the impact and importance of filtering the sequences with DAMe. DAMe allows the user-friendly manipulation of amplicons derived from multiple samples with PCR replicates built in a single or multiple sequencing libraries. It allows the user to: (i) collapse amplicons into unique sequences and sort them by tag combination while retaining the sample identifier and copy number information, (ii) identify sequences carrying unused tag combinations, (iii) evaluate the comparability of PCR replicates of the same sample, and (iv) filter tagged amplicons from a number of PCR replicates using parameters of minimum length, copy number, and reproducibility across the PCR replicates. This enables an efficient analysis of complex datasets, and ultimately increases the ease of handling datasets from large-scale studies.

  3. An accurate algorithm for the detection of DNA fragments from dilution pool sequencing experiments.

    PubMed

    Bansal, Vikas

    2018-01-01

    The short read lengths of current high-throughput sequencing technologies limit the ability to recover long-range haplotype information. Dilution pool methods for preparing DNA sequencing libraries from high molecular weight DNA fragments enable the recovery of long DNA fragments from short sequence reads. These approaches require computational methods for identifying the DNA fragments using aligned sequence reads and assembling the fragments into long haplotypes. Although a number of computational methods have been developed for haplotype assembly, the problem of identifying DNA fragments from dilution pool sequence data has not received much attention. We formulate the problem of detecting DNA fragments from dilution pool sequencing experiments as a genome segmentation problem and develop an algorithm that uses dynamic programming to optimize a likelihood function derived from a generative model for the sequence reads. This algorithm uses an iterative approach to automatically infer the mean background read depth and the number of fragments in each pool. Using simulated data, we demonstrate that our method, FragmentCut, has 25-30% greater sensitivity compared with an HMM based method for fragment detection and can also detect overlapping fragments. On a whole-genome human fosmid pool dataset, the haplotypes assembled using the fragments identified by FragmentCut had greater N50 length, 16.2% lower switch error rate and 35.8% lower mismatch error rate compared with two existing methods. We further demonstrate the greater accuracy of our method using two additional dilution pool datasets. FragmentCut is available from https://bansal-lab.github.io/software/FragmentCut. vibansal@ucsd.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  4. High-quality mtDNA control region sequences from 680 individuals sampled across the Netherlands to establish a national forensic mtDNA reference database.

    PubMed

    Chaitanya, Lakshmi; van Oven, Mannis; Brauer, Silke; Zimmermann, Bettina; Huber, Gabriela; Xavier, Catarina; Parson, Walther; de Knijff, Peter; Kayser, Manfred

    2016-03-01

    The use of mitochondrial DNA (mtDNA) for maternal lineage identification often marks the last resort when investigating forensic and missing-person cases involving highly degraded biological materials. As with all comparative DNA testing, a match between evidence and reference sample requires a statistical interpretation, for which high-quality mtDNA population frequency data are crucial. Here, we determined, under high quality standards, the complete mtDNA control-region sequences of 680 individuals from across the Netherlands sampled at 54 sites, covering the entire country with 10 geographic sub-regions. The complete mtDNA control region (nucleotide positions 16,024-16,569 and 1-576) was amplified with two PCR primers and sequenced with ten different sequencing primers using the EMPOP protocol. Haplotype diversity of the entire sample set was very high at 99.63% and, accordingly, the random-match probability was 0.37%. No population substructure within the Netherlands was detected with our dataset. Phylogenetic analyses were performed to determine mtDNA haplogroups. Inclusion of these high-quality data in the EMPOP database (accession number: EMP00666) will improve its overall data content and geographic coverage in the interest of all EMPOP users worldwide. Moreover, this dataset will serve as (the start of) a national reference database for mtDNA applications in forensic and missing person casework in the Netherlands. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  5. Identifying active foraminifera in the Sea of Japan using metatranscriptomic approach

    NASA Astrophysics Data System (ADS)

    Lejzerowicz, Franck; Voltsky, Ivan; Pawlowski, Jan

    2013-02-01

    Metagenetics represents an efficient and rapid tool to describe environmental diversity patterns of microbial eukaryotes based on ribosomal DNA sequences. However, the results of metagenetic studies are often biased by the presence of extracellular DNA molecules that are persistent in the environment, especially in deep-sea sediment. As an alternative, short-lived RNA molecules constitute a good proxy for the detection of active species. Here, we used a metatranscriptomic approach based on RNA-derived (cDNA) sequences to study the diversity of the deep-sea benthic foraminifera and compared it to the metagenetic approach. We analyzed 257 ribosomal DNA and cDNA sequences obtained from seven sediments samples collected in the Sea of Japan at depths ranging from 486 to 3665 m. The DNA and RNA-based approaches gave a similar view of the taxonomic composition of foraminiferal assemblage, but differed in some important points. First, the cDNA dataset was dominated by sequences of rotaliids and robertiniids, suggesting that these calcareous species, some of which have been observed in Rose Bengal stained samples, are the most active component of foraminiferal community. Second, the richness of monothalamous (single-chambered) foraminifera was particularly high in DNA extracts from the deepest samples, confirming that this group of foraminifera is abundant but not necessarily very active in the deep-sea sediments. Finally, the high divergence of undetermined sequences in cDNA dataset indicate the limits of our database and lack of knowledge about some active but possibly rare species. Our study demonstrates the capability of the metatranscriptomic approach to detect active foraminiferal species and prompt its use in future high-throughput sequencing-based environmental surveys.

  6. Argo_CUDA: Exhaustive GPU based approach for motif discovery in large DNA datasets.

    PubMed

    Vishnevsky, Oleg V; Bocharnikov, Andrey V; Kolchanov, Nikolay A

    2018-02-01

    The development of chromatin immunoprecipitation sequencing (ChIP-seq) technology has revolutionized the genetic analysis of the basic mechanisms underlying transcription regulation and led to accumulation of information about a huge amount of DNA sequences. There are a lot of web services which are currently available for de novo motif discovery in datasets containing information about DNA/protein binding. An enormous motif diversity makes their finding challenging. In order to avoid the difficulties, researchers use different stochastic approaches. Unfortunately, the efficiency of the motif discovery programs dramatically declines with the query set size increase. This leads to the fact that only a fraction of top "peak" ChIP-Seq segments can be analyzed or the area of analysis should be narrowed. Thus, the motif discovery in massive datasets remains a challenging issue. Argo_Compute Unified Device Architecture (CUDA) web service is designed to process the massive DNA data. It is a program for the detection of degenerate oligonucleotide motifs of fixed length written in 15-letter IUPAC code. Argo_CUDA is a full-exhaustive approach based on the high-performance GPU technologies. Compared with the existing motif discovery web services, Argo_CUDA shows good prediction quality on simulated sets. The analysis of ChIP-Seq sequences revealed the motifs which correspond to known transcription factor binding sites.

  7. Palindromic Sequence Artifacts Generated during Next Generation Sequencing Library Preparation from Historic and Ancient DNA

    PubMed Central

    Star, Bastiaan; Nederbragt, Alexander J.; Hansen, Marianne H. S.; Skage, Morten; Gilfillan, Gregor D.; Bradbury, Ian R.; Pampoulie, Christophe; Stenseth, Nils Chr; Jakobsen, Kjetill S.; Jentoft, Sissel

    2014-01-01

    Degradation-specific processes and variation in laboratory protocols can bias the DNA sequence composition from samples of ancient or historic origin. Here, we identify a novel artifact in sequences from historic samples of Atlantic cod (Gadus morhua), which forms interrupted palindromes consisting of reverse complementary sequence at the 5′ and 3′-ends of sequencing reads. The palindromic sequences themselves have specific properties – the bases at the 5′-end align well to the reference genome, whereas extensive misalignments exists among the bases at the terminal 3′-end. The terminal 3′ bases are artificial extensions likely caused by the occurrence of hairpin loops in single stranded DNA (ssDNA), which can be ligated and amplified in particular library creation protocols. We propose that such hairpin loops allow the inclusion of erroneous nucleotides, specifically at the 3′-end of DNA strands, with the 5′-end of the same strand providing the template. We also find these palindromes in previously published ancient DNA (aDNA) datasets, albeit at varying and substantially lower frequencies. This artifact can negatively affect the yield of endogenous DNA in these types of samples and introduces sequence bias. PMID:24608104

  8. Prediction of constitutive A-to-I editing sites from human transcriptomes in the absence of genomic sequences

    PubMed Central

    2013-01-01

    Background Adenosine-to-inosine (A-to-I) RNA editing is recognized as a cellular mechanism for generating both RNA and protein diversity. Inosine base pairs with cytidine during reverse transcription and therefore appears as guanosine during sequencing of cDNA. Current approaches of RNA editing identification largely depend on the comparison between transcriptomes and genomic DNA (gDNA) sequencing datasets from the same individuals, and it has been challenging to identify editing candidates from transcriptomes in the absence of gDNA information. Results We have developed a new strategy to accurately predict constitutive RNA editing sites from publicly available human RNA-seq datasets in the absence of relevant genomic sequences. Our approach establishes new parameters to increase the ability to map mismatches and to minimize sequencing/mapping errors and unreported genome variations. We identified 695 novel constitutive A-to-I editing sites that appear in clusters (named “editing boxes”) in multiple samples and which exhibit spatial and dynamic regulation across human tissues. Some of these editing boxes are enriched in non-repetitive regions lacking inverted repeat structures and contain an extremely high conversion frequency of As to Is. We validated a number of editing boxes in multiple human cell lines and confirmed that ADAR1 is responsible for the observed promiscuous editing events in non-repetitive regions, further expanding our knowledge of the catalytic substrate of A-to-I RNA editing by ADAR enzymes. Conclusions The approach we present here provides a novel way of identifying A-to-I RNA editing events by analyzing only RNA-seq datasets. This method has allowed us to gain new insights into RNA editing and should also aid in the identification of more constitutive A-to-I editing sites from additional transcriptomes. PMID:23537002

  9. Improved detection of DNA-binding proteins via compression technology on PSSM information.

    PubMed

    Wang, Yubo; Ding, Yijie; Guo, Fei; Wei, Leyi; Tang, Jijun

    2017-01-01

    Since the importance of DNA-binding proteins in multiple biomolecular functions has been recognized, an increasing number of researchers are attempting to identify DNA-binding proteins. In recent years, the machine learning methods have become more and more compelling in the case of protein sequence data soaring, because of their favorable speed and accuracy. In this paper, we extract three features from the protein sequence, namely NMBAC (Normalized Moreau-Broto Autocorrelation), PSSM-DWT (Position-specific scoring matrix-Discrete Wavelet Transform), and PSSM-DCT (Position-specific scoring matrix-Discrete Cosine Transform). We also employ feature selection algorithm on these feature vectors. Then, these features are fed into the training SVM (support vector machine) model as classifier to predict DNA-binding proteins. Our method applys three datasets, namely PDB1075, PDB594 and PDB186, to evaluate the performance of our approach. The PDB1075 and PDB594 datasets are employed for Jackknife test and the PDB186 dataset is used for the independent test. Our method achieves the best accuracy in the Jacknife test, from 79.20% to 86.23% and 80.5% to 86.20% on PDB1075 and PDB594 datasets, respectively. In the independent test, the accuracy of our method comes to 76.3%. The performance of independent test also shows that our method has a certain ability to be effectively used for DNA-binding protein prediction. The data and source code are at https://doi.org/10.6084/m9.figshare.5104084.

  10. Counting Patterns in Degenerated Sequences

    NASA Astrophysics Data System (ADS)

    Nuel, Grégory

    Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach.

  11. DNA Barcodes for Forensically Important Fly Species in Brazil.

    PubMed

    Koroiva, Ricardo; de Souza, Mirian S; Roque, Fabio de Oliveira; Pepinelli, Mateus

    2018-04-07

    Here, we analyze 248 DNA barcode sequences of 35 fly species of forensic importance in Brazil. DNA barcoding can be effectively used for specimen identification of these species, allowing the unambiguous identification of 31 species, an overall success rate of 88%. Our results show a high rate of success for molecular identification using DNA barcoding sequences and open new perspectives for immature species identification, a subject on which limited forensic investigations exist in Tropical regions. We also address the implications of building a robust forensic DNA barcode database. A geographic bias is recognized for the COI dataset available for forensically important fly species in Brazil, with concentration of sequences from specimens collected mainly in sites located in the Cerrado, Mata Atlântica, and Pampa biomes.

  12. Assessment of species diversity and distribution of an ancient diatom lineage using a DNA metabarcoding approach.

    PubMed

    Nanjappa, Deepak; Audic, Stephane; Romac, Sarah; Kooistra, Wiebe H C F; Zingone, Adriana

    2014-01-01

    Continuous efforts to estimate actual diversity and to trace the species distribution and ranges in the natural environments have gone in equal pace with advancements of the technologies in the study of microbial species diversity from microscopic observations to DNA-based barcoding. DNA metabarcoding based on Next Generation Sequencing (NGS) constitutes the latest advancement in these efforts. Here we use NGS data from different sites to investigate the geographic range of six species of the diatom family Leptocylindraceae and to identify possible new taxa within the family. We analysed the V4 and V9 regions of the nuclear-encoded SSU rDNA gene region in the NGS database of the European ERA-Biodiversa project BioMarKs, collected in plankton and sediments at six coastal sites in European coastal waters, as well as environmental sequences from the NCBI database. All species known in the family Leptocylindraceae were detected in both datasets, but the much larger Illumina V9 dataset showed a higher species coverage at the various sites than the 454 V4 dataset. Sequences identical or similar to the references of Leptocylindrus aporus, L. convexus, L. danicus/hargravesii and Tenuicylindrus belgicus were found in the Mediterranean Sea, North Atlantic Ocean and Black Sea as well as at locations outside Europe. Instead, sequences identical or close to that of L. minimus were found in the North Atlantic Ocean and the Black Sea but not in the Mediterranean Sea, while sequences belonging to a yet undescribed taxon were encountered only in Oslo Fjord and Baffin Bay. Identification of Leptocylindraceae species in NGS datasets has expanded our knowledge of the species biogeographic distribution and of the overall diversity of this diatom family. Individual species appear to be widespread, but not all of them are found everywhere. Despite the sequencing depth allowed by NGS and the wide geographic area covered by this study, the diversity of this ancient diatom family appears to be low, at least at the level of the marker used in this study.

  13. Assessment of Species Diversity and Distribution of an Ancient Diatom Lineage Using a DNA Metabarcoding Approach

    PubMed Central

    Nanjappa, Deepak; Audic, Stephane; Romac, Sarah; Kooistra, Wiebe H. C. F.; Zingone, Adriana

    2014-01-01

    Background Continuous efforts to estimate actual diversity and to trace the species distribution and ranges in the natural environments have gone in equal pace with advancements of the technologies in the study of microbial species diversity from microscopic observations to DNA-based barcoding. DNA metabarcoding based on Next Generation Sequencing (NGS) constitutes the latest advancement in these efforts. Here we use NGS data from different sites to investigate the geographic range of six species of the diatom family Leptocylindraceae and to identify possible new taxa within the family. Methodology/Principal Findings We analysed the V4 and V9 regions of the nuclear-encoded SSU rDNA gene region in the NGS database of the European ERA-Biodiversa project BioMarKs, collected in plankton and sediments at six coastal sites in European coastal waters, as well as environmental sequences from the NCBI database. All species known in the family Leptocylindraceae were detected in both datasets, but the much larger Illumina V9 dataset showed a higher species coverage at the various sites than the 454 V4 dataset. Sequences identical or similar to the references of Leptocylindrus aporus, L. convexus, L. danicus/hargravesii and Tenuicylindrus belgicus were found in the Mediterranean Sea, North Atlantic Ocean and Black Sea as well as at locations outside Europe. Instead, sequences identical or close to that of L. minimus were found in the North Atlantic Ocean and the Black Sea but not in the Mediterranean Sea, while sequences belonging to a yet undescribed taxon were encountered only in Oslo Fjord and Baffin Bay. Conclusions/Significance Identification of Leptocylindraceae species in NGS datasets has expanded our knowledge of the species biogeographic distribution and of the overall diversity of this diatom family. Individual species appear to be widespread, but not all of them are found everywhere. Despite the sequencing depth allowed by NGS and the wide geographic area covered by this study, the diversity of this ancient diatom family appears to be low, at least at the level of the marker used in this study. PMID:25133638

  14. iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets

    PubMed Central

    2012-01-01

    Background ChIP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, detecting allelic imbalance from a single ChIP-seq dataset often has low statistical power since only sequence reads mapped to heterozygote SNPs are informative for discriminating two alleles. Results We develop a new method iASeq to address this issue by jointly analyzing multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve detection of allelic imbalance. Application of iASeq to 77 ChIP-seq samples from 40 ENCODE datasets and 1 genomic DNA sample in GM12878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately. Conclusions iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a new tool to better analyze ASB. PMID:23194258

  15. Geoseq: a tool for dissecting deep-sequencing datasets.

    PubMed

    Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi

    2010-10-12

    Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.

  16. Ribosomal DNA sequence heterogeneity reflects intraspecies phylogenies and predicts genome structure in two contrasting yeast species.

    PubMed

    West, Claire; James, Stephen A; Davey, Robert P; Dicks, Jo; Roberts, Ian N

    2014-07-01

    The ribosomal RNA encapsulates a wealth of evolutionary information, including genetic variation that can be used to discriminate between organisms at a wide range of taxonomic levels. For example, the prokaryotic 16S rDNA sequence is very widely used both in phylogenetic studies and as a marker in metagenomic surveys and the internal transcribed spacer region, frequently used in plant phylogenetics, is now recognized as a fungal DNA barcode. However, this widespread use does not escape criticism, principally due to issues such as difficulties in classification of paralogous versus orthologous rDNA units and intragenomic variation, both of which may be significant barriers to accurate phylogenetic inference. We recently analyzed data sets from the Saccharomyces Genome Resequencing Project, characterizing rDNA sequence variation within multiple strains of the baker's yeast Saccharomyces cerevisiae and its nearest wild relative Saccharomyces paradoxus in unprecedented detail. Notably, both species possess single locus rDNA systems. Here, we use these new variation datasets to assess whether a more detailed characterization of the rDNA locus can alleviate the second of these phylogenetic issues, sequence heterogeneity, while controlling for the first. We demonstrate that a strong phylogenetic signal exists within both datasets and illustrate how they can be used, with existing methodology, to estimate intraspecies phylogenies of yeast strains consistent with those derived from whole-genome approaches. We also describe the use of partial Single Nucleotide Polymorphisms, a type of sequence variation found only in repetitive genomic regions, in identifying key evolutionary features such as genome hybridization events and show their consistency with whole-genome Structure analyses. We conclude that our approach can transform rDNA sequence heterogeneity from a problem to a useful source of evolutionary information, enabling the estimation of highly accurate phylogenies of closely related organisms, and discuss how it could be extended to future studies of multilocus rDNA systems. [concerted evolution; genome hydridisation; phylogenetic analysis; ribosomal DNA; whole genome sequencing; yeast]. © The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.

  17. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

    PubMed

    Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W

    2010-07-02

    The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.

  18. Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

    PubMed Central

    Lou, Wangchao; Wang, Xiaoqing; Chen, Fan; Chen, Yixiao; Jiang, Bo; Zhang, Hua

    2014-01-01

    Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins. PMID:24475169

  19. IM-TORNADO: a tool for comparison of 16S reads from paired-end libraries.

    PubMed

    Jeraldo, Patricio; Kalari, Krishna; Chen, Xianfeng; Bhavsar, Jaysheel; Mangalam, Ashutosh; White, Bryan; Nelson, Heidi; Kocher, Jean-Pierre; Chia, Nicholas

    2014-01-01

    16S rDNA hypervariable tag sequencing has become the de facto method for accessing microbial diversity. Illumina paired-end sequencing, which produces two separate reads for each DNA fragment, has become the platform of choice for this application. However, when the two reads do not overlap, existing computational pipelines analyze data from read separately and underutilize the information contained in the paired-end reads. We created a workflow known as Illinois Mayo Taxon Organization from RNA Dataset Operations (IM-TORNADO) for processing non-overlapping reads while retaining maximal information content. Using synthetic mock datasets, we show that the use of both reads produced answers with greater correlation to those from full length 16S rDNA when looking at taxonomy, phylogeny, and beta-diversity. IM-TORNADO is freely available at http://sourceforge.net/projects/imtornado and produces BIOM format output for cross compatibility with other pipelines such as QIIME, mothur, and phyloseq.

  20. TriageTools: tools for partitioning and prioritizing analysis of high-throughput sequencing data.

    PubMed

    Fimereli, Danai; Detours, Vincent; Konopka, Tomasz

    2013-04-01

    High-throughput sequencing is becoming a popular research tool but carries with it considerable costs in terms of computation time, data storage and bandwidth. Meanwhile, some research applications focusing on individual genes or pathways do not necessitate processing of a full sequencing dataset. Thus, it is desirable to partition a large dataset into smaller, manageable, but relevant pieces. We present a toolkit for partitioning raw sequencing data that includes a method for extracting reads that are likely to map onto pre-defined regions of interest. We show the method can be used to extract information about genes of interest from DNA or RNA sequencing samples in a fraction of the time and disk space required to process and store a full dataset. We report speedup factors between 2.6 and 96, depending on settings and samples used. The software is available at http://www.sourceforge.net/projects/triagetools/.

  1. High-Throughput Block Optical DNA Sequence Identification.

    PubMed

    Sagar, Dodderi Manjunatha; Korshoj, Lee Erik; Hanson, Katrina Bethany; Chowdhury, Partha Pratim; Otoupal, Peter Britton; Chatterjee, Anushree; Nagpal, Prashant

    2018-01-01

    Optical techniques for molecular diagnostics or DNA sequencing generally rely on small molecule fluorescent labels, which utilize light with a wavelength of several hundred nanometers for detection. Developing a label-free optical DNA sequencing technique will require nanoscale focusing of light, a high-throughput and multiplexed identification method, and a data compression technique to rapidly identify sequences and analyze genomic heterogeneity for big datasets. Such a method should identify characteristic molecular vibrations using optical spectroscopy, especially in the "fingerprinting region" from ≈400-1400 cm -1 . Here, surface-enhanced Raman spectroscopy is used to demonstrate label-free identification of DNA nucleobases with multiplexed 3D plasmonic nanofocusing. While nanometer-scale mode volumes prevent identification of single nucleobases within a DNA sequence, the block optical technique can identify A, T, G, and C content in DNA k-mers. The content of each nucleotide in a DNA block can be a unique and high-throughput method for identifying sequences, genes, and other biomarkers as an alternative to single-letter sequencing. Additionally, coupling two complementary vibrational spectroscopy techniques (infrared and Raman) can improve block characterization. These results pave the way for developing a novel, high-throughput block optical sequencing method with lossy genomic data compression using k-mer identification from multiplexed optical data acquisition. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  2. Haematobia irritans dataset of raw sequence reads from Illumina and Pac Bio sequencing of genomic DNA

    USDA-ARS?s Scientific Manuscript database

    The genome of the horn fly, Haematobia irritans, was sequenced using Illumina- and Pac Bio-based protocols. Following quality filtering, the raw reads have been deposited at NCBI under the BioProject and BioSample accession numbers PRJNA30967 and SAMN07830356, respectively. The Illumina reads are un...

  3. Preparation of metagenomic libraries from naturally occurring marine viruses.

    PubMed

    Solonenko, Sergei A; Sullivan, Matthew B

    2013-01-01

    Microbes are now well recognized as major drivers of the biogeochemical cycling that fuels the Earth, and their viruses (phages) are known to be abundant and important in microbial mortality, horizontal gene transfer, and modulating microbial metabolic output. Investigation of environmental phages has been frustrated by an inability to culture the vast majority of naturally occurring diversity coupled with the lack of robust, quantitative, culture-independent methods for studying this uncultured majority. However, for double-stranded DNA phages, a quantitative viral metagenomic sample-to-sequence workflow now exists. Here, we review these advances with special emphasis on the technical details of preparing DNA sequencing libraries for metagenomic sequencing from environmentally relevant low-input DNA samples. Library preparation steps broadly involve manipulating the sample DNA by fragmentation, end repair and adaptor ligation, size fractionation, and amplification. One critical area of future research and development is parallel advances for alternate nucleic acid types such as single-stranded DNA and RNA viruses that are also abundant in nature. Combinations of recent advances in fragmentation (e.g., acoustic shearing and tagmentation), ligation reactions (adaptor-to-template ratio reference table availability), size fractionation (non-gel-sizing), and amplification (linear amplification for deep sequencing and linker amplification protocols) enhance our ability to generate quantitatively representative metagenomic datasets from low-input DNA samples. Such datasets are already providing new insights into the role of viruses in marine systems and will continue to do so as new environments are explored and synergies and paradigms emerge from large-scale comparative analyses. © 2013 Elsevier Inc. All rights reserved.

  4. A Novel Computational Method for Detecting DNA Methylation Sites with DNA Sequence Information and Physicochemical Properties.

    PubMed

    Pan, Gaofeng; Jiang, Limin; Tang, Jijun; Guo, Fei

    2018-02-08

    DNA methylation is an important biochemical process, and it has a close connection with many types of cancer. Research about DNA methylation can help us to understand the regulation mechanism and epigenetic reprogramming. Therefore, it becomes very important to recognize the methylation sites in the DNA sequence. In the past several decades, many computational methods-especially machine learning methods-have been developed since the high-throughout sequencing technology became widely used in research and industry. In order to accurately identify whether or not a nucleotide residue is methylated under the specific DNA sequence context, we propose a novel method that overcomes the shortcomings of previous methods for predicting methylation sites. We use k -gram, multivariate mutual information, discrete wavelet transform, and pseudo amino acid composition to extract features, and train a sparse Bayesian learning model to do DNA methylation prediction. Five criteria-area under the receiver operating characteristic curve (AUC), Matthew's correlation coefficient (MCC), accuracy (ACC), sensitivity (SN), and specificity-are used to evaluate the prediction results of our method. On the benchmark dataset, we could reach 0.8632 on AUC, 0.8017 on ACC, 0.5558 on MCC, and 0.7268 on SN. Additionally, the best results on two scBS-seq profiled mouse embryonic stem cells datasets were 0.8896 and 0.9511 by AUC, respectively. When compared with other outstanding methods, our method surpassed them on the accuracy of prediction. The improvement of AUC by our method compared to other methods was at least 0.0399 . For the convenience of other researchers, our code has been uploaded to a file hosting service, and can be downloaded from: https://figshare.com/s/0697b692d802861282d3.

  5. A reference human genome dataset of the BGISEQ-500 sequencer.

    PubMed

    Huang, Jie; Liang, Xinming; Xuan, Yuankai; Geng, Chunyu; Li, Yuxiang; Lu, Haorong; Qu, Shoufang; Mei, Xianglin; Chen, Hongbo; Yu, Ting; Sun, Nan; Rao, Junhua; Wang, Jiahao; Zhang, Wenwei; Chen, Ying; Liao, Sha; Jiang, Hui; Liu, Xin; Yang, Zhaopeng; Mu, Feng; Gao, Shangxian

    2017-05-01

    BGISEQ-500 is a new desktop sequencer developed by BGI. Using DNA nanoball and combinational probe anchor synthesis developed from Complete Genomics™ sequencing technologies, it generates short reads at a large scale. Here, we present the first human whole-genome sequencing dataset of BGISEQ-500. The dataset was generated by sequencing the widely used cell line HG001 (NA12878) in two sequencing runs of paired-end 50 bp (PE50) and two sequencing runs of paired-end 100 bp (PE100). We also include examples of the raw images from the sequencer for reference. Finally, we identified variations using this dataset, estimated the accuracy of the variations, and compared to that of the variations identified from similar amounts of publicly available HiSeq2500 data. We found similar single nucleotide polymorphism (SNP) detection accuracy for the BGISEQ-500 PE100 data (false positive rate [FPR] = 0.00020%, sensitivity = 96.20%) compared to the PE150 HiSeq2500 data (FPR = 0.00017%, sensitivity = 96.60%) better SNP detection accuracy than the PE50 data (FPR = 0.0006%, sensitivity = 94.15%). But for insertions and deletions (indels), we found lower accuracy for BGISEQ-500 data (FPR = 0.00069% and 0.00067% for PE100 and PE50 respectively, sensitivity = 88.52% and 70.93%) than the HiSeq2500 data (FPR = 0.00032%, sensitivity = 96.28%). Our dataset can serve as the reference dataset, providing basic information not just for future development, but also for all research and applications based on the new sequencing platform. © The Authors 2017. Published by Oxford University Press.

  6. Phylogenetic placement of the enigmatic parasite, Polypodium hydriforme, within the Phylum Cnidaria

    PubMed Central

    2008-01-01

    Background Polypodium hydriforme is a parasite with an unusual life cycle and peculiar morphology, both of which have made its systematic position uncertain. Polypodium has traditionally been considered a cnidarian because it possesses nematocysts, the stinging structures characteristic of this phylum. However, recent molecular phylogenetic studies using 18S rDNA sequence data have challenged this interpretation, and have shown that Polypodium is a close relative to myxozoans and together they share a closer affinity to bilaterians than cnidarians. Due to the variable rates of 18S rDNA sequences, these results have been suggested to be an artifact of long-branch attraction (LBA). A recent study, using multiple protein coding markers, shows that the myxozoan Buddenbrockia, is nested within cnidarians. Polypodium was not included in this study. To further investigate the phylogenetic placement of Polypodium, we have performed phylogenetic analyses of metazoans with 18S and partial 28S rDNA sequences in a large dataset that includes Polypodium and a comprehensive sampling of cnidarian taxa. Results Analyses of a combined dataset of 18S and partial 28S sequences, and partial 28S alone, support the placement of Polypodium within Cnidaria. Removal of the long-branched myxozoans from the 18S dataset also results in Polypodium being nested within Cnidaria. These results suggest that previous reports showing that Polypodium and Myxozoa form a sister group to Bilateria were an artifact of long-branch attraction. Conclusion By including 28S rDNA sequences and a comprehensive sampling of cnidarian taxa, we demonstrate that previously conflicting hypotheses concerning the phylogenetic placement of Polypodium can be reconciled. Specifically, the data presented provide evidence that Polypodium is indeed a cnidarian and is either the sister taxon to Hydrozoa, or part of the hydrozoan clade, Leptothecata. The former hypothesis is consistent with the traditional view that Polypodium should be placed in its own cnidarian class, Polypodiozoa. PMID:18471296

  7. Phylogenetic placement of the enigmatic parasite, Polypodium hydriforme, within the Phylum Cnidaria.

    PubMed

    Evans, Nathaniel M; Lindner, Alberto; Raikova, Ekaterina V; Collins, Allen G; Cartwright, Paulyn

    2008-05-09

    Polypodium hydriforme is a parasite with an unusual life cycle and peculiar morphology, both of which have made its systematic position uncertain. Polypodium has traditionally been considered a cnidarian because it possesses nematocysts, the stinging structures characteristic of this phylum. However, recent molecular phylogenetic studies using 18S rDNA sequence data have challenged this interpretation, and have shown that Polypodium is a close relative to myxozoans and together they share a closer affinity to bilaterians than cnidarians. Due to the variable rates of 18S rDNA sequences, these results have been suggested to be an artifact of long-branch attraction (LBA). A recent study, using multiple protein coding markers, shows that the myxozoan Buddenbrockia, is nested within cnidarians. Polypodium was not included in this study. To further investigate the phylogenetic placement of Polypodium, we have performed phylogenetic analyses of metazoans with 18S and partial 28S rDNA sequences in a large dataset that includes Polypodium and a comprehensive sampling of cnidarian taxa. Analyses of a combined dataset of 18S and partial 28S sequences, and partial 28S alone, support the placement of Polypodium within Cnidaria. Removal of the long-branched myxozoans from the 18S dataset also results in Polypodium being nested within Cnidaria. These results suggest that previous reports showing that Polypodium and Myxozoa form a sister group to Bilateria were an artifact of long-branch attraction. By including 28S rDNA sequences and a comprehensive sampling of cnidarian taxa, we demonstrate that previously conflicting hypotheses concerning the phylogenetic placement of Polypodium can be reconciled. Specifically, the data presented provide evidence that Polypodium is indeed a cnidarian and is either the sister taxon to Hydrozoa, or part of the hydrozoan clade, Leptothecata. The former hypothesis is consistent with the traditional view that Polypodium should be placed in its own cnidarian class, Polypodiozoa.

  8. The 'dark matter' in the plant genomes: non-coding and unannotated DNA sequences associated with open chromatin.

    PubMed

    Jiang, Jiming

    2015-04-01

    Sequencing of complete plant genomes has become increasingly more routine since the advent of the next-generation sequencing technology. Identification and annotation of large amounts of noncoding but functional DNA sequences, including cis-regulatory DNA elements (CREs), have become a new frontier in plant genome research. Genomic regions containing active CREs bound to regulatory proteins are hypersensitive to DNase I digestion and are called DNase I hypersensitive sites (DHSs). Several recent DHS studies in plants illustrate that DHS datasets produced by DNase I digestion followed by next-generation sequencing (DNase-seq) are highly valuable for the identification and characterization of CREs associated with plant development and responses to environmental cues. DHS-based genomic profiling has opened a door to identify and annotate the 'dark matter' in sequenced plant genomes. Copyright © 2015 Elsevier Ltd. All rights reserved.

  9. An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets

    PubMed Central

    2010-01-01

    Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141

  10. Identification and Removal of Contaminant Sequences From Ribosomal Gene Databases: Lessons From the Census of Deep Life

    PubMed Central

    Sheik, Cody S.; Reese, Brandi Kiel; Twing, Katrina I.; Sylvan, Jason B.; Grim, Sharon L.; Schrenk, Matthew O.; Sogin, Mitchell L.; Colwell, Frederick S.

    2018-01-01

    Earth’s subsurface environment is one of the largest, yet least studied, biomes on Earth, and many questions remain regarding what microorganisms are indigenous to the subsurface. Through the activity of the Census of Deep Life (CoDL) and the Deep Carbon Observatory, an open access 16S ribosomal RNA gene sequence database from diverse subsurface environments has been compiled. However, due to low quantities of biomass in the deep subsurface, the potential for incorporation of contaminants from reagents used during sample collection, processing, and/or sequencing is high. Thus, to understand the ecology of subsurface microorganisms (i.e., the distribution, richness, or survival), it is necessary to minimize, identify, and remove contaminant sequences that will skew the relative abundances of all taxa in the sample. In this meta-analysis, we identify putative contaminants associated with the CoDL dataset, recommend best practices for removing contaminants from samples, and propose a series of best practices for subsurface microbiology sampling. The most abundant putative contaminant genera observed, independent of evenness across samples, were Propionibacterium, Aquabacterium, Ralstonia, and Acinetobacter. While the top five most frequently observed genera were Pseudomonas, Propionibacterium, Acinetobacter, Ralstonia, and Sphingomonas. The majority of the most frequently observed genera (high evenness) were associated with reagent or potential human contamination. Additionally, in DNA extraction blanks, we observed potential archaeal contaminants, including methanogens, which have not been discussed in previous contamination studies. Such contaminants would directly affect the interpretation of subsurface molecular studies, as methanogenesis is an important subsurface biogeochemical process. Utilizing previously identified contaminant genera, we found that ∼27% of the total dataset were identified as contaminant sequences that likely originate from DNA extraction and DNA cleanup methods. Thus, controls must be taken at every step of the collection and processing procedure when working with low biomass environments such as, but not limited to, portions of Earth’s deep subsurface. Taken together, we stress that the CoDL dataset is an incredible resource for the broader research community interested in subsurface life, and steps to remove contamination derived sequences must be taken prior to using this dataset. PMID:29780369

  11. Identification and Removal of Contaminant Sequences From Ribosomal Gene Databases: Lessons From the Census of Deep Life.

    PubMed

    Sheik, Cody S; Reese, Brandi Kiel; Twing, Katrina I; Sylvan, Jason B; Grim, Sharon L; Schrenk, Matthew O; Sogin, Mitchell L; Colwell, Frederick S

    2018-01-01

    Earth's subsurface environment is one of the largest, yet least studied, biomes on Earth, and many questions remain regarding what microorganisms are indigenous to the subsurface. Through the activity of the Census of Deep Life (CoDL) and the Deep Carbon Observatory, an open access 16S ribosomal RNA gene sequence database from diverse subsurface environments has been compiled. However, due to low quantities of biomass in the deep subsurface, the potential for incorporation of contaminants from reagents used during sample collection, processing, and/or sequencing is high. Thus, to understand the ecology of subsurface microorganisms (i.e., the distribution, richness, or survival), it is necessary to minimize, identify, and remove contaminant sequences that will skew the relative abundances of all taxa in the sample. In this meta-analysis, we identify putative contaminants associated with the CoDL dataset, recommend best practices for removing contaminants from samples, and propose a series of best practices for subsurface microbiology sampling. The most abundant putative contaminant genera observed, independent of evenness across samples, were Propionibacterium , Aquabacterium , Ralstonia , and Acinetobacter . While the top five most frequently observed genera were Pseudomonas , Propionibacterium , Acinetobacter , Ralstonia , and Sphingomonas . The majority of the most frequently observed genera (high evenness) were associated with reagent or potential human contamination. Additionally, in DNA extraction blanks, we observed potential archaeal contaminants, including methanogens, which have not been discussed in previous contamination studies. Such contaminants would directly affect the interpretation of subsurface molecular studies, as methanogenesis is an important subsurface biogeochemical process. Utilizing previously identified contaminant genera, we found that ∼27% of the total dataset were identified as contaminant sequences that likely originate from DNA extraction and DNA cleanup methods. Thus, controls must be taken at every step of the collection and processing procedure when working with low biomass environments such as, but not limited to, portions of Earth's deep subsurface. Taken together, we stress that the CoDL dataset is an incredible resource for the broader research community interested in subsurface life, and steps to remove contamination derived sequences must be taken prior to using this dataset.

  12. Phylogenetic study on Shiraia bambusicola by rDNA sequence analyses.

    PubMed

    Cheng, Tian-Fan; Jia, Xiao-Ming; Ma, Xiao-Hang; Lin, Hai-Ping; Zhao, Yu-Hua

    2004-01-01

    In this study, 18S rDNA and ITS-5.8S rDNA regions of four Shiraia bambusicola isolates collected from different species of bamboos were amplified by PCR with universal primer pairs NS1/NS8 and ITS5/ITS4, respectively, and sequenced. Phylogenetic analyses were conducted on three selected datasets of rDNA sequences. Maximum parsimony, distance and maximum likelihood criteria were used to infer trees. Morphological characteristics were also observed. The positioning of Shiraia in the order Pleosporales was well supported by bootstrap, which agreed with the placement by Amano (1980) according to their morphology. We did not find significant inter-hostal differences among these four isolates from different species of bamboos. From the results of analyses and comparison of their rDNA sequences, we conclude that Shiraia should be classified into Pleosporales as Amano (1980) proposed and suggest that it might be positioned in the family Phaeosphaeriaceae. Copyright 2004 WILEY-VCH Verlag GmbH & Co.

  13. An improved model for whole genome phylogenetic analysis by Fourier transform.

    PubMed

    Yin, Changchuan; Yau, Stephen S-T

    2015-10-07

    DNA sequence similarity comparison is one of the major steps in computational phylogenetic studies. The sequence comparison of closely related DNA sequences and genomes is usually performed by multiple sequence alignments (MSA). While the MSA method is accurate for some types of sequences, it may produce incorrect results when DNA sequences undergone rearrangements as in many bacterial and viral genomes. It is also limited by its computational complexity for comparing large volumes of data. Previously, we proposed an alignment-free method that exploits the full information contents of DNA sequences by Discrete Fourier Transform (DFT), but still with some limitations. Here, we present a significantly improved method for the similarity comparison of DNA sequences by DFT. In this method, we map DNA sequences into 2-dimensional (2D) numerical sequences and then apply DFT to transform the 2D numerical sequences into frequency domain. In the 2D mapping, the nucleotide composition of a DNA sequence is a determinant factor and the 2D mapping reduces the nucleotide composition bias in distance measure, and thus improving the similarity measure of DNA sequences. To compare the DFT power spectra of DNA sequences with different lengths, we propose an improved even scaling algorithm to extend shorter DFT power spectra to the longest length of the underlying sequences. After the DFT power spectra are evenly scaled, the spectra are in the same dimensionality of the Fourier frequency space, then the Euclidean distances of full Fourier power spectra of the DNA sequences are used as the dissimilarity metrics. The improved DFT method, with increased computational performance by 2D numerical representation, can be applicable to any DNA sequences of different length ranges. We assess the accuracy of the improved DFT similarity measure in hierarchical clustering of different DNA sequences including simulated and real datasets. The method yields accurate and reliable phylogenetic trees and demonstrates that the improved DFT dissimilarity measure is an efficient and effective similarity measure of DNA sequences. Due to its high efficiency and accuracy, the proposed DFT similarity measure is successfully applied on phylogenetic analysis for individual genes and large whole bacterial genomes. Copyright © 2015 Elsevier Ltd. All rights reserved.

  14. MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets.

    PubMed

    Reddy, Rachamalla Maheedhar; Mohammed, Monzoorul Haque; Mande, Sharmila S

    2014-01-01

    A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA - a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicates that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at https://metagenomics.atc.tcs.com/MetaCAA. Copyright © 2014 Elsevier Inc. All rights reserved.

  15. TFBSshape: a motif database for DNA shape features of transcription factor binding sites.

    PubMed

    Yang, Lin; Zhou, Tianyin; Dror, Iris; Mathelier, Anthony; Wasserman, Wyeth W; Gordân, Raluca; Rohs, Remo

    2014-01-01

    Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein-DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone.

  16. TFBSshape: a motif database for DNA shape features of transcription factor binding sites

    PubMed Central

    Yang, Lin; Zhou, Tianyin; Dror, Iris; Mathelier, Anthony; Wasserman, Wyeth W.; Gordân, Raluca; Rohs, Remo

    2014-01-01

    Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein–DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone. PMID:24214955

  17. Investigating population continuity with ancient DNA under a spatially explicit simulation framework.

    PubMed

    Silva, Nuno Miguel; Rio, Jeremy; Currat, Mathias

    2017-12-15

    Recent advances in sequencing technologies have allowed for the retrieval of ancient DNA data (aDNA) from skeletal remains, providing direct genetic snapshots from diverse periods of human prehistory. Comparing samples taken in the same region but at different times, hereafter called "serial samples", may indicate whether there is continuity in the peopling history of that area or whether an immigration of a genetically different population has occurred between the two sampling times. However, the exploration of genetic relationships between serial samples generally ignores their geographical locations and the spatiotemporal dynamics of populations. Here, we present a new coalescent-based, spatially explicit modelling approach to investigate population continuity using aDNA, which includes two fundamental elements neglected in previous methods: population structure and migration. The approach also considers the extensive temporal and geographical variance that is commonly found in aDNA population samples. We first showed that our spatially explicit approach is more conservative than the previous (panmictic) approach and should be preferred to test for population continuity, especially when small and isolated populations are considered. We then applied our method to two mitochondrial datasets from Germany and France, both including modern and ancient lineages dating from the early Neolithic. The results clearly reject population continuity for the maternal line over the last 7500 years for the German dataset but not for the French dataset, suggesting regional heterogeneity in post-Neolithic migratory processes. Here, we demonstrate the benefits of using a spatially explicit method when investigating population continuity with aDNA. It constitutes an improvement over panmictic methods by considering the spatiotemporal dynamics of genetic lineages and the precise location of ancient samples. The method can be used to investigate population continuity between any pair of serial samples (ancient-ancient or ancient-modern) and to investigate more complex evolutionary scenarios. Although we based our study on mitochondrial DNA sequences, diploid molecular markers of different types (DNA, SNP, STR) can also be simulated with our approach. It thus constitutes a promising tool for the analysis of the numerous aDNA datasets being produced, including genome wide data, in humans but also in many other species.

  18. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning.

    PubMed

    Liu, Bin; Wang, Shanyi; Dong, Qiwen; Li, Shumin; Liu, Xuan

    2016-04-20

    DNA-binding proteins play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. With the rapid development of next generation of sequencing technique, the number of protein sequences is unprecedentedly increasing. Thus it is necessary to develop computational methods to identify the DNA-binding proteins only based on the protein sequence information. In this study, a novel method called iDNA-KACC is presented, which combines the Support Vector Machine (SVM) and the auto-cross covariance transformation. The protein sequences are first converted into profile-based protein representation, and then converted into a series of fixed-length vectors by the auto-cross covariance transformation with Kmer composition. The sequence order effect can be effectively captured by this scheme. These vectors are then fed into Support Vector Machine (SVM) to discriminate the DNA-binding proteins from the non DNA-binding ones. iDNA-KACC achieves an overall accuracy of 75.16% and Matthew correlation coefficient of 0.5 by a rigorous jackknife test. Its performance is further improved by employing an ensemble learning approach, and the improved predictor is called iDNA-KACC-EL. Experimental results on an independent dataset shows that iDNA-KACC-EL outperforms all the other state-of-the-art predictors, indicating that it would be a useful computational tool for DNA binding protein identification. .

  19. Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier.

    PubMed

    Meher, Prabina Kumar; Sahu, Tanmaya Kumar; Rao, A R

    2016-11-05

    DNA barcoding is a molecular diagnostic method that allows automated and accurate identification of species based on a short and standardized fragment of DNA. To this end, an attempt has been made in this study to develop a computational approach for identifying the species by comparing its barcode with the barcode sequence of known species present in the reference library. Each barcode sequence was first mapped onto a numeric feature vector based on k-mer frequencies and then Random forest methodology was employed on the transformed dataset for species identification. The proposed approach outperformed similarity-based, tree-based, diagnostic-based approaches and found comparable with existing supervised learning based approaches in terms of species identification success rate, while compared using real and simulated datasets. Based on the proposed approach, an online web interface SPIDBAR has also been developed and made freely available at http://cabgrid.res.in:8080/spidbar/ for species identification by the taxonomists. Copyright © 2016 Elsevier B.V. All rights reserved.

  20. IM-TORNADO: A Tool for Comparison of 16S Reads from Paired-End Libraries

    PubMed Central

    Jeraldo, Patricio; Kalari, Krishna; Chen, Xianfeng; Bhavsar, Jaysheel; Mangalam, Ashutosh; White, Bryan; Nelson, Heidi; Kocher, Jean-Pierre; Chia, Nicholas

    2014-01-01

    Motivation 16S rDNA hypervariable tag sequencing has become the de facto method for accessing microbial diversity. Illumina paired-end sequencing, which produces two separate reads for each DNA fragment, has become the platform of choice for this application. However, when the two reads do not overlap, existing computational pipelines analyze data from read separately and underutilize the information contained in the paired-end reads. Results We created a workflow known as Illinois Mayo Taxon Organization from RNA Dataset Operations (IM-TORNADO) for processing non-overlapping reads while retaining maximal information content. Using synthetic mock datasets, we show that the use of both reads produced answers with greater correlation to those from full length 16S rDNA when looking at taxonomy, phylogeny, and beta-diversity. Availability and Implementation IM-TORNADO is freely available at http://sourceforge.net/projects/imtornado and produces BIOM format output for cross compatibility with other pipelines such as QIIME, mothur, and phyloseq. PMID:25506826

  1. A comparative study of ChIP-seq sequencing library preparation methods.

    PubMed

    Sundaram, Arvind Y M; Hughes, Timothy; Biondi, Shea; Bolduc, Nathalie; Bowman, Sarah K; Camilli, Andrew; Chew, Yap C; Couture, Catherine; Farmer, Andrew; Jerome, John P; Lazinski, David W; McUsic, Andrew; Peng, Xu; Shazand, Kamran; Xu, Feng; Lyle, Robert; Gilfillan, Gregor D

    2016-10-21

    ChIP-seq is the primary technique used to investigate genome-wide protein-DNA interactions. As part of this procedure, immunoprecipitated DNA must undergo "library preparation" to enable subsequent high-throughput sequencing. To facilitate the analysis of biopsy samples and rare cell populations, there has been a recent proliferation of methods allowing sequencing library preparation from low-input DNA amounts. However, little information exists on the relative merits, performance, comparability and biases inherent to these procedures. Notably, recently developed single-cell ChIP procedures employing microfluidics must also employ library preparation reagents to allow downstream sequencing. In this study, seven methods designed for low-input DNA/ChIP-seq sample preparation (Accel-NGS® 2S, Bowman-method, HTML-PCR, SeqPlex™, DNA SMART™, TELP and ThruPLEX®) were performed on five replicates of 1 ng and 0.1 ng input H3K4me3 ChIP material, and compared to a "gold standard" reference PCR-free dataset. The performance of each method was examined for the prevalence of unmappable reads, amplification-derived duplicate reads, reproducibility, and for the sensitivity and specificity of peak calling. We identified consistent high performance in a subset of the tested reagents, which should aid researchers in choosing the most appropriate reagents for their studies. Furthermore, we expect this work to drive future advances by identifying and encouraging use of the most promising methods and reagents. The results may also aid judgements on how comparable are existing datasets that have been prepared with different sample library preparation reagents.

  2. Data Release: DNA barcodes of plant species collected for the Global Genome Initiative for Gardens Program, National Museum of Natural History, Smithsonian Institution

    PubMed Central

    Zúñiga, Jose D.; Gostel, Morgan R.; Mulcahy, Daniel G.; Barker, Katharine; Asia Hill; Sedaghatpour, Maryam; Vo, Samantha Q.; Funk, Vicki A.; Coddington, Jonathan A.

    2017-01-01

    Abstract The Global Genome Initiative has sequenced and released 1961 DNA barcodes for genetic samples obtained as part of the Global Genome Initiative for Gardens Program. The dataset includes barcodes for 29 plant families and 309 genera that did not have sequences flagged as barcodes in GenBank and sequences from officially recognized barcoding genetic markers meet the data standard of the Consortium for the Barcode of Life. The genetic samples were deposited in the Smithsonian Institution’s National Museum of Natural History Biorepository and their records were made public through the Global Genome Biodiversity Network’s portal. The DNA barcodes are now available on GenBank. PMID:29118648

  3. Disk-based compression of data from genome sequencing.

    PubMed

    Grabowski, Szymon; Deorowicz, Sebastian; Roguski, Łukasz

    2015-05-01

    High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage. We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space. http://sun.aei.polsl.pl/orcom under a free license. sebastian.deorowicz@polsl.pl Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  4. Identification of fungi in shotgun metagenomics datasets

    PubMed Central

    Donovan, Paul D.; Gonzalez, Gabriel; Higgins, Desmond G.

    2018-01-01

    Metagenomics uses nucleic acid sequencing to characterize species diversity in different niches such as environmental biomes or the human microbiome. Most studies have used 16S rRNA amplicon sequencing to identify bacteria. However, the decreasing cost of sequencing has resulted in a gradual shift away from amplicon analyses and towards shotgun metagenomic sequencing. Shotgun metagenomic data can be used to identify a wide range of species, but have rarely been applied to fungal identification. Here, we develop a sequence classification pipeline, FindFungi, and use it to identify fungal sequences in public metagenome datasets. We focus primarily on animal metagenomes, especially those from pig and mouse microbiomes. We identified fungi in 39 of 70 datasets comprising 71 fungal species. At least 11 pathogenic species with zoonotic potential were identified, including Candida tropicalis. We identified Pseudogymnoascus species from 13 Antarctic soil samples initially analyzed for the presence of bacteria capable of degrading diesel oil. We also show that Candida tropicalis and Candida loboi are likely the same species. In addition, we identify several examples where contaminating DNA was erroneously included in fungal genome assemblies. PMID:29444186

  5. Supervised DNA Barcodes species classification: analysis, comparisons and results

    PubMed Central

    2014-01-01

    Background Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. Methods In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. Results A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. Conclusions The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community. PMID:24721333

  6. Efficiency of ITS Sequences for DNA Barcoding in Passiflora (Passifloraceae)

    PubMed Central

    Giudicelli, Giovanna Câmara; Mäder, Geraldo; de Freitas, Loreta Brandão

    2015-01-01

    DNA barcoding is a technique for discriminating and identifying species using short, variable, and standardized DNA regions. Here, we tested for the first time the performance of plastid and nuclear regions as DNA barcodes in Passiflora. This genus is a largely variable, with more than 900 species of high ecological, commercial, and ornamental importance. We analyzed 1034 accessions of 222 species representing the four subgenera of Passiflora and evaluated the effectiveness of five plastid regions and three nuclear datasets currently employed as DNA barcodes in plants using barcoding gap, applied similarity-, and tree-based methods. The plastid regions were able to identify less than 45% of species, whereas the nuclear datasets were efficient for more than 50% using “best match” and “best close match” methods of TaxonDNA software. All subgenera presented higher interspecific pairwise distances and did not fully overlap with the intraspecific distance, and similarity-based methods showed better results than tree-based methods. The nuclear ribosomal internal transcribed spacer 1 (ITS1) region presented a higher discrimination power than the other datasets and also showed other desirable characteristics as a DNA barcode for this genus. Therefore, we suggest that this region should be used as a starting point to identify Passiflora species. PMID:25837628

  7. Cross-species transferability and mapping of genomic and cDNA SSRs in pines

    Treesearch

    D. Chagne; P. Chaumeil; A. Ramboer; C. Collada; A. Guevara; M. T. Cervera; G. G. Vendramin; V. Garcia; J-M. Frigerio; Craig Echt; T. Richardson; Christophe Plomion

    2004-01-01

    Two unigene datasets of Pinus taeda and Pinus pinaster were screened to detect di-, tri and tetranucleotide repeated motifs using the SSRIT script. A total of 419 simple sequence repeats (SSRs) were identified, from which only 12.8% overlapped between the two sets. The position of the SSRs within the coding sequence were predicted...

  8. EMPOP-quality mtDNA control region sequences from Kashmiri of Azad Jammu & Kashmir, Pakistan.

    PubMed

    Rakha, Allah; Peng, Min-Sheng; Bi, Rui; Song, Jiao-Jiao; Salahudin, Zeenat; Adan, Atif; Israr, Muhammad; Yao, Yong-Gang

    2016-11-01

    The mitochondrial DNA (mtDNA) control region (nucleotide position 16024-576) sequences were generated through Sanger sequencing method for 317 self-identified Kashmiris from all districts of Azad Jammu & Kashmir Pakistan. The population sample set showed a total of 251 haplotypes, with a relatively high haplotype diversity (0.9977) and a low random match probability (0.54%). The containing matrilineal lineages belonging to three different phylogeographic origins of Western Eurasian (48.9%), South Asian (47.0%) and East Asian (4.1%). The present study was compared to previous data from Pakistan and other worldwide populations (Central Asia, Western Asia, and East & Southeast Asia). The dataset is made available through EMPOP under accession number EMP00679 and will serve as an mtDNA reference database in forensic casework in Pakistan. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  9. Using classical population genetics tools with heterochroneous data: time matters!

    PubMed

    Depaulis, Frantz; Orlando, Ludovic; Hänni, Catherine

    2009-01-01

    New polymorphism datasets from heterochroneous data have arisen thanks to recent advances in experimental and microbial molecular evolution, and the sequencing of ancient DNA (aDNA). However, classical tools for population genetics analyses do not take into account heterochrony between subsets, despite potential bias on neutrality and population structure tests. Here, we characterize the extent of such possible biases using serial coalescent simulations. We first use a coalescent framework to generate datasets assuming no or different levels of heterochrony and contrast most classical population genetic statistics. We show that even weak levels of heterochrony ( approximately 10% of the average depth of a standard population tree) affect the distribution of polymorphism substantially, leading to overestimate the level of polymorphism theta, to star like trees, with an excess of rare mutations and a deficit of linkage disequilibrium, which are the hallmark of e.g. population expansion (possibly after a drastic bottleneck). Substantial departures of the tests are detected in the opposite direction for more heterochroneous and equilibrated datasets, with balanced trees mimicking in particular population contraction, balancing selection, and population differentiation. We therefore introduce simple corrections to classical estimators of polymorphism and of the genetic distance between populations, in order to remove heterochrony-driven bias. Finally, we show that these effects do occur on real aDNA datasets, taking advantage of the currently available sequence data for Cave Bears (Ursus spelaeus), for which large mtDNA haplotypes have been reported over a substantial time period (22-130 thousand years ago (KYA)). Considering serial sampling changed the conclusion of several tests, indicating that neglecting heterochrony could provide significant support for false past history of populations and inappropriate conservation decisions. We therefore argue for systematically considering heterochroneous models when analyzing heterochroneous samples covering a large time scale.

  10. Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shah, Nameeta; Couronne, Olivier; Pennacchio, Len A.

    The power of multi-sequence comparison for biological discovery is well established. The need for new capabilities to visualize and compare cross-species alignment data is intensified by the growing number of genomic sequence datasets being generated for an ever-increasing number of organisms. To be efficient these visualization algorithms must support the ability to accommodate consistently a wide range of evolutionary distances in a comparison framework based upon phylogenetic relationships. Results: We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing a similarity measure for multiple DNA sequences. The complexity of visual presentation is effectively organized using a frameworkmore » based upon interspecies phylogenetic relationships. The phylogenetic organization supports rapid, user-guided interspecies comparison. To aid in navigation through large sequence datasets, Phylo-VISTA leverages concepts from VISTA that provide a user with the ability to select and view data at varying resolutions. The combination of multiresolution data visualization and analysis, combined with the phylogenetic framework for interspecies comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments. Availability: Phylo-VISTA is available at http://www-gsd.lbl. gov/phylovista. It requires an Internet browser with Java Plugin 1.4.2 and it is integrated into the global alignment program LAGAN at http://lagan.stanford.edu« less

  11. Predicting the binding preference of transcription factors to individual DNA k-mers.

    PubMed

    Alleyne, Trevis M; Peña-Castillo, Lourdes; Badis, Gwenael; Talukder, Shaheynoor; Berger, Michael F; Gehrke, Andrew R; Philippakis, Anthony A; Bulyk, Martha L; Morris, Quaid D; Hughes, Timothy R

    2009-04-15

    Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF-binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA-protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members. We employed a new dataset consisting of the relative preferences of mouse homeodomains for all eight-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when only their protein sequences are given. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest neighbour among functionally important residues emerged among the most effective methods. Our results underscore the complexity of TF-DNA recognition, and suggest a rational approach for future analyses of TF families.

  12. Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements

    PubMed Central

    Tharakaraman, Kannan; Mariño-Ramírez, Leonardo; Sheetlin, Sergey L; Landsman, David; Spouge, John L

    2006-01-01

    Background Many DNA regulatory elements occur as multiple instances within a target promoter. Gibbs sampling programs for finding DNA regulatory elements de novo can be prohibitively slow in locating all instances of such an element in a sequence set. Results We describe an improvement to the A-GLAM computer program, which predicts regulatory elements within DNA sequences with Gibbs sampling. The improvement adds an optional "scanning step" after Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an "individual score" to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score, to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values, so users have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence. Conclusion Datasets from experiments determining the binding sites of transcription factors were used to evaluate the improvement to A-GLAM. Typically, the datasets included several sequences containing multiple instances of a regulatory motif. The improvements to A-GLAM permitted it to predict the multiple instances. PMID:16961919

  13. Privacy-preserving microbiome analysis using secure computation.

    PubMed

    Wagner, Justin; Paulson, Joseph N; Wang, Xiao; Bhattacharjee, Bobby; Corrada Bravo, Héctor

    2016-06-15

    Developing targeted therapeutics and identifying biomarkers relies on large amounts of research participant data. Beyond human DNA, scientists now investigate the DNA of micro-organisms inhabiting the human body. Recent work shows that an individual's collection of microbial DNA consistently identifies that person and could be used to link a real-world identity to a sensitive attribute in a research dataset. Unfortunately, the current suite of DNA-specific privacy-preserving analysis tools does not meet the requirements for microbiome sequencing studies. To address privacy concerns around microbiome sequencing, we implement metagenomic analyses using secure computation. Our implementation allows comparative analysis over combined data without revealing the feature counts for any individual sample. We focus on three analyses and perform an evaluation on datasets currently used by the microbiome research community. We use our implementation to simulate sharing data between four policy-domains. Additionally, we describe an application of our implementation for patients to combine data that allows drug developers to query against and compensate patients for the analysis. The software is freely available for download at: http://cbcb.umd.edu/∼hcorrada/projects/secureseq.html Supplementary data are available at Bioinformatics online. hcorrada@umiacs.umd.edu. © The Author 2016. Published by Oxford University Press.

  14. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases.

    PubMed

    Shen, Li; Shao, Ningyi; Liu, Xiaochuan; Nestler, Eric

    2014-04-15

    Understanding the relationship between the millions of functional DNA elements and their protein regulators, and how they work in conjunction to manifest diverse phenotypes, is key to advancing our understanding of the mammalian genome. Next-generation sequencing technology is now used widely to probe these protein-DNA interactions and to profile gene expression at a genome-wide scale. As the cost of DNA sequencing continues to fall, the interpretation of the ever increasing amount of data generated represents a considerable challenge. We have developed ngs.plot - a standalone program to visualize enrichment patterns of DNA-interacting proteins at functionally important regions based on next-generation sequencing data. We demonstrate that ngs.plot is not only efficient but also scalable. We use a few examples to demonstrate that ngs.plot is easy to use and yet very powerful to generate figures that are publication ready. We conclude that ngs.plot is a useful tool to help fill the gap between massive datasets and genomic information in this era of big sequencing data.

  15. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases

    PubMed Central

    2014-01-01

    Background Understanding the relationship between the millions of functional DNA elements and their protein regulators, and how they work in conjunction to manifest diverse phenotypes, is key to advancing our understanding of the mammalian genome. Next-generation sequencing technology is now used widely to probe these protein-DNA interactions and to profile gene expression at a genome-wide scale. As the cost of DNA sequencing continues to fall, the interpretation of the ever increasing amount of data generated represents a considerable challenge. Results We have developed ngs.plot – a standalone program to visualize enrichment patterns of DNA-interacting proteins at functionally important regions based on next-generation sequencing data. We demonstrate that ngs.plot is not only efficient but also scalable. We use a few examples to demonstrate that ngs.plot is easy to use and yet very powerful to generate figures that are publication ready. Conclusions We conclude that ngs.plot is a useful tool to help fill the gap between massive datasets and genomic information in this era of big sequencing data. PMID:24735413

  16. Enhancer Linking by Methylation/Expression Relationships (ELMER) | Informatics Technology for Cancer Research (ITCR)

    Cancer.gov

    R tool for analysis of DNA methylation and expression datasets. Integrative analysis allows reconstruction of in vivo transcription factor networks altered in cancer along with identification of the underlying gene regulatory sequences.

  17. Base-Calling Algorithm with Vocabulary (BCV) Method for Analyzing Population Sequencing Chromatograms

    PubMed Central

    Fantin, Yuri S.; Neverov, Alexey D.; Favorov, Alexander V.; Alvarez-Figueroa, Maria V.; Braslavskaya, Svetlana I.; Gordukova, Maria A.; Karandashova, Inga V.; Kuleshov, Konstantin V.; Myznikova, Anna I.; Polishchuk, Maya S.; Reshetov, Denis A.; Voiciehovskaya, Yana A.; Mironov, Andrei A.; Chulanov, Vladimir P.

    2013-01-01

    Sanger sequencing is a common method of reading DNA sequences. It is less expensive than high-throughput methods, and it is appropriate for numerous applications including molecular diagnostics. However, sequencing mixtures of similar DNA of pathogens with this method is challenging. This is important because most clinical samples contain such mixtures, rather than pure single strains. The traditional solution is to sequence selected clones of PCR products, a complicated, time-consuming, and expensive procedure. Here, we propose the base-calling with vocabulary (BCV) method that computationally deciphers Sanger chromatograms obtained from mixed DNA samples. The inputs to the BCV algorithm are a chromatogram and a dictionary of sequences that are similar to those we expect to obtain. We apply the base-calling function on a test dataset of chromatograms without ambiguous positions, as well as one with 3–14% sequence degeneracy. Furthermore, we use BCV to assemble a consensus sequence for an HIV genome fragment in a sample containing a mixture of viral DNA variants and to determine the positions of the indels. Finally, we detect drug-resistant Mycobacterium tuberculosis strains carrying frameshift mutations mixed with wild-type bacteria in the pncA gene, and roughly characterize bacterial communities in clinical samples by direct 16S rRNA sequencing. PMID:23382983

  18. An expanded mammal mitogenome dataset from Southeast Asia

    PubMed Central

    Ramos-Madrigal, Jazmín; Peñaloza, Fernando; Liu, Shanlin; Mikkel-Holger, S. Sinding; Riddhi, P. Patel; Martins, Renata; Lenz, Dorina; Fickel, Jörns; Roos, Christian; Shamsir, Mohd Shahir; Azman, Mohammad Shahfiz; Burton, K. Lim; Stephen, J. Rossiter; Wilting, Andreas

    2017-01-01

    Abstract Southeast (SE) Asia is 1 of the most biodiverse regions in the world, and it holds approximately 20% of all mammal species. Despite this, the majority of SE Asia's genetic diversity is still poorly characterized. The growing interest in using environmental DNA to assess and monitor SE Asian species, in particular threatened mammals—has created the urgent need to expand the available reference database of mitochondrial barcode and complete mitogenome sequences. We have partially addressed this need by generating 72 new mitogenome sequences reconstructed from DNA isolated from a range of historical and modern tissue samples. Approximately 55 gigabases of raw sequence were generated. From this data, we assembled 72 complete mitogenome sequences, with an average depth of coverage of ×102.9 and ×55.2 for modern samples and historical samples, respectively. This dataset represents 52 species, of which 30 species had no previous mitogenome data available. The mitogenomes were geotagged to their sampling location, where known, to display a detailed geographical distribution of the species. Our new database of 52 taxa will strongly enhance the utility of environmental DNA approaches for monitoring mammals in SE Asia as it greatly increases the likelihoods that identification of metabarcoding sequencing reads can be assigned to reference sequences. This magnifies the confidence in species detections and thus allows more robust surveys and monitoring programmes of SE Asia's threatened mammal biodiversity. The extensive collections of historical samples from SE Asia in western and SE Asian museums should serve as additional valuable material to further enrich this reference database. PMID:28873965

  19. An expanded mammal mitogenome dataset from Southeast Asia.

    PubMed

    Mohd Salleh, Faezah; Ramos-Madrigal, Jazmín; Peñaloza, Fernando; Liu, Shanlin; Mikkel-Holger, S Sinding; Riddhi, P Patel; Martins, Renata; Lenz, Dorina; Fickel, Jörns; Roos, Christian; Shamsir, Mohd Shahir; Azman, Mohammad Shahfiz; Burton, K Lim; Stephen, J Rossiter; Wilting, Andreas; Gilbert, M Thomas P

    2017-08-01

    Southeast (SE) Asia is 1 of the most biodiverse regions in the world, and it holds approximately 20% of all mammal species. Despite this, the majority of SE Asia's genetic diversity is still poorly characterized. The growing interest in using environmental DNA to assess and monitor SE Asian species, in particular threatened mammals-has created the urgent need to expand the available reference database of mitochondrial barcode and complete mitogenome sequences. We have partially addressed this need by generating 72 new mitogenome sequences reconstructed from DNA isolated from a range of historical and modern tissue samples. Approximately 55 gigabases of raw sequence were generated. From this data, we assembled 72 complete mitogenome sequences, with an average depth of coverage of ×102.9 and ×55.2 for modern samples and historical samples, respectively. This dataset represents 52 species, of which 30 species had no previous mitogenome data available. The mitogenomes were geotagged to their sampling location, where known, to display a detailed geographical distribution of the species. Our new database of 52 taxa will strongly enhance the utility of environmental DNA approaches for monitoring mammals in SE Asia as it greatly increases the likelihoods that identification of metabarcoding sequencing reads can be assigned to reference sequences. This magnifies the confidence in species detections and thus allows more robust surveys and monitoring programmes of SE Asia's threatened mammal biodiversity. The extensive collections of historical samples from SE Asia in western and SE Asian museums should serve as additional valuable material to further enrich this reference database. © The Author 2017. Published by Oxford University Press.

  20. Intrinsic flexibility of B-DNA: the experimental TRX scale.

    PubMed

    Heddi, Brahim; Oguey, Christophe; Lavelle, Christophe; Foloppe, Nicolas; Hartmann, Brigitte

    2010-01-01

    B-DNA flexibility, crucial for DNA-protein recognition, is sequence dependent. Free DNA in solution would in principle be the best reference state to uncover the relation between base sequences and their intrinsic flexibility; however, this has long been hampered by a lack of suitable experimental data. We investigated this relationship by compiling and analyzing a large dataset of NMR (31)P chemical shifts in solution. These measurements reflect the BI <--> BII equilibrium in DNA, intimately correlated to helicoidal descriptors of the curvature, winding and groove dimensions. Comparing the ten complementary DNA dinucleotide steps indicates that some steps are much more flexible than others. This malleability is primarily controlled at the dinucleotide level, modulated by the tetranucleotide environment. Our analyses provide an experimental scale called TRX that quantifies the intrinsic flexibility of the ten dinucleotide steps in terms of Twist, Roll, and X-disp (base pair displacement). Applying the TRX scale to DNA sequences optimized for nucleosome formation reveals a 10 base-pair periodic alternation of stiff and flexible regions. Thus, DNA flexibility captured by the TRX scale is relevant to nucleosome formation, suggesting that this scale may be of general interest to better understand protein-DNA recognition.

  1. A DNA Barcode Library for Korean Chironomidae (Insecta: Diptera) and Indexes for Defining Barcode Gap

    PubMed Central

    Kim, Sungmin; Song, Kyo-Hong; Ree, Han-Il; Kim, Won

    2012-01-01

    Non-biting midges (Diptera: Chironomidae) are a diverse population that commonly causes respiratory allergies in humans. Chironomid larvae can be used to indicate freshwater pollution, but accurate identification on the basis of morphological characteristics is difficult. In this study, we constructed a mitochondrial cytochrome c oxidase subunit I (COI)-based DNA barcode library for Korean chironomids. This library consists of 211 specimens from 49 species, including adults and unidentified larvae. The interspecies and intraspecies COI sequence variations were analyzed. Sophisticated indexes were developed in order to properly evaluate indistinct barcode gaps that are created by insufficient sampling on both the interspecies and intraspecies levels and by variable mutation rates across taxa. In a variety of insect datasets, these indexes were useful for re-evaluating large barcode datasets and for defining COI barcode gaps. The COI-based DNA barcode library will provide a rapid and reliable tool for the molecular identification of Korean chironomid species. Furthermore, this reverse-taxonomic approach will be improved by the continuous addition of other speceis’ sequences to the library. PMID:22138764

  2. IMG/M: integrated genome and metagenome comparative data analysis system

    DOE PAGES

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; ...

    2016-10-13

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less

  3. IMG/M: integrated genome and metagenome comparative data analysis system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support formore » examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review(ER) companion system (IMG/M ER: https://img.jgi.doe.gov/ mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system.« less

  4. IMG/M: integrated genome and metagenome comparative data analysis system

    PubMed Central

    Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; Palaniappan, Krishna; Szeto, Ernest; Pillay, Manoj; Ratner, Anna; Huang, Jinghua; Andersen, Evan; Huntemann, Marcel; Varghese, Neha; Hadjithomas, Michalis; Tennessen, Kristin; Nielsen, Torben; Ivanova, Natalia N.; Kyrpides, Nikos C.

    2017-01-01

    The Integrated Microbial Genomes with Microbiome Samples (IMG/M: https://img.jgi.doe.gov/m/) system contains annotated DNA and RNA sequence data of (i) archaeal, bacterial, eukaryotic and viral genomes from cultured organisms, (ii) single cell genomes (SCG) and genomes from metagenomes (GFM) from uncultured archaea, bacteria and viruses and (iii) metagenomes from environmental, host associated and engineered microbiome samples. Sequence data are generated by DOE's Joint Genome Institute (JGI), submitted by individual scientists, or collected from public sequence data archives. Structural and functional annotation is carried out by JGI's genome and metagenome annotation pipelines. A variety of analytical and visualization tools provide support for examining and comparing IMG/M's datasets. IMG/M allows open access interactive analysis of publicly available datasets, while manual curation, submission and access to private datasets and computationally intensive workspace-based analysis require login/password access to its expert review (ER) companion system (IMG/M ER: https://img.jgi.doe.gov/mer/). Since the last report published in the 2014 NAR Database Issue, IMG/M's dataset content has tripled in terms of number of datasets and overall protein coding genes, while its analysis tools have been extended to cope with the rapid growth in the number and size of datasets handled by the system. PMID:27738135

  5. DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding.

    PubMed

    Ma, Wenxiu; Yang, Lin; Rohs, Remo; Noble, William Stafford

    2017-10-01

    Transcription factors (TFs) bind to specific DNA sequence motifs. Several lines of evidence suggest that TF-DNA binding is mediated in part by properties of the local DNA shape: the width of the minor groove, the relative orientations of adjacent base pairs, etc. Several methods have been developed to jointly account for DNA sequence and shape properties in predicting TF binding affinity. However, a limitation of these methods is that they typically require a training set of aligned TF binding sites. We describe a sequence + shape kernel that leverages DNA sequence and shape information to better understand protein-DNA binding preference and affinity. This kernel extends an existing class of k-mer based sequence kernels, based on the recently described di-mismatch kernel. Using three in vitro benchmark datasets, derived from universal protein binding microarrays (uPBMs), genomic context PBMs (gcPBMs) and SELEX-seq data, we demonstrate that incorporating DNA shape information improves our ability to predict protein-DNA binding affinity. In particular, we observe that (i) the k-spectrum + shape model performs better than the classical k-spectrum kernel, particularly for small k values; (ii) the di-mismatch kernel performs better than the k-mer kernel, for larger k; and (iii) the di-mismatch + shape kernel performs better than the di-mismatch kernel for intermediate k values. The software is available at https://bitbucket.org/wenxiu/sequence-shape.git. rohs@usc.edu or william-noble@uw.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  6. Exploring Genetic Divergence in a Species-Rich Insect Genus Using 2790 DNA Barcodes

    PubMed Central

    Lin, Xiaolong; Stur, Elisabeth; Ekrem, Torbjørn

    2015-01-01

    DNA barcoding using a fragment of the mitochondrial cytochrome c oxidase subunit 1 gene (COI) has proven to be successful for species-level identification in many animal groups. However, most studies have been focused on relatively small datasets or on large datasets of taxonomically high-ranked groups. We explore the quality of DNA barcodes to delimit species in the diverse chironomid genus Tanytarsus (Diptera: Chironomidae) by using different analytical tools. The genus Tanytarsus is the most species-rich taxon of tribe Tanytarsini (Diptera: Chironomidae) with more than 400 species worldwide, some of which can be notoriously difficult to identify to species-level using morphology. Our dataset, based on sequences generated from own material and publicly available data in BOLD, consist of 2790 DNA barcodes with a fragment length of at least 500 base pairs. A neighbor joining tree of this dataset comprises 131 well separated clusters representing 121 morphological species of Tanytarsus: 77 named, 16 unnamed and 28 unidentified theoretical species. For our geographically widespread dataset, DNA barcodes unambiguously discriminate 94.6% of the Tanytarsus species recognized through prior morphological study. Deep intraspecific divergences exist in some species complexes, and need further taxonomic studies using appropriate nuclear markers as well as morphological and ecological data to be resolved. The DNA barcodes cluster into 120–242 molecular operational taxonomic units (OTUs) depending on whether Objective Clustering, Automatic Barcode Gap Discovery (ABGD), Generalized Mixed Yule Coalescent model (GMYC), Poisson Tree Process (PTP), subjective evaluation of the neighbor joining tree or Barcode Index Numbers (BINs) are used. We suggest that a 4–5% threshold is appropriate to delineate species of Tanytarsus non-biting midges. PMID:26406595

  7. Comparison of whole-genome bisulfite sequencing library preparation strategies identifies sources of biases affecting DNA methylation data.

    PubMed

    Olova, Nelly; Krueger, Felix; Andrews, Simon; Oxley, David; Berrens, Rebecca V; Branco, Miguel R; Reik, Wolf

    2018-03-15

    Whole-genome bisulfite sequencing (WGBS) is becoming an increasingly accessible technique, used widely for both fundamental and disease-oriented research. Library preparation methods benefit from a variety of available kits, polymerases and bisulfite conversion protocols. Although some steps in the procedure, such as PCR amplification, are known to introduce biases, a systematic evaluation of biases in WGBS strategies is missing. We perform a comparative analysis of several commonly used pre- and post-bisulfite WGBS library preparation protocols for their performance and quality of sequencing outputs. Our results show that bisulfite conversion per se is the main trigger of pronounced sequencing biases, and PCR amplification builds on these underlying artefacts. The majority of standard library preparation methods yield a significantly biased sequence output and overestimate global methylation. Importantly, both absolute and relative methylation levels at specific genomic regions vary substantially between methods, with clear implications for DNA methylation studies. We show that amplification-free library preparation is the least biased approach for WGBS. In protocols with amplification, the choice of bisulfite conversion protocol or polymerase can significantly minimize artefacts. To aid with the quality assessment of existing WGBS datasets, we have integrated a bias diagnostic tool in the Bismark package and offer several approaches for consideration during the preparation and analysis of WGBS datasets.

  8. A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network.

    PubMed

    Fiannaca, Antonino; La Rosa, Massimo; Rizzo, Riccardo; Urso, Alfonso

    2015-07-01

    In this paper, an alignment-free method for DNA barcode classification that is based on both a spectral representation and a neural gas network for unsupervised clustering is proposed. In the proposed methodology, distinctive words are identified from a spectral representation of DNA sequences. A taxonomic classification of the DNA sequence is then performed using the sequence signature, i.e., the smallest set of k-mers that can assign a DNA sequence to its proper taxonomic category. Experiments were then performed to compare our method with other supervised machine learning classification algorithms, such as support vector machine, random forest, ripper, naïve Bayes, ridor, and classification tree, which also consider short DNA sequence fragments of 200 and 300 base pairs (bp). The experimental tests were conducted over 10 real barcode datasets belonging to different animal species, which were provided by the on-line resource "Barcode of Life Database". The experimental results showed that our k-mer-based approach is directly comparable, in terms of accuracy, recall and precision metrics, with the other classifiers when considering full-length sequences. In addition, we demonstrate the robustness of our method when a classification is performed task with a set of short DNA sequences that were randomly extracted from the original data. For example, the proposed method can reach the accuracy of 64.8% at the species level with 200-bp fragments. Under the same conditions, the best other classifier (random forest) reaches the accuracy of 20.9%. Our results indicate that we obtained a clear improvement over the other classifiers for the study of short DNA barcode sequence fragments. Copyright © 2015 Elsevier B.V. All rights reserved.

  9. Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error

    PubMed Central

    Porter, Teresita M.; Golding, G. Brian

    2012-01-01

    Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys. PMID:22558215

  10. ESTuber db: an online database for Tuber borchii EST sequences.

    PubMed

    Lazzari, Barbara; Caprera, Andrea; Cosentino, Cristian; Stella, Alessandra; Milanesi, Luciano; Viotti, Angelo

    2007-03-08

    The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.

  11. Cloud-based adaptive exon prediction for DNA analysis.

    PubMed

    Putluri, Srinivasareddy; Zia Ur Rahman, Md; Fathima, Shaik Yasmeen

    2018-02-01

    Cloud computing offers significant research and economic benefits to healthcare organisations. Cloud services provide a safe place for storing and managing large amounts of such sensitive data. Under conventional flow of gene information, gene sequence laboratories send out raw and inferred information via Internet to several sequence libraries. DNA sequencing storage costs will be minimised by use of cloud service. In this study, the authors put forward a novel genomic informatics system using Amazon Cloud Services, where genomic sequence information is stored and accessed for processing. True identification of exon regions in a DNA sequence is a key task in bioinformatics, which helps in disease identification and design drugs. Three base periodicity property of exons forms the basis of all exon identification techniques. Adaptive signal processing techniques found to be promising in comparison with several other methods. Several adaptive exon predictors (AEPs) are developed using variable normalised least mean square and its maximum normalised variants to reduce computational complexity. Finally, performance evaluation of various AEPs is done based on measures such as sensitivity, specificity and precision using various standard genomic datasets taken from National Center for Biotechnology Information genomic sequence database.

  12. DnaSAM: Software to perform neutrality testing for large datasets with complex null models.

    PubMed

    Eckert, Andrew J; Liechty, John D; Tearse, Brandon R; Pande, Barnaly; Neale, David B

    2010-05-01

    Patterns of DNA sequence polymorphisms can be used to understand the processes of demography and adaptation within natural populations. High-throughput generation of DNA sequence data has historically been the bottleneck with respect to data processing and experimental inference. Advances in marker technologies have largely solved this problem. Currently, the limiting step is computational, with most molecular population genetic software allowing a gene-by-gene analysis through a graphical user interface. An easy-to-use analysis program that allows both high-throughput processing of multiple sequence alignments along with the flexibility to simulate data under complex demographic scenarios is currently lacking. We introduce a new program, named DnaSAM, which allows high-throughput estimation of DNA sequence diversity and neutrality statistics from experimental data along with the ability to test those statistics via Monte Carlo coalescent simulations. These simulations are conducted using the ms program, which is able to incorporate several genetic parameters (e.g. recombination) and demographic scenarios (e.g. population bottlenecks). The output is a set of diversity and neutrality statistics with associated probability values under a user-specified null model that are stored in easy to manipulate text file. © 2009 Blackwell Publishing Ltd.

  13. A systematic molecular dynamics study of nearest-neighbor effects on base pair and base pair step conformations and fluctuations in B-DNA

    PubMed Central

    Lavery, Richard; Zakrzewska, Krystyna; Beveridge, David; Bishop, Thomas C.; Case, David A.; Cheatham, Thomas; Dixit, Surjit; Jayaram, B.; Lankas, Filip; Laughton, Charles; Maddocks, John H.; Michon, Alexis; Osman, Roman; Orozco, Modesto; Perez, Alberto; Singh, Tanya; Spackova, Nada; Sponer, Jiri

    2010-01-01

    It is well recognized that base sequence exerts a significant influence on the properties of DNA and plays a significant role in protein–DNA interactions vital for cellular processes. Understanding and predicting base sequence effects requires an extensive structural and dynamic dataset which is currently unavailable from experiment. A consortium of laboratories was consequently formed to obtain this information using molecular simulations. This article describes results providing information not only on all 10 unique base pair steps, but also on all possible nearest-neighbor effects on these steps. These results are derived from simulations of 50–100 ns on 39 different DNA oligomers in explicit solvent and using a physiological salt concentration. We demonstrate that the simulations are converged in terms of helical and backbone parameters. The results show that nearest-neighbor effects on base pair steps are very significant, implying that dinucleotide models are insufficient for predicting sequence-dependent behavior. Flanking base sequences can notably lead to base pair step parameters in dynamic equilibrium between two conformational sub-states. Although this study only provides limited data on next-nearest-neighbor effects, we suggest that such effects should be analyzed before attempting to predict the sequence-dependent behavior of DNA. PMID:19850719

  14. Best practices for mapping replication origins in eukaryotic chromosomes.

    PubMed

    Besnard, Emilie; Desprat, Romain; Ryan, Michael; Kahli, Malik; Aladjem, Mirit I; Lemaitre, Jean-Marc

    2014-09-02

    Understanding the regulatory principles ensuring complete DNA replication in each cell division is critical for deciphering the mechanisms that maintain genomic stability. Recent advances in genome sequencing technology facilitated complete mapping of DNA replication sites and helped move the field from observing replication patterns at a handful of single loci to analyzing replication patterns genome-wide. These advances address issues, such as the relationship between replication initiation events, transcription, and chromatin modifications, and identify potential replication origin consensus sequences. This unit summarizes the technological and fundamental aspects of replication profiling and briefly discusses novel insights emerging from mining large datasets, published in the last 3 years, and also describes DNA replication dynamics on a whole-genome scale. Copyright © 2014 John Wiley & Sons, Inc.

  15. Promoter classifier: software package for promoter database analysis.

    PubMed

    Gershenzon, Naum I; Ioshikhes, Ilya P

    2005-01-01

    Promoter Classifier is a package of seven stand-alone Windows-based C++ programs allowing the following basic manipulations with a set of promoter sequences: (i) calculation of positional distributions of nucleotides averaged over all promoters of the dataset; (ii) calculation of the averaged occurrence frequencies of the transcription factor binding sites and their combinations; (iii) division of the dataset into subsets of sequences containing or lacking certain promoter elements or combinations; (iv) extraction of the promoter subsets containing or lacking CpG islands around the transcription start site; and (v) calculation of spatial distributions of the promoter DNA stacking energy and bending stiffness. All programs have a user-friendly interface and provide the results in a convenient graphical form. The Promoter Classifier package is an effective tool for various basic manipulations with eukaryotic promoter sequences that usually are necessary for analysis of large promoter datasets. The program Promoter Divider is described in more detail as a representative component of the package.

  16. CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.

    PubMed

    Chen, Xi; Wang, Chen; Tang, Shanjiang; Yu, Ce; Zou, Quan

    2017-06-24

    The multiple sequence alignment (MSA) is a classic and powerful technique for sequence analysis in bioinformatics. With the rapid growth of biological datasets, MSA parallelization becomes necessary to keep its running time in an acceptable level. Although there are a lot of work on MSA problems, their approaches are either insufficient or contain some implicit assumptions that limit the generality of usage. First, the information of users' sequences, including the sizes of datasets and the lengths of sequences, can be of arbitrary values and are generally unknown before submitted, which are unfortunately ignored by previous work. Second, the center star strategy is suited for aligning similar sequences. But its first stage, center sequence selection, is highly time-consuming and requires further optimization. Moreover, given the heterogeneous CPU/GPU platform, prior studies consider the MSA parallelization on GPU devices only, making the CPUs idle during the computation. Co-run computation, however, can maximize the utilization of the computing resources by enabling the workload computation on both CPU and GPU simultaneously. This paper presents CMSA, a robust and efficient MSA system for large-scale datasets on the heterogeneous CPU/GPU platform. It performs and optimizes multiple sequence alignment automatically for users' submitted sequences without any assumptions. CMSA adopts the co-run computation model so that both CPU and GPU devices are fully utilized. Moreover, CMSA proposes an improved center star strategy that reduces the time complexity of its center sequence selection process from O(mn 2 ) to O(mn). The experimental results show that CMSA achieves an up to 11× speedup and outperforms the state-of-the-art software. CMSA focuses on the multiple similar RNA/DNA sequence alignment and proposes a novel bitmap based algorithm to improve the center star strategy. We can conclude that harvesting the high performance of modern GPU is a promising approach to accelerate multiple sequence alignment. Besides, adopting the co-run computation model can maximize the entire system utilization significantly. The source code is available at https://github.com/wangvsa/CMSA .

  17. Genovo: De Novo Assembly for Metagenomes

    NASA Astrophysics Data System (ADS)

    Laserson, Jonathan; Jojic, Vladimir; Koller, Daphne

    Next-generation sequencing technologies produce a large number of noisy reads from the DNA in a sample. Metagenomics and population sequencing aim to recover the genomic sequences of the species in the sample, which could be of high diversity. Methods geared towards single sequence reconstruction are not sensitive enough when applied in this setting. We introduce a generative probabilistic model of read generation from environmental samples and present Genovo, a novel de novo sequence assembler that discovers likely sequence reconstructions under the model. A Chinese restaurant process prior accounts for the unknown number of genomes in the sample. Inference is made by applying a series of hill-climbing steps iteratively until convergence. We compare the performance of Genovo to three other short read assembly programs across one synthetic dataset and eight metagenomic datasets created using the 454 platform, the largest of which has 311k reads. Genovo's reconstructions cover more bases and recover more genes than the other methods, and yield a higher assembly score.

  18. Sequence Data for Clostridium autoethanogenum using Three Generations of Sequencing Technologies

    DOE PAGES

    Utturkar, Sagar M.; Klingeman, Dawn Marie; Bruno-Barcena, José M.; ...

    2015-04-14

    During the past decade, DNA sequencing output has been mostly dominated by the second generation sequencing platforms which are characterized by low cost, high throughput and shorter read lengths for example, Illumina. The emergence and development of so called third generation sequencing platforms such as PacBio has permitted exceptionally long reads (over 20 kb) to be generated. Due to read length increases, algorithm improvements and hybrid assembly approaches, the concept of one chromosome, one contig and automated finishing of microbial genomes is now a realistic and achievable task for many microbial laboratories. In this paper, we describe high quality sequencemore » datasets which span three generations of sequencing technologies, containing six types of data from four NGS platforms and originating from a single microorganism, Clostridium autoethanogenum. The dataset reported here will be useful for the scientific community to evaluate upcoming NGS platforms, enabling comparison of existing and novel bioinformatics approaches and will encourage interest in the development of innovative experimental and computational methods for NGS data.« less

  19. A comprehensive characterization of mitochondrial DNA mutations in glioblastoma multiforme.

    PubMed

    Vidone, Michele; Clima, Rosanna; Santorsola, Mariangela; Calabrese, Claudia; Girolimetti, Giulia; Kurelac, Ivana; Amato, Laura Benedetta; Iommarini, Luisa; Trevisan, Elisa; Leone, Marco; Soffietti, Riccardo; Morra, Isabella; Faccani, Giuliano; Attimonelli, Marcella; Porcelli, Anna Maria; Gasparre, Giuseppe

    2015-06-01

    Glioblastoma multiforme (GBM) is the most malignant brain cancer in adults, with a poor prognosis, whose molecular stratification still represents a challenge in pathology and clinics. On the other hand, mitochondrial DNA (mtDNA) mutations have been found in most tumors as modifiers of the bioenergetics state, albeit in GBM a characterization of the mtDNA status is lacking to date. Here, a characterization of the burden of mtDNA mutations in GBM samples was performed. First, investigation of tumor-specific vs. non tumor-specific mutations was carried out with the MToolBox bioinformatics pipeline by analyzing 45 matched tumor/blood samples, from whole genome or whole exome sequencing datasets obtained from The Cancer Genome Atlas (TCGA) consortium. Additionally, the entire mtDNA sequence was obtained in a dataset of 104 fresh-frozen GBM samples. Mitochondrial mutations with potential pathogenic interest were prioritized based on heteroplasmic fraction, nucleotide variability, and in silico prediction of pathogenicity. A preliminary biochemical analysis of the activity of mitochondrial respiratory complexes was also performed on fresh-frozen GBM samples. Although a high number of mutations was detected, we report that the large majority of them does not pass the prioritization filters. Therefore, a relatively limited burden of pathogenic mutations is indeed carried by GBM, which did not appear to determine a general impairment of the respiratory chain. This article is part of a Directed Issue entitled: Energy Metabolism Disorders and Therapies. Copyright © 2015 Elsevier Ltd. All rights reserved.

  20. PWHATSHAP: efficient haplotyping for future generation sequencing.

    PubMed

    Bracciali, Andrea; Aldinucci, Marco; Patterson, Murray; Marschall, Tobias; Pisanti, Nadia; Merelli, Ivan; Torquati, Massimo

    2016-09-22

    Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WHATSHAP is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments. Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered PWHATSHAP, a parallel, high-performance version of WHATSHAP. PWHATSHAP is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WHATSHAP, PWHATSHAP exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WHATSHAP, which increases with coverage. Due to its structure and management of the large datasets, the parallelisation of WHATSHAP posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, PWHATSHAP, is a freely available toolkit that improves the efficiency of the analysis of genomics information.

  1. Fast and Sensitive Alignment of Microbial Whole Genome Sequencing Reads to Large Sequence Datasets on a Desktop PC: Application to Metagenomic Datasets and Pathogen Identification

    PubMed Central

    2014-01-01

    Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner. PMID:25077800

  2. Fast and sensitive alignment of microbial whole genome sequencing reads to large sequence datasets on a desktop PC: application to metagenomic datasets and pathogen identification.

    PubMed

    Pongor, Lőrinc S; Vera, Roberto; Ligeti, Balázs

    2014-01-01

    Next generation sequencing (NGS) of metagenomic samples is becoming a standard approach to detect individual species or pathogenic strains of microorganisms. Computer programs used in the NGS community have to balance between speed and sensitivity and as a result, species or strain level identification is often inaccurate and low abundance pathogens can sometimes be missed. We have developed Taxoner, an open source, taxon assignment pipeline that includes a fast aligner (e.g. Bowtie2) and a comprehensive DNA sequence database. We tested the program on simulated datasets as well as experimental data from Illumina, IonTorrent, and Roche 454 sequencing platforms. We found that Taxoner performs as well as, and often better than BLAST, but requires two orders of magnitude less running time meaning that it can be run on desktop or laptop computers. Taxoner is slower than the approaches that use small marker databases but is more sensitive due the comprehensive reference database. In addition, it can be easily tuned to specific applications using small tailored databases. When applied to metagenomic datasets, Taxoner can provide a functional summary of the genes mapped and can provide strain level identification. Taxoner is written in C for Linux operating systems. The code and documentation are available for research applications at http://code.google.com/p/taxoner.

  3. Using Classical Population Genetics Tools with Heterochroneous Data: Time Matters!

    PubMed Central

    Depaulis, Frantz; Orlando, Ludovic; Hänni, Catherine

    2009-01-01

    Background New polymorphism datasets from heterochroneous data have arisen thanks to recent advances in experimental and microbial molecular evolution, and the sequencing of ancient DNA (aDNA). However, classical tools for population genetics analyses do not take into account heterochrony between subsets, despite potential bias on neutrality and population structure tests. Here, we characterize the extent of such possible biases using serial coalescent simulations. Methodology/Principal Findings We first use a coalescent framework to generate datasets assuming no or different levels of heterochrony and contrast most classical population genetic statistics. We show that even weak levels of heterochrony (∼10% of the average depth of a standard population tree) affect the distribution of polymorphism substantially, leading to overestimate the level of polymorphism θ, to star like trees, with an excess of rare mutations and a deficit of linkage disequilibrium, which are the hallmark of e.g. population expansion (possibly after a drastic bottleneck). Substantial departures of the tests are detected in the opposite direction for more heterochroneous and equilibrated datasets, with balanced trees mimicking in particular population contraction, balancing selection, and population differentiation. We therefore introduce simple corrections to classical estimators of polymorphism and of the genetic distance between populations, in order to remove heterochrony-driven bias. Finally, we show that these effects do occur on real aDNA datasets, taking advantage of the currently available sequence data for Cave Bears (Ursus spelaeus), for which large mtDNA haplotypes have been reported over a substantial time period (22–130 thousand years ago (KYA)). Conclusions/Significance Considering serial sampling changed the conclusion of several tests, indicating that neglecting heterochrony could provide significant support for false past history of populations and inappropriate conservation decisions. We therefore argue for systematically considering heterochroneous models when analyzing heterochroneous samples covering a large time scale. PMID:19440242

  4. A new method to cluster genomes based on cumulative Fourier power spectrum.

    PubMed

    Dong, Rui; Zhu, Ziyue; Yin, Changchuan; He, Rong L; Yau, Stephen S-T

    2018-06-20

    Analyzing phylogenetic relationships using mathematical methods has always been of importance in bioinformatics. Quantitative research may interpret the raw biological data in a precise way. Multiple Sequence Alignment (MSA) is used frequently to analyze biological evolutions, but is very time-consuming. When the scale of data is large, alignment methods cannot finish calculation in reasonable time. Therefore, we present a new method using moments of cumulative Fourier power spectrum in clustering the DNA sequences. Each sequence is translated into a vector in Euclidean space. Distances between the vectors can reflect the relationships between sequences. The mapping between the spectra and moment vector is one-to-one, which means that no information is lost in the power spectra during the calculation. We cluster and classify several datasets including Influenza A, primates, and human rhinovirus (HRV) datasets to build up the phylogenetic trees. Results show that the new proposed cumulative Fourier power spectrum is much faster and more accurately than MSA and another alignment-free method known as k-mer. The research provides us new insights in the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes. The computer programs of the cumulative Fourier power spectrum are available at GitHub (https://github.com/YaulabTsinghua/cumulative-Fourier-power-spectrum). Copyright © 2018. Published by Elsevier B.V.

  5. Mitochondrial DNA heteroplasmy in the emerging field of massively parallel sequencing

    PubMed Central

    Just, Rebecca S.; Irwin, Jodi A.; Parson, Walther

    2015-01-01

    Long an important and useful tool in forensic genetic investigations, mitochondrial DNA (mtDNA) typing continues to mature. Research in the last few years has demonstrated both that data from the entire molecule will have practical benefits in forensic DNA casework, and that massively parallel sequencing (MPS) methods will make full mitochondrial genome (mtGenome) sequencing of forensic specimens feasible and cost-effective. A spate of recent studies has employed these new technologies to assess intraindividual mtDNA variation. However, in several instances, contamination and other sources of mixed mtDNA data have been erroneously identified as heteroplasmy. Well vetted mtGenome datasets based on both Sanger and MPS sequences have found authentic point heteroplasmy in approximately 25% of individuals when minor component detection thresholds are in the range of 10–20%, along with positional distribution patterns in the coding region that differ from patterns of point heteroplasmy in the well-studied control region. A few recent studies that examined very low-level heteroplasmy are concordant with these observations when the data are examined at a common level of resolution. In this review we provide an overview of considerations related to the use of MPS technologies to detect mtDNA heteroplasmy. In addition, we examine published reports on point heteroplasmy to characterize features of the data that will assist in the evaluation of future mtGenome data developed by any typing method. PMID:26009256

  6. Pooled assembly of marine metagenomic datasets: enriching annotation through chimerism.

    PubMed

    Magasin, Jonathan D; Gerloff, Dietlind L

    2015-02-01

    Despite advances in high-throughput sequencing, marine metagenomic samples remain largely opaque. A typical sample contains billions of microbial organisms from thousands of genomes and quadrillions of DNA base pairs. Its derived metagenomic dataset underrepresents this complexity by orders of magnitude because of the sparseness and shortness of sequencing reads. Read shortness and sequencing errors pose a major challenge to accurate species and functional annotation. This includes distinguishing known from novel species. Often the majority of reads cannot be annotated and thus cannot help our interpretation of the sample. Here, we demonstrate quantitatively how careful assembly of marine metagenomic reads within, but also across, datasets can alleviate this problem. For 10 simulated datasets, each with species complexity modeled on a real counterpart, chimerism remained within the same species for most contigs (97%). For 42 real pyrosequencing ('454') datasets, assembly increased the proportion of annotated reads, and even more so when datasets were pooled, by on average 1.6% (max 6.6%) for species, 9.0% (max 28.7%) for Pfam protein domains and 9.4% (max 22.9%) for PANTHER gene families. Our results outline exciting prospects for data sharing in the metagenomics community. While chimeric sequences should be avoided in other areas of metagenomics (e.g. biodiversity analyses), conservative pooled assembly is advantageous for annotation specificity and sensitivity. Intriguingly, our experiment also found potential prospects for (low-cost) discovery of new species in 'old' data. dgerloff@ffame.org Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  7. Simple and efficient identification of rare recessive pathologically important sequence variants from next generation exome sequence data.

    PubMed

    Carr, Ian M; Morgan, Joanne; Watson, Christopher; Melnik, Svitlana; Diggle, Christine P; Logan, Clare V; Harrison, Sally M; Taylor, Graham R; Pena, Sergio D J; Markham, Alexander F; Alkuraya, Fowzan S; Black, Graeme C M; Ali, Manir; Bonthron, David T

    2013-07-01

    Massively parallel ("next generation") DNA sequencing (NGS) has quickly become the method of choice for seeking pathogenic mutations in rare uncharacterized monogenic diseases. Typically, before DNA sequencing, protein-coding regions are enriched from patient genomic DNA, representing either the entire genome ("exome sequencing") or selected mapped candidate loci. Sequence variants, identified as differences between the patient's and the human genome reference sequences, are then filtered according to various quality parameters. Changes are screened against datasets of known polymorphisms, such as dbSNP and the 1000 Genomes Project, in the effort to narrow the list of candidate causative variants. An increasing number of commercial services now offer to both generate and align NGS data to a reference genome. This potentially allows small groups with limited computing infrastructure and informatics skills to utilize this technology. However, the capability to effectively filter and assess sequence variants is still an important bottleneck in the identification of deleterious sequence variants in both research and diagnostic settings. We have developed an approach to this problem comprising a user-friendly suite of programs that can interactively analyze, filter and screen data from enrichment-capture NGS data. These programs ("Agile Suite") are particularly suitable for small-scale gene discovery or for diagnostic analysis. © 2013 WILEY PERIODICALS, INC.

  8. DNA demethylation activates genes in seed maternal integument development in rice (Oryza sativa L.).

    PubMed

    Wang, Yifeng; Lin, Haiyan; Tong, Xiaohong; Hou, Yuxuan; Chang, Yuxiao; Zhang, Jian

    2017-11-01

    DNA methylation is an important epigenetic modification that regulates various plant developmental processes. Rice seed integument determines the seed size. However, the role of DNA methylation in its development remains largely unknown. Here, we report the first dynamic DNA methylomic profiling of rice maternal integument before and after pollination by using a whole-genome bisulfite deep sequencing approach. Analysis of DNA methylation patterns identified 4238 differentially methylated regions underpin 4112 differentially methylated genes, including GW2, DEP1, RGB1 and numerous other regulators participated in maternal integument development. Bisulfite sanger sequencing and qRT-PCR of six differentially methylated genes revealed extensive occurrence of DNA hypomethylation triggered by double fertilization at IAP compared with IBP, suggesting that DNA demethylation might be a key mechanism to activate numerous maternal controlling genes. These results presented here not only greatly expanded the rice methylome dataset, but also shed novel insight into the regulatory roles of DNA methylation in rice seed maternal integument development. Copyright © 2017 Elsevier Masson SAS. All rights reserved.

  9. Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape.

    PubMed

    Dai, Hanjun; Umarov, Ramzan; Kuwahara, Hiroyuki; Li, Yu; Song, Le; Gao, Xin

    2017-11-15

    An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods. Our program is freely available at https://github.com/ramzan1990/sequence2vec. xin.gao@kaust.edu.sa or lsong@cc.gatech.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  10. Computational optimisation of targeted DNA sequencing for cancer detection

    NASA Astrophysics Data System (ADS)

    Martinez, Pierre; McGranahan, Nicholas; Birkbak, Nicolai Juul; Gerlinger, Marco; Swanton, Charles

    2013-12-01

    Despite recent progress thanks to next-generation sequencing technologies, personalised cancer medicine is still hampered by intra-tumour heterogeneity and drug resistance. As most patients with advanced metastatic disease face poor survival, there is need to improve early diagnosis. Analysing circulating tumour DNA (ctDNA) might represent a non-invasive method to detect mutations in patients, facilitating early detection. In this article, we define reduced gene panels from publicly available datasets as a first step to assess and optimise the potential of targeted ctDNA scans for early tumour detection. Dividing 4,467 samples into one discovery and two independent validation cohorts, we show that up to 76% of 10 cancer types harbour at least one mutation in a panel of only 25 genes, with high sensitivity across most tumour types. Our analyses demonstrate that targeting ``hotspot'' regions would introduce biases towards in-frame mutations and would compromise the reproducibility of tumour detection.

  11. Cloud-based adaptive exon prediction for DNA analysis

    PubMed Central

    Putluri, Srinivasareddy; Fathima, Shaik Yasmeen

    2018-01-01

    Cloud computing offers significant research and economic benefits to healthcare organisations. Cloud services provide a safe place for storing and managing large amounts of such sensitive data. Under conventional flow of gene information, gene sequence laboratories send out raw and inferred information via Internet to several sequence libraries. DNA sequencing storage costs will be minimised by use of cloud service. In this study, the authors put forward a novel genomic informatics system using Amazon Cloud Services, where genomic sequence information is stored and accessed for processing. True identification of exon regions in a DNA sequence is a key task in bioinformatics, which helps in disease identification and design drugs. Three base periodicity property of exons forms the basis of all exon identification techniques. Adaptive signal processing techniques found to be promising in comparison with several other methods. Several adaptive exon predictors (AEPs) are developed using variable normalised least mean square and its maximum normalised variants to reduce computational complexity. Finally, performance evaluation of various AEPs is done based on measures such as sensitivity, specificity and precision using various standard genomic datasets taken from National Center for Biotechnology Information genomic sequence database. PMID:29515813

  12. Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics

    PubMed Central

    2012-01-01

    Background Detecting the borders between coding and non-coding regions is an essential step in the genome annotation. And information entropy measures are useful for describing the signals in genome sequence. However, the accuracies of previous methods of finding borders based on entropy segmentation method still need to be improved. Methods In this study, we first applied a new recursive entropic segmentation method on DNA sequences to get preliminary significant cuts. A 22-symbol alphabet is used to capture the differential composition of nucleotide doublets and stop codon patterns along three phases in both DNA strands. This process requires no prior training datasets. Results Comparing with the previous segmentation methods, the experimental results on three bacteria genomes, Rickettsia prowazekii, Borrelia burgdorferi and E.coli, show that our approach improves the accuracy for finding the borders between coding and non-coding regions in DNA sequences. Conclusions This paper presents a new segmentation method in prokaryotes based on Jensen-Rényi divergence with a 22-symbol alphabet. For three bacteria genomes, comparing to A12_JR method, our method raised the accuracy of finding the borders between protein coding and non-coding regions in DNA sequences. PMID:23282225

  13. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

    PubMed

    Yooseph, Shibu; Sutton, Granger; Rusch, Douglas B; Halpern, Aaron L; Williamson, Shannon J; Remington, Karin; Eisen, Jonathan A; Heidelberg, Karla B; Manning, Gerard; Li, Weizhong; Jaroszewski, Lukasz; Cieplak, Piotr; Miller, Christopher S; Li, Huiying; Mashiyama, Susan T; Joachimiak, Marcin P; van Belle, Christopher; Chandonia, John-Marc; Soergel, David A; Zhai, Yufeng; Natarajan, Kannan; Lee, Shaun; Raphael, Benjamin J; Bafna, Vineet; Friedman, Robert; Brenner, Steven E; Godzik, Adam; Eisenberg, David; Dixon, Jack E; Taylor, Susan S; Strausberg, Robert L; Frazier, Marvin; Venter, J Craig

    2007-03-01

    Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.

  14. MethylViewer: computational analysis and editing for bisulfite sequencing and methyltransferase accessibility protocol for individual templates (MAPit) projects.

    PubMed

    Pardo, Carolina E; Carr, Ian M; Hoffman, Christopher J; Darst, Russell P; Markham, Alexander F; Bonthron, David T; Kladde, Michael P

    2011-01-01

    Bisulfite sequencing is a widely-used technique for examining cytosine DNA methylation at nucleotide resolution along single DNA strands. Probing with cytosine DNA methyltransferases followed by bisulfite sequencing (MAPit) is an effective technique for mapping protein-DNA interactions. Here, MAPit methylation footprinting with M.CviPI, a GC methyltransferase we previously cloned and characterized, was used to probe hMLH1 chromatin in HCT116 and RKO colorectal cancer cells. Because M.CviPI-probed samples contain both CG and GC methylation, we developed a versatile, visually-intuitive program, called MethylViewer, for evaluating the bisulfite sequencing results. Uniquely, MethylViewer can simultaneously query cytosine methylation status in bisulfite-converted sequences at as many as four different user-defined motifs, e.g. CG, GC, etc., including motifs with degenerate bases. Data can also be exported for statistical analysis and as publication-quality images. Analysis of hMLH1 MAPit data with MethylViewer showed that endogenous CG methylation and accessible GC sites were both mapped on single molecules at high resolution. Disruption of positioned nucleosomes on single molecules of the PHO5 promoter was detected in budding yeast using M.CviPII, increasing the number of enzymes available for probing protein-DNA interactions. MethylViewer provides an integrated solution for primer design and rapid, accurate and detailed analysis of bisulfite sequencing or MAPit datasets from virtually any biological or biochemical system.

  15. TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees.

    PubMed

    Sauvage, Thomas; Plouviez, Sophie; Schmidt, William E; Fredericq, Suzanne

    2018-03-05

    The body of DNA sequence data lacking taxonomically informative sequence headers is rapidly growing in user and public databases (e.g. sequences lacking identification and contaminants). In the context of systematics studies, sorting such sequence data for taxonomic curation and/or molecular diversity characterization (e.g. crypticism) often requires the building of exploratory phylogenetic trees with reference taxa. The subsequent step of segregating DNA sequences of interest based on observed topological relationships can represent a challenging task, especially for large datasets. We have written TREE2FASTA, a Perl script that enables and expedites the sorting of FASTA-formatted sequence data from exploratory phylogenetic trees. TREE2FASTA takes advantage of the interactive, rapid point-and-click color selection and/or annotations of tree leaves in the popular Java tree-viewer FigTree to segregate groups of FASTA sequences of interest to separate files. TREE2FASTA allows for both simple and nested segregation designs to facilitate the simultaneous preparation of multiple data sets that may overlap in sequence content.

  16. Technical Report: Benchmarking for Quasispecies Abundance Inference with Confidence Intervals from Metagenomic Sequence Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    McLoughlin, K.

    2016-01-22

    The software application “MetaQuant” was developed by our group at Lawrence Livermore National Laboratory (LLNL). It is designed to profile microbial populations in a sample using data from whole-genome shotgun (WGS) metagenomic DNA sequencing. Several other metagenomic profiling applications have been described in the literature. We ran a series of benchmark tests to compare the performance of MetaQuant against that of a few existing profiling tools, using real and simulated sequence datasets. This report describes our benchmarking procedure and results.

  17. Beyond Reasonable Doubt: Evolution from DNA Sequences

    PubMed Central

    Penny, David

    2013-01-01

    We demonstrate quantitatively that, as predicted by evolutionary theory, sequences of homologous proteins from different species converge as we go further and further back in time. The converse, a non-evolutionary model can be expressed as probabilities, and the test works for chloroplast, nuclear and mitochondrial sequences, as well as for sequences that diverged at different time depths. Even on our conservative test, the probability that chance could produce the observed levels of ancestral convergence for just one of the eight datasets of 51 proteins is ≈1×10−19 and combined over 8 datasets is ≈1×10−132. By comparison, there are about 1080 protons in the universe, hence the probability that the sequences could have been produced by a process involving unrelated ancestral sequences is about 1050 lower than picking, among all protons, the same proton at random twice in a row. A non-evolutionary control model shows no convergence, and only a small number of parameters are required to account for the observations. It is time that that researchers insisted that doubters put up testable alternatives to evolution. PMID:23950906

  18. High-throughput engineering of a mammalian genome reveals building principles of methylation states at CG rich regions.

    PubMed

    Krebs, Arnaud R; Dessus-Babus, Sophie; Burger, Lukas; Schübeler, Dirk

    2014-09-26

    The majority of mammalian promoters are CpG islands; regions of high CG density that require protection from DNA methylation to be functional. Importantly, how sequence architecture mediates this unmethylated state remains unclear. To address this question in a comprehensive manner, we developed a method to interrogate methylation states of hundreds of sequence variants inserted at the same genomic site in mouse embryonic stem cells. Using this assay, we were able to quantify the contribution of various sequence motifs towards the resulting DNA methylation state. Modeling of this comprehensive dataset revealed that CG density alone is a minor determinant of their unmethylated state. Instead, these data argue for a principal role for transcription factor binding sites, a prediction confirmed by testing synthetic mutant libraries. Taken together, these findings establish the hierarchy between the two cis-encoded mechanisms that define the DNA methylation state and thus the transcriptional competence of CpG islands.

  19. Molecular diversification of Trichuris spp. from Sigmodontinae (Cricetidae) rodents from Argentina based on mitochondrial DNA sequences.

    PubMed

    Callejón, Rocío; Robles, María Del Rosario; Panei, Carlos Javier; Cutillas, Cristina

    2016-08-01

    A molecular phylogenetic hypothesis is presented for the genus Trichuris based on sequence data from mitochondrial cytochrome c oxidase 1 (cox1) and cytochrome b (cob). The taxa consisted of nine populations of whipworm from five species of Sigmodontinae rodents from Argentina. Bayesian Inference, Maximum Parsimony, and Maximum Likelihood methods were used to infer phylogenies for each gene separately but also for the combined mitochondrial data and the combined mitochondrial and nuclear dataset. Phylogenetic results based on cox1 and cob mitochondrial DNA (mtDNA) revealed three clades strongly resolved corresponding to three different species (Trichuris navonae, Trichuris bainae, and Trichuris pardinasi) showing phylogeographic variation, but relationships among Trichuris species were poorly resolved. Phylogenetic reconstruction based on concatenated sequences had greater phylogenetic resolution for delimiting species and populations intra-specific of Trichuris than those based on partitioned genes. Thus, populations of T. bainae and T. pardinasi could be affected by geographical factors and co-divergence parasite-host.

  20. Resolving the phylogenetic position of Darwin's extinct ground sloth (Mylodon darwinii) using mitogenomic and nuclear exon data.

    PubMed

    Delsuc, Frédéric; Kuch, Melanie; Gibb, Gillian C; Hughes, Jonathan; Szpak, Paul; Southon, John; Enk, Jacob; Duggan, Ana T; Poinar, Hendrik N

    2018-05-16

    Mylodon darwinii is the extinct giant ground sloth named after Charles Darwin, who first collected its remains in South America. We have successfully obtained a high-quality mitochondrial genome at 99-fold coverage using an Illumina shotgun sequencing of a 12 880-year-old bone fragment from Mylodon Cave in Chile. Low level of DNA damage showed that this sample was exceptionally well preserved for an ancient subfossil, probably the result of the dry and cold conditions prevailing within the cave. Accordingly, taxonomic assessment of our shotgun metagenomic data showed a very high percentage of endogenous DNA with 22% of the assembled metagenomic contigs assigned to Xenarthra. Additionally, we enriched over 15 kb of sequence data from seven nuclear exons, using target sequence capture designed against a wide xenarthran dataset. Phylogenetic and dating analyses of the mitogenomic dataset including all extant species of xenarthrans and the assembled nuclear supermatrix unambiguously place Mylodon darwinii as the sister-group of modern two-fingered sloths, from which it diverged around 22 million years ago. These congruent results from both the mitochondrial and nuclear data support the diphyly of the two modern sloth lineages, implying the convergent evolution of their unique suspensory behaviour as an adaption to arboreality. Our results offer promising perspectives for whole-genome sequencing of this emblematic extinct taxon. © 2018 The Authors.

  1. Next-generation sequencing coupled with a cell-free display technology for high-throughput production of reliable interactome data

    PubMed Central

    Fujimori, Shigeo; Hirai, Naoya; Ohashi, Hiroyuki; Masuoka, Kazuyo; Nishikimi, Akihiko; Fukui, Yoshinori; Washio, Takanori; Oshikubo, Tomohiro; Yamashita, Tatsuhiro; Miyamoto-Sato, Etsuko

    2012-01-01

    Next-generation sequencing (NGS) has been applied to various kinds of omics studies, resulting in many biological and medical discoveries. However, high-throughput protein-protein interactome datasets derived from detection by sequencing are scarce, because protein-protein interaction analysis requires many cell manipulations to examine the interactions. The low reliability of the high-throughput data is also a problem. Here, we describe a cell-free display technology combined with NGS that can improve both the coverage and reliability of interactome datasets. The completely cell-free method gives a high-throughput and a large detection space, testing the interactions without using clones. The quantitative information provided by NGS reduces the number of false positives. The method is suitable for the in vitro detection of proteins that interact not only with the bait protein, but also with DNA, RNA and chemical compounds. Thus, it could become a universal approach for exploring the large space of protein sequences and interactome networks. PMID:23056904

  2. SSU rDNA divergence in planktonic foraminifera: molecular taxonomy and biogeographic implications.

    PubMed

    André, Aurore; Quillévéré, Frédéric; Morard, Raphaël; Ujiié, Yurika; Escarguel, Gilles; de Vargas, Colomban; de Garidel-Thoron, Thibault; Douady, Christophe J

    2014-01-01

    The use of planktonic foraminifera in paleoceanography requires taxonomic consistency and precise assessment of the species biogeography. Yet, ribosomal small subunit (SSUr) DNA analyses have revealed that most of the modern morpho-species of planktonic foraminifera are composed of a complex of several distinct genetic types that may correspond to cryptic or pseudo-cryptic species. These genetic types are usually delimitated using partial sequences located at the 3'end of the SSUrDNA, but typically based on empirical delimitation. Here, we first use patristic genetic distances calculated within and among genetic types of the most common morpho-species to show that intra-type and inter-type genetic distances within morpho-species may significantly overlap, suggesting that genetic types have been sometimes inconsistently defined. We further apply two quantitative and independent methods, ABGD (Automatic Barcode Gap Detection) and GMYC (General Mixed Yule Coalescent) to a dataset of published and newly obtained partial SSU rDNA for a more objective assessment of the species status of these genetic types. Results of these complementary approaches are highly congruent and lead to a molecular taxonomy that ranks 49 genetic types of planktonic foraminifera as genuine (pseudo)cryptic species. Our results advocate for a standardized sequencing procedure allowing homogenous delimitations of (pseudo)cryptic species. On the ground of this revised taxonomic framework, we finally provide an integrative taxonomy synthesizing geographic, ecological and morphological differentiations that can occur among the genuine (pseudo)cryptic species. Due to molecular, environmental or morphological data scarcities, many aspects of our proposed integrative taxonomy are not yet fully resolved. On the other hand, our study opens up the potential for a correct interpretation of environmental sequence datasets.

  3. SSU rDNA Divergence in Planktonic Foraminifera: Molecular Taxonomy and Biogeographic Implications

    PubMed Central

    André, Aurore; Quillévéré, Frédéric; Morard, Raphaël; Ujiié, Yurika; Escarguel, Gilles; de Vargas, Colomban; de Garidel-Thoron, Thibault; Douady, Christophe J.

    2014-01-01

    The use of planktonic foraminifera in paleoceanography requires taxonomic consistency and precise assessment of the species biogeography. Yet, ribosomal small subunit (SSUr) DNA analyses have revealed that most of the modern morpho-species of planktonic foraminifera are composed of a complex of several distinct genetic types that may correspond to cryptic or pseudo-cryptic species. These genetic types are usually delimitated using partial sequences located at the 3′end of the SSUrDNA, but typically based on empirical delimitation. Here, we first use patristic genetic distances calculated within and among genetic types of the most common morpho-species to show that intra-type and inter-type genetic distances within morpho-species may significantly overlap, suggesting that genetic types have been sometimes inconsistently defined. We further apply two quantitative and independent methods, ABGD (Automatic Barcode Gap Detection) and GMYC (General Mixed Yule Coalescent) to a dataset of published and newly obtained partial SSU rDNA for a more objective assessment of the species status of these genetic types. Results of these complementary approaches are highly congruent and lead to a molecular taxonomy that ranks 49 genetic types of planktonic foraminifera as genuine (pseudo)cryptic species. Our results advocate for a standardized sequencing procedure allowing homogenous delimitations of (pseudo)cryptic species. On the ground of this revised taxonomic framework, we finally provide an integrative taxonomy synthesizing geographic, ecological and morphological differentiations that can occur among the genuine (pseudo)cryptic species. Due to molecular, environmental or morphological data scarcities, many aspects of our proposed integrative taxonomy are not yet fully resolved. On the other hand, our study opens up the potential for a correct interpretation of environmental sequence datasets. PMID:25119900

  4. OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid.

    PubMed

    Poehlman, William L; Rynge, Mats; Branton, Chris; Balamurugan, D; Feltus, Frank A

    2016-01-01

    High-throughput DNA sequencing technology has revolutionized the study of gene expression while introducing significant computational challenges for biologists. These computational challenges include access to sufficient computer hardware and functional data processing workflows. Both these challenges are addressed with our scalable, open-source Pegasus workflow for processing high-throughput DNA sequence datasets into a gene expression matrix (GEM) using computational resources available to U.S.-based researchers on the Open Science Grid (OSG). We describe the usage of the workflow (OSG-GEM), discuss workflow design, inspect performance data, and assess accuracy in mapping paired-end sequencing reads to a reference genome. A target OSG-GEM user is proficient with the Linux command line and possesses basic bioinformatics experience. The user may run this workflow directly on the OSG or adapt it to novel computing environments.

  5. OSG-GEM: Gene Expression Matrix Construction Using the Open Science Grid

    PubMed Central

    Poehlman, William L.; Rynge, Mats; Branton, Chris; Balamurugan, D.; Feltus, Frank A.

    2016-01-01

    High-throughput DNA sequencing technology has revolutionized the study of gene expression while introducing significant computational challenges for biologists. These computational challenges include access to sufficient computer hardware and functional data processing workflows. Both these challenges are addressed with our scalable, open-source Pegasus workflow for processing high-throughput DNA sequence datasets into a gene expression matrix (GEM) using computational resources available to U.S.-based researchers on the Open Science Grid (OSG). We describe the usage of the workflow (OSG-GEM), discuss workflow design, inspect performance data, and assess accuracy in mapping paired-end sequencing reads to a reference genome. A target OSG-GEM user is proficient with the Linux command line and possesses basic bioinformatics experience. The user may run this workflow directly on the OSG or adapt it to novel computing environments. PMID:27499617

  6. CpG island mapping by epigenome prediction.

    PubMed

    Bock, Christoph; Walter, Jörn; Paulsen, Martina; Lengauer, Thomas

    2007-06-01

    CpG islands were originally identified by epigenetic and functional properties, namely, absence of DNA methylation and frequent promoter association. However, this concept was quickly replaced by simple DNA sequence criteria, which allowed for genome-wide annotation of CpG islands in the absence of large-scale epigenetic datasets. Although widely used, the current CpG island criteria incur significant disadvantages: (1) reliance on arbitrary threshold parameters that bear little biological justification, (2) failure to account for widespread heterogeneity among CpG islands, and (3) apparent lack of specificity when applied to the human genome. This study is driven by the idea that a quantitative score of "CpG island strength" that incorporates epigenetic and functional aspects can help resolve these issues. We construct an epigenome prediction pipeline that links the DNA sequence of CpG islands to their epigenetic states, including DNA methylation, histone modifications, and chromatin accessibility. By training support vector machines on epigenetic data for CpG islands on human Chromosomes 21 and 22, we identify informative DNA attributes that correlate with open versus compact chromatin structures. These DNA attributes are used to predict the epigenetic states of all CpG islands genome-wide. Combining predictions for multiple epigenetic features, we estimate the inherent CpG island strength for each CpG island in the human genome, i.e., its inherent tendency to exhibit an open and transcriptionally competent chromatin structure. We extensively validate our results on independent datasets, showing that the CpG island strength predictions are applicable and informative across different tissues and cell types, and we derive improved maps of predicted "bona fide" CpG islands. The mapping of CpG islands by epigenome prediction is conceptually superior to identifying CpG islands by widely used sequence criteria since it links CpG island detection to their characteristic epigenetic and functional states. And it is superior to purely experimental epigenome mapping for CpG island detection since it abstracts from specific properties that are limited to a single cell type or tissue. In addition, using computational epigenetics methods we could identify high correlation between the epigenome and characteristics of the DNA sequence, a finding which emphasizes the need for a better understanding of the mechanistic links between genome and epigenome.

  7. Performances of Different Fragment Sizes for Reduced Representation Bisulfite Sequencing in Pigs.

    PubMed

    Yuan, Xiao-Long; Zhang, Zhe; Pan, Rong-Yang; Gao, Ning; Deng, Xi; Li, Bin; Zhang, Hao; Sangild, Per Torp; Li, Jia-Qi

    2017-01-01

    Reduced representation bisulfite sequencing (RRBS) has been widely used to profile genome-scale DNA methylation in mammalian genomes. However, the applications and technical performances of RRBS with different fragment sizes have not been systematically reported in pigs, which serve as one of the important biomedical models for humans. The aims of this study were to evaluate capacities of RRBS libraries with different fragment sizes to characterize the porcine genome. We found that the Msp I-digested segments between 40 and 220 bp harbored a high distribution peak at 74 bp, which were highly overlapped with the repetitive elements and might reduce the unique mapping alignment. The RRBS library of 110-220 bp fragment size had the highest unique mapping alignment and the lowest multiple alignment. The cost-effectiveness of the 40-110 bp, 110-220 bp and 40-220 bp fragment sizes might decrease when the dataset size was more than 70, 50 and 110 million reads for these three fragment sizes, respectively. Given a 50-million dataset size, the average sequencing depth of the detected CpG sites in the 110-220 bp fragment size appeared to be deeper than in the 40-110 bp and 40-220 bp fragment sizes, and these detected CpG sties differently located in gene- and CpG island-related regions. In this study, our results demonstrated that selections of fragment sizes could affect the numbers and sequencing depth of detected CpG sites as well as the cost-efficiency. No single solution of RRBS is optimal in all circumstances for investigating genome-scale DNA methylation. This work provides the useful knowledge on designing and executing RRBS for investigating the genome-wide DNA methylation in tissues from pigs.

  8. With a little help from DNA barcoding: investigating the diversity of Gastropoda from the Portuguese coast

    PubMed Central

    Borges, Luísa M. S.; Hollatz, Claudia; Lobo, Jorge; Cunha, Ana M.; Vilela, Ana P.; Calado, Gonçalo; Coelho, Rita; Costa, Ana C.; Ferreira, Maria S. G.; Costa, Maria H.; Costa, Filipe O.

    2016-01-01

    The Gastropoda is one of the best studied classes of marine invertebrates. Yet, most species have been delimited based on morphology only. The application of DNA barcodes has shown to be greatly useful to help delimiting species. Therefore, sequences of the cytochrome c oxidase I gene from 108 specimens of 34 morpho-species were used to investigate the molecular diversity within the gastropods from the Portuguese coast. To the above dataset, we added available COI-5P sequences of taxonomically close species, in a total of 58 morpho-species examined. There was a good match between ours and sequences from independent studies, in public repositories. We found 32 concordant (91.4%) out of the 35 Barcode Index Numbers (BINs) generated from our sequences. The application of a ranking system to the barcodes yield over 70% with top taxonomic congruence, while 14.2% of the species barcodes had insufficient data. In the majority of the cases, there was a good concordance between morphological identification and DNA barcodes. Nonetheless, the discordance between morphological and molecular data is a reminder that even the comparatively well-known European marine gastropods can benefit from being probed using the DNA barcode approach. Discordant cases should be reviewed with more integrative studies. PMID:26876495

  9. With a little help from DNA barcoding: investigating the diversity of Gastropoda from the Portuguese coast.

    PubMed

    Borges, Luísa M S; Hollatz, Claudia; Lobo, Jorge; Cunha, Ana M; Vilela, Ana P; Calado, Gonçalo; Coelho, Rita; Costa, Ana C; Ferreira, Maria S G; Costa, Maria H; Costa, Filipe O

    2016-02-15

    The Gastropoda is one of the best studied classes of marine invertebrates. Yet, most species have been delimited based on morphology only. The application of DNA barcodes has shown to be greatly useful to help delimiting species. Therefore, sequences of the cytochrome c oxidase I gene from 108 specimens of 34 morpho-species were used to investigate the molecular diversity within the gastropods from the Portuguese coast. To the above dataset, we added available COI-5P sequences of taxonomically close species, in a total of 58 morpho-species examined. There was a good match between ours and sequences from independent studies, in public repositories. We found 32 concordant (91.4%) out of the 35 Barcode Index Numbers (BINs) generated from our sequences. The application of a ranking system to the barcodes yield over 70% with top taxonomic congruence, while 14.2% of the species barcodes had insufficient data. In the majority of the cases, there was a good concordance between morphological identification and DNA barcodes. Nonetheless, the discordance between morphological and molecular data is a reminder that even the comparatively well-known European marine gastropods can benefit from being probed using the DNA barcode approach. Discordant cases should be reviewed with more integrative studies.

  10. Complete nuclear ribosomal DNA sequence amplification and molecular analyses of Bangia (Bangiales, Rhodophyta) from China

    NASA Astrophysics Data System (ADS)

    Xu, Jiajie; Jiang, Bo; Chai, Sanming; He, Yuan; Zhu, Jianyi; Shen, Zonggen; Shen, Songdong

    2016-09-01

    Filamentous Bangia, which are distributed extensively throughout the world, have simple and similar morphological characteristics. Scientists can classify these organisms using molecular markers in combination with morphology. We successfully sequenced the complete nuclear ribosomal DNA, approximately 13 kb in length, from a marine Bangia population. We further analyzed the small subunit ribosomal DNA gene (nrSSU) and the internal transcribed spacer (ITS) sequence regions along with nine other marine, and two freshwater Bangia samples from China. Pairwise distances of the nrSSU and 5.8S ribosomal DNA gene sequences show the marine samples grouping together with low divergences (00.003; 0-0.006, respectively) from each other, but high divergences (0.123-0.126; 0.198, respectively) from freshwater samples. An exception is the marine sample collected from Weihai, which shows high divergence from both other marine samples (0.063-0.065; 0.129, respectively) and the freshwater samples (0.097; 0.120, respectively). A maximum likelihood phylogenetic tree based on a combined SSU-ITS dataset with maximum likelihood method shows the samples divided into three clades, with the two marine sample clades containing Bangia spp. from North America, Europe, Asia, and Australia; and one freshwater clade, containing Bangia atropurpurea from North America and China.

  11. Phylogenetic study of Oryzoideae species and related taxa of the Poaceae based on atpB-rbcL and ndhF DNA sequences.

    PubMed

    Zeng, Xu; Yuan, Zhengrong; Tong, Xin; Li, Qiushi; Gao, Weiwei; Qin, Minjian; Liu, Zhihua

    2012-05-01

    Oryzoideae (Poaceae) plants have economic and ecological value. However, the phylogenetic position of some plants is not clear, such as Hygroryza aristata (Retz.) Nees. and Porteresia coarctata (Roxb.) Tateoka (syn. Oryza coarctata). Comprehensive molecular phylogenetic studies have been carried out on many genera in the Poaceae. The different DNA sequences, including nuclear and chloroplast sequences, had been extensively employed to determine relationships at both higher and lower taxonomic levels in the Poaceae. Chloroplast DNA ndhF gene and atpB-rbcL spacer were used to construct phylogenetic trees and estimate the divergence time of Oryzoideae, Bambusoideae, Panicoideae, Pooideae and so on. Complete sequences of atpB-rbcL and ndhF were generated for 17 species representing six species of the Oryzoideae and related subfamilies. Nicotiana tabacum L. was the outgroup species. The two DNA datasets were analyzed, using Maximum Parsimony and Bayesian analysis methods. The molecular phylogeny revealed that H. aristata (Retz.) Nees was the sister to Chikusichloa aquatica Koidz. Moreover, P. coarctata (Roxb.) Tateoka was in the genus Oryza. Furthermore, the result of evolution analysis, which based on the ndhF marker, indicated that the time of origin of Oryzoideae might be 31 million years ago.

  12. PDNAsite: Identification of DNA-binding Site from Protein Sequence by Incorporating Spatial and Sequence Context

    PubMed Central

    Zhou, Jiyun; Xu, Ruifeng; He, Yulan; Lu, Qin; Wang, Hongpeng; Kong, Bing

    2016-01-01

    Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community. PMID:27282833

  13. RAD tag sequencing as a source of SNP markers in Cynara cardunculus L

    PubMed Central

    2012-01-01

    Background The globe artichoke (Cynara cardunculus L. var. scolymus) genome is relatively poorly explored, especially compared to those of the other major Asteraceae crops sunflower and lettuce. No SNP markers are in the public domain. We have combined the recently developed restriction-site associated DNA (RAD) approach with the Illumina DNA sequencing platform to effect the rapid and mass discovery of SNP markers for C. cardunculus. Results RAD tags were sequenced from the genomic DNA of three C. cardunculus mapping population parents, generating 9.7 million reads, corresponding to ~1 Gbp of sequence. An assembly based on paired ends produced ~6.0 Mbp of genomic sequence, separated into ~19,000 contigs (mean length 312 bp), of which ~21% were fragments of putative coding sequence. The shared sequences allowed for the discovery of ~34,000 SNPs and nearly 800 indels, equivalent to a SNP frequency of 5.6 per 1,000 nt, and an indel frequency of 0.2 per 1,000 nt. A sample of heterozygous SNP loci was mapped by CAPS assays and this exercise provided validation of our mining criteria. The repetitive fraction of the genome had a high representation of retrotransposon sequence, followed by simple repeats, AT-low complexity regions and mobile DNA elements. The genomic k-mers distribution and CpG rate of C. cardunculus, compared with data derived from three whole genome-sequenced dicots species, provided a further evidence of the random representation of the C. cardunculus genome generated by RAD sampling. Conclusion The RAD tag sequencing approach is a cost-effective and rapid method to develop SNP markers in a highly heterozygous species. Our approach permitted to generate a large and robust SNP datasets by the adoption of optimized filtering criteria. PMID:22214349

  14. Lineage divergence detected in the malaria vector Anopheles marajoara (Diptera: Culicidae) in Amazonian Brazil

    PubMed Central

    2010-01-01

    Background Cryptic species complexes are common among anophelines. Previous phylogenetic analysis based on the complete mtDNA COI gene sequences detected paraphyly in the Neotropical malaria vector Anopheles marajoara. The "Folmer region" detects a single taxon using a 3% divergence threshold. Methods To test the paraphyletic hypothesis and examine the utility of the Folmer region, genealogical trees based on a concatenated (white + 3' COI sequences) dataset and pairwise differentiation of COI fragments were examined. The population structure and demographic history were based on partial COI sequences for 294 individuals from 14 localities in Amazonian Brazil. 109 individuals from 12 localities were sequenced for the nDNA white gene, and 57 individuals from 11 localities were sequenced for the ribosomal DNA (rDNA) internal transcribed spacer 2 (ITS2). Results Distinct A. marajoara lineages were detected by combined genealogical analysis and were also supported among COI haplotypes using a median joining network and AMOVA, with time since divergence during the Pleistocene (<100,000 ya). COI sequences at the 3' end were more variable, demonstrating significant pairwise differentiation (3.82%) compared to the more moderate 2.92% detected by the Folmer region. Lineage 1 was present in all localities, whereas lineage 2 was restricted mainly to the west. Mismatch distributions for both lineages were bimodal, likely due to multiple colonization events and spatial expansion (~798 - 81,045 ya). There appears to be gene flow within, not between lineages, and a partial barrier was detected near Rio Jari in Amapá state, separating western and eastern populations. In contrast, both nDNA data sets (white gene sequences with or without the retention of the 4th intron, and ITS2 sequences and length) detected a single A. marajoara lineage. Conclusions Strong support for combined data with significant differentiation detected in the COI and absent in the nDNA suggest that the divergence is recent, and detectable only by the faster evolving mtDNA. A within subgenus threshold of >2% may be more appropriate among sister taxa in cryptic anopheline complexes than the standard 3%. Differences in demographic history and climatic changes may have contributed to mtDNA lineage divergence in A. marajoara. PMID:20929572

  15. An improved filtering algorithm for big read datasets and its application to single-cell assembly.

    PubMed

    Wedemeyer, Axel; Kliemann, Lasse; Srivastav, Anand; Schielke, Christian; Reusch, Thorsten B; Rosenstiel, Philip

    2017-07-03

    For single-cell or metagenomic sequencing projects, it is necessary to sequence with a very high mean coverage in order to make sure that all parts of the sample DNA get covered by the reads produced. This leads to huge datasets with lots of redundant data. A filtering of this data prior to assembly is advisable. Brown et al. (2012) presented the algorithm Diginorm for this purpose, which filters reads based on the abundance of their k-mers. We present Bignorm, a faster and quality-conscious read filtering algorithm. An important new algorithmic feature is the use of phred quality scores together with a detailed analysis of the k-mer counts to decide which reads to keep. We qualify and recommend parameters for our new read filtering algorithm. Guided by these parameters, we remove in terms of median 97.15% of the reads while keeping the mean phred score of the filtered dataset high. Using the SDAdes assembler, we produce assemblies of high quality from these filtered datasets in a fraction of the time needed for an assembly from the datasets filtered with Diginorm. We conclude that read filtering is a practical and efficient method for reducing read data and for speeding up the assembly process. This applies not only for single cell assembly, as shown in this paper, but also to other projects with high mean coverage datasets like metagenomic sequencing projects. Our Bignorm algorithm allows assemblies of competitive quality in comparison to Diginorm, while being much faster. Bignorm is available for download at https://git.informatik.uni-kiel.de/axw/Bignorm .

  16. Analysis of developmental gene conservation in the Actinomycetales using DNA/DNA microarray comparisons.

    PubMed

    Kirby, Ralph; Herron, Paul; Hoskisson, Paul

    2011-02-01

    Based on available genome sequences, Actinomycetales show significant gene synteny across a wide range of species and genera. In addition, many genera show varying degrees of complex morphological development. Using the presence of gene synteny as a basis, it is clear that an analysis of gene conservation across the Streptomyces and various other Actinomycetales will provide information on both the importance of genes and gene clusters and the evolution of morphogenesis in these bacteria. Genome sequencing, although becoming cheaper, is still relatively expensive for comparing large numbers of strains. Thus, a heterologous DNA/DNA microarray hybridization dataset based on a Streptomyces coelicolor microarray allows a cheaper and greater depth of analysis of gene conservation. This study, using both bioinformatical and microarray approaches, was able to classify genes previously identified as involved in morphogenesis in Streptomyces into various subgroups in terms of conservation across species and genera. This will allow the targeting of genes for further study based on their importance at the species level and at higher evolutionary levels.

  17. oPOSSUM-3: Advanced Analysis of Regulatory Motif Over-Representation Across Genes or ChIP-Seq Datasets

    PubMed Central

    Kwon, Andrew T.; Arenillas, David J.; Hunt, Rebecca Worsley; Wasserman, Wyeth W.

    2012-01-01

    oPOSSUM-3 is a web-accessible software system for identification of over-represented transcription factor binding sites (TFBS) and TFBS families in either DNA sequences of co-expressed genes or sequences generated from high-throughput methods, such as ChIP-Seq. Validation of the system with known sets of co-regulated genes and published ChIP-Seq data demonstrates the capacity for oPOSSUM-3 to identify mediating transcription factors (TF) for co-regulated genes or co-recovered sequences. oPOSSUM-3 is available at http://opossum.cisreg.ca. PMID:22973536

  18. oPOSSUM-3: advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets.

    PubMed

    Kwon, Andrew T; Arenillas, David J; Worsley Hunt, Rebecca; Wasserman, Wyeth W

    2012-09-01

    oPOSSUM-3 is a web-accessible software system for identification of over-represented transcription factor binding sites (TFBS) and TFBS families in either DNA sequences of co-expressed genes or sequences generated from high-throughput methods, such as ChIP-Seq. Validation of the system with known sets of co-regulated genes and published ChIP-Seq data demonstrates the capacity for oPOSSUM-3 to identify mediating transcription factors (TF) for co-regulated genes or co-recovered sequences. oPOSSUM-3 is available at http://opossum.cisreg.ca.

  19. Binary Interval Search: a scalable algorithm for counting interval intersections.

    PubMed

    Layer, Ryan M; Skadron, Kevin; Robins, Gabriel; Hall, Ira M; Quinlan, Aaron R

    2013-01-01

    The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. https://github.com/arq5x/bits.

  20. A comprehensive resource of genomic, epigenomic and transcriptomic sequencing data for the black truffle Tuber melanosporum

    PubMed Central

    2014-01-01

    Background Tuber melanosporum, also known in the gastronomic community as “truffle”, features one of the largest fungal genomes (125 Mb) with an exceptionally high transposable element (TE) and repetitive DNA content (>58%). The main purpose of DNA methylation in fungi is TE silencing. As obligate outcrossing organisms, truffles are bound to a sexual mode of propagation, which together with TEs is thought to represent a major force driving the evolution of DNA methylation. Thus, it was of interest to examine if and how T. melanosporum exploits DNA methylation to maintain genome integrity. Findings We performed whole-genome DNA bisulfite sequencing and mRNA sequencing on different developmental stages of T. melanosporum; namely, fruitbody (“truffle”), free-living mycelium and ectomycorrhiza. The data revealed a high rate of cytosine methylation (>44%), selectively targeting TEs rather than genes with a strong preference for CpG sites. Whole genome DNA sequencing uncovered multiple TE-enriched, copy number variant regions bearing a significant fraction of hypomethylated and expressed TEs, almost exclusively in free-living mycelium propagated in vitro. Treatment of mycelia with 5-azacytidine partially reduced DNA methylation and increased TE transcription. Our transcriptome assembly also resulted in the identification of a set of novel transcripts from 614 genes. Conclusions The datasets presented here provide valuable and comprehensive (epi)genomic information that can be of interest for evolutionary genomics studies of multicellular (filamentous) fungi, in particular Ascomycetes belonging to the subphylum, Pezizomycotina. Evidence derived from comparative methylome and transcriptome analyses indicates that a non-exhaustive and partly reversible methylation process operates in truffles. PMID:25392735

  1. A comprehensive resource of genomic, epigenomic and transcriptomic sequencing data for the black truffle Tuber melanosporum.

    PubMed

    Chen, Pao-Yang; Montanini, Barbara; Liao, Wen-Wei; Morselli, Marco; Jaroszewicz, Artur; Lopez, David; Ottonello, Simone; Pellegrini, Matteo

    2014-01-01

    Tuber melanosporum, also known in the gastronomic community as "truffle", features one of the largest fungal genomes (125 Mb) with an exceptionally high transposable element (TE) and repetitive DNA content (>58%). The main purpose of DNA methylation in fungi is TE silencing. As obligate outcrossing organisms, truffles are bound to a sexual mode of propagation, which together with TEs is thought to represent a major force driving the evolution of DNA methylation. Thus, it was of interest to examine if and how T. melanosporum exploits DNA methylation to maintain genome integrity. We performed whole-genome DNA bisulfite sequencing and mRNA sequencing on different developmental stages of T. melanosporum; namely, fruitbody ("truffle"), free-living mycelium and ectomycorrhiza. The data revealed a high rate of cytosine methylation (>44%), selectively targeting TEs rather than genes with a strong preference for CpG sites. Whole genome DNA sequencing uncovered multiple TE-enriched, copy number variant regions bearing a significant fraction of hypomethylated and expressed TEs, almost exclusively in free-living mycelium propagated in vitro. Treatment of mycelia with 5-azacytidine partially reduced DNA methylation and increased TE transcription. Our transcriptome assembly also resulted in the identification of a set of novel transcripts from 614 genes. The datasets presented here provide valuable and comprehensive (epi)genomic information that can be of interest for evolutionary genomics studies of multicellular (filamentous) fungi, in particular Ascomycetes belonging to the subphylum, Pezizomycotina. Evidence derived from comparative methylome and transcriptome analyses indicates that a non-exhaustive and partly reversible methylation process operates in truffles.

  2. Privacy-Preserving Data Exploration in Genome-Wide Association Studies.

    PubMed

    Johnson, Aaron; Shmatikov, Vitaly

    2013-08-01

    Genome-wide association studies (GWAS) have become a popular method for analyzing sets of DNA sequences in order to discover the genetic basis of disease. Unfortunately, statistics published as the result of GWAS can be used to identify individuals participating in the study. To prevent privacy breaches, even previously published results have been removed from public databases, impeding researchers' access to the data and hindering collaborative research. Existing techniques for privacy-preserving GWAS focus on answering specific questions, such as correlations between a given pair of SNPs (DNA sequence variations). This does not fit the typical GWAS process, where the analyst may not know in advance which SNPs to consider and which statistical tests to use, how many SNPs are significant for a given dataset, etc. We present a set of practical, privacy-preserving data mining algorithms for GWAS datasets. Our framework supports exploratory data analysis, where the analyst does not know a priori how many and which SNPs to consider. We develop privacy-preserving algorithms for computing the number and location of SNPs that are significantly associated with the disease, the significance of any statistical test between a given SNP and the disease, any measure of correlation between SNPs, and the block structure of correlations. We evaluate our algorithms on real-world datasets and demonstrate that they produce significantly more accurate results than prior techniques while guaranteeing differential privacy.

  3. Analyses of mitochondrial amino acid sequence datasets support the proposal that specimens of Hypodontus macropi from three species of macropodid hosts represent distinct species

    PubMed Central

    2013-01-01

    Background Hypodontus macropi is a common intestinal nematode of a range of kangaroos and wallabies (macropodid marsupials). Based on previous multilocus enzyme electrophoresis (MEE) and nuclear ribosomal DNA sequence data sets, H. macropi has been proposed to be complex of species. To test this proposal using independent molecular data, we sequenced the whole mitochondrial (mt) genomes of individuals of H. macropi from three different species of hosts (Macropus robustus robustus, Thylogale billardierii and Macropus [Wallabia] bicolor) as well as that of Macropicola ocydromi (a related nematode), and undertook a comparative analysis of the amino acid sequence datasets derived from these genomes. Results The mt genomes sequenced by next-generation (454) technology from H. macropi from the three host species varied from 13,634 bp to 13,699 bp in size. Pairwise comparisons of the amino acid sequences predicted from these three mt genomes revealed differences of 5.8% to 18%. Phylogenetic analysis of the amino acid sequence data sets using Bayesian Inference (BI) showed that H. macropi from the three different host species formed distinct, well-supported clades. In addition, sliding window analysis of the mt genomes defined variable regions for future population genetic studies of H. macropi in different macropodid hosts and geographical regions around Australia. Conclusions The present analyses of inferred mt protein sequence datasets clearly supported the hypothesis that H. macropi from M. robustus robustus, M. bicolor and T. billardierii represent distinct species. PMID:24261823

  4. The sponge microbiome project.

    PubMed

    Moitinho-Silva, Lucas; Nielsen, Shaun; Amir, Amnon; Gonzalez, Antonio; Ackermann, Gail L; Cerrano, Carlo; Astudillo-Garcia, Carmen; Easson, Cole; Sipkema, Detmer; Liu, Fang; Steinert, Georg; Kotoulas, Giorgos; McCormack, Grace P; Feng, Guofang; Bell, James J; Vicente, Jan; Björk, Johannes R; Montoya, Jose M; Olson, Julie B; Reveillaud, Julie; Steindler, Laura; Pineda, Mari-Carmen; Marra, Maria V; Ilan, Micha; Taylor, Michael W; Polymenakou, Paraskevi; Erwin, Patrick M; Schupp, Peter J; Simister, Rachel L; Knight, Rob; Thacker, Robert W; Costa, Rodrigo; Hill, Russell T; Lopez-Legentil, Susanna; Dailianis, Thanos; Ravasi, Timothy; Hentschel, Ute; Li, Zhiyong; Webster, Nicole S; Thomas, Torsten

    2017-10-01

    Marine sponges (phylum Porifera) are a diverse, phylogenetically deep-branching clade known for forming intimate partnerships with complex communities of microorganisms. To date, 16S rRNA gene sequencing studies have largely utilised different extraction and amplification methodologies to target the microbial communities of a limited number of sponge species, severely limiting comparative analyses of sponge microbial diversity and structure. Here, we provide an extensive and standardised dataset that will facilitate sponge microbiome comparisons across large spatial, temporal, and environmental scales. Samples from marine sponges (n = 3569 specimens), seawater (n = 370), marine sediments (n = 65) and other environments (n = 29) were collected from different locations across the globe. This dataset incorporates at least 268 different sponge species, including several yet unidentified taxa. The V4 region of the 16S rRNA gene was amplified and sequenced from extracted DNA using standardised procedures. Raw sequences (total of 1.1 billion sequences) were processed and clustered with (i) a standard protocol using QIIME closed-reference picking resulting in 39 543 operational taxonomic units (OTU) at 97% sequence identity, (ii) a de novo clustering using Mothur resulting in 518 246 OTUs, and (iii) a new high-resolution Deblur protocol resulting in 83 908 unique bacterial sequences. Abundance tables, representative sequences, taxonomic classifications, and metadata are provided. This dataset represents a comprehensive resource of sponge-associated microbial communities based on 16S rRNA gene sequences that can be used to address overarching hypotheses regarding host-associated prokaryotes, including host specificity, convergent evolution, environmental drivers of microbiome structure, and the sponge-associated rare biosphere. © The Authors 2017. Published by Oxford University Press.

  5. iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model

    PubMed Central

    Lin, Wei-Zhong; Fang, Jian-An; Xiao, Xuan; Chou, Kuo-Chen

    2011-01-01

    DNA-binding proteins play crucial roles in various cellular processes. Developing high throughput tools for rapidly and effectively identifying DNA-binding proteins is one of the major challenges in the field of genome annotation. Although many efforts have been made in this regard, further effort is needed to enhance the prediction power. By incorporating the features into the general form of pseudo amino acid composition that were extracted from protein sequences via the “grey model” and by adopting the random forest operation engine, we proposed a new predictor, called iDNA-Prot, for identifying uncharacterized proteins as DNA-binding proteins or non-DNA binding proteins based on their amino acid sequences information alone. The overall success rate by iDNA-Prot was 83.96% that was obtained via jackknife tests on a newly constructed stringent benchmark dataset in which none of the proteins included has pairwise sequence identity to any other in a same subset. In addition to achieving high success rate, the computational time for iDNA-Prot is remarkably shorter in comparison with the relevant existing predictors. Hence it is anticipated that iDNA-Prot may become a useful high throughput tool for large-scale analysis of DNA-binding proteins. As a user-friendly web-server, iDNA-Prot is freely accessible to the public at the web-site on http://icpr.jci.edu.cn/bioinfo/iDNA-Prot or http://www.jci-bioinfo.cn/iDNA-Prot. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results. PMID:21935457

  6. Protocol matters: which methylome are you actually studying?

    PubMed Central

    Robinson, Mark D; Statham, Aaron L; Speed, Terence P; Clark, Susan J

    2011-01-01

    The field of epigenetics is now capitalizing on the vast number of emerging technologies, largely based on second-generation sequencing, which interrogate DNA methylation status and histone modifications genome-wide. However, getting an exhaustive and unbiased view of a methylome at a reasonable cost is proving to be a significant challenge. In this article, we take a closer look at the impact of the DNA sequence and bias effects introduced to datasets by genome-wide DNA methylation technologies and where possible, explore the bioinformatics tools that deconvolve them. There remains much to be learned about the performance of genome-wide technologies, the data we mine from these assays and how it reflects the actual biology. While there are several methods to interrogate the DNA methylation status genome-wide, our opinion is that no single technique suitably covers the minimum criteria of high coverage and, high resolution at a reasonable cost. In fact, the fraction of the methylome that is studied currently depends entirely on the inherent biases of the protocol employed. There is promise for this to change, as the third generation of sequencing technologies is expected to again ‘revolutionize’ the way that we study genomes and epigenomes. PMID:21566704

  7. SHARAKU: an algorithm for aligning and clustering read mapping profiles of deep sequencing in non-coding RNA processing.

    PubMed

    Tsuchiya, Mariko; Amano, Kojiro; Abe, Masaya; Seki, Misato; Hase, Sumitaka; Sato, Kengo; Sakakibara, Yasubumi

    2016-06-15

    Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs. We developed an algorithm termed SHARAKU to align two read mapping profiles of next-generation sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5'-end processing and 3'-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain. The source code of our program SHARAKU is available at http://www.dna.bio.keio.ac.jp/sharaku/, and the simulated dataset used in this work is available at the same link. Accession code: The sequence data from the whole RNA transcripts in the hippocampus of the left brain used in this work is available from the DNA DataBank of Japan (DDBJ) Sequence Read Archive (DRA) under the accession number DRA004502. yasu@bio.keio.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  8. A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer.

    PubMed

    Quick, Joshua; Quinlan, Aaron R; Loman, Nicholas J

    2014-01-01

    The MinION™ is a new, portable single-molecule sequencer developed by Oxford Nanopore Technologies. It measures four inches in length and is powered from the USB 3.0 port of a laptop computer. The MinION™ measures the change in current resulting from DNA strands interacting with a charged protein nanopore. These measurements can then be used to deduce the underlying nucleotide sequence. We present a read dataset from whole-genome shotgun sequencing of the model organism Escherichia coli K-12 substr. MG1655 generated on a MinION™ device during the early-access MinION™ Access Program (MAP). Sequencing runs of the MinION™ are presented, one generated using R7 chemistry (released in July 2014) and one using R7.3 (released in September 2014). Base-called sequence data are provided to demonstrate the nature of data produced by the MinION™ platform and to encourage the development of customised methods for alignment, consensus and variant calling, de novo assembly and scaffolding. FAST5 files containing event data within the HDF5 container format are provided to assist with the development of improved base-calling methods.

  9. Protecting genomic sequence anonymity with generalization lattices.

    PubMed

    Malin, B A

    2005-01-01

    Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual's identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences. The technique is termed DNA lattice anonymization (DNALA), and is based upon the formal privacy protection schema of k -anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines). The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy. The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific datasharing scenarios.

  10. Comparing sequencing assays and human-machine analyses in actionable genomics for glioblastoma.

    PubMed

    Wrzeszczynski, Kazimierz O; Frank, Mayu O; Koyama, Takahiko; Rhrissorrakrai, Kahn; Robine, Nicolas; Utro, Filippo; Emde, Anne-Katrin; Chen, Bo-Juen; Arora, Kanika; Shah, Minita; Vacic, Vladimir; Norel, Raquel; Bilal, Erhan; Bergmann, Ewa A; Moore Vogel, Julia L; Bruce, Jeffrey N; Lassman, Andrew B; Canoll, Peter; Grommes, Christian; Harvey, Steve; Parida, Laxmi; Michelini, Vanessa V; Zody, Michael C; Jobanputra, Vaidehi; Royyuru, Ajay K; Darnell, Robert B

    2017-08-01

    To analyze a glioblastoma tumor specimen with 3 different platforms and compare potentially actionable calls from each. Tumor DNA was analyzed by a commercial targeted panel. In addition, tumor-normal DNA was analyzed by whole-genome sequencing (WGS) and tumor RNA was analyzed by RNA sequencing (RNA-seq). The WGS and RNA-seq data were analyzed by a team of bioinformaticians and cancer oncologists, and separately by IBM Watson Genomic Analytics (WGA), an automated system for prioritizing somatic variants and identifying drugs. More variants were identified by WGS/RNA analysis than by targeted panels. WGA completed a comparable analysis in a fraction of the time required by the human analysts. The development of an effective human-machine interface in the analysis of deep cancer genomic datasets may provide potentially clinically actionable calls for individual patients in a more timely and efficient manner than currently possible. NCT02725684.

  11. MethVisual - visualization and exploratory statistical analysis of DNA methylation profiles from bisulfite sequencing.

    PubMed

    Zackay, Arie; Steinhoff, Christine

    2010-12-15

    Exploration of DNA methylation and its impact on various regulatory mechanisms has become a very active field of research. Simultaneously there is an arising need for tools to process and analyse the data together with statistical investigation and visualisation. MethVisual is a new application that enables exploratory analysis and intuitive visualization of DNA methylation data as is typically generated by bisulfite sequencing. The package allows the import of DNA methylation sequences, aligns them and performs quality control comparison. It comprises basic analysis steps as lollipop visualization, co-occurrence display of methylation of neighbouring and distant CpG sites, summary statistics on methylation status, clustering and correspondence analysis. The package has been developed for methylation data but can be also used for other data types for which binary coding can be inferred. The application of the package, as well as a comparison to existing DNA methylation analysis tools and its workflow based on two datasets is presented in this paper. The R package MethVisual offers various analysis procedures for data that can be binarized, in particular for bisulfite sequenced methylation data. R/Bioconductor has become one of the most important environments for statistical analysis of various types of biological and medical data. Therefore, any data analysis within R that allows the integration of various data types as provided from different technological platforms is convenient. It is the first and so far the only specific package for DNA methylation analysis, in particular for bisulfite sequenced data available in R/Bioconductor enviroment. The package is available for free at http://methvisual.molgen.mpg.de/ and from the Bioconductor Consortium http://www.bioconductor.org.

  12. MethVisual - visualization and exploratory statistical analysis of DNA methylation profiles from bisulfite sequencing

    PubMed Central

    2010-01-01

    Background Exploration of DNA methylation and its impact on various regulatory mechanisms has become a very active field of research. Simultaneously there is an arising need for tools to process and analyse the data together with statistical investigation and visualisation. Findings MethVisual is a new application that enables exploratory analysis and intuitive visualization of DNA methylation data as is typically generated by bisulfite sequencing. The package allows the import of DNA methylation sequences, aligns them and performs quality control comparison. It comprises basic analysis steps as lollipop visualization, co-occurrence display of methylation of neighbouring and distant CpG sites, summary statistics on methylation status, clustering and correspondence analysis. The package has been developed for methylation data but can be also used for other data types for which binary coding can be inferred. The application of the package, as well as a comparison to existing DNA methylation analysis tools and its workflow based on two datasets is presented in this paper. Conclusions The R package MethVisual offers various analysis procedures for data that can be binarized, in particular for bisulfite sequenced methylation data. R/Bioconductor has become one of the most important environments for statistical analysis of various types of biological and medical data. Therefore, any data analysis within R that allows the integration of various data types as provided from different technological platforms is convenient. It is the first and so far the only specific package for DNA methylation analysis, in particular for bisulfite sequenced data available in R/Bioconductor enviroment. The package is available for free at http://methvisual.molgen.mpg.de/ and from the Bioconductor Consortium http://www.bioconductor.org. PMID:21159174

  13. The Targeted Sequencing of Alpha Satellite DNA in Cercopithecus pogonias Provides New Insight into the Diversity and Dynamics of Centromeric Repeats in Old World monkeys.

    PubMed

    Cacheux, Lauriane; Ponger, Loïc; Gerbault-Seureau, Michèle; Loll, François; Gey, Delphine; Richard, Florence Anne; Escudé, Christophe

    2018-06-01

    Alpha satellite is the major repeated DNA element of primate centromeres. Specific evolutionary mechanisms have led to a great diversity of sequence families with peculiar genomic organization and distribution, which have till now been studied mostly in great apes. Using high throughput sequencing of alpha satellite monomers obtained by enzymatic digestion followed by computational and cytogenetic analysis, we compare here the diversity and genomic distribution of alpha satellite DNA in two related Old World monkey species, Cercopithecus pogonias and Cercopithecus solatus, which are known to have diverged about seven million years ago. Two main families of monomers, called C1 and C2, are found in both species. A detailed analysis of our datasets revealed the existence of numerous subfamilies within the centromeric C1 family. Although the most abundant subfamily is conserved between both species, our FISH experiments clearly show that some subfamilies are specific for each species and that their distribution is restricted to a subset of chromosomes, thereby pointing to the existence of recurrent amplification/homogenization events. The pericentromeric C2 family is very abundant on the short arm of all acrocentric chromosomes in both species, pointing to specific mechanisms that lead to this distribution. Results obtained using two different restriction enzymes are fully consistent with a predominant monomeric organization of alpha satellite DNA which coexists with higher order organization patterns in the Cercopithecus pogonias genome. Our study suggests a high dynamics of alpha satellite DNA in Cercopithecini, with recurrent apparition of new sequence variants and interchromosomal sequence transfer.

  14. HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features.

    PubMed

    Zaman, Rianon; Chowdhury, Shahana Yasmin; Rashid, Mahmood A; Sharma, Alok; Dehzangi, Abdollah; Shatabda, Swakkhar

    2017-01-01

    DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.

  15. Identification of DNA-binding proteins using multi-features fusion and binary firefly optimization algorithm.

    PubMed

    Zhang, Jian; Gao, Bo; Chai, Haiting; Ma, Zhiqiang; Yang, Guifu

    2016-08-26

    DNA-binding proteins (DBPs) play fundamental roles in many biological processes. Therefore, the developing of effective computational tools for identifying DBPs is becoming highly desirable. In this study, we proposed an accurate method for the prediction of DBPs. Firstly, we focused on the challenge of improving DBP prediction accuracy with information solely from the sequence. Secondly, we used multiple informative features to encode the protein. These features included evolutionary conservation profile, secondary structure motifs, and physicochemical properties. Thirdly, we introduced a novel improved Binary Firefly Algorithm (BFA) to remove redundant or noisy features as well as select optimal parameters for the classifier. The experimental results of our predictor on two benchmark datasets outperformed many state-of-the-art predictors, which revealed the effectiveness of our method. The promising prediction performance on a new-compiled independent testing dataset from PDB and a large-scale dataset from UniProt proved the good generalization ability of our method. In addition, the BFA forged in this research would be of great potential in practical applications in optimization fields, especially in feature selection problems. A highly accurate method was proposed for the identification of DBPs. A user-friendly web-server named iDbP (identification of DNA-binding Proteins) was constructed and provided for academic use.

  16. Fine Dissection of Human Mitochondrial DNA Haplogroup HV Lineages Reveals Paleolithic Signatures from European Glacial Refugia

    PubMed Central

    Sarno, Stefania; Sevini, Federica; Vianello, Dario; Tamm, Erika; Metspalu, Ene; van Oven, Mannis; Hübner, Alexander; Sazzini, Marco; Franceschi, Claudio; Pettener, Davide; Luiselli, Donata

    2015-01-01

    Genetic signatures from the Paleolithic inhabitants of Eurasia can be traced from the early divergent mitochondrial DNA lineages still present in contemporary human populations. Previous studies already suggested a pre-Neolithic diffusion of mitochondrial haplogroup HV*(xH,V) lineages, a relatively rare class of mtDNA types that includes parallel branches mainly distributed across Europe and West Asia with a certain degree of structure. Up till now, variation within haplogroup HV was addressed mainly by analyzing sequence data from the mtDNA control region, except for specific sub-branches, such as HV4 or the widely distributed haplogroups H and V. In this study, we present a revised HV topology based on full mtDNA genome data, and we include a comprehensive dataset consisting of 316 complete mtDNA sequences including 60 new samples from the Italian peninsula, a previously underrepresented geographic area. We highlight points of instability in the particular topology of this haplogroup, reconstructed with BEAST-generated trees and networks. We also confirm a major lineage expansion that probably followed the Late Glacial Maximum and preceded Neolithic population movements. We finally observe that Italy harbors a reservoir of mtDNA diversity, with deep-rooting HV lineages often related to sequences present in the Caucasus and the Middle East. The resulting hypothesis of a glacial refugium in Southern Italy has implications for the understanding of late Paleolithic population movements and is discussed within the archaeological cultural shifts occurred over the entire continent. PMID:26640946

  17. Extending SEQenv: a taxa-centric approach to environmental annotations of 16S rDNA sequences

    PubMed Central

    Jeffries, Thomas C.; Ijaz, Umer Z.; Hamonts, Kelly

    2017-01-01

    Understanding how the environment selects a given taxon and the diversity patterns that emerge as a result of environmental filtering can dramatically improve our ability to analyse any environment in depth as well as advancing our knowledge on how the response of different taxa can impact each other and ecosystem functions. Most of the work investigating microbial biogeography has been site-specific, and logical environmental factors, rather than geographical location, may be more influential on microbial diversity. SEQenv, a novel pipeline aiming to provide environmental annotations of sequences emerged to provide a consistent description of the environmental niches using the ENVO ontology. While the pipeline provides a list of environmental terms on the basis of sample datasets and, therefore, the annotations obtained are at the dataset level, it lacks a taxa centric approach to environmental annotation. The work here describes an extension developed to enhance the SEQenv pipeline, which provided the means to directly generate environmental annotations for taxa under different contexts. 16S rDNA amplicon datasets belonging to distinct biomes were selected to illustrate the applicability of the extended SEQenv pipeline. A literature survey of the results demonstrates the immense importance of sequence level environmental annotations by illustrating the distribution of both taxa across environments as well as the various environmental sources of a specific taxon. Significantly enhancing the SEQenv pipeline in the process, this information would be valuable to any biologist seeking to understand the various taxa present in the habitat and the environment they originated from, enabling a more thorough analysis of which lineages are abundant in certain habitats and the recovery of patterns in taxon distribution across different habitats and environmental gradients. PMID:29038749

  18. Does a global DNA barcoding gap exist in Annelida?

    PubMed

    Kvist, Sebastian

    2016-05-01

    Accurate identification of unknown specimens by means of DNA barcoding is contingent on the presence of a DNA barcoding gap, among other factors, as its absence may result in dubious specimen identifications - false negatives or positives. Whereas the utility of DNA barcoding would be greatly reduced in the absence of a distinct and sufficiently sized barcoding gap, the limits of intraspecific and interspecific distances are seldom thoroughly inspected across comprehensive sampling. The present study aims to illuminate this aspect of barcoding in a comprehensive manner for the animal phylum Annelida. All cytochrome c oxidase subunit I sequences (cox1 gene; the chosen region for zoological DNA barcoding) present in GenBank for Annelida, as well as for "Polychaeta", "Oligochaeta", and Hirudinea separately, were downloaded and curated for length, coverage and potential contaminations. The final datasets consisted of 9782 (Annelida), 5545 ("Polychaeta"), 3639 ("Oligochaeta"), and 598 (Hirudinea) cox1 sequences and these were either (i) used as is in an automated global barcoding gap detection analysis or (ii) further analyzed for genetic distances, separated into bins containing intraspecific and interspecific comparisons and plotted in a graph to visualize any potential global barcoding gap. Over 70 million pairwise genetic comparisons were made and results suggest that although there is a tendency towards separation, no distinct or sufficiently sized global barcoding gap exists in either of the datasets rendering future barcoding efforts at risk of erroneous specimen identifications (but local barcoding gaps may still exist allowing for the identification of specimens at lower taxonomic ranks). This seems to be especially true for earthworm taxa, which account for fully 35% of the total number of interspecific comparisons that show 0% divergence.

  19. Detection of Splice Sites Using Support Vector Machine

    NASA Astrophysics Data System (ADS)

    Varadwaj, Pritish; Purohit, Neetesh; Arora, Bhumika

    Automatic identification and annotation of exon and intron region of gene, from DNA sequences has been an important research area in field of computational biology. Several approaches viz. Hidden Markov Model (HMM), Artificial Intelligence (AI) based machine learning and Digital Signal Processing (DSP) techniques have extensively and independently been used by various researchers to cater this challenging task. In this work, we propose a Support Vector Machine based kernel learning approach for detection of splice sites (the exon-intron boundary) in a gene. Electron-Ion Interaction Potential (EIIP) values of nucleotides have been used for mapping character sequences to corresponding numeric sequences. Radial Basis Function (RBF) SVM kernel is trained using EIIP numeric sequences. Furthermore this was tested on test gene dataset for detection of splice site by window (of 12 residues) shifting. Optimum values of window size, various important parameters of SVM kernel have been optimized for a better accuracy. Receiver Operating Characteristic (ROC) curves have been utilized for displaying the sensitivity rate of the classifier and results showed 94.82% accuracy for splice site detection on test dataset.

  20. X-MATE: a flexible system for mapping short read data

    PubMed Central

    Pearson, John V.; Cloonan, Nicole; Grimmond, Sean M.

    2011-01-01

    Summary: Accurate and complete mapping of short-read sequencing to a reference genome greatly enhances the discovery of biological results and improves statistical predictions. We recently presented RNA-MATE, a pipeline for the recursive mapping of RNA-Seq datasets. With the rapid increase in genome re-sequencing projects, progression of available mapping software and the evolution of file formats, we now present X-MATE, an updated version of RNA-MATE, capable of mapping both RNA-Seq and DNA datasets and with improved performance, output file formats, configuration files, and flexibility in core mapping software. Availability: Executables, source code, junction libraries, test data and results and the user manual are available from http://grimmond.imb.uq.edu.au/X-MATE/. Contact: n.cloonan@uq.edu.au; s.grimmond@uq.edu.au Supplementary information: Supplementary data are available at Bioinformatics Online. PMID:21216778

  1. Detecting exact breakpoints of deletions with diversity in hepatitis B viral genomic DNA from next-generation sequencing data.

    PubMed

    Cheng, Ji-Hong; Liu, Wen-Chun; Chang, Ting-Tsung; Hsieh, Sun-Yuan; Tseng, Vincent S

    2017-10-01

    Many studies have suggested that deletions of Hepatitis B Viral (HBV) are associated with the development of progressive liver diseases, even ultimately resulting in hepatocellular carcinoma (HCC). Among the methods for detecting deletions from next-generation sequencing (NGS) data, few methods considered the characteristics of virus, such as high evolution rates and high divergence among the different HBV genomes. Sequencing high divergence HBV genome sequences using the NGS technology outputs millions of reads. Thus, detecting exact breakpoints of deletions from these big and complex data incurs very high computational cost. We proposed a novel analytical method named VirDelect (Virus Deletion Detect), which uses split read alignment base to detect exact breakpoint and diversity variable to consider high divergence in single-end reads data, such that the computational cost can be reduced without losing accuracy. We use four simulated reads datasets and two real pair-end reads datasets of HBV genome sequence to verify VirDelect accuracy by score functions. The experimental results show that VirDelect outperforms the state-of-the-art method Pindel in terms of accuracy score for all simulated datasets and VirDelect had only two base errors even in real datasets. VirDelect is also shown to deliver high accuracy in analyzing the single-end read data as well as pair-end data. VirDelect can serve as an effective and efficient bioinformatics tool for physiologists with high accuracy and efficient performance and applicable to further analysis with characteristics similar to HBV on genome length and high divergence. The software program of VirDelect can be downloaded at https://sourceforge.net/projects/virdelect/. Copyright © 2017. Published by Elsevier Inc.

  2. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ruggles, Kelly V.; Tang, Zuojian; Wang, Xuya

    Improvements in mass spectrometry (MS)-based peptide sequencing provide a new opportunity to determine whether polymorphisms, mutations and splice variants identified in cancer cells are translated. Herein we therefore describe a proteogenomic data integration tool (QUILTS) and illustrate its application to whole genome, transcriptome and global MS peptide sequence datasets generated from a pair of luminal and basal-like breast cancer patient derived xenografts (PDX). The sensitivity of proteogenomic analysis for singe nucleotide variant (SNV) expression and novel splice junction (NSJ) detection was probed using multiple MS/MS process replicates. Despite over thirty sample replicates, only about 10% of all SNV (somatic andmore » germline) were detected by both DNA and RNA sequencing were observed as peptides. An even smaller proportion of peptides corresponding to NSJ observed by RNA sequencing were detected (<0.1%). Peptides mapping to DNA-detected SNV without a detectable mRNA transcript were also observed demonstrating the transcriptome coverage was also incomplete (~80%). In contrast to germ-line variants, somatic variants were less likely to be detected at the peptide level in the basal-like tumor than the luminal tumor raising the possibility of differential translation or protein degradation effects. In conclusion, the QUILTS program integrates DNA, RNA and peptide sequencing to assess the degree to which somatic mutations are translated and therefore biologically active. By identifying gaps in sequence coverage QUILTS benchmarks current technology and assesses progress towards whole cancer proteome and transcriptome analysis.« less

  3. Characterization of unknown genetic modifications using high throughput sequencing and computational subtraction.

    PubMed

    Tengs, Torstein; Zhang, Haibo; Holst-Jensen, Arne; Bohlin, Jon; Butenko, Melinka A; Kristoffersen, Anja Bråthen; Sorteberg, Hilde-Gunn Opsahl; Berdal, Knut G

    2009-10-08

    When generating a genetically modified organism (GMO), the primary goal is to give a target organism one or several novel traits by using biotechnology techniques. A GMO will differ from its parental strain in that its pool of transcripts will be altered. Currently, there are no methods that are reliably able to determine if an organism has been genetically altered if the nature of the modification is unknown. We show that the concept of computational subtraction can be used to identify transgenic cDNA sequences from genetically modified plants. Our datasets include 454-type sequences from a transgenic line of Arabidopsis thaliana and published EST datasets from commercially relevant species (rice and papaya). We believe that computational subtraction represents a powerful new strategy for determining if an organism has been genetically modified as well as to define the nature of the modification. Fewer assumptions have to be made compared to methods currently in use and this is an advantage particularly when working with unknown GMOs.

  4. Characterization of unknown genetic modifications using high throughput sequencing and computational subtraction

    PubMed Central

    Tengs, Torstein; Zhang, Haibo; Holst-Jensen, Arne; Bohlin, Jon; Butenko, Melinka A; Kristoffersen, Anja Bråthen; Sorteberg, Hilde-Gunn Opsahl; Berdal, Knut G

    2009-01-01

    Background When generating a genetically modified organism (GMO), the primary goal is to give a target organism one or several novel traits by using biotechnology techniques. A GMO will differ from its parental strain in that its pool of transcripts will be altered. Currently, there are no methods that are reliably able to determine if an organism has been genetically altered if the nature of the modification is unknown. Results We show that the concept of computational subtraction can be used to identify transgenic cDNA sequences from genetically modified plants. Our datasets include 454-type sequences from a transgenic line of Arabidopsis thaliana and published EST datasets from commercially relevant species (rice and papaya). Conclusion We believe that computational subtraction represents a powerful new strategy for determining if an organism has been genetically modified as well as to define the nature of the modification. Fewer assumptions have to be made compared to methods currently in use and this is an advantage particularly when working with unknown GMOs. PMID:19814792

  5. An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines

    Treesearch

    Michael DeGiorgio; John Syring; Andrew J. Eckert; Aaron Liston; Richard Cronn; David B. Neale; Noah A. Rosenberg

    2014-01-01

    Background: As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models,...

  6. Vaginal microbial flora analysis by next generation sequencing and microarrays; can microbes indicate vaginal origin in a forensic context?

    PubMed

    Benschop, Corina C G; Quaak, Frederike C A; Boon, Mathilde E; Sijen, Titia; Kuiper, Irene

    2012-03-01

    Forensic analysis of biological traces generally encompasses the investigation of both the person who contributed to the trace and the body site(s) from which the trace originates. For instance, for sexual assault cases, it can be beneficial to distinguish vaginal samples from skin or saliva samples. In this study, we explored the use of microbial flora to indicate vaginal origin. First, we explored the vaginal microbiome for a large set of clinical vaginal samples (n = 240) by next generation sequencing (n = 338,184 sequence reads) and found 1,619 different sequences. Next, we selected 389 candidate probes targeting genera or species and designed a microarray, with which we analysed a diverse set of samples; 43 DNA extracts from vaginal samples and 25 DNA extracts from samples from other body sites, including sites in close proximity of or in contact with the vagina. Finally, we used the microarray results and next generation sequencing dataset to assess the potential for a future approach that uses microbial markers to indicate vaginal origin. Since no candidate genera/species were found to positively identify all vaginal DNA extracts on their own, while excluding all non-vaginal DNA extracts, we deduce that a reliable statement about the cellular origin of a biological trace should be based on the detection of multiple species within various genera. Microarray analysis of a sample will then render a microbial flora pattern that is probably best analysed in a probabilistic approach.

  7. Use of the melting curve assay as a means for high-throughput quantification of Illumina sequencing libraries.

    PubMed

    Shinozuka, Hiroshi; Forster, John W

    2016-01-01

    Background. Multiplexed sequencing is commonly performed on massively parallel short-read sequencing platforms such as Illumina, and the efficiency of library normalisation can affect the quality of the output dataset. Although several library normalisation approaches have been established, none are ideal for highly multiplexed sequencing due to issues of cost and/or processing time. Methods. An inexpensive and high-throughput library quantification method has been developed, based on an adaptation of the melting curve assay. Sequencing libraries were subjected to the assay using the Bio-Rad Laboratories CFX Connect(TM) Real-Time PCR Detection System. The library quantity was calculated through summation of reduction of relative fluorescence units between 86 and 95 °C. Results.PCR-enriched sequencing libraries are suitable for this quantification without pre-purification of DNA. Short DNA molecules, which ideally should be eliminated from the library for subsequent processing, were differentiated from the target DNA in a mixture on the basis of differences in melting temperature. Quantification results for long sequences targeted using the melting curve assay were correlated with those from existing methods (R (2) > 0.77), and that observed from MiSeq sequencing (R (2) = 0.82). Discussion.The results of multiplexed sequencing suggested that the normalisation performance of the described method is equivalent to that of another recently reported high-throughput bead-based method, BeNUS. However, costs for the melting curve assay are considerably lower and processing times shorter than those of other existing methods, suggesting greater suitability for highly multiplexed sequencing applications.

  8. Efficient alignment-free DNA barcode analytics.

    PubMed

    Kuksa, Pavel; Pavlovic, Vladimir

    2009-11-10

    In this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups. New alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods. Our results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding.

  9. Training alignment parameters for arbitrary sequencers with LAST-TRAIN.

    PubMed

    Hamada, Michiaki; Ono, Yukiteru; Asai, Kiyoshi; Frith, Martin C

    2017-03-15

    LAST-TRAIN improves sequence alignment accuracy by inferring substitution and gap scores that fit the frequencies of substitutions, insertions, and deletions in a given dataset. We have applied it to mapping DNA reads from IonTorrent and PacBio RS, and we show that it reduces reference bias for Oxford Nanopore reads. the source code is freely available at http://last.cbrc.jp/. mhamada@waseda.jp or mcfrith@edu.k.u-tokyo.ac.jp. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  10. Quantum annealing versus classical machine learning applied to a simplified computational biology problem

    PubMed Central

    Li, Richard Y.; Di Felice, Rosa; Rohs, Remo; Lidar, Daniel A.

    2018-01-01

    Transcription factors regulate gene expression, but how these proteins recognize and specifically bind to their DNA targets is still debated. Machine learning models are effective means to reveal interaction mechanisms. Here we studied the ability of a quantum machine learning approach to predict binding specificity. Using simplified datasets of a small number of DNA sequences derived from actual binding affinity experiments, we trained a commercially available quantum annealer to classify and rank transcription factor binding. The results were compared to state-of-the-art classical approaches for the same simplified datasets, including simulated annealing, simulated quantum annealing, multiple linear regression, LASSO, and extreme gradient boosting. Despite technological limitations, we find a slight advantage in classification performance and nearly equal ranking performance using the quantum annealer for these fairly small training data sets. Thus, we propose that quantum annealing might be an effective method to implement machine learning for certain computational biology problems. PMID:29652405

  11. Genome-wide characterization of centromeric satellites from multiple mammalian genomes.

    PubMed

    Alkan, Can; Cardone, Maria Francesca; Catacchio, Claudia Rita; Antonacci, Francesca; O'Brien, Stephen J; Ryder, Oliver A; Purgato, Stefania; Zoli, Monica; Della Valle, Giuliano; Eichler, Evan E; Ventura, Mario

    2011-01-01

    Despite its importance in cell biology and evolution, the centromere has remained the final frontier in genome assembly and annotation due to its complex repeat structure. However, isolation and characterization of the centromeric repeats from newly sequenced species are necessary for a complete understanding of genome evolution and function. In recent years, various genomes have been sequenced, but the characterization of the corresponding centromeric DNA has lagged behind. Here, we present a computational method (RepeatNet) to systematically identify higher-order repeat structures from unassembled whole-genome shotgun sequence and test whether these sequence elements correspond to functional centromeric sequences. We analyzed genome datasets from six species of mammals representing the diversity of the mammalian lineage, namely, horse, dog, elephant, armadillo, opossum, and platypus. We define candidate monomer satellite repeats and demonstrate centromeric localization for five of the six genomes. Our analysis revealed the greatest diversity of centromeric sequences in horse and dog in contrast to elephant and armadillo, which showed high-centromeric sequence homogeneity. We could not isolate centromeric sequences within the platypus genome, suggesting that centromeres in platypus are not enriched in satellite DNA. Our method can be applied to the characterization of thousands of other vertebrate genomes anticipated for sequencing in the near future, providing an important tool for annotation of centromeres.

  12. Capturing chloroplast variation for molecular ecology studies: a simple next generation sequencing approach applied to a rainforest tree

    PubMed Central

    2013-01-01

    Background With high quantity and quality data production and low cost, next generation sequencing has the potential to provide new opportunities for plant phylogeographic studies on single and multiple species. Here we present an approach for in silicio chloroplast DNA assembly and single nucleotide polymorphism detection from short-read shotgun sequencing. The approach is simple and effective and can be implemented using standard bioinformatic tools. Results The chloroplast genome of Toona ciliata (Meliaceae), 159,514 base pairs long, was assembled from shotgun sequencing on the Illumina platform using de novo assembly of contigs. To evaluate its practicality, value and quality, we compared the short read assembly with an assembly completed using 454 data obtained after chloroplast DNA isolation. Sanger sequence verifications indicated that the Illumina dataset outperformed the longer read 454 data. Pooling of several individuals during preparation of the shotgun library enabled detection of informative chloroplast SNP markers. Following validation, we used the identified SNPs for a preliminary phylogeographic study of T. ciliata in Australia and to confirm low diversity across the distribution. Conclusions Our approach provides a simple method for construction of whole chloroplast genomes from shotgun sequencing of whole genomic DNA using short-read data and no available closely related reference genome (e.g. from the same species or genus). The high coverage of Illumina sequence data also renders this method appropriate for multiplexing and SNP discovery and therefore a useful approach for landscape level studies of evolutionary ecology. PMID:23497206

  13. A phylogenetic study of Laeliinae (Orchidaceae) based on combined nuclear and plastid DNA sequences

    PubMed Central

    van den Berg, Cássio; Higgins, Wesley E.; Dressler, Robert L.; Whitten, W. Mark; Soto-Arenas, Miguel A.; Chase, Mark W.

    2009-01-01

    Background and Aims Laeliinae are a neotropical orchid subtribe with approx. 1500 species in 50 genera. In this study, an attempt is made to assess generic alliances based on molecular phylogenetic analysis of DNA sequence data. Methods Six DNA datasets were gathered: plastid trnL intron, trnL-F spacer, matK gene and trnK introns upstream and dowstream from matK and nuclear ITS rDNA. Data were analysed with maximum parsimony (MP) and Bayesian analysis with mixed models (BA). Key Results Although relationships between Laeliinae and outgroups are well supported, within the subtribe sequence variation is low considering the broad taxonomic range covered. Localized incongruence between the ITS and plastid trees was found. A combined tree followed the ITS trees more closely, but the levels of support obtained with MP were low. The Bayesian analysis recovered more well-supported nodes. The trees from combined MP and BA allowed eight generic alliances to be recognized within Laeliinae, all of which show trends in morphological characters but lack unambiguous synapomorphies. Conclusions By using combined plastid and nuclear DNA data in conjunction with mixed-models Bayesian inference, it is possible to delimit smaller groups within Laeliinae and discuss general patterns of pollination and hybridization compatibility. Furthermore, these small groups can now be used for further detailed studies to explain morphological evolution and diversification patterns within the subtribe. PMID:19423551

  14. Discriminative Prediction of A-To-I RNA Editing Events from DNA Sequence

    PubMed Central

    Sun, Jiangming; Singh, Pratibha; Bagge, Annika; Valtat, Bérengère; Vikman, Petter; Spégel, Peter; Mulder, Hindrik

    2016-01-01

    RNA editing is a post-transcriptional alteration of RNA sequences that, via insertions, deletions or base substitutions, can affect protein structure as well as RNA and protein expression. Recently, it has been suggested that RNA editing may be more frequent than previously thought. A great impediment, however, to a deeper understanding of this process is the paramount sequencing effort that needs to be undertaken to identify RNA editing events. Here, we describe an in silico approach, based on machine learning, that ameliorates this problem. Using 41 nucleotide long DNA sequences, we show that novel A-to-I RNA editing events can be predicted from known A-to-I RNA editing events intra- and interspecies. The validity of the proposed method was verified in an independent experimental dataset. Using our approach, 203 202 putative A-to-I RNA editing events were predicted in the whole human genome. Out of these, 9% were previously reported. The remaining sites require further validation, e.g., by targeted deep sequencing. In conclusion, the approach described here is a useful tool to identify potential A-to-I RNA editing events without the requirement of extensive RNA sequencing. PMID:27764195

  15. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research

    PubMed Central

    Lai, Zhongwu; Markovets, Aleksandra; Ahdesmaki, Miika; Chapman, Brad; Hofmann, Oliver; McEwen, Robert; Johnson, Justin; Dougherty, Brian; Barrett, J. Carl; Dry, Jonathan R.

    2016-01-01

    Abstract Accurate variant calling in next generation sequencing (NGS) is critical to understand cancer genomes better. Here we present VarDict, a novel and versatile variant caller for both DNA- and RNA-sequencing data. VarDict simultaneously calls SNV, MNV, InDels, complex and structural variants, expanding the detected genetic driver landscape of tumors. It performs local realignments on the fly for more accurate allele frequency estimation. VarDict performance scales linearly to sequencing depth, enabling ultra-deep sequencing used to explore tumor evolution or detect tumor DNA circulating in blood. In addition, VarDict performs amplicon aware variant calling for polymerase chain reaction (PCR)-based targeted sequencing often used in diagnostic settings, and is able to detect PCR artifacts. Finally, VarDict also detects differences in somatic and loss of heterozygosity variants between paired samples. VarDict reprocessing of The Cancer Genome Atlas (TCGA) Lung Adenocarcinoma dataset called known driver mutations in KRAS, EGFR, BRAF, PIK3CA and MET in 16% more patients than previously published variant calls. We believe VarDict will greatly facilitate application of NGS in clinical cancer research. PMID:27060149

  16. Adaptable gene-specific dye bias correction for two-channel DNA microarrays.

    PubMed

    Margaritis, Thanasis; Lijnzaad, Philip; van Leenen, Dik; Bouwmeester, Diane; Kemmeren, Patrick; van Hooff, Sander R; Holstege, Frank C P

    2009-01-01

    DNA microarray technology is a powerful tool for monitoring gene expression or for finding the location of DNA-bound proteins. DNA microarrays can suffer from gene-specific dye bias (GSDB), causing some probes to be affected more by the dye than by the sample. This results in large measurement errors, which vary considerably for different probes and also across different hybridizations. GSDB is not corrected by conventional normalization and has been difficult to address systematically because of its variance. We show that GSDB is influenced by label incorporation efficiency, explaining the variation of GSDB across different hybridizations. A correction method (Gene- And Slide-Specific Correction, GASSCO) is presented, whereby sequence-specific corrections are modulated by the overall bias of individual hybridizations. GASSCO outperforms earlier methods and works well on a variety of publically available datasets covering a range of platforms, organisms and applications, including ChIP on chip. A sequence-based model is also presented, which predicts which probes will suffer most from GSDB, useful for microarray probe design and correction of individual hybridizations. Software implementing the method is publicly available.

  17. Adaptable gene-specific dye bias correction for two-channel DNA microarrays

    PubMed Central

    Margaritis, Thanasis; Lijnzaad, Philip; van Leenen, Dik; Bouwmeester, Diane; Kemmeren, Patrick; van Hooff, Sander R; Holstege, Frank CP

    2009-01-01

    DNA microarray technology is a powerful tool for monitoring gene expression or for finding the location of DNA-bound proteins. DNA microarrays can suffer from gene-specific dye bias (GSDB), causing some probes to be affected more by the dye than by the sample. This results in large measurement errors, which vary considerably for different probes and also across different hybridizations. GSDB is not corrected by conventional normalization and has been difficult to address systematically because of its variance. We show that GSDB is influenced by label incorporation efficiency, explaining the variation of GSDB across different hybridizations. A correction method (Gene- And Slide-Specific Correction, GASSCO) is presented, whereby sequence-specific corrections are modulated by the overall bias of individual hybridizations. GASSCO outperforms earlier methods and works well on a variety of publically available datasets covering a range of platforms, organisms and applications, including ChIP on chip. A sequence-based model is also presented, which predicts which probes will suffer most from GSDB, useful for microarray probe design and correction of individual hybridizations. Software implementing the method is publicly available. PMID:19401678

  18. Quantification of DNA cleavage specificity in Hi-C experiments.

    PubMed

    Meluzzi, Dario; Arya, Gaurav

    2016-01-08

    Hi-C experiments produce large numbers of DNA sequence read pairs that are typically analyzed to deduce genomewide interactions between arbitrary loci. A key step in these experiments is the cleavage of cross-linked chromatin with a restriction endonuclease. Although this cleavage should happen specifically at the enzyme's recognition sequence, an unknown proportion of cleavage events may involve other sequences, owing to the enzyme's star activity or to random DNA breakage. A quantitative estimation of these non-specific cleavages may enable simulating realistic Hi-C read pairs for validation of downstream analyses, monitoring the reproducibility of experimental conditions and investigating biophysical properties that correlate with DNA cleavage patterns. Here we describe a computational method for analyzing Hi-C read pairs to estimate the fractions of cleavages at different possible targets. The method relies on expressing an observed local target distribution downstream of aligned reads as a linear combination of known conditional local target distributions. We validated this method using Hi-C read pairs obtained by computer simulation. Application of the method to experimental Hi-C datasets from murine cells revealed interesting similarities and differences in patterns of cleavage across the various experiments considered. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  19. Binary Interval Search: a scalable algorithm for counting interval intersections

    PubMed Central

    Layer, Ryan M.; Skadron, Kevin; Robins, Gabriel; Hall, Ira M.; Quinlan, Aaron R.

    2013-01-01

    Motivation: The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. Results: We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. Availability: https://github.com/arq5x/bits. Contact: arq5x@virginia.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23129298

  20. 454 Pyrosequencing to Describe Microbial Eukaryotic Community Composition, Diversity and Relative Abundance: A Test for Marine Haptophytes

    PubMed Central

    Egge, Elianne; Bittner, Lucie; Andersen, Tom; Audic, Stéphane; de Vargas, Colomban; Edvardsen, Bente

    2013-01-01

    Next generation sequencing of ribosomal DNA is increasingly used to assess the diversity and structure of microbial communities. Here we test the ability of 454 pyrosequencing to detect the number of species present, and assess the relative abundance in terms of cell numbers and biomass of protists in the phylum Haptophyta. We used a mock community consisting of equal number of cells of 11 haptophyte species and compared targeting DNA and RNA/cDNA, and two different V4 SSU rDNA haptophyte-biased primer pairs. Further, we tested four different bioinformatic filtering methods to reduce errors in the resulting sequence dataset. With sequencing depth of 11000–20000 reads and targeting cDNA with Haptophyta specific primers Hap454 we detected all 11 species. A rarefaction analysis of expected number of species recovered as a function of sampling depth suggested that minimum 1400 reads were required here to recover all species in the mock community. Relative read abundance did not correlate to relative cell numbers. Although the species represented with the largest biomass was also proportionally most abundant among the reads, there was generally a weak correlation between proportional read abundance and proportional biomass of the different species, both with DNA and cDNA as template. The 454 sequencing generated considerable spurious diversity, and more with cDNA than DNA as template. With initial filtering based only on match with barcode and primer we observed 100-fold more operational taxonomic units (OTUs) at 99% similarity than the number of species present in the mock community. Filtering based on quality scores, or denoising with PyroNoise resulted in ten times more OTU99% than the number of species. Denoising with AmpliconNoise reduced the number of OTU99% to match the number of species present in the mock community. Based on our analyses, we propose a strategy to more accurately depict haptophyte diversity using 454 pyrosequencing. PMID:24069303

  1. Challenging a bioinformatic tool's ability to detect microbial contaminants using in silico whole genome sequencing data.

    PubMed

    Olson, Nathan D; Zook, Justin M; Morrow, Jayne B; Lin, Nancy J

    2017-01-01

    High sensitivity methods such as next generation sequencing and polymerase chain reaction (PCR) are adversely impacted by organismal and DNA contaminants. Current methods for detecting contaminants in microbial materials (genomic DNA and cultures) are not sensitive enough and require either a known or culturable contaminant. Whole genome sequencing (WGS) is a promising approach for detecting contaminants due to its sensitivity and lack of need for a priori assumptions about the contaminant. Prior to applying WGS, we must first understand its limitations for detecting contaminants and potential for false positives. Herein we demonstrate and characterize a WGS-based approach to detect organismal contaminants using an existing metagenomic taxonomic classification algorithm. Simulated WGS datasets from ten genera as individuals and binary mixtures of eight organisms at varying ratios were analyzed to evaluate the role of contaminant concentration and taxonomy on detection. For the individual genomes the false positive contaminants reported depended on the genus, with Staphylococcus , Escherichia , and Shigella having the highest proportion of false positives. For nearly all binary mixtures the contaminant was detected in the in-silico datasets at the equivalent of 1 in 1,000 cells, though F. tularensis was not detected in any of the simulated contaminant mixtures and Y. pestis was only detected at the equivalent of one in 10 cells. Once a WGS method for detecting contaminants is characterized, it can be applied to evaluate microbial material purity, in efforts to ensure that contaminants are characterized in microbial materials used to validate pathogen detection assays, generate genome assemblies for database submission, and benchmark sequencing methods.

  2. The Porcelain Crab Transcriptome and PCAD, the Porcelain Crab Microarray and Sequence Database

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tagmount, Abderrahmane; Wang, Mei; Lindquist, Erika

    2010-01-27

    Background: With the emergence of a completed genome sequence of the freshwater crustacean Daphnia pulex, construction of genomic-scale sequence databases for additional crustacean sequences are important for comparative genomics and annotation. Porcelain crabs, genus Petrolisthes, have been powerful crustacean models for environmental and evolutionary physiology with respect to thermal adaptation and understanding responses of marine organisms to climate change. Here, we present a large-scale EST sequencing and cDNA microarray database project for the porcelain crab Petrolisthes cinctipes. Methodology/Principal Findings: A set of ~;;30K unique sequences (UniSeqs) representing ~;;19K clusters were generated from ~;;98K high quality ESTs from a set ofmore » tissue specific non-normalized and mixed-tissue normalized cDNA libraries from the porcelain crab Petrolisthes cinctipes. Homology for each UniSeq was assessed using BLAST, InterProScan, GO and KEGG database searches. Approximately 66percent of the UniSeqs had homology in at least one of the databases. All EST and UniSeq sequences along with annotation results and coordinated cDNA microarray datasets have been made publicly accessible at the Porcelain Crab Array Database (PCAD), a feature-enriched version of the Stanford and Longhorn Array Databases.Conclusions/Significance: The EST project presented here represents the third largest sequencing effort for any crustacean, and the largest effort for any crab species. Our assembly and clustering results suggest that our porcelain crab EST data set is equally diverse to the much larger EST set generated in the Daphnia pulex genome sequencing project, and thus will be an important resource to the Daphnia research community. Our homology results support the pancrustacea hypothesis and suggest that Malacostraca may be ancestral to Branchiopoda and Hexapoda. Our results also suggest that our cDNA microarrays cover as much of the transcriptome as can reasonably be captured in EST library sequencing approaches, and thus represent a rich resource for studies of environmental genomics.« less

  3. Sma3s: a three-step modular annotator for large sequence datasets.

    PubMed

    Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J

    2014-08-01

    Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  4. Wanted dead or alive? Using metabarcoding of environmental DNA and RNA to distinguish living assemblages for biosecurity applications

    PubMed Central

    Zaiko, Anastasija; Fletcher, Lauren M.; Laroche, Olivier; Wood, Susanna A.

    2017-01-01

    High-throughput sequencing metabarcoding studies in marine biosecurity have largely focused on targeting environmental DNA (eDNA). DNA can persist extracellularly in the environment, making discrimination of living organisms difficult. In this study, bilge water samples (i.e., water accumulating on-board a vessel during transit) were collected from 15 small recreational and commercial vessels. eDNA and eRNA molecules were co-extracted and the V4 region of the 18S ribosomal RNA gene targeted for metabarcoding. In total, 62.7% of the Operational Taxonomic Units (OTUs) were identified at least once in the corresponding eDNA and eRNA reads, with 19.5% unique to eDNA and 17.7% to eRNA. There were substantial differences in diversity between molecular compartments; 57% of sequences from eDNA-only OTUs belonged to fungi, likely originating from legacy DNA. In contrast, there was a higher percentage of metazoan (50.2%) and ciliate (31.7%) sequences in the eRNA-only OTUs. Our data suggest that the presence of eRNA-only OTUs could be due to increased cellular activities of some rare taxa that were not identified in the eDNA datasets, unusually high numbers of rRNA transcripts in ciliates, and/or artefacts produced during the reverse transcriptase, PCR and sequencing steps. The proportions of eDNA/eRNA shared and unshared OTUs were highly heterogeneous within individual bilge water samples. Multiple factors including boat type and the activities performed on-board, such as washing of scientific equipment, may play a major role in contributing to this variability. For some marine biosecurity applications analysis, eDNA-only data may be sufficient, however there are an increasing number of instances where distinguishing the living portion of a community is essential. For these circumstances, we suggest only including OTUs that are present in both eDNA and eRNA data. OTUs found only in the eRNA data need to be interpreted with caution until further research provides conclusive evidence for their origin. PMID:29095959

  5. Wanted dead or alive? Using metabarcoding of environmental DNA and RNA to distinguish living assemblages for biosecurity applications.

    PubMed

    Pochon, Xavier; Zaiko, Anastasija; Fletcher, Lauren M; Laroche, Olivier; Wood, Susanna A

    2017-01-01

    High-throughput sequencing metabarcoding studies in marine biosecurity have largely focused on targeting environmental DNA (eDNA). DNA can persist extracellularly in the environment, making discrimination of living organisms difficult. In this study, bilge water samples (i.e., water accumulating on-board a vessel during transit) were collected from 15 small recreational and commercial vessels. eDNA and eRNA molecules were co-extracted and the V4 region of the 18S ribosomal RNA gene targeted for metabarcoding. In total, 62.7% of the Operational Taxonomic Units (OTUs) were identified at least once in the corresponding eDNA and eRNA reads, with 19.5% unique to eDNA and 17.7% to eRNA. There were substantial differences in diversity between molecular compartments; 57% of sequences from eDNA-only OTUs belonged to fungi, likely originating from legacy DNA. In contrast, there was a higher percentage of metazoan (50.2%) and ciliate (31.7%) sequences in the eRNA-only OTUs. Our data suggest that the presence of eRNA-only OTUs could be due to increased cellular activities of some rare taxa that were not identified in the eDNA datasets, unusually high numbers of rRNA transcripts in ciliates, and/or artefacts produced during the reverse transcriptase, PCR and sequencing steps. The proportions of eDNA/eRNA shared and unshared OTUs were highly heterogeneous within individual bilge water samples. Multiple factors including boat type and the activities performed on-board, such as washing of scientific equipment, may play a major role in contributing to this variability. For some marine biosecurity applications analysis, eDNA-only data may be sufficient, however there are an increasing number of instances where distinguishing the living portion of a community is essential. For these circumstances, we suggest only including OTUs that are present in both eDNA and eRNA data. OTUs found only in the eRNA data need to be interpreted with caution until further research provides conclusive evidence for their origin.

  6. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

    PubMed

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-05-01

    Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. ivan.borozan@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  7. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification

    PubMed Central

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-01-01

    Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. Contact: ivan.borozan@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25573913

  8. Genome-wide methylation analysis identified sexually dimorphic methylated regions in hybrid tilapia

    PubMed Central

    Wan, Zi Yi; Xia, Jun Hong; Lin, Grace; Wang, Le; Lin, Valerie C. L.; Yue, Gen Hua

    2016-01-01

    Sexual dimorphism is an interesting biological phenomenon. Previous studies showed that DNA methylation might play a role in sexual dimorphism. However, the overall picture of the genome-wide methylation landscape in sexually dimorphic species remains unclear. We analyzed the DNA methylation landscape and transcriptome in hybrid tilapia (Oreochromis spp.) using whole genome bisulfite sequencing (WGBS) and RNA-sequencing (RNA-seq). We found 4,757 sexually dimorphic differentially methylated regions (DMRs), with significant clusters of DMRs located on chromosomal regions associated with sex determination. CpG methylation in promoter regions was negatively correlated with the gene expression level. MAPK/ERK pathway was upregulated in male tilapia. We also inferred active cis-regulatory regions (ACRs) in skeletal muscle tissues from WGBS datasets, revealing sexually dimorphic cis-regulatory regions. These results suggest that DNA methylation contribute to sex-specific phenotypes and serve as resources for further investigation to analyze the functions of these regions and their contributions towards sexual dimorphisms. PMID:27782217

  9. Contrasting morphological and DNA barcode-suggested species boundaries among shallow-water amphipod fauna from the southern European Atlantic coast.

    PubMed

    Lobo, Jorge; Ferreira, Maria S; Antunes, Ilisa C; Teixeira, Marcos A L; Borges, Luisa M S; Sousa, Ronaldo; Gomes, Pedro A; Costa, Maria Helena; Cunha, Marina R; Costa, Filipe O

    2017-02-01

    In this study we compared DNA barcode-suggested species boundaries with morphology-based species identifications in the amphipod fauna of the southern European Atlantic coast. DNA sequences of the cytochrome c oxidase subunit I barcode region (COI-5P) were generated for 43 morphospecies (178 specimens) collected along the Portuguese coast which, together with publicly available COI-5P sequences, produced a final dataset comprising 68 morphospecies and 295 sequences. Seventy-five BINs (Barcode Index Numbers) were assigned to these morphospecies, of which 48 were concordant (i.e., 1 BIN = 1 species), 8 were taxonomically discordant, and 19 were singletons. Twelve species had matching sequences (<2% distance) with conspecifics from distant locations (e.g., North Sea). Seven morphospecies were assigned to multiple, and highly divergent, BINs, including specimens of Corophium multisetosum (18% divergence) and Dexamine spiniventris (16% divergence), which originated from sampling locations on the west coast of Portugal (only about 36 and 250 km apart, respectively). We also found deep divergence (4%-22%) among specimens of seven species from Portugal compared to those from the North Sea and Italy. The detection of evolutionarily meaningful divergence among populations of several amphipod species from southern Europe reinforces the need for a comprehensive re-assessment of the diversity of this faunal group.

  10. qPMS9: An Efficient Algorithm for Quorum Planted Motif Search

    NASA Astrophysics Data System (ADS)

    Nicolae, Marius; Rajasekaran, Sanguthevar

    2015-01-01

    Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (l, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers l and d. It returns all sequences M of length l that occur in each input string, where each occurrence differs from M in at most d positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least q% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (l, d)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.

  11. GABI-Kat SimpleSearch: new features of the Arabidopsis thaliana T-DNA mutant database.

    PubMed

    Kleinboelting, Nils; Huep, Gunnar; Kloetgen, Andreas; Viehoever, Prisca; Weisshaar, Bernd

    2012-01-01

    T-DNA insertion mutants are very valuable for reverse genetics in Arabidopsis thaliana. Several projects have generated large sequence-indexed collections of T-DNA insertion lines, of which GABI-Kat is the second largest resource worldwide. User access to the collection and its Flanking Sequence Tags (FSTs) is provided by the front end SimpleSearch (http://www.GABI-Kat.de). Several significant improvements have been implemented recently. The database now relies on the TAIRv10 genome sequence and annotation dataset. All FSTs have been newly mapped using an optimized procedure that leads to improved accuracy of insertion site predictions. A fraction of the collection with weak FST yield was re-analysed by generating new FSTs. Along with newly found predictions for older sequences about 20,000 new FSTs were included in the database. Information about groups of FSTs pointing to the same insertion site that is found in several lines but is real only in a single line are included, and many problematic FST-to-line links have been corrected using new wet-lab data. SimpleSearch currently contains data from ~71,000 lines with predicted insertions covering 62.5% of the 27,206 nuclear protein coding genes, and offers insertion allele-specific data from 9545 confirmed lines that are available from the Nottingham Arabidopsis Stock Centre.

  12. High-resolution phylogeography of zoonotic tapeworm Echinococcus granulosus sensu stricto genotype G1 with an emphasis on its distribution in Turkey, Italy and Spain.

    PubMed

    Kinkar, Liina; Laurimäe, Teivi; Simsek, Sami; Balkaya, Ibrahim; Casulli, Adriano; Manfredi, Maria Teresa; Ponce-Gordo, Francisco; Varcasia, Antonio; Lavikainen, Antti; González, Luis Miguel; Rehbein, Steffen; VAN DER Giessen, Joke; Sprong, Hein; Saarma, Urmas

    2016-11-01

    Echinococcus granulosus is the causative agent of cystic echinococcosis. The disease is a significant global public health concern and human infections are most commonly associated with E. granulosus sensu stricto (s. s.) genotype G1. The objectives of this study were to: (i) analyse the genetic variation and phylogeography of E. granulosus s. s. G1 in part of its main distribution range in Europe using 8274 bp of mtDNA; (ii) compare the results with those derived from previously used shorter mtDNA sequences and highlight the major differences. We sequenced a total of 91 E. granulosus s. s. G1 isolates from six different intermediate host species, including humans. The isolates originated from seven countries representing primarily Turkey, Italy and Spain. Few samples were also from Albania, Greece, Romania and from a patient originating from Algeria, but diagnosed in Finland. The analysed 91 sequences were divided into 83 haplotypes, revealing complex phylogeography and high genetic variation of E. granulosus s. s. G1 in Europe, particularly in the high-diversity domestication centre of western Asia. Comparisons with shorter mtDNA datasets revealed that 8274 bp sequences provided significantly higher phylogenetic resolution and thus more power to reveal the genetic relations between different haplotypes.

  13. Molecular blood group typing in Banjar, Jawa, Mandailing and Kelantan Malays in Peninsular Malaysia.

    PubMed

    Abd Gani, Rahayu; Manaf, Siti Mariam; Zafarina, Zainuddin; Panneerchelvam, Sundararajulu; Chambers, Geoffrey Keith; Norazmi, Mohd Noor; Edinur, Hisham Atan

    2015-08-01

    In this study we genotyped ABO, Rhesus, Kell, Kidd and Duffy blood group loci in DNA samples from 120 unrelated individuals representing four Malay subethnic groups living in Peninsular Malaysia (Banjar: n = 30, Jawa: n = 30, Mandailing: n = 30 and Kelantan: n = 30). Analyses were performed using commercial polymerase chain reaction-sequence specific primer (PCR-SSP) typing kits (BAG Health Care GmbH, Lich, Germany). Overall, the present study has successfully compiled blood group datasets for the four Malay subethnic groups and used the datasets for studying ancestry and health. Copyright © 2015 Elsevier Ltd. All rights reserved.

  14. Advances in the phylogenesis of Agaricales and its higher ranks and strategies for establishing phylogenetic hypotheses§

    PubMed Central

    Zhao, Rui-lin; Desjardin, Dennis E.; Soytong, Kasem; Hyde, Kevin D.

    2008-01-01

    We present an overview of previous research results on the molecular phylogenetic analyses in Agaricales and its higher ranks (Agaricomycetes/Agaricomycotina/Basidiomycota) along with the most recent treatments of taxonomic systems in these taxa. Establishing phylogenetic hypotheses using DNA sequences, from which an understanding of the natural evolutionary relationships amongst clades may be derived, requires a robust dataset. It has been recognized that single-gene phylogenies may not truly represent organismal phylogenies, but the concordant phylogenetic genealogies from multiple-gene datasets can resolve this problem. The genes commonly used in mushroom phylogenetic research are summarized. PMID:18837104

  15. Metagenomic ventures into outer sequence space.

    PubMed

    Dutilh, Bas E

    Sequencing DNA or RNA directly from the environment often results in many sequencing reads that have no homologs in the database. These are referred to as "unknowns," and reflect the vast unexplored microbial sequence space of our biosphere, also known as "biological dark matter." However, unknowns also exist because metagenomic datasets are not optimally mined. There is a pressure on researchers to publish and move on, and the unknown sequences are often left for what they are, and conclusions drawn based on reads with annotated homologs. This can cause abundant and widespread genomes to be overlooked, such as the recently discovered human gut bacteriophage crAssphage. The unknowns may be enriched for bacteriophage sequences, the most abundant and genetically diverse component of the biosphere and of sequence space. However, it remains an open question, what is the actual size of biological sequence space? The de novo assembly of shotgun metagenomes is the most powerful tool to address this question.

  16. Pse-Analysis: a python package for DNA/RNA and protein/ peptide sequence analysis based on pseudo components and kernel methods.

    PubMed

    Liu, Bin; Wu, Hao; Zhang, Deyuan; Wang, Xiaolong; Chou, Kuo-Chen

    2017-02-21

    To expedite the pace in conducting genome/proteome analysis, we have developed a Python package called Pse-Analysis. The powerful package can automatically complete the following five procedures: (1) sample feature extraction, (2) optimal parameter selection, (3) model training, (4) cross validation, and (5) evaluating prediction quality. All the work a user needs to do is to input a benchmark dataset along with the query biological sequences concerned. Based on the benchmark dataset, Pse-Analysis will automatically construct an ideal predictor, followed by yielding the predicted results for the submitted query samples. All the aforementioned tedious jobs can be automatically done by the computer. Moreover, the multiprocessing technique was adopted to enhance computational speed by about 6 folds. The Pse-Analysis Python package is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/Pse-Analysis/, and can be directly run on Windows, Linux, and Unix.

  17. ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysis

    PubMed Central

    2011-01-01

    Background Chromatin immunoprecipitation (ChIP) followed by microarray hybridization (ChIP-chip) or high-throughput sequencing (ChIP-seq) allows genome-wide discovery of protein-DNA interactions such as transcription factor bindings and histone modifications. Previous reports only compared a small number of profiles, and little has been done to compare histone modification profiles generated by the two technologies or to assess the impact of input DNA libraries in ChIP-seq analysis. Here, we performed a systematic analysis of a modENCODE dataset consisting of 31 pairs of ChIP-chip/ChIP-seq profiles of the coactivator CBP, RNA polymerase II (RNA PolII), and six histone modifications across four developmental stages of Drosophila melanogaster. Results Both technologies produce highly reproducible profiles within each platform, ChIP-seq generally produces profiles with a better signal-to-noise ratio, and allows detection of more peaks and narrower peaks. The set of peaks identified by the two technologies can be significantly different, but the extent to which they differ varies depending on the factor and the analysis algorithm. Importantly, we found that there is a significant variation among multiple sequencing profiles of input DNA libraries and that this variation most likely arises from both differences in experimental condition and sequencing depth. We further show that using an inappropriate input DNA profile can impact the average signal profiles around genomic features and peak calling results, highlighting the importance of having high quality input DNA data for normalization in ChIP-seq analysis. Conclusions Our findings highlight the biases present in each of the platforms, show the variability that can arise from both technology and analysis methods, and emphasize the importance of obtaining high quality and deeply sequenced input DNA libraries for ChIP-seq analysis. PMID:21356108

  18. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample.

    PubMed

    Luo, Chengwei; Tsementzi, Despina; Kyrpides, Nikos; Read, Timothy; Konstantinidis, Konstantinos T

    2012-01-01

    Next-generation sequencing (NGS) is commonly used in metagenomic studies of complex microbial communities but whether or not different NGS platforms recover the same diversity from a sample and their assembled sequences are of comparable quality remain unclear. We compared the two most frequently used platforms, the Roche 454 FLX Titanium and the Illumina Genome Analyzer (GA) II, on the same DNA sample obtained from a complex freshwater planktonic community. Despite the substantial differences in read length and sequencing protocols, the platforms provided a comparable view of the community sampled. For instance, derived assemblies overlapped in ~90% of their total sequences and in situ abundances of genes and genotypes (estimated based on sequence coverage) correlated highly between the two platforms (R(2)>0.9). Evaluation of base-call error, frameshift frequency, and contig length suggested that Illumina offered equivalent, if not better, assemblies than Roche 454. The results from metagenomic samples were further validated against DNA samples of eighteen isolate genomes, which showed a range of genome sizes and G+C% content. We also provide quantitative estimates of the errors in gene and contig sequences assembled from datasets characterized by different levels of complexity and G+C% content. For instance, we noted that homopolymer-associated, single-base errors affected ~1% of the protein sequences recovered in Illumina contigs of 10× coverage and 50% G+C; this frequency increased to ~3% when non-homopolymer errors were also considered. Collectively, our results should serve as a useful practical guide for choosing proper sampling strategies and data possessing protocols for future metagenomic studies.

  19. A parallel and sensitive software tool for methylation analysis on multicore platforms.

    PubMed

    Tárraga, Joaquín; Pérez, Mariano; Orduña, Juan M; Duato, José; Medina, Ignacio; Dopazo, Joaquín

    2015-10-01

    DNA methylation analysis suffers from very long processing time, as the advent of Next-Generation Sequencers has shifted the bottleneck of genomic studies from the sequencers that obtain the DNA samples to the software that performs the analysis of these samples. The existing software for methylation analysis does not seem to scale efficiently neither with the size of the dataset nor with the length of the reads to be analyzed. As it is expected that the sequencers will provide longer and longer reads in the near future, efficient and scalable methylation software should be developed. We present a new software tool, called HPG-Methyl, which efficiently maps bisulphite sequencing reads on DNA, analyzing DNA methylation. The strategy used by this software consists of leveraging the speed of the Burrows-Wheeler Transform to map a large number of DNA fragments (reads) rapidly, as well as the accuracy of the Smith-Waterman algorithm, which is exclusively employed to deal with the most ambiguous and shortest reads. Experimental results on platforms with Intel multicore processors show that HPG-Methyl significantly outperforms in both execution time and sensitivity state-of-the-art software such as Bismark, BS-Seeker or BSMAP, particularly for long bisulphite reads. Software in the form of C libraries and functions, together with instructions to compile and execute this software. Available by sftp to anonymous@clariano.uv.es (password 'anonymous'). juan.orduna@uv.es or jdopazo@cipf.es. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  20. Differential principal component analysis of ChIP-seq.

    PubMed

    Ji, Hongkai; Li, Xia; Wang, Qian-fei; Ning, Yang

    2013-04-23

    We propose differential principal component analysis (dPCA) for analyzing multiple ChIP-sequencing datasets to identify differential protein-DNA interactions between two biological conditions. dPCA integrates unsupervised pattern discovery, dimension reduction, and statistical inference into a single framework. It uses a small number of principal components to summarize concisely the major multiprotein synergistic differential patterns between the two conditions. For each pattern, it detects and prioritizes differential genomic loci by comparing the between-condition differences with the within-condition variation among replicate samples. dPCA provides a unique tool for efficiently analyzing large amounts of ChIP-sequencing data to study dynamic changes of gene regulation across different biological conditions. We demonstrate this approach through analyses of differential chromatin patterns at transcription factor binding sites and promoters as well as allele-specific protein-DNA interactions.

  1. A database of annotated tentative orthologs from crop abiotic stress transcripts.

    PubMed

    Balaji, Jayashree; Crouch, Jonathan H; Petite, Prasad V N S; Hoisington, David A

    2006-10-07

    A minimal requirement to initiate a comparative genomics study on plant responses to abiotic stresses is a dataset of orthologous sequences. The availability of a large amount of sequence information, including those derived from stress cDNA libraries allow for the identification of stress related genes and orthologs associated with the stress response. Orthologous sequences serve as tools to explore genes and their relationships across species. For this purpose, ESTs from stress cDNA libraries across 16 crop species including 6 important cereal crops and 10 dicots were systematically collated and subjected to bioinformatics analysis such as clustering, grouping of tentative orthologous sets, identification of protein motifs/patterns in the predicted protein sequence, and annotation with stress conditions, tissue/library source and putative function. All data are available to the scientific community at http://intranet.icrisat.org/gt1/tog/homepage.htm. We believe that the availability of annotated plant abiotic stress ortholog sets will be a valuable resource for researchers studying the biology of environmental stresses in plant systems, molecular evolution and genomics.

  2. Comparing sequencing assays and human-machine analyses in actionable genomics for glioblastoma

    PubMed Central

    Wrzeszczynski, Kazimierz O.; Frank, Mayu O.; Koyama, Takahiko; Rhrissorrakrai, Kahn; Robine, Nicolas; Utro, Filippo; Emde, Anne-Katrin; Chen, Bo-Juen; Arora, Kanika; Shah, Minita; Vacic, Vladimir; Norel, Raquel; Bilal, Erhan; Bergmann, Ewa A.; Moore Vogel, Julia L.; Bruce, Jeffrey N.; Lassman, Andrew B.; Canoll, Peter; Grommes, Christian; Harvey, Steve; Parida, Laxmi; Michelini, Vanessa V.; Zody, Michael C.; Jobanputra, Vaidehi; Royyuru, Ajay K.

    2017-01-01

    Objective: To analyze a glioblastoma tumor specimen with 3 different platforms and compare potentially actionable calls from each. Methods: Tumor DNA was analyzed by a commercial targeted panel. In addition, tumor-normal DNA was analyzed by whole-genome sequencing (WGS) and tumor RNA was analyzed by RNA sequencing (RNA-seq). The WGS and RNA-seq data were analyzed by a team of bioinformaticians and cancer oncologists, and separately by IBM Watson Genomic Analytics (WGA), an automated system for prioritizing somatic variants and identifying drugs. Results: More variants were identified by WGS/RNA analysis than by targeted panels. WGA completed a comparable analysis in a fraction of the time required by the human analysts. Conclusions: The development of an effective human-machine interface in the analysis of deep cancer genomic datasets may provide potentially clinically actionable calls for individual patients in a more timely and efficient manner than currently possible. ClinicalTrials.gov identifier: NCT02725684. PMID:28740869

  3. Mitochondrial DNA sequence variation and phylogeography of the scarlet kingsnake (Lampropeltis elapsoides).

    PubMed

    Friedman, Michael; Schaffer, Les

    2011-02-01

    BACKGROUND AND AIMS. With the goal of assessing population structure and geographic distribution of haplotype lineages among Lampropeltis elapsoides, we sequenced the ND4 mitochondrial DNA locus from 96 specimens of this snake across its area of distribution. MATERIALS AND METHODS. We relied heavily on formalin-fixed museum specimens to accomplish this analysis. RESULTS. The sequence alignment consisted of 491 bp of the selected gene, with 28% missing data. A simulation used to assess the effect of missing data on population genetic and phylogenetic resolution indicated increased character conflict, but with minimal loss of phylogenetic structure. CONCLUSION. This limited dataset suggests that L. elapsoides constitutes a largely unstructured population, with both widespread haplotypes and large number of private haplotypes, a moderate level of nucleotide diversity, and a low, but significant, degree of north-south population differentiation. Haplotype structure and frequency, nucleotide frequency, and values for Tajima's D and Fu's F(S) indicate a recent range or population expansion following a historic bottleneck.

  4. Machine learning classifier for identification of damaging missense mutations exclusive to human mitochondrial DNA-encoded polypeptides.

    PubMed

    Martín-Navarro, Antonio; Gaudioso-Simón, Andrés; Álvarez-Jarreta, Jorge; Montoya, Julio; Mayordomo, Elvira; Ruiz-Pesini, Eduardo

    2017-03-07

    Several methods have been developed to predict the pathogenicity of missense mutations but none has been specifically designed for classification of variants in mtDNA-encoded polypeptides. Moreover, there is not available curated dataset of neutral and damaging mtDNA missense variants to test the accuracy of predictors. Because mtDNA sequencing of patients suffering mitochondrial diseases is revealing many missense mutations, it is needed to prioritize candidate substitutions for further confirmation. Predictors can be useful as screening tools but their performance must be improved. We have developed a SVM classifier (Mitoclass.1) specific for mtDNA missense variants. Training and validation of the model was executed with 2,835 mtDNA damaging and neutral amino acid substitutions, previously curated by a set of rigorous pathogenicity criteria with high specificity. Each instance is described by a set of three attributes based on evolutionary conservation in Eukaryota of wildtype and mutant amino acids as well as coevolution and a novel evolutionary analysis of specific substitutions belonging to the same domain of mitochondrial polypeptides. Our classifier has performed better than other web-available tested predictors. We checked performance of three broadly used predictors with the total mutations of our curated dataset. PolyPhen-2 showed the best results for a screening proposal with a good sensitivity. Nevertheless, the number of false positive predictions was too high. Our method has an improved sensitivity and better specificity in relation to PolyPhen-2. We also publish predictions for the complete set of 24,201 possible missense variants in the 13 human mtDNA-encoded polypeptides. Mitoclass.1 allows a better selection of candidate damaging missense variants from mtDNA. A careful search of discriminatory attributes and a training step based on a curated dataset of amino acid substitutions belonging exclusively to human mtDNA genes allows an improved performance. Mitoclass.1 accuracy could be improved in the future when more mtDNA missense substitutions will be available for updating the attributes and retraining the model.

  5. Random whole metagenomic sequencing for forensic discrimination of soils.

    PubMed

    Khodakova, Anastasia S; Smith, Renee J; Burgoyne, Leigh; Abarno, Damien; Linacre, Adrian

    2014-01-01

    Here we assess the ability of random whole metagenomic sequencing approaches to discriminate between similar soils from two geographically distinct urban sites for application in forensic science. Repeat samples from two parklands in residential areas separated by approximately 3 km were collected and the DNA was extracted. Shotgun, whole genome amplification (WGA) and single arbitrarily primed DNA amplification (AP-PCR) based sequencing techniques were then used to generate soil metagenomic profiles. Full and subsampled metagenomic datasets were then annotated against M5NR/M5RNA (taxonomic classification) and SEED Subsystems (metabolic classification) databases. Further comparative analyses were performed using a number of statistical tools including: hierarchical agglomerative clustering (CLUSTER); similarity profile analysis (SIMPROF); non-metric multidimensional scaling (NMDS); and canonical analysis of principal coordinates (CAP) at all major levels of taxonomic and metabolic classification. Our data showed that shotgun and WGA-based approaches generated highly similar metagenomic profiles for the soil samples such that the soil samples could not be distinguished accurately. An AP-PCR based approach was shown to be successful at obtaining reproducible site-specific metagenomic DNA profiles, which in turn were employed for successful discrimination of visually similar soil samples collected from two different locations.

  6. Machine Learned Replacement of N-Labels for Basecalled Sequences in DNA Barcoding.

    PubMed

    Ma, Eddie Y T; Ratnasingham, Sujeevan; Kremer, Stefan C

    2018-01-01

    This study presents a machine learning method that increases the number of identified bases in Sanger Sequencing. The system post-processes a KB basecalled chromatogram. It selects a recoverable subset of N-labels in the KB-called chromatogram to replace with basecalls (A,C,G,T). An N-label correction is defined given an additional read of the same sequence, and a human finished sequence. Corrections are added to the dataset when an alignment determines the additional read and human agree on the identity of the N-label. KB must also rate the replacement with quality value of in the additional read. Corrections are only available during system training. Developing the system, nearly 850,000 N-labels are obtained from Barcode of Life Datasystems, the premier database of genetic markers called DNA Barcodes. Increasing the number of correct bases improves reference sequence reliability, increases sequence identification accuracy, and assures analysis correctness. Keeping with barcoding standards, our system maintains an error rate of percent. Our system only applies corrections when it estimates low rate of error. Tested on this data, our automation selects and recovers: 79 percent of N-labels from COI (animal barcode); 80 percent from matK and rbcL (plant barcodes); and 58 percent from non-protein-coding sequences (across eukaryotes).

  7. Efficient alignment-free DNA barcode analytics

    PubMed Central

    Kuksa, Pavel; Pavlovic, Vladimir

    2009-01-01

    Background In this work we consider barcode DNA analysis problems and address them using alternative, alignment-free methods and representations which model sequences as collections of short sequence fragments (features). The methods use fixed-length representations (spectrum) for barcode sequences to measure similarities or dissimilarities between sequences coming from the same or different species. The spectrum-based representation not only allows for accurate and computationally efficient species classification, but also opens possibility for accurate clustering analysis of putative species barcodes and identification of critical within-barcode loci distinguishing barcodes of different sample groups. Results New alignment-free methods provide highly accurate and fast DNA barcode-based identification and classification of species with substantial improvements in accuracy and speed over state-of-the-art barcode analysis methods. We evaluate our methods on problems of species classification and identification using barcodes, important and relevant analytical tasks in many practical applications (adverse species movement monitoring, sampling surveys for unknown or pathogenic species identification, biodiversity assessment, etc.) On several benchmark barcode datasets, including ACG, Astraptes, Hesperiidae, Fish larvae, and Birds of North America, proposed alignment-free methods considerably improve prediction accuracy compared to prior results. We also observe significant running time improvements over the state-of-the-art methods. Conclusion Our results show that newly developed alignment-free methods for DNA barcoding can efficiently and with high accuracy identify specimens by examining only few barcode features, resulting in increased scalability and interpretability of current computational approaches to barcoding. PMID:19900305

  8. The mitochondrial DNA makeup of Romanians: A forensic mtDNA control region database and phylogenetic characterization.

    PubMed

    Turchi, Chiara; Stanciu, Florin; Paselli, Giorgia; Buscemi, Loredana; Parson, Walther; Tagliabracci, Adriano

    2016-09-01

    To evaluate the pattern of Romanian population from a mitochondrial perspective and to establish an appropriate mtDNA forensic database, we generated a high-quality mtDNA control region dataset from 407 Romanian subjects belonging to four major historical regions: Moldavia, Transylvania, Wallachia and Dobruja. The entire control region (CR) was analyzed by Sanger-type sequencing assays and the resulting 306 different haplotypes were classified into haplogroups according to the most updated mtDNA phylogeny. The Romanian gene pool is mainly composed of West Eurasian lineages H (31.7%), U (12.8%), J (10.8%), R (10.1%), T (9.1%), N (8.1%), HV (5.4%),K (3.7%), HV0 (4.2%), with exceptions of East Asian haplogroup M (3.4%) and African haplogroup L (0.7%). The pattern of mtDNA variation observed in this study indicates that the mitochondrial DNA pool is geographically homogeneous across Romania and that the haplogroup composition reveals signals of admixture of populations of different origin. The PCA scatterplot supported this scenario, with Romania located in southeastern Europe area, close to Bulgaria and Hungary, and as a borderland with respect to east Mediterranean and other eastern European countries. High haplotype diversity (0.993) and nucleotide diversity indices (0.00838±0.00426), together with low random match probability (0.0087) suggest the usefulness of this control region dataset as a forensic database in routine forensic mtDNA analysis and in the investigation of maternal genetic lineages in the Romanian population. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  9. De novo sequencing and characterization of floral transcriptome in two species of buckwheat (Fagopyrum)

    PubMed Central

    2011-01-01

    Background Transcriptome sequencing data has become an integral component of modern genetics, genomics and evolutionary biology. However, despite advances in the technologies of DNA sequencing, such data are lacking for many groups of living organisms, in particular, many plant taxa. We present here the results of transcriptome sequencing for two closely related plant species. These species, Fagopyrum esculentum and F. tataricum, belong to the order Caryophyllales - a large group of flowering plants with uncertain evolutionary relationships. F. esculentum (common buckwheat) is also an important food crop. Despite these practical and evolutionary considerations Fagopyrum species have not been the subject of large-scale sequencing projects. Results Normalized cDNA corresponding to genes expressed in flowers and inflorescences of F. esculentum and F. tataricum was sequenced using the 454 pyrosequencing technology. This resulted in 267 (for F. esculentum) and 229 (F. tataricum) thousands of reads with average length of 341-349 nucleotides. De novo assembly of the reads produced about 25 thousands of contigs for each species, with 7.5-8.2× coverage. Comparative analysis of two transcriptomes demonstrated their overall similarity but also revealed genes that are presumably differentially expressed. Among them are retrotransposon genes and genes involved in sugar biosynthesis and metabolism. Thirteen single-copy genes were used for phylogenetic analysis; the resulting trees are largely consistent with those inferred from multigenic plastid datasets. The sister relationships of the Caryophyllales and asterids now gained high support from nuclear gene sequences. Conclusions 454 transcriptome sequencing and de novo assembly was performed for two congeneric flowering plant species, F. esculentum and F. tataricum. As a result, a large set of cDNA sequences that represent orthologs of known plant genes as well as potential new genes was generated. PMID:21232141

  10. DNA entropy reveals a significant difference in complexity between housekeeping and tissue specific gene promoters.

    PubMed

    Thomas, David; Finan, Chris; Newport, Melanie J; Jones, Susan

    2015-10-01

    The complexity of DNA can be quantified using estimates of entropy. Variation in DNA complexity is expected between the promoters of genes with different transcriptional mechanisms; namely housekeeping (HK) and tissue specific (TS). The former are transcribed constitutively to maintain general cellular functions, and the latter are transcribed in restricted tissue and cells types for specific molecular events. It is known that promoter features in the human genome are related to tissue specificity, but this has been difficult to quantify on a genomic scale. If entropy effectively quantifies DNA complexity, calculating the entropies of HK and TS gene promoters as profiles may reveal significant differences. Entropy profiles were calculated for a total dataset of 12,003 human gene promoters and for 501 housekeeping (HK) and 587 tissue specific (TS) human gene promoters. The mean profiles show the TS promoters have a significantly lower entropy (p<2.2e-16) than HK gene promoters. The entropy distributions for the 3 datasets show that promoter entropies could be used to identify novel HK genes. Functional features comprise DNA sequence patterns that are non-random and hence they have lower entropies. The lower entropy of TS gene promoters can be explained by a higher density of positive and negative regulatory elements, required for genes with complex spatial and temporary expression. Copyright © 2015 Elsevier Ltd. All rights reserved.

  11. A Novel Approach to Data Collection for Difficult Structures: Data Management for Large Numbers of Crystals with the BLEND Software.

    PubMed

    Mylona, Anastasia; Carr, Stephen; Aller, Pierre; Moraes, Isabel; Treisman, Richard; Evans, Gwyndaf; Foadi, James

    2017-08-04

    The present article describes how to use the computer program BLEND to help assemble complete datasets for the solution of macromolecular structures, starting from partial or complete datasets, derived from data collection from multiple crystals. The program is demonstrated on more than two hundred X-ray diffraction datasets obtained from 50 crystals of a complex formed between the SRF transcription factor, its cognate DNA, and a peptide from the SRF cofactor MRTF-A. This structure is currently in the process of being fully solved. While full details of the structure are not yet available, the repeated application of BLEND on data from this structure, as they have become available, has made it possible to produce electron density maps clear enough to visualise the potential location of MRTF sequences.

  12. A Novel Approach to Data Collection for Difficult Structures: Data Management for Large Numbers of Crystals with the BLEND Software

    PubMed Central

    Mylona, Anastasia; Carr, Stephen; Aller, Pierre; Moraes, Isabel; Treisman, Richard; Evans, Gwyndaf; Foadi, James

    2018-01-01

    The present article describes how to use the computer program BLEND to help assemble complete datasets for the solution of macromolecular structures, starting from partial or complete datasets, derived from data collection from multiple crystals. The program is demonstrated on more than two hundred X-ray diffraction datasets obtained from 50 crystals of a complex formed between the SRF transcription factor, its cognate DNA, and a peptide from the SRF cofactor MRTF-A. This structure is currently in the process of being fully solved. While full details of the structure are not yet available, the repeated application of BLEND on data from this structure, as they have become available, has made it possible to produce electron density maps clear enough to visualise the potential location of MRTF sequences. PMID:29456874

  13. Cats of the Pharaohs: Genetic Comparison of Egyptian Cat Mummies to their Feline Contemporaries

    PubMed Central

    Kurushima, Jennifer D.; Ikram, Salima; Knudsen, Joan; Bleiberg, Edward; Grahn, Robert A.; Lyons, Leslie A.

    2012-01-01

    The ancient Egyptians mummified an abundance of cats during the Late Period (664 - 332 BC). The overlapping morphology and sizes of developing wildcats and domestic cats confounds the identity of mummified cat species. Genetic analyses should support mummy identification and was conducted on two long bones and a mandible of three cats that were mummified by the ancient Egyptians. The mummy DNA was extracted in a dedicated ancient DNA laboratory at the University of California – Davis, then directly sequencing between 246 and 402 bp of the mtDNA control region from each bone. When compared to a dataset of wildcats (Felis silvestris silvestris, F. s. tristrami, and F. chaus) as well as a previously published worldwide dataset of modern domestic cat samples, including Egypt, the DNA evidence suggests the three mummies represent common contemporary domestic cat mitotypes prevalent in modern Egypt and the Middle East. Divergence estimates date the origin of the mummies’ mitotypes to between two and 7.5 thousand years prior to their mummification, likely prior to or during Egyptian Predyanstic and Early Dynastic Periods. These data are the first genetic evidence supporting that the ancient Egyptians used domesticated cats, F. s. catus, for votive mummies, and likely implies cats were domesticated prior to extensive mummification of cats. PMID:22923880

  14. Bicycle: a bioinformatics pipeline to analyze bisulfite sequencing data.

    PubMed

    Graña, Osvaldo; López-Fernández, Hugo; Fdez-Riverola, Florentino; González Pisano, David; Glez-Peña, Daniel

    2018-04-15

    High-throughput sequencing of bisulfite-converted DNA is a technique used to measure DNA methylation levels. Although a considerable number of computational pipelines have been developed to analyze such data, none of them tackles all the peculiarities of the analysis together, revealing limitations that can force the user to manually perform additional steps needed for a complete processing of the data. This article presents bicycle, an integrated, flexible analysis pipeline for bisulfite sequencing data. Bicycle analyzes whole genome bisulfite sequencing data, targeted bisulfite sequencing data and hydroxymethylation data. To show how bicycle overtakes other available pipelines, we compared them on a defined number of features that are summarized in a table. We also tested bicycle with both simulated and real datasets, to show its level of performance, and compared it to different state-of-the-art methylation analysis pipelines. Bicycle is publicly available under GNU LGPL v3.0 license at http://www.sing-group.org/bicycle. Users can also download a customized Ubuntu LiveCD including bicycle and other bisulfite sequencing data pipelines compared here. In addition, a docker image with bicycle and its dependencies, which allows a straightforward use of bicycle in any platform (e.g. Linux, OS X or Windows), is also available. ograna@cnio.es or dgpena@uvigo.es. Supplementary data are available at Bioinformatics online.

  15. New FeFe-hydrogenase genes identified in a metagenomic fosmid library from a municipal wastewater treatment plant as revealed by high-throughput sequencing.

    PubMed

    Tomazetto, Geizecler; Wibberg, Daniel; Schlüter, Andreas; Oliveira, Valéria M

    2015-01-01

    A fosmid metagenomic library was constructed with total community DNA obtained from a municipal wastewater treatment plant (MWWTP), with the aim of identifying new FeFe-hydrogenase genes encoding the enzymes most important for hydrogen metabolism. The dataset generated by pyrosequencing of a fosmid library was mined to identify environmental gene tags (EGTs) assigned to FeFe-hydrogenase. The majority of EGTs representing FeFe-hydrogenase genes were affiliated with the class Clostridia, suggesting that this group is the main hydrogen producer in the MWWTP analyzed. Based on assembled sequences, three FeFe-hydrogenase genes were predicted based on detection of the L2 motif (MPCxxKxxE) in the encoded gene product, confirming true FeFe-hydrogenase sequences. These sequences were used to design specific primers to detect fosmids encoding FeFe-hydrogenase genes predicted from the dataset. Three identified fosmids were completely sequenced. The cloned genomic fragments within these fosmids are closely related to members of the Spirochaetaceae, Bacteroidales and Firmicutes, and their FeFe-hydrogenase sequences are characterized by the structure type M3, which is common to clostridial enzymes. FeFe-hydrogenase sequences found in this study represent hitherto undetected sequences, indicating the high genetic diversity regarding these enzymes in MWWTP. Results suggest that MWWTP have to be considered as reservoirs for new FeFe-hydrogenase genes. Copyright © 2014 Institut Pasteur. Published by Elsevier Masson SAS. All rights reserved.

  16. Integrated Analyses of Cuticular Hydrocarbons, Chromosome and mtDNA in the Neotropical Social Wasp Mischocyttarus consimilis Zikán (Hymenoptera, Vespidae).

    PubMed

    Cunha, D A S; Menezes, R S T; Costa, M A; Lima, S M; Andrade, L H C; Antonialli, W F

    2017-12-01

    In the present work, we explored multiple data from different biological levels such as cuticular hydrocarbons, chromosomal features, and mtDNA sequences in the Neotropical social wasp Mischocyttarus consimilis (J.F. Zikán). Particularly, we explored the genetic and chemical differentiation level within and between populations of this insect. Our dataset revealed shallow intraspecific differentiation in M. consimilis. The similarity among the analyzed samples can probably be due to the geographical proximity where the colonies were sampled, and we argue that Paraná River did not contribute effectively as a historical barrier to this wasp.

  17. Heterogeneous data fusion for brain tumor classification.

    PubMed

    Metsis, Vangelis; Huang, Heng; Andronesi, Ovidiu C; Makedon, Fillia; Tzika, Aria

    2012-10-01

    Current research in biomedical informatics involves analysis of multiple heterogeneous data sets. This includes patient demographics, clinical and pathology data, treatment history, patient outcomes as well as gene expression, DNA sequences and other information sources such as gene ontology. Analysis of these data sets could lead to better disease diagnosis, prognosis, treatment and drug discovery. In this report, we present a novel machine learning framework for brain tumor classification based on heterogeneous data fusion of metabolic and molecular datasets, including state-of-the-art high-resolution magic angle spinning (HRMAS) proton (1H) magnetic resonance spectroscopy and gene transcriptome profiling, obtained from intact brain tumor biopsies. Our experimental results show that our novel framework outperforms any analysis using individual dataset.

  18. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets

    PubMed Central

    Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S.; Beer, Michael A.

    2013-01-01

    Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167–80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org. PMID:23771147

  19. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets.

    PubMed

    Fletez-Brant, Christopher; Lee, Dongwon; McCallion, Andrew S; Beer, Michael A

    2013-07-01

    Massively parallel sequencing technologies have made the generation of genomic data sets a routine component of many biological investigations. For example, Chromatin immunoprecipitation followed by sequence assays detect genomic regions bound (directly or indirectly) by specific factors, and DNase-seq identifies regions of open chromatin. A major bottleneck in the interpretation of these data is the identification of the underlying DNA sequence code that defines, and ultimately facilitates prediction of, these transcription factor (TF) bound or open chromatin regions. We have recently developed a novel computational methodology, which uses a support vector machine (SVM) with kmer sequence features (kmer-SVM) to identify predictive combinations of short transcription factor-binding sites, which determine the tissue specificity of these genomic assays (Lee, Karchin and Beer, Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res. 2011; 21:2167-80). This regulatory information can (i) give confidence in genomic experiments by recovering previously known binding sites, and (ii) reveal novel sequence features for subsequent experimental testing of cooperative mechanisms. Here, we describe the development and implementation of a web server to allow the broader research community to independently apply our kmer-SVM to analyze and interpret their genomic datasets. We analyze five recently published data sets and demonstrate how this tool identifies accessory factors and repressive sequence elements. kmer-SVM is available at http://kmersvm.beerlab.org.

  20. Challenging a bioinformatic tool’s ability to detect microbial contaminants using in silico whole genome sequencing data

    PubMed Central

    Zook, Justin M.; Morrow, Jayne B.; Lin, Nancy J.

    2017-01-01

    High sensitivity methods such as next generation sequencing and polymerase chain reaction (PCR) are adversely impacted by organismal and DNA contaminants. Current methods for detecting contaminants in microbial materials (genomic DNA and cultures) are not sensitive enough and require either a known or culturable contaminant. Whole genome sequencing (WGS) is a promising approach for detecting contaminants due to its sensitivity and lack of need for a priori assumptions about the contaminant. Prior to applying WGS, we must first understand its limitations for detecting contaminants and potential for false positives. Herein we demonstrate and characterize a WGS-based approach to detect organismal contaminants using an existing metagenomic taxonomic classification algorithm. Simulated WGS datasets from ten genera as individuals and binary mixtures of eight organisms at varying ratios were analyzed to evaluate the role of contaminant concentration and taxonomy on detection. For the individual genomes the false positive contaminants reported depended on the genus, with Staphylococcus, Escherichia, and Shigella having the highest proportion of false positives. For nearly all binary mixtures the contaminant was detected in the in-silico datasets at the equivalent of 1 in 1,000 cells, though F. tularensis was not detected in any of the simulated contaminant mixtures and Y. pestis was only detected at the equivalent of one in 10 cells. Once a WGS method for detecting contaminants is characterized, it can be applied to evaluate microbial material purity, in efforts to ensure that contaminants are characterized in microbial materials used to validate pathogen detection assays, generate genome assemblies for database submission, and benchmark sequencing methods. PMID:28924496

  1. P-Hint-Hunt: a deep parallelized whole genome DNA methylation detection tool.

    PubMed

    Peng, Shaoliang; Yang, Shunyun; Gao, Ming; Liao, Xiangke; Liu, Jie; Yang, Canqun; Wu, Chengkun; Yu, Wenqiang

    2017-03-14

    The increasing studies have been conducted using whole genome DNA methylation detection as one of the most important part of epigenetics research to find the significant relationships among DNA methylation and several typical diseases, such as cancers and diabetes. In many of those studies, mapping the bisulfite treated sequence to the whole genome has been the main method to study DNA cytosine methylation. However, today's relative tools almost suffer from inaccuracies and time-consuming problems. In our study, we designed a new DNA methylation prediction tool ("Hint-Hunt") to solve the problem. By having an optimal complex alignment computation and Smith-Waterman matrix dynamic programming, Hint-Hunt could analyze and predict the DNA methylation status. But when Hint-Hunt tried to predict DNA methylation status with large-scale dataset, there are still slow speed and low temporal-spatial efficiency problems. In order to solve the problems of Smith-Waterman dynamic programming and low temporal-spatial efficiency, we further design a deep parallelized whole genome DNA methylation detection tool ("P-Hint-Hunt") on Tianhe-2 (TH-2) supercomputer. To the best of our knowledge, P-Hint-Hunt is the first parallel DNA methylation detection tool with a high speed-up to process large-scale dataset, and could run both on CPU and Intel Xeon Phi coprocessors. Moreover, we deploy and evaluate Hint-Hunt and P-Hint-Hunt on TH-2 supercomputer in different scales. The experimental results illuminate our tools eliminate the deviation caused by bisulfite treatment in mapping procedure and the multi-level parallel program yields a 48 times speed-up with 64 threads. P-Hint-Hunt gain a deep acceleration on CPU and Intel Xeon Phi heterogeneous platform, which gives full play of the advantages of multi-cores (CPU) and many-cores (Phi).

  2. Deep Sequencing of T-cell Receptor DNA as a Biomarker of Clonally Expanded TILs in Breast Cancer after Immunotherapy.

    PubMed

    Page, David B; Yuan, Jianda; Redmond, David; Wen, Y Hanna; Durack, Jeremy C; Emerson, Ryan; Solomon, Stephen; Dong, Zhiwan; Wong, Phillip; Comstock, Christopher; Diab, Adi; Sung, Janice; Maybody, Majid; Morris, Elizabeth; Brogi, Edi; Morrow, Monica; Sacchini, Virgilio; Elemento, Olivier; Robins, Harlan; Patil, Sujata; Allison, James P; Wolchok, Jedd D; Hudis, Clifford; Norton, Larry; McArthur, Heather L

    2016-10-01

    In early-stage breast cancer, the degree of tumor-infiltrating lymphocytes (TIL) predicts response to chemotherapy and overall survival. Combination immunotherapy with immune checkpoint antibody plus tumor cryoablation can induce lymphocytic infiltrates and improve survival in mice. We used T-cell receptor (TCR) DNA sequencing to evaluate both the effect of cryoimmunotherapy in humans and the feasibility of TCR sequencing in early-stage breast cancer. In a pilot clinical trial, 18 women with early-stage breast cancer were treated preoperatively with cryoablation, single-dose anti-CTLA-4 (ipilimumab), or cryoablation + ipilimumab. TCRs within serially collected peripheral blood and tumor tissue were sequenced. In baseline tumor tissues, T-cell density as measured by TCR sequencing correlated with TIL scores obtained by hematoxylin and eosin (H&E) staining. However, tumors with little or no lymphocytes by H&E contained up to 3.6 × 10 6 TCR DNA sequences, highlighting the sensitivity of the ImmunoSEQ platform. In this dataset, ipilimumab increased intratumoral T-cell density over time, whereas cryoablation ± ipilimumab diversified and remodeled the intratumoral T-cell clonal repertoire. Compared with monotherapy, cryoablation plus ipilimumab was associated with numerically greater numbers of peripheral blood and intratumoral T-cell clones expanding robustly following therapy. In conclusion, TCR sequencing correlates with H&E lymphocyte scoring and provides additional information on clonal diversity. These findings support further study of the use of TCR sequencing as a biomarker for T-cell responses to therapy and for the study of cryoimmunotherapy in early-stage breast cancer. Cancer Immunol Res; 4(10); 835-44. ©2016 AACR. ©2016 American Association for Cancer Research.

  3. Resolving species delimitation within the genus Bunopus Blanford, 1874 (Squamata: Gekkonidae) in Iran using DNA barcoding approach.

    PubMed

    Khosravani, Azar; Rastegar-Pouyani, Eskandar; Rastegar-Pouyani, Nasrullah; Oraie, Hamzeh; Papenfuss, Theodore J

    2017-12-19

    Mitochondrial COI sequences were used to investigate species delimitation within the genus Bunopus in Iran. A dataset with a final sequence length of 633 nucleotides including 100 specimens from 31 geographically distant localities across Iran were generated. The result demonstrated that two major clades with strong support can be identified within the genus Bunopus in Iran. Clade A includes Bunopus crassicaudus and two new entities, eastern populations (subclade A2,1) and Shahdad populations (subclade A2,2). The second clade comprises western and southwestern populations (subclade B1,1), Arabian populations (subclade B1,2) and south and southeast populations in Iran, to which Bunopus tuberculatus (subclade B2) is assigned. In addition to Bunopus crassicaudus and B. tuberculatus, three new candidate species in Iran can easily be identified based on the DNA barcoding approach.

  4. ESTminer: a Web interface for mining EST contig and cluster databases.

    PubMed

    Huang, Yecheng; Pumphrey, Janie; Gingle, Alan R

    2005-03-01

    ESTminer is a Web application and database schema for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The Web interface contains a query frame that allows the selection of contigs/clusters with specific cDNA library makeup or a threshold number of members. The results are displayed as color-coded tree nodes, where the color indicates the fractional size of each cDNA library component. The nodes are expandable, revealing library statistics as well as EST or contig members, with links to sequence data, GenBank records or user configurable links. Also, the interface allows 'queries within queries' where the result set of a query is further filtered by the subsequent query. ESTminer is implemented in Java/JSP and the package, including MySQL and Oracle schema creation scripts, is available from http://cggc.agtec.uga.edu/Data/download.asp agingle@uga.edu.

  5. RNA design using simulated SHAPE data.

    PubMed

    Lotfi, Mohadeseh; Zare-Mirakabad, Fatemeh; Montaseri, Soheila

    2018-05-03

    It has long been established that in addition to being involved in protein translation, RNA plays essential roles in numerous other cellular processes, including gene regulation and DNA replication. Such roles are known to be dictated by higher-order structures of RNA molecules. It is therefore of prime importance to find an RNA sequence that can fold to acquire a particular function that is desirable for use in pharmaceuticals and basic research. The challenge of finding an RNA sequence for a given structure is known as the RNA design problem. Although there are several algorithms to solve this problem, they mainly consider hard constraints, such as minimum free energy, to evaluate the predicted sequences. Recently, SHAPE data has emerged as a new soft constraint for RNA secondary structure prediction. To take advantage of this new experimental constraint, we report here a new method for accurate design of RNA sequences based on their secondary structures using SHAPE data as pseudo-free energy. We then compare our algorithm with four others: INFO-RNA, ERD, MODENA and RNAifold 2.0. Our algorithm precisely predicts 26 out of 29 new sequences for the structures extracted from the Rfam dataset, while the other four algorithms predict no more than 22 out of 29. The proposed algorithm is comparable to the above algorithms on RNA-SSD datasets, where they can predict up to 33 appropriate sequences for RNA secondary structures out of 34.

  6. Chloroplast variation is incongruent with classification of the Australian bloodwood eucalypts (genus Corymbia, family Myrtaceae)

    PubMed Central

    Schuster, Tanja M.; Setaro, Sabrina D.; Tibbits, Josquin F. G.; Batty, Erin L.; Fowler, Rachael M.; McLay, Todd G. B.; Wilcox, Stephen; Ades, Peter K.

    2018-01-01

    Previous molecular phylogenetic analyses have resolved the Australian bloodwood eucalypt genus Corymbia (~100 species) as either monophyletic or paraphyletic with respect to Angophora (9–10 species). Here we assess relationships of Corymbia and Angophora using a large dataset of chloroplast DNA sequences (121,016 base pairs; from 90 accessions representing 55 Corymbia and 8 Angophora species, plus 33 accessions of related genera), skimmed from high throughput sequencing of genomic DNA, and compare results with new analyses of nuclear ITS sequences (119 accessions) from previous studies. Maximum likelihood and maximum parsimony analyses of cpDNA resolve well supported trees with most nodes having >95% bootstrap support. These trees strongly reject monophyly of Corymbia, its two subgenera (Corymbia and Blakella), most taxonomic sections (Abbreviatae, Maculatae, Naviculares, Septentrionales), and several species. ITS trees weakly indicate paraphyly of Corymbia (bootstrap support <50% for maximum likelihood, and 71% for parsimony), but are highly incongruent with the cpDNA analyses, in that they support monophyly of both subgenera and some taxonomic sections of Corymbia. The striking incongruence between cpDNA trees and both morphological taxonomy and ITS trees is attributed largely to chloroplast introgression between taxa, because of geographic sharing of chloroplast clades across taxonomic groups. Such introgression has been widely inferred in studies of the related genus Eucalyptus. This is the first report of its likely prevalence in Corymbia and Angophora, but this is consistent with previous morphological inferences of hybridisation between species. Our findings (based on continent-wide sampling) highlight a need for more focussed studies to assess the extent of hybridisation and introgression in the evolutionary history of these genera, and that critical testing of the classification of Corymbia and Angophora requires additional sequence data from nuclear genomes. PMID:29668710

  7. Finding Protein and Nucleotide Similarities with FASTA

    PubMed Central

    Pearson, William R.

    2016-01-01

    The FASTA programs provide a comprehensive set of rapid similarity searching tools ( fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local and global similarity searches ( ssearch36, ggsearch36) and for searching with short peptides and oligonucleotides ( fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity (Unit 3.5). The FASTA programs can produce “BLAST-like” alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases (Unit 9.4). The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. PMID:27010337

  8. Finding Protein and Nucleotide Similarities with FASTA.

    PubMed

    Pearson, William R

    2016-03-24

    The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. Copyright © 2016 John Wiley & Sons, Inc.

  9. cuRRBS: simple and robust evaluation of enzyme combinations for reduced representation approaches.

    PubMed

    Martin-Herranz, Daniel E; Ribeiro, António J M; Krueger, Felix; Thornton, Janet M; Reik, Wolf; Stubbs, Thomas M

    2017-11-16

    DNA methylation is an important epigenetic modification in many species that is critical for development, and implicated in ageing and many complex diseases, such as cancer. Many cost-effective genome-wide analyses of DNA modifications rely on restriction enzymes capable of digesting genomic DNA at defined sequence motifs. There are hundreds of restriction enzyme families but few are used to date, because no tool is available for the systematic evaluation of restriction enzyme combinations that can enrich for certain sites of interest in a genome. Herein, we present customised Reduced Representation Bisulfite Sequencing (cuRRBS), a novel and easy-to-use computational method that solves this problem. By computing the optimal enzymatic digestions and size selection steps required, cuRRBS generalises the traditional MspI-based Reduced Representation Bisulfite Sequencing (RRBS) protocol to all restriction enzyme combinations. In addition, cuRRBS estimates the fold-reduction in sequencing costs and provides a robustness value for the personalised RRBS protocol, allowing users to tailor the protocol to their experimental needs. Moreover, we show in silico that cuRRBS-defined restriction enzymes consistently out-perform MspI digestion in many biological systems, considering both CpG and CHG contexts. Finally, we have validated the accuracy of cuRRBS predictions for single and double enzyme digestions using two independent experimental datasets. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  10. Land use type significantly affects microbial gene transcription in soil.

    PubMed

    Nacke, Heiko; Fischer, Christiane; Thürmer, Andrea; Meinicke, Peter; Daniel, Rolf

    2014-05-01

    Soil microorganisms play an essential role in sustaining biogeochemical processes and cycling of nutrients across different land use types. To gain insights into microbial gene transcription in forest and grassland soil, we isolated mRNA from 32 sampling sites. After sequencing of generated complementary DNA (cDNA), a total of 5,824,229 sequences could be further analyzed. We were able to assign nonribosomal cDNA sequences to all three domains of life. A dominance of bacterial sequences, which were affiliated to 25 different phyla, was found. Bacterial groups capable of aromatic compound degradation such as Phenylobacterium and Burkholderia were detected in significantly higher relative abundance in forest soil than in grassland soil. Accordingly, KEGG pathway categories related to degradation of aromatic ring-containing molecules (e.g., benzoate degradation) were identified in high abundance within forest soil-derived metatranscriptomic datasets. The impact of land use type forest on community composition and activity is evidently to a high degree caused by the presence of wood breakdown products. Correspondingly, bacterial groups known to be involved in lignin degradation and containing ligninolytic genes such as Burkholderia, Bradyrhizobium, and Azospirillum exhibited increased transcriptional activity in forest soil. Higher solar radiation in grassland presumably induced increased transcription of photosynthesis-related genes within this land use type. This is in accordance with high abundance of photosynthetic organisms and plant-infecting viruses in grassland.

  11. Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space

    PubMed Central

    Karnik, Rahul; Beer, Michael A.

    2015-01-01

    The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs. PMID:26465884

  12. Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space.

    PubMed

    Karnik, Rahul; Beer, Michael A

    2015-01-01

    The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.

  13. CisSERS: Customizable in silico sequence evaluation for restriction sites

    DOE PAGES

    Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus; ...

    2016-04-12

    High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less

  14. CisSERS: Customizable in silico sequence evaluation for restriction sites

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sharpe, Richard M.; Koepke, Tyson; Harper, Artemus

    High-throughput sequencing continues to produce an immense volume of information that is processed and assembled into mature sequence data. Here, data analysis tools are urgently needed that leverage the embedded DNA sequence polymorphisms and consequent changes to restriction sites or sequence motifs in a high-throughput manner to enable biological experimentation. CisSERS was developed as a standalone open source tool to analyze sequence datasets and provide biologists with individual or comparative genome organization information in terms of presence and frequency of patterns or motifs such as restriction enzymes. Predicted agarose gel visualization of the custom analyses results was also integrated tomore » enhance the usefulness of the software. CisSERS offers several novel functionalities, such as handling of large and multiple datasets in parallel, multiple restriction enzyme site detection and custom motif detection features, which are seamlessly integrated with real time agarose gel visualization. Using a simple fasta-formatted file as input, CisSERS utilizes the REBASE enzyme database. Results from CisSERSenable the user to make decisions for designing genotyping by sequencing experiments, reduced representation sequencing, 3’UTR sequencing, and cleaved amplified polymorphic sequence (CAPS) molecular markers for large sample sets. CisSERS is a java based graphical user interface built around a perl backbone. Several of the applications of CisSERS including CAPS molecular marker development were successfully validated using wet-lab experimentation. Here, we present the tool CisSERSand results from in-silico and corresponding wet-lab analyses demonstrating that CisSERS is a technology platform solution that facilitates efficient data utilization in genomics and genetics studies.« less

  15. Human centromere genomics: now it's personal.

    PubMed

    Hayden, Karen E

    2012-07-01

    Advances in human genomics have accelerated studies in evolution, disease, and cellular regulation. However, centromere sequences, defining the chromosomal interface with spindle microtubules, remain largely absent from ongoing genomic studies and disconnected from functional, genome-wide analyses. This disparity results from the challenge of predicting the linear order of multi-megabase-sized regions that are composed almost entirely of near-identical satellite DNA. Acknowledging these challenges, the field of human centromere genomics possesses the potential to rapidly advance given the availability of individual, or personalized, genome projects matched with the promise of long-read sequencing technologies. Here I review the current genomic model of human centromeres in consideration of those studies involving functional datasets that examine the role of sequence in centromere identity.

  16. Deep sequencing methods for protein engineering and design.

    PubMed

    Wrenbeck, Emily E; Faber, Matthew S; Whitehead, Timothy A

    2017-08-01

    The advent of next-generation sequencing (NGS) has revolutionized protein science, and the development of complementary methods enabling NGS-driven protein engineering have followed. In general, these experiments address the functional consequences of thousands of protein variants in a massively parallel manner using genotype-phenotype linked high-throughput functional screens followed by DNA counting via deep sequencing. We highlight the use of information rich datasets to engineer protein molecular recognition. Examples include the creation of multiple dual-affinity Fabs targeting structurally dissimilar epitopes and engineering of a broad germline-targeted anti-HIV-1 immunogen. Additionally, we highlight the generation of enzyme fitness landscapes for conducting fundamental studies of protein behavior and evolution. We conclude with discussion of technological advances. Copyright © 2016 Elsevier Ltd. All rights reserved.

  17. The Histone Database: an integrated resource for histones and histone fold-containing proteins

    PubMed Central

    Mariño-Ramírez, Leonardo; Levine, Kevin M.; Morales, Mario; Zhang, Suiyuan; Moreland, R. Travis; Baxevanis, Andreas D.; Landsman, David

    2011-01-01

    Eukaryotic chromatin is composed of DNA and protein components—core histones—that act to compactly pack the DNA into nucleosomes, the fundamental building blocks of chromatin. These nucleosomes are connected to adjacent nucleosomes by linker histones. Nucleosomes are highly dynamic and, through various core histone post-translational modifications and incorporation of diverse histone variants, can serve as epigenetic marks to control processes such as gene expression and recombination. The Histone Sequence Database is a curated collection of sequences and structures of histones and non-histone proteins containing histone folds, assembled from major public databases. Here, we report a substantial increase in the number of sequences and taxonomic coverage for histone and histone fold-containing proteins available in the database. Additionally, the database now contains an expanded dataset that includes archaeal histone sequences. The database also provides comprehensive multiple sequence alignments for each of the four core histones (H2A, H2B, H3 and H4), the linker histones (H1/H5) and the archaeal histones. The database also includes current information on solved histone fold-containing structures. The Histone Sequence Database is an inclusive resource for the analysis of chromatin structure and function focused on histones and histone fold-containing proteins. Database URL: The Histone Sequence Database is freely available and can be accessed at http://research.nhgri.nih.gov/histones/. PMID:22025671

  18. Allele Identification for Transcriptome-Based Population Genomics in the Invasive Plant Centaurea solstitialis

    PubMed Central

    Dlugosch, Katrina M.; Lai, Zhao; Bonin, Aurélie; Hierro, José; Rieseberg, Loren H.

    2013-01-01

    Transcriptome sequences are becoming more broadly available for multiple individuals of the same species, providing opportunities to derive population genomic information from these datasets. Using the 454 Life Science Genome Sequencer FLX and FLX-Titanium next-generation platforms, we generated 11−430 Mbp of sequence for normalized cDNA for 40 wild genotypes of the invasive plant Centaurea solstitialis, yellow starthistle, from across its worldwide distribution. We examined the impact of sequencing effort on transcriptome recovery and overlap among individuals. To do this, we developed two novel publicly available software pipelines: SnoWhite for read cleaning before assembly, and AllelePipe for clustering of loci and allele identification in assembled datasets with or without a reference genome. AllelePipe is designed specifically for cases in which read depth information is not appropriate or available to assist with disentangling closely related paralogs from allelic variation, as in transcriptome or previously assembled libraries. We find that modest applications of sequencing effort recover most of the novel sequences present in the transcriptome of this species, including single-copy loci and a representative distribution of functional groups. In contrast, the coverage of variable sites, observation of heterozygosity, and overlap among different libraries are all highly dependent on sequencing effort. Nevertheless, the information gained from overlapping regions was informative regarding coarse population structure and variation across our small number of population samples, providing the first genetic evidence in support of hypothesized invasion scenarios. PMID:23390612

  19. Distinct profiles of expressed sequence tags during intestinal regeneration in the sea cucumber Holothuria glaberrima

    PubMed Central

    Rojas-Cartagena, Carmencita; Ortíz-Pineda, Pablo; Ramírez-Gómez, Francisco; Suárez-Castillo, Edna C.; Matos-Cruz, Vanessa; Rodríguez, Carlos; Ortíz-Zuazaga, Humberto; García-Arrarás, José E.

    2010-01-01

    Repair and regeneration are key processes for tissue maintenance, and their disruption may lead to disease states. Little is known about the molecular mechanisms that underline the repair and regeneration of the digestive tract. The sea cucumber Holothuria glaberrima represents an excellent model to dissect and characterize the molecular events during intestinal regeneration. To study the gene expression profile, cDNA libraries were constructed from normal, 3-day, and 7-day regenerating intestines of H. glaberrima. Clones were randomly sequenced and queried against the nonredundant protein database at the National Center for Biotechnology Information. RT-PCR analyses were made of several genes to determine their expression profile during intestinal regeneration. A total of 5,173 sequences from three cDNA libraries were obtained. About 46.2, 35.6, and 26.2% of the sequences for the normal, 3-days, and 7-days cDNA libraries, respectively, shared significant similarity with known sequences in the protein database of GenBank but only present 10% of similarity among them. Analysis of the libraries in terms of functional processes, protein domains, and most common sequences suggests that a differential expression profile is taking place during the regeneration process. Further examination of the expressed sequence tag dataset revealed that 12 putative genes are differentially expressed at significant level (R > 6). Experimental validation by RT-PCR analysis reveals that at least three genes (unknown C-4677-1, melanotransferrin, and centaurin) present a differential expression during regeneration. These findings strongly suggest that the gene expression profile varies among regeneration stages and provide evidence for the existence of differential gene expression. PMID:17579180

  20. A Feature-Based Approach to Modeling Protein–DNA Interactions

    PubMed Central

    Segal, Eran

    2008-01-01

    Transcription factor (TF) binding to its DNA target site is a fundamental regulatory interaction. The most common model used to represent TF binding specificities is a position specific scoring matrix (PSSM), which assumes independence between binding positions. However, in many cases, this simplifying assumption does not hold. Here, we present feature motif models (FMMs), a novel probabilistic method for modeling TF–DNA interactions, based on log-linear models. Our approach uses sequence features to represent TF binding specificities, where each feature may span multiple positions. We develop the mathematical formulation of our model and devise an algorithm for learning its structural features from binding site data. We also developed a discriminative motif finder, which discovers de novo FMMs that are enriched in target sets of sequences compared to background sets. We evaluate our approach on synthetic data and on the widely used TF chromatin immunoprecipitation (ChIP) dataset of Harbison et al. We then apply our algorithm to high-throughput TF ChIP data from mouse and human, reveal sequence features that are present in the binding specificities of mouse and human TFs, and show that FMMs explain TF binding significantly better than PSSMs. Our FMM learning and motif finder software are available at http://genie.weizmann.ac.il/. PMID:18725950

  1. A Hybrid Parallel Strategy Based on String Graph Theory to Improve De Novo DNA Assembly on the TianHe-2 Supercomputer.

    PubMed

    Zhang, Feng; Liao, Xiangke; Peng, Shaoliang; Cui, Yingbo; Wang, Bingqiang; Zhu, Xiaoqian; Liu, Jie

    2016-06-01

    ' The de novo assembly of DNA sequences is increasingly important for biological researches in the genomic era. After more than one decade since the Human Genome Project, some challenges still exist and new solutions are being explored to improve de novo assembly of genomes. String graph assembler (SGA), based on the string graph theory, is a new method/tool developed to address the challenges. In this paper, based on an in-depth analysis of SGA we prove that the SGA-based sequence de novo assembly is an NP-complete problem. According to our analysis, SGA outperforms other similar methods/tools in memory consumption, but costs much more time, of which 60-70 % is spent on the index construction. Upon this analysis, we introduce a hybrid parallel optimization algorithm and implement this algorithm in the TianHe-2's parallel framework. Simulations are performed with different datasets. For data of small size the optimized solution is 3.06 times faster than before, and for data of middle size it's 1.60 times. The results demonstrate an evident performance improvement, with the linear scalability for parallel FM-index construction. This results thus contribute significantly to improving the efficiency of de novo assembly of DNA sequences.

  2. Methods for comparative metagenomics

    PubMed Central

    Huson, Daniel H; Richter, Daniel C; Mitra, Suparna; Auch, Alexander F; Schuster, Stephan C

    2009-01-01

    Background Metagenomics is a rapidly growing field of research that aims at studying uncultured organisms to understand the true diversity of microbes, their functions, cooperation and evolution, in environments such as soil, water, ancient remains of animals, or the digestive system of animals and humans. The recent development of ultra-high throughput sequencing technologies, which do not require cloning or PCR amplification, and can produce huge numbers of DNA reads at an affordable cost, has boosted the number and scope of metagenomic sequencing projects. Increasingly, there is a need for new ways of comparing multiple metagenomics datasets, and for fast and user-friendly implementations of such approaches. Results This paper introduces a number of new methods for interactively exploring, analyzing and comparing multiple metagenomic datasets, which will be made freely available in a new, comparative version 2.0 of the stand-alone metagenome analysis tool MEGAN. Conclusion There is a great need for powerful and user-friendly tools for comparative analysis of metagenomic data and MEGAN 2.0 will help to fill this gap. PMID:19208111

  3. A Continental-Wide Perspective: The Genepool of Nuclear Encoded Ribosomal DNA and Single-Copy Gene Sequences in North American Boechera (Brassicaceae)

    PubMed Central

    Kiefer, Christiane; Koch, Marcus A.

    2012-01-01

    74 of the currently accepted 111 taxa of the North American genus Boechera (Brassicaceae) were subject to pyhlogenetic reconstruction and network analysis. The dataset comprised 911 accessions for which ITS sequences were analyzed. Phylogenetic analyses yielded largely unresolved trees. Together with the network analysis confirming this result this can be interpreted as an indication for multiple, independent, and rapid diversification events. Network analyses were superimposed with datasets describing i) geographical distribution, ii) taxonomy, iii) reproductive mode, and iv) distribution history based on phylogeographic evidence. Our results provide first direct evidence for enormous reticulate evolution in the entire genus and give further insights into the evolutionary history of this complex genus on a continental scale. In addition two novel single-copy gene markers, orthologues of the Arabidopsis thaliana genes At2g25920 and At3g18900, were analyzed for subsets of taxa and confirmed the findings obtained through the ITS data. PMID:22606266

  4. Phylogenetic relationships and timing of diversification in gonorynchiform fishes inferred using nuclear gene DNA sequences (Teleostei: Ostariophysi).

    PubMed

    Near, Thomas J; Dornburg, Alex; Friedman, Matt

    2014-11-01

    The Gonorynchiformes are the sister lineage of the species-rich Otophysi and provide important insights into the diversification of ostariophysan fishes. Phylogenies of gonorynchiforms inferred using morphological characters and mtDNA gene sequences provide differing resolutions with regard to the sister lineage of all other gonorynchiforms (Chanos vs. Gonorynchus) and support for monophyly of the two miniaturized lineages Cromeria and Grasseichthys. In this study the phylogeny and divergence times of gonorynchiforms are investigated with DNA sequences sampled from nine nuclear genes and a published morphological character matrix. Bayesian phylogenetic analyses reveal substantial congruence among individual gene trees with inferences from eight genes placing Gonorynchus as the sister lineage to all other gonorynchiforms. Seven gene trees resolve Cromeria and Grasseichthys as a clade, supporting previous inferences using morphological characters. Phylogenies resulting from either concatenating the nuclear genes, performing a multispecies coalescent species tree analysis, or combining the morphological and nuclear gene DNA sequences resolve Gonorynchus as the living sister lineage of all other gonorynchiforms, strongly support the monophyly of Cromeria and Grasseichthys, and resolve a clade containing Parakneria, Cromeria, and Grasseichthys. The morphological dataset, which includes 13 gonorynchiform fossil taxa that range in age from Early Cretaceous to Eocene, was analyzed in combination with DNA sequences from the nine nuclear genes and a relaxed molecular clock to estimate times of evolutionary divergence. This "tip dating" strategy accommodates uncertainty in the phylogenetic resolution of fossil taxa that provide calibration information in the relaxed molecular clock analysis. The estimated age of the most recent common ancestor (MRCA) of living gonorynchiforms is slightly older than estimates from previous node dating efforts, but the molecular tip dating estimated ages of Kneriinae (Kneria, Parakneria, Cromeria, and Grasseichthys) and the two paedomorphic lineages, Cromeria and Grasseichthys, are considerably younger. Copyright © 2014 Elsevier Inc. All rights reserved.

  5. Open resource metagenomics: a model for sharing metagenomic libraries.

    PubMed

    Neufeld, J D; Engel, K; Cheng, J; Moreno-Hagelsieb, G; Rose, D R; Charles, T C

    2011-11-30

    Both sequence-based and activity-based exploitation of environmental DNA have provided unprecedented access to the genomic content of cultivated and uncultivated microorganisms. Although researchers deposit microbial strains in culture collections and DNA sequences in databases, activity-based metagenomic studies typically only publish sequences from the hits retrieved from specific screens. Physical metagenomic libraries, conceptually similar to entire sequence datasets, are usually not straightforward to obtain by interested parties subsequent to publication. In order to facilitate unrestricted distribution of metagenomic libraries, we propose the adoption of open resource metagenomics, in line with the trend towards open access publishing, and similar to culture- and mutant-strain collections that have been the backbone of traditional microbiology and microbial genetics. The concept of open resource metagenomics includes preparation of physical DNA libraries, preferably in versatile vectors that facilitate screening in a diversity of host organisms, and pooling of clones so that single aliquots containing complete libraries can be easily distributed upon request. Database deposition of associated metadata and sequence data for each library provides researchers with information to select the most appropriate libraries for further research projects. As a starting point, we have established the Canadian MetaMicroBiome Library (CM(2)BL [1]). The CM(2)BL is a publicly accessible collection of cosmid libraries containing environmental DNA from soils collected from across Canada, spanning multiple biomes. The libraries were constructed such that the cloned DNA can be easily transferred to Gateway® compliant vectors, facilitating functional screening in virtually any surrogate microbial host for which there are available plasmid vectors. The libraries, which we are placing in the public domain, will be distributed upon request without restriction to members of both the academic research community and industry. This article invites the scientific community to adopt this philosophy of open resource metagenomics to extend the utility of functional metagenomics beyond initial publication, circumventing the need to start from scratch with each new research project.

  6. Open resource metagenomics: a model for sharing metagenomic libraries

    PubMed Central

    Neufeld, J.D.; Engel, K.; Cheng, J.; Moreno-Hagelsieb, G.; Rose, D.R.; Charles, T.C.

    2011-01-01

    Both sequence-based and activity-based exploitation of environmental DNA have provided unprecedented access to the genomic content of cultivated and uncultivated microorganisms. Although researchers deposit microbial strains in culture collections and DNA sequences in databases, activity-based metagenomic studies typically only publish sequences from the hits retrieved from specific screens. Physical metagenomic libraries, conceptually similar to entire sequence datasets, are usually not straightforward to obtain by interested parties subsequent to publication. In order to facilitate unrestricted distribution of metagenomic libraries, we propose the adoption of open resource metagenomics, in line with the trend towards open access publishing, and similar to culture- and mutant-strain collections that have been the backbone of traditional microbiology and microbial genetics. The concept of open resource metagenomics includes preparation of physical DNA libraries, preferably in versatile vectors that facilitate screening in a diversity of host organisms, and pooling of clones so that single aliquots containing complete libraries can be easily distributed upon request. Database deposition of associated metadata and sequence data for each library provides researchers with information to select the most appropriate libraries for further research projects. As a starting point, we have established the Canadian MetaMicroBiome Library (CM2BL [1]). The CM2BL is a publicly accessible collection of cosmid libraries containing environmental DNA from soils collected from across Canada, spanning multiple biomes. The libraries were constructed such that the cloned DNA can be easily transferred to Gateway® compliant vectors, facilitating functional screening in virtually any surrogate microbial host for which there are available plasmid vectors. The libraries, which we are placing in the public domain, will be distributed upon request without restriction to members of both the academic research community and industry. This article invites the scientific community to adopt this philosophy of open resource metagenomics to extend the utility of functional metagenomics beyond initial publication, circumventing the need to start from scratch with each new research project. PMID:22180823

  7. Classification and assessment tools for structural motif discovery algorithms.

    PubMed

    Badr, Ghada; Al-Turaiki, Isra; Mathkour, Hassan

    2013-01-01

    Motif discovery is the problem of finding recurring patterns in biological data. Patterns can be sequential, mainly when discovered in DNA sequences. They can also be structural (e.g. when discovering RNA motifs). Finding common structural patterns helps to gain a better understanding of the mechanism of action (e.g. post-transcriptional regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit conservation in structure, which may be common even if the sequences are different. Over the past few years, hundreds of algorithms have been developed to solve the sequential motif discovery problem, while less work has been done for the structural case. In this paper, we survey, classify, and compare different algorithms that solve the structural motif discovery problem, where the underlying sequences may be different. We highlight their strengths and weaknesses. We start by proposing a benchmark dataset and a measurement tool that can be used to evaluate different motif discovery approaches. Then, we proceed by proposing our experimental setup. Finally, results are obtained using the proposed benchmark to compare available tools. To the best of our knowledge, this is the first attempt to compare tools solely designed for structural motif discovery. Results show that the accuracy of discovered motifs is relatively low. The results also suggest a complementary behavior among tools where some tools perform well on simple structures, while other tools are better for complex structures. We have classified and evaluated the performance of available structural motif discovery tools. In addition, we have proposed a benchmark dataset with tools that can be used to evaluate newly developed tools.

  8. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.

  9. VirSorter: mining viral signal from microbial genomic data.

    PubMed

    Roux, Simon; Enault, Francois; Hurwitz, Bonnie L; Sullivan, Matthew B

    2015-01-01

    Viruses of microbes impact all ecosystems where microbes drive key energy and substrate transformations including the oceans, humans and industrial fermenters. However, despite this recognized importance, our understanding of viral diversity and impacts remains limited by too few model systems and reference genomes. One way to fill these gaps in our knowledge of viral diversity is through the detection of viral signal in microbial genomic data. While multiple approaches have been developed and applied for the detection of prophages (viral genomes integrated in a microbial genome), new types of microbial genomic data are emerging that are more fragmented and larger scale, such as Single-cell Amplified Genomes (SAGs) of uncultivated organisms or genomic fragments assembled from metagenomic sequencing. Here, we present VirSorter, a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses. Performance testing shows that VirSorter's prophage prediction capability compares to that of available prophage predictors for complete genomes, but is superior in predicting viral sequences outside of a host genome (i.e., from extrachromosomal prophages, lytic infections, or partially assembled prophages). Furthermore, VirSorter outperforms existing tools for fragmented genomic and metagenomic datasets, and can identify viral signal in assembled sequence (contigs) as short as 3kb, while providing near-perfect identification (>95% Recall and 100% Precision) on contigs of at least 10kb. Because VirSorter scales to large datasets, it can also be used in "reverse" to more confidently identify viral sequence in viral metagenomes by sorting away cellular DNA whether derived from gene transfer agents, generalized transduction or contamination. Finally, VirSorter is made available through the iPlant Cyberinfrastructure that provides a web-based user interface interconnected with the required computing resources. VirSorter thus complements existing prophage prediction softwares to better leverage fragmented, SAG and metagenomic datasets in a way that will scale to modern sequencing. Given these features, VirSorter should enable the discovery of new viruses in microbial datasets, and further our understanding of uncultivated viral communities across diverse ecosystems.

  10. VirSorter: mining viral signal from microbial genomic data

    PubMed Central

    Roux, Simon; Enault, Francois; Hurwitz, Bonnie L.

    2015-01-01

    Viruses of microbes impact all ecosystems where microbes drive key energy and substrate transformations including the oceans, humans and industrial fermenters. However, despite this recognized importance, our understanding of viral diversity and impacts remains limited by too few model systems and reference genomes. One way to fill these gaps in our knowledge of viral diversity is through the detection of viral signal in microbial genomic data. While multiple approaches have been developed and applied for the detection of prophages (viral genomes integrated in a microbial genome), new types of microbial genomic data are emerging that are more fragmented and larger scale, such as Single-cell Amplified Genomes (SAGs) of uncultivated organisms or genomic fragments assembled from metagenomic sequencing. Here, we present VirSorter, a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses. Performance testing shows that VirSorter’s prophage prediction capability compares to that of available prophage predictors for complete genomes, but is superior in predicting viral sequences outside of a host genome (i.e., from extrachromosomal prophages, lytic infections, or partially assembled prophages). Furthermore, VirSorter outperforms existing tools for fragmented genomic and metagenomic datasets, and can identify viral signal in assembled sequence (contigs) as short as 3kb, while providing near-perfect identification (>95% Recall and 100% Precision) on contigs of at least 10kb. Because VirSorter scales to large datasets, it can also be used in “reverse” to more confidently identify viral sequence in viral metagenomes by sorting away cellular DNA whether derived from gene transfer agents, generalized transduction or contamination. Finally, VirSorter is made available through the iPlant Cyberinfrastructure that provides a web-based user interface interconnected with the required computing resources. VirSorter thus complements existing prophage prediction softwares to better leverage fragmented, SAG and metagenomic datasets in a way that will scale to modern sequencing. Given these features, VirSorter should enable the discovery of new viruses in microbial datasets, and further our understanding of uncultivated viral communities across diverse ecosystems. PMID:26038737

  11. NABIC: A New Access Portal to Search, Visualize, and Share Agricultural Genomics Data.

    PubMed

    Seol, Young-Joo; Lee, Tae-Ho; Park, Dong-Suk; Kim, Chang-Kug

    2016-01-01

    The National Agricultural Biotechnology Information Center developed an access portal to search, visualize, and share agricultural genomics data with a focus on South Korean information and resources. The portal features an agricultural biotechnology database containing a wide range of omics data from public and proprietary sources. We collected 28.4 TB of data from 162 agricultural organisms, with 10 types of omics data comprising next-generation sequencing sequence read archive, genome, gene, nucleotide, DNA chip, expressed sequence tag, interactome, protein structure, molecular marker, and single-nucleotide polymorphism datasets. Our genomic resources contain information on five animals, seven plants, and one fungus, which is accessed through a genome browser. We also developed a data submission and analysis system as a web service, with easy-to-use functions and cutting-edge algorithms, including those for handling next-generation sequencing data.

  12. Indel detection from DNA and RNA sequencing data with transIndel.

    PubMed

    Yang, Rendong; Van Etten, Jamie L; Dehm, Scott M

    2018-04-19

    Insertions and deletions (indels) are a major class of genomic variation associated with human disease. Indels are primarily detected from DNA sequencing (DNA-seq) data but their transcriptional consequences remain unexplored due to challenges in discriminating medium-sized and large indels from splicing events in RNA-seq data. Here, we developed transIndel, a splice-aware algorithm that parses the chimeric alignments predicted by a short read aligner and reconstructs the mid-sized insertions and large deletions based on the linear alignments of split reads from DNA-seq or RNA-seq data. TransIndel exhibits competitive or superior performance over eight state-of-the-art indel detection tools on benchmarks using both synthetic and real DNA-seq data. Additionally, we applied transIndel to DNA-seq and RNA-seq datasets from 333 primary prostate cancer patients from The Cancer Genome Atlas (TCGA) and 59 metastatic prostate cancer patients from AACR-PCF Stand-Up- To-Cancer (SU2C) studies. TransIndel enhanced the taxonomy of DNA- and RNA-level alterations in prostate cancer by identifying recurrent FOXA1 indels as well as exitron splicing in genes implicated in disease progression. Our study demonstrates that transIndel is a robust tool for elucidation of medium- and large-sized indels from DNA-seq and RNA-seq data. Including RNA-seq in indel discovery efforts leads to significant improvements in sensitivity for identification of med-sized and large indels missed by DNA-seq, and reveals non-canonical RNA-splicing events in genes associated with disease pathology.

  13. MELOGEN: an EST database for melon functional genomics

    PubMed Central

    Gonzalez-Ibeas, Daniel; Blanca, José; Roig, Cristina; González-To, Mireia; Picó, Belén; Truniger, Verónica; Gómez, Pedro; Deleu, Wim; Caño-Delgado, Ana; Arús, Pere; Nuez, Fernando; Garcia-Mas, Jordi; Puigdomènech, Pere; Aranda, Miguel A

    2007-01-01

    Background Melon (Cucumis melo L.) is one of the most important fleshy fruits for fresh consumption. Despite this, few genomic resources exist for this species. To facilitate the discovery of genes involved in essential traits, such as fruit development, fruit maturation and disease resistance, and to speed up the process of breeding new and better adapted melon varieties, we have produced a large collection of expressed sequence tags (ESTs) from eight normalized cDNA libraries from different tissues in different physiological conditions. Results We determined over 30,000 ESTs that were clustered into 16,637 non-redundant sequences or unigenes, comprising 6,023 tentative consensus sequences (contigs) and 10,614 unclustered sequences (singletons). Many potential molecular markers were identified in the melon dataset: 1,052 potential simple sequence repeats (SSRs) and 356 single nucleotide polymorphisms (SNPs) were found. Sixty-nine percent of the melon unigenes showed a significant similarity with proteins in databases. Functional classification of the unigenes was carried out following the Gene Ontology scheme. In total, 9,402 unigenes were mapped to one or more ontology. Remarkably, the distributions of melon and Arabidopsis unigenes followed similar tendencies, suggesting that the melon dataset is representative of the whole melon transcriptome. Bioinformatic analyses primarily focused on potential precursors of melon micro RNAs (miRNAs) in the melon dataset, but many other genes potentially controlling disease resistance and fruit quality traits were also identified. Patterns of transcript accumulation were characterised by Real-Time-qPCR for 20 of these genes. Conclusion The collection of ESTs characterised here represents a substantial increase on the genetic information available for melon. A database (MELOGEN) which contains all EST sequences, contig images and several tools for analysis and data mining has been created. This set of sequences constitutes also the basis for an oligo-based microarray for melon that is being used in experiments to further analyse the melon transcriptome. PMID:17767721

  14. Tentacle: distributed quantification of genes in metagenomes.

    PubMed

    Boulund, Fredrik; Sjögren, Anders; Kristiansson, Erik

    2015-01-01

    In metagenomics, microbial communities are sequenced at increasingly high resolution, generating datasets with billions of DNA fragments. Novel methods that can efficiently process the growing volumes of sequence data are necessary for the accurate analysis and interpretation of existing and upcoming metagenomes. Here we present Tentacle, which is a novel framework that uses distributed computational resources for gene quantification in metagenomes. Tentacle is implemented using a dynamic master-worker approach in which DNA fragments are streamed via a network and processed in parallel on worker nodes. Tentacle is modular, extensible, and comes with support for six commonly used sequence aligners. It is easy to adapt Tentacle to different applications in metagenomics and easy to integrate into existing workflows. Evaluations show that Tentacle scales very well with increasing computing resources. We illustrate the versatility of Tentacle on three different use cases. Tentacle is written for Linux in Python 2.7 and is published as open source under the GNU General Public License (v3). Documentation, tutorials, installation instructions, and the source code are freely available online at: http://bioinformatics.math.chalmers.se/tentacle.

  15. Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and epigenetics data

    PubMed Central

    Nguyen, Quan H; Tellam, Ross L; Naval-Sanchez, Marina; Porto-Neto, Laercio R; Barendse, William; Reverter, Antonio; Hayes, Benjamin; Kijas, James; Dalrymple, Brian P

    2018-01-01

    Abstract Genome sequences for hundreds of mammalian species are available, but an understanding of their genomic regulatory regions, which control gene expression, is only beginning. A comprehensive prediction of potential active regulatory regions is necessary to functionally study the roles of the majority of genomic variants in evolution, domestication, and animal production. We developed a computational method to predict regulatory DNA sequences (promoters, enhancers, and transcription factor binding sites) in production animals (cows and pigs) and extended its broad applicability to other mammals. The method utilizes human regulatory features identified from thousands of tissues, cell lines, and experimental assays to find homologous regions that are conserved in sequences and genome organization and are enriched for regulatory elements in the genome sequences of other mammalian species. Importantly, we developed a filtering strategy, including a machine learning classification method, to utilize a very small number of species-specific experimental datasets available to select for the likely active regulatory regions. The method finds the optimal combination of sensitivity and accuracy to unbiasedly predict regulatory regions in mammalian species. Furthermore, we demonstrated the utility of the predicted regulatory datasets in cattle for prioritizing variants associated with multiple production and climate change adaptation traits and identifying potential genome editing targets. PMID:29618048

  16. Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics, and epigenetics data.

    PubMed

    Nguyen, Quan H; Tellam, Ross L; Naval-Sanchez, Marina; Porto-Neto, Laercio R; Barendse, William; Reverter, Antonio; Hayes, Benjamin; Kijas, James; Dalrymple, Brian P

    2018-03-01

    Genome sequences for hundreds of mammalian species are available, but an understanding of their genomic regulatory regions, which control gene expression, is only beginning. A comprehensive prediction of potential active regulatory regions is necessary to functionally study the roles of the majority of genomic variants in evolution, domestication, and animal production. We developed a computational method to predict regulatory DNA sequences (promoters, enhancers, and transcription factor binding sites) in production animals (cows and pigs) and extended its broad applicability to other mammals. The method utilizes human regulatory features identified from thousands of tissues, cell lines, and experimental assays to find homologous regions that are conserved in sequences and genome organization and are enriched for regulatory elements in the genome sequences of other mammalian species. Importantly, we developed a filtering strategy, including a machine learning classification method, to utilize a very small number of species-specific experimental datasets available to select for the likely active regulatory regions. The method finds the optimal combination of sensitivity and accuracy to unbiasedly predict regulatory regions in mammalian species. Furthermore, we demonstrated the utility of the predicted regulatory datasets in cattle for prioritizing variants associated with multiple production and climate change adaptation traits and identifying potential genome editing targets.

  17. Reliable DNA Barcoding Performance Proved for Species and Island Populations of Comoran Squamate Reptiles

    PubMed Central

    Hawlitschek, Oliver; Nagy, Zoltán T.; Berger, Johannes; Glaw, Frank

    2013-01-01

    In the past decade, DNA barcoding became increasingly common as a method for species identification in biodiversity inventories and related studies. However, mainly due to technical obstacles, squamate reptiles have been the target of few barcoding studies. In this article, we present the results of a DNA barcoding study of squamates of the Comoros archipelago, a poorly studied group of oceanic islands close to and mostly colonized from Madagascar. The barcoding dataset presented here includes 27 of the 29 currently recognized squamate species of the Comoros, including 17 of the 18 endemic species. Some species considered endemic to the Comoros according to current taxonomy were found to cluster with non-Comoran lineages, probably due to poorly resolved taxonomy. All other species for which more than one barcode was obtained corresponded to distinct clusters useful for species identification by barcoding. In most species, even island populations could be distinguished using barcoding. Two cryptic species were identified using the DNA barcoding approach. The obtained barcoding topology, a Bayesian tree based on COI sequences of 5 genera, was compared with available multigene topologies, and in 3 cases, major incongruences between the two topologies became evident. Three of the multigene studies were initiated after initial screening of a preliminary version of the barcoding dataset presented here. We conclude that in the case of the squamates of the Comoros Islands, DNA barcoding has proven a very useful and efficient way of detecting isolated populations and promising starting points for subsequent research. PMID:24069192

  18. A multi-locus analysis of phylogenetic relationships within grass subfamily Pooideae (Poaceae) inferred from sequences of nuclear single copy gene regions compared with plastid DNA.

    PubMed

    Hochbach, Anne; Schneider, Julia; Röser, Martin

    2015-06-01

    To investigate phylogenetic relationships within the grass subfamily Pooideae we studied about 50 taxa covering all recognized tribes, using one plastid DNA (cpDNA) marker (matK gene-3'trnK exon) and for the first time four nuclear single copy gene loci. DNA sequence information from two parts of the nuclear genes topoisomerase 6 (Topo6) spanning the exons 8-13 and 17-19, the exons 9-13 encoding plastid acetyl-CoA-carboxylase (Acc1) and the partial exon 1 of phytochrome B (PhyB) were generated. Individual and nuclear combined data were evaluated using maximum parsimony, maximum likelihood and Bayesian methods. All of the phylogenetic results show Brachyelytrum and the tribe Nardeae as earliest diverging lineages within the subfamily. The 'core' Pooideae (Hordeeae and the Aveneae/Poeae tribe complex) are also strongly supported, as well as the monophyly of the tribes Brachypodieae, Meliceae and Stipeae (except PhyB). The beak grass tribe Diarrheneae and the tribe Duthieeae are not monophyletic in some of the analyses. However, the combined nuclear DNA (nDNA) tree yields the highest resolution and the best delimitation of the tribes, and provides the following evolutionary hypothesis for the tribes: Brachyelytrum, Nardeae, Duthieeae, Meliceae, Stipeae, Diarrheneae, Brachypodieae and the 'core' Pooideae. Within the individual datasets, the phylogenetic trees obtained from Topo6 exon 8-13 shows the most interesting results. The divergent positions of some clone sequences of Ampelodesmos mauritanicus and Trikeraia pappiformis, for instance, may indicate a hybrid origin of these stipoid taxa. Copyright © 2015 Elsevier Inc. All rights reserved.

  19. Mitochondrial DNA and Y-chromosomal diversity in ancient populations of domestic sheep (Ovis aries) in Finland: comparison with contemporary sheep breeds.

    PubMed

    Niemi, Marianna; Bläuer, Auli; Iso-Touru, Terhi; Nyström, Veronica; Harjula, Janne; Taavitsainen, Jussi-Pekka; Storå, Jan; Lidén, Kerstin; Kantanen, Juha

    2013-01-22

    Several molecular and population genetic studies have focused on the native sheep breeds of Finland. In this work, we investigated their ancestral sheep populations from Iron Age, Medieval and Post-Medieval periods by sequencing a partial mitochondrial DNA D-loop and the 5'-promoter region of the SRY gene. We compared the maternal (mitochondrial DNA haplotypes) and paternal (SNP oY1) genetic diversity of ancient sheep in Finland with modern domestic sheep populations in Europe and Asia to study temporal changes in genetic variation and affinities between ancient and modern populations. A 523-bp mitochondrial DNA sequence was successfully amplified for 26 of 36 sheep ancient samples i.e. five, seven and 14 samples representative of Iron Age, Medieval and Post-Medieval sheep, respectively. Genetic diversity was analyzed within the cohorts. This ancient dataset was compared with present-day data consisting of 94 animals from 10 contemporary European breeds and with GenBank DNA sequence data to carry out a haplotype sharing analysis. Among the 18 ancient mitochondrial DNA haplotypes identified, 14 were present in the modern breeds. Ancient haplotypes were assigned to the highly divergent ovine haplogroups A and B, haplogroup B being the major lineage within the cohorts. Only two haplotypes were detected in the Iron Age samples, while the genetic diversity of the Medieval and Post-Medieval cohorts was higher. For three of the ancient DNA samples, Y-chromosome SRY gene sequences were amplified indicating that they originated from rams. The SRY gene of these three ancient ram samples contained SNP G-oY1, which is frequent in modern north-European sheep breeds. Our study did not reveal any sign of major population replacement of native sheep in Finland since the Iron Age. Variations in the availability of archaeological remains may explain differences in genetic diversity estimates and patterns within the cohorts rather than demographic events that occurred in the past. Our ancient DNA results fit well with the genetic context of domestic sheep as determined by analyses of modern north-European sheep breeds.

  20. Mitochondrial DNA and Y-chromosomal diversity in ancient populations of domestic sheep (Ovis aries) in Finland: comparison with contemporary sheep breeds

    PubMed Central

    2013-01-01

    Background Several molecular and population genetic studies have focused on the native sheep breeds of Finland. In this work, we investigated their ancestral sheep populations from Iron Age, Medieval and Post-Medieval periods by sequencing a partial mitochondrial DNA D-loop and the 5’-promoter region of the SRY gene. We compared the maternal (mitochondrial DNA haplotypes) and paternal (SNP oY1) genetic diversity of ancient sheep in Finland with modern domestic sheep populations in Europe and Asia to study temporal changes in genetic variation and affinities between ancient and modern populations. Results A 523-bp mitochondrial DNA sequence was successfully amplified for 26 of 36 sheep ancient samples i.e. five, seven and 14 samples representative of Iron Age, Medieval and Post-Medieval sheep, respectively. Genetic diversity was analyzed within the cohorts. This ancient dataset was compared with present-day data consisting of 94 animals from 10 contemporary European breeds and with GenBank DNA sequence data to carry out a haplotype sharing analysis. Among the 18 ancient mitochondrial DNA haplotypes identified, 14 were present in the modern breeds. Ancient haplotypes were assigned to the highly divergent ovine haplogroups A and B, haplogroup B being the major lineage within the cohorts. Only two haplotypes were detected in the Iron Age samples, while the genetic diversity of the Medieval and Post-Medieval cohorts was higher. For three of the ancient DNA samples, Y-chromosome SRY gene sequences were amplified indicating that they originated from rams. The SRY gene of these three ancient ram samples contained SNP G-oY1, which is frequent in modern north-European sheep breeds. Conclusions Our study did not reveal any sign of major population replacement of native sheep in Finland since the Iron Age. Variations in the availability of archaeological remains may explain differences in genetic diversity estimates and patterns within the cohorts rather than demographic events that occurred in the past. Our ancient DNA results fit well with the genetic context of domestic sheep as determined by analyses of modern north-European sheep breeds. PMID:23339395

  1. Benchmarking protein classification algorithms via supervised cross-validation.

    PubMed

    Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor

    2008-04-24

    Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

  2. A newly recognised Australian endemic species of Austrolecanium Gullan & Hodgson 1998 (Hemiptera: Coccidae) from Queensland.

    PubMed

    Lin, Yen-Po; Ding, Zheng Yee; Gullan, Penny J; Cook, Lyn G

    2017-05-26

    Austrolecanium cryptocaryae Lin & Cook sp. n. is described based on adult female morphology and DNA sequences from mitochondrial and nuclear loci. This Australian endemic species was found on the underside of leaves of Cryptocarya microneura (Lauraceae) in Queensland. All phylogenetic analyses of four independent DNA loci and a concatenated dataset show that A. cryptocaryae is monophyletic and closely related to A. sassafras Gullan & Hodgson, the type species of Austrolecanium Gullan & Hodgson. The adult female of A. cryptocaryae is described and illustrated and a table is provided of the characters that differ among adult females of the three species of Austrolecanium currently recognised (A. cappari (Froggatt), A. cryptocaryae sp. n. and A. sassafras).

  3. GenomicTools: a computational platform for developing high-throughput analytics in genomics.

    PubMed

    Tsirigos, Aristotelis; Haiminen, Niina; Bilal, Erhan; Utro, Filippo

    2012-01-15

    Recent advances in sequencing technology have resulted in the dramatic increase of sequencing data, which, in turn, requires efficient management of computational resources, such as computing time, memory requirements as well as prototyping of computational pipelines. We present GenomicTools, a flexible computational platform, comprising both a command-line set of tools and a C++ API, for the analysis and manipulation of high-throughput sequencing data such as DNA-seq, RNA-seq, ChIP-seq and MethylC-seq. GenomicTools implements a variety of mathematical operations between sets of genomic regions thereby enabling the prototyping of computational pipelines that can address a wide spectrum of tasks ranging from pre-processing and quality control to meta-analyses. Additionally, the GenomicTools platform is designed to analyze large datasets of any size by minimizing memory requirements. In practical applications, where comparable, GenomicTools outperforms existing tools in terms of both time and memory usage. The GenomicTools platform (version 2.0.0) was implemented in C++. The source code, documentation, user manual, example datasets and scripts are available online at http://code.google.com/p/ibm-cbc-genomic-tools.

  4. Consed: a graphical editor for next-generation sequencing.

    PubMed

    Gordon, David; Green, Phil

    2013-11-15

    The rapid growth of DNA sequencing throughput in recent years implies that graphical interfaces for viewing and correcting errors must now handle large numbers of reads, efficiently pinpoint regions of interest and automate as many tasks as possible. We have adapted consed to reflect this. To allow full-feature editing of large datasets while keeping memory requirements low, we developed a viewer, bamScape, that reads billion-read BAM files, identifies and displays problem areas for user review and launches the consed graphical editor on user-selected regions, allowing, in addition to longstanding consed capabilities such as assembly editing, a variety of new features including direct editing of the reference sequence, variant and error detection, display of annotation tracks and the ability to simultaneously process a group of reads. Many batch processing capabilities have been added. The consed package is free to academic, government and non-profit users, and licensed to others for a fee by the University of Washington. The current version (26.0) is available for linux, macosx and solaris systems or as C++ source code. It includes a user's manual (with exercises) and example datasets. http://www.phrap.org/consed/consed.html dgordon@uw.edu .

  5. A statistical method for the detection of variants from next-generation resequencing of DNA pools.

    PubMed

    Bansal, Vikas

    2010-06-15

    Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80-85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3-5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Implementation of this method is available at http://polymorphism.scripps.edu/~vbansal/software/CRISP/.

  6. Evidence of accelerated evolution and ectodermal-specific expression of presumptive BDS toxin cDNAs from Anemonia viridis.

    PubMed

    Nicosia, Aldo; Maggio, Teresa; Mazzola, Salvatore; Cuttitta, Angela

    2013-10-30

    Anemonia viridis is a widespread and extensively studied Mediterranean species of sea anemone from which a large number of polypeptide toxins, such as blood depressing substances (BDS) peptides, have been isolated. The first members of this class, BDS-1 and BDS-2, are polypeptides belonging to the β-defensin fold family and were initially described for their antihypertensive and antiviral activities. BDS-1 and BDS-2 are 43 amino acid peptides characterised by three disulfide bonds that act as neurotoxins affecting Kv3.1, Kv3.2 and Kv3.4 channel gating kinetics. In addition, BDS-1 inactivates the Nav1.7 and Nav1.3 channels. The development of a large dataset of A. viridis expressed sequence tags (ESTs) and the identification of 13 putative BDS-like cDNA sequences has attracted interest, especially as scientific and diagnostic tools. A comparison of BDS cDNA sequences showed that the untranslated regions are more conserved than the protein-coding regions. Moreover, the KA/KS ratios calculated for all pairwise comparisons showed values greater than 1, suggesting mechanisms of accelerated evolution. The structures of the BDS homologs were predicted by molecular modelling. All toxins possess similar 3D structures that consist of a triple-stranded antiparallel β-sheet and an additional small antiparallel β-sheet located downstream of the cleavage/maturation site; however, the orientation of the triple-stranded β-sheet appears to differ among the toxins. To characterise the spatial expression profile of the putative BDS cDNA sequences, tissue-specific cDNA libraries, enriched for BDS transcripts, were constructed. In addition, the proper amplification of ectodermal or endodermal markers ensured the tissue specificity of each library. Sequencing randomly selected clones from each library revealed ectodermal-specific expression of ten BDS transcripts, while transcripts of BDS-8, BDS-13, BDS-14 and BDS-15 failed to be retrieved, likely due to under-representation in our cDNA libraries. The calculation of the relative abundance of BDS transcripts in the cDNA libraries revealed that BDS-1, BDS-3, BDS-4, BDS-5 and BDS-6 are the most represented transcripts.

  7. MitoBreak: the mitochondrial DNA breakpoints database.

    PubMed

    Damas, Joana; Carneiro, João; Amorim, António; Pereira, Filipe

    2014-01-01

    Mitochondrial DNA (mtDNA) rearrangements are key events in the development of many diseases. Investigations of mtDNA regions affected by rearrangements (i.e. breakpoints) can lead to important discoveries about rearrangement mechanisms and can offer important clues about the causes of mitochondrial diseases. Here, we present the mitochondrial DNA breakpoints database (MitoBreak; http://mitobreak.portugene.com), a free, web-accessible comprehensive list of breakpoints from three classes of somatic mtDNA rearrangements: circular deleted (deletions), circular partially duplicated (duplications) and linear mtDNAs. Currently, MitoBreak contains >1400 mtDNA rearrangements from seven species (Homo sapiens, Mus musculus, Rattus norvegicus, Macaca mulatta, Drosophila melanogaster, Caenorhabditis elegans and Podospora anserina) and their associated phenotypic information collected from nearly 400 publications. The database allows researchers to perform multiple types of data analyses through user-friendly interfaces with full or partial datasets. It also permits the download of curated data and the submission of new mtDNA rearrangements. For each reported case, MitoBreak also documents the precise breakpoint positions, junction sequences, disease or associated symptoms and links to the related publications, providing a useful resource to study the causes and consequences of mtDNA structural alterations.

  8. Genome-Wide Characterization of RNA Editing in Chicken Embryos Reveals Common Features among Vertebrates.

    PubMed

    Frésard, Laure; Leroux, Sophie; Roux, Pierre-François; Klopp, Christophe; Fabre, Stéphane; Esquerré, Diane; Dehais, Patrice; Djari, Anis; Gourichon, David; Lagarrigue, Sandrine; Pitel, Frédérique

    2015-01-01

    RNA editing results in a post-transcriptional nucleotide change in the RNA sequence that creates an alternative nucleotide not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is described in more details in mammals, notably since the availability of next generation sequencing technologies allowing whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken is still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, and to extend the knowledge of its conservation among vertebrates. We performed sequencing of RNA and DNA from 8 embryos. Being aware of common pitfalls inherent to sequence analyses that lead to false positive discovery, we stringently filtered our datasets and found fewer than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual increase of editing level with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, attest to its tissue and stage specificity and provide support of the absence of non A-to-I events from the chicken transcriptome.

  9. Genome-Wide Characterization of RNA Editing in Chicken Embryos Reveals Common Features among Vertebrates

    PubMed Central

    Frésard, Laure; Leroux, Sophie; Roux, Pierre-François; Klopp, Christophe; Fabre, Stéphane; Esquerré, Diane; Dehais, Patrice; Djari, Anis; Gourichon, David

    2015-01-01

    RNA editing results in a post-transcriptional nucleotide change in the RNA sequence that creates an alternative nucleotide not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is described in more details in mammals, notably since the availability of next generation sequencing technologies allowing whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken is still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, and to extend the knowledge of its conservation among vertebrates. We performed sequencing of RNA and DNA from 8 embryos. Being aware of common pitfalls inherent to sequence analyses that lead to false positive discovery, we stringently filtered our datasets and found fewer than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual increase of editing level with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, attest to its tissue and stage specificity and provide support of the absence of non A-to-I events from the chicken transcriptome. PMID:26024316

  10. Pyrosequencing analysis of the gyrB gene to differentiate bacteria responsible for diarrheal diseases.

    PubMed

    Hou, X-L; Cao, Q-Y; Jia, H-Y; Chen, Z

    2008-07-01

    Pathogens causing acute diarrhea include a large variety of species from Enterobacteriaceae and Vibrionaceae. A method based on pyrosequencing was used here to differentiate bacteria commonly associated with diarrhea in China; the method is targeted to a partial amplicon of the gyrB gene, which encodes the B subunit of DNA gyrase. Twenty-eight specific polymorphic positions were identified from sequence alignment of a large sequence dataset and targeted using 17 sequencing primers. Of 95 isolates tested, belonging to 13 species within 7 genera, most could be identified to the species level; O157 type could be differentiated from other E. coli types; Salmonella enterica subsp. enterica could be identified at the serotype level; the genus Shigella, except for S. boydii and S. dysenteriae, could also be identified. All these isolates were also subjected to conventional sequencing of a relatively long ( approximately1.2 kb) region of gyrB DNA; these results confirmed those with pyrosequencing. Twenty-two fecal samples were surveyed, the results of which were concordant with culture-based bacterial identification, and the pathogen detection limit with simulated stool specimens was 10(4) CFU/ml. DNA from different pathogens was also mixed to simulate a case of multibacterial infection, and the generated signals correlated well with the mix ratio. In summary, the gyrB-based pyrosequencing approach proved to have significant reliability and discriminatory power for enteropathogenic bacterial identification and provided a fast and effective method for clinical diagnosis.

  11. Identification of Rays through DNA Barcoding: An Application for Ecologists

    PubMed Central

    Cerutti-Pereyra, Florencia; Meekan, Mark G.; Wei, Nu-Wei V.; O'Shea, Owen; Bradshaw, Corey J. A.; Austin, Chris M.

    2012-01-01

    DNA barcoding potentially offers scientists who are not expert taxonomists a powerful tool to support the accuracy of field studies involving taxa that are diverse and difficult to identify. The taxonomy of rays has received reasonable attention in Australia, although the fauna in remote locations such as Ningaloo Reef, Western Australia is poorly studied and the identification of some species in the field is problematic. Here, we report an application of DNA-barcoding to the identification of 16 species (from 10 genera) of tropical rays as part of an ecological study. Analysis of the dataset combined across all samples grouped sequences into clearly defined operational taxonomic units, with two conspicuous exceptions: the Neotrygon kuhlii species complex and the Aetobatus species complex. In the field, the group that presented the most difficulties for identification was the spotted whiptail rays, referred to as the ‘uarnak’ complex. Two sets of problems limited the successful application of DNA barcoding: (1) the presence of cryptic species, species complexes with unresolved taxonomic status and intra-specific geographical variation, and (2) insufficient numbers of entries in online databases that have been verified taxonomically, and the presence of lodged sequences in databases with inconsistent names. Nevertheless, we demonstrate the potential of the DNA barcoding approach to confirm field identifications and to highlight species complexes where taxonomic uncertainty might confound ecological data. PMID:22701556

  12. MeDReaders: a database for transcription factors that bind to methylated DNA.

    PubMed

    Wang, Guohua; Luo, Ximei; Wang, Jianan; Wan, Jun; Xia, Shuli; Zhu, Heng; Qian, Jiang; Wang, Yadong

    2018-01-04

    Understanding the molecular principles governing interactions between transcription factors (TFs) and DNA targets is one of the main subjects for transcriptional regulation. Recently, emerging evidence demonstrated that some TFs could bind to DNA motifs containing highly methylated CpGs both in vitro and in vivo. Identification of such TFs and elucidation of their physiological roles now become an important stepping-stone toward understanding the mechanisms underlying the methylation-mediated biological processes, which have crucial implications for human disease and disease development. Hence, we constructed a database, named as MeDReaders, to collect information about methylated DNA binding activities. A total of 731 TFs, which could bind to methylated DNA sequences, were manually curated in human and mouse studies reported in the literature. In silico approaches were applied to predict methylated and unmethylated motifs of 292 TFs by integrating whole genome bisulfite sequencing (WGBS) and ChIP-Seq datasets in six human cell lines and one mouse cell line extracted from ENCODE and GEO database. MeDReaders database will provide a comprehensive resource for further studies and aid related experiment designs. The database implemented unified access for users to most TFs involved in such methylation-associated binding actives. The website is available at http://medreader.org/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. Influence of structural variation on nuclear localization of DNA-binding polyamide-fluorophore conjugates.

    PubMed

    Edelson, Benjamin S; Best, Timothy P; Olenyuk, Bogdan; Nickols, Nicholas G; Doss, Raymond M; Foister, Shane; Heckel, Alexander; Dervan, Peter B

    2004-01-01

    A pivotal step forward in chemical approaches to controlling gene expression is the development of sequence-specific DNA-binding molecules that can enter live cells and traffic to nuclei unaided. DNA-binding polyamides are a class of programmable, sequence-specific small molecules that have been shown to influence a wide variety of protein-DNA interactions. We have synthesized over 100 polyamide-fluorophore conjugates and assayed their nuclear uptake profiles in 13 mammalian cell lines. The compiled dataset, comprising 1300 entries, establishes a benchmark for the nuclear localization of polyamide-dye conjugates. Compounds in this series were chosen to provide systematic variation in several structural variables, including dye composition and placement, molecular weight, charge, ordering of the aromatic and aliphatic amino-acid building blocks and overall shape. Nuclear uptake does not appear to be correlated with polyamide molecular weight or with the number of imidazole residues, although the positions of imidazole residues affect nuclear access properties significantly. Generally negative determinants for nuclear access include the presence of a beta-Ala-tail residue and the lack of a cationic alkyl amine moiety, whereas the presence of an acetylated 2,4-diaminobutyric acid-turn is a positive factor for nuclear localization. We discuss implications of these data on the design of polyamide-dye conjugates for use in biological systems.

  14. Phylogeny and evolutionary histories of Pyrus L. revealed by phylogenetic trees and networks based on data from multiple DNA sequences.

    PubMed

    Zheng, Xiaoyan; Cai, Danying; Potter, Daniel; Postman, Joseph; Liu, Jing; Teng, Yuanwen

    2014-11-01

    Reconstructing the phylogeny of Pyrus has been difficult due to the wide distribution of the genus and lack of informative data. In this study, we collected 110 accessions representing 25 Pyrus species and constructed both phylogenetic trees and phylogenetic networks based on multiple DNA sequence datasets. Phylogenetic trees based on both cpDNA and nuclear LFY2int2-N (LN) data resulted in poor resolution, especially, only five primary species were monophyletic in the LN tree. A phylogenetic network of LN suggested that reticulation caused by hybridization is one of the major evolutionary processes for Pyrus species. Polytomies of the gene trees and star-like structure of cpDNA networks suggested rapid radiation is another major evolutionary process, especially for the occidental species. Pyrus calleryana and P. regelii were the earliest diverged Pyrus species. Two North African species, P. cordata, P. spinosa and P. betulaefolia were descendent of primitive stock Pyrus species and still share some common molecular characters. Southwestern China, where a large number of P. pashia populations are found, is probably the most important diversification center of Pyrus. More accessions and nuclear genes are needed for further understanding the evolutionary histories of Pyrus. Copyright © 2014 Elsevier Inc. All rights reserved.

  15. Integrative analysis of gene expression and DNA methylation using unsupervised feature extraction for detecting candidate cancer biomarkers.

    PubMed

    Moon, Myungjin; Nakai, Kenta

    2018-04-01

    Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.

  16. DNA analysis of ancient dogs of the Americas: identifying possible founding haplotypes and reconstructing population histories.

    PubMed

    Witt, Kelsey E; Judd, Kathleen; Kitchen, Andrew; Grier, Colin; Kohler, Timothy A; Ortman, Scott G; Kemp, Brian M; Malhi, Ripan S

    2015-02-01

    As dogs have traveled with humans to every continent, they can potentially serve as an excellent proxy when studying human migration history. Past genetic studies into the origins of Native American dogs have used portions of the hypervariable region (HVR) of mitochondrial DNA (mtDNA) to indicate that prior to European contact the dogs of Native Americans originated in Eurasia. In this study, we summarize past DNA studies of both humans and dogs to discuss their population histories in the Americas. We then sequenced a portion of the mtDNA HVR of 42 pre-Columbian dogs from three sites located in Illinois, coastal British Columbia, and Colorado, and identify four novel dog mtDNA haplotypes. Next, we analyzed a dataset comprised of all available ancient dog sequences from the Americas to infer the pre-Columbian population history of dogs in the Americas. Interestingly, we found low levels of genetic diversity for some populations consistent with the possibility of deliberate breeding practices. Furthermore, we identified multiple putative founding haplotypes in addition to dog haplotypes that closely resemble those of wolves, suggesting admixture with North American wolves or perhaps a second domestication of canids in the Americas. Notably, initial effective population size estimates suggest at least 1000 female dogs likely existed in the Americas at the time of the first known canid burial, and that population size increased gradually over time before stabilizing roughly 1200 years before present. Copyright © 2014 Elsevier Ltd. All rights reserved.

  17. NABIC: A New Access Portal to Search, Visualize, and Share Agricultural Genomics Data

    PubMed Central

    Seol, Young-Joo; Lee, Tae-Ho; Park, Dong-Suk; Kim, Chang-Kug

    2016-01-01

    The National Agricultural Biotechnology Information Center developed an access portal to search, visualize, and share agricultural genomics data with a focus on South Korean information and resources. The portal features an agricultural biotechnology database containing a wide range of omics data from public and proprietary sources. We collected 28.4 TB of data from 162 agricultural organisms, with 10 types of omics data comprising next-generation sequencing sequence read archive, genome, gene, nucleotide, DNA chip, expressed sequence tag, interactome, protein structure, molecular marker, and single-nucleotide polymorphism datasets. Our genomic resources contain information on five animals, seven plants, and one fungus, which is accessed through a genome browser. We also developed a data submission and analysis system as a web service, with easy-to-use functions and cutting-edge algorithms, including those for handling next-generation sequencing data. PMID:26848255

  18. An integrated SNP mining and utilization (ISMU) pipeline for next generation sequencing data.

    PubMed

    Azam, Sarwar; Rathore, Abhishek; Shah, Trushar M; Telluri, Mohan; Amindala, BhanuPrakash; Ruperao, Pradeep; Katta, Mohan A V S K; Varshney, Rajeev K

    2014-01-01

    Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.

  19. GUIDEseq: a bioconductor package to analyze GUIDE-Seq datasets for CRISPR-Cas nucleases.

    PubMed

    Zhu, Lihua Julie; Lawrence, Michael; Gupta, Ankit; Pagès, Hervé; Kucukural, Alper; Garber, Manuel; Wolfe, Scot A

    2017-05-15

    Genome editing technologies developed around the CRISPR-Cas9 nuclease system have facilitated the investigation of a broad range of biological questions. These nucleases also hold tremendous promise for treating a variety of genetic disorders. In the context of their therapeutic application, it is important to identify the spectrum of genomic sequences that are cleaved by a candidate nuclease when programmed with a particular guide RNA, as well as the cleavage efficiency of these sites. Powerful new experimental approaches, such as GUIDE-seq, facilitate the sensitive, unbiased genome-wide detection of nuclease cleavage sites within the genome. Flexible bioinformatics analysis tools for processing GUIDE-seq data are needed. Here, we describe an open source, open development software suite, GUIDEseq, for GUIDE-seq data analysis and annotation as a Bioconductor package in R. The GUIDEseq package provides a flexible platform with more than 60 adjustable parameters for the analysis of datasets associated with custom nuclease applications. These parameters allow data analysis to be tailored to different nuclease platforms with different length and complexity in their guide and PAM recognition sequences or their DNA cleavage position. They also enable users to customize sequence aggregation criteria, and vary peak calling thresholds that can influence the number of potential off-target sites recovered. GUIDEseq also annotates potential off-target sites that overlap with genes based on genome annotation information, as these may be the most important off-target sites for further characterization. In addition, GUIDEseq enables the comparison and visualization of off-target site overlap between different datasets for a rapid comparison of different nuclease configurations or experimental conditions. For each identified off-target, the GUIDEseq package outputs mapped GUIDE-Seq read count as well as cleavage score from a user specified off-target cleavage score prediction algorithm permitting the identification of genomic sequences with unexpected cleavage activity. The GUIDEseq package enables analysis of GUIDE-data from various nuclease platforms for any species with a defined genomic sequence. This software package has been used successfully to analyze several GUIDE-seq datasets. The software, source code and documentation are freely available at http://www.bioconductor.org/packages/release/bioc/html/GUIDEseq.html .

  20. CMS: A Web-Based System for Visualization and Analysis of Genome-Wide Methylation Data of Human Cancers

    PubMed Central

    Huang, Yi-Wen; Roa, Juan C.; Goodfellow, Paul J.; Kizer, E. Lynette; Huang, Tim H. M.; Chen, Yidong

    2013-01-01

    Background DNA methylation of promoter CpG islands is associated with gene suppression, and its unique genome-wide profiles have been linked to tumor progression. Coupled with high-throughput sequencing technologies, it can now efficiently determine genome-wide methylation profiles in cancer cells. Also, experimental and computational technologies make it possible to find the functional relationship between cancer-specific methylation patterns and their clinicopathological parameters. Methodology/Principal Findings Cancer methylome system (CMS) is a web-based database application designed for the visualization, comparison and statistical analysis of human cancer-specific DNA methylation. Methylation intensities were obtained from MBDCap-sequencing, pre-processed and stored in the database. 191 patient samples (169 tumor and 22 normal specimen) and 41 breast cancer cell-lines are deposited in the database, comprising about 6.6 billion uniquely mapped sequence reads. This provides comprehensive and genome-wide epigenetic portraits of human breast cancer and endometrial cancer to date. Two views are proposed for users to better understand methylation structure at the genomic level or systemic methylation alteration at the gene level. In addition, a variety of annotation tracks are provided to cover genomic information. CMS includes important analytic functions for interpretation of methylation data, such as the detection of differentially methylated regions, statistical calculation of global methylation intensities, multiple gene sets of biologically significant categories, interactivity with UCSC via custom-track data. We also present examples of discoveries utilizing the framework. Conclusions/Significance CMS provides visualization and analytic functions for cancer methylome datasets. A comprehensive collection of datasets, a variety of embedded analytic functions and extensive applications with biological and translational significance make this system powerful and unique in cancer methylation research. CMS is freely accessible at: http://cbbiweb.uthscsa.edu/KMethylomes/. PMID:23630576

  1. Early-branching euteleost relationships: areas of congruence between concatenation and coalescent model inferences.

    PubMed

    Campbell, Matthew A; Alfaro, Michael E; Belasco, Max; López, J Andrés

    2017-01-01

    Phylogenetic inference based on evidence from DNA sequences has led to significant strides in the development of a stable and robustly supported framework for the vertebrate tree of life. To date, the bulk of those advances have relied on sequence data from a small number of genome regions that have proven unable to produce satisfactory answers to consistently recalcitrant phylogenetic questions. Here, we re-examine phylogenetic relationships among early-branching euteleostean fish lineages classically grouped in the Protacanthopterygii using DNA sequence data surrounding ultraconserved elements. We report and examine a dataset of thirty-four OTUs with 17,957 aligned characters from fifty-three nuclear loci. Phylogenetic analysis is conducted in concatenated, joint gene trees and species tree estimation and summary coalescent frameworks. All analytical frameworks yield supporting evidence for existing hypotheses of relationship for the placement of Lepidogalaxias salamandroides , monophyly of the Stomiatii and the presence of an esociform + salmonid clade. Lepidogalaxias salamandroides and the Esociformes + Salmoniformes are successive sister lineages to all other euteleosts in the majority of analyses. The concatenated and joint gene trees and species tree analysis types produce high support values for this arrangement. However, inter-relationships of Argentiniformes, Stomiatii and Neoteleostei remain uncertain as they varied by analysis type while receiving strong and contradictory indices of support. Topological differences between analysis types are also apparent within the otomorph and the percomorph taxa in the data set. Our results identify concordant areas with strong support for relationships within and between early-branching euteleost lineages but they also reveal limitations in the ability of larger datasets to conclusively resolve other aspects of that phylogeny.

  2. Early-branching euteleost relationships: areas of congruence between concatenation and coalescent model inferences

    PubMed Central

    Alfaro, Michael E.; Belasco, Max; López, J. Andrés

    2017-01-01

    Phylogenetic inference based on evidence from DNA sequences has led to significant strides in the development of a stable and robustly supported framework for the vertebrate tree of life. To date, the bulk of those advances have relied on sequence data from a small number of genome regions that have proven unable to produce satisfactory answers to consistently recalcitrant phylogenetic questions. Here, we re-examine phylogenetic relationships among early-branching euteleostean fish lineages classically grouped in the Protacanthopterygii using DNA sequence data surrounding ultraconserved elements. We report and examine a dataset of thirty-four OTUs with 17,957 aligned characters from fifty-three nuclear loci. Phylogenetic analysis is conducted in concatenated, joint gene trees and species tree estimation and summary coalescent frameworks. All analytical frameworks yield supporting evidence for existing hypotheses of relationship for the placement of Lepidogalaxias salamandroides, monophyly of the Stomiatii and the presence of an esociform + salmonid clade. Lepidogalaxias salamandroides and the Esociformes + Salmoniformes are successive sister lineages to all other euteleosts in the majority of analyses. The concatenated and joint gene trees and species tree analysis types produce high support values for this arrangement. However, inter-relationships of Argentiniformes, Stomiatii and Neoteleostei remain uncertain as they varied by analysis type while receiving strong and contradictory indices of support. Topological differences between analysis types are also apparent within the otomorph and the percomorph taxa in the data set. Our results identify concordant areas with strong support for relationships within and between early-branching euteleost lineages but they also reveal limitations in the ability of larger datasets to conclusively resolve other aspects of that phylogeny. PMID:28929008

  3. CMS: a web-based system for visualization and analysis of genome-wide methylation data of human cancers.

    PubMed

    Gu, Fei; Doderer, Mark S; Huang, Yi-Wen; Roa, Juan C; Goodfellow, Paul J; Kizer, E Lynette; Huang, Tim H M; Chen, Yidong

    2013-01-01

    DNA methylation of promoter CpG islands is associated with gene suppression, and its unique genome-wide profiles have been linked to tumor progression. Coupled with high-throughput sequencing technologies, it can now efficiently determine genome-wide methylation profiles in cancer cells. Also, experimental and computational technologies make it possible to find the functional relationship between cancer-specific methylation patterns and their clinicopathological parameters. Cancer methylome system (CMS) is a web-based database application designed for the visualization, comparison and statistical analysis of human cancer-specific DNA methylation. Methylation intensities were obtained from MBDCap-sequencing, pre-processed and stored in the database. 191 patient samples (169 tumor and 22 normal specimen) and 41 breast cancer cell-lines are deposited in the database, comprising about 6.6 billion uniquely mapped sequence reads. This provides comprehensive and genome-wide epigenetic portraits of human breast cancer and endometrial cancer to date. Two views are proposed for users to better understand methylation structure at the genomic level or systemic methylation alteration at the gene level. In addition, a variety of annotation tracks are provided to cover genomic information. CMS includes important analytic functions for interpretation of methylation data, such as the detection of differentially methylated regions, statistical calculation of global methylation intensities, multiple gene sets of biologically significant categories, interactivity with UCSC via custom-track data. We also present examples of discoveries utilizing the framework. CMS provides visualization and analytic functions for cancer methylome datasets. A comprehensive collection of datasets, a variety of embedded analytic functions and extensive applications with biological and translational significance make this system powerful and unique in cancer methylation research. CMS is freely accessible at: http://cbbiweb.uthscsa.edu/KMethylomes/.

  4. Normalization, bias correction, and peak calling for ChIP-seq

    PubMed Central

    Diaz, Aaron; Park, Kiyoub; Lim, Daniel A.; Song, Jun S.

    2012-01-01

    Next-generation sequencing is rapidly transforming our ability to profile the transcriptional, genetic, and epigenetic states of a cell. In particular, sequencing DNA from the immunoprecipitation of protein-DNA complexes (ChIP-seq) and methylated DNA (MeDIP-seq) can reveal the locations of protein binding sites and epigenetic modifications. These approaches contain numerous biases which may significantly influence the interpretation of the resulting data. Rigorous computational methods for detecting and removing such biases are still lacking. Also, multi-sample normalization still remains an important open problem. This theoretical paper systematically characterizes the biases and properties of ChIP-seq data by comparing 62 separate publicly available datasets, using rigorous statistical models and signal processing techniques. Statistical methods for separating ChIP-seq signal from background noise, as well as correcting enrichment test statistics for sequence-dependent and sonication biases, are presented. Our method effectively separates reads into signal and background components prior to normalization, improving the signal-to-noise ratio. Moreover, most peak callers currently use a generic null model which suffers from low specificity at the sensitivity level requisite for detecting subtle, but true, ChIP enrichment. The proposed method of determining a cell type-specific null model, which accounts for cell type-specific biases, is shown to be capable of achieving a lower false discovery rate at a given significance threshold than current methods. PMID:22499706

  5. Control control control: a reassessment and comparison of GenBank and chromatogram mtDNA sequence variation in Baltic grey seals (Halichoerus grypus).

    PubMed

    Fietz, Katharina; Graves, Jeff A; Olsen, Morten Tange

    2013-01-01

    Genetic data can provide a powerful tool for those interested in the biology, management and conservation of wildlife, but also lead to erroneous conclusions if appropriate controls are not taken at all steps of the analytical process. This particularly applies to data deposited in public repositories such as GenBank, whose utility relies heavily on the assumption of high data quality. Here we report on an in-depth reassessment and comparison of GenBank and chromatogram mtDNA sequence data generated in a previous study of Baltic grey seals. By re-editing the original chromatogram data we found that approximately 40% of the grey seal mtDNA haplotype sequences posted in GenBank contained errors. The re-analysis of the edited chromatogram data yielded overall similar results and conclusions as the original study. However, a significantly different outcome was observed when using the uncorrected dataset based on the GenBank haplotypes. We therefore suggest disregarding the existing GenBank data and instead using the correct haplotypes reported here. Our study serves as an illustrative example reiterating the importance of quality control through every step of a research project, from data generation to interpretation and submission to an online repository. Errors conducted in any step may lead to biased results and conclusions, and could impact management decisions.

  6. Control Control Control: A Reassessment and Comparison of GenBank and Chromatogram mtDNA Sequence Variation in Baltic Grey Seals (Halichoerus grypus)

    PubMed Central

    Fietz, Katharina; Graves, Jeff A.; Olsen, Morten Tange

    2013-01-01

    Genetic data can provide a powerful tool for those interested in the biology, management and conservation of wildlife, but also lead to erroneous conclusions if appropriate controls are not taken at all steps of the analytical process. This particularly applies to data deposited in public repositories such as GenBank, whose utility relies heavily on the assumption of high data quality. Here we report on an in-depth reassessment and comparison of GenBank and chromatogram mtDNA sequence data generated in a previous study of Baltic grey seals. By re-editing the original chromatogram data we found that approximately 40% of the grey seal mtDNA haplotype sequences posted in GenBank contained errors. The re-analysis of the edited chromatogram data yielded overall similar results and conclusions as the original study. However, a significantly different outcome was observed when using the uncorrected dataset based on the GenBank haplotypes. We therefore suggest disregarding the existing GenBank data and instead using the correct haplotypes reported here. Our study serves as an illustrative example reiterating the importance of quality control through every step of a research project, from data generation to interpretation and submission to an online repository. Errors conducted in any step may lead to biased results and conclusions, and could impact management decisions. PMID:23977362

  7. Insight into the validity of Leptobrachium guangxiense (Anura: Megophryidae): evidence from mitochondrial DNA sequences and morphological characters.

    PubMed

    Chen, Weicai; Zhang, Wei; Zhou, Shichu; Li, Ning; Huang, Yong; Mo, Yunming

    2013-01-01

    Lepobrachiun guangxiense Fei, Mo, Ye and Jiang, 2009 (Anura: Megophryidae), is presently thought to be endemic to Shangsi, Guangxi Province, China. A molecular phylogenetic analysis and morphological data were performed to gain insight into the phylogenetic position of this species. Maximum parsimony, maximum likelihood, and Bayesian inference methods were employed to reconstruct phylogenetic relationship, using 1914 bp of sequences from mtDNA genes of 12S rRNA, tRNAVal and 16S rRNA. Topologies revealed that L. guangxiense and Tam Dao (Vietnam) L. chapaense lineage (3A) formed a monophyletic group with well-supported values. The uncorrected p-distance of ~1.4k bp 16S rRNA data-sets between Tam Dao L. chapaense lineage (3A) and L. guangxiense is only 0.1%. Morphologically, L. guangxiense and Tam Dao L. chapaense lineage (3A) shared the same characters, and are distinguishable from "true" L. chapaense from the type locality in Sa Pa, Vietnam. Based on morphological characters and mitochondrial DNA, we suggested that the Tam Dao lineages of L. chapaense are conspecific with L. guangxiense. This represents a range extension for L. guangxiense, and a new country record for Vietnam.

  8. Xenopus in Space and Time: Fossils, Node Calibrations, Tip-Dating, and Paleobiogeography.

    PubMed

    Cannatella, David

    2015-01-01

    Published data from DNA sequences, morphology of 11 extant and 15 extinct frog taxa, and stratigraphic ranges of fossils were integrated to open a window into the deep-time evolution of Xenopus. The ages and morphological characters of fossils were used as independent datasets to calibrate a chronogram. We found that DNA sequences, either alone or in combination with morphological data and fossils, tended to support a close relationship between Xenopus and Hymenochirus, although in some analyses this topology was not significantly better than the Pipa + Hymenochirus topology. Analyses that excluded DNA data found strong support for the Pipa + Hymenochirus tree. The criterion for selecting the maximum age of the calibration prior influenced the age estimates, and our age estimates of early divergences in the tree of frogs are substantially younger than those of published studies. Node-dating and tip-dating calibrations, either alone or in combination, yielded older dates for nodes than did a root calibration alone. Our estimates of divergence times indicate that overwater dispersal, rather than vicariance due to the splitting of Africa and South America, may explain the presence of Xenopus in Africa and its closest fossil relatives in South America.

  9. My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing.

    PubMed

    Van Neste, Christophe; Vandewoestyne, Mado; Van Criekinge, Wim; Deforce, Dieter; Van Nieuwerburgh, Filip

    2014-03-01

    Forensic scientists are currently investigating how to transition from capillary electrophoresis (CE) to massive parallel sequencing (MPS) for analysis of forensic DNA profiles. MPS offers several advantages over CE such as virtually unlimited multiplexy of loci, combining both short tandem repeat (STR) and single nucleotide polymorphism (SNP) loci, small amplicons without constraints of size separation, more discrimination power, deep mixture resolution and sample multiplexing. We present our bioinformatic framework My-Forensic-Loci-queries (MyFLq) for analysis of MPS forensic data. For allele calling, the framework uses a MySQL reference allele database with automatically determined regions of interest (ROIs) by a generic maximal flanking algorithm which makes it possible to use any STR or SNP forensic locus. Python scripts were designed to automatically make allele calls starting from raw MPS data. We also present a method to assess the usefulness and overall performance of a forensic locus with respect to MPS, as well as methods to estimate whether an unknown allele, which sequence is not present in the MySQL database, is in fact a new allele or a sequencing error. The MyFLq framework was applied to an Illumina MiSeq dataset of a forensic Illumina amplicon library, generated from multilocus STR polymerase chain reaction (PCR) on both single contributor samples and multiple person DNA mixtures. Although the multilocus PCR was not yet optimized for MPS in terms of amplicon length or locus selection, the results show excellent results for most loci. The results show a high signal-to-noise ratio, correct allele calls, and a low limit of detection for minor DNA contributors in mixed DNA samples. Technically, forensic MPS affords great promise for routine implementation in forensic genomics. The method is also applicable to adjacent disciplines such as molecular autopsy in legal medicine and in mitochondrial DNA research. Copyright © 2013 The Authors. Published by Elsevier Ireland Ltd.. All rights reserved.

  10. Origins of domestic dog in southern East Asia is supported by analysis of Y-chromosome DNA.

    PubMed

    Ding, Z-L; Oskarsson, M; Ardalan, A; Angleby, H; Dahlgren, L-G; Tepeli, C; Kirkness, E; Savolainen, P; Zhang, Y-P

    2012-05-01

    Global mitochondrial DNA (mtDNA) data indicates that the dog originates from domestication of wolf in Asia South of Yangtze River (ASY), with minor genetic contributions from dog-wolf hybridisation elsewhere. Archaeological data and autosomal single nucleotide polymorphism data have instead suggested that dogs originate from Europe and/or South West Asia but, because these datasets lack data from ASY, evidence pointing to ASY may have been overlooked. Analyses of additional markers for global datasets, including ASY, are therefore necessary to test if mtDNA phylogeography reflects the actual dog history and not merely stochastic events or selection. Here, we analyse 14,437 bp of Y-chromosome DNA sequence in 151 dogs sampled worldwide. We found 28 haplotypes distributed in five haplogroups. Two haplogroups were universally shared and included three haplotypes carried by 46% of all dogs, but two other haplogroups were primarily restricted to East Asia. Highest genetic diversity and virtually complete phylogenetic coverage was found within ASY. The 151 dogs were estimated to originate from 13-24 wolf founders, but there was no indication of post-domestication dog-wolf hybridisations. Thus, Y-chromosome and mtDNA data give strikingly similar pictures of dog phylogeography, most importantly that roughly 50% of the gene pools are shared universally but only ASY has nearly the full range of genetic diversity, such that the gene pools in all other regions may derive from ASY. This corroborates that ASY was the principal, and possibly sole region of wolf domestication, that a large number of wolves were domesticated, and that subsequent dog-wolf hybridisation contributed modestly to the dog gene pool.

  11. Origins of domestic dog in Southern East Asia is supported by analysis of Y-chromosome DNA

    PubMed Central

    Ding, Z-L; Oskarsson, M; Ardalan, A; Angleby, H; Dahlgren, L-G; Tepeli, C; Kirkness, E; Savolainen, P; Zhang, Y-P

    2012-01-01

    Global mitochondrial DNA (mtDNA) data indicates that the dog originates from domestication of wolf in Asia South of Yangtze River (ASY), with minor genetic contributions from dog–wolf hybridisation elsewhere. Archaeological data and autosomal single nucleotide polymorphism data have instead suggested that dogs originate from Europe and/or South West Asia but, because these datasets lack data from ASY, evidence pointing to ASY may have been overlooked. Analyses of additional markers for global datasets, including ASY, are therefore necessary to test if mtDNA phylogeography reflects the actual dog history and not merely stochastic events or selection. Here, we analyse 14 437 bp of Y-chromosome DNA sequence in 151 dogs sampled worldwide. We found 28 haplotypes distributed in five haplogroups. Two haplogroups were universally shared and included three haplotypes carried by 46% of all dogs, but two other haplogroups were primarily restricted to East Asia. Highest genetic diversity and virtually complete phylogenetic coverage was found within ASY. The 151 dogs were estimated to originate from 13–24 wolf founders, but there was no indication of post-domestication dog–wolf hybridisations. Thus, Y-chromosome and mtDNA data give strikingly similar pictures of dog phylogeography, most importantly that roughly 50% of the gene pools are shared universally but only ASY has nearly the full range of genetic diversity, such that the gene pools in all other regions may derive from ASY. This corroborates that ASY was the principal, and possibly sole region of wolf domestication, that a large number of wolves were domesticated, and that subsequent dog–wolf hybridisation contributed modestly to the dog gene pool. PMID:22108628

  12. DNA binding sites characterization by means of Rényi entropy measures on nucleotide transitions.

    PubMed

    Perera, Alexandre; Vallverdu, Montserrat; Claria, Francesc; Soria, José Manuel; Caminal, Pere

    2006-01-01

    In this work, parametric information-theory measures for the characterization of binding sites in DNA are extended with the use of transitional probabilities on the sequence. We propose the use of parametric uncertainty measure such as Renyi entropies obtained from the transition probabilities for the study of the binding sites, in addition to nucleotide frequency based Renyi measures. Results are reported in this manuscript comparing transition frequencies (i.e. dinucelotides) and base frequencies for Shannon and parametric Renyi for a number of binding sites found in E. Coli, lambda and T7 organisms. We observe that, for the evaluated datasets, the information provided by both approaches is not redundant, as they evolve differently under increasing Renyi orders.

  13. Wheat EST resources for functional genomics of abiotic stress

    PubMed Central

    Houde, Mario; Belcaid, Mahdi; Ouellet, François; Danyluk, Jean; Monroy, Antonio F; Dryanova, Ani; Gulick, Patrick; Bergeron, Anne; Laroche, André; Links, Matthew G; MacCarthy, Luke; Crosby, William L; Sarhan, Fathey

    2006-01-01

    Background Wheat is an excellent species to study freezing tolerance and other abiotic stresses. However, the sequence of the wheat genome has not been completely characterized due to its complexity and large size. To circumvent this obstacle and identify genes involved in cold acclimation and associated stresses, a large scale EST sequencing approach was undertaken by the Functional Genomics of Abiotic Stress (FGAS) project. Results We generated 73,521 quality-filtered ESTs from eleven cDNA libraries constructed from wheat plants exposed to various abiotic stresses and at different developmental stages. In addition, 196,041 ESTs for which tracefiles were available from the National Science Foundation wheat EST sequencing program and DuPont were also quality-filtered and used in the analysis. Clustering of the combined ESTs with d2_cluster and TGICL yielded a few large clusters containing several thousand ESTs that were refractory to routine clustering techniques. To resolve this problem, the sequence proximity and "bridges" were identified by an e-value distance graph to manually break clusters into smaller groups. Assembly of the resolved ESTs generated a 75,488 unique sequence set (31,580 contigs and 43,908 singletons/singlets). Digital expression analyses indicated that the FGAS dataset is enriched in stress-regulated genes compared to the other public datasets. Over 43% of the unique sequence set was annotated and classified into functional categories according to Gene Ontology. Conclusion We have annotated 29,556 different sequences, an almost 5-fold increase in annotated sequences compared to the available wheat public databases. Digital expression analysis combined with gene annotation helped in the identification of several pathways associated with abiotic stress. The genomic resources and knowledge developed by this project will contribute to a better understanding of the different mechanisms that govern stress tolerance in wheat and other cereals. PMID:16772040

  14. Convolutional neural network architectures for predicting DNA–protein binding

    PubMed Central

    Zeng, Haoyang; Edwards, Matthew D.; Liu, Ge; Gifford, David K.

    2016-01-01

    Motivation: Convolutional neural networks (CNN) have outperformed conventional methods in modeling the sequence specificity of DNA–protein binding. Yet inappropriate CNN architectures can yield poorer performance than simpler models. Thus an in-depth understanding of how to match CNN architecture to a given task is needed to fully harness the power of CNNs for computational biology applications. Results: We present a systematic exploration of CNN architectures for predicting DNA sequence binding using a large compendium of transcription factor datasets. We identify the best-performing architectures by varying CNN width, depth and pooling designs. We find that adding convolutional kernels to a network is important for motif-based tasks. We show the benefits of CNNs in learning rich higher-order sequence features, such as secondary motifs and local sequence context, by comparing network performance on multiple modeling tasks ranging in difficulty. We also demonstrate how careful construction of sequence benchmark datasets, using approaches that control potentially confounding effects like positional or motif strength bias, is critical in making fair comparisons between competing methods. We explore how to establish the sufficiency of training data for these learning tasks, and we have created a flexible cloud-based framework that permits the rapid exploration of alternative neural network architectures for problems in computational biology. Availability and Implementation: All the models analyzed are available at http://cnn.csail.mit.edu. Contact: gifford@mit.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27307608

  15. Ultra Deep Sequencing of Listeria monocytogenes sRNA Transcriptome Revealed New Antisense RNAs

    PubMed Central

    Behrens, Sebastian; Widder, Stefanie; Mannala, Gopala Krishna; Qing, Xiaoxing; Madhugiri, Ramakanth; Kefer, Nathalie; Mraheil, Mobarak Abu; Rattei, Thomas; Hain, Torsten

    2014-01-01

    Listeria monocytogenes, a gram-positive pathogen, and causative agent of listeriosis, has become a widely used model organism for intracellular infections. Recent studies have identified small non-coding RNAs (sRNAs) as important factors for regulating gene expression and pathogenicity of L. monocytogenes. Increased speed and reduced costs of high throughput sequencing (HTS) techniques have made RNA sequencing (RNA-Seq) the state-of-the-art method to study bacterial transcriptomes. We created a large transcriptome dataset of L. monocytogenes containing a total of 21 million reads, using the SOLiD sequencing technology. The dataset contained cDNA sequences generated from L. monocytogenes RNA collected under intracellular and extracellular condition and additionally was size fractioned into three different size ranges from <40 nt, 40–150 nt and >150 nt. We report here, the identification of nine new sRNAs candidates of L. monocytogenes and a reevaluation of known sRNAs of L. monocytogenes EGD-e. Automatic comparison to known sRNAs revealed a high recovery rate of 55%, which was increased to 90% by manual revision of the data. Moreover, thorough classification of known sRNAs shed further light on their possible biological functions. Interestingly among the newly identified sRNA candidates are antisense RNAs (asRNAs) associated to the housekeeping genes purA, fumC and pgi and potentially their regulation, emphasizing the significance of sRNAs for metabolic adaptation in L. monocytogenes. PMID:24498259

  16. A Predictive Model of the Oxygen and Heme Regulatory Network in Yeast

    PubMed Central

    Kundaje, Anshul; Xin, Xiantong; Lan, Changgui; Lianoglou, Steve; Zhou, Mei; Zhang, Li; Leslie, Christina

    2008-01-01

    Deciphering gene regulatory mechanisms through the analysis of high-throughput expression data is a challenging computational problem. Previous computational studies have used large expression datasets in order to resolve fine patterns of coexpression, producing clusters or modules of potentially coregulated genes. These methods typically examine promoter sequence information, such as DNA motifs or transcription factor occupancy data, in a separate step after clustering. We needed an alternative and more integrative approach to study the oxygen regulatory network in Saccharomyces cerevisiae using a small dataset of perturbation experiments. Mechanisms of oxygen sensing and regulation underlie many physiological and pathological processes, and only a handful of oxygen regulators have been identified in previous studies. We used a new machine learning algorithm called MEDUSA to uncover detailed information about the oxygen regulatory network using genome-wide expression changes in response to perturbations in the levels of oxygen, heme, Hap1, and Co2+. MEDUSA integrates mRNA expression, promoter sequence, and ChIP-chip occupancy data to learn a model that accurately predicts the differential expression of target genes in held-out data. We used a novel margin-based score to extract significant condition-specific regulators and assemble a global map of the oxygen sensing and regulatory network. This network includes both known oxygen and heme regulators, such as Hap1, Mga2, Hap4, and Upc2, as well as many new candidate regulators. MEDUSA also identified many DNA motifs that are consistent with previous experimentally identified transcription factor binding sites. Because MEDUSA's regulatory program associates regulators to target genes through their promoter sequences, we directly tested the predicted regulators for OLE1, a gene specifically induced under hypoxia, by experimental analysis of the activity of its promoter. In each case, deletion of the candidate regulator resulted in the predicted effect on promoter activity, confirming that several novel regulators identified by MEDUSA are indeed involved in oxygen regulation. MEDUSA can reveal important information from a small dataset and generate testable hypotheses for further experimental analysis. Supplemental data are included. PMID:19008939

  17. Billions of basepairs of recently expanded, repetitive sequences are eliminated from the somatic genome during copepod development.

    PubMed

    Sun, Cheng; Wyngaard, Grace; Walton, D Brian; Wichman, Holly A; Mueller, Rachel Lockridge

    2014-03-11

    Chromatin diminution is the programmed deletion of DNA from presomatic cell or nuclear lineages during development, producing single organisms that contain two different nuclear genomes. Phylogenetically diverse taxa undergo chromatin diminution--some ciliates, nematodes, copepods, and vertebrates. In cyclopoid copepods, chromatin diminution occurs in taxa with massively expanded germline genomes; depending on species, germline genome sizes range from 15 - 75 Gb, 12-74 Gb of which are lost from pre-somatic cell lineages at germline--soma differentiation. This is more than an order of magnitude more sequence than is lost from other taxa. To date, the sequences excised from copepods have not been analyzed using large-scale genomic datasets, and the processes underlying germline genomic gigantism in this clade, as well as the functional significance of chromatin diminution, have remained unknown. Here, we used high-throughput genomic sequencing and qPCR to characterize the germline and somatic genomes of Mesocyclops edax, a freshwater cyclopoid copepod with a germline genome of ~15 Gb and a somatic genome of ~3 Gb. We show that most of the excised DNA consists of repetitive sequences that are either 1) verifiable transposable elements (TEs), or 2) non-simple repeats of likely TE origin. Repeat elements in both genomes are skewed towards younger (i.e. less divergent) elements. Excised DNA is a non-random sample of the germline repeat element landscape; younger elements, and high frequency DNA transposons and LINEs, are disproportionately eliminated from the somatic genome. Our results suggest that germline genome expansion in M. edax reflects explosive repeat element proliferation, and that billions of base pairs of such repeats are deleted from the somatic genome every generation. Thus, we hypothesize that chromatin diminution is a mechanism that controls repeat element load, and that this load can evolve to be divergent between tissue types within single organisms.

  18. HLA DNA Sequence Variation among Human Populations: Molecular Signatures of Demographic and Selective Events

    PubMed Central

    Buhler, Stéphane; Sanchez-Mazas, Alicia

    2011-01-01

    Molecular differences between HLA alleles vary up to 57 nucleotides within the peptide binding coding region of human Major Histocompatibility Complex (MHC) genes, but it is still unclear whether this variation results from a stochastic process or from selective constraints related to functional differences among HLA molecules. Although HLA alleles are generally treated as equidistant molecular units in population genetic studies, DNA sequence diversity among populations is also crucial to interpret the observed HLA polymorphism. In this study, we used a large dataset of 2,062 DNA sequences defined for the different HLA alleles to analyze nucleotide diversity of seven HLA genes in 23,500 individuals of about 200 populations spread worldwide. We first analyzed the HLA molecular structure and diversity of these populations in relation to geographic variation and we further investigated possible departures from selective neutrality through Tajima's tests and mismatch distributions. All results were compared to those obtained by classical approaches applied to HLA allele frequencies. Our study shows that the global patterns of HLA nucleotide diversity among populations are significantly correlated to geography, although in some specific cases the molecular information reveals unexpected genetic relationships. At all loci except HLA-DPB1, populations have accumulated a high proportion of very divergent alleles, suggesting an advantage of heterozygotes expressing molecularly distant HLA molecules (asymmetric overdominant selection model). However, both different intensities of selection and unequal levels of gene conversion may explain the heterogeneous mismatch distributions observed among the loci. Also, distinctive patterns of sequence divergence observed at the HLA-DPB1 locus suggest current neutrality but old selective pressures on this gene. We conclude that HLA DNA sequences advantageously complement HLA allele frequencies as a source of data used to explore the genetic history of human populations, and that their analysis allows a more thorough investigation of human MHC molecular evolution. PMID:21408106

  19. Billions of basepairs of recently expanded, repetitive sequences are eliminated from the somatic genome during copepod development

    PubMed Central

    2014-01-01

    Background Chromatin diminution is the programmed deletion of DNA from presomatic cell or nuclear lineages during development, producing single organisms that contain two different nuclear genomes. Phylogenetically diverse taxa undergo chromatin diminution — some ciliates, nematodes, copepods, and vertebrates. In cyclopoid copepods, chromatin diminution occurs in taxa with massively expanded germline genomes; depending on species, germline genome sizes range from 15 – 75 Gb, 12–74 Gb of which are lost from pre-somatic cell lineages at germline – soma differentiation. This is more than an order of magnitude more sequence than is lost from other taxa. To date, the sequences excised from copepods have not been analyzed using large-scale genomic datasets, and the processes underlying germline genomic gigantism in this clade, as well as the functional significance of chromatin diminution, have remained unknown. Results Here, we used high-throughput genomic sequencing and qPCR to characterize the germline and somatic genomes of Mesocyclops edax, a freshwater cyclopoid copepod with a germline genome of ~15 Gb and a somatic genome of ~3 Gb. We show that most of the excised DNA consists of repetitive sequences that are either 1) verifiable transposable elements (TEs), or 2) non-simple repeats of likely TE origin. Repeat elements in both genomes are skewed towards younger (i.e. less divergent) elements. Excised DNA is a non-random sample of the germline repeat element landscape; younger elements, and high frequency DNA transposons and LINEs, are disproportionately eliminated from the somatic genome. Conclusions Our results suggest that germline genome expansion in M. edax reflects explosive repeat element proliferation, and that billions of base pairs of such repeats are deleted from the somatic genome every generation. Thus, we hypothesize that chromatin diminution is a mechanism that controls repeat element load, and that this load can evolve to be divergent between tissue types within single organisms. PMID:24618421

  20. Amplicon Sequencing of the slpH Locus Permits Culture-Independent Strain Typing of Lactobacillus helveticus in Dairy Products

    PubMed Central

    Moser, Aline; Wüthrich, Daniel; Bruggmann, Rémy; Eugster-Meier, Elisabeth; Meile, Leo; Irmler, Stefan

    2017-01-01

    The advent of massive parallel sequencing technologies has opened up possibilities for the study of the bacterial diversity of ecosystems without the need for enrichment or single strain isolation. By exploiting 78 genome data-sets from Lactobacillus helveticus strains, we found that the slpH locus that encodes a putative surface layer protein displays sufficient genetic heterogeneity to be a suitable target for strain typing. Based on high-throughput slpH gene sequencing and the detection of single-base DNA sequence variations, we established a culture-independent method to assess the biodiversity of the L. helveticus strains present in fermented dairy food. When we applied the method to study the L. helveticus strain composition in 15 natural whey cultures (NWCs) that were collected at different Gruyère, a protected designation of origin (PDO) production facilities, we detected a total of 10 sequence types (STs). In addition, we monitored the development of a three-strain mix in raclette cheese for 17 weeks. PMID:28775722

  1. Illuminator, a desktop program for mutation detection using short-read clonal sequencing.

    PubMed

    Carr, Ian M; Morgan, Joanne E; Diggle, Christine P; Sheridan, Eamonn; Markham, Alexander F; Logan, Clare V; Inglehearn, Chris F; Taylor, Graham R; Bonthron, David T

    2011-10-01

    Current methods for sequencing clonal populations of DNA molecules yield several gigabases of data per day, typically comprising reads of < 100 nt. Such datasets permit widespread genome resequencing and transcriptome analysis or other quantitative tasks. However, this huge capacity can also be harnessed for the resequencing of smaller (gene-sized) target regions, through the simultaneous parallel analysis of multiple subjects, using sample "tagging" or "indexing". These methods promise to have a huge impact on diagnostic mutation analysis and candidate gene testing. Here we describe a software package developed for such studies, offering the ability to resolve pooled samples carrying barcode tags and to align reads to a reference sequence using a mutation-tolerant process. The program, Illuminator, can identify rare sequence variants, including insertions and deletions, and permits interactive data analysis on standard desktop computers. It facilitates the effective analysis of targeted clonal sequencer data without dedicated computational infrastructure or specialized training. Copyright © 2011 Elsevier Inc. All rights reserved.

  2. Teaching artificial intelligence to read electropherograms.

    PubMed

    Taylor, Duncan; Powers, David

    2016-11-01

    Electropherograms are produced in great numbers in forensic DNA laboratories as part of everyday criminal casework. Before the results of these electropherograms can be used they must be scrutinised by analysts to determine what the identified data tells us about the underlying DNA sequences and what is purely an artefact of the DNA profiling process. A technique that lends itself well to such a task of classification in the face of vast amounts of data is the use of artificial neural networks. These networks, inspired by the workings of the human brain, have been increasingly successful in analysing large datasets, performing medical diagnoses, identifying handwriting, playing games, or recognising images. In this work we demonstrate the use of an artificial neural network which we train to 'read' electropherograms and show that it can generalise to unseen profiles. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  3. Species delimitation in the Central African herbs Haumania (Marantaceae) using georeferenced nuclear and chloroplastic DNA sequences.

    PubMed

    Ley, A C; Hardy, O J

    2010-11-01

    Species delimitation is a fundamental biological concept which is frequently discussed and altered to integrate new insights. These revealed that speciation is not a one step phenomenon but an ongoing process and morphological characters alone are not sufficient anymore to properly describe the results of this process. Here we want to assess the degree of speciation in two closely related lianescent taxa from the tropical African genus Haumania which display distinct vegetative traits despite a high similarity in reproductive traits and a partial overlap in distribution area which might facilitate gene flow. To this end, we combined phylogenetic and phylogeographic analyses using nuclear (nr) and chloroplast (cp) DNA sequences in comparison to morphological species descriptions. The nuclear dataset unambiguously supports the morphological species concept in Haumania. However, the main chloroplastic haplotypes are shared between species and, although a geographic analysis of cpDNA diversity confirms that individuals from the same taxon are more related than individuals from distinct taxa, cp-haplotypes display correlated geographic distributions between species. Hybridization is the most plausible reason for this pattern. A scenario involving speciation in geographic isolation followed by range expansion is outlined. The study highlights the gain of information on the speciation process in Haumania by adding georeferenced molecular data to the morphological characteristics. It also shows that nr and cp sequence data might provide different but complementary information, questioning the reliability of the unique use of chloroplast data for species recognition by DNA barcoding. Copyright © 2010 Elsevier Inc. All rights reserved.

  4. Computational Identification and Functional Predictions of Long Noncoding RNA in Zea mays

    PubMed Central

    Boerner, Susan; McGinnis, Karen M.

    2012-01-01

    Background Computational analysis of cDNA sequences from multiple organisms suggests that a large portion of transcribed DNA does not code for a functional protein. In mammals, noncoding transcription is abundant, and often results in functional RNA molecules that do not appear to encode proteins. Many long noncoding RNAs (lncRNAs) appear to have epigenetic regulatory function in humans, including HOTAIR and XIST. While epigenetic gene regulation is clearly an essential mechanism in plants, relatively little is known about the presence or function of lncRNAs in plants. Methodology/Principal Findings To explore the connection between lncRNA and epigenetic regulation of gene expression in plants, a computational pipeline using the programming language Python has been developed and applied to maize full length cDNA sequences to identify, classify, and localize potential lncRNAs. The pipeline was used in parallel with an SVM tool for identifying ncRNAs to identify the maximal number of ncRNAs in the dataset. Although the available library of sequences was small and potentially biased toward protein coding transcripts, 15% of the sequences were predicted to be noncoding. Approximately 60% of these sequences appear to act as precursors for small RNA molecules and may function to regulate gene expression via a small RNA dependent mechanism. ncRNAs were predicted to originate from both genic and intergenic loci. Of the lncRNAs that originated from genic loci, ∼20% were antisense to the host gene loci. Conclusions/Significance Consistent with similar studies in other organisms, noncoding transcription appears to be widespread in the maize genome. Computational predictions indicate that maize lncRNAs may function to regulate expression of other genes through multiple RNA mediated mechanisms. PMID:22916204

  5. Purification of nanogram-range immunoprecipitated DNA in ChIP-seq application.

    PubMed

    Zhong, Jian; Ye, Zhenqing; Lenz, Samuel W; Clark, Chad R; Bharucha, Adil; Farrugia, Gianrico; Robertson, Keith D; Zhang, Zhiguo; Ordog, Tamas; Lee, Jeong-Heon

    2017-12-21

    Chromatin immunoprecipitation-sequencing (ChIP-seq) is a widely used epigenetic approach for investigating genome-wide protein-DNA interactions in cells and tissues. The approach has been relatively well established but several key steps still require further improvement. As a part of the procedure, immnoprecipitated DNA must undergo purification and library preparation for subsequent high-throughput sequencing. Current ChIP protocols typically yield nanogram quantities of immunoprecipitated DNA mainly depending on the target of interest and starting chromatin input amount. However, little information exists on the performance of reagents used for the purification of such minute amounts of immunoprecipitated DNA in ChIP elution buffer and their effects on ChIP-seq data. Here, we compared DNA recovery, library preparation efficiency, and ChIP-seq results obtained with several commercial DNA purification reagents applied to 1 ng ChIP DNA and also investigated the impact of conditions under which ChIP DNA is stored. We compared DNA recovery of ten commercial DNA purification reagents and phenol/chloroform extraction from 1 to 50 ng of immunopreciptated DNA in ChIP elution buffer. The recovery yield was significantly different with 1 ng of DNA while similar in higher DNA amounts. We also observed that the low nanogram range of purified DNA is prone to loss during storage depending on the type of polypropylene tube used. The immunoprecipitated DNA equivalent to 1 ng of purified DNA was subject to DNA purification and library preparation to evaluate the performance of four better performing purification reagents in ChIP-seq applications. Quantification of library DNAs indicated the selected purification kits have a negligible impact on the efficiency of library preparation. The resulting ChIP-seq data were comparable with the dataset generated by ENCODE consortium and were highly correlated between the data from different purification reagents. This study provides comparative data on commercial DNA purification reagents applied to nanogram-range immunopreciptated ChIP DNA and evidence for the importance of storage conditions of low nanogram-range purified DNA. We verified consistent high performance of a subset of the tested reagents. These results will facilitate the improvement of ChIP-seq methodology for low-input applications.

  6. Assessing the phylogeographic history of the montane caddisfly Thremma gallicum using mitochondrial and restriction-site-associated DNA (RAD) markers

    PubMed Central

    Macher, Jan-Niklas; Rozenberg, Andrey; Pauls, Steffen U; Tollrian, Ralph; Wagner, Rüdiger; Leese, Florian

    2015-01-01

    Repeated Quaternary glaciations have significantly shaped the present distribution and diversity of several European species in aquatic and terrestrial habitats. To study the phylogeography of freshwater invertebrates, patterns of intraspecific variation have been examined primarily using mitochondrial DNA markers that may yield results unrepresentative of the true species history. Here, population genetic parameters were inferred for a montane aquatic caddisfly, Thremma gallicum, by sequencing a 658-bp fragment of the mitochondrial CO1 gene, and 12,514 nuclear RAD loci. T. gallicum has a highly disjunct distribution in southern and central Europe, with known populations in the Cantabrian Mountains, Pyrenees, Massif Central, and Black Forest. Both datasets represented rangewide sampling of T. gallicum. For the CO1 dataset, this included 352 specimens from 26 populations, and for the RAD dataset, 17 specimens from eight populations. We tested 20 competing phylogeographic scenarios using approximate Bayesian computation (ABC) and estimated genetic diversity patterns. Support for phylogeographic scenarios and diversity estimates differed between datasets with the RAD data favouring a southern origin of extant populations and indicating the Cantabrian Mountains and Massif Central populations to represent highly diverse populations as compared with the Pyrenees and Black Forest populations. The CO1 data supported a vicariance scenario (north–south) and yielded inconsistent diversity estimates. Permutation tests suggest that a few hundred polymorphic RAD SNPs are necessary for reliable parameter estimates. Our results highlight the potential of RAD and ABC-based hypothesis testing to complement phylogeographic studies on non-model species. PMID:25691988

  7. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shi, CY; Yang, H; Wei, CL

    Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Using high-throughput Illumina RNA-seq, the transcriptome from poly (A){sup +} RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled intomore » 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis.« less

  8. Deep sequencing of the Camellia sinensis transcriptome revealed candidate genes for major metabolic pathways of tea-specific compounds

    PubMed Central

    2011-01-01

    Background Tea is one of the most popular non-alcoholic beverages worldwide. However, the tea plant, Camellia sinensis, is difficult to culture in vitro, to transform, and has a large genome, rendering little genomic information available. Recent advances in large-scale RNA sequencing (RNA-seq) provide a fast, cost-effective, and reliable approach to generate large expression datasets for functional genomic analysis, which is especially suitable for non-model species with un-sequenced genomes. Results Using high-throughput Illumina RNA-seq, the transcriptome from poly (A)+ RNA of C. sinensis was analyzed at an unprecedented depth (2.59 gigabase pairs). Approximate 34.5 million reads were obtained, trimmed, and assembled into 127,094 unigenes, with an average length of 355 bp and an N50 of 506 bp, which consisted of 788 contig clusters and 126,306 singletons. This number of unigenes was 10-fold higher than existing C. sinensis sequences deposited in GenBank (as of August 2010). Sequence similarity analyses against six public databases (Uniprot, NR and COGs at NCBI, Pfam, InterPro and KEGG) found 55,088 unigenes that could be annotated with gene descriptions, conserved protein domains, or gene ontology terms. Some of the unigenes were assigned to putative metabolic pathways. Targeted searches using these annotations identified the majority of genes associated with several primary metabolic pathways and natural product pathways that are important to tea quality, such as flavonoid, theanine and caffeine biosynthesis pathways. Novel candidate genes of these secondary pathways were discovered. Comparisons with four previously prepared cDNA libraries revealed that this transcriptome dataset has both a high degree of consistency with previous EST data and an approximate 20 times increase in coverage. Thirteen unigenes related to theanine and flavonoid synthesis were validated. Their expression patterns in different organs of the tea plant were analyzed by RT-PCR and quantitative real time PCR (qRT-PCR). Conclusions An extensive transcriptome dataset has been obtained from the deep sequencing of tea plant. The coverage of the transcriptome is comprehensive enough to discover all known genes of several major metabolic pathways. This transcriptome dataset can serve as an important public information platform for gene expression, genomics, and functional genomic studies in C. sinensis. PMID:21356090

  9. CowPI: A Rumen Microbiome Focussed Version of the PICRUSt Functional Inference Software.

    PubMed

    Wilkinson, Toby J; Huws, Sharon A; Edwards, Joan E; Kingston-Smith, Alison H; Siu-Ting, Karen; Hughes, Martin; Rubino, Francesco; Friedersdorff, Maximillian; Creevey, Christopher J

    2018-01-01

    Metataxonomic 16S rDNA based studies are a commonplace and useful tool in the research of the microbiome, but they do not provide the full investigative power of metagenomics and metatranscriptomics for revealing the functional potential of microbial communities. However, the use of metagenomic and metatranscriptomic technologies is hindered by high costs and skills barrier necessary to generate and interpret the data. To address this, a tool for Phylogenetic Investigation of Communities by Reconstruction of Unobserved States (PICRUSt) was developed for inferring the functional potential of an observed microbiome profile, based on 16S data. This allows functional inferences to be made from metataxonomic 16S rDNA studies with little extra work or cost, but its accuracy relies on the availability of completely sequenced genomes of representative organisms from the community being investigated. The rumen microbiome is an example of a community traditionally underrepresented in genome and sequence databases, but recent efforts by projects such as the Global Rumen Census and Hungate 1000 have resulted in a wide sampling of 16S rDNA profiles and almost 500 fully sequenced microbial genomes from this environment. Using this information, we have developed "CowPI," a focused version of the PICRUSt tool provided for use by the wider scientific community in the study of the rumen microbiome. We evaluated the accuracy of CowPI and PICRUSt using two 16S datasets from the rumen microbiome: one generated from rDNA and the other from rRNA where corresponding metagenomic and metatranscriptomic data was also available. We show that the functional profiles predicted by CowPI better match estimates for both the meta-genomic and transcriptomic datasets than PICRUSt, and capture the higher degree of genetic variation and larger pangenomes of rumen organisms. Nonetheless, whilst being closer in terms of predictive power for the rumen microbiome, there were differences when compared to both the metagenomic and metatranscriptome data and so we recommend, where possible, functional inferences from 16S data should not replace metagenomic and metatranscriptomic approaches. The tool can be accessed at http://www.cowpi.org and is provided to the wider scientific community for use in the study of the rumen microbiome.

  10. Comparative and Joint Analysis of Two Metagenomic Datasets from a Biogas Fermenter Obtained by 454-Pyrosequencing

    PubMed Central

    Jaenicke, Sebastian; Ander, Christina; Bekel, Thomas; Bisdorf, Regina; Dröge, Marcus; Gartemann, Karl-Heinz; Jünemann, Sebastian; Kaiser, Olaf; Krause, Lutz; Tille, Felix; Zakrzewski, Martha; Pühler, Alfred

    2011-01-01

    Biogas production from renewable resources is attracting increased attention as an alternative energy source due to the limited availability of traditional fossil fuels. Many countries are promoting the use of alternative energy sources for sustainable energy production. In this study, a metagenome from a production-scale biogas fermenter was analysed employing Roche's GS FLX Titanium technology and compared to a previous dataset obtained from the same community DNA sample that was sequenced on the GS FLX platform. Taxonomic profiling based on 16S rRNA-specific sequences and an Environmental Gene Tag (EGT) analysis employing CARMA demonstrated that both approaches benefit from the longer read lengths obtained on the Titanium platform. Results confirmed Clostridia as the most prevalent taxonomic class, whereas species of the order Methanomicrobiales are dominant among methanogenic Archaea. However, the analyses also identified additional taxa that were missed by the previous study, including members of the genera Streptococcus, Acetivibrio, Garciella, Tissierella, and Gelria, which might also play a role in the fermentation process leading to the formation of methane. Taking advantage of the CARMA feature to correlate taxonomic information of sequences with their assigned functions, it appeared that Firmicutes, followed by Bacteroidetes and Proteobacteria, dominate within the functional context of polysaccharide degradation whereas Methanomicrobiales represent the most abundant taxonomic group responsible for methane production. Clostridia is the most important class involved in the reductive CoA pathway (Wood-Ljungdahl pathway) that is characteristic for acetogenesis. Based on binning of 16S rRNA-specific sequences allocated to the dominant genus Methanoculleus, it could be shown that this genus is represented by several different species. Phylogenetic analysis of these sequences placed them in close proximity to the hydrogenotrophic methanogen Methanoculleus bourgensis. While rarefaction analyses still indicate incomplete coverage, examination of the GS FLX Titanium dataset resulted in the identification of additional genera and functional elements, providing a far more complete coverage of the community involved in anaerobic fermentative pathways leading to methane formation. PMID:21297863

  11. LDsplit: screening for cis-regulatory motifs stimulating meiotic recombination hotspots by analysis of DNA sequence polymorphisms.

    PubMed

    Yang, Peng; Wu, Min; Guo, Jing; Kwoh, Chee Keong; Przytycka, Teresa M; Zheng, Jie

    2014-02-17

    As a fundamental genomic element, meiotic recombination hotspot plays important roles in life sciences. Thus uncovering its regulatory mechanisms has broad impact on biomedical research. Despite the recent identification of the zinc finger protein PRDM9 and its 13-mer binding motif as major regulators for meiotic recombination hotspots, other regulators remain to be discovered. Existing methods for finding DNA sequence motifs of recombination hotspots often rely on the enrichment of co-localizations between hotspots and short DNA patterns, which ignore the cross-individual variation of recombination rates and sequence polymorphisms in the population. Our objective in this paper is to capture signals encoded in genetic variations for the discovery of recombination-associated DNA motifs. Recently, an algorithm called "LDsplit" has been designed to detect the association between single nucleotide polymorphisms (SNPs) and proximal meiotic recombination hotspots. The association is measured by the difference of population recombination rates at a hotspot between two alleles of a candidate SNP. Here we present an open source software tool of LDsplit, with integrative data visualization for recombination hotspots and their proximal SNPs. Applying LDsplit on SNPs inside an established 7-mer motif bound by PRDM9 we observed that SNP alleles preserving the original motif tend to have higher recombination rates than the opposite alleles that disrupt the motif. Running on SNP windows around hotspots each containing an occurrence of the 7-mer motif, LDsplit is able to guide the established motif finding algorithm of MEME to recover the 7-mer motif. In contrast, without LDsplit the 7-mer motif could not be identified. LDsplit is a software tool for the discovery of cis-regulatory DNA sequence motifs stimulating meiotic recombination hotspots by screening and narrowing down to hotspot associated SNPs. It is the first computational method that utilizes the genetic variation of recombination hotspots among individuals, opening a new avenue for motif finding. Tested on an established motif and simulated datasets, LDsplit shows promise to discover novel DNA motifs for meiotic recombination hotspots.

  12. LDsplit: screening for cis-regulatory motifs stimulating meiotic recombination hotspots by analysis of DNA sequence polymorphisms

    PubMed Central

    2014-01-01

    Background As a fundamental genomic element, meiotic recombination hotspot plays important roles in life sciences. Thus uncovering its regulatory mechanisms has broad impact on biomedical research. Despite the recent identification of the zinc finger protein PRDM9 and its 13-mer binding motif as major regulators for meiotic recombination hotspots, other regulators remain to be discovered. Existing methods for finding DNA sequence motifs of recombination hotspots often rely on the enrichment of co-localizations between hotspots and short DNA patterns, which ignore the cross-individual variation of recombination rates and sequence polymorphisms in the population. Our objective in this paper is to capture signals encoded in genetic variations for the discovery of recombination-associated DNA motifs. Results Recently, an algorithm called “LDsplit” has been designed to detect the association between single nucleotide polymorphisms (SNPs) and proximal meiotic recombination hotspots. The association is measured by the difference of population recombination rates at a hotspot between two alleles of a candidate SNP. Here we present an open source software tool of LDsplit, with integrative data visualization for recombination hotspots and their proximal SNPs. Applying LDsplit on SNPs inside an established 7-mer motif bound by PRDM9 we observed that SNP alleles preserving the original motif tend to have higher recombination rates than the opposite alleles that disrupt the motif. Running on SNP windows around hotspots each containing an occurrence of the 7-mer motif, LDsplit is able to guide the established motif finding algorithm of MEME to recover the 7-mer motif. In contrast, without LDsplit the 7-mer motif could not be identified. Conclusions LDsplit is a software tool for the discovery of cis-regulatory DNA sequence motifs stimulating meiotic recombination hotspots by screening and narrowing down to hotspot associated SNPs. It is the first computational method that utilizes the genetic variation of recombination hotspots among individuals, opening a new avenue for motif finding. Tested on an established motif and simulated datasets, LDsplit shows promise to discover novel DNA motifs for meiotic recombination hotspots. PMID:24533858

  13. Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm.

    PubMed

    Zhang, ZhiZhuo; Chang, Cheng Wei; Hugo, Willy; Cheung, Edwin; Sung, Wing-Kin

    2013-03-01

    Although de novo motifs can be discovered through mining over-represented sequence patterns, this approach misses some real motifs and generates many false positives. To improve accuracy, one solution is to consider some additional binding features (i.e., position preference and sequence rank preference). This information is usually required from the user. This article presents a de novo motif discovery algorithm called SEME (sampling with expectation maximization for motif elicitation), which uses pure probabilistic mixture model to model the motif's binding features and uses expectation maximization (EM) algorithms to simultaneously learn the sequence motif, position, and sequence rank preferences without asking for any prior knowledge from the user. SEME is both efficient and accurate thanks to two important techniques: the variable motif length extension and importance sampling. Using 75 large-scale synthetic datasets, 32 metazoan compendium benchmark datasets, and 164 chromatin immunoprecipitation sequencing (ChIP-Seq) libraries, we demonstrated the superior performance of SEME over existing programs in finding transcription factor (TF) binding sites. SEME is further applied to a more difficult problem of finding the co-regulated TF (coTF) motifs in 15 ChIP-Seq libraries. It identified significantly more correct coTF motifs and, at the same time, predicted coTF motifs with better matching to the known motifs. Finally, we show that the learned position and sequence rank preferences of each coTF reveals potential interaction mechanisms between the primary TF and the coTF within these sites. Some of these findings were further validated by the ChIP-Seq experiments of the coTFs. The application is available online.

  14. A New Phylogeographic Pattern of Endemic Bufo bankorensis in Taiwan Island Is Attributed to the Genetic Variation of Populations

    PubMed Central

    Yu, Teng-Lang; Lin, Hung-Du; Weng, Ching-Feng

    2014-01-01

    Aim To comprehend the phylogeographic patterns of genetic variation in anurans at Taiwan Island, this study attempted to examine (1) the existence of various geological barriers (Central Mountain Ranges, CMRs); and (2) the genetic variation of Bufo bankorensis using mtDNA sequences among populations located in different regions of Taiwan, characterized by different climates and existing under extreme conditions when compared available sequences of related species B. gargarizans of mainland China. Methodology/Principal Findings Phylogenetic analyses of the dataset with mitochondrial DNA (mtDNA) D-loop gene (348 bp) recovered a close relationship between B. bankorensis and B. gargarizans, identified three distinct lineages. Furthermore, the network of mtDNA D-loop gene (564 bp) amplified (279 individuals, 27 localities) from Taiwan Island indicated three divergent clades within B. bankorensis (Clade W, E and S), corresponding to the geography, thereby verifying the importance of the CMRs and Kaoping River drainage as major biogeographic barriers. Mismatch distribution analysis, neutrality tests and Bayesian skyline plots revealed that a significant population expansion occurred for the total population and Clade W, with horizons dated to approximately 0.08 and 0.07 Mya, respectively. These results suggest that the population expansion of Taiwan Island species B. bankorensis might have resulted from the release of available habitat in post-glacial periods, the genetic variation on mtDNA showing habitat selection, subsequent population dispersal, and co-distribution among clades. Conclusions The multiple origins (different clades) of B. bankorensis mtDNA sequences were first evident in this study. The divergent genetic clades found within B. bankorensis could be independent colonization by previously diverged lineages; inferring B. bankorensis originated from B. gargarizans of mainland China, then dispersal followed by isolation within Taiwan Island. Highly divergent clades between W and E of B. bankorensis, implies that the CMRs serve as a genetic barrier and separated the whole island into the western and eastern phylogroups. PMID:24853679

  15. HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

    PubMed

    Wan, Shixiang; Zou, Quan

    2017-01-01

    Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.

  16. A new method for species identification via protein-coding and non-coding DNA barcodes by combining machine learning with bioinformatic methods.

    PubMed

    Zhang, Ai-bing; Feng, Jie; Ward, Robert D; Wan, Ping; Gao, Qiang; Wu, Jun; Zhao, Wei-zhong

    2012-01-01

    Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.

  17. Characterization of the complete mitochondrial genomes of two whipworms Trichuris ovis and Trichuris discolor (Nematoda: Trichuridae).

    PubMed

    Liu, Guo-Hua; Wang, Yan; Xu, Min-Jun; Zhou, Dong-Hui; Ye, Yong-Gang; Li, Jia-Yuan; Song, Hui-Qun; Lin, Rui-Qing; Zhu, Xing-Quan

    2012-12-01

    For many years, whipworms (Trichuris spp.) have been described with a relatively narrow range of both morphological and biometrical features. Moreover, there has been insufficient discrimination between congeners (or closely related species). In the present study, we determined the complete mitochondrial (mt) genomes of two whipworms Trichuris ovis and Trichuris discolor, compared them and then tested the hypothesis that T. ovis and T. discolor are distinct species by phylogenetic analyses using Bayesian inference, maximum likelihood and maximum parsimony) based on the deduced amino acid sequences of the mt protein-coding genes. The complete mt genomes of T. ovis and T. discolor were 13,946 bp and 13,904 bp in size, respectively. Both mt genomes are circular, and consist of 37 genes, including 13 genes coding for proteins, 2 genes for rRNA, and 22 genes for tRNA. The gene content and arrangement are identical to that of human and pig whipworms Trichuris trichiura and Trichuris suis. Taken together, these analyses showed genetic distinctiveness and strongly supported the recent proposal that T. ovis and T. discolor are distinct species using nuclear ribosomal DNA and a portion of the mtDNA sequence dataset. The availability of the complete mtDNA sequences of T. ovis and T. discolor provides novel genetic markers for studying the population genetics, diagnostics and molecular epidemiology of T. ovis and T. discolor. Copyright © 2012 Elsevier B.V. All rights reserved.

  18. Back to BaySICS: a user-friendly program for Bayesian Statistical Inference from Coalescent Simulations.

    PubMed

    Sandoval-Castellanos, Edson; Palkopoulou, Eleftheria; Dalén, Love

    2014-01-01

    Inference of population demographic history has vastly improved in recent years due to a number of technological and theoretical advances including the use of ancient DNA. Approximate Bayesian computation (ABC) stands among the most promising methods due to its simple theoretical fundament and exceptional flexibility. However, limited availability of user-friendly programs that perform ABC analysis renders it difficult to implement, and hence programming skills are frequently required. In addition, there is limited availability of programs able to deal with heterochronous data. Here we present the software BaySICS: Bayesian Statistical Inference of Coalescent Simulations. BaySICS provides an integrated and user-friendly platform that performs ABC analyses by means of coalescent simulations from DNA sequence data. It estimates historical demographic population parameters and performs hypothesis testing by means of Bayes factors obtained from model comparisons. Although providing specific features that improve inference from datasets with heterochronous data, BaySICS also has several capabilities making it a suitable tool for analysing contemporary genetic datasets. Those capabilities include joint analysis of independent tables, a graphical interface and the implementation of Markov-chain Monte Carlo without likelihoods.

  19. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets.

    PubMed

    Shrimankar, D D; Sathe, S R

    2016-01-01

    Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.

  20. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets

    PubMed Central

    Shrimankar, D. D.; Sathe, S. R.

    2016-01-01

    Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures. PMID:27932868

  1. Metavisitor, a Suite of Galaxy Tools for Simple and Rapid Detection and Discovery of Viruses in Deep Sequence Data

    PubMed Central

    Vernick, Kenneth D.

    2017-01-01

    Metavisitor is a software package that allows biologists and clinicians without specialized bioinformatics expertise to detect and assemble viral genomes from deep sequence datasets. The package is composed of a set of modular bioinformatic tools and workflows that are implemented in the Galaxy framework. Using the graphical Galaxy workflow editor, users with minimal computational skills can use existing Metavisitor workflows or adapt them to suit specific needs by adding or modifying analysis modules. Metavisitor works with DNA, RNA or small RNA sequencing data over a range of read lengths and can use a combination of de novo and guided approaches to assemble genomes from sequencing reads. We show that the software has the potential for quick diagnosis as well as discovery of viruses from a vast array of organisms. Importantly, we provide here executable Metavisitor use cases, which increase the accessibility and transparency of the software, ultimately enabling biologists or clinicians to focus on biological or medical questions. PMID:28045932

  2. MToolBox: a highly automated pipeline for heteroplasmy annotation and prioritization analysis of human mitochondrial variants in high-throughput sequencing

    PubMed Central

    Diroma, Maria Angela; Santorsola, Mariangela; Guttà, Cristiano; Gasparre, Giuseppe; Picardi, Ernesto; Pesole, Graziano; Attimonelli, Marcella

    2014-01-01

    Motivation: The increasing availability of mitochondria-targeted and off-target sequencing data in whole-exome and whole-genome sequencing studies (WXS and WGS) has risen the demand of effective pipelines to accurately measure heteroplasmy and to easily recognize the most functionally important mitochondrial variants among a huge number of candidates. To this purpose, we developed MToolBox, a highly automated pipeline to reconstruct and analyze human mitochondrial DNA from high-throughput sequencing data. Results: MToolBox implements an effective computational strategy for mitochondrial genomes assembling and haplogroup assignment also including a prioritization analysis of detected variants. MToolBox provides a Variant Call Format file featuring, for the first time, allele-specific heteroplasmy and annotation files with prioritized variants. MToolBox was tested on simulated samples and applied on 1000 Genomes WXS datasets. Availability and implementation: MToolBox package is available at https://sourceforge.net/projects/mtoolbox/. Contact: marcella.attimonelli@uniba.it Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25028726

  3. Isolating a functionally relevant guild of fungi from the root microbiome of Populus

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bonito, Gregory; Hameed, Khalid; Ventura, Rafael

    Plant roots interact with a bewilderingly complex community of microbes, including root-associated fungi that are essential for maintaining plant health. To improve understanding of the diversity of fungi in the rhizobiome of Populus deltoides, Populus trichocarpa and co-occurring plant hosts Quercus alba and Pinus taeda, we conducted field and greenhouse studies and sampled, isolated, and characterized the diversity of culturable root-associated fungi on these hosts. Using both general and selective isolation media we obtained more than 1800 fungal isolates from individual surface sterilized root tips. Sequences from the ITS and/or D1– D2 regions of the LSU rDNA were obtained frommore » 1042 of the >1800 pure culture isolates and were compared to accessions in the NCBI nucleotide database and analyzed through phylogenetics for preliminary taxonomic identification. Sequences from these isolates were also compared to 454 sequence datasets obtained directly from the Populus rhizosphere and endosphere. Although most of the ectomycorrhizal taxa known to associate with Populus evaded isolation, many of the abundant sequence types from rhizosphere and endosphere 454 datasets were isolated, including novel species belonging to the Atractiellales. Isolation and identification of key endorrhizal fungi will enable more targeted study of plant-fungal interactions. Genome sequencing is currently underway for a subset of our culture library with the aim of understanding the mechanisms involved in host-endophyte establishment and function. As a result, this diverse culture library of fungal root associates will be a valuable resource for metagenomic research, experimentation and further studies on plant-fungal interactions.« less

  4. Isolating a functionally relevant guild of fungi from the root microbiome of Populus

    DOE PAGES

    Bonito, Gregory; Hameed, Khalid; Ventura, Rafael; ...

    2016-05-27

    Plant roots interact with a bewilderingly complex community of microbes, including root-associated fungi that are essential for maintaining plant health. To improve understanding of the diversity of fungi in the rhizobiome of Populus deltoides, Populus trichocarpa and co-occurring plant hosts Quercus alba and Pinus taeda, we conducted field and greenhouse studies and sampled, isolated, and characterized the diversity of culturable root-associated fungi on these hosts. Using both general and selective isolation media we obtained more than 1800 fungal isolates from individual surface sterilized root tips. Sequences from the ITS and/or D1– D2 regions of the LSU rDNA were obtained frommore » 1042 of the >1800 pure culture isolates and were compared to accessions in the NCBI nucleotide database and analyzed through phylogenetics for preliminary taxonomic identification. Sequences from these isolates were also compared to 454 sequence datasets obtained directly from the Populus rhizosphere and endosphere. Although most of the ectomycorrhizal taxa known to associate with Populus evaded isolation, many of the abundant sequence types from rhizosphere and endosphere 454 datasets were isolated, including novel species belonging to the Atractiellales. Isolation and identification of key endorrhizal fungi will enable more targeted study of plant-fungal interactions. Genome sequencing is currently underway for a subset of our culture library with the aim of understanding the mechanisms involved in host-endophyte establishment and function. As a result, this diverse culture library of fungal root associates will be a valuable resource for metagenomic research, experimentation and further studies on plant-fungal interactions.« less

  5. Chromatin accessibility prediction via a hybrid deep convolutional neural network.

    PubMed

    Liu, Qiao; Xia, Fei; Yin, Qijin; Jiang, Rui

    2018-03-01

    A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies. We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases. Deopen is freely available at https://github.com/kimmo1019/Deopen. ruijiang@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  6. RIEMS: a software pipeline for sensitive and comprehensive taxonomic classification of reads from metagenomics datasets.

    PubMed

    Scheuch, Matthias; Höper, Dirk; Beer, Martin

    2015-03-03

    Fuelled by the advent and subsequent development of next generation sequencing technologies, metagenomics became a powerful tool for the analysis of microbial communities both scientifically and diagnostically. The biggest challenge is the extraction of relevant information from the huge sequence datasets generated for metagenomics studies. Although a plethora of tools are available, data analysis is still a bottleneck. To overcome the bottleneck of data analysis, we developed an automated computational workflow called RIEMS - Reliable Information Extraction from Metagenomic Sequence datasets. RIEMS assigns every individual read sequence within a dataset taxonomically by cascading different sequence analyses with decreasing stringency of the assignments using various software applications. After completion of the analyses, the results are summarised in a clearly structured result protocol organised taxonomically. The high accuracy and performance of RIEMS analyses were proven in comparison with other tools for metagenomics data analysis using simulated sequencing read datasets. RIEMS has the potential to fill the gap that still exists with regard to data analysis for metagenomics studies. The usefulness and power of RIEMS for the analysis of genuine sequencing datasets was demonstrated with an early version of RIEMS in 2011 when it was used to detect the orthobunyavirus sequences leading to the discovery of Schmallenberg virus.

  7. Aberrant DNA methylation of WNT pathway genes in the development and progression of CIMP-negative colorectal cancer.

    PubMed

    Galamb, Orsolya; Kalmár, Alexandra; Péterfia, Bálint; Csabai, István; Bodor, András; Ribli, Dezső; Krenács, Tibor; Patai, Árpád V; Wichmann, Barnabás; Barták, Barbara Kinga; Tóth, Kinga; Valcz, Gábor; Spisák, Sándor; Tulassay, Zsolt; Molnár, Béla

    2016-08-02

    The WNT signaling pathway has an essential role in colorectal carcinogenesis and progression, which involves a cascade of genetic and epigenetic changes. We aimed to analyze DNA methylation affecting the WNT pathway genes in colorectal carcinogenesis in promoter and gene body regions using whole methylome analysis in 9 colorectal cancer, 15 adenoma, and 6 normal tumor adjacent tissue (NAT) samples by methyl capture sequencing. Functional methylation was confirmed on 5-aza-2'-deoxycytidine-treated colorectal cancer cell line datasets. In parallel with the DNA methylation analysis, mutations of WNT pathway genes (APC, β-catenin/CTNNB1) were analyzed by 454 sequencing on GS Junior platform. Most differentially methylated CpG sites were localized in gene body regions (95% of WNT pathway genes). In the promoter regions, 33 of the 160 analyzed WNT pathway genes were differentially methylated in colorectal cancer vs. normal, including hypermethylated AXIN2, CHP1, PRICKLE1, SFRP1, SFRP2, SOX17, and hypomethylated CACYBP, CTNNB1, MYC; 44 genes in adenoma vs. NAT; and 41 genes in colorectal cancer vs. adenoma comparisons. Hypermethylation of AXIN2, DKK1, VANGL1, and WNT5A gene promoters was higher, while those of SOX17, PRICKLE1, DAAM2, and MYC was lower in colon carcinoma compared to adenoma. Inverse correlation between expression and methylation was confirmed in 23 genes, including APC, CHP1, PRICKLE1, PSEN1, and SFRP1. Differential methylation affected both canonical and noncanonical WNT pathway genes in colorectal normal-adenoma-carcinoma sequence. Aberrant DNA methylation appears already in adenomas as an early event of colorectal carcinogenesis.

  8. Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter.

    PubMed

    Mohamed Hashim, Ezzeddin Kamil; Abdullah, Rosni

    2015-12-21

    Empirical analysis on k-mer DNA has been proven as an effective tool in finding unique patterns in DNA sequences which can lead to the discovery of potential sequence motifs. In an extensive study of empirical k-mer DNA on hundreds of organisms, the researchers found unique multi-modal k-mer spectra occur in the genomes of organisms from the tetrapod clade only which includes all mammals. The multi-modality is caused by the formation of the two lowest modes where k-mers under them are referred as the rare k-mers. The suppression of the two lowest modes (or the rare k-mers) can be attributed to the CG dinucleotide inclusions in them. Apart from that, the rare k-mers are selectively distributed in certain genomic features of CpG Island (CGI), promoter, 5' UTR, and exon. We correlated the rare k-mers with hundreds of annotated features using several bioinformatic tools, performed further intrinsic rare k-mer analyses within the correlated features, and modeled the elucidated rare k-mer clustering feature into a classifier to predict the correlated CGI and promoter features. Our correlation results show that rare k-mers are highly associated with several annotated features of CGI, promoter, 5' UTR, and open chromatin regions. Our intrinsic results show that rare k-mers have several unique topological, compositional, and clustering properties in CGI and promoter features. Finally, the performances of our RWC (rare-word clustering) method in predicting the CGI and promoter features are ranked among the top three, in eight of the CGI and promoter evaluations, among eight of the benchmarked datasets. Crown Copyright © 2015. Published by Elsevier Ltd. All rights reserved.

  9. Phylogenetic screening of a bacterial, metagenomic library using homing endonuclease restriction and marker insertion

    PubMed Central

    Yung, Pui Yi; Burke, Catherine; Lewis, Matt; Egan, Suhelen; Kjelleberg, Staffan; Thomas, Torsten

    2009-01-01

    Metagenomics provides access to the uncultured majority of the microbial world. The approaches employed in this field have, however, had limited success in linking functional genes to the taxonomic or phylogenetic origin of the organism they belong to. Here we present an efficient strategy to recover environmental DNA fragments that contain phylogenetic marker genes from metagenomic libraries. Our method involves the cleavage of 23S ribsosmal RNA (rRNA) genes within pooled library clones by the homing endonuclease I-CeuI followed by the insertion and selection of an antibiotic resistance cassette. This approach was applied to screen a library of 6500 fosmid clones derived from the microbial community associated with the sponge Cymbastela concentrica. Several fosmid clones were recovered after the screen and detailed phylogenetic and taxonomic assignment based on the rRNA gene showed that they belong to previously unknown organisms. In addition, compositional features of these fosmid clones were used to classify and taxonomically assign a dataset of environmental shotgun sequences. Our approach represents a valuable tool for the analysis of rapidly increasing, environmental DNA sequencing information. PMID:19767618

  10. Deconvoluting simulated metagenomes: the performance of hard- and soft- clustering algorithms applied to metagenomic chromosome conformation capture (3C)

    PubMed Central

    DeMaere, Matthew Z.

    2016-01-01

    Background Chromosome conformation capture, coupled with high throughput DNA sequencing in protocols like Hi-C and 3C-seq, has been proposed as a viable means of generating data to resolve the genomes of microorganisms living in naturally occuring environments. Metagenomic Hi-C and 3C-seq datasets have begun to emerge, but the feasibility of resolving genomes when closely related organisms (strain-level diversity) are present in the sample has not yet been systematically characterised. Methods We developed a computational simulation pipeline for metagenomic 3C and Hi-C sequencing to evaluate the accuracy of genomic reconstructions at, above, and below an operationally defined species boundary. We simulated datasets and measured accuracy over a wide range of parameters. Five clustering algorithms were evaluated (2 hard, 3 soft) using an adaptation of the extended B-cubed validation measure. Results When all genomes in a sample are below 95% sequence identity, all of the tested clustering algorithms performed well. When sequence data contains genomes above 95% identity (our operational definition of strain-level diversity), a naive soft-clustering extension of the Louvain method achieves the highest performance. Discussion Previously, only hard-clustering algorithms have been applied to metagenomic 3C and Hi-C data, yet none of these perform well when strain-level diversity exists in a metagenomic sample. Our simple extension of the Louvain method performed the best in these scenarios, however, accuracy remained well below the levels observed for samples without strain-level diversity. Strain resolution is also highly dependent on the amount of available 3C sequence data, suggesting that depth of sequencing must be carefully considered during experimental design. Finally, there appears to be great scope to improve the accuracy of strain resolution through further algorithm development. PMID:27843713

  11. COPS: Detecting Co-Occurrence and Spatial Arrangement of Transcription Factor Binding Motifs in Genome-Wide Datasets

    PubMed Central

    Lohmann, Ingrid

    2012-01-01

    In multi-cellular organisms, spatiotemporal activity of cis-regulatory DNA elements depends on their occupancy by different transcription factors (TFs). In recent years, genome-wide ChIP-on-Chip, ChIP-Seq and DamID assays have been extensively used to unravel the combinatorial interaction of TFs with cis-regulatory modules (CRMs) in the genome. Even though genome-wide binding profiles are increasingly becoming available for different TFs, single TF binding profiles are in most cases not sufficient for dissecting complex regulatory networks. Thus, potent computational tools detecting statistically significant and biologically relevant TF-motif co-occurrences in genome-wide datasets are essential for analyzing context-dependent transcriptional regulation. We have developed COPS (Co-Occurrence Pattern Search), a new bioinformatics tool based on a combination of association rules and Markov chain models, which detects co-occurring TF binding sites (BSs) on genomic regions of interest. COPS scans DNA sequences for frequent motif patterns using a Frequent-Pattern tree based data mining approach, which allows efficient performance of the software with respect to both data structure and implementation speed, in particular when mining large datasets. Since transcriptional gene regulation very often relies on the formation of regulatory protein complexes mediated by closely adjoining TF binding sites on CRMs, COPS additionally detects preferred short distance between co-occurring TF motifs. The performance of our software with respect to biological significance was evaluated using three published datasets containing genomic regions that are independently bound by several TFs involved in a defined biological process. In sum, COPS is a fast, efficient and user-friendly tool mining statistically and biologically significant TFBS co-occurrences and therefore allows the identification of TFs that combinatorially regulate gene expression. PMID:23272209

  12. Functional metagenomics of oil-impacted mangrove sediments reveals high abundance of hydrolases of biotechnological interest.

    PubMed

    Ottoni, Júlia Ronzella; Cabral, Lucélia; de Sousa, Sanderson Tarciso Pereira; Júnior, Gileno Vieira Lacerda; Domingos, Daniela Ferreira; Soares Junior, Fábio Lino; da Silva, Mylenne Calciolari Pinheiro; Marcon, Joelma; Dias, Armando Cavalcante Franco; de Melo, Itamar Soares; de Souza, Anete Pereira; Andreote, Fernando Dini; de Oliveira, Valéria Maia

    2017-07-01

    Mangroves are located in coastal wetlands and are susceptible to the consequences of oil spills, what may threaten the diversity of microorganisms responsible for the nutrient cycling and the consequent ecosystem functioning. Previous reports show that high concentration of oil favors the incidence of epoxide hydrolases and haloalkane dehalogenases in mangroves. This finding has guided the goals of this study in an attempt to broaden the analysis to other hydrolases and thereby verify whether oil contamination interferes with the prevalence of particular hydrolases and their assigned microorganisms. For this, an in-depth survey of the taxonomic and functional microbial diversity recovered in a fosmid library (Library_Oil Mgv) constructed from oil-impacted Brazilian mangrove sediment was carried out. Fosmid DNA of the whole library was extracted and submitted to Illumina HiSeq sequencing. The resulting Library Oil_Mgv dataset was further compared with those obtained by direct sequencing of environmental DNA from Brazilian mangroves (from distinct regions and affected by distinct sources of contamination), focusing on hydrolases with potential use in biotechnological processes. The most abundant hydrolases found were proteases, esterases and amylases, with similar occurrence profile in all datasets. The main microbial groups harboring such hydrolase-encoding genes were distinct in each mangrove, and in the fosmid library these enzymes were mainly assigned to Chloroflexaceae (for amylases), Planctomycetaceae (for esterases) and Bradyrhizobiaceae (for proteases). Assembly and analysis of Library_Oil Mgv reads revealed three potentially novel enzymes, one epoxide hydrolase, one xylanase and one amylase, to be further investigated via heterologous expression assays.

  13. An Integrated SNP Mining and Utilization (ISMU) Pipeline for Next Generation Sequencing Data

    PubMed Central

    Azam, Sarwar; Rathore, Abhishek; Shah, Trushar M.; Telluri, Mohan; Amindala, BhanuPrakash; Ruperao, Pradeep; Katta, Mohan A. V. S. K.; Varshney, Rajeev K.

    2014-01-01

    Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software. PMID:25003610

  14. Phylogeny of the bears (Ursidae) based on nuclear and mitochondrial genes.

    PubMed

    Yu, Li; Li, Qing-wei; Ryder, O A; Zhang, Ya-ping

    2004-08-01

    The taxomic classification and phylogenetic relationships within the bear family remain argumentative subjects in recent years. Prior investigation has been concentrated on the application of different mitochondrial (mt) sequence data, herein we employ two nuclear single-copy gene segments, the partial exon 1 from gene encoding interphotoreceptor retinoid binding protein (IRBP) and the complete intron 1 from transthyretin (TTR) gene, in conjunction with previously published mt data, to clarify these enigmatic problems. The combined analyses of nuclear IRBP and TTR datasets not only corroborated prior hypotheses, positioning the spectacled bear most basally and grouping the brown and polar bear together but also provided new insights into the bear phylogeny, suggesting the sister-taxa association of sloth bear and sun bear with strong support. Analyses based on combination of nuclear and mt genes differed from nuclear analysis in recognizing the sloth bears as the earliest diverging species among the subfamily ursine representatives while the exact placement of the sun bear did not resolved. Asiatic and American black bears clustered as sister group in all analyses with moderate levels of bootstrap support and high posterior probabilities. Comparisons between the nuclear and mtDNA findings suggested that our combined nuclear dataset have the resolving power comparable to mtDNA dataset for the phylogenetic interpretation of the bear family. As can be seen from present study, the unanimous phylogeny for this recently derived family was still not produced and additional independent genetic markers were in need.

  15. Phylogenetics of modern birds in the era of genomics

    PubMed Central

    Edwards, Scott V; Bryan Jennings, W; Shedlock, Andrew M

    2005-01-01

    In the 14 years since the first higher-level bird phylogenies based on DNA sequence data, avian phylogenetics has witnessed the advent and maturation of the genomics era, the completion of the chicken genome and a suite of technologies that promise to add considerably to the agenda of avian phylogenetics. In this review, we summarize current approaches and data characteristics of recent higher-level bird studies and suggest a number of as yet untested molecular and analytical approaches for the unfolding tree of life for birds. A variety of comparative genomics strategies, including adoption of objective quality scores for sequence data, analysis of contiguous DNA sequences provided by large-insert genomic libraries, and the systematic use of retroposon insertions and other rare genomic changes all promise an integrated phylogenetics that is solidly grounded in genome evolution. The avian genome is an excellent testing ground for such approaches because of the more balanced representation of single-copy and repetitive DNA regions than in mammals. Although comparative genomics has a number of obvious uses in avian phylogenetics, its application to large numbers of taxa poses a number of methodological and infrastructural challenges, and can be greatly facilitated by a ‘community genomics’ approach in which the modest sequencing throughputs of single PI laboratories are pooled to produce larger, complementary datasets. Although the polymerase chain reaction era of avian phylogenetics is far from complete, the comparative genomics era—with its ability to vastly increase the number and type of molecular characters and to provide a genomic context for these characters—will usher in a host of new perspectives and opportunities for integrating genome evolution and avian phylogenetics. PMID:16024355

  16. Clusters of ancestrally related genes that show paralogy in whole or in part are a major feature of the genomes of humans and other species.

    PubMed

    Walker, Michael B; King, Benjamin L; Paigen, Kenneth

    2012-01-01

    Arrangements of genes along chromosomes are a product of evolutionary processes, and we can expect that preferable arrangements will prevail over the span of evolutionary time, often being reflected in the non-random clustering of structurally and/or functionally related genes. Such non-random arrangements can arise by two distinct evolutionary processes: duplications of DNA sequences that give rise to clusters of genes sharing both sequence similarity and common sequence features and the migration together of genes related by function, but not by common descent. To provide a background for distinguishing between the two, which is important for future efforts to unravel the evolutionary processes involved, we here provide a description of the extent to which ancestrally related genes are found in proximity.Towards this purpose, we combined information from five genomic datasets, InterPro, SCOP, PANTHER, Ensembl protein families, and Ensembl gene paralogs. The results are provided in publicly available datasets (http://cgd.jax.org/datasets/clustering/paraclustering.shtml) describing the extent to which ancestrally related genes are in proximity beyond what is expected by chance (i.e. form paraclusters) in the human and nine other vertebrate genomes, as well as the D. melanogaster, C. elegans, A. thaliana, and S. cerevisiae genomes. With the exception of Saccharomyces, paraclusters are a common feature of the genomes we examined. In the human genome they are estimated to include at least 22% of all protein coding genes. Paraclusters are far more prevalent among some gene families than others, are highly species or clade specific and can evolve rapidly, sometimes in response to environmental cues. Altogether, they account for a large portion of the functional clustering previously reported in several genomes.

  17. Coral life history and symbiosis: Functional genomic resources for two reef building Caribbean corals, Acropora palmata and Montastraea faveolata

    PubMed Central

    Schwarz, Jodi A; Brokstein, Peter B; Voolstra, Christian; Terry, Astrid Y; Miller, David J; Szmant, Alina M; Coffroth, Mary Alice; Medina, Mónica

    2008-01-01

    Background Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Results We generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembled into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. Conclusion Partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies. PMID:18298846

  18. Coral Life History and Symbiosis: functional genomic resources for two reef building Caribbean corals, Acropora palmata and Montastraea faveolata

    DOE PAGES

    Schwarz, Jodi A.; Brokstein, Peter B.; Voolstra, Christian R.; ...

    2008-02-25

    Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Here we generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembledmore » into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. In conclusion, partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies.« less

  19. Coral Life History and Symbiosis: functional genomic resources for two reef building Caribbean corals, Acropora palmata and Montastraea faveolata

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schwarz, Jodi A.; Brokstein, Peter B.; Voolstra, Christian R.

    Scleractinian corals are the foundation of reef ecosystems in tropical marine environments. Their great success is due to interactions with endosymbiotic dinoflagellates (Symbiodinium spp.), with which they are obligately symbiotic. To develop a foundation for studying coral biology and coral symbiosis, we have constructed a set of cDNA libraries and generated and annotated ESTs from two species of corals, Acropora palmata and Montastraea faveolata. Here we generated 14,588 (Ap) and 3,854 (Mf) high quality ESTs from five life history/symbiosis stages (spawned eggs, early-stage planula larvae, late-stage planula larvae either infected with symbionts or uninfected, and adult coral). The ESTs assembledmore » into a set of primarily stage-specific clusters, producing 4,980 (Ap), and 1,732 (Mf) unigenes. The egg stage library, relative to the other developmental stages, was enriched in genes functioning in cell division and proliferation, transcription, signal transduction, and regulation of protein function. Fifteen unigenes were identified as candidate symbiosis-related genes as they were expressed in all libraries constructed from the symbiotic stages and were absent from all of the non symbiotic stages. These include several DNA interacting proteins, and one highly expressed unigene (containing 17 cDNAs) with no significant protein-coding region. A significant number of unigenes (25) encode potential pattern recognition receptors (lectins, scavenger receptors, and others), as well as genes that may function in signaling pathways involved in innate immune responses (toll-like signaling, NFkB p105, and MAP kinases). Comparison between the A. palmata and an A. millepora EST dataset identified ferritin as a highly expressed gene in both datasets that appears to be undergoing adaptive evolution. Five unigenes appear to be restricted to the Scleractinia, as they had no homology to any sequences in the nr databases nor to the non-scleractinian cnidarians Nematostella vectensis and Hydra magnipapillata. In conclusion, partial sequencing of 5 cDNA libraries each for A. palmata and M. faveolata has produced a rich set of candidate genes (4,980 genes from A. palmata, and 1,732 genes from M. faveolata) that we can use as a starting point for examining the life history and symbiosis of these two species, as well as to further expand the dataset of cnidarian genes for comparative genomics and evolutionary studies.« less

  20. Accurate phylogenetic classification of DNA fragments based onsequence composition

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    McHardy, Alice C.; Garcia Martin, Hector; Tsirigos, Aristotelis

    2006-05-01

    Metagenome studies have retrieved vast amounts of sequenceout of a variety of environments, leading to novel discoveries and greatinsights into the uncultured microbial world. Except for very simplecommunities, diversity makes sequence assembly and analysis a verychallenging problem. To understand the structure a 5 nd function ofmicrobial communities, a taxonomic characterization of the obtainedsequence fragments is highly desirable, yet currently limited mostly tothose sequences that contain phylogenetic marker genes. We show that forclades at the rank of domain down to genus, sequence composition allowsthe very accurate phylogenetic 10 characterization of genomic sequence.We developed a composition-based classifier, PhyloPythia, for de novophylogenetic sequencemore » characterization and have trained it on adata setof 340 genomes. By extensive evaluation experiments we show that themethodis accurate across all taxonomic ranks considered, even forsequences that originate fromnovel organisms and are as short as 1kb.Application to two metagenome datasets 15 obtained from samples ofphosphorus-removing sludge showed that the method allows the accurateclassification at genus level of most sequence fragments from thedominant populations, while at the same time correctly characterizingeven larger parts of the samples at higher taxonomic levels.« less

  1. Antibiotic Resistome: Improving Detection and Quantification Accuracy for Comparative Metagenomics.

    PubMed

    Elbehery, Ali H A; Aziz, Ramy K; Siam, Rania

    2016-04-01

    The unprecedented rise of life-threatening antibiotic resistance (AR), combined with the unparalleled advances in DNA sequencing of genomes and metagenomes, has pushed the need for in silico detection of the resistance potential of clinical and environmental metagenomic samples through the quantification of AR genes (i.e., genes conferring antibiotic resistance). Therefore, determining an optimal methodology to quantitatively and accurately assess AR genes in a given environment is pivotal. Here, we optimized and improved existing AR detection methodologies from metagenomic datasets to properly consider AR-generating mutations in antibiotic target genes. Through comparative metagenomic analysis of previously published AR gene abundance in three publicly available metagenomes, we illustrate how mutation-generated resistance genes are either falsely assigned or neglected, which alters the detection and quantitation of the antibiotic resistome. In addition, we inspected factors influencing the outcome of AR gene quantification using metagenome simulation experiments, and identified that genome size, AR gene length, total number of metagenomics reads and selected sequencing platforms had pronounced effects on the level of detected AR. In conclusion, our proposed improvements in the current methodologies for accurate AR detection and resistome assessment show reliable results when tested on real and simulated metagenomic datasets.

  2. Validation of Genotyping-By-Sequencing Analysis in Populations of Tetraploid Alfalfa by 454 Sequencing

    PubMed Central

    Rocher, Solen; Jean, Martine; Castonguay, Yves; Belzile, François

    2015-01-01

    Genotyping-by-sequencing (GBS) is a relatively low-cost high throughput genotyping technology based on next generation sequencing and is applicable to orphan species with no reference genome. A combination of genome complexity reduction and multiplexing with DNA barcoding provides a simple and affordable way to resolve allelic variation between plant samples or populations. GBS was performed on ApeKI libraries using DNA from 48 genotypes each of two heterogeneous populations of tetraploid alfalfa (Medicago sativa spp. sativa): the synthetic cultivar Apica (ATF0) and a derived population (ATF5) obtained after five cycles of recurrent selection for superior tolerance to freezing (TF). Nearly 400 million reads were obtained from two lanes of an Illumina HiSeq 2000 sequencer and analyzed with the Universal Network-Enabled Analysis Kit (UNEAK) pipeline designed for species with no reference genome. Following the application of whole dataset-level filters, 11,694 single nucleotide polymorphism (SNP) loci were obtained. About 60% had a significant match on the Medicago truncatula syntenic genome. The accuracy of allelic ratios and genotype calls based on GBS data was directly assessed using 454 sequencing on a subset of SNP loci scored in eight plant samples. Sequencing depth in this study was not sufficient for accurate tetraploid allelic dosage, but reliable genotype calls based on diploid allelic dosage were obtained when using additional quality filtering. Principal Component Analysis of SNP loci in plant samples revealed that a small proportion (<5%) of the genetic variability assessed by GBS is able to differentiate ATF0 and ATF5. Our results confirm that analysis of GBS data using UNEAK is a reliable approach for genome-wide discovery of SNP loci in outcrossed polyploids. PMID:26115486

  3. The microbiome of Brazilian mangrove sediments as revealed by metagenomics.

    PubMed

    Andreote, Fernando Dini; Jiménez, Diego Javier; Chaves, Diego; Dias, Armando Cavalcante Franco; Luvizotto, Danice Mazzer; Dini-Andreote, Francisco; Fasanella, Cristiane Cipola; Lopez, Maryeimy Varon; Baena, Sandra; Taketani, Rodrigo Gouvêa; de Melo, Itamar Soares

    2012-01-01

    Here we embark in a deep metagenomic survey that revealed the taxonomic and potential metabolic pathways aspects of mangrove sediment microbiology. The extraction of DNA from sediment samples and the direct application of pyrosequencing resulted in approximately 215 Mb of data from four distinct mangrove areas (BrMgv01 to 04) in Brazil. The taxonomic approaches applied revealed the dominance of Deltaproteobacteria and Gammaproteobacteria in the samples. Paired statistical analysis showed higher proportions of specific taxonomic groups in each dataset. The metabolic reconstruction indicated the possible occurrence of processes modulated by the prevailing conditions found in mangrove sediments. In terms of carbon cycling, the sequences indicated the prevalence of genes involved in the metabolism of methane, formaldehyde, and carbon dioxide. With respect to the nitrogen cycle, evidence for sequences associated with dissimilatory reduction of nitrate, nitrogen immobilization, and denitrification was detected. Sequences related to the production of adenylsulfate, sulfite, and H(2)S were relevant to the sulphur cycle. These data indicate that the microbial core involved in methane, nitrogen, and sulphur metabolism consists mainly of Burkholderiaceae, Planctomycetaceae, Rhodobacteraceae, and Desulfobacteraceae. Comparison of our data to datasets from soil and sea samples resulted in the allotment of the mangrove sediments between those samples. The results of this study add valuable data about the composition of microbial communities in mangroves and also shed light on possible transformations promoted by microbial organisms in mangrove sediments.

  4. The Microbiome of Brazilian Mangrove Sediments as Revealed by Metagenomics

    PubMed Central

    Andreote, Fernando Dini; Jiménez, Diego Javier; Chaves, Diego; Dias, Armando Cavalcante Franco; Luvizotto, Danice Mazzer; Dini-Andreote, Francisco; Fasanella, Cristiane Cipola; Lopez, Maryeimy Varon; Baena, Sandra; Taketani, Rodrigo Gouvêa; de Melo, Itamar Soares

    2012-01-01

    Here we embark in a deep metagenomic survey that revealed the taxonomic and potential metabolic pathways aspects of mangrove sediment microbiology. The extraction of DNA from sediment samples and the direct application of pyrosequencing resulted in approximately 215 Mb of data from four distinct mangrove areas (BrMgv01 to 04) in Brazil. The taxonomic approaches applied revealed the dominance of Deltaproteobacteria and Gammaproteobacteria in the samples. Paired statistical analysis showed higher proportions of specific taxonomic groups in each dataset. The metabolic reconstruction indicated the possible occurrence of processes modulated by the prevailing conditions found in mangrove sediments. In terms of carbon cycling, the sequences indicated the prevalence of genes involved in the metabolism of methane, formaldehyde, and carbon dioxide. With respect to the nitrogen cycle, evidence for sequences associated with dissimilatory reduction of nitrate, nitrogen immobilization, and denitrification was detected. Sequences related to the production of adenylsulfate, sulfite, and H2S were relevant to the sulphur cycle. These data indicate that the microbial core involved in methane, nitrogen, and sulphur metabolism consists mainly of Burkholderiaceae, Planctomycetaceae, Rhodobacteraceae, and Desulfobacteraceae. Comparison of our data to datasets from soil and sea samples resulted in the allotment of the mangrove sediments between those samples. The results of this study add valuable data about the composition of microbial communities in mangroves and also shed light on possible transformations promoted by microbial organisms in mangrove sediments. PMID:22737213

  5. Performance and precision of double digestion RAD (ddRAD) genotyping in large multiplexed datasets of marine fish species.

    PubMed

    Maroso, F; Hillen, J E J; Pardo, B G; Gkagkavouzis, K; Coscia, I; Hermida, M; Franch, R; Hellemans, B; Van Houdt, J; Simionati, B; Taggart, J B; Nielsen, E E; Maes, G; Ciavaglia, S A; Webster, L M I; Volckaert, F A M; Martinez, P; Bargelloni, L; Ogden, R

    2018-06-01

    The development of Genotyping-By-Sequencing (GBS) technologies enables cost-effective analysis of large numbers of Single Nucleotide Polymorphisms (SNPs), especially in "non-model" species. Nevertheless, as such technologies enter a mature phase, biases and errors inherent to GBS are becoming evident. Here, we evaluated the performance of double digest Restriction enzyme Associated DNA (ddRAD) sequencing in SNP genotyping studies including high number of samples. Datasets of sequence data were generated from three marine teleost species (>5500 samples, >2.5 × 10 12 bases in total), using a standardized protocol. A common bioinformatics pipeline based on STACKS was established, with and without the use of a reference genome. We performed analyses throughout the production and analysis of ddRAD data in order to explore (i) the loss of information due to heterogeneous raw read number across samples; (ii) the discrepancy between expected and observed tag length and coverage; (iii) the performances of reference based vs. de novo approaches; (iv) the sources of potential genotyping errors of the library preparation/bioinformatics protocol, by comparing technical replicates. Our results showed use of a reference genome and a posteriori genotype correction improved genotyping precision. Individual read coverage was a key variable for reproducibility; variance in sequencing depth between loci in the same individual was also identified as an important factor and found to correlate to tag length. A comparison of downstream analysis carried out with ddRAD vs single SNP allele specific assay genotypes provided information about the levels of genotyping imprecision that can have a significant impact on allele frequency estimations and population assignment. The results and insights presented here will help to select and improve approaches to the analysis of large datasets based on RAD-like methodologies. Crown Copyright © 2018. Published by Elsevier B.V. All rights reserved.

  6. Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiency estimation of genomic sampling in non-model taxa

    PubMed Central

    2012-01-01

    Introduction Traditionally, genomic or transcriptomic data have been restricted to a few model or emerging model organisms, and to a handful of species of medical and/or environmental importance. Next-generation sequencing techniques have the capability of yielding massive amounts of gene sequence data for virtually any species at a modest cost. Here we provide a comparative analysis of de novo assembled transcriptomic data for ten non-model species of previously understudied animal taxa. Results cDNA libraries of ten species belonging to five animal phyla (2 Annelida [including Sipuncula], 2 Arthropoda, 2 Mollusca, 2 Nemertea, and 2 Porifera) were sequenced in different batches with an Illumina Genome Analyzer II (read length 100 or 150 bp), rendering between ca. 25 and 52 million reads per species. Read thinning, trimming, and de novo assembly were performed under different parameters to optimize output. Between 67,423 and 207,559 contigs were obtained across the ten species, post-optimization. Of those, 9,069 to 25,681 contigs retrieved blast hits against the NCBI non-redundant database, and approximately 50% of these were assigned with Gene Ontology terms, covering all major categories, and with similar percentages in all species. Local blasts against our datasets, using selected genes from major signaling pathways and housekeeping genes, revealed high efficiency in gene recovery compared to available genomes of closely related species. Intriguingly, our transcriptomic datasets detected multiple paralogues in all phyla and in nearly all gene pathways, including housekeeping genes that are traditionally used in phylogenetic applications for their purported single-copy nature. Conclusions We generated the first study of comparative transcriptomics across multiple animal phyla (comparing two species per phylum in most cases), established the first Illumina-based transcriptomic datasets for sponge, nemertean, and sipunculan species, and generated a tractable catalogue of annotated genes (or gene fragments) and protein families for ten newly sequenced non-model organisms, some of commercial importance (i.e., Octopus vulgaris). These comprehensive sets of genes can be readily used for phylogenetic analysis, gene expression profiling, developmental analysis, and can also be a powerful resource for gene discovery. The characterization of the transcriptomes of such a diverse array of animal species permitted the comparison of sequencing depth, functional annotation, and efficiency of genomic sampling using the same pipelines, which proved to be similar for all considered species. In addition, the datasets revealed their potential as a resource for paralogue detection, a recurrent concern in various aspects of biological inquiry, including phylogenetics, molecular evolution, development, and cellular biochemistry. PMID:23190771

  7. Distinct Trends of DNA Methylation Patterning in the Innate and Adaptive Immune Systems

    PubMed Central

    Schuyler, Ronald P.; Merkel, Angelika; Raineri, Emanuele; Altucci, Lucia; Vellenga, Edo; Martens, Joost H.A.; Pourfarzad, Farzin; Kuijpers, Taco W.; Burden, Frances; Farrow, Samantha; Downes, Kate; Ouwehand, Willem H.; Clarke, Laura; Datta, Avik; Lowy, Ernesto; Flicek, Paul; Frontini, Mattia; Stunnenberg, Hendrik G.; Martín-Subero, José I.; Gut, Ivo; Heath, Simon

    2018-01-01

    Summary DNA methylation and the localization and post-translational modification of nucleosomes are interdependent factors that contribute to the generation of distinct phenotypes from genetically identical cells. With 112 whole-genome bisulfite sequencing datasets from the BLUEPRINT Epigenome Project, we analyzed the global development of DNA methylation patterns during lineage commitment and maturation of a range of immune system effector cells and the cancers that arise from them. We show clear trends in methylation patterns that are distinct in the innate and adaptive arms of the human immune system, both globally and in relation to consistently positioned nucleosomes. Most notable are a progressive loss of methylation in developing lymphocytes and the consistent occurrence of non-CG methylation in specific cell types. Cancer samples from the two lineages are further polarized, suggesting the involvement of distinct lineage-specific epigenetic mechanisms. We anticipate broad utility for this resource as a basis for further comparative epigenetic analyses. PMID:27851971

  8. Distinct Trends of DNA Methylation Patterning in the Innate and Adaptive Immune Systems.

    PubMed

    Schuyler, Ronald P; Merkel, Angelika; Raineri, Emanuele; Altucci, Lucia; Vellenga, Edo; Martens, Joost H A; Pourfarzad, Farzin; Kuijpers, Taco W; Burden, Frances; Farrow, Samantha; Downes, Kate; Ouwehand, Willem H; Clarke, Laura; Datta, Avik; Lowy, Ernesto; Flicek, Paul; Frontini, Mattia; Stunnenberg, Hendrik G; Martín-Subero, José I; Gut, Ivo; Heath, Simon

    2016-11-15

    DNA methylation and the localization and post-translational modification of nucleosomes are interdependent factors that contribute to the generation of distinct phenotypes from genetically identical cells. With 112 whole-genome bisulfite sequencing datasets from the BLUEPRINT Epigenome Project, we analyzed the global development of DNA methylation patterns during lineage commitment and maturation of a range of immune system effector cells and the cancers that arise from them. We show clear trends in methylation patterns that are distinct in the innate and adaptive arms of the human immune system, both globally and in relation to consistently positioned nucleosomes. Most notable are a progressive loss of methylation in developing lymphocytes and the consistent occurrence of non-CG methylation in specific cell types. Cancer samples from the two lineages are further polarized, suggesting the involvement of distinct lineage-specific epigenetic mechanisms. We anticipate broad utility for this resource as a basis for further comparative epigenetic analyses. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

  9. Mitochondrial comparative genomics and phylogenetic signal assessment of mtDNA among arbuscular mycorrhizal fungi.

    PubMed

    Nadimi, Maryam; Daubois, Laurence; Hijri, Mohamed

    2016-05-01

    Mitochondrial (mt) genes, such as cytochrome C oxidase genes (cox), have been widely used for barcoding in many groups of organisms, although this approach has been less powerful in the fungal kingdom due to the rapid evolution of their mt genomes. The use of mt genes in phylogenetic studies of Dikarya has been met with success, while early diverging fungal lineages remain less studied, particularly the arbuscular mycorrhizal fungi (AMF). Advances in next-generation sequencing have substantially increased the number of publically available mtDNA sequences for the Glomeromycota. As a result, comparison of mtDNA across key AMF taxa can now be applied to assess the phylogenetic signal of individual mt coding genes, as well as concatenated subsets of coding genes. Here we show comparative analyses of publically available mt genomes of Glomeromycota, augmented with two mtDNA genomes that were newly sequenced for this study (Rhizophagus irregularis DAOM240159 and Glomus aggregatum DAOM240163), resulting in 16 complete mtDNA datasets. R. irregularis isolate DAOM240159 and G. aggregatum isolate DAOM240163 showed mt genomes measuring 72,293bp and 69,505bp with G+C contents of 37.1% and 37.3%, respectively. We assessed the phylogenies inferred from single mt genes and complete sets of coding genes, which are referred to as "supergenes" (16 concatenated coding genes), using Shimodaira-Hasegawa tests, in order to identify genes that best described AMF phylogeny. We found that rnl, nad5, cox1, and nad2 genes, as well as concatenated subset of these genes, provided phylogenies that were similar to the supergene set. This mitochondrial genomic analysis was also combined with principal coordinate and partitioning analyses, which helped to unravel certain evolutionary relationships in the Rhizophagus genus and for G. aggregatum within the Glomeromycota. We showed evidence to support the position of G. aggregatum within the R. irregularis 'species complex'. Copyright © 2016 Elsevier Inc. All rights reserved.

  10. A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation.

    PubMed

    Howe, Glenn T; Yu, Jianbin; Knaus, Brian; Cronn, Richard; Kolpak, Scott; Dolan, Peter; Lorenz, W Walter; Dean, Jeffrey F D

    2013-02-28

    Douglas-fir (Pseudotsuga menziesii), one of the most economically and ecologically important tree species in the world, also has one of the largest tree breeding programs. Although the coastal and interior varieties of Douglas-fir (vars. menziesii and glauca) are native to North America, the coastal variety is also widely planted for timber production in Europe, New Zealand, Australia, and Chile. Our main goal was to develop a SNP resource large enough to facilitate genomic selection in Douglas-fir breeding programs. To accomplish this, we developed a 454-based reference transcriptome for coastal Douglas-fir, annotated and evaluated the quality of the reference, identified putative SNPs, and then validated a sample of those SNPs using the Illumina Infinium genotyping platform. We assembled a reference transcriptome consisting of 25,002 isogroups (unique gene models) and 102,623 singletons from 2.76 million 454 and Sanger cDNA sequences from coastal Douglas-fir. We identified 278,979 unique SNPs by mapping the 454 and Sanger sequences to the reference, and by mapping four datasets of Illumina cDNA sequences from multiple seed sources, genotypes, and tissues. The Illumina datasets represented coastal Douglas-fir (64.00 and 13.41 million reads), interior Douglas-fir (80.45 million reads), and a Yakima population similar to interior Douglas-fir (8.99 million reads). We assayed 8067 SNPs on 260 trees using an Illumina Infinium SNP genotyping array. Of these SNPs, 5847 (72.5%) were called successfully and were polymorphic. Based on our validation efficiency, our SNP database may contain as many as ~200,000 true SNPs, and as many as ~69,000 SNPs that could be genotyped at ~20,000 gene loci using an Infinium II array-more SNPs than are needed to use genomic selection in tree breeding programs. Ultimately, these genomic resources will enhance Douglas-fir breeding and allow us to better understand landscape-scale patterns of genetic variation and potential responses to climate change.

  11. A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation

    PubMed Central

    2013-01-01

    Background Douglas-fir (Pseudotsuga menziesii), one of the most economically and ecologically important tree species in the world, also has one of the largest tree breeding programs. Although the coastal and interior varieties of Douglas-fir (vars. menziesii and glauca) are native to North America, the coastal variety is also widely planted for timber production in Europe, New Zealand, Australia, and Chile. Our main goal was to develop a SNP resource large enough to facilitate genomic selection in Douglas-fir breeding programs. To accomplish this, we developed a 454-based reference transcriptome for coastal Douglas-fir, annotated and evaluated the quality of the reference, identified putative SNPs, and then validated a sample of those SNPs using the Illumina Infinium genotyping platform. Results We assembled a reference transcriptome consisting of 25,002 isogroups (unique gene models) and 102,623 singletons from 2.76 million 454 and Sanger cDNA sequences from coastal Douglas-fir. We identified 278,979 unique SNPs by mapping the 454 and Sanger sequences to the reference, and by mapping four datasets of Illumina cDNA sequences from multiple seed sources, genotypes, and tissues. The Illumina datasets represented coastal Douglas-fir (64.00 and 13.41 million reads), interior Douglas-fir (80.45 million reads), and a Yakima population similar to interior Douglas-fir (8.99 million reads). We assayed 8067 SNPs on 260 trees using an Illumina Infinium SNP genotyping array. Of these SNPs, 5847 (72.5%) were called successfully and were polymorphic. Conclusions Based on our validation efficiency, our SNP database may contain as many as ~200,000 true SNPs, and as many as ~69,000 SNPs that could be genotyped at ~20,000 gene loci using an Infinium II array—more SNPs than are needed to use genomic selection in tree breeding programs. Ultimately, these genomic resources will enhance Douglas-fir breeding and allow us to better understand landscape-scale patterns of genetic variation and potential responses to climate change. PMID:23445355

  12. GeNemo: a search engine for web-based functional genomic data.

    PubMed

    Zhang, Yongqing; Cao, Xiaoyi; Zhong, Sheng

    2016-07-08

    A set of new data types emerged from functional genomic assays, including ChIP-seq, DNase-seq, FAIRE-seq and others. The results are typically stored as genome-wide intensities (WIG/bigWig files) or functional genomic regions (peak/BED files). These data types present new challenges to big data science. Here, we present GeNemo, a web-based search engine for functional genomic data. GeNemo searches user-input data against online functional genomic datasets, including the entire collection of ENCODE and mouse ENCODE datasets. Unlike text-based search engines, GeNemo's searches are based on pattern matching of functional genomic regions. This distinguishes GeNemo from text or DNA sequence searches. The user can input any complete or partial functional genomic dataset, for example, a binding intensity file (bigWig) or a peak file. GeNemo reports any genomic regions, ranging from hundred bases to hundred thousand bases, from any of the online ENCODE datasets that share similar functional (binding, modification, accessibility) patterns. This is enabled by a Markov Chain Monte Carlo-based maximization process, executed on up to 24 parallel computing threads. By clicking on a search result, the user can visually compare her/his data with the found datasets and navigate the identified genomic regions. GeNemo is available at www.genemo.org. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. From conservation genetics to conservation genomics: a genome-wide assessment of blue whales (Balaenoptera musculus) in Australian feeding aggregations

    PubMed Central

    Sandoval-Castillo, Jonathan; Jenner, K. Curt S.; Gill, Peter C.; Jenner, Micheline-Nicole M.; Morrice, Margaret G.

    2018-01-01

    Genetic datasets of tens of markers have been superseded through next-generation sequencing technology with genome-wide datasets of thousands of markers. Genomic datasets improve our power to detect low population structure and identify adaptive divergence. The increased population-level knowledge can inform the conservation management of endangered species, such as the blue whale (Balaenoptera musculus). In Australia, there are two known feeding aggregations of the pygmy blue whale (B. m. brevicauda) which have shown no evidence of genetic structure based on a small dataset of 10 microsatellites and mtDNA. Here, we develop and implement a high-resolution dataset of 8294 genome-wide filtered single nucleotide polymorphisms, the first of its kind for blue whales. We use these data to assess whether the Australian feeding aggregations constitute one population and to test for the first time whether there is adaptive divergence between the feeding aggregations. We found no evidence of neutral population structure and negligible evidence of adaptive divergence. We propose that individuals likely travel widely between feeding areas and to breeding areas, which would require them to be adapted to a wide range of environmental conditions. This has important implications for their conservation as this blue whale population is likely vulnerable to a range of anthropogenic threats both off Australia and elsewhere. PMID:29410806

  14. Ancestral inference from haplotypes and mutations.

    PubMed

    Griffiths, Robert C; Tavaré, Simon

    2018-04-25

    We consider inference about the history of a sample of DNA sequences, conditional upon the haplotype counts and the number of segregating sites observed at the present time. After deriving some theoretical results in the coalescent setting, we implement rejection sampling and importance sampling schemes to perform the inference. The importance sampling scheme addresses an extension of the Ewens Sampling Formula for a configuration of haplotypes and the number of segregating sites in the sample. The implementations include both constant and variable population size models. The methods are illustrated by two human Y chromosome datasets. Copyright © 2018. Published by Elsevier Inc.

  15. A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks.

    PubMed

    Zhang, Haitao; Wu, Chenxue; Chen, Zewei; Liu, Zhao; Zhu, Yunhong

    2017-01-01

    Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules.

  16. A novel on-line spatial-temporal k-anonymity method for location privacy protection from sequence rules-based inference attacks

    PubMed Central

    Wu, Chenxue; Liu, Zhao; Zhu, Yunhong

    2017-01-01

    Analyzing large-scale spatial-temporal k-anonymity datasets recorded in location-based service (LBS) application servers can benefit some LBS applications. However, such analyses can allow adversaries to make inference attacks that cannot be handled by spatial-temporal k-anonymity methods or other methods for protecting sensitive knowledge. In response to this challenge, first we defined a destination location prediction attack model based on privacy-sensitive sequence rules mined from large scale anonymity datasets. Then we proposed a novel on-line spatial-temporal k-anonymity method that can resist such inference attacks. Our anti-attack technique generates new anonymity datasets with awareness of privacy-sensitive sequence rules. The new datasets extend the original sequence database of anonymity datasets to hide the privacy-sensitive rules progressively. The process includes two phases: off-line analysis and on-line application. In the off-line phase, sequence rules are mined from an original sequence database of anonymity datasets, and privacy-sensitive sequence rules are developed by correlating privacy-sensitive spatial regions with spatial grid cells among the sequence rules. In the on-line phase, new anonymity datasets are generated upon LBS requests by adopting specific generalization and avoidance principles to hide the privacy-sensitive sequence rules progressively from the extended sequence anonymity datasets database. We conducted extensive experiments to test the performance of the proposed method, and to explore the influence of the parameter K value. The results demonstrated that our proposed approach is faster and more effective for hiding privacy-sensitive sequence rules in terms of hiding sensitive rules ratios to eliminate inference attacks. Our method also had fewer side effects in terms of generating new sensitive rules ratios than the traditional spatial-temporal k-anonymity method, and had basically the same side effects in terms of non-sensitive rules variation ratios with the traditional spatial-temporal k-anonymity method. Furthermore, we also found the performance variation tendency from the parameter K value, which can help achieve the goal of hiding the maximum number of original sensitive rules while generating a minimum of new sensitive rules and affecting a minimum number of non-sensitive rules. PMID:28767687

  17. Imbalanced class learning in epigenetics.

    PubMed

    Haque, M Muksitul; Skinner, Michael K; Holder, Lawrence B

    2014-07-01

    In machine learning, one of the important criteria for higher classification accuracy is a balanced dataset. Datasets with a large ratio between minority and majority classes face hindrance in learning using any classifier. Datasets having a magnitude difference in number of instances between the target concept result in an imbalanced class distribution. Such datasets can range from biological data, sensor data, medical diagnostics, or any other domain where labeling any instances of the minority class can be time-consuming or costly or the data may not be easily available. The current study investigates a number of imbalanced class algorithms for solving the imbalanced class distribution present in epigenetic datasets. Epigenetic (DNA methylation) datasets inherently come with few differentially DNA methylated regions (DMR) and with a higher number of non-DMR sites. For this class imbalance problem, a number of algorithms are compared, including the TAN+AdaBoost algorithm. Experiments performed on four epigenetic datasets and several known datasets show that an imbalanced dataset can have similar accuracy as a regular learner on a balanced dataset.

  18. Molecular species identification with rich floristic sampling: DNA barcoding the pteridophyte flora of Japan.

    PubMed

    Ebihara, Atsushi; Nitta, Joel H; Ito, Motomi

    2010-12-08

    DNA barcoding is expected to be an effective identification tool for organisms with heteromorphic generations such as pteridophytes, which possess a morphologically simple gametophyte generation. Although a reference data set including complete coverage of the target local flora/fauna is necessary for accurate identification, DNA barcode studies including such rich taxonomic sampling on a countrywide scale are lacking. The Japanese pteridophyte flora (733 taxa including subspecies and varieties) was used to test the utility of two plastid DNA barcode regions (rbcL and trnH-psbA) with the intention of developing an identification system for native gametophytes. DNA sequences were obtained from each of 689 (94.0%) taxa for rbcL and 617 (84.2%) taxa for trnH-psbA. Mean interspecific divergence values across all taxon pairs (K2P genetic distances) did not reveal a significant difference in rate between trnH-psbA and rbcL, but mean K2P distances of each genus showed significant heterogeneity according to systematic position. The minimum fail rate of taxon discrimination in an identification test using BLAST (12.52%) was obtained when rbcL and trnH-psbA were combined, and became lower in datasets excluding infraspecific taxa or apogamous taxa, or including sexual diploids only. This study demonstrates the overall effectiveness of DNA barcodes for species identification in the Japanese pteridophyte flora. Although this flora is characterized by a high occurrence of apogamous taxa that pose a serious challenge to identification using DNA barcodes, such taxa are limited to a small number of genera, and only minimally detract from the overall success rate. In the case that a query sequence is matched to a known apogamous genus, routine species identification may not be possible. Otherwise, DNA barcoding is a practical tool for identification of most Japanese pteridophytes, and is especially anticipated to be helpful for identification of non-hybridizing gametophytes.

  19. Multilocus inference of species trees and DNA barcoding.

    PubMed

    Mallo, Diego; Posada, David

    2016-09-05

    The unprecedented amount of data resulting from next-generation sequencing has opened a new era in phylogenetic estimation. Although large datasets should, in theory, increase phylogenetic resolution, massive, multilocus datasets have uncovered a great deal of phylogenetic incongruence among different genomic regions, due both to stochastic error and to the action of different evolutionary process such as incomplete lineage sorting, gene duplication and loss and horizontal gene transfer. This incongruence violates one of the fundamental assumptions of the DNA barcoding approach, which assumes that gene history and species history are identical. In this review, we explain some of the most important challenges we will have to face to reconstruct the history of species, and the advantages and disadvantages of different strategies for the phylogenetic analysis of multilocus data. In particular, we describe the evolutionary events that can generate species tree-gene tree discordance, compare the most popular methods for species tree reconstruction, highlight the challenges we need to face when using them and discuss their potential utility in barcoding. Current barcoding methods sacrifice a great amount of statistical power by only considering one locus, and a transition to multilocus barcodes would not only improve current barcoding methods, but also facilitate an eventual transition to species-tree-based barcoding strategies, which could better accommodate scenarios where the barcode gap is too small or inexistent.This article is part of the themed issue 'From DNA barcodes to biomes'. © 2016 The Authors.

  20. Reading Mammal Diversity from Flies: The Persistence Period of Amplifiable Mammal mtDNA in Blowfly Guts (Chrysomya megacephala) and a New DNA Mini-Barcode Target.

    PubMed

    Lee, Ping-Shin; Sing, Kong-Wah; Wilson, John-James

    2015-01-01

    Most tropical mammal species are threatened or data-deficient. Data collection is impeded by the traditional monitoring approaches which can be laborious, expensive and struggle to detect cryptic diversity. Monitoring approaches using mammal DNA derived from invertebrates are emerging as cost- and time-effective alternatives. As a step towards development of blowfly-derived DNA as an effective method for mammal monitoring in the biodiversity hotspot of Peninsular Malaysia, our objectives were (i) to determine the persistence period of amplifiable mammal mtDNA in blowfly guts through a laboratory feeding experiment (ii) to design and test primers that can selectively amplify mammal COI DNA mini-barcodes in the presence of high concentrations of blowfly DNA. The persistence period of amplifiable mammal mtDNA in blowfly guts was 24 h to 96 h post-feeding indicating the need for collecting flies within 24 h of capture to detect mammal mtDNA of sufficient quantity and quality. We designed a new primer combination for a COI DNA mini-barcode that did not amplify blowfly DNA and showed 89% amplification success for a dataset of mammals from Peninsular Malaysia. The short (205 bp) DNA mini-barcode could distinguish most mammal species (including separating dark taxa) and is of suitable length for high-throughput sequencing. Our new DNA mini-barcode target and a standardized trapping protocol with retrieval of blowflies every 24 h could point the way forward in the development of blowfly-derived DNA as an effective method for mammal monitoring.

  1. Reading Mammal Diversity from Flies: The Persistence Period of Amplifiable Mammal mtDNA in Blowfly Guts (Chrysomya megacephala) and a New DNA Mini-Barcode Target

    PubMed Central

    Lee, Ping-Shin; Sing, Kong-Wah; Wilson, John-James

    2015-01-01

    Most tropical mammal species are threatened or data-deficient. Data collection is impeded by the traditional monitoring approaches which can be laborious, expensive and struggle to detect cryptic diversity. Monitoring approaches using mammal DNA derived from invertebrates are emerging as cost- and time-effective alternatives. As a step towards development of blowfly-derived DNA as an effective method for mammal monitoring in the biodiversity hotspot of Peninsular Malaysia, our objectives were (i) to determine the persistence period of amplifiable mammal mtDNA in blowfly guts through a laboratory feeding experiment (ii) to design and test primers that can selectively amplify mammal COI DNA mini-barcodes in the presence of high concentrations of blowfly DNA. The persistence period of amplifiable mammal mtDNA in blowfly guts was 24 h to 96 h post-feeding indicating the need for collecting flies within 24 h of capture to detect mammal mtDNA of sufficient quantity and quality. We designed a new primer combination for a COI DNA mini-barcode that did not amplify blowfly DNA and showed 89% amplification success for a dataset of mammals from Peninsular Malaysia. The short (205 bp) DNA mini-barcode could distinguish most mammal species (including separating dark taxa) and is of suitable length for high-throughput sequencing. Our new DNA mini-barcode target and a standardized trapping protocol with retrieval of blowflies every 24 h could point the way forward in the development of blowfly-derived DNA as an effective method for mammal monitoring. PMID:25898278

  2. A Comparison Study for DNA Motif Modeling on Protein Binding Microarray.

    PubMed

    Wong, Ka-Chun; Li, Yue; Peng, Chengbin; Wong, Hau-San

    2016-01-01

    Transcription factor binding sites (TFBSs) are relatively short (5-15 bp) and degenerate. Identifying them is a computationally challenging task. In particular, protein binding microarray (PBM) is a high-throughput platform that can measure the DNA binding preference of a protein in a comprehensive and unbiased manner; for instance, a typical PBM experiment can measure binding signal intensities of a protein to all possible DNA k-mers (k = 8∼10). Since proteins can often bind to DNA with different binding intensities, one of the major challenges is to build TFBS (also known as DNA motif) models which can fully capture the quantitative binding affinity data. To learn DNA motif models from the non-convex objective function landscape, several optimization methods are compared and applied to the PBM motif model building problem. In particular, representative methods from different optimization paradigms have been chosen for modeling performance comparison on hundreds of PBM datasets. The results suggest that the multimodal optimization methods are very effective for capturing the binding preference information from PBM data. In particular, we observe a general performance improvement if choosing di-nucleotide modeling over mono-nucleotide modeling. In addition, the models learned by the best-performing method are applied to two independent applications: PBM probe rotation testing and ChIP-Seq peak sequence prediction, demonstrating its biological applicability.

  3. Somatic mutation load of estrogen receptor-positive breast tumors predicts overall survival: an analysis of genome sequence data.

    PubMed

    Haricharan, Svasti; Bainbridge, Matthew N; Scheet, Paul; Brown, Powel H

    2014-07-01

    Breast cancer is one of the most commonly diagnosed cancers in women. While there are several effective therapies for breast cancer and important single gene prognostic/predictive markers, more than 40,000 women die from this disease every year. The increasing availability of large-scale genomic datasets provides opportunities for identifying factors that influence breast cancer survival in smaller, well-defined subsets. The purpose of this study was to investigate the genomic landscape of various breast cancer subtypes and its potential associations with clinical outcomes. We used statistical analysis of sequence data generated by the Cancer Genome Atlas initiative including somatic mutation load (SML) analysis, Kaplan-Meier survival curves, gene mutational frequency, and mutational enrichment evaluation to study the genomic landscape of breast cancer. We show that ER(+), but not ER(-), tumors with high SML associate with poor overall survival (HR = 2.02). Further, these high mutation load tumors are enriched for coincident mutations in both DNA damage repair and ER signature genes. While it is known that somatic mutations in specific genes affect breast cancer survival, this study is the first to identify that SML may constitute an important global signature for a subset of ER(+) tumors prone to high mortality. Moreover, although somatic mutations in individual DNA damage genes affect clinical outcome, our results indicate that coincident mutations in DNA damage response and signature ER genes may prove more informative for ER(+) breast cancer survival. Next generation sequencing may prove an essential tool for identifying pathways underlying poor outcomes and for tailoring therapeutic strategies.

  4. DNA barcode data accurately assign higher spider taxa

    PubMed Central

    Coddington, Jonathan A.; Agnarsson, Ingi; Cheng, Ren-Chung; Čandek, Klemen; Driskell, Amy; Frick, Holger; Gregorič, Matjaž; Kostanjšek, Rok; Kropf, Christian; Kweskin, Matthew; Lokovšek, Tjaša; Pipan, Miha; Vidergar, Nina

    2016-01-01

    The use of unique DNA sequences as a method for taxonomic identification is no longer fundamentally controversial, even though debate continues on the best markers, methods, and technology to use. Although both existing databanks such as GenBank and BOLD, as well as reference taxonomies, are imperfect, in best case scenarios “barcodes” (whether single or multiple, organelle or nuclear, loci) clearly are an increasingly fast and inexpensive method of identification, especially as compared to manual identification of unknowns by increasingly rare expert taxonomists. Because most species on Earth are undescribed, a complete reference database at the species level is impractical in the near term. The question therefore arises whether unidentified species can, using DNA barcodes, be accurately assigned to more inclusive groups such as genera and families—taxonomic ranks of putatively monophyletic groups for which the global inventory is more complete and stable. We used a carefully chosen test library of CO1 sequences from 49 families, 313 genera, and 816 species of spiders to assess the accuracy of genus and family-level assignment. We used BLAST queries of each sequence against the entire library and got the top ten hits. The percent sequence identity was reported from these hits (PIdent, range 75–100%). Accurate assignment of higher taxa (PIdent above which errors totaled less than 5%) occurred for genera at PIdent values >95 and families at PIdent values ≥ 91, suggesting these as heuristic thresholds for accurate generic and familial identifications in spiders. Accuracy of identification increases with numbers of species/genus and genera/family in the library; above five genera per family and fifteen species per genus all higher taxon assignments were correct. We propose that using percent sequence identity between conventional barcode sequences may be a feasible and reasonably accurate method to identify animals to family/genus. However, the quality of the underlying database impacts accuracy of results; many outliers in our dataset could be attributed to taxonomic and/or sequencing errors in BOLD and GenBank. It seems that an accurate and complete reference library of families and genera of life could provide accurate higher level taxonomic identifications cheaply and accessibly, within years rather than decades. PMID:27547527

  5. GFinisher: a new strategy to refine and finish bacterial genome assemblies

    NASA Astrophysics Data System (ADS)

    Guizelini, Dieval; Raittz, Roberto T.; Cruz, Leonardo M.; Souza, Emanuel M.; Steffens, Maria B. R.; Pedrosa, Fabio O.

    2016-10-01

    Despite the development in DNA sequencing technology, improving the number and the length of reads, the process of reconstruction of complete genome sequences, the so called genome assembly, is still complex. Only 13% of the prokaryotic genome sequencing projects have been completed. Draft genome sequences deposited in public databases are fragmented in contigs and may lack the full gene complement. The aim of the present work is to identify assembly errors and improve the assembly process of bacterial genomes. The biological patterns observed in genomic sequences and the application of a priori information can allow the identification of misassembled regions, and the reorganization and improvement of the overall de novo genome assembly. GFinisher starts generating a Fuzzy GC skew graphs for each contig in an assembly and follows breaking down the contigs in critical points in order to reassemble and close them using jFGap. This has been successfully applied to dataset from 96 genome assemblies, decreasing the number of contigs by up to 86%. GFinisher can easily optimize assemblies of prokaryotic draft genomes and can be used to improve the assembly programs based on nucleotide sequence patterns in the genome. The software and source code are available at http://gfinisher.sourceforge.net/.

  6. Analysis of genomic sequences by Chaos Game Representation.

    PubMed

    Almeida, J S; Carriço, J A; Maretzek, A; Noble, P A; Fletcher, M

    2001-05-01

    Chaos Game Representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to find the coordinates for their position in a continuous space. This distribution of positions has two properties: it is unique, and the source sequence can be recovered from the coordinates such that distance between positions measures similarity between the corresponding sequences. The possibility of using the latter property to identify succession schemes have been entirely overlooked in previous studies which raises the possibility that CGR may be upgraded from a mere representation technique to a sequence modeling tool. The distribution of positions in the CGR plane were shown to be a generalization of Markov chain probability tables that accommodates non-integer orders. Therefore, Markov models are particular cases of CGR models rather than the reverse, as currently accepted. In addition, the CGR generalization has both practical (computational efficiency) and fundamental (scale independence) advantages. These results are illustrated by using Escherichia coli K-12 as a test data-set, in particular, the genes thrA, thrB and thrC of the threonine operon.

  7. GFinisher: a new strategy to refine and finish bacterial genome assemblies.

    PubMed

    Guizelini, Dieval; Raittz, Roberto T; Cruz, Leonardo M; Souza, Emanuel M; Steffens, Maria B R; Pedrosa, Fabio O

    2016-10-10

    Despite the development in DNA sequencing technology, improving the number and the length of reads, the process of reconstruction of complete genome sequences, the so called genome assembly, is still complex. Only 13% of the prokaryotic genome sequencing projects have been completed. Draft genome sequences deposited in public databases are fragmented in contigs and may lack the full gene complement. The aim of the present work is to identify assembly errors and improve the assembly process of bacterial genomes. The biological patterns observed in genomic sequences and the application of a priori information can allow the identification of misassembled regions, and the reorganization and improvement of the overall de novo genome assembly. GFinisher starts generating a Fuzzy GC skew graphs for each contig in an assembly and follows breaking down the contigs in critical points in order to reassemble and close them using jFGap. This has been successfully applied to dataset from 96 genome assemblies, decreasing the number of contigs by up to 86%. GFinisher can easily optimize assemblies of prokaryotic draft genomes and can be used to improve the assembly programs based on nucleotide sequence patterns in the genome. The software and source code are available at http://gfinisher.sourceforge.net/.

  8. Estimating Bacterial Diversity for Ecological Studies: Methods, Metrics, and Assumptions

    PubMed Central

    Birtel, Julia; Walser, Jean-Claude; Pichon, Samuel; Bürgmann, Helmut; Matthews, Blake

    2015-01-01

    Methods to estimate microbial diversity have developed rapidly in an effort to understand the distribution and diversity of microorganisms in natural environments. For bacterial communities, the 16S rRNA gene is the phylogenetic marker gene of choice, but most studies select only a specific region of the 16S rRNA to estimate bacterial diversity. Whereas biases derived from from DNA extraction, primer choice and PCR amplification are well documented, we here address how the choice of variable region can influence a wide range of standard ecological metrics, such as species richness, phylogenetic diversity, β-diversity and rank-abundance distributions. We have used Illumina paired-end sequencing to estimate the bacterial diversity of 20 natural lakes across Switzerland derived from three trimmed variable 16S rRNA regions (V3, V4, V5). Species richness, phylogenetic diversity, community composition, β-diversity, and rank-abundance distributions differed significantly between 16S rRNA regions. Overall, patterns of diversity quantified by the V3 and V5 regions were more similar to one another than those assessed by the V4 region. Similar results were obtained when analyzing the datasets with different sequence similarity thresholds used during sequences clustering and when the same analysis was used on a reference dataset of sequences from the Greengenes database. In addition we also measured species richness from the same lake samples using ARISA Fingerprinting, but did not find a strong relationship between species richness estimated by Illumina and ARISA. We conclude that the selection of 16S rRNA region significantly influences the estimation of bacterial diversity and species distributions and that caution is warranted when comparing data from different variable regions as well as when using different sequencing techniques. PMID:25915756

  9. Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins

    PubMed Central

    Turnbaugh, Peter J.; Quince, Christopher; Faith, Jeremiah J.; McHardy, Alice C.; Yatsunenko, Tanya; Niazi, Faheem; Affourtit, Jason; Egholm, Michael; Henrissat, Bernard; Knight, Rob; Gordon, Jeffrey I.

    2010-01-01

    We deeply sampled the organismal, genetic, and transcriptional diversity in fecal samples collected from a monozygotic (MZ) twin pair and compared the results to 1,095 communities from the gut and other body habitats of related and unrelated individuals. Using a new scheme for noise reduction in pyrosequencing data, we estimated the total diversity of species-level bacterial phylotypes in the 1.2-1.5 million bacterial 16S rRNA reads obtained from each deeply sampled cotwin to be ~800 (35.9%, 49.1% detected in both). A combined 1.1 million read 16S rRNA dataset representing 281 shallowly sequenced fecal samples from 54 twin pairs and their mothers contained an estimated 4,018 species-level phylotypes, with each sample having a unique species assemblage (53.4 ± 0.6% and 50.3 ± 0.5% overlap with the deeply sampled cotwins). Of the 134 phylotypes with a relative abundance of >0.1% in the combined dataset, only 37 appeared in >50% of the samples, with one phylotype in the Lachnospiraceae family present in 99%. Nongut communities had significantly reduced overlap with the deeply sequenced twins’ fecal microbiota (18.3 ± 0.3%, 15.3 ± 0.3%). The MZ cotwins’ fecal DNA was deeply sequenced (3.8-6.3 Gbp/sample) and assembled reads were assigned to 25 genus-level phylogenetic bins. Only 17% of the genes in these bins were shared between the cotwins. Bins exhibited differences in their degree of sequence variation, gene content including the repertoire of carbohydrate active enzymes present within and between twins (e.g., predicted cellulases, dockerins), and transcriptional activities. These results provide an expanded perspective about features that make each of us unique life forms and directions for future characterization of our gut ecosystems. PMID:20363958

  10. Defining the Estimated Core Genome of Bacterial Populations Using a Bayesian Decision Model

    PubMed Central

    van Tonder, Andries J.; Mistry, Shilan; Bray, James E.; Hill, Dorothea M. C.; Cody, Alison J.; Farmer, Chris L.; Klugman, Keith P.; von Gottberg, Anne; Bentley, Stephen D.; Parkhill, Julian; Jolley, Keith A.; Maiden, Martin C. J.; Brueggemann, Angela B.

    2014-01-01

    The bacterial core genome is of intense interest and the volume of whole genome sequence data in the public domain available to investigate it has increased dramatically. The aim of our study was to develop a model to estimate the bacterial core genome from next-generation whole genome sequencing data and use this model to identify novel genes associated with important biological functions. Five bacterial datasets were analysed, comprising 2096 genomes in total. We developed a Bayesian decision model to estimate the number of core genes, calculated pairwise evolutionary distances (p-distances) based on nucleotide sequence diversity, and plotted the median p-distance for each core gene relative to its genome location. We designed visually-informative genome diagrams to depict areas of interest in genomes. Case studies demonstrated how the model could identify areas for further study, e.g. 25% of the core genes with higher sequence diversity in the Campylobacter jejuni and Neisseria meningitidis genomes encoded hypothetical proteins. The core gene with the highest p-distance value in C. jejuni was annotated in the reference genome as a putative hydrolase, but further work revealed that it shared sequence homology with beta-lactamase/metallo-beta-lactamases (enzymes that provide resistance to a range of broad-spectrum antibiotics) and thioredoxin reductase genes (which reduce oxidative stress and are essential for DNA replication) in other C. jejuni genomes. Our Bayesian model of estimating the core genome is principled, easy to use and can be applied to large genome datasets. This study also highlighted the lack of knowledge currently available for many core genes in bacterial genomes of significant global public health importance. PMID:25144616

  11. Scar-less multi-part DNA assembly design automation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hillson, Nathan J.

    The present invention provides a method of a method of designing an implementation of a DNA assembly. In an exemplary embodiment, the method includes (1) receiving a list of DNA sequence fragments to be assembled together and an order in which to assemble the DNA sequence fragments, (2) designing DNA oligonucleotides (oligos) for each of the DNA sequence fragments, and (3) creating a plan for adding flanking homology sequences to each of the DNA oligos. In an exemplary embodiment, the method includes (1) receiving a list of DNA sequence fragments to be assembled together and an order in which tomore » assemble the DNA sequence fragments, (2) designing DNA oligonucleotides (oligos) for each of the DNA sequence fragments, and (3) creating a plan for adding optimized overhang sequences to each of the DNA oligos.« less

  12. SeqWare Query Engine: storing and searching sequence data in the cloud.

    PubMed

    O'Connor, Brian D; Merriman, Barry; Nelson, Stanley F

    2010-12-21

    Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.

  13. SeqWare Query Engine: storing and searching sequence data in the cloud

    PubMed Central

    2010-01-01

    Background Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. Results In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). Conclusions The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets. PMID:21210981

  14. Flexbar 3.0 - SIMD and multicore parallelization.

    PubMed

    Roehr, Johannes T; Dieterich, Christoph; Reinert, Knut

    2017-09-15

    High-throughput sequencing machines can process many samples in a single run. For Illumina systems, sequencing reads are barcoded with an additional DNA tag that is contained in the respective sequencing adapters. The recognition of barcode and adapter sequences is hence commonly needed for the analysis of next-generation sequencing data. Flexbar performs demultiplexing based on barcodes and adapter trimming for such data. The massive amounts of data generated on modern sequencing machines demand that this preprocessing is done as efficiently as possible. We present Flexbar 3.0, the successor of the popular program Flexbar. It employs now twofold parallelism: multi-threading and additionally SIMD vectorization. Both types of parallelism are used to speed-up the computation of pair-wise sequence alignments, which are used for the detection of barcodes and adapters. Furthermore, new features were included to cover a wide range of applications. We evaluated the performance of Flexbar based on a simulated sequencing dataset. Our program outcompetes other tools in terms of speed and is among the best tools in the presented quality benchmark. https://github.com/seqan/flexbar. johannes.roehr@fu-berlin.de or knut.reinert@fu-berlin.de. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  15. Expressed sequence tag based identification and expression analysis of some cold inducible elements in seabuckthorn (Hippophae rhamnoides L.).

    PubMed

    Ghangal, Rajesh; Raghuvanshi, Saurabh; Sharma, Prakash C

    2012-02-01

    A cDNA library was constructed from the mature leaves of seabuckthorn (Hippophae rhamnoides). Expressed Sequence Tags (ESTs) were generated by single pass sequencing of 4500 cDNA clones. We submitted 3412 ESTs to dbEST of NCBI. Clustering of these ESTs yielded 1665 unigenes comprising of 345 contigs and 1320 singletons. Out of 1665 unigenes, 1278 unigenes were annotated by similarity search while the remaining 387 unannotated unigenes were considered as organism specific. Gene Ontology (GO) analysis of the unigene dataset showed 691 unigenes related to biological processes, 727 to molecular functions and 588 to cellular component category. On the basis of similarity search and GO annotation, 43 unigenes were found responsive to biotic and abiotic stresses. To validate this observation, 13 genes that are known to be associated with cold stress tolerance from previous studies in Arabidopsis and 3 novel transcripts were examined by Real time RT-PCR to understand the change in expression pattern under cold/freeze stress. In silico study of occurrence of microsatellites in these ESTs revealed the presence of 62 Simple Sequence Repeats (SSRs), some of which are being explored to assess genetic diversity among seabuckthorn collections. This is the first report of generation of transcriptome data providing information about genes involved in managing plant abiotic stress in seabuckthorn, a plant known for its enormous medicinal and ecological value. Copyright © 2011 Elsevier Masson SAS. All rights reserved.

  16. Blast2GO goes grid: developing a grid-enabled prototype for functional genomics analysis.

    PubMed

    Aparicio, G; Götz, S; Conesa, A; Segrelles, D; Blanquer, I; García, J M; Hernandez, V; Robles, M; Talon, M

    2006-01-01

    The vast amount in complexity of data generated in Genomic Research implies that new dedicated and powerful computational tools need to be developed to meet their analysis requirements. Blast2GO (B2G) is a bioinformatics tool for Gene Ontology-based DNA or protein sequence annotation and function-based data mining. The application has been developed with the aim of affering an easy-to-use tool for functional genomics research. Typical B2G users are middle size genomics labs carrying out sequencing, ETS and microarray projects, handling datasets up to several thousand sequences. In the current version of B2G. The power and analytical potential of both annotation and function data-mining is somehow restricted to the computational power behind each particular installation. In order to be able to offer the possibility of an enhanced computational capacity within this bioinformatics application, a Grid component is being developed. A prototype has been conceived for the particular problem of speeding up the Blast searches to obtain fast results for large datasets. Many efforts have been done in the literature concerning the speeding up of Blast searches, but few of them deal with the use of large heterogeneous production Grid Infrastructures. These are the infrastructures that could reach the largest number of resources and the best load balancing for data access. The Grid Service under development will analyse requests based on the number of sequences, splitting them accordingly to the available resources. Lower-level computation will be performed through MPIBLAST. The software architecture is based on the WSRF standard.

  17. Enrichment allows identification of diverse, rare elements in metagenomic resistome-virulome sequencing.

    PubMed

    Noyes, Noelle R; Weinroth, Maggie E; Parker, Jennifer K; Dean, Chris J; Lakin, Steven M; Raymond, Robert A; Rovira, Pablo; Doster, Enrique; Abdo, Zaid; Martin, Jennifer N; Jones, Kenneth L; Ruiz, Jaime; Boucher, Christina A; Belk, Keith E; Morley, Paul S

    2017-10-17

    Shotgun metagenomic sequencing is increasingly utilized as a tool to evaluate ecological-level dynamics of antimicrobial resistance and virulence, in conjunction with microbiome analysis. Interest in use of this method for environmental surveillance of antimicrobial resistance and pathogenic microorganisms is also increasing. In published metagenomic datasets, the total of all resistance- and virulence-related sequences accounts for < 1% of all sequenced DNA, leading to limitations in detection of low-abundance resistome-virulome elements. This study describes the extent and composition of the low-abundance portion of the resistome-virulome, using a bait-capture and enrichment system that incorporates unique molecular indices to count DNA molecules and correct for enrichment bias. The use of the bait-capture and enrichment system significantly increased on-target sequencing of the resistome-virulome, enabling detection of an additional 1441 gene accessions and revealing a low-abundance portion of the resistome-virulome that was more diverse and compositionally different than that detected by more traditional metagenomic assays. The low-abundance portion of the resistome-virulome also contained resistance genes with public health importance, such as extended-spectrum betalactamases, that were not detected using traditional shotgun metagenomic sequencing. In addition, the use of the bait-capture and enrichment system enabled identification of rare resistance gene haplotypes that were used to discriminate between sample origins. These results demonstrate that the rare resistome-virulome contains valuable and unique information that can be utilized for both surveillance and population genetic investigations of resistance. Access to the rare resistome-virulome using the bait-capture and enrichment system validated in this study can greatly advance our understanding of microbiome-resistome dynamics.

  18. Surface ocean metabarcoding confirms limited diversity in planktonic foraminifera but reveals unknown hyper-abundant lineages.

    PubMed

    Morard, Raphaël; Garet-Delmas, Marie-José; Mahé, Frédéric; Romac, Sarah; Poulain, Julie; Kucera, Michal; de Vargas, Colomban

    2018-02-07

    Since the advent of DNA metabarcoding surveys, the planktonic realm is considered a treasure trove of diversity, inhabited by a small number of abundant taxa, and a hugely diverse and taxonomically uncharacterized consortium of rare species. Here we assess if the apparent underestimation of plankton diversity applies universally. We target planktonic foraminifera, a group of protists whose known morphological diversity is limited, taxonomically resolved and linked to ribosomal DNA barcodes. We generated a pyrosequencing dataset of ~100,000 partial 18S rRNA foraminiferal sequences from 32 size fractioned photic-zone plankton samples collected at 8 stations in the Indian and Atlantic Oceans during the Tara Oceans expedition (2009-2012). We identified 69 genetic types belonging to 41 morphotaxa in our metabarcoding dataset. The diversity saturated at local and regional scale as well as in the three size fractions and the two depths sampled indicating that the diversity of foraminifera is modest and finite. The large majority of the newly discovered lineages occur in the small size fraction, neglected by classical taxonomy. These unknown lineages dominate the bulk [>0.8 µm] size fraction, implying that a considerable part of the planktonic foraminifera community biomass has its origin in unknown lineages.

  19. A computational model for predicting integrase catalytic domain of retrovirus.

    PubMed

    Wu, Sijia; Han, Jiuqiang; Zhang, Xinman; Zhong, Dexing; Liu, Ruiling

    2017-06-21

    Integrase catalytic domain (ICD) is an essential part in the retrovirus for integration reaction, which enables its newly synthesized DNA to be incorporated into the DNA of infected cells. Owing to the crucial role of ICD for the retroviral replication and the absence of an equivalent of integrase in host cells, it is comprehensible that ICD is a promising drug target for therapeutic intervention. However, annotated ICDs in UniProtKB database have still been insufficient for a good understanding of their statistical characteristics so far. Accordingly, it is of great importance to put forward a computational ICD model in this work to annotate these domains in the retroviruses. The proposed model then discovered 11,660 new putative ICDs after scanning sequences without ICD annotations. Subsequently in order to provide much confidence in ICD prediction, it was tested under different cross-validation methods, compared with other database search tools, and verified on independent datasets. Furthermore, an evolutionary analysis performed on the annotated ICDs of retroviruses revealed a tight connection between ICD and retroviral classification. All the datasets involved in this paper and the application software tool of this model can be available for free download at https://sourceforge.net/projects/icdtool/files/?source=navbar. Copyright © 2017 Elsevier Ltd. All rights reserved.

  20. Maternal Genetic Ancestry and Legacy of 10th Century AD Hungarians

    NASA Astrophysics Data System (ADS)

    Csősz, Aranka; Szécsényi-Nagy, Anna; Csákyová, Veronika; Langó, Péter; Bódis, Viktória; Köhler, Kitti; Tömöry, Gyöngyvér; Nagy, Melinda; Mende, Balázs Gusztáv

    2016-09-01

    The ancient Hungarians originated from the Ural region in today’s central Russia and migrated across the Eastern European steppe, according to historical sources. The Hungarians conquered the Carpathian Basin 895-907 AD, and admixed with the indigenous communities. Here we present mitochondrial DNA results from three datasets: one from the Avar period (7th-9th centuries) of the Carpathian Basin (n = 31) one from the Hungarian conquest-period (n = 76) and a completion of the published 10th-12th century Hungarian-Slavic contact zone dataset by four samples. We compare these mitochondrial DNA hypervariable segment sequences and haplogroup results with published ancient and modern Eurasian data. Whereas the analyzed Avars represents a certain group of the Avar society that shows East and South European genetic characteristics, the Hungarian conquerors’ maternal gene pool is a mixture of West Eurasian and Central and North Eurasian elements. Comprehensively analyzing the results, both the linguistically recorded Finno-Ugric roots and historically documented Turkic and Central Asian influxes had possible genetic imprints in the conquerors’ genetic composition. Our data allows a complex series of historic and population genetic events before the formation of the medieval population of the Carpathian Basin, and the maternal genetic continuity between 10th-12th century and modern Hungarians.

  1. Maternal Genetic Ancestry and Legacy of 10(th) Century AD Hungarians.

    PubMed

    Csősz, Aranka; Szécsényi-Nagy, Anna; Csákyová, Veronika; Langó, Péter; Bódis, Viktória; Köhler, Kitti; Tömöry, Gyöngyvér; Nagy, Melinda; Mende, Balázs Gusztáv

    2016-09-16

    The ancient Hungarians originated from the Ural region in today's central Russia and migrated across the Eastern European steppe, according to historical sources. The Hungarians conquered the Carpathian Basin 895-907 AD, and admixed with the indigenous communities. Here we present mitochondrial DNA results from three datasets: one from the Avar period (7(th)-9(th) centuries) of the Carpathian Basin (n = 31); one from the Hungarian conquest-period (n = 76); and a completion of the published 10(th)-12(th) century Hungarian-Slavic contact zone dataset by four samples. We compare these mitochondrial DNA hypervariable segment sequences and haplogroup results with published ancient and modern Eurasian data. Whereas the analyzed Avars represents a certain group of the Avar society that shows East and South European genetic characteristics, the Hungarian conquerors' maternal gene pool is a mixture of West Eurasian and Central and North Eurasian elements. Comprehensively analyzing the results, both the linguistically recorded Finno-Ugric roots and historically documented Turkic and Central Asian influxes had possible genetic imprints in the conquerors' genetic composition. Our data allows a complex series of historic and population genetic events before the formation of the medieval population of the Carpathian Basin, and the maternal genetic continuity between 10(th)-12(th) century and modern Hungarians.

  2. Maternal Genetic Ancestry and Legacy of 10th Century AD Hungarians

    PubMed Central

    Csősz, Aranka; Szécsényi-Nagy, Anna; Csákyová, Veronika; Langó, Péter; Bódis, Viktória; Köhler, Kitti; Tömöry, Gyöngyvér; Nagy, Melinda; Mende, Balázs Gusztáv

    2016-01-01

    The ancient Hungarians originated from the Ural region in today’s central Russia and migrated across the Eastern European steppe, according to historical sources. The Hungarians conquered the Carpathian Basin 895–907 AD, and admixed with the indigenous communities. Here we present mitochondrial DNA results from three datasets: one from the Avar period (7th–9th centuries) of the Carpathian Basin (n = 31); one from the Hungarian conquest-period (n = 76); and a completion of the published 10th–12th century Hungarian-Slavic contact zone dataset by four samples. We compare these mitochondrial DNA hypervariable segment sequences and haplogroup results with published ancient and modern Eurasian data. Whereas the analyzed Avars represents a certain group of the Avar society that shows East and South European genetic characteristics, the Hungarian conquerors’ maternal gene pool is a mixture of West Eurasian and Central and North Eurasian elements. Comprehensively analyzing the results, both the linguistically recorded Finno-Ugric roots and historically documented Turkic and Central Asian influxes had possible genetic imprints in the conquerors’ genetic composition. Our data allows a complex series of historic and population genetic events before the formation of the medieval population of the Carpathian Basin, and the maternal genetic continuity between 10th–12th century and modern Hungarians. PMID:27633963

  3. Revising the recent evolutionary history of equids using ancient DNA.

    PubMed

    Orlando, Ludovic; Metcalf, Jessica L; Alberdi, Maria T; Telles-Antunes, Miguel; Bonjean, Dominique; Otte, Marcel; Martin, Fabiana; Eisenmann, Véra; Mashkour, Marjan; Morello, Flavia; Prado, Jose L; Salas-Gismondi, Rodolfo; Shockey, Bruce J; Wrinn, Patrick J; Vasil'ev, Sergei K; Ovodov, Nikolai D; Cherry, Michael I; Hopwood, Blair; Male, Dean; Austin, Jeremy J; Hänni, Catherine; Cooper, Alan

    2009-12-22

    The rich fossil record of the family Equidae (Mammalia: Perissodactyla) over the past 55 MY has made it an icon for the patterns and processes of macroevolution. Despite this, many aspects of equid phylogenetic relationships and taxonomy remain unresolved. Recent genetic analyses of extinct equids have revealed unexpected evolutionary patterns and a need for major revisions at the generic, subgeneric, and species levels. To investigate this issue we examine 35 ancient equid specimens from four geographic regions (South America, Europe, Southwest Asia, and South Africa), of which 22 delivered 87-688 bp of reproducible aDNA mitochondrial sequence. Phylogenetic analyses support a major revision of the recent evolutionary history of equids and reveal two new species, a South American hippidion and a descendant of a basal lineage potentially related to Middle Pleistocene equids. Sequences from specimens assigned to the giant extinct Cape zebra, Equus capensis, formed a separate clade within the modern plain zebra species, a phenotypicically plastic group that also included the extinct quagga. In addition, we revise the currently recognized extinction times for two hemione-related equid groups. However, it is apparent that the current dataset cannot solve all of the taxonomic and phylogenetic questions relevant to the evolution of Equus. In light of these findings, we propose a rapid DNA barcoding approach to evaluate the taxonomic status of the many Late Pleistocene fossil Equidae species that have been described from purely morphological analyses.

  4. Exploring Pandora's Box: Potential and Pitfalls of Low Coverage Genome Surveys for Evolutionary Biology

    PubMed Central

    Leese, Florian; Mayer, Christoph; Agrawal, Shobhit; Dambach, Johannes; Dietz, Lars; Doemel, Jana S.; Goodall-Copstake, William P.; Held, Christoph; Jackson, Jennifer A.; Lampert, Kathrin P.; Linse, Katrin; Macher, Jan N.; Nolzen, Jennifer; Raupach, Michael J.; Rivera, Nicole T.; Schubart, Christoph D.; Striewski, Sebastian; Tollrian, Ralph; Sands, Chester J.

    2012-01-01

    High throughput sequencing technologies are revolutionizing genetic research. With this “rise of the machines”, genomic sequences can be obtained even for unknown genomes within a short time and for reasonable costs. This has enabled evolutionary biologists studying genetically unexplored species to identify molecular markers or genomic regions of interest (e.g. micro- and minisatellites, mitochondrial and nuclear genes) by sequencing only a fraction of the genome. However, when using such datasets from non-model species, it is possible that DNA from non-target contaminant species such as bacteria, viruses, fungi, or other eukaryotic organisms may complicate the interpretation of the results. In this study we analysed 14 genomic pyrosequencing libraries of aquatic non-model taxa from four major evolutionary lineages. We quantified the amount of suitable micro- and minisatellites, mitochondrial genomes, known nuclear genes and transposable elements and searched for contamination from various sources using bioinformatic approaches. Our results show that in all sequence libraries with estimated coverage of about 0.02–25%, many appropriate micro- and minisatellites, mitochondrial gene sequences and nuclear genes from different KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways could be identified and characterized. These can serve as markers for phylogenetic and population genetic analyses. A central finding of our study is that several genomic libraries suffered from different biases owing to non-target DNA or mobile elements. In particular, viruses, bacteria or eukaryote endosymbionts contributed significantly (up to 10%) to some of the libraries analysed. If not identified as such, genetic markers developed from high-throughput sequencing data for non-model organisms may bias evolutionary studies or fail completely in experimental tests. In conclusion, our study demonstrates the enormous potential of low-coverage genome survey sequences and suggests bioinformatic analysis workflows. The results also advise a more sophisticated filtering for problematic sequences and non-target genome sequences prior to developing markers. PMID:23185309

  5. Sequential addition of short DNA oligos in DNA-polymerase-based synthesis reactions

    DOEpatents

    Gardner, Shea N; Mariella, Jr., Raymond P; Christian, Allen T; Young, Jennifer A; Clague, David S

    2013-06-25

    A method of preselecting a multiplicity of DNA sequence segments that will comprise the DNA molecule of user-defined sequence, separating the DNA sequence segments temporally, and combining the multiplicity of DNA sequence segments with at least one polymerase enzyme wherein the multiplicity of DNA sequence segments join to produce the DNA molecule of user-defined sequence. Sequence segments may be of length n, where n is an odd integer. In one embodiment the length of desired hybridizing overlap is specified by the user and the sequences and the protocol for combining them are guided by computational (bioinformatics) predictions. In one embodiment sequence segments are combined from multiple reading frames to span the same region of a sequence, so that multiple desired hybridizations may occur with different overlap lengths.

  6. Sequential addition of short DNA oligos in DNA-polymerase-based synthesis reactions

    DOEpatents

    Gardner, Shea N [San Leandro, CA; Mariella, Jr., Raymond P.; Christian, Allen T [Tracy, CA; Young, Jennifer A [Berkeley, CA; Clague, David S [Livermore, CA

    2011-01-18

    A method of fabricating a DNA molecule of user-defined sequence. The method comprises the steps of preselecting a multiplicity of DNA sequence segments that will comprise the DNA molecule of user-defined sequence, separating the DNA sequence segments temporally, and combining the multiplicity of DNA sequence segments with at least one polymerase enzyme wherein the multiplicity of DNA sequence segments join to produce the DNA molecule of user-defined sequence. Sequence segments may be of length n, where n is an even or odd integer. In one embodiment the length of desired hybridizing overlap is specified by the user and the sequences and the protocol for combining them are guided by computational (bioinformatics) predictions. In one embodiment sequence segments are combined from multiple reading frames to span the same region of a sequence, so that multiple desired hybridizations may occur with different overlap lengths. In one embodiment starting sequence fragments are of different lengths, n, n+1, n+2, etc.

  7. The green impact: bacterioplankton response toward a phytoplankton spring bloom in the southern North Sea assessed by comparative metagenomic and metatranscriptomic approaches

    PubMed Central

    Wemheuer, Bernd; Wemheuer, Franziska; Hollensteiner, Jacqueline; Meyer, Frauke-Dorothee; Voget, Sonja; Daniel, Rolf

    2015-01-01

    Phytoplankton blooms exhibit a severe impact on bacterioplankton communities as they change nutrient availabilities and other environmental factors. In the current study, the response of a bacterioplankton community to a Phaeocystis globosa spring bloom was investigated in the southern North Sea. For this purpose, water samples were taken inside and reference samples outside of an algal spring bloom. Structural changes of the bacterioplankton community were assessed by amplicon-based analysis of 16S rRNA genes and transcripts generated from environmental DNA and RNA, respectively. Several marine groups responded to bloom presence. The abundance of the Roseobacter RCA cluster and the SAR92 clade significantly increased in bloom presence in the total and active fraction of the bacterial community. Functional changes were investigated by direct sequencing of environmental DNA and mRNA. The corresponding datasets comprised more than 500 million sequences across all samples. Metatranscriptomic data sets were mapped on representative genomes of abundant marine groups present in the samples and on assembled metagenomic and metatranscriptomic datasets. Differences in gene expression profiles between non-bloom and bloom samples were recorded. The genome-wide gene expression level of Planktomarina temperata, an abundant member of the Roseobacter RCA cluster, was higher inside the bloom. Genes that were differently expressed included transposases, which showed increased expression levels inside the bloom. This might contribute to the adaptation of this organism toward environmental stresses through genome reorganization. In addition, several genes affiliated to the SAR92 clade were significantly upregulated inside the bloom including genes encoding for proteins involved in isoleucine and leucine incorporation. Obtained results provide novel insights into compositional and functional variations of marine bacterioplankton communities as response to a phytoplankton bloom. PMID:26322028

  8. LDSplitDB: a database for studies of meiotic recombination hotspots in MHC using human genomic data.

    PubMed

    Guo, Jing; Chen, Hao; Yang, Peng; Lee, Yew Ti; Wu, Min; Przytycka, Teresa M; Kwoh, Chee Keong; Zheng, Jie

    2018-04-20

    Meiotic recombination happens during the process of meiosis when chromosomes inherited from two parents exchange genetic materials to generate chromosomes in the gamete cells. The recombination events tend to occur in narrow genomic regions called recombination hotspots. Its dysregulation could lead to serious human diseases such as birth defects. Although the regulatory mechanism of recombination events is still unclear, DNA sequence polymorphisms have been found to play crucial roles in the regulation of recombination hotspots. To facilitate the studies of the underlying mechanism, we developed a database named LDSplitDB which provides an integrative and interactive data mining and visualization platform for the genome-wide association studies of recombination hotspots. It contains the pre-computed association maps of the major histocompatibility complex (MHC) region in the 1000 Genomes Project and the HapMap Phase III datasets, and a genome-scale study of the European population from the HapMap Phase II dataset. Besides the recombination profiles, related data of genes, SNPs and different types of epigenetic modifications, which could be associated with meiotic recombination, are provided for comprehensive analysis. To meet the computational requirement of the rapidly increasing population genomics data, we prepared a lookup table of 400 haplotypes for recombination rate estimation using the well-known LDhat algorithm which includes all possible two-locus haplotype configurations. To the best of our knowledge, LDSplitDB is the first large-scale database for the association analysis of human recombination hotspots with DNA sequence polymorphisms. It provides valuable resources for the discovery of the mechanism of meiotic recombination hotspots. The information about MHC in this database could help understand the roles of recombination in human immune system. DATABASE URL: http://histone.scse.ntu.edu.sg/LDSplitDB.

  9. Piecing together cis-regulatory networks: insights from epigenomics studies in plants.

    PubMed

    Huang, Shao-Shan C; Ecker, Joseph R

    2018-05-01

    5-Methylcytosine, a chemical modification of DNA, is a covalent modification found in the genomes of both plants and animals. Epigenetic inheritance of phenotypes mediated by DNA methylation is well established in plants. Most of the known mechanisms of establishing, maintaining and modifying DNA methylation have been worked out in the reference plant Arabidopsis thaliana. Major functions of DNA methylation in plants include regulation of gene expression and silencing of transposable elements (TEs) and repetitive sequences, both of which have parallels in mammalian biology, involve interaction with the transcriptional machinery, and may have profound effects on the regulatory networks in the cell. Methylome and transcriptome dynamics have been investigated in development and environmental responses in Arabidopsis and agriculturally and ecologically important plants, revealing the interdependent relationship among genomic context, methylation patterns, and expression of TE and protein coding genes. Analyses of methylome variation among plant natural populations and species have begun to quantify the extent of genetic control of methylome variation vs. true epimutation, and model the evolutionary forces driving methylome evolution in both short and long time scales. The ability of DNA methylation to positively or negatively modulate binding affinity of transcription factors (TFs) provides a natural link from genome sequence and methylation changes to transcription. Technologies that allow systematic determination of methylation sensitivities of TFs, in native genomic and methylation context without confounding factors such as histone modifications, will provide baseline datasets for building cell-type- and individual-specific regulatory networks that underlie the establishment and inheritance of complex traits. This article is categorized under: Laboratory Methods and Technologies > Genetic/Genomic Methods Biological Mechanisms > Regulatory Biology. © 2017 Wiley Periodicals, Inc.

  10. Identification of Human Lineage-Specific Transcriptional Coregulators Enabled by a Glossary of Binding Modules and Tunable Genomic Backgrounds.

    PubMed

    Mariani, Luca; Weinand, Kathryn; Vedenko, Anastasia; Barrera, Luis A; Bulyk, Martha L

    2017-09-27

    Transcription factors (TFs) control cellular processes by binding specific DNA motifs to modulate gene expression. Motif enrichment analysis of regulatory regions can identify direct and indirect TF binding sites. Here, we created a glossary of 108 non-redundant TF-8mer "modules" of shared specificity for 671 metazoan TFs from publicly available and new universal protein binding microarray data. Analysis of 239 ENCODE TF chromatin immunoprecipitation sequencing datasets and associated RNA sequencing profiles suggest the 8mer modules are more precise than position weight matrices in identifying indirect binding motifs and their associated tethering TFs. We also developed GENRE (genomically equivalent negative regions), a tunable tool for construction of matched genomic background sequences for analysis of regulatory regions. GENRE outperformed four state-of-the-art approaches to background sequence construction. We used our TF-8mer glossary and GENRE in the analysis of the indirect binding motifs for the co-occurrence of tethering factors, suggesting novel TF-TF interactions. We anticipate that these tools will aid in elucidating tissue-specific gene-regulatory programs. Copyright © 2017 Elsevier Inc. All rights reserved.

  11. Rapid Fine Conformational Epitope Mapping Using Comprehensive Mutagenesis and Deep Sequencing*

    PubMed Central

    Kowalsky, Caitlin A.; Faber, Matthew S.; Nath, Aritro; Dann, Hailey E.; Kelly, Vince W.; Liu, Li; Shanker, Purva; Wagner, Ellen K.; Maynard, Jennifer A.; Chan, Christina; Whitehead, Timothy A.

    2015-01-01

    Knowledge of the fine location of neutralizing and non-neutralizing epitopes on human pathogens affords a better understanding of the structural basis of antibody efficacy, which will expedite rational design of vaccines, prophylactics, and therapeutics. However, full utilization of the wealth of information from single cell techniques and antibody repertoire sequencing awaits the development of a high throughput, inexpensive method to map the conformational epitopes for antibody-antigen interactions. Here we show such an approach that combines comprehensive mutagenesis, cell surface display, and DNA deep sequencing. We develop analytical equations to identify epitope positions and show the method effectiveness by mapping the fine epitope for different antibodies targeting TNF, pertussis toxin, and the cancer target TROP2. In all three cases, the experimentally determined conformational epitope was consistent with previous experimental datasets, confirming the reliability of the experimental pipeline. Once the comprehensive library is generated, fine conformational epitope maps can be prepared at a rate of four per day. PMID:26296891

  12. PM-07LOSS OF ATRX DECREASES SURVIVAL AND IMPROVES RESPONSE TO DNA DAMAGING AGENTS IN A NOVEL MOUSE MODEL OF GLIOBLASTOMA

    PubMed Central

    Koschmann, Carl; Calinescu, Alexandra; Thomas, Daniel; Kamran, Neha; Nunez-Aguilera, Felipe; Dzaman, Marta; Lemons, Rosie; Li, Youping; Roh, Haeji; Lowenstein, Pedro; Castro, Maria

    2014-01-01

    Pediatric glioblastoma (GBM) remains one of the most difficult childhood tumors to treat. ATRX is a histone chaperone protein that is mutated primarily in younger patients with GBM. No previous animal model has demonstrated the effect of ATRX loss on GBM formation. We cloned an ATRX knockdown sequence into a Sleeping Beauty (SB) transposase-responsive plasmid (shATRX) for insertion into host genomic DNA. Glioblastomas were induced in mice by injecting plasmids encoding SB transposase/ luciferase, shp53 and NRAS, with or without shATRX, into the ventricle of neonatal mice. Tumors in both groups (with or without shATRX) showed histological hallmarks of human glioblastoma. The loss of ATRX was specifically localized only within tumors generated with the shATRX plasmid and not in the adjacent cortex. Notably, loss of ATRX reduced median survival of mice by 43% (p = 0.012). ATRX-deficient tumors were significantly more likely to develop microsatellite instability (p = 0.014), a hallmark of impaired DNA-damage repair. Analysis of three human GBM sequencing datasets confirmed increased number of somatic nucleotide mutations in ATRX-deficient tumors. Treatment of primary cell cultures generated from mouse GBMs showed that ATRX-deficient tumor cells are significantly more sensitive to DNA damaging agents. In addition, mice with ATRX-deficient GBM treated with whole brain irradiation had trend towards improved survival (p= 0.06), with some long-term survivors. Treated ATRX-deficient tumor cells showed greater evidence of double-stranded DNA breakage, by gH2A.X. In summary, this mouse model prospectively validates ATRX as a tumor suppressor in human GBM for the first time in an animal model. In addition, loss of ATRX leads to increased mutation frequency and response to DNA-damaging therapy. We have generated the hypothesis that ATRX loss leads to a genetically unstable tumor; which is more aggressive when untreated, but more responsive to DNA-damaging therapy, ultimately resulting in equivalent or improved overall survival.

  13. Morphological and genetic analysis of three new species of Ceratomyxa Thélohan, 1892 (Myxozoa: Myxosporea) from carcharhinid sharks off Australia.

    PubMed

    Gleeson, Ricky; Adlard, Robert

    2011-10-01

    Three new species of Ceratomyxa Thélohan, 1892 are described from the gall-bladders of two species of carcharhinid sharks collected off Heron and Lizard Islands on the Great Barrier Reef, Australia. Ceratomyxa carcharhini n. sp. and C. melanopteri n. sp. are described from Carcharhinus melanopterus (Quoy & Gaimard), and Ceratomyxa negaprioni n. sp. is described from Negaprion acutidens (Rüppell). These species are the first ceratomyxids reported from Australian elasmobranchs, and this is the first paper to formally characterise a novel Ceratomyxa species from an elasmobranch using both morphology and small subunit ribosomal DNA sequence data. Maximum parsimony and Bayesian inference analyses of the SSU rDNA dataset revealed that ceratomyxids from elasmobranchs form a sister clade to that of species infecting marine teleosts and Palliatus indecorus Schulman, Kovaleva & Dubina, 1979. Furthermore, the only sequenced freshwater ceratomyxid, Ceratomyxa shasta Noble, 1950, fell outside the overall marine ceratomyxid clade. These data show that Ceratomyxa, as currently recognised, is polyphyletic and ignites discussion on whether Ceratomyxa should be split. However, further taxon sampling, particularly in freshwater systems, is required to establish relevant biological divisions within the genus.

  14. A massively parallel strategy for STR marker development, capture, and genotyping.

    PubMed

    Kistler, Logan; Johnson, Stephen M; Irwin, Mitchell T; Louis, Edward E; Ratan, Aakrosh; Perry, George H

    2017-09-06

    Short tandem repeat (STR) variants are highly polymorphic markers that facilitate powerful population genetic analyses. STRs are especially valuable in conservation and ecological genetic research, yielding detailed information on population structure and short-term demographic fluctuations. Massively parallel sequencing has not previously been leveraged for scalable, efficient STR recovery. Here, we present a pipeline for developing STR markers directly from high-throughput shotgun sequencing data without a reference genome, and an approach for highly parallel target STR recovery. We employed our approach to capture a panel of 5000 STRs from a test group of diademed sifakas (Propithecus diadema, n = 3), endangered Malagasy rainforest lemurs, and we report extremely efficient recovery of targeted loci-97.3-99.6% of STRs characterized with ≥10x non-redundant sequence coverage. We then tested our STR capture strategy on P. diadema fecal DNA, and report robust initial results and suggestions for future implementations. In addition to STR targets, this approach also generates large, genome-wide single nucleotide polymorphism (SNP) panels from flanking regions. Our method provides a cost-effective and scalable solution for rapid recovery of large STR and SNP datasets in any species without needing a reference genome, and can be used even with suboptimal DNA more easily acquired in conservation and ecological studies. Published by Oxford University Press on behalf of Nucleic Acids Research 2017.

  15. Application of Quaternion in improving the quality of global sequence alignment scores for an ambiguous sequence target in Streptococcus pneumoniae DNA

    NASA Astrophysics Data System (ADS)

    Lestari, D.; Bustamam, A.; Novianti, T.; Ardaneswari, G.

    2017-07-01

    DNA sequence can be defined as a succession of letters, representing the order of nucleotides within DNA, using a permutation of four DNA base codes including adenine (A), guanine (G), cytosine (C), and thymine (T). The precise code of the sequences is determined using DNA sequencing methods and technologies, which have been developed since the 1970s and currently become highly developed, advanced and highly throughput sequencing technologies. So far, DNA sequencing has greatly accelerated biological and medical research and discovery. However, in some cases DNA sequencing could produce any ambiguous and not clear enough sequencing results that make them quite difficult to be determined whether these codes are A, T, G, or C. To solve these problems, in this study we can introduce other representation of DNA codes namely Quaternion Q = (PA, PT, PG, PC), where PA, PT, PG, PC are the probability of A, T, G, C bases that could appear in Q and PA + PT + PG + PC = 1. Furthermore, using Quaternion representations we are able to construct the improved scoring matrix for global sequence alignment processes, by applying a dot product method. Moreover, this scoring matrix produces better and higher quality of the match and mismatch score between two DNA base codes. In implementation, we applied the Needleman-Wunsch global sequence alignment algorithm using Octave, to analyze our target sequence which contains some ambiguous sequence data. The subject sequences are the DNA sequences of Streptococcus pneumoniae families obtained from the Genebank, meanwhile the target DNA sequence are received from our collaborator database. As the results we found the Quaternion representations improve the quality of the sequence alignment score and we can conclude that DNA sequence target has maximum similarity with Streptococcus pneumoniae.

  16. Localized population divergence of vervet monkeys (Chlorocebus spp.) in South Africa: evidence from mtDNA

    PubMed Central

    Turner, Trudy R.; Coetzer, Willem G.; Schmitt, Christopher A.; Lorenz, Joseph G.; Freimer, Nelson B.; Grobler, J. Paul

    2015-01-01

    Objectives Vervet monkeys are common in most tree-rich areas of South Africa, but their absence from grassland and semi-desert areas of the country suggest potentially restricted and mosaic local population patterns that may have relevance to local phenotype patterns and selection. A portion of the mtDNA control region was sequenced to study patterns of genetic differentiation. Materials and Methods DNA was extracted and mtDNA sequences were obtained from 101 vervet monkeys at 15 localities which represent both an extensive (widely across the distribution range) and intensive (more than one troop at most of the localities) sampling strategy. Analyses utilized Arlequin 3.1, MEGA 6, BEAST v1.5.2 and Network V3.6.1 Results The dataset contained 26 distinct haplotypes, with six populations fixed for single haplotypes. Pairwise P-distance among population pairs showed significant differentiation among most population pairs, but with non-significant differences among populations within some regions. Populations were grouped into three broad clusters in a maximum likelihood phylogenetic tree and a haplotype network. These clusters correspond to (i) north-western, northern and north-eastern parts of the distribution range as well as the northern coastal belt; (ii) central areas of the country; and (iii) southern part of the Indian Ocean coastal belt, and adjacent inland areas. Discussion Apparent patterns of genetic structure correspond to current and past distribution of suitable habitat, geographic barriers to gene flow, geographic distance and female philopatry. However, further work on nuclear markers and other genomic data is necessary to confirm these results. PMID:26265297

  17. The evolutionary dynamics of the lion Panthera leo revealed by host and viral population genomics.

    PubMed

    Antunes, Agostinho; Troyer, Jennifer L; Roelke, Melody E; Pecon-Slattery, Jill; Packer, Craig; Winterbach, Christiaan; Winterbach, Hanlie; Hemson, Graham; Frank, Laurence; Stander, Philip; Siefert, Ludwig; Driciru, Margaret; Funston, Paul J; Alexander, Kathy A; Prager, Katherine C; Mills, Gus; Wildt, David; Bush, Mitch; O'Brien, Stephen J; Johnson, Warren E

    2008-11-01

    The lion Panthera leo is one of the world's most charismatic carnivores and is one of Africa's key predators. Here, we used a large dataset from 357 lions comprehending 1.13 megabases of sequence data and genotypes from 22 microsatellite loci to characterize its recent evolutionary history. Patterns of molecular genetic variation in multiple maternal (mtDNA), paternal (Y-chromosome), and biparental nuclear (nDNA) genetic markers were compared with patterns of sequence and subtype variation of the lion feline immunodeficiency virus (FIV(Ple)), a lentivirus analogous to human immunodeficiency virus (HIV). In spite of the ability of lions to disperse long distances, patterns of lion genetic diversity suggest substantial population subdivision (mtDNA Phi(ST) = 0.92; nDNA F(ST) = 0.18), and reduced gene flow, which, along with large differences in sero-prevalence of six distinct FIV(Ple) subtypes among lion populations, refute the hypothesis that African lions consist of a single panmictic population. Our results suggest that extant lion populations derive from several Pleistocene refugia in East and Southern Africa ( approximately 324,000-169,000 years ago), which expanded during the Late Pleistocene ( approximately 100,000 years ago) into Central and North Africa and into Asia. During the Pleistocene/Holocene transition ( approximately 14,000-7,000 years), another expansion occurred from southern refugia northwards towards East Africa, causing population interbreeding. In particular, lion and FIV(Ple) variation affirms that the large, well-studied lion population occupying the greater Serengeti Ecosystem is derived from three distinct populations that admixed recently.

  18. The Evolutionary Dynamics of the Lion Panthera leo Revealed by Host and Viral Population Genomics

    PubMed Central

    Antunes, Agostinho; Troyer, Jennifer L.; Roelke, Melody E.; Pecon-Slattery, Jill; Packer, Craig; Winterbach, Christiaan; Winterbach, Hanlie; Hemson, Graham; Frank, Laurence; Stander, Philip; Siefert, Ludwig; Driciru, Margaret; Funston, Paul J.; Alexander, Kathy A.; Prager, Katherine C.; Mills, Gus; Wildt, David; Bush, Mitch; O'Brien, Stephen J.; Johnson, Warren E.

    2008-01-01

    The lion Panthera leo is one of the world's most charismatic carnivores and is one of Africa's key predators. Here, we used a large dataset from 357 lions comprehending 1.13 megabases of sequence data and genotypes from 22 microsatellite loci to characterize its recent evolutionary history. Patterns of molecular genetic variation in multiple maternal (mtDNA), paternal (Y-chromosome), and biparental nuclear (nDNA) genetic markers were compared with patterns of sequence and subtype variation of the lion feline immunodeficiency virus (FIVPle), a lentivirus analogous to human immunodeficiency virus (HIV). In spite of the ability of lions to disperse long distances, patterns of lion genetic diversity suggest substantial population subdivision (mtDNA ΦST = 0.92; nDNA F ST = 0.18), and reduced gene flow, which, along with large differences in sero-prevalence of six distinct FIVPle subtypes among lion populations, refute the hypothesis that African lions consist of a single panmictic population. Our results suggest that extant lion populations derive from several Pleistocene refugia in East and Southern Africa (∼324,000–169,000 years ago), which expanded during the Late Pleistocene (∼100,000 years ago) into Central and North Africa and into Asia. During the Pleistocene/Holocene transition (∼14,000–7,000 years), another expansion occurred from southern refugia northwards towards East Africa, causing population interbreeding. In particular, lion and FIVPle variation affirms that the large, well-studied lion population occupying the greater Serengeti Ecosystem is derived from three distinct populations that admixed recently. PMID:18989457

  19. Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL)

    NASA Astrophysics Data System (ADS)

    Bustamam, A.; Siswantining, T.; Febriyani, N. L.; Novitasari, I. D.; Cahyaningrum, R. D.

    2017-07-01

    The herpes virus can be found anywhere and one of the important characteristics is its ability to cause acute and chronic infection at certain times so as a result of the infection allows severe complications occurred. The herpes virus is composed of DNA containing protein and wrapped by glycoproteins. In this work, the Herpes viruses family is classified and analyzed by clustering their protein-sequence using Tribe Markov Clustering (Tribe-MCL) algorithm. Tribe-MCL is an efficient clustering method based on the theory of Markov chains, to classify protein families from protein sequences using pre-computed sequence similarity information. We implement the Tribe-MCL algorithm using an open source program of R. We select 24 protein sequences of Herpes virus obtained from NCBI database. The dataset consists of three types of glycoprotein B, F, and H. Each type has eight herpes virus that infected humans. Based on our simulation using different inflation factor r=1.5, 2, 3 we find a various number of the clusters results. The greater the inflation factor the greater the number of their clusters. Each protein will grouped together in the same type of protein.

  20. Large-Scale Concatenation cDNA Sequencing

    PubMed Central

    Yu, Wei; Andersson, Björn; Worley, Kim C.; Muzny, Donna M.; Ding, Yan; Liu, Wen; Ricafrente, Jennifer Y.; Wentland, Meredith A.; Lennon, Greg; Gibbs, Richard A.

    1997-01-01

    A total of 100 kb of DNA derived from 69 individual human brain cDNA clones of 0.7–2.0 kb were sequenced by concatenated cDNA sequencing (CCS), whereby multiple individual DNA fragments are sequenced simultaneously in a single shotgun library. The method yielded accurate sequences and a similar efficiency compared with other shotgun libraries constructed from single DNA fragments (>20 kb). Computer analyses were carried out on 65 cDNA clone sequences and their corresponding end sequences to examine both nucleic acid and amino acid sequence similarities in the databases. Thirty-seven clones revealed no DNA database matches, 12 clones generated exact matches (≥98% identity), and 16 clones generated nonexact matches (57%–97% identity) to either known human or other species genes. Of those 28 matched clones, 8 had corresponding end sequences that failed to identify similarities. In a protein similarity search, 27 clone sequences displayed significant matches, whereas only 20 of the end sequences had matches to known protein sequences. Our data indicate that full-length cDNA insert sequences provide significantly more nucleic acid and protein sequence similarity matches than expressed sequence tags (ESTs) for database searching. [All 65 cDNA clone sequences described in this paper have been submitted to the GenBank data library under accession nos. U79240–U79304.] PMID:9110174

  1. Synthesis of DNA

    DOEpatents

    Mariella, Jr., Raymond P.

    2008-11-18

    A method of synthesizing a desired double-stranded DNA of a predetermined length and of a predetermined sequence. Preselected sequence segments that will complete the desired double-stranded DNA are determined. Preselected segment sequences of DNA that will be used to complete the desired double-stranded DNA are provided. The preselected segment sequences of DNA are assembled to produce the desired double-stranded DNA.

  2. Isogenic mice exhibit sexually-dimorphic DNA methylation patterns across multiple tissues.

    PubMed

    McCormick, Helen; Young, Paul E; Hur, Suzy S J; Booher, Keith; Chung, Hunter; Cropley, Jennifer E; Giannoulatou, Eleni; Suter, Catherine M

    2017-12-13

    Cytosine methylation is a stable epigenetic modification of DNA that plays an important role in both normal physiology and disease. Most diseases exhibit some degree of sexual dimorphism, but the extent to which epigenetic states are influenced by sex is understudied and poorly understood. To address this deficit we studied DNA methylation patterns across multiple reduced representation bisulphite sequencing datasets (from liver, heart, brain, muscle and spleen) derived from isogenic male and female mice. DNA methylation patterns varied significantly from tissue to tissue, as expected, but they also varied between the sexes, with thousands of sexually dimorphic loci identified. The loci affected were largely autonomous to each tissue, even within tissues derived from the same germ layer. At most loci, differences between genders were driven by females exhibiting hypermethylation relative to males; a proportion of these differences were independent of the presence of testosterone in males. Loci harbouring gender differences were clustered in ontologies related to tissue function. Our findings suggest that gender is underwritten in the epigenome in a tissue-specific and potentially sex hormone-independent manner. Gender-specific epigenetic states are likely to have important implications for understanding sexually dimorphic phenotypes in health and disease.

  3. Nanopore Technology: A Simple, Inexpensive, Futuristic Technology for DNA Sequencing.

    PubMed

    Gupta, P D

    2016-10-01

    In health care, importance of DNA sequencing has been fully established. Sanger's Capillary Electrophoresis DNA sequencing methodology is time consuming, cumbersome, hence become more expensive. Lately, because of its versatility DNA sequencing became house hold name, and therefore, there is an urgent need of simple, fast, inexpensive, DNA sequencing technology. In the beginning of this century efforts were made, and Nanopore DNA sequencing technology was developed; still it is infancy, nevertheless, it is the futuristic technology.

  4. The genome-wide DNA sequence specificity of the anti-tumour drug bleomycin in human cells.

    PubMed

    Murray, Vincent; Chen, Jon K; Tanaka, Mark M

    2016-07-01

    The cancer chemotherapeutic agent, bleomycin, cleaves DNA at specific sites. For the first time, the genome-wide DNA sequence specificity of bleomycin breakage was determined in human cells. Utilising Illumina next-generation DNA sequencing techniques, over 200 million bleomycin cleavage sites were examined to elucidate the bleomycin genome-wide DNA selectivity. The genome-wide bleomycin cleavage data were analysed by four different methods to determine the cellular DNA sequence specificity of bleomycin strand breakage. For the most highly cleaved DNA sequences, the preferred site of bleomycin breakage was at 5'-GT* dinucleotide sequences (where the asterisk indicates the bleomycin cleavage site), with lesser cleavage at 5'-GC* dinucleotides. This investigation also determined longer bleomycin cleavage sequences, with preferred cleavage at 5'-GT*A and 5'- TGT* trinucleotide sequences, and 5'-TGT*A tetranucleotides. For cellular DNA, the hexanucleotide DNA sequence 5'-RTGT*AY (where R is a purine and Y is a pyrimidine) was the most highly cleaved DNA sequence. It was striking that alternating purine-pyrimidine sequences were highly cleaved by bleomycin. The highest intensity cleavage sites in cellular and purified DNA were very similar although there were some minor differences. Statistical nucleotide frequency analysis indicated a G nucleotide was present at the -3 position (relative to the cleavage site) in cellular DNA but was absent in purified DNA.

  5. Genetic resources offer efficient tools for rice functional genomics research.

    PubMed

    Lo, Shuen-Fang; Fan, Ming-Jen; Hsing, Yue-Ie; Chen, Liang-Jwu; Chen, Shu; Wen, Ien-Chie; Liu, Yi-Lun; Chen, Ku-Ting; Jiang, Mirng-Jier; Lin, Ming-Kuang; Rao, Meng-Yen; Yu, Lin-Chih; Ho, Tuan-Hua David; Yu, Su-May

    2016-05-01

    Rice is an important crop and major model plant for monocot functional genomics studies. With the establishment of various genetic resources for rice genomics, the next challenge is to systematically assign functions to predicted genes in the rice genome. Compared with the robustness of genome sequencing and bioinformatics techniques, progress in understanding the function of rice genes has lagged, hampering the utilization of rice genes for cereal crop improvement. The use of transfer DNA (T-DNA) insertional mutagenesis offers the advantage of uniform distribution throughout the rice genome, but preferentially in gene-rich regions, resulting in direct gene knockout or activation of genes within 20-30 kb up- and downstream of the T-DNA insertion site and high gene tagging efficiency. Here, we summarize the recent progress in functional genomics using the T-DNA-tagged rice mutant population. We also discuss important features of T-DNA activation- and knockout-tagging and promoter-trapping of the rice genome in relation to mutant and candidate gene characterizations and how to more efficiently utilize rice mutant populations and datasets for high-throughput functional genomics and phenomics studies by forward and reverse genetics approaches. These studies may facilitate the translation of rice functional genomics research to improvements of rice and other cereal crops. © 2015 John Wiley & Sons Ltd.

  6. Mining metadata from unidentified ITS sequences in GenBank: A case study in Inocybe (Basidiomycota)

    PubMed Central

    2008-01-01

    Background The lack of reference sequences from well-identified mycorrhizal fungi often poses a challenge to the inference of taxonomic affiliation of sequences from environmental samples, and many environmental sequences are thus left unidentified. Such unidentified sequences belonging to the widely distributed ectomycorrhizal fungal genus Inocybe (Basidiomycota) were retrieved from GenBank and divided into species that were identified in a phylogenetic context using a reference dataset from an ongoing study of the genus. The sequence metadata of the unidentified Inocybe sequences stored in GenBank, as well as data from the corresponding original papers, were compiled and used to explore the ecology and distribution of the genus. In addition, the relative occurrence of Inocybe was contrasted to that of other mycorrhizal genera. Results Most species of Inocybe were found to have less than 3% intraspecific variability in the ITS2 region of the nuclear ribosomal DNA. This cut-off value was used jointly with phylogenetic analysis to delimit and identify unidentified Inocybe sequences to species level. A total of 177 unidentified Inocybe ITS sequences corresponding to 98 species were recovered, 32% of which were successfully identified to species level in this study. These sequences account for an unexpectedly large proportion of the publicly available unidentified fungal ITS sequences when compared with other mycorrhizal genera. Eight Inocybe species were reported from multiple hosts and some even from hosts forming arbutoid or orchid mycorrhizae. Furthermore, Inocybe sequences have been reported from four continents and in climate zones ranging from cold temperate to equatorial climate. Out of the 19 species found in more than one study, six were found in both Europe and North America and one was found in both Europe and Japan, indicating that at least many north temperate species have a wide distribution. Conclusion Although DNA-based species identification and circumscription are associated with practical and conceptual difficulties, they also offer new possibilities and avenues for research. Metadata assembly holds great potential to synthesize valuable information from community studies for use in a species and taxonomy-oriented framework. PMID:18282272

  7. In silico approaches reveal the potential for DNA sequence-dependent histone octamer affinity to influence chromatin structure in vivo.

    PubMed

    Fraser, Ross M; Allan, James; Simmen, Martin W

    2006-12-08

    Nucleosome positioning signals embedded within the DNA sequence have the potential to influence the detailed structure of the higher-order chromatin fibre. In two previous studies of long stretches of DNA, encompassing the chicken beta-globin and ovine beta-lactoglobulin genes, respectively, we mapped the relative affinity of every site for the core histone octamer. In both cases a periodic arrangement of the in vitro positioning sites suggests that they might influence the folding of a nucleosome chain into higher-order structure; this hypothesis was borne out in the case of the beta-lactoglobulin gene, where the distribution of the in vitro positioning sites is related to the positions nucleosomes actually occupy in sheep liver cells. Here, we have exploited the in vitro nucleosome positioning datasets to simulate nucleosomal organisation using in silico approaches. We use the high-resolution, quantitative positioning maps to define a one-dimensional positioning energy lattice, which can be populated with a defined number of nucleosomes. Monte Carlo techniques are employed to simulate the behaviour of the model at equilibrium to produce a set of configurations, which provide a probability-based occupancy map. Employing a variety of techniques we show that the occupancy maps are a sensitive function of the histone octamer density (nucleosome repeat length) and find that a minimal change in this property can produce dramatic localised changes in structure. Although simulations generally give rise to regular periodic nucleosomal arrangements, they often show octamer density-dependent discontinuities, which tend to co-localise with sequences that adopt distinctive chromatin structure in vivo. Furthermore, the overall organisation of simulated chromatin structures are more closely related to the situation in vivo than is the original in vitro positioning data, particularly at a nucleosome density corresponding to the in vivo state. Although our model is simplified, we argue that it provides a unique insight into the influence that DNA sequence can have in determining chromatin structure and could serve as a useful basis for the incorporation of other parameters.

  8. Molecular systematics of Gagea and Lloydia (Liliaceae; Liliales): implications of analyses of nuclear ribosomal and plastid DNA sequences for infrageneric classification

    PubMed Central

    Zarrei, M.; Wilkin, P.; Fay, M. F.; Ingrouille, M. J.; Zarre, S.; Chase, M. W.

    2009-01-01

    Background and Aims Gagea is a Eurasian genus of petaloid monocots, with a few species in North Africa, comprising between 70 and approximately 275 species depending on the author. Lloydia (thought to be the closest relative of Gagea) consists of 12–20 species that have a mostly eastern Asian distribution. Delimitation of these genera and their subdivisions are unresolved questions in Liliaceae taxonomy. The objective of this study is to evaluate generic and infrageneric circumscription of Gagea and Lloydia using DNA sequence data. Methods A phylogenetic study of Gagea and Lloydia (Liliaceae) was conducted using sequences of nuclear ribosomal internal transcribed spacer (ITS) and plastid (rpl16 intron, trnL intron, trnL-F spacer, matK and the psbA-trnH spacer) DNA regions. This included 149 accessions (seven as outgroups), with multiple accessions of some taxa; 552 sequences were included, of which 393 were generated as part of this research. Key Results A close relationship of Gagea and Lloydia was confirmed in analyses using different datasets, but neither Gagea nor Lloydia forms a monophyletic group as currently circumscribed; however, the ITS and plastid analyses did not produce congruent results for the placement of Lloydia relative to the major groups within Gagea. Gagea accessions formed five moderately to strongly supported clades in all trees, with most Lloydia taxa positioned at the basal nodes; in the strict consensus trees from the combined data a basal polytomy occurs. There is limited congruence between the classical, morphology-derived infrageneric taxonomy in Gagea (including Lloydia) and clades in the present phylogenetic analyses. Conclusions The analyses support monophyly of Gagea/Lloydia collectively, and they clearly comprise a single lineage, as some previous authors have hypothesized. The results provide the basis for a new classification of Gagea that has support from some morphological features. Incongruence between plastid and nuclear ITS results is interpreted as potentially due to ancient hybridization and/or paralogy of ITS rDNA. PMID:19451146

  9. Sequence and Structure Dependent DNA-DNA Interactions

    NASA Astrophysics Data System (ADS)

    Kopchick, Benjamin; Qiu, Xiangyun

    Molecular forces between dsDNA strands are largely dominated by electrostatics and have been extensively studied. Quantitative knowledge has been accumulated on how DNA-DNA interactions are modulated by varied biological constituents such as ions, cationic ligands, and proteins. Despite its central role in biology, the sequence of DNA has not received substantial attention and ``random'' DNA sequences are typically used in biophysical studies. However, ~50% of human genome is composed of non-random-sequence DNAs, particularly repetitive sequences. Furthermore, covalent modifications of DNA such as methylation play key roles in gene functions. Such DNAs with specific sequences or modifications often take on structures other than the canonical B-form. Here we present series of quantitative measurements of the DNA-DNA forces with the osmotic stress method on different DNA sequences, from short repeats to the most frequent sequences in genome, and to modifications such as bromination and methylation. We observe peculiar behaviors that appear to be strongly correlated with the incurred structural changes. We speculate the causalities in terms of the differences in hydration shell and DNA surface structures.

  10. Identification of Differentially Methylated Sites with Weak Methylation Effects

    PubMed Central

    Tran, Hong; Zhu, Hongxiao; Wu, Xiaowei; Kim, Gunjune; Clarke, Christopher R.; Larose, Hailey; Haak, David C.; Westwood, James H.; Zhang, Liqing

    2018-01-01

    Deoxyribonucleic acid (DNA) methylation is an epigenetic alteration crucial for regulating stress responses. Identifying large-scale DNA methylation at single nucleotide resolution is made possible by whole genome bisulfite sequencing. An essential task following the generation of bisulfite sequencing data is to detect differentially methylated cytosines (DMCs) among treatments. Most statistical methods for DMC detection do not consider the dependency of methylation patterns across the genome, thus possibly inflating type I error. Furthermore, small sample sizes and weak methylation effects among different phenotype categories make it difficult for these statistical methods to accurately detect DMCs. To address these issues, the wavelet-based functional mixed model (WFMM) was introduced to detect DMCs. To further examine the performance of WFMM in detecting weak differential methylation events, we used both simulated and empirical data and compare WFMM performance to a popular DMC detection tool methylKit. Analyses of simulated data that replicated the effects of the herbicide glyphosate on DNA methylation in Arabidopsis thaliana show that WFMM results in higher sensitivity and specificity in detecting DMCs compared to methylKit, especially when the methylation differences among phenotype groups are small. Moreover, the performance of WFMM is robust with respect to small sample sizes, making it particularly attractive considering the current high costs of bisulfite sequencing. Analysis of empirical Arabidopsis thaliana data under varying glyphosate dosages, and the analysis of monozygotic (MZ) twins who have different pain sensitivities—both datasets have weak methylation effects of <1%—show that WFMM can identify more relevant DMCs related to the phenotype of interest than methylKit. Differentially methylated regions (DMRs) are genomic regions with different DNA methylation status across biological samples. DMRs and DMCs are essentially the same concepts, with the only difference being how methylation information across the genome is summarized. If methylation levels are determined by grouping neighboring cytosine sites, then they are DMRs; if methylation levels are calculated based on single cytosines, they are DMCs. PMID:29419727

  11. A Multilocus Species Delimitation Reveals a Striking Number of Species of Coralline Algae Forming Maerl in the OSPAR Maritime Area

    PubMed Central

    Pardo, Cristina; Lopez, Lua; Peña, Viviana; Hernández-Kantún, Jazmin; Le Gall, Line; Bárbara, Ignacio; Barreiro, Rodolfo

    2014-01-01

    Maerl beds are sensitive biogenic habitats built by an accumulation of loose-lying, non-geniculate coralline algae. While these habitats are considered hot-spots of marine biodiversity, the number and distribution of maerl-forming species is uncertain because homoplasy and plasticity of morphological characters are common. As a result, species discrimination based on morphological features is notoriously challenging, making these coralline algae the ideal candidates for a DNA barcoding study. Here, mitochondrial (COI-5P DNA barcode fragment) and plastidial (psbA gene) sequence data were used in a two-step approach to delimit species in 224 collections of maerl sampled from Svalbard (78°96’N) to the Canary Islands (28°64’N) that represented 10 morphospecies from four genera and two families. First, the COI-5P dataset was analyzed with two methods based on distinct criteria (ABGD and GMYC) to delineate 16 primary species hypotheses (PSHs) arranged into four major lineages. Second, chloroplast (psbA) sequence data served to consolidate these PSHs into 13 secondary species hypotheses (SSHs) that showed biologically plausible ranges. Using several lines of evidence (e.g. morphological characters, known species distributions, sequences from type and topotype material), six SSHs were assigned to available species names that included the geographically widespread Phymatolithon calcareum, Lithothamnion corallioides, and L. glaciale; possible identities of other SSHs are discussed. Concordance between SSHs and morphospecies was minimal, highlighting the convenience of DNA barcoding for an accurate identification of maerl specimens. Our survey indicated that a majority of maerl forming species have small distribution ranges and revealed a gradual replacement of species with latitude. PMID:25111057

  12. De novo Transcriptome Analysis of Portunus trituberculatus Ovary and Testis by RNA-Seq: Identification of Genes Involved in Gonadal Development

    PubMed Central

    Meng, Xian-liang; Liu, Ping; Jia, Fu-long; Li, Jian; Gao, Bao-Quan

    2015-01-01

    The swimming crab Portunus trituberculatus is a commercially important crab species in East Asia countries. Gonadal development is a physiological process of great significance to the reproduction as well as commercial seed production for P. trituberculatus. However, little is currently known about the molecular mechanisms governing the developmental processes of gonads in this species. To open avenues of molecular research on P. trituberculatus gonadal development, Illumina paired-end sequencing technology was employed to develop deep-coverage transcriptome sequencing data for its gonads. Illumina sequencing generated 58,429,148 and 70,474,978 high-quality reads from the ovary and testis cDNA library, respectively. All these reads were assembled into 54,960 unigenes with an average sequence length of 879 bp, of which 12,340 unigenes (22.45% of the total) matched sequences in GenBank non-redundant database. Based on our transcriptome analysis as well as published literature, a number of candidate genes potentially involved in the regulation of gonadal development of P. trituberculatus were identified, such as FAOMeT, mPRγ, PGMRC1, PGDS, PGER4, 3β-HSD and 17β-HSDs. Differential expression analysis generated 5,919 differentially expressed genes between ovary and testis, among which many genes related to gametogenesis and several genes previously reported to be critical in differentiation and development of gonads were found, including Foxl2, Wnt4, Fst, Fem-1 and Sox9. Furthermore, 28,534 SSRs and 111,646 high-quality SNPs were identified in this transcriptome dataset. This work represents the first transcriptome analysis of P. trituberculatus gonads using the next generation sequencing technology and provides a valuable dataset for understanding molecular mechanisms controlling development of gonads and facilitating future investigation of reproductive biology in this species. The molecular markers obtained in this study will provide a fundamental basis for population genetics and functional genomics in P. trituberculatus and other closely related species. PMID:26042806

  13. Phylogenetic reconstruction in the Order Nymphaeales: ITS2 secondary structure analysis and in silico testing of maturase k (matK) as a potential marker for DNA bar coding

    PubMed Central

    2012-01-01

    Background The Nymphaeales (waterlilly and relatives) lineage has diverged as the second branch of basal angiosperms and comprises of two families: Cabombaceae and Nymphaceae. The classification of Nymphaeales and phylogeny within the flowering plants are quite intriguing as several systems (Thorne system, Dahlgren system, Cronquist system, Takhtajan system and APG III system (Angiosperm Phylogeny Group III system) have attempted to redefine the Nymphaeales taxonomy. There have been also fossil records consisting especially of seeds, pollen, stems, leaves and flowers as early as the lower Cretaceous. Here we present an in silico study of the order Nymphaeales taking maturaseK (matK) and internal transcribed spacer (ITS2) as biomarkers for phylogeny reconstruction (using character-based methods and Bayesian approach) and identification of motifs for DNA barcoding. Results The Maximum Likelihood (ML) and Bayesian approach yielded congruent fully resolved and well-supported trees using a concatenated (ITS2+ matK) supermatrix aligned dataset. The taxon sampling corroborates the monophyly of Cabombaceae. Nuphar emerges as a monophyletic clade in the family Nymphaeaceae while there are slight discrepancies in the monophyletic nature of the genera Nymphaea owing to Victoria-Euryale and Ondinea grouping in the same node of Nymphaeaceae. ITS2 secondary structures alignment corroborate the primary sequence analysis. Hydatellaceae emerged as a sister clade to Nymphaeaceae and had a basal lineage amongst the water lilly clades. Species from Cycas and Ginkgo were taken as outgroups and were rooted in the overall tree topology from various methods. Conclusions MatK genes are fast evolving highly variant regions of plant chloroplast DNA that can serve as potential biomarkers for DNA barcoding and also in generating primers for angiosperms with identification of unique motif regions. We have reported unique genus specific motif regions in the Order Nymphaeles from matK dataset which can be further validated for barcoding and designing of PCR primers. Our analysis using a novel approach of sequence-structure alignment and phylogenetic reconstruction using molecular morphometrics congrue with the current placement of Hydatellaceae within the early-divergent angiosperm order Nymphaeales. The results underscore the fact that more diverse genera, if not fully resolved to be monophyletic, should be represented by all major lineages. PMID:23282079

  14. Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models

    PubMed Central

    Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.

    2016-01-01

    An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777

  15. Molecular taxonomy of phytopathogenic fungi: a case study in Peronospora.

    PubMed

    Göker, Markus; García-Blázquez, Gema; Voglmayr, Hermann; Tellería, M Teresa; Martín, María P

    2009-07-29

    Inappropriate taxon definitions may have severe consequences in many areas. For instance, biologically sensible species delimitation of plant pathogens is crucial for measures such as plant protection or biological control and for comparative studies involving model organisms. However, delimiting species is challenging in the case of organisms for which often only molecular data are available, such as prokaryotes, fungi, and many unicellular eukaryotes. Even in the case of organisms with well-established morphological characteristics, molecular taxonomy is often necessary to emend current taxonomic concepts and to analyze DNA sequences directly sampled from the environment. Typically, for this purpose clustering approaches to delineate molecular operational taxonomic units have been applied using arbitrary choices regarding the distance threshold values, and the clustering algorithms. Here, we report on a clustering optimization method to establish a molecular taxonomy of Peronospora based on ITS nrDNA sequences. Peronospora is the largest genus within the downy mildews, which are obligate parasites of higher plants, and includes various economically important pathogens. The method determines the distance function and clustering setting that result in an optimal agreement with selected reference data. Optimization was based on both taxonomy-based and host-based reference information, yielding the same outcome. Resampling and permutation methods indicate that the method is robust regarding taxon sampling and errors in the reference data. Tests with newly obtained ITS sequences demonstrate the use of the re-classified dataset in molecular identification of downy mildews. A corrected taxonomy is provided for all Peronospora ITS sequences contained in public databases. Clustering optimization appears to be broadly applicable in automated, sequence-based taxonomy. The method connects traditional and modern taxonomic disciplines by specifically addressing the issue of how to optimally account for both traditional species concepts and genetic divergence.

  16. Molecular Taxonomy of Phytopathogenic Fungi: A Case Study in Peronospora

    PubMed Central

    Göker, Markus; García-Blázquez, Gema; Voglmayr, Hermann; Tellería, M. Teresa; Martín, María P.

    2009-01-01

    Background Inappropriate taxon definitions may have severe consequences in many areas. For instance, biologically sensible species delimitation of plant pathogens is crucial for measures such as plant protection or biological control and for comparative studies involving model organisms. However, delimiting species is challenging in the case of organisms for which often only molecular data are available, such as prokaryotes, fungi, and many unicellular eukaryotes. Even in the case of organisms with well-established morphological characteristics, molecular taxonomy is often necessary to emend current taxonomic concepts and to analyze DNA sequences directly sampled from the environment. Typically, for this purpose clustering approaches to delineate molecular operational taxonomic units have been applied using arbitrary choices regarding the distance threshold values, and the clustering algorithms. Methodology Here, we report on a clustering optimization method to establish a molecular taxonomy of Peronospora based on ITS nrDNA sequences. Peronospora is the largest genus within the downy mildews, which are obligate parasites of higher plants, and includes various economically important pathogens. The method determines the distance function and clustering setting that result in an optimal agreement with selected reference data. Optimization was based on both taxonomy-based and host-based reference information, yielding the same outcome. Resampling and permutation methods indicate that the method is robust regarding taxon sampling and errors in the reference data. Tests with newly obtained ITS sequences demonstrate the use of the re-classified dataset in molecular identification of downy mildews. Conclusions A corrected taxonomy is provided for all Peronospora ITS sequences contained in public databases. Clustering optimization appears to be broadly applicable in automated, sequence-based taxonomy. The method connects traditional and modern taxonomic disciplines by specifically addressing the issue of how to optimally account for both traditional species concepts and genetic divergence. PMID:19641601

  17. A new way to contemplate Darwin's tangled bank: how DNA barcodes are reconnecting biodiversity science and biomonitoring.

    PubMed

    Hajibabaei, Mehrdad; Baird, Donald J; Fahner, Nicole A; Beiko, Robert; Golding, G Brian

    2016-09-05

    Encompassing the breadth of biodiversity in biomonitoring programmes has been frustrated by an inability to simultaneously identify large numbers of species accurately and in a timely fashion. Biomonitoring infers the state of an ecosystem from samples collected and identified using the best available taxonomic knowledge. The advent of DNA barcoding has now given way to the extraction of bulk DNA from mixed samples of organisms in environmental samples through the development of high-throughput sequencing (HTS). This DNA metabarcoding approach allows an unprecedented view of the true breadth and depth of biodiversity, but its adoption poses two important challenges. First, bioinformatics techniques must simultaneously perform complex analyses of large datasets and translate the results of these analyses to a range of users. Second, the insights gained from HTS need to be amalgamated with concepts such as Linnaean taxonomy and indicator species, which are less comprehensive but more intuitive. It is clear that we are moving beyond proof-of-concept studies to address the challenge of implementation of this new approach for environmental monitoring and regulation. Interpreting Darwin's 'tangled bank' through a DNA lens is now a reality, but the question remains: how can this information be generated and used reliably, and how does it relate to accepted norms in ecosystem study?This article is part of the themed issue 'From DNA barcodes to biomes'. © 2016 The Authors.

  18. A new way to contemplate Darwin's tangled bank: how DNA barcodes are reconnecting biodiversity science and biomonitoring

    PubMed Central

    Baird, Donald J.; Fahner, Nicole A.; Beiko, Robert; Golding, G. Brian

    2016-01-01

    Encompassing the breadth of biodiversity in biomonitoring programmes has been frustrated by an inability to simultaneously identify large numbers of species accurately and in a timely fashion. Biomonitoring infers the state of an ecosystem from samples collected and identified using the best available taxonomic knowledge. The advent of DNA barcoding has now given way to the extraction of bulk DNA from mixed samples of organisms in environmental samples through the development of high-throughput sequencing (HTS). This DNA metabarcoding approach allows an unprecedented view of the true breadth and depth of biodiversity, but its adoption poses two important challenges. First, bioinformatics techniques must simultaneously perform complex analyses of large datasets and translate the results of these analyses to a range of users. Second, the insights gained from HTS need to be amalgamated with concepts such as Linnaean taxonomy and indicator species, which are less comprehensive but more intuitive. It is clear that we are moving beyond proof-of-concept studies to address the challenge of implementation of this new approach for environmental monitoring and regulation. Interpreting Darwin's ‘tangled bank’ through a DNA lens is now a reality, but the question remains: how can this information be generated and used reliably, and how does it relate to accepted norms in ecosystem study? This article is part of the themed issue ‘From DNA barcodes to biomes’. PMID:27481782

  19. Feasibility of nuclear ribosomal region ITS1 over ITS2 in barcoding taxonomically challenging genera of subtribe Cassiinae (Fabaceae).

    PubMed

    Mishra, Priyanka; Kumar, Amit; Rodrigues, Vereena; Shukla, Ashutosh K; Sundaresan, Velusamy

    2016-01-01

    The internal transcribed spacer (ITS) region is situated between 18S and 26S in a polycistronic rRNA precursor transcript. It had been proved to be the most commonly sequenced region across plant species to resolve phylogenetic relationships ranging from shallow to deep taxonomic levels. Despite several taxonomical revisions in Cassiinae, a stable phylogeny remains elusive at the molecular level, particularly concerning the delineation of species in the genera Cassia, Senna and Chamaecrista . This study addresses the comparative potential of ITS datasets (ITS1, ITS2 and concatenated) in resolving the underlying morphological disparity in the highly complex genera, to assess their discriminatory power as potential barcode candidates in Cassiinae. A combination of experimental data and an in-silico approach based on threshold genetic distances, sequence similarity based and hierarchical tree-based methods was performed to decipher the discriminating power of ITS datasets on 18 different species of Cassiinae complex. Lab-generated s equences were compared against those available in the GenBank using BLAST and were aligned through MUSCLE 3.8.31 and analysed in PAUP 4.0 and BEAST1.8 using parsimony ratchet, maximum likelihood and Bayesian inference (BI) methods of gene and species tree reconciliation with bootstrapping. DNA barcoding gap was realized based on the Kimura two-parameter distance model (K2P) in TaxonDNA and MEGA. Based on the K2P distance, significant divergences between the inter- and intra-specific genetic distances were observed, while the presence of a DNA barcoding gap was obvious. The ITS1 region efficiently identified 81.63% and 90% of species using TaxonDNA and BI methods, respectively. The PWG-distance method based on simple pairwise matching indicated the significance of ITS1 whereby highest number of variable (210) and informative sites (206) were obtained. The BI tree-based methods outperformed the similarity-based methods producing well-resolved phylogenetic trees with many nodes well supported by bootstrap analyses. The reticulated phylogenetic hypothesis using the ITS1 region mainly supported the relationship between the species of Cassiinae established by traditional morphological methods. The ITS1 region showed a higher discrimination power and desirable characteristics as compared to ITS2 and ITS1 + 2, thereby concluding to be the locus of choice. Considering the complexity of the group and the underlying biological ambiguities, the results presented here are encouraging for developing DNA barcoding as a useful tool for resolving taxonomical challenges in corroboration with morphological framework.

  20. Simrank: Rapid and sensitive general-purpose k-mer search tool

    PubMed Central

    2011-01-01

    Background Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Results Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. Conclusions Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity. PMID:21524302

  1. A High-Throughput Process for the Solid-Phase Purification of Synthetic DNA Sequences

    PubMed Central

    Grajkowski, Andrzej; Cieślak, Jacek; Beaucage, Serge L.

    2017-01-01

    An efficient process for the purification of synthetic phosphorothioate and native DNA sequences is presented. The process is based on the use of an aminopropylated silica gel support functionalized with aminooxyalkyl functions to enable capture of DNA sequences through an oximation reaction with the keto function of a linker conjugated to the 5′-terminus of DNA sequences. Deoxyribonucleoside phosphoramidites carrying this linker, as a 5′-hydroxyl protecting group, have been synthesized for incorporation into DNA sequences during the last coupling step of a standard solid-phase synthesis protocol executed on a controlled pore glass (CPG) support. Solid-phase capture of the nucleobase- and phosphate-deprotected DNA sequences released from the CPG support is demonstrated to proceed near quantitatively. Shorter than full-length DNA sequences are first washed away from the capture support; the solid-phase purified DNA sequences are then released from this support upon reaction with tetra-n-butylammonium fluoride in dry dimethylsulfoxide (DMSO) and precipitated in tetrahydrofuran (THF). The purity of solid-phase-purified DNA sequences exceeds 98%. The simulated high-throughput and scalability features of the solid-phase purification process are demonstrated without sacrificing purity of the DNA sequences. PMID:28628204

  2. DLocalMotif: a discriminative approach for discovering local motifs in protein sequences.

    PubMed

    Mehdi, Ahmed M; Sehgal, Muhammad Shoaib B; Kobe, Bostjan; Bailey, Timothy L; Bodén, Mikael

    2013-01-01

    Local motifs are patterns of DNA or protein sequences that occur within a sequence interval relative to a biologically defined anchor or landmark. Current protein motif discovery methods do not adequately consider such constraints to identify biologically significant motifs that are only weakly over-represented but spatially confined. Using negatives, i.e. sequences known to not contain a local motif, can further increase the specificity of their discovery. This article introduces the method DLocalMotif that makes use of positional information and negative data for local motif discovery in protein sequences. DLocalMotif combines three scoring functions, measuring degrees of motif over-representation, entropy and spatial confinement, specifically designed to discriminatively exploit the availability of negative data. The method is shown to outperform current methods that use only a subset of these motif characteristics. We apply the method to several biological datasets. The analysis of peroxisomal targeting signals uncovers several novel motifs that occur immediately upstream of the dominant peroxisomal targeting signal-1 signal. The analysis of proline-tyrosine nuclear localization signals uncovers multiple novel motifs that overlap with C2H2 zinc finger domains. We also evaluate the method on classical nuclear localization signals and endoplasmic reticulum retention signals and find that DLocalMotif successfully recovers biologically relevant sequence properties. http://bioinf.scmb.uq.edu.au/dlocalmotif/

  3. Prediction of enhancer-promoter interactions via natural language processing.

    PubMed

    Zeng, Wanwen; Wu, Mengmeng; Jiang, Rui

    2018-05-09

    Precise identification of three-dimensional genome organization, especially enhancer-promoter interactions (EPIs), is important to deciphering gene regulation, cell differentiation and disease mechanisms. Currently, it is a challenging task to distinguish true interactions from other nearby non-interacting ones since the power of traditional experimental methods is limited due to low resolution or low throughput. We propose a novel computational framework EP2vec to assay three-dimensional genomic interactions. We first extract sequence embedding features, defined as fixed-length vector representations learned from variable-length sequences using an unsupervised deep learning method in natural language processing. Then, we train a classifier to predict EPIs using the learned representations in supervised way. Experimental results demonstrate that EP2vec obtains F1 scores ranging from 0.841~ 0.933 on different datasets, which outperforms existing methods. We prove the robustness of sequence embedding features by carrying out sensitivity analysis. Besides, we identify motifs that represent cell line-specific information through analysis of the learned sequence embedding features by adopting attention mechanism. Last, we show that even superior performance with F1 scores 0.889~ 0.940 can be achieved by combining sequence embedding features and experimental features. EP2vec sheds light on feature extraction for DNA sequences of arbitrary lengths and provides a powerful approach for EPIs identification.

  4. The Species Dilemma of Northeast Indian Mahseer (Actinopterygii: Cyprinidae): DNA Barcoding in Clarifying the Riddle

    PubMed Central

    Laskar, Boni A.; Bhattacharjee, Maloyjo J.; Dhar, Bishal; Mahadani, Pradosh; Kundu, Shantanu; Ghosh, Sankar K.

    2013-01-01

    Background The taxonomic validity of Northeast Indian endemic Mahseer species, Tor progeneius and Neolissochilus hexastichus, has been argued repeatedly. This is mainly due to disagreements in recognizing the species based on morphological characters. Consequently, both the species have been concealed for many decades. DNA barcoding has become a promising and an independent technique for accurate species level identification. Therefore, utilization of such technique in association with the traditional morphotaxonomic description can resolve the species dilemma of this important group of sport fishes. Methodology/Principal Findings Altogether, 28 mahseer specimens including paratypes were studied from different locations in Northeast India, and 24 morphometric characters were measured invariably. The Principal Component Analysis with morphometric data revealed five distinct groups of sample that were taxonomically categorized into 4 species, viz., Tor putitora, T. progeneius, Neolissochilus hexagonolepis and N. hexastichus. Analysis with a dataset of 76 DNA barcode sequences of different mahseer species exhibited that the queries of T. putitora and N. hexagonolepis clustered cohesively with the respective conspecific database sequences maintaining 0.8% maximum K2P divergence. The closest congeneric divergence was 3 times higher than the mean conspecific divergence and was considered as barcode gap. The maximum divergence among the samples of T. progeneius and T. putitora was 0.8% that was much below the barcode gap, indicating them being synonymous. The query sequences of N. hexastichus invariably formed a discrete and a congeneric clade with the database sequences and maintained the interspecific divergence that supported its distinct species status. Notably, N. hexastichus was encountered in a single site and seemed to be under threat. Conclusion This study substantiated the identification of N. hexastichus to be a true species, and tentatively regarded T. progeneius to be a synonym of T. putitora. It would guide the conservationists to initiate priority conservation of N. hexastichus and T. putitora. PMID:23341979

  5. Sequencing, Annotation and Analysis of the Syrian Hamster (Mesocricetus auratus) Transcriptome

    PubMed Central

    Tchitchek, Nicolas; Safronetz, David; Rasmussen, Angela L.; Martens, Craig; Virtaneva, Kimmo; Porcella, Stephen F.; Feldmann, Heinz

    2014-01-01

    Background The Syrian hamster (golden hamster, Mesocricetus auratus) is gaining importance as a new experimental animal model for multiple pathogens, including emerging zoonotic diseases such as Ebola. Nevertheless there are currently no publicly available transcriptome reference sequences or genome for this species. Results A cDNA library derived from mRNA and snRNA isolated and pooled from the brains, lungs, spleens, kidneys, livers, and hearts of three adult female Syrian hamsters was sequenced. Sequence reads were assembled into 62,482 contigs and 111,796 reads remained unassembled (singletons). This combined contig/singleton dataset, designated as the Syrian hamster transcriptome, represents a total of 60,117,204 nucleotides. Our Mesocricetus auratus Syrian hamster transcriptome mapped to 11,648 mouse transcripts representing 9,562 distinct genes, and mapped to a similar number of transcripts and genes in the rat. We identified 214 quasi-complete transcripts based on mouse annotations. Canonical pathways involved in a broad spectrum of fundamental biological processes were significantly represented in the library. The Syrian hamster transcriptome was aligned to the current release of the Chinese hamster ovary (CHO) cell transcriptome and genome to improve the genomic annotation of this species. Finally, our Syrian hamster transcriptome was aligned against 14 other rodents, primate and laurasiatheria species to gain insights about the genetic relatedness and placement of this species. Conclusions This Syrian hamster transcriptome dataset significantly improves our knowledge of the Syrian hamster's transcriptome, especially towards its future use in infectious disease research. Moreover, this library is an important resource for the wider scientific community to help improve genome annotation of the Syrian hamster and other closely related species. Furthermore, these data provide the basis for development of expression microarrays that can be used in functional genomics studies. PMID:25398096

  6. DeF-GPU: Efficient and effective deletions finding in hepatitis B viral genomic DNA using a GPU architecture.

    PubMed

    Cheng, Chun-Pei; Lan, Kuo-Lun; Liu, Wen-Chun; Chang, Ting-Tsung; Tseng, Vincent S

    2016-12-01

    Hepatitis B viral (HBV) infection is strongly associated with an increased risk of liver diseases like cirrhosis or hepatocellular carcinoma (HCC). Many lines of evidence suggest that deletions occurring in HBV genomic DNA are highly associated with the activity of HBV via the interplay between aberrant viral proteins release and human immune system. Deletions finding on the HBV whole genome sequences is thus a very important issue though there exist underlying the challenges in mining such big and complex biological data. Although some next generation sequencing (NGS) tools are recently designed for identifying structural variations such as insertions or deletions, their validity is generally committed to human sequences study. This design may not be suitable for viruses due to different species. We propose a graphics processing unit (GPU)-based data mining method called DeF-GPU to efficiently and precisely identify HBV deletions from large NGS data, which generally contain millions of reads. To fit the single instruction multiple data instructions, sequencing reads are referred to as multiple data and the deletion finding procedure is referred to as a single instruction. We use Compute Unified Device Architecture (CUDA) to parallelize the procedures, and further validate DeF-GPU on 5 synthetic and 1 real datasets. Our results suggest that DeF-GPU outperforms the existing commonly-used method Pindel and is able to exactly identify the deletions of our ground truth in few seconds. The source code and other related materials are available at https://sourceforge.net/projects/defgpu/. Copyright © 2016 Elsevier Inc. All rights reserved.

  7. An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer.

    PubMed

    Ruggles, Kelly V; Tang, Zuojian; Wang, Xuya; Grover, Himanshu; Askenazi, Manor; Teubl, Jennifer; Cao, Song; McLellan, Michael D; Clauser, Karl R; Tabb, David L; Mertins, Philipp; Slebos, Robbert; Erdmann-Gilmore, Petra; Li, Shunqiang; Gunawardena, Harsha P; Xie, Ling; Liu, Tao; Zhou, Jian-Ying; Sun, Shisheng; Hoadley, Katherine A; Perou, Charles M; Chen, Xian; Davies, Sherri R; Maher, Christopher A; Kinsinger, Christopher R; Rodland, Karen D; Zhang, Hui; Zhang, Zhen; Ding, Li; Townsend, R Reid; Rodriguez, Henry; Chan, Daniel; Smith, Richard D; Liebler, Daniel C; Carr, Steven A; Payne, Samuel; Ellis, Matthew J; Fenyő, David

    2016-03-01

    Improvements in mass spectrometry (MS)-based peptide sequencing provide a new opportunity to determine whether polymorphisms, mutations, and splice variants identified in cancer cells are translated. Herein, we apply a proteogenomic data integration tool (QUILTS) to illustrate protein variant discovery using whole genome, whole transcriptome, and global proteome datasets generated from a pair of luminal and basal-like breast-cancer-patient-derived xenografts (PDX). The sensitivity of proteogenomic analysis for singe nucleotide variant (SNV) expression and novel splice junction (NSJ) detection was probed using multiple MS/MS sample process replicates defined here as an independent tandem MS experiment using identical sample material. Despite analysis of over 30 sample process replicates, only about 10% of SNVs (somatic and germline) detected by both DNA and RNA sequencing were observed as peptides. An even smaller proportion of peptides corresponding to NSJ observed by RNA sequencing were detected (<0.1%). Peptides mapping to DNA-detected SNVs without a detectable mRNA transcript were also observed, suggesting that transcriptome coverage was incomplete (∼80%). In contrast to germline variants, somatic variants were less likely to be detected at the peptide level in the basal-like tumor than in the luminal tumor, raising the possibility of differential translation or protein degradation effects. In conclusion, this large-scale proteogenomic integration allowed us to determine the degree to which mutations are translated and identify gaps in sequence coverage, thereby benchmarking current technology and progress toward whole cancer proteome and transcriptome analysis. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.

  8. Ribosomal RNA Genes Contribute to the Formation of Pseudogenes and Junk DNA in the Human Genome.

    PubMed

    Robicheau, Brent M; Susko, Edward; Harrigan, Amye M; Snyder, Marlene

    2017-02-01

    Approximately 35% of the human genome can be identified as sequence devoid of a selected-effect function, and not derived from transposable elements or repeated sequences. We provide evidence supporting a known origin for a fraction of this sequence. We show that: 1) highly degraded, but near full length, ribosomal DNA (rDNA) units, including both 45S and Intergenic Spacer (IGS), can be found at multiple sites in the human genome on chromosomes without rDNA arrays, 2) that these rDNA sequences have a propensity for being centromere proximal, and 3) that sequence at all human functional rDNA array ends is divergent from canonical rDNA to the point that it is pseudogenic. We also show that small sequence strings of rDNA (from 45S + IGS) can be found distributed throughout the genome and are identifiable as an "rDNA-like signal", representing 0.26% of the q-arm of HSA21 and ∼2% of the total sequence of other regions tested. The size of sequence strings found in the rDNA-like signal intergrade into the size of sequence strings that make up the full-length degrading rDNA units found scattered throughout the genome. We conclude that the displaced and degrading rDNA sequences are likely of a similar origin but represent different stages in their evolution towards random sequence. Collectively, our data suggests that over vast evolutionary time, rDNA arrays contribute to the production of junk DNA. The concept that the production of rDNA pseudogenes is a by-product of concerted evolution represents a previously under-appreciated process; we demonstrate here its importance. © The Author(s) 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  9. Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies.

    PubMed

    DeMaere, Matthew Z; Darling, Aaron E

    2018-02-01

    Chromosome conformation capture (3C) and Hi-C DNA sequencing methods have rapidly advanced our understanding of the spatial organization of genomes and metagenomes. Many variants of these protocols have been developed, each with their own strengths. Currently there is no systematic means for simulating sequence data from this family of sequencing protocols, potentially hindering the advancement of algorithms to exploit this new datatype. We describe a computational simulator that, given simple parameters and reference genome sequences, will simulate Hi-C sequencing on those sequences. The simulator models the basic spatial structure in genomes that is commonly observed in Hi-C and 3C datasets, including the distance-decay relationship in proximity ligation, differences in the frequency of interaction within and across chromosomes, and the structure imposed by cells. A means to model the 3D structure of randomly generated topologically associating domains is provided. The simulator considers several sources of error common to 3C and Hi-C library preparation and sequencing methods, including spurious proximity ligation events and sequencing error. We have introduced the first comprehensive simulator for 3C and Hi-C sequencing protocols. We expect the simulator to have use in testing of Hi-C data analysis algorithms, as well as more general value for experimental design, where questions such as the required depth of sequencing, enzyme choice, and other decisions can be made in advance in order to ensure adequate statistical power with respect to experimental hypothesis testing.

  10. Sequence editing by Apolipoprotein B RNA-editing catalytic component-B and epidemiological surveillance of transmitted HIV-1 drug resistance

    PubMed Central

    Gifford, Robert J.; Rhee, Soo-Yon; Eriksson, Nicolas; Liu, Tommy F.; Kiuchi, Mark; Das, Amar K.; Shafer, Robert W.

    2008-01-01

    Design Promiscuous guanine (G) to adenine (A) substitutions catalysed by apolipoprotein B RNA-editing catalytic component (APOBEC) enzymes are observed in a proportion of HIV-1 sequences in vivo and can introduce artifacts into some genetic analyses. The potential impact of undetected lethal editing on genotypic estimation of transmitted drug resistance was assessed. Methods Classifiers of lethal, APOBEC-mediated editing were developed by analysis of lentiviral pol gene sequence variation and evaluated using control sets of HIV-1 sequences. The potential impact of sequence editing on genotypic estimation of drug resistance was assessed in sets of sequences obtained from 77 studies of 25 or more therapy-naive individuals, using mixture modelling approaches to determine the maximum likelihood classification of sequences as lethally edited as opposed to viable. Results Analysis of 6437 protease and reverse transcriptase sequences from therapy-naive individuals using a novel classifier of lethal, APOBEC3G-mediated sequence editing, the polypeptide-like 3G (APOBEC3G)-mediated defectives (A3GD) index’, detected lethal editing in association with spurious ‘transmitted drug resistance’ in nearly 3% of proviral sequences obtained from whole blood and 0.2% of samples obtained from plasma. Conclusion Screening for lethally edited sequences in datasets containing a proportion of proviral DNA, such as those likely to be obtained for epidemiological surveillance of transmitted drug resistance in the developing world, can eliminate rare but potentially significant errors in genotypic estimation of transmitted drug resistance. PMID:18356601

  11. Nuclear markers confirm taxonomic status and relationships among highly endangered and closely related right whale species

    PubMed Central

    Gaines, C.A; Hare, M.P; Beck, S.E; Rosenbaum, H.C

    2005-01-01

    Right whales (genus: Eubalaena) are among the most endangered mammals, yet their taxonomy and phylogeny have been questioned. A phylogenetic hypothesis based on mitochondrial DNA (mtDNA) variation recently prompted a taxonomic revision, increasing the number of right whale species to three. We critically evaluated this hypothesis using sequence data from 13 nuclear DNA (nuDNA) loci as well as the mtDNA control region. Fixed diagnostic characters among the nuclear markers strongly support the hypothesis of three genetically distinct species, despite the lack of any diagnostic morphological characters. A phylogenetic analysis of all data produced a strict consensus cladogram with strong support at nodes that define each right whale species as well as relationships among species. Results showed very little conflict among the individual partitions as well as congruence between the mtDNA and nuDNA datasets. These data clearly demonstrate the strength of using numerous independent genetic markers during a phylogenetic analysis of closely related species. In evaluating phylogenetic support contributed by individual loci, 11 of the 14 loci provided support for at least one of the nodes of interest to this study. Only a single marker (mtDNA control region) provided support at all four nodes. A study using any single nuclear marker would have failed to support the proposed phylogeny, and a strong phylogenetic hypothesis was only revealed by the simultaneous analysis of many nuclear loci. In addition, nuDNA and mtDNA data provided complementary levels of support at nodes of different evolutionary depth indicating that the combined use of mtDNA and nuDNA data is both practical and desirable. PMID:15846869

  12. Representation of DNA sequences in genetic codon context with applications in exon and intron prediction.

    PubMed

    Yin, Changchuan

    2015-04-01

    To apply digital signal processing (DSP) methods to analyze DNA sequences, the sequences first must be specially mapped into numerical sequences. Thus, effective numerical mappings of DNA sequences play key roles in the effectiveness of DSP-based methods such as exon prediction. Despite numerous mappings of symbolic DNA sequences to numerical series, the existing mapping methods do not include the genetic coding features of DNA sequences. We present a novel numerical representation of DNA sequences using genetic codon context (GCC) in which the numerical values are optimized by simulation annealing to maximize the 3-periodicity signal to noise ratio (SNR). The optimized GCC representation is then applied in exon and intron prediction by Short-Time Fourier Transform (STFT) approach. The results show the GCC method enhances the SNR values of exon sequences and thus increases the accuracy of predicting protein coding regions in genomes compared with the commonly used 4D binary representation. In addition, this study offers a novel way to reveal specific features of DNA sequences by optimizing numerical mappings of symbolic DNA sequences.

  13. Single-cell genomic sequencing using Multiple Displacement Amplification.

    PubMed

    Lasken, Roger S

    2007-10-01

    Single microbial cells can now be sequenced using DNA amplified by the Multiple Displacement Amplification (MDA) reaction. The few femtograms of DNA in a bacterium are amplified into micrograms of high molecular weight DNA suitable for DNA library construction and Sanger sequencing. The MDA-generated DNA also performs well when used directly as template for pyrosequencing by the 454 Life Sciences method. While MDA from single cells loses some of the genomic sequence, this approach will greatly accelerate the pace of sequencing from uncultured microbes. The genetically linked sequences from single cells are also a powerful tool to be used in guiding genomic assembly of shotgun sequences of multiple organisms from environmental DNA extracts (metagenomic sequences).

  14. An evaluation of two-channel ChIP-on-chip and DNA methylation microarray normalization strategies

    PubMed Central

    2012-01-01

    Background The combination of chromatin immunoprecipitation with two-channel microarray technology enables genome-wide mapping of binding sites of DNA-interacting proteins (ChIP-on-chip) or sites with methylated CpG di-nucleotides (DNA methylation microarray). These powerful tools are the gateway to understanding gene transcription regulation. Since the goals of such studies, the sample preparation procedures, the microarray content and study design are all different from transcriptomics microarrays, the data pre-processing strategies traditionally applied to transcriptomics microarrays may not be appropriate. Particularly, the main challenge of the normalization of "regulation microarrays" is (i) to make the data of individual microarrays quantitatively comparable and (ii) to keep the signals of the enriched probes, representing DNA sequences from the precipitate, as distinguishable as possible from the signals of the un-enriched probes, representing DNA sequences largely absent from the precipitate. Results We compare several widely used normalization approaches (VSN, LOWESS, quantile, T-quantile, Tukey's biweight scaling, Peng's method) applied to a selection of regulation microarray datasets, ranging from DNA methylation to transcription factor binding and histone modification studies. Through comparison of the data distributions of control probes and gene promoter probes before and after normalization, and assessment of the power to identify known enriched genomic regions after normalization, we demonstrate that there are clear differences in performance between normalization procedures. Conclusion T-quantile normalization applied separately on the channels and Tukey's biweight scaling outperform other methods in terms of the conservation of enriched and un-enriched signal separation, as well as in identification of genomic regions known to be enriched. T-quantile normalization is preferable as it additionally improves comparability between microarrays. In contrast, popular normalization approaches like quantile, LOWESS, Peng's method and VSN normalization alter the data distributions of regulation microarrays to such an extent that using these approaches will impact the reliability of the downstream analysis substantially. PMID:22276688

  15. Modified RNA-seq method for microbial community and diversity analysis using rRNA in different types of environmental samples

    PubMed Central

    Yan, Yong-Wei; Zou, Bin; Zhu, Ting; Hozzein, Wael N.

    2017-01-01

    RNA-seq-based SSU (small subunit) rRNA (ribosomal RNA) analysis has provided a better understanding of potentially active microbial community within environments. However, for RNA-seq library construction, high quantities of purified RNA are typically required. We propose a modified RNA-seq method for SSU rRNA-based microbial community analysis that depends on the direct ligation of a 5’ adaptor to RNA before reverse-transcription. The method requires only a low-input quantity of RNA (10–100 ng) and does not require a DNA removal step. The method was initially tested on three mock communities synthesized with enriched SSU rRNA of archaeal, bacterial and fungal isolates at different ratios, and was subsequently used for environmental samples of high or low biomass. For high-biomass salt-marsh sediments, enriched SSU rRNA and total nucleic acid-derived RNA-seq datasets revealed highly consistent community compositions for all of the SSU rRNA sequences, and as much as 46.4%-59.5% of 16S rRNA sequences were suitable for OTU (operational taxonomic unit)-based community and diversity analyses with complete coverage of V1-V2 regions. OTU-based community structures for the two datasets were also highly consistent with those determined by all of the 16S rRNA reads. For low-biomass samples, total nucleic acid-derived RNA-seq datasets were analyzed, and highly active bacterial taxa were also identified by the OTU-based method, notably including members of the previously underestimated genus Nitrospira and phylum Acidobacteria in tap water, members of the phylum Actinobacteria on a shower curtain, and members of the phylum Cyanobacteria on leaf surfaces. More than half of the bacterial 16S rRNA sequences covered the complete region of primer 8F, and non-coverage rates as high as 38.7% were obtained for phylum-unclassified sequences, providing many opportunities to identify novel bacterial taxa. This modified RNA-seq method will provide a better snapshot of diverse microbial communities, most notably by OTU-based analysis, even communities with low-biomass samples. PMID:29016661

  16. Analysis of expressed sequence tags from Prunus mume flower and fruit and development of simple sequence repeat markers

    PubMed Central

    2010-01-01

    Background Expressed Sequence Tag (EST) has been a cost-effective tool in molecular biology and represents an abundant valuable resource for genome annotation, gene expression, and comparative genomics in plants. Results In this study, we constructed a cDNA library of Prunus mume flower and fruit, sequenced 10,123 clones of the library, and obtained 8,656 expressed sequence tag (EST) sequences with high quality. The ESTs were assembled into 4,473 unigenes composed of 1,492 contigs and 2,981 singletons and that have been deposited in NCBI (accession IDs: GW868575 - GW873047), among which 1,294 unique ESTs were with known or putative functions. Furthermore, we found 1,233 putative simple sequence repeats (SSRs) in the P. mume unigene dataset. We randomly tested 42 pairs of PCR primers flanking potential SSRs, and 14 pairs were identified as true-to-type SSR loci and could amplify polymorphic bands from 20 individual plants of P. mume. We further used the 14 EST-SSR primer pairs to test the transferability on peach and plum. The result showed that nearly 89% of the primer pairs produced target PCR bands in the two species. A high level of marker polymorphism was observed in the plum species (65%) and low in the peach (46%), and the clustering analysis of the three species indicated that these SSR markers were useful in the evaluation of genetic relationships and diversity between and within the Prunus species. Conclusions We have constructed the first cDNA library of P. mume flower and fruit, and our data provide sets of molecular biology resources for P. mume and other Prunus species. These resources will be useful for further study such as genome annotation, new gene discovery, gene functional analysis, molecular breeding, evolution and comparative genomics between Prunus species. PMID:20626882

  17. A new model for ancient DNA decay based on paleogenomic meta-analysis

    PubMed Central

    Ware, Roselyn; Smith, Oliver; Collins, Matthew

    2017-01-01

    Abstract The persistence of DNA over archaeological and paleontological timescales in diverse environments has led to a revolutionary body of paleogenomic research, yet the dynamics of DNA degradation are still poorly understood. We analyzed 185 paleogenomic datasets and compared DNA survival with environmental variables and sample ages. We find cytosine deamination follows a conventional thermal age model, but we find no correlation between DNA fragmentation and sample age over the timespans analyzed, even when controlling for environmental variables. We propose a model for ancient DNA decay wherein fragmentation rapidly reaches a threshold, then subsequently slows. The observed loss of DNA over time may be due to a bulk diffusion process in many cases, highlighting the importance of tissues and environments creating effectively closed systems for DNA preservation. This model of DNA degradation is largely based on mammal bone samples due to published genomic dataset availability. Continued refinement to the model to reflect diverse biological systems and tissue types will further improve our understanding of ancient DNA breakdown dynamics. PMID:28486705

  18. Acquisition of New DNA Sequences After Infection of Chicken Cells with Avian Myeloblastosis Virus

    PubMed Central

    Shoyab, M.; Baluda, M. A.; Evans, R.

    1974-01-01

    DNA-RNA hybridization studies between 70S RNA from avian myeloblastosis virus (AMV) and an excess of DNA from (i) AMV-induced leukemic chicken myeloblasts or (ii) a mixture of normal and of congenitally infected K-137 chicken embryos producing avian leukosis viruses revealed the presence of fast- and slow-hybridizing virus-specific DNA sequences. However, the leukemic cells contained twice the level of AMV-specific DNA sequences observed in normal chicken embryonic cells. The fast-reacting sequences were two to three times more numerous in leukemic DNA than in DNA from the mixed embryos. The slow-reacting sequences had a reiteration frequency of approximately 9 and 6, in the two respective systems. Both the fast- and the slow-reacting DNA sequences in leukemic cells exhibited a higher Tm (2 C) than the respective DNA sequences in normal cells. In normal and leukemic cells the slow hybrid sequences appeared to have a Tm which was 2 C higher than that of the fast hybrid sequences. Individual non-virus-producing chicken embryos, either group-specific antigen positive or negative, contained 40 to 100 copies of the fast sequences and 2 to 6 copies of the slowly hybridizing sequences per cell genome. Normal rat cells did not contain DNA that hybridized with AMV RNA, whereas non-virus-producing rat cells transformed by B-77 avian sarcoma virus contained only the slowly reacting sequences. The results demonstrate that leukemic cells transformed by AMV contain new AMV-specific DNA sequences which were not present before infection. PMID:16789139

  19. Homogeneity of the 16S rDNA sequence among geographically disparate isolates of Taylorella equigenitalis

    PubMed Central

    Matsuda, M; Tazumi, A; Kagawa, S; Sekizuka, T; Murayama, O; Moore, JE; Millar, BC

    2006-01-01

    Background At present, six accessible sequences of 16S rDNA from Taylorella equigenitalis (T. equigenitalis) are available, whose sequence differences occur at a few nucleotide positions. Thus it is important to determine these sequences from additional strains in other countries, if possible, in order to clarify any anomalies regarding 16S rDNA sequence heterogeneity. Here, we clone and sequence the approximate full-length 16S rDNA from additional strains of T. equigenitalis isolated in Japan, Australia and France and compare these sequences to the existing published sequences. Results Clarification of any anomalies regarding 16S rDNA sequence heterogeneity of T. equigenitalis was carried out. When cloning, sequencing and comparison of the approximate full-length 16S rDNA from 17 strains of T. equigenitalis isolated in Japan, Australia and France, nucleotide sequence differences were demonstrated at the six loci in the 1,469 nucleotide sequence. Moreover, 12 polymorphic sites occurred among 23 sequences of the 16S rDNA, including the six reference sequences. Conclusion High sequence similarity (99.5% or more) was observed throughout, except from nucleotide positions 138 to 501 where substitutions and deletions were noted. PMID:16398935

  20. Approaching the taxonomic affiliation of unidentified sequences in public databases--an example from the mycorrhizal fungi.

    PubMed

    Nilsson, R Henrik; Kristiansson, Erik; Ryberg, Martin; Larsson, Karl-Henrik

    2005-07-18

    During the last few years, DNA sequence analysis has become one of the primary means of taxonomic identification of species, particularly so for species that are minute or otherwise lack distinct, readily obtainable morphological characters. Although the number of sequences available for comparison in public databases such as GenBank increases exponentially, only a minuscule fraction of all organisms have been sequenced, leaving taxon sampling a momentous problem for sequence-based taxonomic identification. When querying GenBank with a set of unidentified sequences, a considerable proportion typically lack fully identified matches, forming an ever-mounting pile of sequences that the researcher will have to monitor manually in the hope that new, clarifying sequences have been submitted by other researchers. To alleviate these concerns, a project to automatically monitor select unidentified sequences in GenBank for taxonomic progress through repeated local BLAST searches was initiated. Mycorrhizal fungi--a field where species identification often is prohibitively complex--and the much used ITS locus were chosen as test bed. A Perl script package called emerencia is presented. On a regular basis, it downloads select sequences from GenBank, separates the identified sequences from those insufficiently identified, and performs BLAST searches between these two datasets, storing all results in an SQL database. On the accompanying web-service http://emerencia.math.chalmers.se, users can monitor the taxonomic progress of insufficiently identified sequences over time, either through active searches or by signing up for e-mail notification upon disclosure of better matches. Other search categories, such as listing all insufficiently identified sequences (and their present best fully identified matches) publication-wise, are also available. The ever-increasing use of DNA sequences for identification purposes largely falls back on the assumption that public sequence databases contain a thorough sampling of taxonomically well-annotated sequences. Taxonomy, held by some to be an old-fashioned trade, has accordingly never been more important. emerencia does not automate the taxonomic process, but it does allow researchers to focus their efforts elsewhere than countless manual BLAST runs and arduous sieving of BLAST hit lists. The emerencia system is available on an open source basis for local installation with any organism and gene group as targets.

  1. Revising the recent evolutionary history of equids using ancient DNA

    PubMed Central

    Orlando, Ludovic; Metcalf, Jessica L.; Alberdi, Maria T.; Telles-Antunes, Miguel; Bonjean, Dominique; Otte, Marcel; Martin, Fabiana; Eisenmann, Véra; Mashkour, Marjan; Morello, Flavia; Prado, Jose L.; Salas-Gismondi, Rodolfo; Shockey, Bruce J.; Wrinn, Patrick J.; Vasil'ev, Sergei K.; Ovodov, Nikolai D.; Cherry, Michael I.; Hopwood, Blair; Male, Dean; Austin, Jeremy J.; Hänni, Catherine; Cooper, Alan

    2009-01-01

    The rich fossil record of the family Equidae (Mammalia: Perissodactyla) over the past 55 MY has made it an icon for the patterns and processes of macroevolution. Despite this, many aspects of equid phylogenetic relationships and taxonomy remain unresolved. Recent genetic analyses of extinct equids have revealed unexpected evolutionary patterns and a need for major revisions at the generic, subgeneric, and species levels. To investigate this issue we examine 35 ancient equid specimens from four geographic regions (South America, Europe, Southwest Asia, and South Africa), of which 22 delivered 87–688 bp of reproducible aDNA mitochondrial sequence. Phylogenetic analyses support a major revision of the recent evolutionary history of equids and reveal two new species, a South American hippidion and a descendant of a basal lineage potentially related to Middle Pleistocene equids. Sequences from specimens assigned to the giant extinct Cape zebra, Equus capensis, formed a separate clade within the modern plain zebra species, a phenotypicically plastic group that also included the extinct quagga. In addition, we revise the currently recognized extinction times for two hemione-related equid groups. However, it is apparent that the current dataset cannot solve all of the taxonomic and phylogenetic questions relevant to the evolution of Equus. In light of these findings, we propose a rapid DNA barcoding approach to evaluate the taxonomic status of the many Late Pleistocene fossil Equidae species that have been described from purely morphological analyses. PMID:20007379

  2. Multigene analysis suggests ecological speciation in the fungal pathogen Claviceps purpurea

    PubMed Central

    DOUHAN, G. W.; SMITH, M. E.; HUYRN, K. L.; WESTBROOK, A.; Beerli, P.; FISHER, A. J.

    2008-01-01

    Claviceps purpurea is an important pathogen of grasses and source of novel chemical compounds. Three groups within this species (G1, G2, and G3) have been recognized based on habitat association, sclerotia and conidia morphology, and alkaloid production. These groups have further been supported by RAPD and AFLP markers, suggesting this species may be more accurately described as a species complex. However, all divergent ecotypes can coexist in sympatric populations with no obvious physical barriers to prevent gene flow. In this study, we used both phylogenetic and population genetic analyses to test for speciation within C. purpurea using DNA sequences from ITS, a RAS-like locus, and a portion of beta-tubulin. The G1 types are significantly divergent from the G2/G3 types based on each of the three loci and the combined dataset, whereas the G2/G3 types are more integrated with one another. Although the G2 and G3 lineages have not diverged as much as the G1 lineage based on DNA sequence data, the use of three DNA loci does reliably separate the G2 and G3 lineages. However, the population genetic analyses strongly suggest little to no gene flow occurring between the different ecotypes and we argue that this process is driven by adaptations to ecological habitats; G1 isolates are associated with terrestrial grasses, G2 isolates are found in wet and shady environments, and G3 isolates are found in salt marsh habitats. PMID:18373531

  3. De Novo Transcriptome Sequence Assembly from Coconut Leaves and Seeds with a Focus on Factors Involved in RNA-Directed DNA Methylation

    PubMed Central

    Huang, Ya-Yi; Lee, Chueh-Pai; Fu, Jason L.; Chang, Bill Chia-Han; Matzke, Antonius J. M.; Matzke, Marjori

    2014-01-01

    Coconut palm (Cocos nucifera) is a symbol of the tropics and a source of numerous edible and nonedible products of economic value. Despite its nutritional and industrial significance, coconut remains under-represented in public repositories for genomic and transcriptomic data. We report de novo transcript assembly from RNA-seq data and analysis of gene expression in seed tissues (embryo and endosperm) and leaves of a dwarf coconut variety. Assembly of 10 GB sequencing data for each tissue resulted in 58,211 total unigenes in embryo, 61,152 in endosperm, and 33,446 in leaf. Within each unigene pool, 24,857 could be annotated in embryo, 29,731 could be annotated in endosperm, and 26,064 could be annotated in leaf. A KEGG analysis identified 138, 138, and 139 pathways, respectively, in transcriptomes of embryo, endosperm, and leaf tissues. Given the extraordinarily large size of coconut seeds and the importance of small RNA-mediated epigenetic regulation during seed development in model plants, we used homology searches to identify putative homologs of factors required for RNA-directed DNA methylation in coconut. The findings suggest that RNA-directed DNA methylation is important during coconut seed development, particularly in maturing endosperm. This dataset will expand the genomics resources available for coconut and provide a foundation for more detailed analyses that may assist molecular breeding strategies aimed at improving this major tropical crop. PMID:25193496

  4. De novo transcriptome sequence assembly from coconut leaves and seeds with a focus on factors involved in RNA-directed DNA methylation.

    PubMed

    Huang, Ya-Yi; Lee, Chueh-Pai; Fu, Jason L; Chang, Bill Chia-Han; Matzke, Antonius J M; Matzke, Marjori

    2014-09-04

    Coconut palm (Cocos nucifera) is a symbol of the tropics and a source of numerous edible and nonedible products of economic value. Despite its nutritional and industrial significance, coconut remains under-represented in public repositories for genomic and transcriptomic data. We report de novo transcript assembly from RNA-seq data and analysis of gene expression in seed tissues (embryo and endosperm) and leaves of a dwarf coconut variety. Assembly of 10 GB sequencing data for each tissue resulted in 58,211 total unigenes in embryo, 61,152 in endosperm, and 33,446 in leaf. Within each unigene pool, 24,857 could be annotated in embryo, 29,731 could be annotated in endosperm, and 26,064 could be annotated in leaf. A KEGG analysis identified 138, 138, and 139 pathways, respectively, in transcriptomes of embryo, endosperm, and leaf tissues. Given the extraordinarily large size of coconut seeds and the importance of small RNA-mediated epigenetic regulation during seed development in model plants, we used homology searches to identify putative homologs of factors required for RNA-directed DNA methylation in coconut. The findings suggest that RNA-directed DNA methylation is important during coconut seed development, particularly in maturing endosperm. This dataset will expand the genomics resources available for coconut and provide a foundation for more detailed analyses that may assist molecular breeding strategies aimed at improving this major tropical crop. Copyright © 2014 Huang et al.

  5. Detection and quantitation of single nucleotide polymorphisms, DNA sequence variations, DNA mutations, DNA damage and DNA mismatches

    DOEpatents

    McCutchen-Maloney, Sandra L.

    2002-01-01

    DNA mutation binding proteins alone and as chimeric proteins with nucleases are used with solid supports to detect DNA sequence variations, DNA mutations and single nucleotide polymorphisms. The solid supports may be flow cytometry beads, DNA chips, glass slides or DNA dips sticks. DNA molecules are coupled to solid supports to form DNA-support complexes. Labeled DNA is used with unlabeled DNA mutation binding proteins such at TthMutS to detect DNA sequence variations, DNA mutations and single nucleotide length polymorphisms by binding which gives an increase in signal. Unlabeled DNA is utilized with labeled chimeras to detect DNA sequence variations, DNA mutations and single nucleotide length polymorphisms by nuclease activity of the chimera which gives a decrease in signal.

  6. Methylation patterns of repetitive DNA sequences in germ cells of Mus musculus.

    PubMed

    Sanford, J; Forrester, L; Chapman, V; Chandley, A; Hastie, N

    1984-03-26

    The major and the minor satellite sequences of Mus musculus were undermethylated in both sperm and oocyte DNAs relative to the amount of undermethylation observed in adult somatic tissue DNA. This hypomethylation was specific for satellite sequences in sperm DNA. Dispersed repetitive and low copy sequences show a high degree of methylation in sperm DNA; however, a dispersed repetitive sequence was undermethylated in oocyte DNA. This finding suggests a difference in the amount of total genomic DNA methylation between sperm and oocyte DNA. The methylation levels of the minor satellite sequences did not change during spermiogenesis, and were not associated with the onset of meiosis or a specific stage in sperm development.

  7. Nebula--a web-server for advanced ChIP-seq data analysis.

    PubMed

    Boeva, Valentina; Lermine, Alban; Barette, Camille; Guillouf, Christel; Barillot, Emmanuel

    2012-10-01

    ChIP-seq consists of chromatin immunoprecipitation and deep sequencing of the extracted DNA fragments. It is the technique of choice for accurate characterization of the binding sites of transcription factors and other DNA-associated proteins. We present a web service, Nebula, which allows inexperienced users to perform a complete bioinformatics analysis of ChIP-seq data. Nebula was designed for both bioinformaticians and biologists. It is based on the Galaxy open source framework. Galaxy already includes a large number of functionalities for mapping reads and peak calling. We added the following to Galaxy: (i) peak calling with FindPeaks and a module for immunoprecipitation quality control, (ii) de novo motif discovery with ChIPMunk, (iii) calculation of the density and the cumulative distribution of peak locations relative to gene transcription start sites, (iv) annotation of peaks with genomic features and (v) annotation of genes with peak information. Nebula generates the graphs and the enrichment statistics at each step of the process. During Steps 3-5, Nebula optionally repeats the analysis on a control dataset and compares these results with those from the main dataset. Nebula can also incorporate gene expression (or gene modulation) data during these steps. In summary, Nebula is an innovative web service that provides an advanced ChIP-seq analysis pipeline providing ready-to-publish results. Nebula is available at http://nebula.curie.fr/ Supplementary data are available at Bioinformatics online.

  8. Process of labeling specific chromosomes using recombinant repetitive DNA

    DOEpatents

    Moyzis, R.K.; Meyne, J.

    1988-02-12

    Chromosome preferential nucleotide sequences are first determined from a library of recombinant DNA clones having families of repetitive sequences. Library clones are identified with a low homology with a sequence of repetitive DNA families to which the first clones respectively belong and variant sequences are then identified by selecting clones having a pattern of hybridization with genomic DNA dissimilar to the hybridization pattern shown by the respective families. In another embodiment, variant sequences are selected from a sequence of a known repetitive DNA family. The selected variant sequence is classified as chromosome specific, chromosome preferential, or chromosome nonspecific. Sequences which are classified as chromosome preferential are further sequenced and regions are identified having a low homology with other regions of the chromosome preferential sequence or with known sequences of other family members and consensus sequences of the repetitive DNA families for the chromosome preferential sequences. The selected low homology regions are then hybridized with chromosomes to determine those low homology regions hybridized with a specific chromosome under normal stringency conditions.

  9. Metaxa: a software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequencing datasets.

    PubMed

    Bengtsson, Johan; Eriksson, K Martin; Hartmann, Martin; Wang, Zheng; Shenoy, Belle Damodara; Grelet, Gwen-Aëlle; Abarenkov, Kessy; Petri, Anna; Rosenblad, Magnus Alm; Nilsson, R Henrik

    2011-10-01

    The ribosomal small subunit (SSU) rRNA gene has emerged as an important genetic marker for taxonomic identification in environmental sequencing datasets. In addition to being present in the nucleus of eukaryotes and the core genome of prokaryotes, the gene is also found in the mitochondria of eukaryotes and in the chloroplasts of photosynthetic eukaryotes. These three sets of genes are conceptually paralogous and should in most situations not be aligned and analyzed jointly. To identify the origin of SSU sequences in complex sequence datasets has hitherto been a time-consuming and largely manual undertaking. However, the present study introduces Metaxa ( http://microbiology.se/software/metaxa/ ), an automated software tool to extract full-length and partial SSU sequences from larger sequence datasets and assign them to an archaeal, bacterial, nuclear eukaryote, mitochondrial, or chloroplast origin. Using data from reference databases and from full-length organelle and organism genomes, we show that Metaxa detects and scores SSU sequences for origin with very low proportions of false positives and negatives. We believe that this tool will be useful in microbial and evolutionary ecology as well as in metagenomics.

  10. Enlightenment of Yeast Mitochondrial Homoplasmy: Diversified Roles of Gene Conversion

    PubMed Central

    Ling, Feng; Mikawa, Tsutomu; Shibata, Takehiko

    2011-01-01

    Mitochondria have their own genomic DNA. Unlike the nuclear genome, each cell contains hundreds to thousands of copies of mitochondrial DNA (mtDNA). The copies of mtDNA tend to have heterogeneous sequences, due to the high frequency of mutagenesis, but are quickly homogenized within a cell (“homoplasmy”) during vegetative cell growth or through a few sexual generations. Heteroplasmy is strongly associated with mitochondrial diseases, diabetes and aging. Recent studies revealed that the yeast cell has the machinery to homogenize mtDNA, using a common DNA processing pathway with gene conversion; i.e., both genetic events are initiated by a double-stranded break, which is processed into 3′ single-stranded tails. One of the tails is base-paired with the complementary sequence of the recipient double-stranded DNA to form a D-loop (homologous pairing), in which repair DNA synthesis is initiated to restore the sequence lost by the breakage. Gene conversion generates sequence diversity, depending on the divergence between the donor and recipient sequences, especially when it occurs among a number of copies of a DNA sequence family with some sequence variations, such as in immunoglobulin diversification in chicken. MtDNA can be regarded as a sequence family, in which the members tend to be diversified by a high frequency of spontaneous mutagenesis. Thus, it would be interesting to determine why and how double-stranded breakage and D-loop formation induce sequence homogenization in mitochondria and sequence diversification in nuclear DNA. We will review the mechanisms and roles of mtDNA homoplasmy, in contrast to nuclear gene conversion, which diversifies gene and genome sequences, to provide clues toward understanding how the common DNA processing pathway results in such divergent outcomes. PMID:24710143

  11. Global phylogeography and genetic diversity of the zoonotic tapeworm Echinococcus granulosus sensu stricto genotype G1.

    PubMed

    Kinkar, Liina; Laurimäe, Teivi; Acosta-Jamett, Gerardo; Andresiuk, Vanessa; Balkaya, Ibrahim; Casulli, Adriano; Gasser, Robin B; van der Giessen, Joke; González, Luis Miguel; Haag, Karen L; Zait, Houria; Irshadullah, Malik; Jabbar, Abdul; Jenkins, David J; Kia, Eshrat Beigom; Manfredi, Maria Teresa; Mirhendi, Hossein; M'rad, Selim; Rostami-Nejad, Mohammad; Oudni-M'rad, Myriam; Pierangeli, Nora Beatriz; Ponce-Gordo, Francisco; Rehbein, Steffen; Sharbatkhori, Mitra; Simsek, Sami; Soriano, Silvia Viviana; Sprong, Hein; Šnábel, Viliam; Umhang, Gérald; Varcasia, Antonio; Saarma, Urmas

    2018-05-19

    Echinococcus granulosus sensu stricto (s.s.) is the major cause of human cystic echinococcosis worldwide and is listed among the most severe parasitic diseases of humans. To date, numerous studies have investigated the genetic diversity and population structure of E. granulosus s.s. in various geographic regions. However, there has been no global study. Recently, using mitochondrial DNA, it was shown that E. granulosus s.s. G1 and G3 are distinct genotypes, but a larger dataset is required to confirm the distinction of these genotypes. The objectives of this study were to: (i) investigate the distinction of genotypes G1 and G3 using a large global dataset; and (ii) analyse the genetic diversity and phylogeography of genotype G1 on a global scale using near-complete mitogenome sequences. For this study, 222 globally distributed E. granulosus s.s. samples were used, of which 212 belonged to genotype G1 and 10 to G3. Using a total sequence length of 11,682 bp, we inferred phylogenetic networks for three datasets: E. granulosus s.s. (n = 222), G1 (n = 212) and human G1 samples (n = 41). In addition, the Bayesian phylogenetic and phylogeographic analyses were performed. The latter yielded several strongly supported diffusion routes of genotype G1 originating from Turkey, Tunisia and Argentina. We conclude that: (i) using a considerably larger dataset than employed previously, E. granulosus s.s. G1 and G3 are indeed distinct mitochondrial genotypes; (ii) the genetic diversity of E. granulosus s.s. G1 is high globally, with lower values in South America; and (iii) the complex phylogeographic patterns emerging from the phylogenetic and geographic analyses suggest that the current distribution of genotype G1 has been shaped by intensive animal trade. Copyright © 2018 Australian Society for Parasitology. Published by Elsevier Ltd. All rights reserved.

  12. Molecular exploration of hidden diversity in the Indo-West Pacific sciaenid clade

    PubMed Central

    Lo, Pei-Chun; Liu, Shu-Hui; Nor, Siti Azizah Mohd

    2017-01-01

    The family Sciaenidae, known as croakers or drums, is one of the largest perciform fish families. A recent multi-gene based study investigating the phylogeny and biogeography of global sciaenids revealed that the origin and early diversification of this family occurred in tropical America during the Late Oligocene—Early Miocene before undergoing range expansions to other seas including the Indo-West Pacific, where high species richness is observed. Despite this clarification of the overall evolutionary history of the family, knowledge of the taxonomy and phylogeny of sciaenid genera endemic to the Indo-West Pacific is still limited due to lack of a thorough survey of all taxa. In this study, we used DNA-based approaches to investigate the evolutionary relationships, to explore the species diversity, and to elucidate the taxonomic status of sciaenid species/genera within the Indo-West Pacific clade. Three datasets were herein built for the above objectives: the combined dataset (248 samples from 45 currently recognized species) from one nuclear gene (RAG1) and one mitochondrial gene (COI); the dataset with only RAG1 gene sequences (245 samples from 44 currently recognized species); and the dataset with only COI gene sequences (308 samples from 51 currently recognized species). The latter was primarily used for our biodiversity exploration with two different species delimitation methods (Automatic Barcode Gap Discovery, ABGD and Generalized Mixed Yule Coalescent, GMYC). The results were further evaluated with help of four supplementary criteria for species delimitation (genetic similarity, monophyly inferred from individual gene and combined data trees, geographic distribution, and morphology). Our final results confirmed the validity of 32 currently recognized species and identified several potential new species waiting for formal descriptions. We also reexamined the taxonomic status of the genera, Larimichthys, Nibea, Protonibea and Megalonibea, and suggested a revision of Nibea and proposed a new genus Pseudolarimichthys. PMID:28453569

  13. Molecular exploration of hidden diversity in the Indo-West Pacific sciaenid clade.

    PubMed

    Lo, Pei-Chun; Liu, Shu-Hui; Nor, Siti Azizah Mohd; Chen, Wei-Jen

    2017-01-01

    The family Sciaenidae, known as croakers or drums, is one of the largest perciform fish families. A recent multi-gene based study investigating the phylogeny and biogeography of global sciaenids revealed that the origin and early diversification of this family occurred in tropical America during the Late Oligocene-Early Miocene before undergoing range expansions to other seas including the Indo-West Pacific, where high species richness is observed. Despite this clarification of the overall evolutionary history of the family, knowledge of the taxonomy and phylogeny of sciaenid genera endemic to the Indo-West Pacific is still limited due to lack of a thorough survey of all taxa. In this study, we used DNA-based approaches to investigate the evolutionary relationships, to explore the species diversity, and to elucidate the taxonomic status of sciaenid species/genera within the Indo-West Pacific clade. Three datasets were herein built for the above objectives: the combined dataset (248 samples from 45 currently recognized species) from one nuclear gene (RAG1) and one mitochondrial gene (COI); the dataset with only RAG1 gene sequences (245 samples from 44 currently recognized species); and the dataset with only COI gene sequences (308 samples from 51 currently recognized species). The latter was primarily used for our biodiversity exploration with two different species delimitation methods (Automatic Barcode Gap Discovery, ABGD and Generalized Mixed Yule Coalescent, GMYC). The results were further evaluated with help of four supplementary criteria for species delimitation (genetic similarity, monophyly inferred from individual gene and combined data trees, geographic distribution, and morphology). Our final results confirmed the validity of 32 currently recognized species and identified several potential new species waiting for formal descriptions. We also reexamined the taxonomic status of the genera, Larimichthys, Nibea, Protonibea and Megalonibea, and suggested a revision of Nibea and proposed a new genus Pseudolarimichthys.

  14. Acceptance of Domestic Cat Mitochondrial DNA in a Criminal Proceeding

    PubMed Central

    Lyons, Leslie A.; Grahn, Robert A.; Kun, Teri J.; Netzel, Linda R.; Wictum, Elizabeth E.; Halverson, Joy L.

    2014-01-01

    Shed hair from domestic animals readily adheres to clothing and other contact items, providing a source of transfer evidence for criminal investigations. Mitochondrial DNA is often the only option for DNA analysis of shed hair. Human mitochondrial DNA analysis has been accepted in the US court system since 1996. The murder trial of the State of Missouri versus Henry L. Polk, Jr. represents the first legal proceeding where cat mitochondrial DNA analysis was introduced into evidence. The mitochondrial DNA evidence was initially considered inadmissible due to concerns about the cat dataset and the scientific acceptance of the marker. Those concerns were subsequently addressed, and the evidence was deemed admissible. This report reviews the case in regards to the cat biological evidence and its ultimate admission as generally accepted and reliable. Expansion and saturation analysis of the cat mitochondrial DNA control region dataset supported the initial interpretation of the evidence. PMID:25086413

  15. "First generation" automated DNA sequencing technology.

    PubMed

    Slatko, Barton E; Kieleczawa, Jan; Ju, Jingyue; Gardner, Andrew F; Hendrickson, Cynthia L; Ausubel, Frederick M

    2011-10-01

    Beginning in the 1980s, automation of DNA sequencing has greatly increased throughput, reduced costs, and enabled large projects to be completed more easily. The development of automation technology paralleled the development of other aspects of DNA sequencing: better enzymes and chemistry, separation and imaging technology, sequencing protocols, robotics, and computational advancements (including base-calling algorithms with quality scores, database developments, and sequence analysis programs). Despite the emergence of high-throughput sequencing platforms, automated Sanger sequencing technology remains useful for many applications. This unit provides background and a description of the "First-Generation" automated DNA sequencing technology. It also includes protocols for using the current Applied Biosystems (ABI) automated DNA sequencing machines. © 2011 by John Wiley & Sons, Inc.

  16. Influence of DNA sequence on the structure of minicircles under torsional stress

    PubMed Central

    Wang, Qian; Irobalieva, Rossitza N.; Chiu, Wah; Schmid, Michael F.; Fogg, Jonathan M.; Zechiedrich, Lynn

    2017-01-01

    Abstract The sequence dependence of the conformational distribution of DNA under various levels of torsional stress is an important unsolved problem. Combining theory and coarse-grained simulations shows that the DNA sequence and a structural correlation due to topology constraints of a circle are the main factors that dictate the 3D structure of a 336 bp DNA minicircle under torsional stress. We found that DNA minicircle topoisomers can have multiple bend locations under high torsional stress and that the positions of these sharp bends are determined by the sequence, and by a positive mechanical correlation along the sequence. We showed that simulations and theory are able to provide sequence-specific information about individual DNA minicircles observed by cryo-electron tomography (cryo-ET). We provided a sequence-specific cryo-ET tomogram fitting of DNA minicircles, registering the sequence within the geometric features. Our results indicate that the conformational distribution of minicircles under torsional stress can be designed, which has important implications for using minicircle DNA for gene therapy. PMID:28609782

  17. Analysis of DNA Sequences by an Optical Time-Integrating Correlator: Proof-of-Concept Experiments.

    DTIC Science & Technology

    1992-05-01

    DNA ANALYSIS STRATEGY 4 2.1 Representation of DNA Bases 4 2.2 DNA Analysis Strategy 6 3.0 CUSTOM GENERATORS FOR DNA SEQUENCES 10 3.1 Hardware Design 10...of the DNA bases where each base is represented by a 7-bits long pseudorandom sequence. 5 Figure 4: Coarse analysis of a DNA sequence. 7 Figure 5: Fine...a 20-bases long database. 32 xiii LIST OF TABLES PAGE Table 1: Short representations of the DNA bases where each base is represented by 7-bits long

  18. Empirical Validation of Pooled Whole Genome Population Re-Sequencing in Drosophila melanogaster

    PubMed Central

    Zhu, Yuan; Bergland, Alan O.; González, Josefa; Petrov, Dmitri A.

    2012-01-01

    The sequencing of pooled non-barcoded individuals is an inexpensive and efficient means of assessing genome-wide population allele frequencies, yet its accuracy has not been thoroughly tested. We assessed the accuracy of this approach on whole, complex eukaryotic genomes by resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. We called SNPs in the pooled data and estimated false positive and false negative rates using the SNPs called in individual strain as a reference. We also estimated allele frequency of the SNPs using “pooled” data and compared them with “true” frequencies taken from the estimates in the individual strains. We demonstrate that pooled sequencing provides a faithful estimate of population allele frequency with the error well approximated by binomial sampling, and is a reliable means of novel SNP discovery with low false positive rates. However, a sufficient number of strains should be used in the pooling because variation in the amount of DNA derived from individual strains is a substantial source of noise when the number of pooled strains is low. Our results and analysis confirm that pooled sequencing is a very powerful and cost-effective technique for assessing of patterns of sequence variation in populations on genome-wide scales, and is applicable to any dataset where sequencing individuals or individual cells is impossible, difficult, time consuming, or expensive. PMID:22848651

  19. Reference datasets for 2-treatment, 2-sequence, 2-period bioequivalence studies.

    PubMed

    Schütz, Helmut; Labes, Detlew; Fuglsang, Anders

    2014-11-01

    It is difficult to validate statistical software used to assess bioequivalence since very few datasets with known results are in the public domain, and the few that are published are of moderate size and balanced. The purpose of this paper is therefore to introduce reference datasets of varying complexity in terms of dataset size and characteristics (balance, range, outlier presence, residual error distribution) for 2-treatment, 2-period, 2-sequence bioequivalence studies and to report their point estimates and 90% confidence intervals which companies can use to validate their installations. The results for these datasets were calculated using the commercial packages EquivTest, Kinetica, SAS and WinNonlin, and the non-commercial package R. The results of three of these packages mostly agree, but imbalance between sequences seems to provoke questionable results with one package, which illustrates well the need for proper software validation.

  20. DNA-Based Taxonomy in Ecologically Versatile Microalgae: A Re-Evaluation of the Species Concept within the Coccoid Green Algal Genus Coccomyxa (Trebouxiophyceae, Chlorophyta)

    PubMed Central

    Rindi, Fabio; Tempesta, Sabrina; Paoletti, Michela; Pasqualetti, Marcella

    2016-01-01

    Coccomyxa is a genus of unicellular green algae of the class Trebouxiophyceae, well known for its cosmopolitan distribution and great ecological amplitude. The taxonomy of this genus has long been problematic, due to reliance on badly-defined and environmentally variable morphological characters. In this study, based on the discovery of a new species from an extreme habitat, we reassess species circumscription in Coccomyxa, a unicellular genus of the class Trebouxiophyceae, using a combination of ecological and DNA sequence data (analyzed with three different methods of algorithmic species delineation). Our results are compared with those of a recent integrative study of Darienko and colleagues that reassessed the taxonomy of Coccomyxa, recognizing 7 species in the genus. Expanding the dataset from 43 to 61 sequences (SSU + ITS rDNA) resulted in a different delimitation, supporting the recognition of a higher number of species (24 to 27 depending on the analysis used, with the 27-species scenario receiving the strongest support). Among these, C. melkonianii sp. nov. is described from material isolated from a river highly polluted by heavy metals (Rio Irvi, Sardinia, Italy). Analyses performed on ecological characters detected a significant phylogenetic signal in six different characters. We conclude that the 27-species scenario is presently the most realistic for Coccomyxa and we suggest that well-supported lineages distinguishable by ecological preferences should be recognized as different species in this genus. We also recommend that for microbial lineages in which the overall diversity is unknown and taxon sampling is sparse, as is often the case for green microalgae, the results of analyses for algorithmic DNA-based species delimitation should be interpreted with extreme caution. PMID:27028195

  1. HNRNPLL stabilizes mRNAs for DNA replication proteins and promotes cell cycle progression in colorectal cancer cells.

    PubMed

    Sakuma, Keiichiro; Sasaki, Eiichi; Kimura, Kenya; Komori, Koji; Shimizu, Yasuhiro; Yatabe, Yasushi; Aoki, Masahiro

    2018-06-05

    HNRNPLL (heterogeneous nuclear ribonucleoprotein L-like), an RNA-binding protein that regulates alternative splicing of pre-mRNAs, has been shown to regulate differentiation of lymphocytes, as well as metastasis of colorectal cancer cells. Here we show that HNRNPLL promotes cell cycle progression and hence proliferation of colorectal cancer cells. Functional annotation analysis of those genes whose expression levels were changed by three-fold or more in RNA sequencing analysis between SW480 cells overexpressing HNRNPLL and those knocked down for HNRNPLL revealed enrichment of DNA replication-related genes by HNRNPLL overexpression. Among 13 genes detected in the DNA replication pathway, PCNA, RFC3, and FEN1 showed reproducible upregulation by HNRNPLL overexpression both at mRNA and protein levels in SW480 and HT29 cells. Importantly, knockdown of any of these genes alone suppressed the proliferation promoting effect induced by HNRNPLL overexpression. RNA-immunoprecipitation assay presented a binding of FLAG-tagged HNRNPLL to mRNA of these genes, and HNRNPLL overexpression significantly suppressed the downregulation of these genes during 12 hours of actinomycin D treatment, suggesting a role of HNRNPLL in mRNA stability. Finally, analysis of a public RNA sequencing dataset of clinical samples suggested a link between overexpression of HNRNPLL and that of PCNA, RFC3, and FEN1. This link was further supported by immunohistochemistry of colorectal cancer clinical samples, whereas expression of CDKN1A, which is known to inhibit the cooperative function of PCNA, RFC3, and FEN1, was negatively associated with HNRNPLL expression. These results indicate that HNRNPLL stabilizes mRNAs encoding regulators of DNA replication and promotes colorectal cancer cell proliferation. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.

  2. Laser mass spectrometry for DNA sequencing, disease diagnosis, and fingerprinting

    NASA Astrophysics Data System (ADS)

    Chen, C. H. Winston; Taranenko, N. I.; Zhu, Y. F.; Chung, C. N.; Allman, S. L.

    1997-05-01

    Since laser mass spectrometry has the potential for achieving very fast DNA analysis, we recently applied it to DNA sequencing, DNA typing for fingerprinting, and DNA screening for disease diagnosis. Two different approaches for sequencing DNA have been successfully demonstrated. One is to sequence DNA with DNA ladders produced from Sanger's enzymatic method. The other is to do direct sequencing without DNA ladders. The need for quick DNA typing for identification purposes is critical for forensic application. Our preliminary results indicate laser mass spectrometry can possible be used for rapid DNA fingerprinting applications at a much lower cost than gel electrophoresis. Population screening for certain genetic disease can be a very efficient step to reducing medical costs through prevention. Since laser mass spectrometry can provide very fast DNA analysis, we applied laser mass spectrometry to disease diagnosis. Clinical samples with both base deletion and point mutation have been tested with complete success.

  3. Resolving incongruence: Species of hybrid origin in Columnea (Gesneriaceae).

    PubMed

    Smith, James F; Clark, John L; Amaya-Márquez, Marisol; Marín-Gómez, Oscar H

    2017-01-01

    Speciation by hybridization has long been recognized among plants and includes both homoploid and allopolyploid speciation. The numbers of presumed hybrid species averages close to 11% and tends to be concentrated in a subset of angiosperm families. Recent advances in molecular methods have verified species of hybrid origin that had been presumed on the basis of morphology and have identified species that were not initially considered hybrids. Identifying species of hybrid origin is often a challenge and typically based on intermediate morphology, or discrepancies between molecular datasets. Discrepancies between data partitions may result from several factors including poor support, incomplete lineage sorting, or hybridization. A phylogenetic analysis of species in Columnea (Gesneriaceae) indicated significant incongruencies between the cpDNA and nrDNA datasets. Tests that examined whether one or both of the datasets had the phylogenetic signal to reject the topology of the alternate dataset (Shimodaira and Hasegawa [SH] and approximately unbiased [AU] tests) indicated significant differences between the topologies. Splitstree analyses also showed that there was support for the placement of the discrepant taxa in both datasets and that the combined data placed the putative hybrid species in an intermediate position between the two datasets. The genealogical sorting index (GSI) implied that coalescence in nrDNA had occurred in all species where more than a single individual had been sampled, but the GSI value was lower for the cpDNA of most of the putative hybrids, implying that these regions have not yet coalesced in these lineages despite being haploid. The JML test that evaluates simulated species pairwise distances against observed distances also implies that observed nrDNA data generate shorter distances than simulated data, implying hybridization. It is most likely that C. gigantifolia, C. rubriacuta, and C. sp. nov. represent a lineage from a hybrid ancestor, but C. moorei may be a more recent hybrid and may still be undergoing hybridization with sympatric species. Copyright © 2016 Elsevier Inc. All rights reserved.

  4. Species Diversity of Puerto Rican Heterotermes (Dictyoptera: Rhinotermitidae) Revealed by Phylogenetic Analyses of Two Mitochondrial Genes

    PubMed Central

    Jones, Susan C.; Jenkins, Tracie M.

    2016-01-01

    The goal of this study was to infer Heterotermes (Froggatt) (Dictyoptera: Rhinotermitidae) species diversity on the island of Puerto Rico from phylogenetic analyses of DNA sequence data from two mitochondrial genes, 16S rRNA and cytochrome oxidase II (COII). This termite genus is a structural pest known to be well adapted to arid environments in subtropical and tropical regions worldwide including Puerto Rico and many other Caribbean islands. Extensive sampling was accomplished across Puerto Rico, and phylogenetic analyses of individual gene sequences from these samples indicated robust datasets of congruent gene tree topologies showing three monophyletic groups: H. cardini (Snyder), H. convexinotatus (Snyder), and H. tenuis (Hagen). We found that H. cardini and H. convexinotatus were widespread in the arid coastal regions of Puerto Rico, whereas H. tenuis was uncommon and may represent a relatively new introduction. We found only H. convexinotatus on Culebra Island. We provide strong evidence that Puerto Rico may be linked to the Heterotermes in southern Florida, USA, since its GenBank 16S sequence was identical to that of seven Puerto Rican H. cardini sequences. Our study represents the first records of H. cardini from Puerto Rico and Grand Bahama.

  5. Degenerate Pax2 and Senseless binding motifs improve detection of low-affinity sites required for enhancer specificity

    PubMed Central

    Zandvakili, Arya; Campbell, Ian; Weirauch, Matthew T.

    2018-01-01

    Cells use thousands of regulatory sequences to recruit transcription factors (TFs) and produce specific transcriptional outcomes. Since TFs bind degenerate DNA sequences, discriminating functional TF binding sites (TFBSs) from background sequences represents a significant challenge. Here, we show that a Drosophila regulatory element that activates Epidermal Growth Factor signaling requires overlapping, low-affinity TFBSs for competing TFs (Pax2 and Senseless) to ensure cell- and segment-specific activity. Testing available TF binding models for Pax2 and Senseless, however, revealed variable accuracy in predicting such low-affinity TFBSs. To better define parameters that increase accuracy, we developed a method that systematically selects subsets of TFBSs based on predicted affinity to generate hundreds of position-weight matrices (PWMs). Counterintuitively, we found that degenerate PWMs produced from datasets depleted of high-affinity sequences were more accurate in identifying both low- and high-affinity TFBSs for the Pax2 and Senseless TFs. Taken together, these findings reveal how TFBS arrangement can be constrained by competition rather than cooperativity and that degenerate models of TF binding preferences can improve identification of biologically relevant low affinity TFBSs. PMID:29617378

  6. ReadXplorer—visualization and analysis of mapped sequences

    PubMed Central

    Hilker, Rolf; Stadermann, Kai Bernd; Doppmeier, Daniel; Kalinowski, Jörn; Stoye, Jens; Straube, Jasmin; Winnebald, Jörn; Goesmann, Alexander

    2014-01-01

    Motivation: Fast algorithms and well-arranged visualizations are required for the comprehensive analysis of the ever-growing size of genomic and transcriptomic next-generation sequencing data. Results: ReadXplorer is a software offering straightforward visualization and extensive analysis functions for genomic and transcriptomic DNA sequences mapped on a reference. A unique specialty of ReadXplorer is the quality classification of the read mappings. It is incorporated in all analysis functions and displayed in ReadXplorer's various synchronized data viewers for (i) the reference sequence, its base coverage as (ii) normalizable plot and (iii) histogram, (iv) read alignments and (v) read pairs. ReadXplorer's analysis capability covers RNA secondary structure prediction, single nucleotide polymorphism and deletion–insertion polymorphism detection, genomic feature and general coverage analysis. Especially for RNA-Seq data, it offers differential gene expression analysis, transcription start site and operon detection as well as RPKM value and read count calculations. Furthermore, ReadXplorer can combine or superimpose coverage of different datasets. Availability and implementation: ReadXplorer is available as open-source software at http://www.readxplorer.org along with a detailed manual. Contact: rhilker@mikrobio.med.uni-giessen.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24790157

  7. Micronuclear DNA of Oxytricha nova contains sequences with autonomously replicating activity in Saccharomyces cerevisiae.

    PubMed Central

    Colombo, M M; Swanton, M T; Donini, P; Prescott, D M

    1984-01-01

    Oxytricha nova is a hypotrichous ciliate with micronuclei and macronuclei. Micronuclei, which contain large, chromosomal-sized DNA, are genetically inert but undergo meiosis and exchange during cell mating. Macronuclei, which contain only small, gene-sized DNA molecules, provide all of the nuclear RNA needed to run the cell. After cell mating the macronucleus is derived from a micronucleus, a derivation that includes excision of the genes from chromosomes and elimination of the remaining DNA. The eliminated DNA includes all of the repetitious sequences and approximately 95% of the unique sequences. We cloned large restriction fragments from the micronucleus that confer replication ability on a replication-deficient plasmid in Saccharomyces cerevisiae. Sequences that confer replication ability are called autonomously replicating sequences. The frequency and effectiveness of autonomously replicating sequences in micronuclear DNA are similar to those reported for DNAs of other organisms introduced into yeast cells. Of the 12 micronuclear fragments with autonomously replicating sequence activity, 9 also showed homology to macronuclear DNA, indicating that they contain a macronuclear gene sequence. We conclude from this that autonomously replicating sequence activity is nonrandomly distributed throughout micronuclear DNA and is preferentially associated with those regions of micronuclear DNA that contain genes. Images PMID:6092934

  8. DNA sequence-dependent mechanics and protein-assisted bending in repressor-mediated loop formation

    PubMed Central

    Boedicker, James Q.; Garcia, Hernan G.; Johnson, Stephanie; Phillips, Rob

    2014-01-01

    As the chief informational molecule of life, DNA is subject to extensive physical manipulations. The energy required to deform double-helical DNA depends on sequence, and this mechanical code of DNA influences gene regulation, such as through nucleosome positioning. Here we examine the sequence-dependent flexibility of DNA in bacterial transcription factor-mediated looping, a context for which the role of sequence remains poorly understood. Using a suite of synthetic constructs repressed by the Lac repressor and two well-known sequences that show large flexibility differences in vitro, we make precise statistical mechanical predictions as to how DNA sequence influences loop formation and test these predictions using in vivo transcription and in vitro single-molecule assays. Surprisingly, sequence-dependent flexibility does not affect in vivo gene regulation. By theoretically and experimentally quantifying the relative contributions of sequence and the DNA-bending protein HU to DNA mechanical properties, we reveal that bending by HU dominates DNA mechanics and masks intrinsic sequence-dependent flexibility. Such a quantitative understanding of how mechanical regulatory information is encoded in the genome will be a key step towards a predictive understanding of gene regulation at single-base pair resolution. PMID:24231252

  9. An analysis of the positional distribution of DNA motifs in promoter regions and its biological relevance.

    PubMed

    Casimiro, Ana C; Vinga, Susana; Freitas, Ana T; Oliveira, Arlindo L

    2008-02-07

    Motif finding algorithms have developed in their ability to use computationally efficient methods to detect patterns in biological sequences. However the posterior classification of the output still suffers from some limitations, which makes it difficult to assess the biological significance of the motifs found. Previous work has highlighted the existence of positional bias of motifs in the DNA sequences, which might indicate not only that the pattern is important, but also provide hints of the positions where these patterns occur preferentially. We propose to integrate position uniformity tests and over-representation tests to improve the accuracy of the classification of motifs. Using artificial data, we have compared three different statistical tests (Chi-Square, Kolmogorov-Smirnov and a Chi-Square bootstrap) to assess whether a given motif occurs uniformly in the promoter region of a gene. Using the test that performed better in this dataset, we proceeded to study the positional distribution of several well known cis-regulatory elements, in the promoter sequences of different organisms (S. cerevisiae, H. sapiens, D. melanogaster, E. coli and several Dicotyledons plants). The results show that position conservation is relevant for the transcriptional machinery. We conclude that many biologically relevant motifs appear heterogeneously distributed in the promoter region of genes, and therefore, that non-uniformity is a good indicator of biological relevance and can be used to complement over-representation tests commonly used. In this article we present the results obtained for the S. cerevisiae data sets.

  10. Divergent nuclear 18S rDNA paralogs in a turkey coccidium, Eimeria meleagrimitis, complicate molecular systematics and identification.

    PubMed

    El-Sherry, Shiem; Ogedengbe, Mosun E; Hafeez, Mian A; Barta, John R

    2013-07-01

    Multiple 18S rDNA sequences were obtained from two single-oocyst-derived lines of each of Eimeria meleagrimitis and Eimeria adenoeides. After analysing the 15 new 18S rDNA sequences from two lines of E. meleagrimitis and 17 new sequences from two lines of E. adenoeides, there were clear indications that divergent, paralogous 18S rDNA copies existed within the nuclear genome of E. meleagrimitis. In contrast, mitochondrial cytochrome c oxidase subunit I (COI) partial sequences from all lines of a particular Eimeria sp. were identical and, in phylogenetic analyses, COI sequences clustered unambiguously in monophyletic and highly-supported clades specific to individual Eimeria sp. Phylogenetic analysis of the new 18S rDNA sequences from E. meleagrimitis showed that they formed two distinct clades: Type A with four new sequences; and Type B with nine new sequences; both Types A and B sequences were obtained from each of the single-oocyst-derived lines of E. meleagrimitis. Together these rDNA types formed a well-supported E. meleagrimitis clade. Types A and B 18S rDNA sequences from E. meleagrimitis had a mean sequence identity of only 97.4% whereas mean sequence identity within types was 99.1-99.3%. The observed intraspecific sequence divergence among E. meleagrimitis 18S rDNA sequence types was even higher (approximately 2.6%) than the interspecific sequence divergence present between some well-recognized species such as Eimeria tenella and Eimeria necatrix (1.1%). Our observations suggest that, unlike COI sequences, 18S rDNA sequences are not reliable molecular markers to be used alone for species identification with coccidia, although 18S rDNA sequences have clear utility for phylogenetic reconstruction of apicomplexan parasites at the genus and higher taxonomic ranks. Copyright © 2013. Published by Elsevier Ltd.

  11. Affordable hands-on DNA sequencing and genotyping: an exercise for teaching DNA analysis to undergraduates.

    PubMed

    Shah, Kushani; Thomas, Shelby; Stein, Arnold

    2013-01-01

    In this report, we describe a 5-week laboratory exercise for undergraduate biology and biochemistry students in which students learn to sequence DNA and to genotype their DNA for selected single nucleotide polymorphisms (SNPs). Students use miniaturized DNA sequencing gels that require approximately 8 min to run. The students perform G, A, T, C Sanger sequencing reactions. They prepare and run the gels, perform Southern blots (which require only 10 min), and detect sequencing ladders using a colorimetric detection system. Students enlarge their sequencing ladders from digital images of their small nylon membranes, and read the sequence manually. They compare their reads with the actual DNA sequence using BLAST2. After mastering the DNA sequencing system, students prepare their own DNA from a cheek swab, polymerase chain reaction-amplify a region of their DNA that encompasses a SNP of interest, and perform sequencing to determine their genotype at the SNP position. A family pedigree can also be constructed. The SNP chosen by the instructor was rs17822931, which is in the ABCC11 gene and is the determinant of human earwax type. Genotypes at the rs178229931 site vary in different ethnic populations. © 2013 by The International Union of Biochemistry and Molecular Biology.

  12. Phylogenetic characterization of a biogas plant microbial community integrating clone library 16S-rDNA sequences and metagenome sequence data obtained by 454-pyrosequencing.

    PubMed

    Kröber, Magdalena; Bekel, Thomas; Diaz, Naryttza N; Goesmann, Alexander; Jaenicke, Sebastian; Krause, Lutz; Miller, Dimitri; Runte, Kai J; Viehöver, Prisca; Pühler, Alfred; Schlüter, Andreas

    2009-06-01

    The phylogenetic structure of the microbial community residing in a fermentation sample from a production-scale biogas plant fed with maize silage, green rye and liquid manure was analysed by an integrated approach using clone library sequences and metagenome sequence data obtained by 454-pyrosequencing. Sequencing of 109 clones from a bacterial and an archaeal 16S-rDNA amplicon library revealed that the obtained nucleotide sequences are similar but not identical to 16S-rDNA database sequences derived from different anaerobic environments including digestors and bioreactors. Most of the bacterial 16S-rDNA sequences could be assigned to the phylum Firmicutes with the most abundant class Clostridia and to the class Bacteroidetes, whereas most archaeal 16S-rDNA sequences cluster close to the methanogen Methanoculleus bourgensis. Further sequences of the archaeal library most probably represent so far non-characterised species within the genus Methanoculleus. A similar result derived from phylogenetic analysis of mcrA clone sequences. The mcrA gene product encodes the alpha-subunit of methyl-coenzyme-M reductase involved in the final step of methanogenesis. BLASTn analysis applying stringent settings resulted in assignment of 16S-rDNA metagenome sequence reads to 62 16S-rDNA amplicon sequences thus enabling frequency of abundance estimations for 16S-rDNA clone library sequences. Ribosomal Database Project (RDP) Classifier processing of metagenome 16S-rDNA reads revealed abundance of the phyla Firmicutes, Bacteroidetes and Euryarchaeota and the orders Clostridiales, Bacteroidales and Methanomicrobiales. Moreover, a large fraction of 16S-rDNA metagenome reads could not be assigned to lower taxonomic ranks, demonstrating that numerous microorganisms in the analysed fermentation sample of the biogas plant are still unclassified or unknown.

  13. Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods

    PubMed Central

    Dröge, J.; Gregor, I.; McHardy, A. C.

    2015-01-01

    Motivation: Metagenomics characterizes microbial communities by random shotgun sequencing of DNA isolated directly from an environment of interest. An essential step in computational metagenome analysis is taxonomic sequence assignment, which allows identifying the sequenced community members and reconstructing taxonomic bins with sequence data for the individual taxa. For the massive datasets generated by next-generation sequencing technologies, this cannot be performed with de-novo phylogenetic inference methods. We describe an algorithm and the accompanying software, taxator-tk, which performs taxonomic sequence assignment by fast approximate determination of evolutionary neighbors from sequence similarities. Results: Taxator-tk was precise in its taxonomic assignment across all ranks and taxa for a range of evolutionary distances and for short as well as for long sequences. In addition to the taxonomic binning of metagenomes, it is well suited for profiling microbial communities from metagenome samples because it identifies bacterial, archaeal and eukaryotic community members without being affected by varying primer binding strengths, as in marker gene amplification, or copy number variations of marker genes across different taxa. Taxator-tk has an efficient, parallelized implementation that allows the assignment of 6 Gb of sequence data per day on a standard multiprocessor system with 10 CPU cores and microbial RefSeq as the genomic reference data. Availability and implementation: Taxator-tk source and binary program files are publicly available at http://algbio.cs.uni-duesseldorf.de/software/. Contact: Alice.McHardy@uni-duesseldorf.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25388150

  14. iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition.

    PubMed

    Chen, Wei; Feng, Peng-Mian; Lin, Hao; Chou, Kuo-Chen

    2014-01-01

    In eukaryotic genes, exons are generally interrupted by introns. Accurately removing introns and joining exons together are essential processes in eukaryotic gene expression. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapid and effective detection of splice sites that play important roles in gene structure annotation and even in RNA splicing. Although a series of computational methods were proposed for splice site identification, most of them neglected the intrinsic local structural properties. In the present study, a predictor called "iSS-PseDNC" was developed for identifying splice sites. In the new predictor, the sequences were formulated by a novel feature-vector called "pseudo dinucleotide composition" (PseDNC) into which six DNA local structural properties were incorporated. It was observed by the rigorous cross-validation tests on two benchmark datasets that the overall success rates achieved by iSS-PseDNC in identifying splice donor site and splice acceptor site were 85.45% and 87.73%, respectively. It is anticipated that iSS-PseDNC may become a useful tool for identifying splice sites and that the six DNA local structural properties described in this paper may provide novel insights for in-depth investigations into the mechanism of RNA splicing.

  15. An examination of the origin and evolution of additional tandem repeats in the mitochondrial DNA control region of Japanese sika deer (Cervus Nippon).

    PubMed

    Ba, Hengxing; Wu, Lang; Liu, Zongyue; Li, Chunyi

    2016-01-01

    Tandem repeat units are only detected in the left domain of the mitochondrial DNA control region in sika deer. Previous studies showed that Japanese sika deer have more tandem repeat units than its cousins from the Asian continent and Taiwan, which often have only three repeat units. To determine the origin and evolution of these additional repeat units in Japanese sika deer, we obtained the sequence of repeat units from an expanded dataset of the control region from all sika deer lineages. The functional constraint is inferred to act on the first repeat unit because this repeat has the least sequence divergence in comparison to the other units. Based on slipped-strand mispairing mechanisms, the illegitimate elongation model could account for the addition or deletion of these additional repeat units in the Japanese sika deer population. We also report that these additional repeat units could be occurring in the internal positions of tandem repeat regions, possibly via coupling with a homogenization mechanism within and among these lineages. Moreover, the increased number of repeat units in the Japanese sika deer population could reflect a balance between mutation and selection, as well as genetic drift.

  16. Cache-Oblivious parallel SIMD Viterbi decoding for sequence search in HMMER.

    PubMed

    Ferreira, Miguel; Roma, Nuno; Russo, Luis M S

    2014-05-30

    HMMER is a commonly used bioinformatics tool based on Hidden Markov Models (HMMs) to analyze and process biological sequences. One of its main homology engines is based on the Viterbi decoding algorithm, which was already highly parallelized and optimized using Farrar's striped processing pattern with Intel SSE2 instruction set extension. A new SIMD vectorization of the Viterbi decoding algorithm is proposed, based on an SSE2 inter-task parallelization approach similar to the DNA alignment algorithm proposed by Rognes. Besides this alternative vectorization scheme, the proposed implementation also introduces a new partitioning of the Markov model that allows a significantly more efficient exploitation of the cache locality. Such optimization, together with an improved loading of the emission scores, allows the achievement of a constant processing throughput, regardless of the innermost-cache size and of the dimension of the considered model. The proposed optimized vectorization of the Viterbi decoding algorithm was extensively evaluated and compared with the HMMER3 decoder to process DNA and protein datasets, proving to be a rather competitive alternative implementation. Being always faster than the already highly optimized ViterbiFilter implementation of HMMER3, the proposed Cache-Oblivious Parallel SIMD Viterbi (COPS) implementation provides a constant throughput and offers a processing speedup as high as two times faster, depending on the model's size.

  17. Molecular phylogeny of the Cladophoraceae (Cladophorales, Ulvophyceae), with the resurrection of Acrocladus Nägeli and Willeella Børgesen, and the description of Lurbica gen. nov. and Pseudorhizoclonium gen. nov.

    PubMed

    Boedeker, Christian; Leliaert, Frederik; Zuccarello, Giuseppe C

    2016-12-01

    The taxonomy of the Cladophoraceae, a large family of filamentous green algae, has been problematic for a long time due to morphological simplicity, parallel evolution, phenotypic plasticity, and unknown distribution ranges. Partial large subunit (LSU) rDNA sequences were generated for 362 isolates, and the analyses of a concatenated dataset consisting of unique LSU and small subunit (SSU) rDNA sequences of 95 specimens greatly clarified the phylogeny of the Cladophoraceae. The phylogenetic reconstructions showed that the three currently accepted genera Chaetomorpha, Cladophora, and Rhizoclonium are polyphyletic. The backbone of the phylogeny is robust and the relationships of the main lineages were inferred with high support, only the phylogenetic position of both Chaetomorpha melagonium and Cladophora rupestris could not be inferred unambiguously. There have been at least three independent switches between branched and unbranched morphologies within the Cladophoraceae. Freshwater environments have been colonized twice independently, namely by the freshwater Cladophora species as well as by several lineages of the Rhizoclonium riparium clade. In an effort to establish monophyletic genera, the genera Acrocladus and Willeella are resurrected and two new genera are described: Pseudorhizoclonium and Lurbica. © 2016 Phycological Society of America.

  18. A communal catalogue reveals Earth’s multiscale microbial diversity

    DOE PAGES

    Thompson, Luke R.; Sanders, Jon G.; McDonald, Daniel; ...

    2017-11-01

    Our growing awareness of the importance and diversity of the microbial world contrasts starkly with our limited understanding of its fundamental structure. Despite remarkable advances in DNA sequence generation, a lack of standardized protocols and common analytical framework impede useful comparison between studies, hindering development of global inferences about microbial life on Earth. Here, we show that with coordinated protocols, exact microbial 16S rRNA gene sequences can be followed across scores of individual studies, revealing patterns of diversity, community structure, and life history strategy at a planetary scale. Using 27,751 crowdsourced environmental samples comprising more than 2.2 billion reads, wemore » find sharp divides between host-associated and free-living communities. We show that the distribution of taxonomic and sequence diversity follows consistent trends across samples types and along gradients of environmental parameters, highlighting some of the global evolutionary patterns and ecological principles that underpin Earth’s microbiome. Here, this dataset provides the most complete environmental survey of our microbial world to date, and serves as a growing reference to provide immediate global context to future microbial surveys.« less

  19. Pyicos: a versatile toolkit for the analysis of high-throughput sequencing data.

    PubMed

    Althammer, Sonja; González-Vallinas, Juan; Ballaré, Cecilia; Beato, Miguel; Eyras, Eduardo

    2011-12-15

    High-throughput sequencing (HTS) has revolutionized gene regulation studies and is now fundamental for the detection of protein-DNA and protein-RNA binding, as well as for measuring RNA expression. With increasing variety and sequencing depth of HTS datasets, the need for more flexible and memory-efficient tools to analyse them is growing. We describe Pyicos, a powerful toolkit for the analysis of mapped reads from diverse HTS experiments: ChIP-Seq, either punctuated or broad signals, CLIP-Seq and RNA-Seq. We prove the effectiveness of Pyicos to select for significant signals and show that its accuracy is comparable and sometimes superior to that of methods specifically designed for each particular type of experiment. Pyicos facilitates the analysis of a variety of HTS datatypes through its flexibility and memory efficiency, providing a useful framework for data integration into models of regulatory genomics. Open-source software, with tutorials and protocol files, is available at http://regulatorygenomics.upf.edu/pyicos or as a Galaxy server at http://regulatorygenomics.upf.edu/galaxy eduardo.eyras@upf.edu Supplementary data are available at Bioinformatics online.

  20. A communal catalogue reveals Earth’s multiscale microbial diversity

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Thompson, Luke R.; Sanders, Jon G.; McDonald, Daniel

    Our growing awareness of the importance and diversity of the microbial world contrasts starkly with our limited understanding of its fundamental structure. Despite remarkable advances in DNA sequence generation, a lack of standardized protocols and common analytical framework impede useful comparison between studies, hindering development of global inferences about microbial life on Earth. Here, we show that with coordinated protocols, exact microbial 16S rRNA gene sequences can be followed across scores of individual studies, revealing patterns of diversity, community structure, and life history strategy at a planetary scale. Using 27,751 crowdsourced environmental samples comprising more than 2.2 billion reads, wemore » find sharp divides between host-associated and free-living communities. We show that the distribution of taxonomic and sequence diversity follows consistent trends across samples types and along gradients of environmental parameters, highlighting some of the global evolutionary patterns and ecological principles that underpin Earth’s microbiome. Here, this dataset provides the most complete environmental survey of our microbial world to date, and serves as a growing reference to provide immediate global context to future microbial surveys.« less

  1. Structural Analysis of Biodiversity

    PubMed Central

    Sirovich, Lawrence; Stoeckle, Mark Y.; Zhang, Yu

    2010-01-01

    Large, recently-available genomic databases cover a wide range of life forms, suggesting opportunity for insights into genetic structure of biodiversity. In this study we refine our recently-described technique using indicator vectors to analyze and visualize nucleotide sequences. The indicator vector approach generates correlation matrices, dubbed Klee diagrams, which represent a novel way of assembling and viewing large genomic datasets. To explore its potential utility, here we apply the improved algorithm to a collection of almost 17000 DNA barcode sequences covering 12 widely-separated animal taxa, demonstrating that indicator vectors for classification gave correct assignment in all 11000 test cases. Indicator vector analysis revealed discontinuities corresponding to species- and higher-level taxonomic divisions, suggesting an efficient approach to classification of organisms from poorly-studied groups. As compared to standard distance metrics, indicator vectors preserve diagnostic character probabilities, enable automated classification of test sequences, and generate high-information density single-page displays. These results support application of indicator vectors for comparative analysis of large nucleotide data sets and raise prospect of gaining insight into broad-scale patterns in the genetic structure of biodiversity. PMID:20195371

  2. PlantFuncSSR: Integrating First and Next Generation Transcriptomics for Mining of SSR-Functional Domains Markers

    PubMed Central

    Sablok, Gaurav; Pérez-Pulido, Antonio J.; Do, Thac; Seong, Tan Y.; Casimiro-Soriguer, Carlos S.; La Porta, Nicola; Ralph, Peter J.; Squartini, Andrea; Muñoz-Merida, Antonio; Harikrishna, Jennifer A.

    2016-01-01

    Analysis of repetitive DNA sequence content and divergence among the repetitive functional classes is a well-accepted approach for estimation of inter- and intra-generic differences in plant genomes. Among these elements, microsatellites, or Simple Sequence Repeats (SSRs), have been widely demonstrated as powerful genetic markers for species and varieties discrimination. We present PlantFuncSSRs platform having more than 364 plant species with more than 2 million functional SSRs. They are provided with detailed annotations for easy functional browsing of SSRs and with information on primer pairs and associated functional domains. PlantFuncSSRs can be leveraged to identify functional-based genic variability among the species of interest, which might be of particular interest in developing functional markers in plants. This comprehensive on-line portal unifies mining of SSRs from first and next generation sequencing datasets, corresponding primer pairs and associated in-depth functional annotation such as gene ontology annotation, gene interactions and its identification from reference protein databases. PlantFuncSSRs is freely accessible at: http://www.bioinfocabd.upo.es/plantssr. PMID:27446111

  3. DNA capture and next-generation sequencing can recover whole mitochondrial genomes from highly degraded samples for human identification

    PubMed Central

    2013-01-01

    Background Mitochondrial DNA (mtDNA) typing can be a useful aid for identifying people from compromised samples when nuclear DNA is too damaged, degraded or below detection thresholds for routine short tandem repeat (STR)-based analysis. Standard mtDNA typing, focused on PCR amplicon sequencing of the control region (HVS I and HVS II), is limited by the resolving power of this short sequence, which misses up to 70% of the variation present in the mtDNA genome. Methods We used in-solution hybridisation-based DNA capture (using DNA capture probes prepared from modern human mtDNA) to recover mtDNA from post-mortem human remains in which the majority of DNA is both highly fragmented (<100 base pairs in length) and chemically damaged. The method ‘immortalises’ the finite quantities of DNA in valuable extracts as DNA libraries, which is followed by the targeted enrichment of endogenous mtDNA sequences and characterisation by next-generation sequencing (NGS). Results We sequenced whole mitochondrial genomes for human identification from samples where standard nuclear STR typing produced only partial profiles or demonstrably failed and/or where standard mtDNA hypervariable region sequences lacked resolving power. Multiple rounds of enrichment can substantially improve coverage and sequencing depth of mtDNA genomes from highly degraded samples. The application of this method has led to the reliable mitochondrial sequencing of human skeletal remains from unidentified World War Two (WWII) casualties approximately 70 years old and from archaeological remains (up to 2,500 years old). Conclusions This approach has potential applications in forensic science, historical human identification cases, archived medical samples, kinship analysis and population studies. In particular the methodology can be applied to any case, involving human or non-human species, where whole mitochondrial genome sequences are required to provide the highest level of maternal lineage discrimination. Multiple rounds of in-solution hybridisation-based DNA capture can retrieve whole mitochondrial genome sequences from even the most challenging samples. PMID:24289217

  4. Intracellular diversity of the V4 and V9 regions of the 18S rRNA in marine protists (radiolarians) assessed by high-throughput sequencing.

    PubMed

    Decelle, Johan; Romac, Sarah; Sasaki, Eriko; Not, Fabrice; Mahé, Frédéric

    2014-01-01

    Metabarcoding is a powerful tool for exploring microbial diversity in the environment, but its accurate interpretation is impeded by diverse technical (e.g. PCR and sequencing errors) and biological biases (e.g. intra-individual polymorphism) that remain poorly understood. To help interpret environmental metabarcoding datasets, we investigated the intracellular diversity of the V4 and V9 regions of the 18S rRNA gene from Acantharia and Nassellaria (radiolarians) using 454 pyrosequencing. Individual cells of radiolarians were isolated, and PCRs were performed with generalist primers to amplify the V4 and V9 regions. Different denoising procedures were employed to filter the pyrosequenced raw amplicons (Acacia, AmpliconNoise, Linkage method). For each of the six isolated cells, an average of 541 V4 and 562 V9 amplicons assigned to radiolarians were obtained, from which one numerically dominant sequence and several minor variants were found. At the 97% identity, a diversity metrics commonly used in environmental surveys, up to 5 distinct OTUs were detected in a single cell. However, most amplicons grouped within a single OTU whereas other OTUs contained very few amplicons. Different analytical methods provided evidence that most minor variants forming different OTUs correspond to PCR and sequencing artifacts. Duplicate PCR and sequencing from the same DNA extract of a single cell had only 9 to 16% of unique amplicons in common, and alignment visualization of V4 and V9 amplicons showed that most minor variants contained substitutions in highly-conserved regions. We conclude that intracellular variability of the 18S rRNA in radiolarians is very limited despite its multi-copy nature and the existence of multiple nuclei in these protists. Our study recommends some technical guidelines to conservatively discard artificial amplicons from metabarcoding datasets, and thus properly assess the diversity and richness of protists in the environment.

  5. Transcriptome Analysis of the Portunus trituberculatus: De Novo Assembly, Growth-Related Gene Identification and Marker Discovery

    PubMed Central

    Lv, Jianjian; Liu, Ping; Gao, Baoquan; Wang, Yu; Wang, Zheng; Chen, Ping; Li, Jian

    2014-01-01

    Background The swimming crab, Portunus trituberculatus, is an important farmed species in China, has been attracting extensive studies, which require more and more genome background knowledge. To date, the sequencing of its whole genome is unavailable and transcriptomic information is also scarce for this species. In the present study, we performed de novo transcriptome sequencing to produce a comprehensive transcript dataset for major tissues of Portunus trituberculatus by the Illumina paired-end sequencing technology. Results Total RNA was isolated from eyestalk, gill, heart, hepatopancreas and muscle. Equal quantities of RNA from each tissue were pooled to construct a cDNA library. Using the Illumina paired-end sequencing technology, we generated a total of 120,137 transcripts with an average length of 1037 bp. Further assembly analysis showed that all contigs contributed to 87,100 unigenes, of these, 16,029 unigenes (18.40% of the total) can be matched in the GenBank non-redundant database. Potential genes and their functions were predicted by GO, KEGG pathway mapping and COG analysis. Based on our sequence analysis and published literature, many putative genes with fundamental roles in growth and muscle development, including actin, myosin, tropomyosin, troponin and other potentially important candidate genes were identified for the first time in this specie. Furthermore, 22,673 SSRs and 66,191 high-confidence SNPs were identified in this EST dataset. Conclusion The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in Portunus trituberculatus. The data will also instruct future functional studies to manipulate or select for genes influencing growth that should find practical applications in aquaculture breeding programs. The molecular markers identified in this study will provide a material basis for future genetic linkage and quantitative trait loci analyses, and will be essential for accelerating aquaculture breeding programs with this species. PMID:24722690

  6. Transcriptome survey of the anhydrobiotic tardigrade Milnesium tardigradum in comparison with Hypsibius dujardini and Richtersius coronifer

    PubMed Central

    2010-01-01

    Background The phenomenon of desiccation tolerance, also called anhydrobiosis, involves the ability of an organism to survive the loss of almost all cellular water without sustaining irreversible damage. Although there are several physiological, morphological and ecological studies on tardigrades, only limited DNA sequence information is available. Therefore, we explored the transcriptome in the active and anhydrobiotic state of the tardigrade Milnesium tardigradum which has extraordinary tolerance to desiccation and freezing. In this study, we present the first overview of the transcriptome of M. tardigradum and its response to desiccation and discuss potential parallels to stress responses in other organisms. Results We sequenced a total of 9984 expressed sequence tags (ESTs) from two cDNA libraries from the eutardigrade M. tardigradum in its active and inactive, anhydrobiotic (tun) stage. Assembly of these ESTs resulted in 3283 putative unique transcripts, whereof ~50% showed significant sequence similarity to known genes. The resulting unigenes were functionally annotated using the Gene Ontology (GO) vocabulary. A GO term enrichment analysis revealed several GOs that were significantly underrepresented in the inactive stage. Furthermore we compared the putative unigenes of M. tardigradum with ESTs from two other eutardigrade species that are available from public sequence databases, namely Richtersius coronifer and Hypsibius dujardini. The processed sequences of the three tardigrade species revealed similar functional content and the M. tardigradum dataset contained additional sequences from tardigrades not present in the other two. Conclusions This study describes novel sequence data from the tardigrade M. tardigradum, which significantly contributes to the available tardigrade sequence data and will help to establish this extraordinary tardigrade as a model for studying anhydrobiosis. Functional comparison of active and anhydrobiotic tardigrades revealed a differential distribution of Gene Ontology terms associated with chromatin structure and the translation machinery, which are underrepresented in the inactive animals. These findings imply a widespread metabolic response of the animals on dehydration. The collective tardigrade transcriptome data will serve as a reference for further studies and support the identification and characterization of genes involved in the anhydrobiotic response. PMID:20226016

  7. Transcriptome survey of the anhydrobiotic tardigrade Milnesium tardigradum in comparison with Hypsibius dujardini and Richtersius coronifer.

    PubMed

    Mali, Brahim; Grohme, Markus A; Förster, Frank; Dandekar, Thomas; Schnölzer, Martina; Reuter, Dirk; Wełnicz, Weronika; Schill, Ralph O; Frohme, Marcus

    2010-03-12

    The phenomenon of desiccation tolerance, also called anhydrobiosis, involves the ability of an organism to survive the loss of almost all cellular water without sustaining irreversible damage. Although there are several physiological, morphological and ecological studies on tardigrades, only limited DNA sequence information is available. Therefore, we explored the transcriptome in the active and anhydrobiotic state of the tardigrade Milnesium tardigradum which has extraordinary tolerance to desiccation and freezing. In this study, we present the first overview of the transcriptome of M. tardigradum and its response to desiccation and discuss potential parallels to stress responses in other organisms. We sequenced a total of 9984 expressed sequence tags (ESTs) from two cDNA libraries from the eutardigrade M. tardigradum in its active and inactive, anhydrobiotic (tun) stage. Assembly of these ESTs resulted in 3283 putative unique transcripts, whereof approximately 50% showed significant sequence similarity to known genes. The resulting unigenes were functionally annotated using the Gene Ontology (GO) vocabulary. A GO term enrichment analysis revealed several GOs that were significantly underrepresented in the inactive stage. Furthermore we compared the putative unigenes of M. tardigradum with ESTs from two other eutardigrade species that are available from public sequence databases, namely Richtersius coronifer and Hypsibius dujardini. The processed sequences of the three tardigrade species revealed similar functional content and the M. tardigradum dataset contained additional sequences from tardigrades not present in the other two. This study describes novel sequence data from the tardigrade M. tardigradum, which significantly contributes to the available tardigrade sequence data and will help to establish this extraordinary tardigrade as a model for studying anhydrobiosis. Functional comparison of active and anhydrobiotic tardigrades revealed a differential distribution of Gene Ontology terms associated with chromatin structure and the translation machinery, which are underrepresented in the inactive animals. These findings imply a widespread metabolic response of the animals on dehydration. The collective tardigrade transcriptome data will serve as a reference for further studies and support the identification and characterization of genes involved in the anhydrobiotic response.

  8. RDNAnalyzer: A tool for DNA secondary structure prediction and sequence analysis.

    PubMed

    Afzal, Muhammad; Shahid, Ahmad Ali; Shehzadi, Abida; Nadeem, Shahid; Husnain, Tayyab

    2012-01-01

    RDNAnalyzer is an innovative computer based tool designed for DNA secondary structure prediction and sequence analysis. It can randomly generate the DNA sequence or user can upload the sequences of their own interest in RAW format. It uses and extends the Nussinov dynamic programming algorithm and has various application for the sequence analysis. It predicts the DNA secondary structure and base pairings. It also provides the tools for routinely performed sequence analysis by the biological scientists such as DNA replication, reverse compliment generation, transcription, translation, sequence specific information as total number of nucleotide bases, ATGC base contents along with their respective percentages and sequence cleaner. RDNAnalyzer is a unique tool developed in Microsoft Visual Studio 2008 using Microsoft Visual C# and Windows Presentation Foundation and provides user friendly environment for sequence analysis. It is freely available. http://www.cemb.edu.pk/sw.html RDNAnalyzer - Random DNA Analyser, GUI - Graphical user interface, XAML - Extensible Application Markup Language.

  9. Direct Detection and Sequencing of Damaged DNA Bases

    PubMed Central

    2011-01-01

    Products of various forms of DNA damage have been implicated in a variety of important biological processes, such as aging, neurodegenerative diseases, and cancer. Therefore, there exists great interest to develop methods for interrogating damaged DNA in the context of sequencing. Here, we demonstrate that single-molecule, real-time (SMRT®) DNA sequencing can directly detect damaged DNA bases in the DNA template - as a by-product of the sequencing method - through an analysis of the DNA polymerase kinetics that are altered by the presence of a modified base. We demonstrate the sequencing of several DNA templates containing products of DNA damage, including 8-oxoguanine, 8-oxoadenine, O6-methylguanine, 1-methyladenine, O4-methylthymine, 5-hydroxycytosine, 5-hydroxyuracil, 5-hydroxymethyluracil, or thymine dimers, and show that these base modifications can be readily detected with single-modification resolution and DNA strand specificity. We characterize the distinct kinetic signatures generated by these DNA base modifications. PMID:22185597

  10. Direct detection and sequencing of damaged DNA bases.

    PubMed

    Clark, Tyson A; Spittle, Kristi E; Turner, Stephen W; Korlach, Jonas

    2011-12-20

    Products of various forms of DNA damage have been implicated in a variety of important biological processes, such as aging, neurodegenerative diseases, and cancer. Therefore, there exists great interest to develop methods for interrogating damaged DNA in the context of sequencing. Here, we demonstrate that single-molecule, real-time (SMRT®) DNA sequencing can directly detect damaged DNA bases in the DNA template - as a by-product of the sequencing method - through an analysis of the DNA polymerase kinetics that are altered by the presence of a modified base. We demonstrate the sequencing of several DNA templates containing products of DNA damage, including 8-oxoguanine, 8-oxoadenine, O6-methylguanine, 1-methyladenine, O4-methylthymine, 5-hydroxycytosine, 5-hydroxyuracil, 5-hydroxymethyluracil, or thymine dimers, and show that these base modifications can be readily detected with single-modification resolution and DNA strand specificity. We characterize the distinct kinetic signatures generated by these DNA base modifications.

  11. A comprehensive list of cloned human DNA sequences

    PubMed Central

    Schmidtke, Jörg; Cooper, David N.

    1987-01-01

    A list of DNA sequences cloned from the human genome is presented. Intended as a guide to clone availability, this list includes published reports of cDNA, genomic and synthetic clones comprising gene and pseudogene sequences, uncharacterised DNA segments and repetitive DNA elements. PMID:3575113

  12. A comprehensive list of cloned human DNA sequences

    PubMed Central

    Schmidtke, Jörg; Cooper, David N.

    1990-01-01

    A list of DNA sequences cloned from the human genome is presented. Intended as a guide to clone availability, this list includes published reports of cDNA, genomic and synthetic clones comprising gene and pseudogene sequences, uncharacterised DNA segments and repetitive DNA elements. PMID:2333227

  13. A comprehensive list of cloned human DNA sequences

    PubMed Central

    Schmidtke, Jörg; Cooper, David N.

    1988-01-01

    A list of DNA sequences cloned from the human genome is presented. Intended as a guide to clone availability, this list includes published reports of cDNA, genomic and synthetic clones comprising gene and pseudogene sequences, uncharacterised DNA segments and repetitive DNA elements. PMID:3368330

  14. A comprehensive list of cloned human DNA sequences

    PubMed Central

    Schmidtke, Jörg; Cooper, David N.

    1989-01-01

    A list of DNA sequences cloned from the human genome is presented. Intended as a guide to clone availability, this list includes published reports of cDNA, genomic and synthetic clones comprising gene and pseudogene sequences, uncharacterised DNA segments and repetitive DNA elements. PMID:2654889

  15. Kilo-sequencing: an ordered strategy for rapid DNA sequence data acquisition.

    PubMed Central

    Barnes, W M; Bevan, M

    1983-01-01

    A strategy for rapid DNA sequence acquisition in an ordered, nonrandom manner, while retaining all of the conveniences of the dideoxy method with M13 transducing phage DNA template, is described. Target DNA 3 to 14 kb in size can be stably carried by our M13 vectors. Suitable targets are stretches of DNA which lack an enzyme recognition site which is unique on our cloning vectors and adjacent to the sequencing primer; current sites that are so useful when lacking are Pst, Xba, HindIII, BglII, EcoRI. By an in vitro procedure, we cut RF DNA once randomly and once specifically, to create thousands of deletions which start at the unique restriction site adjacent to the dideoxy sequencing primer and extend various distances across the target DNA. Phage carrying a desired size of deletions, whose DNA as template will give rise to DNA sequence data in a desired location along the target DNA, may be purified by electrophoresis alive on agarose gels. Phage running in the same location on the agarose gel thus conveniently give rise to nucleotide sequence data from the same kilobase of target DNA. Images PMID:6298723

  16. Silicene nanoribbon as a new DNA sequencing device

    NASA Astrophysics Data System (ADS)

    Alesheikh, Sara; Shahtahmassebi, Nasser; Roknabadi, Mahmood Rezaee; Pilevar Shahri, Raheleh

    2018-02-01

    The importance of applying DNA sequencing in different fields, results in looking for fast and cheap methods. Nanotechnology helps this development by introducing nanostructures used for DNA sequencing. In this work we study the interaction between zigzag silicene nanoribbon and DNA nucleobases using DFT and non equilibrium Green's function approach, to investigate the possibility of using zigzag silicene nanoribbons as a biosensor for DNA sequencing.

  17. Isolation and characterization of target sequences of the chicken CdxA homeobox gene.

    PubMed Central

    Margalit, Y; Yarus, S; Shapira, E; Gruenbaum, Y; Fainsod, A

    1993-01-01

    The DNA binding specificity of the chicken homeodomain protein CDXA was studied. Using a CDXA-glutathione-S-transferase fusion protein, DNA fragments containing the binding site for this protein were isolated. The sources of DNA were oligonucleotides with random sequence and chicken genomic DNA. The DNA fragments isolated were sequenced and tested in DNA binding assays. Sequencing revealed that most DNA fragments are AT rich which is a common feature of homeodomain binding sites. By electrophoretic mobility shift assays it was shown that the different target sequences isolated bind to the CDXA protein with different affinities. The specific sequences bound by the CDXA protein in the genomic fragments isolated, were determined by DNase I footprinting. From the footprinted sequences, the CDXA consensus binding site was determined. The CDXA protein binds the consensus sequence A, A/T, T, A/T, A, T, A/G. The CAUDAL binding site in the ftz promoter is also included in this consensus sequence. When tested, some of the genomic target sequences were capable of enhancing the transcriptional activity of reporter plasmids when introduced into CDXA expressing cells. This study determined the DNA sequence specificity of the CDXA protein and it also shows that this protein can further activate transcription in cells in culture. Images PMID:7909943

  18. Sequence periodicity in nucleosomal DNA and intrinsic curvature.

    PubMed

    Nair, T Murlidharan

    2010-05-17

    Most eukaryotic DNA contained in the nucleus is packaged by wrapping DNA around histone octamers. Histones are ubiquitous and bind most regions of chromosomal DNA. In order to achieve smooth wrapping of the DNA around the histone octamer, the DNA duplex should be able to deform and should possess intrinsic curvature. The deformability of DNA is a result of the non-parallelness of base pair stacks. The stacking interaction between base pairs is sequence dependent. The higher the stacking energy the more rigid the DNA helix, thus it is natural to expect that sequences that are involved in wrapping around the histone octamer should be unstacked and possess intrinsic curvature. Intrinsic curvature has been shown to be dictated by the periodic recurrence of certain dinucleotides. Several genome-wide studies directed towards mapping of nucleosome positions have revealed periodicity associated with certain stretches of sequences. In the current study, these sequences have been analyzed with a view to understand their sequence-dependent structures. Higher order DNA structures and the distribution of molecular bend loci associated with 146 base nucleosome core DNA sequence from C. elegans and chicken have been analyzed using the theoretical model for DNA curvature. The curvature dispersion calculated by cyclically permuting the sequences revealed that the molecular bend loci were delocalized throughout the nucleosome core region and had varying degrees of intrinsic curvature. The higher order structures associated with nucleosomes of C.elegans and chicken calculated from the sequences revealed heterogeneity with respect to the deviation of the DNA axis. The results points to the possibility of context dependent curvature of varying degrees to be associated with nucleosomal DNA.

  19. Data Portal | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The CPTAC Data Portal is a centralized repository for the public dissemination of proteomic sequence datasets collected by CPTAC, along with corresponding genomic sequence datasets.  In addition, available are analyses of CPTAC's raw mass spectrometry-based data files (mapping of spectra to peptide sequences and protein identification) by individual investigators from CPTAC and by a Common Data Analysis Pipeline.

  20. Assessing the Fidelity of Ancient DNA Sequences Amplified From Nuclear Genes

    PubMed Central

    Binladen, Jonas; Wiuf, Carsten; Gilbert, M. Thomas P.; Bunce, Michael; Barnett, Ross; Larson, Greger; Greenwood, Alex D.; Haile, James; Ho, Simon Y. W.; Hansen, Anders J.; Willerslev, Eske

    2006-01-01

    To date, the field of ancient DNA has relied almost exclusively on mitochondrial DNA (mtDNA) sequences. However, a number of recent studies have reported the successful recovery of ancient nuclear DNA (nuDNA) sequences, thereby allowing the characterization of genetic loci directly involved in phenotypic traits of extinct taxa. It is well documented that postmortem damage in ancient mtDNA can lead to the generation of artifactual sequences. However, as yet no one has thoroughly investigated the damage spectrum in ancient nuDNA. By comparing clone sequences from 23 fossil specimens, recovered from environments ranging from permafrost to desert, we demonstrate the presence of miscoding lesion damage in both the mtDNA and nuDNA, resulting in insertion of erroneous bases during amplification. Interestingly, no significant differences in the frequency of miscoding lesion damage are recorded between mtDNA and nuDNA despite great differences in cellular copy numbers. For both mtDNA and nuDNA, we find significant positive correlations between total sequence heterogeneity and the rates of type 1 transitions (adenine → guanine and thymine → cytosine) and type 2 transitions (cytosine → thymine and guanine → adenine), respectively. Type 2 transitions are by far the most dominant and increase relative to those of type 1 with damage load. The results suggest that the deamination of cytosine (and 5-methyl cytosine) to uracil (and thymine) is the main cause of miscoding lesions in both ancient mtDNA and nuDNA sequences. We argue that the problems presented by postmortem damage, as well as problems with contamination from exogenous sources of conserved nuclear genes, allelic variation, and the reliance on single nucleotide polymorphisms, call for great caution in studies relying on ancient nuDNA sequences. PMID:16299392

Top