Network-assisted target identification for haploinsufficiency and homozygous profiling screens
Wang, Sheng
2017-01-01
Chemical genomic screens have recently emerged as a systematic approach to drug discovery on a genome-wide scale. Drug target identification and elucidation of the mechanism of action (MoA) of hits from these noisy high-throughput screens remain difficult. Here, we present GIT (Genetic Interaction Network-Assisted Target Identification), a network analysis method for drug target identification in haploinsufficiency profiling (HIP) and homozygous profiling (HOP) screens. With the drug-induced phenotypic fitness defect of the deletion of a gene, GIT also incorporates the fitness defects of the gene’s neighbors in the genetic interaction network. On three genome-scale yeast chemical genomic screens, GIT substantially outperforms previous scoring methods on target identification on HIP and HOP assays, respectively. Finally, we showed that by combining HIP and HOP assays, GIT further boosts target identification and reveals potential drug’s mechanism of action. PMID:28574983
Mapping Challenging Mutations by Whole-Genome Sequencing
Smith, Harold E.; Fabritius, Amy S.; Jaramillo-Lambert, Aimee; Golden, Andy
2016-01-01
Whole-genome sequencing provides a rapid and powerful method for identifying mutations on a global scale, and has spurred a renewed enthusiasm for classical genetic screens in model organisms. The most commonly characterized category of mutation consists of monogenic, recessive traits, due to their genetic tractability. Therefore, most of the mapping methods for mutation identification by whole-genome sequencing are directed toward alleles that fulfill those criteria (i.e., single-gene, homozygous variants). However, such approaches are not entirely suitable for the characterization of a variety of more challenging mutations, such as dominant and semidominant alleles or multigenic traits. Therefore, we have developed strategies for the identification of those classes of mutations, using polymorphism mapping in Caenorhabditis elegans as our model for validation. We also report an alternative approach for mutation identification from traditional recombinant crosses, and a solution to the technical challenge of sequencing sterile or terminally arrested strains where population size is limiting. The methods described herein extend the applicability of whole-genome sequencing to a broader spectrum of mutations, including classes that are difficult to map by traditional means. PMID:26945029
Haplotype-Based Genotyping in Polyploids.
Clevenger, Josh P; Korani, Walid; Ozias-Akins, Peggy; Jackson, Scott
2018-01-01
Accurate identification of polymorphisms from sequence data is crucial to unlocking the potential of high throughput sequencing for genomics. Single nucleotide polymorphisms (SNPs) are difficult to accurately identify in polyploid crops due to the duplicative nature of polyploid genomes leading to low confidence in the true alignment of short reads. Implementing a haplotype-based method in contrasting subgenome-specific sequences leads to higher accuracy of SNP identification in polyploids. To test this method, a large-scale 48K SNP array (Axiom Arachis2) was developed for Arachis hypogaea (peanut), an allotetraploid, in which 1,674 haplotype-based SNPs were included. Results of the array show that 74% of the haplotype-based SNP markers could be validated, which is considerably higher than previous methods used for peanut. The haplotype method has been implemented in a standalone program, HAPLOSWEEP, which takes as input bam files and a vcf file and identifies haplotype-based markers. Haplotype discovery can be made within single reads or span paired reads, and can leverage long read technology by targeting any length of haplotype. Haplotype-based genotyping is applicable in all allopolyploid genomes and provides confidence in marker identification and in silico-based genotyping for polyploid genomics.
Veiga, Diogo F. T.; Dutta, Bhaskar; Balaźsi, Gábor
2011-01-01
The escalating amount of genome-scale data demands a pragmatic stance from the research community. How can we utilize this deluge of information to better understand biology, cure diseases, or engage cells in bioremediation or biomaterial production for various purposes? A research pipeline moving new sequence, expression and binding data towards practical end goals seems to be necessary. While most individual researchers are not motivated by such well-articulated pragmatic end goals, the scientific community has already self-organized itself to successfully convert genomic data into fundamentally new biological knowledge and practical applications. Here we review two important steps in this workflow: network inference and network response identification, applied to transcriptional regulatory networks. Among network inference methods, we concentrate on relevance networks due to their conceptual simplicity. We classify and discuss network response identification approaches as either data-centric or network-centric. Finally, we conclude with an outlook on what is still missing from these approaches and what may be ahead on the road to biological discovery. PMID:20174676
Single-molecule optical genome mapping of a human HapMap and a colorectal cancer cell line.
Teo, Audrey S M; Verzotto, Davide; Yao, Fei; Nagarajan, Niranjan; Hillmer, Axel M
2015-01-01
Next-generation sequencing (NGS) technologies have changed our understanding of the variability of the human genome. However, the identification of genome structural variations based on NGS approaches with read lengths of 35-300 bases remains a challenge. Single-molecule optical mapping technologies allow the analysis of DNA molecules of up to 2 Mb and as such are suitable for the identification of large-scale genome structural variations, and for de novo genome assemblies when combined with short-read NGS data. Here we present optical mapping data for two human genomes: the HapMap cell line GM12878 and the colorectal cancer cell line HCT116. High molecular weight DNA was obtained by embedding GM12878 and HCT116 cells, respectively, in agarose plugs, followed by DNA extraction under mild conditions. Genomic DNA was digested with KpnI and 310,000 and 296,000 DNA molecules (≥ 150 kb and 10 restriction fragments), respectively, were analyzed per cell line using the Argus optical mapping system. Maps were aligned to the human reference by OPTIMA, a new glocal alignment method. Genome coverage of 6.8× and 5.7× was obtained, respectively; 2.9× and 1.7× more than the coverage obtained with previously available software. Optical mapping allows the resolution of large-scale structural variations of the genome, and the scaffold extension of NGS-based de novo assemblies. OPTIMA is an efficient new alignment method; our optical mapping data provide a resource for genome structure analyses of the human HapMap reference cell line GM12878, and the colorectal cancer cell line HCT116.
The opportunities and challenges of large-scale molecular approaches to songbird neurobiology
Mello, C.V.; Clayton, D.F.
2014-01-01
High-through put methods for analyzing genome structure and function are having a large impact in song-bird neurobiology. Methods include genome sequencing and annotation, comparative genomics, DNA microarrays and transcriptomics, and the development of a brain atlas of gene expression. Key emerging findings include the identification of complex transcriptional programs active during singing, the robust brain expression of non-coding RNAs, evidence of profound variations in gene expression across brain regions, and the identification of molecular specializations within song production and learning circuits. Current challenges include the statistical analysis of large datasets, effective genome curations, the efficient localization of gene expression changes to specific neuronal circuits and cells, and the dissection of behavioral and environmental factors that influence brain gene expression. The field requires efficient methods for comparisons with organisms like chicken, which offer important anatomical, functional and behavioral contrasts. As sequencing costs plummet, opportunities emerge for comparative approaches that may help reveal evolutionary transitions contributing to vocal learning, social behavior and other properties that make songbirds such compelling research subjects. PMID:25280907
Identification of genomic indels and structural variations using split reads
2011-01-01
Background Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection. Results We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs. Conclusions Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful. PMID:21787423
High-Throughput Block Optical DNA Sequence Identification.
Sagar, Dodderi Manjunatha; Korshoj, Lee Erik; Hanson, Katrina Bethany; Chowdhury, Partha Pratim; Otoupal, Peter Britton; Chatterjee, Anushree; Nagpal, Prashant
2018-01-01
Optical techniques for molecular diagnostics or DNA sequencing generally rely on small molecule fluorescent labels, which utilize light with a wavelength of several hundred nanometers for detection. Developing a label-free optical DNA sequencing technique will require nanoscale focusing of light, a high-throughput and multiplexed identification method, and a data compression technique to rapidly identify sequences and analyze genomic heterogeneity for big datasets. Such a method should identify characteristic molecular vibrations using optical spectroscopy, especially in the "fingerprinting region" from ≈400-1400 cm -1 . Here, surface-enhanced Raman spectroscopy is used to demonstrate label-free identification of DNA nucleobases with multiplexed 3D plasmonic nanofocusing. While nanometer-scale mode volumes prevent identification of single nucleobases within a DNA sequence, the block optical technique can identify A, T, G, and C content in DNA k-mers. The content of each nucleotide in a DNA block can be a unique and high-throughput method for identifying sequences, genes, and other biomarkers as an alternative to single-letter sequencing. Additionally, coupling two complementary vibrational spectroscopy techniques (infrared and Raman) can improve block characterization. These results pave the way for developing a novel, high-throughput block optical sequencing method with lossy genomic data compression using k-mer identification from multiplexed optical data acquisition. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Harnessing Whole Genome Sequencing in Medical Mycology.
Cuomo, Christina A
2017-01-01
Comparative genome sequencing studies of human fungal pathogens enable identification of genes and variants associated with virulence and drug resistance. This review describes current approaches, resources, and advances in applying whole genome sequencing to study clinically important fungal pathogens. Genomes for some important fungal pathogens were only recently assembled, revealing gene family expansions in many species and extreme gene loss in one obligate species. The scale and scope of species sequenced is rapidly expanding, leveraging technological advances to assemble and annotate genomes with higher precision. By using iteratively improved reference assemblies or those generated de novo for new species, recent studies have compared the sequence of isolates representing populations or clinical cohorts. Whole genome approaches provide the resolution necessary for comparison of closely related isolates, for example, in the analysis of outbreaks or sampled across time within a single host. Genomic analysis of fungal pathogens has enabled both basic research and diagnostic studies. The increased scale of sequencing can be applied across populations, and new metagenomic methods allow direct analysis of complex samples.
Liu, Siyang; Huang, Shujia; Rao, Junhua; Ye, Weijian; Krogh, Anders; Wang, Jun
2015-01-01
Comprehensive recognition of genomic variation in one individual is important for understanding disease and developing personalized medication and treatment. Many tools based on DNA re-sequencing exist for identification of single nucleotide polymorphisms, small insertions and deletions (indels) as well as large deletions. However, these approaches consistently display a substantial bias against the recovery of complex structural variants and novel sequence in individual genomes and do not provide interpretation information such as the annotation of ancestral state and formation mechanism. We present a novel approach implemented in a single software package, AsmVar, to discover, genotype and characterize different forms of structural variation and novel sequence from population-scale de novo genome assemblies up to nucleotide resolution. Application of AsmVar to several human de novo genome assemblies captures a wide spectrum of structural variants and novel sequences present in the human population in high sensitivity and specificity. Our method provides a direct solution for investigating structural variants and novel sequences from de novo genome assemblies, facilitating the construction of population-scale pan-genomes. Our study also highlights the usefulness of the de novo assembly strategy for definition of genome structure.
Identification of copy number variants in whole-genome data using Reference Coverage Profiles
Glusman, Gustavo; Severson, Alissa; Dhankani, Varsha; Robinson, Max; Farrah, Terry; Mauldin, Denise E.; Stittrich, Anna B.; Ament, Seth A.; Roach, Jared C.; Brunkow, Mary E.; Bodian, Dale L.; Vockley, Joseph G.; Shmulevich, Ilya; Niederhuber, John E.; Hood, Leroy
2015-01-01
The identification of DNA copy numbers from short-read sequencing data remains a challenge for both technical and algorithmic reasons. The raw data for these analyses are measured in tens to hundreds of gigabytes per genome; transmitting, storing, and analyzing such large files is cumbersome, particularly for methods that analyze several samples simultaneously. We developed a very efficient representation of depth of coverage (150–1000× compression) that enables such analyses. Current methods for analyzing variants in whole-genome sequencing (WGS) data frequently miss copy number variants (CNVs), particularly hemizygous deletions in the 1–100 kb range. To fill this gap, we developed a method to identify CNVs in individual genomes, based on comparison to joint profiles pre-computed from a large set of genomes. We analyzed depth of coverage in over 6000 high quality (>40×) genomes. The depth of coverage has strong sequence-specific fluctuations only partially explained by global parameters like %GC. To account for these fluctuations, we constructed multi-genome profiles representing the observed or inferred diploid depth of coverage at each position along the genome. These Reference Coverage Profiles (RCPs) take into account the diverse technologies and pipeline versions used. Normalization of the scaled coverage to the RCP followed by hidden Markov model (HMM) segmentation enables efficient detection of CNVs and large deletions in individual genomes. Use of pre-computed multi-genome coverage profiles improves our ability to analyze each individual genome. We make available RCPs and tools for performing these analyses on personal genomes. We expect the increased sensitivity and specificity for individual genome analysis to be critical for achieving clinical-grade genome interpretation. PMID:25741365
Misra, Ashish; Green, Michael R
2017-01-01
Alternative splicing is a regulated process that leads to inclusion or exclusion of particular exons in a pre-mRNA transcript, resulting in multiple protein isoforms being encoded by a single gene. With more than 90 % of human genes known to undergo alternative splicing, it represents a major source for biological diversity inside cells. Although in vitro splicing assays have revealed insights into the mechanisms regulating individual alternative splicing events, our global understanding of alternative splicing regulation is still evolving. In recent years, genome-wide RNA interference (RNAi) screening has transformed biological research by enabling genome-scale loss-of-function screens in cultured cells and model organisms. In addition to resulting in the identification of new cellular pathways and potential drug targets, these screens have also uncovered many previously unknown mechanisms regulating alternative splicing. Here, we describe a method for the identification of alternative splicing regulators using genome-wide RNAi screening, as well as assays for further validation of the identified candidates. With modifications, this method can also be adapted to study the splicing regulation of pre-mRNAs that contain two or more splice isoforms.
Exhaustive identification of steady state cycles in large stoichiometric networks
Wright, Jeremiah; Wagner, Andreas
2008-01-01
Background Identifying cyclic pathways in chemical reaction networks is important, because such cycles may indicate in silico violation of energy conservation, or the existence of feedback in vivo. Unfortunately, our ability to identify cycles in stoichiometric networks, such as signal transduction and genome-scale metabolic networks, has been hampered by the computational complexity of the methods currently used. Results We describe a new algorithm for the identification of cycles in stoichiometric networks, and we compare its performance to two others by exhaustively identifying the cycles contained in the genome-scale metabolic networks of H. pylori, M. barkeri, E. coli, and S. cerevisiae. Our algorithm can substantially decrease both the execution time and maximum memory usage in comparison to the two previous algorithms. Conclusion The algorithm we describe improves our ability to study large, real-world, biochemical reaction networks, although additional methodological improvements are desirable. PMID:18616835
Harris, R. Alan; Wang, Ting; Coarfa, Cristian; Nagarajan, Raman P.; Hong, Chibo; Downey, Sara L.; Johnson, Brett E.; Fouse, Shaun D.; Delaney, Allen; Zhao, Yongjun; Olshen, Adam; Ballinger, Tracy; Zhou, Xin; Forsberg, Kevin J.; Gu, Junchen; Echipare, Lorigail; O’Geen, Henriette; Lister, Ryan; Pelizzola, Mattia; Xi, Yuanxin; Epstein, Charles B.; Bernstein, Bradley E.; Hawkins, R. David; Ren, Bing; Chung, Wen-Yu; Gu, Hongcang; Bock, Christoph; Gnirke, Andreas; Zhang, Michael Q.; Haussler, David; Ecker, Joseph; Li, Wei; Farnham, Peggy J.; Waterland, Robert A.; Meissner, Alexander; Marra, Marco A.; Hirst, Martin; Milosavljevic, Aleksandar; Costello, Joseph F.
2010-01-01
Sequencing-based DNA methylation profiling methods are comprehensive and, as accuracy and affordability improve, will increasingly supplant microarrays for genome-scale analyses. Here, four sequencing-based methodologies were applied to biological replicates of human embryonic stem cells to compare their CpG coverage genome-wide and in transposons, resolution, cost, concordance and its relationship with CpG density and genomic context. The two bisulfite methods reached concordance of 82% for CpG methylation levels and 99% for non-CpG cytosine methylation levels. Using binary methylation calls, two enrichment methods were 99% concordant, while regions assessed by all four methods were 97% concordant. To achieve comprehensive methylome coverage while reducing cost, an approach integrating two complementary methods was examined. The integrative methylome profile along with histone methylation, RNA, and SNP profiles derived from the sequence reads allowed genome-wide assessment of allele-specific epigenetic states, identifying most known imprinted regions and new loci with monoallelic epigenetic marks and monoallelic expression. PMID:20852635
Shen, Dan-na; Yi, Xu-fu; Chen, Xiao-gang; Xu, Tong-li; Cui, Li-juan
2007-10-01
Individual response to drugs, toxicants, environmental chemicals and allergens varies with genotype. Some respond well to these substances without significant consequences, while others may respond strongly with severe consequences and even death. Toxicogenetics and toxicogenomics as well as pharmacogenetics explain the genetic basis for the variations of individual response to toxicants by sequencing the human genome and large-scale identification of genome polymorphism. The new disciplines will provide a new route for forensic specialists to determine the cause of death.
Accurate identification of RNA editing sites from primitive sequence with deep neural networks.
Ouyang, Zhangyi; Liu, Feng; Zhao, Chenghui; Ren, Chao; An, Gaole; Mei, Chuan; Bo, Xiaochen; Shu, Wenjie
2018-04-16
RNA editing is a post-transcriptional RNA sequence alteration. Current methods have identified editing sites and facilitated research but require sufficient genomic annotations and prior-knowledge-based filtering steps, resulting in a cumbersome, time-consuming identification process. Moreover, these methods have limited generalizability and applicability in species with insufficient genomic annotations or in conditions of limited prior knowledge. We developed DeepRed, a deep learning-based method that identifies RNA editing from primitive RNA sequences without prior-knowledge-based filtering steps or genomic annotations. DeepRed achieved 98.1% and 97.9% area under the curve (AUC) in training and test sets, respectively. We further validated DeepRed using experimentally verified U87 cell RNA-seq data, achieving 97.9% positive predictive value (PPV). We demonstrated that DeepRed offers better prediction accuracy and computational efficiency than current methods with large-scale, mass RNA-seq data. We used DeepRed to assess the impact of multiple factors on editing identification with RNA-seq data from the Association of Biomolecular Resource Facilities and Sequencing Quality Control projects. We explored developmental RNA editing pattern changes during human early embryogenesis and evolutionary patterns in Drosophila species and the primate lineage using DeepRed. Our work illustrates DeepRed's state-of-the-art performance; it may decipher the hidden principles behind RNA editing, making editing detection convenient and effective.
Argimón, Silvia; Konganti, Kranti; Chen, Hao; Alekseyenko, Alexander V.; Brown, Stuart; Caufield, Page W.
2014-01-01
Comparative genomics is a popular method for the identification of microbial virulence determinants, especially since the sequencing of a large number of whole bacterial genomes from pathogenic and non-pathogenic strains has become relatively inexpensive. The bioinformatics pipelines for comparative genomics usually include gene prediction and annotation and can require significant computer power. To circumvent this, we developed a rapid method for genome-scale in silico subtractive hybridization, based on blastn and independent of feature identification and annotation. Whole genome comparisons by in silico genome subtraction were performed to identify genetic loci specific to Streptococcus mutans strains associated with severe early childhood caries (S-ECC), compared to strains isolated from caries-free (CF) children. The genome similarity of the 20 S. mutans strains included in this study, calculated by Simrank k-mer sharing, ranged from 79.5 to 90.9%, confirming this is a genetically heterogeneous group of strains. We identified strain-specific genetic elements in 19 strains, with sizes ranging from 200 bp to 39 kb. These elements contained protein-coding regions with functions mostly associated with mobile DNA. We did not, however, identify any genetic loci consistently associated with dental caries, i.e., shared by all the S-ECC strains and absent in the CF strains. Conversely, we did not identify any genetic loci specific with the healthy group. Comparison of previously published genomes from pathogenic and carriage strains of Neisseria meningitidis with our in silico genome subtraction yielded the same set of genes specific to the pathogenic strains, thus validating our method. Our results suggest that S. mutans strains derived from caries active or caries free dentitions cannot be differentiated based on the presence or absence of specific genetic elements. Our in silico genome subtraction method is available as the Microbial Genome Comparison (MGC) tool, with a user-friendly JAVA graphical interface. PMID:24291226
SynFind: Compiling Syntenic Regions across Any Set of Genomes on Demand.
Tang, Haibao; Bomhoff, Matthew D; Briones, Evan; Zhang, Liangsheng; Schnable, James C; Lyons, Eric
2015-11-11
The identification of conserved syntenic regions enables discovery of predicted locations for orthologous and homeologous genes, even when no such gene is present. This capability means that synteny-based methods are far more effective than sequence similarity-based methods in identifying true-negatives, a necessity for studying gene loss and gene transposition. However, the identification of syntenic regions requires complex analyses which must be repeated for pairwise comparisons between any two species. Therefore, as the number of published genomes increases, there is a growing demand for scalable, simple-to-use applications to perform comparative genomic analyses that cater to both gene family studies and genome-scale studies. We implemented SynFind, a web-based tool that addresses this need. Given one query genome, SynFind is capable of identifying conserved syntenic regions in any set of target genomes. SynFind is capable of reporting per-gene information, useful for researchers studying specific gene families, as well as genome-wide data sets of syntenic gene and predicted gene locations, critical for researchers focused on large-scale genomic analyses. Inference of syntenic homologs provides the basis for correlation of functional changes around genes of interests between related organisms. Deployed on the CoGe online platform, SynFind is connected to the genomic data from over 15,000 organisms from all domains of life as well as supporting multiple releases of the same organism. SynFind makes use of a powerful job execution framework that promises scalability and reproducibility. SynFind can be accessed at http://genomevolution.org/CoGe/SynFind.pl. A video tutorial of SynFind using Phytophthrora as an example is available at http://www.youtube.com/watch?v=2Agczny9Nyc. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Bi, Jianjun; Song, Rengang; Yang, Huilan; Li, Bingling; Fan, Jianyong; Liu, Zhongrong; Long, Chaoqin
2011-01-01
Identification of immunodominant epitopes is the first step in the rational design of peptide vaccines aimed at T-cell immunity. To date, however, it is yet a great challenge for accurately predicting the potent epitope peptides from a pool of large-scale candidates with an efficient manner. In this study, a method that we named StepRank has been developed for the reliable and rapid prediction of binding capabilities/affinities between proteins and genome-wide peptides. In this procedure, instead of single strategy used in most traditional epitope identification algorithms, four steps with different purposes and thus different computational demands are employed in turn to screen the large-scale peptide candidates that are normally generated from, for example, pathogenic genome. The steps 1 and 2 aim at qualitative exclusion of typical nonbinders by using empirical rule and linear statistical approach, while the steps 3 and 4 focus on quantitative examination and prediction of the interaction energy profile and binding affinity of peptide to target protein via quantitative structure-activity relationship (QSAR) and structure-based free energy analysis. We exemplify this method through its application to binding predictions of the peptide segments derived from the 76 known open-reading frames (ORFs) of herpes simplex virus type 1 (HSV-1) genome with or without affinity to human major histocompatibility complex class I (MHC I) molecule HLA-A*0201, and find that the predictive results are well compatible with the classical anchor residue theory and perfectly match for the extended motif pattern of MHC I-binding peptides. The putative epitopes are further confirmed by comparisons with 11 experimentally measured HLA-A*0201-restrcited peptides from the HSV-1 glycoproteins D and K. We expect that this well-designed scheme can be applied in the computational screening of other viral genomes as well.
miRNAFold: a web server for fast miRNA precursor prediction in genomes.
Tav, Christophe; Tempel, Sébastien; Poligny, Laurent; Tahi, Fariza
2016-07-08
Computational methods are required for prediction of non-coding RNAs (ncRNAs), which are involved in many biological processes, especially at post-transcriptional level. Among these ncRNAs, miRNAs have been largely studied and biologists need efficient and fast tools for their identification. In particular, ab initio methods are usually required when predicting novel miRNAs. Here we present a web server dedicated for miRNA precursors identification at a large scale in genomes. It is based on an algorithm called miRNAFold that allows predicting miRNA hairpin structures quickly with high sensitivity. miRNAFold is implemented as a web server with an intuitive and user-friendly interface, as well as a standalone version. The web server is freely available at: http://EvryRNA.ibisc.univ-evry.fr/miRNAFold. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Zhu, Shiyou; Li, Wei; Liu, Jingze; Chen, Chen-Hao; Liao, Qi; Xu, Ping; Xu, Han; Xiao, Tengfei; Cao, Zhongzheng; Peng, Jingyu; Yuan, Pengfei; Brown, Myles; Liu, Xiaole Shirley; Wei, Wensheng
2017-01-01
CRISPR/Cas9 screens have been widely adopted to analyse coding gene functions, but high throughput screening of non-coding elements using this method is more challenging, because indels caused by a single cut in non-coding regions are unlikely to produce a functional knockout. A high-throughput method to produce deletions of non-coding DNA is needed. Herein, we report a high throughput genomic deletion strategy to screen for functional long non-coding RNAs (lncRNAs) that is based on a lentiviral paired-guide RNA (pgRNA) library. Applying our screening method, we identified 51 lncRNAs that can positively or negatively regulate human cancer cell growth. We individually validated 9 lncRNAs using CRISPR/Cas9-mediated genomic deletion and functional rescue, CRISPR activation or inhibition, and gene expression profiling. Our high-throughput pgRNA genome deletion method should enable rapid identification of functional mammalian non-coding elements. PMID:27798563
Patel, Isha R.; Gangiredla, Jayanthi; Lacher, David W.; Mammel, Mark K.; Jackson, Scott A.; Lampel, Keith A.
2016-01-01
ABSTRACT Most Escherichia coli strains are nonpathogenic. However, for clinical diagnosis and food safety analysis, current identification methods for pathogenic E. coli either are time-consuming and/or provide limited information. Here, we utilized a custom DNA microarray with informative genetic features extracted from 368 sequence sets for rapid and high-throughput pathogen identification. The FDA Escherichia coli Identification (FDA-ECID) platform contains three sets of molecularly informative features that together stratify strain identification and relatedness. First, 53 known flagellin alleles, 103 alleles of wzx and wzy, and 5 alleles of wzm provide molecular serotyping utility. Second, 41,932 probe sets representing the pan-genome of E. coli provide strain-level gene content information. Third, approximately 125,000 single nucleotide polymorphisms (SNPs) of available whole-genome sequences (WGS) were distilled to 9,984 SNPs capable of recapitulating the E. coli phylogeny. We analyzed 103 diverse E. coli strains with available WGS data, including those associated with past foodborne illnesses, to determine robustness and accuracy. The array was able to accurately identify the molecular O and H serotypes, potentially correcting serological failures and providing better resolution for H-nontypeable/nonmotile phenotypes. In addition, molecular risk assessment was possible with key virulence marker identifications. Epidemiologically, each strain had a unique comparative genomic fingerprint that was extended to an additional 507 food and clinical isolates. Finally, a 99.7% phylogenetic concordance was established between microarray analysis and WGS using SNP-level data for advanced genome typing. Our study demonstrates FDA-ECID as a powerful tool for epidemiology and molecular risk assessment with the capacity to profile the global landscape and diversity of E. coli. IMPORTANCE This study describes a robust, state-of-the-art platform developed from available whole-genome sequences of E. coli and Shigella spp. by distilling useful signatures for epidemiology and molecular risk assessment into one assay. The FDA-ECID microarray contains features that enable comprehensive molecular serotyping and virulence profiling along with genome-scale genotyping and SNP analysis. Hence, it is a molecular toolbox that stratifies strain identification and pathogenic potential in the contexts of epidemiology and phylogeny. We applied this tool to strains from food, environmental, and clinical sources, resulting in significantly greater phylogenetic and strain-specific resolution than previously reported for available typing methods. PMID:27037122
Garst, Andrew D; Bassalo, Marcelo C; Pines, Gur; Lynch, Sean A; Halweg-Edwards, Andrea L; Liu, Rongming; Liang, Liya; Wang, Zhiwen; Zeitoun, Ramsey; Alexander, William G; Gill, Ryan T
2017-01-01
Improvements in DNA synthesis and sequencing have underpinned comprehensive assessment of gene function in bacteria and eukaryotes. Genome-wide analyses require high-throughput methods to generate mutations and analyze their phenotypes, but approaches to date have been unable to efficiently link the effects of mutations in coding regions or promoter elements in a highly parallel fashion. We report that CRISPR-Cas9 gene editing in combination with massively parallel oligomer synthesis can enable trackable editing on a genome-wide scale. Our method, CRISPR-enabled trackable genome engineering (CREATE), links each guide RNA to homologous repair cassettes that both edit loci and function as barcodes to track genotype-phenotype relationships. We apply CREATE to site saturation mutagenesis for protein engineering, reconstruction of adaptive laboratory evolution experiments, and identification of stress tolerance and antibiotic resistance genes in bacteria. We provide preliminary evidence that CREATE will work in yeast. We also provide a webtool to design multiplex CREATE libraries.
The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge.
Bui, Duy Duc An; Wyatt, Mathew; Cimino, James J
2017-11-01
Clinical narratives (the text notes found in patients' medical records) are important information sources for secondary use in research. However, in order to protect patient privacy, they must be de-identified prior to use. Manual de-identification is considered to be the gold standard approach but is tedious, expensive, slow, and impractical for use with large-scale clinical data. Automated or semi-automated de-identification using computer algorithms is a potentially promising alternative. The Informatics Institute of the University of Alabama at Birmingham is applying de-identification to clinical data drawn from the UAB hospital's electronic medical records system before releasing them for research. We participated in a shared task challenge by the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-Scale and RDoC Individualized Domains (N-GRID) at the de-identification regular track to gain experience developing our own automatic de-identification tool. We focused on the popular and successful methods from previous challenges: rule-based, dictionary-matching, and machine-learning approaches. We also explored new techniques such as disambiguation rules, term ambiguity measurement, and used multi-pass sieve framework at a micro level. For the challenge's primary measure (strict entity), our submissions achieved competitive results (f-measures: 87.3%, 87.1%, and 86.7%). For our preferred measure (binary token HIPAA), our submissions achieved superior results (f-measures: 93.7%, 93.6%, and 93%). With those encouraging results, we gain the confidence to improve and use the tool for the real de-identification task at the UAB Informatics Institute. Copyright © 2017 Elsevier Inc. All rights reserved.
Rapid Identification of Sequences for Orphan Enzymes to Power Accurate Protein Annotation
Ojha, Sunil; Watson, Douglas S.; Bomar, Martha G.; Galande, Amit K.; Shearer, Alexander G.
2013-01-01
The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the “back catalog” of enzymology – “orphan enzymes,” those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme “back catalog” is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology’s “back catalog” another powerful tool to drive accurate genome annotation. PMID:24386392
Rapid identification of sequences for orphan enzymes to power accurate protein annotation.
Ramkissoon, Kevin R; Miller, Jennifer K; Ojha, Sunil; Watson, Douglas S; Bomar, Martha G; Galande, Amit K; Shearer, Alexander G
2013-01-01
The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.
Lim, Hansaim; Poleksic, Aleksandar; Yao, Yuan; Tong, Hanghang; He, Di; Zhuang, Luke; Meng, Patrick; Xie, Lei
2016-10-01
Target-based screening is one of the major approaches in drug discovery. Besides the intended target, unexpected drug off-target interactions often occur, and many of them have not been recognized and characterized. The off-target interactions can be responsible for either therapeutic or side effects. Thus, identifying the genome-wide off-targets of lead compounds or existing drugs will be critical for designing effective and safe drugs, and providing new opportunities for drug repurposing. Although many computational methods have been developed to predict drug-target interactions, they are either less accurate than the one that we are proposing here or computationally too intensive, thereby limiting their capability for large-scale off-target identification. In addition, the performances of most machine learning based algorithms have been mainly evaluated to predict off-target interactions in the same gene family for hundreds of chemicals. It is not clear how these algorithms perform in terms of detecting off-targets across gene families on a proteome scale. Here, we are presenting a fast and accurate off-target prediction method, REMAP, which is based on a dual regularized one-class collaborative filtering algorithm, to explore continuous chemical space, protein space, and their interactome on a large scale. When tested in a reliable, extensive, and cross-gene family benchmark, REMAP outperforms the state-of-the-art methods. Furthermore, REMAP is highly scalable. It can screen a dataset of 200 thousands chemicals against 20 thousands proteins within 2 hours. Using the reconstructed genome-wide target profile as the fingerprint of a chemical compound, we predicted that seven FDA-approved drugs can be repurposed as novel anti-cancer therapies. The anti-cancer activity of six of them is supported by experimental evidences. Thus, REMAP is a valuable addition to the existing in silico toolbox for drug target identification, drug repurposing, phenotypic screening, and side effect prediction. The software and benchmark are available at https://github.com/hansaimlim/REMAP.
Poleksic, Aleksandar; Yao, Yuan; Tong, Hanghang; Meng, Patrick; Xie, Lei
2016-01-01
Target-based screening is one of the major approaches in drug discovery. Besides the intended target, unexpected drug off-target interactions often occur, and many of them have not been recognized and characterized. The off-target interactions can be responsible for either therapeutic or side effects. Thus, identifying the genome-wide off-targets of lead compounds or existing drugs will be critical for designing effective and safe drugs, and providing new opportunities for drug repurposing. Although many computational methods have been developed to predict drug-target interactions, they are either less accurate than the one that we are proposing here or computationally too intensive, thereby limiting their capability for large-scale off-target identification. In addition, the performances of most machine learning based algorithms have been mainly evaluated to predict off-target interactions in the same gene family for hundreds of chemicals. It is not clear how these algorithms perform in terms of detecting off-targets across gene families on a proteome scale. Here, we are presenting a fast and accurate off-target prediction method, REMAP, which is based on a dual regularized one-class collaborative filtering algorithm, to explore continuous chemical space, protein space, and their interactome on a large scale. When tested in a reliable, extensive, and cross-gene family benchmark, REMAP outperforms the state-of-the-art methods. Furthermore, REMAP is highly scalable. It can screen a dataset of 200 thousands chemicals against 20 thousands proteins within 2 hours. Using the reconstructed genome-wide target profile as the fingerprint of a chemical compound, we predicted that seven FDA-approved drugs can be repurposed as novel anti-cancer therapies. The anti-cancer activity of six of them is supported by experimental evidences. Thus, REMAP is a valuable addition to the existing in silico toolbox for drug target identification, drug repurposing, phenotypic screening, and side effect prediction. The software and benchmark are available at https://github.com/hansaimlim/REMAP. PMID:27716836
Gu, Deqing; Jian, Xingxing; Zhang, Cheng; Hua, Qiang
2017-01-01
Genome-scale metabolic network models (GEMs) have played important roles in the design of genetically engineered strains and helped biologists to decipher metabolism. However, due to the complex gene-reaction relationships that exist in model systems, most algorithms have limited capabilities with respect to directly predicting accurate genetic design for metabolic engineering. In particular, methods that predict reaction knockout strategies leading to overproduction are often impractical in terms of gene manipulations. Recently, we proposed a method named logical transformation of model (LTM) to simplify the gene-reaction associations by introducing intermediate pseudo reactions, which makes it possible to generate genetic design. Here, we propose an alternative method to relieve researchers from deciphering complex gene-reactions by adding pseudo gene controlling reactions. In comparison to LTM, this new method introduces fewer pseudo reactions and generates a much smaller model system named as gModel. We showed that gModel allows two seldom reported applications: identification of minimal genomes and design of minimal cell factories within a modified OptKnock framework. In addition, gModel could be used to integrate expression data directly and improve the performance of the E-Fmin method for predicting fluxes. In conclusion, the model transformation procedure will facilitate genetic research based on GEMs, extending their applications.
Pacheco, Luis G C; Mattos-Guaraldi, Ana L; Santos, Carolina S; Veras, Adonney A O; Guimarães, Luis C; Abreu, Vinícius; Pereira, Felipe L; Soares, Siomar C; Dorella, Fernanda A; Carvalho, Alex F; Leal, Carlos G; Figueiredo, Henrique C P; Ramos, Juliana N; Vieira, Veronica V; Farfour, Eric; Guiso, Nicole; Hirata, Raphael; Azevedo, Vasco; Silva, Artur; Ramos, Rommel T J
2015-01-01
Non-diphtheriae Corynebacterium species have been increasingly recognized as the causative agents of infections in humans. Differential identification of these bacteria in the clinical microbiology laboratory by the most commonly used biochemical tests is challenging, and normally requires additional molecular methods. Herein, we present the annotated draft genome sequences of two isolates of "difficult-to-identify" human-pathogenic corynebacterial species: C. xerosis and C. minutissimum. The genome sequences of ca. 2.7 Mbp, with a mean number of 2,580 protein encoding genes, were also compared with the publicly available genome sequences of strains of C. amycolatum and C. striatum. These results will aid the exploration of novel biochemical reactions to improve existing identification tests as well as the development of more accurate molecular identification methods through detection of species-specific target genes for isolate's identification or drug susceptibility profiling.
Research progress of plant population genomics based on high-throughput sequencing.
Wang, Yun-sheng
2016-08-01
Population genomics, a new paradigm for population genetics, combine the concepts and techniques of genomics with the theoretical system of population genetics and improve our understanding of microevolution through identification of site-specific effect and genome-wide effects using genome-wide polymorphic sites genotypeing. With the appearance and improvement of the next generation high-throughput sequencing technology, the numbers of plant species with complete genome sequences increased rapidly and large scale resequencing has also been carried out in recent years. Parallel sequencing has also been done in some plant species without complete genome sequences. These studies have greatly promoted the development of population genomics and deepened our understanding of the genetic diversity, level of linking disequilibium, selection effect, demographical history and molecular mechanism of complex traits of relevant plant population at a genomic level. In this review, I briely introduced the concept and research methods of population genomics and summarized the research progress of plant population genomics based on high-throughput sequencing. I also discussed the prospect as well as existing problems of plant population genomics in order to provide references for related studies.
High-throughput screening of a CRISPR/Cas9 library for functional genomics in human cells.
Zhou, Yuexin; Zhu, Shiyou; Cai, Changzu; Yuan, Pengfei; Li, Chunmei; Huang, Yanyi; Wei, Wensheng
2014-05-22
Targeted genome editing technologies are powerful tools for studying biology and disease, and have a broad range of research applications. In contrast to the rapid development of toolkits to manipulate individual genes, large-scale screening methods based on the complete loss of gene expression are only now beginning to be developed. Here we report the development of a focused CRISPR/Cas-based (clustered regularly interspaced short palindromic repeats/CRISPR-associated) lentiviral library in human cells and a method of gene identification based on functional screening and high-throughput sequencing analysis. Using knockout library screens, we successfully identified the host genes essential for the intoxication of cells by anthrax and diphtheria toxins, which were confirmed by functional validation. The broad application of this powerful genetic screening strategy will not only facilitate the rapid identification of genes important for bacterial toxicity but will also enable the discovery of genes that participate in other biological processes.
Carr, Ian M; Morgan, Joanne; Watson, Christopher; Melnik, Svitlana; Diggle, Christine P; Logan, Clare V; Harrison, Sally M; Taylor, Graham R; Pena, Sergio D J; Markham, Alexander F; Alkuraya, Fowzan S; Black, Graeme C M; Ali, Manir; Bonthron, David T
2013-07-01
Massively parallel ("next generation") DNA sequencing (NGS) has quickly become the method of choice for seeking pathogenic mutations in rare uncharacterized monogenic diseases. Typically, before DNA sequencing, protein-coding regions are enriched from patient genomic DNA, representing either the entire genome ("exome sequencing") or selected mapped candidate loci. Sequence variants, identified as differences between the patient's and the human genome reference sequences, are then filtered according to various quality parameters. Changes are screened against datasets of known polymorphisms, such as dbSNP and the 1000 Genomes Project, in the effort to narrow the list of candidate causative variants. An increasing number of commercial services now offer to both generate and align NGS data to a reference genome. This potentially allows small groups with limited computing infrastructure and informatics skills to utilize this technology. However, the capability to effectively filter and assess sequence variants is still an important bottleneck in the identification of deleterious sequence variants in both research and diagnostic settings. We have developed an approach to this problem comprising a user-friendly suite of programs that can interactively analyze, filter and screen data from enrichment-capture NGS data. These programs ("Agile Suite") are particularly suitable for small-scale gene discovery or for diagnostic analysis. © 2013 WILEY PERIODICALS, INC.
Lin, Michael F.; Deoras, Ameya N.; Rasmussen, Matthew D.; Kellis, Manolis
2008-01-01
Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons (≤240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human. PMID:18421375
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes
Li, Li; Stoeckert, Christian J.; Roos, David S.
2003-01-01
The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of “recent” paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome. PMID:12952885
Resources for Functional Genomics Studies in Drosophila melanogaster
Mohr, Stephanie E.; Hu, Yanhui; Kim, Kevin; Housden, Benjamin E.; Perrimon, Norbert
2014-01-01
Drosophila melanogaster has become a system of choice for functional genomic studies. Many resources, including online databases and software tools, are now available to support design or identification of relevant fly stocks and reagents or analysis and mining of existing functional genomic, transcriptomic, proteomic, etc. datasets. These include large community collections of fly stocks and plasmid clones, “meta” information sites like FlyBase and FlyMine, and an increasing number of more specialized reagents, databases, and online tools. Here, we introduce key resources useful to plan large-scale functional genomics studies in Drosophila and to analyze, integrate, and mine the results of those studies in ways that facilitate identification of highest-confidence results and generation of new hypotheses. We also discuss ways in which existing resources can be used and might be improved and suggest a few areas of future development that would further support large- and small-scale studies in Drosophila and facilitate use of Drosophila information by the research community more generally. PMID:24653003
Badotti, Fernanda; de Oliveira, Francislon Silva; Garcia, Cleverson Fernando; Vaz, Aline Bruna Martins; Fonseca, Paula Luize Camargos; Nahum, Laila Alves; Oliveira, Guilherme; Góes-Neto, Aristóteles
2017-02-23
Fungi are among the most abundant and diverse organisms on Earth. However, a substantial amount of the species diversity, relationships, habitats, and life strategies of these microorganisms remain to be discovered and characterized. One important factor hindering progress is the difficulty in correctly identifying fungi. Morphological and molecular characteristics have been applied in such tasks. Later, DNA barcoding has emerged as a new method for the rapid and reliable identification of species. The nrITS region is considered the universal barcode of Fungi, and the ITS1 and ITS2 sub-regions have been applied as metabarcoding markers. In this study, we performed a large-scale analysis of all the available Basidiomycota sequences from GenBank. We carried out a rigorous trimming of the initial dataset based in methodological principals of DNA Barcoding. Two different approaches (PCI and barcode gap) were used to determine the performance of the complete ITS region and sub-regions. For most of the Basidiomycota genera, the three genomic markers performed similarly, i.e., when one was considered a good marker for the identification of a genus, the others were also; the same results were observed when the performance was insufficient. However, based on barcode gap analyses, we identified genomic markers that had a superior identification performance than the others and genomic markers that were not indicated for the identification of some genera. Notably, neither the complete ITS nor the sub-regions were useful in identifying 11 of the 113 Basidiomycota genera. The complex phylogenetic relationships and the presence of cryptic species in some genera are possible explanations of this limitation and are discussed. Knowledge regarding the efficiency and limitations of the barcode markers that are currently used for the identification of organisms is crucial because it benefits research in many areas. Our study provides information that may guide researchers in choosing the most suitable genomic markers for identifying Basidiomycota species.
Yang, Jun-Bo; Li, De-Zhu; Li, Hong-Tao
2014-09-01
Chloroplast genomes supply indispensable information that helps improve the phylogenetic resolution and even as organelle-scale barcodes. Next-generation sequencing technologies have helped promote sequencing of complete chloroplast genomes, but compared with the number of angiosperms, relatively few chloroplast genomes have been sequenced. There are two major reasons for the paucity of completely sequenced chloroplast genomes: (i) massive amounts of fresh leaves are needed for chloroplast sequencing and (ii) there are considerable gaps in the sequenced chloroplast genomes of many plants because of the difficulty of isolating high-quality chloroplast DNA, preventing complete chloroplast genomes from being assembled. To overcome these obstacles, all known angiosperm chloroplast genomes available to date were analysed, and then we designed nine universal primer pairs corresponding to the highly conserved regions. Using these primers, angiosperm whole chloroplast genomes can be amplified using long-range PCR and sequenced using next-generation sequencing methods. The primers showed high universality, which was tested using 24 species representing major clades of angiosperms. To validate the functionality of the primers, eight species representing major groups of angiosperms, that is, early-diverging angiosperms, magnoliids, monocots, Saxifragales, fabids, malvids and asterids, were sequenced and assembled their complete chloroplast genomes. In our trials, only 100 mg of fresh leaves was used. The results show that the universal primer set provided an easy, effective and feasible approach for sequencing whole chloroplast genomes in angiosperms. The designed universal primer pairs provide a possibility to accelerate genome-scale data acquisition and will therefore magnify the phylogenetic resolution and species identification in angiosperms. © 2014 John Wiley & Sons Ltd.
Liu, Jun-Jun; Xiang, Yu
2011-01-01
WRKY transcription factors are key regulators of numerous biological processes in plant growth and development, as well as plant responses to abiotic and biotic stresses. Research on biological functions of plant WRKY genes has focused in the past on model plant species or species with largely characterized transcriptomes. However, a variety of non-model plants, such as forest conifers, are essential as feed, biofuel, and wood or for sustainable ecosystems. Identification of WRKY genes in these non-model plants is equally important for understanding the evolutionary and function-adaptive processes of this transcription factor family. Because of limited genomic information, the rarity of regulatory gene mRNAs in transcriptomes, and the sequence divergence to model organism genes, identification of transcription factors in non-model plants using methods similar to those generally used for model plants is difficult. This chapter describes a gene family discovery strategy for identification of WRKY transcription factors in conifers by a combination of in silico-based prediction and PCR-based experimental approaches. Compared to traditional cDNA library screening or EST sequencing at transcriptome scales, this integrated gene discovery strategy provides fast, simple, reliable, and specific methods to unveil the WRKY gene family at both genome and transcriptome levels in non-model plants.
Experimental annotation of the human genome using microarray technology.
Shoemaker, D D; Schadt, E E; Armour, C D; He, Y D; Garrett-Engele, P; McDonagh, P D; Loerch, P M; Leonardson, A; Lum, P Y; Cavet, G; Wu, L F; Altschuler, S J; Edwards, S; King, J; Tsang, J S; Schimmack, G; Schelter, J M; Koch, J; Ziman, M; Marton, M J; Li, B; Cundiff, P; Ward, T; Castle, J; Krolewski, M; Meyer, M R; Mao, M; Burchard, J; Kidd, M J; Dai, H; Phillips, J W; Linsley, P S; Stoughton, R; Scherer, S; Boguski, M S
2001-02-15
The most important product of the sequencing of a genome is a complete, accurate catalogue of genes and their products, primarily messenger RNA transcripts and their cognate proteins. Such a catalogue cannot be constructed by computational annotation alone; it requires experimental validation on a genome scale. Using 'exon' and 'tiling' arrays fabricated by ink-jet oligonucleotide synthesis, we devised an experimental approach to validate and refine computational gene predictions and define full-length transcripts on the basis of co-regulated expression of their exons. These methods can provide more accurate gene numbers and allow the detection of mRNA splice variants and identification of the tissue- and disease-specific conditions under which genes are expressed. We apply our technique to chromosome 22q under 69 experimental condition pairs, and to the entire human genome under two experimental conditions. We discuss implications for more comprehensive, consistent and reliable genome annotation, more efficient, full-length complementary DNA cloning strategies and application to complex diseases.
A Glance at Microsatellite Motifs from 454 Sequencing Reads of Watermelon Genomic DNA
USDA-ARS?s Scientific Manuscript database
A single 454 (Life Sciences Sequencing Technology) run of Charleston Gray watermelon (Citrullus lanatus var. lanatus) genomic DNA was performed and sequence data were assembled. A large scale identification of simple sequence repeat (SSR) was performed and SSR sequence data were used for the develo...
Hertveldt, Kirsten; Beliën, Tim; Volckaert, Guido
2009-01-01
In M13 phage display, proteins and peptides are exposed on one of the surface proteins of filamentous phage particles and become accessible to affinity enrichment against a bait of interest. We describe the construction of fragmented whole genome and gene fragment phage display libraries and interaction selection by panning. This strategy allows the identification and characterization of interacting proteins on a genomic scale by screening the fragmented "proteome" against protein baits. Gene fragment libraries allow a more in depth characterization of the protein-protein interaction site by identification of the protein region involved in the interaction.
Identification and Characterization of Genomic Amplifications in Ovarian Serous Carcinoma
2009-07-01
oncogenes, Rsf1 and Notch3, which were up-regulated in both genomic DNA and transcript levels in ovarian cancer. In a large- scale FISH analysis, Rsf1...associated with worse disease outcome, suggesting that Rsf1 could be potentially used as a prognostic marker in the future (Appendix #1). For the...over- expressed in a recurrent carcinoma. Although the follow-up study in a larger- scale sample size did not demonstrate clear amplification in NAC1
Kim, Heon Seok; Lee, Kyungjin; Bae, Sangsu; Park, Jeongbin; Lee, Chong-Kyo; Kim, Meehyein; Kim, Eunji; Kim, Minju; Kim, Seokjoong; Kim, Chonsaeng; Kim, Jin-Soo
2017-06-23
Several groups have used genome-wide libraries of lentiviruses encoding small guide RNAs (sgRNAs) for genetic screens. In most cases, sgRNA expression cassettes are integrated into cells by using lentiviruses, and target genes are statistically estimated by the readout of sgRNA sequences after targeted sequencing. We present a new virus-free method for human gene knockout screens using a genome-wide library of CRISPR/Cas9 sgRNAs based on plasmids and target gene identification via whole-genome sequencing (WGS) confirmation of authentic mutations rather than statistical estimation through targeted amplicon sequencing. We used 30,840 pairs of individually synthesized oligonucleotides to construct the genome-scale sgRNA library, collectively targeting 10,280 human genes ( i.e. three sgRNAs per gene). These plasmid libraries were co-transfected with a Cas9-expression plasmid into human cells, which were then treated with cytotoxic drugs or viruses. Only cells lacking key factors essential for cytotoxic drug metabolism or viral infection were able to survive. Genomic DNA isolated from cells that survived these challenges was subjected to WGS to directly identify CRISPR/Cas9-mediated causal mutations essential for cell survival. With this approach, we were able to identify known and novel genes essential for viral infection in human cells. We propose that genome-wide sgRNA screens based on plasmids coupled with WGS are powerful tools for forward genetics studies and drug target discovery. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.
NASA Astrophysics Data System (ADS)
Liu, Hongna; Li, Song; Wang, Zhifei; Li, Zhiyang; Deng, Yan; Wang, Hua; Shi, Zhiyang; He, Nongyue
2008-11-01
Single nucleotide polymorphisms (SNPs) comprise the most abundant source of genetic variation in the human genome wide codominant SNPs identification. Therefore, large-scale codominant SNPs identification, especially for those associated with complex diseases, has induced the need for completely high-throughput and automated SNP genotyping method. Herein, we present an automated detection system of SNPs based on two kinds of functional magnetic nanoparticles (MNPs) and dual-color hybridization. The amido-modified MNPs (NH 2-MNPs) modified with APTES were used for DNA extraction from whole blood directly by electrostatic reaction, and followed by PCR, was successfully performed. Furthermore, biotinylated PCR products were captured on the streptavidin-coated MNPs (SA-MNPs) and interrogated by hybridization with a pair of dual-color probes to determine SNP, then the genotype of each sample can be simultaneously identified by scanning the microarray printed with the denatured fluorescent probes. This system provided a rapid, sensitive and highly versatile automated procedure that will greatly facilitate the analysis of different known SNPs in human genome.
Motivation: As cancer genomics initiatives move toward comprehensive identification of genetic alterations in cancer, attention is now turning to understanding how interactions among these genes lead to the acquisition of tumor hallmarks. Emerging pharmacological and clinical data suggest a highly promising role of cancer-specific protein-protein interactions (PPIs) as druggable cancer targets. However, large-scale experimental identification of cancer-related PPIs remains challenging, and currently available resources to explore oncogenic PPI networks are limited.
Flanagan, Keith; Cockell, Simon; Harwood, Colin; Hallinan, Jennifer; Nakjang, Sirintra; Lawry, Beth; Wipat, Anil
2014-06-30
The rapid and cost-effective identification of bacterial species is crucial, especially for clinical diagnosis and treatment. Peptide aptamers have been shown to be valuable for use as a component of novel, direct detection methods. These small peptides have a number of advantages over antibodies, including greater specificity and longer shelf life. These properties facilitate their use as the detector components of biosensor devices. However, the identification of suitable aptamer targets for particular groups of organisms is challenging. We present a semi-automated processing pipeline for the identification of candidate aptamer targets from whole bacterial genome sequences. The pipeline can be configured to search for protein sequence fragments that uniquely identify a set of strains of interest. The system is also capable of identifying additional organisms that may be of interest due to their possession of protein fragments in common with the initial set. Through the use of Cloud computing technology and distributed databases, our system is capable of scaling with the rapidly growing genome repositories, and consequently of keeping the resulting data sets up-to-date. The system described is also more generically applicable to the discovery of specific targets for other diagnostic approaches such as DNA probes, PCR primers and antibodies.
Flanagan, Keith; Cockell, Simon; Harwood, Colin; Hallinan, Jennifer; Nakjang, Sirintra; Lawry, Beth; Wipat, Anil
2014-06-01
The rapid and cost-effective identification of bacterial species is crucial, especially for clinical diagnosis and treatment. Peptide aptamers have been shown to be valuable for use as a component of novel, direct detection methods. These small peptides have a number of advantages over antibodies, including greater specificity and longer shelf life. These properties facilitate their use as the detector components of biosensor devices. However, the identification of suitable aptamer targets for particular groups of organisms is challenging. We present a semi-automated processing pipeline for the identification of candidate aptamer targets from whole bacterial genome sequences. The pipeline can be configured to search for protein sequence fragments that uniquely identify a set of strains of interest. The system is also capable of identifying additional organisms that may be of interest due to their possession of protein fragments in common with the initial set. Through the use of Cloud computing technology and distributed databases, our system is capable of scaling with the rapidly growing genome repositories, and consequently of keeping the resulting data sets up-to-date. The system described is also more generically applicable to the discovery of specific targets for other diagnostic approaches such as DNA probes, PCR primers and antibodies.
Constructing an integrated gene similarity network for the identification of disease genes.
Tian, Zhen; Guo, Maozu; Wang, Chunyu; Xing, LinLin; Wang, Lei; Zhang, Yin
2017-09-20
Discovering novel genes that are involved human diseases is a challenging task in biomedical research. In recent years, several computational approaches have been proposed to prioritize candidate disease genes. Most of these methods are mainly based on protein-protein interaction (PPI) networks. However, since these PPI networks contain false positives and only cover less half of known human genes, their reliability and coverage are very low. Therefore, it is highly necessary to fuse multiple genomic data to construct a credible gene similarity network and then infer disease genes on the whole genomic scale. We proposed a novel method, named RWRB, to infer causal genes of interested diseases. First, we construct five individual gene (protein) similarity networks based on multiple genomic data of human genes. Then, an integrated gene similarity network (IGSN) is reconstructed based on similarity network fusion (SNF) method. Finally, we employee the random walk with restart algorithm on the phenotype-gene bilayer network, which combines phenotype similarity network, IGSN as well as phenotype-gene association network, to prioritize candidate disease genes. We investigate the effectiveness of RWRB through leave-one-out cross-validation methods in inferring phenotype-gene relationships. Results show that RWRB is more accurate than state-of-the-art methods on most evaluation metrics. Further analysis shows that the success of RWRB is benefited from IGSN which has a wider coverage and higher reliability comparing with current PPI networks. Moreover, we conduct a comprehensive case study for Alzheimer's disease and predict some novel disease genes that supported by literature. RWRB is an effective and reliable algorithm in prioritizing candidate disease genes on the genomic scale. Software and supplementary information are available at http://nclab.hit.edu.cn/~tianzhen/RWRB/ .
Pers, Tune H; Hansen, Niclas Tue; Lage, Kasper; Koefoed, Pernille; Dworzynski, Piotr; Miller, Martin Lee; Flint, Tracey J; Mellerup, Erling; Dam, Henrik; Andreassen, Ole A; Djurovic, Srdjan; Melle, Ingrid; Børglum, Anders D; Werge, Thomas; Purcell, Shaun; Ferreira, Manuel A; Kouskoumvekaki, Irene; Workman, Christopher T; Hansen, Torben; Mors, Ole; Brunak, Søren
2011-07-01
Meta-analyses of large-scale association studies typically proceed solely within one data type and do not exploit the potential complementarities in other sources of molecular evidence. Here, we present an approach to combine heterogeneous data from genome-wide association (GWA) studies, protein-protein interaction screens, disease similarity, linkage studies, and gene expression experiments into a multi-layered evidence network which is used to prioritize the entire protein-coding part of the genome identifying a shortlist of candidate genes. We report specifically results on bipolar disorder, a genetically complex disease where GWA studies have only been moderately successful. We validate one such candidate experimentally, YWHAH, by genotyping five variations in 640 patients and 1,377 controls. We found a significant allelic association for the rs1049583 polymorphism in YWHAH (adjusted P = 5.6e-3) with an odds ratio of 1.28 [1.12-1.48], which replicates a previous case-control study. In addition, we demonstrate our approach's general applicability by use of type 2 diabetes data sets. The method presented augments moderately powered GWA data, and represents a validated, flexible, and publicly available framework for identifying risk genes in highly polygenic diseases. The method is made available as a web service at www.cbs.dtu.dk/services/metaranker. © 2011 Wiley-Liss, Inc.
Yu, Hua; Jiao, Bingke; Lu, Lu; Wang, Pengfei; Chen, Shuangcheng; Liang, Chengzhi; Liu, Wei
2018-01-01
Accurately reconstructing gene co-expression network is of great importance for uncovering the genetic architecture underlying complex and various phenotypes. The recent availability of high-throughput RNA-seq sequencing has made genome-wide detecting and quantifying of the novel, rare and low-abundance transcripts practical. However, its potential merits in reconstructing gene co-expression network have still not been well explored. Using massive-scale RNA-seq samples, we have designed an ensemble pipeline, called NetMiner, for building genome-scale and high-quality Gene Co-expression Network (GCN) by integrating three frequently used inference algorithms. We constructed a RNA-seq-based GCN in one species of monocot rice. The quality of network obtained by our method was verified and evaluated by the curated gene functional association data sets, which obviously outperformed each single method. In addition, the powerful capability of network for associating genes with functions and agronomic traits was shown by enrichment analysis and case studies. In particular, we demonstrated the potential value of our proposed method to predict the biological roles of unknown protein-coding genes, long non-coding RNA (lncRNA) genes and circular RNA (circRNA) genes. Our results provided a valuable and highly reliable data source to select key candidate genes for subsequent experimental validation. To facilitate identification of novel genes regulating important biological processes and phenotypes in other plants or animals, we have published the source code of NetMiner, making it freely available at https://github.com/czllab/NetMiner.
GUIDE-Seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases
Nguyen, Nhu T.; Liebers, Matthew; Topkar, Ved V.; Thapar, Vishal; Wyvekens, Nicolas; Khayter, Cyd; Iafrate, A. John; Le, Long P.; Aryee, Martin J.; Joung, J. Keith
2014-01-01
CRISPR RNA-guided nucleases (RGNs) are widely used genome-editing reagents, but methods to delineate their genome-wide off-target cleavage activities have been lacking. Here we describe an approach for global detection of DNA double-stranded breaks (DSBs) introduced by RGNs and potentially other nucleases. This method, called Genome-wide Unbiased Identification of DSBs Enabled by Sequencing (GUIDE-Seq), relies on capture of double-stranded oligodeoxynucleotides into breaks Application of GUIDE-Seq to thirteen RGNs in two human cell lines revealed wide variability in RGN off-target activities and unappreciated characteristics of off-target sequences. The majority of identified sites were not detected by existing computational methods or ChIP-Seq. GUIDE-Seq also identified RGN-independent genomic breakpoint ‘hotspots’. Finally, GUIDE-Seq revealed that truncated guide RNAs exhibit substantially reduced RGN-induced off-target DSBs. Our experiments define the most rigorous framework for genome-wide identification of RGN off-target effects to date and provide a method for evaluating the safety of these nucleases prior to clinical use. PMID:25513782
Jakupciak, John P; Wells, Jeffrey M; Karalus, Richard J; Pawlowski, David R; Lin, Jeffrey S; Feldman, Andrew B
2013-01-01
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations.
Jakupciak, John P.; Wells, Jeffrey M.; Karalus, Richard J.; Pawlowski, David R.; Lin, Jeffrey S.; Feldman, Andrew B.
2013-01-01
Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations. PMID:24455204
Genome-scale modeling of human metabolism - a systems biology approach.
Mardinoglu, Adil; Gatto, Francesco; Nielsen, Jens
2013-09-01
Altered metabolism is linked to the appearance of various human diseases and a better understanding of disease-associated metabolic changes may lead to the identification of novel prognostic biomarkers and the development of new therapies. Genome-scale metabolic models (GEMs) have been employed for studying human metabolism in a systematic manner, as well as for understanding complex human diseases. In the past decade, such metabolic models - one of the fundamental aspects of systems biology - have started contributing to the understanding of the mechanistic relationship between genotype and phenotype. In this review, we focus on the construction of the Human Metabolic Reaction database, the generation of healthy cell type- and cancer-specific GEMs using different procedures, and the potential applications of these developments in the study of human metabolism and in the identification of metabolic changes associated with various disorders. We further examine how in silico genome-scale reconstructions can be employed to simulate metabolic flux distributions and how high-throughput omics data can be analyzed in a context-dependent fashion. Insights yielded from this mechanistic modeling approach can be used for identifying new therapeutic agents and drug targets as well as for the discovery of novel biomarkers. Finally, recent advancements in genome-scale modeling and the future challenge of developing a model of whole-body metabolism are presented. The emergent contribution of GEMs to personalized and translational medicine is also discussed. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Complementary approaches to diagnosing marine diseases: a union of the modern and the classic
Burge, Colleen A.; Friedman, Carolyn S.; Getchell, Rodman; House, Marcia; Mydlarz, Laura D.; Prager, Katherine C.; Renault, Tristan; Kiryu, Ikunari; Vega-Thurber, Rebecca
2016-01-01
Linking marine epizootics to a specific aetiology is notoriously difficult. Recent diagnostic successes show that marine disease diagnosis requires both modern, cutting-edge technology (e.g. metagenomics, quantitative real-time PCR) and more classic methods (e.g. transect surveys, histopathology and cell culture). Here, we discuss how this combination of traditional and modern approaches is necessary for rapid and accurate identification of marine diseases, and emphasize how sole reliance on any one technology or technique may lead disease investigations astray. We present diagnostic approaches at different scales, from the macro (environment, community, population and organismal scales) to the micro (tissue, organ, cell and genomic scales). We use disease case studies from a broad range of taxa to illustrate diagnostic successes from combining traditional and modern diagnostic methods. Finally, we recognize the need for increased capacity of centralized databases, networks, data repositories and contingency plans for diagnosis and management of marine disease. PMID:26880839
Complementary approaches to diagnosing marine diseases: a union of the modern and the classic
Burge, Colleen A.; Friedman, Carolyn S.; Getchell, Rodman G.; House, Marcia; Lafferty, Kevin D.; Mydlarz, Laura D.; Prager, Katherine C.; Sutherland, Kathryn P.; Renault, Tristan; Kiryu, Ikunari; Vega-Thurber, Rebecca
2016-01-01
Linking marine epizootics to a specific aetiology is notoriously difficult. Recent diagnostic successes show that marine disease diagnosis requires both modern, cutting-edge technology (e.g. metagenomics, quantitative real-time PCR) and more classic methods (e.g. transect surveys, histopathology and cell culture). Here, we discuss how this combination of traditional and modern approaches is necessary for rapid and accurate identification of marine diseases, and emphasize how sole reliance on any one technology or technique may lead disease investigations astray. We present diagnostic approaches at different scales, from the macro (environment, community, population and organismal scales) to the micro (tissue, organ, cell and genomic scales). We use disease case studies from a broad range of taxa to illustrate diagnostic successes from combining traditional and modern diagnostic methods. Finally, we recognize the need for increased capacity of centralized databases, networks, data repositories and contingency plans for diagnosis and management of marine disease.
Smukowski Heil, Caiti; Burton, Joshua N; Liachko, Ivan; Friedrich, Anne; Hanson, Noah A; Morris, Cody L; Schacherer, Joseph; Shendure, Jay; Thomas, James H; Dunham, Maitreya J
2018-01-01
Interspecific hybridization is a common mechanism enabling genetic diversification and adaptation; however, the detection of hybrid species has been quite difficult. The identification of microbial hybrids is made even more complicated, as most environmental microbes are resistant to culturing and must be studied in their native mixed communities. We have previously adapted the chromosome conformation capture method Hi-C to the assembly of genomes from mixed populations. Here, we show the method's application in assembling genomes directly from an uncultured, mixed population from a spontaneously inoculated beer sample. Our assembly method has enabled us to de-convolute four bacterial and four yeast genomes from this sample, including a putative yeast hybrid. Downstream isolation and analysis of this hybrid confirmed its genome to consist of Pichia membranifaciens and that of another related, but undescribed, yeast. Our work shows that Hi-C-based metagenomic methods can overcome the limitation of traditional sequencing methods in studying complex mixtures of genomes. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
USDA-ARS?s Scientific Manuscript database
The comprehensive identification of genes underlying phenotypic variation of complex traits remains a major challenge. Most genome-wide screens lack sufficient resolving power as they typically depend on linkage. An alternate method is to screen for allele-specific expression (ASE), a simple yet pow...
Molecular analysis of single oocyst of Eimeria by whole genome amplification (WGA) based nested PCR.
Wang, Yunzhou; Tao, Geru; Cui, Yujuan; Lv, Qiyao; Xie, Li; Li, Yuan; Suo, Xun; Qin, Yinghe; Xiao, Lihua; Liu, Xianyong
2014-09-01
PCR-based molecular tools are widely used for the identification and characterization of protozoa. Here we report the molecular analysis of Eimeria species using combined methods of whole genome amplification (WGA) and nested PCR. Single oocyst of Eimeria stiedai or Eimeriamedia was directly used for random amplification of the genomic DNA with either primer extension preamplification (PEP) or multiple displacement amplification (MDA), and then the WGA product was used as template in nested PCR with species-specific primers for ITS-1, 18S rDNA and 23S rDNA of E. stiedai and E. media. WGA-based PCR was successful for the amplification of these genes from single oocyst. For the species identification of single oocyst isolated from mixed E. stiedai or E. media, the results from WGA-based PCR were exactly in accordance with those from morphological identification, suggesting the availability of this method in molecular analysis of eimerian parasites at the single oocyst level. WGA-based PCR method can also be applied for the identification and genetic characterization of other protists. Copyright © 2014 Elsevier Inc. All rights reserved.
Kügler, Jonas; Nieswandt, Simone; Gerlach, Gerald F; Meens, Jochen; Schirrmann, Thomas; Hust, Michael
2008-09-01
The identification of immunogenic polypeptides of pathogens is helpful for the development of diagnostic assays and therapeutic applications like vaccines. Routinely, these proteins are identified by two-dimensional polyacrylamide gel electrophoresis and Western blot using convalescent serum, followed by mass spectrometry. This technology, however, is limited, because low or differentially expressed proteins, e.g. dependent on pathogen-host interaction, cannot be identified. In this work, we developed and improved a M13 genomic phage display-based method for the selection of immunogenic polypeptides of Mycoplasma hyopneumoniae, a pathogen causing porcine enzootic pneumonia. The fragmented genome of M. hyopneumoniae was cloned into a phage display vector, and the genomic library was packaged using the helperphage Hyperphage to enrich open reading frames (ORFs). Afterwards, the phage display library was screened by panning using convalescent serum. The analysis of individual phage clones resulted in the identification of five genes encoding immunogenic proteins, only two of which had been previously identified and described as immunogenic. This M13 genomic phage display, directly combining ORF enrichment and the presentation of the corresponding polypeptide on the phage surface, complements proteome-based methods for the identification of immunogenic polypeptides and is particularly well suited for the use in mycoplasma species.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Muchero, Wellington; Labbe, Jessy L; Priya, Ranjan
2014-01-01
To date, Populus ranks among a few plant species with a complete genome sequence and other highly developed genomic resources. With the first genome sequence among all tree species, Populus has been adopted as a suitable model organism for genomic studies in trees. However, far from being just a model species, Populus is a key renewable economic resource that plays a significant role in providing raw materials for the biofuel and pulp and paper industries. Therefore, aside from leading frontiers of basic tree molecular biology and ecological research, Populus leads frontiers in addressing global economic challenges related to fuel andmore » fiber production. The latter fact suggests that research aimed at improving quality and quantity of Populus as a raw material will likely drive the pursuit of more targeted and deeper research in order to unlock the economic potential tied in molecular biology processes that drive this tree species. Advances in genome sequence-driven technologies, such as resequencing individual genotypes, which in turn facilitates large scale SNP discovery and identification of large scale polymorphisms are key determinants of future success in these initiatives. In this treatise we discuss implications of genome sequence-enable technologies on Populus genomic and genetic studies of complex and specialized-traits.« less
TipMT: Identification of PCR-based taxon-specific markers.
Rodrigues-Luiz, Gabriela F; Cardoso, Mariana S; Valdivia, Hugo O; Ayala, Edward V; Gontijo, Célia M F; Rodrigues, Thiago de S; Fujiwara, Ricardo T; Lopes, Robson S; Bartholomeu, Daniella C
2017-02-11
Molecular genetic markers are one of the most informative and widely used genome features in clinical and environmental diagnostic studies. A polymerase chain reaction (PCR)-based molecular marker is very attractive because it is suitable to high throughput automation and confers high specificity. However, the design of taxon-specific primers may be difficult and time consuming due to the need to identify appropriate genomic regions for annealing primers and to evaluate primer specificity. Here, we report the development of a Tool for Identification of Primers for Multiple Taxa (TipMT), which is a web application to search and design primers for genotyping based on genomic data. The tool identifies and targets single sequence repeats (SSR) or orthologous/taxa-specific genes for genotyping using Multiplex PCR. This pipeline was applied to the genomes of four species of Leishmania (L. amazonensis, L. braziliensis, L. infantum and L. major) and validated by PCR using artificial genomic DNA mixtures of the Leishmania species as templates. This experimental validation demonstrates the reliability of TipMT because amplification profiles showed discrimination of genomic DNA samples from Leishmania species. The TipMT web tool allows for large-scale identification and design of taxon-specific primers and is freely available to the scientific community at http://200.131.37.155/tipMT/ .
Oluwadare, Oluwatosin; Cheng, Jianlin
2017-11-14
With the development of chromosomal conformation capturing techniques, particularly, the Hi-C technique, the study of the spatial conformation of a genome is becoming an important topic in bioinformatics and computational biology. The Hi-C technique can generate genome-wide chromosomal interaction (contact) data, which can be used to investigate the higher-level organization of chromosomes, such as Topologically Associated Domains (TAD), i.e., locally packed chromosome regions bounded together by intra chromosomal contacts. The identification of the TADs for a genome is useful for studying gene regulation, genomic interaction, and genome function. Here, we formulate the TAD identification problem as an unsupervised machine learning (clustering) problem, and develop a new TAD identification method called ClusterTAD. We introduce a novel method to represent chromosomal contacts as features to be used by the clustering algorithm. Our results show that ClusterTAD can accurately predict the TADs on a simulated Hi-C data. Our method is also largely complementary and consistent with existing methods on the real Hi-C datasets of two mouse cells. The validation with the chromatin immunoprecipitation (ChIP) sequencing (ChIP-Seq) data shows that the domain boundaries identified by ClusterTAD have a high enrichment of CTCF binding sites, promoter-related marks, and enhancer-related histone modifications. As ClusterTAD is based on a proven clustering approach, it opens a new avenue to apply a large array of clustering methods developed in the machine learning field to the TAD identification problem. The source code, the results, and the TADs generated for the simulated and real Hi-C datasets are available here: https://github.com/BDM-Lab/ClusterTAD .
A BAC clone fingerprinting approach to the detection of human genome rearrangements
Krzywinski, Martin; Bosdet, Ian; Mathewson, Carrie; Wye, Natasja; Brebner, Jay; Chiu, Readman; Corbett, Richard; Field, Matthew; Lee, Darlene; Pugh, Trevor; Volik, Stas; Siddiqui, Asim; Jones, Steven; Schein, Jacquie; Collins, Collin; Marra, Marco
2007-01-01
We present a method, called fingerprint profiling (FPP), that uses restriction digest fingerprints of bacterial artificial chromosome clones to detect and classify rearrangements in the human genome. The approach uses alignment of experimental fingerprint patterns to in silico digests of the sequence assembly and is capable of detecting micro-deletions (1-5 kb) and balanced rearrangements. Our method has compelling potential for use as a whole-genome method for the identification and characterization of human genome rearrangements. PMID:17953769
Liu, Bingqiang; Zhang, Hanyuan; Zhou, Chuan; Li, Guojun; Fennell, Anne; Wang, Guanghui; Kang, Yu; Liu, Qi; Ma, Qin
2016-08-09
Phylogenetic footprinting is an important computational technique for identifying cis-regulatory motifs in orthologous regulatory regions from multiple genomes, as motifs tend to evolve slower than their surrounding non-functional sequences. Its application, however, has several difficulties for optimizing the selection of orthologous data and reducing the false positives in motif prediction. Here we present an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP(3)). The framework includes a new orthologous data preparation procedure, an additional promoter scoring and pruning method and an integration of six existing motif finding algorithms as basic motif search engines. Specifically, we collected orthologous genes from available prokaryotic genomes and built the orthologous regulatory regions based on sequence similarity of promoter regions. This procedure made full use of the large-scale genomic data and taxonomy information and filtered out the promoters with limited contribution to produce a high quality orthologous promoter set. The promoter scoring and pruning is implemented through motif voting by a set of complementary predicting tools that mine as many motif candidates as possible and simultaneously eliminate the effect of random noise. We have applied the framework to Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP(3) consistently outperformed other popular motif finding tools. We have integrated MP(3) into our motif identification and analysis server DMINDA, allowing users to efficiently identify and analyze motifs in 2,072 completely sequenced prokaryotic genomes. The performance evaluation indicated that MP(3) is effective for predicting regulatory motifs in prokaryotic genomes. Its application may enhance progress in elucidating transcription regulation mechanism, thus provide benefit to the genomic research community and prokaryotic genome researchers in particular.
An overview of bioinformatics methods for modeling biological pathways in yeast
Hou, Jie; Acharya, Lipi; Zhu, Dongxiao
2016-01-01
The advent of high-throughput genomics techniques, along with the completion of genome sequencing projects, identification of protein–protein interactions and reconstruction of genome-scale pathways, has accelerated the development of systems biology research in the yeast organism Saccharomyces cerevisiae. In particular, discovery of biological pathways in yeast has become an important forefront in systems biology, which aims to understand the interactions among molecules within a cell leading to certain cellular processes in response to a specific environment. While the existing theoretical and experimental approaches enable the investigation of well-known pathways involved in metabolism, gene regulation and signal transduction, bioinformatics methods offer new insights into computational modeling of biological pathways. A wide range of computational approaches has been proposed in the past for reconstructing biological pathways from high-throughput datasets. Here we review selected bioinformatics approaches for modeling biological pathways in S. cerevisiae, including metabolic pathways, gene-regulatory pathways and signaling pathways. We start with reviewing the research on biological pathways followed by discussing key biological databases. In addition, several representative computational approaches for modeling biological pathways in yeast are discussed. PMID:26476430
Conservation genetics and genomics of amphibians and reptiles.
Shaffer, H Bradley; Gidiş, Müge; McCartney-Melstad, Evan; Neal, Kevin M; Oyamaguchi, Hilton M; Tellez, Marisa; Toffelmier, Erin M
2015-01-01
Amphibians and reptiles as a group are often secretive, reach their greatest diversity often in remote tropical regions, and contain some of the most endangered groups of organisms on earth. Particularly in the past decade, genetics and genomics have been instrumental in the conservation biology of these cryptic vertebrates, enabling work ranging from the identification of populations subject to trade and exploitation, to the identification of cryptic lineages harboring critical genetic variation, to the analysis of genes controlling key life history traits. In this review, we highlight some of the most important ways that genetic analyses have brought new insights to the conservation of amphibians and reptiles. Although genomics has only recently emerged as part of this conservation tool kit, several large-scale data sources, including full genomes, expressed sequence tags, and transcriptomes, are providing new opportunities to identify key genes, quantify landscape effects, and manage captive breeding stocks of at-risk species.
Computational modelling of genome-scale metabolic networks and its application to CHO cell cultures.
Rejc, Živa; Magdevska, Lidija; Tršelič, Tilen; Osolin, Timotej; Vodopivec, Rok; Mraz, Jakob; Pavliha, Eva; Zimic, Nikolaj; Cvitanović, Tanja; Rozman, Damjana; Moškon, Miha; Mraz, Miha
2017-09-01
Genome-scale metabolic models (GEMs) have become increasingly important in recent years. Currently, GEMs are the most accurate in silico representation of the genotype-phenotype link. They allow us to study complex networks from the systems perspective. Their application may drastically reduce the amount of experimental and clinical work, improve diagnostic tools and increase our understanding of complex biological phenomena. GEMs have also demonstrated high potential for the optimisation of bio-based production of recombinant proteins. Herein, we review the basic concepts, methods, resources and software tools used for the reconstruction and application of GEMs. We overview the evolution of the modelling efforts devoted to the metabolism of Chinese Hamster Ovary (CHO) cells. We present a case study on CHO cell metabolism under different amino acid depletions. This leads us to the identification of the most influential as well as essential amino acids in selected CHO cell lines. Copyright © 2017 Elsevier Ltd. All rights reserved.
Pindyurin, Alexey V
2017-01-01
A thorough study of the genome-wide binding patterns of chromatin proteins is essential for understanding the regulatory mechanisms of genomic processes in eukaryotic nuclei, including DNA replication, transcription, and repair. The DNA adenine methyltransferase identification (DamID) method is a powerful tool to identify genomic binding sites of chromatin proteins. This method does not require fixation of cells and the use of specific antibodies, and has been used to generate genome-wide binding maps of more than a hundred different proteins in Drosophila tissue culture cells. Recent versions of inducible DamID allow performing cell type-specific profiling of chromatin proteins even in small samples of Drosophila tissues that contain heterogeneous cell types. Importantly, with these methods sorting of cells of interest or their nuclei is not necessary as genomic DNA isolated from the whole tissue can be used as an input. Here, I describe in detail an FLP-inducible DamID method, namely generation of suitable transgenic flies, activation of the Dam transgenes by the FLP recombinase, isolation of DNA from small amounts of dissected tissues, and subsequent identification of the DNA binding sites of the chromatin proteins.
Highlights of DNA Barcoding in identification of salient microorganisms like fungi.
Dulla, E L; Kathera, C; Gurijala, H K; Mallakuntla, T R; Srinivasan, P; Prasad, V; Mopati, R D; Jasti, P K
2016-12-01
Fungi, the second largest kingdom of eukaryotic life, are diverse and widespread. Fungi play a distinctive role in the production of different products on industrial scale, like fungal enzymes, antibiotics, fermented foods, etc., to give storage stability and improved health to meet major global challenges. To utilize algae perfectly for human needs, and to pave the way for getting a healthy relationship with fungi, it is important to identify them in a quick and robust manner with molecular-based identification system. So, there is a technique that aims to provide a well-organized method for species level identifications and to contribute powerfully to taxonomic and biodiversity research is DNA Barcoding. DNA Barcoding is generally achieved by the retrieval of a short DNA sequence - the 'barcode' - from a standard part of the genome and that barcode is then compared with a library of reference barcode sequences derived from individuals of known identity for identification. Copyright © 2016 Elsevier Masson SAS. All rights reserved.
Xiao, Xiaolin; Moreno-Moral, Aida; Rotival, Maxime; Bottolo, Leonardo; Petretto, Enrico
2014-01-01
Recent high-throughput efforts such as ENCODE have generated a large body of genome-scale transcriptional data in multiple conditions (e.g., cell-types and disease states). Leveraging these data is especially important for network-based approaches to human disease, for instance to identify coherent transcriptional modules (subnetworks) that can inform functional disease mechanisms and pathological pathways. Yet, genome-scale network analysis across conditions is significantly hampered by the paucity of robust and computationally-efficient methods. Building on the Higher-Order Generalized Singular Value Decomposition, we introduce a new algorithmic approach for efficient, parameter-free and reproducible identification of network-modules simultaneously across multiple conditions. Our method can accommodate weighted (and unweighted) networks of any size and can similarly use co-expression or raw gene expression input data, without hinging upon the definition and stability of the correlation used to assess gene co-expression. In simulation studies, we demonstrated distinctive advantages of our method over existing methods, which was able to recover accurately both common and condition-specific network-modules without entailing ad-hoc input parameters as required by other approaches. We applied our method to genome-scale and multi-tissue transcriptomic datasets from rats (microarray-based) and humans (mRNA-sequencing-based) and identified several common and tissue-specific subnetworks with functional significance, which were not detected by other methods. In humans we recapitulated the crosstalk between cell-cycle progression and cell-extracellular matrix interactions processes in ventricular zones during neocortex expansion and further, we uncovered pathways related to development of later cognitive functions in the cortical plate of the developing brain which were previously unappreciated. Analyses of seven rat tissues identified a multi-tissue subnetwork of co-expressed heat shock protein (Hsp) and cardiomyopathy genes (Bag3, Cryab, Kras, Emd, Plec), which was significantly replicated using separate failing heart and liver gene expression datasets in humans, thus revealing a conserved functional role for Hsp genes in cardiovascular disease.
Jia, Cangzhi; Yang, Qing; Zou, Quan
2018-04-18
The nucleosome is the basic structure of chromatin in eukaryotic cells, with essential roles in the regulation of many biological processes, such as DNA transcription, replication and repair, and RNA splicing. Because of the importance of nucleosomes, the factors that determine their positioning within genomes should be investigated. High-resolution nucleosome-positioning maps are now available for organisms including Saccharomyces cerevisiae, Drosophila melanogaster and Caenorhabditis elegans, enabling the identification of nucleosome positioning by application of computational tools. Here, we describe a novel predictor called NucPosPred, which was specifically designed for large-scale identification of nucleosome positioning in C. elegans and D. melanogaster genomes. NucPosPred was separately optimized for each species for four types of DNA sequence feature extraction, with consideration of two classification algorithms (gradient-boosting decision tree and support vector machine). The overall accuracy obtained with NucPosPred was 92.29% for C. elegans and 88.26% for D. melanogaster, outperforming previous methods and demonstrating the potential for species-specific prediction of nucleosome positioning. For the convenience of most experimental scientists, a web-server for the predictor NucPosPred is available at http://121.42.167.206/NucPosPred/index.jsp. Copyright © 2018 Elsevier Ltd. All rights reserved.
Jorjani, Hadi; Zavolan, Mihaela
2014-04-01
Accurate identification of transcription start sites (TSSs) is an essential step in the analysis of transcription regulatory networks. In higher eukaryotes, the capped analysis of gene expression technology enabled comprehensive annotation of TSSs in genomes such as those of mice and humans. In bacteria, an equivalent approach, termed differential RNA sequencing (dRNA-seq), has recently been proposed, but the application of this approach to a large number of genomes is hindered by the paucity of computational analysis methods. With few exceptions, when the method has been used, annotation of TSSs has been largely done manually. In this work, we present a computational method called 'TSSer' that enables the automatic inference of TSSs from dRNA-seq data. The method rests on a probabilistic framework for identifying both genomic positions that are preferentially enriched in the dRNA-seq data as well as preferentially captured relative to neighboring genomic regions. Evaluating our approach for TSS calling on several publicly available datasets, we find that TSSer achieves high consistency with the curated lists of annotated TSSs, but identifies many additional TSSs. Therefore, TSSer can accelerate genome-wide identification of TSSs in bacterial genomes and can aid in further characterization of bacterial transcription regulatory networks. TSSer is freely available under GPL license at http://www.clipz.unibas.ch/TSSer/index.php
DOE Office of Scientific and Technical Information (OSTI.GOV)
Catfish Genome Consortium; Wang, Shaolin; Peatman, Eric
2010-03-23
Background-Through the Community Sequencing Program, a catfish EST sequencing project was carried out through a collaboration between the catfish research community and the Department of Energy's Joint Genome Institute. Prior to this project, only a limited EST resource from catfish was available for the purpose of SNP identification. Results-A total of 438,321 quality ESTs were generated from 8 channel catfish (Ictalurus punctatus) and 4 blue catfish (Ictalurus furcatus) libraries, bringing the number of catfish ESTs to nearly 500,000. Assembly of all catfish ESTs resulted in 45,306 contigs and 66,272 singletons. Over 35percent of the unique sequences had significant similarities tomore » known genes, allowing the identification of 14,776 unique genes in catfish. Over 300,000 putative SNPs have been identified, of which approximately 48,000 are high-quality SNPs identified from contigs with at least four sequences and the minor allele presence of at least two sequences in the contig. The EST resource should be valuable for identification of microsatellites, genome annotation, large-scale expression analysis, and comparative genome analysis. Conclusions-This project generated a large EST resource for catfish that captured the majority of the catfish transcriptome. The parallel analysis of ESTs from two closely related Ictalurid catfishes should also provide powerful means for the evaluation of ancient and recent gene duplications, and for the development of high-density microarrays in catfish. The inter- and intra-specific SNPs identified from all catfish EST dataset assembly will greatly benefit the catfish introgression breeding program and whole genome association studies.« less
4C-ker: A Method to Reproducibly Identify Genome-Wide Interactions Captured by 4C-Seq Experiments.
Raviram, Ramya; Rocha, Pedro P; Müller, Christian L; Miraldi, Emily R; Badri, Sana; Fu, Yi; Swanzey, Emily; Proudhon, Charlotte; Snetkova, Valentina; Bonneau, Richard; Skok, Jane A
2016-03-01
4C-Seq has proven to be a powerful technique to identify genome-wide interactions with a single locus of interest (or "bait") that can be important for gene regulation. However, analysis of 4C-Seq data is complicated by the many biases inherent to the technique. An important consideration when dealing with 4C-Seq data is the differences in resolution of signal across the genome that result from differences in 3D distance separation from the bait. This leads to the highest signal in the region immediately surrounding the bait and increasingly lower signals in far-cis and trans. Another important aspect of 4C-Seq experiments is the resolution, which is greatly influenced by the choice of restriction enzyme and the frequency at which it can cut the genome. Thus, it is important that a 4C-Seq analysis method is flexible enough to analyze data generated using different enzymes and to identify interactions across the entire genome. Current methods for 4C-Seq analysis only identify interactions in regions near the bait or in regions located in far-cis and trans, but no method comprehensively analyzes 4C signals of different length scales. In addition, some methods also fail in experiments where chromatin fragments are generated using frequent cutter restriction enzymes. Here, we describe 4C-ker, a Hidden-Markov Model based pipeline that identifies regions throughout the genome that interact with the 4C bait locus. In addition, we incorporate methods for the identification of differential interactions in multiple 4C-seq datasets collected from different genotypes or experimental conditions. Adaptive window sizes are used to correct for differences in signal coverage in near-bait regions, far-cis and trans chromosomes. Using several datasets, we demonstrate that 4C-ker outperforms all existing 4C-Seq pipelines in its ability to reproducibly identify interaction domains at all genomic ranges with different resolution enzymes.
4C-ker: A Method to Reproducibly Identify Genome-Wide Interactions Captured by 4C-Seq Experiments
Raviram, Ramya; Rocha, Pedro P.; Müller, Christian L.; Miraldi, Emily R.; Badri, Sana; Fu, Yi; Swanzey, Emily; Proudhon, Charlotte; Snetkova, Valentina
2016-01-01
4C-Seq has proven to be a powerful technique to identify genome-wide interactions with a single locus of interest (or “bait”) that can be important for gene regulation. However, analysis of 4C-Seq data is complicated by the many biases inherent to the technique. An important consideration when dealing with 4C-Seq data is the differences in resolution of signal across the genome that result from differences in 3D distance separation from the bait. This leads to the highest signal in the region immediately surrounding the bait and increasingly lower signals in far-cis and trans. Another important aspect of 4C-Seq experiments is the resolution, which is greatly influenced by the choice of restriction enzyme and the frequency at which it can cut the genome. Thus, it is important that a 4C-Seq analysis method is flexible enough to analyze data generated using different enzymes and to identify interactions across the entire genome. Current methods for 4C-Seq analysis only identify interactions in regions near the bait or in regions located in far-cis and trans, but no method comprehensively analyzes 4C signals of different length scales. In addition, some methods also fail in experiments where chromatin fragments are generated using frequent cutter restriction enzymes. Here, we describe 4C-ker, a Hidden-Markov Model based pipeline that identifies regions throughout the genome that interact with the 4C bait locus. In addition, we incorporate methods for the identification of differential interactions in multiple 4C-seq datasets collected from different genotypes or experimental conditions. Adaptive window sizes are used to correct for differences in signal coverage in near-bait regions, far-cis and trans chromosomes. Using several datasets, we demonstrate that 4C-ker outperforms all existing 4C-Seq pipelines in its ability to reproducibly identify interaction domains at all genomic ranges with different resolution enzymes. PMID:26938081
Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks
Zhao, Yongan; Carey, Knox; Lloyd, David; Sofia, Heidi; Baker, Dixie; Flicek, Paul; Shringarpure, Suyash; Bustamante, Carlos; Wang, Shuang; Jiang, Xiaoqian; Ohno-Machado, Lucila; Tang, Haixu; Wang, XiaoFeng; Hubaux, Jean-Pierre
2018-01-01
The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context—a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or “beacon”) is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards. While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual’s whole genome sequence), the individual’s membership in a beacon can be inferred through repeated queries for variants present in the individual’s genome. In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets. PMID:28339683
RNA-Seq Based Transcriptional Map of Bovine Respiratory Disease Pathogen “Histophilus somni 2336”
Kumar, Ranjit; Lawrence, Mark L.; Watt, James; Cooksey, Amanda M.; Burgess, Shane C.; Nanduri, Bindu
2012-01-01
Genome structural annotation, i.e., identification and demarcation of the boundaries for all the functional elements in a genome (e.g., genes, non-coding RNAs, proteins and regulatory elements), is a prerequisite for systems level analysis. Current genome annotation programs do not identify all of the functional elements of the genome, especially small non-coding RNAs (sRNAs). Whole genome transcriptome analysis is a complementary method to identify “novel” genes, small RNAs, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. In particular, the identification of non-coding RNAs has revealed their widespread occurrence and functional importance in gene regulation, stress and virulence. However, very little is known about non-coding transcripts in Histophilus somni, one of the causative agents of Bovine Respiratory Disease (BRD) as well as bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis. In this study, we report a single nucleotide resolution transcriptome map of H. somni strain 2336 using RNA-Seq method. The RNA-Seq based transcriptome map identified 94 sRNAs in the H. somni genome of which 82 sRNAs were never predicted or reported in earlier studies. We also identified 38 novel potential protein coding open reading frames that were absent in the current genome annotation. The transcriptome map allowed the identification of 278 operon (total 730 genes) structures in the genome. When compared with the genome sequence of a non-virulent strain 129Pt, a disproportionate number of sRNAs (∼30%) were located in genomic region unique to strain 2336 (∼18% of the total genome). This observation suggests that a number of the newly identified sRNAs in strain 2336 may be involved in strain-specific adaptations. PMID:22276113
RNA-seq based transcriptional map of bovine respiratory disease pathogen "Histophilus somni 2336".
Kumar, Ranjit; Lawrence, Mark L; Watt, James; Cooksey, Amanda M; Burgess, Shane C; Nanduri, Bindu
2012-01-01
Genome structural annotation, i.e., identification and demarcation of the boundaries for all the functional elements in a genome (e.g., genes, non-coding RNAs, proteins and regulatory elements), is a prerequisite for systems level analysis. Current genome annotation programs do not identify all of the functional elements of the genome, especially small non-coding RNAs (sRNAs). Whole genome transcriptome analysis is a complementary method to identify "novel" genes, small RNAs, regulatory regions, and operon structures, thus improving the structural annotation in bacteria. In particular, the identification of non-coding RNAs has revealed their widespread occurrence and functional importance in gene regulation, stress and virulence. However, very little is known about non-coding transcripts in Histophilus somni, one of the causative agents of Bovine Respiratory Disease (BRD) as well as bovine infertility, abortion, septicemia, arthritis, myocarditis, and thrombotic meningoencephalitis. In this study, we report a single nucleotide resolution transcriptome map of H. somni strain 2336 using RNA-Seq method.The RNA-Seq based transcriptome map identified 94 sRNAs in the H. somni genome of which 82 sRNAs were never predicted or reported in earlier studies. We also identified 38 novel potential protein coding open reading frames that were absent in the current genome annotation. The transcriptome map allowed the identification of 278 operon (total 730 genes) structures in the genome. When compared with the genome sequence of a non-virulent strain 129Pt, a disproportionate number of sRNAs (∼30%) were located in genomic region unique to strain 2336 (∼18% of the total genome). This observation suggests that a number of the newly identified sRNAs in strain 2336 may be involved in strain-specific adaptations.
Zhang, Hongtao; Setubal, Joao Carlos; Zhan, Xiaobei; Zheng, Zhiyong; Yu, Lijun; Wu, Jianrong; Chen, Dingqiang
2011-06-01
Agrobacterium sp. ATCC 31749 (formerly named Alcaligenes faecalis var. myxogenes) is a non-pathogenic aerobic soil bacterium used in large scale biotechnological production of curdlan. However, little is known about its genomic information. DNA partial sequence of electron transport chains (ETCs) protein genes were obtained in order to understand the components of ETC and genomic-specificity in Agrobacterium sp. ATCC 31749. Degenerate primers were designed according to ETC conserved sequences in other reported species. DNA partial sequences of ETC genes in Agrobacterium sp. ATCC 31749 were cloned by the PCR method using degenerate primers. Based on comparative genomic analysis, nine electron transport elements were ascertained, including NADH ubiquinone oxidoreductase, succinate dehydrogenase complex II, complex III, cytochrome c, ubiquinone biosynthesis protein ubiB, cytochrome d terminal oxidase, cytochrome bo terminal oxidase, cytochrome cbb (3)-type terminal oxidase and cytochrome caa (3)-type terminal oxidase. Similarity and phylogenetic analyses of these genes revealed that among fully sequenced Agrobacterium species, Agrobacterium sp. ATCC 31749 is closest to Agrobacterium tumefaciens C58. Based on these results a comprehensive ETC model for Agrobacterium sp. ATCC 31749 is proposed.
Reads2Type: a web application for rapid microbial taxonomy identification.
Saputra, Dhany; Rasmussen, Simon; Larsen, Mette V; Haddad, Nizar; Sperotto, Maria Maddalena; Aarestrup, Frank M; Lund, Ole; Sicheritz-Pontén, Thomas
2015-11-25
Identification of bacteria may be based on sequencing and molecular analysis of a specific locus such as 16S rRNA, or a set of loci such as in multilocus sequence typing. In the near future, healthcare institutions and routine diagnostic microbiology laboratories may need to sequence the entire genome of microbial isolates. Therefore we have developed Reads2Type, a web-based tool for taxonomy identification based on whole bacterial genome sequence data. Raw sequencing data provided by the user are mapped against a set of marker probes that are derived from currently available bacteria complete genomes. Using a dataset of 1003 whole genome sequenced bacteria from various sequencing platforms, Reads2Type was able to identify the species with 99.5 % accuracy and on the minutes time scale. In comparison with other tools, Reads2Type offers the advantage of not needing to transfer sequencing files, as the entire computational analysis is done on the computer of whom utilizes the web application. This also prevents data privacy issues to arise. The Reads2Type tool is available at http://www.cbs.dtu.dk/~dhany/reads2type.html.
O'Flaherty, Brigid M; Li, Yan; Tao, Ying; Paden, Clinton R; Queen, Krista; Zhang, Jing; Dinwiddie, Darrell L; Gross, Stephen M; Schroth, Gary P; Tong, Suxiang
2018-06-01
Next generation sequencing (NGS) technologies have revolutionized the genomics field and are becoming more commonplace for identification of human infectious diseases. However, due to the low abundance of viral nucleic acids (NAs) in relation to host, viral identification using direct NGS technologies often lacks sufficient sensitivity. Here, we describe an approach based on two complementary enrichment strategies that significantly improves the sensitivity of NGS-based virus identification. To start, we developed two sets of DNA probes to enrich virus NAs associated with respiratory diseases. The first set of probes spans the genomes, allowing for identification of known viruses and full genome sequencing, while the second set targets regions conserved among viral families or genera, providing the ability to detect both known and potentially novel members of those virus groups. Efficiency of enrichment was assessed by NGS testing reference virus and clinical samples with known infection. We show significant improvement in viral identification using enriched NGS compared to unenriched NGS. Without enrichment, we observed an average of 0.3% targeted viral reads per sample. However, after enrichment, 50%-99% of the reads per sample were the targeted viral reads for both the reference isolates and clinical specimens using both probe sets. Importantly, dramatic improvements on genome coverage were also observed following virus-specific probe enrichment. The methods described here provide improved sensitivity for virus identification by NGS, allowing for a more comprehensive analysis of disease etiology. © 2018 O'Flaherty et al.; Published by Cold Spring Harbor Laboratory Press.
Li, Guosheng; Jagadeeswaran, Guru; Mort, Andrew; Sunkar, Ramanjulu
2017-01-01
Histone modifications represent the crux of epigenetic gene regulation essential for most biological processes including abiotic stress responses in plants. Thus, identification of histone modifications at the genome-scale can provide clues for how some genes are 'turned-on' while some others are "turned-off" in response to stress. This chapter details a step-by-step protocol for identifying genome-wide histone modifications associated with stress-responsive gene regulation using chromatin immunoprecipitation (ChIP) followed by sequencing of the DNA (ChIP-seq).
Alignment-free genome tree inference by learning group-specific distance metrics.
Patil, Kaustubh R; McHardy, Alice C
2013-01-01
Understanding the evolutionary relationships between organisms is vital for their in-depth study. Gene-based methods are often used to infer such relationships, which are not without drawbacks. One can now attempt to use genome-scale information, because of the ever increasing number of genomes available. This opportunity also presents a challenge in terms of computational efficiency. Two fundamentally different methods are often employed for sequence comparisons, namely alignment-based and alignment-free methods. Alignment-free methods rely on the genome signature concept and provide a computationally efficient way that is also applicable to nonhomologous sequences. The genome signature contains evolutionary signal as it is more similar for closely related organisms than for distantly related ones. We used genome-scale sequence information to infer taxonomic distances between organisms without additional information such as gene annotations. We propose a method to improve genome tree inference by learning specific distance metrics over the genome signature for groups of organisms with similar phylogenetic, genomic, or ecological properties. Specifically, our method learns a Mahalanobis metric for a set of genomes and a reference taxonomy to guide the learning process. By applying this method to more than a thousand prokaryotic genomes, we showed that, indeed, better distance metrics could be learned for most of the 18 groups of organisms tested here. Once a group-specific metric is available, it can be used to estimate the taxonomic distances for other sequenced organisms from the group. This study also presents a large scale comparison between 10 methods--9 alignment-free and 1 alignment-based.
High resolution identity testing of inactivated poliovirus vaccines
Mee, Edward T.; Minor, Philip D.; Martin, Javier
2015-01-01
Background Definitive identification of poliovirus strains in vaccines is essential for quality control, particularly where multiple wild-type and Sabin strains are produced in the same facility. Sequence-based identification provides the ultimate in identity testing and would offer several advantages over serological methods. Methods We employed random RT-PCR and high throughput sequencing to recover full-length genome sequences from monovalent and trivalent poliovirus vaccine products at various stages of the manufacturing process. Results All expected strains were detected in previously characterised products and the method permitted identification of strains comprising as little as 0.1% of sequence reads. Highly similar Mahoney and Sabin 1 strains were readily discriminated on the basis of specific variant positions. Analysis of a product known to contain incorrect strains demonstrated that the method correctly identified the contaminants. Conclusion Random RT-PCR and shotgun sequencing provided high resolution identification of vaccine components. In addition to the recovery of full-length genome sequences, the method could also be easily adapted to the characterisation of minor variant frequencies and distinction of closely related products on the basis of distinguishing consensus and low frequency polymorphisms. PMID:26049003
Singh, Vikas K; Khan, Aamir W; Saxena, Rachit K; Sinha, Pallavi; Kale, Sandip M; Parupalli, Swathi; Kumar, Vinay; Chitikineni, Annapurna; Vechalapu, Suryanarayana; Sameer Kumar, Chanda Venkata; Sharma, Mamta; Ghanta, Anuradha; Yamini, Kalinati Narasimhan; Muniswamy, Sonnappa; Varshney, Rajeev K
2017-07-01
Identification of candidate genomic regions associated with target traits using conventional mapping methods is challenging and time-consuming. In recent years, a number of single nucleotide polymorphism (SNP)-based mapping approaches have been developed and used for identification of candidate/putative genomic regions. However, in the majority of these studies, insertion-deletion (Indel) were largely ignored. For efficient use of Indels in mapping target traits, we propose Indel-seq approach, which is a combination of whole-genome resequencing (WGRS) and bulked segregant analysis (BSA) and relies on the Indel frequencies in extreme bulks. Deployment of Indel-seq approach for identification of candidate genomic regions associated with fusarium wilt (FW) and sterility mosaic disease (SMD) resistance in pigeonpea has identified 16 Indels affecting 26 putative candidate genes. Of these 26 affected putative candidate genes, 24 genes showed effect in the upstream/downstream of the genic region and two genes showed effect in the genes. Validation of these 16 candidate Indels in other FW- and SMD-resistant and FW- and SMD-susceptible genotypes revealed a significant association of five Indels (three for FW and two for SMD resistance). Comparative analysis of Indel-seq with other genetic mapping approaches highlighted the importance of the approach in identification of significant genomic regions associated with target traits. Therefore, the Indel-seq approach can be used for quick and precise identification of candidate genomic regions for any target traits in any crop species. © 2016 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.
Nowrousian, Minou; Würtz, Christian; Pöggeler, Stefanie; Kück, Ulrich
2004-03-01
One of the most challenging parts of large scale sequencing projects is the identification of functional elements encoded in a genome. Recently, studies of genomes of up to six different Saccharomyces species have demonstrated that a comparative analysis of genome sequences from closely related species is a powerful approach to identify open reading frames and other functional regions within genomes [Science 301 (2003) 71, Nature 423 (2003) 241]. Here, we present a comparison of selected sequences from Sordaria macrospora to their corresponding Neurospora crassa orthologous regions. Our analysis indicates that due to the high degree of sequence similarity and conservation of overall genomic organization, S. macrospora sequence information can be used to simplify the annotation of the N. crassa genome.
Comparative Genomics in Homo sapiens.
Oti, Martin; Sammeth, Michael
2018-01-01
Genomes can be compared at different levels of divergence, either between species or within species. Within species genomes can be compared between different subpopulations, such as human subpopulations from different continents. Investigating the genomic differences between different human subpopulations is important when studying complex diseases that are affected by many genetic variants, as the variants involved can differ between populations. The 1000 Genomes Project collected genome-scale variation data for 2504 human individuals from 26 different populations, enabling a systematic comparison of variation between human subpopulations. In this chapter, we present step-by-step a basic protocol for the identification of population-specific variants employing the 1000 Genomes data. These variants are subsequently further investigated for those that affect the proteome or RNA splice sites, to investigate potentially biologically relevant differences between the populations.
Systematic Identification of Combinatorial Drivers and Targets in Cancer Cell Lines
Tabchy, Adel; Eltonsy, Nevine; Housman, David E.; Mills, Gordon B.
2013-01-01
There is an urgent need to elicit and validate highly efficacious targets for combinatorial intervention from large scale ongoing molecular characterization efforts of tumors. We established an in silico bioinformatic platform in concert with a high throughput screening platform evaluating 37 novel targeted agents in 669 extensively characterized cancer cell lines reflecting the genomic and tissue-type diversity of human cancers, to systematically identify combinatorial biomarkers of response and co-actionable targets in cancer. Genomic biomarkers discovered in a 141 cell line training set were validated in an independent 359 cell line test set. We identified co-occurring and mutually exclusive genomic events that represent potential drivers and combinatorial targets in cancer. We demonstrate multiple cooperating genomic events that predict sensitivity to drug intervention independent of tumor lineage. The coupling of scalable in silico and biologic high throughput cancer cell line platforms for the identification of co-events in cancer delivers rational combinatorial targets for synthetic lethal approaches with a high potential to pre-empt the emergence of resistance. PMID:23577104
Systematic identification of combinatorial drivers and targets in cancer cell lines.
Tabchy, Adel; Eltonsy, Nevine; Housman, David E; Mills, Gordon B
2013-01-01
There is an urgent need to elicit and validate highly efficacious targets for combinatorial intervention from large scale ongoing molecular characterization efforts of tumors. We established an in silico bioinformatic platform in concert with a high throughput screening platform evaluating 37 novel targeted agents in 669 extensively characterized cancer cell lines reflecting the genomic and tissue-type diversity of human cancers, to systematically identify combinatorial biomarkers of response and co-actionable targets in cancer. Genomic biomarkers discovered in a 141 cell line training set were validated in an independent 359 cell line test set. We identified co-occurring and mutually exclusive genomic events that represent potential drivers and combinatorial targets in cancer. We demonstrate multiple cooperating genomic events that predict sensitivity to drug intervention independent of tumor lineage. The coupling of scalable in silico and biologic high throughput cancer cell line platforms for the identification of co-events in cancer delivers rational combinatorial targets for synthetic lethal approaches with a high potential to pre-empt the emergence of resistance.
Large Scale Single Nucleotide Polymorphism Study of PD Susceptibility
2005-03-01
identification of eight genetic loci in the familial PD, the results of intensive investigations of polymorphisms in dozens of genes related to sporadic, late...1) investigate the association between classical, sporadic PD and 2386 SNPs in 23 genes implicated in the pathogenesis of PD; (2) construct...addition, experiences derived from this study may be applied in other complex disorders for the identification of susceptibility genes , as well as in genome
Bacterial Group II Introns: Identification and Mobility Assay.
Toro, Nicolás; Molina-Sánchez, María Dolores; Nisa-Martínez, Rafael; Martínez-Abarca, Francisco; García-Rodríguez, Fernando Manuel
2016-01-01
Group II introns are large catalytic RNAs and mobile retroelements that encode a reverse transcriptase. Here, we provide methods for their identification in bacterial genomes and further analysis of their splicing and mobility capacities.
Addressing Beacon re-identification attacks: quantification and mitigation of privacy risks.
Raisaro, Jean Louis; Tramèr, Florian; Ji, Zhanglong; Bu, Diyue; Zhao, Yongan; Carey, Knox; Lloyd, David; Sofia, Heidi; Baker, Dixie; Flicek, Paul; Shringarpure, Suyash; Bustamante, Carlos; Wang, Shuang; Jiang, Xiaoqian; Ohno-Machado, Lucila; Tang, Haixu; Wang, XiaoFeng; Hubaux, Jean-Pierre
2017-07-01
The Global Alliance for Genomics and Health (GA4GH) created the Beacon Project as a means of testing the willingness of data holders to share genetic data in the simplest technical context-a query for the presence of a specified nucleotide at a given position within a chromosome. Each participating site (or "beacon") is responsible for assuring that genomic data are exposed through the Beacon service only with the permission of the individual to whom the data pertains and in accordance with the GA4GH policy and standards.While recognizing the inference risks associated with large-scale data aggregation, and the fact that some beacons contain sensitive phenotypic associations that increase privacy risk, the GA4GH adjudged the risk of re-identification based on the binary yes/no allele-presence query responses as acceptable. However, recent work demonstrated that, given a beacon with specific characteristics (including relatively small sample size and an adversary who possesses an individual's whole genome sequence), the individual's membership in a beacon can be inferred through repeated queries for variants present in the individual's genome.In this paper, we propose three practical strategies for reducing re-identification risks in beacons. The first two strategies manipulate the beacon such that the presence of rare alleles is obscured; the third strategy budgets the number of accesses per user for each individual genome. Using a beacon containing data from the 1000 Genomes Project, we demonstrate that the proposed strategies can effectively reduce re-identification risk in beacon-like datasets. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Large-scale sequencing efforts are uncovering the complexity of cancer genomes, which are composed of causal "driver" mutations that promote tumor progression along with many more pathologically neutral "passenger" events. The majority of mutations, both in known cancer drivers and uncharacterized genes, are generally of low occurrence, highlighting the need to functionally annotate the long tail of infrequent mutations present in heterogeneous cancers.
An overview of bioinformatics methods for modeling biological pathways in yeast.
Hou, Jie; Acharya, Lipi; Zhu, Dongxiao; Cheng, Jianlin
2016-03-01
The advent of high-throughput genomics techniques, along with the completion of genome sequencing projects, identification of protein-protein interactions and reconstruction of genome-scale pathways, has accelerated the development of systems biology research in the yeast organism Saccharomyces cerevisiae In particular, discovery of biological pathways in yeast has become an important forefront in systems biology, which aims to understand the interactions among molecules within a cell leading to certain cellular processes in response to a specific environment. While the existing theoretical and experimental approaches enable the investigation of well-known pathways involved in metabolism, gene regulation and signal transduction, bioinformatics methods offer new insights into computational modeling of biological pathways. A wide range of computational approaches has been proposed in the past for reconstructing biological pathways from high-throughput datasets. Here we review selected bioinformatics approaches for modeling biological pathways inS. cerevisiae, including metabolic pathways, gene-regulatory pathways and signaling pathways. We start with reviewing the research on biological pathways followed by discussing key biological databases. In addition, several representative computational approaches for modeling biological pathways in yeast are discussed. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Genetics of Resistant Hypertension: the Missing Heritability and Opportunities.
Teixeira, Samantha K; Pereira, Alexandre C; Krieger, Jose E
2018-05-19
Blood pressure regulation in humans has long been known to be a genetically determined trait. The identification of causal genetic modulators for this trait has been unfulfilling at the least. Despite the recent advances of genome-wide genetic studies, loci associated with hypertension or blood pressure still explain a very low percentage of the overall variation of blood pressure in the general population. This has precluded the translation of discoveries in the genetics of human hypertension to clinical use. Here, we propose the combined use of resistant hypertension as a trait for mapping genetic determinants in humans and the integration of new large-scale technologies to approach in model systems the multidimensional nature of the problem. New large-scale efforts in the genetic and genomic arenas are paving the way for an increased and granular understanding of genetic determinants of hypertension. New technologies for whole genome sequence and large-scale forward genetic screens can help prioritize gene and gene-pathways for downstream characterization and large-scale population studies, and guided pharmacological design can be used to drive discoveries to the translational application through better risk stratification and new therapeutic approaches. Although significant challenges remain in the mapping and identification of genetic determinants of hypertension, new large-scale technological approaches have been proposed to surpass some of the shortcomings that have limited progress in the area for the last three decades. The incorporation of these technologies to hypertension research may significantly help in the understanding of inter-individual blood pressure variation and the deployment of new phenotyping and treatment approaches for the condition.
Large protein as a potential target for use in rabies diagnostics.
Santos Katz, I S; Dias, M H; Lima, I F; Chaves, L B; Ribeiro, O G; Scheffer, K C; Iwai, L K
Rabies is a zoonotic viral disease that remains a serious threat to public health worldwide. The rabies lyssavirus (RABV) genome encodes five structural proteins, multifunctional and significant for pathogenicity. The large protein (L) presents well-conserved genomic regions, which may be a good alternative to generate informative datasets for development of new methods for rabies diagnosis. This paper describes the development of a technique for the identification of L protein in several RABV strains from different hosts, demonstrating that MS-based proteomics is a potential method for antigen identification and a good alternative for rabies diagnosis.
Verde, Ignazio; Jenkins, Jerry; Dondini, Luca; Micali, Sabrina; Pagliarani, Giulia; Vendramin, Elisa; Paris, Roberta; Aramini, Valeria; Gazza, Laura; Rossini, Laura; Bassi, Daniele; Troggio, Michela; Shu, Shengqiang; Grimwood, Jane; Tartarini, Stefano; Dettori, Maria Teresa; Schmutz, Jeremy
2017-03-11
The availability of the peach genome sequence has fostered relevant research in peach and related Prunus species enabling the identification of genes underlying important horticultural traits as well as the development of advanced tools for genetic and genomic analyses. The first release of the peach genome (Peach v1.0) represented a high-quality WGS (Whole Genome Shotgun) chromosome-scale assembly with high contiguity (contig L50 214.2 kb), large portions of mapped sequences (96%) and high base accuracy (99.96%). The aim of this work was to improve the quality of the first assembly by increasing the portion of mapped and oriented sequences, correcting misassemblies and improving the contiguity and base accuracy using high-throughput linkage mapping and deep resequencing approaches. Four linkage maps with 3,576 molecular markers were used to improve the portion of mapped and oriented sequences (from 96.0% and 85.6% of Peach v1.0 to 99.2% and 98.2% of v2.0, respectively) and enabled a more detailed identification of discernible misassemblies (10.4 Mb in total). The deep resequencing approach fixed 859 homozygous SNPs (Single Nucleotide Polymorphisms) and 1347 homozygous indels. Moreover, the assembled NGS contigs enabled the closing of 212 gaps with an improvement in the contig L50 of 19.2%. The improved high quality peach genome assembly (Peach v2.0) represents a valuable tool for the analysis of the genetic diversity, domestication, and as a vehicle for genetic improvement of peach and related Prunus species. Moreover, the important phylogenetic position of peach and the absence of recent whole genome duplication (WGD) events make peach a pivotal species for comparative genomics studies aiming at elucidating plant speciation and diversification processes.
Genome-wide SNP identification and QTL mapping for black rot resistance in cabbage.
Lee, Jonghoon; Izzah, Nur Kholilatul; Jayakodi, Murukarthick; Perumal, Sampath; Joh, Ho Jun; Lee, Hyeon Ju; Lee, Sang-Choon; Park, Jee Young; Yang, Ki-Woung; Nou, Il-Sup; Seo, Joodeok; Yoo, Jaeheung; Suh, Youngdeok; Ahn, Kyounggu; Lee, Ji Hyun; Choi, Gyung Ja; Yu, Yeisoo; Kim, Heebal; Yang, Tae-Jin
2015-02-03
Black rot is a destructive bacterial disease causing large yield and quality losses in Brassica oleracea. To detect quantitative trait loci (QTL) for black rot resistance, we performed whole-genome resequencing of two cabbage parental lines and genome-wide SNP identification using the recently published B. oleracea genome sequences as reference. Approximately 11.5 Gb of sequencing data was produced from each parental line. Reference genome-guided mapping and SNP calling revealed 674,521 SNPs between the two cabbage lines, with an average of one SNP per 662.5 bp. Among 167 dCAPS markers derived from candidate SNPs, 117 (70.1%) were validated as bona fide SNPs showing polymorphism between the parental lines. We then improved the resolution of a previous genetic map by adding 103 markers including 87 SNP-based dCAPS markers. The new map composed of 368 markers and covers 1467.3 cM with an average interval of 3.88 cM between adjacent markers. We evaluated black rot resistance in the mapping population in three independent inoculation tests using F2:3 progenies and identified one major QTL and three minor QTLs. We report successful utilization of whole-genome resequencing for large-scale SNP identification and development of molecular markers for genetic map construction. In addition, we identified novel QTLs for black rot resistance. The high-density genetic map will promote QTL analysis for other important agricultural traits and marker-assisted breeding of B. oleracea.
Genome-scale engineering of Saccharomyces cerevisiae with single-nucleotide precision.
Bao, Zehua; HamediRad, Mohammad; Xue, Pu; Xiao, Han; Tasan, Ipek; Chao, Ran; Liang, Jing; Zhao, Huimin
2018-07-01
We developed a CRISPR-Cas9- and homology-directed-repair-assisted genome-scale engineering method named CHAnGE that can rapidly output tens of thousands of specific genetic variants in yeast. More than 98% of target sequences were efficiently edited with an average frequency of 82%. We validate the single-nucleotide resolution genome-editing capability of this technology by creating a genome-wide gene disruption collection and apply our method to improve tolerance to growth inhibitors.
Meyers, Robin M; Bryan, Jordan G; McFarland, James M; Weir, Barbara A; Sizemore, Ann E; Xu, Han; Dharia, Neekesh V; Montgomery, Phillip G; Cowley, Glenn S; Pantel, Sasha; Goodale, Amy; Lee, Yenarae; Ali, Levi D; Jiang, Guozhi; Lubonja, Rakela; Harrington, William F; Strickland, Matthew; Wu, Ting; Hawes, Derek C; Zhivich, Victor A; Wyatt, Meghan R; Kalani, Zohra; Chang, Jaime J; Okamoto, Michael; Stegmaier, Kimberly; Golub, Todd R; Boehm, Jesse S; Vazquez, Francisca; Root, David E; Hahn, William C; Tsherniak, Aviad
2017-12-01
The CRISPR-Cas9 system has revolutionized gene editing both at single genes and in multiplexed loss-of-function screens, thus enabling precise genome-scale identification of genes essential for proliferation and survival of cancer cells. However, previous studies have reported that a gene-independent antiproliferative effect of Cas9-mediated DNA cleavage confounds such measurement of genetic dependency, thereby leading to false-positive results in copy number-amplified regions. We developed CERES, a computational method to estimate gene-dependency levels from CRISPR-Cas9 essentiality screens while accounting for the copy number-specific effect. In our efforts to define a cancer dependency map, we performed genome-scale CRISPR-Cas9 essentiality screens across 342 cancer cell lines and applied CERES to this data set. We found that CERES decreased false-positive results and estimated sgRNA activity for both this data set and previously published screens performed with different sgRNA libraries. We further demonstrate the utility of this collection of screens, after CERES correction, for identifying cancer-type-specific vulnerabilities.
Meyers, Robin M.; Bryan, Jordan G.; McFarland, James M.; Weir, Barbara A.; Sizemore, Ann E.; Xu, Han; Dharia, Neekesh V.; Montgomery, Phillip G.; Cowley, Glenn S.; Pantel, Sasha; Goodale, Amy; Lee, Yenarae; Ali, Levi D.; Jiang, Guozhi; Lubonja, Rakela; Harrington, William F.; Strickland, Matthew; Wu, Ting; Hawes, Derek C.; Zhivich, Victor A.; Wyatt, Meghan R.; Kalani, Zohra; Chang, Jaime J.; Okamoto, Michael; Stegmaier, Kimberly; Golub, Todd R.; Boehm, Jesse S.; Vazquez, Francisca; Root, David E.; Hahn, William C.; Tsherniak, Aviad
2017-01-01
The CRISPR-Cas9 system has revolutionized gene editing both on single genes and in multiplexed loss-of-function screens, enabling precise genome-scale identification of genes essential to proliferation and survival of cancer cells1,2. However, previous studies reported that a gene-independent anti-proliferative effect of Cas9-mediated DNA cleavage confounds such measurement of genetic dependency, leading to false positive results in copy number amplified regions3,4. We developed CERES, a computational method to estimate gene dependency levels from CRISPR-Cas9 essentiality screens while accounting for the copy-number-specific effect. As part of our efforts to define a cancer dependency map, we performed genome-scale CRISPR-Cas9 essentiality screens across 342 cancer cell lines and applied CERES to this dataset. We found that CERES reduced false positive results and estimated sgRNA activity for both this dataset and previously published screens performed with different sgRNA libraries. Here, we demonstrate the utility of this collection of screens, upon CERES correction, in revealing cancer-type-specific vulnerabilities. PMID:29083409
Association Studies of Sporadic Parkinson’s Disease in the Genomic Era
Labbé, Catherine; Ross, Owen A
2014-01-01
Parkinson’s disease is a common age-related progressive neurodegenerative disorder. Over the last 10 years, advances have been made in our understanding of the etiology of the disease with the greatest insights perhaps coming from genetic studies, including genome-wide association approaches. These large scale studies allow the identification of genomic regions harboring common variants associated to disease risk. Since the first genome-wide association study on sporadic Parkinson’s disease performed in 2005, improvements in study design, including the advent of meta-analyses, have allowed the identification of ~21 susceptibility loci. The first loci to be nominated were previously associated to familial PD (SNCA, MAPT, LRRK2) and these have been extensively replicated. For other more recently identified loci (SREBF1, SCARB2, RIT2) independent replication is still warranted. Cumulative risk estimates of associated variants suggest that more loci are still to be discovered. Additional association studies combined with deep re-sequencing of known genome-wide association study loci are necessary to identify the functional variants that drive disease risk. As each of these associated genes and variants are identified they will give insight into the biological pathways involved the etiology of Parkinson’s disease. This will ultimately lead to the identification of molecules that can be used as biomarkers for diagnosis and as targets for the development of better, personalized treatment. PMID:24653658
Initial sequencing and comparative analysis of the mouse genome
DOE Office of Scientific and Technical Information (OSTI.GOV)
Waterston, Robert H.; Lindblad-Toh, Kerstin; Birney, Ewan
2002-12-15
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of themore » genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.« less
Dictionary-driven prokaryotic gene finding.
Shibuya, Tetsuo; Rigoutsos, Isidore
2002-06-15
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm's implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method's generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.
Identification of differentially methylated sites with weak methylation effect
USDA-ARS?s Scientific Manuscript database
DNA methylation is an epigenetic alteration crucial for regulating stress responses. Identifying large-scale DNA methylation at single nucleotide resolution is made possible by whole genome bisulfite sequencing. An essential task following the generation of bisulfite sequencing data is to detect dif...
Method for genetic identification of unknown organisms
Colston, Jr., Billy W.; Fitch, Joseph P.; Hindson, Benjamin J.; Carter, Chance J.; Beer, Neil Reginald
2016-08-23
A method of rapid, genome and proteome based identification of unknown pathogenic or non-pathogenic organisms in a complex sample. The entire sample is analyzed by creating millions of emulsion encapsulated microdroplets, each containing a single pathogenic or non-pathogenic organism sized particle and appropriate reagents for amplification. Following amplification, the amplified product is analyzed.
Pacheco-Arjona, Jose Ramon; Ramirez-Prado, Jorge Humberto
2014-01-01
The cell wall is a protective and versatile structure distributed in all fungi. The component responsible for its rigidity is chitin, a product of chitin synthase (Chsp) enzymes. There are seven classes of chitin synthase genes (CHS) and the amount and type encoded in fungal genomes varies considerably from one species to another. Previous Chsp sequence analyses focused on their study as individual units, regardless of genomic context. The identification of blocks of conserved genes between genomes can provide important clues about the interactions and localization of chitin synthases. On the present study, we carried out an in silico search of all putative Chsp encoded in 54 full fungal genomes, encompassing 21 orders from five phyla. Phylogenetic studies of these Chsp were able to confidently classify 347 out of the 369 Chsp identified (94%). Patterns in the distribution of Chsp related to taxonomy were identified, the most prominent being related to the type of fungal growth. More importantly, a synteny analysis for genomic blocks centered on class IV Chsp (the most abundant and widely distributed Chsp class) identified a putative cell wall metabolism gene cluster in members of the genus Aspergillus, the first such association reported for any fungal genome. PMID:25148134
Comparative Genomics and Host Resistance against Infectious Diseases
Qureshi, Salman T.; Skamene, Emil
1999-01-01
The large size and complexity of the human genome have limited the identification and functional characterization of components of the innate immune system that play a critical role in front-line defense against invading microorganisms. However, advances in genome analysis (including the development of comprehensive sets of informative genetic markers, improved physical mapping methods, and novel techniques for transcript identification) have reduced the obstacles to discovery of novel host resistance genes. Study of the genomic organization and content of widely divergent vertebrate species has shown a remarkable degree of evolutionary conservation and enables meaningful cross-species comparison and analysis of newly discovered genes. Application of comparative genomics to host resistance will rapidly expand our understanding of human immune defense by facilitating the translation of knowledge acquired through the study of model organisms. We review the rationale and resources for comparative genomic analysis and describe three examples of host resistance genes successfully identified by this approach. PMID:10081670
An approach to large scale identification of non-obvious structural similarities between proteins
Cherkasov, Artem; Jones, Steven JM
2004-01-01
Background A new sequence independent bioinformatics approach allowing genome-wide search for proteins with similar three dimensional structures has been developed. By utilizing the numerical output of the sequence threading it establishes putative non-obvious structural similarities between proteins. When applied to the testing set of proteins with known three dimensional structures the developed approach was able to recognize structurally similar proteins with high accuracy. Results The method has been developed to identify pathogenic proteins with low sequence identity and high structural similarity to host analogues. Such protein structure relationships would be hypothesized to arise through convergent evolution or through ancient horizontal gene transfer events, now undetectable using current sequence alignment techniques. The pathogen proteins, which could mimic or interfere with host activities, would represent candidate virulence factors. The developed approach utilizes the numerical outputs from the sequence-structure threading. It identifies the potential structural similarity between a pair of proteins by correlating the threading scores of the corresponding two primary sequences against the library of the standard folds. This approach allowed up to 64% sensitivity and 99.9% specificity in distinguishing protein pairs with high structural similarity. Conclusion Preliminary results obtained by comparison of the genomes of Homo sapiens and several strains of Chlamydia trachomatis have demonstrated the potential usefulness of the method in the identification of bacterial proteins with known or potential roles in virulence. PMID:15147578
Identification of genes and gene clusters involved in mycotoxin synthesis
USDA-ARS?s Scientific Manuscript database
Research methods to identify and characterize genes involved in mycotoxin biosynthetic pathways have evolved considerably over the years. Before whole genome sequences were available (e.g. pre-genomics), work focused primarily on chemistry, biosynthetic mutant strains and molecular analysis of sing...
Reprogramming cell fate with a genome-scale library of artificial transcription factors.
Eguchi, Asuka; Wleklinski, Matthew J; Spurgat, Mackenzie C; Heiderscheit, Evan A; Kropornicka, Anna S; Vu, Catherine K; Bhimsaria, Devesh; Swanson, Scott A; Stewart, Ron; Ramanathan, Parameswaran; Kamp, Timothy J; Slukvin, Igor; Thomson, James A; Dutton, James R; Ansari, Aseem Z
2016-12-20
Artificial transcription factors (ATFs) are precision-tailored molecules designed to bind DNA and regulate transcription in a preprogrammed manner. Libraries of ATFs enable the high-throughput screening of gene networks that trigger cell fate decisions or phenotypic changes. We developed a genome-scale library of ATFs that display an engineered interaction domain (ID) to enable cooperative assembly and synergistic gene expression at targeted sites. We used this ATF library to screen for key regulators of the pluripotency network and discovered three combinations of ATFs capable of inducing pluripotency without exogenous expression of Oct4 (POU domain, class 5, TF 1). Cognate site identification, global transcriptional profiling, and identification of ATF binding sites reveal that the ATFs do not directly target Oct4; instead, they target distinct nodes that converge to stimulate the endogenous pluripotency network. This forward genetic approach enables cell type conversions without a priori knowledge of potential key regulators and reveals unanticipated gene network dynamics that drive cell fate choices.
Reprogramming cell fate with a genome-scale library of artificial transcription factors
Eguchi, Asuka; Wleklinski, Matthew J.; Spurgat, Mackenzie C.; Heiderscheit, Evan A.; Kropornicka, Anna S.; Vu, Catherine K.; Bhimsaria, Devesh; Swanson, Scott A.; Stewart, Ron; Ramanathan, Parameswaran; Kamp, Timothy J.; Slukvin, Igor; Thomson, James A.; Dutton, James R.; Ansari, Aseem Z.
2016-01-01
Artificial transcription factors (ATFs) are precision-tailored molecules designed to bind DNA and regulate transcription in a preprogrammed manner. Libraries of ATFs enable the high-throughput screening of gene networks that trigger cell fate decisions or phenotypic changes. We developed a genome-scale library of ATFs that display an engineered interaction domain (ID) to enable cooperative assembly and synergistic gene expression at targeted sites. We used this ATF library to screen for key regulators of the pluripotency network and discovered three combinations of ATFs capable of inducing pluripotency without exogenous expression of Oct4 (POU domain, class 5, TF 1). Cognate site identification, global transcriptional profiling, and identification of ATF binding sites reveal that the ATFs do not directly target Oct4; instead, they target distinct nodes that converge to stimulate the endogenous pluripotency network. This forward genetic approach enables cell type conversions without a priori knowledge of potential key regulators and reveals unanticipated gene network dynamics that drive cell fate choices. PMID:27930301
Applicability of SCAR markers to food genomics: olive oil traceability.
Pafundo, Simona; Agrimonti, Caterina; Maestri, Elena; Marmiroli, Nelson
2007-07-25
DNA analysis with molecular markers has opened a shortcut toward a genomic comprehension of complex organisms. The availability of micro-DNA extraction methods, coupled with selective amplification of the smallest extracted fragments with molecular markers, could equally bring a breakthrough in food genomics: the identification of original components in food. Amplified fragment length polymorphisms (AFLPs) have been instrumental in plant genomics because they may allow rapid and reliable analysis of multiple and potentially polymorphic sites. Nevertheless, their direct application to the analysis of DNA extracted from food matrixes is complicated by the low quality of DNA extracted: its high degradation and the presence of inhibitors of enzymatic reactions. The conversion of an AFLP fragment to a robust and specific single-locus PCR-based marker, therefore, could extend the use of molecular markers to large-scale analysis of complex agro-food matrixes. In the present study is reported the development of sequence characterized amplified regions (SCARs) starting from AFLP profiles of monovarietal olive oils analyzed on agarose gel; one of these was used to identify differences among 56 olive cultivars. All the developed markers were purposefully amplified in olive oils to apply them to olive oil traceability.
[Genome editing of industrial microorganism].
Zhu, Linjiang; Li, Qi
2015-03-01
Genome editing is defined as highly-effective and precise modification of cellular genome in a large scale. In recent years, such genome-editing methods have been rapidly developed in the field of industrial strain improvement. The quickly-updating methods thoroughly change the old mode of inefficient genetic modification, which is "one modification, one selection marker, and one target site". Highly-effective modification mode in genome editing have been developed including simultaneous modification of multiplex genes, highly-effective insertion, replacement, and deletion of target genes in the genome scale, cut-paste of a large DNA fragment. These new tools for microbial genome editing will certainly be applied widely, and increase the efficiency of industrial strain improvement, and promote the revolution of traditional fermentation industry and rapid development of novel industrial biotechnology like production of biofuel and biomaterial. The technological principle of these genome-editing methods and their applications were summarized in this review, which can benefit engineering and construction of industrial microorganism.
Phylogenomics of plant genomes: a methodology for genome-wide searches for orthologs in plants
Conte, Matthieu G; Gaillard, Sylvain; Droc, Gaetan; Perin, Christophe
2008-01-01
Background Gene ortholog identification is now a major objective for mining the increasing amount of sequence data generated by complete or partial genome sequencing projects. Comparative and functional genomics urgently need a method for ortholog detection to reduce gene function inference and to aid in the identification of conserved or divergent genetic pathways between several species. As gene functions change during evolution, reconstructing the evolutionary history of genes should be a more accurate way to differentiate orthologs from paralogs. Phylogenomics takes into account phylogenetic information from high-throughput genome annotation and is the most straightforward way to infer orthologs. However, procedures for automatic detection of orthologs are still scarce and suffer from several limitations. Results We developed a procedure for ortholog prediction between Oryza sativa and Arabidopsis thaliana. Firstly, we established an efficient method to cluster A. thaliana and O. sativa full proteomes into gene families. Then, we developed an optimized phylogenomics pipeline for ortholog inference. We validated the full procedure using test sets of orthologs and paralogs to demonstrate that our method outperforms pairwise methods for ortholog predictions. Conclusion Our procedure achieved a high level of accuracy in predicting ortholog and paralog relationships. Phylogenomic predictions for all validated gene families in both species were easily achieved and we can conclude that our methodology outperforms similarly based methods. PMID:18426584
Identifying Bacterial Immune Evasion Proteins Using Phage Display.
Fevre, Cindy; Scheepmaker, Lisette; Haas, Pieter-Jan
2017-01-01
Methods aimed at identification of immune evasion proteins are mainly rely on in silico prediction of sequence, structural homology to known evasion proteins or use a proteomics driven approach. Although proven successful these methods are limited by a low efficiency and or lack of functional identification. Here we describe a high-throughput genomic strategy to functionally identify bacterial immune evasion proteins using phage display technology. Genomic bacterial DNA is randomly fragmented and ligated into a phage display vector that is used to create a phage display library expressing bacterial secreted and membrane bound proteins. This library is used to select displayed bacterial secretome proteins that interact with host immune components.
Review of Processing and Analytical Methods for Francisella ...
Journal Article The etiological agent of tularemia, Francisella tularensis, is a resilient organism within the environment and can be acquired many ways (infectious aerosols and dust, contaminated food and water, infected carcasses, and arthropod bites). However, isolating F. tularensis from environmental samples can be challenging due to its nutritionally fastidious and slow-growing nature. In order to determine the current state of the science regarding available processing and analytical methods for detection and recovery of F. tularensis from water and soil matrices, a review of the literature was conducted. During the review, analysis via culture, immunoassays, and genomic identification were the most commonly found methods for F. tularensis detection within environmental samples. Other methods included combined culture and genomic analysis for rapid quantification of viable microorganisms and use of one assay to identify multiple pathogens from a single sample. Gaps in the literature that were identified during this review suggest that further work to integrate culture and genomic identification would advance our ability to detect and to assess the viability of Francisella spp. The optimization of DNA extraction, whole genome amplification with inhibition-resistant polymerases, and multiagent microarray detection would also advance biothreat detection.
Scalable Parameter Estimation for Genome-Scale Biochemical Reaction Networks
Kaltenbacher, Barbara; Hasenauer, Jan
2017-01-01
Mechanistic mathematical modeling of biochemical reaction networks using ordinary differential equation (ODE) models has improved our understanding of small- and medium-scale biological processes. While the same should in principle hold for large- and genome-scale processes, the computational methods for the analysis of ODE models which describe hundreds or thousands of biochemical species and reactions are missing so far. While individual simulations are feasible, the inference of the model parameters from experimental data is computationally too intensive. In this manuscript, we evaluate adjoint sensitivity analysis for parameter estimation in large scale biochemical reaction networks. We present the approach for time-discrete measurement and compare it to state-of-the-art methods used in systems and computational biology. Our comparison reveals a significantly improved computational efficiency and a superior scalability of adjoint sensitivity analysis. The computational complexity is effectively independent of the number of parameters, enabling the analysis of large- and genome-scale models. Our study of a comprehensive kinetic model of ErbB signaling shows that parameter estimation using adjoint sensitivity analysis requires a fraction of the computation time of established methods. The proposed method will facilitate mechanistic modeling of genome-scale cellular processes, as required in the age of omics. PMID:28114351
Do, Hongdo; Molania, Ramyar
2017-01-01
The identification of genomic rearrangements with high sensitivity and specificity using massively parallel sequencing remains a major challenge, particularly in precision medicine and cancer research. Here, we describe a new method for detecting rearrangements, GRIDSS (Genome Rearrangement IDentification Software Suite). GRIDSS is a multithreaded structural variant (SV) caller that performs efficient genome-wide break-end assembly prior to variant calling using a novel positional de Bruijn graph-based assembler. By combining assembly, split read, and read pair evidence using a probabilistic scoring, GRIDSS achieves high sensitivity and specificity on simulated, cell line, and patient tumor data, recently winning SV subchallenge #5 of the ICGC-TCGA DREAM8.5 Somatic Mutation Calling Challenge. On human cell line data, GRIDSS halves the false discovery rate compared to other recent methods while matching or exceeding their sensitivity. GRIDSS identifies nontemplate sequence insertions, microhomologies, and large imperfect homologies, estimates a quality score for each breakpoint, stratifies calls into high or low confidence, and supports multisample analysis. PMID:29097403
RGAugury: a pipeline for genome-wide prediction of resistance gene analogs (RGAs) in plants.
Li, Pingchuan; Quan, Xiande; Jia, Gaofeng; Xiao, Jin; Cloutier, Sylvie; You, Frank M
2016-11-02
Resistance gene analogs (RGAs), such as NBS-encoding proteins, receptor-like protein kinases (RLKs) and receptor-like proteins (RLPs), are potential R-genes that contain specific conserved domains and motifs. Thus, RGAs can be predicted based on their conserved structural features using bioinformatics tools. Computer programs have been developed for the identification of individual domains and motifs from the protein sequences of RGAs but none offer a systematic assessment of the different types of RGAs. A user-friendly and efficient pipeline is needed for large-scale genome-wide RGA predictions of the growing number of sequenced plant genomes. An integrative pipeline, named RGAugury, was developed to automate RGA prediction. The pipeline first identifies RGA-related protein domains and motifs, namely nucleotide binding site (NB-ARC), leucine rich repeat (LRR), transmembrane (TM), serine/threonine and tyrosine kinase (STTK), lysin motif (LysM), coiled-coil (CC) and Toll/Interleukin-1 receptor (TIR). RGA candidates are identified and classified into four major families based on the presence of combinations of these RGA domains and motifs: NBS-encoding, TM-CC, and membrane associated RLP and RLK. All time-consuming analyses of the pipeline are paralleled to improve performance. The pipeline was evaluated using the well-annotated Arabidopsis genome. A total of 98.5, 85.2, and 100 % of the reported NBS-encoding genes, membrane associated RLPs and RLKs were validated, respectively. The pipeline was also successfully applied to predict RGAs for 50 sequenced plant genomes. A user-friendly web interface was implemented to ease command line operations, facilitate visualization and simplify result management for multiple datasets. RGAugury is an efficiently integrative bioinformatics tool for large scale genome-wide identification of RGAs. It is freely available at Bitbucket: https://bitbucket.org/yaanlpc/rgaugury .
As genomics advances reveal the cancer gene landscape, a daunting task is to understand how these genes contribute to dysregulated oncogenic pathways. Integration of cancer genes into networks offers opportunities to reveal protein–protein interactions (PPIs) with functional and therapeutic significance. Here, we report the generation of a cancer-focused PPI network, termed OncoPPi, and identification of >260 cancer-associated PPIs not in other large-scale interactomes.
The CRISPR-Cas9 system has revolutionized gene editing both at single genes and in multiplexed loss-of-function screens, thus enabling precise genome-scale identification of genes essential for proliferation and survival of cancer cells. However, previous studies have reported that a gene-independent antiproliferative effect of Cas9-mediated DNA cleavage confounds such measurement of genetic dependency, thereby leading to false-positive results in copy number-amplified regions.
Parallel human genome analysis: microarray-based expression monitoring of 1000 genes.
Schena, M; Shalon, D; Heller, R; Chai, A; Brown, P O; Davis, R W
1996-01-01
Microarrays containing 1046 human cDNAs of unknown sequence were printed on glass with high-speed robotics. These 1.0-cm2 DNA "chips" were used to quantitatively monitor differential expression of the cognate human genes using a highly sensitive two-color hybridization assay. Array elements that displayed differential expression patterns under given experimental conditions were characterized by sequencing. The identification of known and novel heat shock and phorbol ester-regulated genes in human T cells demonstrates the sensitivity of the assay. Parallel gene analysis with microarrays provides a rapid and efficient method for large-scale human gene discovery. Images Fig. 1 Fig. 2 Fig. 3 PMID:8855227
Zhao, Mengran; Hsiang, Tom; Feng, Xiaoxing
2016-01-01
Noncoding RNAs (ncRNAs) have been identified in many fungi. However, no genome-scale identification of ncRNAs has been inventoried for basidiomycetes. In this research, we detected 254 small noncoding RNAs (sncRNAs) in a genome assembly of an isolate (CCEF00389) of Pleurotus ostreatus, which is a widely cultivated edible basidiomycetous fungus worldwide. The identified sncRNAs include snRNAs, snoRNAs, tRNAs, and miRNAs. SnRNA U1 was not found in CCEF00389 genome assembly and some other basidiomycetous genomes by BLASTn. This implies that if snRNA U1 of basidiomycetes exists, it has a sequence that varies significantly from other organisms. By analyzing the distribution of sncRNA loci, we found that snRNAs and most tRNAs (88.6%) were located in pseudo-UTR regions, while miRNAs are commonly found in introns. To analyze the evolutionary conservation of the sncRNAs in P. ostreatus, we aligned all 254 sncRNAs to the genome assemblies of some other Agaricomycotina fungi. The results suggest that most sncRNAs (77.56%) were highly conserved in P. ostreatus, and 20% were conserved in Agaricomycotina fungi. These findings indicate that most sncRNAs of P. ostreatus were not conserved across Agaricomycotina fungi. PMID:27703969
Leach, Verity; Tonkin, Emma; Lancastle, Deborah; Kirk, Maggie
2016-06-01
Genomics is an ever increasing aspect of nursing practice, with focus being directed towards improving health. The authors present an implementation strategy for the incorporation of genomics into nursing practice within the UK, based on three behaviour change theories and the identification of individuals who are likely to provide support for change. Individuals identified as Opinion Leaders and Adopters of genomics illustrate how changes in behaviour might occur among the nursing profession. The core philosophy of the strategy is that genomic nurse Adopters and Opinion Leaders who have direct interaction with their peers in practice will be best placed to highlight the importance of genomics within the nursing role. The strategy discussed in this paper provides scope for continued nursing education and development of genomics within nursing practice on a larger scale. The recommendations might be of particular relevance for senior staff and management. © 2016 John Wiley & Sons Australia, Ltd.
Garg, Aprajita; Wesolowski, Donna; Alonso, Dulce; Deitsch, Kirk W; Ben Mamoun, Choukri; Altman, Sidney
2015-09-22
Identification and genetic validation of new targets from available genome sequences are critical steps toward the development of new potent and selective antimalarials. However, no methods are currently available for large-scale functional analysis of the Plasmodium falciparum genome. Here we present evidence for successful use of morpholino oligomers (MO) to mediate degradation of target mRNAs or to inhibit RNA splicing or translation of several genes of P. falciparum involved in chloroquine transport, apicoplast biogenesis, and phospholipid biosynthesis. Consistent with their role in the parasite life cycle, down-regulation of these essential genes resulted in inhibition of parasite development. We show that a MO conjugate that targets the chloroquine-resistant transporter PfCRT is effective against chloroquine-sensitive and -resistant parasites, causes enlarged digestive vacuoles, and renders chloroquine-resistant strains more sensitive to chloroquine. Similarly, we show that a MO conjugate that targets the PfDXR involved in apicoplast biogenesis inhibits parasite growth and that this defect can be rescued by addition of isopentenyl pyrophosphate. MO-based gene regulation is a viable alternative approach to functional analysis of the P. falciparum genome.
van Leeuwen, Elisabeth M; Sabo, Aniko; Bis, Joshua C; Huffman, Jennifer E; Manichaikul, Ani; Smith, Albert V; Feitosa, Mary F; Demissie, Serkalem; Joshi, Peter K; Duan, Qing; Marten, Jonathan; van Klinken, Jan B; Surakka, Ida; Nolte, Ilja M; Zhang, Weihua; Mbarek, Hamdi; Li-Gao, Ruifang; Trompet, Stella; Verweij, Niek; Evangelou, Evangelos; Lyytikäinen, Leo-Pekka; Tayo, Bamidele O; Deelen, Joris; van der Most, Peter J; van der Laan, Sander W; Arking, Dan E; Morrison, Alanna; Dehghan, Abbas; Franco, Oscar H; Hofman, Albert; Rivadeneira, Fernando; Sijbrands, Eric J; Uitterlinden, Andre G; Mychaleckyj, Josyf C; Campbell, Archie; Hocking, Lynne J; Padmanabhan, Sandosh; Brody, Jennifer A; Rice, Kenneth M; White, Charles C; Harris, Tamara; Isaacs, Aaron; Campbell, Harry; Lange, Leslie A; Rudan, Igor; Kolcic, Ivana; Navarro, Pau; Zemunik, Tatijana; Salomaa, Veikko; Kooner, Angad S; Kooner, Jaspal S; Lehne, Benjamin; Scott, William R; Tan, Sian-Tsung; de Geus, Eco J; Milaneschi, Yuri; Penninx, Brenda W J H; Willemsen, Gonneke; de Mutsert, Renée; Ford, Ian; Gansevoort, Ron T; Segura-Lepe, Marcelo P; Raitakari, Olli T; Viikari, Jorma S; Nikus, Kjell; Forrester, Terrence; McKenzie, Colin A; de Craen, Anton J M; de Ruijter, Hester M; Pasterkamp, Gerard; Snieder, Harold; Oldehinkel, Albertine J; Slagboom, P Eline; Cooper, Richard S; Kähönen, Mika; Lehtimäki, Terho; Elliott, Paul; van der Harst, Pim; Jukema, J Wouter; Mook-Kanamori, Dennis O; Boomsma, Dorret I; Chambers, John C; Swertz, Morris; Ripatti, Samuli; Willems van Dijk, Ko; Vitart, Veronique; Polasek, Ozren; Hayward, Caroline; Wilson, James G; Wilson, James F; Gudnason, Vilmundur; Rich, Stephen S; Psaty, Bruce M; Borecki, Ingrid B; Boerwinkle, Eric; Rotter, Jerome I; Cupples, L Adrienne; van Duijn, Cornelia M
2016-01-01
Background So far, more than 170 loci have been associated with circulating lipid levels through genome-wide association studies (GWAS). These associations are largely driven by common variants, their function is often not known, and many are likely to be markers for the causal variants. In this study we aimed to identify more new rare and low-frequency functional variants associated with circulating lipid levels. Methods We used the 1000 Genomes Project as a reference panel for the imputations of GWAS data from ∼60 000 individuals in the discovery stage and ∼90 000 samples in the replication stage. Results Our study resulted in the identification of five new associations with circulating lipid levels at four loci. All four loci are within genes that can be linked biologically to lipid metabolism. One of the variants, rs116843064, is a damaging missense variant within the ANGPTL4 gene. Conclusions This study illustrates that GWAS with high-scale imputation may still help us unravel the biological mechanism behind circulating lipid levels. PMID:27036123
Evaluation of Quality Assessment Protocols for High Throughput Genome Resequencing Data
Chiara, Matteo; Pavesi, Giulio
2017-01-01
Large-scale initiatives aiming to recover the complete sequence of thousands of human genomes are currently being undertaken worldwide, concurring to the generation of a comprehensive catalog of human genetic variation. The ultimate and most ambitious goal of human population scale genomics is the characterization of the so-called human “variome,” through the identification of causal mutations or haplotypes. Several research institutions worldwide currently use genotyping assays based on Next-Generation Sequencing (NGS) for diagnostics and clinical screenings, and the widespread application of such technologies promises major revolutions in medical science. Bioinformatic analysis of human resequencing data is one of the main factors limiting the effectiveness and general applicability of NGS for clinical studies. The requirement for multiple tools, to be combined in dedicated protocols in order to accommodate different types of data (gene panels, exomes, or whole genomes) and the high variability of the data makes difficult the establishment of a ultimate strategy of general use. While there already exist several studies comparing sensitivity and accuracy of bioinformatic pipelines for the identification of single nucleotide variants from resequencing data, little is known about the impact of quality assessment and reads pre-processing strategies. In this work we discuss major strengths and limitations of the various genome resequencing protocols are currently used in molecular diagnostics and for the discovery of novel disease-causing mutations. By taking advantage of publicly available data we devise and suggest a series of best practices for the pre-processing of the data that consistently improve the outcome of genotyping with minimal impacts on computational costs. PMID:28736571
Sserwadda, Ivan; Amujal, Marion; Namatovu, Norah
2018-01-01
HIV/AIDS, tuberculosis (TB), and malaria are 3 major global public health threats that undermine development in many resource-poor settings. Recently, the notion that positive selection during epidemics or longer periods of exposure to common infectious diseases may have had a major effect in modifying the constitution of the human genome is being interrogated at a large scale in many populations around the world. This positive selection from infectious diseases increases power to detect associations in genome-wide association studies (GWASs). High-throughput sequencing (HTS) has transformed both the management of infectious diseases and continues to enable large-scale functional characterization of host resistance/susceptibility alleles and loci; a paradigm shift from single candidate gene studies. Application of genome sequencing technologies and genomics has enabled us to interrogate the host-pathogen interface for improving human health. Human populations are constantly locked in evolutionary arms races with pathogens; therefore, identification of common infectious disease-associated genomic variants/markers is important in therapeutic, vaccine development, and screening susceptible individuals in a population. This review describes a range of host-pathogen genomic loci that have been associated with disease susceptibility and resistant patterns in the era of HTS. We further highlight potential opportunities for these genetic markers. PMID:29755620
Complex multifractal nature in Mycobacterium tuberculosis genome
Mandal, Saurav; Roychowdhury, Tanmoy; Chirom, Keilash; Bhattacharya, Alok; Brojen Singh, R. K.
2017-01-01
The mutifractal and long range correlation (C(r)) properties of strings, such as nucleotide sequence can be a useful parameter for identification of underlying patterns and variations. In this study C(r) and multifractal singularity function f(α) have been used to study variations in the genomes of a pathogenic bacteria Mycobacterium tuberculosis. Genomic sequences of M. tuberculosis isolates displayed significant variations in C(r) and f(α) reflecting inherent differences in sequences among isolates. M. tuberculosis isolates can be categorised into different subgroups based on sensitivity to drugs, these are DS (drug sensitive isolates), MDR (multi-drug resistant isolates) and XDR (extremely drug resistant isolates). C(r) follows significantly different scaling rules in different subgroups of isolates, but all the isolates follow one parameter scaling law. The richness in complexity of each subgroup can be quantified by the measures of multifractal parameters displaying a pattern in which XDR isolates have highest value and lowest for drug sensitive isolates. Therefore C(r) and multifractal functions can be useful parameters for analysis of genomic sequences. PMID:28440326
Complex multifractal nature in Mycobacterium tuberculosis genome
NASA Astrophysics Data System (ADS)
Mandal, Saurav; Roychowdhury, Tanmoy; Chirom, Keilash; Bhattacharya, Alok; Brojen Singh, R. K.
2017-04-01
The mutifractal and long range correlation (C(r)) properties of strings, such as nucleotide sequence can be a useful parameter for identification of underlying patterns and variations. In this study C(r) and multifractal singularity function f(α) have been used to study variations in the genomes of a pathogenic bacteria Mycobacterium tuberculosis. Genomic sequences of M. tuberculosis isolates displayed significant variations in C(r) and f(α) reflecting inherent differences in sequences among isolates. M. tuberculosis isolates can be categorised into different subgroups based on sensitivity to drugs, these are DS (drug sensitive isolates), MDR (multi-drug resistant isolates) and XDR (extremely drug resistant isolates). C(r) follows significantly different scaling rules in different subgroups of isolates, but all the isolates follow one parameter scaling law. The richness in complexity of each subgroup can be quantified by the measures of multifractal parameters displaying a pattern in which XDR isolates have highest value and lowest for drug sensitive isolates. Therefore C(r) and multifractal functions can be useful parameters for analysis of genomic sequences.
IDENTIFICATION OF CHICKEN-SPECIFIC FECAL MICROBIAL SEQUENCES USING A METAGENOMIC APPROACH
In this study, we applied a genome fragment enrichment (GFE) method to select for genomic regions that differ between different fecal metagenomes. Competitive DNA hybridizations were performed between chicken fecal DNA and pig fecal DNA (C-P) and between chicken fecal DNA and an ...
Chechetkin, V R; Lobzin, V V
2017-08-07
Using state-of-the-art techniques combining imaging methods and high-throughput genomic mapping tools leaded to the significant progress in detailing chromosome architecture of various organisms. However, a gap still remains between the rapidly growing structural data on the chromosome folding and the large-scale genome organization. Could a part of information on the chromosome folding be obtained directly from underlying genomic DNA sequences abundantly stored in the databanks? To answer this question, we developed an original discrete double Fourier transform (DDFT). DDFT serves for the detection of large-scale genome regularities associated with domains/units at the different levels of hierarchical chromosome folding. The method is versatile and can be applied to both genomic DNA sequences and corresponding physico-chemical parameters such as base-pairing free energy. The latter characteristic is closely related to the replication and transcription and can also be used for the assessment of temperature or supercoiling effects on the chromosome folding. We tested the method on the genome of E. coli K-12 and found good correspondence with the annotated domains/units established experimentally. As a brief illustration of further abilities of DDFT, the study of large-scale genome organization for bacteriophage PHIX174 and bacterium Caulobacter crescentus was also added. The combined experimental, modeling, and bioinformatic DDFT analysis should yield more complete knowledge on the chromosome architecture and genome organization. Copyright © 2017 Elsevier Ltd. All rights reserved.
Dessimoz, Christophe; Zoller, Stefan; Manousaki, Tereza; Qiu, Huan; Meyer, Axel; Kuraku, Shigehiro
2011-09-01
Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references.
Zoller, Stefan; Manousaki, Tereza; Qiu, Huan; Meyer, Axel; Kuraku, Shigehiro
2011-01-01
Recent development of deep sequencing technologies has facilitated de novo genome sequencing projects, now conducted even by individual laboratories. However, this will yield more and more genome sequences that are not well assembled, and will hinder thorough annotation when no closely related reference genome is available. One of the challenging issues is the identification of protein-coding sequences split into multiple unassembled genomic segments, which can confound orthology assignment and various laboratory experiments requiring the identification of individual genes. In this study, using the genome of a cartilaginous fish, Callorhinchus milii, as test case, we performed gene prediction using a model specifically trained for this genome. We implemented an algorithm, designated ESPRIT, to identify possible linkages between multiple protein-coding portions derived from a single genomic locus split into multiple unassembled genomic segments. We developed a validation framework based on an artificially fragmented human genome, improvements between early and recent mouse genome assemblies, comparison with experimentally validated sequences from GenBank, and phylogenetic analyses. Our strategy provided insights into practical solutions for efficient annotation of only partially sequenced (low-coverage) genomes. To our knowledge, our study is the first formulation of a method to link unassembled genomic segments based on proteomes of relatively distantly related species as references. PMID:21712341
HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing
Karimi, Ramin; Hajdu, Andras
2016-01-01
Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis. PMID:26884678
HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing.
Karimi, Ramin; Hajdu, Andras
2016-01-01
Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.
Hitomi, Yuki; Tokunaga, Katsushi
2017-01-01
Human genome variation may cause differences in traits and disease risks. Disease-causal/susceptible genes and variants for both common and rare diseases can be detected by comprehensive whole-genome analyses, such as whole-genome sequencing (WGS), using next-generation sequencing (NGS) technology and genome-wide association studies (GWAS). Here, in addition to the application of an NGS as a whole-genome analysis method, we summarize approaches for the identification of functional disease-causal/susceptible variants from abundant genetic variants in the human genome and methods for evaluating their functional effects in human diseases, using an NGS and in silico and in vitro functional analyses. We also discuss the clinical applications of the functional disease causal/susceptible variants to personalized medicine.
CRISPR Approaches to Small Molecule Target Identification. | Office of Cancer Genomics
A long-standing challenge in drug development is the identification of the mechanisms of action of small molecules with therapeutic potential. A number of methods have been developed to address this challenge, each with inherent strengths and limitations. We here provide a brief review of these methods with a focus on chemical-genetic methods that are based on systematically profiling the effects of genetic perturbations on drug sensitivity.
Identifying genetic relatives without compromising privacy
He, Dan; Furlotte, Nicholas A.; Hormozdiari, Farhad; Joo, Jong Wha J.; Wadia, Akshay; Ostrovsky, Rafail; Sahai, Amit; Eskin, Eleazar
2014-01-01
The development of high-throughput genomic technologies has impacted many areas of genetic research. While many applications of these technologies focus on the discovery of genes involved in disease from population samples, applications of genomic technologies to an individual’s genome or personal genomics have recently gained much interest. One such application is the identification of relatives from genetic data. In this application, genetic information from a set of individuals is collected in a database, and each pair of individuals is compared in order to identify genetic relatives. An inherent issue that arises in the identification of relatives is privacy. In this article, we propose a method for identifying genetic relatives without compromising privacy by taking advantage of novel cryptographic techniques customized for secure and private comparison of genetic information. We demonstrate the utility of these techniques by allowing a pair of individuals to discover whether or not they are related without compromising their genetic information or revealing it to a third party. The idea is that individuals only share enough special-purpose cryptographically protected information with each other to identify whether or not they are relatives, but not enough to expose any information about their genomes. We show in HapMap and 1000 Genomes data that our method can recover first- and second-order genetic relationships and, through simulations, show that our method can identify relationships as distant as third cousins while preserving privacy. PMID:24614977
Identifying genetic relatives without compromising privacy.
He, Dan; Furlotte, Nicholas A; Hormozdiari, Farhad; Joo, Jong Wha J; Wadia, Akshay; Ostrovsky, Rafail; Sahai, Amit; Eskin, Eleazar
2014-04-01
The development of high-throughput genomic technologies has impacted many areas of genetic research. While many applications of these technologies focus on the discovery of genes involved in disease from population samples, applications of genomic technologies to an individual's genome or personal genomics have recently gained much interest. One such application is the identification of relatives from genetic data. In this application, genetic information from a set of individuals is collected in a database, and each pair of individuals is compared in order to identify genetic relatives. An inherent issue that arises in the identification of relatives is privacy. In this article, we propose a method for identifying genetic relatives without compromising privacy by taking advantage of novel cryptographic techniques customized for secure and private comparison of genetic information. We demonstrate the utility of these techniques by allowing a pair of individuals to discover whether or not they are related without compromising their genetic information or revealing it to a third party. The idea is that individuals only share enough special-purpose cryptographically protected information with each other to identify whether or not they are relatives, but not enough to expose any information about their genomes. We show in HapMap and 1000 Genomes data that our method can recover first- and second-order genetic relationships and, through simulations, show that our method can identify relationships as distant as third cousins while preserving privacy.
2012-01-01
Background A single-step blending approach allows genomic prediction using information of genotyped and non-genotyped animals simultaneously. However, the combined relationship matrix in a single-step method may need to be adjusted because marker-based and pedigree-based relationship matrices may not be on the same scale. The same may apply when a GBLUP model includes both genomic breeding values and residual polygenic effects. The objective of this study was to compare single-step blending methods and GBLUP methods with and without adjustment of the genomic relationship matrix for genomic prediction of 16 traits in the Nordic Holstein population. Methods The data consisted of de-regressed proofs (DRP) for 5 214 genotyped and 9 374 non-genotyped bulls. The bulls were divided into a training and a validation population by birth date, October 1, 2001. Five approaches for genomic prediction were used: 1) a simple GBLUP method, 2) a GBLUP method with a polygenic effect, 3) an adjusted GBLUP method with a polygenic effect, 4) a single-step blending method, and 5) an adjusted single-step blending method. In the adjusted GBLUP and single-step methods, the genomic relationship matrix was adjusted for the difference of scale between the genomic and the pedigree relationship matrices. A set of weights on the pedigree relationship matrix (ranging from 0.05 to 0.40) was used to build the combined relationship matrix in the single-step blending method and the GBLUP method with a polygenetic effect. Results Averaged over the 16 traits, reliabilities of genomic breeding values predicted using the GBLUP method with a polygenic effect (relative weight of 0.20) were 0.3% higher than reliabilities from the simple GBLUP method (without a polygenic effect). The adjusted single-step blending and original single-step blending methods (relative weight of 0.20) had average reliabilities that were 2.1% and 1.8% higher than the simple GBLUP method, respectively. In addition, the GBLUP method with a polygenic effect led to less bias of genomic predictions than the simple GBLUP method, and both single-step blending methods yielded less bias of predictions than all GBLUP methods. Conclusions The single-step blending method is an appealing approach for practical genomic prediction in dairy cattle. Genomic prediction from the single-step blending method can be improved by adjusting the scale of the genomic relationship matrix. PMID:22455934
High resolution identity testing of inactivated poliovirus vaccines.
Mee, Edward T; Minor, Philip D; Martin, Javier
2015-07-09
Definitive identification of poliovirus strains in vaccines is essential for quality control, particularly where multiple wild-type and Sabin strains are produced in the same facility. Sequence-based identification provides the ultimate in identity testing and would offer several advantages over serological methods. We employed random RT-PCR and high throughput sequencing to recover full-length genome sequences from monovalent and trivalent poliovirus vaccine products at various stages of the manufacturing process. All expected strains were detected in previously characterised products and the method permitted identification of strains comprising as little as 0.1% of sequence reads. Highly similar Mahoney and Sabin 1 strains were readily discriminated on the basis of specific variant positions. Analysis of a product known to contain incorrect strains demonstrated that the method correctly identified the contaminants. Random RT-PCR and shotgun sequencing provided high resolution identification of vaccine components. In addition to the recovery of full-length genome sequences, the method could also be easily adapted to the characterisation of minor variant frequencies and distinction of closely related products on the basis of distinguishing consensus and low frequency polymorphisms. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
Nikolova, Olga; Moser, Russell; Kemp, Christopher; Gönen, Mehmet; Margolin, Adam A
2017-05-01
In recent years, vast advances in biomedical technologies and comprehensive sequencing have revealed the genomic landscape of common forms of human cancer in unprecedented detail. The broad heterogeneity of the disease calls for rapid development of personalized therapies. Translating the readily available genomic data into useful knowledge that can be applied in the clinic remains a challenge. Computational methods are needed to aid these efforts by robustly analyzing genome-scale data from distinct experimental platforms for prioritization of targets and treatments. We propose a novel, biologically motivated, Bayesian multitask approach, which explicitly models gene-centric dependencies across multiple and distinct genomic platforms. We introduce a gene-wise prior and present a fully Bayesian formulation of a group factor analysis model. In supervised prediction applications, our multitask approach leverages similarities in response profiles of groups of drugs that are more likely to be related to true biological signal, which leads to more robust performance and improved generalization ability. We evaluate the performance of our method on molecularly characterized collections of cell lines profiled against two compound panels, namely the Cancer Cell Line Encyclopedia and the Cancer Therapeutics Response Portal. We demonstrate that accounting for the gene-centric dependencies enables leveraging information from multi-omic input data and improves prediction and feature selection performance. We further demonstrate the applicability of our method in an unsupervised dimensionality reduction application by inferring genes essential to tumorigenesis in the pancreatic ductal adenocarcinoma and lung adenocarcinoma patient cohorts from The Cancer Genome Atlas. : The code for this work is available at https://github.com/olganikolova/gbgfa. : nikolova@ohsu.edu or margolin@ohsu.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Guo, Bingfu; Guo, Yong; Hong, Huilong; Qiu, Li-Juan
2016-01-01
Molecular characterization of sequence flanking exogenous fragment insertion is essential for safety assessment and labeling of genetically modified organism (GMO). In this study, the T-DNA insertion sites and flanking sequences were identified in two newly developed transgenic glyphosate-tolerant soybeans GE-J16 and ZH10-6 based on whole genome sequencing (WGS) method. More than 22.4 Gb sequence data (∼21 × coverage) for each line was generated on Illumina HiSeq 2500 platform. The junction reads mapped to boundaries of T-DNA and flanking sequences in these two events were identified by comparing all sequencing reads with soybean reference genome and sequence of transgenic vector. The putative insertion loci and flanking sequences were further confirmed by PCR amplification, Sanger sequencing, and co-segregation analysis. All these analyses supported that exogenous T-DNA fragments were integrated in positions of Chr19: 50543767-50543792 and Chr17: 7980527-7980541 in these two transgenic lines. Identification of genomic insertion sites of G2-EPSPS and GAT transgenes will facilitate the utilization of their glyphosate-tolerant traits in soybean breeding program. These results also demonstrated that WGS was a cost-effective and rapid method for identifying sites of T-DNA insertions and flanking sequences in soybean.
FluReF, an automated flu virus reassortment finder based on phylogenetic trees.
Yurovsky, Alisa; Moret, Bernard M E
2011-01-01
Reassortments are events in the evolution of the genome of influenza (flu), whereby segments of the genome are exchanged between different strains. As reassortments have been implicated in major human pandemics of the last century, their identification has become a health priority. While such identification can be done "by hand" on a small dataset, researchers and health authorities are building up enormous databases of genomic sequences for every flu strain, so that it is imperative to develop automated identification methods. However, current methods are limited to pairwise segment comparisons. We present FluReF, a fully automated flu virus reassortment finder. FluReF is inspired by the visual approach to reassortment identification and uses the reconstructed phylogenetic trees of the individual segments and of the full genome. We also present a simple flu evolution simulator, based on the current, source-sink, hypothesis for flu cycles. On synthetic datasets produced by our simulator, FluReF, tuned for a 0% false positive rate, yielded false negative rates of less than 10%. FluReF corroborated two new reassortments identified by visual analysis of 75 Human H3N2 New York flu strains from 2005-2008 and gave partial verification of reassortments found using another bioinformatics method. FluReF finds reassortments by a bottom-up search of the full-genome and segment-based phylogenetic trees for candidate clades--groups of one or more sampled viruses that are separated from the other variants from the same season. Candidate clades in each tree are tested to guarantee confidence values, using the lengths of key edges as well as other tree parameters; clades with reassortments must have validated incongruencies among segment trees. FluReF demonstrates robustness of prediction for geographically and temporally expanded datasets, and is not limited to finding reassortments with previously collected sequences. The complete source code is available from http://lcbb.epfl.ch/software.html.
Demir, E; Babur, O; Dogrusoz, U; Gursoy, A; Nisanci, G; Cetin-Atalay, R; Ozturk, M
2002-07-01
Availability of the sequences of entire genomes shifts the scientific curiosity towards the identification of function of the genomes in large scale as in genome studies. In the near future, data produced about cellular processes at molecular level will accumulate with an accelerating rate as a result of proteomics studies. In this regard, it is essential to develop tools for storing, integrating, accessing, and analyzing this data effectively. We define an ontology for a comprehensive representation of cellular events. The ontology presented here enables integration of fragmented or incomplete pathway information and supports manipulation and incorporation of the stored data, as well as multiple levels of abstraction. Based on this ontology, we present the architecture of an integrated environment named Patika (Pathway Analysis Tool for Integration and Knowledge Acquisition). Patika is composed of a server-side, scalable, object-oriented database and client-side editors to provide an integrated, multi-user environment for visualizing and manipulating network of cellular events. This tool features automated pathway layout, functional computation support, advanced querying and a user-friendly graphical interface. We expect that Patika will be a valuable tool for rapid knowledge acquisition, microarray generated large-scale data interpretation, disease gene identification, and drug development. A prototype of Patika is available upon request from the authors.
Toward the automated generation of genome-scale metabolic networks in the SEED.
DeJongh, Matthew; Formsma, Kevin; Boillot, Paul; Gould, John; Rycenga, Matthew; Best, Aaron
2007-04-26
Current methods for the automated generation of genome-scale metabolic networks focus on genome annotation and preliminary biochemical reaction network assembly, but do not adequately address the process of identifying and filling gaps in the reaction network, and verifying that the network is suitable for systems level analysis. Thus, current methods are only sufficient for generating draft-quality networks, and refinement of the reaction network is still largely a manual, labor-intensive process. We have developed a method for generating genome-scale metabolic networks that produces substantially complete reaction networks, suitable for systems level analysis. Our method partitions the reaction space of central and intermediary metabolism into discrete, interconnected components that can be assembled and verified in isolation from each other, and then integrated and verified at the level of their interconnectivity. We have developed a database of components that are common across organisms, and have created tools for automatically assembling appropriate components for a particular organism based on the metabolic pathways encoded in the organism's genome. This focuses manual efforts on that portion of an organism's metabolism that is not yet represented in the database. We have demonstrated the efficacy of our method by reverse-engineering and automatically regenerating the reaction network from a published genome-scale metabolic model for Staphylococcus aureus. Additionally, we have verified that our method capitalizes on the database of common reaction network components created for S. aureus, by using these components to generate substantially complete reconstructions of the reaction networks from three other published metabolic models (Escherichia coli, Helicobacter pylori, and Lactococcus lactis). We have implemented our tools and database within the SEED, an open-source software environment for comparative genome annotation and analysis. Our method sets the stage for the automated generation of substantially complete metabolic networks for over 400 complete genome sequences currently in the SEED. With each genome that is processed using our tools, the database of common components grows to cover more of the diversity of metabolic pathways. This increases the likelihood that components of reaction networks for subsequently processed genomes can be retrieved from the database, rather than assembled and verified manually.
Genome-to-Watershed Predictive Understanding of Terrestrial Environments
NASA Astrophysics Data System (ADS)
Hubbard, S. S.; Agarwal, D.; Banfield, J. F.; Beller, H. R.; Brodie, E.; Long, P.; Nico, P. S.; Steefel, C. I.; Tokunaga, T. K.; Williams, K. H.
2014-12-01
Although terrestrial environments play a critical role in cycling water, greenhouse gasses, and other life-critical elements, the complexity of interactions among component microbes, plants, minerals, migrating fluids and dissolved constituents hinders predictive understanding of system behavior. The 'Sustainable Systems 2.0' project is developing genome-to-watershed scale predictive capabilities to quantify how the microbiome affects biogeochemical watershed functioning, how watershed-scale hydro-biogeochemical processes affect microbial functioning, and how these interactions co-evolve with climate and land-use changes. Development of such predictive capabilities is critical for guiding the optimal management of water resources, contaminant remediation, carbon stabilization, and agricultural sustainability - now and with global change. Initial investigations are focused on floodplains in the Colorado River Basin, and include iterative model development, experiments and observations with an early emphasis on subsurface aspects. Field experiments include local-scale experiments at Rifle CO to quantify spatiotemporal metabolic and geochemical responses to O2and nitrate amendments as well as floodplain-scale monitoring to quantify genomic and biogeochemical response to natural hydrological perturbations. Information obtained from such experiments are represented within GEWaSC, a Genome-Enabled Watershed Simulation Capability, which is being developed to allow mechanistic interrogation of how genomic information stored in a subsurface microbiome affects biogeochemical cycling. This presentation will describe the genome-to-watershed scale approach as well as early highlights associated with the project. Highlights include: first insights into the diversity of the subsurface microbiome and metabolic roles of organisms involved in subsurface nitrogen, sulfur and hydrogen and carbon cycling; the extreme variability of subsurface DOC and hydrological controls on carbon and nitrogen cycling; geophysical identification of floodplain hotspots that are useful for model parameterization; and GEWaSC demonstration of how incorporation of identified microbial metabolic processes improves prediction of the larger system biogeochemical behavior.
Cheng, Chia-Yang; Chu, Chia-Han; Hsu, Hung-Wei; Hsu, Fang-Rong; Tang, Chung Yi; Wang, Wen-Ching; Kung, Hsing-Jien; Chang, Pei-Ching
2014-01-01
Post-translational modification (PTM) of transcriptional factors and chromatin remodelling proteins is recognized as a major mechanism by which transcriptional regulation occurs. Chromatin immunoprecipitation (ChIP) in combination with high-throughput sequencing (ChIP-seq) is being applied as a gold standard when studying the genome-wide binding sites of transcription factor (TFs). This has greatly improved our understanding of protein-DNA interactions on a genomic-wide scale. However, current ChIP-seq peak calling tools are not sufficiently sensitive and are unable to simultaneously identify post-translational modified TFs based on ChIP-seq analysis; this is largely due to the wide-spread presence of multiple modified TFs. Using SUMO-1 modification as an example; we describe here an improved approach that allows the simultaneous identification of the particular genomic binding regions of all TFs with SUMO-1 modification. Traditional peak calling methods are inadequate when identifying multiple TF binding sites that involve long genomic regions and therefore we designed a ChIP-seq processing pipeline for the detection of peaks via a combinatorial fusion method. Then, we annotate the peaks with known transcription factor binding sites (TFBS) using the Transfac Matrix Database (v7.0), which predicts potential SUMOylated TFs. Next, the peak calling result was further analyzed based on the promoter proximity, TFBS annotation, a literature review, and was validated by ChIP-real-time quantitative PCR (qPCR) and ChIP-reChIP real-time qPCR. The results show clearly that SUMOylated TFs are able to be pinpointed using our pipeline. A methodology is presented that analyzes SUMO-1 ChIP-seq patterns and predicts related TFs. Our analysis uses three peak calling tools. The fusion of these different tools increases the precision of the peak calling results. TFBS annotation method is able to predict potential SUMOylated TFs. Here, we offer a new approach that enhances ChIP-seq data analysis and allows the identification of multiple SUMOylated TF binding sites simultaneously, which can then be utilized for other functional PTM binding site prediction in future.
Gene context analysis in the Integrated Microbial Genomes (IMG) data management system.
Mavromatis, Konstantinos; Chu, Ken; Ivanova, Natalia; Hooper, Sean D; Markowitz, Victor M; Kyrpides, Nikos C
2009-11-24
Computational methods for determining the function of genes in newly sequenced genomes have been traditionally based on sequence similarity to genes whose function has been identified experimentally. Function prediction methods can be extended using gene context analysis approaches such as examining the conservation of chromosomal gene clusters, gene fusion events and co-occurrence profiles across genomes. Context analysis is based on the observation that functionally related genes are often having similar gene context and relies on the identification of such events across phylogenetically diverse collection of genomes. We have used the data management system of the Integrated Microbial Genomes (IMG) as the framework to implement and explore the power of gene context analysis methods because it provides one of the largest available genome integrations. Visualization and search tools to facilitate gene context analysis have been developed and applied across all publicly available archaeal and bacterial genomes in IMG. These computations are now maintained as part of IMG's regular genome content update cycle. IMG is available at: http://img.jgi.doe.gov.
Risk assessment increasingly relies more heavily on mode of action, thus the identification of human bioindicators of disease becomes all the more important. Genomic methods represent a tool for both mode of action determination and bioindicator identification. The Mechanistic In...
Li, Chunmei; Yu, Zhilong; Fu, Yusi; Pang, Yuhong; Huang, Yanyi
2017-04-26
We develop a novel single-cell-based platform through digital counting of amplified genomic DNA fragments, named multifraction amplification (mfA), to detect the copy number variations (CNVs) in a single cell. Amplification is required to acquire genomic information from a single cell, while introducing unavoidable bias. Unlike prevalent methods that directly infer CNV profiles from the pattern of sequencing depth, our mfA platform denatures and separates the DNA molecules from a single cell into multiple fractions of a reaction mix before amplification. By examining the sequencing result of each fraction for a specific fragment and applying a segment-merge maximum likelihood algorithm to the calculation of copy number, we digitize the sequencing-depth-based CNV identification and thus provide a method that is less sensitive to the amplification bias. In this paper, we demonstrate a mfA platform through multiple displacement amplification (MDA) chemistry. When performing the mfA platform, the noise of MDA is reduced; therefore, the resolution of single-cell CNV identification can be improved to 100 kb. We can also determine the genomic region free of allelic drop-out with mfA platform, which is impossible for conventional single-cell amplification methods.
Identifying micro-inversions using high-throughput sequencing reads.
He, Feifei; Li, Yang; Tang, Yu-Hang; Ma, Jian; Zhu, Huaiqiu
2016-01-11
The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads. The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp. To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from: http://cqb.pku.edu.cn/ZhuLab/MID .
DOE Office of Scientific and Technical Information (OSTI.GOV)
Song, Hyun-Seob; Goldberg, Noam; Mahajan, Ashutosh
Elementary (flux) modes (EMs) have served as a valuable tool for investigating structural and functional properties of metabolic networks. Identification of the full set of EMs in genome-scale networks remains challenging due to combinatorial explosion of EMs in complex networks. It is often, however, that only a small subset of relevant EMs needs to be known, for which optimization-based sequential computation is a useful alternative. Most of the currently available methods along this line are based on the iterative use of mixed integer linear programming (MILP), the effectiveness of which significantly deteriorates as the number of iterations builds up. Tomore » alleviate the computational burden associated with the MILP implementation, we here present a novel optimization algorithm termed alternate integer linear programming (AILP). Results: Our algorithm was designed to iteratively solve a pair of integer programming (IP) and linear programming (LP) to compute EMs in a sequential manner. In each step, the IP identifies a minimal subset of reactions, the deletion of which disables all previously identified EMs. Thus, a subsequent LP solution subject to this reaction deletion constraint becomes a distinct EM. In cases where no feasible LP solution is available, IP-derived reaction deletion sets represent minimal cut sets (MCSs). Despite the additional computation of MCSs, AILP achieved significant time reduction in computing EMs by orders of magnitude. The proposed AILP algorithm not only offers a computational advantage in the EM analysis of genome-scale networks, but also improves the understanding of the linkage between EMs and MCSs.« less
Charlesworth, Jac C; Peralta, Juan M; Drigalenko, Eugene; Göring, Harald Hh; Almasy, Laura; Dyer, Thomas D; Blangero, John
2009-12-15
Gene identification using linkage, association, or genome-wide expression is often underpowered. We propose that formal combination of information from multiple gene-identification approaches may lead to the identification of novel loci that are missed when only one form of information is available. Firstly, we analyze the Genetic Analysis Workshop 16 Framingham Heart Study Problem 2 genome-wide association data for HDL-cholesterol using a "gene-centric" approach. Then we formally combine the association test results with genome-wide transcriptional profiling data for high-density lipoprotein cholesterol (HDL-C), from the San Antonio Family Heart Study, using a Z-transform test (Stouffer's method). We identified 39 genes by the joint test at a conservative 1% false-discovery rate, including 9 from the significant gene-based association test and 23 whose expression was significantly correlated with HDL-C. Seven genes identified as significant in the joint test were not independently identified by either the association or expression tests. This combined approach has increased power and leads to the direct nomination of novel candidate genes likely to be involved in the determination of HDL-C levels. Such information can then be used as justification for a more exhaustive search for functional sequence variation within the nominated genes. We anticipate that this type of analysis will improve our speed of identification of regulatory genes causally involved in disease risk.
The Essential Genome of Escherichia coli K-12
2018-01-01
ABSTRACT Transposon-directed insertion site sequencing (TraDIS) is a high-throughput method coupling transposon mutagenesis with short-fragment DNA sequencing. It is commonly used to identify essential genes. Single gene deletion libraries are considered the gold standard for identifying essential genes. Currently, the TraDIS method has not been benchmarked against such libraries, and therefore, it remains unclear whether the two methodologies are comparable. To address this, a high-density transposon library was constructed in Escherichia coli K-12. Essential genes predicted from sequencing of this library were compared to existing essential gene databases. To decrease false-positive identification of essential genes, statistical data analysis included corrections for both gene length and genome length. Through this analysis, new essential genes and genes previously incorrectly designated essential were identified. We show that manual analysis of TraDIS data reveals novel features that would not have been detected by statistical analysis alone. Examples include short essential regions within genes, orientation-dependent effects, and fine-resolution identification of genome and protein features. Recognition of these insertion profiles in transposon mutagenesis data sets will assist genome annotation of less well characterized genomes and provides new insights into bacterial physiology and biochemistry. PMID:29463657
Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm
Glunčić, Matko; Paar, Vladimir
2013-01-01
The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012.exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of α-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes). PMID:22977183
Ab initio gene identification in metagenomic sequences
Zhu, Wenhan; Lomsadze, Alexandre; Borodovsky, Mark
2010-01-01
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes. PMID:20403810
Genome-scale engineering for systems and synthetic biology
Esvelt, Kevin M; Wang, Harris H
2013-01-01
Genome-modification technologies enable the rational engineering and perturbation of biological systems. Historically, these methods have been limited to gene insertions or mutations at random or at a few pre-defined locations across the genome. The handful of methods capable of targeted gene editing suffered from low efficiencies, significant labor costs, or both. Recent advances have dramatically expanded our ability to engineer cells in a directed and combinatorial manner. Here, we review current technologies and methodologies for genome-scale engineering, discuss the prospects for extending efficient genome modification to new hosts, and explore the implications of continued advances toward the development of flexibly programmable chasses, novel biochemistries, and safer organismal and ecological engineering. PMID:23340847
Yu, Ron X.; Liu, Jie; True, Nick; Wang, Wei
2008-01-01
A major challenge in the post-genome era is to reconstruct regulatory networks from the biological knowledge accumulated up to date. The development of tools for identifying direct target genes of transcription factors (TFs) is critical to this endeavor. Given a set of microarray experiments, a probabilistic model called TRANSMODIS has been developed which can infer the direct targets of a TF by integrating sequence motif, gene expression and ChIP-chip data. The performance of TRANSMODIS was first validated on a set of transcription factor perturbation experiments (TFPEs) involving Pho4p, a well studied TF in Saccharomyces cerevisiae. TRANSMODIS removed elements of arbitrariness in manual target gene selection process and produced results that concur with one's intuition. TRANSMODIS was further validated on a genome-wide scale by comparing it with two other methods in Saccharomyces cerevisiae. The usefulness of TRANSMODIS was then demonstrated by applying it to the identification of direct targets of DAF-16, a critical TF regulating ageing in Caenorhabditis elegans. We found that 189 genes were tightly regulated by DAF-16. In addition, DAF-16 has differential preference for motifs when acting as an activator or repressor, which awaits experimental verification. TRANSMODIS is computationally efficient and robust, making it a useful probabilistic framework for finding immediate targets. PMID:18350157
Cloud-based adaptive exon prediction for DNA analysis.
Putluri, Srinivasareddy; Zia Ur Rahman, Md; Fathima, Shaik Yasmeen
2018-02-01
Cloud computing offers significant research and economic benefits to healthcare organisations. Cloud services provide a safe place for storing and managing large amounts of such sensitive data. Under conventional flow of gene information, gene sequence laboratories send out raw and inferred information via Internet to several sequence libraries. DNA sequencing storage costs will be minimised by use of cloud service. In this study, the authors put forward a novel genomic informatics system using Amazon Cloud Services, where genomic sequence information is stored and accessed for processing. True identification of exon regions in a DNA sequence is a key task in bioinformatics, which helps in disease identification and design drugs. Three base periodicity property of exons forms the basis of all exon identification techniques. Adaptive signal processing techniques found to be promising in comparison with several other methods. Several adaptive exon predictors (AEPs) are developed using variable normalised least mean square and its maximum normalised variants to reduce computational complexity. Finally, performance evaluation of various AEPs is done based on measures such as sensitivity, specificity and precision using various standard genomic datasets taken from National Center for Biotechnology Information genomic sequence database.
oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes
Ho Sui, Shannan J.; Mortimer, James R.; Arenillas, David J.; Brumm, Jochen; Walsh, Christopher J.; Kennedy, Brian P.; Wasserman, Wyeth W.
2005-01-01
Targeted transcript profiling studies can identify sets of co-expressed genes; however, identification of the underlying functional mechanism(s) is a significant challenge. Established methods for the analysis of gene annotations, particularly those based on the Gene Ontology, can identify functional linkages between genes. Similar methods for the identification of over-represented transcription factor binding sites (TFBSs) have been successful in yeast, but extension to human genomics has largely proved ineffective. Creation of a system for the efficient identification of common regulatory mechanisms in a subset of co-expressed human genes promises to break a roadblock in functional genomics research. We have developed an integrated system that searches for evidence of co-regulation by one or more transcription factors (TFs). oPOSSUM combines a pre-computed database of conserved TFBSs in human and mouse promoters with statistical methods for identification of sites over-represented in a set of co-expressed genes. The algorithm successfully identified mediating TFs in control sets of tissue-specific genes and in sets of co-expressed genes from three transcript profiling studies. Simulation studies indicate that oPOSSUM produces few false positives using empirically defined thresholds and can tolerate up to 50% noise in a set of co-expressed genes. PMID:15933209
Barrett, Christian L.; Cho, Byung-Kwan
2011-01-01
Immuno-precipitation of protein–DNA complexes followed by microarray hybridization is a powerful and cost-effective technology for discovering protein–DNA binding events at the genome scale. It is still an unresolved challenge to comprehensively, accurately and sensitively extract binding event information from the produced data. We have developed a novel strategy composed of an information-preserving signal-smoothing procedure, higher order derivative analysis and application of the principle of maximum entropy to address this challenge. Importantly, our method does not require any input parameters to be specified by the user. Using genome-scale binding data of two Escherichia coli global transcription regulators for which a relatively large number of experimentally supported sites are known, we show that ∼90% of known sites were resolved to within four probes, or ∼88 bp. Over half of the sites were resolved to within two probes, or ∼38 bp. Furthermore, we demonstrate that our strategy delivers significant quantitative and qualitative performance gains over available methods. Such accurate and sensitive binding site resolution has important consequences for accurately reconstructing transcriptional regulatory networks, for motif discovery, for furthering our understanding of local and non-local factors in protein–DNA interactions and for extending the usefulness horizon of the ChIP-chip platform. PMID:21051353
Neugebauer, Tomasz; Bordeleau, Eric; Burrus, Vincent; Brzezinski, Ryszard
2015-01-01
Data visualization methods are necessary during the exploration and analysis activities of an increasingly data-intensive scientific process. There are few existing visualization methods for raw nucleotide sequences of a whole genome or chromosome. Software for data visualization should allow the researchers to create accessible data visualization interfaces that can be exported and shared with others on the web. Herein, novel software developed for generating DNA data visualization interfaces is described. The software converts DNA data sets into images that are further processed as multi-scale images to be accessed through a web-based interface that supports zooming, panning and sequence fragment selection. Nucleotide composition frequencies and GC skew of a selected sequence segment can be obtained through the interface. The software was used to generate DNA data visualization of human and bacterial chromosomes. Examples of visually detectable features such as short and long direct repeats, long terminal repeats, mobile genetic elements, heterochromatic segments in microbial and human chromosomes, are presented. The software and its source code are available for download and further development. The visualization interfaces generated with the software allow for the immediate identification and observation of several types of sequence patterns in genomes of various sizes and origins. The visualization interfaces generated with the software are readily accessible through a web browser. This software is a useful research and teaching tool for genetics and structural genomics.
USDA-ARS?s Scientific Manuscript database
Long noncoding RNAs (lncRNAs) have been recognized in recent years as key regulators of diverse cellular processes. Genome-wide large-scale projects have uncovered thousands of lncRNAs in many model organisms. Large intergenic noncoding RNAs (lincRNAs) are lncRNAs that are transcribed from intergeni...
Stolc, Viktor; Samanta, Manoj Pratim; Tongprasit, Waraporn; Sethi, Himanshu; Liang, Shoudan; Nelson, David C.; Hegeman, Adrian; Nelson, Clark; Rancour, David; Bednarek, Sebastian; Ulrich, Eldon L.; Zhao, Qin; Wrobel, Russell L.; Newman, Craig S.; Fox, Brian G.; Phillips, George N.; Markley, John L.; Sussman, Michael R.
2005-01-01
Using a maskless photolithography method, we produced DNA oligonucleotide microarrays with probe sequences tiled throughout the genome of the plant Arabidopsis thaliana. RNA expression was determined for the complete nuclear, mitochondrial, and chloroplast genomes by tiling 5 million 36-mer probes. These probes were hybridized to labeled mRNA isolated from liquid grown T87 cells, an undifferentiated Arabidopsis cell culture line. Transcripts were detected from at least 60% of the nearly 26,330 annotated genes, which included 151 predicted genes that were not identified previously by a similar genome-wide hybridization study on four different cell lines. In comparison with previously published results with 25-mer tiling arrays produced by chromium masking-based photolithography technique, 36-mer oligonucleotide probes were found to be more useful in identifying intron–exon boundaries. Using two-dimensional HPLC tandem mass spectrometry, a small-scale proteomic analysis was performed with the same cells. A large amount of strongly hybridizing RNA was found in regions “antisense” to known genes. Similarity of antisense activities between the 25-mer and 36-mer data sets suggests that it is a reproducible and inherent property of the experiments. Transcription activities were also detected for many of the intergenic regions and the small RNAs, including tRNA, small nuclear RNA, small nucleolar RNA, and microRNA. Expression of tRNAs correlates with genome-wide amino acid usage. PMID:15755812
NASA Technical Reports Server (NTRS)
Stolc, Viktor; Samanta, Manoj Pratim; Tongprasit, Waraporn; Sethi, Himanshu; Liang, Shoudan; Nelson, David C.; Hegeman, Adrian; Nelson, Clark; Rancour, David; Bednarek, Sebastian;
2005-01-01
Using a maskless photolithography method, we produced DNA oligonucleotide microarrays with probe sequences tiled throughout the genome of the plant Arabidopsis thaliana. RNA expression was determined for the complete nuclear, mitochondrial, and chloroplast genomes by tiling 5 million 36-mer probes. These probes were hybridized to labeled mRNA isolated from liquid grown T87 cells, an undifferentiated Arabidopsis cell culture line. Transcripts were detected from at least 60% of the nearly 26,330 annotated genes, which included 151 predicted genes that were not identified previously by a similar genome-wide hybridization study on four different cell lines. In comparison with previously published results with 25-mer tiling arrays produced by chromium masking-based photolithography technique, 36-mer oligonucleotide probes were found to be more useful in identifying intron-exon boundaries. Using two-dimensional HPLC tandem mass spectrometry, a small-scale proteomic analysis was performed with the same cells. A large amount of strongly hybridizing RNA was found in regions "antisense" to known genes. Similarity of antisense activities between the 25-mer and 36-mer data sets suggests that it is a reproducible and inherent property of the experiments. Transcription activities were also detected for many of the intergenic regions and the small RNAs, including tRNA, small nuclear RNA, small nucleolar RNA, and microRNA. Expression of tRNAs correlates with genome-wide amino acid usage.
Variation block-based genomics method for crop plants.
Kim, Yul Ho; Park, Hyang Mi; Hwang, Tae-Young; Lee, Seuk Ki; Choi, Man Soo; Jho, Sungwoong; Hwang, Seungwoo; Kim, Hak-Min; Lee, Dongwoo; Kim, Byoung-Chul; Hong, Chang Pyo; Cho, Yun Sung; Kim, Hyunmin; Jeong, Kwang Ho; Seo, Min Jung; Yun, Hong Tai; Kim, Sun Lim; Kwon, Young-Up; Kim, Wook Han; Chun, Hye Kyung; Lim, Sang Jong; Shin, Young-Ah; Choi, Ik-Young; Kim, Young Sun; Yoon, Ho-Sung; Lee, Suk-Ha; Lee, Sunghoon
2014-06-15
In contrast with wild species, cultivated crop genomes consist of reshuffled recombination blocks, which occurred by crossing and selection processes. Accordingly, recombination block-based genomics analysis can be an effective approach for the screening of target loci for agricultural traits. We propose the variation block method, which is a three-step process for recombination block detection and comparison. The first step is to detect variations by comparing the short-read DNA sequences of the cultivar to the reference genome of the target crop. Next, sequence blocks with variation patterns are examined and defined. The boundaries between the variation-containing sequence blocks are regarded as recombination sites. All the assumed recombination sites in the cultivar set are used to split the genomes, and the resulting sequence regions are termed variation blocks. Finally, the genomes are compared using the variation blocks. The variation block method identified recurring recombination blocks accurately and successfully represented block-level diversities in the publicly available genomes of 31 soybean and 23 rice accessions. The practicality of this approach was demonstrated by the identification of a putative locus determining soybean hilum color. We suggest that the variation block method is an efficient genomics method for the recombination block-level comparison of crop genomes. We expect that this method will facilitate the development of crop genomics by bringing genomics technologies to the field of crop breeding.
Whole-genome sequencing for comparative genomics and de novo genome assembly.
Benjak, Andrej; Sala, Claudia; Hartkoorn, Ruben C
2015-01-01
Next-generation sequencing technologies for whole-genome sequencing of mycobacteria are rapidly becoming an attractive alternative to more traditional sequencing methods. In particular this technology is proving useful for genome-wide identification of mutations in mycobacteria (comparative genomics) as well as for de novo assembly of whole genomes. Next-generation sequencing however generates a vast quantity of data that can only be transformed into a usable and comprehensible form using bioinformatics. Here we describe the methodology one would use to prepare libraries for whole-genome sequencing, and the basic bioinformatics to identify mutations in a genome following Illumina HiSeq or MiSeq sequencing, as well as de novo genome assembly following sequencing using Pacific Biosciences (PacBio).
Genomic insights into the taxonomic status of the Bacillus cereus group
Liu, Yang; Lai, Qiliang; Göker, Markus; Meier-Kolthoff, Jan P.; Wang, Meng; Sun, Yamin; Wang, Lei; Shao, Zongze
2015-01-01
The identification and phylogenetic relationships of bacteria within the Bacillus cereus group are controversial. This study aimed at determining the taxonomic affiliations of these strains using the whole-genome sequence-based Genome BLAST Distance Phylogeny (GBDP) approach. The GBDP analysis clearly separated 224 strains into 30 clusters, representing eleven known, partially merged species and accordingly 19–20 putative novel species. Additionally, 16S rRNA gene analysis, a novel variant of multi-locus sequence analysis (nMLSA) and screening of virulence genes were performed. The 16S rRNA gene sequence was not sufficient to differentiate the bacteria within this group due to its high conservation. The nMLSA results were consistent with GBDP. Moreover, a fast typing method was proposed using the pycA gene, and where necessary, the ccpA gene. The pXO plasmids and cry genes were widely distributed, suggesting little correlation with the phylogenetic positions of the host bacteria. This might explain why classifications based on virulence characteristics proved unsatisfactory in the past. In summary, this is the first large-scale and systematic study of the taxonomic status of the bacteria within the B. cereus group using whole-genome sequences, and is likely to contribute to further insights into their pathogenicity, phylogeny and adaptation to diverse environments. PMID:26373441
Meng, Xianjing; Yin, Yilong; Yang, Gongping; Xi, Xiaoming
2013-07-18
Retinal identification based on retinal vasculatures in the retina provides the most secure and accurate means of authentication among biometrics and has primarily been used in combination with access control systems at high security facilities. Recently, there has been much interest in retina identification. As digital retina images always suffer from deformations, the Scale Invariant Feature Transform (SIFT), which is known for its distinctiveness and invariance for scale and rotation, has been introduced to retinal based identification. However, some shortcomings like the difficulty of feature extraction and mismatching exist in SIFT-based identification. To solve these problems, a novel preprocessing method based on the Improved Circular Gabor Transform (ICGF) is proposed. After further processing by the iterated spatial anisotropic smooth method, the number of uninformative SIFT keypoints is decreased dramatically. Tested on the VARIA and eight simulated retina databases combining rotation and scaling, the developed method presents promising results and shows robustness to rotations and scale changes.
Meng, Xianjing; Yin, Yilong; Yang, Gongping; Xi, Xiaoming
2013-01-01
Retinal identification based on retinal vasculatures in the retina provides the most secure and accurate means of authentication among biometrics and has primarily been used in combination with access control systems at high security facilities. Recently, there has been much interest in retina identification. As digital retina images always suffer from deformations, the Scale Invariant Feature Transform (SIFT), which is known for its distinctiveness and invariance for scale and rotation, has been introduced to retinal based identification. However, some shortcomings like the difficulty of feature extraction and mismatching exist in SIFT-based identification. To solve these problems, a novel preprocessing method based on the Improved Circular Gabor Transform (ICGF) is proposed. After further processing by the iterated spatial anisotropic smooth method, the number of uninformative SIFT keypoints is decreased dramatically. Tested on the VARIA and eight simulated retina databases combining rotation and scaling, the developed method presents promising results and shows robustness to rotations and scale changes. PMID:23873409
Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster
Song, Yun S.
2012-01-01
Estimating fine-scale recombination maps of Drosophila from population genomic data is a challenging problem, in particular because of the high background recombination rate. In this paper, a new computational method is developed to address this challenge. Through an extensive simulation study, it is demonstrated that the method allows more accurate inference, and exhibits greater robustness to the effects of natural selection and noise, compared to a well-used previous method developed for studying fine-scale recombination rate variation in the human genome. As an application, a genome-wide analysis of genetic variation data is performed for two Drosophila melanogaster populations, one from North America (Raleigh, USA) and the other from Africa (Gikongoro, Rwanda). It is shown that fine-scale recombination rate variation is widespread throughout the D. melanogaster genome, across all chromosomes and in both populations. At the fine-scale, a conservative, systematic search for evidence of recombination hotspots suggests the existence of a handful of putative hotspots each with at least a tenfold increase in intensity over the background rate. A wavelet analysis is carried out to compare the estimated recombination maps in the two populations and to quantify the extent to which recombination rates are conserved. In general, similarity is observed at very broad scales, but substantial differences are seen at fine scales. The average recombination rate of the X chromosome appears to be higher than that of the autosomes in both populations, and this pattern is much more pronounced in the African population than the North American population. The correlation between various genomic features—including recombination rates, diversity, divergence, GC content, gene content, and sequence quality—is examined using the wavelet analysis, and it is shown that the most notable difference between D. melanogaster and humans is in the correlation between recombination and diversity. PMID:23284288
Krasota, Alexandr; Loginovskih, Natalia; Ivanova, Olga; Lipskaya, Galina
2016-01-06
Enteroviruses, the most common human viral pathogens worldwide, have been associated with serous meningitis, encephalitis, syndrome of acute flaccid paralysis, myocarditis and the onset of diabetes type 1. In the future, the rapid identification of the etiological agent would allow to adjust the therapy promptly and thereby improve the course of the disease and prognosis. We developed RT-nested PCR amplification of the genomic region coding viral structural protein VP1 for direct identification of enteroviruses in clinical specimens and compared it with the existing analogs. One-hundred-fifty-nine cerebrospinal fluids (CSF) from patients with suspected meningitis were studied. The amplification of VP1 genomic region using the new method was achieved for 86 (54.1%) patients compared with 75 (47.2%), 53 (33.3%) and 31 (19.5%) achieved with previously published methods. We identified 11 serotypes of the Enterovirus species B in 2012, including relatively rare echovirus 14 (E-14), E-15 and E-32, and eight serotypes of species B and 5 enteroviruses A71 (EV-A71) in 2013. The developed method can be useful for direct identification of enteroviruses in clinical material with the low virus loads such as CSF.
Krasota, Alexandr; Loginovskih, Natalia; Ivanova, Olga; Lipskaya, Galina
2016-01-01
Enteroviruses, the most common human viral pathogens worldwide, have been associated with serous meningitis, encephalitis, syndrome of acute flaccid paralysis, myocarditis and the onset of diabetes type 1. In the future, the rapid identification of the etiological agent would allow to adjust the therapy promptly and thereby improve the course of the disease and prognosis. We developed RT-nested PCR amplification of the genomic region coding viral structural protein VP1 for direct identification of enteroviruses in clinical specimens and compared it with the existing analogs. One-hundred-fifty-nine cerebrospinal fluids (CSF) from patients with suspected meningitis were studied. The amplification of VP1 genomic region using the new method was achieved for 86 (54.1%) patients compared with 75 (47.2%), 53 (33.3%) and 31 (19.5%) achieved with previously published methods. We identified 11 serotypes of the Enterovirus species B in 2012, including relatively rare echovirus 14 (E-14), E-15 and E-32, and eight serotypes of species B and 5 enteroviruses A71 (EV-A71) in 2013. The developed method can be useful for direct identification of enteroviruses in clinical material with the low virus loads such as CSF. PMID:26751470
Finding the Genomic Basis of Local Adaptation: Pitfalls, Practical Solutions, and Future Directions.
Hoban, Sean; Kelley, Joanna L; Lotterhos, Katie E; Antolin, Michael F; Bradburd, Gideon; Lowry, David B; Poss, Mary L; Reed, Laura K; Storfer, Andrew; Whitlock, Michael C
2016-10-01
Uncovering the genetic and evolutionary basis of local adaptation is a major focus of evolutionary biology. The recent development of cost-effective methods for obtaining high-quality genome-scale data makes it possible to identify some of the loci responsible for adaptive differences among populations. Two basic approaches for identifying putatively locally adaptive loci have been developed and are broadly used: one that identifies loci with unusually high genetic differentiation among populations (differentiation outlier methods) and one that searches for correlations between local population allele frequencies and local environments (genetic-environment association methods). Here, we review the promises and challenges of these genome scan methods, including correcting for the confounding influence of a species' demographic history, biases caused by missing aspects of the genome, matching scales of environmental data with population structure, and other statistical considerations. In each case, we make suggestions for best practices for maximizing the accuracy and efficiency of genome scans to detect the underlying genetic basis of local adaptation. With attention to their current limitations, genome scan methods can be an important tool in finding the genetic basis of adaptive evolutionary change.
GIGGLE: a search engine for large-scale integrated genome analysis.
Layer, Ryan M; Pedersen, Brent S; DiSera, Tonya; Marth, Gabor T; Gertz, Jason; Quinlan, Aaron R
2018-02-01
GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.
GIGGLE: a search engine for large-scale integrated genome analysis
Layer, Ryan M; Pedersen, Brent S; DiSera, Tonya; Marth, Gabor T; Gertz, Jason; Quinlan, Aaron R
2018-01-01
GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation. PMID:29309061
Mapping the Space of Genomic Signatures
Kari, Lila; Hill, Kathleen A.; Sayem, Abu S.; Karamichalis, Rallis; Bryans, Nathaniel; Davis, Katelyn; Dattani, Nikesh S.
2015-01-01
We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to k (herein k = 9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence alignment and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information. This method also correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and that the sequence most different from it in this dataset belongs to a cucumber. PMID:26000734
Combining functional genomics and chemical biology to identify targets of bioactive compounds.
Ho, Cheuk Hei; Piotrowski, Jeff; Dixon, Scott J; Baryshnikova, Anastasia; Costanzo, Michael; Boone, Charles
2011-02-01
Genome sequencing projects have revealed thousands of suspected genes, challenging researchers to develop efficient large-scale functional analysis methodologies. Determining the function of a gene product generally requires a means to alter its function. Genetically tractable model organisms have been widely exploited for the isolation and characterization of activating and inactivating mutations in genes encoding proteins of interest. Chemical genetics represents a complementary approach involving the use of small molecules capable of either inactivating or activating their targets. Saccharomyces cerevisiae has been an important test bed for the development and application of chemical genomic assays aimed at identifying targets and modes of action of known and uncharacterized compounds. Here we review yeast chemical genomic assays strategies for drug target identification. Copyright © 2010 Elsevier Ltd. All rights reserved.
2014-12-11
Cassava (Manihot esculenta Crantz) is a major staple crop in Africa, Asia, and South America, and its starchy roots provide nourishment for 800 million people worldwide. Although native to South America, cassava was brought to Africa 400-500 years ago and is now widely cultivated across sub-Saharan Africa, but it is subject to biotic and abiotic stresses. To assist in the rapid identification of markers for pathogen resistance and crop traits, and to accelerate breeding programs, we generated a framework map for M. esculenta Crantz from reduced representation sequencing [genotyping-by-sequencing (GBS)]. The composite 2412-cM map integrates 10 biparental maps (comprising 3480 meioses) and organizes 22,403 genetic markers on 18 chromosomes, in agreement with the observed karyotype. We used the map to anchor 71.9% of the draft genome assembly and 90.7% of the predicted protein-coding genes. The chromosome-anchored genome sequence will be useful for breeding improvement by assisting in the rapid identification of markers linked to important traits, and in providing a framework for genomic selection-enhanced breeding of this important crop. Copyright © 2015 International Cassava Genetic Map Consortium (ICGMC).
Lyons, Jessica
2014-12-11
Cassava Manihot esculenta Crantz) is a major staple crop in Africa, Asia, and South America, and its starchy roots provide nourishment for 800 million people worldwide. Although native to South America, cassava was brought to Africa 400–500 years ago and is now widely cultivated across sub-Saharan Africa, but it is subject to biotic and abiotic stresses. To assist in the rapid identification of markers for pathogen resistance and crop traits, and to accelerate breeding programs, we generated a framework map for M. esculent Crantz from reduced representation sequencing [genotyping-by-sequencing (GBS)]. The composite 2412-cM map integrates 10 biparental maps (comprising 3480more » meioses) and organizes 22,403 genetic markers on 18 chromosomes, in agreement with the observed karyotype. Here, we used the map to anchor 71.9% of the draft genome assembly and 90.7% of the predicted protein-coding genes. The chromosome-anchored genome sequence will be useful for breeding improvement by assisting in the rapid identification of markers linked to important traits, and in providing a framework for genomic selectionenhanced breeding of this important crop.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lyons, Jessica
Cassava Manihot esculenta Crantz) is a major staple crop in Africa, Asia, and South America, and its starchy roots provide nourishment for 800 million people worldwide. Although native to South America, cassava was brought to Africa 400–500 years ago and is now widely cultivated across sub-Saharan Africa, but it is subject to biotic and abiotic stresses. To assist in the rapid identification of markers for pathogen resistance and crop traits, and to accelerate breeding programs, we generated a framework map for M. esculent Crantz from reduced representation sequencing [genotyping-by-sequencing (GBS)]. The composite 2412-cM map integrates 10 biparental maps (comprising 3480more » meioses) and organizes 22,403 genetic markers on 18 chromosomes, in agreement with the observed karyotype. Here, we used the map to anchor 71.9% of the draft genome assembly and 90.7% of the predicted protein-coding genes. The chromosome-anchored genome sequence will be useful for breeding improvement by assisting in the rapid identification of markers linked to important traits, and in providing a framework for genomic selectionenhanced breeding of this important crop.« less
Finding cancer driver mutations in the era of big data research.
Poulos, Rebecca C; Wong, Jason W H
2018-04-02
In the last decade, the costs of genome sequencing have decreased considerably. The commencement of large-scale cancer sequencing projects has enabled cancer genomics to join the big data revolution. One of the challenges still facing cancer genomics research is determining which are the driver mutations in an individual cancer, as these contribute only a small subset of the overall mutation profile of a tumour. Focusing primarily on somatic single nucleotide mutations in this review, we consider both coding and non-coding driver mutations, and discuss how such mutations might be identified from cancer sequencing datasets. We describe some of the tools and database that are available for the annotation of somatic variants and the identification of cancer driver genes. We also address the use of genome-wide variation in mutation load to establish background mutation rates from which to identify driver mutations under positive selection. Finally, we describe the ways in which mutational signatures can act as clues for the identification of cancer drivers, as these mutations may cause, or arise from, certain mutational processes. By defining the molecular changes responsible for driving cancer development, new cancer treatment strategies may be developed or novel preventative measures proposed.
Global Identification and Characterization of Transcriptionally Active Regions in the Rice Genome
Stolc, Viktor; Deng, Wei; He, Hang; Korbel, Jan; Chen, Xuewei; Tongprasit, Waraporn; Ronald, Pamela; Chen, Runsheng; Gerstein, Mark; Wang Deng, Xing
2007-01-01
Genome tiling microarray studies have consistently documented rich transcriptional activity beyond the annotated genes. However, systematic characterization and transcriptional profiling of the putative novel transcripts on the genome scale are still lacking. We report here the identification of 25,352 and 27,744 transcriptionally active regions (TARs) not encoded by annotated exons in the rice (Oryza. sativa) subspecies japonica and indica, respectively. The non-exonic TARs account for approximately two thirds of the total TARs detected by tiling arrays and represent transcripts likely conserved between japonica and indica. Transcription of 21,018 (83%) japonica non-exonic TARs was verified through expression profiling in 10 tissue types using a re-array in which annotated genes and TARs were each represented by five independent probes. Subsequent analyses indicate that about 80% of the japonica TARs that were not assigned to annotated exons can be assigned to various putatively functional or structural elements of the rice genome, including splice variants, uncharacterized portions of incompletely annotated genes, antisense transcripts, duplicated gene fragments, and potential non-coding RNAs. These results provide a systematic characterization of non-exonic transcripts in rice and thus expand the current view of the complexity and dynamics of the rice transcriptome. PMID:17372628
Wang, Harris H; Church, George M
2011-01-01
Engineering at the scale of whole genomes requires fundamentally new molecular biology tools. Recent advances in recombineering using synthetic oligonucleotides enable the rapid generation of mutants at high efficiency and specificity and can be implemented at the genome scale. With these techniques, libraries of mutants can be generated, from which individuals with functionally useful phenotypes can be isolated. Furthermore, populations of cells can be evolved in situ by directed evolution using complex pools of oligonucleotides. Here, we discuss ways to utilize these multiplexed genome engineering methods, with special emphasis on experimental design and implementation. Copyright © 2011 Elsevier Inc. All rights reserved.
Development of Computational Tools for Metabolic Model Curation, Flux Elucidation and Strain Design
DOE Office of Scientific and Technical Information (OSTI.GOV)
Maranas, Costas D
An overarching goal of the Department of Energy mission is the efficient deployment and engineering of microbial and plant systems to enable biomass conversion in pursuit of high energy density liquid biofuels. This has spurred the pace at which new organisms are sequenced and annotated. This torrent of genomic information has opened the door to understanding metabolism in not just skeletal pathways and a handful of microorganisms but for truly genome-scale reconstructions derived for hundreds of microbes and plants. Understanding and redirecting metabolism is crucial because metabolic fluxes are unique descriptors of cellular physiology that directly assess the current cellularmore » state and quantify the effect of genetic engineering interventions. At the same time, however, trying to keep pace with the rate of genomic data generation has ushered in a number of modeling and computational challenges related to (i) the automated assembly, testing and correction of genome-scale metabolic models, (ii) metabolic flux elucidation using labeled isotopes, and (iii) comprehensive identification of engineering interventions leading to the desired metabolism redirection.« less
Application of industrial scale genomics to discovery of therapeutic targets in heart failure.
Mehraban, F; Tomlinson, J E
2001-12-01
In recent years intense activity in both academic and industrial sectors has provided a wealth of information on the human genome with an associated impressive increase in the number of novel gene sequences deposited in sequence data repositories and patent applications. This genomic industrial revolution has transformed the way in which drug target discovery is now approached. In this article we discuss how various differential gene expression (DGE) technologies are being utilized for cardiovascular disease (CVD) drug target discovery. Other approaches such as sequencing cDNA from cardiovascular derived tissues and cells coupled with bioinformatic sequence analysis are used with the aim of identifying novel gene sequences that may be exploited towards target discovery. Additional leverage from gene sequence information is obtained through identification of polymorphisms that may confer disease susceptibility and/or affect drug responsiveness. Pharmacogenomic studies are described wherein gene expression-based techniques are used to evaluate drug response and/or efficacy. Industrial-scale genomics supports and addresses not only novel target gene discovery but also the burgeoning issues in pharmaceutical and clinical cardiovascular medicine relative to polymorphic gene responses.
Ponce-de-León, Miguel; Montero, Francisco; Peretó, Juli
2013-10-31
Metabolic reconstruction is the computational-based process that aims to elucidate the network of metabolites interconnected through reactions catalyzed by activities assigned to one or more genes. Reconstructed models may contain inconsistencies that appear as gap metabolites and blocked reactions. Although automatic methods for solving this problem have been previously developed, there are many situations where manual curation is still needed. We introduce a general definition of gap metabolite that allows its detection in a straightforward manner. Moreover, a method for the detection of Unconnected Modules, defined as isolated sets of blocked reactions connected through gap metabolites, is proposed. The method has been successfully applied to the curation of iCG238, the genome-scale metabolic model for the bacterium Blattabacterium cuenoti, obligate endosymbiont of cockroaches. We found the proposed approach to be a valuable tool for the curation of genome-scale metabolic models. The outcome of its application to the genome-scale model B. cuenoti iCG238 is a more accurate model version named as B. cuenoti iMP240.
Chromosome evolution with naked eye: Palindromic context of the life origin
NASA Astrophysics Data System (ADS)
Larionov, Sergei; Loskutov, Alexander; Ryadchenko, Eugeny
2008-03-01
Based on the representation of the DNA sequence as a two-dimensional (2D) plane walk, we consider the problem of identification and comparison of functional and structural organizations of chromosomes of different organisms. According to the characteristic design of 2D walks we identify telomere sites, palindromes of various sizes and complexity, areas of ribosomal RNA, transposons, as well as diverse satellite sequences. As an interesting result of the application of the 2D walk method, a new duplicated gigantic palindrome in the X human chromosome is detected. A schematic mechanism leading to the formation of such a duplicated palindrome is proposed. Analysis of a large number of the different genomes shows that some chromosomes (or their fragments) of various species appear as imperfect gigantic palindromes, which are disintegrated by many inversions and the mutation drift on different scales. A spread occurrence of these types of sequences in the numerous chromosomes allows us to develop a new insight of some accepted points of the genome evolution in the prebiotic phase.
Hatano, Takashi; Sano, Daisuke; Takahashi, Hideaki; Hyakusoku, Hiroshi; Isono, Yasuhiro; Shimada, Shoko; Sawakuma, Kae; Takada, Kentaro; Oikawa, Ritsuko; Watanabe, Yoshiyuki; Yamamoto, Hiroyuki; Itoh, Fumio; Myers, Jeffrey N; Oridate, Nobuhiko
2017-04-01
Recent studies showed that human papillomavirus (HPV) integration contributes to the genomic instability seen in HPV-associated head and neck squamous cell carcinoma (HPV-HNSCC). However, the epigenetic alterations induced after HPV integration remains unclear. To identify the molecular details of HPV16 DNA integration and the ensuing patterns of methylation in HNSCC, we performed next-generation sequencing using a target-enrichment method for the effective identification of HPV16 integration breakpoints as well as the characterization of genomic sequences adjacent to HPV16 integration breakpoints with three HPV16-related HNSCC cell lines. The DNA methylation levels of the integrated HPV16 genome and that of the adjacent human genome were also analyzed by bisulfite pyrosequencing. We found various integration loci, including novel integration sites. Integration loci were located predominantly in the intergenic region, with a significant enrichment of the microhomologous sequences between the human and HPV16 genomes at the integration breakpoints. Furthermore, various levels of methylation within both the human genome and the integrated HPV genome at the integration breakpoints in each integrant were observed. Allele-specific methylation analysis suggested that the HPV16 integrants remained hypomethylated when the flanking host genome was hypomethylated. After integration into highly methylated human genome regions, however, the HPV16 DNA became methylated. In conclusion, we found novel integration sites and methylation patterns in HPV-HNSCC using our unique method. These findings may provide insights into understanding of viral integration mechanism and virus-associated carcinogenesis of HPV-HNSCC. © 2016 UICC.
Xie, Qingjun; Tzfadia, Oren; Levy, Matan; Weithorn, Efrat; Peled-Zehavi, Hadas; Van Parys, Thomas; Van de Peer, Yves; Galili, Gad
2016-01-01
ABSTRACT Most of the proteins that are specifically turned over by selective autophagy are recognized by the presence of short Atg8 interacting motifs (AIMs) that facilitate their association with the autophagy apparatus. Such AIMs can be identified by bioinformatics methods based on their defined degenerate consensus F/W/Y-X-X-L/I/V sequences in which X represents any amino acid. Achieving reliability and/or fidelity of the prediction of such AIMs on a genome-wide scale represents a major challenge. Here, we present a bioinformatics approach, high fidelity AIM (hfAIM), which uses additional sequence requirements—the presence of acidic amino acids and the absence of positively charged amino acids in certain positions—to reliably identify AIMs in proteins. We demonstrate that the use of the hfAIM method allows for in silico high fidelity prediction of AIMs in AIM-containing proteins (ACPs) on a genome-wide scale in various organisms. Furthermore, by using hfAIM to identify putative AIMs in the Arabidopsis proteome, we illustrate a potential contribution of selective autophagy to various biological processes. More specifically, we identified 9 peroxisomal PEX proteins that contain hfAIM motifs, among which AtPEX1, AtPEX6 and AtPEX10 possess evolutionary-conserved AIMs. Bimolecular fluorescence complementation (BiFC) results verified that AtPEX6 and AtPEX10 indeed interact with Atg8 in planta. In addition, we show that mutations occurring within or nearby hfAIMs in PEX1, PEX6 and PEX10 caused defects in the growth and development of various organisms. Taken together, the above results suggest that the hfAIM tool can be used to effectively perform genome-wide in silico screens of proteins that are potentially regulated by selective autophagy. The hfAIM system is a web tool that can be accessed at link: http://bioinformatics.psb.ugent.be/hfAIM/. PMID:27071037
Efficient identification of Y chromosome sequences in the human and Drosophila genomes.
Carvalho, Antonio Bernardo; Clark, Andrew G
2013-11-01
Notwithstanding their biological importance, Y chromosomes remain poorly known in most species. A major obstacle to their study is the identification of Y chromosome sequences; due to its high content of repetitive DNA, in most genome projects, the Y chromosome sequence is fragmented into a large number of small, unmapped scaffolds. Identification of Y-linked genes among these fragments has yielded important insights about the origin and evolution of Y chromosomes, but the process is labor intensive, restricting studies to a small number of species. Apart from these fragmentary assemblies, in a few mammalian species, the euchromatic sequence of the Y is essentially complete, owing to painstaking BAC mapping and sequencing. Here we use female short-read sequencing and k-mer comparison to identify Y-linked sequences in two very different genomes, Drosophila virilis and human. Using this method, essentially all D. virilis scaffolds were unambiguously classified as Y-linked or not Y-linked. We found 800 new scaffolds (totaling 8.5 Mbp), and four new genes in the Y chromosome of D. virilis, including JYalpha, a gene involved in hybrid male sterility. Our results also strongly support the preponderance of gene gains over gene losses in the evolution of the Drosophila Y. In the intensively studied human genome, used here as a positive control, we recovered all previously known genes or gene families, plus a small amount (283 kb) of new, unfinished sequence. Hence, this method works in large and complex genomes and can be applied to any species with sex chromosomes.
Genome-scale identification of Legionella pneumophila effectors using a machine learning approach.
Burstein, David; Zusman, Tal; Degtyar, Elena; Viner, Ram; Segal, Gil; Pupko, Tal
2009-07-01
A large number of highly pathogenic bacteria utilize secretion systems to translocate effector proteins into host cells. Using these effectors, the bacteria subvert host cell processes during infection. Legionella pneumophila translocates effectors via the Icm/Dot type-IV secretion system and to date, approximately 100 effectors have been identified by various experimental and computational techniques. Effector identification is a critical first step towards the understanding of the pathogenesis system in L. pneumophila as well as in other bacterial pathogens. Here, we formulate the task of effector identification as a classification problem: each L. pneumophila open reading frame (ORF) was classified as either effector or not. We computationally defined a set of features that best distinguish effectors from non-effectors. These features cover a wide range of characteristics including taxonomical dispersion, regulatory data, genomic organization, similarity to eukaryotic proteomes and more. Machine learning algorithms utilizing these features were then applied to classify all the ORFs within the L. pneumophila genome. Using this approach we were able to predict and experimentally validate 40 new effectors, reaching a success rate of above 90%. Increasing the number of validated effectors to around 140, we were able to gain novel insights into their characteristics. Effectors were found to have low G+C content, supporting the hypothesis that a large number of effectors originate via horizontal gene transfer, probably from their protozoan host. In addition, effectors were found to cluster in specific genomic regions. Finally, we were able to provide a novel description of the C-terminal translocation signal required for effector translocation by the Icm/Dot secretion system. To conclude, we have discovered 40 novel L. pneumophila effectors, predicted over a hundred additional highly probable effectors, and shown the applicability of machine learning algorithms for the identification and characterization of bacterial pathogenesis determinants.
Nakamura, Kosuke; Kondo, Kazunari; Akiyama, Hiroshi; Ishigaki, Takumi; Noguchi, Akio; Katsumata, Hiroshi; Takasaki, Kazuto; Futo, Satoshi; Sakata, Kozue; Fukuda, Nozomi; Mano, Junichi; Kitta, Kazumi; Tanaka, Hidenori; Akashi, Ryo; Nishimaki-Mogami, Tomoko
2016-08-15
Identification of transgenic sequences in an unknown genetically modified (GM) papaya (Carica papaya L.) by whole genome sequence analysis was demonstrated. Whole genome sequence data were generated for a GM-positive fresh papaya fruit commodity detected in monitoring using real-time polymerase chain reaction (PCR). The sequences obtained were mapped against an open database for papaya genome sequence. Transgenic construct- and event-specific sequences were identified as a GM papaya developed to resist infection from a Papaya ringspot virus. Based on the transgenic sequences, a specific real-time PCR detection method for GM papaya applicable to various food commodities was developed. Whole genome sequence analysis enabled identifying unknown transgenic construct- and event-specific sequences in GM papaya and development of a reliable method for detecting them in papaya food commodities. Copyright © 2016 Elsevier Ltd. All rights reserved.
Toward an Upgraded Honey Bee (Apis mellifera L.) Genome Annotation Using Proteogenomics.
McAfee, Alison; Harpur, Brock A; Michaud, Sarah; Beavis, Ronald C; Kent, Clement F; Zayed, Amro; Foster, Leonard J
2016-02-05
The honey bee is a key pollinator in agricultural operations as well as a model organism for studying the genetics and evolution of social behavior. The Apis mellifera genome has been sequenced and annotated twice over, enabling proteomics and functional genomics methods for probing relevant aspects of their biology. One troubling trend that emerged from proteomic analyses is that honey bee peptide samples consistently result in lower peptide identification rates compared with other organisms. This suggests that the genome annotation can be improved, or atypical biological processes are interfering with the mass spectrometry workflow. First, we tested whether high levels of polymorphisms could explain some of the missed identifications by searching spectra against the reference proteome (OGSv3.2) versus a customized proteome of a single honey bee, but our results indicate that this contribution was minor. Likewise, error-tolerant peptide searches lead us to eliminate unexpected post-translational modifications as a major factor in missed identifications. We then used a proteogenomic approach with ~1500 raw files to search for missing genes and new exons, to revive discarded annotations and to identify over 2000 new coding regions. These results will contribute to a more comprehensive genome annotation and facilitate continued research on this important insect.
USDA-ARS?s Scientific Manuscript database
Large sets of genomic data are becoming available for cucumber (Cucumis sativus), yet there is no tool for whole genome genotyping. Creation of saturated genetic maps depends on development of good markers. The present cucumber genetic maps are based on several hundreds of markers. However they are ...
The Effects of Signal Erosion and Core Genome Reduction on the Identification of Diagnostic Markers
Sahl, Jason W.; Vazquez, Adam J.; Hall, Carina M.; Busch, Joseph D.; Tuanyok, Apichai; Mayo, Mark; Schupp, James M.; Lummis, Madeline; Pearson, Talima; Shippy, Kenzie; Allender, Christopher J.; Theobald, Vanessa; Hutcheson, Alex; Korlach, Jonas; LiPuma, John J.; Ladner, Jason; Lovett, Sean; Koroleva, Galina; Palacios, Gustavo; Limmathurotsakul, Direk; Wuthiekanun, Vanaporn; Wongsuwan, Gumphol; Currie, Bart J.
2016-01-01
ABSTRACT Whole-genome sequence (WGS) data are commonly used to design diagnostic targets for the identification of bacterial pathogens. To do this effectively, genomics databases must be comprehensive to identify the strict core genome that is specific to the target pathogen. As additional genomes are analyzed, the core genome size is reduced and there is erosion of the target-specific regions due to commonality with related species, potentially resulting in the identification of false positives and/or false negatives. PMID:27651357
Freytag, Saskia; Manitz, Juliane; Schlather, Martin; Kneib, Thomas; Amos, Christopher I.; Risch, Angela; Chang-Claude, Jenny; Heinrich, Joachim; Bickeböller, Heike
2014-01-01
Biological pathways provide rich information and biological context on the genetic causes of complex diseases. The logistic kernel machine test integrates prior knowledge on pathways in order to analyze data from genome-wide association studies (GWAS). Here, the kernel converts genomic information of two individuals to a quantitative value reflecting their genetic similarity. With the selection of the kernel one implicitly chooses a genetic effect model. Like many other pathway methods, none of the available kernels accounts for topological structure of the pathway or gene-gene interaction types. However, evidence indicates that connectivity and neighborhood of genes are crucial in the context of GWAS, because genes associated with a disease often interact. Thus, we propose a novel kernel that incorporates the topology of pathways and information on interactions. Using simulation studies, we demonstrate that the proposed method maintains the type I error correctly and can be more effective in the identification of pathways associated with a disease than non-network-based methods. We apply our approach to genome-wide association case control data on lung cancer and rheumatoid arthritis. We identify some promising new pathways associated with these diseases, which may improve our current understanding of the genetic mechanisms. PMID:24434848
Jiang, Lingxi; Yang, Litao; Rao, Jun; Guo, Jinchao; Wang, Shu; Liu, Jia; Lee, Seonghun; Zhang, Dabing
2010-02-01
To implement genetically modified organism (GMO) labeling regulations, an event-specific analysis method based on the junction sequence between exogenous integration and host genomic DNA has become the preferential approach for GMO identification and quantification. In this study, specific primers and TaqMan probes based on the revealed 5'-end junction sequence of GM cotton MON15985 were designed, and qualitative and quantitative polymerase chain reaction (PCR) assays were established employing the designed primers and probes. In the qualitative PCR assay, the limit of detection (LOD) was 0.5 g kg(-1) in 100 ng total cotton genomic DNA, corresponding to about 17 copies of haploid cotton genomic DNA, and the LOD and limit of quantification (LOQ) for quantitative PCR assay were 10 and 17 copies of haploid cotton genomic DNA, respectively. Furthermore, the developed quantitative PCR assays were validated in-house by five different researchers. Also, five practical samples with known GM contents were quantified using the developed PCR assay in in-house validation, and the bias between the true and quantification values ranged from 2.06% to 12.59%. This study shows that the developed qualitative and quantitative PCR methods are applicable for the identification and quantification of GM cotton MON15985 and its derivates.
Privacy preserving protocol for detecting genetic relatives using rare variants.
Hormozdiari, Farhad; Joo, Jong Wha J; Wadia, Akshay; Guan, Feng; Ostrosky, Rafail; Sahai, Amit; Eskin, Eleazar
2014-06-15
High-throughput sequencing technologies have impacted many areas of genetic research. One such area is the identification of relatives from genetic data. The standard approach for the identification of genetic relatives collects the genomic data of all individuals and stores it in a database. Then, each pair of individuals is compared to detect the set of genetic relatives, and the matched individuals are informed. The main drawback of this approach is the requirement of sharing your genetic data with a trusted third party to perform the relatedness test. In this work, we propose a secure protocol to detect the genetic relatives from sequencing data while not exposing any information about their genomes. We assume that individuals have access to their genome sequences but do not want to share their genomes with anyone else. Unlike previous approaches, our approach uses both common and rare variants which provide the ability to detect much more distant relationships securely. We use a simulated data generated from the 1000 genomes data and illustrate that we can easily detect up to fifth degree cousins which was not possible using the existing methods. We also show in the 1000 genomes data with cryptic relationships that our method can detect these individuals. The software is freely available for download at http://genetics.cs.ucla.edu/crypto/. © The Author 2014. Published by Oxford University Press.
High-throughput gene mapping in Caenorhabditis elegans.
Swan, Kathryn A; Curtis, Damian E; McKusick, Kathleen B; Voinov, Alexander V; Mapa, Felipa A; Cancilla, Michael R
2002-07-01
Positional cloning of mutations in model genetic systems is a powerful method for the identification of targets of medical and agricultural importance. To facilitate the high-throughput mapping of mutations in Caenorhabditis elegans, we have identified a further 9602 putative new single nucleotide polymorphisms (SNPs) between two C. elegans strains, Bristol N2 and the Hawaiian mapping strain CB4856, by sequencing inserts from a CB4856 genomic DNA library and using an informatics pipeline to compare sequences with the canonical N2 genomic sequence. When combined with data from other laboratories, our marker set of 17,189 SNPs provides even coverage of the complete worm genome. To date, we have confirmed >1099 evenly spaced SNPs (one every 91 +/- 56 kb) across the six chromosomes and validated the utility of our SNP marker set and new fluorescence polarization-based genotyping methods for systematic and high-throughput identification of genes in C. elegans by cloning several proprietary genes. We illustrate our approach by recombination mapping and confirmation of the mutation in the cloned gene, dpy-18.
Damage identification of a TLP floating wind turbine by meta-heuristic algorithms
NASA Astrophysics Data System (ADS)
Ettefagh, M. M.
2015-12-01
Damage identification of the offshore floating wind turbine by vibration/dynamic signals is one of the important and new research fields in the Structural Health Monitoring (SHM). In this paper a new damage identification method is proposed based on meta-heuristic algorithms using the dynamic response of the TLP (Tension-Leg Platform) floating wind turbine structure. The Genetic Algorithms (GA), Artificial Immune System (AIS), Particle Swarm Optimization (PSO), and Artificial Bee Colony (ABC) are chosen for minimizing the object function, defined properly for damage identification purpose. In addition to studying the capability of mentioned algorithms in correctly identifying the damage, the effect of the response type on the results of identification is studied. Also, the results of proposed damage identification are investigated with considering possible uncertainties of the structure. Finally, for evaluating the proposed method in real condition, a 1/100 scaled experimental setup of TLP Floating Wind Turbine (TLPFWT) is provided in a laboratory scale and the proposed damage identification method is applied to the scaled turbine.
Systematic comparison of variant calling pipelines using gold standard personal exome variants
Hwang, Sohyun; Kim, Eiru; Lee, Insuk; Marcotte, Edward M.
2015-01-01
The success of clinical genomics using next generation sequencing (NGS) requires the accurate and consistent identification of personal genome variants. Assorted variant calling methods have been developed, which show low concordance between their calls. Hence, a systematic comparison of the variant callers could give important guidance to NGS-based clinical genomics. Recently, a set of high-confident variant calls for one individual (NA12878) has been published by the Genome in a Bottle (GIAB) consortium, enabling performance benchmarking of different variant calling pipelines. Based on the gold standard reference variant calls from GIAB, we compared the performance of thirteen variant calling pipelines, testing combinations of three read aligners—BWA-MEM, Bowtie2, and Novoalign—and four variant callers—Genome Analysis Tool Kit HaplotypeCaller (GATK-HC), Samtools mpileup, Freebayes and Ion Proton Variant Caller (TVC), for twelve data sets for the NA12878 genome sequenced by different platforms including Illumina2000, Illumina2500, and Ion Proton, with various exome capture systems and exome coverage. We observed different biases toward specific types of SNP genotyping errors by the different variant callers. The results of our study provide useful guidelines for reliable variant identification from deep sequencing of personal genomes. PMID:26639839
Guidelines for Genome-Scale Analysis of Biological Rhythms.
Hughes, Michael E; Abruzzi, Katherine C; Allada, Ravi; Anafi, Ron; Arpat, Alaaddin Bulak; Asher, Gad; Baldi, Pierre; de Bekker, Charissa; Bell-Pedersen, Deborah; Blau, Justin; Brown, Steve; Ceriani, M Fernanda; Chen, Zheng; Chiu, Joanna C; Cox, Juergen; Crowell, Alexander M; DeBruyne, Jason P; Dijk, Derk-Jan; DiTacchio, Luciano; Doyle, Francis J; Duffield, Giles E; Dunlap, Jay C; Eckel-Mahan, Kristin; Esser, Karyn A; FitzGerald, Garret A; Forger, Daniel B; Francey, Lauren J; Fu, Ying-Hui; Gachon, Frédéric; Gatfield, David; de Goede, Paul; Golden, Susan S; Green, Carla; Harer, John; Harmer, Stacey; Haspel, Jeff; Hastings, Michael H; Herzel, Hanspeter; Herzog, Erik D; Hoffmann, Christy; Hong, Christian; Hughey, Jacob J; Hurley, Jennifer M; de la Iglesia, Horacio O; Johnson, Carl; Kay, Steve A; Koike, Nobuya; Kornacker, Karl; Kramer, Achim; Lamia, Katja; Leise, Tanya; Lewis, Scott A; Li, Jiajia; Li, Xiaodong; Liu, Andrew C; Loros, Jennifer J; Martino, Tami A; Menet, Jerome S; Merrow, Martha; Millar, Andrew J; Mockler, Todd; Naef, Felix; Nagoshi, Emi; Nitabach, Michael N; Olmedo, Maria; Nusinow, Dmitri A; Ptáček, Louis J; Rand, David; Reddy, Akhilesh B; Robles, Maria S; Roenneberg, Till; Rosbash, Michael; Ruben, Marc D; Rund, Samuel S C; Sancar, Aziz; Sassone-Corsi, Paolo; Sehgal, Amita; Sherrill-Mix, Scott; Skene, Debra J; Storch, Kai-Florian; Takahashi, Joseph S; Ueda, Hiroki R; Wang, Han; Weitz, Charles; Westermark, Pål O; Wijnen, Herman; Xu, Ying; Wu, Gang; Yoo, Seung-Hee; Young, Michael; Zhang, Eric Erquan; Zielinski, Tomasz; Hogenesch, John B
2017-10-01
Genome biology approaches have made enormous contributions to our understanding of biological rhythms, particularly in identifying outputs of the clock, including RNAs, proteins, and metabolites, whose abundance oscillates throughout the day. These methods hold significant promise for future discovery, particularly when combined with computational modeling. However, genome-scale experiments are costly and laborious, yielding "big data" that are conceptually and statistically difficult to analyze. There is no obvious consensus regarding design or analysis. Here we discuss the relevant technical considerations to generate reproducible, statistically sound, and broadly useful genome-scale data. Rather than suggest a set of rigid rules, we aim to codify principles by which investigators, reviewers, and readers of the primary literature can evaluate the suitability of different experimental designs for measuring different aspects of biological rhythms. We introduce CircaInSilico, a web-based application for generating synthetic genome biology data to benchmark statistical methods for studying biological rhythms. Finally, we discuss several unmet analytical needs, including applications to clinical medicine, and suggest productive avenues to address them.
Guidelines for Genome-Scale Analysis of Biological Rhythms
Hughes, Michael E.; Abruzzi, Katherine C.; Allada, Ravi; Anafi, Ron; Arpat, Alaaddin Bulak; Asher, Gad; Baldi, Pierre; de Bekker, Charissa; Bell-Pedersen, Deborah; Blau, Justin; Brown, Steve; Ceriani, M. Fernanda; Chen, Zheng; Chiu, Joanna C.; Cox, Juergen; Crowell, Alexander M.; DeBruyne, Jason P.; Dijk, Derk-Jan; DiTacchio, Luciano; Doyle, Francis J.; Duffield, Giles E.; Dunlap, Jay C.; Eckel-Mahan, Kristin; Esser, Karyn A.; FitzGerald, Garret A.; Forger, Daniel B.; Francey, Lauren J.; Fu, Ying-Hui; Gachon, Frédéric; Gatfield, David; de Goede, Paul; Golden, Susan S.; Green, Carla; Harer, John; Harmer, Stacey; Haspel, Jeff; Hastings, Michael H.; Herzel, Hanspeter; Herzog, Erik D.; Hoffmann, Christy; Hong, Christian; Hughey, Jacob J.; Hurley, Jennifer M.; de la Iglesia, Horacio O.; Johnson, Carl; Kay, Steve A.; Koike, Nobuya; Kornacker, Karl; Kramer, Achim; Lamia, Katja; Leise, Tanya; Lewis, Scott A.; Li, Jiajia; Li, Xiaodong; Liu, Andrew C.; Loros, Jennifer J.; Martino, Tami A.; Menet, Jerome S.; Merrow, Martha; Millar, Andrew J.; Mockler, Todd; Naef, Felix; Nagoshi, Emi; Nitabach, Michael N.; Olmedo, Maria; Nusinow, Dmitri A.; Ptáček, Louis J.; Rand, David; Reddy, Akhilesh B.; Robles, Maria S.; Roenneberg, Till; Rosbash, Michael; Ruben, Marc D.; Rund, Samuel S.C.; Sancar, Aziz; Sassone-Corsi, Paolo; Sehgal, Amita; Sherrill-Mix, Scott; Skene, Debra J.; Storch, Kai-Florian; Takahashi, Joseph S.; Ueda, Hiroki R.; Wang, Han; Weitz, Charles; Westermark, Pål O.; Wijnen, Herman; Xu, Ying; Wu, Gang; Yoo, Seung-Hee; Young, Michael; Zhang, Eric Erquan; Zielinski, Tomasz; Hogenesch, John B.
2017-01-01
Genome biology approaches have made enormous contributions to our understanding of biological rhythms, particularly in identifying outputs of the clock, including RNAs, proteins, and metabolites, whose abundance oscillates throughout the day. These methods hold significant promise for future discovery, particularly when combined with computational modeling. However, genome-scale experiments are costly and laborious, yielding “big data” that are conceptually and statistically difficult to analyze. There is no obvious consensus regarding design or analysis. Here we discuss the relevant technical considerations to generate reproducible, statistically sound, and broadly useful genome-scale data. Rather than suggest a set of rigid rules, we aim to codify principles by which investigators, reviewers, and readers of the primary literature can evaluate the suitability of different experimental designs for measuring different aspects of biological rhythms. We introduce CircaInSilico, a web-based application for generating synthetic genome biology data to benchmark statistical methods for studying biological rhythms. Finally, we discuss several unmet analytical needs, including applications to clinical medicine, and suggest productive avenues to address them. PMID:29098954
Cotten, Matthew; Oude Munnink, Bas; Canuti, Marta; Deijs, Martin; Watson, Simon J; Kellam, Paul; van der Hoek, Lia
2014-01-01
We have developed a full genome virus detection process that combines sensitive nucleic acid preparation optimised for virus identification in fecal material with Illumina MiSeq sequencing and a novel post-sequencing virus identification algorithm. Enriched viral nucleic acid was converted to double-stranded DNA and subjected to Illumina MiSeq sequencing. The resulting short reads were processed with a novel iterative Python algorithm SLIM for the identification of sequences with homology to known viruses. De novo assembly was then used to generate full viral genomes. The sensitivity of this process was demonstrated with a set of fecal samples from HIV-1 infected patients. A quantitative assessment of the mammalian, plant, and bacterial virus content of this compartment was generated and the deep sequencing data were sufficient to assembly 12 complete viral genomes from 6 virus families. The method detected high levels of enteropathic viruses that are normally controlled in healthy adults, but may be involved in the pathogenesis of HIV-1 infection and will provide a powerful tool for virus detection and for analyzing changes in the fecal virome associated with HIV-1 progression and pathogenesis.
Cotten, Matthew; Oude Munnink, Bas; Canuti, Marta; Deijs, Martin; Watson, Simon J.; Kellam, Paul; van der Hoek, Lia
2014-01-01
We have developed a full genome virus detection process that combines sensitive nucleic acid preparation optimised for virus identification in fecal material with Illumina MiSeq sequencing and a novel post-sequencing virus identification algorithm. Enriched viral nucleic acid was converted to double-stranded DNA and subjected to Illumina MiSeq sequencing. The resulting short reads were processed with a novel iterative Python algorithm SLIM for the identification of sequences with homology to known viruses. De novo assembly was then used to generate full viral genomes. The sensitivity of this process was demonstrated with a set of fecal samples from HIV-1 infected patients. A quantitative assessment of the mammalian, plant, and bacterial virus content of this compartment was generated and the deep sequencing data were sufficient to assembly 12 complete viral genomes from 6 virus families. The method detected high levels of enteropathic viruses that are normally controlled in healthy adults, but may be involved in the pathogenesis of HIV-1 infection and will provide a powerful tool for virus detection and for analyzing changes in the fecal virome associated with HIV-1 progression and pathogenesis. PMID:24695106
Cloud-based adaptive exon prediction for DNA analysis
Putluri, Srinivasareddy; Fathima, Shaik Yasmeen
2018-01-01
Cloud computing offers significant research and economic benefits to healthcare organisations. Cloud services provide a safe place for storing and managing large amounts of such sensitive data. Under conventional flow of gene information, gene sequence laboratories send out raw and inferred information via Internet to several sequence libraries. DNA sequencing storage costs will be minimised by use of cloud service. In this study, the authors put forward a novel genomic informatics system using Amazon Cloud Services, where genomic sequence information is stored and accessed for processing. True identification of exon regions in a DNA sequence is a key task in bioinformatics, which helps in disease identification and design drugs. Three base periodicity property of exons forms the basis of all exon identification techniques. Adaptive signal processing techniques found to be promising in comparison with several other methods. Several adaptive exon predictors (AEPs) are developed using variable normalised least mean square and its maximum normalised variants to reduce computational complexity. Finally, performance evaluation of various AEPs is done based on measures such as sensitivity, specificity and precision using various standard genomic datasets taken from National Center for Biotechnology Information genomic sequence database. PMID:29515813
Large-scale identification of chemically induced mutations in Drosophila melanogaster
Haelterman, Nele A.; Jiang, Lichun; Li, Yumei; Bayat, Vafa; Sandoval, Hector; Ugur, Berrak; Tan, Kai Li; Zhang, Ke; Bei, Danqing; Xiong, Bo; Charng, Wu-Lin; Busby, Theodore; Jawaid, Adeel; David, Gabriela; Jaiswal, Manish; Venken, Koen J.T.; Yamamoto, Shinya
2014-01-01
Forward genetic screens using chemical mutagens have been successful in defining the function of thousands of genes in eukaryotic model organisms. The main drawback of this strategy is the time-consuming identification of the molecular lesions causative of the phenotypes of interest. With whole-genome sequencing (WGS), it is now possible to sequence hundreds of strains, but determining which mutations are causative among thousands of polymorphisms remains challenging. We have sequenced 394 mutant strains, generated in a chemical mutagenesis screen, for essential genes on the Drosophila X chromosome and describe strategies to reduce the number of candidate mutations from an average of ∼3500 to 35 single-nucleotide variants per chromosome. By combining WGS with a rough mapping method based on large duplications, we were able to map 274 (∼70%) mutations. We show that these mutations are causative, using small 80-kb duplications that rescue lethality. Hence, our findings demonstrate that combining rough mapping with WGS dramatically expands the toolkit necessary for assigning function to genes. PMID:25258387
Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.
Gerstein, Mark B; Lu, Zhi John; Van Nostrand, Eric L; Cheng, Chao; Arshinoff, Bradley I; Liu, Tao; Yip, Kevin Y; Robilotto, Rebecca; Rechtsteiner, Andreas; Ikegami, Kohta; Alves, Pedro; Chateigner, Aurelien; Perry, Marc; Morris, Mitzi; Auerbach, Raymond K; Feng, Xin; Leng, Jing; Vielle, Anne; Niu, Wei; Rhrissorrakrai, Kahn; Agarwal, Ashish; Alexander, Roger P; Barber, Galt; Brdlik, Cathleen M; Brennan, Jennifer; Brouillet, Jeremy Jean; Carr, Adrian; Cheung, Ming-Sin; Clawson, Hiram; Contrino, Sergio; Dannenberg, Luke O; Dernburg, Abby F; Desai, Arshad; Dick, Lindsay; Dosé, Andréa C; Du, Jiang; Egelhofer, Thea; Ercan, Sevinc; Euskirchen, Ghia; Ewing, Brent; Feingold, Elise A; Gassmann, Reto; Good, Peter J; Green, Phil; Gullier, Francois; Gutwein, Michelle; Guyer, Mark S; Habegger, Lukas; Han, Ting; Henikoff, Jorja G; Henz, Stefan R; Hinrichs, Angie; Holster, Heather; Hyman, Tony; Iniguez, A Leo; Janette, Judith; Jensen, Morten; Kato, Masaomi; Kent, W James; Kephart, Ellen; Khivansara, Vishal; Khurana, Ekta; Kim, John K; Kolasinska-Zwierz, Paulina; Lai, Eric C; Latorre, Isabel; Leahey, Amber; Lewis, Suzanna; Lloyd, Paul; Lochovsky, Lucas; Lowdon, Rebecca F; Lubling, Yaniv; Lyne, Rachel; MacCoss, Michael; Mackowiak, Sebastian D; Mangone, Marco; McKay, Sheldon; Mecenas, Desirea; Merrihew, Gennifer; Miller, David M; Muroyama, Andrew; Murray, John I; Ooi, Siew-Loon; Pham, Hoang; Phippen, Taryn; Preston, Elicia A; Rajewsky, Nikolaus; Rätsch, Gunnar; Rosenbaum, Heidi; Rozowsky, Joel; Rutherford, Kim; Ruzanov, Peter; Sarov, Mihail; Sasidharan, Rajkumar; Sboner, Andrea; Scheid, Paul; Segal, Eran; Shin, Hyunjin; Shou, Chong; Slack, Frank J; Slightam, Cindie; Smith, Richard; Spencer, William C; Stinson, E O; Taing, Scott; Takasaki, Teruaki; Vafeados, Dionne; Voronina, Ksenia; Wang, Guilin; Washington, Nicole L; Whittle, Christina M; Wu, Beijing; Yan, Koon-Kiu; Zeller, Georg; Zha, Zheng; Zhong, Mei; Zhou, Xingliang; Ahringer, Julie; Strome, Susan; Gunsalus, Kristin C; Micklem, Gos; Liu, X Shirley; Reinke, Valerie; Kim, Stuart K; Hillier, LaDeana W; Henikoff, Steven; Piano, Fabio; Snyder, Michael; Stein, Lincoln; Lieb, Jason D; Waterston, Robert H
2010-12-24
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor-binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor-binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
Unravelling the hidden ancestry of American admixed populations.
Montinaro, Francesco; Busby, George B J; Pascali, Vincenzo L; Myers, Simon; Hellenthal, Garrett; Capelli, Cristian
2015-03-24
The movement of people into the Americas has brought different populations into contact, and contemporary American genomes are the product of a range of complex admixture events. Here we apply a haplotype-based ancestry identification approach to a large set of genome-wide SNP data from a variety of American, European and African populations to determine the contributions of different ancestral populations to the Americas. Our results provide a fine-scale characterization of the source populations, identify a series of novel, previously unreported contributions from Africa and Europe and highlight geohistorical structure in the ancestry of American admixed populations.
Quantifying Selection with Pool-Seq Time Series Data.
Taus, Thomas; Futschik, Andreas; Schlötterer, Christian
2017-11-01
Allele frequency time series data constitute a powerful resource for unraveling mechanisms of adaptation, because the temporal dimension captures important information about evolutionary forces. In particular, Evolve and Resequence (E&R), the whole-genome sequencing of replicated experimentally evolving populations, is becoming increasingly popular. Based on computer simulations several studies proposed experimental parameters to optimize the identification of the selection targets. No such recommendations are available for the underlying parameters selection strength and dominance. Here, we introduce a highly accurate method to estimate selection parameters from replicated time series data, which is fast enough to be applied on a genome scale. Using this new method, we evaluate how experimental parameters can be optimized to obtain the most reliable estimates for selection parameters. We show that the effective population size (Ne) and the number of replicates have the largest impact. Because the number of time points and sequencing coverage had only a minor effect, we suggest that time series analysis is feasible without major increase in sequencing costs. We anticipate that time series analysis will become routine in E&R studies. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
Nanoscale Bio-engineering Solutions for Space Exploration: The Nanopore Sequencer
NASA Technical Reports Server (NTRS)
Stolc, Viktor; Cozmuta, Ioana
2004-01-01
Characterization of biological systems at the molecular level and extraction of essential information for nano-engineering design to guide the nano-fabrication of solid-state sensors and molecular identification devices is a computational challenge. The alpha hemolysin protein ion channel is used as a model system for structural analysis of nucleic acids like DNA. Applied voltage draws a DNA strand and surrounding ionic solution through the biological nanopore. The subunits in the DNA strand block ion flow by differing amounts. Atomistic scale simulations are employed using NASA supercomputers to study DNA translocation, with the aim to enhance single DNA subunit identification. Compared to protein channels, solid-state nanopores offer a better temporal control of the translocation of DNA and the possibility to easily tune its chemistry to increase the signal resolution. Potential applications for NASA missions, besides real-time genome sequencing include astronaut health, life detection and decoding of various genomes.
Nanoscale Bioengineering Solutions for Space Exploration the Nanopore Sequencer
NASA Technical Reports Server (NTRS)
Ioana, Cozmuta; Viktor, Stoic
2005-01-01
Characterization of biological systems at the molecular level and extraction of essential information for nano-engineering design to guide the nano-fabrication of solid-state sensors and molecular identification devices is a computational challenge. The alpha hemolysin protein ion channel is used as a model system for structural analysis of nucleic acids like DNA. Applied voltage draws a DNA strand and surrounding ionic solution through the biological nanopore. The subunits in the DNA strand block ion flow by differing amounts. Atomistic scale simulations are employed using NASA supercomputers to study DNA translocation. with the aim to enhance single DNA subunit identification. Compared to protein channels, solid-state nanopores offer a better temporal control of the translocation of DNA and the possibility to easily tune its chemistry to increase the signal resolution. Potential applications for NASA missions, besides real-time genome sequencing include astronaut health, life detection and decoding of various genomes. http://phenomrph.arc.nasa.gov/index.php
Multiprimer PCR system for differential identification of mycobacteria in clinical samples.
Del Portillo, P; Thomas, M C; Martínez, E; Marañón, C; Valladares, B; Patarroyo, M E; Carlos López, M
1996-01-01
A novel multiprimer PCR method with the potential to identify mycobacteria in clinical samples is presented. The assay relies on the simultaneous amplification of three bacterial DNA genomic fragments by using different sets of oligonucleotide primers. The first set of primers amplifies a 506-bp fragment from the gene for the 32-kDa antigen of Mycobacterium tuberculosis, which is present in most of the species belonging to the genus Mycobacterium. The second set of primers amplifies a 984-bp fragment from the IS6110 insertion sequence of the bacteria belonging to the M. tuberculosis complex. The third set of primers, derived from an M. tuberculosis species-specific sequence named MTP40, amplifies a 396-bp genomic fragment. Thus, while the multiprimer system would render three amplification fragments from the M. tuberculosis genome and two fragments from the Mycobacterium bovis genome, a unique amplification fragment would be obtained from nontuberculous mycobacteria. The results obtained, using reference mycobacterial strains and typed clinical isolates, show that the multiprimer PCR method may be a rapid, sensitive, and specific tool for the differential identification of various mycobacterial strains in a single-step assay. PMID:8789008
Mohd-Yusoff, Nur Fatihah; Ruperao, Pradeep; Tomoyoshi, Nurain Emylia; Edwards, David; Gresshoff, Peter M.; Biswas, Bandana; Batley, Jacqueline
2015-01-01
Genetic structure can be altered by chemical mutagenesis, which is a common method applied in molecular biology and genetics. Second-generation sequencing provides a platform to reveal base alterations occurring in the whole genome due to mutagenesis. A model legume, Lotus japonicus ecotype Miyakojima, was chemically mutated with alkylating ethyl methanesulfonate (EMS) for the scanning of DNA lesions throughout the genome. Using second-generation sequencing, two individually mutated third-generation progeny (M3, named AM and AS) were sequenced and analyzed to identify single nucleotide polymorphisms and reveal the effects of EMS on nucleotide sequences in these mutant genomes. Single-nucleotide polymorphisms were found in every 208 kb (AS) and 202 kb (AM) with a bias mutation of G/C-to-A/T changes at low percentage. Most mutations were intergenic. The mutation spectrum of the genomes was comparable in their individual chromosomes; however, each mutated genome has unique alterations, which are useful to identify causal mutations for their phenotypic changes. The data obtained demonstrate that whole genomic sequencing is applicable as a high-throughput tool to investigate genomic changes due to mutagenesis. The identification of these single-point mutations will facilitate the identification of phenotypically causative mutations in EMS-mutated germplasm. PMID:25660167
An efficient graph theory based method to identify every minimal reaction set in a metabolic network
2014-01-01
Background Development of cells with minimal metabolic functionality is gaining importance due to their efficiency in producing chemicals and fuels. Existing computational methods to identify minimal reaction sets in metabolic networks are computationally expensive. Further, they identify only one of the several possible minimal reaction sets. Results In this paper, we propose an efficient graph theory based recursive optimization approach to identify all minimal reaction sets. Graph theoretical insights offer systematic methods to not only reduce the number of variables in math programming and increase its computational efficiency, but also provide efficient ways to find multiple optimal solutions. The efficacy of the proposed approach is demonstrated using case studies from Escherichia coli and Saccharomyces cerevisiae. In case study 1, the proposed method identified three minimal reaction sets each containing 38 reactions in Escherichia coli central metabolic network with 77 reactions. Analysis of these three minimal reaction sets revealed that one of them is more suitable for developing minimal metabolism cell compared to other two due to practically achievable internal flux distribution. In case study 2, the proposed method identified 256 minimal reaction sets from the Saccharomyces cerevisiae genome scale metabolic network with 620 reactions. The proposed method required only 4.5 hours to identify all the 256 minimal reaction sets and has shown a significant reduction (approximately 80%) in the solution time when compared to the existing methods for finding minimal reaction set. Conclusions Identification of all minimal reactions sets in metabolic networks is essential since different minimal reaction sets have different properties that effect the bioprocess development. The proposed method correctly identified all minimal reaction sets in a both the case studies. The proposed method is computationally efficient compared to other methods for finding minimal reaction sets and useful to employ with genome-scale metabolic networks. PMID:24594118
An evaluation of two-channel ChIP-on-chip and DNA methylation microarray normalization strategies
2012-01-01
Background The combination of chromatin immunoprecipitation with two-channel microarray technology enables genome-wide mapping of binding sites of DNA-interacting proteins (ChIP-on-chip) or sites with methylated CpG di-nucleotides (DNA methylation microarray). These powerful tools are the gateway to understanding gene transcription regulation. Since the goals of such studies, the sample preparation procedures, the microarray content and study design are all different from transcriptomics microarrays, the data pre-processing strategies traditionally applied to transcriptomics microarrays may not be appropriate. Particularly, the main challenge of the normalization of "regulation microarrays" is (i) to make the data of individual microarrays quantitatively comparable and (ii) to keep the signals of the enriched probes, representing DNA sequences from the precipitate, as distinguishable as possible from the signals of the un-enriched probes, representing DNA sequences largely absent from the precipitate. Results We compare several widely used normalization approaches (VSN, LOWESS, quantile, T-quantile, Tukey's biweight scaling, Peng's method) applied to a selection of regulation microarray datasets, ranging from DNA methylation to transcription factor binding and histone modification studies. Through comparison of the data distributions of control probes and gene promoter probes before and after normalization, and assessment of the power to identify known enriched genomic regions after normalization, we demonstrate that there are clear differences in performance between normalization procedures. Conclusion T-quantile normalization applied separately on the channels and Tukey's biweight scaling outperform other methods in terms of the conservation of enriched and un-enriched signal separation, as well as in identification of genomic regions known to be enriched. T-quantile normalization is preferable as it additionally improves comparability between microarrays. In contrast, popular normalization approaches like quantile, LOWESS, Peng's method and VSN normalization alter the data distributions of regulation microarrays to such an extent that using these approaches will impact the reliability of the downstream analysis substantially. PMID:22276688
2013-01-01
Background Colorectal cancer is the third leading cause of cancer deaths in the United States. The initial assessment of colorectal cancer involves clinical staging that takes into account the extent of primary tumor invasion, determining the number of lymph nodes with metastatic cancer and the identification of metastatic sites in other organs. Advanced clinical stage indicates metastatic cancer, either in regional lymph nodes or in distant organs. While the genomic and genetic basis of colorectal cancer has been elucidated to some degree, less is known about the identity of specific cancer genes that are associated with advanced clinical stage and metastasis. Methods We compiled multiple genomic data types (mutations, copy number alterations, gene expression and methylation status) as well as clinical meta-data from The Cancer Genome Atlas (TCGA). We used an elastic-net regularized regression method on the combined genomic data to identify genetic aberrations and their associated cancer genes that are indicators of clinical stage. We ranked candidate genes by their regression coefficient and level of support from multiple assay modalities. Results A fit of the elastic-net regularized regression to 197 samples and integrated analysis of four genomic platforms identified the set of top gene predictors of advanced clinical stage, including: WRN, SYK, DDX5 and ADRA2C. These genetic features were identified robustly in bootstrap resampling analysis. Conclusions We conducted an analysis integrating multiple genomic features including mutations, copy number alterations, gene expression and methylation. This integrated approach in which one considers all of these genomic features performs better than any individual genomic assay. We identified multiple genes that robustly delineate advanced clinical stage, suggesting their possible role in colorectal cancer metastatic progression. PMID:24308539
Owen, Joseph R.; Noyes, Noelle; Young, Amy E.; Prince, Daniel J.; Blanchard, Patricia C.; Lehenbauer, Terry W.; Aly, Sharif S.; Davis, Jessica H.; O’Rourke, Sean M.; Abdo, Zaid; Belk, Keith; Miller, Michael R.; Morley, Paul; Van Eenennaam, Alison L.
2017-01-01
Extended laboratory culture and antimicrobial susceptibility testing timelines hinder rapid species identification and susceptibility profiling of bacterial pathogens associated with bovine respiratory disease, the most prevalent cause of cattle mortality in the United States. Whole-genome sequencing offers a culture-independent alternative to current bacterial identification methods, but requires a library of bacterial reference genomes for comparison. To contribute new bacterial genome assemblies and evaluate genetic diversity and variation in antimicrobial resistance genotypes, whole-genome sequencing was performed on bovine respiratory disease–associated bacterial isolates (Histophilus somni, Mycoplasma bovis, Mannheimia haemolytica, and Pasteurella multocida) from dairy and beef cattle. One hundred genomically distinct assemblies were added to the NCBI database, doubling the available genomic sequences for these four species. Computer-based methods identified 11 predicted antimicrobial resistance genes in three species, with none being detected in M. bovis. While computer-based analysis can identify antibiotic resistance genes within whole-genome sequences (genotype), it may not predict the actual antimicrobial resistance observed in a living organism (phenotype). Antimicrobial susceptibility testing on 64 H. somni, M. haemolytica, and P. multocida isolates had an overall concordance rate between genotype and phenotypic resistance to the associated class of antimicrobials of 72.7% (P < 0.001), showing substantial discordance. Concordance rates varied greatly among different antimicrobial, antibiotic resistance gene, and bacterial species combinations. This suggests that antimicrobial susceptibility phenotypes are needed to complement genomically predicted antibiotic resistance gene genotypes to better understand how the presence of antibiotic resistance genes within a given bacterial species could potentially impact optimal bovine respiratory disease treatment and morbidity/mortality outcomes. PMID:28739600
Owen, Joseph R; Noyes, Noelle; Young, Amy E; Prince, Daniel J; Blanchard, Patricia C; Lehenbauer, Terry W; Aly, Sharif S; Davis, Jessica H; O'Rourke, Sean M; Abdo, Zaid; Belk, Keith; Miller, Michael R; Morley, Paul; Van Eenennaam, Alison L
2017-09-07
Extended laboratory culture and antimicrobial susceptibility testing timelines hinder rapid species identification and susceptibility profiling of bacterial pathogens associated with bovine respiratory disease, the most prevalent cause of cattle mortality in the United States. Whole-genome sequencing offers a culture-independent alternative to current bacterial identification methods, but requires a library of bacterial reference genomes for comparison. To contribute new bacterial genome assemblies and evaluate genetic diversity and variation in antimicrobial resistance genotypes, whole-genome sequencing was performed on bovine respiratory disease-associated bacterial isolates ( Histophilus somni , Mycoplasma bovis , Mannheimia haemolytica , and Pasteurella multocida ) from dairy and beef cattle. One hundred genomically distinct assemblies were added to the NCBI database, doubling the available genomic sequences for these four species. Computer-based methods identified 11 predicted antimicrobial resistance genes in three species, with none being detected in M. bovis While computer-based analysis can identify antibiotic resistance genes within whole-genome sequences (genotype), it may not predict the actual antimicrobial resistance observed in a living organism (phenotype). Antimicrobial susceptibility testing on 64 H. somni , M. haemolytica , and P. multocida isolates had an overall concordance rate between genotype and phenotypic resistance to the associated class of antimicrobials of 72.7% ( P < 0.001), showing substantial discordance. Concordance rates varied greatly among different antimicrobial, antibiotic resistance gene, and bacterial species combinations. This suggests that antimicrobial susceptibility phenotypes are needed to complement genomically predicted antibiotic resistance gene genotypes to better understand how the presence of antibiotic resistance genes within a given bacterial species could potentially impact optimal bovine respiratory disease treatment and morbidity/mortality outcomes. Copyright © 2017 Owen et al.
The Importance of Bacterial Culture to Food Microbiology in the Age of Genomics.
Gill, Alexander
2017-01-01
Culture-based and genomics methods provide different insights into the nature and behavior of bacteria. Maximizing the usefulness of both approaches requires recognizing their limitations and employing them appropriately. Genomic analysis excels at identifying bacteria and establishing the relatedness of isolates. Culture-based methods remain necessary for detection and enumeration, to determine viability, and to validate phenotype predictions made on the bias of genomic analysis. The purpose of this short paper is to discuss the application of culture-based analysis and genomics to the questions food microbiologists routinely need to ask regarding bacteria to ensure the safety of food and its economic production and distribution. To address these issues appropriate tools are required for the detection and enumeration of specific bacterial populations and the characterization of isolates for, identification, phylogenetics, and phenotype prediction.
Phylogenomic Reconstruction of the Oomycete Phylogeny Derived from 37 Genomes
McCarthy, Charley G. P.
2017-01-01
ABSTRACT The oomycetes are a class of microscopic, filamentous eukaryotes within the Stramenopiles-Alveolata-Rhizaria (SAR) supergroup which includes ecologically significant animal and plant pathogens, most infamously the causative agent of potato blight Phytophthora infestans. Single-gene and concatenated phylogenetic studies both of individual oomycete genera and of members of the larger class have resulted in conflicting conclusions concerning species phylogenies within the oomycetes, particularly for the large Phytophthora genus. Genome-scale phylogenetic studies have successfully resolved many eukaryotic relationships by using supertree methods, which combine large numbers of potentially disparate trees to determine evolutionary relationships that cannot be inferred from individual phylogenies alone. With a sufficient amount of genomic data now available, we have undertaken the first whole-genome phylogenetic analysis of the oomycetes using data from 37 oomycete species and 6 SAR species. In our analysis, we used established supertree methods to generate phylogenies from 8,355 homologous oomycete and SAR gene families and have complemented those analyses with both phylogenomic network and concatenated supermatrix analyses. Our results show that a genome-scale approach to oomycete phylogeny resolves oomycete classes and individual clades within the problematic Phytophthora genus. Support for the resolution of the inferred relationships between individual Phytophthora clades varies depending on the methodology used. Our analysis represents an important first step in large-scale phylogenomic analysis of the oomycetes. IMPORTANCE The oomycetes are a class of eukaryotes and include ecologically significant animal and plant pathogens. Single-gene and multigene phylogenetic studies of individual oomycete genera and of members of the larger classes have resulted in conflicting conclusions concerning interspecies relationships among these species, particularly for the Phytophthora genus. The onset of next-generation sequencing techniques now means that a wealth of oomycete genomic data is available. For the first time, we have used genome-scale phylogenetic methods to resolve oomycete phylogenetic relationships. We used supertree methods to generate single-gene and multigene species phylogenies. Overall, our supertree analyses utilized phylogenetic data from 8,355 oomycete gene families. We have also complemented our analyses with superalignment phylogenies derived from 131 single-copy ubiquitous gene families. Our results show that a genome-scale approach to oomycete phylogeny resolves oomycete classes and clades. Our analysis represents an important first step in large-scale phylogenomic analysis of the oomycetes. PMID:28435885
Brammeld, Jonathan S; Petljak, Mia; Martincorena, Inigo; Williams, Steven P; Alonso, Luz Garcia; Dalmases, Alba; Bellosillo, Beatriz; Robles-Espinoza, Carla Daniela; Price, Stacey; Barthorpe, Syd; Tarpey, Patrick; Alifrangis, Constantine; Bignell, Graham; Vidal, Joana; Young, Jamie; Stebbings, Lucy; Beal, Kathryn; Stratton, Michael R; Saez-Rodriguez, Julio; Garnett, Mathew; Montagut, Clara; Iorio, Francesco; McDermott, Ultan
2017-04-01
Drug resistance is an almost inevitable consequence of cancer therapy and ultimately proves fatal for the majority of patients. In many cases, this is the consequence of specific gene mutations that have the potential to be targeted to resensitize the tumor. The ability to uniformly saturate the genome with point mutations without chromosome or nucleotide sequence context bias would open the door to identify all putative drug resistance mutations in cancer models. Here, we describe such a method for elucidating drug resistance mechanisms using genome-wide chemical mutagenesis allied to next-generation sequencing. We show that chemically mutagenizing the genome of cancer cells dramatically increases the number of drug-resistant clones and allows the detection of both known and novel drug resistance mutations. We used an efficient computational process that allows for the rapid identification of involved pathways and druggable targets. Such a priori knowledge would greatly empower serial monitoring strategies for drug resistance in the clinic as well as the development of trials for drug-resistant patients. © 2017 Brammeld et al.; Published by Cold Spring Harbor Laboratory Press.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Haraldsdóttir, Hulda S.; Fleming, Ronan M. T.
Conserved moieties are groups of atoms that remain intact in all reactions of a metabolic network. Identification of conserved moieties gives insight into the structure and function of metabolic networks and facilitates metabolic modelling. All moiety conservation relations can be represented as nonnegative integer vectors in the left null space of the stoichiometric matrix corresponding to a biochemical network. Algorithms exist to compute such vectors based only on reaction stoichiometry but their computational complexity has limited their application to relatively small metabolic networks. Moreover, the vectors returned by existing algorithms do not, in general, represent conservation of a specific moietymore » with a defined atomic structure. Here, we show that identification of conserved moieties requires data on reaction atom mappings in addition to stoichiometry. We present a novel method to identify conserved moieties in metabolic networks by graph theoretical analysis of their underlying atom transition networks. Our method returns the exact group of atoms belonging to each conserved moiety as well as the corresponding vector in the left null space of the stoichiometric matrix. It can be implemented as a pipeline of polynomial time algorithms. Our implementation completes in under five minutes on a metabolic network with more than 4,000 mass balanced reactions. The scalability of the method enables extension of existing applications for moiety conservation relations to genome-scale metabolic networks. Finally, we also give examples of new applications made possible by elucidating the atomic structure of conserved moieties.« less
Haraldsdóttir, Hulda S.; Fleming, Ronan M. T.
2016-01-01
Conserved moieties are groups of atoms that remain intact in all reactions of a metabolic network. Identification of conserved moieties gives insight into the structure and function of metabolic networks and facilitates metabolic modelling. All moiety conservation relations can be represented as nonnegative integer vectors in the left null space of the stoichiometric matrix corresponding to a biochemical network. Algorithms exist to compute such vectors based only on reaction stoichiometry but their computational complexity has limited their application to relatively small metabolic networks. Moreover, the vectors returned by existing algorithms do not, in general, represent conservation of a specific moiety with a defined atomic structure. Here, we show that identification of conserved moieties requires data on reaction atom mappings in addition to stoichiometry. We present a novel method to identify conserved moieties in metabolic networks by graph theoretical analysis of their underlying atom transition networks. Our method returns the exact group of atoms belonging to each conserved moiety as well as the corresponding vector in the left null space of the stoichiometric matrix. It can be implemented as a pipeline of polynomial time algorithms. Our implementation completes in under five minutes on a metabolic network with more than 4,000 mass balanced reactions. The scalability of the method enables extension of existing applications for moiety conservation relations to genome-scale metabolic networks. We also give examples of new applications made possible by elucidating the atomic structure of conserved moieties. PMID:27870845
Haraldsdóttir, Hulda S.; Fleming, Ronan M. T.
2016-11-21
Conserved moieties are groups of atoms that remain intact in all reactions of a metabolic network. Identification of conserved moieties gives insight into the structure and function of metabolic networks and facilitates metabolic modelling. All moiety conservation relations can be represented as nonnegative integer vectors in the left null space of the stoichiometric matrix corresponding to a biochemical network. Algorithms exist to compute such vectors based only on reaction stoichiometry but their computational complexity has limited their application to relatively small metabolic networks. Moreover, the vectors returned by existing algorithms do not, in general, represent conservation of a specific moietymore » with a defined atomic structure. Here, we show that identification of conserved moieties requires data on reaction atom mappings in addition to stoichiometry. We present a novel method to identify conserved moieties in metabolic networks by graph theoretical analysis of their underlying atom transition networks. Our method returns the exact group of atoms belonging to each conserved moiety as well as the corresponding vector in the left null space of the stoichiometric matrix. It can be implemented as a pipeline of polynomial time algorithms. Our implementation completes in under five minutes on a metabolic network with more than 4,000 mass balanced reactions. The scalability of the method enables extension of existing applications for moiety conservation relations to genome-scale metabolic networks. Finally, we also give examples of new applications made possible by elucidating the atomic structure of conserved moieties.« less
Haraldsdóttir, Hulda S; Fleming, Ronan M T
2016-11-01
Conserved moieties are groups of atoms that remain intact in all reactions of a metabolic network. Identification of conserved moieties gives insight into the structure and function of metabolic networks and facilitates metabolic modelling. All moiety conservation relations can be represented as nonnegative integer vectors in the left null space of the stoichiometric matrix corresponding to a biochemical network. Algorithms exist to compute such vectors based only on reaction stoichiometry but their computational complexity has limited their application to relatively small metabolic networks. Moreover, the vectors returned by existing algorithms do not, in general, represent conservation of a specific moiety with a defined atomic structure. Here, we show that identification of conserved moieties requires data on reaction atom mappings in addition to stoichiometry. We present a novel method to identify conserved moieties in metabolic networks by graph theoretical analysis of their underlying atom transition networks. Our method returns the exact group of atoms belonging to each conserved moiety as well as the corresponding vector in the left null space of the stoichiometric matrix. It can be implemented as a pipeline of polynomial time algorithms. Our implementation completes in under five minutes on a metabolic network with more than 4,000 mass balanced reactions. The scalability of the method enables extension of existing applications for moiety conservation relations to genome-scale metabolic networks. We also give examples of new applications made possible by elucidating the atomic structure of conserved moieties.
Identification of structural variation in mouse genomes.
Keane, Thomas M; Wong, Kim; Adams, David J; Flint, Jonathan; Reymond, Alexandre; Yalcin, Binnaz
2014-01-01
Structural variation is variation in structure of DNA regions affecting DNA sequence length and/or orientation. It generally includes deletions, insertions, copy-number gains, inversions, and transposable elements. Traditionally, the identification of structural variation in genomes has been challenging. However, with the recent advances in high-throughput DNA sequencing and paired-end mapping (PEM) methods, the ability to identify structural variation and their respective association to human diseases has improved considerably. In this review, we describe our current knowledge of structural variation in the mouse, one of the prime model systems for studying human diseases and mammalian biology. We further present the evolutionary implications of structural variation on transposable elements. We conclude with future directions on the study of structural variation in mouse genomes that will increase our understanding of molecular architecture and functional consequences of structural variation.
Vongsangnak, Wanwipa; Klanchui, Amornpan; Tawornsamretkit, Iyarest; Tatiyaborwornchai, Witthawin; Laoteng, Kobkul; Meechai, Asawin
2016-06-01
We present a novel genome-scale metabolic model iWV1213 of Mucor circinelloides, which is an oleaginous fungus for industrial applications. The model contains 1213 genes, 1413 metabolites and 1326 metabolic reactions across different compartments. We demonstrate that iWV1213 is able to accurately predict the growth rates of M. circinelloides on various nutrient sources and culture conditions using Flux Balance Analysis and Phenotypic Phase Plane analysis. Comparative analysis of three oleaginous genome-scale models, including M. circinelloides (iWV1213), Mortierella alpina (iCY1106) and Yarrowia lipolytica (iYL619_PCP) revealed that iWV1213 possesses a higher number of genes involved in carbohydrate, amino acid, and lipid metabolisms that might contribute to its versatility in nutrient utilization. Moreover, the identification of unique and common active reactions among the Zygomycetes oleaginous models using Flux Variability Analysis unveiled a set of gene/enzyme candidates as metabolic engineering targets for cellular improvement. Thus, iWV1213 offers a powerful metabolic engineering tool for multi-level omics analysis, enabling strain optimization as a cell factory platform of lipid-based production. Copyright © 2016 Elsevier B.V. All rights reserved.
Bernardes, Juliana; Zaverucha, Gerson; Vaquero, Catherine; Carbone, Alessandra
2016-01-01
Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE. PMID:27472895
Epithelial ovarian cancer: the molecular genetics of epithelial ovarian cancer.
Krzystyniak, J; Ceppi, L; Dizon, D S; Birrer, M J
2016-04-01
Epithelial ovarian cancer (EOC) remains one of the leading causes of cancer-related deaths among women worldwide, despite gains in diagnostics and treatments made over the last three decades. Existing markers of ovarian cancer possess very limited clinical relevance highlighting the emerging need for identification of novel prognostic biomarkers as well as better predictive factors that might allow the stratification of patients who could benefit from a more targeted approach. A summary of molecular genetics of EOC. Large-scale high-throughput genomic technologies appear to be powerful tools for investigations into the genetic abnormalities in ovarian tumors, including studies on dysregulated genes and aberrantly activated signaling pathways. Such technologies can complement well-established clinical histopathology analysis and tumor grading and will hope to result in better, more tailored treatments in the future. Genomic signatures obtained by gene expression profiling of EOC may be able to predict survival outcomes and other important clinical outcomes, such as the success of surgical treatment. Finally, genomic analyses may allow for the identification of novel predictive biomarkers for purposes of treatment planning. These data combined suggest a pathway to progress in the treatment of advanced ovarian cancer and the promise of fulfilling the objective of providing personalized medicine to women with ovarian cancer. The understanding of basic molecular events in the tumorigenesis and chemoresistance of EOC together with discovery of potential biomarkers may be greatly enhanced through large-scale genomic studies. In order to maximize the impact of these technologies, however, extensive validation studies are required. © The Author 2016. Published by Oxford University Press on behalf of the European Society for Medical Oncology. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Letaief, Rabia; Rebours, Emmanuelle; Grohs, Cécile; Meersseman, Cédric; Fritz, Sébastien; Trouilh, Lidwine; Esquerré, Diane; Barbieri, Johanna; Klopp, Christophe; Philippe, Romain; Blanquet, Véronique; Boichard, Didier; Rocha, Dominique; Boussaha, Mekki
2017-10-24
Copy number variations (CNV) are known to play a major role in genetic variability and disease pathogenesis in several species including cattle. In this study, we report the identification and characterization of CNV in eight French beef and dairy breeds using whole-genome sequence data from 200 animals. Bioinformatics analyses to search for CNV were carried out using four different but complementary tools and we validated a subset of the CNV by both in silico and experimental approaches. We report the identification and localization of 4178 putative deletion-only, duplication-only and CNV regions, which cover 6% of the bovine autosomal genome; they were validated by two in silico approaches and/or experimentally validated using array-based comparative genomic hybridization and single nucleotide polymorphism genotyping arrays. The size of these variants ranged from 334 bp to 7.7 Mb, with an average size of ~ 54 kb. Of these 4178 variants, 3940 were deletions, 67 were duplications and 171 corresponded to both deletions and duplications, which were defined as potential CNV regions. Gene content analysis revealed that, among these variants, 1100 deletions and duplications encompassed 1803 known genes, which affect a wide spectrum of molecular functions, and 1095 overlapped with known QTL regions. Our study is a large-scale survey of CNV in eight French dairy and beef breeds. These CNV will be useful to study the link between genetic variability and economically important traits, and to improve our knowledge on the genomic architecture of cattle.
Hilke Schroeder; Richard Cronn; Yulai Yanbaev; Tara Jennings; Malte Mader; Bernd Degen; Birgit Kersten; Dusan Gomory
2016-01-01
To detect and avoid illegal logging of valuable tree species, identification methods for the origin of timber are necessary. We used next-generation sequencing to identify chloroplast genome regions that differentiate the origin of white oaks from the three continents; Asia, Europe, and North America. By using the chloroplast genome of Asian Q. mongolica...
2013-01-01
Background Mitochondrial DNA (mtDNA) typing can be a useful aid for identifying people from compromised samples when nuclear DNA is too damaged, degraded or below detection thresholds for routine short tandem repeat (STR)-based analysis. Standard mtDNA typing, focused on PCR amplicon sequencing of the control region (HVS I and HVS II), is limited by the resolving power of this short sequence, which misses up to 70% of the variation present in the mtDNA genome. Methods We used in-solution hybridisation-based DNA capture (using DNA capture probes prepared from modern human mtDNA) to recover mtDNA from post-mortem human remains in which the majority of DNA is both highly fragmented (<100 base pairs in length) and chemically damaged. The method ‘immortalises’ the finite quantities of DNA in valuable extracts as DNA libraries, which is followed by the targeted enrichment of endogenous mtDNA sequences and characterisation by next-generation sequencing (NGS). Results We sequenced whole mitochondrial genomes for human identification from samples where standard nuclear STR typing produced only partial profiles or demonstrably failed and/or where standard mtDNA hypervariable region sequences lacked resolving power. Multiple rounds of enrichment can substantially improve coverage and sequencing depth of mtDNA genomes from highly degraded samples. The application of this method has led to the reliable mitochondrial sequencing of human skeletal remains from unidentified World War Two (WWII) casualties approximately 70 years old and from archaeological remains (up to 2,500 years old). Conclusions This approach has potential applications in forensic science, historical human identification cases, archived medical samples, kinship analysis and population studies. In particular the methodology can be applied to any case, involving human or non-human species, where whole mitochondrial genome sequences are required to provide the highest level of maternal lineage discrimination. Multiple rounds of in-solution hybridisation-based DNA capture can retrieve whole mitochondrial genome sequences from even the most challenging samples. PMID:24289217
Structured illumination to spatially map chromatin motions.
Bonin, Keith; Smelser, Amanda; Moreno, Naike Salvador; Holzwarth, George; Wang, Kevin; Levy, Preston; Vidi, Pierre-Alexandre
2018-05-01
We describe a simple optical method that creates structured illumination of a photoactivatable probe and apply this method to characterize chromatin motions in nuclei of live cells. A laser beam coupled to a diffractive optical element at the back focal plane of an excitation objective generates an array of near diffraction-limited beamlets with FWHM of 340 ± 30 nm, which simultaneously photoactivate a 7 × 7 matrix pattern of GFP-labeled histones, with spots 1.70 μm apart. From the movements of the photoactivated spots, we map chromatin diffusion coefficients at multiple microdomains of the cell nucleus. The results show correlated motions of nearest chromatin microdomain neighbors, whereas chromatin movements are uncorrelated at the global scale of the nucleus. The method also reveals a DNA damage-dependent decrease in chromatin diffusion. The diffractive optical element instrumentation can be easily and cheaply implemented on commercial inverted fluorescence microscopes to analyze adherent cell culture models. A protocol to measure chromatin motions in nonadherent human hematopoietic stem and progenitor cells is also described. We anticipate that the method will contribute to the identification of the mechanisms regulating chromatin mobility, which influences most genomic processes and may underlie the biogenesis of genomic translocations associated with hematologic malignancies. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
Wientjes, Yvonne C J; Bijma, Piter; Vandenplas, Jérémie; Calus, Mario P L
2017-10-01
Different methods are available to calculate multi-population genomic relationship matrices. Since those matrices differ in base population, it is anticipated that the method used to calculate genomic relationships affects the estimate of genetic variances, covariances, and correlations. The aim of this article is to define the multi-population genomic relationship matrix to estimate current genetic variances within and genetic correlations between populations. The genomic relationship matrix containing two populations consists of four blocks, one block for population 1, one block for population 2, and two blocks for relationships between the populations. It is known, based on literature, that by using current allele frequencies to calculate genomic relationships within a population, current genetic variances are estimated. In this article, we theoretically derived the properties of the genomic relationship matrix to estimate genetic correlations between populations and validated it using simulations. When the scaling factor of across-population genomic relationships is equal to the product of the square roots of the scaling factors for within-population genomic relationships, the genetic correlation is estimated unbiasedly even though estimated genetic variances do not necessarily refer to the current population. When this property is not met, the correlation based on estimated variances should be multiplied by a correction factor based on the scaling factors. In this study, we present a genomic relationship matrix which directly estimates current genetic variances as well as genetic correlations between populations. Copyright © 2017 by the Genetics Society of America.
Nielsen, H Bjørn; Almeida, Mathieu; Juncker, Agnieszka Sierakowska; Rasmussen, Simon; Li, Junhua; Sunagawa, Shinichi; Plichta, Damian R; Gautier, Laurent; Pedersen, Anders G; Le Chatelier, Emmanuelle; Pelletier, Eric; Bonde, Ida; Nielsen, Trine; Manichanh, Chaysavanh; Arumugam, Manimozhiyan; Batto, Jean-Michel; Quintanilha Dos Santos, Marcelo B; Blom, Nikolaj; Borruel, Natalia; Burgdorf, Kristoffer S; Boumezbeur, Fouad; Casellas, Francesc; Doré, Joël; Dworzynski, Piotr; Guarner, Francisco; Hansen, Torben; Hildebrand, Falk; Kaas, Rolf S; Kennedy, Sean; Kristiansen, Karsten; Kultima, Jens Roat; Léonard, Pierre; Levenez, Florence; Lund, Ole; Moumen, Bouziane; Le Paslier, Denis; Pons, Nicolas; Pedersen, Oluf; Prifti, Edi; Qin, Junjie; Raes, Jeroen; Sørensen, Søren; Tap, Julien; Tims, Sebastian; Ussery, David W; Yamada, Takuji; Renault, Pierre; Sicheritz-Ponten, Thomas; Bork, Peer; Wang, Jun; Brunak, Søren; Ehrlich, S Dusko
2014-08-01
Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.
Genomics Education for the Public: Perspectives of Genomic Researchers and ELSI Advisors
Jones, Sondra Smolek; Markey, Janell M.; Byerly, Katherine W.; Roberts, Megan C.
2014-01-01
Aims: For more than two decades genomic education of the public has been a significant challenge. As genomic information becomes integrated into daily life and routine clinical care, the need for public education is even more critical. We conducted a pilot study to learn how genomic researchers and ethical, legal, and social implications advisors who were affiliated with large-scale genomic variation studies have approached the issue of educating the public about genomics. Methods/Results: Semi-structured telephone interviews were conducted with researchers and advisors associated with the SNP/HAPMAP studies and the Cancer Genome Atlas Study. Respondents described approach(es) associated with educating the public about their study. Interviews were audio-recorded, transcribed, coded, and analyzed by team review. Although few respondents described formal educational efforts, most provided recommendations for what should/could be done, emphasizing the need for an overarching entity(s) to take responsibility to lead the effort to educate the public. Opposing views were described related to: who this should be; the overall goal of the educational effort; and the educational approach. Four thematic areas emerged: What is the rationale for educating the public about genomics?; Who is the audience?; Who should be responsible for this effort?; and What should the content be? Policy issues associated with these themes included the need to agree on philosophical framework(s) to guide the rationale, content, and target audiences for education programs; coordinate previous/ongoing educational efforts; and develop a centralized knowledge base. Suggestions for next steps are presented. Conclusion: A complex interplay of philosophical, professional, and cultural issues can create impediments to genomic education of the public. Many challenges, however, can be addressed by agreement on a guiding philosophical framework(s) and identification of a responsible entity(s) to provide leadership for developing/overseeing an appropriate infrastructure to support the coordination/integration/sharing and evaluation of educational efforts, benefiting consumers and professionals. PMID:24495163
Methods for Initial Characterization of Campylobacter jejuni Bacteriophages.
Sørensen, Martine Camilla Holst; Gencay, Yilmaz Emre; Brøndsted, Lone
2017-01-01
Here we describe an initial characterization of Campylobacter jejuni bacteriophages by host range analysis, genome size determination by pulsed-field gel electrophoresis, and receptor-type identification by screening mutants for phage sensitivity.
NASA Astrophysics Data System (ADS)
Seto, Donald
The convergence and wealth of informatics, bioinformatics and genomics methods and associated resources allow a comprehensive and rapid approach for the surveillance and detection of bacterial and viral organisms. Coupled with the continuing race for the fastest, most cost-efficient and highest-quality DNA sequencing technology, that is, "next generation sequencing", the detection of biological threat agents by `cheaper and faster' means is possible. With the application of improved bioinformatic tools for the understanding of these genomes and for parsing unique pathogen genome signatures, along with `state-of-the-art' informatics which include faster computational methods, equipment and databases, it is feasible to apply new algorithms to biothreat agent detection. Two such methods are high-throughput DNA sequencing-based and resequencing microarray-based identification. These are illustrated and validated by two examples involving human adenoviruses, both from real-world test beds.
Huang, Jie; Li, Yu-Zhi; Du, Lian-Ming; Yang, Bo; Shen, Fu-Jun; Zhang, He-Min; Zhang, Zhi-He; Zhang, Xiu-Yue; Yue, Bi-Song
2015-02-07
The giant panda (Ailuropoda melanoleuca) is a critically endangered species endemic to China. Microsatellites have been preferred as the most popular molecular markers and proven effective in estimating population size, paternity test, genetic diversity for the critically endangered species. The availability of the giant panda complete genome sequences provided the opportunity to carry out genome-wide scans for all types of microsatellites markers, which now opens the way for the analysis and development of microsatellites in giant panda. By screening the whole genome sequence of giant panda in silico mining, we identified microsatellites in the genome of giant panda and analyzed their frequency and distribution in different genomic regions. Based on our search criteria, a repertoire of 855,058 SSRs was detected, with mono-nucleotides being the most abundant. SSRs were found in all genomic regions and were more abundant in non-coding regions than coding regions. A total of 160 primer pairs were designed to screen for polymorphic microsatellites using the selected tetranucleotide microsatellite sequences. The 51 novel polymorphic tetranucleotide microsatellite loci were discovered based on genotyping blood DNA from 22 captive giant pandas in this study. Finally, a total of 15 markers, which showed good polymorphism, stability, and repetition in faecal samples, were used to establish the novel microsatellite marker system for giant panda. Meanwhile, a genotyping database for Chengdu captive giant pandas (n = 57) were set up using this standardized system. What's more, a universal individual identification method was established and the genetic diversity were analysed in this study as the applications of this marker system. The microsatellite abundance and diversity were characterized in giant panda genomes. A total of 154,677 tetranucleotide microsatellites were identified and 15 of them were discovered as the polymorphic and stable loci. The individual identification method and the genetic diversity analysis method in this study provided adequate material for the future study of giant panda.
Resources for Genetic and Genomic Analysis of Emerging Pathogen Acinetobacter baumannii
Ramage, Elizabeth; Weiss, Eli J.; Radey, Matthew; Hayden, Hillary S.; Held, Kiara G.; Huse, Holly K.; Zurawski, Daniel V.; Brittnacher, Mitchell J.; Manoil, Colin
2015-01-01
ABSTRACT Acinetobacter baumannii is a Gram-negative bacterial pathogen notorious for causing serious nosocomial infections that resist antibiotic therapy. Research to identify factors responsible for the pathogen's success has been limited by the resources available for genome-scale experimental studies. This report describes the development of several such resources for A. baumannii strain AB5075, a recently characterized wound isolate that is multidrug resistant and displays robust virulence in animal models. We report the completion and annotation of the genome sequence, the construction of a comprehensive ordered transposon mutant library, the extension of high-coverage transposon mutant pool sequencing (Tn-seq) to the strain, and the identification of the genes essential for growth on nutrient-rich agar. These resources should facilitate large-scale genetic analysis of virulence, resistance, and other clinically relevant traits that make A. baumannii a formidable public health threat. IMPORTANCE Acinetobacter baumannii is one of six bacterial pathogens primarily responsible for antibiotic-resistant infections that have become the scourge of health care facilities worldwide. Eliminating such infections requires a deeper understanding of the factors that enable the pathogen to persist in hospital environments, establish infections, and resist antibiotics. We present a set of resources that should accelerate genome-scale genetic characterization of these traits for a reference isolate of A. baumannii that is highly virulent and representative of current outbreak strains. PMID:25845845
Genomic Approaches to Zebrafish Cancer
2017-01-01
The zebrafish has emerged as an important model for studying cancer biology. Identification of DNA, RNA and chromatin abnormalities can give profound insight into the mechanisms of tumorigenesis and the there are many techniques for analyzing the genomes of these tumors. Here, I present an overview of the available technologies for analyzing tumor genomes in the zebrafish, including array based methods as well as next-generation sequencing technologies. I also discuss the ways in which zebrafish tumor genomes can be compared to human genomes using cross-species oncogenomics, which act to filter genomic noise and ultimately uncover central drivers of malignancy. Finally, I discuss downstream analytic tools, including network analysis, that can help to organize the alterations into coherent biological frameworks that can then be investigated further. PMID:27165352
Multiscale global identification of porous structures
NASA Astrophysics Data System (ADS)
Hatłas, Marcin; Beluch, Witold
2018-01-01
The paper is devoted to the evolutionary identification of the material constants of porous structures based on measurements conducted on a macro scale. Numerical homogenization with the RVE concept is used to determine the equivalent properties of a macroscopically homogeneous material. Finite element method software is applied to solve the boundary-value problem in both scales. Global optimization methods in form of evolutionary algorithm are employed to solve the identification task. Modal analysis is performed to collect the data necessary for the identification. A numerical example presenting the effectiveness of proposed attitude is attached.
Chromatin accessibility prediction via a hybrid deep convolutional neural network.
Liu, Qiao; Xia, Fei; Yin, Qijin; Jiang, Rui
2018-03-01
A majority of known genetic variants associated with human-inherited diseases lie in non-coding regions that lack adequate interpretation, making it indispensable to systematically discover functional sites at the whole genome level and precisely decipher their implications in a comprehensive manner. Although computational approaches have been complementing high-throughput biological experiments towards the annotation of the human genome, it still remains a big challenge to accurately annotate regulatory elements in the context of a specific cell type via automatic learning of the DNA sequence code from large-scale sequencing data. Indeed, the development of an accurate and interpretable model to learn the DNA sequence signature and further enable the identification of causative genetic variants has become essential in both genomic and genetic studies. We proposed Deopen, a hybrid framework mainly based on a deep convolutional neural network, to automatically learn the regulatory code of DNA sequences and predict chromatin accessibility. In a series of comparison with existing methods, we show the superior performance of our model in not only the classification of accessible regions against background sequences sampled at random, but also the regression of DNase-seq signals. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known motifs. We finally demonstrate the sensitivity of our model in finding causative noncoding variants in the analysis of a breast cancer dataset. We expect to see wide applications of Deopen with either public or in-house chromatin accessibility data in the annotation of the human genome and the identification of non-coding variants associated with diseases. Deopen is freely available at https://github.com/kimmo1019/Deopen. ruijiang@tsinghua.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Wang, Lu-Yong; Fasulo, D
2006-01-01
Genome-wide association study for complex diseases will generate massive amount of single nucleotide polymorphisms (SNPs) data. Univariate statistical test (i.e. Fisher exact test) was used to single out non-associated SNPs. However, the disease-susceptible SNPs may have little marginal effects in population and are unlikely to retain after the univariate tests. Also, model-based methods are impractical for large-scale dataset. Moreover, genetic heterogeneity makes the traditional methods harder to identify the genetic causes of diseases. A more recent random forest method provides a more robust method for screening the SNPs in thousands scale. However, for more large-scale data, i.e., Affymetrix Human Mapping 100K GeneChip data, a faster screening method is required to screening SNPs in whole-genome large scale association analysis with genetic heterogeneity. We propose a boosting-based method for rapid screening in large-scale analysis of complex traits in the presence of genetic heterogeneity. It provides a relatively fast and fairly good tool for screening and limiting the candidate SNPs for further more complex computational modeling task.
Kargarfard, Fatemeh; Sami, Ashkan; Mohammadi-Dehcheshmeh, Manijeh; Ebrahimie, Esmaeil
2016-11-16
Recent (2013 and 2009) zoonotic transmission of avian or porcine influenza to humans highlights an increase in host range by evading species barriers. Gene reassortment or antigenic shift between viruses from two or more hosts can generate a new life-threatening virus when the new shuffled virus is no longer recognized by antibodies existing within human populations. There is no large scale study to help understand the underlying mechanisms of host transmission. Furthermore, there is no clear understanding of how different segments of the influenza genome contribute in the final determination of host range. To obtain insight into the rules underpinning host range determination, various supervised machine learning algorithms were employed to mine reassortment changes in different viral segments in a range of hosts. Our multi-host dataset contained whole segments of 674 influenza strains organized into three host categories: avian, human, and swine. Some of the sequences were assigned to multiple hosts. In point of fact, the datasets are a form of multi-labeled dataset and we utilized a multi-label learning method to identify discriminative sequence sites. Then algorithms such as CBA, Ripper, and decision tree were applied to extract informative and descriptive association rules for each viral protein segment. We found informative rules in all segments that are common within the same host class but varied between different hosts. For example, for infection of an avian host, HA14V and NS1230S were the most important discriminative and combinatorial positions. Host range identification is facilitated by high support combined rules in this study. Our major goal was to detect discriminative genomic positions that were able to identify multi host viruses, because such viruses are likely to cause pandemic or disastrous epidemics.
Zheng, Guangyong; Xu, Yaochen; Zhang, Xiujun; Liu, Zhi-Ping; Wang, Zhuo; Chen, Luonan; Zhu, Xin-Guang
2016-12-23
A gene regulatory network (GRN) represents interactions of genes inside a cell or tissue, in which vertexes and edges stand for genes and their regulatory interactions respectively. Reconstruction of gene regulatory networks, in particular, genome-scale networks, is essential for comparative exploration of different species and mechanistic investigation of biological processes. Currently, most of network inference methods are computationally intensive, which are usually effective for small-scale tasks (e.g., networks with a few hundred genes), but are difficult to construct GRNs at genome-scale. Here, we present a software package for gene regulatory network reconstruction at a genomic level, in which gene interaction is measured by the conditional mutual information measurement using a parallel computing framework (so the package is named CMIP). The package is a greatly improved implementation of our previous PCA-CMI algorithm. In CMIP, we provide not only an automatic threshold determination method but also an effective parallel computing framework for network inference. Performance tests on benchmark datasets show that the accuracy of CMIP is comparable to most current network inference methods. Moreover, running tests on synthetic datasets demonstrate that CMIP can handle large datasets especially genome-wide datasets within an acceptable time period. In addition, successful application on a real genomic dataset confirms its practical applicability of the package. This new software package provides a powerful tool for genomic network reconstruction to biological community. The software can be accessed at http://www.picb.ac.cn/CMIP/ .
Identification of coding and non-coding mutational hotspots in cancer genomes.
Piraino, Scott W; Furney, Simon J
2017-01-05
The identification of mutations that play a causal role in tumour development, so called "driver" mutations, is of critical importance for understanding how cancers form and how they might be treated. Several large cancer sequencing projects have identified genes that are recurrently mutated in cancer patients, suggesting a role in tumourigenesis. While the landscape of coding drivers has been extensively studied and many of the most prominent driver genes are well characterised, comparatively less is known about the role of mutations in the non-coding regions of the genome in cancer development. The continuing fall in genome sequencing costs has resulted in a concomitant increase in the number of cancer whole genome sequences being produced, facilitating systematic interrogation of both the coding and non-coding regions of cancer genomes. To examine the mutational landscapes of tumour genomes we have developed a novel method to identify mutational hotspots in tumour genomes using both mutational data and information on evolutionary conservation. We have applied our methodology to over 1300 whole cancer genomes and show that it identifies prominent coding and non-coding regions that are known or highly suspected to play a role in cancer. Importantly, we applied our method to the entire genome, rather than relying on predefined annotations (e.g. promoter regions) and we highlight recurrently mutated regions that may have resulted from increased exposure to mutational processes rather than selection, some of which have been identified previously as targets of selection. Finally, we implicate several pan-cancer and cancer-specific candidate non-coding regions, which could be involved in tumourigenesis. We have developed a framework to identify mutational hotspots in cancer genomes, which is applicable to the entire genome. This framework identifies known and novel coding and non-coding mutional hotspots and can be used to differentiate candidate driver regions from likely passenger regions susceptible to somatic mutation.
Bernard, Guillaume; Chan, Cheong Xin; Ragan, Mark A
2016-07-01
Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.
Chockalingam, Sriram; Aluru, Maneesha; Aluru, Srinivas
2016-09-19
Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.
Computing Prediction and Functional Analysis of Prokaryotic Propionylation.
Wang, Li-Na; Shi, Shao-Ping; Wen, Ping-Ping; Zhou, Zhi-You; Qiu, Jian-Ding
2017-11-27
Identification and systematic analysis of candidates for protein propionylation are crucial steps for understanding its molecular mechanisms and biological functions. Although several proteome-scale methods have been performed to delineate potential propionylated proteins, the majority of lysine-propionylated substrates and their role in pathological physiology still remain largely unknown. By gathering various databases and literatures, experimental prokaryotic propionylation data were collated to be trained in a support vector machine with various features via a three-step feature selection method. A novel online tool for seeking potential lysine-propionylated sites (PropSeek) ( http://bioinfo.ncu.edu.cn/PropSeek.aspx ) was built. Independent test results of leave-one-out and n-fold cross-validation were similar to each other, showing that PropSeek is a stable and robust predictor with satisfying performance. Meanwhile, analyses of Gene Ontology, Kyoto Encyclopedia of Genes and Genomes pathways, and protein-protein interactions implied a potential role of prokaryotic propionylation in protein synthesis and metabolism.
Dhanasekaran, A Ranjitha; Pearson, Jon L; Ganesan, Balasubramanian; Weimer, Bart C
2015-02-25
Mass spectrometric analysis of microbial metabolism provides a long list of possible compounds. Restricting the identification of the possible compounds to those produced by the specific organism would benefit the identification process. Currently, identification of mass spectrometry (MS) data is commonly done using empirically derived compound databases. Unfortunately, most databases contain relatively few compounds, leaving long lists of unidentified molecules. Incorporating genome-encoded metabolism enables MS output identification that may not be included in databases. Using an organism's genome as a database restricts metabolite identification to only those compounds that the organism can produce. To address the challenge of metabolomic analysis from MS data, a web-based application to directly search genome-constructed metabolic databases was developed. The user query returns a genome-restricted list of possible compound identifications along with the putative metabolic pathways based on the name, formula, SMILES structure, and the compound mass as defined by the user. Multiple queries can be done simultaneously by submitting a text file created by the user or obtained from the MS analysis software. The user can also provide parameters specific to the experiment's MS analysis conditions, such as mass deviation, adducts, and detection mode during the query so as to provide additional levels of evidence to produce the tentative identification. The query results are provided as an HTML page and downloadable text file of possible compounds that are restricted to a specific genome. Hyperlinks provided in the HTML file connect the user to the curated metabolic databases housed in ProCyc, a Pathway Tools platform, as well as the KEGG Pathway database for visualization and metabolic pathway analysis. Metabolome Searcher, a web-based tool, facilitates putative compound identification of MS output based on genome-restricted metabolic capability. This enables researchers to rapidly extend the possible identifications of large data sets for metabolites that are not in compound databases. Putative compound names with their associated metabolic pathways from metabolomics data sets are returned to the user for additional biological interpretation and visualization. This novel approach enables compound identification by restricting the possible masses to those encoded in the genome.
Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger.
Wright, James C; Sugden, Deana; Francis-McIntyre, Sue; Riba-Garcia, Isabel; Gaskell, Simon J; Grigoriev, Igor V; Baker, Scott E; Beynon, Robert J; Hubbard, Simon J
2009-02-04
Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR). 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method.
Distilled single-cell genome sequencing and de novo assembly for sparse microbial communities.
Taghavi, Zeinab; Movahedi, Narjes S; Draghici, Sorin; Chitsaz, Hamidreza
2013-10-01
Identification of every single genome present in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single-cell sequencing using multiple displacement amplification, which is simply intractable for a large number of cells. However, there is hope for breaking this barrier, as the number of different cell types with distinct genome sequences is usually much smaller than the number of cells. Here, we present a novel divide and conquer method to sequence and de novo assemble all distinct genomes present in a microbial sample with a sequencing cost and computational complexity proportional to the number of genome types, rather than the number of cells. The method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated data. The proposed divide and conquer method successfully reduces the cost of sequencing in comparison with the naïve exhaustive approach. Squeezambler and datasets are available at http://compbio.cs.wayne.edu/software/squeezambler/.
A versatile genome-scale PCR-based pipeline for high-definition DNA FISH.
Bienko, Magda; Crosetto, Nicola; Teytelman, Leonid; Klemm, Sandy; Itzkovitz, Shalev; van Oudenaarden, Alexander
2013-02-01
We developed a cost-effective genome-scale PCR-based method for high-definition DNA FISH (HD-FISH). We visualized gene loci with diffraction-limited resolution, chromosomes as spot clusters and single genes together with transcripts by combining HD-FISH with single-molecule RNA FISH. We provide a database of over 4.3 million primer pairs targeting the human and mouse genomes that is readily usable for rapid and flexible generation of probes.
Malin, Bradley A
2005-01-01
The incorporation of genomic data into personal medical records poses many challenges to patient privacy. In response, various systems for preserving patient privacy in shared genomic data have been developed and deployed. Although these systems de-identify the data by removing explicit identifiers (e.g., name, address, or Social Security number) and incorporate sound security design principles, they suffer from a lack of formal modeling of inferences learnable from shared data. This report evaluates the extent to which current protection systems are capable of withstanding a range of re-identification methods, including genotype-phenotype inferences, location-visit patterns, family structures, and dictionary attacks. For a comparative re-identification analysis, the systems are mapped to a common formalism. Although there is variation in susceptibility, each system is deficient in its protection capacity. The author discovers patterns of protection failure and discusses several of the reasons why these systems are susceptible. The analyses and discussion within provide guideposts for the development of next-generation protection methods amenable to formal proofs.
Malin, Bradley A.
2005-01-01
The incorporation of genomic data into personal medical records poses many challenges to patient privacy. In response, various systems for preserving patient privacy in shared genomic data have been developed and deployed. Although these systems de-identify the data by removing explicit identifiers (e.g., name, address, or Social Security number) and incorporate sound security design principles, they suffer from a lack of formal modeling of inferences learnable from shared data. This report evaluates the extent to which current protection systems are capable of withstanding a range of re-identification methods, including genotype–phenotype inferences, location–visit patterns, family structures, and dictionary attacks. For a comparative re-identification analysis, the systems are mapped to a common formalism. Although there is variation in susceptibility, each system is deficient in its protection capacity. The author discovers patterns of protection failure and discusses several of the reasons why these systems are susceptible. The analyses and discussion within provide guideposts for the development of next-generation protection methods amenable to formal proofs. PMID:15492030
Ataman, Meric
2017-01-01
Genome-scale metabolic reconstructions have proven to be valuable resources in enhancing our understanding of metabolic networks as they encapsulate all known metabolic capabilities of the organisms from genes to proteins to their functions. However the complexity of these large metabolic networks often hinders their utility in various practical applications. Although reduced models are commonly used for modeling and in integrating experimental data, they are often inconsistent across different studies and laboratories due to different criteria and detail, which can compromise transferability of the findings and also integration of experimental data from different groups. In this study, we have developed a systematic semi-automatic approach to reduce genome-scale models into core models in a consistent and logical manner focusing on the central metabolism or subsystems of interest. The method minimizes the loss of information using an approach that combines graph-based search and optimization methods. The resulting core models are shown to be able to capture key properties of the genome-scale models and preserve consistency in terms of biomass and by-product yields, flux and concentration variability and gene essentiality. The development of these “consistently-reduced” models will help to clarify and facilitate integration of different experimental data to draw new understanding that can be directly extendable to genome-scale models. PMID:28727725
VirSorter: mining viral signal from microbial genomic data.
Roux, Simon; Enault, Francois; Hurwitz, Bonnie L; Sullivan, Matthew B
2015-01-01
Viruses of microbes impact all ecosystems where microbes drive key energy and substrate transformations including the oceans, humans and industrial fermenters. However, despite this recognized importance, our understanding of viral diversity and impacts remains limited by too few model systems and reference genomes. One way to fill these gaps in our knowledge of viral diversity is through the detection of viral signal in microbial genomic data. While multiple approaches have been developed and applied for the detection of prophages (viral genomes integrated in a microbial genome), new types of microbial genomic data are emerging that are more fragmented and larger scale, such as Single-cell Amplified Genomes (SAGs) of uncultivated organisms or genomic fragments assembled from metagenomic sequencing. Here, we present VirSorter, a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses. Performance testing shows that VirSorter's prophage prediction capability compares to that of available prophage predictors for complete genomes, but is superior in predicting viral sequences outside of a host genome (i.e., from extrachromosomal prophages, lytic infections, or partially assembled prophages). Furthermore, VirSorter outperforms existing tools for fragmented genomic and metagenomic datasets, and can identify viral signal in assembled sequence (contigs) as short as 3kb, while providing near-perfect identification (>95% Recall and 100% Precision) on contigs of at least 10kb. Because VirSorter scales to large datasets, it can also be used in "reverse" to more confidently identify viral sequence in viral metagenomes by sorting away cellular DNA whether derived from gene transfer agents, generalized transduction or contamination. Finally, VirSorter is made available through the iPlant Cyberinfrastructure that provides a web-based user interface interconnected with the required computing resources. VirSorter thus complements existing prophage prediction softwares to better leverage fragmented, SAG and metagenomic datasets in a way that will scale to modern sequencing. Given these features, VirSorter should enable the discovery of new viruses in microbial datasets, and further our understanding of uncultivated viral communities across diverse ecosystems.
VirSorter: mining viral signal from microbial genomic data
Roux, Simon; Enault, Francois; Hurwitz, Bonnie L.
2015-01-01
Viruses of microbes impact all ecosystems where microbes drive key energy and substrate transformations including the oceans, humans and industrial fermenters. However, despite this recognized importance, our understanding of viral diversity and impacts remains limited by too few model systems and reference genomes. One way to fill these gaps in our knowledge of viral diversity is through the detection of viral signal in microbial genomic data. While multiple approaches have been developed and applied for the detection of prophages (viral genomes integrated in a microbial genome), new types of microbial genomic data are emerging that are more fragmented and larger scale, such as Single-cell Amplified Genomes (SAGs) of uncultivated organisms or genomic fragments assembled from metagenomic sequencing. Here, we present VirSorter, a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses. Performance testing shows that VirSorter’s prophage prediction capability compares to that of available prophage predictors for complete genomes, but is superior in predicting viral sequences outside of a host genome (i.e., from extrachromosomal prophages, lytic infections, or partially assembled prophages). Furthermore, VirSorter outperforms existing tools for fragmented genomic and metagenomic datasets, and can identify viral signal in assembled sequence (contigs) as short as 3kb, while providing near-perfect identification (>95% Recall and 100% Precision) on contigs of at least 10kb. Because VirSorter scales to large datasets, it can also be used in “reverse” to more confidently identify viral sequence in viral metagenomes by sorting away cellular DNA whether derived from gene transfer agents, generalized transduction or contamination. Finally, VirSorter is made available through the iPlant Cyberinfrastructure that provides a web-based user interface interconnected with the required computing resources. VirSorter thus complements existing prophage prediction softwares to better leverage fragmented, SAG and metagenomic datasets in a way that will scale to modern sequencing. Given these features, VirSorter should enable the discovery of new viruses in microbial datasets, and further our understanding of uncultivated viral communities across diverse ecosystems. PMID:26038737
2010-01-01
Background An important focus of genomic science is the discovery and characterization of all functional elements within genomes. In silico methods are used in genome studies to discover putative regulatory genomic elements (called words or motifs). Although a number of methods have been developed for motif discovery, most of them lack the scalability needed to analyze large genomic data sets. Methods This manuscript presents WordSeeker, an enumerative motif discovery toolkit that utilizes multi-core and distributed computational platforms to enable scalable analysis of genomic data. A controller task coordinates activities of worker nodes, each of which (1) enumerates a subset of the DNA word space and (2) scores words with a distributed Markov chain model. Results A comprehensive suite of performance tests was conducted to demonstrate the performance, speedup and efficiency of WordSeeker. The scalability of the toolkit enabled the analysis of the entire genome of Arabidopsis thaliana; the results of the analysis were integrated into The Arabidopsis Gene Regulatory Information Server (AGRIS). A public version of WordSeeker was deployed on the Glenn cluster at the Ohio Supercomputer Center. Conclusion WordSeeker effectively utilizes concurrent computing platforms to enable the identification of putative functional elements in genomic data sets. This capability facilitates the analysis of the large quantity of sequenced genomic data. PMID:21210985
Dictionary-driven prokaryotic gene finding
Shibuya, Tetsuo; Rigoutsos, Isidore
2002-01-01
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm’s implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method’s generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail. PMID:12060689
DOE Office of Scientific and Technical Information (OSTI.GOV)
Adams, Michael W.; W. W. Adams, Michael
2014-01-07
Virtualy all cellular processes are carried out by dynamic molecular assemblies or multiprotein complexes (PCs), the composition of which is largely unknown. Structural genomics efforts have demonstrated that less than 25% of the genes in a given prokaryotic genome will yield stable, soluble proteins when expressed using a one-ORF-at-a-time approach. We proposed that much of the remaining 75% of the genes encode proteins that are part of multiprotein complexes or are modified post-translationally, for example, with metals. The problem is that PCs and metalloproteins (MPs) cannot be accurately predicted on a genome-wide scale. The only solution to this dilemma ismore » to experimentally determine PCs and MPs in biomass of a model organism and to develop analytical tools that can then be applied to the biomass of any other organism. In other words, organisms themselves must be analyzed to identify their PCs and MPs: “native proteomes” must be determined. This information can then be utilized to design multiple ORF expression systems to produce recombinant forms of PCs and MPs. Moreover, the information and utility of this approach can be enhanced by using a hyperthermophile, one that grows optimally at 100°C, as a model organism. By analyzing the native proteome at close to 100 °C below the optimum growth temperature, we will trap reversible and dynamic complexes, thereby enabling their identification, purification, and subsequent characterization. The model organism for the current study is Pyrococcus furiosus, a hyperthermophilic archaeon that grows optimally at 100°C. It is grown up to 600-liter scale and kg quantities of biomass are available. In this project we identified native PCs and MPs using P. furiosus biomass (with MS/MS analyses to identify proteins by component 4). In addition, we provided samples of abundant native PCs and MPs for structural characterization (using SAXS by component 5). We also designed and evaluated generic bioinformatics and experimental protocols for PC and MP production in other prokaryotes of DOE interest. The research resulted in ten peer-reviewed publications including in Nature and Nature Methods.« less
GENOMIC DIVERSITY AND THE MICROENVIRONMENT AS DRIVERS OF PROGRESSION IN DCIS
2017-10-01
stains, including quantitative analysis, 7) Identification of upstaged DCIS cases for the radiology aim, 8) Development of image analysis methods for...goals of the project? Aim 1. Determine whether genetic diversity of DCIS is greater in DCIS with adjacent invasive disease compared to DCIS without... compared to DCIS without IDC. Since genomics is not the sole driver of tumor behavior, we will phenotypically characterize DCIS and its
Computational Identification of Novel Genes: Current and Future Perspectives.
Klasberg, Steffen; Bitard-Feildel, Tristan; Mallet, Ludovic
2016-01-01
While it has long been thought that all genomic novelties are derived from the existing material, many genes lacking homology to known genes were found in recent genome projects. Some of these novel genes were proposed to have evolved de novo, ie, out of noncoding sequences, whereas some have been shown to follow a duplication and divergence process. Their discovery called for an extension of the historical hypotheses about gene origination. Besides the theoretical breakthrough, increasing evidence accumulated that novel genes play important roles in evolutionary processes, including adaptation and speciation events. Different techniques are available to identify genes and classify them as novel. Their classification as novel is usually based on their similarity to known genes, or lack thereof, detected by comparative genomics or against databases. Computational approaches are further prime methods that can be based on existing models or leveraging biological evidences from experiments. Identification of novel genes remains however a challenging task. With the constant software and technologies updates, no gold standard, and no available benchmark, evaluation and characterization of genomic novelty is a vibrant field. In this review, the classical and state-of-the-art tools for gene prediction are introduced. The current methods for novel gene detection are presented; the methodological strategies and their limits are discussed along with perspective approaches for further studies.
Song, Zhijiao; Zhang, Miaomiao; Li, Fagen; Weng, Qijie; Zhou, Chanpin; Li, Mei; Li, Jie; Huang, Huanhua; Mo, Xiaoyong; Gan, Siming
2016-01-01
Identification of loci or genes under natural selection is important for both understanding the genetic basis of local adaptation and practical applications, and genome scans provide a powerful means for such identification purposes. In this study, genome-wide simple sequence repeats markers (SSRs) were used to scan for molecular footprints of divergent selection in Eucalyptus grandis, a hardwood species occurring widely in costal areas from 32° S to 16° S in Australia. High population diversity levels and weak population structure were detected with putatively neutral genomic SSRs. Using three FST outlier detection methods, a total of 58 outlying SSRs were collectively identified as loci under divergent selection against three non-correlated climatic variables, namely, mean annual temperature, isothermality and annual precipitation. Using a spatial analysis method, nine significant associations were revealed between FST outlier allele frequencies and climatic variables, involving seven alleles from five SSR loci. Of the five significant SSRs, two (EUCeSSR1044 and Embra394) contained alleles of putative genes with known functional importance for response to climatic factors. Our study presents critical information on the population diversity and structure of the important woody species E. grandis and provides insight into the adaptive responses of perennial trees to climatic variations. PMID:27748400
Mishra, Apurva; Pandey, Ramesh K; Manickam, Natesan
2015-01-01
Rapid phylogenetic and functional gene (gtfB) identification of S. mutans from the dental plaque derived from children. Dental plaque collected from fifteen patients of age group 7-12 underwent centrifugation followed by genomic DNA extraction for S. mutans. Genomic DNA was processed with S. mutans specific primers in suitable PCR condtions for phylogenetic and functional gene (gtfB) identification. The yield and results were confirmed by agarose gel electrophoresis. 1% agarose gel electrophoresis depicts the positive PCR amplification at 1,485 bp when compared with standard 1 kbp indicating the presence of S. mutans in the test sample. Another PCR reaction was set using gtfB primers specific for S. mutans for functional gene identification. 1.2% agarose gel electrophoresis was done and a positive amplication was observed at 192 bp when compared to 100 bp standards. With the advancement in molecular biology techniques, PCR based identification and quantification of the bacterial load can be done within hours using species-specific primers and DNA probes. Thus, this technique may reduce the laboratory time spend in conventional culture methods, reduces the possibility of colony identification errors and is more sensitive to culture techniques.
Pajuelo, Mónica J.; Eguiluz, María; Dahlstrom, Eric; Requena, David; Guzmán, Frank; Ramirez, Manuel; Sheen, Patricia; Frace, Michael; Sammons, Scott; Cama, Vitaliano; Anzick, Sarah; Bruno, Dan; Mahanty, Siddhartha; Wilkins, Patricia; Nash, Theodore; Gonzalez, Armando; García, Héctor H.; Gilman, Robert H.; Porcella, Steve; Zimic, Mirko
2015-01-01
Background Infections with Taenia solium are the most common cause of adult acquired seizures worldwide, and are the leading cause of epilepsy in developing countries. A better understanding of the genetic diversity of T. solium will improve parasite diagnostics and transmission pathways in endemic areas thereby facilitating the design of future control measures and interventions. Microsatellite markers are useful genome features, which enable strain typing and identification in complex pathogen genomes. Here we describe microsatellite identification and characterization in T. solium, providing information that will assist in global efforts to control this important pathogen. Methods For genome sequencing, T. solium cysts and proglottids were collected from Huancayo and Puno in Peru, respectively. Using next generation sequencing (NGS) and de novo assembly, we assembled two draft genomes and one hybrid genome. Microsatellite sequences were identified and 36 of them were selected for further analysis. Twenty T. solium isolates were collected from Tumbes in the northern region, and twenty from Puno in the southern region of Peru. The size-polymorphism of the selected microsatellites was determined with multi-capillary electrophoresis. We analyzed the association between microsatellite polymorphism and the geographic origin of the samples. Results The predicted size of the hybrid (proglottid genome combined with cyst genome) T. solium genome was 111 MB with a GC content of 42.54%. A total of 7,979 contigs (>1,000 nt) were obtained. We identified 9,129 microsatellites in the Puno-proglottid genome and 9,936 in the Huancayo-cyst genome, with 5 or more repeats, ranging from mono- to hexa-nucleotide. Seven microsatellites were polymorphic and 29 were monomorphic within the analyzed isolates. T. solium tapeworms were classified into two genetic groups that correlated with the North/South geographic origin of the parasites. Conclusions/Significance The availability of draft genomes for T. solium represents a significant step towards the understanding the biology of the parasite. We report here a set of T. solium polymorphic microsatellite markers that appear promising for genetic epidemiology studies. PMID:26697878
ITEP: an integrated toolkit for exploration of microbial pan-genomes.
Benedict, Matthew N; Henriksen, James R; Metcalf, William W; Whitaker, Rachel J; Price, Nathan D
2014-01-03
Comparative genomics is a powerful approach for studying variation in physiological traits as well as the evolution and ecology of microorganisms. Recent technological advances have enabled sequencing large numbers of related genomes in a single project, requiring computational tools for their integrated analysis. In particular, accurate annotations and identification of gene presence and absence are critical for understanding and modeling the cellular physiology of newly sequenced genomes. Although many tools are available to compare the gene contents of related genomes, new tools are necessary to enable close examination and curation of protein families from large numbers of closely related organisms, to integrate curation with the analysis of gain and loss, and to generate metabolic networks linking the annotations to observed phenotypes. We have developed ITEP, an Integrated Toolkit for Exploration of microbial Pan-genomes, to curate protein families, compute similarities to externally-defined domains, analyze gene gain and loss, and generate draft metabolic networks from one or more curated reference network reconstructions in groups of related microbial species among which the combination of core and variable genes constitute the their "pan-genomes". The ITEP toolkit consists of: (1) a series of modular command-line scripts for identification, comparison, curation, and analysis of protein families and their distribution across many genomes; (2) a set of Python libraries for programmatic access to the same data; and (3) pre-packaged scripts to perform common analysis workflows on a collection of genomes. ITEP's capabilities include de novo protein family prediction, ortholog detection, analysis of functional domains, identification of core and variable genes and gene regions, sequence alignments and tree generation, annotation curation, and the integration of cross-genome analysis and metabolic networks for study of metabolic network evolution. ITEP is a powerful, flexible toolkit for generation and curation of protein families. ITEP's modular design allows for straightforward extension as analysis methods and tools evolve. By integrating comparative genomics with the development of draft metabolic networks, ITEP harnesses the power of comparative genomics to build confidence in links between genotype and phenotype and helps disambiguate gene annotations when they are evaluated in both evolutionary and metabolic network contexts.
Supersize me: how whole-genome sequencing and big data are transforming epidemiology.
Kao, Rowland R; Haydon, Daniel T; Lycett, Samantha J; Murcia, Pablo R
2014-05-01
In epidemiology, the identification of 'who infected whom' allows us to quantify key characteristics such as incubation periods, heterogeneity in transmission rates, duration of infectiousness, and the existence of high-risk groups. Although invaluable, the existence of many plausible infection pathways makes this difficult, and epidemiological contact tracing either uncertain, logistically prohibitive, or both. The recent advent of next-generation sequencing technology allows the identification of traceable differences in the pathogen genome that are transforming our ability to understand high-resolution disease transmission, sometimes even down to the host-to-host scale. We review recent examples of the use of pathogen whole-genome sequencing for the purpose of forensic tracing of transmission pathways, focusing on the particular problems where evolutionary dynamics must be supplemented by epidemiological information on the most likely timing of events as well as possible transmission pathways. We also discuss potential pitfalls in the over-interpretation of these data, and highlight the manner in which a confluence of this technology with sophisticated mathematical and statistical approaches has the potential to produce a paradigm shift in our understanding of infectious disease transmission and control. Copyright © 2014 Elsevier Ltd. All rights reserved.
Prioritizing causal disease genes using unbiased genomic features.
Deo, Rahul C; Musso, Gabriel; Tasan, Murat; Tang, Paul; Poon, Annie; Yuan, Christiana; Felix, Janine F; Vasan, Ramachandran S; Beroukhim, Rameen; De Marco, Teresa; Kwok, Pui-Yan; MacRae, Calum A; Roth, Frederick P
2014-12-03
Cardiovascular disease (CVD) is the leading cause of death in the developed world. Human genetic studies, including genome-wide sequencing and SNP-array approaches, promise to reveal disease genes and mechanisms representing new therapeutic targets. In practice, however, identification of the actual genes contributing to disease pathogenesis has lagged behind identification of associated loci, thus limiting the clinical benefits. To aid in localizing causal genes, we develop a machine learning approach, Objective Prioritization for Enhanced Novelty (OPEN), which quantitatively prioritizes gene-disease associations based on a diverse group of genomic features. This approach uses only unbiased predictive features and thus is not hampered by a preference towards previously well-characterized genes. We demonstrate success in identifying genetic determinants for CVD-related traits, including cholesterol levels, blood pressure, and conduction system and cardiomyopathy phenotypes. Using OPEN, we prioritize genes, including FLNC, for association with increased left ventricular diameter, which is a defining feature of a prevalent cardiovascular disorder, dilated cardiomyopathy or DCM. Using a zebrafish model, we experimentally validate FLNC and identify a novel FLNC splice-site mutation in a patient with severe DCM. Our approach stands to assist interpretation of large-scale genetic studies without compromising their fundamentally unbiased nature.
NASA Astrophysics Data System (ADS)
Serra, Reviewed By Martin J.
2000-01-01
Genomics is one of the most rapidly expanding areas of science. This book is an outgrowth of a series of lectures given by one of the former heads (CRC) of the Human Genome Initiative. The book is designed to reach a wide audience, from biologists with little chemical or physical science background through engineers, computer scientists, and physicists with little current exposure to the chemical or biological principles of genetics. The text starts with a basic review of the chemical and biological properties of DNA. However, without either a biochemistry background or a supplemental biochemistry text, this chapter and much of the rest of the text would be difficult to digest. The second chapter is designed to put DNA into the context of the larger chromosomal unit. Specialized chromosomal structures and sequences (centromeres, telomeres) are introduced, leading to a section on chromosome organization and purification. The next 4 chapters cover the physical (hybridization, electrophoresis), chemical (polymerase chain reaction), and biological (genetic) techniques that provide the backbone of genomic analysis. These chapters cover in significant detail the fundamental principles underlying each technique and provide a firm background for the remainder of the text. Chapters 79 consider the need and methods for the development of physical maps. Chapter 7 primarily discusses chromosomal localization techniques, including in situ hybridization, FISH, and chromosome paintings. The next two chapters focus on the development of libraries and clones. In particular, Chapter 9 considers the limitations of current mapping and clone production. The current state and future of DNA sequencing is covered in the next three chapters. The first considers the current methods of DNA sequencing - especially gel-based methods of analysis, although other possible approaches (mass spectrometry) are introduced. Much of the chapter addresses the limitations of current methods, including analysis of error in sequencing and current bottlenecks in the sequencing effort. The next chapter describes the steps necessary to scale current technologies for the sequencing of entire genomes. Chapter 12 examines alternate methods for DNA sequencing. Initially, methods of single-molecule sequencing and sequencing by microscopy are introduced; the majority of the chapter is devoted to the development of DNA sequencing methods using chip microarrays and hybridization. The remaining chapters (13-15) consider the uses and analysis of DNA sequence information. The initial focus is on the identification of genes. Several examples are given of the use of DNA sequence information for diagnosis of inherited or infectious diseases. The sequence-specific manipulation of DNA is discussed in Chapter 14. The final chapter deals with the implications of large-scale sequencing, including methods for identifying genes and finding errors in DNA sequences, to the development of computer algorithms for the interpretation of DNA sequence information. The text figures are black and white line drawings that, although clearly done, seem a bit primitive for 1999. While I appreciated the simplicity of the drawings, many students accustomed to more colorful presentations will find them wanting. The four color figures in the center of the text seem an afterthought and add little to the text's clarity. Each chapter has a set of additional reading sources, mostly primary sources. Often, specialized topics are offset into boxes that provide clarification and amplification without cluttering the text. An appendix includes a list of the Web-based database resources. As an undergraduate instructor who has previously taught biochemistry, molecular biology, and a course on the human genome, I found many interesting tidbits and amplifications throughout the text. I would recommend this book as a text for an advanced undergraduate or beginning graduate course in genomics. Although the text works though several examples of genetic and genome analysis, additional problem/homework sets would need to be developed to ensure student comprehension. The text steers clear of the ethical implications of the Human Genome Initiative and remains true to its subtitle The Science and Technology .
de Oliveira, Gilberto Santos; Kawahara, Rebeca; Rosa-Fernandes, Livia; Avila, Carla Cristi; Teixeira, Marta M. G.; Larsen, Martin R.
2018-01-01
Background Chagas disease also known as American trypanosomiasis is caused by the protozoan Trypanosoma cruzi. Over the last 30 years, Chagas disease has expanded from a neglected parasitic infection of the rural population to an urbanized chronic disease, becoming a potentially emergent global health problem. T. cruzi strains were assigned to seven genetic groups (TcI-TcVI and TcBat), named discrete typing units (DTUs), which represent a set of isolates that differ in virulence, pathogenicity and immunological features. Indeed, diverse clinical manifestations (from asymptomatic to highly severe disease) have been attempted to be related to T.cruzi genetic variability. Due to that, several DTU typing methods have been introduced. Each method has its own advantages and drawbacks such as high complexity and analysis time and all of them are based on genetic signatures. Recently, a novel method discriminated bacterial strains using a peptide identification-free, genome sequence-independent shotgun proteomics workflow. Here, we aimed to develop a Trypanosoma cruzi Strain Typing Assay using MS/MS peptide spectral libraries, named Tc-STAMS2. Methods/Principal findings The Tc-STAMS2 method uses shotgun proteomics combined with spectral library search to assign and discriminate T. cruzi strains independently on the genome knowledge. The method is based on the construction of a library of MS/MS peptide spectra built using genotyped T. cruzi reference strains. For identification, the MS/MS peptide spectra of unknown T. cruzi cells are identified using the spectral matching algorithm SpectraST. The Tc-STAMS2 method allowed correct identification of all DTUs with high confidence. The method was robust towards different sample preparations, length of chromatographic gradients and fragmentation techniques. Moreover, a pilot inter-laboratory study showed the applicability to different MS platforms. Conclusions and significance This is the first study that develops a MS-based platform for T. cruzi strain typing. Indeed, the Tc-STAMS2 method allows T. cruzi strain typing using MS/MS spectra as discriminatory features and allows the differentiation of TcI-TcVI DTUs. Similar to genomic-based strategies, the Tc-STAMS2 method allows identification of strains within DTUs. Its robustness towards different experimental and biological variables makes it a valuable complementary strategy to the current T. cruzi genotyping assays. Moreover, this method can be used to identify DTU-specific features correlated with the strain phenotype. PMID:29608573
De-identification of clinical notes via recurrent neural network and conditional random field.
Liu, Zengjian; Tang, Buzhou; Wang, Xiaolong; Chen, Qingcai
2017-11-01
De-identification, identifying information from data, such as protected health information (PHI) present in clinical data, is a critical step to enable data to be shared or published. The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) clinical natural language processing (NLP) challenge contains a de-identification track in de-identifying electronic medical records (EMRs) (i.e., track 1). The challenge organizers provide 1000 annotated mental health records for this track, 600 out of which are used as a training set and 400 as a test set. We develop a hybrid system for the de-identification task on the training set. Firstly, four individual subsystems, that is, a subsystem based on bidirectional LSTM (long-short term memory, a variant of recurrent neural network), a subsystem-based on bidirectional LSTM with features, a subsystem based on conditional random field (CRF) and a rule-based subsystem, are used to identify PHI instances. Then, an ensemble learning-based classifiers is deployed to combine all PHI instances predicted by above three machine learning-based subsystems. Finally, the results of the ensemble learning-based classifier and the rule-based subsystem are merged together. Experiments conducted on the official test set show that our system achieves the highest micro F1-scores of 93.07%, 91.43% and 95.23% under the "token", "strict" and "binary token" criteria respectively, ranking first in the 2016 CEGS N-GRID NLP challenge. In addition, on the dataset of 2014 i2b2 NLP challenge, our system achieves the highest micro F1-scores of 96.98%, 95.11% and 98.28% under the "token", "strict" and "binary token" criteria respectively, outperforming other state-of-the-art systems. All these experiments prove the effectiveness of our proposed method. Copyright © 2017. Published by Elsevier Inc.
Ando, David; Singh, Jahnavi; Keasling, Jay D.; García Martín, Héctor
2018-01-01
Determination of internal metabolic fluxes is crucial for fundamental and applied biology because they map how carbon and electrons flow through metabolism to enable cell function. 13C Metabolic Flux Analysis (13C MFA) and Two-Scale 13C Metabolic Flux Analysis (2S-13C MFA) are two techniques used to determine such fluxes. Both operate on the simplifying approximation that metabolic flux from peripheral metabolism into central “core” carbon metabolism is minimal, and can be omitted when modeling isotopic labeling in core metabolism. The validity of this “two-scale” or “bow tie” approximation is supported both by the ability to accurately model experimental isotopic labeling data, and by experimentally verified metabolic engineering predictions using these methods. However, the boundaries of core metabolism that satisfy this approximation can vary across species, and across cell culture conditions. Here, we present a set of algorithms that (1) systematically calculate flux bounds for any specified “core” of a genome-scale model so as to satisfy the bow tie approximation and (2) automatically identify an updated set of core reactions that can satisfy this approximation more efficiently. First, we leverage linear programming to simultaneously identify the lowest fluxes from peripheral metabolism into core metabolism compatible with the observed growth rate and extracellular metabolite exchange fluxes. Second, we use Simulated Annealing to identify an updated set of core reactions that allow for a minimum of fluxes into core metabolism to satisfy these experimental constraints. Together, these methods accelerate and automate the identification of a biologically reasonable set of core reactions for use with 13C MFA or 2S-13C MFA, as well as provide for a substantially lower set of flux bounds for fluxes into the core as compared with previous methods. We provide an open source Python implementation of these algorithms at https://github.com/JBEI/limitfluxtocore. PMID:29300340
Logue, Mark W; Amstadter, Ananda B; Baker, Dewleen G; Duncan, Laramie; Koenen, Karestan C; Liberzon, Israel; Miller, Mark W; Morey, Rajendra A; Nievergelt, Caroline M; Ressler, Kerry J; Smith, Alicia K; Smoller, Jordan W; Stein, Murray B; Sumner, Jennifer A; Uddin, Monica
2015-01-01
The development of posttraumatic stress disorder (PTSD) is influenced by genetic factors. Although there have been some replicated candidates, the identification of risk variants for PTSD has lagged behind genetic research of other psychiatric disorders such as schizophrenia, autism, and bipolar disorder. Psychiatric genetics has moved beyond examination of specific candidate genes in favor of the genome-wide association study (GWAS) strategy of very large numbers of samples, which allows for the discovery of previously unsuspected genes and molecular pathways. The successes of genetic studies of schizophrenia and bipolar disorder have been aided by the formation of a large-scale GWAS consortium: the Psychiatric Genomics Consortium (PGC). In contrast, only a handful of GWAS of PTSD have appeared in the literature to date. Here we describe the formation of a group dedicated to large-scale study of PTSD genetics: the PGC-PTSD. The PGC-PTSD faces challenges related to the contingency on trauma exposure and the large degree of ancestral genetic diversity within and across participating studies. Using the PGC analysis pipeline supplemented by analyses tailored to address these challenges, we anticipate that our first large-scale GWAS of PTSD will comprise over 10 000 cases and 30 000 trauma-exposed controls. Following in the footsteps of our PGC forerunners, this collaboration—of a scope that is unprecedented in the field of traumatic stress—will lead the search for replicable genetic associations and new insights into the biological underpinnings of PTSD. PMID:25904361
Cho, Namjin; Hwang, Byungjin; Yoon, Jung-ki; Park, Sangun; Lee, Joongoo; Seo, Han Na; Lee, Jeewon; Huh, Sunghoon; Chung, Jinsoo; Bang, Duhee
2015-09-21
Interpreting epistatic interactions is crucial for understanding evolutionary dynamics of complex genetic systems and unveiling structure and function of genetic pathways. Although high resolution mapping of en masse variant libraries renders molecular biologists to address genotype-phenotype relationships, long-read sequencing technology remains indispensable to assess functional relationship between mutations that lie far apart. Here, we introduce JigsawSeq for multiplexed sequence identification of pooled gene variant libraries by combining a codon-based molecular barcoding strategy and de novo assembly of short-read data. We first validate JigsawSeq on small sub-pools and observed high precision and recall at various experimental settings. With extensive simulations, we then apply JigsawSeq to large-scale gene variant libraries to show that our method can be reliably scaled using next-generation sequencing. JigsawSeq may serve as a rapid screening tool for functional genomics and offer the opportunity to explore evolutionary trajectories of protein variants.
Germine, L; Robinson, E B; Smoller, J W; Calkins, M E; Moore, T M; Hakonarson, H; Daly, M J; Lee, P H; Holmes, A J; Buckner, R L; Gur, R C; Gur, R E
2016-01-01
Breakthroughs in genomics have begun to unravel the genetic architecture of schizophrenia risk, providing methods for quantifying schizophrenia polygenic risk based on common genetic variants. Our objective in the current study was to understand the relationship between schizophrenia genetic risk variants and neurocognitive development in healthy individuals. We first used combined genomic and neurocognitive data from the Philadelphia Neurodevelopmental Cohort (4303 participants ages 8–21 years) to screen 26 neurocognitive phenotypes for their association with schizophrenia polygenic risk. Schizophrenia polygenic risk was estimated for each participant based on summary statistics from the most recent schizophrenia genome-wide association analysis (Psychiatric Genomics Consortium 2014). After correction for multiple comparisons, greater schizophrenia polygenic risk was significantly associated with reduced speed of emotion identification and verbal reasoning. These associations were significant by age 9 years and there was no evidence of interaction between schizophrenia polygenic risk and age on neurocognitive performance. We then looked at the association between schizophrenia polygenic risk and emotion identification speed in the Harvard/MGH Brain Genomics Superstruct Project sample (695 participants ages 18–35 years), where we replicated the association between schizophrenia polygenic risk and emotion identification speed. These analyses provide evidence for a replicable association between polygenic risk for schizophrenia and a specific aspect of social cognition. Our findings indicate that individual differences in genetic risk for schizophrenia are linked with the development of aspects of social cognition and potentially verbal reasoning, and that these associations emerge relatively early in development. PMID:27754483
Germine, L; Robinson, E B; Smoller, J W; Calkins, M E; Moore, T M; Hakonarson, H; Daly, M J; Lee, P H; Holmes, A J; Buckner, R L; Gur, R C; Gur, R E
2016-10-18
Breakthroughs in genomics have begun to unravel the genetic architecture of schizophrenia risk, providing methods for quantifying schizophrenia polygenic risk based on common genetic variants. Our objective in the current study was to understand the relationship between schizophrenia genetic risk variants and neurocognitive development in healthy individuals. We first used combined genomic and neurocognitive data from the Philadelphia Neurodevelopmental Cohort (4303 participants ages 8-21 years) to screen 26 neurocognitive phenotypes for their association with schizophrenia polygenic risk. Schizophrenia polygenic risk was estimated for each participant based on summary statistics from the most recent schizophrenia genome-wide association analysis (Psychiatric Genomics Consortium 2014). After correction for multiple comparisons, greater schizophrenia polygenic risk was significantly associated with reduced speed of emotion identification and verbal reasoning. These associations were significant by age 9 years and there was no evidence of interaction between schizophrenia polygenic risk and age on neurocognitive performance. We then looked at the association between schizophrenia polygenic risk and emotion identification speed in the Harvard/MGH Brain Genomics Superstruct Project sample (695 participants ages 18-35 years), where we replicated the association between schizophrenia polygenic risk and emotion identification speed. These analyses provide evidence for a replicable association between polygenic risk for schizophrenia and a specific aspect of social cognition. Our findings indicate that individual differences in genetic risk for schizophrenia are linked with the development of aspects of social cognition and potentially verbal reasoning, and that these associations emerge relatively early in development.
Analysis of raw meats and fats of pigs using polymerase chain reaction for Halal authentication.
Aida, A A; Che Man, Y B; Wong, C M V L; Raha, A R; Son, R
2005-01-01
A method for species identification from pork and lard samples using polymerase chain reaction (PCR) analysis of a conserved region in the mitochondrial (mt) cytochrome b (cyt b) gene has been developed. Genomic DNA of pork and lard were extracted using Qiagen DNeasy(®) Tissue Kits and subjected to PCR amplification targeting the mt cyt b gene. The genomic DNA from lard was found to be of good quality and produced clear PCR products on the amplification of the mt cyt b gene of approximately 360 base pairs. To distinguish between species, the amplified PCR products were cut with restriction enzyme BsaJI resulting in porcine-specific restriction fragment length polymorphisms (RFLP). The cyt b PCR-RFLP species identification assay yielded excellent results for identification of pig species. It is a potentially reliable technique for detection of pig meat and fat from other animals for Halal authentication.
Kong, B H; Hanifah, Y A; Yusof, M Y; Thong, K L
2011-12-01
Acinetobacter baumannii, genomic species 3 and 13TU are being increasingly reported as the most important Acinetobacter species that cause infections in hospitalized patients. These Acinetobacter species are grouped in the Acinetobacter calcoaceticus- Acinetobacter baumannii (Acb) complex. Differentiation of the species in the Acb-complex is limited by phenotypic methods. Therefore, in this study, amplified ribosomal DNA restriction analysis (ARDRA) was applied to confirm the identity A. baumannii strains as well as to differentiate between the subspecies. One hundred and eighty-five strains from Intensive Care Unit, Universiti Malaya Medical Center (UMMC) were successfully identified as A. baumannii by ARDRA. Acinetobacter genomic species 13TU and 15TU were identified in 3 and 1 strains, respectively. ARDRA provides an accurate, rapid and definitive approach towards the identification of the species level in the genus Acinetobacter. This paper reports the first application ARDRA in genospecies identification of Acinetobacter in Malaysia.
Applications of Genomic Sequencing in Pediatric CNS Tumors.
Bavle, Abhishek A; Lin, Frank Y; Parsons, D Williams
2016-05-01
Recent advances in genome-scale sequencing methods have resulted in a significant increase in our understanding of the biology of human cancers. When applied to pediatric central nervous system (CNS) tumors, these remarkable technological breakthroughs have facilitated the molecular characterization of multiple tumor types, provided new insights into the genetic basis of these cancers, and prompted innovative strategies that are changing the management paradigm in pediatric neuro-oncology. Genomic tests have begun to affect medical decision making in a number of ways, from delineating histopathologically similar tumor types into distinct molecular subgroups that correlate with clinical characteristics, to guiding the addition of novel therapeutic agents for patients with high-risk or poor-prognosis tumors, or alternatively, reducing treatment intensity for those with a favorable prognosis. Genomic sequencing has also had a significant impact on translational research strategies in pediatric CNS tumors, resulting in wide-ranging applications that have the potential to direct the rational preclinical screening of novel therapeutic agents, shed light on tumor heterogeneity and evolution, and highlight differences (or similarities) between pediatric and adult CNS tumors. Finally, in addition to allowing the identification of somatic (tumor-specific) mutations, the analysis of patient-matched constitutional (germline) DNA has facilitated the detection of pathogenic germline alterations in cancer genes in patients with CNS tumors, with critical implications for genetic counseling and tumor surveillance strategies for children with familial predisposition syndromes. As our understanding of the molecular landscape of pediatric CNS tumors continues to advance, innovative applications of genomic sequencing hold significant promise for further improving the care of children with these cancers.
Harnessing CRISPR-Cas systems for bacterial genome editing.
Selle, Kurt; Barrangou, Rodolphe
2015-04-01
Manipulation of genomic sequences facilitates the identification and characterization of key genetic determinants in the investigation of biological processes. Genome editing via clustered regularly interspaced short palindromic repeats (CRISPR)-CRISPR-associated (Cas) constitutes a next-generation method for programmable and high-throughput functional genomics. CRISPR-Cas systems are readily reprogrammed to induce sequence-specific DNA breaks at target loci, resulting in fixed mutations via host-dependent DNA repair mechanisms. Although bacterial genome editing is a relatively unexplored and underrepresented application of CRISPR-Cas systems, recent studies provide valuable insights for the widespread future implementation of this technology. This review summarizes recent progress in bacterial genome editing and identifies fundamental genetic and phenotypic outcomes of CRISPR targeting in bacteria, in the context of tool development, genome homeostasis, and DNA repair. Copyright © 2015 Elsevier Ltd. All rights reserved.
de Oliveira, Gilberto Santos; Kawahara, Rebeca; Rosa-Fernandes, Livia; Mule, Simon Ngao; Avila, Carla Cristi; Teixeira, Marta M G; Larsen, Martin R; Palmisano, Giuseppe
2018-04-01
Chagas disease also known as American trypanosomiasis is caused by the protozoan Trypanosoma cruzi. Over the last 30 years, Chagas disease has expanded from a neglected parasitic infection of the rural population to an urbanized chronic disease, becoming a potentially emergent global health problem. T. cruzi strains were assigned to seven genetic groups (TcI-TcVI and TcBat), named discrete typing units (DTUs), which represent a set of isolates that differ in virulence, pathogenicity and immunological features. Indeed, diverse clinical manifestations (from asymptomatic to highly severe disease) have been attempted to be related to T.cruzi genetic variability. Due to that, several DTU typing methods have been introduced. Each method has its own advantages and drawbacks such as high complexity and analysis time and all of them are based on genetic signatures. Recently, a novel method discriminated bacterial strains using a peptide identification-free, genome sequence-independent shotgun proteomics workflow. Here, we aimed to develop a Trypanosoma cruzi Strain Typing Assay using MS/MS peptide spectral libraries, named Tc-STAMS2. The Tc-STAMS2 method uses shotgun proteomics combined with spectral library search to assign and discriminate T. cruzi strains independently on the genome knowledge. The method is based on the construction of a library of MS/MS peptide spectra built using genotyped T. cruzi reference strains. For identification, the MS/MS peptide spectra of unknown T. cruzi cells are identified using the spectral matching algorithm SpectraST. The Tc-STAMS2 method allowed correct identification of all DTUs with high confidence. The method was robust towards different sample preparations, length of chromatographic gradients and fragmentation techniques. Moreover, a pilot inter-laboratory study showed the applicability to different MS platforms. This is the first study that develops a MS-based platform for T. cruzi strain typing. Indeed, the Tc-STAMS2 method allows T. cruzi strain typing using MS/MS spectra as discriminatory features and allows the differentiation of TcI-TcVI DTUs. Similar to genomic-based strategies, the Tc-STAMS2 method allows identification of strains within DTUs. Its robustness towards different experimental and biological variables makes it a valuable complementary strategy to the current T. cruzi genotyping assays. Moreover, this method can be used to identify DTU-specific features correlated with the strain phenotype.
Zeng, Lu; Kortschak, R Daniel; Raison, Joy M; Bertozzi, Terry; Adelson, David L
2018-01-01
Transposable Elements (TEs) are mobile DNA sequences that make up significant fractions of amniote genomes. However, they are difficult to detect and annotate ab initio because of their variable features, lengths and clade-specific variants. We have addressed this problem by refining and developing a Comprehensive ab initio Repeat Pipeline (CARP) to identify and cluster TEs and other repetitive sequences in genome assemblies. The pipeline begins with a pairwise alignment using krishna, a custom aligner. Single linkage clustering is then carried out to produce families of repetitive elements. Consensus sequences are then filtered for protein coding genes and then annotated using Repbase and a custom library of retrovirus and reverse transcriptase sequences. This process yields three types of family: fully annotated, partially annotated and unannotated. Fully annotated families reflect recently diverged/young known TEs present in Repbase. The remaining two types of families contain a mixture of novel TEs and segmental duplications. These can be resolved by aligning these consensus sequences back to the genome to assess copy number vs. length distribution. Our pipeline has three significant advantages compared to other methods for ab initio repeat identification: 1) we generate not only consensus sequences, but keep the genomic intervals for the original aligned sequences, allowing straightforward analysis of evolutionary dynamics, 2) consensus sequences represent low-divergence, recently/currently active TE families, 3) segmental duplications are annotated as a useful by-product. We have compared our ab initio repeat annotations for 7 genome assemblies to other methods and demonstrate that CARP compares favourably with RepeatModeler, the most widely used repeat annotation package.
Zeng, Lu; Kortschak, R. Daniel; Raison, Joy M.
2018-01-01
Transposable Elements (TEs) are mobile DNA sequences that make up significant fractions of amniote genomes. However, they are difficult to detect and annotate ab initio because of their variable features, lengths and clade-specific variants. We have addressed this problem by refining and developing a Comprehensive ab initio Repeat Pipeline (CARP) to identify and cluster TEs and other repetitive sequences in genome assemblies. The pipeline begins with a pairwise alignment using krishna, a custom aligner. Single linkage clustering is then carried out to produce families of repetitive elements. Consensus sequences are then filtered for protein coding genes and then annotated using Repbase and a custom library of retrovirus and reverse transcriptase sequences. This process yields three types of family: fully annotated, partially annotated and unannotated. Fully annotated families reflect recently diverged/young known TEs present in Repbase. The remaining two types of families contain a mixture of novel TEs and segmental duplications. These can be resolved by aligning these consensus sequences back to the genome to assess copy number vs. length distribution. Our pipeline has three significant advantages compared to other methods for ab initio repeat identification: 1) we generate not only consensus sequences, but keep the genomic intervals for the original aligned sequences, allowing straightforward analysis of evolutionary dynamics, 2) consensus sequences represent low-divergence, recently/currently active TE families, 3) segmental duplications are annotated as a useful by-product. We have compared our ab initio repeat annotations for 7 genome assemblies to other methods and demonstrate that CARP compares favourably with RepeatModeler, the most widely used repeat annotation package. PMID:29538441
Identification of cis-suppression of human disease mutations by comparative genomics.
Jordan, Daniel M; Frangakis, Stephan G; Golzio, Christelle; Cassa, Christopher A; Kurtzberg, Joanne; Davis, Erica E; Sunyaev, Shamil R; Katsanis, Nicholas
2015-08-13
Patterns of amino acid conservation have served as a tool for understanding protein evolution. The same principles have also found broad application in human genomics, driven by the need to interpret the pathogenic potential of variants in patients. Here we performed a systematic comparative genomics analysis of human disease-causing missense variants. We found that an appreciable fraction of disease-causing alleles are fixed in the genomes of other species, suggesting a role for genomic context. We developed a model of genetic interactions that predicts most of these to be simple pairwise compensations. Functional testing of this model on two known human disease genes revealed discrete cis amino acid residues that, although benign on their own, could rescue the human mutations in vivo. This approach was also applied to ab initio gene discovery to support the identification of a de novo disease driver in BTG2 that is subject to protective cis-modification in more than 50 species. Finally, on the basis of our data and models, we developed a computational tool to predict candidate residues subject to compensation. Taken together, our data highlight the importance of cis-genomic context as a contributor to protein evolution; they provide an insight into the complexity of allele effect on phenotype; and they are likely to assist methods for predicting allele pathogenicity.
Deshmukh, Rupesh K; Sonah, Humira; Bélanger, Richard R
2016-01-01
Aquaporins (AQPs) are channel-forming integral membrane proteins that facilitate the movement of water and many other small molecules. Compared to animals, plants contain a much higher number of AQPs in their genome. Homology-based identification of AQPs in sequenced species is feasible because of the high level of conservation of protein sequences across plant species. Genome-wide characterization of AQPs has highlighted several important aspects such as distribution, genetic organization, evolution and conserved features governing solute specificity. From a functional point of view, the understanding of AQP transport system has expanded rapidly with the help of transcriptomics and proteomics data. The efficient analysis of enormous amounts of data generated through omic scale studies has been facilitated through computational advancements. Prediction of protein tertiary structures, pore architecture, cavities, phosphorylation sites, heterodimerization, and co-expression networks has become more sophisticated and accurate with increasing computational tools and pipelines. However, the effectiveness of computational approaches is based on the understanding of physiological and biochemical properties, transport kinetics, solute specificity, molecular interactions, sequence variations, phylogeny and evolution of aquaporins. For this purpose, tools like Xenopus oocyte assays, yeast expression systems, artificial proteoliposomes, and lipid membranes have been efficiently exploited to study the many facets that influence solute transport by AQPs. In the present review, we discuss genome-wide identification of AQPs in plants in relation with recent advancements in analytical tools, and their availability and technological challenges as they apply to AQPs. An exhaustive review of omics resources available for AQP research is also provided in order to optimize their efficient utilization. Finally, a detailed catalog of computational tools and analytical pipelines is offered as a resource for AQP research.
Deshmukh, Rupesh K.; Sonah, Humira; Bélanger, Richard R.
2016-01-01
Aquaporins (AQPs) are channel-forming integral membrane proteins that facilitate the movement of water and many other small molecules. Compared to animals, plants contain a much higher number of AQPs in their genome. Homology-based identification of AQPs in sequenced species is feasible because of the high level of conservation of protein sequences across plant species. Genome-wide characterization of AQPs has highlighted several important aspects such as distribution, genetic organization, evolution and conserved features governing solute specificity. From a functional point of view, the understanding of AQP transport system has expanded rapidly with the help of transcriptomics and proteomics data. The efficient analysis of enormous amounts of data generated through omic scale studies has been facilitated through computational advancements. Prediction of protein tertiary structures, pore architecture, cavities, phosphorylation sites, heterodimerization, and co-expression networks has become more sophisticated and accurate with increasing computational tools and pipelines. However, the effectiveness of computational approaches is based on the understanding of physiological and biochemical properties, transport kinetics, solute specificity, molecular interactions, sequence variations, phylogeny and evolution of aquaporins. For this purpose, tools like Xenopus oocyte assays, yeast expression systems, artificial proteoliposomes, and lipid membranes have been efficiently exploited to study the many facets that influence solute transport by AQPs. In the present review, we discuss genome-wide identification of AQPs in plants in relation with recent advancements in analytical tools, and their availability and technological challenges as they apply to AQPs. An exhaustive review of omics resources available for AQP research is also provided in order to optimize their efficient utilization. Finally, a detailed catalog of computational tools and analytical pipelines is offered as a resource for AQP research. PMID:28066459
Meng, Jia; Kanzaki, Gregory; Meas, Diane; Lam, Christopher K.; Crummer, Heather; Tain, Justina; Xu, H. Howard
2013-01-01
Regulated antisense RNA (asRNA) expression has been employed successfully in Gram-positive bacteria for genome-wide essential gene identification and drug target determination. However, there have been no published reports describing the application of asRNA gene silencing for comprehensive analyses of essential genes in Gram-negative bacteria. In this study, we report the first genome-wide identification of asRNA constructs for essential genes in Escherichia coli. We screened 250,000 library transformants for conditional growth-inhibitory recombinant clones from two shot-gun genomic libraries of E. coli using a paired-termini expression vector (pHN678). After sequencing plasmid inserts of 675 confirmed inducer-sensitive cell clones, we identified 152 separate asRNA constructs of which 134 inserts came from essential genes while 18 originated from non-essential genes (but share operons with essential genes). Among the 79 individual essential genes silenced by these asRNA constructs, 61 genes (77%) engage in processes related to protein synthesis. The cell-based assays of an asRNA clone targeting fusA (encoding elongation factor G) showed that the induced cells were sensitized 12 fold to fusidic acid, a known specific inhibitor. Our results demonstrate the utility of the paired-termini expression vector and feasibility of large-scale gene silencing in E. coli using regulated asRNA expression. PMID:22268863
The Effects of Signal Erosion and Core Genome Reduction on the Identification of Diagnostic Markers
2016-09-20
31 diagnostics for the identification of bacterial pathogens. To do this effectively, 32 genomics databases must be comprehensive to identify the...diverse B. 118 pseudomallei/mallei strains were sequenced, assembled, and deposited in public 119 databases (Supplemental Table 1); these genomes were...combined with 160 B. 120 pseudomallei/mallei genome assemblies already in public databases . Most of the 121 genomes (n=779) in this study were
Khachane, Amit; Kumar, Ranjit; Jain, Sanyam; Jain, Samta; Banumathy, Gowrishankar; Singh, Varsha; Nagpal, Saurabh; Tatu, Utpal
2005-01-01
Bioinformatics tools to aid gene and protein sequence analysis have become an integral part of biology in the post-genomic era. Release of the Plasmodium falciparum genome sequence has allowed biologists to define the gene and the predicted protein content as well as their sequences in the parasite. Using pI and molecular weight as characteristics unique to each protein, we have developed a bioinformatics tool to aid identification of proteins from Plasmodium falciparum. The tool makes use of a Virtual 2-DE generated by plotting all of the proteins from the Plasmodium database on a pI versus molecular weight scale. Proteins are identified by comparing the position of migration of desired protein spots from an experimental 2-DE and that on a virtual 2-DE. The procedure has been automated in the form of user-friendly software called "Plasmo2D". The tool can be downloaded from http://144.16.89.25/Plasmo2D.zip.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pakrasi, Himadri
The overall objective of this project was to use a systems biology approach to evaluate the potentials of a number of cyanobacterial strains for photobiological production of advanced biofuels and/or their chemical precursors. Cyanobacteria are oxygen evolving photosynthetic prokaryotes. Among them, certain unicellular species such as Cyanothece can also fix N 2, a process that is exquisitely sensitive to oxygen. To accommodate such incompatible processes in a single cell, Cyanothece produces oxygen during the day, and creates an O 2-limited intracellular environment during the night to perform O 2-sensitive processes such as N 2-fixation. Thus, Cyanothece cells are natural bioreactorsmore » for the storage of captured solar energy with subsequent utilization at a different time during a diurnal cycle. Our studies include the identification of a novel, fast-growing, mixotrophic, transformable cyanobacterium. This strain has been sequenced and will be made available to the community. In addition, we have developed genome-scale models for a family of cyanobacteria to assess their metabolic repertoire. Furthermore, we developed a method for rapid construction of metabolic models using multiple annotation sources and a metabolic model of a related organism. This method will allow rapid annotation and screening of potential phenotypes based on the newly available genome sequences of many organisms.« less
Elkins, C A; Kotewicz, M L; Jackson, S A; Lacher, D W; Abu-Ali, G S; Patel, I R
2013-01-01
Modern risk control and food safety practices involving food-borne bacterial pathogens are benefiting from new genomic technologies for rapid, yet highly specific, strain characterisations. Within the United States Food and Drug Administration (USFDA) Center for Food Safety and Applied Nutrition (CFSAN), optical genome mapping and DNA microarray genotyping have been used for several years to quickly assess genomic architecture and gene content, respectively, for outbreak strain subtyping and to enhance retrospective trace-back analyses. The application and relative utility of each method varies with outbreak scenario and the suspect pathogen, with comparative analytical power enhanced by database scale and depth. Integration of these two technologies allows high-resolution scrutiny of the genomic landscapes of enteric food-borne pathogens with notable examples including Shiga toxin-producing Escherichia coli (STEC) and Salmonella enterica serovars from a variety of food commodities. Moreover, the recent application of whole genome sequencing technologies to food-borne pathogen outbreaks and surveillance has enhanced resolution to the single nucleotide scale. This new wealth of sequence data will support more refined next-generation custom microarray designs, targeted re-sequencing and "genomic signature recognition" approaches involving a combination of genes and single nucleotide polymorphism detection to distil strain-specific fingerprinting to a minimised scale. This paper examines the utility of microarrays and optical mapping in analysing outbreaks, reviews best practices and the limits of these technologies for pathogen differentiation, and it considers future integration with whole genome sequencing efforts.
The repetitive landscape of the chicken genome.
Wicker, Thomas; Robertson, Jon S; Schulze, Stefan R; Feltus, F Alex; Magrini, Vincent; Morrison, Jason A; Mardis, Elaine R; Wilson, Richard K; Peterson, Daniel G; Paterson, Andrew H; Ivarie, Robert
2005-01-01
Cot-based cloning and sequencing (CBCS) is a powerful tool for isolating and characterizing the various repetitive components of any genome, combining the established principles of DNA reassociation kinetics with high-throughput sequencing. CBCS was used to generate sequence libraries representing the high, middle, and low-copy fractions of the chicken genome. Sequencing high-copy DNA of chicken to about 2.7 x coverage of its estimated sequence complexity led to the initial identification of several new repeat families, which were then used for a survey of the newly released first draft of the complete chicken genome. The analysis provided insight into the diversity and biology of known repeat structures such as CR1 and CNM, for which only limited sequence data had previously been available. Cot sequence data also resulted in the identification of four novel repeats (Birddawg, Hitchcock, Kronos, and Soprano), two new subfamilies of CR1 repeats, and many elements absent from the chicken genome assembly. Multiple autonomous elements were found for a novel Mariner-like transposon, Galluhop, in addition to nonautonomous deletion derivatives. Phylogenetic analysis of the high-copy repeats CR1, Galluhop, and Birddawg provided insight into two distinct genome dispersion strategies. This study also exemplifies the power of the CBCS method to create representative databases for the repetitive fractions of genomes for which only limited sequence data is available.
The repetitive landscape of the chicken genome
Wicker, Thomas; Robertson, Jon S.; Schulze, Stefan R.; Feltus, F. Alex; Magrini, Vincent; Morrison, Jason A.; Mardis, Elaine R.; Wilson, Richard K.; Peterson, Daniel G.; Paterson, Andrew H.; Ivarie, Robert
2005-01-01
Cot-based cloning and sequencing (CBCS) is a powerful tool for isolating and characterizing the various repetitive components of any genome, combining the established principles of DNA reassociation kinetics with high-throughput sequencing. CBCS was used to generate sequence libraries representing the high, middle, and low-copy fractions of the chicken genome. Sequencing high-copy DNA of chicken to about 2.7× coverage of its estimated sequence complexity led to the initial identification of several new repeat families, which were then used for a survey of the newly released first draft of the complete chicken genome. The analysis provided insight into the diversity and biology of known repeat structures such as CR1 and CNM, for which only limited sequence data had previously been available. Cot sequence data also resulted in the identification of four novel repeats (Birddawg, Hitchcock, Kronos, and Soprano), two new subfamilies of CR1 repeats, and many elements absent from the chicken genome assembly. Multiple autonomous elements were found for a novel Mariner-like transposon, Galluhop, in addition to nonautonomous deletion derivatives. Phylogenetic analysis of the high-copy repeats CR1, Galluhop, and Birddawg provided insight into two distinct genome dispersion strategies. This study also exemplifies the power of the CBCS method to create representative databases for the repetitive fractions of genomes for which only limited sequence data is available. PMID:15256510
Gugiu, Gabriel B
2017-01-01
Lipidomics refers to the large-scale study of lipids in biological systems (Wenk, Nat Rev Drug Discov 4(7):594-610, 2005; Rolim et al., Gene 554(2):131-139, 2015). From a mass spectrometric point of view, by lipidomics we understand targeted or untargeted mass spectrometric analysis of lipids using either liquid chromatography (LC) (Castro-Perez et al., J Proteome Res 9(5):2377-2389, 2010) or shotgun (Han and Gross, Mass Spectrom Rev 24(3):367-412, 2005) approaches coupled with tandem mass spectrometry. This chapter describes the former methodology, which is becoming rapidly the preferred method for lipid identification owing to similarities with established omics workflows, such as proteomics (Washburn et al., Nat Biotechnol 19(3):242-247, 2001) or genomics (Yadav, J Biomol Tech: JBT 18(5):277, 2007). The workflow described consists in lipid extraction using a modified Bligh and Dyer method (Bligh and Dyer, Can J Biochem Physiol 37(8):911-917, 1959), ultra high pressure liquid chromatography fractionation of lipid samples on a reverse phase C18 column, followed by tandem mass spectrometric analysis and in silico database search for lipid identification based on MSMS spectrum matching (Kind et al., Nat Methods 10(8):755-758, 2013; Yamada et al., J Chromatogr A 1292:211-218, 2013; Taguchi and Ishikawa, J Chromatogr A 1217(25):4229-4239, 2010; Peake et al., Thermoscientifices 1-3, 2015) and accurate mass of parent ion (Sud et al., Nucleic Acids Res 35(database issue):D527-D532, 2007; Wishart et al., Nucleic Acids Res 35(database):D521-D526, 2007).
Tran, Phuong N; Savka, Michael A; Gan, Han Ming
2017-01-01
The genus Pseudomonas has one of the largest diversity of species within the Bacteria kingdom. To date, its taxonomy is still being revised and updated. Due to the non-standardized procedure and ambiguous thresholds at species level, largely based on 16S rRNA gene or conventional biochemical assay, species identification of publicly available Pseudomonas genomes remains questionable. In this study, we performed a large-scale analysis of all Pseudomonas genomes with species designation (excluding the well-defined P. aeruginosa ) and re-evaluated their taxonomic assignment via in silico genome-genome hybridization and/or genetic comparison with valid type species. Three-hundred and seventy-three pseudomonad genomes were analyzed and subsequently clustered into 145 distinct genospecies. We detected 207 erroneous labels and corrected 43 to the proper species based on Average Nucleotide Identity Multilocus Sequence Typing (MLST) sequence similarity to the type strain. Surprisingly, more than half of the genomes initially designated as Pseudomonas syringae and Pseudomonas fluorescens should be classified either to a previously described species or to a new genospecies. Notably, high pairwise average nucleotide identity (>95%) indicating species-level similarity was observed between P. synxantha-P. libanensis, P. psychrotolerans - P. oryzihabitans , and P. kilonensis- P. brassicacearum , that were previously differentiated based on conventional biochemical tests and/or genome-genome hybridization techniques.
Sadhukhan, Priyanka P; Raghunathan, Anu
2014-01-01
Genome Scale Metabolic Modeling methods represent one way to compute whole cell function starting from the genome sequence of an organism and contribute towards understanding and predicting the genotype-phenotype relationship. About 80 models spanning all the kingdoms of life from archaea to eukaryotes have been built till date and used to interrogate cell phenotype under varying conditions. These models have been used to not only understand the flux distribution in evolutionary conserved pathways like glycolysis and the Krebs cycle but also in applications ranging from value added product formation in Escherichia coli to predicting inborn errors of Homo sapiens metabolism. This chapter describes a protocol that delineates the process of genome scale metabolic modeling for analysing host-pathogen behavior and interaction using flux balance analysis (FBA). The steps discussed in the process include (1) reconstruction of a metabolic network from the genome sequence, (2) its representation in a precise mathematical framework, (3) its translation to a model, and (4) the analysis using linear algebra and optimization. The methods for biological interpretations of computed cell phenotypes in the context of individual host and pathogen models and their integration are also discussed.
Methods, Tools and Current Perspectives in Proteogenomics *
Ruggles, Kelly V.; Krug, Karsten; Wang, Xiaojing; Clauser, Karl R.; Wang, Jing; Payne, Samuel H.; Fenyö, David; Zhang, Bing; Mani, D. R.
2017-01-01
With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More recently, integrative analysis of quantitative measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications. PMID:28456751
Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger
Wright, James C; Sugden, Deana; Francis-McIntyre, Sue; Riba-Garcia, Isabel; Gaskell, Simon J; Grigoriev, Igor V; Baker, Scott E; Beynon, Robert J; Hubbard, Simon J
2009-01-01
Background Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR). Results 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. Conclusion This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method. PMID:19193216
Enabling comparative modeling of closely related genomes: Example genus Brucella
Faria, José P.; Edirisinghe, Janaka N.; Davis, James J.; ...
2014-03-08
For many scientific applications, it is highly desirable to be able to compare metabolic models of closely related genomes. In this study, we attempt to raise awareness to the fact that taking annotated genomes from public repositories and using them for metabolic model reconstructions is far from being trivial due to annotation inconsistencies. We are proposing a protocol for comparative analysis of metabolic models on closely related genomes, using fifteen strains of genus Brucella, which contains pathogens of both humans and livestock. This study lead to the identification and subsequent correction of inconsistent annotations in the SEED database, as wellmore » as the identification of 31 biochemical reactions that are common to Brucella, which are not originally identified by automated metabolic reconstructions. We are currently implementing this protocol for improving automated annotations within the SEED database and these improvements have been propagated into PATRIC, Model-SEED, KBase and RAST. This method is an enabling step for the future creation of consistent annotation systems and high-quality model reconstructions that will support in predicting accurate phenotypes such as pathogenicity, media requirements or type of respiration.« less
Enabling comparative modeling of closely related genomes: Example genus Brucella
DOE Office of Scientific and Technical Information (OSTI.GOV)
Faria, José P.; Edirisinghe, Janaka N.; Davis, James J.
For many scientific applications, it is highly desirable to be able to compare metabolic models of closely related genomes. In this study, we attempt to raise awareness to the fact that taking annotated genomes from public repositories and using them for metabolic model reconstructions is far from being trivial due to annotation inconsistencies. We are proposing a protocol for comparative analysis of metabolic models on closely related genomes, using fifteen strains of genus Brucella, which contains pathogens of both humans and livestock. This study lead to the identification and subsequent correction of inconsistent annotations in the SEED database, as wellmore » as the identification of 31 biochemical reactions that are common to Brucella, which are not originally identified by automated metabolic reconstructions. We are currently implementing this protocol for improving automated annotations within the SEED database and these improvements have been propagated into PATRIC, Model-SEED, KBase and RAST. This method is an enabling step for the future creation of consistent annotation systems and high-quality model reconstructions that will support in predicting accurate phenotypes such as pathogenicity, media requirements or type of respiration.« less
Microfluidic droplet enrichment for targeted sequencing
Eastburn, Dennis J.; Huang, Yong; Pellegrino, Maurizio; Sciambi, Adam; Ptáček, Louis J.; Abate, Adam R.
2015-01-01
Targeted sequence enrichment enables better identification of genetic variation by providing increased sequencing coverage for genomic regions of interest. Here, we report the development of a new target enrichment technology that is highly differentiated from other approaches currently in use. Our method, MESA (Microfluidic droplet Enrichment for Sequence Analysis), isolates genomic DNA fragments in microfluidic droplets and performs TaqMan PCR reactions to identify droplets containing a desired target sequence. The TaqMan positive droplets are subsequently recovered via dielectrophoretic sorting, and the TaqMan amplicons are removed enzymatically prior to sequencing. We demonstrated the utility of this approach by generating an average 31.6-fold sequence enrichment across 250 kb of targeted genomic DNA from five unique genomic loci. Significantly, this enrichment enabled a more comprehensive identification of genetic polymorphisms within the targeted loci. MESA requires low amounts of input DNA, minimal prior locus sequence information and enriches the target region without PCR bias or artifacts. These features make it well suited for the study of genetic variation in a number of research and diagnostic applications. PMID:25873629
NIBBS-search for fast and accurate prediction of phenotype-biased metabolic systems.
Schmidt, Matthew C; Rocha, Andrea M; Padmanabhan, Kanchana; Shpanskaya, Yekaterina; Banfield, Jill; Scott, Kathleen; Mihelcic, James R; Samatova, Nagiza F
2012-01-01
Understanding of genotype-phenotype associations is important not only for furthering our knowledge on internal cellular processes, but also essential for providing the foundation necessary for genetic engineering of microorganisms for industrial use (e.g., production of bioenergy or biofuels). However, genotype-phenotype associations alone do not provide enough information to alter an organism's genome to either suppress or exhibit a phenotype. It is important to look at the phenotype-related genes in the context of the genome-scale network to understand how the genes interact with other genes in the organism. Identification of metabolic subsystems involved in the expression of the phenotype is one way of placing the phenotype-related genes in the context of the entire network. A metabolic system refers to a metabolic network subgraph; nodes are compounds and edges labels are the enzymes that catalyze the reaction. The metabolic subsystem could be part of a single metabolic pathway or span parts of multiple pathways. Arguably, comparative genome-scale metabolic network analysis is a promising strategy to identify these phenotype-related metabolic subsystems. Network Instance-Based Biased Subgraph Search (NIBBS) is a graph-theoretic method for genome-scale metabolic network comparative analysis that can identify metabolic systems that are statistically biased toward phenotype-expressing organismal networks. We set up experiments with target phenotypes like hydrogen production, TCA expression, and acid-tolerance. We show via extensive literature search that some of the resulting metabolic subsystems are indeed phenotype-related and formulate hypotheses for other systems in terms of their role in phenotype expression. NIBBS is also orders of magnitude faster than MULE, one of the most efficient maximal frequent subgraph mining algorithms that could be adjusted for this problem. Also, the set of phenotype-biased metabolic systems output by NIBBS comes very close to the set of phenotype-biased subgraphs output by an exact maximally-biased subgraph enumeration algorithm ( MBS-Enum ). The code (NIBBS and the module to visualize the identified subsystems) is available at http://freescience.org/cs/NIBBS.
NIBBS-Search for Fast and Accurate Prediction of Phenotype-Biased Metabolic Systems
Padmanabhan, Kanchana; Shpanskaya, Yekaterina; Banfield, Jill; Scott, Kathleen; Mihelcic, James R.; Samatova, Nagiza F.
2012-01-01
Understanding of genotype-phenotype associations is important not only for furthering our knowledge on internal cellular processes, but also essential for providing the foundation necessary for genetic engineering of microorganisms for industrial use (e.g., production of bioenergy or biofuels). However, genotype-phenotype associations alone do not provide enough information to alter an organism's genome to either suppress or exhibit a phenotype. It is important to look at the phenotype-related genes in the context of the genome-scale network to understand how the genes interact with other genes in the organism. Identification of metabolic subsystems involved in the expression of the phenotype is one way of placing the phenotype-related genes in the context of the entire network. A metabolic system refers to a metabolic network subgraph; nodes are compounds and edges labels are the enzymes that catalyze the reaction. The metabolic subsystem could be part of a single metabolic pathway or span parts of multiple pathways. Arguably, comparative genome-scale metabolic network analysis is a promising strategy to identify these phenotype-related metabolic subsystems. Network Instance-Based Biased Subgraph Search (NIBBS) is a graph-theoretic method for genome-scale metabolic network comparative analysis that can identify metabolic systems that are statistically biased toward phenotype-expressing organismal networks. We set up experiments with target phenotypes like hydrogen production, TCA expression, and acid-tolerance. We show via extensive literature search that some of the resulting metabolic subsystems are indeed phenotype-related and formulate hypotheses for other systems in terms of their role in phenotype expression. NIBBS is also orders of magnitude faster than MULE, one of the most efficient maximal frequent subgraph mining algorithms that could be adjusted for this problem. Also, the set of phenotype-biased metabolic systems output by NIBBS comes very close to the set of phenotype-biased subgraphs output by an exact maximally-biased subgraph enumeration algorithm ( MBS-Enum ). The code (NIBBS and the module to visualize the identified subsystems) is available at http://freescience.org/cs/NIBBS. PMID:22589706
Endara, María-José; Coley, Phyllis D; Wiggins, Natasha L; Forrister, Dale L; Younkin, Gordon C; Nicholls, James A; Pennington, R Toby; Dexter, Kyle G; Kidner, Catherine A; Stone, Graham N; Kursar, Thomas A
2018-04-01
The need for species identification and taxonomic discovery has led to the development of innovative technologies for large-scale plant identification. DNA barcoding has been useful, but fails to distinguish among many species in species-rich plant genera, particularly in tropical regions. Here, we show that chemical fingerprinting, or 'chemocoding', has great potential for plant identification in challenging tropical biomes. Using untargeted metabolomics in combination with multivariate analysis, we constructed species-level fingerprints, which we define as chemocoding. We evaluated the utility of chemocoding with species that were defined morphologically and subject to next-generation DNA sequencing in the diverse and recently radiated neotropical genus Inga (Leguminosae), both at single study sites and across broad geographic scales. Our results show that chemocoding is a robust method for distinguishing morphologically similar species at a single site and for identifying widespread species across continental-scale ranges. Given that species are the fundamental unit of analysis for conservation and biodiversity research, the development of accurate identification methods is essential. We suggest that chemocoding will be a valuable additional source of data for a quick identification of plants, especially for groups where other methods fall short. © 2018 The Authors. New Phytologist © 2018 New Phytologist Trust.
The Role of Genome Accessibility in Transcription Factor Binding in Bacteria.
Gomes, Antonio L C; Wang, Harris H
2016-04-01
ChIP-seq enables genome-scale identification of regulatory regions that govern gene expression. However, the biological insights generated from ChIP-seq analysis have been limited to predictions of binding sites and cooperative interactions. Furthermore, ChIP-seq data often poorly correlate with in vitro measurements or predicted motifs, highlighting that binding affinity alone is insufficient to explain transcription factor (TF)-binding in vivo. One possibility is that binding sites are not equally accessible across the genome. A more comprehensive biophysical representation of TF-binding is required to improve our ability to understand, predict, and alter gene expression. Here, we show that genome accessibility is a key parameter that impacts TF-binding in bacteria. We developed a thermodynamic model that parameterizes ChIP-seq coverage in terms of genome accessibility and binding affinity. The role of genome accessibility is validated using a large-scale ChIP-seq dataset of the M. tuberculosis regulatory network. We find that accounting for genome accessibility led to a model that explains 63% of the ChIP-seq profile variance, while a model based in motif score alone explains only 35% of the variance. Moreover, our framework enables de novo ChIP-seq peak prediction and is useful for inferring TF-binding peaks in new experimental conditions by reducing the need for additional experiments. We observe that the genome is more accessible in intergenic regions, and that increased accessibility is positively correlated with gene expression and anti-correlated with distance to the origin of replication. Our biophysically motivated model provides a more comprehensive description of TF-binding in vivo from first principles towards a better representation of gene regulation in silico, with promising applications in systems biology.
Zhu, Zhou; Ihle, Nathan T; Rejto, Paul A; Zarrinkar, Patrick P
2016-06-13
Genome-scale functional genomic screens across large cell line panels provide a rich resource for discovering tumor vulnerabilities that can lead to the next generation of targeted therapies. Their data analysis typically has focused on identifying genes whose knockdown enhances response in various pre-defined genetic contexts, which are limited by biological complexities as well as the incompleteness of our knowledge. We thus introduce a complementary data mining strategy to identify genes with exceptional sensitivity in subsets, or outlier groups, of cell lines, allowing an unbiased analysis without any a priori assumption about the underlying biology of dependency. Genes with outlier features are strongly and specifically enriched with those known to be associated with cancer and relevant biological processes, despite no a priori knowledge being used to drive the analysis. Identification of exceptional responders (outliers) may not lead only to new candidates for therapeutic intervention, but also tumor indications and response biomarkers for companion precision medicine strategies. Several tumor suppressors have an outlier sensitivity pattern, supporting and generalizing the notion that tumor suppressors can play context-dependent oncogenic roles. The novel application of outlier analysis described here demonstrates a systematic and data-driven analytical strategy to decipher large-scale functional genomic data for oncology target and precision medicine discoveries.
Molecular barcodes detect redundancy and contamination in hairpin-bisulfite PCR
Miner, Brooks E.; Stöger, Reinhard J.; Burden, Alice F.; Laird, Charles D.; Hansen, R. Scott
2004-01-01
PCR amplification of limited amounts of DNA template carries an increased risk of product redundancy and contamination. We use molecular barcoding to label each genomic DNA template with an individual sequence tag prior to PCR amplification. In addition, we include molecular ‘batch-stamps’ that effectively label each genomic template with a sample ID and analysis date. This highly sensitive method identifies redundant and contaminant sequences and serves as a reliable method for positive identification of desired sequences; we can therefore capture accurately the genomic template diversity in the sample analyzed. Although our application described here involves the use of hairpin-bisulfite PCR for amplification of double-stranded DNA, the method can readily be adapted to single-strand PCR. Useful applications will include analyses of limited template DNA for biomedical, ancient DNA and forensic purposes. PMID:15459281
Direct detection of methylation in genomic DNA
Bart, A.; van Passel, M. W. J.; van Amsterdam, K.; van der Ende, A.
2005-01-01
The identification of methylated sites on bacterial genomic DNA would be a useful tool to study the major roles of DNA methylation in prokaryotes: distinction of self and nonself DNA, direction of post-replicative mismatch repair, control of DNA replication and cell cycle, and regulation of gene expression. Three types of methylated nucleobases are known: N6-methyladenine, 5-methylcytosine and N4-methylcytosine. The aim of this study was to develop a method to detect all three types of DNA methylation in complete genomic DNA. It was previously shown that N6-methyladenine and 5-methylcytosine in plasmid and viral DNA can be detected by intersequence trace comparison of methylated and unmethylated DNA. We extended this method to include N4-methylcytosine detection in both in vitro and in vivo methylated DNA. Furthermore, application of intersequence trace comparison was extended to bacterial genomic DNA. Finally, we present evidence that intrasequence comparison suffices to detect methylated sites in genomic DNA. In conclusion, we present a method to detect all three natural types of DNA methylation in bacterial genomic DNA. This provides the possibility to define the complete methylome of any prokaryote. PMID:16091626
Methanococcus jannaschii genome: revisited
NASA Technical Reports Server (NTRS)
Kyrpides, N. C.; Olsen, G. J.; Klenk, H. P.; White, O.; Woese, C. R.
1996-01-01
Analysis of genomic sequences is necessarily an ongoing process. Initial gene assignments tend (wisely) to be on the conservative side (Venter, 1996). The analysis of the genome then grows in an iterative fashion as additional data and more sophisticated algorithms are brought to bear on the data. The present report is an emendation of the original gene list of Methanococcus jannaschii (Bult et al., 1996). By using a somewhat more updated database and more relaxed (and operator-intensive) pattern matching methods, we were able to add significantly to, and in a few cases amend, the gene identification table originally published by Bult et al. (1996).
Genomics screens for metastasis genes
Yan, Jinchun; Huang, Qihong
2014-01-01
Metastasis is responsible for most cancer mortality. The process of metastasis is complex, requiring the coordinated expression and fine regulation of many genes in multiple pathways in both the tumor and host tissues. Identification and characterization of the genetic programs that regulate metastasis is critical to understanding the metastatic process and discovering molecular targets for the prevention and treatment of metastasis. Genomic approaches and functional genomic analyses can systemically discover metastasis genes. In this review, we summarize the genetic tools and methods that have been used to identify and characterize the genes that play critical roles in metastasis. PMID:22684367
NASA Astrophysics Data System (ADS)
Champeimont, Raphaël; Laine, Elodie; Hu, Shuang-Wei; Penin, Francois; Carbone, Alessandra
2016-05-01
A novel computational approach of coevolution analysis allowed us to reconstruct the protein-protein interaction network of the Hepatitis C Virus (HCV) at the residue resolution. For the first time, coevolution analysis of an entire viral genome was realized, based on a limited set of protein sequences with high sequence identity within genotypes. The identified coevolving residues constitute highly relevant predictions of protein-protein interactions for further experimental identification of HCV protein complexes. The method can be used to analyse other viral genomes and to predict the associated protein interaction networks.
Kelly, Benjamin J; Fitch, James R; Hu, Yangqiu; Corsmeier, Donald J; Zhong, Huachun; Wetzel, Amy N; Nordquist, Russell D; Newsom, David L; White, Peter
2015-01-20
While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.
BAC sequencing using pooled methods.
Saski, Christopher A; Feltus, F Alex; Parida, Laxmi; Haiminen, Niina
2015-01-01
Shotgun sequencing and assembly of a large, complex genome can be both expensive and challenging to accurately reconstruct the true genome sequence. Repetitive DNA arrays, paralogous sequences, polyploidy, and heterozygosity are main factors that plague de novo genome sequencing projects that typically result in highly fragmented assemblies and are difficult to extract biological meaning. Targeted, sub-genomic sequencing offers complexity reduction by removing distal segments of the genome and a systematic mechanism for exploring prioritized genomic content through BAC sequencing. If one isolates and sequences the genome fraction that encodes the relevant biological information, then it is possible to reduce overall sequencing costs and efforts that target a genomic segment. This chapter describes the sub-genome assembly protocol for an organism based upon a BAC tiling path derived from a genome-scale physical map or from fine mapping using BACs to target sub-genomic regions. Methods that are described include BAC isolation and mapping, DNA sequencing, and sequence assembly.
Yap, Kien-Pong; Ho, Wing S; Gan, Han M; Chai, Lay C; Thong, Kwai L
2016-01-01
Typhoid fever, caused by Salmonella enterica serovar Typhi, remains an important public health burden in Southeast Asia and other endemic countries. Various genotyping methods have been applied to study the genetic variations of this human-restricted pathogen. Multilocus sequence typing (MLST) is one of the widely accepted methods, and recently, there is a growing interest in the re-application of MLST in the post-genomic era. In this study, we provide the global MLST distribution of S. Typhi utilizing both publicly available 1,826 S. Typhi genome sequences in addition to performing conventional MLST on S. Typhi strains isolated from various endemic regions spanning over a century. Our global MLST analysis confirms the predominance of two sequence types (ST1 and ST2) co-existing in the endemic regions. Interestingly, S. Typhi strains with ST8 are currently confined within the African continent. Comparative genomic analyses of ST8 and other rare STs with genomes of ST1/ST2 revealed unique mutations in important virulence genes such as flhB, sipC, and tviD that may explain the variations that differentiate between seemingly successful (widespread) and unsuccessful (poor dissemination) S. Typhi populations. Large scale whole-genome phylogeny demonstrated evidence of phylogeographical structuring and showed that ST8 may have diverged from the earlier ancestral population of ST1 and ST2, which later lost some of its fitness advantages, leading to poor worldwide dissemination. In response to the unprecedented increase in genomic data, this study demonstrates and highlights the utility of large-scale genome-based MLST as a quick and effective approach to narrow the scope of in-depth comparative genomic analysis and consequently provide new insights into the fine scale of pathogen evolution and population structure.
2013-01-01
Background Significant efforts have been made to address the problem of identifying short genes in prokaryotic genomes. However, most known methods are not effective in detecting short genes. Because of the limited information contained in short DNA sequences, it is very difficult to accurately distinguish between protein coding and non-coding sequences in prokaryotic genomes. We have developed a new Iteratively Adaptive Sparse Partial Least Squares (IASPLS) algorithm as the classifier to improve the accuracy of the identification process. Results For testing, we chose the short coding and non-coding sequences from seven prokaryotic organisms. We used seven feature sets (including GC content, Z-curve, etc.) of short genes. In comparison with GeneMarkS, Metagene, Orphelia, and Heuristic Approachs methods, our model achieved the best prediction performance in identification of short prokaryotic genes. Even when we focused on the very short length group ([60–100 nt)), our model provided sensitivity as high as 83.44% and specificity as high as 92.8%. These values are two or three times higher than three of the other methods while Metagene fails to recognize genes in this length range. The experiments also proved that the IASPLS can improve the identification accuracy in comparison with other widely used classifiers, i.e. Logistic, Random Forest (RF) and K nearest neighbors (KNN). The accuracy in using IASPLS was improved 5.90% or more in comparison with the other methods. In addition to the improvements in accuracy, IASPLS required ten times less computer time than using KNN or RF. Conclusions It is conclusive that our method is preferable for application as an automated method of short gene classification. Its linearity and easily optimized parameters make it practicable for predicting short genes of newly-sequenced or under-studied species. Reviewers This article was reviewed by Alexey Kondrashov, Rajeev Azad (nominated by Dr J.Peter Gogarten) and Yuriy Fofanov (nominated by Dr Janet Siefert). PMID:24067167
Pattemore, Julie A; Hane, James K; Williams, Angela H; Wilson, Bree A L; Stodart, Ben J; Ash, Gavin J
2014-08-07
Metarhizium anisopliae is an important fungal biocontrol agent of insect pests of agricultural crops. Genomics can aid the successful commercialization of biopesticides by identification of key genes differentiating closely related species, selection of virulent microbial isolates which are amenable to industrial scale production and formulation and through the reduction of phenotypic variability. The genome of Metarhizium isolate ARSEF23 was recently published as a model for M. anisopliae, however phylogenetic analysis has since re-classified this isolate as M. robertsii. We present a new annotated genome sequence of M. anisopliae (isolate Ma69) and whole genome comparison to M. robertsii (ARSEF23) and M. acridum (CQMa 102). Whole genome analysis of M. anisopliae indicates significant macrosynteny with M. robertsii but with some large genomic inversions. In comparison to M. acridum, the genome of M. anisopliae shares lower sequence homology. While alignments overall are co-linear, the genome of M. acridum is not contiguous enough to conclusively observe macrosynteny. Mating type gene analysis revealed both MAT1-1 and MAT1-2 genes present in M. anisopliae suggesting putative homothallism, despite having no known teleomorph, in contrast with the putatively heterothallic M. acridum isolate CQMa 102 (MAT1-2) and M. robertsii isolate ARSEF23 (altered MAT1-1). Repetitive DNA and RIP analysis revealed M. acridum to have twice the repetitive content of the other two species and M. anisopliae to be five times more RIP affected than M. robertsii. We also present an initial bioinformatic survey of candidate pathogenicity genes in M. anisopliae. The annotated genome of M. anisopliae is an important resource for the identification of virulence genes specific to M. anisopliae and development of species- and strain- specific assays. New insight into the possibility of homothallism and RIP affectedness has important implications for the development of M. anisopliae as a biopesticide as it may indicate the potential for greater inherent diversity in this species than the other species. This could present opportunities to select isolates with unique combinations of pathogenicity factors, or it may point to instability in the species, a negative attribute in a biopesticide.
Aggarwal, Gautam; Worthey, E A; McDonagh, Paul D; Myler, Peter J
2003-06-07
Seattle Biomedical Research Institute (SBRI) as part of the Leishmania Genome Network (LGN) is sequencing chromosomes of the trypanosomatid protozoan species Leishmania major. At SBRI, chromosomal sequence is annotated using a combination of trained and untrained non-consensus gene-prediction algorithms with ARTEMIS, an annotation platform with rich and user-friendly interfaces. Here we describe a methodology used to import results from three different protein-coding gene-prediction algorithms (GLIMMER, TESTCODE and GENESCAN) into the ARTEMIS sequence viewer and annotation tool. Comparison of these methods, along with the CODONUSAGE algorithm built into ARTEMIS, shows the importance of combining methods to more accurately annotate the L. major genomic sequence. An improvised and powerful tool for gene prediction has been developed by importing data from widely-used algorithms into an existing annotation platform. This approach is especially fruitful in the Leishmania genome project where there is large proportion of novel genes requiring manual annotation.
Fu, Liezhen; Wen, Luan; Luu, Nga; Shi, Yun-Bo
2016-01-01
Genome editing with designer nucleases such as TALEN and CRISPR/Cas enzymes has broad applications. Delivery of these designer nucleases into organisms induces various genetic mutations including deletions, insertions and nucleotide substitutions. Characterizing those mutations is critical for evaluating the efficacy and specificity of targeted genome editing. While a number of methods have been developed to identify the mutations, none other than sequencing allows the identification of the most desired mutations, i.e., out-of-frame insertions/deletions that disrupt genes. Here we report a simple and efficient method to visualize and quantify the efficiency of genomic mutations induced by genome-editing. Our approach is based on the expression of a two-color fusion protein in a vector that allows the insertion of the edited region in the genome in between the two color moieties. We show that our approach not only easily identifies developing animals with desired mutations but also efficiently quantifies the mutation rate in vivo. Furthermore, by using LacZα and GFP as the color moieties, our approach can even eliminate the need for a fluorescent microscope, allowing the analysis with simple bright field visualization. Such an approach will greatly simplify the screen for effective genome-editing enzymes and identify the desired mutant cells/animals. PMID:27748423
Shen, Qi; Zhang, Dong; Sun, Wei; Zhang, Yu-Jun; Shang, Zhi-Wei; Chen, Shi-Lin
2017-05-01
Perilla frutescens is one of 60 kinds of food and medicine plants in the initial directory announced by health ministry of China. With the development of Perilla domain in recent , the breeding and application of good varieties has become the main bottleneck of its development. This study reported that applied to the system selection, add to marker-assisted method to breed perilla varieties. Through the whole genome sequencing and consistency matching, annotated the mutation locus according to genome data, and comparison analysis with Perilla common variants database, finally selected 30 non-synonymous mutation SNPs used as characteristic markers of Zhongyan Feishu No.1. those SNP marker were used as chosen standard of Perilla varieties. Finally breeding new perilla variety Zhongyan Feishu No.1, which possess to characters of the leaf and seed dual-used, high yield, high resistance, and could used to green fertilizer. The Zhongyan Feishu No.1 acquired the plant new varieties identification of Beijing city , the identification numbers is 2016054. Marker assisted identification guide new varieties breeding in plants, which can provide a new reference for breeding of medicinal plants. Copyright© by the Chinese Pharmaceutical Association.
Song, Hyun-Seob; Goldberg, Noam; Mahajan, Ashutosh; Ramkrishna, Doraiswami
2017-08-01
Elementary (flux) modes (EMs) have served as a valuable tool for investigating structural and functional properties of metabolic networks. Identification of the full set of EMs in genome-scale networks remains challenging due to combinatorial explosion of EMs in complex networks. It is often, however, that only a small subset of relevant EMs needs to be known, for which optimization-based sequential computation is a useful alternative. Most of the currently available methods along this line are based on the iterative use of mixed integer linear programming (MILP), the effectiveness of which significantly deteriorates as the number of iterations builds up. To alleviate the computational burden associated with the MILP implementation, we here present a novel optimization algorithm termed alternate integer linear programming (AILP). Our algorithm was designed to iteratively solve a pair of integer programming (IP) and linear programming (LP) to compute EMs in a sequential manner. In each step, the IP identifies a minimal subset of reactions, the deletion of which disables all previously identified EMs. Thus, a subsequent LP solution subject to this reaction deletion constraint becomes a distinct EM. In cases where no feasible LP solution is available, IP-derived reaction deletion sets represent minimal cut sets (MCSs). Despite the additional computation of MCSs, AILP achieved significant time reduction in computing EMs by orders of magnitude. The proposed AILP algorithm not only offers a computational advantage in the EM analysis of genome-scale networks, but also improves the understanding of the linkage between EMs and MCSs. The software is implemented in Matlab, and is provided as supplementary information . hyunseob.song@pnnl.gov. Supplementary data are available at Bioinformatics online. Published by Oxford University Press 2017. This work is written by US Government employees and are in the public domain in the US.
Zhang, Baixia; Li, Yanwen; Zhang, Yanling; Li, Zhiyong; Bi, Tian; He, Yusu; Song, Kuokui; Wang, Yun
2016-01-01
Identification of bioactive components is an important area of research in traditional Chinese medicine (TCM) formula. The reported identification methods only consider the interaction between the components and the target proteins, which is not sufficient to explain the influence of TCM on the gene expression. Here, we propose the Initial Transcription Process-based Identification (ITPI) method for the discovery of bioactive components that influence transcription factors (TFs). In this method, genome-wide chip detection technology was used to identify differentially expressed genes (DEGs). The TFs of DEGs were derived from GeneCards. The components influencing the TFs were derived from STITCH. The bioactive components in the formula were identified by evaluating the molecular similarity between the components in formula and the components that influence the TF of DEGs. Using the formula of Tian-Zhu-San (TZS) as an example, the reliability and limitation of ITPI were examined and 16 bioactive components that influence TFs were identified. PMID:27034696
Sequence-based classification and identification of fungi
USDA-ARS?s Scientific Manuscript database
Fungal taxonomy and ecology have been revolutionized by the application of molecular methods and both have increasing connections to genomics and functional biology. However, data streams from traditional specimen- and culture-based systematics are not yet fully integrated with those from metagenomi...
Engineering domain fusion chimeras from I-OnuI family LAGLIDADG homing endonucleases
Lambert, Abigail R.; Kuhar, Ryan; Jarjour, Jordan; Kulshina, Nadia; Parmeggiani, Fabio; Danaher, Patrick; Gano, Jacob; Baker, David; Stoddard, Barry L.; Scharenberg, Andrew M.
2012-01-01
Although engineered LAGLIDADG homing endonucleases (LHEs) are finding increasing applications in biotechnology, their generation remains a challenging, industrial-scale process. As new single-chain LAGLIDADG nuclease scaffolds are identified, however, an alternative paradigm is emerging: identification of an LHE scaffold whose native cleavage site is a close match to a desired target sequence, followed by small-scale engineering to modestly refine recognition specificity. The application of this paradigm could be accelerated if methods were available for fusing N- and C-terminal domains from newly identified LHEs into chimeric enzymes with hybrid cleavage sites. Here we have analyzed the structural requirements for fusion of domains extracted from six single-chain I-OnuI family LHEs, spanning 40–70% amino acid identity. Our analyses demonstrate that both the LAGLIDADG helical interface residues and the linker peptide composition have important effects on the stability and activity of chimeric enzymes. Using a simple domain fusion method in which linker peptide residues predicted to contact their respective domains are retained, and in which limited variation is introduced into the LAGLIDADG helix and nearby interface residues, catalytically active enzymes were recoverable for ∼70% of domain chimeras. This method will be useful for creating large numbers of chimeric LHEs for genome engineering applications. PMID:22684507
A Hybrid Approach for the Automated Finishing of Bacterial Genomes
Robins, William P.; Chin, Chen-Shan; Webster, Dale; Paxinos, Ellen; Hsu, David; Ashby, Meredith; Wang, Susana; Peluso, Paul; Sebra, Robert; Sorenson, Jon; Bullard, James; Yen, Jackie; Valdovino, Marie; Mollova, Emilia; Luong, Khai; Lin, Steven; LaMay, Brianna; Joshi, Amruta; Rowe, Lori; Frace, Michael; Tarr, Cheryl L.; Turnsek, Maryann; Davis, Brigid M; Kasarskis, Andrew; Mekalanos, John J.; Waldor, Matthew K.; Schadt, Eric E.
2013-01-01
Dramatic improvements in DNA sequencing technology have revolutionized our ability to characterize most genomic diversity. However, accurate resolution of large structural events has remained challenging due to the comparatively shorter read lengths of second-generation technologies. Emerging third-generation sequencing technologies, which yield markedly increased read length on rapid time scales and for low cost, have the potential to address assembly limitations. Here we combine sequencing data from second- and third-generation DNA sequencing technologies to assemble the two-chromosome genome of a recent Haitian cholera outbreak strain into two nearly finished contigs at > 99.9% accuracy. Complex regions with clinically significant structure were completely resolved. In separate control assemblies on experimental and simulated data for the canonical N16961 reference we obtain 14 and 8 scaffolds greater than 1kb, respectively, correcting several errors in the underlying source data. This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly. PMID:22750883
Hal: an automated pipeline for phylogenetic analyses of genomic data.
Robbertse, Barbara; Yoder, Ryan J; Boyd, Alex; Reeves, John; Spatafora, Joseph W
2011-02-07
The rapid increase in genomic and genome-scale data is resulting in unprecedented levels of discrete sequence data available for phylogenetic analyses. Major analytical impasses exist, however, prior to analyzing these data with existing phylogenetic software. Obstacles include the management of large data sets without standardized naming conventions, identification and filtering of orthologous clusters of proteins or genes, and the assembly of alignments of orthologous sequence data into individual and concatenated super alignments. Here we report the production of an automated pipeline, Hal that produces multiple alignments and trees from genomic data. These alignments can be produced by a choice of four alignment programs and analyzed by a variety of phylogenetic programs. In short, the Hal pipeline connects the programs BLASTP, MCL, user specified alignment programs, GBlocks, ProtTest and user specified phylogenetic programs to produce species trees. The script is available at sourceforge (http://sourceforge.net/projects/bio-hal/). The results from an example analysis of Kingdom Fungi are briefly discussed.
Comparative genomic characterization of citrus-associated Xylella fastidiosa strains.
da Silva, Vivian S; Shida, Cláudio S; Rodrigues, Fabiana B; Ribeiro, Diógenes C D; de Souza, Alessandra A; Coletta-Filho, Helvécio D; Machado, Marcos A; Nunes, Luiz R; de Oliveira, Regina Costa
2007-12-21
The xylem-inhabiting bacterium Xylella fastidiosa (Xf) is the causal agent of Pierce's disease (PD) in vineyards and citrus variegated chlorosis (CVC) in orange trees. Both of these economically-devastating diseases are caused by distinct strains of this complex group of microorganisms, which has motivated researchers to conduct extensive genomic sequencing projects with Xf strains. This sequence information, along with other molecular tools, have been used to estimate the evolutionary history of the group and provide clues to understand the capacity of Xf to infect different hosts, causing a variety of symptoms. Nonetheless, although significant amounts of information have been generated from Xf strains, a large proportion of these efforts has concentrated on the study of North American strains, limiting our understanding about the genomic composition of South American strains - which is particularly important for CVC-associated strains. This paper describes the first genome-wide comparison among South American Xf strains, involving 6 distinct citrus-associated bacteria. Comparative analyses performed through a microarray-based approach allowed identification and characterization of large mobile genetic elements that seem to be exclusive to South American strains. Moreover, a large-scale sequencing effort, based on Suppressive Subtraction Hybridization (SSH), identified 290 new ORFs, distributed in 135 Groups of Orthologous Elements, throughout the genomes of these bacteria. Results from microarray-based comparisons provide further evidence concerning activity of horizontally transferred elements, reinforcing their importance as major mediators in the evolution of Xf. Moreover, the microarray-based genomic profiles showed similarity between Xf strains 9a5c and Fb7, which is unexpected, given the geographical and chronological differences associated with the isolation of these microorganisms. The newly identified ORFs, obtained by SSH, represent an approximately 10% increase in our current knowledge of the South American Xf gene pool and include new putative virulence factors, as well as novel potential markers for strain identification. Surprisingly, this list of novel elements include sequences previously believed to be unique to North American strains, pointing to the necessity of revising the list of specific markers that may be used for identification of distinct Xf strains.
Janssen, K A; Sidoli, S; Garcia, B A
2017-01-01
Functional epigenetic regulation occurs by dynamic modification of chromatin, including genetic material (i.e., DNA methylation), histone proteins, and other nuclear proteins. Due to the highly complex nature of the histone code, mass spectrometry (MS) has become the leading technique in identification of single and combinatorial histone modifications. MS has now overcome antibody-based strategies due to its automation, high resolution, and accurate quantitation. Moreover, multiple approaches to analysis have been developed for global quantitation of posttranslational modifications (PTMs), including large-scale characterization of modification coexistence (middle-down and top-down proteomics), which is not currently possible with any other biochemical strategy. Recently, our group and others have simplified and increased the effectiveness of analyzing histone PTMs by improving multiple MS methods and data analysis tools. This review provides an overview of the major achievements in the analysis of histone PTMs using MS with a focus on the most recent improvements. We speculate that the workflow for histone analysis at its state of the art is highly reliable in terms of identification and quantitation accuracy, and it has the potential to become a routine method for systems biology thanks to the possibility of integrating histone MS results with genomics and proteomics datasets. © 2017 Elsevier Inc. All rights reserved.
Construction of Red Fox Chromosomal Fragments from the Short-Read Genome Assembly.
Rando, Halie M; Farré, Marta; Robson, Michael P; Won, Naomi B; Johnson, Jennifer L; Buch, Ronak; Bastounes, Estelle R; Xiang, Xueyan; Feng, Shaohong; Liu, Shiping; Xiong, Zijun; Kim, Jaebum; Zhang, Guojie; Trut, Lyudmila N; Larkin, Denis M; Kukekova, Anna V
2018-06-20
The genome of a red fox ( Vulpes vulpes ) was recently sequenced and assembled using next-generation sequencing (NGS). The assembly is of high quality, with 94X coverage and a scaffold N50 of 11.8 Mbp, but is split into 676,878 scaffolds, some of which are likely to contain assembly errors. Fragmentation and misassembly hinder accurate gene prediction and downstream analysis such as the identification of loci under selection. Therefore, assembly of the genome into chromosome-scale fragments was an important step towards developing this genomic model. Scaffolds from the assembly were aligned to the dog reference genome and compared to the alignment of an outgroup genome (cat) against the dog to identify syntenic sequences among species. The program Reference-Assisted Chromosome Assembly (RACA) then integrated the comparative alignment with the mapping of the raw sequencing reads generated during assembly against the fox scaffolds. The 128 sequence fragments RACA assembled were compared to the fox meiotic linkage map to guide the construction of 40 chromosomal fragments. This computational approach to assembly was facilitated by prior research in comparative mammalian genomics, and the continued improvement of the red fox genome can in turn offer insight into canid and carnivore chromosome evolution. This assembly is also necessary for advancing genetic research in foxes and other canids.
Yamada, Takuji; Waller, Alison S; Raes, Jeroen; Zelezniak, Aleksej; Perchat, Nadia; Perret, Alain; Salanoubat, Marcel; Patil, Kiran R; Weissenbach, Jean; Bork, Peer
2012-01-01
Despite the current wealth of sequencing data, one-third of all biochemically characterized metabolic enzymes lack a corresponding gene or protein sequence, and as such can be considered orphan enzymes. They represent a major gap between our molecular and biochemical knowledge, and consequently are not amenable to modern systemic analyses. As 555 of these orphan enzymes have metabolic pathway neighbours, we developed a global framework that utilizes the pathway and (meta)genomic neighbour information to assign candidate sequences to orphan enzymes. For 131 orphan enzymes (37% of those for which (meta)genomic neighbours are available), we associate sequences to them using scoring parameters with an estimated accuracy of 70%, implying functional annotation of 16 345 gene sequences in numerous (meta)genomes. As a case in point, two of these candidate sequences were experimentally validated to encode the predicted activity. In addition, we augmented the currently available genome-scale metabolic models with these new sequence–function associations and were able to expand the models by on average 8%, with a considerable change in the flux connectivity patterns and improved essentiality prediction. PMID:22569339
Chemicals found in nature have evolved over geological time scales to productively interact with biological molecules, and thus represent an effective resource for pharmaceutical development. Marine-derived bacteria are rich sources of chemically diverse, bioactive secondary metabolites, but harnessing this diversity for biomedical benefit is limited by challenges associated with natural product purification and determination of biochemical mechanism.
Large Scale Single Nucleotide Polymorphism Study of PD Susceptibility
2006-03-01
familial PD, the results of intensive investigations of polymorphisms in dozens of genes related to sporadic, late onset, typical PD have not shown...association between classical, sporadic PD and 2386 SNPs in 23 genes implicated in the pathogenesis of PD; (2) construct haplotypes based on the SNP...derived from this study may be applied in other complex disorders for the identification of susceptibility genes , as well as in genome-wide SNP
Deciphering the Epigenetic Code: An Overview of DNA Methylation Analysis Methods
Umer, Muhammad
2013-01-01
Abstract Significance: Methylation of cytosine in DNA is linked with gene regulation, and this has profound implications in development, normal biology, and disease conditions in many eukaryotic organisms. A wide range of methods and approaches exist for its identification, quantification, and mapping within the genome. While the earliest approaches were nonspecific and were at best useful for quantification of total methylated cytosines in the chunk of DNA, this field has seen considerable progress and development over the past decades. Recent Advances: Methods for DNA methylation analysis differ in their coverage and sensitivity, and the method of choice depends on the intended application and desired level of information. Potential results include global methyl cytosine content, degree of methylation at specific loci, or genome-wide methylation maps. Introduction of more advanced approaches to DNA methylation analysis, such as microarray platforms and massively parallel sequencing, has brought us closer to unveiling the whole methylome. Critical Issues: Sensitive quantification of DNA methylation from degraded and minute quantities of DNA and high-throughput DNA methylation mapping of single cells still remain a challenge. Future Directions: Developments in DNA sequencing technologies as well as the methods for identification and mapping of 5-hydroxymethylcytosine are expected to augment our current understanding of epigenomics. Here we present an overview of methodologies available for DNA methylation analysis with special focus on recent developments in genome-wide and high-throughput methods. While the application focus relates to cancer research, the methods are equally relevant to broader issues of epigenetics and redox science in this special forum. Antioxid. Redox Signal. 18, 1972–1986. PMID:23121567
BiGG: a Biochemical Genetic and Genomic knowledgebase of large scale metabolic reconstructions
2010-01-01
Background Genome-scale metabolic reconstructions under the Constraint Based Reconstruction and Analysis (COBRA) framework are valuable tools for analyzing the metabolic capabilities of organisms and interpreting experimental data. As the number of such reconstructions and analysis methods increases, there is a greater need for data uniformity and ease of distribution and use. Description We describe BiGG, a knowledgebase of Biochemically, Genetically and Genomically structured genome-scale metabolic network reconstructions. BiGG integrates several published genome-scale metabolic networks into one resource with standard nomenclature which allows components to be compared across different organisms. BiGG can be used to browse model content, visualize metabolic pathway maps, and export SBML files of the models for further analysis by external software packages. Users may follow links from BiGG to several external databases to obtain additional information on genes, proteins, reactions, metabolites and citations of interest. Conclusions BiGG addresses a need in the systems biology community to have access to high quality curated metabolic models and reconstructions. It is freely available for academic use at http://bigg.ucsd.edu. PMID:20426874
USDA-ARS?s Scientific Manuscript database
Focusing on the identification of pathogenicity gene content, we leveraged the reference genomes of Fusarium pathogens F. oxysporum f. sp. lycopersici (tomato-infecting) and F. solani (pea-infecting) and their well-characterised core and dispensable chromosomes to predict genomic organisation in the...
Competitive code-based fast palmprint identification using a set of cover trees
NASA Astrophysics Data System (ADS)
Yue, Feng; Zuo, Wangmeng; Zhang, David; Wang, Kuanquan
2009-06-01
A palmprint identification system recognizes a query palmprint image by searching for its nearest neighbor from among all the templates in a database. When applied on a large-scale identification system, it is often necessary to speed up the nearest-neighbor searching process. We use competitive code, which has very fast feature extraction and matching speed, for palmprint identification. To speed up the identification process, we extend the cover tree method and propose to use a set of cover trees to facilitate the fast and accurate nearest-neighbor searching. We can use the cover tree method because, as we show, the angular distance used in competitive code can be decomposed into a set of metrics. Using the Hong Kong PolyU palmprint database (version 2) and a large-scale palmprint database, our experimental results show that the proposed method searches for nearest neighbors faster than brute force searching.
Dong, Zirui; Wang, Huilin; Chen, Haixiao; Jiang, Hui; Yuan, Jianying; Yang, Zhenjun; Wang, Wen-Jing; Xu, Fengping; Guo, Xiaosen; Cao, Ye; Zhu, Zhenzhen; Geng, Chunyu; Cheung, Wan Chee; Kwok, Yvonne K; Yang, Huangming; Leung, Tak Yeung; Morton, Cynthia C.; Cheung, Sau Wai; Choy, Kwong Wai
2017-01-01
Purpose Recent studies demonstrate that whole-genome sequencing (WGS) enables detection of cryptic rearrangements in apparently balanced chromosomal rearrangements (also known as balanced chromosomal abnormalities, BCAs) previously identified by conventional cytogenetic methods. We aimed to assess our analytical tool for detecting BCAs in The 1000 Genomes Project without knowing affected bands. Methods The 1000 Genomes Project provides an unprecedented integrated map of structural variants in phenotypically normal subjects, but there is no information on potential inclusion of subjects with apparently BCAs akin to those traditionally detected in diagnostic cytogenetics laboratories. We applied our analytical tool to 1,166 genomes from the 1000 Genomes Project with sufficient physical coverage (8.25-fold). Results Our approach detected four reciprocal balanced translocations and four inversions ranging in size from 57.9 kb to 13.3 Mb, all of which were confirmed by cytogenetic methods and PCR studies. One of DNAs has a subtle translocation that is not readily identified by chromosome analysis due to similar banding patterns and size of exchanged segments, and another results in disruption of all transcripts of an OMIM gene. Conclusions Our study demonstrates the extension of utilizing low-coverage WGS for unbiased detection of BCAs including translocations and inversions previously unknown in the 1000 Genomes Project. PMID:29095815
Genetic recombination is targeted towards gene promoter regions in dogs.
Auton, Adam; Rui Li, Ying; Kidd, Jeffrey; Oliveira, Kyle; Nadel, Julie; Holloway, J Kim; Hayward, Jessica J; Cohen, Paula E; Greally, John M; Wang, Jun; Bustamante, Carlos D; Boyko, Adam R
2013-01-01
The identification of the H3K4 trimethylase, PRDM9, as the gene responsible for recombination hotspot localization has provided considerable insight into the mechanisms by which recombination is initiated in mammals. However, uniquely amongst mammals, canids appear to lack a functional version of PRDM9 and may therefore provide a model for understanding recombination that occurs in the absence of PRDM9, and thus how PRDM9 functions to shape the recombination landscape. We have constructed a fine-scale genetic map from patterns of linkage disequilibrium assessed using high-throughput sequence data from 51 free-ranging dogs, Canis lupus familiaris. While broad-scale properties of recombination appear similar to other mammalian species, our fine-scale estimates indicate that canine highly elevated recombination rates are observed in the vicinity of CpG rich regions including gene promoter regions, but show little association with H3K4 trimethylation marks identified in spermatocytes. By comparison to genomic data from the Andean fox, Lycalopex culpaeus, we show that biased gene conversion is a plausible mechanism by which the high CpG content of the dog genome could have occurred.
Genome-Wide Profiling of RNA–Protein Interactions Using CLIP-Seq
Stork, Cheryl; Zheng, Sika
2017-01-01
UV crosslinking immunoprecipitation (CLIP) is an increasingly popular technique to study protein–RNA interactions in tissues and cells. Whole cells or tissues are ultraviolet irradiated to generate a covalent bond between RNA and proteins that are in close contact. After partial RNase digestion, antibodies specific to an RNA binding protein (RBP) or a protein–epitope tag is then used to immunoprecipitate the protein–RNA complexes. After stringent washing and gel separation the RBP–RNA complex is excised. The RBP is protease digested to allow purification of the bound RNA. Reverse transcription of the RNA followed by high-throughput sequencing of the cDNA library is now often used to identify protein bound RNA on a genome-wide scale. UV irradiation can result in cDNA truncations and/or mutations at the crosslink sites, which complicates the alignment of the sequencing library to the reference genome and the identification of the crosslinking sites. Meanwhile, one or more amino acids of a crosslinked RBP can remain attached to its bound RNA due to incomplete digestion of the protein. As a result, reverse transcriptase may not read through the crosslink sites, and produce cDNA ending at the crosslinked nucleotide. This is harnessed by one variant of CLIP methods to identify crosslinking sites at a nucleotide resolution. This method, individual nucleotide resolution CLIP (iCLIP) circularizes cDNA to capture the truncated cDNA and also increases the efficiency of ligating sequencing adapters to the library. Here, we describe the detailed procedure of iCLIP. PMID:26965263
Tabata, Ryo; Kamiya, Takehiro; Shigenobu, Shuji; Yamaguchi, Katsushi; Yamada, Masashi; Hasebe, Mitsuyasu; Fujiwara, Toru; Sawa, Shinichiro
2013-01-01
Next-generation sequencing (NGS) technologies enable the rapid production of an enormous quantity of sequence data. These powerful new technologies allow the identification of mutations by whole-genome sequencing. However, most reported NGS-based mapping methods, which are based on bulked segregant analysis, are costly and laborious. To address these limitations, we designed a versatile NGS-based mapping method that consists of a combination of low- to medium-coverage multiplex SOLiD (Sequencing by Oligonucleotide Ligation and Detection) and classical genetic rough mapping. Using only low to medium coverage reduces the SOLiD sequencing costs and, since just 10 to 20 mutant F2 plants are required for rough mapping, the operation is simple enough to handle in a laboratory with limited space and funding. As a proof of principle, we successfully applied this method to identify the CTR1, which is involved in boron-mediated root development, from among a population of high boron requiring Arabidopsis thaliana mutants. Our work demonstrates that this NGS-based mapping method is a moderately priced and versatile method that can readily be applied to other model organisms. PMID:23104114
Short-read, high-throughput sequencing technology for STR genotyping
Bornman, Daniel M.; Hester, Mark E.; Schuetter, Jared M.; Kasoji, Manjula D.; Minard-Smith, Angela; Barden, Curt A.; Nelson, Scott C.; Godbold, Gene D.; Baker, Christine H.; Yang, Boyu; Walther, Jacquelyn E.; Tornes, Ivan E.; Yan, Pearlly S.; Rodriguez, Benjamin; Bundschuh, Ralf; Dickens, Michael L.; Young, Brian A.; Faith, Seth A.
2013-01-01
DNA-based methods for human identification principally rely upon genotyping of short tandem repeat (STR) loci. Electrophoretic-based techniques for variable-length classification of STRs are universally utilized, but are limited in that they have relatively low throughput and do not yield nucleotide sequence information. High-throughput sequencing technology may provide a more powerful instrument for human identification, but is not currently validated for forensic casework. Here, we present a systematic method to perform high-throughput genotyping analysis of the Combined DNA Index System (CODIS) STR loci using short-read (150 bp) massively parallel sequencing technology. Open source reference alignment tools were optimized to evaluate PCR-amplified STR loci using a custom designed STR genome reference. Evaluation of this approach demonstrated that the 13 CODIS STR loci and amelogenin (AMEL) locus could be accurately called from individual and mixture samples. Sensitivity analysis showed that as few as 18,500 reads, aligned to an in silico referenced genome, were required to genotype an individual (>99% confidence) for the CODIS loci. The power of this technology was further demonstrated by identification of variant alleles containing single nucleotide polymorphisms (SNPs) and the development of quantitative measurements (reads) for resolving mixed samples. PMID:25621315
Identification of cis-suppression of human disease mutations by comparative genomics
Jordan, Daniel M.; Frangakis, Stephan G.; Golzio, Christelle; Cassa, Christopher A.; Kurtzberg, Joanne; Davis, Erica E.; Sunyaev, Shamil R.; Katsanis, Nicholas
2015-01-01
Patterns of amino acid conservation have served as a tool for understanding protein evolution1. The same principles have also found broad application in human genomics, driven by the need to interpret the pathogenic potential of variants in patients2. Here we performed a systematic comparative genomics analysis of human disease-causing missense variants. We found that an appreciable fraction of disease-causing alleles are fixed in the genomes of other species, suggesting a role for genomic context. We developed a model of genetic interactions that predicts most of these to be simple pairwise compensations. Functional testing of this model on two known human disease genes3,4 revealed discrete cis amino acid residues that, although benign on their own, could rescue the human mutations in vivo. This approach was also applied to ab initio gene discovery to support the identification of a de novo disease driver in BTG2 that is subject to protective cis-modification in more than 50 species. Finally, on the basis of our data and models, we developed a computational tool to predict candidate residues subject to compensation. Taken together, our data highlight the importance of cis-genomic context as a contributor to protein evolution; they provide an insight into the complexity of allele effect on phenotype; and they are likely to assist methods for predicting allele pathogenicity5,6. PMID:26123021
Ghouila, Amel; Florent, Isabelle; Guerfali, Fatma Zahra; Terrapon, Nicolas; Laouini, Dhafer; Yahia, Sadok Ben; Gascuel, Olivier; Bréhélin, Laurent
2014-01-01
Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence--the general domain tendency to preferentially appear along with some favorite domains in the proteins--to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced.
Ghouila, Amel; Florent, Isabelle; Guerfali, Fatma Zahra; Terrapon, Nicolas; Laouini, Dhafer; Yahia, Sadok Ben; Gascuel, Olivier; Bréhélin, Laurent
2014-01-01
Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence — the general domain tendency to preferentially appear along with some favorite domains in the proteins — to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced. PMID:24901648
Toward Genomics-Based Breeding in C3 Cool-Season Perennial Grasses.
Talukder, Shyamal K; Saha, Malay C
2017-01-01
Most important food and feed crops in the world belong to the C3 grass family. The future of food security is highly reliant on achieving genetic gains of those grasses. Conventional breeding methods have already reached a plateau for improving major crops. Genomics tools and resources have opened an avenue to explore genome-wide variability and make use of the variation for enhancing genetic gains in breeding programs. Major C3 annual cereal breeding programs are well equipped with genomic tools; however, genomic research of C3 cool-season perennial grasses is lagging behind. In this review, we discuss the currently available genomics tools and approaches useful for C3 cool-season perennial grass breeding. Along with a general review, we emphasize the discussion focusing on forage grasses that were considered orphan and have little or no genetic information available. Transcriptome sequencing and genotype-by-sequencing technology for genome-wide marker detection using next-generation sequencing (NGS) are very promising as genomics tools. Most C3 cool-season perennial grass members have no prior genetic information; thus NGS technology will enhance collinear study with other C3 model grasses like Brachypodium and rice. Transcriptomics data can be used for identification of functional genes and molecular markers, i.e., polymorphism markers and simple sequence repeats (SSRs). Genome-wide association study with NGS-based markers will facilitate marker identification for marker-assisted selection. With limited genetic information, genomic selection holds great promise to breeders for attaining maximum genetic gain of the cool-season C3 perennial grasses. Application of all these tools can ensure better genetic gains, reduce length of selection cycles, and facilitate cultivar development to meet the future demand for food and fodder.
Improved evidence-based genome-scale metabolic models for maize leaf, embryo, and endosperm
Seaver, Samuel M. D.; Bradbury, Louis M. T.; Frelin, Océane; Zarecki, Raphy; Ruppin, Eytan; Hanson, Andrew D.; Henry, Christopher S.
2015-01-01
There is a growing demand for genome-scale metabolic reconstructions for plants, fueled by the need to understand the metabolic basis of crop yield and by progress in genome and transcriptome sequencing. Methods are also required to enable the interpretation of plant transcriptome data to study how cellular metabolic activity varies under different growth conditions or even within different organs, tissues, and developmental stages. Such methods depend extensively on the accuracy with which genes have been mapped to the biochemical reactions in the plant metabolic pathways. Errors in these mappings lead to metabolic reconstructions with an inflated number of reactions and possible generation of unreliable metabolic phenotype predictions. Here we introduce a new evidence-based genome-scale metabolic reconstruction of maize, with significant improvements in the quality of the gene-reaction associations included within our model. We also present a new approach for applying our model to predict active metabolic genes based on transcriptome data. This method includes a minimal set of reactions associated with low expression genes to enable activity of a maximum number of reactions associated with high expression genes. We apply this method to construct an organ-specific model for the maize leaf, and tissue specific models for maize embryo and endosperm cells. We validate our models using fluxomics data for the endosperm and embryo, demonstrating an improved capacity of our models to fit the available fluxomics data. All models are publicly available via the DOE Systems Biology Knowledgebase and PlantSEED, and our new method is generally applicable for analysis transcript profiles from any plant, paving the way for further in silico studies with a wide variety of plant genomes. PMID:25806041
Improved evidence-based genome-scale metabolic models for maize leaf, embryo, and endosperm
Seaver, Samuel M.D.; Bradbury, Louis M.T.; Frelin, Océane; ...
2015-03-10
There is a growing demand for genome-scale metabolic reconstructions for plants, fueled by the need to understand the metabolic basis of crop yield and by progress in genome and transcriptome sequencing. Methods are also required to enable the interpretation of plant transcriptome data to study how cellular metabolic activity varies under different growth conditions or even within different organs, tissues, and developmental stages. Such methods depend extensively on the accuracy with which genes have been mapped to the biochemical reactions in the plant metabolic pathways. Errors in these mappings lead to metabolic reconstructions with an inflated number of reactions andmore » possible generation of unreliable metabolic phenotype predictions. Here we introduce a new evidence-based genome-scale metabolic reconstruction of maize, with significant improvements in the quality of the gene-reaction associations included within our model. We also present a new approach for applying our model to predict active metabolic genes based on transcriptome data. This method includes a minimal set of reactions associated with low expression genes to enable activity of a maximum number of reactions associated with high expression genes. We apply this method to construct an organ-specific model for the maize leaf, and tissue specific models for maize embryo and endosperm cells. We validate our models using fluxomics data for the endosperm and embryo, demonstrating an improved capacity of our models to fit the available fluxomics data. All models are publicly available via the DOE Systems Biology Knowledgebase and PlantSEED, and our new method is generally applicable for analysis transcript profiles from any plant, paving the way for further in silico studies with a wide variety of plant genomes.« less
Thakur, Shalabh; Guttman, David S
2016-06-30
Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, most scale poorly when faced with hundreds of genome sequences, and many require extensive manual curation. We have developed a de-novo genome analysis pipeline (DeNoGAP) for the automated, iterative and high-throughput analysis of data from comparative genomics projects involving hundreds of whole genome sequences. The pipeline is designed to perform reference-assisted and de novo gene prediction, homolog protein family assignment, ortholog prediction, functional annotation, and pan-genome analysis using a range of proven tools and databases. While most existing methods scale quadratically with the number of genomes since they rely on pairwise comparisons among predicted protein sequences, DeNoGAP scales linearly since the homology assignment is based on iteratively refined hidden Markov models. This iterative clustering strategy enables DeNoGAP to handle a very large number of genomes using minimal computational resources. Moreover, the modular structure of the pipeline permits easy updates as new analysis programs become available. DeNoGAP integrates bioinformatics tools and databases for comparative analysis of a large number of genomes. The pipeline offers tools and algorithms for annotation and analysis of completed and draft genome sequences. The pipeline is developed using Perl, BioPerl and SQLite on Ubuntu Linux version 12.04 LTS. Currently, the software package accompanies script for automated installation of necessary external programs on Ubuntu Linux; however, the pipeline should be also compatible with other Linux and Unix systems after necessary external programs are installed. DeNoGAP is freely available at https://sourceforge.net/projects/denogap/ .
Identification of Acinetobacter seifertii isolated from Bolivian hospitals.
Cerezales, Mónica; Xanthopoulou, Kyriaki; Ertel, Julia; Nemec, Alexandr; Bustamante, Zulema; Seifert, Harald; Gallego, Lucia; Higgins, Paul G
2018-06-01
Acinetobacter seifertii is a recently described species that belongs to the Acinetobacter calcoaceticus-Acinetobacter baumannii complex. It has been recovered from clinical samples and is sometimes associated with antimicrobial resistance determinants. We present here the case of three A. seifertii clinical isolates which were initially identified as Acinetobacter sp. by phenotypic methods but no identification at the species level was achieved using semi-automated identification methods. The isolates were further analysed by whole genome sequencing and identified as A. seifertii. Due to the fact that A. seifertii has been isolated from serious infections such as respiratory tract and bloodstream infections, we emphasize the importance of correctly identifying isolates of the genus Acinetobacter at the species level to gain a deeper knowledge of their prevalence and clinical impact.
Otaño-Rivera, Víctor; Boakye, Amma; Grobe, Nadja; Almutairi, Mohammed M; Kursan, Shams; Mattis, Lesan K; Castrop, Hayo; Gurley, Susan B; Elased, Khalid M; Boivin, Gregory P; Di Fulvio, Mauricio
2017-04-01
Genotyping of genetically-engineered mice is necessary for the effective design of breeding strategies and identification of mutant mice. This process relies on the identification of DNA markers introduced into genomic sequences of mice, a task usually performed using the polymerase chain reaction (PCR). Clearly, the limiting step in genotyping is isolating pure genomic DNA. Isolation of mouse DNA for genotyping typically involves painful procedures such as tail snip, digit removal, or ear punch. Although the harvesting of hair has previously been proposed as a source of genomic DNA, there has been a perceived complication and reluctance to use this non-painful technique because of low DNA yields and fear of contamination. In this study we developed a simple, economic, and efficient strategy using Chelex® resins to purify genomic DNA from hair roots of mice which are suitable for genotyping. Upon comparison with standard DNA purification methods using a commercially available kit, we demonstrate that Chelex® efficiently and consistently purifies high-quality DNA from hair roots, minimizing pain, shortening time and reducing costs associated with the determination of accurate genotypes. Therefore, the use of hair roots combined with Chelex® is a reliable and more humane alternative for DNA genotyping.
Yadav, Pragya D; Shete, Anita M; Nyayanit, Dimpal A; Albarino, Cesar G; Jain, Shilpi; Guerrero, Lisa W; Kumar, Sandeep; Patil, Deepak Y; Nichol, Stuart T; Mourya, Devendra T
2018-06-25
In 1954, a virus named Wad Medani virus (WMV) was isolated from Hyalomma marginatum ticks from Maharashtra State, India. In 1963, another virus was isolated from Sturnia pagodarum birds in Tamil Nadu, India, and named Kammavanpettai virus (KVPTV) based on the site of its isolation. Originally these virus isolates could not be identified with conventional methods. Here we describe next-generation sequencing studies leading to the determination of their complete genome sequences, and identification of both virus isolates as orbiviruses (family Reoviridae). Sequencing data showed that KVPTV has an AT-rich genome, whereas the genome of WMV is GC-rich. The size of the KVPTV genome is 18 234 nucleotides encoding proteins ranging 238-1290 amino acids (aa) in length. Similarly, the size of the WMV genome is 16 941 nucleotides encoding proteins ranging 214-1305 amino acids in length. Phylogenetic analysis of the VP1 gene, along with the capsid genes VP5 and VP7, revealed that KVPTV is likely a novel mosquito-borne virus and WMV is a tick-borne orbivirus. This study focuses on the phylogenetic comparison of these newly identified orbiviruses with mosquito-, tick- and Culicoides-borne orbiviruses isolated in India and other countries.
Genomic markers for decision making: what is preventing us from using markers?
Coyle, Vicky M; Johnston, Patrick G
2010-02-01
The advent of novel genomic technologies that enable the evaluation of genomic alterations on a genome-wide scale has significantly altered the field of genomic marker research in solid tumors. Researchers have moved away from the traditional model of identifying a particular genomic alteration and evaluating the association between this finding and a clinical outcome measure to a new approach involving the identification and measurement of multiple genomic markers simultaneously within clinical studies. This in turn has presented additional challenges in considering the use of genomic markers in oncology, such as clinical study design, reproducibility and interpretation and reporting of results. This Review will explore these challenges, focusing on microarray-based gene-expression profiling, and highlights some common failings in study design that have impacted on the use of putative genomic markers in the clinic. Despite these rapid technological advances there is still a paucity of genomic markers in routine clinical use at present. A rational and focused approach to the evaluation and validation of genomic markers is needed, whereby analytically validated markers are investigated in clinical studies that are adequately powered and have pre-defined patient populations and study endpoints. Furthermore, novel adaptive clinical trial designs, incorporating putative genomic markers into prospective clinical trials, will enable the evaluation of these markers in a rigorous and timely fashion. Such approaches have the potential to facilitate the implementation of such markers into routine clinical practice and consequently enable the rational and tailored use of cancer therapies for individual patients.
NASA Astrophysics Data System (ADS)
Zhang, Jingxia; Guo, Yinghai; Shen, Yulin; Zhao, Difei; Li, Mi
2018-06-01
The use of geophysical logging data to identify lithology is an important groundwork in logging interpretation. Inevitably, noise is mixed in during data collection due to the equipment and other external factors and this will affect the further lithological identification and other logging interpretation. Therefore, to get a more accurate lithological identification it is necessary to adopt de-noising methods. In this study, a new de-noising method, namely improved complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN)-wavelet transform, is proposed, which integrates the superiorities of improved CEEMDAN and wavelet transform. Improved CEEMDAN, an effective self-adaptive multi-scale analysis method, is used to decompose non-stationary signals as the logging data to obtain the intrinsic mode function (IMF) of N different scales and one residual. Moreover, one self-adaptive scale selection method is used to determine the reconstruction scale k. Simultaneously, given the possible frequency aliasing problem between adjacent IMFs, a wavelet transform threshold de-noising method is used to reduce the noise of the (k-1)th IMF. Subsequently, the de-noised logging data are reconstructed by the de-noised (k-1)th IMF and the remaining low-frequency IMFs and the residual. Finally, empirical mode decomposition, improved CEEMDAN, wavelet transform and the proposed method are applied for analysis of the simulation and the actual data. Results show diverse performance of these de-noising methods with regard to accuracy for lithological identification. Compared with the other methods, the proposed method has the best self-adaptability and accuracy in lithological identification.
Fricano, Meagan M; Ditewig, Amy C; Jung, Paul M; Liguori, Michael J; Blomme, Eric A G; Yang, Yi
2011-01-01
Blood is an ideal tissue for the identification of novel genomic biomarkers for toxicity or efficacy. However, using blood for transcriptomic profiling presents significant technical challenges due to the transcriptomic changes induced by ex vivo handling and the interference of highly abundant globin mRNA. Most whole blood RNA stabilization and isolation methods also require significant volumes of blood, limiting their effective use in small animal species, such as rodents. To overcome these challenges, a QIAzol-based RNA stabilization and isolation method (QSI) was developed to isolate sufficient amounts of high quality total RNA from 25 to 500 μL of rat whole blood. The method was compared to the standard PAXgene Blood RNA System using blood collected from rats exposed to saline or lipopolysaccharide (LPS). The QSI method yielded an average of 54 ng total RNA per μL of rat whole blood with an average RNA Integrity Number (RIN) of 9, a performance comparable with the standard PAXgene method. Total RNA samples were further processed using the NuGEN Ovation Whole Blood Solution system and cDNA was hybridized to Affymetrix Rat Genome 230 2.0 Arrays. The microarray QC parameters using RNA isolated with the QSI method were within the acceptable range for microarray analysis. The transcriptomic profiles were highly correlated with those using RNA isolated with the PAXgene method and were consistent with expected LPS-induced inflammatory responses. The present study demonstrated that the QSI method coupled with NuGEN Ovation Whole Blood Solution system is cost-effective and particularly suitable for transcriptomic profiling of minimal volumes of whole blood, typical of those obtained with small animal species.
Opportunities and challenges of big data for the social sciences: The case of genomic data.
Liu, Hexuan; Guo, Guang
2016-09-01
In this paper, we draw attention to one unique and valuable source of big data, genomic data, by demonstrating the opportunities they provide to social scientists. We discuss different types of large-scale genomic data and recent advances in statistical methods and computational infrastructure used to address challenges in managing and analyzing such data. We highlight how these data and methods can be used to benefit social science research. Copyright © 2016 Elsevier Inc. All rights reserved.
Context-specific metabolic networks are consistent with experiments.
Becker, Scott A; Palsson, Bernhard O
2008-05-16
Reconstructions of cellular metabolism are publicly available for a variety of different microorganisms and some mammalian genomes. To date, these reconstructions are "genome-scale" and strive to include all reactions implied by the genome annotation, as well as those with direct experimental evidence. Clearly, many of the reactions in a genome-scale reconstruction will not be active under particular conditions or in a particular cell type. Methods to tailor these comprehensive genome-scale reconstructions into context-specific networks will aid predictive in silico modeling for a particular situation. We present a method called Gene Inactivity Moderated by Metabolism and Expression (GIMME) to achieve this goal. The GIMME algorithm uses quantitative gene expression data and one or more presupposed metabolic objectives to produce the context-specific reconstruction that is most consistent with the available data. Furthermore, the algorithm provides a quantitative inconsistency score indicating how consistent a set of gene expression data is with a particular metabolic objective. We show that this algorithm produces results consistent with biological experiments and intuition for adaptive evolution of bacteria, rational design of metabolic engineering strains, and human skeletal muscle cells. This work represents progress towards producing constraint-based models of metabolism that are specific to the conditions where the expression profiling data is available.
Haddad, Diana; Bilcikova, Erika; Witney, Adam A.; Carlton, Jane M.; White, Charles E.; Blair, Peter L.; Chattopadhyay, Rana; Russell, Joshua; Abot, Esteban; Charoenvit, Yupin; Aguiar, Joao C.; Carucci, Daniel J.; Weiss, Walter R.
2004-01-01
We describe a novel approach for identifying target antigens for preerythrocytic malaria vaccines. Our strategy is to rapidly test hundreds of DNA vaccines encoding exons from the Plasmodium yoelii yoelii genomic sequence. In this antigen identification method, we measure reduction in parasite burden in the liver after sporozoite challenge in mice. Orthologs of protective P. y. yoelii genes can then be identified in the genomic databases of Plasmodium falciparum and Plasmodium vivax and investigated as candidate antigens for a human vaccine. A pilot study to develop the antigen identification method approach used 192 P. y. yoelii exons from genes expressed during the sporozoite stage of the life cycle. A total of 182 (94%) exons were successfully cloned into a DNA immunization vector with the Gateway cloning technology. To assess immunization strategies, mice were vaccinated with 19 of the new DNA plasmids in addition to the well-characterized protective plasmid encoding P. y. yoelii circumsporozoite protein. Single plasmid immunization by gene gun identified a novel vaccine target antigen which decreased liver parasite burden by 95% and which has orthologs in P. vivax and P. knowlesi but not P. falciparum. Intramuscular injection of DNA plasmids produced a different pattern of protective responses from those seen with gene gun immunization. Intramuscular immunization with plasmid pools could reduce liver parasite burden in mice despite the fact that none of the plasmids was protective when given individually. We conclude that high-throughput cloning of exons into DNA vaccines and their screening is feasible and can rapidly identify new malaria vaccine candidate antigens. PMID:14977966
Overview Article: Identifying transcriptional cis-regulatory modules in animal genomes
Suryamohan, Kushal; Halfon, Marc S.
2014-01-01
Gene expression is regulated through the activity of transcription factors and chromatin modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily-identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods has led to an explosion of both computational and empirical methods for CRM discovery in model and non-model organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against transcription factors or histone post-translational modifications, identification of nucleosome-depleted “open” chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted transcription factor binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. PMID:25704908
A new strategy for genome assembly using short sequence reads and reduced representation libraries.
Young, Andrew L; Abaan, Hatice Ozel; Zerbino, Daniel; Mullikin, James C; Birney, Ewan; Margulies, Elliott H
2010-02-01
We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.
Meng, Jia; Kanzaki, Gregory; Meas, Diane; Lam, Christopher K; Crummer, Heather; Tain, Justina; Xu, H Howard
2012-04-01
Regulated antisense RNA (asRNA) expression has been employed successfully in Gram-positive bacteria for genome-wide essential gene identification and drug target determination. However, there have been no published reports describing the application of asRNA gene silencing for comprehensive analyses of essential genes in Gram-negative bacteria. In this study, we report the first genome-wide identification of asRNA constructs for essential genes in Escherichia coli. We screened 250 000 library transformants for conditional growth inhibitory recombinant clones from two shotgun genomic libraries of E. coli using a paired-termini expression vector (pHN678). After sequencing plasmid inserts of 675 confirmed inducer sensitive cell clones, we identified 152 separate asRNA constructs of which 134 inserts came from essential genes, while 18 originated from nonessential genes (but share operons with essential genes). Among the 79 individual essential genes silenced by these asRNA constructs, 61 genes (77%) engage in processes related to protein synthesis. The cell-based assays of an asRNA clone targeting fusA (encoding elongation factor G) showed that the induced cells were sensitized 12-fold to fusidic acid, a known specific inhibitor. Our results demonstrate the utility of the paired-termini expression vector and feasibility of large-scale gene silencing in E. coli using regulated asRNA expression. © 2012 Federation of European Microbiological Societies. Published by Blackwell Publishing Ltd. All rights reserved.
Chaudhary, Neha; Tøndel, Kristin; Bhatnagar, Rakesh; dos Santos, Vítor A P Martins; Puchałka, Jacek
2016-03-01
Genome-Scale Metabolic Reconstructions (GSMRs), along with optimization-based methods, predominantly Flux Balance Analysis (FBA) and its derivatives, are widely applied for assessing and predicting the behavior of metabolic networks upon perturbation, thereby enabling identification of potential novel drug targets and biotechnologically relevant pathways. The abundance of alternate flux profiles has led to the evolution of methods to explore the complete solution space aiming to increase the accuracy of predictions. Herein we present a novel, generic algorithm to characterize the entire flux space of GSMR upon application of FBA, leading to the optimal value of the objective (the optimal flux space). Our method employs Modified Latin-Hypercube Sampling (LHS) to effectively border the optimal space, followed by Principal Component Analysis (PCA) to identify and explain the major sources of variability within it. The approach was validated with the elementary mode analysis of a smaller network of Saccharomyces cerevisiae and applied to the GSMR of Pseudomonas aeruginosa PAO1 (iMO1086). It is shown to surpass the commonly used Monte Carlo Sampling (MCS) in providing a more uniform coverage for a much larger network in less number of samples. Results show that although many fluxes are identified as variable upon fixing the objective value, majority of the variability can be reduced to several main patterns arising from a few alternative pathways. In iMO1086, initial variability of 211 reactions could almost entirely be explained by 7 alternative pathway groups. These findings imply that the possibilities to reroute greater portions of flux may be limited within metabolic networks of bacteria. Furthermore, the optimal flux space is subject to change with environmental conditions. Our method may be a useful device to validate the predictions made by FBA-based tools, by describing the optimal flux space associated with these predictions, thus to improve them.
Schrider, Daniel R.; Kern, Andrew D.
2015-01-01
The comparative genomics revolution of the past decade has enabled the discovery of functional elements in the human genome via sequence comparison. While that is so, an important class of elements, those specific to humans, is entirely missed by searching for sequence conservation across species. Here we present an analysis based on variation data among human genomes that utilizes a supervised machine learning approach for the identification of human-specific purifying selection in the genome. Using only allele frequency information from the complete low-coverage 1000 Genomes Project data set in conjunction with a support vector machine trained from known functional and nonfunctional portions of the genome, we are able to accurately identify portions of the genome constrained by purifying selection. Our method identifies previously known human-specific gains or losses of function and uncovers many novel candidates. Candidate targets for gain and loss of function along the human lineage include numerous putative regulatory regions of genes essential for normal development of the central nervous system, including a significant enrichment of gain of function events near neurotransmitter receptor genes. These results are consistent with regulatory turnover being a key mechanism in the evolution of human-specific characteristics of brain development. Finally, we show that the majority of the genome is unconstrained by natural selection currently, in agreement with what has been estimated from phylogenetic methods but in sharp contrast to estimates based on transcriptomics or other high-throughput functional methods. PMID:26590212
Jayashree, B; Jagadeesh, V T; Hoisington, D
2008-05-01
The availability of complete, annotated genomic sequence information in model organisms is a rich resource that can be extended to understudied orphan crops through comparative genomic approaches. We report here a software tool (cisprimertool) for the identification of conserved intron scanning regions using expressed sequence tag alignments to a completely sequenced model crop genome. The method used is based on earlier studies reporting the assessment of conserved intron scanning primers (called CISP) within relatively conserved exons located near exon-intron boundaries from onion, banana, sorghum and pearl millet alignments with rice. The tool is freely available to academic users at http://www.icrisat.org/gt-bt/CISPTool.htm. © 2007 ICRISAT.
Tran, Phuong N.; Savka, Michael A.; Gan, Han Ming
2017-01-01
The genus Pseudomonas has one of the largest diversity of species within the Bacteria kingdom. To date, its taxonomy is still being revised and updated. Due to the non-standardized procedure and ambiguous thresholds at species level, largely based on 16S rRNA gene or conventional biochemical assay, species identification of publicly available Pseudomonas genomes remains questionable. In this study, we performed a large-scale analysis of all Pseudomonas genomes with species designation (excluding the well-defined P. aeruginosa) and re-evaluated their taxonomic assignment via in silico genome-genome hybridization and/or genetic comparison with valid type species. Three-hundred and seventy-three pseudomonad genomes were analyzed and subsequently clustered into 145 distinct genospecies. We detected 207 erroneous labels and corrected 43 to the proper species based on Average Nucleotide Identity Multilocus Sequence Typing (MLST) sequence similarity to the type strain. Surprisingly, more than half of the genomes initially designated as Pseudomonas syringae and Pseudomonas fluorescens should be classified either to a previously described species or to a new genospecies. Notably, high pairwise average nucleotide identity (>95%) indicating species-level similarity was observed between P. synxantha-P. libanensis, P. psychrotolerans–P. oryzihabitans, and P. kilonensis- P. brassicacearum, that were previously differentiated based on conventional biochemical tests and/or genome-genome hybridization techniques. PMID:28747902
Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints.
Glusman, Gustavo; Mauldin, Denise E; Hood, Leroy E; Robinson, Max
2017-01-01
We present an ultrafast method for comparing personal genomes. We transform the standard genome representation (lists of variants relative to a reference) into "genome fingerprints" via locality sensitive hashing. The resulting genome fingerprints can be meaningfully compared even when the input data were obtained using different sequencing technologies, processed using different pipelines, represented in different data formats and relative to different reference versions. Furthermore, genome fingerprints are robust to up to 30% missing data. Because of their reduced size, computation on the genome fingerprints is fast and requires little memory. For example, we could compute all-against-all pairwise comparisons among the 2504 genomes in the 1000 Genomes data set in 67 s at high quality (21 μs per comparison, on a single processor), and achieved a lower quality approximation in just 11 s. Efficient computation enables scaling up a variety of important genome analyses, including quantifying relatedness, recognizing duplicative sequenced genomes in a set, population reconstruction, and many others. The original genome representation cannot be reconstructed from its fingerprint, effectively decoupling genome comparison from genome interpretation; the method thus has significant implications for privacy-preserving genome analytics.
A PCR primer bank for quantitative gene expression analysis.
Wang, Xiaowei; Seed, Brian
2003-12-15
Although gene expression profiling by microarray analysis is a useful tool for assessing global levels of transcriptional activity, variability associated with the data sets usually requires that observed differences be validated by some other method, such as real-time quantitative polymerase chain reaction (real-time PCR). However, non-specific amplification of non-target genes is frequently observed in the latter, confounding the analysis in approximately 40% of real-time PCR attempts when primer-specific labels are not used. Here we present an experimentally validated algorithm for the identification of transcript-specific PCR primers on a genomic scale that can be applied to real-time PCR with sequence-independent detection methods. An online database, PrimerBank, has been created for researchers to retrieve primer information for their genes of interest. PrimerBank currently contains 147 404 primers encompassing most known human and mouse genes. The primer design algorithm has been tested by conventional and real-time PCR for a subset of 112 primer pairs with a success rate of 98.2%.
Santos, André S; Ramos, Rommel T; Silva, Artur; Hirata, Raphael; Mattos-Guaraldi, Ana L; Meyer, Roberto; Azevedo, Vasco; Felicori, Liza; Pacheco, Luis G C
2018-05-11
Biochemical tests are traditionally used for bacterial identification at the species level in clinical microbiology laboratories. While biochemical profiles are generally efficient for the identification of the most important corynebacterial pathogen Corynebacterium diphtheriae, their ability to differentiate between biovars of this bacterium is still controversial. Besides, the unambiguous identification of emerging human pathogenic species of the genus Corynebacterium may be hampered by highly variable biochemical profiles commonly reported for these species, including Corynebacterium striatum, Corynebacterium amycolatum, Corynebacterium minutissimum, and Corynebacterium xerosis. In order to identify the genomic basis contributing for the biochemical variabilities observed in phenotypic identification methods of these bacteria, we combined a comprehensive literature review with a bioinformatics approach based on reconstruction of six specific biochemical reactions/pathways in 33 recently released whole genome sequences. We used data retrieved from curated databases (MetaCyc, PathoSystems Resource Integration Center (PATRIC), The SEED, TransportDB, UniProtKB) associated with homology searches by BLAST and profile Hidden Markov Models (HMMs) to detect enzymes participating in the various pathways and performed ab initio protein structure modeling and molecular docking to confirm specific results. We found a differential distribution among the various strains of genes that code for some important enzymes, such as beta-phosphoglucomutase and fructokinase, and also for individual components of carbohydrate transport systems, including the fructose-specific phosphoenolpyruvate-dependent sugar phosphotransferase (PTS) and the ribose-specific ATP-binging cassette (ABC) transporter. Horizontal gene transfer plays a role in the biochemical variability of the isolates, as some genes needed for sucrose fermentation were seen to be present in genomic islands. Noteworthy, using profile HMMs, we identified an enzyme with putative alpha-1,6-glycosidase activity only in some specific strains of C. diphtheriae and this may aid to understanding of the differential abilities to utilize glycogen and starch between the biovars.
Chromatin-associated RNA sequencing (ChAR-seq) maps genome-wide RNA-to-DNA contacts
Jukam, David; Teran, Nicole A; Risca, Viviana I; Smith, Owen K; Johnson, Whitney L; Skotheim, Jan M; Greenleaf, William James
2018-01-01
RNA is a critical component of chromatin in eukaryotes, both as a product of transcription, and as an essential constituent of ribonucleoprotein complexes that regulate both local and global chromatin states. Here, we present a proximity ligation and sequencing method called Chromatin-Associated RNA sequencing (ChAR-seq) that maps all RNA-to-DNA contacts across the genome. Using Drosophila cells, we show that ChAR-seq provides unbiased, de novo identification of targets of chromatin-bound RNAs including nascent transcripts, chromosome-specific dosage compensation ncRNAs, and genome-wide trans-associated RNAs involved in co-transcriptional RNA processing. PMID:29648534
Identification and characterization of nuclear genes involved in photosynthesis in Populus
2014-01-01
Background The gap between the real and potential photosynthetic rate under field conditions suggests that photosynthesis could potentially be improved. Nuclear genes provide possible targets for improving photosynthetic efficiency. Hence, genome-wide identification and characterization of the nuclear genes affecting photosynthetic traits in woody plants would provide key insights on genetic regulation of photosynthesis and identify candidate processes for improvement of photosynthesis. Results Using microarray and bulked segregant analysis strategies, we identified differentially expressed nuclear genes for photosynthesis traits in a segregating population of poplar. We identified 515 differentially expressed genes in this population (FC ≥ 2 or FC ≤ 0.5, P < 0.05), 163 up-regulated and 352 down-regulated. Real-time PCR expression analysis confirmed the microarray data. Singular Enrichment Analysis identified 48 significantly enriched GO terms for molecular functions (28), biological processes (18) and cell components (2). Furthermore, we selected six candidate genes for functional examination by a single-marker association approach, which demonstrated that 20 SNPs in five candidate genes significantly associated with photosynthetic traits, and the phenotypic variance explained by each SNP ranged from 2.3% to 12.6%. This revealed that regulation of photosynthesis by the nuclear genome mainly involves transport, metabolism and response to stimulus functions. Conclusions This study provides new genome-scale strategies for the discovery of potential candidate genes affecting photosynthesis in Populus, and for identification of the functions of genes involved in regulation of photosynthesis. This work also suggests that improving photosynthetic efficiency under field conditions will require the consideration of multiple factors, such as stress responses. PMID:24673936
Advances in molecular biological methods are continually being brought to bear on human health research, from a basic understanding of systems biology to identification of toxicity pathways for environmental stressors and to correlations of molecular indicators with physiological...
Scaglione, Davide; Lanteri, Sergio; Acquadro, Alberto; Lai, Zhao; Knapp, Steven J; Rieseberg, Loren; Portis, Ezio
2012-10-01
Cynara cardunculus (2n = 2× = 34) is a member of the Asteraceae family that contributes significantly to the agricultural economy of the Mediterranean basin. The species includes two cultivated varieties, globe artichoke and cardoon, which are grown mainly for food. Cynara cardunculus is an orphan crop species whose genome/transcriptome has been relatively unexplored, especially in comparison to other Asteraceae crops. Hence, there is a significant need to improve its genomic resources through the identification of novel genes and sequence-based markers, to design new breeding schemes aimed at increasing quality and crop productivity. We report the outcome of cDNA sequencing and assembly for eleven accessions of C. cardunculus. Sequencing of three mapping parental genotypes using Roche 454-Titanium technology generated 1.7 × 10⁶ reads, which were assembled into 38,726 reference transcripts covering 32 Mbp. Putative enzyme-encoding genes were annotated using the KEGG-database. Transcription factors and candidate resistance genes were surveyed as well. Paired-end sequencing was done for cDNA libraries of eight other representative C. cardunculus accessions on an Illumina Genome Analyzer IIx, generating 46 × 10⁶ reads. Alignment of the IGA and 454 reads to reference transcripts led to the identification of 195,400 SNPs with a Bayesian probability exceeding 95%; a validation rate of 90% was obtained by Sanger-sequencing of a subset of contigs. These results demonstrate that the integration of data from different NGS platforms enables large-scale transcriptome characterization, along with massive SNP discovery. This information will contribute to the dissection of key agricultural traits in C. cardunculus and facilitate the implementation of marker-assisted selection programs. © 2012 The Authors. Plant Biotechnology Journal © 2012 Society for Experimental Biology, Association of Applied Biologists and Blackwell Publishing Ltd.
Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes
Farreras, Montse
2014-01-01
Abstract The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes. PMID:24597675
Bias correction for estimated QTL effects using the penalized maximum likelihood method.
Zhang, J; Yue, C; Zhang, Y-M
2012-04-01
A penalized maximum likelihood method has been proposed as an important approach to the detection of epistatic quantitative trait loci (QTL). However, this approach is not optimal in two special situations: (1) closely linked QTL with effects in opposite directions and (2) small-effect QTL, because the method produces downwardly biased estimates of QTL effects. The present study aims to correct the bias by using correction coefficients and shifting from the use of a uniform prior on the variance parameter of a QTL effect to that of a scaled inverse chi-square prior. The results of Monte Carlo simulation experiments show that the improved method increases the power from 25 to 88% in the detection of two closely linked QTL of equal size in opposite directions and from 60 to 80% in the identification of QTL with small effects (0.5% of the total phenotypic variance). We used the improved method to detect QTL responsible for the barley kernel weight trait using 145 doubled haploid lines developed in the North American Barley Genome Mapping Project. Application of the proposed method to other shrinkage estimation of QTL effects is discussed.
Applications of graph theory in protein structure identification
2011-01-01
There is a growing interest in the identification of proteins on the proteome wide scale. Among different kinds of protein structure identification methods, graph-theoretic methods are very sharp ones. Due to their lower costs, higher effectiveness and many other advantages, they have drawn more and more researchers’ attention nowadays. Specifically, graph-theoretic methods have been widely used in homology identification, side-chain cluster identification, peptide sequencing and so on. This paper reviews several methods in solving protein structure identification problems using graph theory. We mainly introduce classical methods and mathematical models including homology modeling based on clique finding, identification of side-chain clusters in protein structures upon graph spectrum, and de novo peptide sequencing via tandem mass spectrometry using the spectrum graph model. In addition, concluding remarks and future priorities of each method are given. PMID:22165974
Brabec, Jan; Kostadinova, Aneta; Scholz, Tomáš; Littlewood, D Timothy J
2015-06-19
The genus Diplostomum (Platyhelminthes: Trematoda: Diplostomidae) is a diverse group of freshwater parasites with complex life-cycles and global distribution. The larval stages are important pathogens causing eye fluke disease implicated in substantial impacts on natural fish populations and losses in aquaculture. However, the problematic species delimitation and difficulties in the identification of larval stages hamper the assessment of the distributional and host ranges of Diplostomum spp. and their transmission ecology. Total genomic DNA was isolated from adult worms and shotgun sequenced using Illumina MiSeq technology. Mitochondrial (mt) genomes and nuclear ribosomal RNA (rRNA) operons were assembled using established bioinformatic tools and fully annotated. Mt protein-coding genes and nuclear rRNA genes were subjected to phylogenetic analysis by maximum likelihood and the resulting topologies compared. We characterised novel complete mt genomes and nuclear rRNA operons of two closely related species, Diplostomum spathaceum and D. pseudospathaceum. Comparative mt genome assessment revealed that the cox1 gene and its 'barcode' region used for molecular identification are the most conserved regions; instead, nad4 and nad5 genes were identified as most promising molecular diagnostic markers. Using the novel data, we provide the first genome wide estimation of the phylogenetic relationships of the order Diplostomida, one of the two fundamental lineages of the Digenea. Analyses of the mitogenomic data invariably recovered the Diplostomidae as a sister lineage of the order Plagiorchiida rather than as a basal lineage of the Diplostomida as inferred in rDNA phylogenies; this was concordant with the mt gene order of Diplostomum spp. exhibiting closer match to the conserved gene order of the Plagiorchiida. Complete sequences of the mt genome and rRNA operon of two species of Diplostomum provide a valuable resource for novel genetic markers for species delineation and large-scale molecular epidemiology and disease ecology studies based on the most accessible life-cycle stages of eye flukes.
Data science approaches to pharmacogenetics.
Penrod, N M; Moore, J H
2014-01-01
Pharmacogenetic studies rely on applied statistics to evaluate genetic data describing natural variation in response to pharmacotherapeutics such as drugs and vaccines. In the beginning, these studies were based on candidate gene approaches that specifically focused on efficacy or adverse events correlated with variants of single genes. This hypothesis driven method required the researcher to have a priori knowledge of which genes or gene sets to investigate. According to rational design, the focus of these studies has been on drug metabolizing enzymes, drug transporters, and drug targets. As technology has progressed, these studies have transitioned to hypothesis-free explorations where markers across the entire genome can be measured in large scale, population based, genome-wide association studies (GWAS). This enables identification of novel genetic biomarkers, therapeutic targets, and analysis of gene-gene interactions, which may reveal molecular mechanisms of drug activities. Ultimately, the challenge is to utilize gene-drug associations to create dosing algorithms based individual genotypes, which will guide physicians and ensure they prescribe the correct dose of the correct drug the first time eliminating trial-and-error and adverse events. We review here basic concepts and applications of data science to the genetic analysis of pharmacologic outcomes.
Integrative and conjugative elements and their hosts: composition, distribution and organization
Touchon, Marie; Rocha, Eduardo P. C.
2017-01-01
Abstract Conjugation of single-stranded DNA drives horizontal gene transfer between bacteria and was widely studied in conjugative plasmids. The organization and function of integrative and conjugative elements (ICE), even if they are more abundant, was only studied in a few model systems. Comparative genomics of ICE has been precluded by the difficulty in finding and delimiting these elements. Here, we present the results of a method that circumvents these problems by requiring only the identification of the conjugation genes and the species’ pan-genome. We delimited 200 ICEs and this allowed the first large-scale characterization of these elements. We quantified the presence in ICEs of a wide set of functions associated with the biology of mobile genetic elements, including some that are typically associated with plasmids, such as partition and replication. Protein sequence similarity networks and phylogenetic analyses revealed that ICEs are structured in functional modules. Integrases and conjugation systems have different evolutionary histories, even if the gene repertoires of ICEs can be grouped in function of conjugation types. Our characterization of the composition and organization of ICEs paves the way for future functional and evolutionary analyses of their cargo genes, composed of a majority of unknown function genes. PMID:28911112
Howard, Thomas P; Hayward, Andrew P; Tordillos, Anthony; Fragoso, Christopher; Moreno, Maria A; Tohme, Joe; Kausch, Albert P; Mottinger, John P; Dellaporta, Stephen L
2014-01-01
Since their initial discovery, transposons have been widely used as mutagens for forward and reverse genetic screens in a range of organisms. The problems of high copy number and sequence divergence among related transposons have often limited the efficiency at which tagged genes can be identified. A method was developed to identity the locations of Mutator (Mu) transposons in the Zea mays genome using a simple enrichment method combined with genome resequencing to identify transposon junction fragments. The sequencing library was prepared from genomic DNA by digesting with a restriction enzyme that cuts within a perfectly conserved motif of the Mu terminal inverted repeats (TIR). Paired-end reads containing Mu TIR sequences were computationally identified and chromosomal sequences flanking the transposon were mapped to the maize reference genome. This method has been used to identify Mu insertions in a number of alleles and to isolate the previously unidentified lazy plant1 (la1) gene. The la1 gene is required for the negatively gravitropic response of shoots and mutant plants lack the ability to sense gravity. Using bioinformatic and fluorescence microscopy approaches, we show that the la1 gene encodes a cell membrane and nuclear localized protein. Our Mu-Taq method is readily adaptable to identify the genomic locations of any insertion of a known sequence in any organism using any sequencing platform.
Howard, Thomas P.; Hayward, Andrew P.; Tordillos, Anthony; Fragoso, Christopher; Moreno, Maria A.; Tohme, Joe; Kausch, Albert P.; Mottinger, John P.; Dellaporta, Stephen L.
2014-01-01
Since their initial discovery, transposons have been widely used as mutagens for forward and reverse genetic screens in a range of organisms. The problems of high copy number and sequence divergence among related transposons have often limited the efficiency at which tagged genes can be identified. A method was developed to identity the locations of Mutator (Mu) transposons in the Zea mays genome using a simple enrichment method combined with genome resequencing to identify transposon junction fragments. The sequencing library was prepared from genomic DNA by digesting with a restriction enzyme that cuts within a perfectly conserved motif of the Mu terminal inverted repeats (TIR). Paired-end reads containing Mu TIR sequences were computationally identified and chromosomal sequences flanking the transposon were mapped to the maize reference genome. This method has been used to identify Mu insertions in a number of alleles and to isolate the previously unidentified lazy plant1 (la1) gene. The la1 gene is required for the negatively gravitropic response of shoots and mutant plants lack the ability to sense gravity. Using bioinformatic and fluorescence microscopy approaches, we show that the la1 gene encodes a cell membrane and nuclear localized protein. Our Mu-Taq method is readily adaptable to identify the genomic locations of any insertion of a known sequence in any organism using any sequencing platform. PMID:24498020
Kernel methods for large-scale genomic data analysis
Xing, Eric P.; Schaid, Daniel J.
2015-01-01
Machine learning, particularly kernel methods, has been demonstrated as a promising new tool to tackle the challenges imposed by today’s explosive data growth in genomics. They provide a practical and principled approach to learning how a large number of genetic variants are associated with complex phenotypes, to help reveal the complexity in the relationship between the genetic markers and the outcome of interest. In this review, we highlight the potential key role it will have in modern genomic data processing, especially with regard to integration with classical methods for gene prioritizing, prediction and data fusion. PMID:25053743
Stress Sensors and Signal Transducers in Cyanobacteria
Los, Dmitry A.; Zorina, Anna; Sinetova, Maria; Kryazhov, Sergey; Mironov, Kirill; Zinchenko, Vladislav V.
2010-01-01
In living cells, the perception of environmental stress and the subsequent transduction of stress signals are primary events in the acclimation to changes in the environment. Some molecular sensors and transducers of environmental stress cannot be identified by traditional and conventional methods. Based on genomic information, a systematic approach has been applied to the solution of this problem in cyanobacteria, involving mutagenesis of potential sensors and signal transducers in combination with DNA microarray analyses for the genome-wide expression of genes. Forty-five genes for the histidine kinases (Hiks), 12 genes for serine-threonine protein kinases (Spks), 42 genes for response regulators (Rres), seven genes for RNA polymerase sigma factors, and nearly 70 genes for transcription factors have been successfully inactivated by targeted mutagenesis in the unicellular cyanobacterium Synechocystis sp. PCC 6803. Screening of mutant libraries by genome-wide DNA microarray analysis under various stress and non-stress conditions has allowed identification of proteins that perceive and transduce signals of environmental stress. Here we summarize recent progress in the identification of sensory and regulatory systems, including Hiks, Rres, Spks, sigma factors, transcription factors, and the role of genomic DNA supercoiling in the regulation of the responses of cyanobacterial cells to various types of stress. PMID:22294932
Wang, Ruijia; Nambiar, Ram; Zheng, Dinghai
2018-01-01
Abstract PolyA_DB is a database cataloging cleavage and polyadenylation sites (PASs) in several genomes. Previous versions were based mainly on expressed sequence tags (ESTs), which had a limited amount and could lead to inaccurate PAS identification due to the presence of internal A-rich sequences in transcripts. Here, we present an updated version of the database based solely on deep sequencing data. First, PASs are mapped by the 3′ region extraction and deep sequencing (3′READS) method, ensuring unequivocal PAS identification. Second, a large volume of data based on diverse biological samples increases PAS coverage by 3.5-fold over the EST-based version and provides PAS usage information. Third, strand-specific RNA-seq data are used to extend annotated 3′ ends of genes to obtain more thorough annotations of alternative polyadenylation (APA) sites. Fourth, conservation information of PAS across mammals sheds light on significance of APA sites. The database (URL: http://www.polya-db.org/v3) currently holds PASs in human, mouse, rat and chicken, and has links to the UCSC genome browser for further visualization and for integration with other genomic data. PMID:29069441
Kamoun, Choumouss; Payen, Thibaut; Hua-Van, Aurélie; Filée, Jonathan
2013-10-11
Insertion Sequences (ISs) and their non-autonomous derivatives (MITEs) are important components of prokaryotic genomes inducing duplication, deletion, rearrangement or lateral gene transfers. Although ISs and MITEs are relatively simple and basic genetic elements, their detection remains a difficult task due to their remarkable sequence diversity. With the advent of high-throughput genome and metagenome sequencing technologies, the development of fast, reliable and sensitive methods of ISs and MITEs detection become an important challenge. So far, almost all studies dealing with prokaryotic transposons have used classical BLAST-based detection methods against reference libraries. Here we introduce alternative methods of detection either taking advantages of the structural properties of the elements (de novo methods) or using an additional library-based method using profile HMM searches. In this study, we have developed three different work flows dedicated to ISs and MITEs detection: the first two use de novo methods detecting either repeated sequences or presence of Inverted Repeats; the third one use 28 in-house transposase alignment profiles with HMM search methods. We have compared the respective performances of each method using a reference dataset of 30 archaeal and 30 bacterial genomes in addition to simulated and real metagenomes. Compared to a BLAST-based method using ISFinder as library, de novo methods significantly improve ISs and MITEs detection. For example, in the 30 archaeal genomes, we discovered 30 new elements (+20%) in addition to the 141 multi-copies elements already detected by the BLAST approach. Many of the new elements correspond to ISs belonging to unknown or highly divergent families. The total number of MITEs has even doubled with the discovery of elements displaying very limited sequence similarities with their respective autonomous partners (mainly in the Inverted Repeats of the elements). Concerning metagenomes, with the exception of short reads data (<300 bp) for which both techniques seem equally limited, profile HMM searches considerably ameliorate the detection of transposase encoding genes (up to +50%) generating low level of false positives compare to BLAST-based methods. Compared to classical BLAST-based methods, the sensitivity of de novo and profile HMM methods developed in this study allow a better and more reliable detection of transposons in prokaryotic genomes and metagenomes. We believed that future studies implying ISs and MITEs identification in genomic data should combine at least one de novo and one library-based method, with optimal results obtained by running the two de novo methods in addition to a library-based search. For metagenomic data, profile HMM search should be favored, a BLAST-based step is only useful to the final annotation into groups and families.
Oba, Mami; Tsuchiaka, Shinobu; Omatsu, Tsutomu; Katayama, Yukie; Otomaru, Konosuke; Hirata, Teppei; Aoki, Hiroshi; Murata, Yoshiteru; Makino, Shinji; Nagai, Makoto; Mizutani, Tetsuya
2018-01-08
We tested usefulness of a target enrichment system SureSelect, a comprehensive viral nucleic acid detection method, for rapid identification of viral pathogens in feces samples of cattle, pigs and goats. This system enriches nucleic acids of target viruses in clinical/field samples by using a library of biotinylated RNAs with sequences complementary to the target viruses. The enriched nucleic acids are amplified by PCR and subjected to next generation sequencing to identify the target viruses. In many samples, SureSelect target enrichment method increased efficiencies for detection of the viruses listed in the biotinylated RNA library. Furthermore, this method enabled us to determine nearly full-length genome sequence of porcine parainfluenza virus 1 and greatly increased Breadth, a value indicating the ratio of the mapping consensus length in the reference genome, in pig samples. Our data showed usefulness of SureSelect target enrichment system for comprehensive analysis of genomic information of various viruses in field samples. Copyright © 2017 Elsevier Inc. All rights reserved.
Development of Mycoplasma synoviae (MS) core genome multilocus sequence typing (cgMLST) scheme.
Ghanem, Mostafa; El-Gazzar, Mohamed
2018-05-01
Mycoplasma synoviae (MS) is a poultry pathogen with reported increased prevalence and virulence in recent years. MS strain identification is essential for prevention, control efforts and epidemiological outbreak investigations. Multiple multilocus based sequence typing schemes have been developed for MS, yet the resolution of these schemes could be limited for outbreak investigation. The cost of whole genome sequencing became close to that of sequencing the seven MLST targets; however, there is no standardized method for typing MS strains based on whole genome sequences. In this paper, we propose a core genome multilocus sequence typing (cgMLST) scheme as a standardized and reproducible method for typing MS based whole genome sequences. A diverse set of 25 MS whole genome sequences were used to identify 302 core genome genes as cgMLST targets (35.5% of MS genome) and 44 whole genome sequences of MS isolates from six countries in four continents were used for typing applying this scheme. cgMLST based phylogenetic trees displayed a high degree of agreement with core genome SNP based analysis and available epidemiological information. cgMLST allowed evaluation of two conventional MLST schemes of MS. The high discriminatory power of cgMLST allowed differentiation between samples of the same conventional MLST type. cgMLST represents a standardized, accurate, highly discriminatory, and reproducible method for differentiation between MS isolates. Like conventional MLST, it provides stable and expandable nomenclature, allowing for comparing and sharing the typing results between different laboratories worldwide. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.
Iqbal, Muhammad; Hayat, Maqsood
2016-05-01
Gene splicing is a vital source of protein diversity. Perfectly eradication of introns and joining exons is the prominent task in eukaryotic gene expression, as exons are usually interrupted by introns. Identification of splicing sites through experimental techniques is complicated and time-consuming task. With the avalanche of genome sequences generated in the post genomic age, it remains a complicated and challenging task to develop an automatic, robust and reliable computational method for fast and effective identification of splicing sites. In this study, a hybrid model "iSS-Hyb-mRMR" is proposed for quickly and accurately identification of splicing sites. Two sample representation methods namely; pseudo trinucleotide composition (PseTNC) and pseudo tetranucleotide composition (PseTetraNC) were used to extract numerical descriptors from DNA sequences. Hybrid model was developed by concatenating PseTNC and PseTetraNC. In order to select high discriminative features, minimum redundancy maximum relevance algorithm was applied on the hybrid feature space. The performance of these feature representation methods was tested using various classification algorithms including K-nearest neighbor, probabilistic neural network, general regression neural network, and fitting network. Jackknife test was used for evaluation of its performance on two benchmark datasets S1 and S2, respectively. The predictor, proposed in the current study achieved an accuracy of 93.26%, sensitivity of 88.77%, and specificity of 97.78% for S1, and the accuracy of 94.12%, sensitivity of 87.14%, and specificity of 98.64% for S2, respectively. It is observed, that the performance of proposed model is higher than the existing methods in the literature so for; and will be fruitful in the mechanism of RNA splicing, and other research academia. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Daane, Jacob M.; Rohner, Nicolas; Konstantinidis, Peter; Djuranovic, Sergej; Harris, Matthew P.
2016-01-01
The identification of genetic mechanisms underlying evolutionary change is critical to our understanding of natural diversity, but is presently limited by the lack of genetic and genomic resources for most species. Here, we present a new comparative genomic approach that can be applied to a broad taxonomic sampling of nonmodel species to investigate the genetic basis of evolutionary change. Using our analysis pipeline, we show that duplication and divergence of fgfr1a is correlated with the reduction of scales within fishes of the genus Phoxinellus. As a parallel genetic mechanism is observed in scale-reduction within independent lineages of cypriniforms, our finding exposes significant developmental constraint guiding morphological evolution. In addition, we identified fixed variation in fgf20a within Phoxinellus and demonstrated that combinatorial loss-of-function of fgfr1a and fgf20a within zebrafish phenocopies the evolved scalation pattern. Together, these findings reveal epistatic interactions between fgfr1a and fgf20a as a developmental mechanism regulating skeletal variation among fishes. PMID:26452532
Wang, Ya-Xuan; Gao, Ying-Lian; Liu, Jin-Xing; Kong, Xiang-Zhen; Li, Hai-Jun
2017-09-01
Identifying differentially expressed genes from the thousands of genes is a challenging task. Robust principal component analysis (RPCA) is an efficient method in the identification of differentially expressed genes. RPCA method uses nuclear norm to approximate the rank function. However, theoretical studies showed that the nuclear norm minimizes all singular values, so it may not be the best solution to approximate the rank function. The truncated nuclear norm is defined as the sum of some smaller singular values, which may achieve a better approximation of the rank function than nuclear norm. In this paper, a novel method is proposed by replacing nuclear norm of RPCA with the truncated nuclear norm, which is named robust principal component analysis regularized by truncated nuclear norm (TRPCA). The method decomposes the observation matrix of genomic data into a low-rank matrix and a sparse matrix. Because the significant genes can be considered as sparse signals, the differentially expressed genes are viewed as the sparse perturbation signals. Thus, the differentially expressed genes can be identified according to the sparse matrix. The experimental results on The Cancer Genome Atlas data illustrate that the TRPCA method outperforms other state-of-the-art methods in the identification of differentially expressed genes.
Matthiesen, Rune; Kirpekar, Finn
2009-01-01
The idea of identifying or characterizing an RNA molecule based on a mass spectrum of specifically generated RNA fragments has been used in various forms for well over a decade. We have developed software—named RRM for ‘RNA mass mapping’—which can search whole prokaryotic genomes or RNA FASTA sequence databases to identify the origin of a given RNA based on a mass spectrum of RNA fragments. As input, the program uses the masses of specific RNase cleavage of the RNA under investigation. RNase T1 digestion is used here as a demonstration of the usability of the method for RNA identification. The concept for identification is that the masses of the digestion products constitute a specific fingerprint, which characterize the given RNA. The search algorithm is based on the same principles as those used in peptide mass fingerprinting, but has here been extended to work for both RNA sequence databases and for genome searches. A simple and powerful probability model for ranking RNA matches is proposed. We demonstrate viability of the entire setup by identifying the DNA template of a series of RNAs of biological and of in vitro transcriptional origin in complete microbial genomes and by identifying authentic 16S ribosomal RNAs in a ‘small ribosomal subunit RNA’ database. Thus, we present a new tool for a rapid identification of unknown RNAs using only a few picomoles of starting material. PMID:19264806
Metabolic Network Modeling of Microbial Communities
Biggs, Matthew B.; Medlock, Gregory L.; Kolling, Glynis L.
2015-01-01
Genome-scale metabolic network reconstructions and constraint-based analysis are powerful methods that have the potential to make functional predictions about microbial communities. Current use of genome-scale metabolic networks to characterize the metabolic functions of microbial communities includes species compartmentalization, separating species-level and community-level objectives, dynamic analysis, the “enzyme-soup” approach, multi-scale modeling, and others. There are many challenges inherent to the field, including a need for tools that accurately assign high-level omics signals to individual community members, new automated reconstruction methods that rival manual curation, and novel algorithms for integrating omics data and engineering communities. As technologies and modeling frameworks improve, we expect that there will be proportional advances in the fields of ecology, health science, and microbial community engineering. PMID:26109480
Maggi, Elaine C; Gravina, Silvia; Cheng, Haiying; Piperdi, Bilal; Yuan, Ziqiang; Dong, Xiao; Libutti, Steven K; Vijg, Jan; Montagna, Cristina
2018-01-01
The goal of this study was to develop a method for whole genome cell-free DNA (cfDNA) methylation analysis in humans and mice with the ultimate goal to facilitate the identification of tumor derived DNA methylation changes in the blood. Plasma or serum from patients with pancreatic neuroendocrine tumors or lung cancer, and plasma from a murine model of pancreatic adenocarcinoma was used to develop a protocol for cfDNA isolation, library preparation and whole-genome bisulfite sequencing of ultra low quantities of cfDNA, including tumor-specific DNA. The protocol developed produced high quality libraries consistently generating a conversion rate >98% that will be applicable for the analysis of human and mouse plasma or serum to detect tumor-derived changes in DNA methylation.
Genome alignment with graph data structures: a comparison
2014-01-01
Background Recent advances in rapid, low-cost sequencing have opened up the opportunity to study complete genome sequences. The computational approach of multiple genome alignment allows investigation of evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as rearrangement studies and phylogenetic inference. Graphs have proven to be a powerful tool for coping with the complexity of genome-scale sequence alignments. The potential of graphs to intuitively represent all aspects of genome alignments led to the development of graph-based approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a genome alignment through identification and removal of graph substructures that indicate errors in the alignment. Results We compare the structures of commonly used graphs in terms of their abilities to represent alignment information. We describe how the graphs can be transformed into each other, and identify and classify graph substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that remove these substructures. Conclusion We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information content. Still, many ideas are shared among all graph-based approaches. Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools. PMID:24712884
Bottari, Benedetta; Felis, Giovanna E; Salvetti, Elisa; Castioni, Anna; Campedelli, Ilenia; Torriani, Sandra; Bernini, Valentina; Gatti, Monica
2017-07-01
Lactobacillus casei,Lactobacillus paracasei and Lactobacillusrhamnosus form a closely related taxonomic group (the L. casei group) within the facultatively heterofermentative lactobacilli. Strains of these species have been used for a long time as probiotics in a wide range of products, and they represent the dominant species of nonstarter lactic acid bacteria in ripened cheeses, where they contribute to flavour development. The close genetic relationship among those species, as well as the similarity of biochemical properties of the strains, hinders the development of an adequate selective method to identify these bacteria. Despite this being a hot topic, as demonstrated by the large amount of literature about it, the results of different proposed identification methods are often ambiguous and unsatisfactory. The aim of this study was to develop a more robust species-specific identification assay for differentiating the species of the L. casei group. A taxonomy-driven comparative genomic analysis was carried out to select the potential target genes whose similarity could better reflect genome-wide diversity. The gene mutL appeared to be the most promising one and, therefore, a novel species-specific multiplex PCR assay was developed to rapidly and effectively distinguish L. casei, L. paracasei and L. rhamnosus strains. The analysis of a collection of 76 wild dairy isolates, previously identified as members of the L. casei group combining the results of multiple approaches, revealed that the novel designed primers, especially in combination with already existing ones, were able to improve the discrimination power at the species level and reveal previously undiscovered intraspecific biodiversity.
Schröder, Jan; Hsu, Arthur; Boyle, Samantha E.; Macintyre, Geoff; Cmero, Marek; Tothill, Richard W.; Johnstone, Ricky W.; Shackleton, Mark; Papenfuss, Anthony T.
2014-01-01
Motivation: Methods for detecting somatic genome rearrangements in tumours using next-generation sequencing are vital in cancer genomics. Available algorithms use one or more sources of evidence, such as read depth, paired-end reads or split reads to predict structural variants. However, the problem remains challenging due to the significant computational burden and high false-positive or false-negative rates. Results: In this article, we present Socrates (SOft Clip re-alignment To idEntify Structural variants), a highly efficient and effective method for detecting genomic rearrangements in tumours that uses only split-read data. Socrates has single-nucleotide resolution, identifies micro-homologies and untemplated sequence at break points, has high sensitivity and high specificity and takes advantage of parallelism for efficient use of resources. We demonstrate using simulated and real data that Socrates performs well compared with a number of existing structural variant detection tools. Availability and implementation: Socrates is released as open source and available from http://bioinf.wehi.edu.au/socrates. Contact: papenfuss@wehi.edu.au Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24389656
Selecting sequence variants to improve genomic predictions for dairy cattle
USDA-ARS?s Scientific Manuscript database
Millions of genetic variants have been identified by population-scale sequencing projects, but subsets are needed for routine genomic predictions or to include on genotyping arrays. Methods of selecting sequence variants were compared using both simulated sequence genotypes and actual data from run ...
CRISPR-enabled tools for engineering microbial genomes and phenotypes.
Tarasava, Katia; Oh, Eun Joong; Eckert, Carrie A; Gill, Ryan T
2018-06-19
In recent years CRISPR-Cas technologies have revolutionized microbial engineering approaches. Genome editing and non-editing applications of various CRISPR-Cas systems have expanded the throughput and scale of engineering efforts, as well as opened up new avenues for manipulating genomes of non-model organisms. As we expand the range of organisms used for biotechnological applications, we need to develop better, more versatile tools for manipulation of these systems. Here we summarize the current advances in microbial gene editing using CRISPR-Cas based tools, and highlight state-of-the-art methods for high-throughput, efficient genome-scale engineering in model organisms Escherichia coli and Saccharomyces cerevisiae. We also review non-editing CRISPR-Cas applications available for gene expression manipulation, epigenetic remodeling, RNA editing, labeling and synthetic gene circuit design. Finally, we point out the areas of research that need further development in order to expand the range of applications and increase the utility of these new methods. This article is protected by copyright. All rights reserved.
Chwialkowska, Karolina; Korotko, Urszula; Kosinska, Joanna; Szarejko, Iwona; Kwasniewski, Miroslaw
2017-01-01
Epigenetic mechanisms, including histone modifications and DNA methylation, mutually regulate chromatin structure, maintain genome integrity, and affect gene expression and transposon mobility. Variations in DNA methylation within plant populations, as well as methylation in response to internal and external factors, are of increasing interest, especially in the crop research field. Methylation Sensitive Amplification Polymorphism (MSAP) is one of the most commonly used methods for assessing DNA methylation changes in plants. This method involves gel-based visualization of PCR fragments from selectively amplified DNA that are cleaved using methylation-sensitive restriction enzymes. In this study, we developed and validated a new method based on the conventional MSAP approach called Methylation Sensitive Amplification Polymorphism Sequencing (MSAP-Seq). We improved the MSAP-based approach by replacing the conventional separation of amplicons on polyacrylamide gels with direct, high-throughput sequencing using Next Generation Sequencing (NGS) and automated data analysis. MSAP-Seq allows for global sequence-based identification of changes in DNA methylation. This technique was validated in Hordeum vulgare . However, MSAP-Seq can be straightforwardly implemented in different plant species, including crops with large, complex and highly repetitive genomes. The incorporation of high-throughput sequencing into MSAP-Seq enables parallel and direct analysis of DNA methylation in hundreds of thousands of sites across the genome. MSAP-Seq provides direct genomic localization of changes and enables quantitative evaluation. We have shown that the MSAP-Seq method specifically targets gene-containing regions and that a single analysis can cover three-quarters of all genes in large genomes. Moreover, MSAP-Seq's simplicity, cost effectiveness, and high-multiplexing capability make this method highly affordable. Therefore, MSAP-Seq can be used for DNA methylation analysis in crop plants with large and complex genomes.
Chwialkowska, Karolina; Korotko, Urszula; Kosinska, Joanna; Szarejko, Iwona; Kwasniewski, Miroslaw
2017-01-01
Epigenetic mechanisms, including histone modifications and DNA methylation, mutually regulate chromatin structure, maintain genome integrity, and affect gene expression and transposon mobility. Variations in DNA methylation within plant populations, as well as methylation in response to internal and external factors, are of increasing interest, especially in the crop research field. Methylation Sensitive Amplification Polymorphism (MSAP) is one of the most commonly used methods for assessing DNA methylation changes in plants. This method involves gel-based visualization of PCR fragments from selectively amplified DNA that are cleaved using methylation-sensitive restriction enzymes. In this study, we developed and validated a new method based on the conventional MSAP approach called Methylation Sensitive Amplification Polymorphism Sequencing (MSAP-Seq). We improved the MSAP-based approach by replacing the conventional separation of amplicons on polyacrylamide gels with direct, high-throughput sequencing using Next Generation Sequencing (NGS) and automated data analysis. MSAP-Seq allows for global sequence-based identification of changes in DNA methylation. This technique was validated in Hordeum vulgare. However, MSAP-Seq can be straightforwardly implemented in different plant species, including crops with large, complex and highly repetitive genomes. The incorporation of high-throughput sequencing into MSAP-Seq enables parallel and direct analysis of DNA methylation in hundreds of thousands of sites across the genome. MSAP-Seq provides direct genomic localization of changes and enables quantitative evaluation. We have shown that the MSAP-Seq method specifically targets gene-containing regions and that a single analysis can cover three-quarters of all genes in large genomes. Moreover, MSAP-Seq's simplicity, cost effectiveness, and high-multiplexing capability make this method highly affordable. Therefore, MSAP-Seq can be used for DNA methylation analysis in crop plants with large and complex genomes. PMID:29250096
Bohil, Corey J; Higgins, Nicholas A; Keebler, Joseph R
2014-01-01
We compared methods for predicting and understanding the source of confusion errors during military vehicle identification training. Participants completed training to identify main battle tanks. They also completed card-sorting and similarity-rating tasks to express their mental representation of resemblance across the set of training items. We expected participants to selectively attend to a subset of vehicle features during these tasks, and we hypothesised that we could predict identification confusion errors based on the outcomes of the card-sort and similarity-rating tasks. Based on card-sorting results, we were able to predict about 45% of observed identification confusions. Based on multidimensional scaling of the similarity-rating data, we could predict more than 80% of identification confusions. These methods also enabled us to infer the dimensions receiving significant attention from each participant. This understanding of mental representation may be crucial in creating personalised training that directs attention to features that are critical for accurate identification. Participants completed military vehicle identification training and testing, along with card-sorting and similarity-rating tasks. The data enabled us to predict up to 84% of identification confusion errors and to understand the mental representation underlying these errors. These methods have potential to improve training and reduce identification errors leading to fratricide.
Zwaenepoel, Arthur; Diels, Tim; Amar, David; Van Parys, Thomas; Shamir, Ron; Van de Peer, Yves; Tzfadia, Oren
2018-01-01
Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.
Zhang, Zhongyang; Hao, Ke
2015-11-01
Cancer genomes exhibit profound somatic copy number alterations (SCNAs). Studying tumor SCNAs using massively parallel sequencing provides unprecedented resolution and meanwhile gives rise to new challenges in data analysis, complicated by tumor aneuploidy and heterogeneity as well as normal cell contamination. While the majority of read depth based methods utilize total sequencing depth alone for SCNA inference, the allele specific signals are undervalued. We proposed a joint segmentation and inference approach using both signals to meet some of the challenges. Our method consists of four major steps: 1) extracting read depth supporting reference and alternative alleles at each SNP/Indel locus and comparing the total read depth and alternative allele proportion between tumor and matched normal sample; 2) performing joint segmentation on the two signal dimensions; 3) correcting the copy number baseline from which the SCNA state is determined; 4) calling SCNA state for each segment based on both signal dimensions. The method is applicable to whole exome/genome sequencing (WES/WGS) as well as SNP array data in a tumor-control study. We applied the method to a dataset containing no SCNAs to test the specificity, created by pairing sequencing replicates of a single HapMap sample as normal/tumor pairs, as well as a large-scale WGS dataset consisting of 88 liver tumors along with adjacent normal tissues. Compared with representative methods, our method demonstrated improved accuracy, scalability to large cancer studies, capability in handling both sequencing and SNP array data, and the potential to improve the estimation of tumor ploidy and purity.
Zhang, Zhongyang; Hao, Ke
2015-01-01
Cancer genomes exhibit profound somatic copy number alterations (SCNAs). Studying tumor SCNAs using massively parallel sequencing provides unprecedented resolution and meanwhile gives rise to new challenges in data analysis, complicated by tumor aneuploidy and heterogeneity as well as normal cell contamination. While the majority of read depth based methods utilize total sequencing depth alone for SCNA inference, the allele specific signals are undervalued. We proposed a joint segmentation and inference approach using both signals to meet some of the challenges. Our method consists of four major steps: 1) extracting read depth supporting reference and alternative alleles at each SNP/Indel locus and comparing the total read depth and alternative allele proportion between tumor and matched normal sample; 2) performing joint segmentation on the two signal dimensions; 3) correcting the copy number baseline from which the SCNA state is determined; 4) calling SCNA state for each segment based on both signal dimensions. The method is applicable to whole exome/genome sequencing (WES/WGS) as well as SNP array data in a tumor-control study. We applied the method to a dataset containing no SCNAs to test the specificity, created by pairing sequencing replicates of a single HapMap sample as normal/tumor pairs, as well as a large-scale WGS dataset consisting of 88 liver tumors along with adjacent normal tissues. Compared with representative methods, our method demonstrated improved accuracy, scalability to large cancer studies, capability in handling both sequencing and SNP array data, and the potential to improve the estimation of tumor ploidy and purity. PMID:26583378
Ai, Jinxia; Wang, Xuesong; Gao, Lijun; Xia, Wei; Li, Mingcheng; Yuan, Guangxin; Niu, Jiamu; Zhang, Lihua
2017-11-01
The use of Fetus cervi, which is derived from the embryo and placenta of Cervus Nippon Temminck or Cervs elaphus Linnaeus, has been documented for a long time in China. There are abundant species of deer worldwide. Those recorded by China Pharmacopeia (2010 edition) from all the species were either authentic or adulterants/counterfeits. Identification of their origins or authenticity became a key in the preparation of the authentic products. The traditional SDS alkaline lysis and salt-outing methods were modified to extract mt DNA and genomic DNA from fresh and dry Fetus cervi in addition to Fetus from false animals, respectively. A set of primers were designed by bioinformatics to target the intra-and inter-variation. The mt DNA and genomic DNA extracted from Fetus cervi using the two methods meet the requirement for authenticity. Extraction of mt DNA by SDS alkaline lysis is more practical and accurate than extraction of genomic DNA by salt-outing method. There were differences in length and number of segments amplified by PCR between mt DNA from authentic Fetus cervi and false animals Fetus. The distinctive PCR-fingerprint patterns can distinguish the Fetus cervi from adulterants and counterfeit animal Fetus.
Gagliano, Sarah A; Ravji, Reena; Barnes, Michael R; Weale, Michael E; Knight, Jo
2015-08-24
Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies.
Gene discovery by chemical mutagenesis and whole-genome sequencing in Dictyostelium.
Li, Cheng-Lin Frank; Santhanam, Balaji; Webb, Amanda Nicole; Zupan, Blaž; Shaulsky, Gad
2016-09-01
Whole-genome sequencing is a useful approach for identification of chemical-induced lesions, but previous applications involved tedious genetic mapping to pinpoint the causative mutations. We propose that saturation mutagenesis under low mutagenic loads, followed by whole-genome sequencing, should allow direct implication of genes by identifying multiple independent alleles of each relevant gene. We tested the hypothesis by performing three genetic screens with chemical mutagenesis in the social soil amoeba Dictyostelium discoideum Through genome sequencing, we successfully identified mutant genes with multiple alleles in near-saturation screens, including resistance to intense illumination and strong suppressors of defects in an allorecognition pathway. We tested the causality of the mutations by comparison to published data and by direct complementation tests, finding both dominant and recessive causative mutations. Therefore, our strategy provides a cost- and time-efficient approach to gene discovery by integrating chemical mutagenesis and whole-genome sequencing. The method should be applicable to many microbial systems, and it is expected to revolutionize the field of functional genomics in Dictyostelium by greatly expanding the mutation spectrum relative to other common mutagenesis methods. © 2016 Li et al.; Published by Cold Spring Harbor Laboratory Press.
Laurenson, Yan C S M; Kyriazakis, Ilias; Bishop, Stephen C
2013-10-18
Estimated breeding values (EBV) for faecal egg count (FEC) and genetic markers for host resistance to nematodes may be used to identify resistant animals for selective breeding programmes. Similarly, targeted selective treatment (TST) requires the ability to identify the animals that will benefit most from anthelmintic treatment. A mathematical model was used to combine the concepts and evaluate the potential of using genetic-based methods to identify animals for a TST regime. EBVs obtained by genomic prediction were predicted to be the best determinant criterion for TST in terms of the impact on average empty body weight and average FEC, whereas pedigree-based EBVs for FEC were predicted to be marginally worse than using phenotypic FEC as a determinant criterion. Whilst each method has financial implications, if the identification of host resistance is incorporated into a wider genomic selection indices or selective breeding programmes, then genetic or genomic information may be plausibly included in TST regimes. Copyright © 2013 Elsevier B.V. All rights reserved.
Time- and Cost-Efficient Identification of T-DNA Insertion Sites through Targeted Genomic Sequencing
Lepage, Étienne; Zampini, Éric; Boyle, Brian; Brisson, Normand
2013-01-01
Forward genetic screens enable the unbiased identification of genes involved in biological processes. In Arabidopsis, several mutant collections are publicly available, which greatly facilitates such practice. Most of these collections were generated by agrotransformation of a T-DNA at random sites in the plant genome. However, precise mapping of T-DNA insertion sites in mutants isolated from such screens is a laborious and time-consuming task. Here we report a simple, low-cost and time efficient approach to precisely map T-DNA insertions simultaneously in many different mutants. By combining sequence capture, next-generation sequencing and 2D-PCR pooling, we developed a new method that allowed the rapid localization of T-DNA insertion sites in 55 out of 64 mutant plants isolated in a screen for gyrase inhibition hypersensitivity. PMID:23951038
Wang, WeiBo; Sun, Wei; Wang, Wei; Szatkiewicz, Jin
2018-03-01
The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection. Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.
The Use of Weighted Graphs for Large-Scale Genome Analysis
Zhou, Fang; Toivonen, Hannu; King, Ross D.
2014-01-01
There is an acute need for better tools to extract knowledge from the growing flood of sequence data. For example, thousands of complete genomes have been sequenced, and their metabolic networks inferred. Such data should enable a better understanding of evolution. However, most existing network analysis methods are based on pair-wise comparisons, and these do not scale to thousands of genomes. Here we propose the use of weighted graphs as a data structure to enable large-scale phylogenetic analysis of networks. We have developed three types of weighted graph for enzymes: taxonomic (these summarize phylogenetic importance), isoenzymatic (these summarize enzymatic variety/redundancy), and sequence-similarity (these summarize sequence conservation); and we applied these types of weighted graph to survey prokaryotic metabolism. To demonstrate the utility of this approach we have compared and contrasted the large-scale evolution of metabolism in Archaea and Eubacteria. Our results provide evidence for limits to the contingency of evolution. PMID:24619061
Reconstruction of genome-scale human metabolic models using omics data.
Ryu, Jae Yong; Kim, Hyun Uk; Lee, Sang Yup
2015-08-01
The impact of genome-scale human metabolic models on human systems biology and medical sciences is becoming greater, thanks to increasing volumes of model building platforms and publicly available omics data. The genome-scale human metabolic models started with Recon 1 in 2007, and have since been used to describe metabolic phenotypes of healthy and diseased human tissues and cells, and to predict therapeutic targets. Here we review recent trends in genome-scale human metabolic modeling, including various generic and tissue/cell type-specific human metabolic models developed to date, and methods, databases and platforms used to construct them. For generic human metabolic models, we pay attention to Recon 2 and HMR 2.0 with emphasis on data sources used to construct them. Draft and high-quality tissue/cell type-specific human metabolic models have been generated using these generic human metabolic models. Integration of tissue/cell type-specific omics data with the generic human metabolic models is the key step, and we discuss omics data and their integration methods to achieve this task. The initial version of the tissue/cell type-specific human metabolic models can further be computationally refined through gap filling, reaction directionality assignment and the subcellular localization of metabolic reactions. We review relevant tools for this model refinement procedure as well. Finally, we suggest the direction of further studies on reconstructing an improved human metabolic model.
Optimal knockout strategies in genome-scale metabolic networks using particle swarm optimization.
Nair, Govind; Jungreuthmayer, Christian; Zanghellini, Jürgen
2017-02-01
Knockout strategies, particularly the concept of constrained minimal cut sets (cMCSs), are an important part of the arsenal of tools used in manipulating metabolic networks. Given a specific design, cMCSs can be calculated even in genome-scale networks. We would however like to find not only the optimal intervention strategy for a given design but the best possible design too. Our solution (PSOMCS) is to use particle swarm optimization (PSO) along with the direct calculation of cMCSs from the stoichiometric matrix to obtain optimal designs satisfying multiple objectives. To illustrate the working of PSOMCS, we apply it to a toy network. Next we show its superiority by comparing its performance against other comparable methods on a medium sized E. coli core metabolic network. PSOMCS not only finds solutions comparable to previously published results but also it is orders of magnitude faster. Finally, we use PSOMCS to predict knockouts satisfying multiple objectives in a genome-scale metabolic model of E. coli and compare it with OptKnock and RobustKnock. PSOMCS finds competitive knockout strategies and designs compared to other current methods and is in some cases significantly faster. It can be used in identifying knockouts which will force optimal desired behaviors in large and genome scale metabolic networks. It will be even more useful as larger metabolic models of industrially relevant organisms become available.
Dong, Zirui; Wang, Huilin; Chen, Haixiao; Jiang, Hui; Yuan, Jianying; Yang, Zhenjun; Wang, Wen-Jing; Xu, Fengping; Guo, Xiaosen; Cao, Ye; Zhu, Zhenzhen; Geng, Chunyu; Cheung, Wan Chee; Kwok, Yvonne K; Yang, Huanming; Leung, Tak Yeung; Morton, Cynthia C; Cheung, Sau Wai; Choy, Kwong Wai
2017-11-02
PurposeRecent studies demonstrate that whole-genome sequencing enables detection of cryptic rearrangements in apparently balanced chromosomal rearrangements (also known as balanced chromosomal abnormalities, BCAs) previously identified by conventional cytogenetic methods. We aimed to assess our analytical tool for detecting BCAs in the 1000 Genomes Project without knowing which bands were affected.MethodsThe 1000 Genomes Project provides an unprecedented integrated map of structural variants in phenotypically normal subjects, but there is no information on potential inclusion of subjects with apparent BCAs akin to those traditionally detected in diagnostic cytogenetics laboratories. We applied our analytical tool to 1,166 genomes from the 1000 Genomes Project with sufficient physical coverage (8.25-fold).ResultsWith this approach, we detected four reciprocal balanced translocations and four inversions, ranging in size from 57.9 kb to 13.3 Mb, all of which were confirmed by cytogenetic methods and polymerase chain reaction studies. One of these DNAs has a subtle translocation that is not readily identified by chromosome analysis because of the similarity of the banding patterns and size of exchanged segments, and another results in disruption of all transcripts of an OMIM gene.ConclusionOur study demonstrates the extension of utilizing low-pass whole-genome sequencing for unbiased detection of BCAs including translocations and inversions previously unknown in the 1000 Genomes Project.GENETICS in MEDICINE advance online publication, 2 November 2017; doi:10.1038/gim.2017.170.
Taranto, F; D'Agostino, N; Greco, B; Cardi, T; Tripodi, P
2016-11-21
Knowledge on population structure and genetic diversity in vegetable crops is essential for association mapping studies and genomic selection. Genotyping by sequencing (GBS) represents an innovative method for large scale SNP detection and genotyping of genetic resources. Herein we used the GBS approach for the genome-wide identification of SNPs in a collection of Capsicum spp. accessions and for the assessment of the level of genetic diversity in a subset of 222 cultivated pepper (Capsicum annum) genotypes. GBS analysis generated a total of 7,568,894 master tags, of which 43.4% uniquely aligned to the reference genome CM334. A total of 108,591 SNP markers were identified, of which 105,184 were in C. annuum accessions. In order to explore the genetic diversity of C. annuum and to select a minimal core set representing most of the total genetic variation with minimum redundancy, a subset of 222 C. annuum accessions were analysed using 32,950 high quality SNPs. Based on Bayesian and Hierarchical clustering it was possible to divide the collection into three clusters. Cluster I had the majority of varieties and landraces mainly from Southern and Northern Italy, and from Eastern Europe, whereas clusters II and III comprised accessions of different geographical origins. Considering the genome-wide genetic variation among the accessions included in cluster I, a second round of Bayesian (K = 3) and Hierarchical (K = 2) clustering was performed. These analysis showed that genotypes were grouped not only based on geographical origin, but also on fruit-related features. GBS data has proven useful to assess the genetic diversity in a collection of C. annuum accessions. The high number of SNP markers, uniformly distributed on the 12 chromosomes, allowed the accessions to be distinguished according to geographical origin and fruit-related features. SNP markers and information on population structure developed in this study will undoubtedly support genome-wide association mapping studies and marker-assisted selection programs.
Miura, Naoki; Kucho, Ken-Ichi; Noguchi, Michiko; Miyoshi, Noriaki; Uchiumi, Toshiki; Kawaguchi, Hiroaki; Tanimoto, Akihide
2014-01-01
The microminipig, which weighs less than 10 kg at an early stage of maturity, has been reported as a potential experimental model animal. Its extremely small size and other distinct characteristics suggest the possibility of a number of differences between the genome of the microminipig and that of conventional pigs. In this study, we analyzed the genomes of two healthy microminipigs using a next-generation sequencer SOLiD™ system. We then compared the obtained genomic sequences with a genomic database for the domestic pig (Sus scrofa). The mapping coverage of sequenced tag from the microminipig to conventional pig genomic sequences was greater than 96% and we detected no clear, substantial genomic variance from these data. The results may indicate that the distinct characteristics of the microminipig derive from small-scale alterations in the genome, such as Single Nucleotide Polymorphisms or translational modifications, rather than large-scale deletion or insertion polymorphisms. Further investigation of the entire genomic sequence of the microminipig with methods enabling deeper coverage is required to elucidate the genetic basis of its distinct phenotypic traits. Copyright © 2014 International Institute of Anticancer Research (Dr. John G. Delinassios), All rights reserved.
Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints
Glusman, Gustavo; Mauldin, Denise E.; Hood, Leroy E.; Robinson, Max
2017-01-01
We present an ultrafast method for comparing personal genomes. We transform the standard genome representation (lists of variants relative to a reference) into “genome fingerprints” via locality sensitive hashing. The resulting genome fingerprints can be meaningfully compared even when the input data were obtained using different sequencing technologies, processed using different pipelines, represented in different data formats and relative to different reference versions. Furthermore, genome fingerprints are robust to up to 30% missing data. Because of their reduced size, computation on the genome fingerprints is fast and requires little memory. For example, we could compute all-against-all pairwise comparisons among the 2504 genomes in the 1000 Genomes data set in 67 s at high quality (21 μs per comparison, on a single processor), and achieved a lower quality approximation in just 11 s. Efficient computation enables scaling up a variety of important genome analyses, including quantifying relatedness, recognizing duplicative sequenced genomes in a set, population reconstruction, and many others. The original genome representation cannot be reconstructed from its fingerprint, effectively decoupling genome comparison from genome interpretation; the method thus has significant implications for privacy-preserving genome analytics. PMID:29018478
Marko, Nicholas F.; Weil, Robert J.
2012-01-01
Introduction Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research. Methods We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome. Results Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect. Conclusions Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that “small” departures from normality in the expression data distributions are analytically-insignificant and that “robust” gene-calling algorithms can fully compensate for these effects. PMID:23118863
The Essential Genome of Escherichia coli K-12.
Goodall, Emily C A; Robinson, Ashley; Johnston, Iain G; Jabbari, Sara; Turner, Keith A; Cunningham, Adam F; Lund, Peter A; Cole, Jeffrey A; Henderson, Ian R
2018-02-20
Transposon-directed insertion site sequencing (TraDIS) is a high-throughput method coupling transposon mutagenesis with short-fragment DNA sequencing. It is commonly used to identify essential genes. Single gene deletion libraries are considered the gold standard for identifying essential genes. Currently, the TraDIS method has not been benchmarked against such libraries, and therefore, it remains unclear whether the two methodologies are comparable. To address this, a high-density transposon library was constructed in Escherichia coli K-12. Essential genes predicted from sequencing of this library were compared to existing essential gene databases. To decrease false-positive identification of essential genes, statistical data analysis included corrections for both gene length and genome length. Through this analysis, new essential genes and genes previously incorrectly designated essential were identified. We show that manual analysis of TraDIS data reveals novel features that would not have been detected by statistical analysis alone. Examples include short essential regions within genes, orientation-dependent effects, and fine-resolution identification of genome and protein features. Recognition of these insertion profiles in transposon mutagenesis data sets will assist genome annotation of less well characterized genomes and provides new insights into bacterial physiology and biochemistry. IMPORTANCE Incentives to define lists of genes that are essential for bacterial survival include the identification of potential targets for antibacterial drug development, genes required for rapid growth for exploitation in biotechnology, and discovery of new biochemical pathways. To identify essential genes in Escherichia coli , we constructed a transposon mutant library of unprecedented density. Initial automated analysis of the resulting data revealed many discrepancies compared to the literature. We now report more extensive statistical analysis supported by both literature searches and detailed inspection of high-density TraDIS sequencing data for each putative essential gene for the E. coli model laboratory organism. This paper is important because it provides a better understanding of the essential genes of E. coli , reveals the limitations of relying on automated analysis alone, and provides a new standard for the analysis of TraDIS data. Copyright © 2018 Goodall et al.
Mao, Hongliang; Wang, Hao
2017-03-01
Short Interspersed Nuclear Elements (SINEs) are transposable elements (TEs) that amplify through a copy-and-paste mode via RNA intermediates. The computational identification of new SINEs are challenging because of their weak structural signals and rapid diversification in sequences. Here we report SINE_Scan, a highly efficient program to predict SINE elements in genomic DNA sequences. SINE_Scan integrates hallmark of SINE transposition, copy number and structural signals to identify a SINE element. SINE_Scan outperforms the previously published de novo SINE discovery program. It shows high sensitivity and specificity in 19 plant and animal genome assemblies, of which sizes vary from 120 Mb to 3.5 Gb. It identifies numerous new families and substantially increases the estimation of the abundance of SINEs in these genomes. The code of SINE_Scan is freely available at http://github.com/maohlzj/SINE_Scan , implemented in PERL and supported on Linux. wangh8@fudan.edu.cn. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Mao, Hongliang
2017-01-01
Abstract Motivation: Short Interspersed Nuclear Elements (SINEs) are transposable elements (TEs) that amplify through a copy-and-paste mode via RNA intermediates. The computational identification of new SINEs are challenging because of their weak structural signals and rapid diversification in sequences. Results: Here we report SINE_Scan, a highly efficient program to predict SINE elements in genomic DNA sequences. SINE_Scan integrates hallmark of SINE transposition, copy number and structural signals to identify a SINE element. SINE_Scan outperforms the previously published de novo SINE discovery program. It shows high sensitivity and specificity in 19 plant and animal genome assemblies, of which sizes vary from 120 Mb to 3.5 Gb. It identifies numerous new families and substantially increases the estimation of the abundance of SINEs in these genomes. Availability and Implementation: The code of SINE_Scan is freely available at http://github.com/maohlzj/SINE_Scan, implemented in PERL and supported on Linux. Contact: wangh8@fudan.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online. PMID:28062442
Hyde, Craig L.; Nagle, Mike W.; Tian, Chao; Chen, Xing; Paciga, Sara A.; Wendland, Jens R.; Tung, Joyce; Hinds, David A.; Perlis, Roy H.; Winslow, Ashley R.
2016-01-01
Despite strong evidence supporting the heritability of Major Depressive Disorder, previous genome-wide studies were unable to identify risk loci among individuals of European descent. We used self-reported data from 75,607 individuals reporting clinical diagnosis of depression and 231,747 reporting no history of depression through 23andMe, and meta-analyzed these results with published MDD GWAS results. We identified five independent variants from four regions associated with self-report of clinical diagnosis or treatment for depression. Loci with pval<1.0×10−5 in the meta-analysis were further analyzed in a replication dataset (45,773 cases and 106,354 controls) from 23andMe. A total of 17 independent SNPs from 15 regions reached genome-wide significance after joint-analysis over all three datasets. Some of these loci were also implicated in GWAS of related psychiatric traits. These studies provide evidence for large-scale consumer genomic data as a powerful and efficient complement to traditional means of ascertainment for neuropsychiatric disease genomics. PMID:27479909
Kasi, Devi; Catherine, Christy; Lee, Seung-Won; Lee, Kyung-Ho; Kim, Yu Jung; Ro Lee, Myeong; Ju, Jung Won; Kim, Dong-Myung
2017-05-01
The rapidly evolving cloning and sequencing technologies have enabled understanding of genomic structure of parasite genomes, opening up new ways of combatting parasite-related diseases. To make the most of the exponentially accumulating genomic data, however, it is crucial to analyze the proteins encoded by these genomic sequences. In this study, we adopted an engineered cell-free protein synthesis system for large-scale expression screening of an expression sequence tag (EST) library of Clonorchis sinensis to identify potential antigens that can be used for diagnosis and treatment of clonorchiasis. To allow high-throughput expression and identification of individual genes comprising the library, a cell-free synthesis reaction was designed such that both the template DNA and the expressed proteins were co-immobilized on the same microbeads, leading to microbead-based linkage of the genotype and phenotype. This reaction configuration allowed streamlined expression, recovery, and analysis of proteins. This approach enabled us to identify 21 antigenic proteins. © 2017 American Institute of Chemical Engineers Biotechnol. Prog., 33:832-837, 2017. © 2017 American Institute of Chemical Engineers.
Mookerjee, Shona A; Sia, Elaine A
2006-03-20
The mechanisms that govern mutation avoidance in the mitochondrial genome, though believed to be numerous, are poorly understood. The identification of individual genes has implicated mismatch repair and several recombination pathways in maintaining the fidelity and structural stability of mitochondrial DNA. However, the majority of genes in these pathways have not been identified and the interactions between different pathways have not been extensively studied. Additionally, the multicopy presence of the mitochondrial genome affects the occurrence and persistence of mutant phenotypes, making mitochondrial DNA transmission and sorting important factors affecting mutation accumulation. We present new evidence that the putative recombination genes CCE1, DIN7, and MHR1 have overlapping function with the mismatch repair homolog MSH1 in point mutation avoidance and suppression of aberrant recombination events. In addition, we demonstrate a novel role for Msh1p in mtDNA transmission, a role not predicted by studies of its nuclear homologs.
Pichon, Christophe; du Merle, Laurence; Caliot, Marie Elise; Trieu-Cuot, Patrick; Le Bouguénec, Chantal
2012-04-01
Characterization of small non-coding ribonucleic acids (sRNA) among the large volume of data generated by high-throughput RNA-seq or tiling microarray analyses remains a challenge. Thus, there is still a need for accurate in silico prediction methods to identify sRNAs within a given bacterial species. After years of effort, dedicated software were developed based on comparative genomic analyses or mathematical/statistical models. Although these genomic analyses enabled sRNAs in intergenic regions to be efficiently identified, they all failed to predict antisense sRNA genes (asRNA), i.e. RNA genes located on the DNA strand complementary to that which encodes the protein. The statistical models enabled any genomic region to be analyzed theorically but not efficiently. We present a new model for in silico identification of sRNA and asRNA candidates within an entire bacterial genome. This model was successfully used to analyze the Gram-negative Escherichia coli and Gram-positive Streptococcus agalactiae. In both bacteria, numerous asRNAs are transcribed from the complementary strand of genes located in pathogenicity islands, strongly suggesting that these asRNAs are regulators of the virulence expression. In particular, we characterized an asRNA that acted as an enhancer-like regulator of the type 1 fimbriae production involved in the virulence of extra-intestinal pathogenic E. coli.
Pichon, Christophe; du Merle, Laurence; Caliot, Marie Elise; Trieu-Cuot, Patrick; Le Bouguénec, Chantal
2012-01-01
Characterization of small non-coding ribonucleic acids (sRNA) among the large volume of data generated by high-throughput RNA-seq or tiling microarray analyses remains a challenge. Thus, there is still a need for accurate in silico prediction methods to identify sRNAs within a given bacterial species. After years of effort, dedicated software were developed based on comparative genomic analyses or mathematical/statistical models. Although these genomic analyses enabled sRNAs in intergenic regions to be efficiently identified, they all failed to predict antisense sRNA genes (asRNA), i.e. RNA genes located on the DNA strand complementary to that which encodes the protein. The statistical models enabled any genomic region to be analyzed theorically but not efficiently. We present a new model for in silico identification of sRNA and asRNA candidates within an entire bacterial genome. This model was successfully used to analyze the Gram-negative Escherichia coli and Gram-positive Streptococcus agalactiae. In both bacteria, numerous asRNAs are transcribed from the complementary strand of genes located in pathogenicity islands, strongly suggesting that these asRNAs are regulators of the virulence expression. In particular, we characterized an asRNA that acted as an enhancer-like regulator of the type 1 fimbriae production involved in the virulence of extra-intestinal pathogenic E. coli. PMID:22139924
How molecular profiling could revolutionize drug discovery.
Stoughton, Roland B; Friend, Stephen H
2005-04-01
Information from genomic, proteomic and metabolomic measurements has already benefited target discovery and validation, assessment of efficacy and toxicity of compounds, identification of disease subgroups and the prediction of responses of individual patients. Greater benefits can be expected from the application of these technologies on a significantly larger scale; by simultaneously collecting diverse measurements from the same subjects or cell cultures; by exploiting the steadily improving quantitative accuracy of the technologies; and by interpreting the emerging data in the context of underlying biological models of increasing sophistication. The benefits of applying molecular profiling to drug discovery and development will include much lower failure rates at all stages of the drug development pipeline, faster progression from discovery through to clinical trials and more successful therapies for patient subgroups. Upheavals in existing organizational structures in the current 'conveyor belt' models of drug discovery might be required to take full advantage of these methods.
Adaptive introgression across species boundaries in Heliconius butterflies.
Pardo-Diaz, Carolina; Salazar, Camilo; Baxter, Simon W; Merot, Claire; Figueiredo-Ready, Wilsea; Joron, Mathieu; McMillan, W Owen; Jiggins, Chris D
2012-01-01
It is widely documented that hybridisation occurs between many closely related species, but the importance of introgression in adaptive evolution remains unclear, especially in animals. Here, we have examined the role of introgressive hybridisation in transferring adaptations between mimetic Heliconius butterflies, taking advantage of the recent identification of a gene regulating red wing patterns in this genus. By sequencing regions both linked and unlinked to the red colour locus, we found a region that displays an almost perfect genotype by phenotype association across four species, H. melpomene, H. cydno, H. timareta, and H. heurippa. This particular segment is located 70 kb downstream of the red colour specification gene optix, and coalescent analysis indicates repeated introgression of adaptive alleles from H. melpomene into the H. cydno species clade. Our analytical methods complement recent genome scale data for the same region and suggest adaptive introgression has a crucial role in generating adaptive wing colour diversity in this group of butterflies.
On Functional Module Detection in Metabolic Networks
Koch, Ina; Ackermann, Jörg
2013-01-01
Functional modules of metabolic networks are essential for understanding the metabolism of an organism as a whole. With the vast amount of experimental data and the construction of complex and large-scale, often genome-wide, models, the computer-aided identification of functional modules becomes more and more important. Since steady states play a key role in biology, many methods have been developed in that context, for example, elementary flux modes, extreme pathways, transition invariants and place invariants. Metabolic networks can be studied also from the point of view of graph theory, and algorithms for graph decomposition have been applied for the identification of functional modules. A prominent and currently intensively discussed field of methods in graph theory addresses the Q-modularity. In this paper, we recall known concepts of module detection based on the steady-state assumption, focusing on transition-invariants (elementary modes) and their computation as minimal solutions of systems of Diophantine equations. We present the Fourier-Motzkin algorithm in detail. Afterwards, we introduce the Q-modularity as an example for a useful non-steady-state method and its application to metabolic networks. To illustrate and discuss the concepts of invariants and Q-modularity, we apply a part of the central carbon metabolism in potato tubers (Solanum tuberosum) as running example. The intention of the paper is to give a compact presentation of known steady-state concepts from a graph-theoretical viewpoint in the context of network decomposition and reduction and to introduce the application of Q-modularity to metabolic Petri net models. PMID:24958145
SeqTU: A web server for identification of bacterial transcription units
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Xin; Chou, Wen -Chi; Ma, Qin
A transcription unit (TU) consists of K ≥ 1 consecutive genes on the same strand of a bacterial genome that are transcribed into a single mRNA molecule under certain conditions. Their identification is an essential step in elucidation of transcriptional regulatory networks. We have recently developed a machine-learning method to accurately identify TUs from RNA-seq data, based on two features of the assembled RNA reads: the continuity and stability of RNA-seq coverage across a genomic region. While good performance was achieved by the method on Escherichia coli and Clostridium thermocellum, substantial work is needed to make the program generally applicablemore » to all bacteria, knowing that the program requires organism specific information. A web server, named SeqTU, was developed to automatically identify TUs with given RNA-seq data of any bacterium using a machine-learning approach. The server consists of a number of utility tools, in addition to TU identification, such as data preparation, data quality check and RNA-read mapping. SeqTU provides a user-friendly interface and automated prediction of TUs from given RNA-seq data. Furthermore, the predicted TUs are displayed intuitively using HTML format along with a graphic visualization of the prediction.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pettengill, Emily A.; Pettengill, James B.; Binet, Rachel
As a leading cause of bacterial dysentery, Shigella represents a significant threat to public health and food safety. Related, but often overlooked, enteroinvasive Escherichia coli (EIEC) can also cause dysentery. Current typing methods have limited ability to identify and differentiate between these pathogens despite the need for rapid and accurate identification of pathogens for clinical treatment and outbreak response. We present a comprehensive phylogeny of Shigella and EIEC using whole genome sequencing of 169 samples, constituting unparalleled strain diversity, and observe a lack of monophyly between Shigella and EIEC and among Shigella taxonomic groups. The evolutionary relationships in the phylogenymore » are supported by analyses of population structure and hierarchical clustering patterns of translated gene homolog abundance. Lastly, we identified a panel of 404 single nucleotide polymorphism (SNP) markers specific to each phylogenetic cluster for more accurate identification of Shigella and EIEC. Our findings show that Shigella and EIEC are not distinct evolutionary groups within the E. coli genus and, thus, EIEC as a group is not the ancestor to Shigella. The multiple analyses presented provide evidence for reconsidering the taxonomic placement of Shigella. The SNP markers offer more discriminatory power to molecular epidemiological typing methods involving these bacterial pathogens.« less
Pettengill, Emily A.; Pettengill, James B.; Binet, Rachel
2016-01-19
As a leading cause of bacterial dysentery, Shigella represents a significant threat to public health and food safety. Related, but often overlooked, enteroinvasive Escherichia coli (EIEC) can also cause dysentery. Current typing methods have limited ability to identify and differentiate between these pathogens despite the need for rapid and accurate identification of pathogens for clinical treatment and outbreak response. We present a comprehensive phylogeny of Shigella and EIEC using whole genome sequencing of 169 samples, constituting unparalleled strain diversity, and observe a lack of monophyly between Shigella and EIEC and among Shigella taxonomic groups. The evolutionary relationships in the phylogenymore » are supported by analyses of population structure and hierarchical clustering patterns of translated gene homolog abundance. Lastly, we identified a panel of 404 single nucleotide polymorphism (SNP) markers specific to each phylogenetic cluster for more accurate identification of Shigella and EIEC. Our findings show that Shigella and EIEC are not distinct evolutionary groups within the E. coli genus and, thus, EIEC as a group is not the ancestor to Shigella. The multiple analyses presented provide evidence for reconsidering the taxonomic placement of Shigella. The SNP markers offer more discriminatory power to molecular epidemiological typing methods involving these bacterial pathogens.« less
SeqTU: A web server for identification of bacterial transcription units
Chen, Xin; Chou, Wen -Chi; Ma, Qin; ...
2017-03-07
A transcription unit (TU) consists of K ≥ 1 consecutive genes on the same strand of a bacterial genome that are transcribed into a single mRNA molecule under certain conditions. Their identification is an essential step in elucidation of transcriptional regulatory networks. We have recently developed a machine-learning method to accurately identify TUs from RNA-seq data, based on two features of the assembled RNA reads: the continuity and stability of RNA-seq coverage across a genomic region. While good performance was achieved by the method on Escherichia coli and Clostridium thermocellum, substantial work is needed to make the program generally applicablemore » to all bacteria, knowing that the program requires organism specific information. A web server, named SeqTU, was developed to automatically identify TUs with given RNA-seq data of any bacterium using a machine-learning approach. The server consists of a number of utility tools, in addition to TU identification, such as data preparation, data quality check and RNA-read mapping. SeqTU provides a user-friendly interface and automated prediction of TUs from given RNA-seq data. Furthermore, the predicted TUs are displayed intuitively using HTML format along with a graphic visualization of the prediction.« less
Košir, Alexandra Bogožalec; Arulandhu, Alfred J; Voorhuijzen, Marleen M; Xiao, Hongmei; Hagelaar, Rico; Staats, Martijn; Costessi, Adalberto; Žel, Jana; Kok, Esther J; Dijk, Jeroen P van
2017-10-26
The majority of feed products in industrialised countries contains materials derived from genetically modified organisms (GMOs). In parallel, the number of reports of unauthorised GMOs (UGMOs) is gradually increasing. There is a lack of specific detection methods for UGMOs, due to the absence of detailed sequence information and reference materials. In this research, an adapted genome walking approach was developed, called ALF: Amplification of Linearly-enriched Fragments. Coupling of ALF to NGS aims for simultaneous detection and identification of all GMOs, including UGMOs, in one sample, in a single analysis. The ALF approach was assessed on a mixture made of DNA extracts from four reference materials, in an uneven distribution, mimicking a real life situation. The complete insert and genomic flanking regions were known for three of the included GMO events, while for MON15985 only partial sequence information was available. Combined with a known organisation of elements, this GMO served as a model for a UGMO. We successfully identified sequences matching with this organisation of elements serving as proof of principle for ALF as new UGMO detection strategy. Additionally, this study provides a first outline of an automated, web-based analysis pipeline for identification of UGMOs containing known GM elements.
Systems genetics for drug target discovery
Penrod, Nadia M.; Cowper-Sal_lari, Richard; Moore, Jason H.
2011-01-01
The collection and analysis of genomic data has the potential to reveal novel druggable targets by providing insight into the genetic basis of disease. However, the number of drugs, targeting new molecular entities, approved by the US Food and Drug Administration (FDA) has not increased in the years since the collection of genomic data has become commonplace. The paucity of translatable results can be partly attributed to conventional analysis methods that test one gene at a time in an effort to identify disease-associated factors as candidate drug targets. By disengaging genetic factors from their position within the genetic regulatory system, much of the information stored within the genomic data set is lost. Here we discuss how genomic data is used to identify disease-associated genes or genomic regions, how disease-associated regions are validated as functional targets, and the role network analysis can play in bridging the gap between data generation and effective drug target identification. PMID:21862141
Uchiyama, Ikuo
2008-10-31
Identifying the set of intrinsically conserved genes, or the genomic core, among related genomes is crucial for understanding prokaryotic genomes where horizontal gene transfers are common. Although core genome identification appears to be obvious among very closely related genomes, it becomes more difficult when more distantly related genomes are compared. Here, we consider the core structure as a set of sufficiently long segments in which gene orders are conserved so that they are likely to have been inherited mainly through vertical transfer, and developed a method for identifying the core structure by finding the order of pre-identified orthologous groups (OGs) that maximally retains the conserved gene orders. The method was applied to genome comparisons of two well-characterized families, Bacillaceae and Enterobacteriaceae, and identified their core structures comprising 1438 and 2125 OGs, respectively. The core sets contained most of the essential genes and their related genes, which were primarily included in the intersection of the two core sets comprising around 700 OGs. The definition of the genomic core based on gene order conservation was demonstrated to be more robust than the simpler approach based only on gene conservation. We also investigated the core structures in terms of G+C content homogeneity and phylogenetic congruence, and found that the core genes primarily exhibited the expected characteristic, i.e., being indigenous and sharing the same history, more than the non-core genes. The results demonstrate that our strategy of genome alignment based on gene order conservation can provide an effective approach to identify the genomic core among moderately related microbial genomes.
Deciphering the distance to antibiotic resistance for the pneumococcus using genome sequencing data
Mobegi, Fredrick M.; Cremers, Amelieke J. H.; de Jonge, Marien I.; Bentley, Stephen D.; van Hijum, Sacha A. F. T.; Zomer, Aldert
2017-01-01
Advances in genome sequencing technologies and genome-wide association studies (GWAS) have provided unprecedented insights into the molecular basis of microbial phenotypes and enabled the identification of the underlying genetic variants in real populations. However, utilization of genome sequencing in clinical phenotyping of bacteria is challenging due to the lack of reliable and accurate approaches. Here, we report a method for predicting microbial resistance patterns using genome sequencing data. We analyzed whole genome sequences of 1,680 Streptococcus pneumoniae isolates from four independent populations using GWAS and identified probable hotspots of genetic variation which correlate with phenotypes of resistance to essential classes of antibiotics. With the premise that accumulation of putative resistance-conferring SNPs, potentially in combination with specific resistance genes, precedes full resistance, we retrogressively surveyed the hotspot loci and quantified the number of SNPs and/or genes, which if accumulated would confer full resistance to an otherwise susceptible strain. We name this approach the ‘distance to resistance’. It can be used to identify the creep towards complete antibiotics resistance in bacteria using genome sequencing. This approach serves as a basis for the development of future sequencing-based methods for predicting resistance profiles of bacterial strains in hospital microbiology and public health settings. PMID:28205635
Haploids: Constraints and opportunities in plant breeding.
Dwivedi, Sangam L; Britt, Anne B; Tripathi, Leena; Sharma, Shivali; Upadhyaya, Hari D; Ortiz, Rodomiro
2015-11-01
The discovery of haploids in higher plants led to the use of doubled haploid (DH) technology in plant breeding. This article provides the state of the art on DH technology including the induction and identification of haploids, what factors influence haploid induction, molecular basis of microspore embryogenesis, the genetics underpinnings of haploid induction and its use in plant breeding, particularly to fix traits and unlock genetic variation. Both in vitro and in vivo methods have been used to induce haploids that are thereafter chromosome doubled to produce DH. Various heritable factors contribute to the successful induction of haploids, whose genetics is that of a quantitative trait. Genomic regions associated with in vitro and in vivo DH production were noted in various crops with the aid of DNA markers. It seems that F2 plants are the most suitable for the induction of DH lines than F1 plants. Identifying putative haploids is a key issue in haploid breeding. DH technology in Brassicas and cereals, such as barley, maize, rice, rye and wheat, has been improved and used routinely in cultivar development, while in other food staples such as pulses and root crops the technology has not reached to the stage leading to its application in plant breeding. The centromere-mediated haploid induction system has been used in Arabidopsis, but not yet in crops. Most food staples are derived from genomic resources-rich crops, including those with sequenced reference genomes. The integration of genomic resources with DH technology provides new opportunities for the improving selection methods, maximizing selection gains and accelerate cultivar development. Marker-aided breeding and DH technology have been used to improve host plant resistance in barley, rice, and wheat. Multinational seed companies are using DH technology in large-scale production of inbred lines for further development of hybrid cultivars, particularly in maize. The public sector provides support to national programs or small-medium private seed for the exploitation of DH technology in plant breeding. Copyright © 2015 Elsevier Inc. All rights reserved.
Identification of Differentially Methylated Sites with Weak Methylation Effects
Tran, Hong; Zhu, Hongxiao; Wu, Xiaowei; Kim, Gunjune; Clarke, Christopher R.; Larose, Hailey; Haak, David C.; Westwood, James H.; Zhang, Liqing
2018-01-01
Deoxyribonucleic acid (DNA) methylation is an epigenetic alteration crucial for regulating stress responses. Identifying large-scale DNA methylation at single nucleotide resolution is made possible by whole genome bisulfite sequencing. An essential task following the generation of bisulfite sequencing data is to detect differentially methylated cytosines (DMCs) among treatments. Most statistical methods for DMC detection do not consider the dependency of methylation patterns across the genome, thus possibly inflating type I error. Furthermore, small sample sizes and weak methylation effects among different phenotype categories make it difficult for these statistical methods to accurately detect DMCs. To address these issues, the wavelet-based functional mixed model (WFMM) was introduced to detect DMCs. To further examine the performance of WFMM in detecting weak differential methylation events, we used both simulated and empirical data and compare WFMM performance to a popular DMC detection tool methylKit. Analyses of simulated data that replicated the effects of the herbicide glyphosate on DNA methylation in Arabidopsis thaliana show that WFMM results in higher sensitivity and specificity in detecting DMCs compared to methylKit, especially when the methylation differences among phenotype groups are small. Moreover, the performance of WFMM is robust with respect to small sample sizes, making it particularly attractive considering the current high costs of bisulfite sequencing. Analysis of empirical Arabidopsis thaliana data under varying glyphosate dosages, and the analysis of monozygotic (MZ) twins who have different pain sensitivities—both datasets have weak methylation effects of <1%—show that WFMM can identify more relevant DMCs related to the phenotype of interest than methylKit. Differentially methylated regions (DMRs) are genomic regions with different DNA methylation status across biological samples. DMRs and DMCs are essentially the same concepts, with the only difference being how methylation information across the genome is summarized. If methylation levels are determined by grouping neighboring cytosine sites, then they are DMRs; if methylation levels are calculated based on single cytosines, they are DMCs. PMID:29419727
Distinctive characters of Nostoc genomes in cyanolichens.
Gagunashvili, Andrey N; Andrésson, Ólafur S
2018-06-05
Cyanobacteria of the genus Nostoc are capable of forming symbioses with a wide range of organism, including a diverse assemblage of cyanolichens. Only certain lineages of Nostoc appear to be able to form a close, stable symbiosis, raising the question whether symbiotic competence is determined by specific sets of genes and functionalities. We present the complete genome sequencing, annotation and analysis of two lichen Nostoc strains. Comparison with other Nostoc genomes allowed identification of genes potentially involved in symbioses with a broad range of partners including lichen mycobionts. The presence of additional genes necessary for symbiotic competence is likely reflected in larger genome sizes of symbiotic Nostoc strains. Some of the identified genes are presumably involved in the initial recognition and establishment of the symbiotic association, while others may confer advantage to cyanobionts during cohabitation with a mycobiont in the lichen symbiosis. Our study presents the first genome sequencing and genome-scale analysis of lichen-associated Nostoc strains. These data provide insight into the molecular nature of the cyanolichen symbiosis and pinpoint candidate genes for further studies aimed at deciphering the genetic mechanisms behind the symbiotic competence of Nostoc. Since many phylogenetic studies have shown that Nostoc is a polyphyletic group that includes several lineages, this work also provides an improved molecular basis for demarcation of a Nostoc clade with symbiotic competence.
Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio
2016-01-01
Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species. PMID:26786968
Garazha, Andrew; Ivanova, Alena; Suntsova, Maria; Malakhova, Galina; Roumiantsev, Sergey; Zhavoronkov, Alex; Buzdin, Anton
2015-01-01
Endogenous retroviruses (ERVs) and LTR retrotransposons (LRs) occupy ∼8% of human genome. Deep sequencing technologies provide clues to understanding of functional relevance of individual ERVs/LRs by enabling direct identification of transcription factor binding sites (TFBS) and other landmarks of functional genomic elements. Here, we performed the genome-wide identification of human ERVs/LRs containing TFBS according to the ENCODE project. We created the first interactive ERV/LRs database that groups the individual inserts according to their familial nomenclature, number of mapped TFBS and divergence from their consensus sequence. Information on any particular element can be easily extracted by the user. We also created a genome browser tool, which enables quick mapping of any ERV/LR insert according to genomic coordinates, known human genes and TFBS. These tools can be used to easily explore functionally relevant individual ERV/LRs, and for studying their impact on the regulation of human genes. Overall, we identified ∼110,000 ERV/LR genomic elements having TFBS. We propose a hypothesis of "domestication" of ERV/LR TFBS by the genome milieu including subsequent stages of initial epigenetic repression, partial functional release, and further mutation-driven reshaping of TFBS in tight coevolution with the enclosing genomic loci.
Next-generation genome-scale models for metabolic engineering.
King, Zachary A; Lloyd, Colton J; Feist, Adam M; Palsson, Bernhard O
2015-12-01
Constraint-based reconstruction and analysis (COBRA) methods have become widely used tools for metabolic engineering in both academic and industrial laboratories. By employing a genome-scale in silico representation of the metabolic network of a host organism, COBRA methods can be used to predict optimal genetic modifications that improve the rate and yield of chemical production. A new generation of COBRA models and methods is now being developed--encompassing many biological processes and simulation strategies-and next-generation models enable new types of predictions. Here, three key examples of applying COBRA methods to strain optimization are presented and discussed. Then, an outlook is provided on the next generation of COBRA models and the new types of predictions they will enable for systems metabolic engineering. Copyright © 2014 Elsevier Ltd. All rights reserved.
[Genome-scale sequence data processing and epigenetic analysis of DNA methylation].
Wang, Ting-Zhang; Shan, Gao; Xu, Jian-Hong; Xue, Qing-Zhong
2013-06-01
A new approach recently developed for detecting cytosine DNA methylation (mC) and analyzing the genome-scale DNA methylation profiling, is called BS-Seq which is based on bisulfite conversion of genomic DNA combined with next-generation sequencing. The method can not only provide an insight into the difference of genome-scale DNA methylation among different organisms, but also reveal the conservation of DNA methylation in all contexts and nucleotide preference for different genomic regions, including genes, exons, and repetitive DNA sequences. It will be helpful to under-stand the epigenetic impacts of cytosine DNA methylation on the regulation of gene expression and maintaining silence of repetitive sequences, such as transposable elements. In this paper, we introduce the preprocessing steps of DNA methylation data, by which cytosine (C) and guanine (G) in the reference sequence are transferred to thymine (T) and adenine (A), and cytosine in reads is transferred to thymine, respectively. We also comprehensively review the main content of the DNA methylation analysis on the genomic scale: (1) the cytosine methylation under the context of different sequences; (2) the distribution of genomic methylcytosine; (3) DNA methylation context and the preference for the nucleotides; (4) DNA- protein interaction sites of DNA methylation; (5) degree of methylation of cytosine in the different structural elements of genes. DNA methylation analysis technique provides a powerful tool for the epigenome study in human and other species, and genes and environment interaction, and founds the theoretical basis for further development of disease diagnostics and therapeutics in human.
An exploration into study design for biomarker identification: issues and recommendations.
Hall, Jacqueline A; Brown, Robert; Paul, Jim
2007-01-01
Genomic profiling produces large amounts of data and a challenge remains in identifying relevant biological processes associated with clinical outcome. Many candidate biomarkers have been identified but few have been successfully validated and make an impact clinically. This review focuses on some of the study design issues encountered in data mining for biomarker identification with illustrations of how study design may influence the final results. This includes issues of clinical endpoint use and selection, power, statistical, biological and clinical significance. We give particular attention to study design for the application of supervised clustering methods for identification of gene networks associated with clinical outcome and provide recommendations for future work to increase the success of identification of clinically relevant biomarkers.
RATT: Rapid Annotation Transfer Tool
Otto, Thomas D.; Dillon, Gary P.; Degrave, Wim S.; Berriman, Matthew
2011-01-01
Second-generation sequencing technologies have made large-scale sequencing projects commonplace. However, making use of these datasets often requires gene function to be ascribed genome wide. Although tool development has kept pace with the changes in sequence production, for tasks such as mapping, de novo assembly or visualization, genome annotation remains a challenge. We have developed a method to rapidly provide accurate annotation for new genomes using previously annotated genomes as a reference. The method, implemented in a tool called RATT (Rapid Annotation Transfer Tool), transfers annotations from a high-quality reference to a new genome on the basis of conserved synteny. We demonstrate that a Mycobacterium tuberculosis genome or a single 2.5 Mb chromosome from a malaria parasite can be annotated in less than five minutes with only modest computational resources. RATT is available at http://ratt.sourceforge.net. PMID:21306991
Identification of genomic islands in six plant pathogens.
Chen, Ling-Ling
2006-06-07
Genomic islands (GIs) play important roles in microbial evolution, which are acquired by horizontal gene transfer. In this paper, the GIs of six completely sequenced plant pathogens are identified using a windowless method based on Z curve representation of DNA sequences. Consequently, four, eight, four, one, two and four GIs are recognized with the length greater than 20-Kb in plant pathogens Agrobacterium tumefaciens str. C58, Rolstonia solanacearum GMI1000, Xanthomonas axonopodis pv. citri str. 306 (Xac), Xanthomonas campestris pv. campestris str. ATCC33913 (Xcc), Xylella fastidiosa 9a5c and Pseudomonas syringae pv. tomato str. DC3000, respectively. Most of these regions share a set of conserved features of GIs, including an abrupt change in GC content compared with that of the rest of the genome, the existence of integrase genes at the junction, the use of tRNA as the integration sites, the presence of genetic mobility genes, the difference of codon usage, codon preference and amino acid usage, etc. The identification of these GIs will benefit the research for the six important phytopathogens.
Across language families: Genome diversity mirrors linguistic variation within Europe
Longobardi, Giuseppe; Ghirotto, Silvia; Guardiano, Cristina; Tassi, Francesca; Benazzo, Andrea; Ceolin, Andrea
2015-01-01
ABSTRACT Objectives: The notion that patterns of linguistic and biological variation may cast light on each other and on population histories dates back to Darwin's times; yet, turning this intuition into a proper research program has met with serious methodological difficulties, especially affecting language comparisons. This article takes advantage of two new tools of comparative linguistics: a refined list of Indo‐European cognate words, and a novel method of language comparison estimating linguistic diversity from a universal inventory of grammatical polymorphisms, and hence enabling comparison even across different families. We corroborated the method and used it to compare patterns of linguistic and genomic variation in Europe. Materials and Methods: Two sets of linguistic distances, lexical and syntactic, were inferred from these data and compared with measures of geographic and genomic distance through a series of matrix correlation tests. Linguistic and genomic trees were also estimated and compared. A method (Treemix) was used to infer migration episodes after the main population splits. Results: We observed significant correlations between genomic and linguistic diversity, the latter inferred from data on both Indo‐European and non‐Indo‐European languages. Contrary to previous observations, on the European scale, language proved a better predictor of genomic differences than geography. Inferred episodes of genetic admixture following the main population splits found convincing correlates also in the linguistic realm. Discussion: These results pave the ground for previously unfeasible cross‐disciplinary analyses at the worldwide scale, encompassing populations of distant language families. Am J Phys Anthropol 157:630–640, 2015. © 2015 Wiley Periodicals, Inc. PMID:26059462
Genetic Recombination Is Targeted towards Gene Promoter Regions in Dogs
Auton, Adam; Rui Li, Ying; Kidd, Jeffrey; Oliveira, Kyle; Nadel, Julie; Holloway, J. Kim; Hayward, Jessica J.; Cohen, Paula E.; Greally, John M.; Wang, Jun; Bustamante, Carlos D.; Boyko, Adam R.
2013-01-01
The identification of the H3K4 trimethylase, PRDM9, as the gene responsible for recombination hotspot localization has provided considerable insight into the mechanisms by which recombination is initiated in mammals. However, uniquely amongst mammals, canids appear to lack a functional version of PRDM9 and may therefore provide a model for understanding recombination that occurs in the absence of PRDM9, and thus how PRDM9 functions to shape the recombination landscape. We have constructed a fine-scale genetic map from patterns of linkage disequilibrium assessed using high-throughput sequence data from 51 free-ranging dogs, Canis lupus familiaris. While broad-scale properties of recombination appear similar to other mammalian species, our fine-scale estimates indicate that canine highly elevated recombination rates are observed in the vicinity of CpG rich regions including gene promoter regions, but show little association with H3K4 trimethylation marks identified in spermatocytes. By comparison to genomic data from the Andean fox, Lycalopex culpaeus, we show that biased gene conversion is a plausible mechanism by which the high CpG content of the dog genome could have occurred. PMID:24348265
Poly A- transcripts expressed in HeLa cells.
Wu, Qingfa; Kim, Yeong C; Lu, Jian; Xuan, Zhenyu; Chen, Jun; Zheng, Yonglan; Zhou, Tom; Zhang, Michael Q; Wu, Chung-I; Wang, San Ming
2008-07-30
Transcripts expressed in eukaryotes are classified as poly A+ transcripts or poly A- transcripts based on the presence or absence of the 3' poly A tail. Most transcripts identified so far are poly A+ transcripts, whereas the poly A- transcripts remain largely unknown. We developed the TRD (Total RNA Detection) system for transcript identification. The system detects the transcripts through the following steps: 1) depleting the abundant ribosomal and small-size transcripts; 2) synthesizing cDNA without regard to the status of the 3' poly A tail; 3) applying the 454 sequencing technology for massive 3' EST collection from the cDNA; and 4) determining the genome origins of the detected transcripts by mapping the sequences to the human genome reference sequences. Using this system, we characterized the cytoplasmic transcripts from HeLa cells. Of the 13,467 distinct 3' ESTs analyzed, 24% are poly A-, 36% are poly A+, and 40% are bimorphic with poly A+ features but without the 3' poly A tail. Most of the poly A- 3' ESTs do not match known transcript sequences; they have a similar distribution pattern in the genome as the poly A+ and bimorphic 3' ESTs, and their mapped intergenic regions are evolutionarily conserved. Experiments confirmed the authenticity of the detected poly A- transcripts. Our study provides the first large-scale sequence evidence for the presence of poly A- transcripts in eukaryotes. The abundance of the poly A- transcripts highlights the need for comprehensive identification of these transcripts for decoding the transcriptome, annotating the genome and studying biological relevance of the poly A- transcripts.
Voz, Marianne L.; Coppieters, Wouter; Manfroid, Isabelle; Baudhuin, Ariane; Von Berg, Virginie; Charlier, Carole; Meyer, Dirk; Driever, Wolfgang; Martial, Joseph A.; Peers, Bernard
2012-01-01
Forward genetics using zebrafish is a powerful tool for studying vertebrate development through large-scale mutagenesis. Nonetheless, the identification of the molecular lesion is still laborious and involves time-consuming genetic mapping. Here, we show that high-throughput sequencing of the whole zebrafish genome can directly locate the interval carrying the causative mutation and at the same time pinpoint the molecular lesion. The feasibility of this approach was validated by sequencing the m1045 mutant line that displays a severe hypoplasia of the exocrine pancreas. We generated 13 Gb of sequence, equivalent to an eightfold genomic coverage, from a pool of 50 mutant embryos obtained from a map-cross between the AB mutant carrier and the WIK polymorphic strain. The chromosomal region carrying the causal mutation was localized based on its unique property to display high levels of homozygosity among sequence reads as it derives exclusively from the initial AB mutated allele. We developed an algorithm identifying such a region by calculating a homozygosity score along all chromosomes. This highlighted an 8-Mb window on chromosome 5 with a score close to 1 in the m1045 mutants. The sequence analysis of all genes within this interval revealed a nonsense mutation in the snapc4 gene. Knockdown experiments confirmed the assertion that snapc4 is the gene whose mutation leads to exocrine pancreas hypoplasia. In conclusion, this study constitutes a proof-of-concept that whole-genome sequencing is a fast and effective alternative to the classical positional cloning strategies in zebrafish. PMID:22496837
Liu, Jia; Guo, Jinchao; Zhang, Haibo; Li, Ning; Yang, Litao; Zhang, Dabing
2009-11-25
Various polymerase chain reaction (PCR) methods were developed for the execution of genetically modified organism (GMO) labeling policies, of which an event-specific PCR detection method based on the flanking sequence of exogenous integration is the primary trend in GMO detection due to its high specificity. In this study, the 5' and 3' flanking sequences of the exogenous integration of MON89788 soybean were revealed by thermal asymmetric interlaced PCR. The event-specific PCR primers and TaqMan probe were designed based upon the revealed 5' flanking sequence, and the qualitative and quantitative PCR assays were established employing these designed primers and probes. In qualitative PCR, the limit of detection (LOD) was about 0.01 ng of genomic DNA corresponding to 10 copies of haploid soybean genomic DNA. In the quantitative PCR assay, the LOD was as low as two haploid genome copies, and the limit of quantification was five haploid genome copies. Furthermore, the developed PCR methods were in-house validated by five researchers, and the validated results indicated that the developed event-specific PCR methods can be used for identification and quantification of MON89788 soybean and its derivates.
Population Genomics of Fungal and Oomycete Pathogens.
Grünwald, Niklaus J; McDonald, Bruce A; Milgroom, Michael G
2016-08-04
We are entering a new era in plant pathology in which whole-genome sequences of many individuals of a pathogen species are becoming readily available. Population genomics aims to discover genetic mechanisms underlying phenotypes associated with adaptive traits such as pathogenicity, virulence, fungicide resistance, and host specialization, as genome sequences or large numbers of single nucleotide polymorphisms become readily available from multiple individuals of the same species. This emerging field encompasses detailed genetic analyses of natural populations, comparative genomic analyses of closely related species, identification of genes under selection, and linkage analyses involving association studies in natural populations or segregating populations resulting from crosses. The era of pathogen population genomics will provide new opportunities and challenges, requiring new computational and analytical tools. This review focuses on conceptual and methodological issues as well as the approaches to answering questions in population genomics. The major steps start with defining relevant biological and evolutionary questions, followed by sampling, genotyping, and phenotyping, and ending in analytical methods and interpretations. We provide examples of recent applications of population genomics to fungal and oomycete plant pathogens.
Malin, Bradley; Sweeney, Latanya
2004-06-01
The increasing integration of patient-specific genomic data into clinical practice and research raises serious privacy concerns. Various systems have been proposed that protect privacy by removing or encrypting explicitly identifying information, such as name or social security number, into pseudonyms. Though these systems claim to protect identity from being disclosed, they lack formal proofs. In this paper, we study the erosion of privacy when genomic data, either pseudonymous or data believed to be anonymous, are released into a distributed healthcare environment. Several algorithms are introduced, collectively called RE-Identification of Data In Trails (REIDIT), which link genomic data to named individuals in publicly available records by leveraging unique features in patient-location visit patterns. Algorithmic proofs of re-identification are developed and we demonstrate, with experiments on real-world data, that susceptibility to re-identification is neither trivial nor the result of bizarre isolated occurrences. We propose that such techniques can be applied as system tests of privacy protection capabilities.
Nabavi, Sheida
2016-08-15
With advances in technologies, huge amounts of multiple types of high-throughput genomics data are available. These data have tremendous potential to identify new and clinically valuable biomarkers to guide the diagnosis, assessment of prognosis, and treatment of complex diseases, such as cancer. Integrating, analyzing, and interpreting big and noisy genomics data to obtain biologically meaningful results, however, remains highly challenging. Mining genomics datasets by utilizing advanced computational methods can help to address these issues. To facilitate the identification of a short list of biologically meaningful genes as candidate drivers of anti-cancer drug resistance from an enormous amount of heterogeneous data, we employed statistical machine-learning techniques and integrated genomics datasets. We developed a computational method that integrates gene expression, somatic mutation, and copy number aberration data of sensitive and resistant tumors. In this method, an integrative method based on module network analysis is applied to identify potential driver genes. This is followed by cross-validation and a comparison of the results of sensitive and resistance groups to obtain the final list of candidate biomarkers. We applied this method to the ovarian cancer data from the cancer genome atlas. The final result contains biologically relevant genes, such as COL11A1, which has been reported as a cis-platinum resistant biomarker for epithelial ovarian carcinoma in several recent studies. The described method yields a short list of aberrant genes that also control the expression of their co-regulated genes. The results suggest that the unbiased data driven computational method can identify biologically relevant candidate biomarkers. It can be utilized in a wide range of applications that compare two conditions with highly heterogeneous datasets.
Continuing Evolution of Burkholderia mallei Through Genome Reduction and Large-Scale Rearrangements
2010-01-22
in Materials and Methods. b NRPS, nonribosomal peptide synthase ; PKS, polyketide synthase ; RND, resistance nodulation-division like pump. Losada et al...genomics, genome erosion, bacterial virulence. ª The Author(s) 2010. Published by Oxford University Press on behalf of the Society for Molecular Biology...creativecommons.org/licenses/by-nc/ 2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original
Inference of Ancestral Recombination Graphs through Topological Data Analysis
Cámara, Pablo G.; Levine, Arnold J.; Rabadán, Raúl
2016-01-01
The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Galápagos Islands. PMID:27532298
Multiplexed precision genome editing with trackable genomic barcodes in yeast.
Roy, Kevin R; Smith, Justin D; Vonesch, Sibylle C; Lin, Gen; Tu, Chelsea Szu; Lederer, Alex R; Chu, Angela; Suresh, Sundari; Nguyen, Michelle; Horecka, Joe; Tripathi, Ashutosh; Burnett, Wallace T; Morgan, Maddison A; Schulz, Julia; Orsley, Kevin M; Wei, Wu; Aiyar, Raeka S; Davis, Ronald W; Bankaitis, Vytas A; Haber, James E; Salit, Marc L; St Onge, Robert P; Steinmetz, Lars M
2018-07-01
Our understanding of how genotype controls phenotype is limited by the scale at which we can precisely alter the genome and assess the phenotypic consequences of each perturbation. Here we describe a CRISPR-Cas9-based method for multiplexed accurate genome editing with short, trackable, integrated cellular barcodes (MAGESTIC) in Saccharomyces cerevisiae. MAGESTIC uses array-synthesized guide-donor oligos for plasmid-based high-throughput editing and features genomic barcode integration to prevent plasmid barcode loss and to enable robust phenotyping. We demonstrate that editing efficiency can be increased more than fivefold by recruiting donor DNA to the site of breaks using the LexA-Fkh1p fusion protein. We performed saturation editing of the essential gene SEC14 and identified amino acids critical for chemical inhibition of lipid signaling. We also constructed thousands of natural genetic variants, characterized guide mismatch tolerance at the genome scale, and ascertained that cryptic Pol III termination elements substantially reduce guide efficacy. MAGESTIC will be broadly useful to uncover the genetic basis of phenotypes in yeast.
2011-01-01
Background Copy number aberrations (CNAs) are an important molecular signature in cancer initiation, development, and progression. However, these aberrations span a wide range of chromosomes, making it hard to distinguish cancer related genes from other genes that are not closely related to cancer but are located in broadly aberrant regions. With the current availability of high-resolution data sets such as single nucleotide polymorphism (SNP) microarrays, it has become an important issue to develop a computational method to detect driving genes related to cancer development located in the focal regions of CNAs. Results In this study, we introduce a novel method referred to as the wavelet-based identification of focal genomic aberrations (WIFA). The use of the wavelet analysis, because it is a multi-resolution approach, makes it possible to effectively identify focal genomic aberrations in broadly aberrant regions. The proposed method integrates multiple cancer samples so that it enables the detection of the consistent aberrations across multiple samples. We then apply this method to glioblastoma multiforme and lung cancer data sets from the SNP microarray platform. Through this process, we confirm the ability to detect previously known cancer related genes from both cancer types with high accuracy. Also, the application of this approach to a lung cancer data set identifies focal amplification regions that contain known oncogenes, though these regions are not reported using a recent CNAs detecting algorithm GISTIC: SMAD7 (chr18q21.1) and FGF10 (chr5p12). Conclusions Our results suggest that WIFA can be used to reveal cancer related genes in various cancer data sets. PMID:21569311
PathFinder: reconstruction and dynamic visualization of metabolic pathways.
Goesmann, Alexander; Haubrock, Martin; Meyer, Folker; Kalinowski, Jörn; Giegerich, Robert
2002-01-01
Beyond methods for a gene-wise annotation and analysis of sequenced genomes new automated methods for functional analysis on a higher level are needed. The identification of realized metabolic pathways provides valuable information on gene expression and regulation. Detection of incomplete pathways helps to improve a constantly evolving genome annotation or discover alternative biochemical pathways. To utilize automated genome analysis on the level of metabolic pathways new methods for the dynamic representation and visualization of pathways are needed. PathFinder is a tool for the dynamic visualization of metabolic pathways based on annotation data. Pathways are represented as directed acyclic graphs, graph layout algorithms accomplish the dynamic drawing and visualization of the metabolic maps. A more detailed analysis of the input data on the level of biochemical pathways helps to identify genes and detect improper parts of annotations. As an Relational Database Management System (RDBMS) based internet application PathFinder reads a list of EC-numbers or a given annotation in EMBL- or Genbank-format and dynamically generates pathway graphs.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Green, Pamela J.
The long-term goal of this research was to better understand the influence of mRNA stability on gene regulation, particularly in response to hormones and the circadian clock. The primary aim of this project was to examine this using DNA microarrays, small RNA analysis and other approaches. We accomplished these objectives, although we were only able to detect small changes in mRNA stability in response to these stimuli. However, the work also contributed to a major breakthrough allowing the identification of small RNAs on a genomic scale in eukaryotes. Moreover, the project prompted us to develop a new way to analyzemore » mRNA decay genome wide. Thus, the research was hugely successful beyond our objectives.« less
DelVecchio, Vito G; Wagner, Mary Ann; Eschenbrenner, Michel; Horn, Troy A; Kraycer, Jo Ann; Estock, Frank; Elzer, Phil; Mujer, Cesar V
2002-12-20
The proteomes of selected Brucella spp. have been extensively analyzed by utilizing current proteomic technology involving 2-DE and MALDI-MS. In Brucella melitensis, more than 500 proteins were identified. The rapid and large-scale identification of proteins in this organism was accomplished by using the annotated B. melitensis genome which is now available in the GenBank. Coupled with new and powerful tools for data analysis, differentially expressed proteins were identified and categorized into several classes. A global overview of protein expression patterns emerged, thereby facilitating the simultaneous analysis of different metabolic pathways in B. melitensis. Such a global characterization would not have been possible by using time consuming and traditional biochemical approaches. The era of post-genomic technology offers new and exciting opportunities to understand the complete biology of different Brucella species.
Convergence between biological, behavioural and genetic determinants of obesity.
Ghosh, Sujoy; Bouchard, Claude
2017-12-01
Multiple biological, behavioural and genetic determinants or correlates of obesity have been identified to date. Genome-wide association studies (GWAS) have contributed to the identification of more than 100 obesity-associated genetic variants, but their roles in causal processes leading to obesity remain largely unknown. Most variants are likely to have tissue-specific regulatory roles through joint contributions to biological pathways and networks, through changes in gene expression that influence quantitative traits, or through the regulation of the epigenome. The recent availability of large-scale functional genomics resources provides an opportunity to re-examine obesity GWAS data to begin elucidating the function of genetic variants. Interrogation of knockout mouse phenotype resources provides a further avenue to test for evidence of convergence between genetic variation and biological or behavioural determinants of obesity.
Usability study of clinical exome analysis software: top lessons learned and recommendations.
Shyr, Casper; Kushniruk, Andre; Wasserman, Wyeth W
2014-10-01
New DNA sequencing technologies have revolutionized the search for genetic disruptions. Targeted sequencing of all protein coding regions of the genome, called exome analysis, is actively used in research-oriented genetics clinics, with the transition to exomes as a standard procedure underway. This transition is challenging; identification of potentially causal mutation(s) amongst ∼10(6) variants requires specialized computation in combination with expert assessment. This study analyzes the usability of user interfaces for clinical exome analysis software. There are two study objectives: (1) To ascertain the key features of successful user interfaces for clinical exome analysis software based on the perspective of expert clinical geneticists, (2) To assess user-system interactions in order to reveal strengths and weaknesses of existing software, inform future design, and accelerate the clinical uptake of exome analysis. Surveys, interviews, and cognitive task analysis were performed for the assessment of two next-generation exome sequence analysis software packages. The subjects included ten clinical geneticists who interacted with the software packages using the "think aloud" method. Subjects' interactions with the software were recorded in their clinical office within an urban research and teaching hospital. All major user interface events (from the user interactions with the packages) were time-stamped and annotated with coding categories to identify usability issues in order to characterize desired features and deficiencies in the user experience. We detected 193 usability issues, the majority of which concern interface layout and navigation, and the resolution of reports. Our study highlights gaps in specific software features typical within exome analysis. The clinicians perform best when the flow of the system is structured into well-defined yet customizable layers for incorporation within the clinical workflow. The results highlight opportunities to dramatically accelerate clinician analysis and interpretation of patient genomic data. We present the first application of usability methods to evaluate software interfaces in the context of exome analysis. Our results highlight how the study of user responses can lead to identification of usability issues and challenges and reveal software reengineering opportunities for improving clinical next-generation sequencing analysis. While the evaluation focused on two distinctive software tools, the results are general and should inform active and future software development for genome analysis software. As large-scale genome analysis becomes increasingly common in healthcare, it is critical that efficient and effective software interfaces are provided to accelerate clinical adoption of the technology. Implications for improved design of such applications are discussed. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.
Gonzalez, Michael A; Lebrigio, Rafael F Acosta; Van Booven, Derek; Ulloa, Rick H; Powell, Eric; Speziani, Fiorella; Tekin, Mustafa; Schüle, Rebecca; Züchner, Stephan
2013-06-01
Novel genes are now identified at a rapid pace for many Mendelian disorders, and increasingly, for genetically complex phenotypes. However, new challenges have also become evident: (1) effectively managing larger exome and/or genome datasets, especially for smaller labs; (2) direct hands-on analysis and contextual interpretation of variant data in large genomic datasets; and (3) many small and medium-sized clinical and research-based investigative teams around the world are generating data that, if combined and shared, will significantly increase the opportunities for the entire community to identify new genes. To address these challenges, we have developed GEnomes Management Application (GEM.app), a software tool to annotate, manage, visualize, and analyze large genomic datasets (https://genomics.med.miami.edu/). GEM.app currently contains ∼1,600 whole exomes from 50 different phenotypes studied by 40 principal investigators from 15 different countries. The focus of GEM.app is on user-friendly analysis for nonbioinformaticians to make next-generation sequencing data directly accessible. Yet, GEM.app provides powerful and flexible filter options, including single family filtering, across family/phenotype queries, nested filtering, and evaluation of segregation in families. In addition, the system is fast, obtaining results within 4 sec across ∼1,200 exomes. We believe that this system will further enhance identification of genetic causes of human disease. © 2013 Wiley Periodicals, Inc.
Li, Peng; Jia, Junwei; Bai, Lan; Pan, Aihu; Tang, Xueming
2013-07-01
Genetically modified carnation (Dianthus caryophyllus L.) Moonshade was approved for planting and commercialization in several countries from 2004. Developing methods for analyzing Moonshade is necessary for implementing genetically modified organism labeling regulations. In this study, the 5'-transgene integration sequence was isolated using thermal asymmetric interlaced (TAIL)-PCR. Based upon the 5'-transgene integration sequence, conventional and TaqMan real-time PCR assays were established. The relative limit of detection for the conventional PCR assay was 0.05 % for Moonshade using 100 ng total carnation genomic DNA, corresponding to approximately 79 copies of the carnation haploid genome, and the limits of detection and quantification of the TaqMan real-time PCR assay were estimated to be 51 and 254 copies of haploid carnation genomic DNA, respectively. These results are useful for identifying and quantifying Moonshade and its derivatives.
Deep sequencing approaches for the analysis of prokaryotic transcriptional boundaries and dynamics.
James, Katherine; Cockell, Simon J; Zenkin, Nikolay
2017-05-01
The identification of the protein-coding regions of a genome is straightforward due to the universality of start and stop codons. However, the boundaries of the transcribed regions, conditional operon structures, non-coding RNAs and the dynamics of transcription, such as pausing of elongation, are non-trivial to identify, even in the comparatively simple genomes of prokaryotes. Traditional methods for the study of these areas, such as tiling arrays, are noisy, labour-intensive and lack the resolution required for densely-packed bacterial genomes. Recently, deep sequencing has become increasingly popular for the study of the transcriptome due to its lower costs, higher accuracy and single nucleotide resolution. These methods have revolutionised our understanding of prokaryotic transcriptional dynamics. Here, we review the deep sequencing and data analysis techniques that are available for the study of transcription in prokaryotes, and discuss the bioinformatic considerations of these analyses. Copyright © 2017 Elsevier Inc. All rights reserved.
A new way to protect privacy in large-scale genome-wide association studies.
Kamm, Liina; Bogdanov, Dan; Laur, Sven; Vilo, Jaak
2013-04-01
Increased availability of various genotyping techniques has initiated a race for finding genetic markers that can be used in diagnostics and personalized medicine. Although many genetic risk factors are known, key causes of common diseases with complex heritage patterns are still unknown. Identification of such complex traits requires a targeted study over a large collection of data. Ideally, such studies bring together data from many biobanks. However, data aggregation on such a large scale raises many privacy issues. We show how to conduct such studies without violating privacy of individual donors and without leaking the data to third parties. The presented solution has provable security guarantees. Supplementary data are available at Bioinformatics online.
Negative Enrichment and Isolation of Circulating Tumor Cells for Whole Genome Amplification.
Kanwar, Nisha; Done, Susan J
2017-01-01
Circulating tumor cells (CTCs) are a rare population of cells found in the peripheral blood of patients with many types of cancer such as breast, prostate, colon, and lung cancers. Higher numbers of these cells in blood are associated with a poorer prognosis of patients. Genomic profiling of CTCs would help characterize markers specific for the identification of these cells in blood, and also define genomic alterations that give these cells a metastatic advantage over other cells in the primary tumor. Here, we describe an immunomagnetic method to enrich CTCs from the blood of patients with breast cancer, followed by single-cell laser capture microdissection to isolate single CTCs. Whole genome amplification of isolated CTCs allows for many downstream applications to be performed to aide in their characterization, such as whole genome or exome sequencing, Single Nucleotide Polymorphism (SNP) and copy number analysis, and targeted sequencing or quantitative Polymerase Chain Reaction (qPCR) for genomic analyses.
Oud, Bart; Maris, Antonius J A; Daran, Jean-Marc; Pronk, Jack T
2012-01-01
Successful reverse engineering of mutants that have been obtained by nontargeted strain improvement has long presented a major challenge in yeast biotechnology. This paper reviews the use of genome-wide approaches for analysis of Saccharomyces cerevisiae strains originating from evolutionary engineering or random mutagenesis. On the basis of an evaluation of the strengths and weaknesses of different methods, we conclude that for the initial identification of relevant genetic changes, whole genome sequencing is superior to other analytical techniques, such as transcriptome, metabolome, proteome, or array-based genome analysis. Key advantages of this technique over gene expression analysis include the independency of genome sequences on experimental context and the possibility to directly and precisely reproduce the identified changes in naive strains. The predictive value of genome-wide analysis of strains with industrially relevant characteristics can be further improved by classical genetics or simultaneous analysis of strains derived from parallel, independent strain improvement lineages. PMID:22152095
Oud, Bart; van Maris, Antonius J A; Daran, Jean-Marc; Pronk, Jack T
2012-03-01
Successful reverse engineering of mutants that have been obtained by nontargeted strain improvement has long presented a major challenge in yeast biotechnology. This paper reviews the use of genome-wide approaches for analysis of Saccharomyces cerevisiae strains originating from evolutionary engineering or random mutagenesis. On the basis of an evaluation of the strengths and weaknesses of different methods, we conclude that for the initial identification of relevant genetic changes, whole genome sequencing is superior to other analytical techniques, such as transcriptome, metabolome, proteome, or array-based genome analysis. Key advantages of this technique over gene expression analysis include the independency of genome sequences on experimental context and the possibility to directly and precisely reproduce the identified changes in naive strains. The predictive value of genome-wide analysis of strains with industrially relevant characteristics can be further improved by classical genetics or simultaneous analysis of strains derived from parallel, independent strain improvement lineages. © 2011 Federation of European Microbiological Societies. Published by Blackwell Publishing Ltd. All rights reserved.
Efficient isolation method for high-quality genomic DNA from cicada exuviae.
Nguyen, Hoa Quynh; Kim, Ye Inn; Borzée, Amaël; Jang, Yikweon
2017-10-01
In recent years, animal ethics issues have led researchers to explore nondestructive methods to access materials for genetic studies. Cicada exuviae are among those materials because they are cast skins that individuals left after molt and are easily collected. In this study, we aim to identify the most efficient extraction method to obtain high quantity and quality of DNA from cicada exuviae. We compared relative DNA yield and purity of six extraction protocols, including both manual protocols and available commercial kits, extracting from four different exoskeleton parts. Furthermore, amplification and sequencing of genomic DNA were evaluated in terms of availability of sequencing sequence at the expected genomic size. Both the choice of protocol and exuvia part significantly affected DNA yield and purity. Only samples that were extracted using the PowerSoil DNA Isolation kit generated gel bands of expected size as well as successful sequencing results. The failed attempts to extract DNA using other protocols could be partially explained by a low DNA yield from cicada exuviae and partly by contamination with humic acids that exist in the soil where cicada nymphs reside before emergence, as shown by spectroscopic measurements. Genomic DNA extracted from cicada exuviae could provide valuable information for species identification, allowing the investigation of genetic diversity across consecutive broods, or spatiotemporal variation among various populations. Consequently, we hope to provide a simple method to acquire pure genomic DNA applicable for multiple research purposes.
Khang, Chang Hyun; Park, Sook-Young; Lee, Yong-Hwan; Kang, Seogchan
2005-06-01
Rapid progress in fungal genome sequencing presents many new opportunities for functional genomic analysis of fungal biology through the systematic mutagenesis of the genes identified through sequencing. However, the lack of efficient tools for targeted gene replacement is a limiting factor for fungal functional genomics, as it often necessitates the screening of a large number of transformants to identify the desired mutant. We developed an efficient method of gene replacement and evaluated factors affecting the efficiency of this method using two plant pathogenic fungi, Magnaporthe grisea and Fusarium oxysporum. This method is based on Agrobacterium tumefaciens-mediated transformation with a mutant allele of the target gene flanked by the herpes simplex virus thymidine kinase (HSVtk) gene as a conditional negative selection marker against ectopic transformants. The HSVtk gene product converts 5-fluoro-2'-deoxyuridine to a compound toxic to diverse fungi. Because ectopic transformants express HSVtk, while gene replacement mutants lack HSVtk, growing transformants on a medium amended with 5-fluoro-2'-deoxyuridine facilitates the identification of targeted mutants by counter-selecting against ectopic transformants. In addition to M. grisea and F. oxysporum, the method and associated vectors are likely to be applicable to manipulating genes in a broad spectrum of fungi, thus potentially serving as an efficient, universal functional genomic tool for harnessing the growing body of fungal genome sequence data to study fungal biology.
CORALINA: a universal method for the generation of gRNA libraries for CRISPR-based screening.
Köferle, Anna; Worf, Karolina; Breunig, Christopher; Baumann, Valentin; Herrero, Javier; Wiesbeck, Maximilian; Hutter, Lukas H; Götz, Magdalena; Fuchs, Christiane; Beck, Stephan; Stricker, Stefan H
2016-11-14
The bacterial CRISPR system is fast becoming the most popular genetic and epigenetic engineering tool due to its universal applicability and adaptability. The desire to deploy CRISPR-based methods in a large variety of species and contexts has created an urgent need for the development of easy, time- and cost-effective methods enabling large-scale screening approaches. Here we describe CORALINA (comprehensive gRNA library generation through controlled nuclease activity), a method for the generation of comprehensive gRNA libraries for CRISPR-based screens. CORALINA gRNA libraries can be derived from any source of DNA without the need of complex oligonucleotide synthesis. We show the utility of CORALINA for human and mouse genomic DNA, its reproducibility in covering the most relevant genomic features including regulatory, coding and non-coding sequences and confirm the functionality of CORALINA generated gRNAs. The simplicity and cost-effectiveness make CORALINA suitable for any experimental system. The unprecedented sequence complexities obtainable with CORALINA libraries are a necessary pre-requisite for less biased large scale genomic and epigenomic screens.
Liu, Lei; Ang, Keng Pee; Elliott, J A K; Kent, Matthew Peter; Lien, Sigbjørn; MacDonald, Danielle; Boulding, Elizabeth Grace
2017-03-01
Comparative genome scans can be used to identify chromosome regions, but not traits, that are putatively under selection. Identification of targeted traits may be more likely in recently domesticated populations under strong artificial selection for increased production. We used a North American Atlantic salmon 6K SNP dataset to locate genome regions of an aquaculture strain (Saint John River) that were highly diverged from that of its putative wild founder population (Tobique River). First, admixed individuals with partial European ancestry were detected using STRUCTURE and removed from the dataset. Outlier loci were then identified as those showing extreme differentiation between the aquaculture population and the founder population. All Arlequin methods identified an overlapping subset of 17 outlier loci, three of which were also identified by BayeScan. Many outlier loci were near candidate genes and some were near published quantitative trait loci (QTLs) for growth, appetite, maturity, or disease resistance. Parallel comparisons using a wild, nonfounder population (Stewiacke River) yielded only one overlapping outlier locus as well as a known maturity QTL. We conclude that genome scans comparing a recently domesticated strain with its wild founder population can facilitate identification of candidate genes for traits known to have been under strong artificial selection.
Prigent, Sylvain; Nielsen, Jens Christian; Frisvad, Jens Christian; Nielsen, Jens
2018-06-05
Modelling of metabolism at the genome-scale have proved to be an efficient method for explaining observed phenotypic traits in living organisms. Further, it can be used as a means of predicting the effect of genetic modifications e.g. for development of microbial cell factories. With the increasing amount of genome sequencing data available, a need exists to accurately and efficiently generate such genome-scale metabolic models (GEMs) of non-model organisms, for which data is sparse. In this study, we present an automatic reconstruction approach applied to 24 Penicillium species, which have potential for production of pharmaceutical secondary metabolites or used in the manufacturing of food products such as cheeses. The models were based on the MetaCyc database and a previously published Penicillium GEM, and gave rise to comprehensive genome-scale metabolic descriptions. The models proved that while central carbon metabolism is highly conserved, secondary metabolic pathways represent the main diversity among the species. The automatic reconstruction approach presented in this study can be applied to generate GEMs of other understudied organisms, and the developed GEMs are a useful resource for the study of Penicillium metabolism, for example with the scope of developing novel cell factories. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
Yu, Feiqiao Brian; Blainey, Paul C; Schulz, Frederik; Woyke, Tanja; Horowitz, Mark A; Quake, Stephen R
2017-07-05
Metagenomics and single-cell genomics have enabled genome discovery from unknown branches of life. However, extracting novel genomes from complex mixtures of metagenomic data can still be challenging and represents an ill-posed problem which is generally approached with ad hoc methods. Here we present a microfluidic-based mini-metagenomic method which offers a statistically rigorous approach to extract novel microbial genomes while preserving single-cell resolution. We used this approach to analyze two hot spring samples from Yellowstone National Park and extracted 29 new genomes, including three deeply branching lineages. The single-cell resolution enabled accurate quantification of genome function and abundance, down to 1% in relative abundance. Our analyses of genome level SNP distributions also revealed low to moderate environmental selection. The scale, resolution, and statistical power of microfluidic-based mini-metagenomics make it a powerful tool to dissect the genomic structure of microbial communities while effectively preserving the fundamental unit of biology, the single cell.
Pooled Protein Immunization for Identification of Cell Surface Antigens in Streptococcus sanguinis
Ge, Xiuchun; Kitten, Todd; Munro, Cindy L.; Conrad, Daniel H.; Xu, Ping
2010-01-01
Background Available bacterial genomes provide opportunities for screening vaccines by reverse vaccinology. Efficient identification of surface antigens is required to reduce time and animal cost in this technology. We developed an approach to identify surface antigens rapidly in Streptococcus sanguinis, a common infective endocarditis causative species. Methods and Findings We applied bioinformatics for antigen prediction and pooled antigens for immunization. Forty-seven surface-exposed proteins including 28 lipoproteins and 19 cell wall-anchored proteins were chosen based on computer algorithms and comparative genomic analyses. Eight proteins among these candidates and 2 other proteins were pooled together to immunize rabbits. The antiserum reacted strongly with each protein and with S. sanguinis whole cells. Affinity chromatography was used to purify the antibodies to 9 of the antigen pool components. Competitive ELISA and FACS results indicated that these 9 proteins were exposed on S. sanguinis cell surfaces. The purified antibodies had demonstrable opsonic activity. Conclusions The results indicate that immunization with pooled proteins, in combination with affinity purification, and comprehensive immunological assays may facilitate cell surface antigen identification to combat infectious diseases. PMID:20668678
2012-01-01
Background Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation. Methods An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis. Results Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored. Conclusions The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations. PMID:22462519
Simultaneous Identification of Multiple Driver Pathways in Cancer
Leiserson, Mark D. M.; Blokh, Dima
2013-01-01
Distinguishing the somatic mutations responsible for cancer (driver mutations) from random, passenger mutations is a key challenge in cancer genomics. Driver mutations generally target cellular signaling and regulatory pathways consisting of multiple genes. This heterogeneity complicates the identification of driver mutations by their recurrence across samples, as different combinations of mutations in driver pathways are observed in different samples. We introduce the Multi-Dendrix algorithm for the simultaneous identification of multiple driver pathways de novo in somatic mutation data from a cohort of cancer samples. The algorithm relies on two combinatorial properties of mutations in a driver pathway: high coverage and mutual exclusivity. We derive an integer linear program that finds set of mutations exhibiting these properties. We apply Multi-Dendrix to somatic mutations from glioblastoma, breast cancer, and lung cancer samples. Multi-Dendrix identifies sets of mutations in genes that overlap with known pathways – including Rb, p53, PI(3)K, and cell cycle pathways – and also novel sets of mutually exclusive mutations, including mutations in several transcription factors or other genes involved in transcriptional regulation. These sets are discovered directly from mutation data with no prior knowledge of pathways or gene interactions. We show that Multi-Dendrix outperforms other algorithms for identifying combinations of mutations and is also orders of magnitude faster on genome-scale data. Software available at: http://compbio.cs.brown.edu/software. PMID:23717195
methylPipe and compEpiTools: a suite of R packages for the integrative analysis of epigenomics data.
Kishore, Kamal; de Pretis, Stefano; Lister, Ryan; Morelli, Marco J; Bianchi, Valerio; Amati, Bruno; Ecker, Joseph R; Pelizzola, Mattia
2015-09-29
Numerous methods are available to profile several epigenetic marks, providing data with different genome coverage and resolution. Large epigenomic datasets are then generated, and often combined with other high-throughput data, including RNA-seq, ChIP-seq for transcription factors (TFs) binding and DNase-seq experiments. Despite the numerous computational tools covering specific steps in the analysis of large-scale epigenomics data, comprehensive software solutions for their integrative analysis are still missing. Multiple tools must be identified and combined to jointly analyze histone marks, TFs binding and other -omics data together with DNA methylation data, complicating the analysis of these data and their integration with publicly available datasets. To overcome the burden of integrating various data types with multiple tools, we developed two companion R/Bioconductor packages. The former, methylPipe, is tailored to the analysis of high- or low-resolution DNA methylomes in several species, accommodating (hydroxy-)methyl-cytosines in both CpG and non-CpG sequence context. The analysis of multiple whole-genome bisulfite sequencing experiments is supported, while maintaining the ability of integrating targeted genomic data. The latter, compEpiTools, seamlessly incorporates the results obtained with methylPipe and supports their integration with other epigenomics data. It provides a number of methods to score these data in regions of interest, leading to the identification of enhancers, lncRNAs, and RNAPII stalling/elongation dynamics. Moreover, it allows a fast and comprehensive annotation of the resulting genomic regions, and the association of the corresponding genes with non-redundant GeneOntology terms. Finally, the package includes a flexible method based on heatmaps for the integration of various data types, combining annotation tracks with continuous or categorical data tracks. methylPipe and compEpiTools provide a comprehensive Bioconductor-compliant solution for the integrative analysis of heterogeneous epigenomics data. These packages are instrumental in providing biologists with minimal R skills a complete toolkit facilitating the analysis of their own data, or in accelerating the analyses performed by more experienced bioinformaticians.
Lu, Fu-Hao; McKenzie, Neil; Kettleborough, George; Heavens, Darren; Clark, Matthew D; Bevan, Michael W
2018-05-01
The accurate sequencing and assembly of very large, often polyploid, genomes remains a challenging task, limiting long-range sequence information and phased sequence variation for applications such as plant breeding. The 15-Gb hexaploid bread wheat (Triticum aestivum) genome has been particularly challenging to sequence, and several different approaches have recently generated long-range assemblies. Mapping and understanding the types of assembly errors are important for optimising future sequencing and assembly approaches and for comparative genomics. Here we use a Fosill 38-kb jumping library to assess medium and longer-range order of different publicly available wheat genome assemblies. Modifications to the Fosill protocol generated longer Illumina sequences and enabled comprehensive genome coverage. Analyses of two independent Bacterial Artificial Chromosome (BAC)-based chromosome-scale assemblies, two independent Illumina whole genome shotgun assemblies, and a hybrid Single Molecule Real Time (SMRT-PacBio) and short read (Illumina) assembly were carried out. We revealed a surprising scale and variety of discrepancies using Fosill mate-pair mapping and validated several of each class. In addition, Fosill mate-pairs were used to scaffold a whole genome Illumina assembly, leading to a 3-fold increase in N50 values. Our analyses, using an independent means to validate different wheat genome assemblies, show that whole genome shotgun assemblies based solely on Illumina sequences are significantly more accurate by all measures compared to BAC-based chromosome-scale assemblies and hybrid SMRT-Illumina approaches. Although current whole genome assemblies are reasonably accurate and useful, additional improvements will be needed to generate complete assemblies of wheat genomes using open-source, computationally efficient, and cost-effective methods.
Practical Approaches for Detecting Selection in Microbial Genomes.
Hedge, Jessica; Wilson, Daniel J
2016-02-01
Microbial genome evolution is shaped by a variety of selective pressures. Understanding how these processes occur can help to address important problems in microbiology by explaining observed differences in phenotypes, including virulence and resistance to antibiotics. Greater access to whole-genome sequencing provides microbiologists with the opportunity to perform large-scale analyses of selection in novel settings, such as within individual hosts. This tutorial aims to guide researchers through the fundamentals underpinning popular methods for measuring selection in pathogens. These methods are transferable to a wide variety of organisms, and the exercises provided are designed for researchers with any level of programming experience.
Accurate evaluation and analysis of functional genomics data and methods
Greene, Casey S.; Troyanskaya, Olga G.
2016-01-01
The development of technology capable of inexpensively performing large-scale measurements of biological systems has generated a wealth of data. Integrative analysis of these data holds the promise of uncovering gene function, regulation, and, in the longer run, understanding complex disease. However, their analysis has proved very challenging, as it is difficult to quickly and effectively assess the relevance and accuracy of these data for individual biological questions. Here, we identify biases that present challenges for the assessment of functional genomics data and methods. We then discuss evaluation methods that, taken together, begin to address these issues. We also argue that the funding of systematic data-driven experiments and of high-quality curation efforts will further improve evaluation metrics so that they more-accurately assess functional genomics data and methods. Such metrics will allow researchers in the field of functional genomics to continue to answer important biological questions in a data-driven manner. PMID:22268703
Agren, Rasmus; Liu, Liming; Shoaie, Saeed; Vongsangnak, Wanwipa; Nookaew, Intawat; Nielsen, Jens
2013-01-01
We present the RAVEN (Reconstruction, Analysis and Visualization of Metabolic Networks) Toolbox: a software suite that allows for semi-automated reconstruction of genome-scale models. It makes use of published models and/or the KEGG database, coupled with extensive gap-filling and quality control features. The software suite also contains methods for visualizing simulation results and omics data, as well as a range of methods for performing simulations and analyzing the results. The software is a useful tool for system-wide data analysis in a metabolic context and for streamlined reconstruction of metabolic networks based on protein homology. The RAVEN Toolbox workflow was applied in order to reconstruct a genome-scale metabolic model for the important microbial cell factory Penicillium chrysogenum Wisconsin54-1255. The model was validated in a bibliomic study of in total 440 references, and it comprises 1471 unique biochemical reactions and 1006 ORFs. It was then used to study the roles of ATP and NADPH in the biosynthesis of penicillin, and to identify potential metabolic engineering targets for maximization of penicillin production. PMID:23555215
A novel approach to identifying regulatory motifs in distantly related genomes
Van Hellemont, Ruth; Monsieurs, Pieter; Thijs, Gert; De Moor, Bart; Van de Peer, Yves; Marchal, Kathleen
2005-01-01
Although proven successful in the identification of regulatory motifs, phylogenetic footprinting methods still show some shortcomings. To assess these difficulties, most apparent when applying phylogenetic footprinting to distantly related organisms, we developed a two-step procedure that combines the advantages of sequence alignment and motif detection approaches. The results on well-studied benchmark datasets indicate that the presented method outperforms other methods when the sequences become either too long or too heterogeneous in size. PMID:16420672
Molecular Diagnosis and Biomarker Identification on SELDI proteomics data by ADTBoost method.
Wang, Lu-Yong; Chakraborty, Amit; Comaniciu, Dorin
2005-01-01
Clinical proteomics is an emerging field that will have great impact on molecular diagnosis, identification of disease biomarkers, drug discovery and clinical trials in the post-genomic era. Protein profiling in tissues and fluids in disease and pathological control and other proteomics techniques will play an important role in molecular diagnosis with therapeutics and personalized healthcare. We introduced a new robust diagnostic method based on ADTboost algorithm, a novel algorithm in proteomics data analysis to improve classification accuracy. It generates classification rules, which are often smaller and easier to interpret. This method often gives most discriminative features, which can be utilized as biomarkers for diagnostic purpose. Also, it has a nice feature of providing a measure of prediction confidence. We carried out this method in amyotrophic lateral sclerosis (ALS) disease data acquired by surface enhanced laser-desorption/ionization-time-of-flight mass spectrometry (SELDI-TOF MS) experiments. Our method is shown to have outstanding prediction capacity through the cross-validation, ROC analysis results and comparative study. Our molecular diagnosis method provides an efficient way to distinguish ALS disease from neurological controls. The results are expressed in a simple and straightforward alternating decision tree format or conditional format. We identified most discriminative peaks in proteomic data, which can be utilized as biomarkers for diagnosis. It will have broad application in molecular diagnosis through proteomics data analysis and personalized medicine in this post-genomic era.
Phytochemical genomics--a new trend.
Saito, Kazuki
2013-06-01
Phytochemical genomics is a recently emerging field, which investigates the genomic basis of the synthesis and function of phytochemicals (plant metabolites), particularly based on advanced metabolomics. The chemical diversity of the model plant Arabidopsis thaliana is larger than previously expected, and the gene-to-metabolite correlations have been elucidated mostly by an integrated analysis of transcriptomes and metabolomes. For example, most genes involved in the biosynthesis of flavonoids in Arabidopsis have been characterized by this method. A similar approach has been applied to the functional genomics for production of phytochemicals in crops and medicinal plants. Great promise is seen in metabolic quantitative loci analysis in major crops such as rice and tomato, and identification of novel genes involved in the biosynthesis of bioactive specialized metabolites in medicinal plants. Copyright © 2013 The Author. Published by Elsevier Ltd.. All rights reserved.
[The application of genome editing in identification of plant gene function and crop breeding].
Zhou, Xiang-chun; Xing, Yong-zhong
2016-03-01
Plant genome can be modified via current biotechnology with high specificity and excellent efficiency. Zinc finger nucleases (ZFN), transcription activator-like effector nucleases (TALEN) and clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated 9 (Cas9) system are the key engineered nucleases used in the genome editing. Genome editing techniques enable gene targeted mutagenesis, gene knock-out, gene insertion or replacement at the target sites during the endogenous DNA repair process, including non-homologous end joining (NHEJ) and homologous recombination (HR), triggered by the induction of DNA double-strand break (DSB). Genome editing has been successfully applied in the genome modification of diverse plant species, such as Arabidopsis thaliana, Oryza sativa, and Nicotiana tabacum. In this review, we summarize the application of genome editing in identification of plant gene function and crop breeding. Moreover, we also discuss the improving points of genome editing in crop precision genetic improvement for further study.
Jin, Sheng Chih; Benitez, Bruno A; Deming, Yuetiva; Cruchaga, Carlos
2016-01-01
Analyses of genome-wide association studies (GWAS) for complex disorders usually identify common variants with a relatively small effect size that only explain a small proportion of phenotypic heritability. Several studies have suggested that a significant fraction of heritability may be explained by low-frequency (minor allele frequency (MAF) of 1-5 %) and rare-variants that are not contained in the commercial GWAS genotyping arrays (Schork et al., Curr Opin Genet Dev 19:212, 2009). Rare variants can also have relatively large effects on risk for developing human diseases or disease phenotype (Cruchaga et al., PLoS One 7:e31039, 2012). However, it is necessary to perform next-generation sequencing (NGS) studies in a large population (>4,000 samples) to detect a significant rare-variant association. Several NGS methods, such as custom capture sequencing and amplicon-based sequencing, are designed to screen a small proportion of the genome, but most of these methods are limited in the number of samples that can be multiplexed (i.e. most sequencing kits only provide 96 distinct index). Additionally, the sequencing library preparation for 4,000 samples remains expensive and thus conducting NGS studies with the aforementioned methods are not feasible for most research laboratories.The need for low-cost large scale rare-variant detection makes pooled-DNA sequencing an ideally efficient and cost-effective technique to identify rare variants in target regions by sequencing hundreds to thousands of samples. Our recent work has demonstrated that pooled-DNA sequencing can accurately detect rare variants in targeted regions in multiple DNA samples with high sensitivity and specificity (Jin et al., Alzheimers Res Ther 4:34, 2012). In these studies we used a well-established pooled-DNA sequencing approach and a computational package, SPLINTER (short indel prediction by large deviation inference and nonlinear true frequency estimation by recursion) (Vallania et al., Genome Res 20:1711, 2010), for accurate identification of rare variants in large DNA pools. Given an average sequencing coverage of 30× per haploid genome, SPLINTER can detect rare variants and short indels up to 4 base pairs (bp) with high sensitivity and specificity (up to 1 haploid allele in a pool as large as 500 individuals). Step-by-step instructions on how to conduct pooled-DNA sequencing experiments and data analyses are described in this chapter.
Huang, Guoliang; Huang, Qin; Ma, Li; Luo, Xianbo; Pang, Biao; Zhang, Zhixin; Wang, Ruliang; Zhang, Junqi; Li, Qi; Fu, Rongxin; Ye, Jiancheng
2014-01-01
A sensitive DNA isothermal amplification method for the detection of DNA at fM to aM concentrations for pathogen identification was developed using a non-stick-coated metal microfluidic bioreactor. A portable confocal optical detector was utilized to monitor the DNA amplification in micro- to nanoliter reaction assays in real-time, with fluorescence collection near the optical diffraction limit. The non-stick-coated metal microfluidic bioreactor, with a surface contact angle of 103°, was largely inert to bio-molecules, and DNA amplification could be performed in a minimum reaction volume of 40 nL. The isothermal nucleic acid amplification for Mycoplasma pneumoniae identification in the non-stick-coated microfluidic bioreactor could be performed at a minimum DNA template concentration of 1.3 aM, and a detection limit of three copies of genomic DNA was obtained. This microfluidic bioreactor offers a promising clinically relevant pathogen molecular diagnostic method via the amplification of targets from only a few copies of genomic DNA from a single bacterium. PMID:25475544
Muthukrishnan, Madhanmohan; Singanallur, Nagendrakumar B; Ralla, Kumar; Villuppanoor, Srinivasan A
2008-08-01
Foot-and-mouth disease virus (FMDV) samples transported to the laboratory from far and inaccessible areas for serodiagnosis pose a major problem in a tropical country like India, where there is maximum temperature fluctuation. Inadequate storage methods lead to spoilage of FMDV samples collected from clinically positive animals in the field. Such samples are declared as non-typeable by the typing laboratories with the consequent loss of valuable epidemiological data. The present study evaluated the usefulness of FTA Classic Cards for the collection, shipment, storage and identification of the FMDV genome by RT-PCR and real-time RT-PCR. The stability of the viral RNA, the absence of infectivity and ease of processing the sample for molecular methods make the FTA cards a useful option for transport of FMDV genome for identification and serotyping. The method can be used routinely for FMDV research as it is economical and the cards can be transported easily in envelopes by regular document transport methods. Live virus cannot be isolated from samples collected in FTA cards, which is a limitation. This property can be viewed as an advantage as it limits the risk of transmission of live virus.
Lagier, Jean-Christophe; Hugon, Perrine; Khelaifia, Saber; Fournier, Pierre-Edouard; La Scola, Bernard
2015-01-01
SUMMARY Bacterial culture was the first method used to describe the human microbiota, but this method is considered outdated by many researchers. Metagenomics studies have since been applied to clinical microbiology; however, a “dark matter” of prokaryotes, which corresponds to a hole in our knowledge and includes minority bacterial populations, is not elucidated by these studies. By replicating the natural environment, environmental microbiologists were the first to reduce the “great plate count anomaly,” which corresponds to the difference between microscopic and culture counts. The revolution in bacterial identification also allowed rapid progress. 16S rRNA bacterial identification allowed the accurate identification of new species. Mass spectrometry allowed the high-throughput identification of rare species and the detection of new species. By using these methods and by increasing the number of culture conditions, culturomics allowed the extension of the known human gut repertoire to levels equivalent to those of pyrosequencing. Finally, taxonogenomics strategies became an emerging method for describing new species, associating the genome sequence of the bacteria systematically. We provide a comprehensive review on these topics, demonstrating that both empirical and hypothesis-driven approaches will enable a rapid increase in the identification of the human prokaryote repertoire. PMID:25567229
Liu, Chang
2017-01-01
The spatial organization of the genome in the nucleus is critical for many cellular processes. It has been broadly accepted that the packing of chromatin inside the nucleus is not random, but structured at several hierarchical levels. The Hi-C method combines Chromatin Conformation Capture and high-throughput sequencing, which allows interrogating genome-wide chromatin interactions. Depending on the sequencing depth, chromatin packing patterns derived from Hi-C experiments can be viewed on a chromosomal scale or at a local genic level. Here, I describe a protocol of plant in situ Hi-C library preparation, which covers procedures starting from tissue fixation to library amplification.