2012-01-01
Background Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation. Methods An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis. Results Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored. Conclusions The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations. PMID:22462519
Best practices for evaluating single nucleotide variant calling methods for microbial genomics
Olson, Nathan D.; Lund, Steven P.; Colman, Rebecca E.; Foster, Jeffrey T.; Sahl, Jason W.; Schupp, James M.; Keim, Paul; Morrow, Jayne B.; Salit, Marc L.; Zook, Justin M.
2015-01-01
Innovations in sequencing technologies have allowed biologists to make incredible advances in understanding biological systems. As experience grows, researchers increasingly recognize that analyzing the wealth of data provided by these new sequencing platforms requires careful attention to detail for robust results. Thus far, much of the scientific Communit’s focus for use in bacterial genomics has been on evaluating genome assembly algorithms and rigorously validating assembly program performance. Missing, however, is a focus on critical evaluation of variant callers for these genomes. Variant calling is essential for comparative genomics as it yields insights into nucleotide-level organismal differences. Variant calling is a multistep process with a host of potential error sources that may lead to incorrect variant calls. Identifying and resolving these incorrect calls is critical for bacterial genomics to advance. The goal of this review is to provide guidance on validating algorithms and pipelines used in variant calling for bacterial genomics. First, we will provide an overview of the variant calling procedures and the potential sources of error associated with the methods. We will then identify appropriate datasets for use in evaluating algorithms and describe statistical methods for evaluating algorithm performance. As variant calling moves from basic research to the applied setting, standardized methods for performance evaluation and reporting are required; it is our hope that this review provides the groundwork for the development of these standards. PMID:26217378
Including α s1 casein gene information in genomic evaluations of French dairy goats.
Carillier-Jacquin, Céline; Larroque, Hélène; Robert-Granié, Christèle
2016-08-04
Genomic best linear unbiased prediction methods assume that all markers explain the same fraction of the genetic variance and do not account effectively for genes with major effects such as the α s1 casein polymorphism in dairy goats. In this study, we investigated methods to include the available α s1 casein genotype effect in genomic evaluations of French dairy goats. First, the α s1 casein genotype was included as a fixed effect in genomic evaluation models based only on bucks that were genotyped at the α s1 casein locus. Less than 1 % of the females with phenotypes were genotyped at the α s1 casein gene. Thus, to incorporate these female phenotypes in the genomic evaluation, two methods that allowed for this large number of missing α s1 casein genotypes were investigated. Probabilities for each possible α s1 casein genotype were first estimated for each female of unknown genotype based on iterative peeling equations. The second method is based on a multiallelic gene content approach. For each model tested, we used three datasets each divided into a training and a validation set: (1) two-breed population (Alpine + Saanen), (2) Alpine population, and (3) Saanen population. The α s1 casein genotype had a significant effect on milk yield, fat content and protein content. Including an α s1 casein effect in genetic and genomic evaluations based only on male known α s1 casein genotypes improved accuracies (from 6 to 27 %). In genomic evaluations based on all female phenotypes, the gene content approach performed better than the other tested methods but the improvement in accuracy was only slightly better (from 1 to 14 %) than that of a genomic model without the α s1 casein effect. Including the α s1 casein effect in a genomic evaluation model for French dairy goats is possible and useful to improve accuracy. Difficulties in predicting the genotypes for ungenotyped animals limited the improvement in accuracy of the obtained estimated breeding values.
Gregory, T Ryan; Nathwani, Paula; Bonnett, Tiffany R; Huber, Dezene P W
2013-09-01
A study was undertaken to evaluate both a pre-existing method and a newly proposed approach for the estimation of nuclear genome sizes in arthropods. First, concerns regarding the reliability of the well-established method of flow cytometry relating to impacts of rearing conditions on genome size estimates were examined. Contrary to previous reports, a more carefully controlled test found negligible environmental effects on genome size estimates in the fly Drosophila melanogaster. Second, a more recently touted method based on quantitative real-time PCR (qPCR) was examined in terms of ease of use, efficiency, and (most importantly) accuracy using four test species: the flies Drosophila melanogaster and Musca domestica and the beetles Tribolium castaneum and Dendroctonus ponderosa. The results of this analysis demonstrated that qPCR has the tendency to produce substantially different genome size estimates from other established techniques while also being far less efficient than existing methods.
Accurate evaluation and analysis of functional genomics data and methods
Greene, Casey S.; Troyanskaya, Olga G.
2016-01-01
The development of technology capable of inexpensively performing large-scale measurements of biological systems has generated a wealth of data. Integrative analysis of these data holds the promise of uncovering gene function, regulation, and, in the longer run, understanding complex disease. However, their analysis has proved very challenging, as it is difficult to quickly and effectively assess the relevance and accuracy of these data for individual biological questions. Here, we identify biases that present challenges for the assessment of functional genomics data and methods. We then discuss evaluation methods that, taken together, begin to address these issues. We also argue that the funding of systematic data-driven experiments and of high-quality curation efforts will further improve evaluation metrics so that they more-accurately assess functional genomics data and methods. Such metrics will allow researchers in the field of functional genomics to continue to answer important biological questions in a data-driven manner. PMID:22268703
Rapid calculation of genomic evaluations for new animals
USDA-ARS?s Scientific Manuscript database
A method was developed to calculate preliminary genomic evaluations daily or weekly before the release of official monthly evaluations by processing only newly genotyped animals using estimates of SNP effects from the previous official evaluation. To minimize computing time, reliabilities and genomi...
Comparison of different methods for isolation of bacterial DNA from retail oyster tissues
USDA-ARS?s Scientific Manuscript database
Oysters are filter-feeders that bio-accumulate bacteria in water while feeding. To evaluate the bacterial genomic DNA extracted from retail oyster tissues, including the gills and digestive glands, four isolation methods were used. Genomic DNA extraction was performed using the Allmag™ Blood Genomic...
Comparison of methods for the implementation of genome-assisted evaluation of Spanish dairy cattle.
Jiménez-Montero, J A; González-Recio, O; Alenda, R
2013-01-01
The aim of this study was to evaluate methods for genomic evaluation of the Spanish Holstein population as an initial step toward the implementation of routine genomic evaluations. This study provides a description of the population structure of progeny tested bulls in Spain at the genomic level and compares different genomic evaluation methods with regard to accuracy and bias. Two bayesian linear regression models, Bayes-A and Bayesian-LASSO (B-LASSO), as well as a machine learning algorithm, Random-Boosting (R-Boost), and BLUP using a realized genomic relationship matrix (G-BLUP), were compared. Five traits that are currently under selection in the Spanish Holstein population were used: milk yield, fat yield, protein yield, fat percentage, and udder depth. In total, genotypes from 1859 progeny tested bulls were used. The training sets were composed of bulls born before 2005; including 1601 bulls for production and 1574 bulls for type, whereas the testing sets contained 258 and 235 bulls born in 2005 or later for production and type, respectively. Deregressed proofs (DRP) from January 2009 Interbull (Uppsala, Sweden) evaluation were used as the dependent variables for bulls in the training sets, whereas DRP from the December 2011 DRPs Interbull evaluation were used to compare genomic predictions with progeny test results for bulls in the testing set. Genomic predictions were more accurate than traditional pedigree indices for predicting future progeny test results of young bulls. The gain in accuracy, due to inclusion of genomic data varied by trait and ranged from 0.04 to 0.42 Pearson correlation units. Results averaged across traits showed that B-LASSO had the highest accuracy with an advantage of 0.01, 0.03 and 0.03 points in Pearson correlation compared with R-Boost, Bayes-A, and G-BLUP, respectively. The B-LASSO predictions also showed the least bias (0.02, 0.03 and 0.10 SD units less than Bayes-A, R-Boost and G-BLUP, respectively) as measured by mean difference between genomic predictions and progeny test results. The R-Boosting algorithm provided genomic predictions with regression coefficients closer to unity, which is an alternative measure of bias, for 4 out of 5 traits and also resulted in mean squared errors estimates that were 2%, 10%, and 12% smaller than B-LASSO, Bayes-A, and G-BLUP, respectively. The observed prediction accuracy obtained with these methods was within the range of values expected for a population of similar size, suggesting that the prediction method and reference population described herein are appropriate for implementation of routine genome-assisted evaluations in Spanish dairy cattle. R-Boost is a competitive marker regression methodology in terms of predictive ability that can accommodate large data sets. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
QUAST: quality assessment tool for genome assemblies.
Gurevich, Alexey; Saveliev, Vladislav; Vyahhi, Nikolay; Tesler, Glenn
2013-04-15
Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST-a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website. http://bioinf.spbau.ru/quast . Supplementary data are available at Bioinformatics online.
Lu, Bingxin; Leong, Hon Wai
2016-02-01
Genomic islands (GIs) are clusters of functionally related genes acquired by lateral genetic transfer (LGT), and they are present in many bacterial genomes. GIs are extremely important for bacterial research, because they not only promote genome evolution but also contain genes that enhance adaption and enable antibiotic resistance. Many methods have been proposed to predict GI. But most of them rely on either annotations or comparisons with other closely related genomes. Hence these methods cannot be easily applied to new genomes. As the number of newly sequenced bacterial genomes rapidly increases, there is a need for methods to detect GI based solely on sequences of a single genome. In this paper, we propose a novel method, GI-SVM, to predict GIs given only the unannotated genome sequence. GI-SVM is based on one-class support vector machine (SVM), utilizing composition bias in terms of k-mer content. From our evaluations on three real genomes, GI-SVM can achieve higher recall compared with current methods, without much loss of precision. Besides, GI-SVM allows flexible parameter tuning to get optimal results for each genome. In short, GI-SVM provides a more sensitive method for researchers interested in a first-pass detection of GI in newly sequenced genomes.
Lin, Da; Hong, Ping; Zhang, Siheng; Xu, Weize; Jamal, Muhammad; Yan, Keji; Lei, Yingying; Li, Liang; Ruan, Yijun; Fu, Zhen F; Li, Guoliang; Cao, Gang
2018-05-01
Chromosome conformation capture (3C) technologies can be used to investigate 3D genomic structures. However, high background noise, high costs, and a lack of straightforward noise evaluation in current methods impede the advancement of 3D genomic research. Here we developed a simple digestion-ligation-only Hi-C (DLO Hi-C) technology to explore the 3D landscape of the genome. This method requires only two rounds of digestion and ligation, without the need for biotin labeling and pulldown. Non-ligated DNA was efficiently removed in a cost-effective step by purifying specific linker-ligated DNA fragments. Notably, random ligation could be quickly evaluated in an early quality-control step before sequencing. Moreover, an in situ version of DLO Hi-C using a four-cutter restriction enzyme has been developed. We applied DLO Hi-C to delineate the genomic architecture of THP-1 and K562 cells and uncovered chromosomal translocations. This technology may facilitate investigation of genomic organization, gene regulation, and (meta)genome assembly.
Comparison of genomic-enhanced EPD systems using an external phenotypic database
USDA-ARS?s Scientific Manuscript database
The American Angus Association (AAA) is currently evaluating two methods to incorporate genomic information into their genetic evaluation program: 1) multi-trait incorporation of an externally produced molecular breeding value as an indicator trait (MT) and 2) single-step evaluation with an unweight...
Issues surrounding the health economic evaluation of genomic technologies
Buchanan, James; Wordsworth, Sarah; Schuh, Anna
2014-01-01
Aim Genomic interventions could enable improved disease stratification and individually tailored therapies. However, they have had a limited impact on clinical practice to date due to a lack of evidence, particularly economic evidence. This is partly because health economists are yet to reach consensus on whether existing methods are sufficient to evaluate genomic technologies. As different approaches may produce conflicting adoption decisions, clarification is urgently required. This article summarizes the methodological issues associated with conducting economic evaluations of genomic interventions. Materials & methods A structured literature review was conducted to identify references that considered the methodological challenges faced when conducting economic evaluations of genomic interventions. Results Methodological challenges related to the analytical approach included the choice of comparator, perspective and timeframe. Challenges in costing centered around the need to collect a broad range of costs, frequently, in a data-limited environment. Measuring outcomes is problematic as standard measures have limited applicability, however, alternative metrics (e.g., personal utility) are underdeveloped and alternative approaches (e.g., cost–benefit analysis) underused. Effectiveness data quality is weak and challenging to incorporate into standard economic analyses, while little is known about patient and clinician behavior in this context. Comprehensive value of information analyses are likely to be helpful. Conclusion Economic evaluations of genomic technologies present a particular challenge for health economists. New methods may be required to resolve these issues, but the evidence to justify alternative approaches is yet to be produced. This should be the focus of future work in this field. PMID:24236483
Accuracy and training population design for genomic selection in elite north american oats
USDA-ARS?s Scientific Manuscript database
Genomic selection (GS) is a method to estimate the breeding values of individuals by using markers throughout the genome. We evaluated the accuracies of GS using data from five traits on 446 oat lines genotyped with 1005 Diversity Array Technology (DArT) markers and two GS methods (RR-BLUP and Bayes...
Approximating genomic reliabilities for national genomic evaluation
USDA-ARS?s Scientific Manuscript database
With the introduction of standard methods for approximating effective daughter/data contribution by Interbull in 2001, conventional EDC or reliabilities contributed by daughter phenotypes are directly comparable across countries and used in routine conventional evaluations. In order to make publishe...
QUAST: quality assessment tool for genome assemblies
Gurevich, Alexey; Saveliev, Vladislav; Vyahhi, Nikolay; Tesler, Glenn
2013-01-01
Summary: Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. Here, we present QUAST—a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. In this study, we used QUAST to compare several genome assemblers on three datasets. QUAST tables and plots for all of them are available in the Supplementary Material, and interactive versions of these reports are on the QUAST website. Availability: http://bioinf.spbau.ru/quast Contact: gurevich@bioinf.spbau.ru Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23422339
Application of single-step genomic evaluation for crossbred performance in pig.
Xiang, T; Nielsen, B; Su, G; Legarra, A; Christensen, O F
2016-03-01
Crossbreding is predominant and intensively used in commercial meat production systems, especially in poultry and swine. Genomic evaluation has been successfully applied for breeding within purebreds but also offers opportunities of selecting purebreds for crossbred performance by combining information from purebreds with information from crossbreds. However, it generally requires that all relevant animals are genotyped, which is costly and presently does not seem to be feasible in practice. Recently, a novel single-step BLUP method for genomic evaluation of both purebred and crossbred performance has been developed that can incorporate marker genotypes into a traditional animal model. This new method has not been validated in real data sets. In this study, we applied this single-step method to analyze data for the maternal trait of total number of piglets born in Danish Landrace, Yorkshire, and two-way crossbred pigs in different scenarios. The genetic correlation between purebred and crossbred performances was investigated first, and then the impact of (crossbred) genomic information on prediction reliability for crossbred performance was explored. The results confirm the existence of a moderate genetic correlation, and it was seen that the standard errors on the estimates were reduced when including genomic information. Models with marker information, especially crossbred genomic information, improved model-based reliabilities for crossbred performance of purebred boars and also improved the predictive ability for crossbred animals and, to some extent, reduced the bias of prediction. We conclude that the new single-step BLUP method is a good tool in the genetic evaluation for crossbred performance in purebred animals.
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species
2013-01-01
Background The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another. PMID:23870653
Evaluation method for the potential functionome harbored in the genome and metagenome.
Takami, Hideto; Taniguchi, Takeaki; Moriya, Yuki; Kuwahara, Tomomi; Kanehisa, Minoru; Goto, Susumu
2012-12-12
One of the main goals of genomic analysis is to elucidate the comprehensive functions (functionome) in individual organisms or a whole community in various environments. However, a standard evaluation method for discerning the functional potentials harbored within the genome or metagenome has not yet been established. We have developed a new evaluation method for the potential functionome, based on the completion ratio of Kyoto Encyclopedia of Genes and Genomes (KEGG) functional modules. Distribution of the completion ratio of the KEGG functional modules in 768 prokaryotic species varied greatly with the kind of module, and all modules primarily fell into 4 patterns (universal, restricted, diversified and non-prokaryotic modules), indicating the universal and unique nature of each module, and also the versatility of the KEGG Orthology (KO) identifiers mapped to each one. The module completion ratio in 8 phenotypically different bacilli revealed that some modules were shared only in phenotypically similar species. Metagenomes of human gut microbiomes from 13 healthy individuals previously determined by the Sanger method were analyzed based on the module completion ratio. Results led to new discoveries in the nutritional preferences of gut microbes, believed to be one of the mutualistic representations of gut microbiomes to avoid nutritional competition with the host. The method developed in this study could characterize the functionome harbored in genomes and metagenomes. As this method also provided taxonomical information from KEGG modules as well as the gene hosts constructing the modules, interpretation of completion profiles was simplified and we could identify the complementarity between biochemical functions in human hosts and the nutritional preferences in human gut microbiomes. Thus, our method has the potential to be a powerful tool for comparative functional analysis in genomics and metagenomics, able to target unknown environments containing various uncultivable microbes within unidentified phyla.
Evaluation of methods and marker Systems in Genomic Selection of oil palm (Elaeis guineensis Jacq.).
Kwong, Qi Bin; Teh, Chee Keng; Ong, Ai Ling; Chew, Fook Tim; Mayes, Sean; Kulaveerasingam, Harikrishna; Tammi, Martti; Yeoh, Suat Hui; Appleton, David Ross; Harikrishna, Jennifer Ann
2017-12-11
Genomic selection (GS) uses genome-wide markers as an attempt to accelerate genetic gain in breeding programs of both animals and plants. This approach is particularly useful for perennial crops such as oil palm, which have long breeding cycles, and for which the optimal method for GS is still under debate. In this study, we evaluated the effect of different marker systems and modeling methods for implementing GS in an introgressed dura family derived from a Deli dura x Nigerian dura (Deli x Nigerian) with 112 individuals. This family is an important breeding source for developing new mother palms for superior oil yield and bunch characters. The traits of interest selected for this study were fruit-to-bunch (F/B), shell-to-fruit (S/F), kernel-to-fruit (K/F), mesocarp-to-fruit (M/F), oil per palm (O/P) and oil-to-dry mesocarp (O/DM). The marker systems evaluated were simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs). RR-BLUP, Bayesian A, B, Cπ, LASSO, Ridge Regression and two machine learning methods (SVM and Random Forest) were used to evaluate GS accuracy of the traits. The kinship coefficient between individuals in this family ranged from 0.35 to 0.62. S/F and O/DM had the highest genomic heritability, whereas F/B and O/P had the lowest. The accuracies using 135 SSRs were low, with accuracies of the traits around 0.20. The average accuracy of machine learning methods was 0.24, as compared to 0.20 achieved by other methods. The trait with the highest mean accuracy was F/B (0.28), while the lowest were both M/F and O/P (0.18). By using whole genomic SNPs, the accuracies for all traits, especially for O/DM (0.43), S/F (0.39) and M/F (0.30) were improved. The average accuracy of machine learning methods was 0.32, compared to 0.31 achieved by other methods. Due to high genomic resolution, the use of whole-genome SNPs improved the efficiency of GS dramatically for oil palm and is recommended for dura breeding programs. Machine learning slightly outperformed other methods, but required parameters optimization for GS implementation.
The Glyphosate-Based Herbicide Roundup Does not Elevate Genome-Wide Mutagenesis of Escherichia coli.
Tincher, Clayton; Long, Hongan; Behringer, Megan; Walker, Noah; Lynch, Michael
2017-10-05
Mutations induced by pollutants may promote pathogen evolution, for example by accelerating mutations conferring antibiotic resistance. Generally, evaluating the genome-wide mutagenic effects of long-term sublethal pollutant exposure at single-nucleotide resolution is extremely difficult. To overcome this technical barrier, we use the mutation accumulation/whole-genome sequencing (MA/WGS) method as a mutagenicity test, to quantitatively evaluate genome-wide mutagenesis of Escherichia coli after long-term exposure to a wide gradient of the glyphosate-based herbicide (GBH) Roundup Concentrate Plus. The genome-wide mutation rate decreases as GBH concentration increases, suggesting that even long-term GBH exposure does not compromise the genome stability of bacteria. Copyright © 2017 Tincher et al.
Quasispecies Analyses of the HIV-1 Near-full-length Genome With Illumina MiSeq
Ode, Hirotaka; Matsuda, Masakazu; Matsuoka, Kazuhiro; Hachiya, Atsuko; Hattori, Junko; Kito, Yumiko; Yokomaku, Yoshiyuki; Iwatani, Yasumasa; Sugiura, Wataru
2015-01-01
Human immunodeficiency virus type-1 (HIV-1) exhibits high between-host genetic diversity and within-host heterogeneity, recognized as quasispecies. Because HIV-1 quasispecies fluctuate in terms of multiple factors, such as antiretroviral exposure and host immunity, analyzing the HIV-1 genome is critical for selecting effective antiretroviral therapy and understanding within-host viral coevolution mechanisms. Here, to obtain HIV-1 genome sequence information that includes minority variants, we sought to develop a method for evaluating quasispecies throughout the HIV-1 near-full-length genome using the Illumina MiSeq benchtop deep sequencer. To ensure the reliability of minority mutation detection, we applied an analysis method of sequence read mapping onto a consensus sequence derived from de novo assembly followed by iterative mapping and subsequent unique error correction. Deep sequencing analyses of aHIV-1 clone showed that the analysis method reduced erroneous base prevalence below 1% in each sequence position and discarded only < 1% of all collected nucleotides, maximizing the usage of the collected genome sequences. Further, we designed primer sets to amplify the HIV-1 near-full-length genome from clinical plasma samples. Deep sequencing of 92 samples in combination with the primer sets and our analysis method provided sufficient coverage to identify >1%-frequency sequences throughout the genome. When we evaluated sequences of pol genes from 18 treatment-naïve patients' samples, the deep sequencing results were in agreement with Sanger sequencing and identified numerous additional minority mutations. The results suggest that our deep sequencing method would be suitable for identifying within-host viral population dynamics throughout the genome. PMID:26617593
Zhang, Kui; Wiener, Howard; Beasley, Mark; George, Varghese; Amos, Christopher I; Allison, David B
2006-08-01
Individual genome scans for quantitative trait loci (QTL) mapping often suffer from low statistical power and imprecise estimates of QTL location and effect. This lack of precision yields large confidence intervals for QTL location, which are problematic for subsequent fine mapping and positional cloning. In prioritizing areas for follow-up after an initial genome scan and in evaluating the credibility of apparent linkage signals, investigators typically examine the results of other genome scans of the same phenotype and informally update their beliefs about which linkage signals in their scan most merit confidence and follow-up via a subjective-intuitive integration approach. A method that acknowledges the wisdom of this general paradigm but formally borrows information from other scans to increase confidence in objectivity would be a benefit. We developed an empirical Bayes analytic method to integrate information from multiple genome scans. The linkage statistic obtained from a single genome scan study is updated by incorporating statistics from other genome scans as prior information. This technique does not require that all studies have an identical marker map or a common estimated QTL effect. The updated linkage statistic can then be used for the estimation of QTL location and effect. We evaluate the performance of our method by using extensive simulations based on actual marker spacing and allele frequencies from available data. Results indicate that the empirical Bayes method can account for between-study heterogeneity, estimate the QTL location and effect more precisely, and provide narrower confidence intervals than results from any single individual study. We also compared the empirical Bayes method with a method originally developed for meta-analysis (a closely related but distinct purpose). In the face of marked heterogeneity among studies, the empirical Bayes method outperforms the comparator.
Berry, Nadine Kaye; Bain, Nicole L; Enjeti, Anoop K; Rowlings, Philip
2014-01-01
Aim To evaluate the role of whole genome comparative genomic hybridisation microarray (array-CGH) in detecting genomic imbalances as compared to conventional karyotype (GTG-analysis) or myeloma specific fluorescence in situ hybridisation (FISH) panel in a diagnostic setting for plasma cell dyscrasia (PCD). Methods A myeloma-specific interphase FISH (i-FISH) panel was carried out on CD138 PC-enriched bone marrow (BM) from 20 patients having BM biopsies for evaluation of PCD. Whole genome array-CGH was performed on reference (control) and neoplastic (test patient) genomic DNA extracted from CD138 PC-enriched BM and analysed. Results Comparison of techniques demonstrated a much higher detection rate of genomic imbalances using array-CGH. Genomic imbalances were detected in 1, 19 and 20 patients using GTG-analysis, i-FISH and array-CGH, respectively. Genomic rearrangements were detected in one patient using GTG-analysis and seven patients using i-FISH, while none were detected using array-CGH. I-FISH was the most sensitive method for detecting gene rearrangements and GTG-analysis was the least sensitive method overall. All copy number aberrations observed in GTG-analysis were detected using array-CGH and i-FISH. Conclusions We show that array-CGH performed on CD138-enriched PCs significantly improves the detection of clinically relevant and possibly novel genomic abnormalities in PCD, and thus could be considered as a standard diagnostic technique in combination with IGH rearrangement i-FISH. PMID:23969274
Assessing genomic selection prediction accuracy in a dynamic barley breeding
USDA-ARS?s Scientific Manuscript database
Genomic selection is a method to improve quantitative traits in crops and livestock by estimating breeding values of selection candidates using phenotype and genome-wide marker data sets. Prediction accuracy has been evaluated through simulation and cross-validation, however validation based on prog...
Evaluating imputation algorithms for low-depth genotyping-by-sequencing (GBS) data
USDA-ARS?s Scientific Manuscript database
Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordabl...
Methods to approximate reliabilities in single-step genomic evaluation
USDA-ARS?s Scientific Manuscript database
Reliability of predictions from single-step genomic BLUP (ssGBLUP) can be calculated by inversion, but that is not feasible for large data sets. Two methods of approximating reliability were developed based on decomposition of a function of reliability into contributions from records, pedigrees, and...
White, James R; Escobar-Paramo, Patricia; Mongodin, Emmanuel F; Nelson, Karen E; DiRuggiero, Jocelyne
2008-10-01
The extent of chromosome rearrangements in Pyrococcus isolates from marine hydrothermal vents in Vulcano Island, Italy, was evaluated by high-throughput genomic methods. The results illustrate the dynamic nature of the genomes of the genus Pyrococcus and raise the possibility of a connection between rapidly changing environmental conditions and adaptive genomic properties.
USDA-ARS?s Scientific Manuscript database
Bacterial cold water disease (BCWD) causes significant economic losses in salmonid aquaculture, and traditional family-based breeding programs aimed at improving BCWD resistance have been limited to exploiting only between-family variation. We used genomic selection (GS) models to predict genomic br...
Accounting for discovery bias in genomic prediction
USDA-ARS?s Scientific Manuscript database
Our objective was to evaluate an approach to mitigating discovery bias in genomic prediction. Accuracy may be improved by placing greater emphasis on regions of the genome expected to be more influential on a trait. Methods emphasizing regions result in a phenomenon known as “discovery bias” if info...
Evaluation method for the potential functionome harbored in the genome and metagenome
2012-01-01
Background One of the main goals of genomic analysis is to elucidate the comprehensive functions (functionome) in individual organisms or a whole community in various environments. However, a standard evaluation method for discerning the functional potentials harbored within the genome or metagenome has not yet been established. We have developed a new evaluation method for the potential functionome, based on the completion ratio of Kyoto Encyclopedia of Genes and Genomes (KEGG) functional modules. Results Distribution of the completion ratio of the KEGG functional modules in 768 prokaryotic species varied greatly with the kind of module, and all modules primarily fell into 4 patterns (universal, restricted, diversified and non-prokaryotic modules), indicating the universal and unique nature of each module, and also the versatility of the KEGG Orthology (KO) identifiers mapped to each one. The module completion ratio in 8 phenotypically different bacilli revealed that some modules were shared only in phenotypically similar species. Metagenomes of human gut microbiomes from 13 healthy individuals previously determined by the Sanger method were analyzed based on the module completion ratio. Results led to new discoveries in the nutritional preferences of gut microbes, believed to be one of the mutualistic representations of gut microbiomes to avoid nutritional competition with the host. Conclusions The method developed in this study could characterize the functionome harbored in genomes and metagenomes. As this method also provided taxonomical information from KEGG modules as well as the gene hosts constructing the modules, interpretation of completion profiles was simplified and we could identify the complementarity between biochemical functions in human hosts and the nutritional preferences in human gut microbiomes. Thus, our method has the potential to be a powerful tool for comparative functional analysis in genomics and metagenomics, able to target unknown environments containing various uncultivable microbes within unidentified phyla. PMID:23234305
Dewey, Colin N
2012-01-01
Whole-genome alignment (WGA) is the prediction of evolutionary relationships at the nucleotide level between two or more genomes. It combines aspects of both colinear sequence alignment and gene orthology prediction, and is typically more challenging to address than either of these tasks due to the size and complexity of whole genomes. Despite the difficulty of this problem, numerous methods have been developed for its solution because WGAs are valuable for genome-wide analyses, such as phylogenetic inference, genome annotation, and function prediction. In this chapter, we discuss the meaning and significance of WGA and present an overview of the methods that address it. We also examine the problem of evaluating whole-genome aligners and offer a set of methodological challenges that need to be tackled in order to make the most effective use of our rapidly growing databases of whole genomes.
Hitomi, Yuki; Tokunaga, Katsushi
2017-01-01
Human genome variation may cause differences in traits and disease risks. Disease-causal/susceptible genes and variants for both common and rare diseases can be detected by comprehensive whole-genome analyses, such as whole-genome sequencing (WGS), using next-generation sequencing (NGS) technology and genome-wide association studies (GWAS). Here, in addition to the application of an NGS as a whole-genome analysis method, we summarize approaches for the identification of functional disease-causal/susceptible variants from abundant genetic variants in the human genome and methods for evaluating their functional effects in human diseases, using an NGS and in silico and in vitro functional analyses. We also discuss the clinical applications of the functional disease causal/susceptible variants to personalized medicine.
White, James R.; Escobar-Paramo, Patricia; Mongodin, Emmanuel F.; Nelson, Karen E.; DiRuggiero, Jocelyne
2008-01-01
The extent of chromosome rearrangements in Pyrococcus isolates from marine hydrothermal vents in Vulcano Island, Italy, was evaluated by high-throughput genomic methods. The results illustrate the dynamic nature of the genomes of the genus Pyrococcus and raise the possibility of a connection between rapidly changing environmental conditions and adaptive genomic properties. PMID:18723649
Influence of outliers on accuracy estimation in genomic prediction in plant breeding.
Estaghvirou, Sidi Boubacar Ould; Ogutu, Joseph O; Piepho, Hans-Peter
2014-10-01
Outliers often pose problems in analyses of data in plant breeding, but their influence on the performance of methods for estimating predictive accuracy in genomic prediction studies has not yet been evaluated. Here, we evaluate the influence of outliers on the performance of methods for accuracy estimation in genomic prediction studies using simulation. We simulated 1000 datasets for each of 10 scenarios to evaluate the influence of outliers on the performance of seven methods for estimating accuracy. These scenarios are defined by the number of genotypes, marker effect variance, and magnitude of outliers. To mimic outliers, we added to one observation in each simulated dataset, in turn, 5-, 8-, and 10-times the error SD used to simulate small and large phenotypic datasets. The effect of outliers on accuracy estimation was evaluated by comparing deviations in the estimated and true accuracies for datasets with and without outliers. Outliers adversely influenced accuracy estimation, more so at small values of genetic variance or number of genotypes. A method for estimating heritability and predictive accuracy in plant breeding and another used to estimate accuracy in animal breeding were the most accurate and resistant to outliers across all scenarios and are therefore preferable for accuracy estimation in genomic prediction studies. The performances of the other five methods that use cross-validation were less consistent and varied widely across scenarios. The computing time for the methods increased as the size of outliers and sample size increased and the genetic variance decreased. Copyright © 2014 Ould Estaghvirou et al.
Optimized gene editing technology for Drosophila melanogaster using germ line-specific Cas9.
Ren, Xingjie; Sun, Jin; Housden, Benjamin E; Hu, Yanhui; Roesel, Charles; Lin, Shuailiang; Liu, Lu-Ping; Yang, Zhihao; Mao, Decai; Sun, Lingzhu; Wu, Qujie; Ji, Jun-Yuan; Xi, Jianzhong; Mohr, Stephanie E; Xu, Jiang; Perrimon, Norbert; Ni, Jian-Quan
2013-11-19
The ability to engineer genomes in a specific, systematic, and cost-effective way is critical for functional genomic studies. Recent advances using the CRISPR-associated single-guide RNA system (Cas9/sgRNA) illustrate the potential of this simple system for genome engineering in a number of organisms. Here we report an effective and inexpensive method for genome DNA editing in Drosophila melanogaster whereby plasmid DNAs encoding short sgRNAs under the control of the U6b promoter are injected into transgenic flies in which Cas9 is specifically expressed in the germ line via the nanos promoter. We evaluate the off-targets associated with the method and establish a Web-based resource, along with a searchable, genome-wide database of predicted sgRNAs appropriate for genome engineering in flies. Finally, we discuss the advantages of our method in comparison with other recently published approaches.
USDA-ARS?s Scientific Manuscript database
The objective of this study was to compare methods for genomic evaluation in a Rainbow Trout (Oncorhynchus mykiss) population for survival when challenged by Flavobacterium psychrophilum, the causative agent of bacterial cold water disease (BCWD). The used methods were: 1)regular ssGBLUP that assume...
Assessing the Robustness of Complete Bacterial Genome Segmentations
NASA Astrophysics Data System (ADS)
Devillers, Hugo; Chiapello, Hélène; Schbath, Sophie; El Karoui, Meriem
Comparison of closely related bacterial genomes has revealed the presence of highly conserved sequences forming a "backbone" that is interrupted by numerous, less conserved, DNA fragments. Segmentation of bacterial genomes into backbone and variable regions is particularly useful to investigate bacterial genome evolution. Several software tools have been designed to compare complete bacterial chromosomes and a few online databases store pre-computed genome comparisons. However, very few statistical methods are available to evaluate the reliability of these software tools and to compare the results obtained with them. To fill this gap, we have developed two local scores to measure the robustness of bacterial genome segmentations. Our method uses a simulation procedure based on random perturbations of the compared genomes. The scores presented in this paper are simple to implement and our results show that they allow to discriminate easily between robust and non-robust bacterial genome segmentations when using aligners such as MAUVE and MGA.
Amyotte, Beatrice; Bowen, Amy J.; Banks, Travis; Rajcan, Istvan; Somers, Daryl J.
2017-01-01
Breeding apples is a long-term endeavour and it is imperative that new cultivars are selected to have outstanding consumer appeal. This study has taken the approach of merging sensory science with genome wide association analyses in order to map the human perception of apple flavour and texture onto the apple genome. The goal was to identify genomic associations that could be used in breeding apples for improved fruit quality. A collection of 85 apple cultivars was examined over two years through descriptive sensory evaluation by a trained sensory panel. The trained sensory panel scored randomized sliced samples of each apple cultivar for seventeen taste, flavour and texture attributes using controlled sensory evaluation practices. In addition, the apple collection was subjected to genotyping by sequencing for marker discovery. A genome wide association analysis suggested significant genomic associations for several sensory traits including juiciness, crispness, mealiness and fresh green apple flavour. The findings include previously unreported genomic regions that could be used in apple breeding and suggest that similar sensory association mapping methods could be applied in other plants. PMID:28231290
Amyotte, Beatrice; Bowen, Amy J; Banks, Travis; Rajcan, Istvan; Somers, Daryl J
2017-01-01
Breeding apples is a long-term endeavour and it is imperative that new cultivars are selected to have outstanding consumer appeal. This study has taken the approach of merging sensory science with genome wide association analyses in order to map the human perception of apple flavour and texture onto the apple genome. The goal was to identify genomic associations that could be used in breeding apples for improved fruit quality. A collection of 85 apple cultivars was examined over two years through descriptive sensory evaluation by a trained sensory panel. The trained sensory panel scored randomized sliced samples of each apple cultivar for seventeen taste, flavour and texture attributes using controlled sensory evaluation practices. In addition, the apple collection was subjected to genotyping by sequencing for marker discovery. A genome wide association analysis suggested significant genomic associations for several sensory traits including juiciness, crispness, mealiness and fresh green apple flavour. The findings include previously unreported genomic regions that could be used in apple breeding and suggest that similar sensory association mapping methods could be applied in other plants.
USDA-ARS?s Scientific Manuscript database
Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its ef...
Berry, Nadine Kaye; Bain, Nicole L; Enjeti, Anoop K; Rowlings, Philip
2014-01-01
To evaluate the role of whole genome comparative genomic hybridisation microarray (array-CGH) in detecting genomic imbalances as compared to conventional karyotype (GTG-analysis) or myeloma specific fluorescence in situ hybridisation (FISH) panel in a diagnostic setting for plasma cell dyscrasia (PCD). A myeloma-specific interphase FISH (i-FISH) panel was carried out on CD138 PC-enriched bone marrow (BM) from 20 patients having BM biopsies for evaluation of PCD. Whole genome array-CGH was performed on reference (control) and neoplastic (test patient) genomic DNA extracted from CD138 PC-enriched BM and analysed. Comparison of techniques demonstrated a much higher detection rate of genomic imbalances using array-CGH. Genomic imbalances were detected in 1, 19 and 20 patients using GTG-analysis, i-FISH and array-CGH, respectively. Genomic rearrangements were detected in one patient using GTG-analysis and seven patients using i-FISH, while none were detected using array-CGH. I-FISH was the most sensitive method for detecting gene rearrangements and GTG-analysis was the least sensitive method overall. All copy number aberrations observed in GTG-analysis were detected using array-CGH and i-FISH. We show that array-CGH performed on CD138-enriched PCs significantly improves the detection of clinically relevant and possibly novel genomic abnormalities in PCD, and thus could be considered as a standard diagnostic technique in combination with IGH rearrangement i-FISH.
Evaluation of FTA ® paper for storage of oral meta-genomic DNA.
Foitzik, Magdalena; Stumpp, Sascha N; Grischke, Jasmin; Eberhard, Jörg; Stiesch, Meike
2014-10-01
The purpose of the present study was to evaluate the short-term storage of meta-genomic DNA from native oral biofilms on FTA(®) paper. Thirteen volunteers of both sexes received an acrylic splint for intraoral biofilm formation over a period of 48 hours. The biofilms were collected, resuspended in phosphate-buffered saline, and either stored on FTA(®) paper or directly processed by standard laboratory DNA extraction. The nucleic acid extraction efficiencies were evaluated by 16S rDNA targeted SSCP fingerprinting. The acquired banding pattern of FTA-derived meta-genomic DNA was compared to a standard DNA preparation protocol. Sensitivity and positive predictive values were calculated. The volunteers showed inter-individual differences in their bacterial species composition. A total of 200 bands were found for both methods and 85% of the banding patterns were equal, representing a sensitivity of 0.941 and a false-negative predictive value of 0.059. Meta-genomic DNA sampling, extraction, and adhesion using FTA(®) paper is a reliable method for storage of microbial DNA for a short period of time.
GUIDE-Seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases
Nguyen, Nhu T.; Liebers, Matthew; Topkar, Ved V.; Thapar, Vishal; Wyvekens, Nicolas; Khayter, Cyd; Iafrate, A. John; Le, Long P.; Aryee, Martin J.; Joung, J. Keith
2014-01-01
CRISPR RNA-guided nucleases (RGNs) are widely used genome-editing reagents, but methods to delineate their genome-wide off-target cleavage activities have been lacking. Here we describe an approach for global detection of DNA double-stranded breaks (DSBs) introduced by RGNs and potentially other nucleases. This method, called Genome-wide Unbiased Identification of DSBs Enabled by Sequencing (GUIDE-Seq), relies on capture of double-stranded oligodeoxynucleotides into breaks Application of GUIDE-Seq to thirteen RGNs in two human cell lines revealed wide variability in RGN off-target activities and unappreciated characteristics of off-target sequences. The majority of identified sites were not detected by existing computational methods or ChIP-Seq. GUIDE-Seq also identified RGN-independent genomic breakpoint ‘hotspots’. Finally, GUIDE-Seq revealed that truncated guide RNAs exhibit substantially reduced RGN-induced off-target DSBs. Our experiments define the most rigorous framework for genome-wide identification of RGN off-target effects to date and provide a method for evaluating the safety of these nucleases prior to clinical use. PMID:25513782
Howard, Jeremy T; Pryce, Jennie E; Baes, Christine; Maltecca, Christian
2017-08-01
Traditionally, pedigree-based relationship coefficients have been used to manage the inbreeding and degree of inbreeding depression that exists within a population. The widespread incorporation of genomic information in dairy cattle genetic evaluations allows for the opportunity to develop and implement methods to manage populations at the genomic level. As a result, the realized proportion of the genome that 2 individuals share can be more accurately estimated instead of using pedigree information to estimate the expected proportion of shared alleles. Furthermore, genomic information allows genome-wide relationship or inbreeding estimates to be augmented to characterize relationships for specific regions of the genome. Region-specific stretches can be used to more effectively manage areas of low genetic diversity or areas that, when homozygous, result in reduced performance across economically important traits. The use of region-specific metrics should allow breeders to more precisely manage the trade-off between the genetic value of the progeny and undesirable side effects associated with inbreeding. Methods tailored toward more effectively identifying regions affected by inbreeding and their associated use to manage the genome at the herd level, however, still need to be developed. We have reviewed topics related to inbreeding, measures of relatedness, genetic diversity and methods to manage populations at the genomic level, and we discuss future challenges related to managing populations through implementing genomic methods at the herd and population levels. Copyright © 2017 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Benchmarking of Methods for Genomic Taxonomy
Larsen, Mette V.; Cosentino, Salvatore; Lukjancenko, Oksana; ...
2014-02-26
One of the first issues that emerges when a prokaryotic organism of interest is encountered is the question of what it is—that is, which species it is. The 16S rRNA gene formed the basis of the first method for sequence-based taxonomy and has had a tremendous impact on the field of microbiology. Nevertheless, the method has been found to have a number of shortcomings. In this paper, we trained and benchmarked five methods for whole-genome sequence-based prokaryotic species identification on a common data set of complete genomes: (i) SpeciesFinder, which is based on the complete 16S rRNA gene; (ii) Reads2Typemore » that searches for species-specific 50-mers in either the 16S rRNA gene or the gyrB gene (for the Enterobacteraceae family); (iii) the ribosomal multilocus sequence typing (rMLST) method that samples up to 53 ribosomal genes; (iv) TaxonomyFinder, which is based on species-specific functional protein domain profiles; and finally (v) KmerFinder, which examines the number of cooccurring k-mers (substrings of k nucleotides in DNA sequence data). The performances of the methods were subsequently evaluated on three data sets of short sequence reads or draft genomes from public databases. In total, the evaluation sets constituted sequence data from more than 11,000 isolates covering 159 genera and 243 species. Our results indicate that methods that sample only chromosomal, core genes have difficulties in distinguishing closely related species which only recently diverged. Finally, the KmerFinder method had the overall highest accuracy and correctly identified from 93% to 97% of the isolates in the evaluations sets.« less
Use of simulated data sets to evaluate the fidelity of metagenomic processing methods
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mavromatis, K; Ivanova, N; Barry, Kerrie
2007-01-01
Metagenomics is a rapidly emerging field of research for studying microbial communities. To evaluate methods presently used to process metagenomic sequences, we constructed three simulated data sets of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes. These data sets were designed to model real metagenomes in terms of complexity and phylogenetic composition. We assembled sampled reads using three commonly used genome assemblers (Phrap, Arachne and JAZZ), and predicted genes using two popular gene-finding pipelines (fgenesb and CRITICA/GLIMMER). The phylogenetic origins of the assembled contigs were predicted using one sequence similarity-based ( blast hit distribution) and twomore » sequence composition-based (PhyloPythia, oligonucleotide frequencies) binning methods. We explored the effects of the simulated community structure and method combinations on the fidelity of each processing step by comparison to the corresponding isolate genomes. The simulated data sets are available online to facilitate standardized benchmarking of tools for metagenomic analysis.« less
Use of simulated data sets to evaluate the fidelity of Metagenomicprocessing methods
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mavromatis, Konstantinos; Ivanova, Natalia; Barry, Kerri
2006-12-01
Metagenomics is a rapidly emerging field of research for studying microbial communities. To evaluate methods presently used to process metagenomic sequences, we constructed three simulated data sets of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes. These data sets were designed to model real metagenomes in terms of complexity and phylogenetic composition. We assembled sampled reads using three commonly used genome assemblers (Phrap, Arachne and JAZZ), and predicted genes using two popular gene finding pipelines (fgenesb and CRITICA/GLIMMER). The phylogenetic origins of the assembled contigs were predicted using one sequence similarity--based (blast hit distribution) and twomore » sequence composition--based (PhyloPythia, oligonucleotide frequencies) binning methods. We explored the effects of the simulated community structure and method combinations on the fidelity of each processing step by comparison to the corresponding isolate genomes. The simulated data sets are available online to facilitate standardized benchmarking of tools for metagenomic analysis.« less
Roberts, Megan C; Clyne, Mindy; Kennedy, Amy E; Chambers, David A; Khoury, Muin J
2017-10-26
PurposeImplementation science offers methods to evaluate the translation of genomic medicine research into practice. The extent to which the National Institutes of Health (NIH) human genomics grant portfolio includes implementation science is unknown. This brief report's objective is to describe recently funded implementation science studies in genomic medicine in the NIH grant portfolio, and identify remaining gaps.MethodsWe identified investigator-initiated NIH research grants on implementation science in genomic medicine (funding initiated 2012-2016). A codebook was adapted from the literature, three authors coded grants, and descriptive statistics were calculated for each code.ResultsForty-two grants fit the inclusion criteria (~1.75% of investigator-initiated genomics grants). The majority of included grants proposed qualitative and/or quantitative methods with cross-sectional study designs, and described clinical settings and primarily white, non-Hispanic study populations. Most grants were in oncology and examined genetic testing for risk assessment. Finally, grants lacked the use of implementation science frameworks, and most examined uptake of genomic medicine and/or assessed patient-centeredness.ConclusionWe identified large gaps in implementation science studies in genomic medicine in the funded NIH portfolio over the past 5 years. To move the genomics field forward, investigator-initiated research grants should employ rigorous implementation science methods within diverse settings and populations.Genetics in Medicine advance online publication, 26 October 2017; doi:10.1038/gim.2017.180.
A public resource facilitating clinical use of genomes
Ball, Madeleine P.; Thakuria, Joseph V.; Zaranek, Alexander Wait; Clegg, Tom; Rosenbaum, Abraham M.; Wu, Xiaodi; Angrist, Misha; Bhak, Jong; Bobe, Jason; Callow, Matthew J.; Cano, Carlos; Chou, Michael F.; Chung, Wendy K.; Douglas, Shawn M.; Estep, Preston W.; Gore, Athurva; Hulick, Peter; Labarga, Alberto; Lee, Je-Hyuk; Lunshof, Jeantine E.; Kim, Byung Chul; Kim, Jong-Il; Li, Zhe; Murray, Michael F.; Nilsen, Geoffrey B.; Peters, Brock A.; Raman, Anugraha M.; Rienhoff, Hugh Y.; Robasky, Kimberly; Wheeler, Matthew T.; Vandewege, Ward; Vorhaus, Daniel B.; Yang, Joyce L.; Yang, Luhan; Aach, John; Ashley, Euan A.; Drmanac, Radoje; Kim, Seong-Jin; Li, Jin Billy; Peshkin, Leonid; Seidman, Christine E.; Seo, Jeong-Sun; Zhang, Kun; Rehm, Heidi L.; Church, George M.
2012-01-01
Rapid advances in DNA sequencing promise to enable new diagnostics and individualized therapies. Achieving personalized medicine, however, will require extensive research on highly reidentifiable, integrated datasets of genomic and health information. To assist with this, participants in the Personal Genome Project choose to forgo privacy via our institutional review board- approved “open consent” process. The contribution of public data and samples facilitates both scientific discovery and standardization of methods. We present our findings after enrollment of more than 1,800 participants, including whole-genome sequencing of 10 pilot participant genomes (the PGP-10). We introduce the Genome-Environment-Trait Evidence (GET-Evidence) system. This tool automatically processes genomes and prioritizes both published and novel variants for interpretation. In the process of reviewing the presumed healthy PGP-10 genomes, we find numerous literature references implying serious disease. Although it is sometimes impossible to rule out a late-onset effect, stringent evidence requirements can address the high rate of incidental findings. To that end we develop a peer production system for recording and organizing variant evaluations according to standard evidence guidelines, creating a public forum for reaching consensus on interpretation of clinically relevant variants. Genome analysis becomes a two-step process: using a prioritized list to record variant evaluations, then automatically sorting reviewed variants using these annotations. Genome data, health and trait information, participant samples, and variant interpretations are all shared in the public domain—we invite others to review our results using our participant samples and contribute to our interpretations. We offer our public resource and methods to further personalized medical research. PMID:22797899
Apollo: a sequence annotation editor
Lewis, SE; Searle, SMJ; Harris, N; Gibson, M; Iyer, V; Richter, J; Wiel, C; Bayraktaroglu, L; Birney, E; Crosby, MA; Kaminker, JS; Matthews, BB; Prochnik, SE; Smith, CD; Tupy, JL; Rubin, GM; Misra, S; Mungall, CJ; Clamp, ME
2002-01-01
The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster genome and it is increasingly being used as a starting point for the development of customized annotation editing tools for other genome projects. PMID:12537571
Quantifying Genome Editing Outcomes at Endogenous Loci using SMRT Sequencing
Clark, Joseph; Punjya, Niraj; Sebastiano, Vittorio; Bao, Gang; Porteus, Matthew H
2014-01-01
SUMMARY Targeted genome editing with engineered nucleases has transformed the ability to introduce precise sequence modifications at almost any site within the genome. A major obstacle to probing the efficiency and consequences of genome editing is that no existing method enables the frequency of different editing events to be simultaneously measured across a cell population at any endogenous genomic locus. We have developed a novel method for quantifying individual genome editing outcomes at any site of interest using single molecule real time (SMRT) DNA sequencing. We show that this approach can be applied at various loci, using multiple engineered nuclease platforms including TALENs, RNA guided endonucleases (CRISPR/Cas9), and ZFNs, and in different cell lines to identify conditions and strategies in which the desired engineering outcome has occurred. This approach facilitates the evaluation of new gene editing technologies and permits sensitive quantification of editing outcomes in almost every experimental system used. PMID:24685129
Orlando, Lori A.; Sperber, Nina R.; Voils, Corrine; Nichols, Marshall; Myers, Rachel A.; Wu, R. Ryanne; Rakhra-Burris, Tejinder; Levy, Kenneth D.; Levy, Mia; Pollin, Toni I.; Guan, Yue; Horowitz, Carol R.; Ramos, Michelle; Kimmel, Stephen E.; McDonough, Caitrin W.; Madden, Ebony B.; Damschroder, Laura J.
2017-01-01
Purpose Implementation research provides a structure for evaluating the clinical integration of genomic medicine interventions. This paper describes the Implementing GeNomics In PracTicE (IGNITE) Network’s efforts to promote: 1) a broader understanding of genomic medicine implementation research; and 2) the sharing of knowledge generated in the network. Methods To facilitate this goal the IGNITE Network Common Measures Working Group (CMG) members adopted the Consolidated Framework for Implementation Research (CFIR) to guide their approach to: identifying constructs and measures relevant to evaluating genomic medicine as a whole, standardizing data collection across projects, and combining data in a centralized resource for cross network analyses. Results CMG identified ten high-priority CFIR constructs as important for genomic medicine. Of those, eight didn’t have standardized measurement instruments. Therefore, we developed four survey tools to address this gap. In addition, we identified seven high-priority constructs related to patients, families, and communities that did not map to CFIR constructs. Both sets of constructs were combined to create a draft genomic medicine implementation model. Conclusion We developed processes to identify constructs deemed valuable for genomic medicine implementation and codified them in a model. These resources are freely available to facilitate knowledge generation and sharing across the field. PMID:28914267
Orlando, Lori A; Sperber, Nina R; Voils, Corrine; Nichols, Marshall; Myers, Rachel A; Wu, R Ryanne; Rakhra-Burris, Tejinder; Levy, Kenneth D; Levy, Mia; Pollin, Toni I; Guan, Yue; Horowitz, Carol R; Ramos, Michelle; Kimmel, Stephen E; McDonough, Caitrin W; Madden, Ebony B; Damschroder, Laura J
2018-06-01
PurposeImplementation research provides a structure for evaluating the clinical integration of genomic medicine interventions. This paper describes the Implementing Genomics in Practice (IGNITE) Network's efforts to promote (i) a broader understanding of genomic medicine implementation research and (ii) the sharing of knowledge generated in the network.MethodsTo facilitate this goal, the IGNITE Network Common Measures Working Group (CMG) members adopted the Consolidated Framework for Implementation Research (CFIR) to guide its approach to identifying constructs and measures relevant to evaluating genomic medicine as a whole, standardizing data collection across projects, and combining data in a centralized resource for cross-network analyses.ResultsCMG identified 10 high-priority CFIR constructs as important for genomic medicine. Of those, eight did not have standardized measurement instruments. Therefore, we developed four survey tools to address this gap. In addition, we identified seven high-priority constructs related to patients, families, and communities that did not map to CFIR constructs. Both sets of constructs were combined to create a draft genomic medicine implementation model.ConclusionWe developed processes to identify constructs deemed valuable for genomic medicine implementation and codified them in a model. These resources are freely available to facilitate knowledge generation and sharing across the field.
Maize - GO annotation methods, evaluation, and review (Maize-GAMER)
USDA-ARS?s Scientific Manuscript database
Making a genome sequence accessible and useful involves three basic steps: genome assembly, structural annotation, and functional annotation. The quality of data generated at each step influences the accuracy of inferences that can be made, with high-quality analyses produce better datasets resultin...
Jorjani, Hadi; Zavolan, Mihaela
2014-04-01
Accurate identification of transcription start sites (TSSs) is an essential step in the analysis of transcription regulatory networks. In higher eukaryotes, the capped analysis of gene expression technology enabled comprehensive annotation of TSSs in genomes such as those of mice and humans. In bacteria, an equivalent approach, termed differential RNA sequencing (dRNA-seq), has recently been proposed, but the application of this approach to a large number of genomes is hindered by the paucity of computational analysis methods. With few exceptions, when the method has been used, annotation of TSSs has been largely done manually. In this work, we present a computational method called 'TSSer' that enables the automatic inference of TSSs from dRNA-seq data. The method rests on a probabilistic framework for identifying both genomic positions that are preferentially enriched in the dRNA-seq data as well as preferentially captured relative to neighboring genomic regions. Evaluating our approach for TSS calling on several publicly available datasets, we find that TSSer achieves high consistency with the curated lists of annotated TSSs, but identifies many additional TSSs. Therefore, TSSer can accelerate genome-wide identification of TSSs in bacterial genomes and can aid in further characterization of bacterial transcription regulatory networks. TSSer is freely available under GPL license at http://www.clipz.unibas.ch/TSSer/index.php
swga: a primer design toolkit for selective whole genome amplification.
Clarke, Erik L; Sundararaman, Sesh A; Seifert, Stephanie N; Bushman, Frederic D; Hahn, Beatrice H; Brisson, Dustin
2017-07-15
Population genomic analyses are often hindered by difficulties in obtaining sufficient numbers of genomes for analysis by DNA sequencing. Selective whole-genome amplification (SWGA) provides an efficient approach to amplify microbial genomes from complex backgrounds for sequence acquisition. However, the process of designing sets of primers for this method has many degrees of freedom and would benefit from an automated process to evaluate the vast number of potential primer sets. Here, we present swga , a program that identifies primer sets for SWGA and evaluates them for efficiency and selectivity. We used swga to design and test primer sets for the selective amplification of Wolbachia pipientis genomic DNA from infected Drosophila melanogaster and Mycobacterium tuberculosis from human blood. We identify primer sets that successfully amplify each against their backgrounds and describe a general method for using swga for arbitrary targets. In addition, we describe characteristics of primer sets that correlate with successful amplification, and present guidelines for implementation of SWGA to detect new targets. Source code and documentation are freely available on https://www.github.com/eclarke/swga . The program is implemented in Python and C and licensed under the GNU Public License. ecl@mail.med.upenn.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Simultaneous gene finding in multiple genomes.
König, Stefanie; Romoth, Lars W; Gerischer, Lizzy; Stanke, Mario
2016-11-15
As the tree of life is populated with sequenced genomes ever more densely, the new challenge is the accurate and consistent annotation of entire clades of genomes. We address this problem with a new approach to comparative gene finding that takes a multiple genome alignment of closely related species and simultaneously predicts the location and structure of protein-coding genes in all input genomes, thereby exploiting negative selection and sequence conservation. The model prefers potential gene structures in the different genomes that are in agreement with each other, or-if not-where the exon gains and losses are plausible given the species tree. We formulate the multi-species gene finding problem as a binary labeling problem on a graph. The resulting optimization problem is NP hard, but can be efficiently approximated using a subgradient-based dual decomposition approach. The proposed method was tested on whole-genome alignments of 12 vertebrate and 12 Drosophila species. The accuracy was evaluated for human, mouse and Drosophila melanogaster and compared to competing methods. Results suggest that our method is well-suited for annotation of (a large number of) genomes of closely related species within a clade, in particular, when RNA-Seq data are available for many of the genomes. The transfer of existing annotations from one genome to another via the genome alignment is more accurate than previous approaches that are based on protein-spliced alignments, when the genomes are at close to medium distances. The method is implemented in C ++ as part of Augustus and available open source at http://bioinf.uni-greifswald.de/augustus/ CONTACT: stefaniekoenig@ymail.com or mario.stanke@uni-greifswald.deSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Godfroy, Olivier; Peters, Akira F; Coelho, Susana M; Cock, J Mark
2015-12-01
Ectocarpus has emerged as a model organism for the brown algae and a broad range of genetic and genomic resources are being generated for this species. The aim of the work presented here was to evaluate two mutagenesis protocols based on ultraviolet irradiation and ethyl methanesulphonate treatment using genome resequencing to measure the number, type and distribution of mutations generated by the two methods. Ultraviolet irradiation generated a greater number of genetic lesions than ethyl methanesulphonate treatment, with more than 400 mutations being detected in the genome of the mutagenised individual. This study therefore confirms that the ultraviolet mutagenesis protocol is suitable for approaches that require a high density of mutations, such as saturation mutagenesis or Targeting Induced Local Lesions in Genomes (TILLING). Copyright © 2015 Elsevier B.V. All rights reserved.
Nielsen, E E; Morgan, J A T; Maher, S L; Edson, J; Gauthier, M; Pepperell, J; Holmes, B J; Bennett, M B; Ovenden, J R
2017-05-01
Archived specimens are highly valuable sources of DNA for retrospective genetic/genomic analysis. However, often limited effort has been made to evaluate and optimize extraction methods, which may be crucial for downstream applications. Here, we assessed and optimized the usefulness of abundant archived skeletal material from sharks as a source of DNA for temporal genomic studies. Six different methods for DNA extraction, encompassing two different commercial kits and three different protocols, were applied to material, so-called bio-swarf, from contemporary and archived jaws and vertebrae of tiger sharks (Galeocerdo cuvier). Protocols were compared for DNA yield and quality using a qPCR approach. For jaw swarf, all methods provided relatively high DNA yield and quality, while large differences in yield between protocols were observed for vertebrae. Similar results were obtained from samples of white shark (Carcharodon carcharias). Application of the optimized methods to 38 museum and private angler trophy specimens dating back to 1912 yielded sufficient DNA for downstream genomic analysis for 68% of the samples. No clear relationships between age of samples, DNA quality and quantity were observed, likely reflecting different preparation and storage methods for the trophies. Trial sequencing of DNA capture genomic libraries using 20 000 baits revealed that a significant proportion of captured sequences were derived from tiger sharks. This study demonstrates that archived shark jaws and vertebrae are potential high-yield sources of DNA for genomic-scale analysis. It also highlights that even for similar tissue types, a careful evaluation of extraction protocols can vastly improve DNA yield. © 2016 John Wiley & Sons Ltd.
Distilled single-cell genome sequencing and de novo assembly for sparse microbial communities.
Taghavi, Zeinab; Movahedi, Narjes S; Draghici, Sorin; Chitsaz, Hamidreza
2013-10-01
Identification of every single genome present in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single-cell sequencing using multiple displacement amplification, which is simply intractable for a large number of cells. However, there is hope for breaking this barrier, as the number of different cell types with distinct genome sequences is usually much smaller than the number of cells. Here, we present a novel divide and conquer method to sequence and de novo assemble all distinct genomes present in a microbial sample with a sequencing cost and computational complexity proportional to the number of genome types, rather than the number of cells. The method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated data. The proposed divide and conquer method successfully reduces the cost of sequencing in comparison with the naïve exhaustive approach. Squeezambler and datasets are available at http://compbio.cs.wayne.edu/software/squeezambler/.
Fast and Accurate Approximation to Significance Tests in Genome-Wide Association Studies
Zhang, Yu; Liu, Jun S.
2011-01-01
Genome-wide association studies commonly involve simultaneous tests of millions of single nucleotide polymorphisms (SNP) for disease association. The SNPs in nearby genomic regions, however, are often highly correlated due to linkage disequilibrium (LD, a genetic term for correlation). Simple Bonferonni correction for multiple comparisons is therefore too conservative. Permutation tests, which are often employed in practice, are both computationally expensive for genome-wide studies and limited in their scopes. We present an accurate and computationally efficient method, based on Poisson de-clumping heuristics, for approximating genome-wide significance of SNP associations. Compared with permutation tests and other multiple comparison adjustment approaches, our method computes the most accurate and robust p-value adjustments for millions of correlated comparisons within seconds. We demonstrate analytically that the accuracy and the efficiency of our method are nearly independent of the sample size, the number of SNPs, and the scale of p-values to be adjusted. In addition, our method can be easily adopted to estimate false discovery rate. When applied to genome-wide SNP datasets, we observed highly variable p-value adjustment results evaluated from different genomic regions. The variation in adjustments along the genome, however, are well conserved between the European and the African populations. The p-value adjustments are significantly correlated with LD among SNPs, recombination rates, and SNP densities. Given the large variability of sequence features in the genome, we further discuss a novel approach of using SNP-specific (local) thresholds to detect genome-wide significant associations. This article has supplementary material online. PMID:22140288
SIDR: simultaneous isolation and parallel sequencing of genomic DNA and total RNA from single cells.
Han, Kyung Yeon; Kim, Kyu-Tae; Joung, Je-Gun; Son, Dae-Soon; Kim, Yeon Jeong; Jo, Areum; Jeon, Hyo-Jeong; Moon, Hui-Sung; Yoo, Chang Eun; Chung, Woosung; Eum, Hye Hyeon; Kim, Sangmin; Kim, Hong Kwan; Lee, Jeong Eon; Ahn, Myung-Ju; Lee, Hae-Ock; Park, Donghyun; Park, Woong-Yang
2018-01-01
Simultaneous sequencing of the genome and transcriptome at the single-cell level is a powerful tool for characterizing genomic and transcriptomic variation and revealing correlative relationships. However, it remains technically challenging to analyze both the genome and transcriptome in the same cell. Here, we report a novel method for simultaneous isolation of genomic DNA and total RNA (SIDR) from single cells, achieving high recovery rates with minimal cross-contamination, as is crucial for accurate description and integration of the single-cell genome and transcriptome. For reliable and efficient separation of genomic DNA and total RNA from single cells, the method uses hypotonic lysis to preserve nuclear lamina integrity and subsequently captures the cell lysate using antibody-conjugated magnetic microbeads. Evaluating the performance of this method using real-time PCR demonstrated that it efficiently recovered genomic DNA and total RNA. Thorough data quality assessments showed that DNA and RNA simultaneously fractionated by the SIDR method were suitable for genome and transcriptome sequencing analysis at the single-cell level. The integration of single-cell genome and transcriptome sequencing by SIDR (SIDR-seq) showed that genetic alterations, such as copy-number and single-nucleotide variations, were more accurately captured by single-cell SIDR-seq compared with conventional single-cell RNA-seq, although copy-number variations positively correlated with the corresponding gene expression levels. These results suggest that SIDR-seq is potentially a powerful tool to reveal genetic heterogeneity and phenotypic information inferred from gene expression patterns at the single-cell level. © 2018 Han et al.; Published by Cold Spring Harbor Laboratory Press.
SIDR: simultaneous isolation and parallel sequencing of genomic DNA and total RNA from single cells
Han, Kyung Yeon; Kim, Kyu-Tae; Joung, Je-Gun; Son, Dae-Soon; Kim, Yeon Jeong; Jo, Areum; Jeon, Hyo-Jeong; Moon, Hui-Sung; Yoo, Chang Eun; Chung, Woosung; Eum, Hye Hyeon; Kim, Sangmin; Kim, Hong Kwan; Lee, Jeong Eon; Ahn, Myung-Ju; Lee, Hae-Ock; Park, Donghyun; Park, Woong-Yang
2018-01-01
Simultaneous sequencing of the genome and transcriptome at the single-cell level is a powerful tool for characterizing genomic and transcriptomic variation and revealing correlative relationships. However, it remains technically challenging to analyze both the genome and transcriptome in the same cell. Here, we report a novel method for simultaneous isolation of genomic DNA and total RNA (SIDR) from single cells, achieving high recovery rates with minimal cross-contamination, as is crucial for accurate description and integration of the single-cell genome and transcriptome. For reliable and efficient separation of genomic DNA and total RNA from single cells, the method uses hypotonic lysis to preserve nuclear lamina integrity and subsequently captures the cell lysate using antibody-conjugated magnetic microbeads. Evaluating the performance of this method using real-time PCR demonstrated that it efficiently recovered genomic DNA and total RNA. Thorough data quality assessments showed that DNA and RNA simultaneously fractionated by the SIDR method were suitable for genome and transcriptome sequencing analysis at the single-cell level. The integration of single-cell genome and transcriptome sequencing by SIDR (SIDR-seq) showed that genetic alterations, such as copy-number and single-nucleotide variations, were more accurately captured by single-cell SIDR-seq compared with conventional single-cell RNA-seq, although copy-number variations positively correlated with the corresponding gene expression levels. These results suggest that SIDR-seq is potentially a powerful tool to reveal genetic heterogeneity and phenotypic information inferred from gene expression patterns at the single-cell level. PMID:29208629
Icarus: visualizer for de novo assembly evaluation.
Mikheenko, Alla; Valin, Gleb; Prjibelski, Andrey; Saveliev, Vladislav; Gurevich, Alexey
2016-11-01
: Data visualization plays an increasingly important role in NGS data analysis. With advances in both sequencing and computational technologies, it has become a new bottleneck in genomics studies. Indeed, evaluation of de novo genome assemblies is one of the areas that can benefit from the visualization. However, even though multiple quality assessment methods are now available, existing visualization tools are hardly suitable for this purpose. Here, we present Icarus-a novel genome visualizer for accurate assessment and analysis of genomic draft assemblies, which is based on the tool QUAST. Icarus can be used in studies where a related reference genome is available, as well as for non-model organisms. The tool is available online and as a standalone application. http://cab.spbu.ru/software/icarus CONTACT: aleksey.gurevich@spbu.ruSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Reference-guided de novo assembly approach improves genome reconstruction for related species.
Lischer, Heidi E L; Shimizu, Kentaro K
2017-11-10
The development of next-generation sequencing has made it possible to sequence whole genomes at a relatively low cost. However, de novo genome assemblies remain challenging due to short read length, missing data, repetitive regions, polymorphisms and sequencing errors. As more and more genomes are sequenced, reference-guided assembly approaches can be used to assist the assembly process. However, previous methods mostly focused on the assembly of other genotypes within the same species. We adapted and extended a reference-guided de novo assembly approach, which enables the usage of a related reference sequence to guide the genome assembly. In order to compare and evaluate de novo and our reference-guided de novo assembly approaches, we used a simulated data set of a repetitive and heterozygotic plant genome. The extended reference-guided de novo assembly approach almost always outperforms the corresponding de novo assembly program even when a reference of a different species is used. Similar improvements can be observed in high and low coverage situations. In addition, we show that a single evaluation metric, like the widely used N50 length, is not enough to properly rate assemblies as it not always points to the best assembly evaluated with other criteria. Therefore, we used the summed z-scores of 36 different statistics to evaluate the assemblies. The combination of reference mapping and de novo assembly provides a powerful tool to improve genome reconstruction by integrating information of a related genome. Our extension of the reference-guided de novo assembly approach enables the application of this strategy not only within but also between related species. Finally, the evaluation of genome assemblies is often not straight forward, as the truth is not known. Thus one should always use a combination of evaluation metrics, which not only try to assess the continuity but also the accuracy of an assembly.
Methods of Genomic Competency Integration in Practice
Jenkins, Jean; Calzone, Kathleen A.; Caskey, Sarah; Culp, Stacey; Weiner, Marsha; Badzek, Laurie
2015-01-01
Purpose Genomics is increasingly relevant to health care, necessitating support for nurses to incorporate genomic competencies into practice. The primary aim of this project was to develop, implement, and evaluate a year-long genomic education intervention that trained, supported, and supervised institutional administrator and educator champion dyads to increase nursing capacity to integrate genomics through assessments of program satisfaction and institutional achieved outcomes. Design Longitudinal study of 23 Magnet Recognition Program® Hospitals (21 intervention, 2 controls) participating in a 1-year new competency integration effort aimed at increasing genomic nursing competency and overcoming barriers to genomics integration in practice. Methods Champion dyads underwent genomic training consisting of one in-person kick-off training meeting followed by monthly education webinars. Champion dyads designed institution-specific action plans detailing objectives, methods or strategies used to engage and educate nursing staff, timeline for implementation, and outcomes achieved. Action plans focused on a minimum of seven genomic priority areas: champion dyad personal development; practice assessment; policy content assessment; staff knowledge needs assessment; staff development; plans for integration; and anticipated obstacles and challenges. Action plans were updated quarterly, outlining progress made as well as inclusion of new methods or strategies. Progress was validated through virtual site visits with the champion dyads and chief nursing officers. Descriptive data were collected on all strategies or methods utilized, and timeline for achievement. Descriptive data were analyzed using content analysis. Findings The complexity of the competency content and the uniqueness of social systems and infrastructure resulted in a significant variation of champion dyad interventions. Conclusions Nursing champions can facilitate change in genomic nursing capacity through varied strategies but require substantial training in order to design and implement interventions. Clinical Relevance Genomics is critical to the practice of all nurses. There is a great opportunity and interest to address genomic knowledge deficits in the practicing nurse workforce as a strategy to improve patient outcomes. Exemplars of champion dyad interventions designed to increase nursing capacity focus on improving education, policy, and healthcare services. PMID:25808828
Detection of DNA Methylation by Whole-Genome Bisulfite Sequencing.
Li, Qing; Hermanson, Peter J; Springer, Nathan M
2018-01-01
DNA methylation plays an important role in the regulation of the expression of transposons and genes. Various methods have been developed to assay DNA methylation levels. Bisulfite sequencing is considered to be the "gold standard" for single-base resolution measurement of DNA methylation levels. Coupled with next-generation sequencing, whole-genome bisulfite sequencing (WGBS) allows DNA methylation to be evaluated at a genome-wide scale. Here, we described a protocol for WGBS in plant species with large genomes. This protocol has been successfully applied to assay genome-wide DNA methylation levels in maize and barley. This protocol has also been successfully coupled with sequence capture technology to assay DNA methylation levels in a targeted set of genomic regions.
Wan, Zhiyu; Vorobeychik, Yevgeniy; Kantarcioglu, Murat; Malin, Bradley
2017-07-26
Genomic data is increasingly collected by a wide array of organizations. As such, there is a growing demand to make summary information about such collections available more widely. However, over the past decade, a series of investigations have shown that attacks, rooted in statistical inference methods, can be applied to discern the presence of a known individual's DNA sequence in the pool of subjects. Recently, it was shown that the Beacon Project of the Global Alliance for Genomics and Health, a web service for querying about the presence (or absence) of a specific allele, was vulnerable. The Integrating Data for Analysis, Anonymization, and Sharing (iDASH) Center modeled a track in their third Privacy Protection Challenge on how to mitigate the Beacon vulnerability. We developed the winning solution for this track. This paper describes our computational method to optimize the tradeoff between the utility and the privacy of the Beacon service. We generalize the genomic data sharing problem beyond that which was introduced in the iDASH Challenge to be more representative of real world scenarios to allow for a more comprehensive evaluation. We then conduct a sensitivity analysis of our method with respect to several state-of-the-art methods using a dataset of 400,000 positions in Chromosome 10 for 500 individuals from Phase 3 of the 1000 Genomes Project. All methods are evaluated for utility, privacy and efficiency. Our method achieves better performance than all state-of-the-art methods, irrespective of how key factors (e.g., the allele frequency in the population, the size of the pool and utility weights) change from the original parameters of the problem. We further illustrate that it is possible for our method to exhibit subpar performance under special cases of allele query sequences. However, we show our method can be extended to address this issue when the query sequence is fixed and known a priori to the data custodian, so that they may plan stage their responses accordingly. This research shows that it is possible to thwart the attack on Beacon services, without substantially altering the utility of the system, using computational methods. The method we initially developed is limited by the design of the scenario and evaluation protocol for the iDASH Challenge; however, it can be improved by allowing the data custodian to act in a staged manner.
Dessimoz, Christophe; Boeckmann, Brigitte; Roth, Alexander C J; Gonnet, Gaston H
2006-01-01
Correct orthology assignment is a critical prerequisite of numerous comparative genomics procedures, such as function prediction, construction of phylogenetic species trees and genome rearrangement analysis. We present an algorithm for the detection of non-orthologs that arise by mistake in current orthology classification methods based on genome-specific best hits, such as the COGs database. The algorithm works with pairwise distance estimates, rather than computationally expensive and error-prone tree-building methods. The accuracy of the algorithm is evaluated through verification of the distribution of predicted cases, case-by-case phylogenetic analysis and comparisons with predictions from other projects using independent methods. Our results show that a very significant fraction of the COG groups include non-orthologs: using conservative parameters, the algorithm detects non-orthology in a third of all COG groups. Consequently, sequence analysis sensitive to correct orthology assignments will greatly benefit from these findings.
Effect of genotyped cows in the reference population on the genomic evaluation of Holstein cattle.
Uemoto, Y; Osawa, T; Saburi, J
2017-03-01
This study evaluated the dependence of reliability and prediction bias on the prediction method, the contribution of including animals (bulls or cows), and the genetic relatedness, when including genotyped cows in the progeny-tested bull reference population. We performed genomic evaluation using a Japanese Holstein population, and assessed the accuracy of genomic enhanced breeding value (GEBV) for three production traits and 13 linear conformation traits. A total of 4564 animals for production traits and 4172 animals for conformation traits were genotyped using Illumina BovineSNP50 array. Single- and multi-step methods were compared for predicting GEBV in genotyped bull-only and genotyped bull-cow reference populations. No large differences in realized reliability and regression coefficient were found between the two reference populations; however, a slight difference was found between the two methods for production traits. The accuracy of GEBV determined by single-step method increased slightly when genotyped cows were included in the bull reference population, but decreased slightly by multi-step method. A validation study was used to evaluate the accuracy of GEBV when 800 additional genotyped bulls (POPbull) or cows (POPcow) were included in the base reference population composed of 2000 genotyped bulls. The realized reliabilities of POPbull were higher than those of POPcow for all traits. For the gain of realized reliability over the base reference population, the average ratios of POPbull gain to POPcow gain for production traits and conformation traits were 2.6 and 7.2, respectively, and the ratios depended on heritabilities of the traits. For regression coefficient, no large differences were found between the results for POPbull and POPcow. Another validation study was performed to investigate the effect of genetic relatedness between cows and bulls in the reference and test populations. The effect of genetic relationship among bulls in the reference population was also assessed. The results showed that it is important to account for relatedness among bulls in the reference population. Our studies indicate that the prediction method, the contribution ratio of including animals, and genetic relatedness could affect the prediction accuracy in genomic evaluation of Holstein cattle, when including genotyped cows in the reference population.
Mesquita, R A; Anzai, E K; Oliveira, R N; Nunes, F D
2001-01-01
There are several protocols reported in the literature for the extraction of genomic DNA from formalin-fixed paraffin-embedded samples. Genomic DNA is utilized in molecular analyses, including PCR. This study compares three different methods for the extraction of genomic DNA from formalin-fixed paraffin-embedded (inflammatory fibrous hyperplasia) and non-formalin-fixed (normal oral mucosa) samples: phenol with enzymatic digestion, and silica with and without enzymatic digestion. The amplification of DNA by means of the PCR technique was carried out with primers for the exon 7 of human keratin type 14. Amplicons were analyzed by means of electrophoresis in an 8% polyacrylamide gel with 5% glycerol, followed by silver-staining visualization. The phenol/enzymatic digestion and the silica/enzymatic digestion methods provided amplicons from both tissue samples. The method described is a potential aid in the establishment of the histopathologic diagnosis and in retrospective studies with archival paraffin-embedded samples.
Generation of Knock-in Mouse by Genome Editing.
Fujii, Wataru
2017-01-01
Knock-in mice are useful for evaluating endogenous gene expressions and functions in vivo. Instead of the conventional gene-targeting method using embryonic stem cells, an exogenous DNA sequence can be inserted into the target locus in the zygote using genome editing technology. In this chapter, I describe the generation of epitope-tagged mice using engineered endonuclease and single-stranded oligodeoxynucleotide through the mouse zygote as an example of how to generate a knock-in mouse by genome editing.
Etard, Christelle; Joshi, Swarnima; Stegmaier, Johannes; Mikut, Ralf; Strähle, Uwe
2017-12-01
A bottleneck in CRISPR/Cas9 genome editing is variable efficiencies of in silico-designed gRNAs. We evaluated the sensitivity of the TIDE method (Tracking of Indels by DEcomposition) introduced by Brinkman et al. in 2014 for assessing the cutting efficiencies of gRNAs in zebrafish. We show that this simple method, which involves bulk polymerase chain reaction amplification and Sanger sequencing, is highly effective in tracking well-performing gRNAs in pools of genomic DNA derived from injected embryos. The method is equally effective for tracing INDELs in heterozygotes.
Olson, Nathan D; Zook, Justin M; Morrow, Jayne B; Lin, Nancy J
2017-01-01
High sensitivity methods such as next generation sequencing and polymerase chain reaction (PCR) are adversely impacted by organismal and DNA contaminants. Current methods for detecting contaminants in microbial materials (genomic DNA and cultures) are not sensitive enough and require either a known or culturable contaminant. Whole genome sequencing (WGS) is a promising approach for detecting contaminants due to its sensitivity and lack of need for a priori assumptions about the contaminant. Prior to applying WGS, we must first understand its limitations for detecting contaminants and potential for false positives. Herein we demonstrate and characterize a WGS-based approach to detect organismal contaminants using an existing metagenomic taxonomic classification algorithm. Simulated WGS datasets from ten genera as individuals and binary mixtures of eight organisms at varying ratios were analyzed to evaluate the role of contaminant concentration and taxonomy on detection. For the individual genomes the false positive contaminants reported depended on the genus, with Staphylococcus , Escherichia , and Shigella having the highest proportion of false positives. For nearly all binary mixtures the contaminant was detected in the in-silico datasets at the equivalent of 1 in 1,000 cells, though F. tularensis was not detected in any of the simulated contaminant mixtures and Y. pestis was only detected at the equivalent of one in 10 cells. Once a WGS method for detecting contaminants is characterized, it can be applied to evaluate microbial material purity, in efforts to ensure that contaminants are characterized in microbial materials used to validate pathogen detection assays, generate genome assemblies for database submission, and benchmark sequencing methods.
Fu, Liezhen; Wen, Luan; Luu, Nga; Shi, Yun-Bo
2016-01-01
Genome editing with designer nucleases such as TALEN and CRISPR/Cas enzymes has broad applications. Delivery of these designer nucleases into organisms induces various genetic mutations including deletions, insertions and nucleotide substitutions. Characterizing those mutations is critical for evaluating the efficacy and specificity of targeted genome editing. While a number of methods have been developed to identify the mutations, none other than sequencing allows the identification of the most desired mutations, i.e., out-of-frame insertions/deletions that disrupt genes. Here we report a simple and efficient method to visualize and quantify the efficiency of genomic mutations induced by genome-editing. Our approach is based on the expression of a two-color fusion protein in a vector that allows the insertion of the edited region in the genome in between the two color moieties. We show that our approach not only easily identifies developing animals with desired mutations but also efficiently quantifies the mutation rate in vivo. Furthermore, by using LacZα and GFP as the color moieties, our approach can even eliminate the need for a fluorescent microscope, allowing the analysis with simple bright field visualization. Such an approach will greatly simplify the screen for effective genome-editing enzymes and identify the desired mutant cells/animals. PMID:27748423
SvABA: genome-wide detection of structural variants and indels by local assembly.
Wala, Jeremiah A; Bandopadhayay, Pratiti; Greenwald, Noah F; O'Rourke, Ryan; Sharpe, Ted; Stewart, Chip; Schumacher, Steve; Li, Yilong; Weischenfeldt, Joachim; Yao, Xiaotong; Nusbaum, Chad; Campbell, Peter; Getz, Gad; Meyerson, Matthew; Zhang, Cheng-Zhong; Imielinski, Marcin; Beroukhim, Rameen
2018-04-01
Structural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA's performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs and substantially improves detection performance for variants in the 20-300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (<1000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types and found that short templated-sequence insertions occur in ∼4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized (50-300 bp) SVs. © 2018 Wala et al.; Published by Cold Spring Harbor Laboratory Press.
Lando, David; Stevens, Tim J; Basu, Srinjan; Laue, Ernest D
2018-01-01
Single-cell chromosome conformation capture approaches are revealing the extent of cell-to-cell variability in the organization and packaging of genomes. These single-cell methods, unlike their multi-cell counterparts, allow straightforward computation of realistic chromosome conformations that may be compared and combined with other, independent, techniques to study 3D structure. Here we discuss how single-cell Hi-C and subsequent 3D genome structure determination allows comparison with data from microscopy. We then carry out a systematic evaluation of recently published single-cell Hi-C datasets to establish a computational approach for the evaluation of single-cell Hi-C protocols. We show that the calculation of genome structures provides a useful tool for assessing the quality of single-cell Hi-C data because it requires a self-consistent network of interactions, relating to the underlying 3D conformation, with few errors, as well as sufficient longer-range cis- and trans-chromosomal contacts.
Ranking of Prokaryotic Genomes Based on Maximization of Sortedness of Gene Lengths
Bolshoy, A; Salih, B; Cohen, I; Tatarinova, T
2014-01-01
How variations of gene lengths (some genes become longer than their predecessors, while other genes become shorter and the sizes of these factions are randomly different from organism to organism) depend on organismal evolution and adaptation is still an open question. We propose to rank the genomes according to lengths of their genes, and then find association between the genome rank and variousproperties, such as growth temperature, nucleotide composition, and pathogenicity. This approach reveals evolutionary driving factors. The main purpose of this study is to test effectiveness and robustness of several ranking methods. The selected method of evaluation is measuring of overall sortedness of the data. We have demonstrated that all considered methods give consistent results and Bubble Sort and Simulated Annealing achieve the highest sortedness. Also, Bubble Sort is considerably faster than the Simulated Annealing method. PMID:26146586
Ranking of Prokaryotic Genomes Based on Maximization of Sortedness of Gene Lengths.
Bolshoy, A; Salih, B; Cohen, I; Tatarinova, T
How variations of gene lengths (some genes become longer than their predecessors, while other genes become shorter and the sizes of these factions are randomly different from organism to organism) depend on organismal evolution and adaptation is still an open question. We propose to rank the genomes according to lengths of their genes, and then find association between the genome rank and variousproperties, such as growth temperature, nucleotide composition, and pathogenicity. This approach reveals evolutionary driving factors. The main purpose of this study is to test effectiveness and robustness of several ranking methods. The selected method of evaluation is measuring of overall sortedness of the data. We have demonstrated that all considered methods give consistent results and Bubble Sort and Simulated Annealing achieve the highest sortedness. Also, Bubble Sort is considerably faster than the Simulated Annealing method.
Sebastiani, Paola; Zhao, Zhenming; Abad-Grau, Maria M; Riva, Alberto; Hartley, Stephen W; Sedgewick, Amanda E; Doria, Alessandro; Montano, Monty; Melista, Efthymia; Terry, Dellara; Perls, Thomas T; Steinberg, Martin H; Baldwin, Clinton T
2008-01-01
Background One of the challenges of the analysis of pooling-based genome wide association studies is to identify authentic associations among potentially thousands of false positive associations. Results We present a hierarchical and modular approach to the analysis of genome wide genotype data that incorporates quality control, linkage disequilibrium, physical distance and gene ontology to identify authentic associations among those found by statistical association tests. The method is developed for the allelic association analysis of pooled DNA samples, but it can be easily generalized to the analysis of individually genotyped samples. We evaluate the approach using data sets from diverse genome wide association studies including fetal hemoglobin levels in sickle cell anemia and a sample of centenarians and show that the approach is highly reproducible and allows for discovery at different levels of synthesis. Conclusion Results from the integration of Bayesian tests and other machine learning techniques with linkage disequilibrium data suggest that we do not need to use too stringent thresholds to reduce the number of false positive associations. This method yields increased power even with relatively small samples. In fact, our evaluation shows that the method can reach almost 70% sensitivity with samples of only 100 subjects. PMID:18194558
Wang, Shuai; Wei, Wei; Luo, Xuenong; Cai, Xuepeng
2014-01-01
Besides the complete genome, different partial genomic sequences of Hepatitis E virus (HEV) have been used in genotyping studies, making it difficult to compare the results based on them. No commonly agreed partial region for HEV genotyping has been determined. In this study, we used a statistical method to evaluate the phylogenetic performance of each partial genomic sequence from a genome wide, by comparisons of evolutionary distances between genomic regions and the full-length genomes of 101 HEV isolates to identify short genomic regions that can reproduce HEV genotype assignments based on full-length genomes. Several genomic regions, especially one genomic region at the 3'-terminal of the papain-like cysteine protease domain, were detected to have relatively high phylogenetic correlations with the full-length genome. Phylogenetic analyses confirmed the identical performances between these regions and the full-length genome in genotyping, in which the HEV isolates involved could be divided into reasonable genotypes. This analysis may be of value in developing a partial sequence-based consensus classification of HEV species.
USDA-ARS?s Scientific Manuscript database
Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole...
Strategies for the evaluation of DNA damage and repair mechanisms in cancer.
Figueroa-González, Gabriela; Pérez-Plasencia, Carlos
2017-06-01
DNA lesions and the repair mechanisms that maintain the integrity of genomic DNA are important in preventing carcinogenesis and its progression. Notably, mutations in DNA repair mechanisms are associated with cancer predisposition syndromes. Additionally, these mechanisms maintain the genomic integrity of cancer cells. The majority of therapies established to treat cancer are genotoxic agents that induce DNA damage, promoting cancer cells to undergo apoptotic death. Effective methods currently exist to evaluate the diverse effects of genotoxic agents and the underlying molecular mechanisms that repair DNA lesions. The current study provides an overview of a number of methods that are available for the detection, analysis and quantification of underlying DNA repair mechanisms.
Comparison of Seven Methods for Boolean Factor Analysis and Their Evaluation by Information Gain.
Frolov, Alexander A; Húsek, Dušan; Polyakov, Pavel Yu
2016-03-01
An usual task in large data set analysis is searching for an appropriate data representation in a space of fewer dimensions. One of the most efficient methods to solve this task is factor analysis. In this paper, we compare seven methods for Boolean factor analysis (BFA) in solving the so-called bars problem (BP), which is a BFA benchmark. The performance of the methods is evaluated by means of information gain. Study of the results obtained in solving BP of different levels of complexity has allowed us to reveal strengths and weaknesses of these methods. It is shown that the Likelihood maximization Attractor Neural Network with Increasing Activity (LANNIA) is the most efficient BFA method in solving BP in many cases. Efficacy of the LANNIA method is also shown, when applied to the real data from the Kyoto Encyclopedia of Genes and Genomes database, which contains full genome sequencing for 1368 organisms, and to text data set R52 (from Reuters 21578) typically used for label categorization.
Development of Mycoplasma synoviae (MS) core genome multilocus sequence typing (cgMLST) scheme.
Ghanem, Mostafa; El-Gazzar, Mohamed
2018-05-01
Mycoplasma synoviae (MS) is a poultry pathogen with reported increased prevalence and virulence in recent years. MS strain identification is essential for prevention, control efforts and epidemiological outbreak investigations. Multiple multilocus based sequence typing schemes have been developed for MS, yet the resolution of these schemes could be limited for outbreak investigation. The cost of whole genome sequencing became close to that of sequencing the seven MLST targets; however, there is no standardized method for typing MS strains based on whole genome sequences. In this paper, we propose a core genome multilocus sequence typing (cgMLST) scheme as a standardized and reproducible method for typing MS based whole genome sequences. A diverse set of 25 MS whole genome sequences were used to identify 302 core genome genes as cgMLST targets (35.5% of MS genome) and 44 whole genome sequences of MS isolates from six countries in four continents were used for typing applying this scheme. cgMLST based phylogenetic trees displayed a high degree of agreement with core genome SNP based analysis and available epidemiological information. cgMLST allowed evaluation of two conventional MLST schemes of MS. The high discriminatory power of cgMLST allowed differentiation between samples of the same conventional MLST type. cgMLST represents a standardized, accurate, highly discriminatory, and reproducible method for differentiation between MS isolates. Like conventional MLST, it provides stable and expandable nomenclature, allowing for comparing and sharing the typing results between different laboratories worldwide. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.
Elloumi, Fathi; Hu, Zhiyuan; Li, Yan; Parker, Joel S; Gulley, Margaret L; Amos, Keith D; Troester, Melissa A
2011-06-30
Genomic tests are available to predict breast cancer recurrence and to guide clinical decision making. These predictors provide recurrence risk scores along with a measure of uncertainty, usually a confidence interval. The confidence interval conveys random error and not systematic bias. Standard tumor sampling methods make this problematic, as it is common to have a substantial proportion (typically 30-50%) of a tumor sample comprised of histologically benign tissue. This "normal" tissue could represent a source of non-random error or systematic bias in genomic classification. To assess the performance characteristics of genomic classification to systematic error from normal contamination, we collected 55 tumor samples and paired tumor-adjacent normal tissue. Using genomic signatures from the tumor and paired normal, we evaluated how increasing normal contamination altered recurrence risk scores for various genomic predictors. Simulations of normal tissue contamination caused misclassification of tumors in all predictors evaluated, but different breast cancer predictors showed different types of vulnerability to normal tissue bias. While two predictors had unpredictable direction of bias (either higher or lower risk of relapse resulted from normal contamination), one signature showed predictable direction of normal tissue effects. Due to this predictable direction of effect, this signature (the PAM50) was adjusted for normal tissue contamination and these corrections improved sensitivity and negative predictive value. For all three assays quality control standards and/or appropriate bias adjustment strategies can be used to improve assay reliability. Normal tissue sampled concurrently with tumor is an important source of bias in breast genomic predictors. All genomic predictors show some sensitivity to normal tissue contamination and ideal strategies for mitigating this bias vary depending upon the particular genes and computational methods used in the predictor.
Chwialkowska, Karolina; Korotko, Urszula; Kosinska, Joanna; Szarejko, Iwona; Kwasniewski, Miroslaw
2017-01-01
Epigenetic mechanisms, including histone modifications and DNA methylation, mutually regulate chromatin structure, maintain genome integrity, and affect gene expression and transposon mobility. Variations in DNA methylation within plant populations, as well as methylation in response to internal and external factors, are of increasing interest, especially in the crop research field. Methylation Sensitive Amplification Polymorphism (MSAP) is one of the most commonly used methods for assessing DNA methylation changes in plants. This method involves gel-based visualization of PCR fragments from selectively amplified DNA that are cleaved using methylation-sensitive restriction enzymes. In this study, we developed and validated a new method based on the conventional MSAP approach called Methylation Sensitive Amplification Polymorphism Sequencing (MSAP-Seq). We improved the MSAP-based approach by replacing the conventional separation of amplicons on polyacrylamide gels with direct, high-throughput sequencing using Next Generation Sequencing (NGS) and automated data analysis. MSAP-Seq allows for global sequence-based identification of changes in DNA methylation. This technique was validated in Hordeum vulgare . However, MSAP-Seq can be straightforwardly implemented in different plant species, including crops with large, complex and highly repetitive genomes. The incorporation of high-throughput sequencing into MSAP-Seq enables parallel and direct analysis of DNA methylation in hundreds of thousands of sites across the genome. MSAP-Seq provides direct genomic localization of changes and enables quantitative evaluation. We have shown that the MSAP-Seq method specifically targets gene-containing regions and that a single analysis can cover three-quarters of all genes in large genomes. Moreover, MSAP-Seq's simplicity, cost effectiveness, and high-multiplexing capability make this method highly affordable. Therefore, MSAP-Seq can be used for DNA methylation analysis in crop plants with large and complex genomes.
Chwialkowska, Karolina; Korotko, Urszula; Kosinska, Joanna; Szarejko, Iwona; Kwasniewski, Miroslaw
2017-01-01
Epigenetic mechanisms, including histone modifications and DNA methylation, mutually regulate chromatin structure, maintain genome integrity, and affect gene expression and transposon mobility. Variations in DNA methylation within plant populations, as well as methylation in response to internal and external factors, are of increasing interest, especially in the crop research field. Methylation Sensitive Amplification Polymorphism (MSAP) is one of the most commonly used methods for assessing DNA methylation changes in plants. This method involves gel-based visualization of PCR fragments from selectively amplified DNA that are cleaved using methylation-sensitive restriction enzymes. In this study, we developed and validated a new method based on the conventional MSAP approach called Methylation Sensitive Amplification Polymorphism Sequencing (MSAP-Seq). We improved the MSAP-based approach by replacing the conventional separation of amplicons on polyacrylamide gels with direct, high-throughput sequencing using Next Generation Sequencing (NGS) and automated data analysis. MSAP-Seq allows for global sequence-based identification of changes in DNA methylation. This technique was validated in Hordeum vulgare. However, MSAP-Seq can be straightforwardly implemented in different plant species, including crops with large, complex and highly repetitive genomes. The incorporation of high-throughput sequencing into MSAP-Seq enables parallel and direct analysis of DNA methylation in hundreds of thousands of sites across the genome. MSAP-Seq provides direct genomic localization of changes and enables quantitative evaluation. We have shown that the MSAP-Seq method specifically targets gene-containing regions and that a single analysis can cover three-quarters of all genes in large genomes. Moreover, MSAP-Seq's simplicity, cost effectiveness, and high-multiplexing capability make this method highly affordable. Therefore, MSAP-Seq can be used for DNA methylation analysis in crop plants with large and complex genomes. PMID:29250096
Valente, Bruno D.; Morota, Gota; Peñagaricano, Francisco; Gianola, Daniel; Weigel, Kent; Rosa, Guilherme J. M.
2015-01-01
The term “effect” in additive genetic effect suggests a causal meaning. However, inferences of such quantities for selection purposes are typically viewed and conducted as a prediction task. Predictive ability as tested by cross-validation is currently the most acceptable criterion for comparing models and evaluating new methodologies. Nevertheless, it does not directly indicate if predictors reflect causal effects. Such evaluations would require causal inference methods that are not typical in genomic prediction for selection. This suggests that the usual approach to infer genetic effects contradicts the label of the quantity inferred. Here we investigate if genomic predictors for selection should be treated as standard predictors or if they must reflect a causal effect to be useful, requiring causal inference methods. Conducting the analysis as a prediction or as a causal inference task affects, for example, how covariates of the regression model are chosen, which may heavily affect the magnitude of genomic predictors and therefore selection decisions. We demonstrate that selection requires learning causal genetic effects. However, genomic predictors from some models might capture noncausal signal, providing good predictive ability but poorly representing true genetic effects. Simulated examples are used to show that aiming for predictive ability may lead to poor modeling decisions, while causal inference approaches may guide the construction of regression models that better infer the target genetic effect even when they underperform in cross-validation tests. In conclusion, genomic selection models should be constructed to aim primarily for identifiability of causal genetic effects, not for predictive ability. PMID:25908318
2015-01-01
Objective Developed sequencing techniques are yielding large-scale genomic data at low cost. A genome-wide association study (GWAS) targeting genetic variations that are significantly associated with a particular disease offers great potential for medical improvement. However, subjects who volunteer their genomic data expose themselves to the risk of privacy invasion; these privacy concerns prevent efficient genomic data sharing. Our goal is to presents a cryptographic solution to this problem. Methods To maintain the privacy of subjects, we propose encryption of all genotype and phenotype data. To allow the cloud to perform meaningful computation in relation to the encrypted data, we use a fully homomorphic encryption scheme. Noting that we can evaluate typical statistics for GWAS from a frequency table, our solution evaluates frequency tables with encrypted genomic and clinical data as input. We propose to use a packing technique for efficient evaluation of these frequency tables. Results Our solution supports evaluation of the D′ measure of linkage disequilibrium, the Hardy-Weinberg Equilibrium, the χ2 test, etc. In this paper, we take χ2 test and linkage disequilibrium as examples and demonstrate how we can conduct these algorithms securely and efficiently in an outsourcing setting. We demonstrate with experimentation that secure outsourcing computation of one χ2 test with 10, 000 subjects requires about 35 ms and evaluation of one linkage disequilibrium with 10, 000 subjects requires about 80 ms. Conclusions With appropriate encoding and packing technique, cryptographic solutions based on fully homomorphic encryption for secure computations of GWAS can be practical. PMID:26732892
Zook, Justin M.; Morrow, Jayne B.; Lin, Nancy J.
2017-01-01
High sensitivity methods such as next generation sequencing and polymerase chain reaction (PCR) are adversely impacted by organismal and DNA contaminants. Current methods for detecting contaminants in microbial materials (genomic DNA and cultures) are not sensitive enough and require either a known or culturable contaminant. Whole genome sequencing (WGS) is a promising approach for detecting contaminants due to its sensitivity and lack of need for a priori assumptions about the contaminant. Prior to applying WGS, we must first understand its limitations for detecting contaminants and potential for false positives. Herein we demonstrate and characterize a WGS-based approach to detect organismal contaminants using an existing metagenomic taxonomic classification algorithm. Simulated WGS datasets from ten genera as individuals and binary mixtures of eight organisms at varying ratios were analyzed to evaluate the role of contaminant concentration and taxonomy on detection. For the individual genomes the false positive contaminants reported depended on the genus, with Staphylococcus, Escherichia, and Shigella having the highest proportion of false positives. For nearly all binary mixtures the contaminant was detected in the in-silico datasets at the equivalent of 1 in 1,000 cells, though F. tularensis was not detected in any of the simulated contaminant mixtures and Y. pestis was only detected at the equivalent of one in 10 cells. Once a WGS method for detecting contaminants is characterized, it can be applied to evaluate microbial material purity, in efforts to ensure that contaminants are characterized in microbial materials used to validate pathogen detection assays, generate genome assemblies for database submission, and benchmark sequencing methods. PMID:28924496
Prospects and Potential Uses of Genomic Prediction of Key Performance Traits in Tetraploid Potato.
Stich, Benjamin; Van Inghelandt, Delphine
2018-01-01
Genomic prediction is a routine tool in breeding programs of most major animal and plant species. However, its usefulness for potato breeding has not yet been evaluated in detail. The objectives of this study were to (i) examine the prospects of genomic prediction of key performance traits in a diversity panel of tetraploid potato modeling additive, dominance, and epistatic effects, (ii) investigate the effects of size and make up of training set, number of test environments and molecular markers on prediction accuracy, and (iii) assess the effect of including markers from candidate genes on the prediction accuracy. With genomic best linear unbiased prediction (GBLUP), BayesA, BayesCπ, and Bayesian LASSO, four different prediction methods were used for genomic prediction of relative area under disease progress curve after a Phytophthora infestans infection, plant maturity, maturity corrected resistance, tuber starch content, tuber starch yield (TSY), and tuber yield (TY) of 184 tetraploid potato clones or subsets thereof genotyped with the SolCAP 8.3k SNP array. The cross-validated prediction accuracies with GBLUP and the three Bayesian approaches for the six evaluated traits ranged from about 0.5 to about 0.8. For traits with a high expected genetic complexity, such as TSY and TY, we observed an 8% higher prediction accuracy using a model with additive and dominance effects compared with a model with additive effects only. Our results suggest that for oligogenic traits in general and when diagnostic markers are available in particular, the use of Bayesian methods for genomic prediction is highly recommended and that the diagnostic markers should be modeled as fixed effects. The evaluation of the relative performance of genomic prediction vs. phenotypic selection indicated that the former is superior, assuming cycle lengths and selection intensities that are possible to realize in commercial potato breeding programs.
Prospects and Potential Uses of Genomic Prediction of Key Performance Traits in Tetraploid Potato
Stich, Benjamin; Van Inghelandt, Delphine
2018-01-01
Genomic prediction is a routine tool in breeding programs of most major animal and plant species. However, its usefulness for potato breeding has not yet been evaluated in detail. The objectives of this study were to (i) examine the prospects of genomic prediction of key performance traits in a diversity panel of tetraploid potato modeling additive, dominance, and epistatic effects, (ii) investigate the effects of size and make up of training set, number of test environments and molecular markers on prediction accuracy, and (iii) assess the effect of including markers from candidate genes on the prediction accuracy. With genomic best linear unbiased prediction (GBLUP), BayesA, BayesCπ, and Bayesian LASSO, four different prediction methods were used for genomic prediction of relative area under disease progress curve after a Phytophthora infestans infection, plant maturity, maturity corrected resistance, tuber starch content, tuber starch yield (TSY), and tuber yield (TY) of 184 tetraploid potato clones or subsets thereof genotyped with the SolCAP 8.3k SNP array. The cross-validated prediction accuracies with GBLUP and the three Bayesian approaches for the six evaluated traits ranged from about 0.5 to about 0.8. For traits with a high expected genetic complexity, such as TSY and TY, we observed an 8% higher prediction accuracy using a model with additive and dominance effects compared with a model with additive effects only. Our results suggest that for oligogenic traits in general and when diagnostic markers are available in particular, the use of Bayesian methods for genomic prediction is highly recommended and that the diagnostic markers should be modeled as fixed effects. The evaluation of the relative performance of genomic prediction vs. phenotypic selection indicated that the former is superior, assuming cycle lengths and selection intensities that are possible to realize in commercial potato breeding programs. PMID:29563919
Guidelines for Genome-Scale Analysis of Biological Rhythms.
Hughes, Michael E; Abruzzi, Katherine C; Allada, Ravi; Anafi, Ron; Arpat, Alaaddin Bulak; Asher, Gad; Baldi, Pierre; de Bekker, Charissa; Bell-Pedersen, Deborah; Blau, Justin; Brown, Steve; Ceriani, M Fernanda; Chen, Zheng; Chiu, Joanna C; Cox, Juergen; Crowell, Alexander M; DeBruyne, Jason P; Dijk, Derk-Jan; DiTacchio, Luciano; Doyle, Francis J; Duffield, Giles E; Dunlap, Jay C; Eckel-Mahan, Kristin; Esser, Karyn A; FitzGerald, Garret A; Forger, Daniel B; Francey, Lauren J; Fu, Ying-Hui; Gachon, Frédéric; Gatfield, David; de Goede, Paul; Golden, Susan S; Green, Carla; Harer, John; Harmer, Stacey; Haspel, Jeff; Hastings, Michael H; Herzel, Hanspeter; Herzog, Erik D; Hoffmann, Christy; Hong, Christian; Hughey, Jacob J; Hurley, Jennifer M; de la Iglesia, Horacio O; Johnson, Carl; Kay, Steve A; Koike, Nobuya; Kornacker, Karl; Kramer, Achim; Lamia, Katja; Leise, Tanya; Lewis, Scott A; Li, Jiajia; Li, Xiaodong; Liu, Andrew C; Loros, Jennifer J; Martino, Tami A; Menet, Jerome S; Merrow, Martha; Millar, Andrew J; Mockler, Todd; Naef, Felix; Nagoshi, Emi; Nitabach, Michael N; Olmedo, Maria; Nusinow, Dmitri A; Ptáček, Louis J; Rand, David; Reddy, Akhilesh B; Robles, Maria S; Roenneberg, Till; Rosbash, Michael; Ruben, Marc D; Rund, Samuel S C; Sancar, Aziz; Sassone-Corsi, Paolo; Sehgal, Amita; Sherrill-Mix, Scott; Skene, Debra J; Storch, Kai-Florian; Takahashi, Joseph S; Ueda, Hiroki R; Wang, Han; Weitz, Charles; Westermark, Pål O; Wijnen, Herman; Xu, Ying; Wu, Gang; Yoo, Seung-Hee; Young, Michael; Zhang, Eric Erquan; Zielinski, Tomasz; Hogenesch, John B
2017-10-01
Genome biology approaches have made enormous contributions to our understanding of biological rhythms, particularly in identifying outputs of the clock, including RNAs, proteins, and metabolites, whose abundance oscillates throughout the day. These methods hold significant promise for future discovery, particularly when combined with computational modeling. However, genome-scale experiments are costly and laborious, yielding "big data" that are conceptually and statistically difficult to analyze. There is no obvious consensus regarding design or analysis. Here we discuss the relevant technical considerations to generate reproducible, statistically sound, and broadly useful genome-scale data. Rather than suggest a set of rigid rules, we aim to codify principles by which investigators, reviewers, and readers of the primary literature can evaluate the suitability of different experimental designs for measuring different aspects of biological rhythms. We introduce CircaInSilico, a web-based application for generating synthetic genome biology data to benchmark statistical methods for studying biological rhythms. Finally, we discuss several unmet analytical needs, including applications to clinical medicine, and suggest productive avenues to address them.
Guidelines for Genome-Scale Analysis of Biological Rhythms
Hughes, Michael E.; Abruzzi, Katherine C.; Allada, Ravi; Anafi, Ron; Arpat, Alaaddin Bulak; Asher, Gad; Baldi, Pierre; de Bekker, Charissa; Bell-Pedersen, Deborah; Blau, Justin; Brown, Steve; Ceriani, M. Fernanda; Chen, Zheng; Chiu, Joanna C.; Cox, Juergen; Crowell, Alexander M.; DeBruyne, Jason P.; Dijk, Derk-Jan; DiTacchio, Luciano; Doyle, Francis J.; Duffield, Giles E.; Dunlap, Jay C.; Eckel-Mahan, Kristin; Esser, Karyn A.; FitzGerald, Garret A.; Forger, Daniel B.; Francey, Lauren J.; Fu, Ying-Hui; Gachon, Frédéric; Gatfield, David; de Goede, Paul; Golden, Susan S.; Green, Carla; Harer, John; Harmer, Stacey; Haspel, Jeff; Hastings, Michael H.; Herzel, Hanspeter; Herzog, Erik D.; Hoffmann, Christy; Hong, Christian; Hughey, Jacob J.; Hurley, Jennifer M.; de la Iglesia, Horacio O.; Johnson, Carl; Kay, Steve A.; Koike, Nobuya; Kornacker, Karl; Kramer, Achim; Lamia, Katja; Leise, Tanya; Lewis, Scott A.; Li, Jiajia; Li, Xiaodong; Liu, Andrew C.; Loros, Jennifer J.; Martino, Tami A.; Menet, Jerome S.; Merrow, Martha; Millar, Andrew J.; Mockler, Todd; Naef, Felix; Nagoshi, Emi; Nitabach, Michael N.; Olmedo, Maria; Nusinow, Dmitri A.; Ptáček, Louis J.; Rand, David; Reddy, Akhilesh B.; Robles, Maria S.; Roenneberg, Till; Rosbash, Michael; Ruben, Marc D.; Rund, Samuel S.C.; Sancar, Aziz; Sassone-Corsi, Paolo; Sehgal, Amita; Sherrill-Mix, Scott; Skene, Debra J.; Storch, Kai-Florian; Takahashi, Joseph S.; Ueda, Hiroki R.; Wang, Han; Weitz, Charles; Westermark, Pål O.; Wijnen, Herman; Xu, Ying; Wu, Gang; Yoo, Seung-Hee; Young, Michael; Zhang, Eric Erquan; Zielinski, Tomasz; Hogenesch, John B.
2017-01-01
Genome biology approaches have made enormous contributions to our understanding of biological rhythms, particularly in identifying outputs of the clock, including RNAs, proteins, and metabolites, whose abundance oscillates throughout the day. These methods hold significant promise for future discovery, particularly when combined with computational modeling. However, genome-scale experiments are costly and laborious, yielding “big data” that are conceptually and statistically difficult to analyze. There is no obvious consensus regarding design or analysis. Here we discuss the relevant technical considerations to generate reproducible, statistically sound, and broadly useful genome-scale data. Rather than suggest a set of rigid rules, we aim to codify principles by which investigators, reviewers, and readers of the primary literature can evaluate the suitability of different experimental designs for measuring different aspects of biological rhythms. We introduce CircaInSilico, a web-based application for generating synthetic genome biology data to benchmark statistical methods for studying biological rhythms. Finally, we discuss several unmet analytical needs, including applications to clinical medicine, and suggest productive avenues to address them. PMID:29098954
Development and evaluation of a genomics training program for community health workers in Texas.
Chen, Lei-Shih; Zhao, Shixi; Stelzig, Donaji; Dhar, Shweta U; Eble, Tanya; Yeh, Yu-Chen; Kwok, Oi-Man
2018-01-04
PurposeGenomics services have the potential to reduce incidence and mortality of diseases by providing individualized, family health history (FHH)-based prevention strategies to clients. These services may benefit from the involvement of community health workers (CHWs) in the provision of FHH-based genomics education and services, as CHWs are frontline public health workers and lay health educators, who share similar ethnicities, languages, socioeconomic statuses, and life experiences with the communities they serve. We developed, implemented, and evaluated the FHH-based genomics training program for CHWs.MethodsThis theory- and evidence-based FHH-focused genomics curriculum was developed by an interdisciplinary team. Full-day workshops in English and Spanish were delivered to 145 Texas CHWs (91.6% were Hispanic/black). Preworkshop, postworkshop, and 3-month follow-up data were collected.ResultsCHWs significantly improved their attitudes, intention, self-efficacy, and knowledge regarding adopting FHH-based genomics into their practice after the workshops. At 3-month follow-up, these scores remained higher, and there was a significant increase in CHWs' genomics practices.ConclusionThis FHH-based genomics training successfully educated Texas CHWs, and the outcomes were promising. Dissemination of training to CHWs in and outside of Texas is needed to promote better access to and delivery of personalized genomics services for the lay and underserved communities.GENETICS in MEDICINE advance online publication, 4 January 2018; doi:10.1038/gim.2017.236.
Future animal improvement programs applied to global populations
USDA-ARS?s Scientific Manuscript database
Breeding programs evolved gradually from within-herd phenotypic selection to local and regional cooperatives to national evaluations and now international evaluations. In the future, breeders may adapt reproductive, computational, and genomic methods to global populations as easily as with national ...
snpTree--a web-server to identify and construct SNP trees from whole genome sequence data.
Leekitcharoenphon, Pimlapas; Kaas, Rolf S; Thomsen, Martin Christen Frølund; Friis, Carsten; Rasmussen, Simon; Aarestrup, Frank M
2012-01-01
The advances and decreasing economical cost of whole genome sequencing (WGS), will soon make this technology available for routine infectious disease epidemiology. In epidemiological studies, outbreak isolates have very little diversity and require extensive genomic analysis to differentiate and classify isolates. One of the successfully and broadly used methods is analysis of single nucletide polymorphisms (SNPs). Currently, there are different tools and methods to identify SNPs including various options and cut-off values. Furthermore, all current methods require bioinformatic skills. Thus, we lack a standard and simple automatic tool to determine SNPs and construct phylogenetic tree from WGS data. Here we introduce snpTree, a server for online-automatic SNPs analysis. This tool is composed of different SNPs analysis suites, perl and python scripts. snpTree can identify SNPs and construct phylogenetic trees from WGS as well as from assembled genomes or contigs. WGS data in fastq format are aligned to reference genomes by BWA while contigs in fasta format are processed by Nucmer. SNPs are concatenated based on position on reference genome and a tree is constructed from concatenated SNPs using FastTree and a perl script. The online server was implemented by HTML, Java and python script.The server was evaluated using four published bacterial WGS data sets (V. cholerae, S. aureus CC398, S. Typhimurium and M. tuberculosis). The evaluation results for the first three cases was consistent and concordant for both raw reads and assembled genomes. In the latter case the original publication involved extensive filtering of SNPs, which could not be repeated using snpTree. The snpTree server is an easy to use option for rapid standardised and automatic SNP analysis in epidemiological studies also for users with limited bioinformatic experience. The web server is freely accessible at http://www.cbs.dtu.dk/services/snpTree-1.0/.
MetaQUAST: evaluation of metagenome assemblies.
Mikheenko, Alla; Saveliev, Vladislav; Gurevich, Alexey
2016-04-01
During the past years we have witnessed the rapid development of new metagenome assembly methods. Although there are many benchmark utilities designed for single-genome assemblies, there is no well-recognized evaluation and comparison tool for metagenomic-specific analogues. In this article, we present MetaQUAST, a modification of QUAST, the state-of-the-art tool for genome assembly evaluation based on alignment of contigs to a reference. MetaQUAST addresses such metagenome datasets features as (i) unknown species content by detecting and downloading reference sequences, (ii) huge diversity by giving comprehensive reports for multiple genomes and (iii) presence of highly relative species by detecting chimeric contigs. We demonstrate MetaQUAST performance by comparing several leading assemblers on one simulated and two real datasets. http://bioinf.spbau.ru/metaquast aleksey.gurevich@spbu.ru Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Ridge, Lasso and Bayesian additive-dominance genomic models.
Azevedo, Camila Ferreira; de Resende, Marcos Deon Vilela; E Silva, Fabyano Fonseca; Viana, José Marcelo Soriano; Valente, Magno Sávio Ferreira; Resende, Márcio Fernando Ribeiro; Muñoz, Patricio
2015-08-25
A complete approach for genome-wide selection (GWS) involves reliable statistical genetics models and methods. Reports on this topic are common for additive genetic models but not for additive-dominance models. The objective of this paper was (i) to compare the performance of 10 additive-dominance predictive models (including current models and proposed modifications), fitted using Bayesian, Lasso and Ridge regression approaches; and (ii) to decompose genomic heritability and accuracy in terms of three quantitative genetic information sources, namely, linkage disequilibrium (LD), co-segregation (CS) and pedigree relationships or family structure (PR). The simulation study considered two broad sense heritability levels (0.30 and 0.50, associated with narrow sense heritabilities of 0.20 and 0.35, respectively) and two genetic architectures for traits (the first consisting of small gene effects and the second consisting of a mixed inheritance model with five major genes). G-REML/G-BLUP and a modified Bayesian/Lasso (called BayesA*B* or t-BLASSO) method performed best in the prediction of genomic breeding as well as the total genotypic values of individuals in all four scenarios (two heritabilities x two genetic architectures). The BayesA*B*-type method showed a better ability to recover the dominance variance/additive variance ratio. Decomposition of genomic heritability and accuracy revealed the following descending importance order of information: LD, CS and PR not captured by markers, the last two being very close. Amongst the 10 models/methods evaluated, the G-BLUP, BAYESA*B* (-2,8) and BAYESA*B* (4,6) methods presented the best results and were found to be adequate for accurately predicting genomic breeding and total genotypic values as well as for estimating additive and dominance in additive-dominance genomic models.
Methodology and software to detect viral integration site hot-spots
2011-01-01
Background Modern gene therapy methods have limited control over where a therapeutic viral vector inserts into the host genome. Vector integration can activate local gene expression, which can cause cancer if the vector inserts near an oncogene. Viral integration hot-spots or 'common insertion sites' (CIS) are scrutinized to evaluate and predict patient safety. CIS are typically defined by a minimum density of insertions (such as 2-4 within a 30-100 kb region), which unfortunately depends on the total number of observed VIS. This is problematic for comparing hot-spot distributions across data sets and patients, where the VIS numbers may vary. Results We develop two new methods for defining hot-spots that are relatively independent of data set size. Both methods operate on distributions of VIS across consecutive 1 Mb 'bins' of the genome. The first method 'z-threshold' tallies the number of VIS per bin, converts these counts to z-scores, and applies a threshold to define high density bins. The second method 'BCP' applies a Bayesian change-point model to the z-scores to define hot-spots. The novel hot-spot methods are compared with a conventional CIS method using simulated data sets and data sets from five published human studies, including the X-linked ALD (adrenoleukodystrophy), CGD (chronic granulomatous disease) and SCID-X1 (X-linked severe combined immunodeficiency) trials. The BCP analysis of the human X-linked ALD data for two patients separately (774 and 1627 VIS) and combined (2401 VIS) resulted in 5-6 hot-spots covering 0.17-0.251% of the genome and containing 5.56-7.74% of the total VIS. In comparison, the CIS analysis resulted in 12-110 hot-spots covering 0.018-0.246% of the genome and containing 5.81-22.7% of the VIS, corresponding to a greater number of hot-spots as the data set size increased. Our hot-spot methods enable one to evaluate the extent of VIS clustering, and formally compare data sets in terms of hot-spot overlap. Finally, we show that the BCP hot-spots from the repopulating samples coincide with greater gene and CpG island density than the median genome density. Conclusions The z-threshold and BCP methods are useful for comparing hot-spot patterns across data sets of disparate sizes. The methodology and software provided here should enable one to study hot-spot conservation across a variety of VIS data sets and evaluate vector safety for gene therapy trials. PMID:21914224
Clinical evaluation incorporating a personal genome
Ashley, Euan A.; Butte, Atul J.; Wheeler, Matthew T.; Chen, Rong; Klein, Teri E.; Dewey, Frederick E.; Dudley, Joel T.; Ormond, Kelly E.; Pavlovic, Aleksandra; Hudgins, Louanne; Gong, Li; Hodges, Laura M.; Berlin, Dorit S.; Thorn, Caroline F.; Sangkuhl, Katrin; Hebert, Joan M.; Woon, Mark; Sagreiya, Hersh; Whaley, Ryan; Morgan, Alexander A.; Pushkarev, Dmitry; Neff, Norma F; Knowles, Joshua W.; Chou, Mike; Thakuria, Joseph; Rosenbaum, Abraham; Zaranek, Alexander Wait; Church, George; Greely, Henry T.; Quake, Stephen R.; Altman, Russ B.
2010-01-01
Background The cost of genomic information has fallen steeply but the path to clinical translation of risk estimates for common variants found in genome wide association studies remains unclear. Since the speed and cost of sequencing complete genomes is rapidly declining, more comprehensive means of analyzing these data in concert with rare variants for genetic risk assessment and individualisation of therapy are required. Here, we present the first integrated analysis of a complete human genome in a clinical context. Methods An individual with a family history of vascular disease and early sudden death was evaluated. Clinical assessment included risk prediction for coronary artery disease, screening for causes of sudden cardiac death, and genetic counselling. Genetic analysis included the development of novel methods for the integration of whole genome sequence data including 2.6 million single nucleotide polymorphisms and 752 copy number variations. The algorithm focused on predicting genetic risk of genes associated with known Mendelian disease, recognised drug responses, and pathogenicity for novel variants. In addition, since integration of risk ratios derived from case control studies is challenging, we estimated posterior probabilities from age and sex appropriate prior probability and likelihood ratios derived for each genotype. In addition, we developed a visualisation approach to account for gene-environment interactions and conditionally dependent risks. Findings We found increased genetic risk for myocardial infarction, type II diabetes and certain cancers. Rare variants in LPA are consistent with the family history of coronary artery disease. Pharmacogenomic analysis suggested a positive response to lipid lowering therapy, likely clopidogrel resistance, and a low initial dosing requirement for warfarin. Many variants of uncertain significance were reported. Interpretation Although challenges remain, our results suggest that whole genome sequencing can yield useful and clinically relevant information for individual patients, especially for those with a strong family history of significant disease. PMID:20435227
Vincent, Caroline; Usongo, Valentine; Berry, Chrystal; Tremblay, Denise M; Moineau, Sylvain; Yousfi, Khadidja; Doualla-Bell, Florence; Fournier, Eric; Nadon, Céline; Goodridge, Lawrence; Bekal, Sadjia
2018-08-01
Salmonella enterica serovar Heidelberg (S. Heidelberg) is one of the top serovars causing human salmonellosis. This serovar ranks second and third among serovars that cause human infections in Québec and Canada, respectively, and has been associated with severe infections. Traditional typing methods such as PFGE do not display adequate discrimination required to resolve outbreak investigations due to the low level of genetic diversity of isolates belonging to this serovar. This study evaluates the ability of four whole genome sequence (WGS)-based typing methods to differentiate among 145 S. Heidelberg strains involved in four distinct outbreak events and sporadic cases of salmonellosis that occurred in Québec between 2007 and 2016. Isolates from all outbreaks were indistinguishable by PFGE. The core genome single nucleotide variant (SNV), core genome multilocus sequence typing (MLST) and whole genome MLST approaches were highly discriminatory and separated outbreak strains into four distinct phylogenetic clusters that were concordant with the epidemiological data. The clustered regularly interspaced short palindromic repeats (CRISPR) typing method was less discriminatory. However, CRISPR typing may be used as a secondary method to differentiate isolates of S. Heidelberg that are genetically similar but epidemiologically unrelated to outbreak events. WGS-based typing methods provide a highly discriminatory alternative to PFGE for the laboratory investigation of foodborne outbreaks. Copyright © 2018 Elsevier Ltd. All rights reserved.
Campos, G S; Reimann, F A; Cardoso, L L; Ferreira, C E R; Junqueira, V S; Schmidt, P I; Braccini Neto, J; Yokoo, M J I; Sollero, B P; Boligon, A A; Cardoso, F F
2018-05-07
The objective of the present study was to evaluate the accuracy and bias of direct and blended genomic predictions using different methods and cross-validation techniques for growth traits (weight and weight gains) and visual scores (conformation, precocity, muscling and size) obtained at weaning and at yearling in Hereford and Braford breeds. Phenotypic data contained 126,290 animals belonging to the Delta G Connection genetic improvement program, and a set of 3,545 animals genotyped with the 50K chip and 131 sires with the 777K. After quality control, 41,045 markers remained for all animals. An animal model was used to estimate (co)variances components and to predict breeding values, which were later used to calculate the deregressed estimated breeding values (DEBV). Animals with genotype and phenotype for the traits studied were divided into four or five groups by random and k-means clustering cross-validation strategies. The values of accuracy of the direct genomic values (DGV) were moderate to high magnitude for at weaning and at yearling traits, ranging from 0.19 to 0.45 for the k-means and 0.23 to 0.78 for random clustering among all traits. The greatest gain in relation to the pedigree BLUP (PBLUP) was 9.5% with the BayesB method with both the k-means and the random clustering. Blended genomic value accuracies ranged from 0.19 to 0.56 for k-means and from 0.21 to 0.82 for random clustering. The analyzes using the historical pedigree and phenotypes contributed additional information to calculate the GEBV and in general, the largest gains were for the single-step (ssGBLUP) method in bivariate analyses with a mean increase of 43.00% among all traits measured at weaning and of 46.27% for those evaluated at yearling. The accuracy values for the marker effects estimation methods were lower for k-means clustering, indicating that the training set relationship to the selection candidates is a major factor affecting accuracy of genomic predictions. The gains in accuracy obtained with genomic blending methods, mainly ssGBLUP in bivariate analyses, indicate that genomic predictions should be used as a tool to improve genetic gains in relation to the traditional PBLUP selection.
Finding approximate gene clusters with Gecko 3.
Winter, Sascha; Jahn, Katharina; Wehner, Stefanie; Kuchenbecker, Leon; Marz, Manja; Stoye, Jens; Böcker, Sebastian
2016-11-16
Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Van Coillie, Samya; Liang, Lunxi; Zhang, Yao; Wang, Huanbin; Fang, Jing-Yuan; Xu, Jie
2016-04-05
High-throughput methods such as co-immunoprecipitationmass spectrometry (coIP-MS) and yeast 2 hybridization (Y2H) have suggested a broad range of unannotated protein-protein interactions (PPIs), and interpretation of these PPIs remains a challenging task. The advancements in cancer genomic researches allow for the inference of "coactivation pairs" in cancer, which may facilitate the identification of PPIs involved in cancer. Here we present OncoBinder as a tool for the assessment of proteomic interaction data based on the functional synergy of oncoproteins in cancer. This decision tree-based method combines gene mutation, copy number and mRNA expression information to infer the functional status of protein-coding genes. We applied OncoBinder to evaluate the potential binders of EGFR and ERK2 proteins based on the gastric cancer dataset of The Cancer Genome Atlas (TCGA). As a result, OncoBinder identified high confidence interactions (annotated by Kyoto Encyclopedia of Genes and Genomes (KEGG) or validated by low-throughput assays) more efficiently than co-expression based method. Taken together, our results suggest that evaluation of gene functional synergy in cancer may facilitate the interpretation of proteomic interaction data. The OncoBinder toolbox for Matlab is freely accessible online.
Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies
Schatz, Michael C.; Phillippy, Adam M.; Sommer, Daniel D.; Delcher, Arthur L.; Puiu, Daniela; Narzisi, Giuseppe; Salzberg, Steven L.; Pop, Mihai
2013-01-01
Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net. PMID:22199379
MUFFINN: cancer gene discovery via network analysis of somatic mutation data.
Cho, Ara; Shim, Jung Eun; Kim, Eiru; Supek, Fran; Lehner, Ben; Lee, Insuk
2016-06-23
A major challenge for distinguishing cancer-causing driver mutations from inconsequential passenger mutations is the long-tail of infrequently mutated genes in cancer genomes. Here, we present and evaluate a method for prioritizing cancer genes accounting not only for mutations in individual genes but also in their neighbors in functional networks, MUFFINN (MUtations For Functional Impact on Network Neighbors). This pathway-centric method shows high sensitivity compared with gene-centric analyses of mutation data. Notably, only a marginal decrease in performance is observed when using 10 % of TCGA patient samples, suggesting the method may potentiate cancer genome projects with small patient populations.
Extensive complementarity between gene function prediction methods.
Vidulin, Vedrana; Šmuc, Tomislav; Supek, Fran
2016-12-01
The number of sequenced genomes rises steadily but we still lack the knowledge about the biological roles of many genes. Automated function prediction (AFP) is thus a necessity. We hypothesized that AFP approaches that draw on distinct genome features may be useful for predicting different types of gene functions, motivating a systematic analysis of the benefits gained by obtaining and integrating such predictions. Our pipeline amalgamates 5 133 543 genes from 2071 genomes in a single massive analysis that evaluates five established genomic AFP methodologies. While 1227 Gene Ontology (GO) terms yielded reliable predictions, the majority of these functions were accessible to only one or two of the methods. Moreover, different methods tend to assign a GO term to non-overlapping sets of genes. Thus, inferences made by diverse genomic AFP methods display a striking complementary, both gene-wise and function-wise. Because of this, a viable integration strategy is to rely on a single most-confident prediction per gene/function, rather than enforcing agreement across multiple AFP methods. Using an information-theoretic approach, we estimate that current databases contain 29.2 bits/gene of known Escherichia coli gene functions. This can be increased by up to 5.5 bits/gene using individual AFP methods or by 11 additional bits/gene upon integration, thereby providing a highly-ranking predictor on the Critical Assessment of Function Annotation 2 community benchmark. Availability of more sequenced genomes boosts the predictive accuracy of AFP approaches and also the benefit from integrating them. The individual and integrated GO predictions for the complete set of genes are available from http://gorbi.irb.hr/ CONTACT: fran.supek@irb.hrSupplementary information: Supplementary materials are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Evaluating the protein coding potential of exonized transposable element sequences
Piriyapongsa, Jittima; Rutledge, Mark T; Patel, Sanil; Borodovsky, Mark; Jordan, I King
2007-01-01
Background Transposable element (TE) sequences, once thought to be merely selfish or parasitic members of the genomic community, have been shown to contribute a wide variety of functional sequences to their host genomes. Analysis of complete genome sequences have turned up numerous cases where TE sequences have been incorporated as exons into mRNAs, and it is widely assumed that such 'exonized' TEs encode protein sequences. However, the extent to which TE-derived sequences actually encode proteins is unknown and a matter of some controversy. We have tried to address this outstanding issue from two perspectives: i-by evaluating ascertainment biases related to the search methods used to uncover TE-derived protein coding sequences (CDS) and ii-through a probabilistic codon-frequency based analysis of the protein coding potential of TE-derived exons. Results We compared the ability of three classes of sequence similarity search methods to detect TE-derived sequences among data sets of experimentally characterized proteins: 1-a profile-based hidden Markov model (HMM) approach, 2-BLAST methods and 3-RepeatMasker. Profile based methods are more sensitive and more selective than the other methods evaluated. However, the application of profile-based search methods to the detection of TE-derived sequences among well-curated experimentally characterized protein data sets did not turn up many more cases than had been previously detected and nowhere near as many cases as recent genome-wide searches have. We observed that the different search methods used were complementary in the sense that they yielded largely non-overlapping sets of hits and differed in their ability to recover known cases of TE-derived CDS. The probabilistic analysis of TE-derived exon sequences indicates that these sequences have low protein coding potential on average. In particular, non-autonomous TEs that do not encode protein sequences, such as Alu elements, are frequently exonized but unlikely to encode protein sequences. Conclusion The exaptation of the numerous TE sequences found in exons as bona fide protein coding sequences may prove to be far less common than has been suggested by the analysis of complete genomes. We hypothesize that many exonized TE sequences actually function as post-transcriptional regulators of gene expression, rather than coding sequences, which may act through a variety of double stranded RNA related regulatory pathways. Indeed, their relatively high copy numbers and similarity to sequences dispersed throughout the genome suggests that exonized TE sequences could serve as master regulators with a wide scope of regulatory influence. Reviewers: This article was reviewed by Itai Yanai, Kateryna D. Makova, Melissa Wilson (nominated by Kateryna D. Makova) and Cedric Feschotte (nominated by John M. Logsdon Jr.). PMID:18036258
Peano, Clelia; Samson, Maria Cristina; Palmieri, Luisa; Gulli, Mariolina; Marmiroli, Nelson
2004-11-17
The presence of DNA in foodstuffs derived from or containing genetically modified organisms (GMO) is the basic requirement for labeling of GMO foods in Council Directive 2001/18/CE (Off. J. Eur. Communities 2001, L1 06/2). In this work, four different methods for DNA extraction were evaluated and compared. To rank the different methods, the quality and quantity of DNA extracted from standards, containing known percentages of GMO material and from different food products, were considered. The food products analyzed derived from both soybean and maize and were chosen on the basis of the mechanical, technological, and chemical treatment they had been subjected to during processing. Degree of DNA degradation at various stages of food production was evaluated through the amplification of different DNA fragments belonging to the endogenous genes of both maize and soybean. Genomic DNA was extracted from Roundup Ready soybean and maize MON810 standard flours, according to four different methods, and quantified by real-time Polymerase Chain Reaction (PCR), with the aim of determining the influence of the extraction methods on the DNA quantification through real-time PCR.
Jonas, Elisabeth; de Koning, Dirk Jan
Genomic Selection is an important topic in quantitative genetics and breeding. Not only does it allow the full use of current molecular genetic technologies, it stimulates also the development of new methods and models. Genomic selection, if fully implemented in commercial farming, should have a major impact on the productivity of various agricultural systems. But suggested approaches need to be applicable in commercial breeding populations. Many of the published research studies focus on methodologies. We conclude from the reviewed publications, that a stronger focus on strategies for the implementation of genomic selection in advanced breeding lines, introduction of new varieties, hybrids or multi-line crosses is needed. Efforts to find solutions for a better prediction and integration of environmental influences need to continue within applied breeding schemes. Goals of the implementation of genomic selection into crop breeding should be carefully defined and crop breeders in the private sector will play a substantial part in the decision-making process. However, the lack of published results from studies within, or in collaboration with, private companies diminishes the knowledge on the status of genomic selection within applied breeding programmes. Studies on the implementation of genomic selection in plant breeding need to evaluate models and methods with an enhanced emphasis on population-specific requirements and production environments. Adaptation of methods to breeding schemes or changes to breeding programmes for a better integration of genomic selection strategies are needed across species. More openness with a continuous exchange will contribute to successes.
Ferrarini, Alberto; Forcato, Claudio; Buson, Genny; Tononi, Paola; Del Monaco, Valentina; Terracciano, Mario; Bolognesi, Chiara; Fontana, Francesca; Medoro, Gianni; Neves, Rui; Möhlendick, Birte; Rihawi, Karim; Ardizzoni, Andrea; Sumanasuriya, Semini; Flohr, Penny; Lambros, Maryou; de Bono, Johann; Stoecklein, Nikolas H; Manaresi, Nicolò
2018-01-01
Chromosomal instability and associated chromosomal aberrations are hallmarks of cancer and play a critical role in disease progression and development of resistance to drugs. Single-cell genome analysis has gained interest in latest years as a source of biomarkers for targeted-therapy selection and drug resistance, and several methods have been developed to amplify the genomic DNA and to produce libraries suitable for Whole Genome Sequencing (WGS). However, most protocols require several enzymatic and cleanup steps, thus increasing the complexity and length of protocols, while robustness and speed are key factors for clinical applications. To tackle this issue, we developed a single-tube, single-step, streamlined protocol, exploiting ligation mediated PCR (LM-PCR) Whole Genome Amplification (WGA) method, for low-pass genome sequencing with the Ion Torrent™ platform and copy number alterations (CNAs) calling from single cells. The method was evaluated on single cells isolated from 6 aberrant cell lines of the NCI-H series. In addition, to demonstrate the feasibility of the workflow on clinical samples, we analyzed single circulating tumor cells (CTCs) and white blood cells (WBCs) isolated from the blood of patients affected by prostate cancer or lung adenocarcinoma. The results obtained show that the developed workflow generates data accurately representing whole genome absolute copy number profiles of single cell and allows alterations calling at resolutions down to 100 Kbp with as few as 200,000 reads. The presented data demonstrate the feasibility of the Ampli1™ WGA-based low-pass workflow for detection of CNAs in single tumor cells which would be of particular interest for genome-driven targeted therapy selection and for monitoring of disease progression.
Scalable Parameter Estimation for Genome-Scale Biochemical Reaction Networks
Kaltenbacher, Barbara; Hasenauer, Jan
2017-01-01
Mechanistic mathematical modeling of biochemical reaction networks using ordinary differential equation (ODE) models has improved our understanding of small- and medium-scale biological processes. While the same should in principle hold for large- and genome-scale processes, the computational methods for the analysis of ODE models which describe hundreds or thousands of biochemical species and reactions are missing so far. While individual simulations are feasible, the inference of the model parameters from experimental data is computationally too intensive. In this manuscript, we evaluate adjoint sensitivity analysis for parameter estimation in large scale biochemical reaction networks. We present the approach for time-discrete measurement and compare it to state-of-the-art methods used in systems and computational biology. Our comparison reveals a significantly improved computational efficiency and a superior scalability of adjoint sensitivity analysis. The computational complexity is effectively independent of the number of parameters, enabling the analysis of large- and genome-scale models. Our study of a comprehensive kinetic model of ErbB signaling shows that parameter estimation using adjoint sensitivity analysis requires a fraction of the computation time of established methods. The proposed method will facilitate mechanistic modeling of genome-scale cellular processes, as required in the age of omics. PMID:28114351
Hu, Jiazhi; Meyers, Robin M; Dong, Junchao; Panchakshari, Rohit A; Alt, Frederick W; Frock, Richard L
2016-05-01
Unbiased, high-throughput assays for detecting and quantifying DNA double-stranded breaks (DSBs) across the genome in mammalian cells will facilitate basic studies of the mechanisms that generate and repair endogenous DSBs. They will also enable more applied studies, such as those to evaluate the on- and off-target activities of engineered nucleases. Here we describe a linear amplification-mediated high-throughput genome-wide sequencing (LAM-HTGTS) method for the detection of genome-wide 'prey' DSBs via their translocation in cultured mammalian cells to a fixed 'bait' DSB. Bait-prey junctions are cloned directly from isolated genomic DNA using LAM-PCR and unidirectionally ligated to bridge adapters; subsequent PCR steps amplify the single-stranded DNA junction library in preparation for Illumina Miseq paired-end sequencing. A custom bioinformatics pipeline identifies prey sequences that contribute to junctions and maps them across the genome. LAM-HTGTS differs from related approaches because it detects a wide range of broken end structures with nucleotide-level resolution. Familiarity with nucleic acid methods and next-generation sequencing analysis is necessary for library generation and data interpretation. LAM-HTGTS assays are sensitive, reproducible, relatively inexpensive, scalable and straightforward to implement with a turnaround time of <1 week.
Legarra, A; Baloche, G; Barillet, F; Astruc, J M; Soulas, C; Aguerre, X; Arrese, F; Mintegi, L; Lasarte, M; Maeztu, F; Beltrán de Heredia, I; Ugarte, E
2014-05-01
Genotypes, phenotypes and pedigrees of 6 breeds of dairy sheep (including subdivisions of Latxa, Manech, and Basco-Béarnaise) from the Spain and France Western Pyrenees were used to estimate genetic relationships across breeds (together with genotypes from the Lacaune dairy sheep) and to verify by forward cross-validation single-breed or multiple-breed genetic evaluations. The number of rams genotyped fluctuated between 100 and 1,300 but generally represented the 10 last cohorts of progeny-tested rams within each breed. Genetic relationships were assessed by principal components analysis of the genomic relationship matrices and also by the conservation of linkage disequilibrium patterns at given physical distances in the genome. Genomic and pedigree-based evaluations used daughter yield performances of all rams, although some of them were not genotyped. A pseudo-single step method was used in this case for genomic predictions. Results showed a clear structure in blond and black breeds for Manech and Latxa, reflecting historical exchanges, and isolation of Basco-Béarnaise and Lacaune. Relatedness between any 2 breeds was, however, lower than expected. Single-breed genomic predictions had accuracies comparable with other breeds of dairy sheep or small breeds of dairy cattle. They were more accurate than pedigree predictions for 5 out of 6 breeds, with absolute increases in accuracy ranging from 0.05 to 0.30 points. They were significantly better, as assessed by bootstrapping of candidates, for 2 of the breeds. Predictions using multiple populations only marginally increased the accuracy for a couple of breeds. Pooling populations does not increase the accuracy of genomic evaluations in dairy sheep; however, single-breed genomic predictions are more accurate, even for small breeds, and make the consideration of genomic schemes in dairy sheep interesting. Copyright © 2014 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Liu, Bingqiang; Zhang, Hanyuan; Zhou, Chuan; Li, Guojun; Fennell, Anne; Wang, Guanghui; Kang, Yu; Liu, Qi; Ma, Qin
2016-08-09
Phylogenetic footprinting is an important computational technique for identifying cis-regulatory motifs in orthologous regulatory regions from multiple genomes, as motifs tend to evolve slower than their surrounding non-functional sequences. Its application, however, has several difficulties for optimizing the selection of orthologous data and reducing the false positives in motif prediction. Here we present an integrative phylogenetic footprinting framework for accurate motif predictions in prokaryotic genomes (MP(3)). The framework includes a new orthologous data preparation procedure, an additional promoter scoring and pruning method and an integration of six existing motif finding algorithms as basic motif search engines. Specifically, we collected orthologous genes from available prokaryotic genomes and built the orthologous regulatory regions based on sequence similarity of promoter regions. This procedure made full use of the large-scale genomic data and taxonomy information and filtered out the promoters with limited contribution to produce a high quality orthologous promoter set. The promoter scoring and pruning is implemented through motif voting by a set of complementary predicting tools that mine as many motif candidates as possible and simultaneously eliminate the effect of random noise. We have applied the framework to Escherichia coli k12 genome and evaluated the prediction performance through comparison with seven existing programs. This evaluation was systematically carried out at the nucleotide and binding site level, and the results showed that MP(3) consistently outperformed other popular motif finding tools. We have integrated MP(3) into our motif identification and analysis server DMINDA, allowing users to efficiently identify and analyze motifs in 2,072 completely sequenced prokaryotic genomes. The performance evaluation indicated that MP(3) is effective for predicting regulatory motifs in prokaryotic genomes. Its application may enhance progress in elucidating transcription regulation mechanism, thus provide benefit to the genomic research community and prokaryotic genome researchers in particular.
Efficient isolation method for high-quality genomic DNA from cicada exuviae.
Nguyen, Hoa Quynh; Kim, Ye Inn; Borzée, Amaël; Jang, Yikweon
2017-10-01
In recent years, animal ethics issues have led researchers to explore nondestructive methods to access materials for genetic studies. Cicada exuviae are among those materials because they are cast skins that individuals left after molt and are easily collected. In this study, we aim to identify the most efficient extraction method to obtain high quantity and quality of DNA from cicada exuviae. We compared relative DNA yield and purity of six extraction protocols, including both manual protocols and available commercial kits, extracting from four different exoskeleton parts. Furthermore, amplification and sequencing of genomic DNA were evaluated in terms of availability of sequencing sequence at the expected genomic size. Both the choice of protocol and exuvia part significantly affected DNA yield and purity. Only samples that were extracted using the PowerSoil DNA Isolation kit generated gel bands of expected size as well as successful sequencing results. The failed attempts to extract DNA using other protocols could be partially explained by a low DNA yield from cicada exuviae and partly by contamination with humic acids that exist in the soil where cicada nymphs reside before emergence, as shown by spectroscopic measurements. Genomic DNA extracted from cicada exuviae could provide valuable information for species identification, allowing the investigation of genetic diversity across consecutive broods, or spatiotemporal variation among various populations. Consequently, we hope to provide a simple method to acquire pure genomic DNA applicable for multiple research purposes.
Khang, Chang Hyun; Park, Sook-Young; Lee, Yong-Hwan; Kang, Seogchan
2005-06-01
Rapid progress in fungal genome sequencing presents many new opportunities for functional genomic analysis of fungal biology through the systematic mutagenesis of the genes identified through sequencing. However, the lack of efficient tools for targeted gene replacement is a limiting factor for fungal functional genomics, as it often necessitates the screening of a large number of transformants to identify the desired mutant. We developed an efficient method of gene replacement and evaluated factors affecting the efficiency of this method using two plant pathogenic fungi, Magnaporthe grisea and Fusarium oxysporum. This method is based on Agrobacterium tumefaciens-mediated transformation with a mutant allele of the target gene flanked by the herpes simplex virus thymidine kinase (HSVtk) gene as a conditional negative selection marker against ectopic transformants. The HSVtk gene product converts 5-fluoro-2'-deoxyuridine to a compound toxic to diverse fungi. Because ectopic transformants express HSVtk, while gene replacement mutants lack HSVtk, growing transformants on a medium amended with 5-fluoro-2'-deoxyuridine facilitates the identification of targeted mutants by counter-selecting against ectopic transformants. In addition to M. grisea and F. oxysporum, the method and associated vectors are likely to be applicable to manipulating genes in a broad spectrum of fungi, thus potentially serving as an efficient, universal functional genomic tool for harnessing the growing body of fungal genome sequence data to study fungal biology.
Takemoto, Kazuhiro; Aie, Kazuki
2017-05-25
Host-pathogen interactions are important in a wide range of research fields. Given the importance of metabolic crosstalk between hosts and pathogens, a metabolic network-based reverse ecology method was proposed to infer these interactions. However, the validity of this method remains unclear because of the various explanations presented and the influence of potentially confounding factors that have thus far been neglected. We re-evaluated the importance of the reverse ecology method for evaluating host-pathogen interactions while statistically controlling for confounding effects using oxygen requirement, genome, metabolic network, and phylogeny data. Our data analyses showed that host-pathogen interactions were more strongly influenced by genome size, primary network parameters (e.g., number of edges), oxygen requirement, and phylogeny than the reserve ecology-based measures. These results indicate the limitations of the reverse ecology method; however, they do not discount the importance of adopting reverse ecology approaches altogether. Rather, we highlight the need for developing more suitable methods for inferring host-pathogen interactions and conducting more careful examinations of the relationships between metabolic networks and host-pathogen interactions.
DeMaere, Matthew Z.
2016-01-01
Background Chromosome conformation capture, coupled with high throughput DNA sequencing in protocols like Hi-C and 3C-seq, has been proposed as a viable means of generating data to resolve the genomes of microorganisms living in naturally occuring environments. Metagenomic Hi-C and 3C-seq datasets have begun to emerge, but the feasibility of resolving genomes when closely related organisms (strain-level diversity) are present in the sample has not yet been systematically characterised. Methods We developed a computational simulation pipeline for metagenomic 3C and Hi-C sequencing to evaluate the accuracy of genomic reconstructions at, above, and below an operationally defined species boundary. We simulated datasets and measured accuracy over a wide range of parameters. Five clustering algorithms were evaluated (2 hard, 3 soft) using an adaptation of the extended B-cubed validation measure. Results When all genomes in a sample are below 95% sequence identity, all of the tested clustering algorithms performed well. When sequence data contains genomes above 95% identity (our operational definition of strain-level diversity), a naive soft-clustering extension of the Louvain method achieves the highest performance. Discussion Previously, only hard-clustering algorithms have been applied to metagenomic 3C and Hi-C data, yet none of these perform well when strain-level diversity exists in a metagenomic sample. Our simple extension of the Louvain method performed the best in these scenarios, however, accuracy remained well below the levels observed for samples without strain-level diversity. Strain resolution is also highly dependent on the amount of available 3C sequence data, suggesting that depth of sequencing must be carefully considered during experimental design. Finally, there appears to be great scope to improve the accuracy of strain resolution through further algorithm development. PMID:27843713
2013-01-01
Background Efficient screening of bacterial artificial chromosome (BAC) libraries with polymerase chain reaction (PCR)-based markers is feasible provided that a multidimensional pooling strategy is implemented. Single nucleotide polymorphisms (SNPs) can be screened in multiplexed format, therefore this marker type lends itself particularly well for medium- to high-throughput applications. Combining the power of multiplex-PCR assays with a multidimensional pooling system may prove to be especially challenging in a polyploid genome. In polyploid genomes two classes of SNPs need to be distinguished, polymorphisms between accessions (intragenomic SNPs) and those differentiating between homoeologous genomes (intergenomic SNPs). We have assessed whether the highly parallel Illumina GoldenGate® Genotyping Assay is suitable for the screening of a BAC library of the polyploid Brassica napus genome. Results A multidimensional screening platform was developed for a Brassica napus BAC library which is composed of almost 83,000 clones. Intragenomic and intergenomic SNPs were included in Illumina’s GoldenGate® Genotyping Assay and both SNP classes were used successfully for screening of the multidimensional BAC pools of the Brassica napus library. An optimized scoring method is proposed which is especially valuable for SNP calling of intergenomic SNPs. Validation of the genotyping results by independent methods revealed a success of approximately 80% for the multiplex PCR-based screening regardless of whether intra- or intergenomic SNPs were evaluated. Conclusions Illumina’s GoldenGate® Genotyping Assay can be efficiently used for screening of multidimensional Brassica napus BAC pools. SNP calling was specifically tailored for the evaluation of BAC pool screening data. The developed scoring method can be implemented independently of plant reference samples. It is demonstrated that intergenomic SNPs represent a powerful tool for BAC library screening of a polyploid genome. PMID:24010766
Descriptive Statistics of the Genome: Phylogenetic Classification of Viruses.
Hernandez, Troy; Yang, Jie
2016-10-01
The typical process for classifying and submitting a newly sequenced virus to the NCBI database involves two steps. First, a BLAST search is performed to determine likely family candidates. That is followed by checking the candidate families with the pairwise sequence alignment tool for similar species. The submitter's judgment is then used to determine the most likely species classification. The aim of this article is to show that this process can be automated into a fast, accurate, one-step process using the proposed alignment-free method and properly implemented machine learning techniques. We present a new family of alignment-free vectorizations of the genome, the generalized vector, that maintains the speed of existing alignment-free methods while outperforming all available methods. This new alignment-free vectorization uses the frequency of genomic words (k-mers), as is done in the composition vector, and incorporates descriptive statistics of those k-mers' positional information, as inspired by the natural vector. We analyze five different characterizations of genome similarity using k-nearest neighbor classification and evaluate these on two collections of viruses totaling over 10,000 viruses. We show that our proposed method performs better than, or as well as, other methods at every level of the phylogenetic hierarchy. The data and R code is available upon request.
Effective normalization for copy number variation detection from whole genome sequencing.
Janevski, Angel; Varadan, Vinay; Kamalakaran, Sitharthan; Banerjee, Nilanjana; Dimitrova, Nevenka
2012-01-01
Whole genome sequencing enables a high resolution view of the human genome and provides unique insights into genome structure at an unprecedented scale. There have been a number of tools to infer copy number variation in the genome. These tools, while validated, also include a number of parameters that are configurable to genome data being analyzed. These algorithms allow for normalization to account for individual and population-specific effects on individual genome CNV estimates but the impact of these changes on the estimated CNVs is not well characterized. We evaluate in detail the effect of normalization methodologies in two CNV algorithms FREEC and CNV-seq using whole genome sequencing data from 8 individuals spanning four populations. We apply FREEC and CNV-seq to a sequencing data set consisting of 8 genomes. We use multiple configurations corresponding to different read-count normalization methodologies in FREEC, and statistically characterize the concordance of the CNV calls between FREEC configurations and the analogous output from CNV-seq. The normalization methodologies evaluated in FREEC are: GC content, mappability and control genome. We further stratify the concordance analysis within genic, non-genic, and a collection of validated variant regions. The GC content normalization methodology generates the highest number of altered copy number regions. Both mappability and control genome normalization reduce the total number and length of copy number regions. Mappability normalization yields Jaccard indices in the 0.07 - 0.3 range, whereas using a control genome normalization yields Jaccard index values around 0.4 with normalization based on GC content. The most critical impact of using mappability as a normalization factor is substantial reduction of deletion CNV calls. The output of another method based on control genome normalization, CNV-seq, resulted in comparable CNV call profiles, and substantial agreement in variable gene and CNV region calls. Choice of read-count normalization methodology has a substantial effect on CNV calls and the use of genomic mappability or an appropriately chosen control genome can optimize the output of CNV analysis.
Clinical and Biological Relevance of Genomic Heterogeneity in Chronic Lymphocytic Leukemia
Friedman, Daphne R.; Lucas, Joseph E.; Weinberg, J. Brice
2013-01-01
Background Chronic lymphocytic leukemia (CLL) is typically regarded as an indolent B-cell malignancy. However, there is wide variability with regards to need for therapy, time to progressive disease, and treatment response. This clinical variability is due, in part, to biological heterogeneity between individual patients’ leukemias. While much has been learned about this biological variation using genomic approaches, it is unclear whether such efforts have sufficiently evaluated biological and clinical heterogeneity in CLL. Methods To study the extent of genomic variability in CLL and the biological and clinical attributes of genomic classification in CLL, we evaluated 893 unique CLL samples from fifteen publicly available gene expression profiling datasets. We used unsupervised approaches to divide the data into subgroups, evaluated the biological pathways and genetic aberrations that were associated with the subgroups, and compared prognostic and clinical outcome data between the subgroups. Results Using an unsupervised approach, we determined that approximately 600 CLL samples are needed to define the spectrum of diversity in CLL genomic expression. We identified seven genomically-defined CLL subgroups that have distinct biological properties, are associated with specific chromosomal deletions and amplifications, and have marked differences in molecular prognostic markers and clinical outcomes. Conclusions Our results indicate that investigations focusing on small numbers of patient samples likely provide a biased outlook on CLL biology. These findings may have important implications in identifying patients who should be treated with specific targeted therapies, which could have efficacy against CLL cells that rely on specific biological pathways. PMID:23468975
NASA Astrophysics Data System (ADS)
Su, Hailin; Li, Hengde; Wang, Shi; Wang, Yangfan; Bao, Zhenmin
2017-02-01
Genomic selection is more and more popular in animal and plant breeding industries all around the world, as it can be applied early in life without impacting selection candidates. The objective of this study was to bring the advantages of genomic selection to scallop breeding. Two different genomic selection tools MixP and gsbay were applied on genomic evaluation of simulated data and Zhikong scallop ( Chlamys farreri) field data. The data were compared with genomic best linear unbiased prediction (GBLUP) method which has been applied widely. Our results showed that both MixP and gsbay could accurately estimate single-nucleotide polymorphism (SNP) marker effects, and thereby could be applied for the analysis of genomic estimated breeding values (GEBV). In simulated data from different scenarios, the accuracy of GEBV acquired was ranged from 0.20 to 0.78 by MixP; it was ranged from 0.21 to 0.67 by gsbay; and it was ranged from 0.21 to 0.61 by GBLUP. Estimations made by MixP and gsbay were expected to be more reliable than those estimated by GBLUP. Predictions made by gsbay were more robust, while with MixP the computation is much faster, especially in dealing with large-scale data. These results suggested that both algorithms implemented by MixP and gsbay are feasible to carry out genomic selection in scallop breeding, and more genotype data will be necessary to produce genomic estimated breeding values with a higher accuracy for the industry.
NGSPanPipe: A Pipeline for Pan-genome Identification in Microbial Strains from Experimental Reads.
Kulsum, Umay; Kapil, Arti; Singh, Harpreet; Kaur, Punit
2018-01-01
Recent advancements in sequencing technologies have decreased both time span and cost for sequencing the whole bacterial genome. High-throughput Next-Generation Sequencing (NGS) technology has led to the generation of enormous data concerning microbial populations publically available across various repositories. As a consequence, it has become possible to study and compare the genomes of different bacterial strains within a species or genus in terms of evolution, ecology and diversity. Studying the pan-genome provides insights into deciphering microevolution, global composition and diversity in virulence and pathogenesis of a species. It can also assist in identifying drug targets and proposing vaccine candidates. The effective analysis of these large genome datasets necessitates the development of robust tools. Current methods to develop pan-genome do not support direct input of raw reads from the sequencer machine but require preprocessing of reads as an assembled protein/gene sequence file or the binary matrix of orthologous genes/proteins. We have designed an easy-to-use integrated pipeline, NGSPanPipe, which can directly identify the pan-genome from short reads. The output from the pipeline is compatible with other pan-genome analysis tools. We evaluated our pipeline with other methods for developing pan-genome, i.e. reference-based assembly and de novo assembly using simulated reads of Mycobacterium tuberculosis. The single script pipeline (pipeline.pl) is applicable for all bacterial strains. It integrates multiple in-house Perl scripts and is freely accessible from https://github.com/Biomedinformatics/NGSPanPipe .
Bharathi, Madasamy Jayahar; Murugan, Nandagopal; Rameshkumar, Gunasekaran; Ramakrishnan, Rengappa; Venugopal Reddy, Yerahaia Chinna; Shivkumar, Chandrasekar; Ramesh, Srinivasan
2013-05-01
This study is aimed to determine the utility of various polymerase chain reaction (PCR) methods in vitreous fluids (VFs) for detecting the infectious genomes in the diagnosis of infectious endophthalmitis in terms of sensitivity and specificity. This prospective and consecutive analysis included a total of 66 VFs that were submitted for the microbiological evaluation, which were obtained from 66 clinically diagnosed endophthalmitis patients presented between November 2010 and October 2011 at the tertiary eye care referral centre in South India. Part of the collected VFs were subjected to cultures and smears, and the remaining parts were utilized for five PCR methods: uniplex, nested, semi-nested, multiplex and nested multiplex after extracting DNA, using universal eubacterial and Propionibacterium acnes species-specific primer sets targeting 16S rRNA gene in all bacteria and P. acnes, and panfungal primers, targeting 28S rRNA gene in all fungi. Of the 66 VFs, five (7.5%) showed positive results in smears, 16 (24%) in cultures and 43 (65%) showed positive results in PCRs. Among the 43 positively amplified VFs, 10 (15%) were positive for P. acnes genome, one for panfungal genome and 42 (62%) for eubacterial genome (including 10 P. acnes positives). Among 42 eubacterial-positive VFs, 36 were positive by both uniplex (first round) and multiplex (first round) PCRs, while nested (second round) and nested multiplex (second round) PCRs produced positive results in 42 and 41 VFs, respectively. Of the 43 PCR-positive specimens, 16 (37%) had positive growth (15 bacterial and one fungal) in culture. Of 50 culture-negative specimens, 27 (54%) were showed positive amplification, of which 10 were amplified for both P. acnes and eubacterial genomes and the remaining 17 were for eubacterial genome alone. Nested PCRs are superior than uniplex and multiplex PCR. PCRs proved to be a powerful tool in the diagnosis of endophthalmitis, especially for detecting uncultured microbes.
Malin, Bradley A
2005-01-01
The incorporation of genomic data into personal medical records poses many challenges to patient privacy. In response, various systems for preserving patient privacy in shared genomic data have been developed and deployed. Although these systems de-identify the data by removing explicit identifiers (e.g., name, address, or Social Security number) and incorporate sound security design principles, they suffer from a lack of formal modeling of inferences learnable from shared data. This report evaluates the extent to which current protection systems are capable of withstanding a range of re-identification methods, including genotype-phenotype inferences, location-visit patterns, family structures, and dictionary attacks. For a comparative re-identification analysis, the systems are mapped to a common formalism. Although there is variation in susceptibility, each system is deficient in its protection capacity. The author discovers patterns of protection failure and discusses several of the reasons why these systems are susceptible. The analyses and discussion within provide guideposts for the development of next-generation protection methods amenable to formal proofs.
Malin, Bradley A.
2005-01-01
The incorporation of genomic data into personal medical records poses many challenges to patient privacy. In response, various systems for preserving patient privacy in shared genomic data have been developed and deployed. Although these systems de-identify the data by removing explicit identifiers (e.g., name, address, or Social Security number) and incorporate sound security design principles, they suffer from a lack of formal modeling of inferences learnable from shared data. This report evaluates the extent to which current protection systems are capable of withstanding a range of re-identification methods, including genotype–phenotype inferences, location–visit patterns, family structures, and dictionary attacks. For a comparative re-identification analysis, the systems are mapped to a common formalism. Although there is variation in susceptibility, each system is deficient in its protection capacity. The author discovers patterns of protection failure and discusses several of the reasons why these systems are susceptible. The analyses and discussion within provide guideposts for the development of next-generation protection methods amenable to formal proofs. PMID:15492030
Toward a standard in structural genome annotation for prokaryotes
Tripp, H. James; Sutton, Granger; White, Owen; ...
2015-07-25
In an effort to identify the best practice for finding genes in prokaryotic genomes and propose it as a standard for automated annotation pipelines, we collected 1,004,576 peptides from various publicly available resources, and these were used as a basis to evaluate various gene-calling methods. The peptides came from 45 bacterial replicons with an average GC content from 31 % to 74 %, biased toward higher GC content genomes. Automated, manual, and semi-manual methods were used to tally errors in three widely used gene calling methods, as evidenced by peptides mapped outside the boundaries of called genes. We found thatmore » the consensus set of identical genes predicted by the three methods constitutes only about 70 % of the genes predicted by each individual method (with start and stop required to coincide). Peptide data was useful for evaluating some of the differences between gene callers, but not reliable enough to make the results conclusive, due to limitations inherent in any proteogenomic study. A single, unambiguous, unanimous best practice did not emerge from this analysis, since the available proteomics data were not adequate to provide an objective measurement of differences in the accuracy between these methods. However, as a result of this study, software, reference data, and procedures have been better matched among participants, representing a step toward a much-needed standard. In the absence of sufficient amount of experimental data to achieve a universal standard, our recommendation is that any of these methods can be used by the community, as long as a single method is employed across all datasets to be compared.« less
Secure searching of biomarkers through hybrid homomorphic encryption scheme.
Kim, Miran; Song, Yongsoo; Cheon, Jung Hee
2017-07-26
As genome sequencing technology develops rapidly, there has lately been an increasing need to keep genomic data secure even when stored in the cloud and still used for research. We are interested in designing a protocol for the secure outsourcing matching problem on encrypted data. We propose an efficient method to securely search a matching position with the query data and extract some information at the position. After decryption, only a small amount of comparisons with the query information should be performed in plaintext state. We apply this method to find a set of biomarkers in encrypted genomes. The important feature of our method is to encode a genomic database as a single element of polynomial ring. Since our method requires a single homomorphic multiplication of hybrid scheme for query computation, it has the advantage over the previous methods in parameter size, computation complexity, and communication cost. In particular, the extraction procedure not only prevents leakage of database information that has not been queried by user but also reduces the communication cost by half. We evaluate the performance of our method and verify that the computation on large-scale personal data can be securely and practically outsourced to a cloud environment during data analysis. It takes about 3.9 s to search-and-extract the reference and alternate sequences at the queried position in a database of size 4M. Our solution for finding a set of biomarkers in DNA sequences shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment.
Improvement of Predictive Ability by Uniform Coverage of the Target Genetic Space
Bustos-Korts, Daniela; Malosetti, Marcos; Chapman, Scott; Biddulph, Ben; van Eeuwijk, Fred
2016-01-01
Genome-enabled prediction provides breeders with the means to increase the number of genotypes that can be evaluated for selection. One of the major challenges in genome-enabled prediction is how to construct a training set of genotypes from a calibration set that represents the target population of genotypes, where the calibration set is composed of a training and validation set. A random sampling protocol of genotypes from the calibration set will lead to low quality coverage of the total genetic space by the training set when the calibration set contains population structure. As a consequence, predictive ability will be affected negatively, because some parts of the genotypic diversity in the target population will be under-represented in the training set, whereas other parts will be over-represented. Therefore, we propose a training set construction method that uniformly samples the genetic space spanned by the target population of genotypes, thereby increasing predictive ability. To evaluate our method, we constructed training sets alongside with the identification of corresponding genomic prediction models for four genotype panels that differed in the amount of population structure they contained (maize Flint, maize Dent, wheat, and rice). Training sets were constructed using uniform sampling, stratified-uniform sampling, stratified sampling and random sampling. We compared these methods with a method that maximizes the generalized coefficient of determination (CD). Several training set sizes were considered. We investigated four genomic prediction models: multi-locus QTL models, GBLUP models, combinations of QTL and GBLUPs, and Reproducing Kernel Hilbert Space (RKHS) models. For the maize and wheat panels, construction of the training set under uniform sampling led to a larger predictive ability than under stratified and random sampling. The results of our methods were similar to those of the CD method. For the rice panel, all training set construction methods led to similar predictive ability, a reflection of the very strong population structure in this panel. PMID:27672112
Genome-reconstruction for eukaryotes from complex natural microbial communities.
West, Patrick T; Probst, Alexander J; Grigoriev, Igor V; Thomas, Brian C; Banfield, Jillian F
2018-04-01
Microbial eukaryotes are integral components of natural microbial communities, and their inclusion is critical for many ecosystem studies, yet the majority of published metagenome analyses ignore eukaryotes. In order to include eukaryotes in environmental studies, we propose a method to recover eukaryotic genomes from complex metagenomic samples. A key step for genome recovery is separation of eukaryotic and prokaryotic fragments. We developed a k -mer-based strategy, EukRep, for eukaryotic sequence identification and applied it to environmental samples to show that it enables genome recovery, genome completeness evaluation, and prediction of metabolic potential. We used this approach to test the effect of addition of organic carbon on a geyser-associated microbial community and detected a substantial change of the community metabolism, with selection against almost all candidate phyla bacteria and archaea and for eukaryotes. Near complete genomes were reconstructed for three fungi placed within the Eurotiomycetes and an arthropod. While carbon fixation and sulfur oxidation were important functions in the geyser community prior to carbon addition, the organic carbon-impacted community showed enrichment for secreted proteases, secreted lipases, cellulose targeting CAZymes, and methanol oxidation. We demonstrate the broader utility of EukRep by reconstructing and evaluating relatively high-quality fungal, protist, and rotifer genomes from complex environmental samples. This approach opens the way for cultivation-independent analyses of whole microbial communities. © 2018 West et al.; Published by Cold Spring Harbor Laboratory Press.
Kim, Hyun-Joong; Ryu, Ji-Oh; Song, Ji-Yeon; Kim, Hae-Yeong
2017-07-01
In the detection of Shigella species using molecular biological methods, previously known genetic markers for Shigella species were not sufficient to discriminate between Shigella species and diarrheagenic Escherichia coli. The purposes of this study were to screen for genetic markers of the Shigella genus and four Shigella species through comparative genomics and develop a multiplex polymerase chain reaction (PCR) for the detection of shigellae and Shigella species. A total of seven genomic DNA sequences from Shigella species were subjected to comparative genomics for the screening of genetic markers of shigellae and each Shigella species. The primer sets were designed from the screened genetic markers and evaluated using PCR with genomic DNAs from Shigella and other bacterial strains in Enterobacteriaceae. A novel Shigella quintuplex PCR, designed for the detection of Shigella genus, S. dysenteriae, S. boydii, S. flexneri, and S. sonnei, was developed from the evaluated primer sets, and its performance was demonstrated with specifically amplified results from each Shigella species. This Shigella multiplex PCR is the first to be reported with novel genetic markers developed through comparative genomics and may be a useful tool for the accurate detection of the Shigella genus and species from closely related bacteria in clinical microbiology and food safety.
Laurenson, Yan C S M; Kyriazakis, Ilias; Bishop, Stephen C
2013-10-18
Estimated breeding values (EBV) for faecal egg count (FEC) and genetic markers for host resistance to nematodes may be used to identify resistant animals for selective breeding programmes. Similarly, targeted selective treatment (TST) requires the ability to identify the animals that will benefit most from anthelmintic treatment. A mathematical model was used to combine the concepts and evaluate the potential of using genetic-based methods to identify animals for a TST regime. EBVs obtained by genomic prediction were predicted to be the best determinant criterion for TST in terms of the impact on average empty body weight and average FEC, whereas pedigree-based EBVs for FEC were predicted to be marginally worse than using phenotypic FEC as a determinant criterion. Whilst each method has financial implications, if the identification of host resistance is incorporated into a wider genomic selection indices or selective breeding programmes, then genetic or genomic information may be plausibly included in TST regimes. Copyright © 2013 Elsevier B.V. All rights reserved.
Tan, Cheng; Wu, Zhenfang; Ren, Jiangli; Huang, Zhuolin; Liu, Dewu; He, Xiaoyan; Prakapenka, Dzianis; Zhang, Ran; Li, Ning; Da, Yang; Hu, Xiaoxiang
2017-03-29
The number of teats in pigs is related to a sow's ability to rear piglets to weaning age. Several studies have identified genes and genomic regions that affect teat number in swine but few common results were reported. The objective of this study was to identify genetic factors that affect teat number in pigs, evaluate the accuracy of genomic prediction, and evaluate the contribution of significant genes and genomic regions to genomic broad-sense heritability and prediction accuracy using 41,108 autosomal single nucleotide polymorphisms (SNPs) from genotyping-by-sequencing on 2936 Duroc boars. Narrow-sense heritability and dominance heritability of teat number estimated by genomic restricted maximum likelihood were 0.365 ± 0.030 and 0.035 ± 0.019, respectively. The accuracy of genomic predictions, calculated as the average correlation between the genomic best linear unbiased prediction and phenotype in a tenfold validation study, was 0.437 ± 0.064 for the model with additive and dominance effects and 0.435 ± 0.064 for the model with additive effects only. Genome-wide association studies (GWAS) using three methods of analysis identified 85 significant SNP effects for teat number on chromosomes 1, 6, 7, 10, 11, 12 and 14. The region between 102.9 and 106.0 Mb on chromosome 7, which was reported in several studies, had the most significant SNP effects in or near the PTGR2, FAM161B, LIN52, VRTN, FCF1, AREL1 and LRRC74A genes. This region accounted for 10.0% of the genomic additive heritability and 8.0% of the accuracy of prediction. The second most significant chromosome region not reported by previous GWAS was the region between 77.7 and 79.7 Mb on chromosome 11, where SNPs in the FGF14 gene had the most significant effect and accounted for 5.1% of the genomic additive heritability and 5.2% of the accuracy of prediction. The 85 significant SNPs accounted for 28.5 to 28.8% of the genomic additive heritability and 35.8 to 36.8% of the accuracy of prediction. The three methods used for the GWAS identified 85 significant SNPs with additive effects on teat number, including SNPs in a previously reported chromosomal region and SNPs in novel chromosomal regions. Most significant SNPs with larger estimated effects also had larger contributions to the total genomic heritability and accuracy of prediction than other SNPs.
Luo, Li; Zhu, Yun
2012-01-01
Abstract The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T2, collapsing method, multivariate and collapsing (CMC) method, individual χ2 test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets. PMID:22651812
Luo, Li; Zhu, Yun; Xiong, Momiao
2012-06-01
The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T(2), collapsing method, multivariate and collapsing (CMC) method, individual χ(2) test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets.
Predicting Protein Function by Genomic Context: Quantitative Evaluation and Qualitative Inferences
Huynen, Martijn; Snel, Berend; Lathe, Warren; Bork, Peer
2000-01-01
Various new methods have been proposed to predict functional interactions between proteins based on the genomic context of their genes. The types of genomic context that they use are Type I: the fusion of genes; Type II: the conservation of gene-order or co-occurrence of genes in potential operons; and Type III: the co-occurrence of genes across genomes (phylogenetic profiles). Here we compare these types for their coverage, their correlations with various types of functional interaction, and their overlap with homology-based function assignment. We apply the methods to Mycoplasma genitalium, the standard benchmarking genome in computational and experimental genomics. Quantitatively, conservation of gene order is the technique with the highest coverage, applying to 37% of the genes. By combining gene order conservation with gene fusion (6%), the co-occurrence of genes in operons in absence of gene order conservation (8%), and the co-occurrence of genes across genomes (11%), significant context information can be obtained for 50% of the genes (the categories overlap). Qualitatively, we observe that the functional interactions between genes are stronger as the requirements for physical neighborhood on the genome are more stringent, while the fraction of potential false positives decreases. Moreover, only in cases in which gene order is conserved in a substantial fraction of the genomes, in this case six out of twenty-five, does a single type of functional interaction (physical interaction) clearly dominate (>80%). In other cases, complementary function information from homology searches, which is available for most of the genes with significant genomic context, is essential to predict the type of interaction. Using a combination of genomic context and homology searches, new functional features can be predicted for 10% of M. genitalium genes. PMID:10958638
Avelar, Daniel M; Linardi, Pedro M
2010-09-15
The recently developed Multiple Displacement Amplification technique (MDA) allows for the production of a large quantity of high quality genomic DNA from low amounts of the original DNA. The goal of this study was to evaluate the performance of the MDA technique to amplify genomic DNA of siphonapterids that have been stored for long periods in 70% ethanol at room temperature. We subjected each DNA sample to two different methodologies: (1) amplification of mitochondrial 16S sequences without MDA; (2) amplification of 16S after MDA. All the samples obtained from these procedures were then sequenced. Only 4 samples (15.4%) subjected to method 1 showed amplification. In contrast, the application of MDA (method 2) improved the performance substantially, with 24 samples (92.3%) showing amplification, with significant difference. Interestingly, one of the samples successfully amplified with this method was originally collected in 1909. All of the sequenced samples displayed satisfactory results in quality evaluations (Phred ≥ 20) and good similarities, as identified with the BLASTn tool. Our results demonstrate that the use of MDA may be an effective tool in molecular studies involving specimens of fleas that have traditionally been considered inadequately preserved for such purposes.
2010-01-01
The recently developed Multiple Displacement Amplification technique (MDA) allows for the production of a large quantity of high quality genomic DNA from low amounts of the original DNA. The goal of this study was to evaluate the performance of the MDA technique to amplify genomic DNA of siphonapterids that have been stored for long periods in 70% ethanol at room temperature. We subjected each DNA sample to two different methodologies: (1) amplification of mitochondrial 16S sequences without MDA; (2) amplification of 16S after MDA. All the samples obtained from these procedures were then sequenced. Only 4 samples (15.4%) subjected to method 1 showed amplification. In contrast, the application of MDA (method 2) improved the performance substantially, with 24 samples (92.3%) showing amplification, with significant difference. Interestingly, one of the samples successfully amplified with this method was originally collected in 1909. All of the sequenced samples displayed satisfactory results in quality evaluations (Phred ≥ 20) and good similarities, as identified with the BLASTn tool. Our results demonstrate that the use of MDA may be an effective tool in molecular studies involving specimens of fleas that have traditionally been considered inadequately preserved for such purposes. PMID:20840790
Akanno, E C; Schenkel, F S; Sargolzaei, M; Friendship, R M; Robinson, J A B
2014-10-01
Genetic improvement of pigs in tropical developing countries has focused on imported exotic populations which have been subjected to intensive selection with attendant high population-wide linkage disequilibrium (LD). Presently, indigenous pig population with limited selection and low LD are being considered for improvement. Given that the infrastructure for genetic improvement using the conventional BLUP selection methods are lacking, a genome-wide selection (GS) program was proposed for developing countries. A simulation study was conducted to evaluate the option of using 60 K SNP panel and observed amount of LD in the exotic and indigenous pig populations. Several scenarios were evaluated including different size and structure of training and validation populations, different selection methods and long-term accuracy of GS in different population/breeding structures and traits. The training set included previously selected exotic population, unselected indigenous population and their crossbreds. Traits studied included number born alive (NBA), average daily gain (ADG) and back fat thickness (BFT). The ridge regression method was used to train the prediction model. The results showed that accuracies of genomic breeding values (GBVs) in the range of 0.30 (NBA) to 0.86 (BFT) in the validation population are expected if high density marker panels are utilized. The GS method improved accuracy of breeding values better than pedigree-based approach for traits with low heritability and in young animals with no performance data. Crossbred training population performed better than purebreds when validation was in populations with similar or a different structure as in the training set. Genome-wide selection holds promise for genetic improvement of pigs in the tropics. © 2014 Blackwell Verlag GmbH.
McKinney, Garrett J; Larson, Wesley A; Seeb, Lisa W; Seeb, James E
2017-05-01
In their recently corrected manuscript, "Breaking RAD: An evaluation of the utility of restriction site associated DNA sequencing for genome scans of adaptation", Lowry et al. argue that genome scans using RADseq will miss many loci under selection due to a combination of sparse marker density and low levels of linkage disequilibrium in most species. We agree that marker density and levels of LD are important considerations when designing a RADseq study; however, we dispute that RAD-based genome scans are as prone to failure as Lowry et al. suggest. Their arguments ignore the flexible nature of RADseq; the availability of different restriction enzymes and capacity for combining restriction enzymes ensures that a well-designed study should be able to generate enough markers for efficient genome coverage. We further believe that simplifying assumptions about linkage disequilibrium in their simulations are invalid in many species. Finally, it is important to note that the alternative methods proposed by Lowry et al. have limitations equal to or greater than RADseq. The wealth of studies with positive impactful findings that have used RAD genome scans instead supports the argument that properly conducted RAD genome scans are an effective method for gaining insight into ecology and evolution, particularly for non-model organisms and those with large or complex genomes. © 2016 John Wiley & Sons Ltd.
Li, Qiling; Li, Min; Ma, Li; Li, Wenzhi; Wu, Xuehong; Richards, Jendai; Fu, Guoxing; Xu, Wei; Bythwood, Tameka; Li, Xu; Wang, Jianxin; Song, Qing
2014-01-01
Background The use of DNA from archival formalin and paraffin embedded (FFPE) tissue for genetic and epigenetic analyses may be problematic, since the DNA is often degraded and only limited amounts may be available. Thus, it is currently not known whether genome-wide methylation can be reliably assessed in DNA from archival FFPE tissue. Methodology/Principal Findings Ovarian tissues, which were obtained and formalin-fixed and paraffin-embedded in either 1999 or 2011, were sectioned and stained with hematoxylin-eosin (H&E).Epithelial cells were captured by laser micro dissection, and their DNA subjected to whole genomic bisulfite conversion, whole genomic polymerase chain reaction (PCR) amplification, and purification. Sequencing and software analyses were performed to identify the extent of genomic methylation. We observed that 31.7% of sequence reads from the DNA in the 1999 archival FFPE tissue, and 70.6% of the reads from the 2011 sample, could be matched with the genome. Methylation rates of CpG on the Watson and Crick strands were 32.2% and 45.5%, respectively, in the 1999 sample, and 65.1% and 42.7% in the 2011 sample. Conclusions/Significance We have developed an efficient method that allows DNA methylation to be assessed in archival FFPE tissue samples. PMID:25133528
Genomic evaluation of regional dairy cattle breeds in single-breed and multibreed contexts.
Jónás, D; Ducrocq, V; Fritz, S; Baur, A; Sanchez, M-P; Croiseau, P
2017-02-01
An important prerequisite for high prediction accuracy in genomic prediction is the availability of a large training population, which allows accurate marker effect estimation. This requirement is not fulfilled in case of regional breeds with a limited number of breeding animals. We assessed the efficiency of the current French routine genomic evaluation procedure in four regional breeds (Abondance, Tarentaise, French Simmental and Vosgienne) as well as the potential benefits when the training populations consisting of males and females of these breeds are merged to form a multibreed training population. Genomic evaluation was 5-11% more accurate than a pedigree-based BLUP in three of the four breeds, while the numerically smallest breed showed a < 1% increase in accuracy. Multibreed genomic evaluation was beneficial for two breeds (Abondance and French Simmental) with maximum gains of 5 and 8% in correlation coefficients between yield deviations and genomic estimated breeding values, when compared to the single-breed genomic evaluation results. Inflation of genomic evaluation of young candidates was also reduced. Our results indicate that genomic selection can be effective in regional breeds as well. Here, we provide empirical evidence proving that genetic distance between breeds is only one of the factors affecting the efficiency of multibreed genomic evaluation. © 2016 Blackwell Verlag GmbH.
Essie: A Concept-based Search Engine for Structured Biomedical Text
Ide, Nicholas C.; Loane, Russell F.; Demner-Fushman, Dina
2007-01-01
This article describes the algorithms implemented in the Essie search engine that is currently serving several Web sites at the National Library of Medicine. Essie is a phrase-based search engine with term and concept query expansion and probabilistic relevancy ranking. Essie’s design is motivated by an observation that query terms are often conceptually related to terms in a document, without actually occurring in the document text. Essie’s performance was evaluated using data and standard evaluation methods from the 2003 and 2006 Text REtrieval Conference (TREC) Genomics track. Essie was the best-performing search engine in the 2003 TREC Genomics track and achieved results comparable to those of the highest-ranking systems on the 2006 TREC Genomics track task. Essie shows that a judicious combination of exploiting document structure, phrase searching, and concept based query expansion is a useful approach for information retrieval in the biomedical domain. PMID:17329729
Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ)
Mascher, Martin; Muehlbauer, Gary J; Rokhsar, Daniel S; Chapman, Jarrod; Schmutz, Jeremy; Barry, Kerrie; Muñoz-Amatriaín, María; Close, Timothy J; Wise, Roger P; Schulman, Alan H; Himmelbach, Axel; Mayer, Klaus FX; Scholz, Uwe; Poland, Jesse A; Stein, Nils; Waugh, Robbie
2013-01-01
Next-generation whole-genome shotgun assemblies of complex genomes are highly useful, but fail to link nearby sequence contigs with each other or provide a linear order of contigs along individual chromosomes. Here, we introduce a strategy based on sequencing progeny of a segregating population that allows de novo production of a genetically anchored linear assembly of the gene space of an organism. We demonstrate the power of the approach by reconstructing the chromosomal organization of the gene space of barley, a large, complex and highly repetitive 5.1 Gb genome. We evaluate the robustness of the new assembly by comparison to a recently released physical and genetic framework of the barley genome, and to various genetically ordered sequence-based genotypic datasets. The method is independent of the need for any prior sequence resources, and will enable rapid and cost-efficient establishment of powerful genomic information for many species. PMID:23998490
Goodswen, Stephen J.; Kennedy, Paul J.; Ellis, John T.
2012-01-01
Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers. PMID:23226328
Wang, Xihong; Zheng, Zhuqing; Cai, Yudong; Chen, Ting; Li, Chao; Fu, Weiwei; Jiang, Yu
2017-12-01
The increasing amount of sequencing data available for a wide variety of species can be theoretically used for detecting copy number variations (CNVs) at the population level. However, the growing sample sizes and the divergent complexity of nonhuman genomes challenge the efficiency and robustness of current human-oriented CNV detection methods. Here, we present CNVcaller, a read-depth method for discovering CNVs in population sequencing data. The computational speed of CNVcaller was 1-2 orders of magnitude faster than CNVnator and Genome STRiP for complex genomes with thousands of unmapped scaffolds. CNV detection of 232 goats required only 1.4 days on a single compute node. Additionally, the Mendelian consistency of sheep trios indicated that CNVcaller mitigated the influence of high proportions of gaps and misassembled duplications in the nonhuman reference genome assembly. Furthermore, multiple evaluations using real sheep and human data indicated that CNVcaller achieved the best accuracy and sensitivity for detecting duplications. The fast generalized detection algorithms included in CNVcaller overcome prior computational barriers for detecting CNVs in large-scale sequencing data with complex genomic structures. Therefore, CNVcaller promotes population genetic analyses of functional CNVs in more species. © The Authors 2017. Published by Oxford University Press.
Wang, Xihong; Zheng, Zhuqing; Cai, Yudong; Chen, Ting; Li, Chao; Fu, Weiwei
2017-01-01
Abstract Background The increasing amount of sequencing data available for a wide variety of species can be theoretically used for detecting copy number variations (CNVs) at the population level. However, the growing sample sizes and the divergent complexity of nonhuman genomes challenge the efficiency and robustness of current human-oriented CNV detection methods. Results Here, we present CNVcaller, a read-depth method for discovering CNVs in population sequencing data. The computational speed of CNVcaller was 1–2 orders of magnitude faster than CNVnator and Genome STRiP for complex genomes with thousands of unmapped scaffolds. CNV detection of 232 goats required only 1.4 days on a single compute node. Additionally, the Mendelian consistency of sheep trios indicated that CNVcaller mitigated the influence of high proportions of gaps and misassembled duplications in the nonhuman reference genome assembly. Furthermore, multiple evaluations using real sheep and human data indicated that CNVcaller achieved the best accuracy and sensitivity for detecting duplications. Conclusions The fast generalized detection algorithms included in CNVcaller overcome prior computational barriers for detecting CNVs in large-scale sequencing data with complex genomic structures. Therefore, CNVcaller promotes population genetic analyses of functional CNVs in more species. PMID:29220491
Oud, Bart; Maris, Antonius J A; Daran, Jean-Marc; Pronk, Jack T
2012-01-01
Successful reverse engineering of mutants that have been obtained by nontargeted strain improvement has long presented a major challenge in yeast biotechnology. This paper reviews the use of genome-wide approaches for analysis of Saccharomyces cerevisiae strains originating from evolutionary engineering or random mutagenesis. On the basis of an evaluation of the strengths and weaknesses of different methods, we conclude that for the initial identification of relevant genetic changes, whole genome sequencing is superior to other analytical techniques, such as transcriptome, metabolome, proteome, or array-based genome analysis. Key advantages of this technique over gene expression analysis include the independency of genome sequences on experimental context and the possibility to directly and precisely reproduce the identified changes in naive strains. The predictive value of genome-wide analysis of strains with industrially relevant characteristics can be further improved by classical genetics or simultaneous analysis of strains derived from parallel, independent strain improvement lineages. PMID:22152095
Oud, Bart; van Maris, Antonius J A; Daran, Jean-Marc; Pronk, Jack T
2012-03-01
Successful reverse engineering of mutants that have been obtained by nontargeted strain improvement has long presented a major challenge in yeast biotechnology. This paper reviews the use of genome-wide approaches for analysis of Saccharomyces cerevisiae strains originating from evolutionary engineering or random mutagenesis. On the basis of an evaluation of the strengths and weaknesses of different methods, we conclude that for the initial identification of relevant genetic changes, whole genome sequencing is superior to other analytical techniques, such as transcriptome, metabolome, proteome, or array-based genome analysis. Key advantages of this technique over gene expression analysis include the independency of genome sequences on experimental context and the possibility to directly and precisely reproduce the identified changes in naive strains. The predictive value of genome-wide analysis of strains with industrially relevant characteristics can be further improved by classical genetics or simultaneous analysis of strains derived from parallel, independent strain improvement lineages. © 2011 Federation of European Microbiological Societies. Published by Blackwell Publishing Ltd. All rights reserved.
Guo, X; Christensen, O F; Ostersen, T; Wang, Y; Lund, M S; Su, G
2015-02-01
A single-step method allows genetic evaluation using information of phenotypes, pedigree, and markers from genotyped and nongenotyped individuals simultaneously. This paper compared genomic predictions obtained from a single-step BLUP (SSBLUP) method, a genomic BLUP (GBLUP) method, a selection index blending (SELIND) method, and a traditional pedigree-based method (BLUP) for total number of piglets born (TNB), litter size at d 5 after birth (LS5), and mortality rate before d 5 (Mort; including stillbirth) in Danish Landrace and Yorkshire pigs. Data sets of 778,095 litters from 309,362 Landrace sows and 472,001 litters from 190,760 Yorkshire sows were used for the analysis. There were 332,795 Landrace and 207,255 Yorkshire animals in the pedigree data, among which 3,445 Landrace pigs (1,366 boars and 2,079 sows) and 3,372 Yorkshire pigs (1,241 boars and 2,131 sows) were genotyped with the Illumina PorcineSNP60 BeadChip. The results showed that the 3 methods with marker information (SSBLUP, GBLUP, and SELIND) produced more accurate predictions for genotyped animals than the pedigree-based method. For genotyped animals, the average of reliabilities for all traits in both breeds using traditional BLUP was 0.091, which increased to 0.171 w+hen using GBLUP and to 0.179 when using SELIND and further increased to 0.209 when using SSBLUP. Furthermore, the average reliability of EBV for nongenotyped animals was increased from 0.091 for traditional BLUP to 0.105 for the SSBLUP. The results indicate that the SSBLUP is a good approach to practical genomic prediction of litter size and piglet mortality in Danish Landrace and Yorkshire populations.
Bangera, Rama; Correa, Katharina; Lhorente, Jean P; Figueroa, René; Yáñez, José M
2017-01-31
Salmon Rickettsial Syndrome (SRS) caused by Piscirickettsia salmonis is a major disease affecting the Chilean salmon industry. Genomic selection (GS) is a method wherein genome-wide markers and phenotype information of full-sibs are used to predict genomic EBV (GEBV) of selection candidates and is expected to have increased accuracy and response to selection over traditional pedigree based Best Linear Unbiased Prediction (PBLUP). Widely used GS methods such as genomic BLUP (GBLUP), SNPBLUP, Bayes C and Bayesian Lasso may perform differently with respect to accuracy of GEBV prediction. Our aim was to compare the accuracy, in terms of reliability of genome-enabled prediction, from different GS methods with PBLUP for resistance to SRS in an Atlantic salmon breeding program. Number of days to death (DAYS), binary survival status (STATUS) phenotypes, and 50 K SNP array genotypes were obtained from 2601 smolts challenged with P. salmonis. The reliability of different GS methods at different SNP densities with and without pedigree were compared to PBLUP using a five-fold cross validation scheme. Heritability estimated from GS methods was significantly higher than PBLUP. Pearson's correlation between predicted GEBV from PBLUP and GS models ranged from 0.79 to 0.91 and 0.79-0.95 for DAYS and STATUS, respectively. The relative increase in reliability from different GS methods for DAYS and STATUS with 50 K SNP ranged from 8 to 25% and 27-30%, respectively. All GS methods outperformed PBLUP at all marker densities. DAYS and STATUS showed superior reliability over PBLUP even at the lowest marker density of 3 K and 500 SNP, respectively. 20 K SNP showed close to maximal reliability for both traits with little improvement using higher densities. These results indicate that genomic predictions can accelerate genetic progress for SRS resistance in Atlantic salmon and implementation of this approach will contribute to the control of SRS in Chile. We recommend GBLUP for routine GS evaluation because this method is computationally faster and the results are very similar with other GS methods. The use of lower density SNP or the combination of low density SNP and an imputation strategy may help to reduce genotyping costs without compromising gain in reliability.
The Functional Genomics Network in the evolution of biological text mining over the past decade.
Blaschke, Christian; Valencia, Alfonso
2013-03-25
Different programs of The European Science Foundation (ESF) have contributed significantly to connect researchers in Europe and beyond through several initiatives. This support was particularly relevant for the development of the areas related with extracting information from papers (text-mining) because it supported the field in its early phases long before it was recognized by the community. We review the historical development of text mining research and how it was introduced in bioinformatics. Specific applications in (functional) genomics are described like it's integration in genome annotation pipelines and the support to the analysis of high-throughput genomics experimental data, and we highlight the activities of evaluation of methods and benchmarking for which the ESF programme support was instrumental. Copyright © 2013 Elsevier B.V. All rights reserved.
Seyerle, Amanda A; Sitlani, Colleen M; Noordam, Raymond; Gogarten, Stephanie M; Li, Jin; Li, Xiaohui; Evans, Daniel S; Sun, Fangui; Laaksonen, Maarit A; Isaacs, Aaron; Kristiansson, Kati; Highland, Heather M; Stewart, James D; Harris, Tamara B; Trompet, Stella; Bis, Joshua C; Peloso, Gina M; Brody, Jennifer A; Broer, Linda; Busch, Evan L; Duan, Qing; Stilp, Adrienne M; O’Donnell, Christopher J; Macfarlane, Peter W; Floyd, James S; Kors, Jan A; Lin, Henry J; Li-Gao, Ruifang; Sofer, Tamar; Méndez-Giráldez, Raúl; Cummings, Steven R; Heckbert, Susan R; Hofman, Albert; Ford, Ian; Li, Yun; Launer, Lenore J; Porthan, Kimmo; Newton-Cheh, Christopher; Napier, Melanie D; Kerr, Kathleen F; Reiner, Alexander P; Rice, Kenneth M; Roach, Jeffrey; Buckley, Brendan M; Soliman, Elsayed Z; de Mutsert, Renée; Sotoodehnia, Nona; Uitterlinden, André G; North, Kari E; Lee, Craig R; Gudnason, Vilmundur; Stürmer, Til; Rosendaal, Frits R; Taylor, Kent D; Wiggins, Kerri L; Wilson, James G; Chen, Yii-Der I; Kaplan, Robert C; Wilhelmsen, Kirk; Cupples, L Adrienne; Salomaa, Veikko; van Duijn, Cornelia; Jukema, J Wouter; Liu, Yongmei; Mook-Kanamori, Dennis O; Lange, Leslie A; Vasan, Ramachandran S; Smith, Albert V; Stricker, Bruno H; Laurie, Cathy C; Rotter, Jerome I; Whitsel, Eric A; Psaty, Bruce M; Avery, Christy L
2017-01-01
Thiazide diuretics, commonly used antihypertensives, may cause QT interval (QT) prolongation, a risk factor for highly fatal and difficult to predict ventricular arrhythmias. We examined whether common SNPs modified the association between thiazide use and QT or its component parts (QRS interval, JT interval) by performing ancestry-specific, trans-ethnic, and cross-phenotype genome-wide analyses of European (66%), African American (15%), and Hispanic (19%) populations (N=78,199), leveraging longitudinal data, incorporating corrected standard errors to account for underestimation of interaction estimate variances and evaluating evidence for pathway enrichment. Although no loci achieved genome-wide significance (P<5×10−8), we found suggestive evidence (P<5×10−6) for SNPs modifying the thiazide-QT association at 22 loci, including ion transport loci (e.g. NELL1, KCNQ3). The biologic plausibility of our suggestive results and simulations demonstrating modest power to detect interaction effects at genome-wide significant levels indicate that larger studies and innovative statistical methods are warranted in future efforts evaluating thiazide-SNP interactions. PMID:28719597
Mehrban, Hossein; Lee, Deuk Hwan; Moradi, Mohammad Hossein; IlCho, Chung; Naserkheil, Masoumeh; Ibáñez-Escriche, Noelia
2017-01-04
Hanwoo beef is known for its marbled fat, tenderness, juiciness and characteristic flavor, as well as for its low cholesterol and high omega 3 fatty acid contents. As yet, there has been no comprehensive investigation to estimate genomic selection accuracy for carcass traits in Hanwoo cattle using dense markers. This study aimed at evaluating the accuracy of alternative statistical methods that differed in assumptions about the underlying genetic model for various carcass traits: backfat thickness (BT), carcass weight (CW), eye muscle area (EMA), and marbling score (MS). Accuracies of direct genomic breeding values (DGV) for carcass traits were estimated by applying fivefold cross-validation to a dataset including 1183 animals and approximately 34,000 single nucleotide polymorphisms (SNPs). Accuracies of BayesC, Bayesian LASSO (BayesL) and genomic best linear unbiased prediction (GBLUP) methods were similar for BT, EMA and MS. However, for CW, DGV accuracy was 7% higher with BayesC than with BayesL and GBLUP. The increased accuracy of BayesC, compared to GBLUP and BayesL, was maintained for CW, regardless of the training sample size, but not for BT, EMA, and MS. Genome-wide association studies detected consistent large effects for SNPs on chromosomes 6 and 14 for CW. The predictive performance of the models depended on the trait analyzed. For CW, the results showed a clear superiority of BayesC compared to GBLUP and BayesL. These findings indicate the importance of using a proper variable selection method for genomic selection of traits and also suggest that the genetic architecture that underlies CW differs from that of the other carcass traits analyzed. Thus, our study provides significant new insights into the carcass traits of Hanwoo cattle.
Parson, Walther; Strobl, Christina; Huber, Gabriela; Zimmermann, Bettina; Gomes, Sibylle M.; Souto, Luis; Fendt, Liane; Delport, Rhena; Langit, Reina; Wootton, Sharon; Lagacé, Robert; Irwin, Jodi
2013-01-01
Insights into the human mitochondrial phylogeny have been primarily achieved by sequencing full mitochondrial genomes (mtGenomes). In forensic genetics (partial) mtGenome information can be used to assign haplotypes to their phylogenetic backgrounds, which may, in turn, have characteristic geographic distributions that would offer useful information in a forensic case. In addition and perhaps even more relevant in the forensic context, haplogroup-specific patterns of mutations form the basis for quality control of mtDNA sequences. The current method for establishing (partial) mtDNA haplotypes is Sanger-type sequencing (STS), which is laborious, time-consuming, and expensive. With the emergence of Next Generation Sequencing (NGS) technologies, the body of available mtDNA data can potentially be extended much more quickly and cost-efficiently. Customized chemistries, laboratory workflows and data analysis packages could support the community and increase the utility of mtDNA analysis in forensics. We have evaluated the performance of mtGenome sequencing using the Personal Genome Machine (PGM) and compared the resulting haplotypes directly with conventional Sanger-type sequencing. A total of 64 mtGenomes (>1 million bases) were established that yielded high concordance with the corresponding STS haplotypes (<0.02% differences). About two-thirds of the differences were observed in or around homopolymeric sequence stretches. In addition, the sequence alignment algorithm employed to align NGS reads played a significant role in the analysis of the data and the resulting mtDNA haplotypes. Further development of alignment software would be desirable to facilitate the application of NGS in mtDNA forensic genetics. PMID:23948325
Roth, Andrew; Khattra, Jaswinder; Ho, Julie; Yap, Damian; Prentice, Leah M.; Melnyk, Nataliya; McPherson, Andrew; Bashashati, Ali; Laks, Emma; Biele, Justina; Ding, Jiarui; Le, Alan; Rosner, Jamie; Shumansky, Karey; Marra, Marco A.; Gilks, C. Blake; Huntsman, David G.; McAlpine, Jessica N.; Aparicio, Samuel
2014-01-01
The evolution of cancer genomes within a single tumor creates mixed cell populations with divergent somatic mutational landscapes. Inference of tumor subpopulations has been disproportionately focused on the assessment of somatic point mutations, whereas computational methods targeting evolutionary dynamics of copy number alterations (CNA) and loss of heterozygosity (LOH) in whole-genome sequencing data remain underdeveloped. We present a novel probabilistic model, TITAN, to infer CNA and LOH events while accounting for mixtures of cell populations, thereby estimating the proportion of cells harboring each event. We evaluate TITAN on idealized mixtures, simulating clonal populations from whole-genome sequences taken from genomically heterogeneous ovarian tumor sites collected from the same patient. In addition, we show in 23 whole genomes of breast tumors that the inference of CNA and LOH using TITAN critically informs population structure and the nature of the evolving cancer genome. Finally, we experimentally validated subclonal predictions using fluorescence in situ hybridization (FISH) and single-cell sequencing from an ovarian cancer patient sample, thereby recapitulating the key modeling assumptions of TITAN. PMID:25060187
Takabatake, Reona; Masubuchi, Tomoko; Futo, Satoshi; Minegishi, Yasutaka; Noguchi, Akio; Kondo, Kazunari; Teshima, Reiko; Kurashima, Takeyo; Mano, Junichi; Kitta, Kazumi
2016-01-01
A novel real-time PCR-based analytical method was developed for the event-specific quantification of a genetically modified (GM) maize, 3272. We first attempted to obtain genome DNA from this maize using a DNeasy Plant Maxi kit and a DNeasy Plant Mini kit, which have been widely utilized in our previous studies, but DNA extraction yields from 3272 were markedly lower than those from non-GM maize seeds. However, lowering of DNA extraction yields was not observed with GM quicker or Genomic-tip 20/G. We chose GM quicker for evaluation of the quantitative method. We prepared a standard plasmid for 3272 quantification. The conversion factor (Cf), which is required to calculate the amount of a genetically modified organism (GMO), was experimentally determined for two real-time PCR instruments, the Applied Biosystems 7900HT (the ABI 7900) and the Applied Biosystems 7500 (the ABI7500). The determined Cf values were 0.60 and 0.59 for the ABI 7900 and the ABI 7500, respectively. To evaluate the developed method, a blind test was conducted as part of an interlaboratory study. The trueness and precision were evaluated as the bias and reproducibility of the relative standard deviation (RSDr). The determined values were similar to those in our previous validation studies. The limit of quantitation for the method was estimated to be 0.5% or less, and we concluded that the developed method would be suitable and practical for detection and quantification of 3272.
Katz, Lee S.; Griswold, Taylor; Williams-Newkirk, Amanda J.; Wagner, Darlene; Petkau, Aaron; Sieffert, Cameron; Van Domselaar, Gary; Deng, Xiangyu; Carleton, Heather A.
2017-01-01
Modern epidemiology of foodborne bacterial pathogens in industrialized countries relies increasingly on whole genome sequencing (WGS) techniques. As opposed to profiling techniques such as pulsed-field gel electrophoresis, WGS requires a variety of computational methods. Since 2013, United States agencies responsible for food safety including the CDC, FDA, and USDA, have been performing whole-genome sequencing (WGS) on all Listeria monocytogenes found in clinical, food, and environmental samples. Each year, more genomes of other foodborne pathogens such as Escherichia coli, Campylobacter jejuni, and Salmonella enterica are being sequenced. Comparing thousands of genomes across an entire species requires a fast method with coarse resolution; however, capturing the fine details of highly related isolates requires a computationally heavy and sophisticated algorithm. Most L. monocytogenes investigations employing WGS depend on being able to identify an outbreak clade whose inter-genomic distances are less than an empirically determined threshold. When the difference between a few single nucleotide polymorphisms (SNPs) can help distinguish between genomes that are likely outbreak-associated and those that are less likely to be associated, we require a fine-resolution method. To achieve this level of resolution, we have developed Lyve-SET, a high-quality SNP pipeline. We evaluated Lyve-SET by retrospectively investigating 12 outbreak data sets along with four other SNP pipelines that have been used in outbreak investigation or similar scenarios. To compare these pipelines, several distance and phylogeny-based comparison methods were applied, which collectively showed that multiple pipelines were able to identify most outbreak clusters and strains. Currently in the US PulseNet system, whole genome multi-locus sequence typing (wgMLST) is the preferred primary method for foodborne WGS cluster detection and outbreak investigation due to its ability to name standardized genomic profiles, its central database, and its ability to be run in a graphical user interface. However, creating a functional wgMLST scheme requires extended up-front development and subject-matter expertise. When a scheme does not exist or when the highest resolution is needed, SNP analysis is used. Using three Listeria outbreak data sets, we demonstrated the concordance between Lyve-SET SNP typing and wgMLST. Availability: Lyve-SET can be found at https://github.com/lskatz/Lyve-SET. PMID:28348549
Transition of genomic evaluation from a research project to a production system
USDA-ARS?s Scientific Manuscript database
Genomic data began to be included in official USDA genetic evaluations of dairy cattle in January 2009. Numerous changes to the evaluation system were made to enable efficient management of genomic information, to incorporate it in official evaluations, and to distribute evaluations. Artificial-inse...
Prophage Integrase Typing Is a Useful Indicator of Genomic Diversity in Salmonella enterica
Colavecchio, Anna; D’Souza, Yasmin; Tompkins, Elizabeth; Jeukens, Julie; Freschi, Luca; Emond-Rheault, Jean-Guillaume; Kukavica-Ibrulj, Irena; Boyle, Brian; Bekal, Sadjia; Tamber, Sandeep; Levesque, Roger C.; Goodridge, Lawrence D.
2017-01-01
Salmonella enterica is a bacterial species that is a major cause of illness in humans and food-producing animals. S. enterica exhibits considerable inter-serovar diversity, as evidenced by the large number of host adapted serovars that have been identified. The development of methods to assess genome diversity in S. enterica will help to further define the limits of diversity in this foodborne pathogen. Thus, we evaluated a PCR assay, which targets prophage integrase genes, as a rapid method to investigate S. enterica genome diversity. To evaluate the PCR prophage integrase assay, 49 isolates of S. enterica were selected, including 19 clinical isolates from clonal serovars (Enteritidis and Heidelberg) that commonly cause human illness, and 30 isolates from food-associated Salmonella serovars that rarely cause human illness. The number of integrase genes identified by the PCR assay was compared to the number of integrase genes within intact prophages identified by whole genome sequencing and phage finding program PHASTER. The PCR assay identified a total of 147 prophage integrase genes within the 49 S. enterica genomes (79 integrase genes in the food-associated Salmonella isolates, 50 integrase genes in S. Enteritidis, and 18 integrase genes in S. Heidelberg). In comparison, whole genome sequencing and PHASTER identified a total of 75 prophage integrase genes within 102 intact prophages in the 49 S. enterica genomes (44 integrase genes in the food-associated Salmonella isolates, 21 integrase genes in S. Enteritidis, and 9 integrase genes in S. Heidelberg). Collectively, both the PCR assay and PHASTER identified the presence of a large diversity of prophage integrase genes in the food-associated isolates compared to the clinical isolates, thus indicating a high degree of diversity in the food-associated isolates, and confirming the clonal nature of S. Enteritidis and S. Heidelberg. Moreover, PHASTER revealed a diversity of 29 different types of prophages and 23 different integrase genes within the food-associated isolates, but only identified four different phages and integrase genes within clonal isolates of S. Enteritidis and S. Heidelberg. These results demonstrate the potential usefulness of PCR based detection of prophage integrase genes as a rapid indicator of genome diversity in S. enterica. PMID:28740489
Prophage Integrase Typing Is a Useful Indicator of Genomic Diversity in Salmonella enterica.
Colavecchio, Anna; D'Souza, Yasmin; Tompkins, Elizabeth; Jeukens, Julie; Freschi, Luca; Emond-Rheault, Jean-Guillaume; Kukavica-Ibrulj, Irena; Boyle, Brian; Bekal, Sadjia; Tamber, Sandeep; Levesque, Roger C; Goodridge, Lawrence D
2017-01-01
Salmonella enterica is a bacterial species that is a major cause of illness in humans and food-producing animals. S. enterica exhibits considerable inter-serovar diversity, as evidenced by the large number of host adapted serovars that have been identified. The development of methods to assess genome diversity in S. enterica will help to further define the limits of diversity in this foodborne pathogen. Thus, we evaluated a PCR assay, which targets prophage integrase genes, as a rapid method to investigate S. enterica genome diversity. To evaluate the PCR prophage integrase assay, 49 isolates of S. enterica were selected, including 19 clinical isolates from clonal serovars (Enteritidis and Heidelberg) that commonly cause human illness, and 30 isolates from food-associated Salmonella serovars that rarely cause human illness. The number of integrase genes identified by the PCR assay was compared to the number of integrase genes within intact prophages identified by whole genome sequencing and phage finding program PHASTER. The PCR assay identified a total of 147 prophage integrase genes within the 49 S. enterica genomes (79 integrase genes in the food-associated Salmonella isolates, 50 integrase genes in S . Enteritidis, and 18 integrase genes in S . Heidelberg). In comparison, whole genome sequencing and PHASTER identified a total of 75 prophage integrase genes within 102 intact prophages in the 49 S. enterica genomes (44 integrase genes in the food-associated Salmonella isolates, 21 integrase genes in S . Enteritidis, and 9 integrase genes in S . Heidelberg). Collectively, both the PCR assay and PHASTER identified the presence of a large diversity of prophage integrase genes in the food-associated isolates compared to the clinical isolates, thus indicating a high degree of diversity in the food-associated isolates, and confirming the clonal nature of S . Enteritidis and S . Heidelberg. Moreover, PHASTER revealed a diversity of 29 different types of prophages and 23 different integrase genes within the food-associated isolates, but only identified four different phages and integrase genes within clonal isolates of S. Enteritidis and S. Heidelberg. These results demonstrate the potential usefulness of PCR based detection of prophage integrase genes as a rapid indicator of genome diversity in S. enterica .
Schmidt, Martin; Van Bel, Michiel; Woloszynska, Magdalena; Slabbinck, Bram; Martens, Cindy; De Block, Marc; Coppens, Frederik; Van Lijsebettens, Mieke
2017-07-06
Cytosine methylation in plant genomes is important for the regulation of gene transcription and transposon activity. Genome-wide methylomes are studied upon mutation of the DNA methyltransferases, adaptation to environmental stresses or during development. However, from basic biology to breeding programs, there is a need to monitor multiple samples to determine transgenerational methylation inheritance or differential cytosine methylation. Methylome data obtained by sodium hydrogen sulfite (bisulfite)-conversion and next-generation sequencing (NGS) provide genome-wide information on cytosine methylation. However, a profiling method that detects cytosine methylation state dispersed over the genome would allow high-throughput analysis of multiple plant samples with distinct epigenetic signatures. We use specific restriction endonucleases to enrich for cytosine coverage in a bisulfite and NGS-based profiling method, which was compared to whole-genome bisulfite sequencing of the same plant material. We established an effective methylome profiling method in plants, termed plant-reduced representation bisulfite sequencing (plant-RRBS), using optimized double restriction endonuclease digestion, fragment end repair, adapter ligation, followed by bisulfite conversion, PCR amplification and NGS. We report a performant laboratory protocol and a straightforward bioinformatics data analysis pipeline for plant-RRBS, applicable for any reference-sequenced plant species. As a proof of concept, methylome profiling was performed using an Oryza sativa ssp. indica pure breeding line and a derived epigenetically altered line (epiline). Plant-RRBS detects methylation levels at tens of millions of cytosine positions deduced from bisulfite conversion in multiple samples. To evaluate the method, the coverage of cytosine positions, the intra-line similarity and the differential cytosine methylation levels between the pure breeding line and the epiline were determined. Plant-RRBS reproducibly covers commonly up to one fourth of the cytosine positions in the rice genome when using MspI-DpnII within a group of five biological replicates of a line. The method predominantly detects cytosine methylation in putative promoter regions and not-annotated regions in rice. Plant-RRBS offers high-throughput and broad, genome-dispersed methylation detection by effective read number generation obtained from reproducibly covered genome fractions using optimized endonuclease combinations, facilitating comparative analyses of multi-sample studies for cytosine methylation and transgenerational stability in experimental material and plant breeding populations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fliedner Theodor M.; Feinendegen Ludwig E.; Meineke Viktor
2005-02-28
First results of this feasibility study showed that evaluation of the stored material of the chronically irradiated dogs with modern molecular biological techniques proved to be successful and extremely promising. Therefore an in deep analysis of at least part of the huge amount of remaining material is of outmost interest. The methods applied in this feasibility study were pathological evaluation with different staining methods, protein analysis by means of immunohistochemistry, strand break analysis with the TdT-assay, DNA- and RNA-analysis as well as genomic examination by gene array. Overall more than 50% of the investigated material could be used. In particularmore » the results of an increased stimulation of the immune system within the dogs of the 3mSv group as both compared to the control and higher dose groups gives implications for the in depth study of the cellular events occurring in context with low dose radiation. Based on the findings of this study a further evaluation and statistically analysis of more material can help to identify promising biomarkers for low dose radiation. A systematic evaluation of a correlation of dose rates and strand breaks within the dog tissue might moreover help to explain mechanisms of tolerance to IR. One central problem is that most sequences for dog specific primers are not known yet. The discovery of the dog genome is still under progress. In this study the isolation of RNA within the dog tissue was successful. But up to now there are no gene arrays or gene chips commercially available, tested and adapted for canine tissue. The uncritical use of untested genomic test systems for canine tissue seems to be ineffective at the moment, time consuming and ineffective. Next steps in the investigation of genomic changes after IR within the stored dog tissue should be limited to quantitative RT-PCR of tested primer sequences for the dog. A collaboration with institutions working in the field of the discovery of the dog genome could have synergistic effects.« less
Spindel, Jennifer; Begum, Hasina; Akdemir, Deniz; Virk, Parminder; Collard, Bertrand; Redoña, Edilberto; Atlin, Gary; Jannink, Jean-Luc; McCouch, Susan R
2015-02-01
Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its efficacy for breeding inbred lines of rice. We performed a genome-wide association study (GWAS) in conjunction with five-fold GS cross-validation on a population of 363 elite breeding lines from the International Rice Research Institute's (IRRI) irrigated rice breeding program and herein report the GS results. The population was genotyped with 73,147 markers using genotyping-by-sequencing. The training population, statistical method used to build the GS model, number of markers, and trait were varied to determine their effect on prediction accuracy. For all three traits, genomic prediction models outperformed prediction based on pedigree records alone. Prediction accuracies ranged from 0.31 and 0.34 for grain yield and plant height to 0.63 for flowering time. Analyses using subsets of the full marker set suggest that using one marker every 0.2 cM is sufficient for genomic selection in this collection of rice breeding materials. RR-BLUP was the best performing statistical method for grain yield where no large effect QTL were detected by GWAS, while for flowering time, where a single very large effect QTL was detected, the non-GS multiple linear regression method outperformed GS models. For plant height, in which four mid-sized QTL were identified by GWAS, random forest produced the most consistently accurate GS models. Our results suggest that GS, informed by GWAS interpretations of genetic architecture and population structure, could become an effective tool for increasing the efficiency of rice breeding as the costs of genotyping continue to decline.
Spindel, Jennifer; Begum, Hasina; Akdemir, Deniz; Virk, Parminder; Collard, Bertrand; Redoña, Edilberto; Atlin, Gary; Jannink, Jean-Luc; McCouch, Susan R.
2015-01-01
Genomic Selection (GS) is a new breeding method in which genome-wide markers are used to predict the breeding value of individuals in a breeding population. GS has been shown to improve breeding efficiency in dairy cattle and several crop plant species, and here we evaluate for the first time its efficacy for breeding inbred lines of rice. We performed a genome-wide association study (GWAS) in conjunction with five-fold GS cross-validation on a population of 363 elite breeding lines from the International Rice Research Institute's (IRRI) irrigated rice breeding program and herein report the GS results. The population was genotyped with 73,147 markers using genotyping-by-sequencing. The training population, statistical method used to build the GS model, number of markers, and trait were varied to determine their effect on prediction accuracy. For all three traits, genomic prediction models outperformed prediction based on pedigree records alone. Prediction accuracies ranged from 0.31 and 0.34 for grain yield and plant height to 0.63 for flowering time. Analyses using subsets of the full marker set suggest that using one marker every 0.2 cM is sufficient for genomic selection in this collection of rice breeding materials. RR-BLUP was the best performing statistical method for grain yield where no large effect QTL were detected by GWAS, while for flowering time, where a single very large effect QTL was detected, the non-GS multiple linear regression method outperformed GS models. For plant height, in which four mid-sized QTL were identified by GWAS, random forest produced the most consistently accurate GS models. Our results suggest that GS, informed by GWAS interpretations of genetic architecture and population structure, could become an effective tool for increasing the efficiency of rice breeding as the costs of genotyping continue to decline. PMID:25689273
Yang, Yunfeng; Zhu, Mengxia; Wu, Liyou; Zhou, Jizhong
2008-09-16
Using genomic DNA as common reference in microarray experiments has recently been tested by different laboratories. Conflicting results have been reported with regard to the reliability of microarray results using this method. To explain it, we hypothesize that data processing is a critical element that impacts the data quality. Microarray experiments were performed in a gamma-proteobacterium Shewanella oneidensis. Pair-wise comparison of three experimental conditions was obtained either with two labeled cDNA samples co-hybridized to the same array, or by employing Shewanella genomic DNA as a standard reference. Various data processing techniques were exploited to reduce the amount of inconsistency between both methods and the results were assessed. We discovered that data quality was significantly improved by imposing the constraint of minimal number of replicates, logarithmic transformation and random error analyses. These findings demonstrate that data processing significantly influences data quality, which provides an explanation for the conflicting evaluation in the literature. This work could serve as a guideline for microarray data analysis using genomic DNA as a standard reference.
García-Mena, Jaime; Cano-Ramirez, Claudia; Garibay-Orijel, Claudio; Ramirez-Canseco, Sergio; Poggi-Varaldo, Héctor M
2005-06-01
A PCR-based method for the quantitative detection of Lentinus edodes and Trametes versicolor, two ligninolytic fungi applied for wastewater treatment and bioremediation, was developed. Genomic DNA was used to optimize a PCR method targeting the conserved copper-binding sequence of laccase genes. The method allowed the quantitative detection and differentiation of these fungi in single and defined-mixed cultures after fractionation of the PCR products by electrophoresis in agarose gels. Amplified products of about 150 bp for L. edodes, and about 200 bp for T. versicolor were purified and cloned. The PCR method showed a linear detection response in the 1.0 microg-1 ng range. The same method was tested with genomic DNA from a third fungus (Phanerochaete chrysosporium), yielding a fragment of about 400 bp. Southern-blot and DNA sequence analysis indicated that a specific PCR product was amplified from each genome, and that these corresponded to sequences of laccase genes. This PCR protocol permits the detection and differentiation of three ligninolytic fungi by amplifying DNA fragments of different sizes using a single pair of primers, without further enzymatic restriction of the PCR products. This method has potential use in the monitoring, evaluation, and improvement of fungal cultures used in wastewater treatment processes.
Island-Model Genomic Selection for Long-Term Genetic Improvement of Autogamous Crops.
Yabe, Shiori; Yamasaki, Masanori; Ebana, Kaworu; Hayashi, Takeshi; Iwata, Hiroyoshi
2016-01-01
Acceleration of genetic improvement of autogamous crops such as wheat and rice is necessary to increase cereal production in response to the global food crisis. Population and pedigree methods of breeding, which are based on inbred line selection, are used commonly in the genetic improvement of autogamous crops. These methods, however, produce a few novel combinations of genes in a breeding population. Recurrent selection promotes recombination among genes and produces novel combinations of genes in a breeding population, but it requires inaccurate single-plant evaluation for selection. Genomic selection (GS), which can predict genetic potential of individuals based on their marker genotype, might have high reliability of single-plant evaluation and might be effective in recurrent selection. To evaluate the efficiency of recurrent selection with GS, we conducted simulations using real marker genotype data of rice cultivars. Additionally, we introduced the concept of an "island model" inspired by evolutionary algorithms that might be useful to maintain genetic variation through the breeding process. We conducted GS simulations using real marker genotype data of rice cultivars to evaluate the efficiency of recurrent selection and the island model in an autogamous species. Results demonstrated the importance of producing novel combinations of genes through recurrent selection. An initial population derived from admixture of multiple bi-parental crosses showed larger genetic gains than a population derived from a single bi-parental cross in whole cycles, suggesting the importance of genetic variation in an initial population. The island-model GS better maintained genetic improvement in later generations than the other GS methods, suggesting that the island-model GS can utilize genetic variation in breeding and can retain alleles with small effects in the breeding population. The island-model GS will become a new breeding method that enhances the potential of genomic selection in autogamous crops, especially bringing long-term improvement.
Island-Model Genomic Selection for Long-Term Genetic Improvement of Autogamous Crops
Yabe, Shiori; Yamasaki, Masanori; Ebana, Kaworu; Hayashi, Takeshi; Iwata, Hiroyoshi
2016-01-01
Acceleration of genetic improvement of autogamous crops such as wheat and rice is necessary to increase cereal production in response to the global food crisis. Population and pedigree methods of breeding, which are based on inbred line selection, are used commonly in the genetic improvement of autogamous crops. These methods, however, produce a few novel combinations of genes in a breeding population. Recurrent selection promotes recombination among genes and produces novel combinations of genes in a breeding population, but it requires inaccurate single-plant evaluation for selection. Genomic selection (GS), which can predict genetic potential of individuals based on their marker genotype, might have high reliability of single-plant evaluation and might be effective in recurrent selection. To evaluate the efficiency of recurrent selection with GS, we conducted simulations using real marker genotype data of rice cultivars. Additionally, we introduced the concept of an “island model” inspired by evolutionary algorithms that might be useful to maintain genetic variation through the breeding process. We conducted GS simulations using real marker genotype data of rice cultivars to evaluate the efficiency of recurrent selection and the island model in an autogamous species. Results demonstrated the importance of producing novel combinations of genes through recurrent selection. An initial population derived from admixture of multiple bi-parental crosses showed larger genetic gains than a population derived from a single bi-parental cross in whole cycles, suggesting the importance of genetic variation in an initial population. The island-model GS better maintained genetic improvement in later generations than the other GS methods, suggesting that the island-model GS can utilize genetic variation in breeding and can retain alleles with small effects in the breeding population. The island-model GS will become a new breeding method that enhances the potential of genomic selection in autogamous crops, especially bringing long-term improvement. PMID:27115872
Post-genomics nanotechnology is gaining momentum: nanoproteomics and applications in life sciences.
Kobeissy, Firas H; Gulbakan, Basri; Alawieh, Ali; Karam, Pierre; Zhang, Zhiqun; Guingab-Cagmat, Joy D; Mondello, Stefania; Tan, Weihong; Anagli, John; Wang, Kevin
2014-02-01
The post-genomics era has brought about new Omics biotechnologies, such as proteomics and metabolomics, as well as their novel applications to personal genomics and the quantified self. These advances are now also catalyzing other and newer post-genomics innovations, leading to convergences between Omics and nanotechnology. In this work, we systematically contextualize and exemplify an emerging strand of post-genomics life sciences, namely, nanoproteomics and its applications in health and integrative biological systems. Nanotechnology has been utilized as a complementary component to revolutionize proteomics through different kinds of nanotechnology applications, including nanoporous structures, functionalized nanoparticles, quantum dots, and polymeric nanostructures. Those applications, though still in their infancy, have led to several highly sensitive diagnostics and new methods of drug delivery and targeted therapy for clinical use. The present article differs from previous analyses of nanoproteomics in that it offers an in-depth and comparative evaluation of the attendant biotechnology portfolio and their applications as seen through the lens of post-genomics life sciences and biomedicine. These include: (1) immunosensors for inflammatory, pathogenic, and autoimmune markers for infectious and autoimmune diseases, (2) amplified immunoassays for detection of cancer biomarkers, and (3) methods for targeted therapy and automatically adjusted drug delivery such as in experimental stroke and brain injury studies. As nanoproteomics becomes available both to the clinician at the bedside and the citizens who are increasingly interested in access to novel post-genomics diagnostics through initiatives such as the quantified self, we anticipate further breakthroughs in personalized and targeted medicine.
Gene order in rosid phylogeny, inferred from pairwise syntenies among extant genomes
2012-01-01
Background Ancestral gene order reconstruction for flowering plants has lagged behind developments in yeasts, insects and higher animals, because of the recency of widespread plant genome sequencing, sequencers' embargoes on public data use, paralogies due to whole genome duplication (WGD) and fractionation of undeleted duplicates, extensive paralogy from other sources, and the computational cost of existing methods. Results We address these problems, using the gene order of four core eudicot genomes (cacao, castor bean, papaya and grapevine) that have escaped any recent WGD events, and two others (poplar and cucumber) that descend from independent WGDs, in inferring the ancestral gene order of the rosid clade and those of its main subgroups, the fabids and malvids. We improve and adapt techniques including the OMG method for extracting large, paralogy-free, multiple orthologies from conflated pairwise synteny data among the six genomes and the PATHGROUPS approach for ancestral gene order reconstruction in a given phylogeny, where some genomes may be descendants of WGD events. We use the gene order evidence to evaluate the hypothesis that the order Malpighiales belongs to the malvids rather than as traditionally assigned to the fabids. Conclusions Gene orders of ancestral eudicot species, involving 10,000 or more genes can be reconstructed in an efficient, parsimonious and consistent way, despite paralogies due to WGD and other processes. Pairwise genomic syntenies provide appropriate input to a parameter-free procedure of multiple ortholog identification followed by gene-order reconstruction in solving instances of the "small phylogeny" problem. PMID:22759433
Kaur, Navneet; Hasegawa, Daniel K; Ling, Kai-Shu; Wintermantel, William M
2016-10-01
The relationships between plant viruses and their vectors have evolved over the millennia, and yet, studies on viruses began <150 years ago and investigations into the virus and vector interactions even more recently. The advent of next generation sequencing, including rapid genome and transcriptome analysis, methods for evaluation of small RNAs, and the related disciplines of proteomics and metabolomics offer a significant shift in the ability to elucidate molecular mechanisms involved in virus infection and transmission by insect vectors. Genomic technologies offer an unprecedented opportunity to examine the response of insect vectors to the presence of ingested viruses through gene expression changes and altered biochemical pathways. This review focuses on the interactions between viruses and their whitefly or thrips vectors and on potential applications of genomics-driven control of the insect vectors. Recent studies have evaluated gene expression in vectors during feeding on plants infected with begomoviruses, criniviruses, and tospoviruses, which exhibit very different types of virus-vector interactions. These studies demonstrate the advantages of genomics and the potential complementary studies that rapidly advance our understanding of the biology of virus transmission by insect vectors and offer additional opportunities to design novel genetic strategies to manage insect vectors and the viruses they transmit.
Using flow cytometry to estimate pollen DNA content: improved methodology and applications
Kron, Paul; Husband, Brian C.
2012-01-01
Background and Aims Flow cytometry has been used to measure nuclear DNA content in pollen, mostly to understand pollen development and detect unreduced gametes. Published data have not always met the high-quality standards required for some applications, in part due to difficulties inherent in the extraction of nuclei. Here we describe a simple and relatively novel method for extracting pollen nuclei, involving the bursting of pollen through a nylon mesh, compare it with other methods and demonstrate its broad applicability and utility. Methods The method was tested across 80 species, 64 genera and 33 families, and the data were evaluated using established criteria for estimating genome size and analysing cell cycle. Filter bursting was directly compared with chopping in five species, yields were compared with published values for sonicated samples, and the method was applied by comparing genome size estimates for leaf and pollen nuclei in six species. Key Results Data quality met generally applied standards for estimating genome size in 81 % of species and the higher best practice standards for cell cycle analysis in 51 %. In 41 % of species we met the most stringent criterion of screening 10 000 pollen grains per sample. In direct comparison with two chopping techniques, our method produced better quality histograms with consistently higher nuclei yields, and yields were higher than previously published results for sonication. In three binucleate and three trinucleate species we found that pollen-based genome size estimates differed from leaf tissue estimates by 1·5 % or less when 1C pollen nuclei were used, while estimates from 2C generative nuclei differed from leaf estimates by up to 2·5 %. Conclusions The high success rate, ease of use and wide applicability of the filter bursting method show that this method can facilitate the use of pollen for estimating genome size and dramatically improve unreduced pollen production estimation with flow cytometry. PMID:22875815
Owen, Joseph R.; Noyes, Noelle; Young, Amy E.; Prince, Daniel J.; Blanchard, Patricia C.; Lehenbauer, Terry W.; Aly, Sharif S.; Davis, Jessica H.; O’Rourke, Sean M.; Abdo, Zaid; Belk, Keith; Miller, Michael R.; Morley, Paul; Van Eenennaam, Alison L.
2017-01-01
Extended laboratory culture and antimicrobial susceptibility testing timelines hinder rapid species identification and susceptibility profiling of bacterial pathogens associated with bovine respiratory disease, the most prevalent cause of cattle mortality in the United States. Whole-genome sequencing offers a culture-independent alternative to current bacterial identification methods, but requires a library of bacterial reference genomes for comparison. To contribute new bacterial genome assemblies and evaluate genetic diversity and variation in antimicrobial resistance genotypes, whole-genome sequencing was performed on bovine respiratory disease–associated bacterial isolates (Histophilus somni, Mycoplasma bovis, Mannheimia haemolytica, and Pasteurella multocida) from dairy and beef cattle. One hundred genomically distinct assemblies were added to the NCBI database, doubling the available genomic sequences for these four species. Computer-based methods identified 11 predicted antimicrobial resistance genes in three species, with none being detected in M. bovis. While computer-based analysis can identify antibiotic resistance genes within whole-genome sequences (genotype), it may not predict the actual antimicrobial resistance observed in a living organism (phenotype). Antimicrobial susceptibility testing on 64 H. somni, M. haemolytica, and P. multocida isolates had an overall concordance rate between genotype and phenotypic resistance to the associated class of antimicrobials of 72.7% (P < 0.001), showing substantial discordance. Concordance rates varied greatly among different antimicrobial, antibiotic resistance gene, and bacterial species combinations. This suggests that antimicrobial susceptibility phenotypes are needed to complement genomically predicted antibiotic resistance gene genotypes to better understand how the presence of antibiotic resistance genes within a given bacterial species could potentially impact optimal bovine respiratory disease treatment and morbidity/mortality outcomes. PMID:28739600
Owen, Joseph R; Noyes, Noelle; Young, Amy E; Prince, Daniel J; Blanchard, Patricia C; Lehenbauer, Terry W; Aly, Sharif S; Davis, Jessica H; O'Rourke, Sean M; Abdo, Zaid; Belk, Keith; Miller, Michael R; Morley, Paul; Van Eenennaam, Alison L
2017-09-07
Extended laboratory culture and antimicrobial susceptibility testing timelines hinder rapid species identification and susceptibility profiling of bacterial pathogens associated with bovine respiratory disease, the most prevalent cause of cattle mortality in the United States. Whole-genome sequencing offers a culture-independent alternative to current bacterial identification methods, but requires a library of bacterial reference genomes for comparison. To contribute new bacterial genome assemblies and evaluate genetic diversity and variation in antimicrobial resistance genotypes, whole-genome sequencing was performed on bovine respiratory disease-associated bacterial isolates ( Histophilus somni , Mycoplasma bovis , Mannheimia haemolytica , and Pasteurella multocida ) from dairy and beef cattle. One hundred genomically distinct assemblies were added to the NCBI database, doubling the available genomic sequences for these four species. Computer-based methods identified 11 predicted antimicrobial resistance genes in three species, with none being detected in M. bovis While computer-based analysis can identify antibiotic resistance genes within whole-genome sequences (genotype), it may not predict the actual antimicrobial resistance observed in a living organism (phenotype). Antimicrobial susceptibility testing on 64 H. somni , M. haemolytica , and P. multocida isolates had an overall concordance rate between genotype and phenotypic resistance to the associated class of antimicrobials of 72.7% ( P < 0.001), showing substantial discordance. Concordance rates varied greatly among different antimicrobial, antibiotic resistance gene, and bacterial species combinations. This suggests that antimicrobial susceptibility phenotypes are needed to complement genomically predicted antibiotic resistance gene genotypes to better understand how the presence of antibiotic resistance genes within a given bacterial species could potentially impact optimal bovine respiratory disease treatment and morbidity/mortality outcomes. Copyright © 2017 Owen et al.
Zhu, Bo; Niu, Hong; Zhang, Wengang; Wang, Zezhao; Liang, Yonghu; Guan, Long; Guo, Peng; Chen, Yan; Zhang, Lupei; Guo, Yong; Ni, Heming; Gao, Xue; Gao, Huijiang; Xu, Lingyang; Li, Junya
2017-06-14
Fatty acid composition of muscle is an important trait contributing to meat quality. Recently, genome-wide association study (GWAS) has been extensively used to explore the molecular mechanism underlying important traits in cattle. In this study, we performed GWAS using high density SNP array to analyze the association between SNPs and fatty acids and evaluated the accuracy of genomic prediction for fatty acids in Chinese Simmental cattle. Using the BayesB method, we identified 35 and 7 regions in Chinese Simmental cattle that displayed significant associations with individual fatty acids and fatty acid groups, respectively. We further obtained several candidate genes which may be involved in fatty acid biosynthesis including elongation of very long chain fatty acids protein 5 (ELOVL5), fatty acid synthase (FASN), caspase 2 (CASP2) and thyroglobulin (TG). Specifically, we obtained strong evidence of association signals for one SNP located at 51.3 Mb for FASN using Genome-wide Rapid Association Mixed Model and Regression-Genomic Control (GRAMMAR-GC) approaches. Also, region-based association test identified multiple SNPs within FASN and ELOVL5 for C14:0. In addition, our result revealed that the effectiveness of genomic prediction for fatty acid composition using BayesB was slightly superior over GBLUP in Chinese Simmental cattle. We identified several significantly associated regions and loci which can be considered as potential candidate markers for genomics-assisted breeding programs. Using multiple methods, our results revealed that FASN and ELOVL5 are associated with fatty acids with strong evidence. Our finding also suggested that it is feasible to perform genomic selection for fatty acids in Chinese Simmental cattle.
Redundancy analysis allows improved detection of methylation changes in large genomic regions.
Ruiz-Arenas, Carlos; González, Juan R
2017-12-14
DNA methylation is an epigenetic process that regulates gene expression. Methylation can be modified by environmental exposures and changes in the methylation patterns have been associated with diseases. Methylation microarrays measure methylation levels at more than 450,000 CpGs in a single experiment, and the most common analysis strategy is to perform a single probe analysis to find methylation probes associated with the outcome of interest. However, methylation changes usually occur at the regional level: for example, genomic structural variants can affect methylation patterns in regions up to several megabases in length. Existing DMR methods provide lists of Differentially Methylated Regions (DMRs) of up to only few kilobases in length, and cannot check if a target region is differentially methylated. Therefore, these methods are not suitable to evaluate methylation changes in large regions. To address these limitations, we developed a new DMR approach based on redundancy analysis (RDA) that assesses whether a target region is differentially methylated. Using simulated and real datasets, we compared our approach to three common DMR detection methods (Bumphunter, blockFinder, and DMRcate). We found that Bumphunter underestimated methylation changes and blockFinder showed poor performance. DMRcate showed poor power in the simulated datasets and low specificity in the real data analysis. Our method showed very high performance in all simulation settings, even with small sample sizes and subtle methylation changes, while controlling type I error. Other advantages of our method are: 1) it estimates the degree of association between the DMR and the outcome; 2) it can analyze a targeted or region of interest; and 3) it can evaluate the simultaneous effects of different variables. The proposed methodology is implemented in MEAL, a Bioconductor package designed to facilitate the analysis of methylation data. We propose a multivariate approach to decipher whether an outcome of interest alters the methylation pattern of a region of interest. The method is designed to analyze large target genomic regions and outperforms the three most popular methods for detecting DMRs. Our method can evaluate factors with more than two levels or the simultaneous effect of more than one continuous variable, which is not possible with the state-of-the-art methods.
Orthology prediction methods: A quality assessment using curated protein families
Trachana, Kalliopi; Larsson, Tomas A; Powell, Sean; Chen, Wei-Hua; Doerks, Tobias; Muller, Jean; Bork, Peer
2011-01-01
The increasing number of sequenced genomes has prompted the development of several automated orthology prediction methods. Tests to evaluate the accuracy of predictions and to explore biases caused by biological and technical factors are therefore required. We used 70 manually curated families to analyze the performance of five public methods in Metazoa. We analyzed the strengths and weaknesses of the methods and quantified the impact of biological and technical challenges. From the latter part of the analysis, genome annotation emerged as the largest single influencer, affecting up to 30% of the performance. Generally, most methods did well in assigning orthologous group but they failed to assign the exact number of genes for half of the groups. The publicly available benchmark set (http://eggnog.embl.de/orthobench/) should facilitate the improvement of current orthology assignment protocols, which is of utmost importance for many fields of biology and should be tackled by a broad scientific community. PMID:21853451
Zseq: An Approach for Preprocessing Next-Generation Sequencing Data.
Alkhateeb, Abedalrhman; Rueda, Luis
2017-08-01
Next-generation sequencing technology generates a huge number of reads (short sequences), which contain a vast amount of genomic data. The sequencing process, however, comes with artifacts. Preprocessing of sequences is mandatory for further downstream analysis. We present Zseq, a linear method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq algorithm is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Moreover, de novo assembled transcripts from the reads filtered by Zseq have longer genomic sequences than other tested methods. Estimating the threshold of the cutoff point is introduced using labeling rules with optimistic results.
Economic evaluation of genomic selection in small ruminants: a sheep meat breeding program.
Shumbusho, F; Raoul, J; Astruc, J M; Palhiere, I; Lemarié, S; Fugeray-Scarbel, A; Elsen, J M
2016-06-01
Recent genomic evaluation studies using real data and predicting genetic gain by modeling breeding programs have reported moderate expected benefits from the replacement of classic selection schemes by genomic selection (GS) in small ruminants. The objectives of this study were to compare the cost, monetary genetic gain and economic efficiency of classic selection and GS schemes in the meat sheep industry. Deterministic methods were used to model selection based on multi-trait indices from a sheep meat breeding program. Decisional variables related to male selection candidates and progeny testing were optimized to maximize the annual monetary genetic gain (AMGG), that is, a weighted sum of meat and maternal traits annual genetic gains. For GS, a reference population of 2000 individuals was assumed and genomic information was available for evaluation of male candidates only. In the classic selection scheme, males breeding values were estimated from own and offspring phenotypes. In GS, different scenarios were considered, differing by the information used to select males (genomic only, genomic+own performance, genomic+offspring phenotypes). The results showed that all GS scenarios were associated with higher total variable costs than classic selection (if the cost of genotyping was 123 euros/animal). In terms of AMGG and economic returns, GS scenarios were found to be superior to classic selection only if genomic information was combined with their own meat phenotypes (GS-Pheno) or with their progeny test information. The predicted economic efficiency, defined as returns (proportional to number of expressions of AMGG in the nucleus and commercial flocks) minus total variable costs, showed that the best GS scenario (GS-Pheno) was up to 15% more efficient than classic selection. For all selection scenarios, optimization increased the overall AMGG, returns and economic efficiency. As a conclusion, our study shows that some forms of GS strategies are more advantageous than classic selection, provided that GS is already initiated (i.e. the initial reference population is available). Optimizing decisional variables of the classic selection scheme could be of greater benefit than including genomic information in optimized designs.
Badke, Yvonne M; Bates, Ronald O; Ernst, Catherine W; Fix, Justin; Steibel, Juan P
2014-04-16
Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65-0.68). Using genotypes imputed from a large reference panel (accuracy: R(2) = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R(2) = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation.
Clinical utility and development of biomarkers in invasive aspergillosis.
Patterson, Thomas F
2011-01-01
The diagnosis of invasive aspergillosis remains very difficult, and there are limited treatment options for the disease. Pre-clinical models have been used to evaluate the diagnosis and treatment of Aspergillus infection and to assess the pathogenicity and virulence of the organism. Extensive efforts in Aspergillus research have significantly expanded the genomic information about this microorganism. The standardization of animal models of invasive aspergillosis can be used to enhance the evaluation of genomic information about the organism to improve the diagnosis and treatment of invasive aspergillosis. One approach to this process has been the award of a contract by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health to establish and standardize animal models of invasive aspergillosis for the development of new diagnostic technologies for both pulmonary and disseminated Aspergillus infection. This work utilizes molecular approaches for the genetic manipulation of Aspergillus strains that can be tested in animal-model systems to establish new diagnostic targets and tools. Studies have evaluated the performance characteristics of assays for cell-wall antigens of Aspergillus including galactomannan and beta-D-glucan, as well as for DNA targets in the organism, through PCR. New targets, such as proteomic and genomic approaches, and novel detection methods, such as point-of-care lateral-flow devices, have also been evaluated. The goal of this paper is to provide a framework for evaluating genomic targets in animal models to improve the diagnosis and treatment of invasive aspergillosis toward ultimately improving the outcomes for patients with this frequently fatal infection.
Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data
Lemoine, Frédéric; Lespinet, Olivier; Labedan, Bernard
2007-01-01
Background Comparison of completely sequenced microbial genomes has revealed how fluid these genomes are. Detecting synteny blocks requires reliable methods to determining the orthologs among the whole set of homologs detected by exhaustive comparisons between each pair of completely sequenced genomes. This is a complex and difficult problem in the field of comparative genomics but will help to better understand the way prokaryotic genomes are evolving. Results We have developed a suite of programs that automate three essential steps to study conservation of gene order, and validated them with a set of 107 bacteria and archaea that cover the majority of the prokaryotic taxonomic space. We identified the whole set of shared homologs between two or more species and computed the evolutionary distance separating each pair of homologs. We applied two strategies to extract from the set of homologs a collection of valid orthologs shared by at least two genomes. The first computes the Reciprocal Smallest Distance (RSD) using the PAM distances separating pairs of homologs. The second method groups homologs in families and reconstructs each family's evolutionary tree, distinguishing bona fide orthologs as well as paralogs created after the last speciation event. Although the phylogenetic tree method often succeeds where RSD fails, the reverse could occasionally be true. Accordingly, we used the data obtained with either methods or their intersection to number the orthologs that are adjacent in for each pair of genomes, the Positional Orthologous Genes (POGs), and to further study their properties. Once all these synteny blocks have been detected, we showed that POGs are subject to more evolutionary constraints than orthologs outside synteny groups, whichever the taxonomic distance separating the compared organisms. Conclusion The suite of programs described in this paper allows a reliable detection of orthologs and is useful for evaluating gene order conservation in prokaryotes whichever their taxonomic distance. Thus, our approach will make easy the rapid identification of POGS in the next few years as we are expecting to be inundated with thousands of completely sequenced microbial genomes. PMID:18047665
Sperber, Nina R; Carpenter, Janet S; Cavallari, Larisa H; J Damschroder, Laura; Cooper-DeHoff, Rhonda M; Denny, Joshua C; Ginsburg, Geoffrey S; Guan, Yue; Horowitz, Carol R; Levy, Kenneth D; Levy, Mia A; Madden, Ebony B; Matheny, Michael E; Pollin, Toni I; Pratt, Victoria M; Rosenman, Marc; Voils, Corrine I; W Weitzel, Kristen; Wilke, Russell A; Ryanne Wu, R; Orlando, Lori A
2017-05-22
To realize potential public health benefits from genetic and genomic innovations, understanding how best to implement the innovations into clinical care is important. The objective of this study was to synthesize data on challenges identified by six diverse projects that are part of a National Human Genome Research Institute (NHGRI)-funded network focused on implementing genomics into practice and strategies to overcome these challenges. We used a multiple-case study approach with each project considered as a case and qualitative methods to elicit and describe themes related to implementation challenges and strategies. We describe challenges and strategies in an implementation framework and typology to enable consistent definitions and cross-case comparisons. Strategies were linked to challenges based on expert review and shared themes. Three challenges were identified by all six projects, and strategies to address these challenges varied across the projects. One common challenge was to increase the relative priority of integrating genomics within the health system electronic health record (EHR). Four projects used data warehousing techniques to accomplish the integration. The second common challenge was to strengthen clinicians' knowledge and beliefs about genomic medicine. To overcome this challenge, all projects developed educational materials and conducted meetings and outreach focused on genomic education for clinicians. The third challenge was engaging patients in the genomic medicine projects. Strategies to overcome this challenge included use of mass media to spread the word, actively involving patients in implementation (e.g., a patient advisory board), and preparing patients to be active participants in their healthcare decisions. This is the first collaborative evaluation focusing on the description of genomic medicine innovations implemented in multiple real-world clinical settings. Findings suggest that strategies to facilitate integration of genomic data within existing EHRs and educate stakeholders about the value of genomic services are considered important for effective implementation. Future work could build on these findings to evaluate which strategies are optimal under what conditions. This information will be useful for guiding translation of discoveries to clinical care, which, in turn, can provide data to inform continual improvement of genomic innovations and their applications.
Automated ensemble assembly and validation of microbial genomes.
Koren, Sergey; Treangen, Todd J; Hill, Christopher M; Pop, Mihai; Phillippy, Adam M
2014-05-03
The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible. To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers. Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
A universal genomic coordinate translator for comparative genomics
2014-01-01
Background Genomic duplications constitute major events in the evolution of species, allowing paralogous copies of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic sequences. However, a comprehensive analysis of duplication events and their contributions to evolution would require all-to-all genome alignments, which increases at N2 with the number of available genomes, N. Results Here, we introduce Kraken, software that omits the all-to-all requirement by recursively traversing a graph of pairwise alignments and dynamically re-computing orthology. Kraken scales linearly with the number of targeted genomes, N, which allows for including large numbers of genomes in analyses. We first evaluated the method on the set of 12 Drosophila genomes, finding that orthologous correspondence computed indirectly through a graph of multiple synteny maps comes at minimal cost in terms of sensitivity, but reduces overall computational runtime by an order of magnitude. We then used the method on three well-annotated mammalian genomes, human, mouse, and rat, and show that up to 93% of protein coding transcripts have unambiguous pairwise orthologous relationships across the genomes. On a nucleotide level, 70 to 83% of exons match exactly at both splice junctions, and up to 97% on at least one junction. We last applied Kraken to an RNA-sequencing dataset from multiple vertebrates and diverse tissues, where we confirmed that brain-specific gene family members, i.e. one-to-many or many-to-many homologs, are more highly correlated across species than single-copy (i.e. one-to-one homologous) genes. Not limited to protein coding genes, Kraken also identifies thousands of newly identified transcribed loci, likely non-coding RNAs that are consistently transcribed in human, chimpanzee and gorilla, and maintain significant correlation of expression levels across species. Conclusions Kraken is a computational genome coordinate translator that facilitates cross-species comparisons, distinguishes orthologs from paralogs, and does not require costly all-to-all whole genome mappings. Kraken is freely available under LPGL from http://github.com/nedaz/kraken. PMID:24976580
A universal genomic coordinate translator for comparative genomics.
Zamani, Neda; Sundström, Görel; Meadows, Jennifer R S; Höppner, Marc P; Dainat, Jacques; Lantz, Henrik; Haas, Brian J; Grabherr, Manfred G
2014-06-30
Genomic duplications constitute major events in the evolution of species, allowing paralogous copies of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic sequences. However, a comprehensive analysis of duplication events and their contributions to evolution would require all-to-all genome alignments, which increases at N2 with the number of available genomes, N. Here, we introduce Kraken, software that omits the all-to-all requirement by recursively traversing a graph of pairwise alignments and dynamically re-computing orthology. Kraken scales linearly with the number of targeted genomes, N, which allows for including large numbers of genomes in analyses. We first evaluated the method on the set of 12 Drosophila genomes, finding that orthologous correspondence computed indirectly through a graph of multiple synteny maps comes at minimal cost in terms of sensitivity, but reduces overall computational runtime by an order of magnitude. We then used the method on three well-annotated mammalian genomes, human, mouse, and rat, and show that up to 93% of protein coding transcripts have unambiguous pairwise orthologous relationships across the genomes. On a nucleotide level, 70 to 83% of exons match exactly at both splice junctions, and up to 97% on at least one junction. We last applied Kraken to an RNA-sequencing dataset from multiple vertebrates and diverse tissues, where we confirmed that brain-specific gene family members, i.e. one-to-many or many-to-many homologs, are more highly correlated across species than single-copy (i.e. one-to-one homologous) genes. Not limited to protein coding genes, Kraken also identifies thousands of newly identified transcribed loci, likely non-coding RNAs that are consistently transcribed in human, chimpanzee and gorilla, and maintain significant correlation of expression levels across species. Kraken is a computational genome coordinate translator that facilitates cross-species comparisons, distinguishes orthologs from paralogs, and does not require costly all-to-all whole genome mappings. Kraken is freely available under LPGL from http://github.com/nedaz/kraken.
Ai, Yuncan; Ai, Hannan; Meng, Fanmei; Zhao, Lei
2013-01-01
No attention has been paid on comparing a set of genome sequences crossing genetic components and biological categories with far divergence over large size range. We define it as the systematic comparative genomics and aim to develop the methodology. First, we create a method, GenomeFingerprinter, to unambiguously produce a set of three-dimensional coordinates from a sequence, followed by one three-dimensional plot and six two-dimensional trajectory projections, to illustrate the genome fingerprint of a given genome sequence. Second, we develop a set of concepts and tools, and thereby establish a method called the universal genome fingerprint analysis (UGFA). Particularly, we define the total genetic component configuration (TGCC) (including chromosome, plasmid, and phage) for describing a strain as a systematic unit, the universal genome fingerprint map (UGFM) of TGCC for differentiating strains as a universal system, and the systematic comparative genomics (SCG) for comparing a set of genomes crossing genetic components and biological categories. Third, we construct a method of quantitative analysis to compare two genomes by using the outcome dataset of genome fingerprint analysis. Specifically, we define the geometric center and its geometric mean for a given genome fingerprint map, followed by the Euclidean distance, the differentiate rate, and the weighted differentiate rate to quantitatively describe the difference between two genomes of comparison. Moreover, we demonstrate the applications through case studies on various genome sequences, giving tremendous insights into the critical issues in microbial genomics and taxonomy. We have created a method, GenomeFingerprinter, for rapidly computing, geometrically visualizing, intuitively comparing a set of genomes at genome fingerprint level, and hence established a method called the universal genome fingerprint analysis, as well as developed a method of quantitative analysis of the outcome dataset. These have set up the methodology of systematic comparative genomics based on the genome fingerprint analysis.
Mycobacterium tuberculosis promotes genomic instability in macrophages
Castro-Garza, Jorge; Luévano-Martínez, Miriam Lorena; Villarreal-Treviño, Licet; Gosálvez, Jaime; Fernández, José Luis; Dávila-Rodríguez, Martha Imelda; García-Vielma, Catalina; González-Hernández, Silvia; Cortés-Gutiérrez, Elva Irene
2018-01-01
BACKGROUND Mycobacterium tuberculosis is an intracellular pathogen, which may either block cellular defensive mechanisms and survive inside the host cell or induce cell death. Several studies are still exploring the mechanisms involved in these processes. OBJECTIVES To evaluate the genomic instability of M. tuberculosis-infected macrophages and compare it with that of uninfected macrophages. METHODS We analysed the possible variations in the genomic instability of Mycobacterium-infected macrophages using the DNA breakage detection fluorescence in situ hybridisation (DBD-FISH) technique with a whole human genome DNA probe. FINDINGS Quantitative image analyses showed a significant increase in DNA damage in infected macrophages as compared with uninfected cells. DNA breaks were localised in nuclear membrane blebs, as confirmed with DNA fragmentation assay. Furthermore, a significant increase in micronuclei and nuclear abnormalities were observed in infected macrophages versus uninfected cells. MAIN CONCLUSIONS Genomic instability occurs during mycobacterial infection and these data may be seminal for future research on host cell DNA damage in M. tuberculosis infection. PMID:29412354
A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes.
Mehmood, Tahir; Bohlin, Jon; Snipen, Lars
2015-01-01
The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value < 0.01) and SVM (p-value < 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
Muley, Vijaykumar Yogesh; Ranjan, Akash
2012-01-01
Recent progress in computational methods for predicting physical and functional protein-protein interactions has provided new insights into the complexity of biological processes. Most of these methods assume that functionally interacting proteins are likely to have a shared evolutionary history. This history can be traced out for the protein pairs of a query genome by correlating different evolutionary aspects of their homologs in multiple genomes known as the reference genomes. These methods include phylogenetic profiling, gene neighborhood and co-occurrence of the orthologous protein coding genes in the same cluster or operon. These are collectively known as genomic context methods. On the other hand a method called mirrortree is based on the similarity of phylogenetic trees between two interacting proteins. Comprehensive performance analyses of these methods have been frequently reported in literature. However, very few studies provide insight into the effect of reference genome selection on detection of meaningful protein interactions. We analyzed the performance of four methods and their variants to understand the effect of reference genome selection on prediction efficacy. We used six sets of reference genomes, sampled in accordance with phylogenetic diversity and relationship between organisms from 565 bacteria. We used Escherichia coli as a model organism and the gold standard datasets of interacting proteins reported in DIP, EcoCyc and KEGG databases to compare the performance of the prediction methods. Higher performance for predicting protein-protein interactions was achievable even with 100-150 bacterial genomes out of 565 genomes. Inclusion of archaeal genomes in the reference genome set improves performance. We find that in order to obtain a good performance, it is better to sample few genomes of related genera of prokaryotes from the large number of available genomes. Moreover, such a sampling allows for selecting 50-100 genomes for comparable accuracy of predictions when computational resources are limited.
Cryptosporidium as a testbed for single cell genome characterization of unicellular eukaryotes.
Troell, Karin; Hallström, Björn; Divne, Anna-Maria; Alsmark, Cecilia; Arrighi, Romanico; Huss, Mikael; Beser, Jessica; Bertilsson, Stefan
2016-06-23
Infectious disease involving multiple genetically distinct populations of pathogens is frequently concurrent, but difficult to detect or describe with current routine methodology. Cryptosporidium sp. is a widespread gastrointestinal protozoan of global significance in both animals and humans. It cannot be easily maintained in culture and infections of multiple strains have been reported. To explore the potential use of single cell genomics methodology for revealing genome-level variation in clinical samples from Cryptosporidium-infected hosts, we sorted individual oocysts for subsequent genome amplification and full-genome sequencing. Cells were identified with fluorescent antibodies with an 80 % success rate for the entire single cell genomics workflow, demonstrating that the methodology can be applied directly to purified fecal samples. Ten amplified genomes from sorted single cells were selected for genome sequencing and compared both to the original population and a reference genome in order to evaluate the accuracy and performance of the method. Single cell genome coverage was on average 81 % even with a moderate sequencing effort and by combining the 10 single cell genomes, the full genome was accounted for. By a comparison to the original sample, biological variation could be distinguished and separated from noise introduced in the amplification. As a proof of principle, we have demonstrated the power of applying single cell genomics to dissect infectious disease caused by closely related parasite species or subtypes. The workflow can easily be expanded and adapted to target other protozoans, and potential applications include mapping genome-encoded traits, virulence, pathogenicity, host specificity and resistance at the level of cells as truly meaningful biological units.
Evaluating phylogenetic congruence in the post-genomic era.
Leigh, Jessica W; Lapointe, François-Joseph; Lopez, Philippe; Bapteste, Eric
2011-01-01
Congruence is a broadly applied notion in evolutionary biology used to justify multigene phylogeny or phylogenomics, as well as in studies of coevolution, lateral gene transfer, and as evidence for common descent. Existing methods for identifying incongruence or heterogeneity using character data were designed for data sets that are both small and expected to be rarely incongruent. At the same time, methods that assess incongruence using comparison of trees test a null hypothesis of uncorrelated tree structures, which may be inappropriate for phylogenomic studies. As such, they are ill-suited for the growing number of available genome sequences, most of which are from prokaryotes and viruses, either for phylogenomic analysis or for studies of the evolutionary forces and events that have shaped these genomes. Specifically, many existing methods scale poorly with large numbers of genes, cannot accommodate high levels of incongruence, and do not adequately model patterns of missing taxa for different markers. We propose the development of novel incongruence assessment methods suitable for the analysis of the molecular evolution of the vast majority of life and support the investigation of homogeneity of evolutionary process in cases where markers do not share identical tree structures.
Evaluating Phylogenetic Congruence in the Post-Genomic Era
Leigh, Jessica W.; Lapointe, François-Joseph; Lopez, Philippe; Bapteste, Eric
2011-01-01
Congruence is a broadly applied notion in evolutionary biology used to justify multigene phylogeny or phylogenomics, as well as in studies of coevolution, lateral gene transfer, and as evidence for common descent. Existing methods for identifying incongruence or heterogeneity using character data were designed for data sets that are both small and expected to be rarely incongruent. At the same time, methods that assess incongruence using comparison of trees test a null hypothesis of uncorrelated tree structures, which may be inappropriate for phylogenomic studies. As such, they are ill-suited for the growing number of available genome sequences, most of which are from prokaryotes and viruses, either for phylogenomic analysis or for studies of the evolutionary forces and events that have shaped these genomes. Specifically, many existing methods scale poorly with large numbers of genes, cannot accommodate high levels of incongruence, and do not adequately model patterns of missing taxa for different markers. We propose the development of novel incongruence assessment methods suitable for the analysis of the molecular evolution of the vast majority of life and support the investigation of homogeneity of evolutionary process in cases where markers do not share identical tree structures. PMID:21712432
Ha, Gavin; Roth, Andrew; Khattra, Jaswinder; Ho, Julie; Yap, Damian; Prentice, Leah M; Melnyk, Nataliya; McPherson, Andrew; Bashashati, Ali; Laks, Emma; Biele, Justina; Ding, Jiarui; Le, Alan; Rosner, Jamie; Shumansky, Karey; Marra, Marco A; Gilks, C Blake; Huntsman, David G; McAlpine, Jessica N; Aparicio, Samuel; Shah, Sohrab P
2014-11-01
The evolution of cancer genomes within a single tumor creates mixed cell populations with divergent somatic mutational landscapes. Inference of tumor subpopulations has been disproportionately focused on the assessment of somatic point mutations, whereas computational methods targeting evolutionary dynamics of copy number alterations (CNA) and loss of heterozygosity (LOH) in whole-genome sequencing data remain underdeveloped. We present a novel probabilistic model, TITAN, to infer CNA and LOH events while accounting for mixtures of cell populations, thereby estimating the proportion of cells harboring each event. We evaluate TITAN on idealized mixtures, simulating clonal populations from whole-genome sequences taken from genomically heterogeneous ovarian tumor sites collected from the same patient. In addition, we show in 23 whole genomes of breast tumors that the inference of CNA and LOH using TITAN critically informs population structure and the nature of the evolving cancer genome. Finally, we experimentally validated subclonal predictions using fluorescence in situ hybridization (FISH) and single-cell sequencing from an ovarian cancer patient sample, thereby recapitulating the key modeling assumptions of TITAN. © 2014 Ha et al.; Published by Cold Spring Harbor Laboratory Press.
Genomic Selection in Dairy Cattle: The USDA Experience.
Wiggans, George R; Cole, John B; Hubbard, Suzanne M; Sonstegard, Tad S
2017-02-08
Genomic selection has revolutionized dairy cattle breeding. Since 2000, assays have been developed to genotype large numbers of single-nucleotide polymorphisms (SNPs) at relatively low cost. The first commercial SNP genotyping chip was released with a set of 54,001 SNPs in December 2007. Over 15,000 genotypes were used to determine which SNPs should be used in genomic evaluation of US dairy cattle. Official USDA genomic evaluations were first released in January 2009 for Holsteins and Jerseys, in August 2009 for Brown Swiss, in April 2013 for Ayrshires, and in April 2016 for Guernseys. Producers have accepted genomic evaluations as accurate indications of a bull's eventual daughter-based evaluation. The integration of DNA marker technology and genomics into the traditional evaluation system has doubled the rate of genetic progress for traits of economic importance, decreased generation interval, increased selection accuracy, reduced previous costs of progeny testing, and allowed identification of recessive lethals.
Zhu, Wensheng; Yuan, Ying; Zhang, Jingwen; Zhou, Fan; Knickmeyer, Rebecca C; Zhu, Hongtu
2017-02-01
The aim of this paper is to systematically evaluate a biased sampling issue associated with genome-wide association analysis (GWAS) of imaging phenotypes for most imaging genetic studies, including the Alzheimer's Disease Neuroimaging Initiative (ADNI). Specifically, the original sampling scheme of these imaging genetic studies is primarily the retrospective case-control design, whereas most existing statistical analyses of these studies ignore such sampling scheme by directly correlating imaging phenotypes (called the secondary traits) with genotype. Although it has been well documented in genetic epidemiology that ignoring the case-control sampling scheme can produce highly biased estimates, and subsequently lead to misleading results and suspicious associations, such findings are not well documented in imaging genetics. We use extensive simulations and a large-scale imaging genetic data analysis of the Alzheimer's Disease Neuroimaging Initiative (ADNI) data to evaluate the effects of the case-control sampling scheme on GWAS results based on some standard statistical methods, such as linear regression methods, while comparing it with several advanced statistical methods that appropriately adjust for the case-control sampling scheme. Copyright © 2016 Elsevier Inc. All rights reserved.
Background: Asthma and allergy represent complex phenotypes, which disproportionately burden ethnic minorities in the United States. Strong evidence for genomic factors predisposing subjects to asthma/allergy is available. However, methods to utilize this information to identify ...
Haplotype assembly in polyploid genomes and identical by descent shared tracts.
Aguiar, Derek; Istrail, Sorin
2013-07-01
Genome-wide haplotype reconstruction from sequence data, or haplotype assembly, is at the center of major challenges in molecular biology and life sciences. For complex eukaryotic organisms like humans, the genome is vast and the population samples are growing so rapidly that algorithms processing high-throughput sequencing data must scale favorably in terms of both accuracy and computational efficiency. Furthermore, current models and methodologies for haplotype assembly (i) do not consider individuals sharing haplotypes jointly, which reduces the size and accuracy of assembled haplotypes, and (ii) are unable to model genomes having more than two sets of homologous chromosomes (polyploidy). Polyploid organisms are increasingly becoming the target of many research groups interested in the genomics of disease, phylogenetics, botany and evolution but there is an absence of theory and methods for polyploid haplotype reconstruction. In this work, we present a number of results, extensions and generalizations of compass graphs and our HapCompass framework. We prove the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. Furthermore, we present graph theory-based algorithms for the problem of haplotype assembly using our previously developed HapCompass framework for (i) novel implementations of haplotype assembly optimizations (minimum error correction), (ii) assembly of a pair of individuals sharing a haplotype tract identical by descent and (iii) assembly of polyploid genomes. We evaluate our methods on 1000 Genomes Project, Pacific Biosciences and simulated sequence data. HapCompass is available for download at http://www.brown.edu/Research/Istrail_Lab/. Supplementary data are available at Bioinformatics online.
Post-Genomics Nanotechnology Is Gaining Momentum: Nanoproteomics and Applications in Life Sciences
Kobeissy, Firas H.; Gulbakan, Basri; Alawieh, Ali; Karam, Pierre; Zhang, Zhiqun; Guingab-Cagmat, Joy D.; Mondello, Stefania; Tan, Weihong; Anagli, John
2014-01-01
Abstract The post-genomics era has brought about new Omics biotechnologies, such as proteomics and metabolomics, as well as their novel applications to personal genomics and the quantified self. These advances are now also catalyzing other and newer post-genomics innovations, leading to convergences between Omics and nanotechnology. In this work, we systematically contextualize and exemplify an emerging strand of post-genomics life sciences, namely, nanoproteomics and its applications in health and integrative biological systems. Nanotechnology has been utilized as a complementary component to revolutionize proteomics through different kinds of nanotechnology applications, including nanoporous structures, functionalized nanoparticles, quantum dots, and polymeric nanostructures. Those applications, though still in their infancy, have led to several highly sensitive diagnostics and new methods of drug delivery and targeted therapy for clinical use. The present article differs from previous analyses of nanoproteomics in that it offers an in-depth and comparative evaluation of the attendant biotechnology portfolio and their applications as seen through the lens of post-genomics life sciences and biomedicine. These include: (1) immunosensors for inflammatory, pathogenic, and autoimmune markers for infectious and autoimmune diseases, (2) amplified immunoassays for detection of cancer biomarkers, and (3) methods for targeted therapy and automatically adjusted drug delivery such as in experimental stroke and brain injury studies. As nanoproteomics becomes available both to the clinician at the bedside and the citizens who are increasingly interested in access to novel post-genomics diagnostics through initiatives such as the quantified self, we anticipate further breakthroughs in personalized and targeted medicine. PMID:24410486
Diaz, Naryttza N; Krause, Lutz; Goesmann, Alexander; Niehaus, Karsten; Nattkemper, Tim W
2009-01-01
Background Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the k-nearest neighbor with strategies from kernel-based learning. Results Our novel strategy was extensively evaluated using the leave-one-out cross validation strategy on fragments of variable length (800 bp – 50 Kbp) from 373 completely sequenced genomes. TACOA is able to classify genomic fragments of length 800 bp and 1 Kbp with high accuracy until rank class. For longer fragments ≥ 3 Kbp accurate predictions are made at even deeper taxonomic ranks (order and genus). Remarkably, TACOA also produces reliable results when the taxonomic origin of a fragment is not represented in the reference set, thus classifying such fragments to its known broader taxonomic class or simply as "unknown". We compared the classification accuracy of TACOA with the latest intrinsic classifier PhyloPythia using 63 recently published complete genomes. For fragments of length 800 bp and 1 Kbp the overall accuracy of TACOA is higher than that obtained by PhyloPythia at all taxonomic ranks. For all fragment lengths, both methods achieved comparable high specificity results up to rank class and low false negative rates are also obtained. Conclusion An accurate multi-class taxonomic classifier was developed for environmental genomic fragments. TACOA can predict with high reliability the taxonomic origin of genomic fragments as short as 800 bp. The proposed method is transparent, fast, accurate and the reference set can be easily updated as newly sequenced genomes become available. Moreover, the method demonstrated to be competitive when compared to the most current classifier PhyloPythia and has the advantage that it can be locally installed and the reference set can be kept up-to-date. PMID:19210774
Quantitative phenotyping via deep barcode sequencing.
Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey
2009-10-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.
Kelemen, Arpad; Vasilakos, Athanasios V; Liang, Yulan
2009-09-01
Comprehensive evaluation of common genetic variations through association of single-nucleotide polymorphism (SNP) structure with common complex disease in the genome-wide scale is currently a hot area in human genome research due to the recent development of the Human Genome Project and HapMap Project. Computational science, which includes computational intelligence (CI), has recently become the third method of scientific enquiry besides theory and experimentation. There have been fast growing interests in developing and applying CI in disease mapping using SNP and haplotype data. Some of the recent studies have demonstrated the promise and importance of CI for common complex diseases in genomic association study using SNP/haplotype data, especially for tackling challenges, such as gene-gene and gene-environment interactions, and the notorious "curse of dimensionality" problem. This review provides coverage of recent developments of CI approaches for complex diseases in genetic association study with SNP/haplotype data.
McDonough, Caitrin W.; Elsey, Amanda R.; Burkley, Benjamin; Cavallari, Larisa H.; Johnson, Julie A.
2016-01-01
Objective. To evaluate the impact of personal genotyping and a novel educational approach on student attitudes, knowledge, and beliefs regarding pharmacogenomics and genomic medicine. Methods. Two online elective courses (pharmacogenomics and genomic medicine) were offered to student pharmacists at the University of Florida using a flipped-classroom, patient-centered teaching approach. In the pharmacogenomics course, students could be genotyped and apply results to patient cases. Results. Thirty-four and 19 student pharmacists completed the pharmacogenomics and genomic medicine courses, respectively, and 100% of eligible students (n=34) underwent genotyping. Student knowledge improved after the courses. Seventy-four percent (n=25) of students reported better understanding of pharmacogenomics based on having undergone genotyping. Conclusions. Completion of a novel pharmacogenomics elective course sequence that incorporated personal genotyping and genomic medicine was associated with increased student pharmacist knowledge and improved clinical confidence with pharmacogenomics. PMID:27756930
Nilsson, Björn; Håkansson, Petra; Johansson, Mikael; Nelander, Sven; Fioretos, Thoas
2007-01-01
Ontological analysis facilitates the interpretation of microarray data. Here we describe new ontological analysis methods which, unlike existing approaches, are threshold-free and statistically powerful. We perform extensive evaluations and introduce a new concept, detection spectra, to characterize methods. We show that different ontological analysis methods exhibit distinct detection spectra, and that it is critical to account for this diversity. Our results argue strongly against the continued use of existing methods, and provide directions towards an enhanced approach. PMID:17488501
Gruenstaeudl, Michael; Gerschler, Nico; Borsch, Thomas
2018-06-21
The sequencing and comparison of plastid genomes are becoming a standard method in plant genomics, and many researchers are using this approach to infer plant phylogenetic relationships. Due to the widespread availability of next-generation sequencing, plastid genome sequences are being generated at breakneck pace. This trend towards massive sequencing of plastid genomes highlights the need for standardized bioinformatic workflows. In particular, documentation and dissemination of the details of genome assembly, annotation, alignment and phylogenetic tree inference are needed, as these processes are highly sensitive to the choice of software and the precise settings used. Here, we present the procedure and results of sequencing, assembling, annotating and quality-checking of three complete plastid genomes of the aquatic plant genus Cabomba as well as subsequent gene alignment and phylogenetic tree inference. We accompany our findings by a detailed description of the bioinformatic workflow employed. Importantly, we share a total of eleven software scripts for each of these bioinformatic processes, enabling other researchers to evaluate and replicate our analyses step by step. The results of our analyses illustrate that the plastid genomes of Cabomba are highly conserved in both structure and gene content.
Duan, Junbo; Zhang, Ji-Gang; Deng, Hong-Wen; Wang, Yu-Ping
2013-01-01
Copy number variation (CNV) has played an important role in studies of susceptibility or resistance to complex diseases. Traditional methods such as fluorescence in situ hybridization (FISH) and array comparative genomic hybridization (aCGH) suffer from low resolution of genomic regions. Following the emergence of next generation sequencing (NGS) technologies, CNV detection methods based on the short read data have recently been developed. However, due to the relatively young age of the procedures, their performance is not fully understood. To help investigators choose suitable methods to detect CNVs, comparative studies are needed. We compared six publicly available CNV detection methods: CNV-seq, FREEC, readDepth, CNVnator, SegSeq and event-wise testing (EWT). They are evaluated both on simulated and real data with different experiment settings. The receiver operating characteristic (ROC) curve is employed to demonstrate the detection performance in terms of sensitivity and specificity, box plot is employed to compare their performances in terms of breakpoint and copy number estimation, Venn diagram is employed to show the consistency among these methods, and F-score is employed to show the overlapping quality of detected CNVs. The computational demands are also studied. The results of our work provide a comprehensive evaluation on the performances of the selected CNV detection methods, which will help biological investigators choose the best possible method.
Deverka, Patricia A; Lavallee, Danielle C; Desai, Priyanka J; Armstrong, Joanne; Gorman, Mark; Hole-Curry, Leah; O’Leary, James; Ruffner, BW; Watkins, John; Veenstra, David L; Baker, Laurence H; Unger, Joseph M; Ramsey, Scott D
2013-01-01
Aims The Center for Comparative Effectiveness Research in Cancer Genomics completed a 2-year stakeholder-guided process for the prioritization of genomic tests for comparative effectiveness research studies. We sought to evaluate the effectiveness of engagement procedures in achieving project goals and to identify opportunities for future improvements. Materials & methods The evaluation included an online questionnaire, one-on-one telephone interviews and facilitated discussion. Responses to the online questionnaire were tabulated for descriptive purposes, while transcripts from key informant interviews were analyzed using a directed content analysis approach. Results A total of 11 out of 13 stakeholders completed both the online questionnaire and interview process, while nine participated in the facilitated discussion. Eighty-nine percent of questionnaire items received overall ratings of agree or strongly agree; 11% of responses were rated as neutral with the exception of a single rating of disagreement with an item regarding the clarity of how stakeholder input was incorporated into project decisions. Recommendations for future improvement included developing standard recruitment practices, role descriptions and processes for improved communication with clinical and comparative effectiveness research investigators. Conclusions Evaluation of the stakeholder engagement process provided constructive feedback for future improvements and should be routinely conducted to ensure maximal effectiveness of stakeholder involvement. PMID:23459832
Jenkinson, Garrett; Abante, Jordi; Feinberg, Andrew P; Goutsias, John
2018-03-07
DNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. However, the performance of most currently available methods for methylation analysis is hampered by their inability to directly account for statistical dependencies between neighboring methylation sites, thus ignoring significant information available in WGBS reads. We present a powerful information-theoretic approach for genome-wide modeling and analysis of WGBS data based on the 1D Ising model of statistical physics. This approach takes into account correlations in methylation by utilizing a joint probability model that encapsulates all information available in WGBS methylation reads and produces accurate results even when applied on single WGBS samples with low coverage. Using the Shannon entropy, our approach provides a rigorous quantification of methylation stochasticity in individual WGBS samples genome-wide. Furthermore, it utilizes the Jensen-Shannon distance to evaluate differences in methylation distributions between a test and a reference sample. Differential performance assessment using simulated and real human lung normal/cancer data demonstrate a clear superiority of our approach over DSS, a recently proposed method for WGBS data analysis. Critically, these results demonstrate that marginal methods become statistically invalid when correlations are present in the data. This contribution demonstrates clear benefits and the necessity of modeling joint probability distributions of methylation using the 1D Ising model of statistical physics and of quantifying methylation stochasticity using concepts from information theory. By employing this methodology, substantial improvement of DNA methylation analysis can be achieved by effectively taking into account the massive amount of statistical information available in WGBS data, which is largely ignored by existing methods.
Evaluation of approaches for estimating the accuracy of genomic prediction in plant breeding
2013-01-01
Background In genomic prediction, an important measure of accuracy is the correlation between the predicted and the true breeding values. Direct computation of this quantity for real datasets is not possible, because the true breeding value is unknown. Instead, the correlation between the predicted breeding values and the observed phenotypic values, called predictive ability, is often computed. In order to indirectly estimate predictive accuracy, this latter correlation is usually divided by an estimate of the square root of heritability. In this study we use simulation to evaluate estimates of predictive accuracy for seven methods, four (1 to 4) of which use an estimate of heritability to divide predictive ability computed by cross-validation. Between them the seven methods cover balanced and unbalanced datasets as well as correlated and uncorrelated genotypes. We propose one new indirect method (4) and two direct methods (5 and 6) for estimating predictive accuracy and compare their performances and those of four other existing approaches (three indirect (1 to 3) and one direct (7)) with simulated true predictive accuracy as the benchmark and with each other. Results The size of the estimated genetic variance and hence heritability exerted the strongest influence on the variation in the estimated predictive accuracy. Increasing the number of genotypes considerably increases the time required to compute predictive accuracy by all the seven methods, most notably for the five methods that require cross-validation (Methods 1, 2, 3, 4 and 6). A new method that we propose (Method 5) and an existing method (Method 7) used in animal breeding programs were the fastest and gave the least biased, most precise and stable estimates of predictive accuracy. Of the methods that use cross-validation Methods 4 and 6 were often the best. Conclusions The estimated genetic variance and the number of genotypes had the greatest influence on predictive accuracy. Methods 5 and 7 were the fastest and produced the least biased, the most precise, robust and stable estimates of predictive accuracy. These properties argue for routinely using Methods 5 and 7 to assess predictive accuracy in genomic selection studies. PMID:24314298
Evaluation of approaches for estimating the accuracy of genomic prediction in plant breeding.
Ould Estaghvirou, Sidi Boubacar; Ogutu, Joseph O; Schulz-Streeck, Torben; Knaak, Carsten; Ouzunova, Milena; Gordillo, Andres; Piepho, Hans-Peter
2013-12-06
In genomic prediction, an important measure of accuracy is the correlation between the predicted and the true breeding values. Direct computation of this quantity for real datasets is not possible, because the true breeding value is unknown. Instead, the correlation between the predicted breeding values and the observed phenotypic values, called predictive ability, is often computed. In order to indirectly estimate predictive accuracy, this latter correlation is usually divided by an estimate of the square root of heritability. In this study we use simulation to evaluate estimates of predictive accuracy for seven methods, four (1 to 4) of which use an estimate of heritability to divide predictive ability computed by cross-validation. Between them the seven methods cover balanced and unbalanced datasets as well as correlated and uncorrelated genotypes. We propose one new indirect method (4) and two direct methods (5 and 6) for estimating predictive accuracy and compare their performances and those of four other existing approaches (three indirect (1 to 3) and one direct (7)) with simulated true predictive accuracy as the benchmark and with each other. The size of the estimated genetic variance and hence heritability exerted the strongest influence on the variation in the estimated predictive accuracy. Increasing the number of genotypes considerably increases the time required to compute predictive accuracy by all the seven methods, most notably for the five methods that require cross-validation (Methods 1, 2, 3, 4 and 6). A new method that we propose (Method 5) and an existing method (Method 7) used in animal breeding programs were the fastest and gave the least biased, most precise and stable estimates of predictive accuracy. Of the methods that use cross-validation Methods 4 and 6 were often the best. The estimated genetic variance and the number of genotypes had the greatest influence on predictive accuracy. Methods 5 and 7 were the fastest and produced the least biased, the most precise, robust and stable estimates of predictive accuracy. These properties argue for routinely using Methods 5 and 7 to assess predictive accuracy in genomic selection studies.
2012-01-01
Background A single-step blending approach allows genomic prediction using information of genotyped and non-genotyped animals simultaneously. However, the combined relationship matrix in a single-step method may need to be adjusted because marker-based and pedigree-based relationship matrices may not be on the same scale. The same may apply when a GBLUP model includes both genomic breeding values and residual polygenic effects. The objective of this study was to compare single-step blending methods and GBLUP methods with and without adjustment of the genomic relationship matrix for genomic prediction of 16 traits in the Nordic Holstein population. Methods The data consisted of de-regressed proofs (DRP) for 5 214 genotyped and 9 374 non-genotyped bulls. The bulls were divided into a training and a validation population by birth date, October 1, 2001. Five approaches for genomic prediction were used: 1) a simple GBLUP method, 2) a GBLUP method with a polygenic effect, 3) an adjusted GBLUP method with a polygenic effect, 4) a single-step blending method, and 5) an adjusted single-step blending method. In the adjusted GBLUP and single-step methods, the genomic relationship matrix was adjusted for the difference of scale between the genomic and the pedigree relationship matrices. A set of weights on the pedigree relationship matrix (ranging from 0.05 to 0.40) was used to build the combined relationship matrix in the single-step blending method and the GBLUP method with a polygenetic effect. Results Averaged over the 16 traits, reliabilities of genomic breeding values predicted using the GBLUP method with a polygenic effect (relative weight of 0.20) were 0.3% higher than reliabilities from the simple GBLUP method (without a polygenic effect). The adjusted single-step blending and original single-step blending methods (relative weight of 0.20) had average reliabilities that were 2.1% and 1.8% higher than the simple GBLUP method, respectively. In addition, the GBLUP method with a polygenic effect led to less bias of genomic predictions than the simple GBLUP method, and both single-step blending methods yielded less bias of predictions than all GBLUP methods. Conclusions The single-step blending method is an appealing approach for practical genomic prediction in dairy cattle. Genomic prediction from the single-step blending method can be improved by adjusting the scale of the genomic relationship matrix. PMID:22455934
NOIROT, M.; BARRE, P.; DUPERRAY, C.; LOUARN, J.; HAMON, S.
2003-01-01
Estimates of genome size using flow cytometry can be biased by the presence of cytosolic compounds, leading to pseudo‐intraspecific variation in genome size. Two important compounds present in coffee trees—caffeine and chlorogenic acid—modify accessibility of the dye propidium iodide to Petunia DNA, a species used as internal standard in our genome size evaluation. These compounds could be responsible for intraspecific variation in genome size since their contents vary between trees. They could also be implicated in environmental variations in genome size, such as those revealed when comparing the results of evaluations carried out on different dates on several genotypes. PMID:12876189
Statistical Methods in Integrative Genomics
Richardson, Sylvia; Tseng, George C.; Sun, Wei
2016-01-01
Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then review statistical methods of integrative genomics, with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions. PMID:27482531
Chou, Wen-Chi; Ma, Qin; Yang, Shihui; ...
2015-03-12
The identification of transcription units (TUs) encoded in a bacterial genome is essential to elucidation of transcriptional regulation of the organism. To gain a detailed understanding of the dynamically composed TU structures, we have used four strand-specific RNA-seq (ssRNA-seq) datasets collected under two experimental conditions to derive the genomic TU organization of Clostridium thermocellum using a machine-learning approach. Our method accurately predicted the genomic boundaries of individual TUs based on two sets of parameters measuring the RNA-seq expression patterns across the genome: expression-level continuity and variance. A total of 2590 distinct TUs are predicted based on the four RNA-seq datasets.more » Moreover, among the predicted TUs, 44% have multiple genes. We assessed our prediction method on an independent set of RNA-seq data with longer reads. The evaluation confirmed the high quality of the predicted TUs. Functional enrichment analyses on a selected subset of the predicted TUs revealed interesting biology. To demonstrate the generality of the prediction method, we have also applied the method to RNA-seq data collected on Escherichia coli and achieved high prediction accuracies. The TU prediction program named SeqTU is publicly available athttps://code.google.com/p/seqtu/. We expect that the predicted TUs can serve as the baseline information for studying transcriptional and post-transcriptional regulation in C. thermocellum and other bacteria.« less
Informing the Design of Direct-to-Consumer Interactive Personal Genomics Reports
Shaer, Orit; Okerlund, Johanna; Balestra, Martina; Stowell, Elizabeth; Ascher, Laura; Bi, Joanna; Schlenker, Claire; Ball, Madeleine
2015-01-01
Background In recent years, people who sought direct-to-consumer genetic testing services have been increasingly confronted with an unprecedented amount of personal genomic information, which influences their decisions, emotional state, and well-being. However, these users of direct-to-consumer genetic services, who vary in their education and interests, frequently have little relevant experience or tools for understanding, reasoning about, and interacting with their personal genomic data. Online interactive techniques can play a central role in making personal genomic data useful for these users. Objective We sought to (1) identify the needs of diverse users as they make sense of their personal genomic data, (2) consequently develop effective interactive visualizations of genomic trait data to address these users’ needs, and (3) evaluate the effectiveness of the developed visualizations in facilitating comprehension. Methods The first two user studies, conducted with 63 volunteers in the Personal Genome Project and with 36 personal genomic users who participated in a design workshop, respectively, employed surveys and interviews to identify the needs and expectations of diverse users. Building on the two initial studies, the third study was conducted with 730 Amazon Mechanical Turk users and employed a controlled experimental design to examine the effectiveness of different design interventions on user comprehension. Results The first two studies identified searching, comparing, sharing, and organizing data as fundamental to users’ understanding of personal genomic data. The third study demonstrated that interactive and visual design interventions could improve the understandability of personal genomic reports for consumers. In particular, results showed that a new interactive bubble chart visualization designed for the study resulted in the highest comprehension scores, as well as the highest perceived comprehension scores. These scores were significantly higher than scores received using the industry standard tabular reports currently used for communicating personal genomic information. Conclusions Drawing on multiple research methods and populations, the findings of the studies reported in this paper offer deep understanding of users’ needs and practices, and demonstrate that interactive online design interventions can improve the understandability of personal genomic reports for consumers. We discuss implications for designers and researchers. PMID:26070951
Bletz, Stefan; Janezic, Sandra; Harmsen, Dag; Rupnik, Maja; Mellmann, Alexander
2018-06-01
Clostridium difficile , recently renamed Clostridioides difficile , is the most common cause of antibiotic-associated nosocomial gastrointestinal infections worldwide. To differentiate endogenous infections and transmission events, highly discriminatory subtyping is necessary. Today, methods based on whole-genome sequencing data are increasingly used to subtype bacterial pathogens; however, frequently a standardized methodology and typing nomenclature are missing. Here we report a core genome multilocus sequence typing (cgMLST) approach developed for C. difficile Initially, we determined the breadth of the C. difficile population based on all available MLST sequence types with Bayesian inference (BAPS). The resulting BAPS partitions were used in combination with C. difficile clade information to select representative isolates that were subsequently used to define cgMLST target genes. Finally, we evaluated the novel cgMLST scheme with genomes from 3,025 isolates. BAPS grouping ( n = 6 groups) together with the clade information led to a total of 11 representative isolates that were included for cgMLST definition and resulted in 2,270 cgMLST genes that were present in all isolates. Overall, 2,184 to 2,268 cgMLST targets were detected in the genome sequences of 70 outbreak-associated and reference strains, and on average 99.3% cgMLST targets (1,116 to 2,270 targets) were present in 2,954 genomes downloaded from the NCBI database, underlining the representativeness of the cgMLST scheme. Moreover, reanalyzing different cluster scenarios with cgMLST were concordant to published single nucleotide variant analyses. In conclusion, the novel cgMLST is representative for the whole C. difficile population, is highly discriminatory in outbreak situations, and provides a unique nomenclature facilitating interlaboratory exchange. Copyright © 2018 American Society for Microbiology.
CNV-TV: a robust method to discover copy number variation from short sequencing reads.
Duan, Junbo; Zhang, Ji-Gang; Deng, Hong-Wen; Wang, Yu-Ping
2013-05-02
Copy number variation (CNV) is an important structural variation (SV) in human genome. Various studies have shown that CNVs are associated with complex diseases. Traditional CNV detection methods such as fluorescence in situ hybridization (FISH) and array comparative genomic hybridization (aCGH) suffer from low resolution. The next generation sequencing (NGS) technique promises a higher resolution detection of CNVs and several methods were recently proposed for realizing such a promise. However, the performances of these methods are not robust under some conditions, e.g., some of them may fail to detect CNVs of short sizes. There has been a strong demand for reliable detection of CNVs from high resolution NGS data. A novel and robust method to detect CNV from short sequencing reads is proposed in this study. The detection of CNV is modeled as a change-point detection from the read depth (RD) signal derived from the NGS, which is fitted with a total variation (TV) penalized least squares model. The performance (e.g., sensitivity and specificity) of the proposed approach are evaluated by comparison with several recently published methods on both simulated and real data from the 1000 Genomes Project. The experimental results showed that both the true positive rate and false positive rate of the proposed detection method do not change significantly for CNVs with different copy numbers and lengthes, when compared with several existing methods. Therefore, our proposed approach results in a more reliable detection of CNVs than the existing methods.
Focke, Felix; Haase, Ilka; Fischer, Markus
2011-01-26
Usually spices are identified morphologically using simple methods like magnifying glasses or microscopic instruments. On the other hand, molecular biological methods like the polymerase chain reaction (PCR) enable an accurate and specific detection also in complex matrices. Generally, the origins of spices are plants with diverse genetic backgrounds and relationships. The processing methods used for the production of spices are complex and individual. Consequently, the development of a reliable DNA-based method for spice analysis is a challenging intention. However, once established, this method will be easily adapted to less difficult food matrices. In the current study, several alternative methods for the isolation of DNA from spices have been developed and evaluated in detail with regard to (i) its purity (photometric), (ii) yield (fluorimetric methods), and (iii) its amplifiability (PCR). Whole genome amplification methods were used to preamplify isolates to improve the ratio between amplifiable DNA and inhibiting substances. Specific primer sets were designed, and the PCR conditions were optimized to detect 18 spices selectively. Assays of self-made spice mixtures were performed to proof the applicability of the developed methods.
Nambiar, Bindu; Cornell Sookdeo, Cathleen; Berthelette, Patricia; Jackson, Robert; Piraino, Susan; Burnham, Brenda; Nass, Shelley; Souza, David; O'Riordan, Catherine R; Vincent, Karen A; Cheng, Seng H; Armentano, Donna; Kyostio-Moore, Sirkka
2017-02-01
Several ongoing clinical studies are evaluating recombinant adeno-associated virus (rAAV) vectors as gene delivery vehicles for a variety of diseases. However, the production of vectors with genomes >4.7 kb is challenging, with vector preparations frequently containing truncated genomes. To determine whether the generation of oversized rAAVs can be improved using a producer cell-line (PCL) process, HeLaS3-cell lines harboring either a 5.1 or 5.4 kb rAAV vector genome encoding codon-optimized cDNA for human B-domain deleted Factor VIII (FVIII) were isolated. High-producing "masterwells" (MWs), defined as producing >50,000 vg/cell, were identified for each oversized vector. These MWs provided stable vector production for >20 passages. The quality and potency of the AAVrh8R/FVIII-5.1 and AAVrh8R/FVIII-5.4 vectors generated by the PCL method were then compared to those prepared via transient transfection (TXN). Southern and dot blot analyses demonstrated that both production methods resulted in packaging of heterogeneously sized genomes. However, the PCL-derived rAAV vector preparations contained some genomes >4.7 kb, whereas the majority of genomes generated by the TXN method were ≤4.7 kb. The PCL process reduced packaging of non-vector DNA for both the AAVrh8R/FVIII-5.1 and the AAVrh8R/FVIII-5.4 kb vector preparations. Furthermore, more DNA-containing viral particles were obtained for the AAVrh8R/FVIII-5.1 vector. In a mouse model of hemophilia A, animals administered a PCL-derived rAAV vector exhibited twofold higher plasma FVIII activity and increased levels of vector genomes in the liver than mice treated with vector produced via TXN did. Hence, the quality of oversized vectors prepared using the PCL method is greater than that of vectors generated using the TXN process, and importantly this improvement translates to enhanced performance in vivo.
Genomic markers for decision making: what is preventing us from using markers?
Coyle, Vicky M; Johnston, Patrick G
2010-02-01
The advent of novel genomic technologies that enable the evaluation of genomic alterations on a genome-wide scale has significantly altered the field of genomic marker research in solid tumors. Researchers have moved away from the traditional model of identifying a particular genomic alteration and evaluating the association between this finding and a clinical outcome measure to a new approach involving the identification and measurement of multiple genomic markers simultaneously within clinical studies. This in turn has presented additional challenges in considering the use of genomic markers in oncology, such as clinical study design, reproducibility and interpretation and reporting of results. This Review will explore these challenges, focusing on microarray-based gene-expression profiling, and highlights some common failings in study design that have impacted on the use of putative genomic markers in the clinic. Despite these rapid technological advances there is still a paucity of genomic markers in routine clinical use at present. A rational and focused approach to the evaluation and validation of genomic markers is needed, whereby analytically validated markers are investigated in clinical studies that are adequately powered and have pre-defined patient populations and study endpoints. Furthermore, novel adaptive clinical trial designs, incorporating putative genomic markers into prospective clinical trials, will enable the evaluation of these markers in a rigorous and timely fashion. Such approaches have the potential to facilitate the implementation of such markers into routine clinical practice and consequently enable the rational and tailored use of cancer therapies for individual patients.
Brownstein, Catherine A; Beggs, Alan H; Homer, Nils; Merriman, Barry; Yu, Timothy W; Flannery, Katherine C; DeChene, Elizabeth T; Towne, Meghan C; Savage, Sarah K; Price, Emily N; Holm, Ingrid A; Luquette, Lovelace J; Lyon, Elaine; Majzoub, Joseph; Neupert, Peter; McCallie, David; Szolovits, Peter; Willard, Huntington F; Mendelsohn, Nancy J; Temme, Renee; Finkel, Richard S; Yum, Sabrina W; Medne, Livija; Sunyaev, Shamil R; Adzhubey, Ivan; Cassa, Christopher A; de Bakker, Paul I W; Duzkale, Hatice; Dworzyński, Piotr; Fairbrother, William; Francioli, Laurent; Funke, Birgit H; Giovanni, Monica A; Handsaker, Robert E; Lage, Kasper; Lebo, Matthew S; Lek, Monkol; Leshchiner, Ignaty; MacArthur, Daniel G; McLaughlin, Heather M; Murray, Michael F; Pers, Tune H; Polak, Paz P; Raychaudhuri, Soumya; Rehm, Heidi L; Soemedi, Rachel; Stitziel, Nathan O; Vestecka, Sara; Supper, Jochen; Gugenmus, Claudia; Klocke, Bernward; Hahn, Alexander; Schubach, Max; Menzel, Mortiz; Biskup, Saskia; Freisinger, Peter; Deng, Mario; Braun, Martin; Perner, Sven; Smith, Richard J H; Andorf, Janeen L; Huang, Jian; Ryckman, Kelli; Sheffield, Val C; Stone, Edwin M; Bair, Thomas; Black-Ziegelbein, E Ann; Braun, Terry A; Darbro, Benjamin; DeLuca, Adam P; Kolbe, Diana L; Scheetz, Todd E; Shearer, Aiden E; Sompallae, Rama; Wang, Kai; Bassuk, Alexander G; Edens, Erik; Mathews, Katherine; Moore, Steven A; Shchelochkov, Oleg A; Trapane, Pamela; Bossler, Aaron; Campbell, Colleen A; Heusel, Jonathan W; Kwitek, Anne; Maga, Tara; Panzer, Karin; Wassink, Thomas; Van Daele, Douglas; Azaiez, Hela; Booth, Kevin; Meyer, Nic; Segal, Michael M; Williams, Marc S; Tromp, Gerard; White, Peter; Corsmeier, Donald; Fitzgerald-Butt, Sara; Herman, Gail; Lamb-Thrush, Devon; McBride, Kim L; Newsom, David; Pierson, Christopher R; Rakowsky, Alexander T; Maver, Aleš; Lovrečić, Luca; Palandačić, Anja; Peterlin, Borut; Torkamani, Ali; Wedell, Anna; Huss, Mikael; Alexeyenko, Andrey; Lindvall, Jessica M; Magnusson, Måns; Nilsson, Daniel; Stranneheim, Henrik; Taylan, Fulya; Gilissen, Christian; Hoischen, Alexander; van Bon, Bregje; Yntema, Helger; Nelen, Marcel; Zhang, Weidong; Sager, Jason; Zhang, Lu; Blair, Kathryn; Kural, Deniz; Cariaso, Michael; Lennon, Greg G; Javed, Asif; Agrawal, Saloni; Ng, Pauline C; Sandhu, Komal S; Krishna, Shuba; Veeramachaneni, Vamsi; Isakov, Ofer; Halperin, Eran; Friedman, Eitan; Shomron, Noam; Glusman, Gustavo; Roach, Jared C; Caballero, Juan; Cox, Hannah C; Mauldin, Denise; Ament, Seth A; Rowen, Lee; Richards, Daniel R; San Lucas, F Anthony; Gonzalez-Garay, Manuel L; Caskey, C Thomas; Bai, Yu; Huang, Ying; Fang, Fang; Zhang, Yan; Wang, Zhengyuan; Barrera, Jorge; Garcia-Lobo, Juan M; González-Lamuño, Domingo; Llorca, Javier; Rodriguez, Maria C; Varela, Ignacio; Reese, Martin G; De La Vega, Francisco M; Kiruluta, Edward; Cargill, Michele; Hart, Reece K; Sorenson, Jon M; Lyon, Gholson J; Stevenson, David A; Bray, Bruce E; Moore, Barry M; Eilbeck, Karen; Yandell, Mark; Zhao, Hongyu; Hou, Lin; Chen, Xiaowei; Yan, Xiting; Chen, Mengjie; Li, Cong; Yang, Can; Gunel, Murat; Li, Peining; Kong, Yong; Alexander, Austin C; Albertyn, Zayed I; Boycott, Kym M; Bulman, Dennis E; Gordon, Paul M K; Innes, A Micheil; Knoppers, Bartha M; Majewski, Jacek; Marshall, Christian R; Parboosingh, Jillian S; Sawyer, Sarah L; Samuels, Mark E; Schwartzentruber, Jeremy; Kohane, Isaac S; Margulies, David M
2014-03-25
There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance. A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization. The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.
2014-01-01
Background There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance. Results A total of 30 international groups were engaged. The entries reveal a general convergence of practices on most elements of the analysis and interpretation process. However, even given this commonality of approach, only two groups identified the consensus candidate variants in all disease cases, demonstrating a need for consistent fine-tuning of the generally accepted methods. There was greater diversity of the final clinical report content and in the patient consenting process, demonstrating that these areas require additional exploration and standardization. Conclusions The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups. PMID:24667040
Fangmann, A; Sharifi, R A; Heinkel, J; Danowski, K; Schrade, H; Erbe, M; Simianer, H
2017-04-01
Currently used multi-step methods to incorporate genomic information in the prediction of breeding values (BV) implicitly involve many assumptions which, if violated, may result in loss of information, inaccuracies and bias. To overcome this, single-step genomic best linear unbiased prediction (ssGBLUP) was proposed combining pedigree, phenotype and genotype of all individuals for genetic evaluation. Our objective was to implement ssGBLUP for genomic predictions in pigs and to compare the accuracy of ssGBLUP with that of multi-step methods with empirical data of moderately sized pig breeding populations. Different predictions were performed: conventional parent average (PA), direct genomic value (DGV) calculated with genomic BLUP (GBLUP), a GEBV obtained by blending the DGV with PA, and ssGBLUP. Data comprised individuals from a German Landrace (LR) and Large White (LW) population. The trait 'number of piglets born alive' (NBA) was available for 182,054 litters of 41,090 LR sows and 15,750 litters from 4534 LW sows. The pedigree contained 174,021 animals, of which 147,461 (26,560) animals were LR (LW) animals. In total, 526 LR and 455 LW animals were genotyped with the Illumina PorcineSNP60 BeadChip. After quality control and imputation, 495 LR (424 LW) animals with 44,368 (43,678) SNP on 18 autosomes remained for the analysis. Predictive abilities, i.e., correlations between de-regressed proofs and genomic BV, were calculated with a five-fold cross validation and with a forward prediction for young genotyped validation animals born after 2011. Generally, predictive abilities for LR were rather small (0.08 for GBLUP, 0.19 for GEBV and 0.18 for ssGBLUP). For LW, ssGBLUP had the greatest predictive ability (0.45). For both breeds, assessment of reliabilities for young genotyped animals indicated that genomic prediction outperforms PA with ssGBLUP providing greater reliabilities (0.40 for LR and 0.32 for LW) than GEBV (0.35 for LR and 0.29 for LW). Grouping of animals according to information sources revealed that genomic prediction had the highest potential benefit for genotyped animals without their own phenotype. Although, ssGBLUP did not generally outperform GBLUP or GEBV, the results suggest that ssGBLUP can be a useful and conceptually convincing approach for practical genomic prediction of NBA in moderately sized LR and LW populations.
Jia, Zhaofeng; Liang, Yujie; Ma, Bin; Xu, Xiao; Xiong, Jianyi; Duan, Li; Wang, Daping
2017-05-17
The dedifferentiation of hyaline chondrocytes into fibroblastic chondrocytes often accompanies monolayer expansion of chondrocytes in vitro. The global DNA methylation level of chondrocytes is considered to be a suitable biomarker for the loss of the chondrocyte phenotype. However, results based on different experimental methods can be inconsistent. Therefore, it is important to establish a precise, simple, and rapid method to quantify global DNA methylation levels during chondrocyte dedifferentiation. Current genome-wide methylation analysis techniques largely rely on bisulfite genomic sequencing. Due to DNA degradation during bisulfite conversion, these methods typically require a large sample volume. Other methods used to quantify global DNA methylation levels include high-performance liquid chromatography (HPLC). However, HPLC requires complete digestion of genomic DNA. Additionally, the prohibitively high cost of HPLC instruments limits HPLC's wider application. In this study, genomic DNA (gDNA) was extracted from human chondrocytes cultured with varying number of passages. The gDNA methylation level was detected using a methylation-specific dot blot assay. In this dot blot approach, a gDNA mixture containing the methylated DNA to be detected was spotted directly onto an N + membrane as a dot inside a previously drawn circular template pattern. Compared with other gel electrophoresis-based blotting approaches and other complex blotting procedures, the dot blot method saves significant time. In addition, dot blots can detect overall DNA methylation level using a commercially available 5-mC antibody. We found that the DNA methylation level differed between the monolayer subcultures, and therefore could play a key role in chondrocyte dedifferentiation. The 5-mC dot blot is a reliable, simple, and rapid method to detect the general DNA methylation level to evaluate chondrocyte phenotype.
González-Recio, O; Jiménez-Montero, J A; Alenda, R
2013-01-01
In the next few years, with the advent of high-density single nucleotide polymorphism (SNP) arrays and genome sequencing, genomic evaluation methods will need to deal with a large number of genetic variants and an increasing sample size. The boosting algorithm is a machine-learning technique that may alleviate the drawbacks of dealing with such large data sets. This algorithm combines different predictors in a sequential manner with some shrinkage on them; each predictor is applied consecutively to the residuals from the committee formed by the previous ones to form a final prediction based on a subset of covariates. Here, a detailed description is provided and examples using a toy data set are included. A modification of the algorithm called "random boosting" was proposed to increase predictive ability and decrease computation time of genome-assisted evaluation in large data sets. Random boosting uses a random selection of markers to add a subsequent weak learner to the predictive model. These modifications were applied to a real data set composed of 1,797 bulls genotyped for 39,714 SNP. Deregressed proofs of 4 yield traits and 1 type trait from January 2009 routine evaluations were used as dependent variables. A 2-fold cross-validation scenario was implemented. Sires born before 2005 were used as a training sample (1,576 and 1,562 for production and type traits, respectively), whereas younger sires were used as a testing sample to evaluate predictive ability of the algorithm on yet-to-be-observed phenotypes. Comparison with the original algorithm was provided. The predictive ability of the algorithm was measured as Pearson correlations between observed and predicted responses. Further, estimated bias was computed as the average difference between observed and predicted phenotypes. The results showed that the modification of the original boosting algorithm could be run in 1% of the time used with the original algorithm and with negligible differences in accuracy and bias. This modification may be used to speed the calculus of genome-assisted evaluation in large data sets such us those obtained from consortiums. Copyright © 2013 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Approaches to Fungal Genome Annotation
Haas, Brian J.; Zeng, Qiandong; Pearson, Matthew D.; Cuomo, Christina A.; Wortman, Jennifer R.
2011-01-01
Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment. PMID:22059117
Machine learning derived risk prediction of anorexia nervosa.
Guo, Yiran; Wei, Zhi; Keating, Brendan J; Hakonarson, Hakon
2016-01-20
Anorexia nervosa (AN) is a complex psychiatric disease with a moderate to strong genetic contribution. In addition to conventional genome wide association (GWA) studies, researchers have been using machine learning methods in conjunction with genomic data to predict risk of diseases in which genetics play an important role. In this study, we collected whole genome genotyping data on 3940 AN cases and 9266 controls from the Genetic Consortium for Anorexia Nervosa (GCAN), the Wellcome Trust Case Control Consortium 3 (WTCCC3), Price Foundation Collaborative Group and the Children's Hospital of Philadelphia (CHOP), and applied machine learning methods for predicting AN disease risk. The prediction performance is measured by area under the receiver operating characteristic curve (AUC), indicating how well the model distinguishes cases from unaffected control subjects. Logistic regression model with the lasso penalty technique generated an AUC of 0.693, while Support Vector Machines and Gradient Boosted Trees reached AUC's of 0.691 and 0.623, respectively. Using different sample sizes, our results suggest that larger datasets are required to optimize the machine learning models and achieve higher AUC values. To our knowledge, this is the first attempt to assess AN risk based on genome wide genotype level data. Future integration of genomic, environmental and family-based information is likely to improve the AN risk evaluation process, eventually benefitting AN patients and families in the clinical setting.
Orthology prediction methods: a quality assessment using curated protein families.
Trachana, Kalliopi; Larsson, Tomas A; Powell, Sean; Chen, Wei-Hua; Doerks, Tobias; Muller, Jean; Bork, Peer
2011-10-01
The increasing number of sequenced genomes has prompted the development of several automated orthology prediction methods. Tests to evaluate the accuracy of predictions and to explore biases caused by biological and technical factors are therefore required. We used 70 manually curated families to analyze the performance of five public methods in Metazoa. We analyzed the strengths and weaknesses of the methods and quantified the impact of biological and technical challenges. From the latter part of the analysis, genome annotation emerged as the largest single influencer, affecting up to 30% of the performance. Generally, most methods did well in assigning orthologous group but they failed to assign the exact number of genes for half of the groups. The publicly available benchmark set (http://eggnog.embl.de/orthobench/) should facilitate the improvement of current orthology assignment protocols, which is of utmost importance for many fields of biology and should be tackled by a broad scientific community. Copyright © 2011 WILEY Periodicals, Inc.
Evaluation of xenobiotic-induced changes in gene expression as a method to identify and classify potential toxicants is being pursued by industry and regulatory agencies worldwide. A workshop was held at the Research Triangle Park campus of the Environmental Protection Agency to...
Meuwissen, Theo H E; Indahl, Ulf G; Ødegård, Jørgen
2017-12-27
Non-linear Bayesian genomic prediction models such as BayesA/B/C/R involve iteration and mostly Markov chain Monte Carlo (MCMC) algorithms, which are computationally expensive, especially when whole-genome sequence (WGS) data are analyzed. Singular value decomposition (SVD) of the genotype matrix can facilitate genomic prediction in large datasets, and can be used to estimate marker effects and their prediction error variances (PEV) in a computationally efficient manner. Here, we developed, implemented, and evaluated a direct, non-iterative method for the estimation of marker effects for the BayesC genomic prediction model. The BayesC model assumes a priori that markers have normally distributed effects with probability [Formula: see text] and no effect with probability (1 - [Formula: see text]). Marker effects and their PEV are estimated by using SVD and the posterior probability of the marker having a non-zero effect is calculated. These posterior probabilities are used to obtain marker-specific effect variances, which are subsequently used to approximate BayesC estimates of marker effects in a linear model. A computer simulation study was conducted to compare alternative genomic prediction methods, where a single reference generation was used to estimate marker effects, which were subsequently used for 10 generations of forward prediction, for which accuracies were evaluated. SVD-based posterior probabilities of markers having non-zero effects were generally lower than MCMC-based posterior probabilities, but for some regions the opposite occurred, resulting in clear signals for QTL-rich regions. The accuracies of breeding values estimated using SVD- and MCMC-based BayesC analyses were similar across the 10 generations of forward prediction. For an intermediate number of generations (2 to 5) of forward prediction, accuracies obtained with the BayesC model tended to be slightly higher than accuracies obtained using the best linear unbiased prediction of SNP effects (SNP-BLUP model). When reducing marker density from WGS data to 30 K, SNP-BLUP tended to yield the highest accuracies, at least in the short term. Based on SVD of the genotype matrix, we developed a direct method for the calculation of BayesC estimates of marker effects. Although SVD- and MCMC-based marker effects differed slightly, their prediction accuracies were similar. Assuming that the SVD of the marker genotype matrix is already performed for other reasons (e.g. for SNP-BLUP), computation times for the BayesC predictions were comparable to those of SNP-BLUP.
Can Nucleoli Be Markers of Developmental Potential in Human Zygotes?
Fulka, Helena; Kyogoku, Hirohisa; Zatsepina, Olga; Langerova, Alena; Fulka, Josef
2015-11-01
In 1999, Tesarik and Greco reported that they could predict the developmental potential of human zygotes from a single static evaluation of their pronuclei. This was based on the distribution and number of specific nuclear organelles - the nucleoli. Recent studies in mice show that nucleoli play a key role in parental genome restructuring after fertilization, and that interfering with this process may lead to developmental failure. These studies thus support the Tesarik-Greco evaluation as a potentially useful method for selecting high-quality embryos in human assisted reproductive technologies. In this opinion article we discuss recent evidence linking nucleoli to parental genome reprogramming, and ask whether nucleoli can mirror or be used as representative markers of embryonic parameters such as chromosome content or DNA fragmentation. Copyright © 2015 Elsevier Ltd. All rights reserved.
Byron, Sara A.; Aldrich, Jessica; Sangal, Ashish; Barilla, Heather; Kiefer, Jeffrey A.; Carpten, John D.; Craig, David W.; Whitsett, Timothy G.
2017-01-01
Background Small cell lung cancer (SCLC) that has progressed after first-line therapy is an aggressive disease with few effective therapeutic strategies. In this prospective study, we employed next-generation sequencing (NGS) to identify therapeutically actionable alterations to guide treatment for advanced SCLC patients. Methods Twelve patients with SCLC were enrolled after failing platinum-based chemotherapy. Following informed consent, genome-wide exome and RNA-sequencing was performed in a CLIA-certified, CAP-accredited environment. Actionable targets were identified and therapeutic recommendations made from a pharmacopeia of FDA-approved drugs. Clinical response to genomically-guided treatment was evaluated by Response Evaluation Criteria in Solid Tumors (RECIST) 1.1. Results The study completed its accrual goal of 12 evaluable patients. The minimum tumor content for successful NGS was 20%, with a median turnaround time from sample collection to genomics-based treatment recommendation of 27 days. At least two clinically actionable targets were identified in each patient, and six patients (50%) received treatment identified by NGS. Two had partial responses by RECIST 1.1 on a clinical trial involving a PD-1 inhibitor + irinotecan (indicated by MLH1 alteration). The remaining patients had clinical deterioration before NGS recommended therapy could be initiated. Conclusions Comprehensive genomic profiling using NGS identified clinically-actionable alterations in SCLC patients who progressed on initial therapy. Recommended PD-1 therapy generated partial responses in two patients. Earlier access to NGS guided therapy, along with improved understanding of those SCLC patients likely to respond to immune-based therapies, should help to extend survival in these cases with poor outcomes. PMID:28586388
Alignment-free genome tree inference by learning group-specific distance metrics.
Patil, Kaustubh R; McHardy, Alice C
2013-01-01
Understanding the evolutionary relationships between organisms is vital for their in-depth study. Gene-based methods are often used to infer such relationships, which are not without drawbacks. One can now attempt to use genome-scale information, because of the ever increasing number of genomes available. This opportunity also presents a challenge in terms of computational efficiency. Two fundamentally different methods are often employed for sequence comparisons, namely alignment-based and alignment-free methods. Alignment-free methods rely on the genome signature concept and provide a computationally efficient way that is also applicable to nonhomologous sequences. The genome signature contains evolutionary signal as it is more similar for closely related organisms than for distantly related ones. We used genome-scale sequence information to infer taxonomic distances between organisms without additional information such as gene annotations. We propose a method to improve genome tree inference by learning specific distance metrics over the genome signature for groups of organisms with similar phylogenetic, genomic, or ecological properties. Specifically, our method learns a Mahalanobis metric for a set of genomes and a reference taxonomy to guide the learning process. By applying this method to more than a thousand prokaryotic genomes, we showed that, indeed, better distance metrics could be learned for most of the 18 groups of organisms tested here. Once a group-specific metric is available, it can be used to estimate the taxonomic distances for other sequenced organisms from the group. This study also presents a large scale comparison between 10 methods--9 alignment-free and 1 alignment-based.
An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values.
Alberti, Claudio; Daniels, Noah; Hernaez, Mikel; Voges, Jan; Goldfeder, Rachel L; Hernandez-Lopez, Ana A; Mattavelli, Marco; Berger, Bonnie
2016-01-01
This paper provides the specification and an initial validation of an evaluation framework for the comparison of lossy compressors of genome sequencing quality values. The goal is to define reference data, test sets, tools and metrics that shall be used to evaluate the impact of lossy compression of quality values on human genome variant calling. The functionality of the framework is validated referring to two state-of-the-art genomic compressors. This work has been spurred by the current activity within the ISO/IEC SC29/WG11 technical committee (a.k.a. MPEG), which is investigating the possibility of starting a standardization activity for genomic information representation.
Campagne, Fabien
2008-02-29
The evaluation of information retrieval techniques has traditionally relied on human judges to determine which documents are relevant to a query and which are not. This protocol is used in the Text Retrieval Evaluation Conference (TREC), organized annually for the past 15 years, to support the unbiased evaluation of novel information retrieval approaches. The TREC Genomics Track has recently been introduced to measure the performance of information retrieval for biomedical applications. We describe two protocols for evaluating biomedical information retrieval techniques without human relevance judgments. We call these protocols No Title Evaluation (NT Evaluation). The first protocol measures performance for focused searches, where only one relevant document exists for each query. The second protocol measures performance for queries expected to have potentially many relevant documents per query (high-recall searches). Both protocols take advantage of the clear separation of titles and abstracts found in Medline. We compare the performance obtained with these evaluation protocols to results obtained by reusing the relevance judgments produced in the 2004 and 2005 TREC Genomics Track and observe significant correlations between performance rankings generated by our approach and TREC. Spearman's correlation coefficients in the range of 0.79-0.92 are observed comparing bpref measured with NT Evaluation or with TREC evaluations. For comparison, coefficients in the range 0.86-0.94 can be observed when evaluating the same set of methods with data from two independent TREC Genomics Track evaluations. We discuss the advantages of NT Evaluation over the TRels and the data fusion evaluation protocols introduced recently. Our results suggest that the NT Evaluation protocols described here could be used to optimize some search engine parameters before human evaluation. Further research is needed to determine if NT Evaluation or variants of these protocols can fully substitute for human evaluations.
Variation block-based genomics method for crop plants.
Kim, Yul Ho; Park, Hyang Mi; Hwang, Tae-Young; Lee, Seuk Ki; Choi, Man Soo; Jho, Sungwoong; Hwang, Seungwoo; Kim, Hak-Min; Lee, Dongwoo; Kim, Byoung-Chul; Hong, Chang Pyo; Cho, Yun Sung; Kim, Hyunmin; Jeong, Kwang Ho; Seo, Min Jung; Yun, Hong Tai; Kim, Sun Lim; Kwon, Young-Up; Kim, Wook Han; Chun, Hye Kyung; Lim, Sang Jong; Shin, Young-Ah; Choi, Ik-Young; Kim, Young Sun; Yoon, Ho-Sung; Lee, Suk-Ha; Lee, Sunghoon
2014-06-15
In contrast with wild species, cultivated crop genomes consist of reshuffled recombination blocks, which occurred by crossing and selection processes. Accordingly, recombination block-based genomics analysis can be an effective approach for the screening of target loci for agricultural traits. We propose the variation block method, which is a three-step process for recombination block detection and comparison. The first step is to detect variations by comparing the short-read DNA sequences of the cultivar to the reference genome of the target crop. Next, sequence blocks with variation patterns are examined and defined. The boundaries between the variation-containing sequence blocks are regarded as recombination sites. All the assumed recombination sites in the cultivar set are used to split the genomes, and the resulting sequence regions are termed variation blocks. Finally, the genomes are compared using the variation blocks. The variation block method identified recurring recombination blocks accurately and successfully represented block-level diversities in the publicly available genomes of 31 soybean and 23 rice accessions. The practicality of this approach was demonstrated by the identification of a putative locus determining soybean hilum color. We suggest that the variation block method is an efficient genomics method for the recombination block-level comparison of crop genomes. We expect that this method will facilitate the development of crop genomics by bringing genomics technologies to the field of crop breeding.
Aubrey, Wayne; Riley, Michael C; Young, Michael; King, Ross D; Oliver, Stephen G; Clare, Amanda
2015-01-01
Many advances in synthetic biology require the removal of a large number of genomic elements from a genome. Most existing deletion methods leave behind markers, and as there are a limited number of markers, such methods can only be applied a fixed number of times. Deletion methods that recycle markers generally are either imprecise (remove untargeted sequences), or leave scar sequences which can cause genome instability and rearrangements. No existing marker recycling method is automation-friendly. We have developed a novel openly available deletion tool that consists of: 1) a method for deleting genomic elements that can be repeatedly used without limit, is precise, scar-free, and suitable for automation; and 2) software to design the method's primers. Our tool is sequence agnostic and could be used to delete large numbers of coding sequences, promoter regions, transcription factor binding sites, terminators, etc in a single genome. We have validated our tool on the deletion of non-essential open reading frames (ORFs) from S. cerevisiae. The tool is applicable to arbitrary genomes, and we provide primer sequences for the deletion of: 90% of the ORFs from the S. cerevisiae genome, 88% of the ORFs from S. pombe genome, and 85% of the ORFs from the L. lactis genome.
Mining sequence variations in representative polyploid sugarcane germplasm accessions
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Xiping; Song, Jian; You, Qian
Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes.more » To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.« less
Mining sequence variations in representative polyploid sugarcane germplasm accessions
Yang, Xiping; Song, Jian; You, Qian; ...
2017-08-09
Sugarcane (Saccharum spp.) is one of the most important economic crops because of its high sugar production and biofuel potential. Due to the high polyploid level and complex genome of sugarcane, it has been a huge challenge to investigate genomic sequence variations, which are critical for identifying alleles contributing to important agronomic traits. In order to mine the genetic variations in sugarcane, genotyping by sequencing (GBS), was used to genotype 14 representative Saccharum complex accessions. GBS is a method to generate a large number of markers, enabled by next generation sequencing (NGS) and the genome complexity reduction using restriction enzymes.more » To use GBS for high throughput genotyping highly polyploid sugarcane, the GBS analysis pipelines in 14 Saccharum complex accessions were established by evaluating different alignment methods, sequence variants callers, and sequence depth for single nucleotide polymorphism (SNP) filtering. By using the established pipeline, a total of 76,251 non-redundant SNPs, 5642 InDels, 6380 presence/absence variants (PAVs), and 826 copy number variations (CNVs) were detected among the 14 accessions. In addition, non-reference based universal network enabled analysis kit and Stacks de novo called 34,353 and 109,043 SNPs, respectively. In the 14 accessions, the percentages of single dose SNPs ranged from 38.3% to 62.3% with an average of 49.6%, much more than the portions of multiple dosage SNPs. Concordantly called SNPs were used to evaluate the phylogenetic relationship among the 14 accessions. The results showed that the divergence time between the Erianthus genus and the Saccharum genus was more than 10 million years ago (MYA). The Saccharum species separated from their common ancestors ranging from 0.19 to 1.65 MYA. The GBS pipelines including the reference sequences, alignment methods, sequence variant callers, and sequence depth were recommended and discussed for the Saccharum complex and other related species. A large number of sequence variations were discovered in the Saccharum complex, including SNPs, InDels, PAVs, and CNVs. Genome-wide SNPs were further used to illustrate sequence features of polyploid species and demonstrated the divergence of different species in the Saccharum complex. The results of this study showed that GBS was an effective NGS-based method to discover genomic sequence variations in highly polyploid and heterozygous species.« less
Non-additive Effects in Genomic Selection
Varona, Luis; Legarra, Andres; Toro, Miguel A.; Vitezica, Zulma G.
2018-01-01
In the last decade, genomic selection has become a standard in the genetic evaluation of livestock populations. However, most procedures for the implementation of genomic selection only consider the additive effects associated with SNP (Single Nucleotide Polymorphism) markers used to calculate the prediction of the breeding values of candidates for selection. Nevertheless, the availability of estimates of non-additive effects is of interest because: (i) they contribute to an increase in the accuracy of the prediction of breeding values and the genetic response; (ii) they allow the definition of mate allocation procedures between candidates for selection; and (iii) they can be used to enhance non-additive genetic variation through the definition of appropriate crossbreeding or purebred breeding schemes. This study presents a review of methods for the incorporation of non-additive genetic effects into genomic selection procedures and their potential applications in the prediction of future performance, mate allocation, crossbreeding, and purebred selection. The work concludes with a brief outline of some ideas for future lines of that may help the standard inclusion of non-additive effects in genomic selection. PMID:29559995
Non-additive Effects in Genomic Selection.
Varona, Luis; Legarra, Andres; Toro, Miguel A; Vitezica, Zulma G
2018-01-01
In the last decade, genomic selection has become a standard in the genetic evaluation of livestock populations. However, most procedures for the implementation of genomic selection only consider the additive effects associated with SNP (Single Nucleotide Polymorphism) markers used to calculate the prediction of the breeding values of candidates for selection. Nevertheless, the availability of estimates of non-additive effects is of interest because: (i) they contribute to an increase in the accuracy of the prediction of breeding values and the genetic response; (ii) they allow the definition of mate allocation procedures between candidates for selection; and (iii) they can be used to enhance non-additive genetic variation through the definition of appropriate crossbreeding or purebred breeding schemes. This study presents a review of methods for the incorporation of non-additive genetic effects into genomic selection procedures and their potential applications in the prediction of future performance, mate allocation, crossbreeding, and purebred selection. The work concludes with a brief outline of some ideas for future lines of that may help the standard inclusion of non-additive effects in genomic selection.
A new normalizing algorithm for BAC CGH arrays with quality control metrics.
Miecznikowski, Jeffrey C; Gaile, Daniel P; Liu, Song; Shepherd, Lori; Nowak, Norma
2011-01-01
The main focus in pin-tip (or print-tip) microarray analysis is determining which probes, genes, or oligonucleotides are differentially expressed. Specifically in array comparative genomic hybridization (aCGH) experiments, researchers search for chromosomal imbalances in the genome. To model this data, scientists apply statistical methods to the structure of the experiment and assume that the data consist of the signal plus random noise. In this paper we propose "SmoothArray", a new method to preprocess comparative genomic hybridization (CGH) bacterial artificial chromosome (BAC) arrays and we show the effects on a cancer dataset. As part of our R software package "aCGHplus," this freely available algorithm removes the variation due to the intensity effects, pin/print-tip, the spatial location on the microarray chip, and the relative location from the well plate. removal of this variation improves the downstream analysis and subsequent inferences made on the data. Further, we present measures to evaluate the quality of the dataset according to the arrayer pins, 384-well plates, plate rows, and plate columns. We compare our method against competing methods using several metrics to measure the biological signal. With this novel normalization algorithm and quality control measures, the user can improve their inferences on datasets and pinpoint problems that may arise in their BAC aCGH technology.
Joint genomic evaluation of French dairy cattle breeds using multiple-trait models
2012-01-01
Background Using a multi-breed reference population might be a way of increasing the accuracy of genomic breeding values in small breeds. Models involving mixed-breed data do not take into account the fact that marker effects may differ among breeds. This study was aimed at investigating the impact on accuracy of increasing the number of genotyped candidates in the training set by using a multi-breed reference population, in contrast to single-breed genomic evaluations. Methods Three traits (milk production, fat content and female fertility) were analyzed by genomic mixed linear models and Bayesian methodology. Three breeds of French dairy cattle were used: Holstein, Montbéliarde and Normande with 2976, 950 and 970 bulls in the training population, respectively and 964, 222 and 248 bulls in the validation population, respectively. All animals were genotyped with the Illumina Bovine SNP50 array. Accuracy of genomic breeding values was evaluated under three scenarios for the correlation of genomic breeding values between breeds (rg): uncorrelated (1), rg = 0; estimated rg (2); high, rg = 0.95 (3). Accuracy and bias of predictions obtained in the validation population with the multi-breed training set were assessed by the coefficient of determination (R2) and by the regression coefficient of daughter yield deviations of validation bulls on their predicted genomic breeding values, respectively. Results The genetic variation captured by the markers for each trait was similar to that estimated for routine pedigree-based genetic evaluation. Posterior means for rg ranged from −0.01 for fertility between Montbéliarde and Normande to 0.79 for milk yield between Montbéliarde and Holstein. Differences in R2 between the three scenarios were notable only for fat content in the Montbéliarde breed: from 0.27 in scenario (1) to 0.33 in scenarios (2) and (3). Accuracies for fertility were lower than for other traits. Conclusions Using a multi-breed reference population resulted in small or no increases in accuracy. Only the breed with a small data set and large genetic correlation with the breed with a large data set showed increased accuracy for the traits with moderate (milk) to high (fat content) heritability. No benefit was observed for fertility, a lowly heritable trait. PMID:23216664
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hug, Laura A.; Thomas, Brian C.; Sharon, Itai
Nitrogen, sulfur and carbon fluxes in the terrestrial subsurface are determined by the intersecting activities of microbial community members, yet the organisms responsible are largely unknown. Metagenomic methods can identify organisms and functions, but genome recovery is often precluded by data complexity. To address this limitation, we developed subsampling assembly methods to re-construct high-quality draft genomes from complex samples. Here, we applied these methods to evaluate the interlinked roles of the most abundant organisms in biogeochemical cycling in the aquifer sediment. Community proteomics confirmed these activities. The eight most abundant organisms belong to novel lineages, and two represent phyla withmore » no previously sequenced genome. Four organisms are predicted to fix carbon via the Calvin Benson Bassham, Wood Ljungdahl or 3-hydroxyproprionate/4-hydroxybutarate pathways. The profiled organisms are involved in the network of denitrification, dissimilatory nitrate reduction to ammonia, ammonia oxidation and sulfate reduction/oxidation, and require substrates supplied by other community members. An ammonium-oxidizing Thaumarchaeote is the most abundant community member, despite low ammonium concentrations in the groundwater. Finally, this organism likely benefits from two other relatively abundant organisms capable of producing ammonium from nitrate, which is abundant in the groundwater. Overall, dominant members of the microbial community are interconnected through exchange of geochemical resources.« less
Economic evaluation of genomic sequencing in the paediatric population: a critical review.
Alam, Khurshid; Schofield, Deborah
2018-05-24
Systematic evidence is critical to the formulation of national health policy to provide public funding for the integration of genomic sequencing into routine clinical care. The purpose of this review is to present systematic evidence on the economic evaluation of genomic sequencing conducted for paediatric patients in clinical care, and to identify any gaps in the methodology of economic evaluations. We undertook a critical review of the empirical evidence from economic evaluations of genomic sequencing among paediatric patients searching five electronic databases. Our inclusion criteria were limited to literature published in the English language between 2010 and 2017 in OECD countries. Articles that met our inclusion criteria were assessed using a recognised checklist for a well-designed economic evaluation. We found 11 full-text articles that met our inclusion criteria. Our analysis found that genomic sequencing markedly increased the diagnostic rate to 16-79%, but lowered the cost by 11-64% compared to the standard diagnostic pathway. Only five recent studies in paediatric clinical cohorts met most of the criteria for a well-designed economic evaluation and demonstrated cost-effectiveness of genomic sequencing in paediatric clinical cohorts of patients. Our review identified the need for improvement in the rigour of the methodologies used to provide robust evidence for the formulation of health policy on public funding to integrate genomic sequencing into routine clinical care. Nonetheless, there is emerging evidence of the cost-effectiveness of genomic sequencing over usual care for paediatric patients.
Agreda, Patricia M.; Beitman, Gerard H.; Gutierrez, Erin C.; Harris, James M.; Koch, Kristopher R.; LaViers, William D.; Leitch, Sharon V.; Maus, Courtney E.; McMillian, Ray A.; Nussbaumer, William A.; Palmer, Marcus L. R.; Porter, Michael J.; Richart, Gregory A.; Schwab, Ryan J.
2013-01-01
We evaluated the effect of storage at 2 to 8°C on the stability of human genomic and human papillomavirus (HPV) DNA stored in BD SurePath and Hologic PreservCyt liquid-based cytology media. DNA retained the ability to be extracted and PCR amplified for more than 2.5 years in both medium types. Prior inability to detect DNA in archived specimens may have been due to failure of the extraction method to isolate DNA from fixed cells. PMID:23678069
Genotypes are useful for more than genomic evaluation
USDA-ARS?s Scientific Manuscript database
New services that provide pedigree discovery, breed composition, mating programs, genomic inbreeding, fertility defects, and inheritance tracking all are possible from low-cost genotyping in addition to genomic evaluation. Genetic markers let breeders select among sibs before their phenotypes became...
Optimizing and evaluating the reconstruction of Metagenome-assembled microbial genomes.
Papudeshi, Bhavya; Haggerty, J Matthew; Doane, Michael; Morris, Megan M; Walsh, Kevin; Beattie, Douglas T; Pande, Dnyanada; Zaeri, Parisa; Silva, Genivaldo G Z; Thompson, Fabiano; Edwards, Robert A; Dinsdale, Elizabeth A
2017-11-28
Microbiome/host interactions describe characteristics that affect the host's health. Shotgun metagenomics includes sequencing a random subset of the microbiome to analyze its taxonomic and metabolic potential. Reconstruction of DNA fragments into genomes from metagenomes (called metagenome-assembled genomes) assigns unknown fragments to taxa/function and facilitates discovery of novel organisms. Genome reconstruction incorporates sequence assembly and sorting of assembled sequences into bins, characteristic of a genome. However, the microbial community composition, including taxonomic and phylogenetic diversity may influence genome reconstruction. We determine the optimal reconstruction method for four microbiome projects that had variable sequencing platforms (IonTorrent and Illumina), diversity (high or low), and environment (coral reefs and kelp forests), using a set of parameters to select for optimal assembly and binning tools. We tested the effects of the assembly and binning processes on population genome reconstruction using 105 marine metagenomes from 4 projects. Reconstructed genomes were obtained from each project using 3 assemblers (IDBA, MetaVelvet, and SPAdes) and 2 binning tools (GroopM and MetaBat). We assessed the efficiency of assemblers using statistics that including contig continuity and contig chimerism and the effectiveness of binning tools using genome completeness and taxonomic identification. We concluded that SPAdes, assembled more contigs (143,718 ± 124 contigs) of longer length (N50 = 1632 ± 108 bp), and incorporated the most sequences (sequences-assembled = 19.65%). The microbial richness and evenness were maintained across the assembly, suggesting low contig chimeras. SPAdes assembly was responsive to the biological and technological variations within the project, compared with other assemblers. Among binning tools, we conclude that MetaBat produced bins with less variation in GC content (average standard deviation: 1.49), low species richness (4.91 ± 0.66), and higher genome completeness (40.92 ± 1.75) across all projects. MetaBat extracted 115 bins from the 4 projects of which 66 bins were identified as reconstructed metagenome-assembled genomes with sequences belonging to a specific genus. We identified 13 novel genomes, some of which were 100% complete, but show low similarity to genomes within databases. In conclusion, we present a set of biologically relevant parameters for evaluation to select for optimal assembly and binning tools. For the tools we tested, SPAdes assembler and MetaBat binning tools reconstructed quality metagenome-assembled genomes for the four projects. We also conclude that metagenomes from microbial communities that have high coverage of phylogenetically distinct, and low taxonomic diversity results in highest quality metagenome-assembled genomes.
Evaluating Quality of Aged Archival Formalin-Fixed Paraffin-Embedded Samples for RNA-Sequencing
Archival formalin-fixed paraffin-embedded (FFPE) samples offer a vast, untapped source of genomic data for biomarker discovery. However, the quality of FFPE samples is often highly variable, and conventional methods to assess RNA quality for RNA-sequencing (RNA-seq) are not infor...
Use of archival resources has been limited to date by inconsistent methods for genomic profiling of degraded RNA from formalin-fixed paraffin-embedded (FFPE) samples. RNA-sequencing offers a promising way to address this problem. Here we evaluated transcriptomic dose responses us...
Developing Predictive Toxicity Signatures Using In Vitro Data from the EPA ToxCast Program
A major focus in toxicology research is the development of in vitro methods to predict in vivo chemical toxicity. Numerous studies have evaluated the use of targeted biochemical, cell-based and genomic assay approaches. Each of these techniques is potentially helpful, but provide...
Müller, Bárbara S F; Neves, Leandro G; de Almeida Filho, Janeo E; Resende, Márcio F R; Muñoz, Patricio R; Dos Santos, Paulo E T; Filho, Estefano Paludzyszyn; Kirst, Matias; Grattapaglia, Dario
2017-07-11
The advent of high-throughput genotyping technologies coupled to genomic prediction methods established a new paradigm to integrate genomics and breeding. We carried out whole-genome prediction and contrasted it to a genome-wide association study (GWAS) for growth traits in breeding populations of Eucalyptus benthamii (n =505) and Eucalyptus pellita (n =732). Both species are of increasing commercial interest for the development of germplasm adapted to environmental stresses. Predictive ability reached 0.16 in E. benthamii and 0.44 in E. pellita for diameter growth. Predictive abilities using either Genomic BLUP or different Bayesian methods were similar, suggesting that growth adequately fits the infinitesimal model. Genomic prediction models using ~5000-10,000 SNPs provided predictive abilities equivalent to using all 13,787 and 19,506 SNPs genotyped in the E. benthamii and E. pellita populations, respectively. No difference was detected in predictive ability when different sets of SNPs were utilized, based on position (equidistantly genome-wide, inside genes, linkage disequilibrium pruned or on single chromosomes), as long as the total number of SNPs used was above ~5000. Predictive abilities obtained by removing relatedness between training and validation sets fell near zero for E. benthamii and were halved for E. pellita. These results corroborate the current view that relatedness is the main driver of genomic prediction, although some short-range historical linkage disequilibrium (LD) was likely captured for E. pellita. A GWAS identified only one significant association for volume growth in E. pellita, illustrating the fact that while genome-wide regression is able to account for large proportions of the heritability, very little or none of it is captured into significant associations using GWAS in breeding populations of the size evaluated in this study. This study provides further experimental data supporting positive prospects of using genome-wide data to capture large proportions of trait heritability and predict growth traits in trees with accuracies equal or better than those attainable by phenotypic selection. Additionally, our results document the superiority of the whole-genome regression approach in accounting for large proportions of the heritability of complex traits such as growth in contrast to the limited value of the local GWAS approach toward breeding applications in forest trees.
Statistical methods for detecting periodic fragments in DNA sequence data
2011-01-01
Background Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed. Results We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS). Conclusions For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {AA, TT, TA}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers. Reviewers This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight. PMID:21527008
Comprehensive evaluation of genome-wide 5-hydroxymethylcytosine profiling approaches in human DNA.
Skvortsova, Ksenia; Zotenko, Elena; Luu, Phuc-Loi; Gould, Cathryn M; Nair, Shalima S; Clark, Susan J; Stirzaker, Clare
2017-01-01
The discovery that 5-methylcytosine (5mC) can be oxidized to 5-hydroxymethylcytosine (5hmC) by the ten-eleven translocation (TET) proteins has prompted wide interest in the potential role of 5hmC in reshaping the mammalian DNA methylation landscape. The gold-standard bisulphite conversion technologies to study DNA methylation do not distinguish between 5mC and 5hmC. However, new approaches to mapping 5hmC genome-wide have advanced rapidly, although it is unclear how the different methods compare in accurately calling 5hmC. In this study, we provide a comparative analysis on brain DNA using three 5hmC genome-wide approaches, namely whole-genome bisulphite/oxidative bisulphite sequencing (WG Bis/OxBis-seq), Infinium HumanMethylation450 BeadChip arrays coupled with oxidative bisulphite (HM450K Bis/OxBis) and antibody-based immunoprecipitation and sequencing of hydroxymethylated DNA (hMeDIP-seq). We also perform loci-specific TET-assisted bisulphite sequencing (TAB-seq) for validation of candidate regions. We show that whole-genome single-base resolution approaches are advantaged in providing precise 5hmC values but require high sequencing depth to accurately measure 5hmC, as this modification is commonly in low abundance in mammalian cells. HM450K arrays coupled with oxidative bisulphite provide a cost-effective representation of 5hmC distribution, at CpG sites with 5hmC levels >~10%. However, 5hmC analysis is restricted to the genomic location of the probes, which is an important consideration as 5hmC modification is commonly enriched at enhancer elements. Finally, we show that the widely used hMeDIP-seq method provides an efficient genome-wide profile of 5hmC and shows high correlation with WG Bis/OxBis-seq 5hmC distribution in brain DNA. However, in cell line DNA with low levels of 5hmC, hMeDIP-seq-enriched regions are not detected by WG Bis/OxBis or HM450K, either suggesting misinterpretation of 5hmC calls by hMeDIP or lack of sensitivity of the latter methods. We highlight both the advantages and caveats of three commonly used genome-wide 5hmC profiling technologies and show that interpretation of 5hmC data can be significantly influenced by the sensitivity of methods used, especially as the levels of 5hmC are low and vary in different cell types and different genomic locations.
Sun, Xiaochun; Ma, Ping; Mumm, Rita H
2012-01-01
Genomic selection (GS) procedures have proven useful in estimating breeding value and predicting phenotype with genome-wide molecular marker information. However, issues of high dimensionality, multicollinearity, and the inability to deal effectively with epistasis can jeopardize accuracy and predictive ability. We, therefore, propose a new nonparametric method, pRKHS, which combines the features of supervised principal component analysis (SPCA) and reproducing kernel Hilbert spaces (RKHS) regression, with versions for traits with no/low epistasis, pRKHS-NE, to high epistasis, pRKHS-E. Instead of assigning a specific relationship to represent the underlying epistasis, the method maps genotype to phenotype in a nonparametric way, thus requiring fewer genetic assumptions. SPCA decreases the number of markers needed for prediction by filtering out low-signal markers with the optimal marker set determined by cross-validation. Principal components are computed from reduced marker matrix (called supervised principal components, SPC) and included in the smoothing spline ANOVA model as independent variables to fit the data. The new method was evaluated in comparison with current popular methods for practicing GS, specifically RR-BLUP, BayesA, BayesB, as well as a newer method by Crossa et al., RKHS-M, using both simulated and real data. Results demonstrate that pRKHS generally delivers greater predictive ability, particularly when epistasis impacts trait expression. Beyond prediction, the new method also facilitates inferences about the extent to which epistasis influences trait expression.
Sun, Xiaochun; Ma, Ping; Mumm, Rita H.
2012-01-01
Genomic selection (GS) procedures have proven useful in estimating breeding value and predicting phenotype with genome-wide molecular marker information. However, issues of high dimensionality, multicollinearity, and the inability to deal effectively with epistasis can jeopardize accuracy and predictive ability. We, therefore, propose a new nonparametric method, pRKHS, which combines the features of supervised principal component analysis (SPCA) and reproducing kernel Hilbert spaces (RKHS) regression, with versions for traits with no/low epistasis, pRKHS-NE, to high epistasis, pRKHS-E. Instead of assigning a specific relationship to represent the underlying epistasis, the method maps genotype to phenotype in a nonparametric way, thus requiring fewer genetic assumptions. SPCA decreases the number of markers needed for prediction by filtering out low-signal markers with the optimal marker set determined by cross-validation. Principal components are computed from reduced marker matrix (called supervised principal components, SPC) and included in the smoothing spline ANOVA model as independent variables to fit the data. The new method was evaluated in comparison with current popular methods for practicing GS, specifically RR-BLUP, BayesA, BayesB, as well as a newer method by Crossa et al., RKHS-M, using both simulated and real data. Results demonstrate that pRKHS generally delivers greater predictive ability, particularly when epistasis impacts trait expression. Beyond prediction, the new method also facilitates inferences about the extent to which epistasis influences trait expression. PMID:23226325
Bernard, Guillaume; Chan, Cheong Xin; Ragan, Mark A
2016-07-01
Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.
Verzotto, Davide; M Teo, Audrey S; Hillmer, Axel M; Nagarajan, Niranjan
2016-01-01
Resolution of complex repeat structures and rearrangements in the assembly and analysis of large eukaryotic genomes is often aided by a combination of high-throughput sequencing and genome-mapping technologies (for example, optical restriction mapping). In particular, mapping technologies can generate sparse maps of large DNA fragments (150 kilo base pairs (kbp) to 2 Mbp) and thus provide a unique source of information for disambiguating complex rearrangements in cancer genomes. Despite their utility, combining high-throughput sequencing and mapping technologies has been challenging because of the lack of efficient and sensitive map-alignment algorithms for robustly aligning error-prone maps to sequences. We introduce a novel seed-and-extend glocal (short for global-local) alignment method, OPTIMA (and a sliding-window extension for overlap alignment, OPTIMA-Overlap), which is the first to create indexes for continuous-valued mapping data while accounting for mapping errors. We also present a novel statistical model, agnostic with respect to technology-dependent error rates, for conservatively evaluating the significance of alignments without relying on expensive permutation-based tests. We show that OPTIMA and OPTIMA-Overlap outperform other state-of-the-art approaches (1.6-2 times more sensitive) and are more efficient (170-200 %) and precise in their alignments (nearly 99 % precision). These advantages are independent of the quality of the data, suggesting that our indexing approach and statistical evaluation are robust, provide improved sensitivity and guarantee high precision.
Wang, Shuang; Zhang, Yuchen; Dai, Wenrui; Lauter, Kristin; Kim, Miran; Tang, Yuzhe; Xiong, Hongkai; Jiang, Xiaoqian
2016-01-01
Motivation: Genome-wide association studies (GWAS) have been widely used in discovering the association between genotypes and phenotypes. Human genome data contain valuable but highly sensitive information. Unprotected disclosure of such information might put individual’s privacy at risk. It is important to protect human genome data. Exact logistic regression is a bias-reduction method based on a penalized likelihood to discover rare variants that are associated with disease susceptibility. We propose the HEALER framework to facilitate secure rare variants analysis with a small sample size. Results: We target at the algorithm design aiming at reducing the computational and storage costs to learn a homomorphic exact logistic regression model (i.e. evaluate P-values of coefficients), where the circuit depth is proportional to the logarithmic scale of data size. We evaluate the algorithm performance using rare Kawasaki Disease datasets. Availability and implementation: Download HEALER at http://research.ucsd-dbmi.org/HEALER/ Contact: shw070@ucsd.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26446135
A private DNA motif finding algorithm.
Chen, Rui; Peng, Yun; Choi, Byron; Xu, Jianliang; Hu, Haibo
2014-08-01
With the increasing availability of genomic sequence data, numerous methods have been proposed for finding DNA motifs. The discovery of DNA motifs serves a critical step in many biological applications. However, the privacy implication of DNA analysis is normally neglected in the existing methods. In this work, we propose a private DNA motif finding algorithm in which a DNA owner's privacy is protected by a rigorous privacy model, known as ∊-differential privacy. It provides provable privacy guarantees that are independent of adversaries' background knowledge. Our algorithm makes use of the n-gram model and is optimized for processing large-scale DNA sequences. We evaluate the performance of our algorithm over real-life genomic data and demonstrate the promise of integrating privacy into DNA motif finding. Copyright © 2014 Elsevier Inc. All rights reserved.
Lim, Hansaim; Gray, Paul; Xie, Lei; Poleksic, Aleksandar
2016-01-01
Conventional one-drug-one-gene approach has been of limited success in modern drug discovery. Polypharmacology, which focuses on searching for multi-targeted drugs to perturb disease-causing networks instead of designing selective ligands to target individual proteins, has emerged as a new drug discovery paradigm. Although many methods for single-target virtual screening have been developed to improve the efficiency of drug discovery, few of these algorithms are designed for polypharmacology. Here, we present a novel theoretical framework and a corresponding algorithm for genome-scale multi-target virtual screening based on the one-class collaborative filtering technique. Our method overcomes the sparseness of the protein-chemical interaction data by means of interaction matrix weighting and dual regularization from both chemicals and proteins. While the statistical foundation behind our method is general enough to encompass genome-wide drug off-target prediction, the program is specifically tailored to find protein targets for new chemicals with little to no available interaction data. We extensively evaluate our method using a number of the most widely accepted gene-specific and cross-gene family benchmarks and demonstrate that our method outperforms other state-of-the-art algorithms for predicting the interaction of new chemicals with multiple proteins. Thus, the proposed algorithm may provide a powerful tool for multi-target drug design. PMID:27958331
Lim, Hansaim; Gray, Paul; Xie, Lei; Poleksic, Aleksandar
2016-12-13
Conventional one-drug-one-gene approach has been of limited success in modern drug discovery. Polypharmacology, which focuses on searching for multi-targeted drugs to perturb disease-causing networks instead of designing selective ligands to target individual proteins, has emerged as a new drug discovery paradigm. Although many methods for single-target virtual screening have been developed to improve the efficiency of drug discovery, few of these algorithms are designed for polypharmacology. Here, we present a novel theoretical framework and a corresponding algorithm for genome-scale multi-target virtual screening based on the one-class collaborative filtering technique. Our method overcomes the sparseness of the protein-chemical interaction data by means of interaction matrix weighting and dual regularization from both chemicals and proteins. While the statistical foundation behind our method is general enough to encompass genome-wide drug off-target prediction, the program is specifically tailored to find protein targets for new chemicals with little to no available interaction data. We extensively evaluate our method using a number of the most widely accepted gene-specific and cross-gene family benchmarks and demonstrate that our method outperforms other state-of-the-art algorithms for predicting the interaction of new chemicals with multiple proteins. Thus, the proposed algorithm may provide a powerful tool for multi-target drug design.
Muthukrishnan, Madhanmohan; Singanallur, Nagendrakumar B; Ralla, Kumar; Villuppanoor, Srinivasan A
2008-08-01
Foot-and-mouth disease virus (FMDV) samples transported to the laboratory from far and inaccessible areas for serodiagnosis pose a major problem in a tropical country like India, where there is maximum temperature fluctuation. Inadequate storage methods lead to spoilage of FMDV samples collected from clinically positive animals in the field. Such samples are declared as non-typeable by the typing laboratories with the consequent loss of valuable epidemiological data. The present study evaluated the usefulness of FTA Classic Cards for the collection, shipment, storage and identification of the FMDV genome by RT-PCR and real-time RT-PCR. The stability of the viral RNA, the absence of infectivity and ease of processing the sample for molecular methods make the FTA cards a useful option for transport of FMDV genome for identification and serotyping. The method can be used routinely for FMDV research as it is economical and the cards can be transported easily in envelopes by regular document transport methods. Live virus cannot be isolated from samples collected in FTA cards, which is a limitation. This property can be viewed as an advantage as it limits the risk of transmission of live virus.
Crisan, Anamaria; McKee, Geoffrey; Munzner, Tamara
2018-01-01
Background Microbial genome sequencing is now being routinely used in many clinical and public health laboratories. Understanding how to report complex genomic test results to stakeholders who may have varying familiarity with genomics—including clinicians, laboratorians, epidemiologists, and researchers—is critical to the successful and sustainable implementation of this new technology; however, there are no evidence-based guidelines for designing such a report in the pathogen genomics domain. Here, we describe an iterative, human-centered approach to creating a report template for communicating tuberculosis (TB) genomic test results. Methods We used Design Study Methodology—a human centered approach drawn from the information visualization domain—to redesign an existing clinical report. We used expert consults and an online questionnaire to discover various stakeholders’ needs around the types of data and tasks related to TB that they encounter in their daily workflow. We also evaluated their perceptions of and familiarity with genomic data, as well as its utility at various clinical decision points. These data shaped the design of multiple prototype reports that were compared against the existing report through a second online survey, with the resulting qualitative and quantitative data informing the final, redesigned, report. Results We recruited 78 participants, 65 of whom were clinicians, nurses, laboratorians, researchers, and epidemiologists involved in TB diagnosis, treatment, and/or surveillance. Our first survey indicated that participants were largely enthusiastic about genomic data, with the majority agreeing on its utility for certain TB diagnosis and treatment tasks and many reporting some confidence in their ability to interpret this type of data (between 58.8% and 94.1%, depending on the specific data type). When we compared our four prototype reports against the existing design, we found that for the majority (86.7%) of design comparisons, participants preferred the alternative prototype designs over the existing version, and that both clinicians and non-clinicians expressed similar design preferences. Participants showed clearer design preferences when asked to compare individual design elements versus entire reports. Both the quantitative and qualitative data informed the design of a revised report, available online as a LaTeX template. Conclusions We show how a human-centered design approach integrating quantitative and qualitative feedback can be used to design an alternative report for representing complex microbial genomic data. We suggest experimental and design guidelines to inform future design studies in the bioinformatics and microbial genomics domains, and suggest that this type of mixed-methods study is important to facilitate the successful translation of pathogen genomics in the clinic, not only for clinical reports but also more complex bioinformatics data visualization software. PMID:29340235
Navigating the Interface Between Landscape Genetics and Landscape Genomics.
Storfer, Andrew; Patton, Austin; Fraik, Alexandra K
2018-01-01
As next-generation sequencing data become increasingly available for non-model organisms, a shift has occurred in the focus of studies of the geographic distribution of genetic variation. Whereas landscape genetics studies primarily focus on testing the effects of landscape variables on gene flow and genetic population structure, landscape genomics studies focus on detecting candidate genes under selection that indicate possible local adaptation. Navigating the transition between landscape genomics and landscape genetics can be challenging. The number of molecular markers analyzed has shifted from what used to be a few dozen loci to thousands of loci and even full genomes. Although genome scale data can be separated into sets of neutral loci for analyses of gene flow and population structure and putative loci under selection for inference of local adaptation, there are inherent differences in the questions that are addressed in the two study frameworks. We discuss these differences and their implications for study design, marker choice and downstream analysis methods. Similar to the rapid proliferation of analysis methods in the early development of landscape genetics, new analytical methods for detection of selection in landscape genomics studies are burgeoning. We focus on genome scan methods for detection of selection, and in particular, outlier differentiation methods and genetic-environment association tests because they are the most widely used. Use of genome scan methods requires an understanding of the potential mismatches between the biology of a species and assumptions inherent in analytical methods used, which can lead to high false positive rates of detected loci under selection. Key to choosing appropriate genome scan methods is an understanding of the underlying demographic structure of study populations, and such data can be obtained using neutral loci from the generated genome-wide data or prior knowledge of a species' phylogeographic history. To this end, we summarize recent simulation studies that test the power and accuracy of genome scan methods under a variety of demographic scenarios and sampling designs. We conclude with a discussion of additional considerations for future method development, and a summary of methods that show promise for landscape genomics studies but are not yet widely used.
Navigating the Interface Between Landscape Genetics and Landscape Genomics
Storfer, Andrew; Patton, Austin; Fraik, Alexandra K.
2018-01-01
As next-generation sequencing data become increasingly available for non-model organisms, a shift has occurred in the focus of studies of the geographic distribution of genetic variation. Whereas landscape genetics studies primarily focus on testing the effects of landscape variables on gene flow and genetic population structure, landscape genomics studies focus on detecting candidate genes under selection that indicate possible local adaptation. Navigating the transition between landscape genomics and landscape genetics can be challenging. The number of molecular markers analyzed has shifted from what used to be a few dozen loci to thousands of loci and even full genomes. Although genome scale data can be separated into sets of neutral loci for analyses of gene flow and population structure and putative loci under selection for inference of local adaptation, there are inherent differences in the questions that are addressed in the two study frameworks. We discuss these differences and their implications for study design, marker choice and downstream analysis methods. Similar to the rapid proliferation of analysis methods in the early development of landscape genetics, new analytical methods for detection of selection in landscape genomics studies are burgeoning. We focus on genome scan methods for detection of selection, and in particular, outlier differentiation methods and genetic-environment association tests because they are the most widely used. Use of genome scan methods requires an understanding of the potential mismatches between the biology of a species and assumptions inherent in analytical methods used, which can lead to high false positive rates of detected loci under selection. Key to choosing appropriate genome scan methods is an understanding of the underlying demographic structure of study populations, and such data can be obtained using neutral loci from the generated genome-wide data or prior knowledge of a species' phylogeographic history. To this end, we summarize recent simulation studies that test the power and accuracy of genome scan methods under a variety of demographic scenarios and sampling designs. We conclude with a discussion of additional considerations for future method development, and a summary of methods that show promise for landscape genomics studies but are not yet widely used. PMID:29593776
Shickh, Salma; Clausen, Marc; Mighton, Chloe; Casalino, Selina; Joshi, Esha; Glogowski, Emily; Schrader, Kasmintan A; Scheer, Adena; Elser, Christine; Panchal, Seema; Eisen, Andrea; Graham, Tracy; Aronson, Melyssa; Semotiuk, Kara M; Winter-Paquette, Laura; Evans, Michael; Lerner-Ellis, Jordan; Carroll, June C; Hamilton, Jada G; Offit, Kenneth; Robson, Mark; Thorpe, Kevin E; Laupacis, Andreas; Bombard, Yvonne
2018-04-26
Genome sequencing, a novel genetic diagnostic technology that analyses the billions of base pairs of DNA, promises to optimise healthcare through personalised diagnosis and treatment. However, implementation of genome sequencing faces challenges including the lack of consensus on disclosure of incidental results, gene changes unrelated to the disease under investigation, but of potential clinical significance to the patient and their provider. Current recommendations encourage clinicians to return medically actionable incidental results and stress the importance of education and informed consent. Given the shortage of genetics professionals and genomics expertise among healthcare providers, decision aids (DAs) can help fill a critical gap in the clinical delivery of genome sequencing. We aim to assess the effectiveness of an interactive DA developed for selection of incidental results. We will compare the DA in combination with a brief Q&A session with a genetic counsellor to genetic counselling alone in a mixed-methods randomised controlled trial. Patients who received negative standard cancer genetic results for their personal and family history of cancer and are thus eligible for sequencing will be recruited from cancer genetics clinics in Toronto. Our primary outcome is decisional conflict. Secondary outcomes are knowledge, satisfaction, preparation for decision-making, anxiety and length of session with the genetic counsellor. A subset of participants will complete a qualitative interview about preferences for incidental results. This study has been approved by research ethics boards of St. Michael's Hospital, Mount Sinai Hospital and Sunnybrook Health Sciences Centre. This research poses no significant risk to participants. This study evaluates the effectiveness of a novel patient-centred tool to support clinical delivery of incidental results. Results will be shared through national and international conferences, and at a stakeholder workshop to develop a consensus statement to optimise implementation of the DA in practice. NCT03244202; Pre-results. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Aiewsakun, Pakorn; Simmonds, Peter
2018-02-20
The International Committee on Taxonomy of Viruses (ICTV) classifies viruses into families, genera and species and provides a regulated system for their nomenclature that is universally used in virus descriptions. Virus taxonomic assignments have traditionally been based upon virus phenotypic properties such as host range, virion morphology and replication mechanisms, particularly at family level. However, gene sequence comparisons provide a clearer guide to their evolutionary relationships and provide the only information that may guide the incorporation of viruses detected in environmental (metagenomic) studies that lack any phenotypic data. The current study sought to determine whether the existing virus taxonomy could be reproduced by examination of genetic relationships through the extraction of protein-coding gene signatures and genome organisational features. We found large-scale consistency between genetic relationships and taxonomic assignments for viruses of all genome configurations and genome sizes. The analysis pipeline that we have called 'Genome Relationships Applied to Virus Taxonomy' (GRAViTy) was highly effective at reproducing the current assignments of viruses at family level as well as inter-family groupings into orders. Its ability to correctly differentiate assigned viruses from unassigned viruses, and classify them into the correct taxonomic group, was evaluated by threefold cross-validation technique. This predicted family membership of eukaryotic viruses with close to 100% accuracy and specificity potentially enabling the algorithm to predict assignments for the vast corpus of metagenomic sequences consistently with ICTV taxonomy rules. In an evaluation run of GRAViTy, over one half (460/921) of (near)-complete genome sequences from several large published metagenomic eukaryotic virus datasets were assigned to 127 novel family-level groupings. If corroborated by other analysis methods, these would potentially more than double the number of eukaryotic virus families in the ICTV taxonomy. A rapid and objective means to explore metagenomic viral diversity and make informed recommendations for their assignments at each taxonomic layer is essential. GRAViTy provides one means to make rule-based assignments at family and order levels in a manner that preserves the integrity and underlying organisational principles of the current ICTV taxonomy framework. Such methods are increasingly required as the vast virosphere is explored.
Constructing an integrated gene similarity network for the identification of disease genes.
Tian, Zhen; Guo, Maozu; Wang, Chunyu; Xing, LinLin; Wang, Lei; Zhang, Yin
2017-09-20
Discovering novel genes that are involved human diseases is a challenging task in biomedical research. In recent years, several computational approaches have been proposed to prioritize candidate disease genes. Most of these methods are mainly based on protein-protein interaction (PPI) networks. However, since these PPI networks contain false positives and only cover less half of known human genes, their reliability and coverage are very low. Therefore, it is highly necessary to fuse multiple genomic data to construct a credible gene similarity network and then infer disease genes on the whole genomic scale. We proposed a novel method, named RWRB, to infer causal genes of interested diseases. First, we construct five individual gene (protein) similarity networks based on multiple genomic data of human genes. Then, an integrated gene similarity network (IGSN) is reconstructed based on similarity network fusion (SNF) method. Finally, we employee the random walk with restart algorithm on the phenotype-gene bilayer network, which combines phenotype similarity network, IGSN as well as phenotype-gene association network, to prioritize candidate disease genes. We investigate the effectiveness of RWRB through leave-one-out cross-validation methods in inferring phenotype-gene relationships. Results show that RWRB is more accurate than state-of-the-art methods on most evaluation metrics. Further analysis shows that the success of RWRB is benefited from IGSN which has a wider coverage and higher reliability comparing with current PPI networks. Moreover, we conduct a comprehensive case study for Alzheimer's disease and predict some novel disease genes that supported by literature. RWRB is an effective and reliable algorithm in prioritizing candidate disease genes on the genomic scale. Software and supplementary information are available at http://nclab.hit.edu.cn/~tianzhen/RWRB/ .
Tapping the promise of genomics in species with complex, nonmodel genomes.
Hirsch, Candice N; Buell, C Robin
2013-01-01
Genomics is enabling a renaissance in all disciplines of plant biology. However, many plant genomes are complex and remain recalcitrant to current genomic technologies. The complexities of these nonmodel plant genomes are attributable to gene and genome duplication, heterozygosity, ploidy, and/or repetitive sequences. Methods are available to simplify the genome and reduce these barriers, including inbreeding and genome reduction, making these species amenable to current sequencing and assembly methods. Some, but not all, of the complexities in nonmodel genomes can be bypassed by sequencing the transcriptome rather than the genome. Additionally, comparative genomics approaches, which leverage phylogenetic relatedness, can aid in the interpretation of complex genomes. Although there are limitations in accessing complex nonmodel plant genomes using current sequencing technologies, genome manipulation and resourceful analyses can allow access to even the most recalcitrant plant genomes.
Genomic selection of agronomic traits in hybrid rice using an NCII population.
Xu, Yang; Wang, Xin; Ding, Xiaowen; Zheng, Xingfei; Yang, Zefeng; Xu, Chenwu; Hu, Zhongli
2018-05-10
Hybrid breeding is an effective tool to improve yield in rice, while parental selection remains the key and difficult issue. Genomic selection (GS) provides opportunities to predict the performance of hybrids before phenotypes are measured. However, the application of GS is influenced by several genetic and statistical factors. Here, we used a rice North Carolina II (NC II) population constructed by crossing 115 rice varieties with five male sterile lines as a model to evaluate effects of statistical methods, heritability, marker density and training population size on prediction for hybrid performance. From the comparison of six GS methods, we found that predictabilities for different methods are significantly different, with genomic best linear unbiased prediction (GBLUP) and least absolute shrinkage and selection operation (LASSO) being the best, support vector machine (SVM) and partial least square (PLS) being the worst. The marker density has lower influence on predicting rice hybrid performance compared with the size of training population. Additionally, we used the 575 (115 × 5) hybrid rice as a training population to predict eight agronomic traits of all hybrids derived from 120 (115 + 5) rice varieties each mating with 3023 rice accessions from the 3000 rice genomes project (3 K RGP). Of the 362,760 potential hybrids, selection of the top 100 predicted hybrids would lead to 35.5%, 23.25%, 30.21%, 42.87%, 61.80%, 75.83%, 19.24% and 36.12% increase in grain yield per plant, thousand-grain weight, panicle number per plant, plant height, secondary branch number, grain number per panicle, panicle length and primary branch number, respectively. This study evaluated the factors affecting predictabilities for hybrid prediction and demonstrated the implementation of GS to predict hybrid performance of rice. Our results suggest that GS could enable the rapid selection of superior hybrids, thus increasing the efficiency of rice hybrid breeding.
Jayakumar, Vasanthan; Sakakibara, Yasubumi
2017-11-03
Long reads obtained from third-generation sequencing platforms can help overcome the long-standing challenge of the de novo assembly of sequences for the genomic analysis of non-model eukaryotic organisms. Numerous long-read-aided de novo assemblies have been published recently, which exhibited superior quality of the assembled genomes in comparison with those achieved using earlier second-generation sequencing technologies. Evaluating assemblies is important in guiding the appropriate choice for specific research needs. In this study, we evaluated 10 long-read assemblers using a variety of metrics on Pacific Biosciences (PacBio) data sets from different taxonomic categories with considerable differences in genome size. The results allowed us to narrow down the list to a few assemblers that can be effectively applied to eukaryotic assembly projects. Moreover, we highlight how best to use limited genomic resources for effectively evaluating the genome assemblies of non-model organisms. © The Author 2017. Published by Oxford University Press.
Quantitative phenotyping via deep barcode sequencing
Smith, Andrew M.; Heisler, Lawrence E.; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J.; Chee, Mark; Roth, Frederick P.; Giaever, Guri; Nislow, Corey
2009-01-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or “Bar-seq,” outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that ∼20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene–environment interactions on a genome-wide scale. PMID:19622793
Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models
Stephens, Zachary D.; Hudson, Matthew E.; Mainzer, Liudmila S.; Taschuk, Morgan; Weber, Matthew R.; Iyer, Ravishankar K.
2016-01-01
An obstacle to validating and benchmarking methods for genome analysis is that there are few reference datasets available for which the “ground truth” about the mutational landscape of the sample genome is known and fully validated. Additionally, the free and public availability of real human genome datasets is incompatible with the preservation of donor privacy. In order to better analyze and understand genomic data, we need test datasets that model all variants, reflecting known biology as well as sequencing artifacts. Read simulators can fulfill this requirement, but are often criticized for limited resemblance to true data and overall inflexibility. We present NEAT (NExt-generation sequencing Analysis Toolkit), a set of tools that not only includes an easy-to-use read simulator, but also scripts to facilitate variant comparison and tool evaluation. NEAT has a wide variety of tunable parameters which can be set manually on the default model or parameterized using real datasets. The software is freely available at github.com/zstephens/neat-genreads. PMID:27893777
Iacono, William G; Malone, Stephen M; Vaidyanathan, Uma; Vrieze, Scott I
2014-12-01
This article provides an introductory overview of the investigative strategy employed to evaluate the genetic basis of 17 endophenotypes examined as part of a 20-year data collection effort from the Minnesota Center for Twin and Family Research. Included are characterization of the study samples, descriptive statistics for key properties of the psychophysiological measures, and rationale behind the steps taken in the molecular genetic study design. The statistical approach included (a) biometric analysis of twin and family data, (b) heritability analysis using 527,829 single nucleotide polymorphisms (SNPs), (c) genome-wide association analysis of these SNPs and 17,601 autosomal genes, (d) follow-up analyses of candidate SNPs and genes hypothesized to have an association with each endophenotype, (e) rare variant analysis of nonsynonymous SNPs in the exome, and (f) whole genome sequencing association analysis using 27 million genetic variants. These methods were used in the accompanying empirical articles comprising this special issue, Genome-Wide Scans of Genetic Variants for Psychophysiological Endophenotypes. Copyright © 2014 Society for Psychophysiological Research.
A new genome-mining tool redefines the lasso peptide biosynthetic landscape
Tietz, Jonathan I.; Schwalen, Christopher J.; Patel, Parth S.; Maxson, Tucker; Blair, Patricia M.; Tai, Hua-Chia; Zakai, Uzma I.; Mitchell, Douglas A.
2016-01-01
Ribosomally synthesized and post-translationally modified peptide (RiPP) natural products are attractive for genome-driven discovery and re-engineering, but limitations in bioinformatic methods and exponentially increasing genomic data make large-scale mining difficult. We report RODEO (Rapid ORF Description and Evaluation Online), which combines hidden Markov model-based analysis, heuristic scoring, and machine learning to identify biosynthetic gene clusters and predict RiPP precursor peptides. We initially focused on lasso peptides, which display intriguing physiochemical properties and bioactivities, but their hypervariability renders them challenging prospects for automated mining. Our approach yielded the most comprehensive mapping of lasso peptide space, revealing >1,300 compounds. We characterized the structures and bioactivities of six lasso peptides, prioritized based on predicted structural novelty, including an unprecedented handcuff-like topology and another with a citrulline modification exceptionally rare among bacteria. These combined insights significantly expand the knowledge of lasso peptides, and more broadly, provide a framework for future genome-mining efforts. PMID:28244986
Gymoese, Pernille; Sørensen, Gitte; Litrup, Eva; Olsen, John Elmerdal; Nielsen, Eva Møller
2017-01-01
Whole-genome sequencing is rapidly replacing current molecular typing methods for surveillance purposes. Our study evaluates core-genome single-nucleotide polymorphism analysis for outbreak detection and linking of sources of Salmonella enterica serovar Typhimurium and its monophasic variants during a 7-month surveillance period in Denmark. We reanalyzed and defined 8 previously characterized outbreaks from the phylogenetic relatedness of the isolates, epidemiologic data, and food traceback investigations. All outbreaks were identified, and we were able to exclude unrelated and include additional related human cases. We were furthermore able to link possible food and veterinary sources to the outbreaks. Isolates clustered according to sequence types (STs) 19, 34, and 36. Our study shows that core-genome single-nucleotide polymorphism analysis is suitable for surveillance and outbreak investigation for Salmonella Typhimurium (ST19 and ST36), but whole genome–wide analysis may be required for the tight genetic clone of monophasic variants (ST34). PMID:28930002
Comparing Mycobacterium tuberculosis genomes using genome topology networks.
Jiang, Jianping; Gu, Jianlei; Zhang, Liang; Zhang, Chenyi; Deng, Xiao; Dou, Tonghai; Zhao, Guoping; Zhou, Yan
2015-02-14
Over the last decade, emerging research methods, such as comparative genomic analysis and phylogenetic study, have yielded new insights into genotypes and phenotypes of closely related bacterial strains. Several findings have revealed that genomic structural variations (SVs), including gene gain/loss, gene duplication and genome rearrangement, can lead to different phenotypes among strains, and an investigation of genes affected by SVs may extend our knowledge of the relationships between SVs and phenotypes in microbes, especially in pathogenic bacteria. In this work, we introduce a 'Genome Topology Network' (GTN) method based on gene homology and gene locations to analyze genomic SVs and perform phylogenetic analysis. Furthermore, the concept of 'unfixed ortholog' has been proposed, whose members are affected by SVs in genome topology among close species. To improve the precision of 'unfixed ortholog' recognition, a strategy to detect annotation differences and complete gene annotation was applied. To assess the GTN method, a set of thirteen complete M. tuberculosis genomes was analyzed as a case study. GTNs with two different gene homology-assigning methods were built, the Clusters of Orthologous Groups (COG) method and the orthoMCL clustering method, and two phylogenetic trees were constructed accordingly, which may provide additional insights into whole genome-based phylogenetic analysis. We obtained 24 unfixable COG groups, of which most members were related to immunogenicity and drug resistance, such as PPE-repeat proteins (COG5651) and transcriptional regulator TetR gene family members (COG1309). The GTN method has been implemented in PERL and released on our website. The tool can be downloaded from http://homepage.fudan.edu.cn/zhouyan/gtn/ , and allows re-annotating the 'lost' genes among closely related genomes, analyzing genes affected by SVs, and performing phylogenetic analysis. With this tool, many immunogenic-related and drug resistance-related genes were found to be affected by SVs in M. tuberculosis genomes. We believe that the GTN method will be suitable for the exploration of genomic SVs in connection with biological features of bacterial strains, and that GTN-based phylogenetic analysis will provide additional insights into whole genome-based phylogenetic analysis.
Sellera, Fábio P; Fernandes, Miriam R; Moura, Quézia; Souza, Tiago A; Nascimento, Cristiane L; Cerdeira, Louise; Lincopan, Nilton
2018-03-01
The incidence of multidrug-resistant bacteria in wildlife animals has been investigated to improve our knowledge of the spread of clinically relevant antimicrobial resistance genes. The aim of this study was to report the first draft genome sequence of an extensively drug-resistant (XDR) Pseudomonas aeruginosa ST644 isolate recovered from a Magellanic penguin with a footpad infection (bumblefoot) undergoing rehabilitation process. The genome was sequenced on an Illumina NextSeq ® platform using 150-bp paired-end reads. De novo genome assembly was performed using Velvet v.1.2.10, and the whole genome sequence was evaluated using bioinformatics approaches from the Center of Genomic Epidemiology, whereas an in-house method (mapping of raw whole genome sequence reads) was used to identify chromosomal point mutations. The genome size was calculated at 6436450bp, with 6357 protein-coding sequences and the presence of genes conferring resistance to aminoglycosides, β-lactams, phenicols, sulphonamides, tetracyclines, quinolones and fosfomycin; in addition, mutations in the genes gyrA (Thr83Ile), parC (Ser87Leu), phoQ (Arg61His) and pmrB (Tyr345His), conferring resistance to quinolones and polymyxins, respectively, were confirmed. This draft genome sequence can provide useful information for comparative genomic analysis regarding the dissemination of clinically significant antibiotic resistance genes and XDR bacterial species at the human-animal interface. Copyright © 2017 International Society for Chemotherapy of Infection and Cancer. Published by Elsevier Ltd. All rights reserved.
Kageyama, Shinji; Shinmura, Kazuya; Yamamoto, Hiroko; Goto, Masanori; Suzuki, Koichi; Tanioka, Fumihiko; Tsuneyoshi, Toshihiro; Sugimura, Haruhiko
2008-04-01
The PCR-based DNA fingerprinting method called the methylation-sensitive amplified fragment length polymorphism (MS-AFLP) analysis is used for genome-wide scanning of methylation status. In this study, we developed a method of fluorescence-labeled MS-AFLP (FL-MS-AFLP) analysis by applying a fluorescence-labeled primer and fluorescence-detecting electrophoresis apparatus to the existing method of MS-AFLP analysis. The FL-MS-AFLP analysis enables quantitative evaluation of more than 350 random CpG loci per run. It was shown to allow evaluation of the differences in methylation level of blood DNA of gastric cancer patients and evaluation of hypermethylation and hypomethylation in DNA from gastric cancer tissue in comparison with adjacent non-cancerous tissue.
Benner, Christian; Havulinna, Aki S; Järvelin, Marjo-Riitta; Salomaa, Veikko; Ripatti, Samuli; Pirinen, Matti
2017-10-05
During the past few years, various novel statistical methods have been developed for fine-mapping with the use of summary statistics from genome-wide association studies (GWASs). Although these approaches require information about the linkage disequilibrium (LD) between variants, there has not been a comprehensive evaluation of how estimation of the LD structure from reference genotype panels performs in comparison with that from the original individual-level GWAS data. Using population genotype data from Finland and the UK Biobank, we show here that a reference panel of 1,000 individuals from the target population is adequate for a GWAS cohort of up to 10,000 individuals, whereas smaller panels, such as those from the 1000 Genomes Project, should be avoided. We also show, both theoretically and empirically, that the size of the reference panel needs to scale with the GWAS sample size; this has important consequences for the application of these methods in ongoing GWAS meta-analyses and large biobank studies. We conclude by providing software tools and by recommending practices for sharing LD information to more efficiently exploit summary statistics in genetics research. Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Computational Identification of Novel Genes: Current and Future Perspectives.
Klasberg, Steffen; Bitard-Feildel, Tristan; Mallet, Ludovic
2016-01-01
While it has long been thought that all genomic novelties are derived from the existing material, many genes lacking homology to known genes were found in recent genome projects. Some of these novel genes were proposed to have evolved de novo, ie, out of noncoding sequences, whereas some have been shown to follow a duplication and divergence process. Their discovery called for an extension of the historical hypotheses about gene origination. Besides the theoretical breakthrough, increasing evidence accumulated that novel genes play important roles in evolutionary processes, including adaptation and speciation events. Different techniques are available to identify genes and classify them as novel. Their classification as novel is usually based on their similarity to known genes, or lack thereof, detected by comparative genomics or against databases. Computational approaches are further prime methods that can be based on existing models or leveraging biological evidences from experiments. Identification of novel genes remains however a challenging task. With the constant software and technologies updates, no gold standard, and no available benchmark, evaluation and characterization of genomic novelty is a vibrant field. In this review, the classical and state-of-the-art tools for gene prediction are introduced. The current methods for novel gene detection are presented; the methodological strategies and their limits are discussed along with perspective approaches for further studies.
A multivariate prediction model for Rho-dependent termination of transcription.
Nadiras, Cédric; Eveno, Eric; Schwartz, Annie; Figueroa-Bossi, Nara; Boudvillain, Marc
2018-06-21
Bacterial transcription termination proceeds via two main mechanisms triggered either by simple, well-conserved (intrinsic) nucleic acid motifs or by the motor protein Rho. Although bacterial genomes can harbor hundreds of termination signals of either type, only intrinsic terminators are reliably predicted. Computational tools to detect the more complex and diversiform Rho-dependent terminators are lacking. To tackle this issue, we devised a prediction method based on Orthogonal Projections to Latent Structures Discriminant Analysis [OPLS-DA] of a large set of in vitro termination data. Using previously uncharacterized genomic sequences for biochemical evaluation and OPLS-DA, we identified new Rho-dependent signals and quantitative sequence descriptors with significant predictive value. Most relevant descriptors specify features of transcript C>G skewness, secondary structure, and richness in regularly-spaced 5'CC/UC dinucleotides that are consistent with known principles for Rho-RNA interaction. Descriptors collectively warrant OPLS-DA predictions of Rho-dependent termination with a ∼85% success rate. Scanning of the Escherichia coli genome with the OPLS-DA model identifies significantly more termination-competent regions than anticipated from transcriptomics and predicts that regions intrinsically refractory to Rho are primarily located in open reading frames. Altogether, this work delineates features important for Rho activity and describes the first method able to predict Rho-dependent terminators in bacterial genomes.
Validated method for quantification of genetically modified organisms in samples of maize flour.
Kunert, Renate; Gach, Johannes S; Vorauer-Uhl, Karola; Engel, Edwin; Katinger, Hermann
2006-02-08
Sensitive and accurate testing for trace amounts of biotechnology-derived DNA from plant material is the prerequisite for detection of 1% or 0.5% genetically modified ingredients in food products or raw materials thereof. Compared to ELISA detection of expressed proteins, real-time PCR (RT-PCR) amplification has easier sample preparation and detection limits are lower. Of the different methods of DNA preparation CTAB method with high flexibility in starting material and generation of sufficient DNA with relevant quality was chosen. Previous RT-PCR data generated with the SYBR green detection method showed that the method is highly sensitive to sample matrices and genomic DNA content influencing the interpretation of results. Therefore, this paper describes a real-time DNA quantification based on the TaqMan probe method, indicating high accuracy and sensitivity with detection limits of lower than 18 copies per sample applicable and comparable to highly purified plasmid standards as well as complex matrices of genomic DNA samples. The results were evaluated with ValiData for homology of variance, linearity, accuracy of the standard curve, and standard deviation.
EGASP: the human ENCODE Genome Annotation Assessment Project
Guigó, Roderic; Flicek, Paul; Abril, Josep F; Reymond, Alexandre; Lagarde, Julien; Denoeud, France; Antonarakis, Stylianos; Ashburner, Michael; Bajic, Vladimir B; Birney, Ewan; Castelo, Robert; Eyras, Eduardo; Ucla, Catherine; Gingeras, Thomas R; Harrow, Jennifer; Hubbard, Tim; Lewis, Suzanna E; Reese, Martin G
2006-01-01
Background We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment. Results The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified. Conclusion This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence. PMID:16925836
Nikolova, Olga; Moser, Russell; Kemp, Christopher; Gönen, Mehmet; Margolin, Adam A
2017-05-01
In recent years, vast advances in biomedical technologies and comprehensive sequencing have revealed the genomic landscape of common forms of human cancer in unprecedented detail. The broad heterogeneity of the disease calls for rapid development of personalized therapies. Translating the readily available genomic data into useful knowledge that can be applied in the clinic remains a challenge. Computational methods are needed to aid these efforts by robustly analyzing genome-scale data from distinct experimental platforms for prioritization of targets and treatments. We propose a novel, biologically motivated, Bayesian multitask approach, which explicitly models gene-centric dependencies across multiple and distinct genomic platforms. We introduce a gene-wise prior and present a fully Bayesian formulation of a group factor analysis model. In supervised prediction applications, our multitask approach leverages similarities in response profiles of groups of drugs that are more likely to be related to true biological signal, which leads to more robust performance and improved generalization ability. We evaluate the performance of our method on molecularly characterized collections of cell lines profiled against two compound panels, namely the Cancer Cell Line Encyclopedia and the Cancer Therapeutics Response Portal. We demonstrate that accounting for the gene-centric dependencies enables leveraging information from multi-omic input data and improves prediction and feature selection performance. We further demonstrate the applicability of our method in an unsupervised dimensionality reduction application by inferring genes essential to tumorigenesis in the pancreatic ductal adenocarcinoma and lung adenocarcinoma patient cohorts from The Cancer Genome Atlas. : The code for this work is available at https://github.com/olganikolova/gbgfa. : nikolova@ohsu.edu or margolin@ohsu.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
OPATs: Omnibus P-value association tests.
Chen, Chia-Wei; Yang, Hsin-Chou
2017-07-10
Combining statistical significances (P-values) from a set of single-locus association tests in genome-wide association studies is a proof-of-principle method for identifying disease-associated genomic segments, functional genes and biological pathways. We review P-value combinations for genome-wide association studies and introduce an integrated analysis tool, Omnibus P-value Association Tests (OPATs), which provides popular analysis methods of P-value combinations. The software OPATs programmed in R and R graphical user interface features a user-friendly interface. In addition to analysis modules for data quality control and single-locus association tests, OPATs provides three types of set-based association test: window-, gene- and biopathway-based association tests. P-value combinations with or without threshold and rank truncation are provided. The significance of a set-based association test is evaluated by using resampling procedures. Performance of the set-based association tests in OPATs has been evaluated by simulation studies and real data analyses. These set-based association tests help boost the statistical power, alleviate the multiple-testing problem, reduce the impact of genetic heterogeneity, increase the replication efficiency of association tests and facilitate the interpretation of association signals by streamlining the testing procedures and integrating the genetic effects of multiple variants in genomic regions of biological relevance. In summary, P-value combinations facilitate the identification of marker sets associated with disease susceptibility and uncover missing heritability in association studies, thereby establishing a foundation for the genetic dissection of complex diseases and traits. OPATs provides an easy-to-use and statistically powerful analysis tool for P-value combinations. OPATs, examples, and user guide can be downloaded from http://www.stat.sinica.edu.tw/hsinchou/genetics/association/OPATs.htm. © The Author 2017. Published by Oxford University Press.
Aubrey, Wayne; Riley, Michael C.; Young, Michael; King, Ross D.; Oliver, Stephen G.; Clare, Amanda
2015-01-01
Many advances in synthetic biology require the removal of a large number of genomic elements from a genome. Most existing deletion methods leave behind markers, and as there are a limited number of markers, such methods can only be applied a fixed number of times. Deletion methods that recycle markers generally are either imprecise (remove untargeted sequences), or leave scar sequences which can cause genome instability and rearrangements. No existing marker recycling method is automation-friendly. We have developed a novel openly available deletion tool that consists of: 1) a method for deleting genomic elements that can be repeatedly used without limit, is precise, scar-free, and suitable for automation; and 2) software to design the method’s primers. Our tool is sequence agnostic and could be used to delete large numbers of coding sequences, promoter regions, transcription factor binding sites, terminators, etc in a single genome. We have validated our tool on the deletion of non-essential open reading frames (ORFs) from S. cerevisiae. The tool is applicable to arbitrary genomes, and we provide primer sequences for the deletion of: 90% of the ORFs from the S. cerevisiae genome, 88% of the ORFs from S. pombe genome, and 85% of the ORFs from the L. lactis genome. PMID:26630677
Characterization of novel RS1 exonic deletions in juvenile X-linked retinoschisis
D’Souza, Leera; Cukras, Catherine; Antolik, Christian; Craig, Candice; He, Hong; Li, Shibo; Hejtmancik, James F.; Sieving, Paul A.; Wang, Xinjing
2013-01-01
Purpose X-linked juvenile retinoschisis (XLRS) is a vitreoretinal dystrophy characterized by schisis (splitting) of the inner layers of the neuroretina. Mutations within the retinoschisis (RS1) gene are responsible for this disease. The mutation spectrum consists of amino acid substitutions, splice site variations, small indels, and larger genomic deletions. Clinically, genomic deletions are rarely reported. Here, we characterize two novel full exonic deletions: one encompassing exon 1 and the other spanning exons 4–5 of the RS1 gene. We also report the clinical findings in these patients with XLRS with two different exonic deletions. Methods Unrelated XLRS men and boys and their mothers (if available) were enrolled for molecular genetics evaluation. The patients also underwent ophthalmologic examination and in some cases electroretinogram (ERG) recording. All the exons and the flanking intronic regions of the RS1 gene were analyzed with direct sequencing. Two patients with exonic deletions were further evaluated with array comparative genomic hybridization to define the scope of the genomic aberrations. After the deleted genomic region was identified, primer walking followed by direct sequencing was used to determine the exact breakpoints. Results Two novel exonic deletions of the RS1 gene were identified: one including exon 1 and the other spanning exons 4 and 5. The exon 1 deletion extends from the 5′ region of the RS1 gene (including the promoter) through intron 1 (c.(−35)-1723_c.51+2664del4472). The exon 4–5 deletion spans introns 3 to intron 5 (c.185–1020_c.522+1844del5764). Conclusions Here we report two novel exonic deletions within the RS1 gene locus. We have also described the clinical presentations and hypothesized the genomic mechanisms underlying these schisis phenotypes. PMID:24227916
Protecting genomic sequence anonymity with generalization lattices.
Malin, B A
2005-01-01
Current genomic privacy technologies assume the identity of genomic sequence data is protected if personal information, such as demographics, are obscured, removed, or encrypted. While demographic features can directly compromise an individual's identity, recent research demonstrates such protections are insufficient because sequence data itself is susceptible to re-identification. To counteract this problem, we introduce an algorithm for anonymizing a collection of person-specific DNA sequences. The technique is termed DNA lattice anonymization (DNALA), and is based upon the formal privacy protection schema of k -anonymity. Under this model, it is impossible to observe or learn features that distinguish one genetic sequence from k-1 other entries in a collection. To maximize information retained in protected sequences, we incorporate a concept generalization lattice to learn the distance between two residues in a single nucleotide region. The lattice provides the most similar generalized concept for two residues (e.g. adenine and guanine are both purines). The method is tested and evaluated with several publicly available human population datasets ranging in size from 30 to 400 sequences. Our findings imply the anonymization schema is feasible for the protection of sequences privacy. The DNALA method is the first computational disclosure control technique for general DNA sequences. Given the computational nature of the method, guarantees of anonymity can be formally proven. There is room for improvement and validation, though this research provides the groundwork from which future researchers can construct genomics anonymization schemas tailored to specific datasharing scenarios.
Simultaneous fitting of genomic-BLUP and Bayes-C components in a genomic prediction model.
Iheshiulor, Oscar O M; Woolliams, John A; Svendsen, Morten; Solberg, Trygve; Meuwissen, Theo H E
2017-08-24
The rapid adoption of genomic selection is due to two key factors: availability of both high-throughput dense genotyping and statistical methods to estimate and predict breeding values. The development of such methods is still ongoing and, so far, there is no consensus on the best approach. Currently, the linear and non-linear methods for genomic prediction (GP) are treated as distinct approaches. The aim of this study was to evaluate the implementation of an iterative method (called GBC) that incorporates aspects of both linear [genomic-best linear unbiased prediction (G-BLUP)] and non-linear (Bayes-C) methods for GP. The iterative nature of GBC makes it less computationally demanding similar to other non-Markov chain Monte Carlo (MCMC) approaches. However, as a Bayesian method, GBC differs from both MCMC- and non-MCMC-based methods by combining some aspects of G-BLUP and Bayes-C methods for GP. Its relative performance was compared to those of G-BLUP and Bayes-C. We used an imputed 50 K single-nucleotide polymorphism (SNP) dataset based on the Illumina Bovine50K BeadChip, which included 48,249 SNPs and 3244 records. Daughter yield deviations for somatic cell count, fat yield, milk yield, and protein yield were used as response variables. GBC was frequently (marginally) superior to G-BLUP and Bayes-C in terms of prediction accuracy and was significantly better than G-BLUP only for fat yield. On average across the four traits, GBC yielded a 0.009 and 0.006 increase in prediction accuracy over G-BLUP and Bayes-C, respectively. Computationally, GBC was very much faster than Bayes-C and similar to G-BLUP. Our results show that incorporating some aspects of G-BLUP and Bayes-C in a single model can improve accuracy of GP over the commonly used method: G-BLUP. Generally, GBC did not statistically perform better than G-BLUP and Bayes-C, probably due to the close relationships between reference and validation individuals. Nevertheless, it is a flexible tool, in the sense, that it simultaneously incorporates some aspects of linear and non-linear models for GP, thereby exploiting family relationships while also accounting for linkage disequilibrium between SNPs and genes with large effects. The application of GBC in GP merits further exploration.
Dairy cattle genomics evaluation program update
USDA-ARS?s Scientific Manuscript database
Implementation of genomic evaluation has caused profound changes in dairy cattle breeding. All young bulls bought by major artificial-insemination organizations now are selected based on these evaluation. Evaluation reliability can reach ~75% for yield traits, which is adequate for marketing semen o...
Arakaki, Atsushi; Shibusawa, Mie; Hosokawa, Masahito; Matsunaga, Tadashi
2010-03-01
Magnetotactic bacteria comprise a phylogenetically diverse group that is capable of synthesizing intracellular magnetic particles. Although various morphotypes of magnetotactic bacteria have been observed in the environment, bacterial strains available in pure culture are currently limited to a few genera due to difficulties in their enrichment and cultivation. In order to obtain genetic information from uncultured magnetotactic bacteria, a genome preparation method that involves magnetic separation of cells, flow cytometry, and multiple displacement amplification (MDA) using phi29 polymerase was used in this study. The conditions for the MDA reaction using samples containing 1 to 100 cells were evaluated using a pure-culture magnetotactic bacterium, "Magnetospirillum magneticum AMB-1," whose complete genome sequence is available. Uniform gene amplification was confirmed by quantitative PCR (Q-PCR) when 100 cells were used as a template. This method was then applied for genome preparation of uncultured magnetotactic bacteria from complex bacterial communities in an aquatic environment. A sample containing 100 cells of the uncultured magnetotactic coccus was prepared by magnetic cell separation and flow cytometry and used as an MDA template. 16S rRNA sequence analysis of the MDA product from these 100 cells revealed that the amplified genomic DNA was from a single species of magnetotactic bacterium that was phylogenetically affiliated with magnetotactic cocci in the Alphaproteobacteria. The combined use of magnetic separation, flow cytometry, and MDA provides a new strategy to access individual genetic information from magnetotactic bacteria in environmental samples.
A statistical approach to selecting and confirming validation targets in -omics experiments
2012-01-01
Background Genomic technologies are, by their very nature, designed for hypothesis generation. In some cases, the hypotheses that are generated require that genome scientists confirm findings about specific genes or proteins. But one major advantage of high-throughput technology is that global genetic, genomic, transcriptomic, and proteomic behaviors can be observed. Manual confirmation of every statistically significant genomic result is prohibitively expensive. This has led researchers in genomics to adopt the strategy of confirming only a handful of the most statistically significant results, a small subset chosen for biological interest, or a small random subset. But there is no standard approach for selecting and quantitatively evaluating validation targets. Results Here we present a new statistical method and approach for statistically validating lists of significant results based on confirming only a small random sample. We apply our statistical method to show that the usual practice of confirming only the most statistically significant results does not statistically validate result lists. We analyze an extensively validated RNA-sequencing experiment to show that confirming a random subset can statistically validate entire lists of significant results. Finally, we analyze multiple publicly available microarray experiments to show that statistically validating random samples can both (i) provide evidence to confirm long gene lists and (ii) save thousands of dollars and hundreds of hours of labor over manual validation of each significant result. Conclusions For high-throughput -omics studies, statistical validation is a cost-effective and statistically valid approach to confirming lists of significant results. PMID:22738145
Jack, John; Havener, Tammy M; McLeod, Howard L; Motsinger-Reif, Alison A; Foster, Matthew
2015-01-01
Aim: We investigate the role of ethnicity and admixture in drug response across a broad group of chemotherapeutic drugs. Also, we generate hypotheses on the genetic variants driving differential drug response through multivariate genome-wide association studies. Methods: Immortalized lymphoblastoid cell lines from 589 individuals (Hispanic or non-Hispanic/Caucasian) were used to investigate dose-response for 28 chemotherapeutic compounds. Univariate and multivariate statistical models were used to elucidate associations between genetic variants and differential drug response as well as the role of ethnicity in drug potency and efficacy. Results & Conclusion: For many drugs, the variability in drug response appears to correlate with self-reported race and estimates of genetic ancestry. Additionally, multivariate genome-wide association analyses offered interesting hypotheses governing these differential responses. PMID:26314407
Current Advances in Detection and Treatment of Babesiosis
Mosqueda, J; Olvera-Ramírez, A; Aguilar-Tipacamú, G; Cantó, GJ
2012-01-01
Babesiosis is a disease with a world-wide distribution affecting many species of mammals principally cattle and man. The major impact occurs in the cattle industry where bovine babesiosis has had a huge economic effect due to loss of meat and beef production of infected animals and death. Nowadays to those costs there must be added the high cost of tick control, disease detection, prevention and treatment. In almost a century and a quarter since the first report of the disease, the truth is: there is no a safe and efficient vaccine available, there are limited chemotherapeutic choices and few low-cost, reliable and fast detection methods. Detection and treatment of babesiosis are important tools to control babesiosis. Microscopy detection methods are still the cheapest and fastest methods used to identify Babesia parasites although their sensitivity and specificity are limited. Newer immunological methods are being developed and they offer faster, more sensitive and more specific options to conventional methods, although the direct immunological diagnoses of parasite antigens in host tissues are still missing. Detection methods based on nucleic acid identification and their amplification are the most sensitive and reliable techniques available today; importantly, most of those methodologies were developed before the genomics and bioinformatics era, which leaves ample room for optimization. For years, babesiosis treatment has been based on the use of very few drugs like imidocarb or diminazene aceturate. Recently, several pharmacological compounds were developed and evaluated, offering new options to control the disease. With the complete sequence of the Babesia bovis genome and the B. bigemina genome project in progress, the post-genomic era brings a new light on the development of diagnosis methods and new chemotherapy targets. In this review, we will present the current advances in detection and treatment of babesiosis in cattle and other animals, with additional reference to several apicomplexan parasites. PMID:22360483
USDA-ARS?s Scientific Manuscript database
A research grant from the Azalea Society of America has enabled us to collect and begin evaluating diverse Rhododendron viscosum germplasm to identify genetic and phenotypic variation for pH adaptability. During the Spring of 2014, we developed novel, in vitro screening methods for Rhododendron to ...
USDA-ARS?s Scientific Manuscript database
Blueberry bacterial leaf scorch (BBLS) disease, a threat to blueberry production in the Southern USA and potentially elsewhere, is caused by Xylella fastidiosa. Efficient control of BBLS requires knowledge of the pathogen. However, this is challenging because Xylella fastidiosa is difficult to cultu...
Strategies for Using Peer-Assisted Learning Effectively in an Undergraduate Bioinformatics Course
ERIC Educational Resources Information Center
Shapiro, Casey; Ayon, Carlos; Moberg-Parker, Jordan; Levis-Fitzgerald, Marc; Sanders, Erin R.
2013-01-01
This study used a mixed methods approach to evaluate hybrid peer-assisted learning approaches incorporated into a bioinformatics tutorial for a genome annotation research project. Quantitative and qualitative data were collected from undergraduates who enrolled in a research-based laboratory course during two different academic terms at UCLA.…
MetaPhinder—Identifying Bacteriophage Sequences in Metagenomic Data Sets
Villarroel, Julia; Lund, Ole; Voldby Larsen, Mette; Nielsen, Morten
2016-01-01
Bacteriophages are the most abundant biological entity on the planet, but at the same time do not account for much of the genetic material isolated from most environments due to their small genome sizes. They also show great genetic diversity and mosaic genomes making it challenging to analyze and understand them. Here we present MetaPhinder, a method to identify assembled genomic fragments (i.e.contigs) of phage origin in metagenomic data sets. The method is based on a comparison to a database of whole genome bacteriophage sequences, integrating hits to multiple genomes to accomodate for the mosaic genome structure of many bacteriophages. The method is demonstrated to out-perform both BLAST methods based on single hits and methods based on k-mer comparisons. MetaPhinder is available as a web service at the Center for Genomic Epidemiology https://cge.cbs.dtu.dk/services/MetaPhinder/, while the source code can be downloaded from https://bitbucket.org/genomicepidemiology/metaphinder or https://github.com/vanessajurtz/MetaPhinder. PMID:27684958
MetaPhinder-Identifying Bacteriophage Sequences in Metagenomic Data Sets.
Jurtz, Vanessa Isabell; Villarroel, Julia; Lund, Ole; Voldby Larsen, Mette; Nielsen, Morten
Bacteriophages are the most abundant biological entity on the planet, but at the same time do not account for much of the genetic material isolated from most environments due to their small genome sizes. They also show great genetic diversity and mosaic genomes making it challenging to analyze and understand them. Here we present MetaPhinder, a method to identify assembled genomic fragments (i.e.contigs) of phage origin in metagenomic data sets. The method is based on a comparison to a database of whole genome bacteriophage sequences, integrating hits to multiple genomes to accomodate for the mosaic genome structure of many bacteriophages. The method is demonstrated to out-perform both BLAST methods based on single hits and methods based on k-mer comparisons. MetaPhinder is available as a web service at the Center for Genomic Epidemiology https://cge.cbs.dtu.dk/services/MetaPhinder/, while the source code can be downloaded from https://bitbucket.org/genomicepidemiology/metaphinder or https://github.com/vanessajurtz/MetaPhinder.
Secure count query on encrypted genomic data.
Hasan, Mohammad Zahidul; Mahdi, Md Safiur Rahman; Sadat, Md Nazmus; Mohammed, Noman
2018-05-01
Human genomic information can yield more effective healthcare by guiding medical decisions. Therefore, genomics research is gaining popularity as it can identify potential correlations between a disease and a certain gene, which improves the safety and efficacy of drug treatment and can also develop more effective prevention strategies [1]. To reduce the sampling error and to increase the statistical accuracy of this type of research projects, data from different sources need to be brought together since a single organization does not necessarily possess required amount of data. In this case, data sharing among multiple organizations must satisfy strict policies (for instance, HIPAA and PIPEDA) that have been enforced to regulate privacy-sensitive data sharing. Storage and computation on the shared data can be outsourced to a third party cloud service provider, equipped with enormous storage and computation resources. However, outsourcing data to a third party is associated with a potential risk of privacy violation of the participants, whose genomic sequence or clinical profile is used in these studies. In this article, we propose a method for secure sharing and computation on genomic data in a semi-honest cloud server. In particular, there are two main contributions. Firstly, the proposed method can handle biomedical data containing both genotype and phenotype. Secondly, our proposed index tree scheme reduces the computational overhead significantly for executing secure count query operation. In our proposed method, the confidentiality of shared data is ensured through encryption, while making the entire computation process efficient and scalable for cutting-edge biomedical applications. We evaluated our proposed method in terms of efficiency on a database of Single-Nucleotide Polymorphism (SNP) sequences, and experimental results demonstrate that the execution time for a query of 50 SNPs in a database of 50,000 records is approximately 5 s, where each record contains 500 SNPs. And, it requires 69.7 s to execute the query on the same database that also includes phenotypes. Copyright © 2018 Elsevier Inc. All rights reserved.
Coordinates and intervals in graph-based reference genomes.
Rand, Knut D; Grytten, Ivar; Nederbragt, Alexander J; Storvik, Geir O; Glad, Ingrid K; Sandve, Geir K
2017-05-18
It has been proposed that future reference genomes should be graph structures in order to better represent the sequence diversity present in a species. However, there is currently no standard method to represent genomic intervals, such as the positions of genes or transcription factor binding sites, on graph-based reference genomes. We formalize offset-based coordinate systems on graph-based reference genomes and introduce methods for representing intervals on these reference structures. We show the advantage of our methods by representing genes on a graph-based representation of the newest assembly of the human genome (GRCh38) and its alternative loci for regions that are highly variable. More complex reference genomes, containing alternative loci, require methods to represent genomic data on these structures. Our proposed notation for genomic intervals makes it possible to fully utilize the alternative loci of the GRCh38 assembly and potential future graph-based reference genomes. We have made a Python package for representing such intervals on offset-based coordinate systems, available at https://github.com/uio-cels/offsetbasedgraph . An interactive web-tool using this Python package to visualize genes on a graph created from GRCh38 is available at https://github.com/uio-cels/genomicgraphcoords .
Jiang, Likun; You, Weiwei; Zhang, Xiaojun; Xu, Jian; Jiang, Yanliang; Wang, Kai; Zhao, Zixia; Chen, Baohua; Zhao, Yunfeng; Mahboob, Shahid; Al-Ghanim, Khalid A; Ke, Caihuan; Xu, Peng
2016-02-01
The small abalone (Haliotis diversicolor) is one of the most important aquaculture species in East Asia. To facilitate gene cloning and characterization, genome analysis, and genetic breeding of it, we constructed a large-insert bacterial artificial chromosome (BAC) library, which is an important genetic tool for advanced genetics and genomics research. The small abalone BAC library includes 92,610 clones with an average insert size of 120 Kb, equivalent to approximately 7.6× of the small abalone genome. We set up three-dimensional pools and super pools of 18,432 BAC clones for target gene screening using PCR method. To assess the approach, we screened 12 target genes in these 18,432 BAC clones and identified 16 positive BAC clones. Eight positive BAC clones were then sequenced and assembled with the next generation sequencing platform. The assembled contigs representing these 8 BAC clones spanned 928 Kb of the small abalone genome, providing the first batch of genome sequences for genome evaluation and characterization. The average GC content of small abalone genome was estimated as 40.33%. A total of 21 protein-coding genes, including 7 target genes, were annotated into the 8 BACs, which proved the feasibility of PCR screening approach with three-dimensional pools in small abalone BAC library. One hundred fifty microsatellite loci were also identified from the sequences for marker development in the future. The BAC library and clone pools provided valuable resources and tools for genetic breeding and conservation of H. diversicolor.
Attitudes regarding privacy of genomic information in personalized cancer therapy
Rogith, Deevakar; Yusuf, Rafeek A; Hovick, Shelley R; Peterson, Susan K; Burton-Chase, Allison M; Li, Yisheng; Meric-Bernstam, Funda; Bernstam, Elmer V
2014-01-01
Objective To evaluate attitudes regarding privacy of genomic data in a sample of patients with breast cancer. Methods Female patients with breast cancer (n=100) completed a questionnaire assessing attitudes regarding concerns about privacy of genomic data. Results Most patients (83%) indicated that genomic data should be protected. However, only 13% had significant concerns regarding privacy of such data. Patients expressed more concern about insurance discrimination than employment discrimination (43% vs 28%, p<0.001). They expressed less concern about research institutions protecting the security of their molecular data than government agencies or drug companies (20% vs 38% vs 44%; p<0.001). Most did not express concern regarding the association of their genomic data with their name and personal identity (49% concerned), billing and insurance information (44% concerned), or clinical data (27% concerned). Significantly fewer patients were concerned about the association with clinical data than other data types (p<0.001). In the absence of direct benefit, patients were more willing to consent to sharing of deidentified than identified data with researchers not involved in their care (76% vs 60%; p<0.001). Most (85%) patients were willing to consent to DNA banking. Discussion While patients are opposed to indiscriminate release of genomic data, privacy does not appear to be their primary concern. Furthermore, we did not find any specific predictors of privacy concerns. Conclusions Patients generally expressed low levels of concern regarding privacy of genomic data, and many expressed willingness to consent to sharing their genomic data with researchers. PMID:24737606
Recent genomic heritage in Scotland.
Amador, Carmen; Huffman, Jennifer; Trochet, Holly; Campbell, Archie; Porteous, David; Wilson, James F; Hastie, Nick; Vitart, Veronique; Hayward, Caroline; Navarro, Pau; Haley, Chris S
2015-06-06
The Generation Scotland Scottish Family Health Study (GS:SFHS) includes 23,960 participants from across Scotland with records for many health-related traits and environmental covariates. Genotypes at ~700 K SNPs are currently available for 10,000 participants. The cohort was designed as a resource for genetic and health related research and the study of complex traits. In this study we developed a suite of analyses to disentangle the genomic differentiation within GS:SFHS individuals to describe and optimise the sample and methods for future analyses. We combined the genotypic information of GS:SFHS with 1092 individuals from the 1000 Genomes project and estimated their genomic relationships. Then, we performed Principal Component Analyses of the resulting relationships to investigate the genomic origin of different groups. We characterised two groups of individuals: those with a few sparse rare markers in the genome, and those with several large rare haplotypes which might represent relatively recent exogenous ancestors. We identified some individuals with likely Italian ancestry and a group with some potential African/Asian ancestry. An analysis of homozygosity in the GS:SFHS sample revealed a very similar pattern to other European populations. We also identified an individual carrying a chromosome 1 uniparental disomy. We found evidence of local geographic stratification within the population having impact on the genomic structure. These findings illuminate the history of the Scottish population and have implications for further analyses such as the study of the contributions of common and rare variants to trait heritabilities and the evaluation of genomic and phenotypic prediction of disease.
Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives.
Ramstetter, Monica D; Dyer, Thomas D; Lehman, Donna M; Curran, Joanne E; Duggirala, Ravindranath; Blangero, John; Mezey, Jason G; Williams, Amy L
2017-09-01
Inferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these approaches in real data has been lacking. Here, we report an assessment of 12 state-of-the-art pairwise relatedness inference methods using a data set with 2485 individuals contained in several large pedigrees that span up to six generations. We find that all methods have high accuracy (92-99%) when detecting first- and second-degree relationships, but their accuracy dwindles to <43% for seventh-degree relationships. However, most identical by descent (IBD) segment-based methods inferred seventh-degree relatives correct to within one relatedness degree for >76% of relative pairs. Overall, the most accurate methods are Estimation of Recent Shared Ancestry (ERSA) and approaches that compute total IBD sharing using the output from GERMLINE and Refined IBD to infer relatedness. Combining information from the most accurate methods provides little accuracy improvement, indicating that novel approaches, such as new methods that leverage relatedness signals from multiple samples, are needed to achieve a sizeable jump in performance. Copyright © 2017 Ramstetter et al.
Daetwyler, Hans D; Calus, Mario P L; Pong-Wong, Ricardo; de Los Campos, Gustavo; Hickey, John M
2013-02-01
The genomic prediction of phenotypes and breeding values in animals and plants has developed rapidly into its own research field. Results of genomic prediction studies are often difficult to compare because data simulation varies, real or simulated data are not fully described, and not all relevant results are reported. In addition, some new methods have been compared only in limited genetic architectures, leading to potentially misleading conclusions. In this article we review simulation procedures, discuss validation and reporting of results, and apply benchmark procedures for a variety of genomic prediction methods in simulated and real example data. Plant and animal breeding programs are being transformed by the use of genomic data, which are becoming widely available and cost-effective to predict genetic merit. A large number of genomic prediction studies have been published using both simulated and real data. The relative novelty of this area of research has made the development of scientific conventions difficult with regard to description of the real data, simulation of genomes, validation and reporting of results, and forward in time methods. In this review article we discuss the generation of simulated genotype and phenotype data, using approaches such as the coalescent and forward in time simulation. We outline ways to validate simulated data and genomic prediction results, including cross-validation. The accuracy and bias of genomic prediction are highlighted as performance indicators that should be reported. We suggest that a measure of relatedness between the reference and validation individuals be reported, as its impact on the accuracy of genomic prediction is substantial. A large number of methods were compared in example simulated and real (pine and wheat) data sets, all of which are publicly available. In our limited simulations, most methods performed similarly in traits with a large number of quantitative trait loci (QTL), whereas in traits with fewer QTL variable selection did have some advantages. In the real data sets examined here all methods had very similar accuracies. We conclude that no single method can serve as a benchmark for genomic prediction. We recommend comparing accuracy and bias of new methods to results from genomic best linear prediction and a variable selection approach (e.g., BayesB), because, together, these methods are appropriate for a range of genetic architectures. An accompanying article in this issue provides a comprehensive review of genomic prediction methods and discusses a selection of topics related to application of genomic prediction in plants and animals.
Daetwyler, Hans D.; Calus, Mario P. L.; Pong-Wong, Ricardo; de los Campos, Gustavo; Hickey, John M.
2013-01-01
The genomic prediction of phenotypes and breeding values in animals and plants has developed rapidly into its own research field. Results of genomic prediction studies are often difficult to compare because data simulation varies, real or simulated data are not fully described, and not all relevant results are reported. In addition, some new methods have been compared only in limited genetic architectures, leading to potentially misleading conclusions. In this article we review simulation procedures, discuss validation and reporting of results, and apply benchmark procedures for a variety of genomic prediction methods in simulated and real example data. Plant and animal breeding programs are being transformed by the use of genomic data, which are becoming widely available and cost-effective to predict genetic merit. A large number of genomic prediction studies have been published using both simulated and real data. The relative novelty of this area of research has made the development of scientific conventions difficult with regard to description of the real data, simulation of genomes, validation and reporting of results, and forward in time methods. In this review article we discuss the generation of simulated genotype and phenotype data, using approaches such as the coalescent and forward in time simulation. We outline ways to validate simulated data and genomic prediction results, including cross-validation. The accuracy and bias of genomic prediction are highlighted as performance indicators that should be reported. We suggest that a measure of relatedness between the reference and validation individuals be reported, as its impact on the accuracy of genomic prediction is substantial. A large number of methods were compared in example simulated and real (pine and wheat) data sets, all of which are publicly available. In our limited simulations, most methods performed similarly in traits with a large number of quantitative trait loci (QTL), whereas in traits with fewer QTL variable selection did have some advantages. In the real data sets examined here all methods had very similar accuracies. We conclude that no single method can serve as a benchmark for genomic prediction. We recommend comparing accuracy and bias of new methods to results from genomic best linear prediction and a variable selection approach (e.g., BayesB), because, together, these methods are appropriate for a range of genetic architectures. An accompanying article in this issue provides a comprehensive review of genomic prediction methods and discusses a selection of topics related to application of genomic prediction in plants and animals. PMID:23222650
Eisenberg, David; Marcotte, Edward M.; Pellegrini, Matteo; Thompson, Michael J.; Yeates, Todd O.
2002-10-15
A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.
Approximation of reliability of direct genomic breeding values
USDA-ARS?s Scientific Manuscript database
Two methods to efficiently approximate theoretical genomic reliabilities are presented. The first method is based on the direct inverse of the left hand side (LHS) of mixed model equations. It uses the genomic relationship matrix for a small subset of individuals with the highest genomic relationshi...
2014-01-01
Background Leptotrombidium pallidum and Leptotrombidium scutellare are the major vector mites for Orientia tsutsugamushi, the causative agent of scrub typhus. Before these organisms can be subjected to whole-genome sequencing, it is necessary to estimate their genome sizes to obtain basic information for establishing the strategies that should be used for genome sequencing and assembly. Method The genome sizes of L. pallidum and L. scutellare were estimated by a method based on quantitative real-time PCR. In addition, a k-mer analysis of the whole-genome sequences obtained through Illumina sequencing was conducted to verify the mutual compatibility and reliability of the results. Results The genome sizes estimated using qPCR were 191 ± 7 Mb for L. pallidum and 262 ± 13 Mb for L. scutellare. The k-mer analysis-based genome lengths were estimated to be 175 Mb for L. pallidum and 286 Mb for L. scutellare. The estimates from these two independent methods were mutually complementary and within a similar range to those of other Acariform mites. Conclusions The estimation method based on qPCR appears to be a useful alternative when the standard methods, such as flow cytometry, are impractical. The relatively small estimated genome sizes should facilitate whole-genome analysis, which could contribute to our understanding of Arachnida genome evolution and provide key information for scrub typhus prevention and mite vector competence. PMID:24947244
Wang, Xiaolong; Li, Lin; Zhao, Jiaxin; Li, Fangliang; Guo, Wei; Chen, Xia
2017-04-01
To evaluate the effects of different preservation methods (stored in a -20°C ice chest, preserved in liquid nitrogen and dried in silica gel) on inter simple sequence repeat (ISSR) or random amplified polymorphic DNA (RAPD) analyses in various botanical specimens (including broad-leaved plants, needle-leaved plants and succulent plants) for different times (three weeks and three years), we used a statistical analysis based on the number of bands, genetic index and cluster analysis. The results demonstrate that methods used to preserve samples can provide sufficient amounts of genomic DNA for ISSR and RAPD analyses; however, the effect of different preservation methods on these analyses vary significantly, and the preservation time has little effect on these analyses. Our results provide a reference for researchers to select the most suitable preservation method depending on their study subject for the analysis of molecular markers based on genomic DNA. Copyright © 2017 Académie des sciences. Published by Elsevier Masson SAS. All rights reserved.
FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm
Tuo, Shouheng; Zhang, Junying; Yuan, Xiguo; Zhang, Yuanyuan; Liu, Zhaowen
2016-01-01
Motivation Two-locus model is a typical significant disease model to be identified in genome-wide association study (GWAS). Due to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power, high computation cost, and preference for some types of disease models. Method In this study, two scoring functions (Bayesian network based K2-score and Gini-score) are used for characterizing two SNP locus as a candidate model, the two criteria are adopted simultaneously for improving identification power and tackling the preference problem to disease models. Harmony search algorithm (HSA) is improved for quickly finding the most likely candidate models among all two-locus models, in which a local search algorithm with two-dimensional tabu table is presented to avoid repeatedly evaluating some disease models that have strong marginal effect. Finally G-test statistic is used to further test the candidate models. Results We investigate our method named FHSA-SED on 82 simulated datasets and a real AMD dataset, and compare it with two typical methods (MACOED and CSE) which have been developed recently based on swarm intelligent search algorithm. The results of simulation experiments indicate that our method outperforms the two compared algorithms in terms of detection power, computation time, evaluation times, sensitivity (TPR), specificity (SPC), positive predictive value (PPV) and accuracy (ACC). Our method has identified two SNPs (rs3775652 and rs10511467) that may be also associated with disease in AMD dataset. PMID:27014873
Covell, David G
2015-01-01
Developing reliable biomarkers of tumor cell drug sensitivity and resistance can guide hypothesis-driven basic science research and influence pre-therapy clinical decisions. A popular strategy for developing biomarkers uses characterizations of human tumor samples against a range of cancer drug responses that correlate with genomic change; developed largely from the efforts of the Cancer Cell Line Encyclopedia (CCLE) and Sanger Cancer Genome Project (CGP). The purpose of this study is to provide an independent analysis of this data that aims to vet existing and add novel perspectives to biomarker discoveries and applications. Existing and alternative data mining and statistical methods will be used to a) evaluate drug responses of compounds with similar mechanism of action (MOA), b) examine measures of gene expression (GE), copy number (CN) and mutation status (MUT) biomarkers, combined with gene set enrichment analysis (GSEA), for hypothesizing biological processes important for drug response, c) conduct global comparisons of GE, CN and MUT as biomarkers across all drugs screened in the CGP dataset, and d) assess the positive predictive power of CGP-derived GE biomarkers as predictors of drug response in CCLE tumor cells. The perspectives derived from individual and global examinations of GEs, MUTs and CNs confirm existing and reveal unique and shared roles for these biomarkers in tumor cell drug sensitivity and resistance. Applications of CGP-derived genomic biomarkers to predict the drug response of CCLE tumor cells finds a highly significant ROC, with a positive predictive power of 0.78. The results of this study expand the available data mining and analysis methods for genomic biomarker development and provide additional support for using biomarkers to guide hypothesis-driven basic science research and pre-therapy clinical decisions.
Tumor Touch Imprints as Source for Whole Genome Analysis of Neuroblastoma Tumors
Brunner, Clemens; Brunner-Herglotz, Bettina; Ziegler, Andrea; Frech, Christian; Amann, Gabriele; Ladenstein, Ruth; Ambros, Inge M.; Ambros, Peter F.
2016-01-01
Introduction Tumor touch imprints (TTIs) are routinely used for the molecular diagnosis of neuroblastomas by interphase fluorescence in-situ hybridization (I-FISH). However, in order to facilitate a comprehensive, up-to-date molecular diagnosis of neuroblastomas and to identify new markers to refine risk and therapy stratification methods, whole genome approaches are needed. We examined the applicability of an ultra-high density SNP array platform that identifies copy number changes of varying sizes down to a few exons for the detection of genomic changes in tumor DNA extracted from TTIs. Material and Methods DNAs were extracted from TTIs of 46 neuroblastoma and 4 other pediatric tumors. The DNAs were analyzed on the Cytoscan HD SNP array platform to evaluate numerical and structural genomic aberrations. The quality of the data obtained from TTIs was compared to that from randomly chosen fresh or fresh frozen solid tumors (n = 212) and I-FISH validation was performed. Results SNP array profiles were obtained from 48 (out of 50) TTI DNAs of which 47 showed genomic aberrations. The high marker density allowed for single gene analysis, e.g. loss of nine exons in the ATRX gene and the visualization of chromothripsis. Data quality was comparable to fresh or fresh frozen tumor SNP profiles. SNP array results were confirmed by I-FISH. Conclusion TTIs are an excellent source for SNP array processing with the advantage of simple handling, distribution and storage of tumor tissue on glass slides. The minimal amount of tumor tissue needed to analyze whole genomes makes TTIs an economic surrogate source in the molecular diagnostic work up of tumor samples. PMID:27560999
Genomic selection for fruit quality traits in apple (Malus×domestica Borkh.).
Kumar, Satish; Chagné, David; Bink, Marco C A M; Volz, Richard K; Whitworth, Claire; Carlisle, Charmaine
2012-01-01
The genome sequence of apple (Malus×domestica Borkh.) was published more than a year ago, which helped develop an 8K SNP chip to assist in implementing genomic selection (GS). In apple breeding programmes, GS can be used to obtain genomic breeding values (GEBV) for choosing next-generation parents or selections for further testing as potential commercial cultivars at a very early stage. Thus GS has the potential to accelerate breeding efficiency significantly because of decreased generation interval or increased selection intensity. We evaluated the accuracy of GS in a population of 1120 seedlings generated from a factorial mating design of four females and two male parents. All seedlings were genotyped using an Illumina Infinium chip comprising 8,000 single nucleotide polymorphisms (SNPs), and were phenotyped for various fruit quality traits. Random-regression best liner unbiased prediction (RR-BLUP) and the Bayesian LASSO method were used to obtain GEBV, and compared using a cross-validation approach for their accuracy to predict unobserved BLUP-BV. Accuracies were very similar for both methods, varying from 0.70 to 0.90 for various fruit quality traits. The selection response per unit time using GS compared with the traditional BLUP-based selection were very high (>100%) especially for low-heritability traits. Genome-wide average estimated linkage disequilibrium (LD) between adjacent SNPs was 0.32, with a relatively slow decay of LD in the long range (r(2) = 0.33 and 0.19 at 100 kb and 1,000 kb respectively), contributing to the higher accuracy of GS. Distribution of estimated SNP effects revealed involvement of large effect genes with likely pleiotropic effects. These results demonstrated that genomic selection is a credible alternative to conventional selection for fruit quality traits.
Abdeldaim, G; Herrmann, B; Korsgaard, J; Olcén, P; Blomberg, J; Strålin, K
2009-06-01
The pneumolysin (ply) gene is widely used as a target in PCR assays for Streptococcus pneumoniae in respiratory secretions. However, false-positive results with conventional ply-based PCR have been reported. The aim here was to study the performance of a quantitative ply-based PCR for the identification of pneumococcal lower respiratory tract infection (LRTI). In a prospective study, fibreoptic bronchoscopy was performed in 156 hospitalized adult patients with LRTI and 31 controls who underwent bronchoscopy because of suspicion of malignancy. Among the LRTI patients and controls, the quantitative ply-based PCR applied to bronchoalveolar lavage (BAL) fluid was positive at >or=10(3) genome copies/mL in 61% and 71% of the subjects, at >or=10(5) genome copies/mL in 40% and 58% of the subjects, and at >or=10(7) genome copies/mL in 15% and 3.2% of the subjects, respectively. Using BAL fluid culture, blood culture, and/or a urinary antigen test, S. pneumoniae was identified in 19 LRTI patients. As compared with these diagnostic methods used in combination, quantitative ply-based PCR showed sensitivities and specificities of 89% and 43% at a cut-off of 10(3) genome copies/mL, of 84% and 66% at a cut-off of 10(5) genome copies/mL, and of 53% and 90% at a cut-off of 10(7) genome copies/mL, respectively. In conclusion, a high cut-off with the quantitative ply-based PCR was required to reach acceptable specificity. However, as a high cut-off resulted in low sensitivity, quantitative ply-based PCR does not appear to be clinically useful. Quantitative PCR methods for S. pneumoniae using alternative gene targets should be evaluated.
Variable presence of the inverted repeat and plastome stability in Erodium
Blazier, John C.; Jansen, Robert K.; Mower, Jeffrey P.; Govindu, Madhu; Zhang, Jin; Weng, Mao-Lun; Ruhlman, Tracey A.
2016-01-01
Background and Aims Several unrelated lineages such as plastids, viruses and plasmids, have converged on quadripartite genomes of similar size with large and small single copy regions and a large inverted repeat (IR). Except for Erodium (Geraniaceae), saguaro cactus and some legumes, the plastomes of all photosynthetic angiosperms display this structure. The functional significance of the IR is not understood and Erodium provides a system to examine the role of the IR in the long-term stability of these genomes. We compared the degree of genomic rearrangement in plastomes of Erodium that differ in the presence and absence of the IR. Methods We sequenced 17 new Erodium plastomes. Using 454, Illumina, PacBio and Sanger sequences, 16 genomes were assembled and categorized along with one incomplete and two previously published Erodium plastomes. We conducted phylogenetic analyses among these species using a dataset of 19 protein-coding genes and determined if significantly higher evolutionary rates had caused the long branch seen previously in phylogenetic reconstructions within the genus. Bioinformatic comparisons were also performed to evaluate plastome evolution across the genus. Key Results Erodium plastomes fell into four types (Type 1–4) that differ in their substitution rates, short dispersed repeat content and degree of genomic rearrangement, gene and intron content and GC content. Type 4 plastomes had significantly higher rates of synonymous substitutions (dS) for all genes and for 14 of the 19 genes non-synonymous substitutions (dN) were significantly accelerated. We evaluated the evidence for a single IR loss in Erodium and in doing so discovered that Type 4 plastomes contain a novel IR. Conclusions The presence or absence of the IR does not affect plastome stability in Erodium. Rather, the overall repeat content shows a negative correlation with genome stability, a pattern in agreement with other angiosperm groups and recent findings on genome stability in bacterial endosymbionts. PMID:27192713
An efficient approach to BAC based assembly of complex genomes.
Visendi, Paul; Berkman, Paul J; Hayashi, Satomi; Golicz, Agnieszka A; Bayer, Philipp E; Ruperao, Pradeep; Hurgobin, Bhavna; Montenegro, Juan; Chan, Chon-Kit Kenneth; Staňková, Helena; Batley, Jacqueline; Šimková, Hana; Doležel, Jaroslav; Edwards, David
2016-01-01
There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate 'gold' reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes.
Determining protein function and interaction from genome analysis
Eisenberg, David; Marcotte, Edward M.; Thompson, Michael J.; Pellegrini, Matteo; Yeates, Todd O.
2004-08-03
A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.
Assigning protein functions by comparative genome analysis protein phylogenetic profiles
Pellegrini, Matteo; Marcotte, Edward M.; Thompson, Michael J.; Eisenberg, David; Grothe, Robert; Yeates, Todd O.
2003-05-13
A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.
Systematic evaluation of bias in microbial community profiles induced by whole genome amplification.
Direito, Susana O L; Zaura, Egija; Little, Miranda; Ehrenfreund, Pascale; Röling, Wilfred F M
2014-03-01
Whole genome amplification methods facilitate the detection and characterization of microbial communities in low biomass environments. We examined the extent to which the actual community structure is reliably revealed and factors contributing to bias. One widely used [multiple displacement amplification (MDA)] and one new primer-free method [primase-based whole genome amplification (pWGA)] were compared using a polymerase chain reaction (PCR)-based method as control. Pyrosequencing of an environmental sample and principal component analysis revealed that MDA impacted community profiles more strongly than pWGA and indicated that this related to species GC content, although an influence of DNA integrity could not be excluded. Subsequently, biases by species GC content, DNA integrity and fragment size were separately analysed using defined mixtures of DNA from various species. We found significantly less amplification of species with the highest GC content for MDA-based templates and, to a lesser extent, for pWGA. DNA fragmentation also interfered severely: species with more fragmented DNA were less amplified with MDA and pWGA. pWGA was unable to amplify low molecular weight DNA (< 1.5 kb), whereas MDA was inefficient. We conclude that pWGA is the most promising method for characterization of microbial communities in low-biomass environments and for currently planned astrobiological missions to Mars. © 2013 Society for Applied Microbiology and John Wiley & Sons Ltd.
MIPS plant genome information resources.
Spannagl, Manuel; Haberer, Georg; Ernst, Rebecca; Schoof, Heiko; Mayer, Klaus F X
2007-01-01
The Munich Institute for Protein Sequences (MIPS) has been involved in maintaining plant genome databases since the Arabidopsis thaliana genome project. Genome databases and analysis resources have focused on individual genomes and aim to provide flexible and maintainable data sets for model plant genomes as a backbone against which experimental data, for example from high-throughput functional genomics, can be organized and evaluated. In addition, model genomes also form a scaffold for comparative genomics, and much can be learned from genome-wide evolutionary studies.
2014-01-01
Background Although the X chromosome is the second largest bovine chromosome, markers on the X chromosome are not used for genomic prediction in some countries and populations. In this study, we presented a method for computing genomic relationships using X chromosome markers, investigated the accuracy of imputation from a low density (7K) to the 54K SNP (single nucleotide polymorphism) panel, and compared the accuracy of genomic prediction with and without using X chromosome markers. Methods The impact of considering X chromosome markers on prediction accuracy was assessed using data from Nordic Holstein bulls and different sets of SNPs: (a) the 54K SNPs for reference and test animals, (b) SNPs imputed from the 7K to the 54K SNP panel for test animals, (c) SNPs imputed from the 7K to the 54K panel for half of the reference animals, and (d) the 7K SNP panel for all animals. Beagle and Findhap were used for imputation. GBLUP (genomic best linear unbiased prediction) models with or without X chromosome markers and with or without a residual polygenic effect were used to predict genomic breeding values for 15 traits. Results Averaged over the two imputation datasets, correlation coefficients between imputed and true genotypes for autosomal markers, pseudo-autosomal markers, and X-specific markers were 0.971, 0.831 and 0.935 when using Findhap, and 0.983, 0.856 and 0.937 when using Beagle. Estimated reliabilities of genomic predictions based on the imputed datasets using Findhap or Beagle were very close to those using the real 54K data. Genomic prediction using all markers gave slightly higher reliabilities than predictions without X chromosome markers. Based on our data which included only bulls, using a G matrix that accounted for sex-linked relationships did not improve prediction, compared with a G matrix that did not account for sex-linked relationships. A model that included a polygenic effect did not recover the loss of prediction accuracy from exclusion of X chromosome markers. Conclusions The results from this study suggest that markers on the X chromosome contribute to accuracy of genomic predictions and should be used for routine genomic evaluation. PMID:25080199
A dictionary based informational genome analysis
2012-01-01
Background In the post-genomic era several methods of computational genomics are emerging to understand how the whole information is structured within genomes. Literature of last five years accounts for several alignment-free methods, arisen as alternative metrics for dissimilarity of biological sequences. Among the others, recent approaches are based on empirical frequencies of DNA k-mers in whole genomes. Results Any set of words (factors) occurring in a genome provides a genomic dictionary. About sixty genomes were analyzed by means of informational indexes based on genomic dictionaries, where a systemic view replaces a local sequence analysis. A software prototype applying a methodology here outlined carried out some computations on genomic data. We computed informational indexes, built the genomic dictionaries with different sizes, along with frequency distributions. The software performed three main tasks: computation of informational indexes, storage of these in a database, index analysis and visualization. The validation was done by investigating genomes of various organisms. A systematic analysis of genomic repeats of several lengths, which is of vivid interest in biology (for example to compute excessively represented functional sequences, such as promoters), was discussed, and suggested a method to define synthetic genetic networks. Conclusions We introduced a methodology based on dictionaries, and an efficient motif-finding software application for comparative genomics. This approach could be extended along many investigation lines, namely exported in other contexts of computational genomics, as a basis for discrimination of genomic pathologies. PMID:22985068
Efficient identification of context dependent subgroups of risk from genome wide association studies
Dyson, Greg; Sing, Charles F.
2014-01-01
We have developed a modified Patient Rule-Induction Method (PRIM) as an alternative strategy for analyzing representative samples of non-experimental human data to estimate and test the role of genomic variations as predictors of disease risk in etiologically heterogeneous sub-samples. A computational limit of the proposed strategy is encountered when the number of genomic variations (predictor variables) under study is large (> 500) because permutations are used to generate a null distribution to test the significance of a term (defined by values of particular variables) that characterizes a sub-sample of individuals through the peeling and pasting processes. As an alternative, in this paper we introduce a theoretical strategy that facilitates the quick calculation of Type I and Type II errors in the evaluation of terms in the peeling and pasting processes carried out in the execution of a PRIM analysis that are underestimated and non-existent, respectively, when a permutation-based hypothesis test is employed. The resultant savings in computational time makes possible the consideration of larger numbers of genomic variations (an example genome wide association study is given) in the selection of statistically significant terms in the formulation of PRIM prediction models. PMID:24570412
Rius, Nuria; Guillén, Yolanda; Delprat, Alejandra; Kapusta, Aurélie; Feschotte, Cédric; Ruiz, Alfredo
2016-05-10
Many new Drosophila genomes have been sequenced in recent years using new-generation sequencing platforms and assembly methods. Transposable elements (TEs), being repetitive sequences, are often misassembled, especially in the genomes sequenced with short reads. Consequently, the mobile fraction of many of the new genomes has not been analyzed in detail or compared with that of other genomes sequenced with different methods, which could shed light into the understanding of genome and TE evolution. Here we compare the TE content of three genomes: D. buzzatii st-1, j-19, and D. mojavensis. We have sequenced a new D. buzzatii genome (j-19) that complements the D. buzzatii reference genome (st-1) already published, and compared their TE contents with that of D. mojavensis. We found an underestimation of TE sequences in Drosophila genus NGS-genomes when compared to Sanger-genomes. To be able to compare genomes sequenced with different technologies, we developed a coverage-based method and applied it to the D. buzzatii st-1 and j-19 genome. Between 10.85 and 11.16 % of the D. buzzatii st-1 genome is made up of TEs, between 7 and 7,5 % of D. buzzatii j-19 genome, while TEs represent 15.35 % of the D. mojavensis genome. Helitrons are the most abundant order in the three genomes. TEs in D. buzzatii are less abundant than in D. mojavensis, as expected according to the genome size and TE content positive correlation. However, TEs alone do not explain the genome size difference. TEs accumulate in the dot chromosomes and proximal regions of D. buzzatii and D. mojavensis chromosomes. We also report a significantly higher TE density in D. buzzatii and D. mojavensis X chromosomes, which is not expected under the current models. Our easy-to-use correction method allowed us to identify recently active families in D. buzzatii st-1 belonging to the LTR-retrotransposon superfamily Gypsy.
Gallego, Carlos J; Bennette, Caroline S; Heagerty, Patrick; Comstock, Bryan; Horike-Pyne, Martha; Hisama, Fuki; Amendola, Laura M; Bennett, Robin L; Dorschner, Michael O; Tarczy-Hornoch, Peter; Grady, William M; Fullerton, S Malia; Trinidad, Susan B; Regier, Dean A; Nickerson, Deborah A; Burke, Wylie; Patrick, Donald L; Jarvik, Gail P; Veenstra, David L
2014-09-01
Whole exome and whole genome sequencing are applications of next generation sequencing transforming clinical care, but there is little evidence whether these tests improve patient outcomes or if they are cost effective compared to current standard of care. These gaps in knowledge can be addressed by comparative effectiveness and patient-centered outcomes research. We designed a randomized controlled trial that incorporates these research methods to evaluate whole exome sequencing compared to usual care in patients being evaluated for hereditary colorectal cancer and polyposis syndromes. Approximately 220 patients will be randomized and followed for 12 months after return of genomic findings. Patients will receive findings associated with colorectal cancer in a first return of results visit, and findings not associated with colorectal cancer (incidental findings) during a second return of results visit. The primary outcome is efficacy to detect mutations associated with these syndromes; secondary outcomes include psychosocial impact, cost-effectiveness and comparative costs. The secondary outcomes will be obtained via surveys before and after each return visit. The expected challenges in conducting this randomized controlled trial include the relatively low prevalence of genetic disease, difficult interpretation of some genetic variants, and uncertainty about which incidental findings should be returned to patients. The approaches utilized in this study may help guide other investigators in clinical genomics to identify useful outcome measures and strategies to address comparative effectiveness questions about the clinical implementation of genomic sequencing in clinical care. Copyright © 2014 Elsevier Inc. All rights reserved.
Private genome analysis through homomorphic encryption
2015-01-01
Background The rapid development of genome sequencing technology allows researchers to access large genome datasets. However, outsourcing the data processing o the cloud poses high risks for personal privacy. The aim of this paper is to give a practical solution for this problem using homomorphic encryption. In our approach, all the computations can be performed in an untrusted cloud without requiring the decryption key or any interaction with the data owner, which preserves the privacy of genome data. Methods We present evaluation algorithms for secure computation of the minor allele frequencies and χ2 statistic in a genome-wide association studies setting. We also describe how to privately compute the Hamming distance and approximate Edit distance between encrypted DNA sequences. Finally, we compare performance details of using two practical homomorphic encryption schemes - the BGV scheme by Gentry, Halevi and Smart and the YASHE scheme by Bos, Lauter, Loftus and Naehrig. Results The approach with the YASHE scheme analyzes data from 400 people within about 2 seconds and picks a variant associated with disease from 311 spots. For another task, using the BGV scheme, it took about 65 seconds to securely compute the approximate Edit distance for DNA sequences of size 5K and figure out the differences between them. Conclusions The performance numbers for BGV are better than YASHE when homomorphically evaluating deep circuits (like the Hamming distance algorithm or approximate Edit distance algorithm). On the other hand, it is more efficient to use the YASHE scheme for a low-degree computation, such as minor allele frequencies or χ2 test statistic in a case-control study. PMID:26733152
Draft genome of the lined seahorse, Hippocampus erectus.
Lin, Qiang; Qiu, Ying; Gu, Ruobo; Xu, Meng; Li, Jia; Bian, Chao; Zhang, Huixian; Qin, Geng; Zhang, Yanhong; Luo, Wei; Chen, Jieming; You, Xinxin; Fan, Mingjun; Sun, Min; Xu, Pao; Venkatesh, Byrappa; Xu, Junming; Fu, Hongtuo; Shi, Qiong
2017-06-01
The lined seahorse, Hippocampus erectus , is an Atlantic species and mainly inhabits shallow sea beds or coral reefs. It has become very popular in China for its wide use in traditional Chinese medicine. In order to improve the aquaculture yield of this valuable fish species, we are trying to develop genomic resources for assistant selection in genetic breeding. Here, we provide whole genome sequencing, assembly, and gene annotation of the lined seahorse, which can enrich genome resource and further application for its molecular breeding. A total of 174.6 Gb (Gigabase) raw DNA sequences were generated by the Illumina Hiseq2500 platform. The final assembly of the lined seahorse genome is around 458 Mb, representing 94% of the estimated genome size (489 Mb by k-mer analysis). The contig N50 and scaffold N50 reached 14.57 kb and 1.97 Mb, respectively. Quality of the assembled genome was assessed by BUSCO with prediction of 85% of the known vertebrate genes and evaluated using the de novo assembled RNA-seq transcripts to prove a high mapping ratio (more than 99% transcripts could be mapped to the assembly). Using homology-based, de novo and transcriptome-based prediction methods, we predicted 20 788 protein-coding genes in the generated assembly, which is less than our previously reported gene number (23 458) of the tiger tail seahorse ( H. comes ). We report a draft genome of the lined seahorse. These generated genomic data are going to enrich genome resource of this economically important fish, and also provide insights into the genetic mechanisms of its iconic morphology and male pregnancy behavior. © The Authors 2017. Published by Oxford University Press.
Draft genome of the lined seahorse, Hippocampus erectus
Lin, Qiang; Qiu, Ying; Gu, Ruobo; Xu, Meng; Li, Jia; Bian, Chao; Zhang, Huixian; Qin, Geng; Zhang, Yanhong; Luo, Wei; Chen, Jieming; You, Xinxin; Fan, Mingjun; Sun, Min; Xu, Pao; Venkatesh, Byrappa
2017-01-01
Abstract Background: The lined seahorse, Hippocampus erectus, is an Atlantic species and mainly inhabits shallow sea beds or coral reefs. It has become very popular in China for its wide use in traditional Chinese medicine. In order to improve the aquaculture yield of this valuable fish species, we are trying to develop genomic resources for assistant selection in genetic breeding. Here, we provide whole genome sequencing, assembly, and gene annotation of the lined seahorse, which can enrich genome resource and further application for its molecular breeding. Findings: A total of 174.6 Gb (Gigabase) raw DNA sequences were generated by the Illumina Hiseq2500 platform. The final assembly of the lined seahorse genome is around 458 Mb, representing 94% of the estimated genome size (489 Mb by k-mer analysis). The contig N50 and scaffold N50 reached 14.57 kb and 1.97 Mb, respectively. Quality of the assembled genome was assessed by BUSCO with prediction of 85% of the known vertebrate genes and evaluated using the de novo assembled RNA-seq transcripts to prove a high mapping ratio (more than 99% transcripts could be mapped to the assembly). Using homology-based, de novo and transcriptome-based prediction methods, we predicted 20 788 protein-coding genes in the generated assembly, which is less than our previously reported gene number (23 458) of the tiger tail seahorse (H. comes). Conclusion: We report a draft genome of the lined seahorse. These generated genomic data are going to enrich genome resource of this economically important fish, and also provide insights into the genetic mechanisms of its iconic morphology and male pregnancy behavior. PMID:28444302
In silico prediction of splice-altering single nucleotide variants in the human genome.
Jian, Xueqiu; Boerwinkle, Eric; Liu, Xiaoming
2014-12-16
In silico tools have been developed to predict variants that may have an impact on pre-mRNA splicing. The major limitation of the application of these tools to basic research and clinical practice is the difficulty in interpreting the output. Most tools only predict potential splice sites given a DNA sequence without measuring splicing signal changes caused by a variant. Another limitation is the lack of large-scale evaluation studies of these tools. We compared eight in silico tools on 2959 single nucleotide variants within splicing consensus regions (scSNVs) using receiver operating characteristic analysis. The Position Weight Matrix model and MaxEntScan outperformed other methods. Two ensemble learning methods, adaptive boosting and random forests, were used to construct models that take advantage of individual methods. Both models further improved prediction, with outputs of directly interpretable prediction scores. We applied our ensemble scores to scSNVs from the Catalogue of Somatic Mutations in Cancer database. Analysis showed that predicted splice-altering scSNVs are enriched in recurrent scSNVs and known cancer genes. We pre-computed our ensemble scores for all potential scSNVs across the human genome, providing a whole genome level resource for identifying splice-altering scSNVs discovered from large-scale sequencing studies.
Hug, Laura A.; Thomas, Brian C.; Sharon, Itai; ...
2015-07-22
Nitrogen, sulfur and carbon fluxes in the terrestrial subsurface are determined by the intersecting activities of microbial community members, yet the organisms responsible are largely unknown. Metagenomic methods can identify organisms and functions, but genome recovery is often precluded by data complexity. To address this limitation, we developed subsampling assembly methods to re-construct high-quality draft genomes from complex samples. Here, we applied these methods to evaluate the interlinked roles of the most abundant organisms in biogeochemical cycling in the aquifer sediment. Community proteomics confirmed these activities. The eight most abundant organisms belong to novel lineages, and two represent phyla withmore » no previously sequenced genome. Four organisms are predicted to fix carbon via the Calvin Benson Bassham, Wood Ljungdahl or 3-hydroxyproprionate/4-hydroxybutarate pathways. The profiled organisms are involved in the network of denitrification, dissimilatory nitrate reduction to ammonia, ammonia oxidation and sulfate reduction/oxidation, and require substrates supplied by other community members. An ammonium-oxidizing Thaumarchaeote is the most abundant community member, despite low ammonium concentrations in the groundwater. Finally, this organism likely benefits from two other relatively abundant organisms capable of producing ammonium from nitrate, which is abundant in the groundwater. Overall, dominant members of the microbial community are interconnected through exchange of geochemical resources.« less
Olova, Nelly; Krueger, Felix; Andrews, Simon; Oxley, David; Berrens, Rebecca V; Branco, Miguel R; Reik, Wolf
2018-03-15
Whole-genome bisulfite sequencing (WGBS) is becoming an increasingly accessible technique, used widely for both fundamental and disease-oriented research. Library preparation methods benefit from a variety of available kits, polymerases and bisulfite conversion protocols. Although some steps in the procedure, such as PCR amplification, are known to introduce biases, a systematic evaluation of biases in WGBS strategies is missing. We perform a comparative analysis of several commonly used pre- and post-bisulfite WGBS library preparation protocols for their performance and quality of sequencing outputs. Our results show that bisulfite conversion per se is the main trigger of pronounced sequencing biases, and PCR amplification builds on these underlying artefacts. The majority of standard library preparation methods yield a significantly biased sequence output and overestimate global methylation. Importantly, both absolute and relative methylation levels at specific genomic regions vary substantially between methods, with clear implications for DNA methylation studies. We show that amplification-free library preparation is the least biased approach for WGBS. In protocols with amplification, the choice of bisulfite conversion protocol or polymerase can significantly minimize artefacts. To aid with the quality assessment of existing WGBS datasets, we have integrated a bias diagnostic tool in the Bismark package and offer several approaches for consideration during the preparation and analysis of WGBS datasets.
Shi, Cheng-Min; Yang, Ziheng
2018-01-01
Abstract The phylogenetic relationships among extant gibbon species remain unresolved despite numerous efforts using morphological, behavorial, and genetic data and the sequencing of whole genomes. A major challenge in reconstructing the gibbon phylogeny is the radiative speciation process, which resulted in extremely short internal branches in the species phylogeny and extensive incomplete lineage sorting with extensive gene-tree heterogeneity across the genome. Here, we analyze two genomic-scale data sets, with ∼10,000 putative noncoding and exonic loci, respectively, to estimate the species tree for the major groups of gibbons. We used the Bayesian full-likelihood method bpp under the multispecies coalescent model, which naturally accommodates incomplete lineage sorting and uncertainties in the gene trees. For comparison, we included three heuristic coalescent-based methods (mp-est, SVDQuartets, and astral) as well as concatenation. From both data sets, we infer the phylogeny for the four extant gibbon genera to be (Hylobates, (Nomascus, (Hoolock, Symphalangus))). We used simulation guided by the real data to evaluate the accuracy of the methods used. Astral, while not as efficient as bpp, performed well in estimation of the species tree even in presence of excessive incomplete lineage sorting. Concatenation, mp-est and SVDQuartets were unreliable when the species tree contains very short internal branches. Likelihood ratio test of gene flow suggests a small amount of migration from Hylobates moloch to H. pileatus, while cross-genera migration is absent or rare. Our results highlight the utility of coalescent-based methods in addressing challenging species tree problems characterized by short internal branches and rampant gene tree-species tree discordance. PMID:29087487
Evaluation of redundancy analysis to identify signatures of local adaptation.
Capblancq, Thibaut; Luu, Keurcien; Blum, Michael G B; Bazin, Eric
2018-05-26
Ordination is a common tool in ecology that aims at representing complex biological information in a reduced space. In landscape genetics, ordination methods such as principal component analysis (PCA) have been used to detect adaptive variation based on genomic data. Taking advantage of environmental data in addition to genotype data, redundancy analysis (RDA) is another ordination approach that is useful to detect adaptive variation. This paper aims at proposing a test statistic based on RDA to search for loci under selection. We compare redundancy analysis to pcadapt, which is a nonconstrained ordination method, and to a latent factor mixed model (LFMM), which is a univariate genotype-environment association method. Individual-based simulations identify evolutionary scenarios where RDA genome scans have a greater statistical power than genome scans based on PCA. By constraining the analysis with environmental variables, RDA performs better than PCA in identifying adaptive variation when selection gradients are weakly correlated with population structure. Additionally, we show that if RDA and LFMM have a similar power to identify genetic markers associated with environmental variables, the RDA-based procedure has the advantage to identify the main selective gradients as a combination of environmental variables. To give a concrete illustration of RDA in population genomics, we apply this method to the detection of outliers and selective gradients on an SNP data set of Populus trichocarpa (Geraldes et al., 2013). The RDA-based approach identifies the main selective gradient contrasting southern and coastal populations to northern and continental populations in the northwestern American coast. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
A multi-strategy approach to informative gene identification from gene expression data.
Liu, Ziying; Phan, Sieu; Famili, Fazel; Pan, Youlian; Lenferink, Anne E G; Cantin, Christiane; Collins, Catherine; O'Connor-McCourt, Maureen D
2010-02-01
An unsupervised multi-strategy approach has been developed to identify informative genes from high throughput genomic data. Several statistical methods have been used in the field to identify differentially expressed genes. Since different methods generate different lists of genes, it is very challenging to determine the most reliable gene list and the appropriate method. This paper presents a multi-strategy method, in which a combination of several data analysis techniques are applied to a given dataset and a confidence measure is established to select genes from the gene lists generated by these techniques to form the core of our final selection. The remainder of the genes that form the peripheral region are subject to exclusion or inclusion into the final selection. This paper demonstrates this methodology through its application to an in-house cancer genomics dataset and a public dataset. The results indicate that our method provides more reliable list of genes, which are validated using biological knowledge, biological experiments, and literature search. We further evaluated our multi-strategy method by consolidating two pairs of independent datasets, each pair is for the same disease, but generated by different labs using different platforms. The results showed that our method has produced far better results.
Pathway Analysis in Attention Deficit Hyperactivity Disorder: An Ensemble Approach
Mooney, Michael A.; McWeeney, Shannon K.; Faraone, Stephen V.; Hinney, Anke; Hebebrand, Johannes; Nigg, Joel T.; Wilmot, Beth
2016-01-01
Despite a wealth of evidence for the role of genetics in attention deficit hyperactivity disorder (ADHD), specific and definitive genetic mechanisms have not been identified. Pathway analyses, a subset of gene-set analyses, extend the knowledge gained from genome-wide association studies (GWAS) by providing functional context for genetic associations. However, there are numerous methods for association testing of gene sets and no real consensus regarding the best approach. The present study applied six pathway analysis methods to identify pathways associated with ADHD in two GWAS datasets from the Psychiatric Genomics Consortium. Methods that utilize genotypes to model pathway-level effects identified more replicable pathway associations than methods using summary statistics. In addition, pathways implicated by more than one method were significantly more likely to replicate. A number of brain-relevant pathways, such as RhoA signaling, glycosaminoglycan biosynthesis, fibroblast growth factor receptor activity, and pathways containing potassium channel genes, were nominally significant by multiple methods in both datasets. These results support previous hypotheses about the role of regulation of neurotransmitter release, neurite outgrowth and axon guidance in contributing to the ADHD phenotype and suggest the value of cross-method convergence in evaluating pathway analysis results. PMID:27004716
Zhou, Weiqiang; Sherwood, Ben; Ji, Hongkai
2017-01-01
Technological advances have led to an explosive growth of high-throughput functional genomic data. Exploiting the correlation among different data types, it is possible to predict one functional genomic data type from other data types. Prediction tools are valuable in understanding the relationship among different functional genomic signals. They also provide a cost-efficient solution to inferring the unknown functional genomic profiles when experimental data are unavailable due to resource or technological constraints. The predicted data may be used for generating hypotheses, prioritizing targets, interpreting disease variants, facilitating data integration, quality control, and many other purposes. This article reviews various applications of prediction methods in functional genomics, discusses analytical challenges, and highlights some common and effective strategies used to develop prediction methods for functional genomic data. PMID:28076869
[Genome editing of industrial microorganism].
Zhu, Linjiang; Li, Qi
2015-03-01
Genome editing is defined as highly-effective and precise modification of cellular genome in a large scale. In recent years, such genome-editing methods have been rapidly developed in the field of industrial strain improvement. The quickly-updating methods thoroughly change the old mode of inefficient genetic modification, which is "one modification, one selection marker, and one target site". Highly-effective modification mode in genome editing have been developed including simultaneous modification of multiplex genes, highly-effective insertion, replacement, and deletion of target genes in the genome scale, cut-paste of a large DNA fragment. These new tools for microbial genome editing will certainly be applied widely, and increase the efficiency of industrial strain improvement, and promote the revolution of traditional fermentation industry and rapid development of novel industrial biotechnology like production of biofuel and biomaterial. The technological principle of these genome-editing methods and their applications were summarized in this review, which can benefit engineering and construction of industrial microorganism.
Wang, WeiBo; Sun, Wei; Wang, Wei; Szatkiewicz, Jin
2018-03-01
The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection. Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.
Guidelines for the selection of functional assays to evaluate the hallmarks of cancer.
Menyhárt, Otília; Harami-Papp, Hajnalka; Sukumar, Saraswati; Schäfer, Reinhold; Magnani, Luca; de Barrios, Oriol; Győrffy, Balázs
2016-12-01
The hallmarks of cancer capture the most essential phenotypic characteristics of malignant transformation and progression. Although numerous factors involved in this multi-step process are still unknown to date, an ever-increasing number of mutated/altered candidate genes are being identified within large-scale cancer genomic projects. Therefore, investigators need to be aware of available and appropriate techniques capable of determining characteristic features of each hallmark. We review the methods tailored to experimental cancer researchers to evaluate cell proliferation, programmed cell death, replicative immortality, induction of angiogenesis, invasion and metastasis, genome instability, and reprogramming of energy metabolism. Selecting the ideal method is based on the investigator's goals, available equipment and also on financial constraints. Multiplexing strategies enable a more in-depth data collection from a single experiment - obtaining several results from a single procedure reduces variability and saves time and relative cost, leading to more robust conclusions compared to a single end point measurement. Each hallmark possesses characteristics that can be analyzed by immunoblot, RT-PCR, immunocytochemistry, immunoprecipitation, RNA microarray or RNA-seq. In general, flow cytometry, fluorescence microscopy, and multiwell readers are extremely versatile tools and, with proper sample preparation, allow the detection of a vast number of hallmark features. Finally, we also provide a list of hallmark-specific genes to be measured in transcriptome-level studies. Although our list is not exhaustive, we provide a snapshot of the most widely used methods, with an emphasis on methods enabling the simultaneous evaluation of multiple hallmark features. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Jung, Ki-Hong; Dardick, Christopher; Bartley, Laura E; Cao, Peijian; Phetsom, Jirapa; Canlas, Patrick; Seo, Young-Su; Shultz, Michael; Ouyang, Shu; Yuan, Qiaoping; Frank, Bryan C; Ly, Eugene; Zheng, Li; Jia, Yi; Hsia, An-Ping; An, Kyungsook; Chou, Hui-Hsien; Rocke, David; Lee, Geun Cheol; Schnable, Patrick S; An, Gynheung; Buell, C Robin; Ronald, Pamela C
2008-10-06
Studies of gene function are often hampered by gene-redundancy, especially in organisms with large genomes such as rice (Oryza sativa). We present an approach for using transcriptomics data to focus functional studies and address redundancy. To this end, we have constructed and validated an inexpensive and publicly available rice oligonucleotide near-whole genome array, called the rice NSF45K array. We generated expression profiles for light- vs. dark-grown rice leaf tissue and validated the biological significance of the data by analyzing sources of variation and confirming expression trends with reverse transcription polymerase chain reaction. We examined trends in the data by evaluating enrichment of gene ontology terms at multiple false discovery rate thresholds. To compare data generated with the NSF45K array with published results, we developed publicly available, web-based tools (www.ricearray.org). The Oligo and EST Anatomy Viewer enables visualization of EST-based expression profiling data for all genes on the array. The Rice Multi-platform Microarray Search Tool facilitates comparison of gene expression profiles across multiple rice microarray platforms. Finally, we incorporated gene expression and biochemical pathway data to reduce the number of candidate gene products putatively participating in the eight steps of the photorespiration pathway from 52 to 10, based on expression levels of putatively functionally redundant genes. We confirmed the efficacy of this method to cope with redundancy by correctly predicting participation in photorespiration of a gene with five paralogs. Applying these methods will accelerate rice functional genomics.
Buccal Micronucleus Cytome Assay in Sickle Cell Disease
Naga, Mallika Bokka Sri Satya; Gour, Shreya; Nallagutta, Nalini; Velidandla, Surekha; Manikya, Sangameshwar
2016-01-01
Introduction Sickle Cell Anaemia (SCA) is a commonly inherited blood disorder preceded by episodes of pain, chronic haemolytic anaemia and severe infections. The underlying phenomenon which causes this disease is the point mutation in the haemoglobin beta gene (Hbβ) found on chromosome 11 p. Increased oxidative stress leads to DNA damage. DNA damage occurring in such conditions can be studied by the buccal micronucleus cytome assay, which is a minimally invasive method for studying chromosomal instability, cell death and regenerative potential of human buccal tissue. Aim To evaluate genomic instability in patients with sickle cell disease by buccal micronucleus cytome assay. Materials and Methods The study included 40 sickle cell anemia patients (Group A) and 40 age and sex matched controls (Group B). Buccal swabs were collected and stained with Papanicolaou (PAP). Number of cells with micronucleus, binuclei, nuclear bud, pyknosis and karyolysis were counted in two groups as parameters for the evaluation of genome stability. Results All the analysis was done using t-test. A p-value of <0.001 was considered statistically significant. There was a statistically significant increase in micronuclei number in SCA patients when compared with controls. Karyolytic (un-nucleated) cell number in Group A was more than to those of the controls. Conclusion The results might suggest that patients with sickle cell anaemia have genome instability which is represented by the presence of micronuclei in the somatic cells. Presence of apoptotic cells might only indicate the bodily damage to the tissue as a result of the disease. PMID:27504413
Prevalence of genomic island PAPI-1 in clinical isolates of Pseudomonas aeruginosa in Iran.
Sadeghifard, Nourkhoda; Rasaei, Seyedeh Zahra; Ghafourian, Sobhan; Zolfaghary, Mohammad Reza; Ranjbar, Reza; Raftari, Mohammad; Mohebi, Reza; Maleki, Abbas; Rahbar, Mohammad
2012-03-01
Pseudomonas aeruginosa, a gram-negative rod-shaped bacterium, is an opportunistic pathogen, which causes various serious diseases in humans and animals. The aims of this study were to evaluate of the presence of genomic island PAPI-1 in Pseudomonas aeruginosa isolated from Reference Laboratory of Ilam, Milad Hospital and Emam Khomeini Hospital, Iran and to study the frequency of extended spectrum beta-lactamases (ESBLs) among isolates. Forty-eight clinical isolates of P. aeruginosa were obtained during April to September 2010, and were evaluated for ESBLs by screening and confirmatory disk diffusion methods and PAPI-1 by PCR. Fifteen of 48 P. aeruginosa isolates were positive for ESBLs and 17 isolates positive for PAPI-1. This was first study of the prevalence of PAPI-1 in clinical isolates of P. aeruginosa in Iran, showing that most of PAPI-1 positive strains had high levels of antibiotic resistance and produced ESBLs.
Wolc, Anna; Stricker, Chris; Arango, Jesus; Settar, Petek; Fulton, Janet E; O'Sullivan, Neil P; Preisinger, Rudolf; Habier, David; Fernando, Rohan; Garrick, Dorian J; Lamont, Susan J; Dekkers, Jack C M
2011-01-21
Genomic selection involves breeding value estimation of selection candidates based on high-density SNP genotypes. To quantify the potential benefit of genomic selection, accuracies of estimated breeding values (EBV) obtained with different methods using pedigree or high-density SNP genotypes were evaluated and compared in a commercial layer chicken breeding line. The following traits were analyzed: egg production, egg weight, egg color, shell strength, age at sexual maturity, body weight, albumen height, and yolk weight. Predictions appropriate for early or late selection were compared. A total of 2,708 birds were genotyped for 23,356 segregating SNP, including 1,563 females with records. Phenotypes on relatives without genotypes were incorporated in the analysis (in total 13,049 production records).The data were analyzed with a Reduced Animal Model using a relationship matrix based on pedigree data or on marker genotypes and with a Bayesian method using model averaging. Using a validation set that consisted of individuals from the generation following training, these methods were compared by correlating EBV with phenotypes corrected for fixed effects, selecting the top 30 individuals based on EBV and evaluating their mean phenotype, and by regressing phenotypes on EBV. Using high-density SNP genotypes increased accuracies of EBV up to two-fold for selection at an early age and by up to 88% for selection at a later age. Accuracy increases at an early age can be mostly attributed to improved estimates of parental EBV for shell quality and egg production, while for other egg quality traits it is mostly due to improved estimates of Mendelian sampling effects. A relatively small number of markers was sufficient to explain most of the genetic variation for egg weight and body weight.
Sczyrba, Alex
2018-02-13
DOE JGI's Alex Sczyrba on "Evaluation of the Cow Rumen Metagenome" and "Assembly by Single Copy Gene Analysis and Single Cell Genome Assemblies" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.
USDA-ARS?s Scientific Manuscript database
The newly released rainbow trout genome assembly in NCBI RefSeq has greatly expanded our abilities for analyzing rainbow trout sequencing data. In this poster, we evaluate the utility of this genome assembly for analyzing RNA sequencing (RNA-seq) data of rainbow trout responses to various stressors,...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sczyrba, Alex
2011-10-13
DOE JGI's Alex Sczyrba on "Evaluation of the Cow Rumen Metagenome" and "Assembly by Single Copy Gene Analysis and Single Cell Genome Assemblies" at the Metagenomics Informatics Challenges Workshop held at the DOE JGI on October 12-13, 2011.
Fish genome manipulation and directional breeding.
Ye, Ding; Zhu, ZuoYan; Sun, YongHua
2015-02-01
Aquaculture is one of the fastest developing agricultural industries worldwide. One of the most important factors for sustainable aquaculture is the development of high performing culture strains. Genome manipulation offers a powerful method to achieve rapid and directional breeding in fish. We review the history of fish breeding methods based on classical genome manipulation, including polyploidy breeding and nuclear transfer. Then, we discuss the advances and applications of fish directional breeding based on transgenic technology and recently developed genome editing technologies. These methods offer increased efficiency, precision and predictability in genetic improvement over traditional methods.
A new strategy for genome assembly using short sequence reads and reduced representation libraries.
Young, Andrew L; Abaan, Hatice Ozel; Zerbino, Daniel; Mullikin, James C; Birney, Ewan; Margulies, Elliott H
2010-02-01
We have developed a novel approach for using massively parallel short-read sequencing to generate fast and inexpensive de novo genomic assemblies comparable to those generated by capillary-based methods. The ultrashort (<100 base) sequences generated by this technology pose specific biological and computational challenges for de novo assembly of large genomes. To account for this, we devised a method for experimentally partitioning the genome using reduced representation (RR) libraries prior to assembly. We use two restriction enzymes independently to create a series of overlapping fragment libraries, each containing a tractable subset of the genome. Together, these libraries allow us to reassemble the entire genome without the need of a reference sequence. As proof of concept, we applied this approach to sequence and assembled the majority of the 125-Mb Drosophila melanogaster genome. We subsequently demonstrate the accuracy of our assembly method with meaningful comparisons against the current available D. melanogaster reference genome (dm3). The ease of assembly and accuracy for comparative genomics suggest that our approach will scale to future mammalian genome-sequencing efforts, saving both time and money without sacrificing quality.
Optimization and evaluation of single-cell whole-genome multiple displacement amplification.
Spits, C; Le Caignec, C; De Rycke, M; Van Haute, L; Van Steirteghem, A; Liebaers, I; Sermon, K
2006-05-01
The scarcity of genomic DNA can be a limiting factor in some fields of genetic research. One of the methods developed to overcome this difficulty is whole genome amplification (WGA). Recently, multiple displacement amplification (MDA) has proved very efficient in the WGA of small DNA samples and pools of cells, the reaction being catalyzed by the phi29 or the Bst DNA polymerases. The aim of the present study was to develop a reliable, efficient, and fast protocol for MDA at the single-cell level. We first compared the efficiency of phi29 and Bst polymerases on DNA samples and single cells. The phi29 polymerase generated accurately, in a short time and from a single cell, sufficient DNA for a large set of tests, whereas the Bst enzyme showed a low efficiency and a high error rate. A single-cell protocol was optimized using the phi29 polymerase and was evaluated on 60 single cells; the DNA obtained DNA was assessed by 22 locus-specific PCRs. This new protocol can be useful for many applications involving minute quantities of starting material, such as forensic DNA analysis, prenatal and preimplantation genetic diagnosis, or cancer research. (c) 2006 Wiley-Liss, Inc.
The Prevalence of Human Papilloma Virus in Esophageal Squamous Cell Carcinoma
Noori, Sadat; Monabati, Ahmad; Ghaderi, Abbasali
2012-01-01
Background: Carcinomas of esophagus, mostly squamous cell carcinomas, occur throughout the world. There are a number of suspected genetic or environmental etiologies. Human papilloma virus (HPV) is said to be a major etiology in areas with high incidence of esophageal carcinoma, while it is hardly detectable in low incidence regions. This study was designed to evaluate the prevalence of HPV in esophageal squamous cell carcinoma (ESCC) cases diagnosed in Pathology Department, Medical School, Shiraz University of Medical Sciences. Methods: DNA material for PCR amplification of HPV genome was extracted from formalin-fixed paraffin-embedded tissue blocks of 92 cases of ESCC, diagnosed during 20 years from 1982 to 2002. Polymerase chain reaction was performed for amplification and detection of common HPV and type specific HPV-16 and HPV-18 genomic sequences in the presence of positive control (HPV-18 and HPV positive biopsies of uterine exocervix) and additional internal controls i.e. beta-globin and cytotoxic T lymphocyte antigen 4 (CTLA4). Result: Good amplification of positive control and internal controls was observed. However, no amplification of HPV genome was observed. Conclusion: There is no association between HPV infection and the development of esophageal squamous cell carcinoma in the cases evaluated. PMID:23115442
Neurobehavioral Development of Common Marmoset Monkeys
Schultz-Darken, Nancy; Braun, Katarina M.; Emborg, Marina E.
2016-01-01
Common marmoset (Callithrix jacchus) monkeys are a resource for biomedical research and their use is predicted to increase due to the suitability of this species for transgenic approaches. Identification of abnormal neurodevelopment due to genetic modification relies upon the comparison with validated patterns of normal behavior defined by unbiased methods. As scientists unfamiliar with nonhuman primate development are interested to apply genomic editing techniques in marmosets, it would be beneficial to the field that the investigators use validated methods of postnatal evaluation that are age and species appropriate. This review aims to analyze current available data on marmoset physical and behavioral postnatal development, describe the methods used and discuss next steps to better understand and evaluate marmoset normal and abnormal postnatal neurodevelopment PMID:26502294
Genomic prediction of reproduction traits for Merino sheep.
Bolormaa, S; Brown, D J; Swan, A A; van der Werf, J H J; Hayes, B J; Daetwyler, H D
2017-06-01
Economically important reproduction traits in sheep, such as number of lambs weaned and litter size, are expressed only in females and later in life after most selection decisions are made, which makes them ideal candidates for genomic selection. Accurate genomic predictions would lead to greater genetic gain for these traits by enabling accurate selection of young rams with high genetic merit. The aim of this study was to design and evaluate the accuracy of a genomic prediction method for female reproduction in sheep using daughter trait deviations (DTD) for sires and ewe phenotypes (when individual ewes were genotyped) for three reproduction traits: number of lambs born (NLB), litter size (LSIZE) and number of lambs weaned. Genomic best linear unbiased prediction (GBLUP), BayesR and pedigree BLUP analyses of the three reproduction traits measured on 5340 sheep (4503 ewes and 837 sires) with real and imputed genotypes for 510 174 SNPs were performed. The prediction of breeding values using both sire and ewe trait records was validated in Merino sheep. Prediction accuracy was evaluated by across sire family and random cross-validations. Accuracies of genomic estimated breeding values (GEBVs) were assessed as the mean Pearson correlation adjusted by the accuracy of the input phenotypes. The addition of sire DTD into the prediction analysis resulted in higher accuracies compared with using only ewe records in genomic predictions or pedigree BLUP. Using GBLUP, the average accuracy based on the combined records (ewes and sire DTD) was 0.43 across traits, but the accuracies varied by trait and type of cross-validations. The accuracies of GEBVs from random cross-validations (range 0.17-0.61) were higher than were those from sire family cross-validations (range 0.00-0.51). The GEBV accuracies of 0.41-0.54 for NLB and LSIZE based on the combined records were amongst the highest in the study. Although BayesR was not significantly different from GBLUP in prediction accuracy, it identified several candidate genes which are known to be associated with NLB and LSIZE. The approach provides a way to make use of all data available in genomic prediction for traits that have limited recording. © 2017 Stichting International Foundation for Animal Genetics.
Jacchia, Sara; Nardini, Elena; Savini, Christian; Petrillo, Mauro; Angers-Loustau, Alexandre; Shim, Jung-Hyun; Trijatmiko, Kurniawan; Kreysa, Joachim; Mazzara, Marco
2015-02-18
In this study, we developed, optimized, and in-house validated a real-time PCR method for the event-specific detection and quantification of Golden Rice 2, a genetically modified rice with provitamin A in the grain. We optimized and evaluated the performance of the taxon (targeting rice Phospholipase D α2 gene)- and event (targeting the 3' insert-to-plant DNA junction)-specific assays that compose the method as independent modules, using haploid genome equivalents as unit of measurement. We verified the specificity of the two real-time PCR assays and determined their dynamic range, limit of quantification, limit of detection, and robustness. We also confirmed that the taxon-specific DNA sequence is present in single copy in the rice genome and verified its stability of amplification across 132 rice varieties. A relative quantification experiment evidenced the correct performance of the two assays when used in combination.
Mitochondrial DNA heteroplasmy in the emerging field of massively parallel sequencing
Just, Rebecca S.; Irwin, Jodi A.; Parson, Walther
2015-01-01
Long an important and useful tool in forensic genetic investigations, mitochondrial DNA (mtDNA) typing continues to mature. Research in the last few years has demonstrated both that data from the entire molecule will have practical benefits in forensic DNA casework, and that massively parallel sequencing (MPS) methods will make full mitochondrial genome (mtGenome) sequencing of forensic specimens feasible and cost-effective. A spate of recent studies has employed these new technologies to assess intraindividual mtDNA variation. However, in several instances, contamination and other sources of mixed mtDNA data have been erroneously identified as heteroplasmy. Well vetted mtGenome datasets based on both Sanger and MPS sequences have found authentic point heteroplasmy in approximately 25% of individuals when minor component detection thresholds are in the range of 10–20%, along with positional distribution patterns in the coding region that differ from patterns of point heteroplasmy in the well-studied control region. A few recent studies that examined very low-level heteroplasmy are concordant with these observations when the data are examined at a common level of resolution. In this review we provide an overview of considerations related to the use of MPS technologies to detect mtDNA heteroplasmy. In addition, we examine published reports on point heteroplasmy to characterize features of the data that will assist in the evaluation of future mtGenome data developed by any typing method. PMID:26009256
How to interpret methylation sensitive amplified polymorphism (MSAP) profiles?
Fulneček, Jaroslav; Kovařík, Aleš
2014-01-06
DNA methylation plays a key role in development, contributes to genome stability, and may also respond to external factors supporting adaptation and evolution. To connect different types of stimuli with particular biological processes, identifying genome regions with altered 5-methylcytosine distribution at a genome-wide scale is important. Many researchers are using the simple, reliable, and relatively inexpensive Methylation Sensitive Amplified Polymorphism (MSAP) method that is particularly useful in studies of epigenetic variation. However, electrophoretic patterns produced by the method are rather difficult to interpret, particularly when MspI and HpaII isoschizomers are used because these enzymes are methylation-sensitive, and any C within the CCGG recognition motif can be methylated in plant DNA. Here, we evaluate MSAP patterns with respect to current knowledge of the enzyme activities and the level and distribution of 5-methylcytosine in plant and vertebrate genomes. We discuss potential caveats related to complex MSAP patterns and provide clues regarding how to interpret them. We further show that addition of combined HpaII + MspI digestion would assist in the interpretation of the most controversial MSAP pattern represented by the signal in the HpaII but not in the MspI profile. We recommend modification of the MSAP protocol that definitely discerns between putative hemimethylated mCCGG and internal CmCGG sites. We believe that our view and the simple improvement will assist in correct MSAP data interpretation.
Chen, DaYang; Zhen, HeFu; Qiu, Yong; Liu, Ping; Zeng, Peng; Xia, Jun; Shi, QianYu; Xie, Lin; Zhu, Zhu; Gao, Ya; Huang, GuoDong; Wang, Jian; Yang, HuanMing; Chen, Fang
2018-03-21
Research based on a strategy of single-cell low-coverage whole genome sequencing (SLWGS) has enabled better reproducibility and accuracy for detection of copy number variations (CNVs). The whole genome amplification (WGA) method and sequencing platform are critical factors for successful SLWGS (<0.1 × coverage). In this study, we compared single cell and multiple cells sequencing data produced by the HiSeq2000 and Ion Proton platforms using two WGA kits and then comprehensively evaluated the GC-bias, reproducibility, uniformity and CNV detection among different experimental combinations. Our analysis demonstrated that the PicoPLEX WGA Kit resulted in higher reproducibility, lower sequencing error frequency but more GC-bias than the GenomePlex Single Cell WGA Kit (WGA4 kit) independent of the cell number on the HiSeq2000 platform. While on the Ion Proton platform, the WGA4 kit (both single cell and multiple cells) had higher uniformity and less GC-bias but lower reproducibility than those of the PicoPLEX WGA Kit. Moreover, on these two sequencing platforms, depending on cell number, the performance of the two WGA kits was different for both sensitivity and specificity on CNV detection. The results can help researchers who plan to use SLWGS on single or multiple cells to select appropriate experimental conditions for their applications.
Parks, Donovan H.; Imelfort, Michael; Skennerton, Connor T.; Hugenholtz, Philip; Tyson, Gene W.
2015-01-01
Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of “marker” genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities. PMID:25977477
Parks, Donovan H; Imelfort, Michael; Skennerton, Connor T; Hugenholtz, Philip; Tyson, Gene W
2015-07-01
Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of "marker" genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities. © 2015 Parks et al.; Published by Cold Spring Harbor Laboratory Press.
Hamamoto, Kouta; Ueda, Shuhei; Yamamoto, Yoshimasa
2015-01-01
Genotyping and characterization of bacterial isolates are essential steps in the identification and control of antibiotic-resistant bacterial infections. Recently, one novel genotyping method using three genomic guided Escherichia coli markers (GIG-EM), dinG, tonB, and dipeptide permease (DPP), was reported. Because GIG-EM has not been fully evaluated using clinical isolates, we assessed this typing method with 72 E. coli collection of reference (ECOR) environmental E. coli reference strains and 63 E. coli isolates of various genetic backgrounds. In this study, we designated 768 bp of dinG, 745 bp of tonB, and 655 bp of DPP target sequences for use in the typing method. Concatenations of the processed marker sequences were used to draw GIG-EM phylogenetic trees. E. coli isolates with identical sequence types as identified by the conventional multilocus sequence typing (MLST) method were localized to the same branch of the GIG-EM phylogenetic tree. Sixteen clinical E. coli isolates were utilized as test isolates without prior characterization by conventional MLST and phylogenetic grouping before GIG-EM typing. Of these, 14 clinical isolates were assigned to a branch including only isolates of a pandemic clone, E. coli B2-ST131-O25b, and these results were confirmed by conventional typing methods. Our results suggested that the GIG-EM typing method and its application to phylogenetic trees might be useful tools for the molecular characterization and determination of the genetic relationships among E. coli isolates. PMID:25809972
Hamamoto, Kouta; Ueda, Shuhei; Yamamoto, Yoshimasa; Hirai, Itaru
2015-06-01
Genotyping and characterization of bacterial isolates are essential steps in the identification and control of antibiotic-resistant bacterial infections. Recently, one novel genotyping method using three genomic guided Escherichia coli markers (GIG-EM), dinG, tonB, and dipeptide permease (DPP), was reported. Because GIG-EM has not been fully evaluated using clinical isolates, we assessed this typing method with 72 E. coli collection of reference (ECOR) environmental E. coli reference strains and 63 E. coli isolates of various genetic backgrounds. In this study, we designated 768 bp of dinG, 745 bp of tonB, and 655 bp of DPP target sequences for use in the typing method. Concatenations of the processed marker sequences were used to draw GIG-EM phylogenetic trees. E. coli isolates with identical sequence types as identified by the conventional multilocus sequence typing (MLST) method were localized to the same branch of the GIG-EM phylogenetic tree. Sixteen clinical E. coli isolates were utilized as test isolates without prior characterization by conventional MLST and phylogenetic grouping before GIG-EM typing. Of these, 14 clinical isolates were assigned to a branch including only isolates of a pandemic clone, E. coli B2-ST131-O25b, and these results were confirmed by conventional typing methods. Our results suggested that the GIG-EM typing method and its application to phylogenetic trees might be useful tools for the molecular characterization and determination of the genetic relationships among E. coli isolates. Copyright © 2015, American Society for Microbiology. All Rights Reserved.
Brezovský, Jan
2016-01-01
An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools’ predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2. PMID:27224906
Bendl, Jaroslav; Musil, Miloš; Štourač, Jan; Zendulka, Jaroslav; Damborský, Jiří; Brezovský, Jan
2016-05-01
An important message taken from human genome sequencing projects is that the human population exhibits approximately 99.9% genetic similarity. Variations in the remaining parts of the genome determine our identity, trace our history and reveal our heritage. The precise delineation of phenotypically causal variants plays a key role in providing accurate personalized diagnosis, prognosis, and treatment of inherited diseases. Several computational methods for achieving such delineation have been reported recently. However, their ability to pinpoint potentially deleterious variants is limited by the fact that their mechanisms of prediction do not account for the existence of different categories of variants. Consequently, their output is biased towards the variant categories that are most strongly represented in the variant databases. Moreover, most such methods provide numeric scores but not binary predictions of the deleteriousness of variants or confidence scores that would be more easily understood by users. We have constructed three datasets covering different types of disease-related variants, which were divided across five categories: (i) regulatory, (ii) splicing, (iii) missense, (iv) synonymous, and (v) nonsense variants. These datasets were used to develop category-optimal decision thresholds and to evaluate six tools for variant prioritization: CADD, DANN, FATHMM, FitCons, FunSeq2 and GWAVA. This evaluation revealed some important advantages of the category-based approach. The results obtained with the five best-performing tools were then combined into a consensus score. Additional comparative analyses showed that in the case of missense variations, protein-based predictors perform better than DNA sequence-based predictors. A user-friendly web interface was developed that provides easy access to the five tools' predictions, and their consensus scores, in a user-understandable format tailored to the specific features of different categories of variations. To enable comprehensive evaluation of variants, the predictions are complemented with annotations from eight databases. The web server is freely available to the community at http://loschmidt.chemi.muni.cz/predictsnp2.
fastBMA: scalable network inference and transitive reduction.
Hung, Ling-Hong; Shi, Kaiyuan; Wu, Migao; Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee
2017-10-01
Inferring genetic networks from genome-wide expression data is extremely demanding computationally. We have developed fastBMA, a distributed, parallel, and scalable implementation of Bayesian model averaging (BMA) for this purpose. fastBMA also includes a computationally efficient module for eliminating redundant indirect edges in the network by mapping the transitive reduction to an easily solved shortest-path problem. We evaluated the performance of fastBMA on synthetic data and experimental genome-wide time series yeast and human datasets. When using a single CPU core, fastBMA is up to 100 times faster than the next fastest method, LASSO, with increased accuracy. It is a memory-efficient, parallel, and distributed application that scales to human genome-wide expression data. A 10 000-gene regulation network can be obtained in a matter of hours using a 32-core cloud cluster (2 nodes of 16 cores). fastBMA is a significant improvement over its predecessor ScanBMA. It is more accurate and orders of magnitude faster than other fast network inference methods such as the 1 based on LASSO. The improved scalability allows it to calculate networks from genome scale data in a reasonable time frame. The transitive reduction method can improve accuracy in denser networks. fastBMA is available as code (M.I.T. license) from GitHub (https://github.com/lhhunghimself/fastBMA), as part of the updated networkBMA Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/networkBMA.html) and as ready-to-deploy Docker images (https://hub.docker.com/r/biodepot/fastbma/). © The Authors 2017. Published by Oxford University Press.
Covariance Between Genotypic Effects and its Use for Genomic Inference in Half-Sib Families
Wittenburg, Dörte; Teuscher, Friedrich; Klosa, Jan; Reinsch, Norbert
2016-01-01
In livestock, current statistical approaches utilize extensive molecular data, e.g., single nucleotide polymorphisms (SNPs), to improve the genetic evaluation of individuals. The number of model parameters increases with the number of SNPs, so the multicollinearity between covariates can affect the results obtained using whole genome regression methods. In this study, dependencies between SNPs due to linkage and linkage disequilibrium among the chromosome segments were explicitly considered in methods used to estimate the effects of SNPs. The population structure affects the extent of such dependencies, so the covariance among SNP genotypes was derived for half-sib families, which are typical in livestock populations. Conditional on the SNP haplotypes of the common parent (sire), the theoretical covariance was determined using the haplotype frequencies of the population from which the individual parent (dam) was derived. The resulting covariance matrix was included in a statistical model for a trait of interest, and this covariance matrix was then used to specify prior assumptions for SNP effects in a Bayesian framework. The approach was applied to one family in simulated scenarios (few and many quantitative trait loci) and using semireal data obtained from dairy cattle to identify genome segments that affect performance traits, as well as to investigate the impact on predictive ability. Compared with a method that does not explicitly consider any of the relationship among predictor variables, the accuracy of genetic value prediction was improved by 10–22%. The results show that the inclusion of dependence is particularly important for genomic inference based on small sample sizes. PMID:27402363
[sgRNA design for the CRISPR/Cas9 system and evaluation of its off-target effects].
Xie, Sheng-song; Zhang, Yi; Zhang, Li-sheng; Li, Guang-lei; Zhao, Chang-zhi; Ni, Pan; Zhao, Shu-hong
2015-11-01
The third generation of CRISPR/Cas9-mediated genome editing technology has been successfully applied to genome modification of various species including animals, plants and microorganisms. How to improve the efficiency of CRISPR/Cas9 genome editing and reduce its off-target effects has been extensively explored in this field. Using sgRNA (Small guide RNA) with high efficiency and specificity is one of the critical factors for successful genome editing. Several software have been developed for sgRNA design and/or off-target evaluation, which have advantages and disadvantages respectively. In this review, we summarize characters of 16 kinds online and standalone software for sgRNA design and/or off-target evaluation and conduct a comparative analysis of these different kinds of software through developing 38 evaluation indexes. We also summarize 11 experimental approaches for testing genome editing efficiency and off-target effects as well as how to screen highly efficient and specific sgRNA.
Technical note: Equivalent genomic models with a residual polygenic effect.
Liu, Z; Goddard, M E; Hayes, B J; Reinhardt, F; Reents, R
2016-03-01
Routine genomic evaluations in animal breeding are usually based on either a BLUP with genomic relationship matrix (GBLUP) or single nucleotide polymorphism (SNP) BLUP model. For a multi-step genomic evaluation, these 2 alternative genomic models were proven to give equivalent predictions for genomic reference animals. The model equivalence was verified also for young genotyped animals without phenotypes. Due to incomplete linkage disequilibrium of SNP markers to genes or causal mutations responsible for genetic inheritance of quantitative traits, SNP markers cannot explain all the genetic variance. A residual polygenic effect is normally fitted in the genomic model to account for the incomplete linkage disequilibrium. In this study, we start by showing the proof that the multi-step GBLUP and SNP BLUP models are equivalent for the reference animals, when they have a residual polygenic effect included. Second, the equivalence of both multi-step genomic models with a residual polygenic effect was also verified for young genotyped animals without phenotypes. Additionally, we derived formulas to convert genomic estimated breeding values of the GBLUP model to its components, direct genomic values and residual polygenic effect. Third, we made a proof that the equivalence of these 2 genomic models with a residual polygenic effect holds also for single-step genomic evaluation. Both the single-step GBLUP and SNP BLUP models lead to equal prediction for genotyped animals with phenotypes (e.g., reference animals), as well as for (young) genotyped animals without phenotypes. Finally, these 2 single-step genomic models with a residual polygenic effect were proven to be equivalent for estimation of SNP effects, too. Copyright © 2016 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.
Identifying micro-inversions using high-throughput sequencing reads.
He, Feifei; Li, Yang; Tang, Yu-Hang; Ma, Jian; Zhu, Huaiqiu
2016-01-11
The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads. The algorithm of MID is designed based on a dynamic programming path-finding approach. What makes MID different from other variant detection tools is that MID can handle small MIs and multiple breakpoints within an unmapped read. Moreover, MID improves reliability in low coverage data by integrating multiple samples. Our evaluation demonstrated that MID outperforms Gustaf, which can currently detect inversions from 30 bp to 500 bp. To our knowledge, MID is the first method that can efficiently and reliably identify MIs from unmapped short next-generation sequencing reads. MID is reliable on low coverage data, which is suitable for large-scale projects such as the 1000 Genomes Project (1KGP). MID identified previously unknown MIs from the 1KGP that overlap with genes and regulatory elements in the human genome. We also identified MIs in cancer cell lines from Cancer Cell Line Encyclopedia (CCLE). Therefore our tool is expected to be useful to improve the study of MIs as a type of genetic variant in the human genome. The source code can be downloaded from: http://cqb.pku.edu.cn/ZhuLab/MID .
Caprioara-Buda, M; Meyer, W; Jeynov, B; Corbisier, P; Trapmann, S; Emons, H
2012-07-01
The reliable quantification of genetically modified organisms (GMOs) by real-time PCR requires, besides thoroughly validated quantitative detection methods, sustainable calibration systems. The latter establishes the anchor points for the measured value and the measurement unit, respectively. In this paper, the suitability of two types of DNA calibrants, i.e. plasmid DNA and genomic DNA extracted from plant leaves, for the certification of the GMO content in reference materials as copy number ratio between two targeted DNA sequences was investigated. The PCR efficiencies and coefficients of determination of the calibration curves as well as the measured copy number ratios for three powder certified reference materials (CRMs), namely ERM-BF415e (NK603 maize), ERM-BF425c (356043 soya), and ERM-BF427c (98140 maize), originally certified for their mass fraction of GMO, were compared for both types of calibrants. In all three systems investigated, the PCR efficiencies of plasmid DNA were slightly closer to the PCR efficiencies observed for the genomic DNA extracted from seed powders rather than those of the genomic DNA extracted from leaves. Although the mean DNA copy number ratios for each CRM overlapped within their uncertainties, the DNA copy number ratios were significantly different using the two types of calibrants. Based on these observations, both plasmid and leaf genomic DNA calibrants would be technically suitable as anchor points for the calibration of the real-time PCR methods applied in this study. However, the most suitable approach to establish a sustainable traceability chain is to fix a reference system based on plasmid DNA.
Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints.
Glusman, Gustavo; Mauldin, Denise E; Hood, Leroy E; Robinson, Max
2017-01-01
We present an ultrafast method for comparing personal genomes. We transform the standard genome representation (lists of variants relative to a reference) into "genome fingerprints" via locality sensitive hashing. The resulting genome fingerprints can be meaningfully compared even when the input data were obtained using different sequencing technologies, processed using different pipelines, represented in different data formats and relative to different reference versions. Furthermore, genome fingerprints are robust to up to 30% missing data. Because of their reduced size, computation on the genome fingerprints is fast and requires little memory. For example, we could compute all-against-all pairwise comparisons among the 2504 genomes in the 1000 Genomes data set in 67 s at high quality (21 μs per comparison, on a single processor), and achieved a lower quality approximation in just 11 s. Efficient computation enables scaling up a variety of important genome analyses, including quantifying relatedness, recognizing duplicative sequenced genomes in a set, population reconstruction, and many others. The original genome representation cannot be reconstructed from its fingerprint, effectively decoupling genome comparison from genome interpretation; the method thus has significant implications for privacy-preserving genome analytics.
Comparing CNV detection methods for SNP arrays.
Winchester, Laura; Yau, Christopher; Ragoussis, Jiannis
2009-09-01
Data from whole genome association studies can now be used for dual purposes, genotyping and copy number detection. In this review we discuss some of the methods for using SNP data to detect copy number events. We examine a number of algorithms designed to detect copy number changes through the use of signal-intensity data and consider methods to evaluate the changes found. We describe the use of several statistical models in copy number detection in germline samples. We also present a comparison of data using these methods to assess accuracy of prediction and detection of changes in copy number.
FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm.
Tuo, Shouheng; Zhang, Junying; Yuan, Xiguo; Zhang, Yuanyuan; Liu, Zhaowen
2016-01-01
Two-locus model is a typical significant disease model to be identified in genome-wide association study (GWAS). Due to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power, high computation cost, and preference for some types of disease models. In this study, two scoring functions (Bayesian network based K2-score and Gini-score) are used for characterizing two SNP locus as a candidate model, the two criteria are adopted simultaneously for improving identification power and tackling the preference problem to disease models. Harmony search algorithm (HSA) is improved for quickly finding the most likely candidate models among all two-locus models, in which a local search algorithm with two-dimensional tabu table is presented to avoid repeatedly evaluating some disease models that have strong marginal effect. Finally G-test statistic is used to further test the candidate models. We investigate our method named FHSA-SED on 82 simulated datasets and a real AMD dataset, and compare it with two typical methods (MACOED and CSE) which have been developed recently based on swarm intelligent search algorithm. The results of simulation experiments indicate that our method outperforms the two compared algorithms in terms of detection power, computation time, evaluation times, sensitivity (TPR), specificity (SPC), positive predictive value (PPV) and accuracy (ACC). Our method has identified two SNPs (rs3775652 and rs10511467) that may be also associated with disease in AMD dataset.
Lara-Ramírez, Edgar E.; Salazar, Ma Isabel; López-López, María de Jesús; Salas-Benito, Juan Santiago; Sánchez-Varela, Alejandro
2014-01-01
The increasing number of dengue virus (DENV) genome sequences available allows identifying the contributing factors to DENV evolution. In the present study, the codon usage in serotypes 1–4 (DENV1–4) has been explored for 3047 sequenced genomes using different statistics methods. The correlation analysis of total GC content (GC) with GC content at the three nucleotide positions of codons (GC1, GC2, and GC3) as well as the effective number of codons (ENC, ENCp) versus GC3 plots revealed mutational bias and purifying selection pressures as the major forces influencing the codon usage, but with distinct pressure on specific nucleotide position in the codon. The correspondence analysis (CA) and clustering analysis on relative synonymous codon usage (RSCU) within each serotype showed similar clustering patterns to the phylogenetic analysis of nucleotide sequences for DENV1–4. These clustering patterns are strongly related to the virus geographic origin. The phylogenetic dependence analysis also suggests that stabilizing selection acts on the codon usage bias. Our analysis of a large scale reveals new feature on DENV genomic evolution. PMID:25136631
Cloud-based adaptive exon prediction for DNA analysis.
Putluri, Srinivasareddy; Zia Ur Rahman, Md; Fathima, Shaik Yasmeen
2018-02-01
Cloud computing offers significant research and economic benefits to healthcare organisations. Cloud services provide a safe place for storing and managing large amounts of such sensitive data. Under conventional flow of gene information, gene sequence laboratories send out raw and inferred information via Internet to several sequence libraries. DNA sequencing storage costs will be minimised by use of cloud service. In this study, the authors put forward a novel genomic informatics system using Amazon Cloud Services, where genomic sequence information is stored and accessed for processing. True identification of exon regions in a DNA sequence is a key task in bioinformatics, which helps in disease identification and design drugs. Three base periodicity property of exons forms the basis of all exon identification techniques. Adaptive signal processing techniques found to be promising in comparison with several other methods. Several adaptive exon predictors (AEPs) are developed using variable normalised least mean square and its maximum normalised variants to reduce computational complexity. Finally, performance evaluation of various AEPs is done based on measures such as sensitivity, specificity and precision using various standard genomic datasets taken from National Center for Biotechnology Information genomic sequence database.
Effects of Yangtze River source water on genomic polymorphisms of male mice detected by RAPD.
Zhang, Xiaolin; Zhang, Zongyao; Zhang, Xuxiang; Wu, Bing; Zhang, Yan; Yang, Liuyan; Cheng, Shupei
2010-02-01
In order to evaluate the environmental health risk of drinking water from Yangtze River source, randomly amplified polymorphic DNA (RAPD) markers were used to detect the effects of the source water on genomic polymorphisms of hepatic cell of male mice (Mus musculus, ICR). After the mice were fed with source water for 90 days, RAPD-polymerase chain reactions (PCRs) were performed on hepatic genomic DNA using 20 arbitrary primers. Totally, 189 loci were generated, including 151 polymorphic loci. On average, one PCR primer produced 5.3, 4.9 and 4.8 bands for each mouse in the control, the groups fed with source water and BaP solution, respectively. Compared with the control, feeding mice with Yangtze River source water caused 33 new loci to appear and 19 to disappear. Statistical analysis of RAPD printfingers revealed that Yangtze River source water exerted a significant influence on the hepatic genomic polymorphisms of male mice. This study suggests that RAPD is a reliable and sensitive method for the environmental health risk of Yangtze River source water.
Gene-gene and gene-environment interactions defining lipid-related traits.
Ordovás, José M; Robertson, Ruairi; Cléirigh, Ellen Ní
2011-04-01
Steps towards reducing chronic disease progression are continuously being taken through the form of genomic research. Studies over the last year have highlighted more and more polymorphisms, pathways and interactions responsible for metabolic disorders such as cardiovascular disease, obesity and dyslipidemia. Many of these chronic illnesses can be partially blamed by altered lipid metabolism, combined with individual genetic components. Critical evaluation and comparison of these recent studies is essential in order to comprehend the results, conclusions and future prospects in the field of genomics as a whole. Recent literature elucidates significant gene--diet and gene--environment interactions resulting in altered lipid metabolism, inflammation and other metabolic imbalances leading to cardiovascular disease and obesity. Epigenetic and epistatic interactions are now becoming more significantly associated with such disorders, as genomic research digs deeper into the complex nature of genetic individuality and heritability. The vast array of data collected from genome-wide association studies must now be empowered and explored through more complex interaction studies, using standardized methods and larger sample sizes. In doing so the etiology of chronic disease progression will be further understood.
Genome hypermethylation in Pinus silvestris of Chernobyl--a mechanism for radiation adaptation?
Kovalchuk, Olga; Burke, Paula; Arkhipov, Andrey; Kuchma, Nikolaj; James, S Jill; Kovalchuk, Igor; Pogribny, Igor
2003-08-28
Adaptation is a complex process by which populations of organisms respond to long-term environmental stresses by permanent genetic change. Here we present data from the natural "open-field" radiation adaptation experiment after the Chernobyl accident and provide the first evidence of the involvement of epigenetic changes in adaptation of a eukaryote-Scots pine (Pinus silvestris), to chronic radiation exposure. We have evaluated global genome methylation of control and radiation-exposed pine trees using a method based on cleavage by a methylation-sensitive HpaII restriction endonuclease that leaves a 5' guanine overhang and subsequent single nucleotide extension with labeled [3H] dCTP. We have found that genomic DNA of exposed pine trees was considerably hypermethylated. Moreover, hypermethylation appeared to be dependent upon the radiation dose absorbed by the trees. Such hypermethylation may be viewed as a defense strategy of plants that prevents genome instability and reshuffling of the hereditary material, allowing survival in an extreme environment. Further studies are clearly needed to analyze in detail the involvement of DNA methylation and other epigenetic mechanisms in the complex process of radiation stress and adaptive response.
Protein structure database search and evolutionary classification.
Yang, Jinn-Moon; Tung, Chi-Hua
2006-01-01
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].
Protecting genomic data analytics in the cloud: state of the art and opportunities.
Tang, Haixu; Jiang, Xiaoqian; Wang, Xiaofeng; Wang, Shuang; Sofia, Heidi; Fox, Dov; Lauter, Kristin; Malin, Bradley; Telenti, Amalio; Xiong, Li; Ohno-Machado, Lucila
2016-10-13
The outsourcing of genomic data into public cloud computing settings raises concerns over privacy and security. Significant advancements in secure computation methods have emerged over the past several years, but such techniques need to be rigorously evaluated for their ability to support the analysis of human genomic data in an efficient and cost-effective manner. With respect to public cloud environments, there are concerns about the inadvertent exposure of human genomic data to unauthorized users. In analyses involving multiple institutions, there is additional concern about data being used beyond agreed research scope and being prcoessed in untrused computational environments, which may not satisfy institutional policies. To systematically investigate these issues, the NIH-funded National Center for Biomedical Computing iDASH (integrating Data for Analysis, 'anonymization' and SHaring) hosted the second Critical Assessment of Data Privacy and Protection competition to assess the capacity of cryptographic technologies for protecting computation over human genomes in the cloud and promoting cross-institutional collaboration. Data scientists were challenged to design and engineer practical algorithms for secure outsourcing of genome computation tasks in working software, whereby analyses are performed only on encrypted data. They were also challenged to develop approaches to enable secure collaboration on data from genomic studies generated by multiple organizations (e.g., medical centers) to jointly compute aggregate statistics without sharing individual-level records. The results of the competition indicated that secure computation techniques can enable comparative analysis of human genomes, but greater efficiency (in terms of compute time and memory utilization) are needed before they are sufficiently practical for real world environments.
BAC sequencing using pooled methods.
Saski, Christopher A; Feltus, F Alex; Parida, Laxmi; Haiminen, Niina
2015-01-01
Shotgun sequencing and assembly of a large, complex genome can be both expensive and challenging to accurately reconstruct the true genome sequence. Repetitive DNA arrays, paralogous sequences, polyploidy, and heterozygosity are main factors that plague de novo genome sequencing projects that typically result in highly fragmented assemblies and are difficult to extract biological meaning. Targeted, sub-genomic sequencing offers complexity reduction by removing distal segments of the genome and a systematic mechanism for exploring prioritized genomic content through BAC sequencing. If one isolates and sequences the genome fraction that encodes the relevant biological information, then it is possible to reduce overall sequencing costs and efforts that target a genomic segment. This chapter describes the sub-genome assembly protocol for an organism based upon a BAC tiling path derived from a genome-scale physical map or from fine mapping using BACs to target sub-genomic regions. Methods that are described include BAC isolation and mapping, DNA sequencing, and sequence assembly.
De novo assembly of a haplotype-resolved human genome.
Cao, Hongzhi; Wu, Honglong; Luo, Ruibang; Huang, Shujia; Sun, Yuhui; Tong, Xin; Xie, Yinlong; Liu, Binghang; Yang, Hailong; Zheng, Hancheng; Li, Jian; Li, Bo; Wang, Yu; Yang, Fang; Sun, Peng; Liu, Siyang; Gao, Peng; Huang, Haodong; Sun, Jing; Chen, Dan; He, Guangzhu; Huang, Weihua; Huang, Zheng; Li, Yue; Tellier, Laurent C A M; Liu, Xiao; Feng, Qiang; Xu, Xun; Zhang, Xiuqing; Bolund, Lars; Krogh, Anders; Kristiansen, Karsten; Drmanac, Radoje; Drmanac, Snezana; Nielsen, Rasmus; Li, Songgang; Wang, Jian; Yang, Huanming; Li, Yingrui; Wong, Gane Ka-Shu; Wang, Jun
2015-06-01
The human genome is diploid, and knowledge of the variants on each chromosome is important for the interpretation of genomic information. Here we report the assembly of a haplotype-resolved diploid genome without using a reference genome. Our pipeline relies on fosmid pooling together with whole-genome shotgun strategies, based solely on next-generation sequencing and hierarchical assembly methods. We applied our sequencing method to the genome of an Asian individual and generated a 5.15-Gb assembled genome with a haplotype N50 of 484 kb. Our analysis identified previously undetected indels and 7.49 Mb of novel coding sequences that could not be aligned to the human reference genome, which include at least six predicted genes. This haplotype-resolved genome represents the most complete de novo human genome assembly to date. Application of our approach to identify individual haplotype differences should aid in translating genotypes to phenotypes for the development of personalized medicine.
Using Partial Genomic Fosmid Libraries for Sequencing CompleteOrganellar Genomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
McNeal, Joel R.; Leebens-Mack, James H.; Arumuganathan, K.
2005-08-26
Organellar genome sequences provide numerous phylogenetic markers and yield insight into organellar function and molecular evolution. These genomes are much smaller in size than their nuclear counterparts; thus, their complete sequencing is much less expensive than total nuclear genome sequencing, making broader phylogenetic sampling feasible. However, for some organisms it is challenging to isolate plastid DNA for sequencing using standard methods. To overcome these difficulties, we constructed partial genomic libraries from total DNA preparations of two heterotrophic and two autotrophic angiosperm species using fosmid vectors. We then used macroarray screening to isolate clones containing large fragments of plastid DNA. Amore » minimum tiling path of clones comprising the entire genome sequence of each plastid was selected, and these clones were shotgun-sequenced and assembled into complete genomes. Although this method worked well for both heterotrophic and autotrophic plants, nuclear genome size had a dramatic effect on the proportion of screened clones containing plastid DNA and, consequently, the overall number of clones that must be screened to ensure full plastid genome coverage. This technique makes it possible to determine complete plastid genome sequences for organisms that defy other available organellar genome sequencing methods, especially those for which limited amounts of tissue are available.« less
USDA-ARS?s Scientific Manuscript database
Whole genome tandem repeat polymorphisms were evaluated between two closely related Xylella fastidiosa strains, M23 and Temecula1, both cause almond leaf scorch disease (ALSD) and grape Pierce’s disease (PD) in California. Strain M23 was isolated from almond and the genome was sequenced in this stu...
Applications of Genomics to Genetic Improvement of Dairy Cattle
USDA-ARS?s Scientific Manuscript database
Implementation of genomic evaluation has caused profound changes in dairy cattle breeding. All young bulls bought by major artificial-insemination (AI) organizations now are selected based on such evaluations. Evaluation reliability can reach about 75% for yield traits, which is adequate for marketi...
Malmberg, M Michelle; Shi, Fan; Spangenberg, German C; Daetwyler, Hans D; Cogan, Noel O I
2018-01-01
Intensive breeding of Brassica napus has resulted in relatively low diversity, such that B. napus would benefit from germplasm improvement schemes that sustain diversity. As such, samples representative of global germplasm pools need to be assessed for existing population structure, diversity and linkage disequilibrium (LD). Complexity reduction genotyping-by-sequencing (GBS) methods, including GBS-transcriptomics (GBS-t), enable cost-effective screening of a large number of samples, while whole genome re-sequencing (WGR) delivers the ability to generate large numbers of unbiased genomic single nucleotide polymorphisms (SNPs), and identify structural variants (SVs). Furthermore, the development of genomic tools based on whole genomes representative of global oilseed diversity and orientated by the reference genome has substantial industry relevance and will be highly beneficial for canola breeding. As recent studies have focused on European and Chinese varieties, a global diversity panel as well as a substantial number of Australian spring types were included in this study. Focusing on industry relevance, 633 varieties were initially genotyped using GBS-t to examine population structure using 61,037 SNPs. Subsequently, 149 samples representative of global diversity were selected for WGR and both data sets used for a side-by-side evaluation of diversity and LD. The WGR data was further used to develop genomic resources consisting of a list of 4,029,750 high-confidence SNPs annotated using SnpEff, and SVs in the form of 10,976 deletions and 2,556 insertions. These resources form the basis of a reliable and repeatable system allowing greater integration between canola genomics studies, with a strong focus on breeding germplasm and industry applicability.
Solving the problem of comparing whole bacterial genomes across different sequencing platforms.
Kaas, Rolf S; Leekitcharoenphon, Pimlapas; Aarestrup, Frank M; Lund, Ole
2014-01-01
Whole genome sequencing (WGS) shows great potential for real-time monitoring and identification of infectious disease outbreaks. However, rapid and reliable comparison of data generated in multiple laboratories and using multiple technologies is essential. So far studies have focused on using one technology because each technology has a systematic bias making integration of data generated from different platforms difficult. We developed two different procedures for identifying variable sites and inferring phylogenies in WGS data across multiple platforms. The methods were evaluated on three bacterial data sets and sequenced on three different platforms (Illumina, 454, Ion Torrent). We show that the methods are able to overcome the systematic biases caused by the sequencers and infer the expected phylogenies. It is concluded that the cause of the success of these new procedures is due to a validation of all informative sites that are included in the analysis. The procedures are available as web tools.
G-cimp status prediction of glioblastoma samples using mRNA expression data.
Baysan, Mehmet; Bozdag, Serdar; Cam, Margaret C; Kotliarova, Svetlana; Ahn, Susie; Walling, Jennifer; Killian, Jonathan K; Stevenson, Holly; Meltzer, Paul; Fine, Howard A
2012-01-01
Glioblastoma Multiforme (GBM) is a tumor with high mortality and no known cure. The dramatic molecular and clinical heterogeneity seen in this tumor has led to attempts to define genetically similar subgroups of GBM with the hope of developing tumor specific therapies targeted to the unique biology within each of these subgroups. Recently, a subset of relatively favorable prognosis GBMs has been identified. These glioma CpG island methylator phenotype, or G-CIMP tumors, have distinct genomic copy number aberrations, DNA methylation patterns, and (mRNA) expression profiles compared to other GBMs. While the standard method for identifying G-CIMP tumors is based on genome-wide DNA methylation data, such data is often not available compared to the more widely available gene expression data. In this study, we have developed and evaluated a method to predict the G-CIMP status of GBM samples based solely on gene expression data.
G-Cimp Status Prediction Of Glioblastoma Samples Using mRNA Expression Data
Baysan, Mehmet; Bozdag, Serdar; Cam, Margaret C.; Kotliarova, Svetlana; Ahn, Susie; Walling, Jennifer; Killian, Jonathan K.; Stevenson, Holly; Meltzer, Paul; Fine, Howard A.
2012-01-01
Glioblastoma Multiforme (GBM) is a tumor with high mortality and no known cure. The dramatic molecular and clinical heterogeneity seen in this tumor has led to attempts to define genetically similar subgroups of GBM with the hope of developing tumor specific therapies targeted to the unique biology within each of these subgroups. Recently, a subset of relatively favorable prognosis GBMs has been identified. These glioma CpG island methylator phenotype, or G-CIMP tumors, have distinct genomic copy number aberrations, DNA methylation patterns, and (mRNA) expression profiles compared to other GBMs. While the standard method for identifying G-CIMP tumors is based on genome-wide DNA methylation data, such data is often not available compared to the more widely available gene expression data. In this study, we have developed and evaluated a method to predict the G-CIMP status of GBM samples based solely on gene expression data. PMID:23139755
Han, Joan C.; Elsea, Sarah H.; Pena, Heloísa B.; Pena, Sérgio Danilo Junho
2013-01-01
Detection of human microdeletion and microduplication syndromes poses significant burden on public healthcare systems in developing countries. With genome-wide diagnostic assays frequently inaccessible, targeted low-cost PCR-based approaches are preferred. However, their reproducibility depends on equally efficient amplification using a number of target and control primers. To address this, the recently described technique called Microdeletion/Microduplication Quantitative Fluorescent PCR (MQF-PCR) was shown to reliably detect four human syndromes by quantifying DNA amplification in an internally controlled PCR reaction. Here, we confirm its utility in the detection of eight human microdeletion syndromes, including the more common WAGR, Smith-Magenis, and Potocki-Lupski syndromes with 100% sensitivity and 100% specificity. We present selection, design, and performance evaluation of detection primers using variety of approaches. We conclude that MQF-PCR is an easily adaptable method for detection of human pathological chromosomal aberrations. PMID:24288428
Effectiveness of liquid soap and hand sanitizer against Norwalk virus on contaminated hands.
Liu, Pengbo; Yuen, Yvonne; Hsiao, Hui-Mien; Jaykus, Lee-Ann; Moe, Christine
2010-01-01
Disinfection is an essential measure for interrupting human norovirus (HuNoV) transmission, but it is difficult to evaluate the efficacy of disinfectants due to the absence of a practicable cell culture system for these viruses. The purpose of this study was to screen sodium hypochlorite and ethanol for efficacy against Norwalk virus (NV) and expand the studies to evaluate the efficacy of antibacterial liquid soap and alcohol-based hand sanitizer for the inactivation of NV on human finger pads. Samples were tested by real-time reverse transcription-quantitative PCR (RT-qPCR) both with and without a prior RNase treatment. In suspension assay, sodium hypochlorite concentrations of >or=160 ppm effectively eliminated RT-qPCR detection signal, while ethanol, regardless of concentration, was relatively ineffective, giving at most a 0.5 log(10) reduction in genomic copies of NV cDNA. Using the American Society for Testing and Materials (ASTM) standard finger pad method and a modification thereof (with rubbing), we observed the greatest reduction in genomic copies of NV cDNA with the antibacterial liquid soap treatment (0.67 to 1.20 log(10) reduction) and water rinse only (0.58 to 1.58 log(10) reduction). The alcohol-based hand sanitizer was relatively ineffective, reducing the genomic copies of NV cDNA by only 0.14 to 0.34 log(10) compared to baseline. Although the concentrations of genomic copies of NV cDNA were consistently lower on finger pad eluates pretreated with RNase compared to those without prior RNase treatment, these differences were not statistically significant. Despite the promise of alcohol-based sanitizers for the control of pathogen transmission, they may be relatively ineffective against the HuNoV, reinforcing the need to develop and evaluate new products against this important group of viruses.
A BAC clone fingerprinting approach to the detection of human genome rearrangements
Krzywinski, Martin; Bosdet, Ian; Mathewson, Carrie; Wye, Natasja; Brebner, Jay; Chiu, Readman; Corbett, Richard; Field, Matthew; Lee, Darlene; Pugh, Trevor; Volik, Stas; Siddiqui, Asim; Jones, Steven; Schein, Jacquie; Collins, Collin; Marra, Marco
2007-01-01
We present a method, called fingerprint profiling (FPP), that uses restriction digest fingerprints of bacterial artificial chromosome clones to detect and classify rearrangements in the human genome. The approach uses alignment of experimental fingerprint patterns to in silico digests of the sequence assembly and is capable of detecting micro-deletions (1-5 kb) and balanced rearrangements. Our method has compelling potential for use as a whole-genome method for the identification and characterization of human genome rearrangements. PMID:17953769
Wright, O R L
2014-06-01
This review examines knowledge and confidence of nutrition and dietetics professionals in nutritional genomics and evaluates the teaching strategies in this field within nutrition and dietetics university programmes and professional development courses internationally. A systematic search of 10 literature databases was conducted from January 2000 to December 2012 to identify original research. Any studies of either nutrition and/or dietetics students or dietitians/nutritionists investigating current levels of knowledge or confidence in nutritional genomics, or strategies to improve learning and/or confidence in this area, were eligible. Eighteen articles (15 separate studies) met the inclusion criteria. Three articles were assessed as negative, eight as neutral and seven as positive according to the American Dietetics Association Quality Criteria Checklist. The overall ranking of evidence was low. Dietitians have low involvement, knowledge and confidence in nutritional genomics, and evidence for educational strategies is limited and methodologically weak. There is a need to develop training pathways and material to up-skill nutrition and/or dietetics students and nutrition and/or dietetics professionals in nutritional genomics through multidisciplinary collaboration with content area experts. There is a paucity of high quality evidence on optimum teaching strategies; however, methods promoting repetitive exposure to nutritional genomics material, problem-solving, collaborative and case-based learning are most promising for university and professional development programmes. © 2013 The British Dietetic Association Ltd.
Three invariant Hi-C interaction patterns: Applications to genome assembly.
Oddes, Sivan; Zelig, Aviv; Kaplan, Noam
2018-06-01
Assembly of reference-quality genomes from next-generation sequencing data is a key challenge in genomics. Recently, we and others have shown that Hi-C data can be used to address several outstanding challenges in the field of genome assembly. This principle has since been developed in academia and industry, and has been used in the assembly of several major genomes. In this paper, we explore the central principles underlying Hi-C-based assembly approaches, by quantitatively defining and characterizing three invariant Hi-C interaction patterns on which these approaches can build: Intrachromosomal interaction enrichment, distance-dependent interaction decay and local interaction smoothness. Specifically, we evaluate to what degree each invariant pattern holds on a single locus level in different species, cell types and Hi-C map resolutions. We find that these patterns are generally consistent across species and cell types but are affected by sequencing depth, and that matrix balancing improves consistency of loci with all three invariant patterns. Finally, we overview current Hi-C-based assembly approaches in light of these invariant patterns and demonstrate how local interaction smoothness can be used to easily detect scaffolding errors in extremely sparse Hi-C maps. We suggest that simultaneously considering all three invariant patterns may lead to better Hi-C-based genome assembly methods. Copyright © 2018 Elsevier Inc. All rights reserved.
Breeding and Genetics Symposium: really big data: processing and analysis of very large data sets.
Cole, J B; Newman, S; Foertter, F; Aguilar, I; Coffey, M
2012-03-01
Modern animal breeding data sets are large and getting larger, due in part to recent availability of high-density SNP arrays and cheap sequencing technology. High-performance computing methods for efficient data warehousing and analysis are under development. Financial and security considerations are important when using shared clusters. Sound software engineering practices are needed, and it is better to use existing solutions when possible. Storage requirements for genotypes are modest, although full-sequence data will require greater storage capacity. Storage requirements for intermediate and results files for genetic evaluations are much greater, particularly when multiple runs must be stored for research and validation studies. The greatest gains in accuracy from genomic selection have been realized for traits of low heritability, and there is increasing interest in new health and management traits. The collection of sufficient phenotypes to produce accurate evaluations may take many years, and high-reliability proofs for older bulls are needed to estimate marker effects. Data mining algorithms applied to large data sets may help identify unexpected relationships in the data, and improved visualization tools will provide insights. Genomic selection using large data requires a lot of computing power, particularly when large fractions of the population are genotyped. Theoretical improvements have made possible the inversion of large numerator relationship matrices, permitted the solving of large systems of equations, and produced fast algorithms for variance component estimation. Recent work shows that single-step approaches combining BLUP with a genomic relationship (G) matrix have similar computational requirements to traditional BLUP, and the limiting factor is the construction and inversion of G for many genotypes. A naïve algorithm for creating G for 14,000 individuals required almost 24 h to run, but custom libraries and parallel computing reduced that to 15 m. Large data sets also create challenges for the delivery of genetic evaluations that must be overcome in a way that does not disrupt the transition from conventional to genomic evaluations. Processing time is important, especially as real-time systems for on-farm decisions are developed. The ultimate value of these systems is to decrease time-to-results in research, increase accuracy in genomic evaluations, and accelerate rates of genetic improvement.
NASA Astrophysics Data System (ADS)
Yoshikazu, Kawata; Shin-Ichi, Yano; Hiroyuki, Kojima
1998-03-01
An efficient and simple method for constructing a genomic DNA library using a TA cloning vector is presented. It is based on the sonicative cleavage of genomic DNA and modification of fragment ends with Taq DNA polymerase, followed by ligation using a TA vector. This method was applied for cloning of the phytoene synthase gene crt B from Spirulina platensis. This method is useful when genomic DNA cannot be efficiently digested with restriction enzymes, a problem often encountered during the construction of a genomic DNA library of cyanobacteria.
78 FR 56905 - National Human Genome Research Institute; Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2013-09-16
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... clearly unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research....m. Agenda: To review and evaluate grant applications. Place: National Human Genome Research...
77 FR 58402 - National Human Genome Research Institute; Notice of Closed Meetings
Federal Register 2010, 2011, 2012, 2013, 2014
2012-09-20
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... clearly unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research...: To review and evaluate grant applications. Place: National Human Genome Research Institute, 5635...
78 FR 107 - National Human Genome Research Institute; Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2013-01-02
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... evaluate grant applications. Place: National Human Genome Research Institute, 3rd Floor Conference Room....D., Scientific Review Officer, Scientific Review Branch, National Human Genome Research Institute...
Castellana, Stefano; Fusilli, Caterina; Mazzoccoli, Gianluigi; Biagini, Tommaso; Capocefalo, Daniele; Carella, Massimo; Vescovi, Angelo Luigi; Mazza, Tommaso
2017-06-01
24,189 are all the possible non-synonymous amino acid changes potentially affecting the human mitochondrial DNA. Only a tiny subset was functionally evaluated with certainty so far, while the pathogenicity of the vast majority was only assessed in-silico by software predictors. Since these tools proved to be rather incongruent, we have designed and implemented APOGEE, a machine-learning algorithm that outperforms all existing prediction methods in estimating the harmfulness of mitochondrial non-synonymous genome variations. We provide a detailed description of the underlying algorithm, of the selected and manually curated training and test sets of variants, as well as of its classification ability.
An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes.
Shao, Mingfu; Lin, Yu; Moret, Bernard M E
2015-05-01
Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this article, we propose an integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse, and rat genomes, where once again our method outperforms MSOAR.
Gene context analysis in the Integrated Microbial Genomes (IMG) data management system.
Mavromatis, Konstantinos; Chu, Ken; Ivanova, Natalia; Hooper, Sean D; Markowitz, Victor M; Kyrpides, Nikos C
2009-11-24
Computational methods for determining the function of genes in newly sequenced genomes have been traditionally based on sequence similarity to genes whose function has been identified experimentally. Function prediction methods can be extended using gene context analysis approaches such as examining the conservation of chromosomal gene clusters, gene fusion events and co-occurrence profiles across genomes. Context analysis is based on the observation that functionally related genes are often having similar gene context and relies on the identification of such events across phylogenetically diverse collection of genomes. We have used the data management system of the Integrated Microbial Genomes (IMG) as the framework to implement and explore the power of gene context analysis methods because it provides one of the largest available genome integrations. Visualization and search tools to facilitate gene context analysis have been developed and applied across all publicly available archaeal and bacterial genomes in IMG. These computations are now maintained as part of IMG's regular genome content update cycle. IMG is available at: http://img.jgi.doe.gov.
Fu, Yong-Bi; Peterson, Gregory W; Dong, Yibo
2016-04-07
Genotyping-by-sequencing (GBS) has emerged as a useful genomic approach for exploring genome-wide genetic variation. However, GBS commonly samples a genome unevenly and can generate a substantial amount of missing data. These technical features would limit the power of various GBS-based genetic and genomic analyses. Here we present software called IgCoverage for in silico evaluation of genomic coverage through GBS with an individual or pair of restriction enzymes on one sequenced genome, and report a new set of 21 restriction enzyme combinations that can be applied to enhance GBS applications. These enzyme combinations were developed through an application of IgCoverage on 22 plant, animal, and fungus species with sequenced genomes, and some of them were empirically evaluated with different runs of Illumina MiSeq sequencing in 12 plant species. The in silico analysis of 22 organisms revealed up to eight times more genome coverage for the new combinations consisted of pairing four- or five-cutter restriction enzymes than the commonly used enzyme combination PstI + MspI. The empirical evaluation of the new enzyme combination (HinfI + HpyCH4IV) in 12 plant species showed 1.7-6 times more genome coverage than PstI + MspI, and 2.3 times more genome coverage in dicots than monocots. Also, the SNP genotyping in 12 Arabidopsis and 12 rice plants revealed that HinfI + HpyCH4IV generated 7 and 1.3 times more SNPs (with 0-16.7% missing observations) than PstI + MspI, respectively. These findings demonstrate that these novel enzyme combinations can be utilized to increase genome sampling and improve SNP genotyping in various GBS applications. Copyright © 2016 Fu et al.
Muñoz-San Martín, Catalina; Apt, Werner; Zulantay, Inés
2017-04-01
The protozoan Trypanosoma cruzi is the causative agent of Chagas disease, a major public health problem in Latin America. This parasite has a complex population structure comprised by six or seven major evolutionary lineages (discrete typing units or DTUs) TcI-TcVI and TcBat, some of which have apparently resulted from ancient hybridization events. Because of the existence of significant biological differences between these lineages, strain characterization methods have been essential to study T. cruzi in its different vectors and hosts. However, available methods can be laborious and costly, limited in resolution or sensitivity. In this study, a new genotyping strategy by real-time PCR to identify each of the six DTUs in clinical blood samples have been developed and evaluated. Two nuclear (SL-IR and 18S rDNA) and two mitochondrial genes (COII and ND1) were selected to develop original primers. The method was evaluated with eight genomic DNA of T. cruzi populations belonging to the six DTUs, one genomic DNA of Trypanosoma rangeli, and 53 blood samples from individuals with chronic Chagas disease. The assays had an analytical sensitivity of 1-25fg of DNA per reaction tube depending on the DTU analyzed. The selectivity of trials with 20fg/μL of genomic DNA identified each DTU, excluding non-targets DTUs in every test. The method was able to characterize 67.9% of the chronically infected clinical samples with high detection of TcII followed by TcI. With the proposed original genotyping methodology, each DTU was established with high sensitivity after a single real-time PCR assay. This novel protocol reduces carryover contamination, enables detection of each DTU independently and in the future, the quantification of each DTU in clinical blood samples. Copyright © 2017 Elsevier B.V. All rights reserved.
A Distance Measure for Genome Phylogenetic Analysis
NASA Astrophysics Data System (ADS)
Cao, Minh Duc; Allison, Lloyd; Dix, Trevor
Phylogenetic analyses of species based on single genes or parts of the genomes are often inconsistent because of factors such as variable rates of evolution and horizontal gene transfer. The availability of more and more sequenced genomes allows phylogeny construction from complete genomes that is less sensitive to such inconsistency. For such long sequences, construction methods like maximum parsimony and maximum likelihood are often not possible due to their intensive computational requirement. Another class of tree construction methods, namely distance-based methods, require a measure of distances between any two genomes. Some measures such as evolutionary edit distance of gene order and gene content are computational expensive or do not perform well when the gene content of the organisms are similar. This study presents an information theoretic measure of genetic distances between genomes based on the biological compression algorithm expert model. We demonstrate that our distance measure can be applied to reconstruct the consensus phylogenetic tree of a number of Plasmodium parasites from their genomes, the statistical bias of which would mislead conventional analysis methods. Our approach is also used to successfully construct a plausible evolutionary tree for the γ-Proteobacteria group whose genomes are known to contain many horizontally transferred genes.
Comparison of Methods of Detection of Exceptional Sequences in Prokaryotic Genomes.
Rusinov, I S; Ershova, A S; Karyagina, A S; Spirin, S A; Alexeevski, A V
2018-02-01
Many proteins need recognition of specific DNA sequences for functioning. The number of recognition sites and their distribution along the DNA might be of biological importance. For example, the number of restriction sites is often reduced in prokaryotic and phage genomes to decrease the probability of DNA cleavage by restriction endonucleases. We call a sequence an exceptional one if its frequency in a genome significantly differs from one predicted by some mathematical model. An exceptional sequence could be either under- or over-represented, depending on its frequency in comparison with the predicted one. Exceptional sequences could be considered biologically meaningful, for example, as targets of DNA-binding proteins or as parts of abundant repetitive elements. Several methods to predict frequency of a short sequence in a genome, based on actual frequencies of certain its subsequences, are used. The most popular are methods based on Markov chain models. But any rigorous comparison of the methods has not previously been performed. We compared three methods for the prediction of short sequence frequencies: the maximum-order Markov chain model-based method, the method that uses geometric mean of extended Markovian estimates, and the method that utilizes frequencies of all subsequences including discontiguous ones. We applied them to restriction sites in complete genomes of 2500 prokaryotic species and demonstrated that the results depend greatly on the method used: lists of 5% of the most under-represented sites differed by up to 50%. The method designed by Burge and coauthors in 1992, which utilizes all subsequences of the sequence, showed a higher precision than the other two methods both on prokaryotic genomes and randomly generated sequences after computational imitation of selective pressure. We propose this method as the first choice for detection of exceptional sequences in prokaryotic genomes.
TEGS-CN: A Statistical Method for Pathway Analysis of Genome-wide Copy Number Profile.
Huang, Yen-Tsung; Hsu, Thomas; Christiani, David C
2014-01-01
The effects of copy number alterations make up a significant part of the tumor genome profile, but pathway analyses of these alterations are still not well established. We proposed a novel method to analyze multiple copy numbers of genes within a pathway, termed Test for the Effect of a Gene Set with Copy Number data (TEGS-CN). TEGS-CN was adapted from TEGS, a method that we previously developed for gene expression data using a variance component score test. With additional development, we extend the method to analyze DNA copy number data, accounting for different sizes and thus various numbers of copy number probes in genes. The test statistic follows a mixture of X (2) distributions that can be obtained using permutation with scaled X (2) approximation. We conducted simulation studies to evaluate the size and the power of TEGS-CN and to compare its performance with TEGS. We analyzed a genome-wide copy number data from 264 patients of non-small-cell lung cancer. With the Molecular Signatures Database (MSigDB) pathway database, the genome-wide copy number data can be classified into 1814 biological pathways or gene sets. We investigated associations of the copy number profile of the 1814 gene sets with pack-years of cigarette smoking. Our analysis revealed five pathways with significant P values after Bonferroni adjustment (<2.8 × 10(-5)), including the PTEN pathway (7.8 × 10(-7)), the gene set up-regulated under heat shock (3.6 × 10(-6)), the gene sets involved in the immune profile for rejection of kidney transplantation (9.2 × 10(-6)) and for transcriptional control of leukocytes (2.2 × 10(-5)), and the ganglioside biosynthesis pathway (2.7 × 10(-5)). In conclusion, we present a new method for pathway analyses of copy number data, and causal mechanisms of the five pathways require further study.
Visual Exploration of Genetic Association with Voxel-based Imaging Phenotypes in an MCI/AD Study
Kim, Sungeun; Shen, Li; Saykin, Andrew J.; West, John D.
2010-01-01
Neuroimaging genomics is a new transdisciplinary research field, which aims to examine genetic effects on brain via integrated analyses of high throughput neuroimaging and genomic data. We report our recent work on (1) developing an imaging genomic browsing system that allows for whole genome and entire brain analyses based on visual exploration and (2) applying the system to the imaging genomic analysis of an existing MCI/AD cohort. Voxel-based morphometry is used to define imaging phenotypes. ANCOVA is employed to evaluate the effect of the interaction of genotypes and diagnosis in relation to imaging phenotypes while controlling for relevant covariates. Encouraging experimental results suggest that the proposed system has substantial potential for enabling discovery of imaging genomic associations through visual evaluation and for localizing candidate imaging regions and genomic regions for refined statistical modeling. PMID:19963597
Kazemian, Majid; Zhu, Qiyun; Halfon, Marc S.; Sinha, Saurabh
2011-01-01
Despite recent advances in experimental approaches for identifying transcriptional cis-regulatory modules (CRMs, ‘enhancers’), direct empirical discovery of CRMs for all genes in all cell types and environmental conditions is likely to remain an elusive goal. Effective methods for computational CRM discovery are thus a critically needed complement to empirical approaches. However, existing computational methods that search for clusters of putative binding sites are ineffective if the relevant TFs and/or their binding specificities are unknown. Here, we provide a significantly improved method for ‘motif-blind’ CRM discovery that does not depend on knowledge or accurate prediction of TF-binding motifs and is effective when limited knowledge of functional CRMs is available to ‘supervise’ the search. We propose a new statistical method, based on ‘Interpolated Markov Models’, for motif-blind, genome-wide CRM discovery. It captures the statistical profile of variable length words in known CRMs of a regulatory network and finds candidate CRMs that match this profile. The method also uses orthologs of the known CRMs from closely related genomes. We perform in silico evaluation of predicted CRMs by assessing whether their neighboring genes are enriched for the expected expression patterns. This assessment uses a novel statistical test that extends the widely used Hypergeometric test of gene set enrichment to account for variability in intergenic lengths. We find that the new CRM prediction method is superior to existing methods. Finally, we experimentally validate 12 new CRM predictions by examining their regulatory activity in vivo in Drosophila; 10 of the tested CRMs were found to be functional, while 6 of the top 7 predictions showed the expected activity patterns. We make our program available as downloadable source code, and as a plugin for a genome browser installed on our servers. PMID:21821659
A Ranking Approach to Genomic Selection.
Blondel, Mathieu; Onogi, Akio; Iwata, Hiroyoshi; Ueda, Naonori
2015-01-01
Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used. In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value. We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.
Baby, Vincent; Lachance, Jean-Christophe; Gagnon, Jules; Lucier, Jean-François; Matteau, Dominick; Knight, Tom; Rodrigue, Sébastien
2018-01-01
The creation and comparison of minimal genomes will help better define the most fundamental mechanisms supporting life. Mesoplasma florum is a near-minimal, fast-growing, nonpathogenic bacterium potentially amenable to genome reduction efforts. In a comparative genomic study of 13 M. florum strains, including 11 newly sequenced genomes, we have identified the core genome and open pangenome of this species. Our results show that all of the strains have approximately 80% of their gene content in common. Of the remaining 20%, 17% of the genes were found in multiple strains and 3% were unique to any given strain. On the basis of random transposon mutagenesis, we also estimated that ~290 out of 720 genes are essential for M. florum L1 in rich medium. We next evaluated different genome reduction scenarios for M. florum L1 by using gene conservation and essentiality data, as well as comparisons with the first working approximation of a minimal organism, Mycoplasma mycoides JCVI-syn3.0. Our results suggest that 409 of the 473 M. mycoides JCVI-syn3.0 genes have orthologs in M. florum L1. Conversely, 57 putatively essential M. florum L1 genes have no homolog in M. mycoides JCVI-syn3.0. This suggests differences in minimal genome compositions, even for these evolutionarily closely related bacteria. IMPORTANCE The last years have witnessed the development of whole-genome cloning and transplantation methods and the complete synthesis of entire chromosomes. Recently, the first minimal cell, Mycoplasma mycoides JCVI-syn3.0, was created. Despite these milestone achievements, several questions remain to be answered. For example, is the composition of minimal genomes virtually identical in phylogenetically related species? On the basis of comparative genomics and transposon mutagenesis, we investigated this question by using an alternative model, Mesoplasma florum, that is also amenable to genome reduction efforts. Our results suggest that the creation of additional minimal genomes could help reveal different gene compositions and strategies that can support life, even within closely related species.
77 FR 28888 - National Human Genome Research Institute Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2012-05-16
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research Institute Initial...: To review and evaluate grant applications. Place: National Human Genome Research Institute, 3635...
77 FR 20646 - National Human Genome Research Institute; Notice of Closed Meetings
Federal Register 2010, 2011, 2012, 2013, 2014
2012-04-05
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... clearly unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research.... Agenda: To review and evaluate grant applications. Place: National Human Genome Research Institute, 5635...
78 FR 20933 - National Human Genome Research Institute; Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2013-04-08
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research Institute Special... review and evaluate grant applications. Place: National Human Genome Research Institute, Room 3055, 5635...
78 FR 14806 - National Human Genome Research Institute; Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2013-03-07
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research Institute Special... p.m. Agenda: To review and evaluate grant applications. Place: National Human Genome Research...
77 FR 22332 - National Human Genome Research Institute; Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2012-04-13
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research Institute Special.... Agenda: To review and evaluate grant applications. Place: National Human Genome Research Institute, 5635...
77 FR 12604 - National Human Genome Research Institute; Notice of Closed Meetings
Federal Register 2010, 2011, 2012, 2013, 2014
2012-03-01
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... clearly unwarranted invasion of personal privacy. >Name of Committee: National Human Genome Research... review and evaluate contract proposals. Place: National Human Genome Reseach Institute, 5635 Fishers Lane...
78 FR 31953 - National Human Genome Research Institute; Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2013-05-28
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research Institute Special... review and evaluate grant applications. Place: National Human Genome Research Institute, 3rd Floor...
Foster, Charles S P; Sauquet, Hervê; van der Merwe, Marlien; McPherson, Hannah; Rossetto, Maurizio; Ho, Simon Y W
2017-05-01
The evolutionary timescale of angiosperms has long been a key question in biology. Molecular estimates of this timescale have shown considerable variation, being influenced by differences in taxon sampling, gene sampling, fossil calibrations, evolutionary models, and choices of priors. Here, we analyze a data set comprising 76 protein-coding genes from the chloroplast genomes of 195 taxa spanning 86 families, including novel genome sequences for 11 taxa, to evaluate the impact of models, priors, and gene sampling on Bayesian estimates of the angiosperm evolutionary timescale. Using a Bayesian relaxed molecular-clock method, with a core set of 35 minimum and two maximum fossil constraints, we estimated that crown angiosperms arose 221 (251-192) Ma during the Triassic. Based on a range of additional sensitivity and subsampling analyses, we found that our date estimates were generally robust to large changes in the parameters of the birth-death tree prior and of the model of rate variation across branches. We found an exception to this when we implemented fossil calibrations in the form of highly informative gamma priors rather than as uniform priors on node ages. Under all other calibration schemes, including trials of seven maximum age constraints, we consistently found that the earliest divergences of angiosperm clades substantially predate the oldest fossils that can be assigned unequivocally to their crown group. Overall, our results and experiments with genome-scale data suggest that reliable estimates of the angiosperm crown age will require increased taxon sampling, significant methodological changes, and new information from the fossil record. [Angiospermae, chloroplast, genome, molecular dating, Triassic.]. © The Author(s) 2016. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Ben Hassen, Manel; Bartholomé, Jérôme; Valè, Giampiero; Cao, Tuong-Vi; Ahmadi, Nourollah
2018-05-09
Developing rice varieties adapted to alternate wetting and drying water management is crucial for the sustainability of irrigated rice cropping systems. Here we report the first study exploring the feasibility of breeding rice for adaptation to alternate wetting and drying using genomic prediction methods that account for genotype by environment interactions. Two breeding populations (a reference panel of 284 accessions and a progeny population of 97 advanced lines) were evaluated under alternate wetting and drying and continuous flooding management systems. The predictive ability of genomic prediction for response variables (index of relative performance and the slope of the joint regression) and for multi-environment genomic prediction models were compared. For the three traits considered (days to flowering, panicle weight and nitrogen-balance index), significant genotype by environment interactions were observed in both populations. In cross validation, predictive ability for the index was on average lower (0.31) than that of the slope of the joint regression (0.64) whatever the trait considered. Similar results were found for progeny validation. Both cross-validation and progeny validation experiments showed that the performance of multi-environment models predicting unobserved phenotypes of untested entrees was similar to the performance of single environment models with differences in predictive ability ranging from -6% to 4% depending on the trait and on the statistical model concerned. The predictive ability of multi-environment models predicting unobserved phenotypes of entrees evaluated under both water management systems outperformed single environment models by an average of 30%. Practical implications for breeding rice for adaptation to alternate wetting and drying system are discussed. Copyright © 2018, G3: Genes, Genomes, Genetics.
Galaev, A V; Babaiants, L T; Sivolap, Iu M
2003-01-01
Comparative analysis of introgressive and parental forms of wheat was carried out to reveal the sites of donor genome with new loci of resistance to fungal diseases. By ISSR-method 124 ISSR-loci were detected in the genomes of 18 individual plants of introgressive line 5/20-91; 17 of them have been related to introgressive fragments of Ae. cylindrica genome in T. aestivum. It was shown that ISSR-method is effective for detection of the variability caused by introgression of alien genetic material to T. aestivum genome.
Targeted next-generation sequencing in monogenic dyslipidemias.
Hegele, Robert A; Ban, Matthew R; Cao, Henian; McIntyre, Adam D; Robinson, John F; Wang, Jian
2015-04-01
To evaluate the potential clinical translation of high-throughput next-generation sequencing (NGS) methods in diagnosis and management of dyslipidemia. Recent NGS experiments indicate that most causative genes for monogenic dyslipidemias are already known. Thus, monogenic dyslipidemias can now be diagnosed using targeted NGS. Targeting of dyslipidemia genes can be achieved by either: designing custom reagents for a dyslipidemia-specific NGS panel; or performing genome-wide NGS and focusing on genes of interest. Advantages of the former approach are lower cost and limited potential to detect incidental pathogenic variants unrelated to dyslipidemia. However, the latter approach is more flexible because masking criteria can be altered as knowledge advances, with no need for re-design of reagents or follow-up sequencing runs. Also, the cost of genome-wide analysis is decreasing and ethical concerns can likely be mitigated. DNA-based diagnosis is already part of the clinical diagnostic algorithms for familial hypercholesterolemia. Furthermore, DNA-based diagnosis is supplanting traditional biochemical methods to diagnose chylomicronemia caused by deficiency of lipoprotein lipase or its co-factors. The increasing availability and decreasing cost of clinical NGS for dyslipidemia means that its potential benefits can now be evaluated on a larger scale.
Short-read, high-throughput sequencing technology for STR genotyping
Bornman, Daniel M.; Hester, Mark E.; Schuetter, Jared M.; Kasoji, Manjula D.; Minard-Smith, Angela; Barden, Curt A.; Nelson, Scott C.; Godbold, Gene D.; Baker, Christine H.; Yang, Boyu; Walther, Jacquelyn E.; Tornes, Ivan E.; Yan, Pearlly S.; Rodriguez, Benjamin; Bundschuh, Ralf; Dickens, Michael L.; Young, Brian A.; Faith, Seth A.
2013-01-01
DNA-based methods for human identification principally rely upon genotyping of short tandem repeat (STR) loci. Electrophoretic-based techniques for variable-length classification of STRs are universally utilized, but are limited in that they have relatively low throughput and do not yield nucleotide sequence information. High-throughput sequencing technology may provide a more powerful instrument for human identification, but is not currently validated for forensic casework. Here, we present a systematic method to perform high-throughput genotyping analysis of the Combined DNA Index System (CODIS) STR loci using short-read (150 bp) massively parallel sequencing technology. Open source reference alignment tools were optimized to evaluate PCR-amplified STR loci using a custom designed STR genome reference. Evaluation of this approach demonstrated that the 13 CODIS STR loci and amelogenin (AMEL) locus could be accurately called from individual and mixture samples. Sensitivity analysis showed that as few as 18,500 reads, aligned to an in silico referenced genome, were required to genotype an individual (>99% confidence) for the CODIS loci. The power of this technology was further demonstrated by identification of variant alleles containing single nucleotide polymorphisms (SNPs) and the development of quantitative measurements (reads) for resolving mixed samples. PMID:25621315
A large-scale evaluation of computational protein function prediction
Radivojac, Predrag; Clark, Wyatt T; Ronnen Oron, Tal; Schnoes, Alexandra M; Wittkop, Tobias; Sokolov, Artem; Graim, Kiley; Funk, Christopher; Verspoor, Karin; Ben-Hur, Asa; Pandey, Gaurav; Yunes, Jeffrey M; Talwalkar, Ameet S; Repo, Susanna; Souza, Michael L; Piovesan, Damiano; Casadio, Rita; Wang, Zheng; Cheng, Jianlin; Fang, Hai; Gough, Julian; Koskinen, Patrik; Törönen, Petri; Nokso-Koivisto, Jussi; Holm, Liisa; Cozzetto, Domenico; Buchan, Daniel W A; Bryson, Kevin; Jones, David T; Limaye, Bhakti; Inamdar, Harshal; Datta, Avik; Manjari, Sunitha K; Joshi, Rajendra; Chitale, Meghana; Kihara, Daisuke; Lisewski, Andreas M; Erdin, Serkan; Venner, Eric; Lichtarge, Olivier; Rentzsch, Robert; Yang, Haixuan; Romero, Alfonso E; Bhat, Prajwal; Paccanaro, Alberto; Hamp, Tobias; Kassner, Rebecca; Seemayer, Stefan; Vicedo, Esmeralda; Schaefer, Christian; Achten, Dominik; Auer, Florian; Böhm, Ariane; Braun, Tatjana; Hecht, Maximilian; Heron, Mark; Hönigschmid, Peter; Hopf, Thomas; Kaufmann, Stefanie; Kiening, Michael; Krompass, Denis; Landerer, Cedric; Mahlich, Yannick; Roos, Manfred; Björne, Jari; Salakoski, Tapio; Wong, Andrew; Shatkay, Hagit; Gatzmann, Fanny; Sommer, Ingolf; Wass, Mark N; Sternberg, Michael J E; Škunca, Nives; Supek, Fran; Bošnjak, Matko; Panov, Panče; Džeroski, Sašo; Šmuc, Tomislav; Kourmpetis, Yiannis A I; van Dijk, Aalt D J; ter Braak, Cajo J F; Zhou, Yuanpeng; Gong, Qingtian; Dong, Xinran; Tian, Weidong; Falda, Marco; Fontana, Paolo; Lavezzo, Enrico; Di Camillo, Barbara; Toppo, Stefano; Lan, Liang; Djuric, Nemanja; Guo, Yuhong; Vucetic, Slobodan; Bairoch, Amos; Linial, Michal; Babbitt, Patricia C; Brenner, Steven E; Orengo, Christine; Rost, Burkhard; Mooney, Sean D; Friedberg, Iddo
2013-01-01
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based Critical Assessment of protein Function Annotation (CAFA) experiment. Fifty-four methods representing the state-of-the-art for protein function prediction were evaluated on a target set of 866 proteins from eleven organisms. Two findings stand out: (i) today’s best protein function prediction algorithms significantly outperformed widely-used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is significant need for improvement of currently available tools. PMID:23353650
Fast ancestral gene order reconstruction of genomes with unequal gene content.
Feijão, Pedro; Araujo, Eloi
2016-11-11
During evolution, genomes are modified by large scale structural events, such as rearrangements, deletions or insertions of large blocks of DNA. Of particular interest, in order to better understand how this type of genomic evolution happens, is the reconstruction of ancestral genomes, given a phylogenetic tree with extant genomes at its leaves. One way of solving this problem is to assume a rearrangement model, such as Double Cut and Join (DCJ), and find a set of ancestral genomes that minimizes the number of events on the input tree. Since this problem is NP-hard for most rearrangement models, exact solutions are practical only for small instances, and heuristics have to be used for larger datasets. This type of approach can be called event-based. Another common approach is based on finding conserved structures between the input genomes, such as adjacencies between genes, possibly also assigning weights that indicate a measure of confidence or probability that this particular structure is present on each ancestral genome, and then finding a set of non conflicting adjacencies that optimize some given function, usually trying to maximize total weight and minimizing character changes in the tree. We call this type of methods homology-based. In previous work, we proposed an ancestral reconstruction method that combines homology- and event-based ideas, using the concept of intermediate genomes, that arise in DCJ rearrangement scenarios. This method showed better rate of correctly reconstructed adjacencies than other methods, while also being faster, since the use of intermediate genomes greatly reduces the search space. Here, we generalize the intermediate genome concept to genomes with unequal gene content, extending our method to account for gene insertions and deletions of any length. In many of the simulated datasets, our proposed method had better results than MLGO and MGRA, two state-of-the-art algorithms for ancestral reconstruction with unequal gene content, while running much faster, making it more scalable to larger datasets. Studing ancestral reconstruction problems under a new light, using the concept of intermediate genomes, allows the design of very fast algorithms by greatly reducing the solution search space, while also giving very good results. The algorithms introduced in this paper were implemented in an open-source software called RINGO (ancestral Reconstruction with INtermediate GenOmes), available at https://github.com/pedrofeijao/RINGO .
Ultrafast Comparison of Personal Genomes via Precomputed Genome Fingerprints
Glusman, Gustavo; Mauldin, Denise E.; Hood, Leroy E.; Robinson, Max
2017-01-01
We present an ultrafast method for comparing personal genomes. We transform the standard genome representation (lists of variants relative to a reference) into “genome fingerprints” via locality sensitive hashing. The resulting genome fingerprints can be meaningfully compared even when the input data were obtained using different sequencing technologies, processed using different pipelines, represented in different data formats and relative to different reference versions. Furthermore, genome fingerprints are robust to up to 30% missing data. Because of their reduced size, computation on the genome fingerprints is fast and requires little memory. For example, we could compute all-against-all pairwise comparisons among the 2504 genomes in the 1000 Genomes data set in 67 s at high quality (21 μs per comparison, on a single processor), and achieved a lower quality approximation in just 11 s. Efficient computation enables scaling up a variety of important genome analyses, including quantifying relatedness, recognizing duplicative sequenced genomes in a set, population reconstruction, and many others. The original genome representation cannot be reconstructed from its fingerprint, effectively decoupling genome comparison from genome interpretation; the method thus has significant implications for privacy-preserving genome analytics. PMID:29018478
Population Structure and Genomic Breed Composition in an Angus-Brahman Crossbred Cattle Population.
Gobena, Mesfin; Elzo, Mauricio A; Mateescu, Raluca G
2018-01-01
Crossbreeding is a common strategy used in tropical and subtropical regions to enhance beef production, and having accurate knowledge of breed composition is essential for the success of a crossbreeding program. Although pedigree records have been traditionally used to obtain the breed composition of crossbred cattle, the accuracy of pedigree-based breed composition can be reduced by inaccurate and/or incomplete records and Mendelian sampling. Breed composition estimation from genomic data has multiple advantages including higher accuracy without being affected by missing, incomplete, or inaccurate records and the ability to be used as independent authentication of breed in breed-labeled beef products. The present study was conducted with 676 Angus-Brahman crossbred cattle with genotype and pedigree information to evaluate the feasibility and accuracy of using genomic data to determine breed composition. We used genomic data in parametric and non-parametric methods to detect population structure due to differences in breed composition while accounting for the confounding effect of close familial relationships. By applying principal component analysis (PCA) and the maximum likelihood method of ADMIXTURE to genomic data, it was possible to successfully characterize population structure resulting from heterogeneous breed ancestry, while accounting for close familial relationships. PCA results offered additional insight into the different hierarchies of genetic variation structuring. The first principal component was strongly correlated with Angus-Brahman proportions, and the second represented variation within animals that have a relatively more extended Brangus lineage-indicating the presence of a distinct pattern of genetic variation in these cattle. Although there was strong agreement between breed proportions estimated from pedigree and genetic information, there were significant discrepancies between these two methods for certain animals. This was most likely due to inaccuracies in the pedigree-based estimation of breed composition, which supported the case for using genomic information to complement and/or replace pedigree information when estimating breed composition. Comparison with a supervised analysis where purebreds are used as the training set suggest that accurate predictions can be achieved even in the absence of purebred population information.
Calculation and delivery of US genomic evaluations for dairy cattle
USDA-ARS?s Scientific Manuscript database
In April 2013, the responsibility for calculation and distribution of genomic evaluations for dairy cattle was transferred from the USDA to the US dairy industry’s Council on Dairy Cattle Breeding; the responsibility for development of evaluation methodology remained with the USDA. The Council on Da...
Comprehensive GMO detection using real-time PCR array: single-laboratory validation.
Mano, Junichi; Harada, Mioko; Takabatake, Reona; Furui, Satoshi; Kitta, Kazumi; Nakamura, Kosuke; Akiyama, Hiroshi; Teshima, Reiko; Noritake, Hiromichi; Hatano, Shuko; Futo, Satoshi; Minegishi, Yasutaka; Iizuka, Tayoshi
2012-01-01
We have developed a real-time PCR array method to comprehensively detect genetically modified (GM) organisms. In the method, genomic DNA extracted from an agricultural product is analyzed using various qualitative real-time PCR assays on a 96-well PCR plate, targeting for individual GM events, recombinant DNA (r-DNA) segments, taxon-specific DNAs, and donor organisms of the respective r-DNAs. In this article, we report the single-laboratory validation of both DNA extraction methods and component PCR assays constituting the real-time PCR array. We selected some DNA extraction methods for specified plant matrixes, i.e., maize flour, soybean flour, and ground canola seeds, then evaluated the DNA quantity, DNA fragmentation, and PCR inhibition of the resultant DNA extracts. For the component PCR assays, we evaluated the specificity and LOD. All DNA extraction methods and component PCR assays satisfied the criteria set on the basis of previous reports.
Precise, High-throughput Analysis of Bacterial Growth.
Kurokawa, Masaomi; Ying, Bei-Wen
2017-09-19
Bacterial growth is a central concept in the development of modern microbial physiology, as well as in the investigation of cellular dynamics at the systems level. Recent studies have reported correlations between bacterial growth and genome-wide events, such as genome reduction and transcriptome reorganization. Correctly analyzing bacterial growth is crucial for understanding the growth-dependent coordination of gene functions and cellular components. Accordingly, the precise quantitative evaluation of bacterial growth in a high-throughput manner is required. Emerging technological developments offer new experimental tools that allow updates of the methods used for studying bacterial growth. The protocol introduced here employs a microplate reader with a highly optimized experimental procedure for the reproducible and precise evaluation of bacterial growth. This protocol was used to evaluate the growth of several previously described Escherichia coli strains. The main steps of the protocol are as follows: the preparation of a large number of cell stocks in small vials for repeated tests with reproducible results, the use of 96-well plates for high-throughput growth evaluation, and the manual calculation of two major parameters (i.e., maximal growth rate and population density) representing the growth dynamics. In comparison to the traditional colony-forming unit (CFU) assay, which counts the cells that are cultured in glass tubes over time on agar plates, the present method is more efficient and provides more detailed temporal records of growth changes, but has a stricter detection limit at low population densities. In summary, the described method is advantageous for the precise and reproducible high-throughput analysis of bacterial growth, which can be used to draw conceptual conclusions or to make theoretical observations.
Anonymizing patient genomic data for public sharing association studies.
Fernandez-Lozano, Carlos; Lopez-Campos, Guillermo; Seoane, Jose A; Lopez-Alonso, Victoria; Dorado, Julian; Martín-Sanchez, Fernando; Pazos, Alejandro
2013-01-01
The development of personalized medicine is tightly linked with the correct exploitation of molecular data, especially those associated with the genome sequence along with these use of genomic data there is an increasing demand to share these data for research purposes. Transition of clinical data to research is based in the anonymization of these data so the patient cannot be identified, the use of genomic data poses a great challenge because its nature of identifying data. In this work we have analyzed current methods for genome anonymization and propose a one way encryption method that may enable the process of genomic data sharing accessing only to certain regions of genomes for research purposes.
Decoding transcriptional enhancers: Evolving from annotation to functional interpretation
Engel, Krysta L.; Mackiewicz, Mark; Hardigan, Andrew A.; Myers, Richard M.; Savic, Daniel
2016-01-01
Deciphering the intricate molecular processes that orchestrate the spatial and temporal regulation of genes has become an increasingly major focus of biological research. The differential expression of genes by diverse cell types with a common genome is a hallmark of complex cellular functions, as well as the basis for multicellular life. Importantly, a more coherent understanding of gene regulation is critical for defining developmental processes, evolutionary principles and disease etiologies. Here we present our current understanding of gene regulation by focusing on the role of enhancer elements in these complex processes. Although functional genomic methods have provided considerable advances to our understanding of gene regulation, these assays, which are usually performed on a genome-wide scale, typically provide correlative observations that lack functional interpretation. Recent innovations in genome editing technologies have placed gene regulatory studies at an exciting crossroads, as systematic, functional evaluation of enhancers and other transcriptional regulatory elements can now be performed in a coordinated, high-throughput manner across the entire genome. This review provides insights on transcriptional enhancer function, their role in development and disease, and catalogues experimental tools commonly used to study these elements. Additionally, we discuss the crucial role of novel techniques in deciphering the complex gene regulatory landscape and how these studies will shape future research. PMID:27224938
Decoding transcriptional enhancers: Evolving from annotation to functional interpretation.
Engel, Krysta L; Mackiewicz, Mark; Hardigan, Andrew A; Myers, Richard M; Savic, Daniel
2016-09-01
Deciphering the intricate molecular processes that orchestrate the spatial and temporal regulation of genes has become an increasingly major focus of biological research. The differential expression of genes by diverse cell types with a common genome is a hallmark of complex cellular functions, as well as the basis for multicellular life. Importantly, a more coherent understanding of gene regulation is critical for defining developmental processes, evolutionary principles and disease etiologies. Here we present our current understanding of gene regulation by focusing on the role of enhancer elements in these complex processes. Although functional genomic methods have provided considerable advances to our understanding of gene regulation, these assays, which are usually performed on a genome-wide scale, typically provide correlative observations that lack functional interpretation. Recent innovations in genome editing technologies have placed gene regulatory studies at an exciting crossroads, as systematic, functional evaluation of enhancers and other transcriptional regulatory elements can now be performed in a coordinated, high-throughput manner across the entire genome. This review provides insights on transcriptional enhancer function, their role in development and disease, and catalogues experimental tools commonly used to study these elements. Additionally, we discuss the crucial role of novel techniques in deciphering the complex gene regulatory landscape and how these studies will shape future research. Copyright © 2016 Elsevier Ltd. All rights reserved.
78 FR 24223 - National Human Genome Research Institute; Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2013-04-24
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research Institute Initial...: To review and evaluate grant applications. Place: National Human Genome Research Institute, 3rd floor...
Finding the Genomic Basis of Local Adaptation: Pitfalls, Practical Solutions, and Future Directions.
Hoban, Sean; Kelley, Joanna L; Lotterhos, Katie E; Antolin, Michael F; Bradburd, Gideon; Lowry, David B; Poss, Mary L; Reed, Laura K; Storfer, Andrew; Whitlock, Michael C
2016-10-01
Uncovering the genetic and evolutionary basis of local adaptation is a major focus of evolutionary biology. The recent development of cost-effective methods for obtaining high-quality genome-scale data makes it possible to identify some of the loci responsible for adaptive differences among populations. Two basic approaches for identifying putatively locally adaptive loci have been developed and are broadly used: one that identifies loci with unusually high genetic differentiation among populations (differentiation outlier methods) and one that searches for correlations between local population allele frequencies and local environments (genetic-environment association methods). Here, we review the promises and challenges of these genome scan methods, including correcting for the confounding influence of a species' demographic history, biases caused by missing aspects of the genome, matching scales of environmental data with population structure, and other statistical considerations. In each case, we make suggestions for best practices for maximizing the accuracy and efficiency of genome scans to detect the underlying genetic basis of local adaptation. With attention to their current limitations, genome scan methods can be an important tool in finding the genetic basis of adaptive evolutionary change.
A tailing genome walking method suitable for genomes with high local GC content.
Liu, Taian; Fang, Yongxiang; Yao, Wenjuan; Guan, Qisai; Bai, Gang; Jing, Zhizhong
2013-10-15
The tailing genome walking strategies are simple and efficient. However, they sometimes can be restricted due to the low stringency of homo-oligomeric primers. Here we modified their conventional tailing step by adding polythymidine and polyguanine to the target single-stranded DNA (ssDNA). The tailed ssDNA was then amplified exponentially with a specific primer in the known region and a primer comprising 5' polycytosine and 3' polyadenosine. The successful application of this novel method for identifying integration sites mediated by φC31 integrase in goat genome indicates that the method is more suitable for genomes with high complexity and local GC content. Copyright © 2013 Elsevier Inc. All rights reserved.
Phylogenic study of Lemnoideae (duckweeds) through complete chloroplast genomes for eight accessions
Ding, Yanqiang; Fang, Yang; Guo, Ling; Li, Zhidan; He, Kaize
2017-01-01
Background Phylogenetic relationship within different genera of Lemnoideae, a kind of small aquatic monocotyledonous plants, was not well resolved, using either morphological characters or traditional markers. Given that rich genetic information in chloroplast genome makes them particularly useful for phylogenetic studies, we used chloroplast genomes to clarify the phylogeny within Lemnoideae. Methods DNAs were sequenced with next-generation sequencing. The duckweeds chloroplast genomes were indirectly filtered from the total DNA data, or directly obtained from chloroplast DNA data. To test the reliability of assembling the chloroplast genome based on the filtration of the total DNA, two methods were used to assemble the chloroplast genome of Landoltia punctata strain ZH0202. A phylogenetic tree was built on the basis of the whole chloroplast genome sequences using MrBayes v.3.2.6 and PhyML 3.0. Results Eight complete duckweeds chloroplast genomes were assembled, with lengths ranging from 165,775 bp to 171,152 bp, and each contains 80 protein-coding sequences, four rRNAs, 30 tRNAs and two pseudogenes. The identity of L. punctata strain ZH0202 chloroplast genomes assembled through two methods was 100%, and their sequences and lengths were completely identical. The chloroplast genome comparison demonstrated that the differences in chloroplast genome sizes among the Lemnoideae primarily resulted from variation in non-coding regions, especially from repeat sequence variation. The phylogenetic analysis demonstrated that the different genera of Lemnoideae are derived from each other in the following order: Spirodela, Landoltia, Lemna, Wolffiella, and Wolffia. Discussion This study demonstrates potential of whole chloroplast genome DNA as an effective option for phylogenetic studies of Lemnoideae. It also showed the possibility of using chloroplast DNA data to elucidate those phylogenies which were not yet solved well by traditional methods even in plants other than duckweeds. PMID:29302399
Genome-Wide Profiling of DNA Double-Strand Breaks by the BLESS and BLISS Methods.
Mirzazadeh, Reza; Kallas, Tomasz; Bienko, Magda; Crosetto, Nicola
2018-01-01
DNA double-strand breaks (DSBs) are major DNA lesions that are constantly formed during physiological processes such as DNA replication, transcription, and recombination, or as a result of exogenous agents such as ionizing radiation, radiomimetic drugs, and genome editing nucleases. Unrepaired DSBs threaten genomic stability by leading to the formation of potentially oncogenic rearrangements such as translocations. In past few years, several methods based on next-generation sequencing (NGS) have been developed to study the genome-wide distribution of DSBs or their conversion to translocation events. We developed Breaks Labeling, Enrichment on Streptavidin, and Sequencing (BLESS), which was the first method for direct labeling of DSBs in situ followed by their genome-wide mapping at nucleotide resolution (Crosetto et al., Nat Methods 10:361-365, 2013). Recently, we have further expanded the quantitative nature, applicability, and scalability of BLESS by developing Breaks Labeling In Situ and Sequencing (BLISS) (Yan et al., Nat Commun 8:15058, 2017). Here, we first present an overview of existing methods for genome-wide localization of DSBs, and then focus on the BLESS and BLISS methods, discussing different assay design options depending on the sample type and application.
The spectrum of genomic signatures: from dinucleotides to chaos game representation.
Wang, Yingwei; Hill, Kathleen; Singh, Shiva; Kari, Lila
2005-02-14
In the post genomic era, access to complete genome sequence data for numerous diverse species has opened multiple avenues for examining and comparing primary DNA sequence organization of entire genomes. Previously, the concept of a genomic signature was introduced with the observation of species-type specific Dinucleotide Relative Abundance Profiles (DRAPs); dinucleotides were identified as the subsequences with the greatest bias in representation in a majority of genomes. Herein, we demonstrate that DRAP is one particular genomic signature contained within a broader spectrum of signatures. Within this spectrum, an alternative genomic signature, Chaos Game Representation (CGR), provides a unique visualization of patterns in sequence organization. A genomic signature is associated with a particular integer order or subsequence length that represents a measure of the resolution or granularity in the analysis of primary DNA sequence organization. We quantitatively explore the organizational information provided by genomic signatures of different orders through different distance measures, including a novel Image Distance. The Image Distance and other existing distance measures are evaluated by comparing the phylogenetic trees they generate for 26 complete mitochondrial genomes from a diversity of species. The phylogenetic tree generated by the Image Distance is compatible with the known relatedness of species. Quantitative evaluation of the spectrum of genomic signatures may be used to ultimately gain insight into the determinants and biological relevance of the genome signatures.
Wilson, Brenda J; Miller, Fiona Alice; Rousseau, François
2017-12-01
Next generation genomic sequencing (NGS) technologies-whole genome and whole exome sequencing-are now cheap enough to be within the grasp of many health care organizations. To many, NGS is symbolic of cutting edge health care, offering the promise of "precision" and "personalized" medicine. Historically, research and clinical application has been a two-way street in clinical genetics: research often driven directly by the desire to understand and try to solve immediate clinical problems affecting real, identifiable patients and families, accompanied by a low threshold of willingness to apply research-driven interventions without resort to formal empirical evaluations. However, NGS technologies are not simple substitutes for older technologies and need careful evaluation for use as screening, diagnostic, or prognostic tools. We have concerns across three areas. First, at the moment, analytic validity is unknown because technical platforms are not yet stable, laboratory quality assurance programs are in their infancy, and data interpretation capabilities are badly underdeveloped. Second, clinical validity of genomic findings for patient populations without pre-existing high genetic risk is doubtful, as most clinical experience with NGS technologies relates to patients with a high prior likelihood of a genetic etiology. Finally, we are concerned that proponents argue not only for clinically driven approaches to assessing a patient's genome, but also for seeking out variants associated with unrelated conditions or susceptibilities-so-called "secondary targets"-this is screening on a genomic scale. We argue that clinical uses of genomic sequencing should remain limited to specialist and research settings, that screening for secondary findings in clinical testing should be limited to the maximum extent possible, and that the benefits, harms, and economic implications of their routine use be systematically evaluated. All stakeholders have a responsibility to ensure that patients receive effective, safe health care, in an economically sustainable health care system. There should be no exception for genome-based interventions. Copyright © 2017 Elsevier Inc. All rights reserved.
A genome-wide 3C-method for characterizing the three-dimensional architectures of genomes.
Duan, Zhijun; Andronescu, Mirela; Schutz, Kevin; Lee, Choli; Shendure, Jay; Fields, Stanley; Noble, William S; Anthony Blau, C
2012-11-01
Accumulating evidence demonstrates that the three-dimensional (3D) organization of chromosomes within the eukaryotic nucleus reflects and influences genomic activities, including transcription, DNA replication, recombination and DNA repair. In order to uncover structure-function relationships, it is necessary first to understand the principles underlying the folding and the 3D arrangement of chromosomes. Chromosome conformation capture (3C) provides a powerful tool for detecting interactions within and between chromosomes. A high throughput derivative of 3C, chromosome conformation capture on chip (4C), executes a genome-wide interrogation of interaction partners for a given locus. We recently developed a new method, a derivative of 3C and 4C, which, similar to Hi-C, is capable of comprehensively identifying long-range chromosome interactions throughout a genome in an unbiased fashion. Hence, our method can be applied to decipher the 3D architectures of genomes. Here, we provide a detailed protocol for this method. Published by Elsevier Inc.
Emergent rules for codon choice elucidated by editing rare arginine codons in Escherichia coli
Napolitano, Michael G.; Landon, Matthieu; Gregg, Christopher J.; Lajoie, Marc J.; Govindarajan, Lakshmi; Mosberg, Joshua A.; Kuznetsov, Gleb; Goodman, Daniel B.; Vargas-Rodriguez, Oscar; Isaacs, Farren J.; Söll, Dieter; Church, George M.
2016-01-01
The degeneracy of the genetic code allows nucleic acids to encode amino acid identity as well as noncoding information for gene regulation and genome maintenance. The rare arginine codons AGA and AGG (AGR) present a case study in codon choice, with AGRs encoding important transcriptional and translational properties distinct from the other synonymous alternatives (CGN). We created a strain of Escherichia coli with all 123 instances of AGR codons removed from all essential genes. We readily replaced 110 AGR codons with the synonymous CGU codons, but the remaining 13 “recalcitrant” AGRs required diversification to identify viable alternatives. Successful replacement codons tended to conserve local ribosomal binding site-like motifs and local mRNA secondary structure, sometimes at the expense of amino acid identity. Based on these observations, we empirically defined metrics for a multidimensional “safe replacement zone” (SRZ) within which alternative codons are more likely to be viable. To evaluate synonymous and nonsynonymous alternatives to essential AGRs further, we implemented a CRISPR/Cas9-based method to deplete a diversified population of a wild-type allele, allowing us to evaluate exhaustively the fitness impact of all 64 codon alternatives. Using this method, we confirmed the relevance of the SRZ by tracking codon fitness over time in 14 different genes, finding that codons that fall outside the SRZ are rapidly depleted from a growing population. Our unbiased and systematic strategy for identifying unpredicted design flaws in synthetic genomes and for elucidating rules governing codon choice will be crucial for designing genomes exhibiting radically altered genetic codes. PMID:27601680
Sasheva, Pavlina; Grossniklaus, Ueli
2017-01-01
Over the last years, it has become increasingly clear that environmental influences can affect the epigenomic landscape and that some epigenetic variants can have heritable, phenotypic effects. While there are a variety of methods to perform genome-wide analyses of DNA methylation in model organisms, this is still a challenging task for non-model organisms without a reference genome. Differentially methylated region-representational difference analysis (DMR-RDA) is a sensitive and powerful PCR-based technique that isolates DNA fragments that are differentially methylated between two otherwise identical genomes. The technique does not require special equipment and is independent of prior knowledge about the genome. It is even applicable to genomes that have high complexity and a large size, being the method of choice for the analysis of plant non-model systems.
Genomics Education in Practice: Evaluation of a Mobile Lab Design
ERIC Educational Resources Information Center
Van Mil, Marc H. W.; Boerwinkel, Dirk Jan; Buizer-Voskamp, Jacobine E.; Speksnijder, Annelies; Waarlo, Arend Jan
2010-01-01
Dutch genomics research centers have developed the "DNA labs on the road" to bridge the gap between modern genomics research practice and secondary-school curriculum in the Netherlands. These mobile DNA labs offer upper-secondary students the opportunity to experience genomics research through experiments with laboratory equipment that…
78 FR 61851 - National Human Genome Research Institute; Notice of Closed Meeting
Federal Register 2010, 2011, 2012, 2013, 2014
2013-10-04
... DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health National Human Genome... unwarranted invasion of personal privacy. Name of Committee: National Human Genome Research Institute Special... a.m. to 4:00 p.m. Agenda: To review and evaluate grant applications. Place: National Human Genome...
USDA-ARS?s Scientific Manuscript database
Geographic isolates of Lymantria dispar multiple nucleopolyhedrovirus: Genome sequence analysis and pathogenicity against European and Asian gypsy moth strains. To evaluate the genetic diversity of Lymantria dispar nucleopolyhedrovirus (LdMNPV) at the genomic level, the genomes of three isolates of...
GStream: Improving SNP and CNV Coverage on Genome-Wide Association Studies
Alonso, Arnald; Marsal, Sara; Tortosa, Raül; Canela-Xandri, Oriol; Julià, Antonio
2013-01-01
We present GStream, a method that combines genome-wide SNP and CNV genotyping in the Illumina microarray platform with unprecedented accuracy. This new method outperforms previous well-established SNP genotyping software. More importantly, the CNV calling algorithm of GStream dramatically improves the results obtained by previous state-of-the-art methods and yields an accuracy that is close to that obtained by purely CNV-oriented technologies like Comparative Genomic Hybridization (CGH). We demonstrate the superior performance of GStream using microarray data generated from HapMap samples. Using the reference CNV calls generated by the 1000 Genomes Project (1KGP) and well-known studies on whole genome CNV characterization based either on CGH or genotyping microarray technologies, we show that GStream can increase the number of reliably detected variants up to 25% compared to previously developed methods. Furthermore, the increased genome coverage provided by GStream allows the discovery of CNVs in close linkage disequilibrium with SNPs, previously associated with disease risk in published Genome-Wide Association Studies (GWAS). These results could provide important insights into the biological mechanism underlying the detected disease risk association. With GStream, large-scale GWAS will not only benefit from the combined genotyping of SNPs and CNVs at an unprecedented accuracy, but will also take advantage of the computational efficiency of the method. PMID:23844243
Effect of cow reference group on validation reliability of genomic evaluation.
Koivula, M; Strandén, I; Aamand, G P; Mäntysaari, E A
2016-06-01
We studied the effect of including genomic data for cows in the reference population of single-step evaluations. Deregressed individual cow genetic evaluations (DRP) from milk production evaluations of Nordic Red Dairy cattle were used to estimate the single-step breeding values. Validation reliability and bias of the evaluations were calculated with four data sets including different amount of DRP record information from genotyped cows in the reference population. The gain in reliability was from 2% to 4% units for the production traits, depending on the used DRP data and the amount of genomic data. Moreover, inclusion of genotyped bull dams and their genotyped daughters seemed to create some bias in the single-step evaluation. Still, genotyping cows and their inclusion in the reference population is advantageous and should be encouraged.
Muleta, Kebede T; Bulli, Peter; Zhang, Zhiwu; Chen, Xianming; Pumphrey, Michael
2017-11-01
Harnessing diversity from germplasm collections is more feasible today because of the development of lower-cost and higher-throughput genotyping methods. However, the cost of phenotyping is still generally high, so efficient methods of sampling and exploiting useful diversity are needed. Genomic selection (GS) has the potential to enhance the use of desirable genetic variation in germplasm collections through predicting the genomic estimated breeding values (GEBVs) for all traits that have been measured. Here, we evaluated the effects of various scenarios of population genetic properties and marker density on the accuracy of GEBVs in the context of applying GS for wheat ( L.) germplasm use. Empirical data for adult plant resistance to stripe rust ( f. sp. ) collected on 1163 spring wheat accessions and genotypic data based on the wheat 9K single nucleotide polymorphism (SNP) iSelect assay were used for various genomic prediction tests. Unsurprisingly, the results of the cross-validation tests demonstrated that prediction accuracy increased with an increase in training population size and marker density. It was evident that using all the available markers (5619) was unnecessary for capturing the trait variation in the germplasm collection, with no further gain in prediction accuracy beyond 1 SNP per 3.2 cM (∼1850 markers), which is close to the linkage disequilibrium decay rate in this population. Collectively, our results suggest that larger germplasm collections may be efficiently sampled via lower-density genotyping methods, whereas genetic relationships between the training and validation populations remain critical when exploiting GS to select from germplasm collections. Copyright © 2017 Crop Science Society of America.
A statistical method for the detection of variants from next-generation resequencing of DNA pools.
Bansal, Vikas
2010-06-15
Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80-85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3-5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Implementation of this method is available at http://polymorphism.scripps.edu/~vbansal/software/CRISP/.
Mutation Scanning in Wheat by Exon Capture and Next-Generation Sequencing.
King, Robert; Bird, Nicholas; Ramirez-Gonzalez, Ricardo; Coghill, Jane A; Patil, Archana; Hassani-Pak, Keywan; Uauy, Cristobal; Phillips, Andrew L
2015-01-01
Targeted Induced Local Lesions in Genomes (TILLING) is a reverse genetics approach to identify novel sequence variation in genomes, with the aims of investigating gene function and/or developing useful alleles for breeding. Despite recent advances in wheat genomics, most current TILLING methods are low to medium in throughput, being based on PCR amplification of the target genes. We performed a pilot-scale evaluation of TILLING in wheat by next-generation sequencing through exon capture. An oligonucleotide-based enrichment array covering ~2 Mbp of wheat coding sequence was used to carry out exon capture and sequencing on three mutagenised lines of wheat containing previously-identified mutations in the TaGA20ox1 homoeologous genes. After testing different mapping algorithms and settings, candidate SNPs were identified by mapping to the IWGSC wheat Chromosome Survey Sequences. Where sequence data for all three homoeologues were found in the reference, mutant calls were unambiguous; however, where the reference lacked one or two of the homoeologues, captured reads from these genes were mis-mapped to other homoeologues, resulting either in dilution of the variant allele frequency or assignment of mutations to the wrong homoeologue. Competitive PCR assays were used to validate the putative SNPs and estimate cut-off levels for SNP filtering. At least 464 high-confidence SNPs were detected across the three mutagenized lines, including the three known alleles in TaGA20ox1, indicating a mutation rate of ~35 SNPs per Mb, similar to that estimated by PCR-based TILLING. This demonstrates the feasibility of using exon capture for genome re-sequencing as a method of mutation detection in polyploid wheat, but accurate mutation calling will require an improved genomic reference with more comprehensive coverage of homoeologues.
2014-01-01
Background Genome-wide microarrays have been useful for predicting chemical-genetic interactions at the gene level. However, interpreting genome-wide microarray results can be overwhelming due to the vast output of gene expression data combined with off-target transcriptional responses many times induced by a drug treatment. This study demonstrates how experimental and computational methods can interact with each other, to arrive at more accurate predictions of drug-induced perturbations. We present a two-stage strategy that links microarray experimental testing and network training conditions to predict gene perturbations for a drug with a known mechanism of action in a well-studied organism. Results S. cerevisiae cells were treated with the antifungal, fluconazole, and expression profiling was conducted under different biological conditions using Affymetrix genome-wide microarrays. Transcripts were filtered with a formal network-based method, sparse simultaneous equation models and Lasso regression (SSEM-Lasso), under different network training conditions. Gene expression results were evaluated using both gene set and single gene target analyses, and the drug’s transcriptional effects were narrowed first by pathway and then by individual genes. Variables included: (i) Testing conditions – exposure time and concentration and (ii) Network training conditions – training compendium modifications. Two analyses of SSEM-Lasso output – gene set and single gene – were conducted to gain a better understanding of how SSEM-Lasso predicts perturbation targets. Conclusions This study demonstrates that genome-wide microarrays can be optimized using a two-stage strategy for a more in-depth understanding of how a cell manifests biological reactions to a drug treatment at the transcription level. Additionally, a more detailed understanding of how the statistical model, SSEM-Lasso, propagates perturbations through a network of gene regulatory interactions is achieved. PMID:24444313
Fast randomization of large genomic datasets while preserving alteration counts.
Gobbi, Andrea; Iorio, Francesco; Dawson, Kevin J; Wedge, David C; Tamborero, David; Alexandrov, Ludmil B; Lopez-Bigas, Nuria; Garnett, Mathew J; Jurman, Giuseppe; Saez-Rodriguez, Julio
2014-09-01
Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a 'mutually exclusive' manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive. We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks. BiRewire is available on BioConductor at http://www.bioconductor.org/packages/2.13/bioc/html/BiRewire.html. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations
Zhou, Meng; Cheng, Liang; Yang, Haixiu; Wang, Jing; Sun, Jie; Wang, Zhenzhen
2016-01-01
MicroRNAs (miRNAs) play an important role in the development and progression of human diseases. The identification of disease-associated miRNAs will be helpful for understanding the molecular mechanisms of diseases at the post-transcriptional level. Based on different types of genomic data sources, computational methods for miRNA-disease association prediction have been proposed. However, individual source of genomic data tends to be incomplete and noisy; therefore, the integration of various types of genomic data for inferring reliable miRNA-disease associations is urgently needed. In this study, we present a computational framework, CHNmiRD, for identifying miRNA-disease associations by integrating multiple genomic and phenotype data, including protein-protein interaction data, gene ontology data, experimentally verified miRNA-target relationships, disease phenotype information and known miRNA-disease connections. The performance of CHNmiRD was evaluated by experimentally verified miRNA-disease associations, which achieved an area under the ROC curve (AUC) of 0.834 for 5-fold cross-validation. In particular, CHNmiRD displayed excellent performance for diseases without any known related miRNAs. The results of case studies for three human diseases (glioblastoma, myocardial infarction and type 1 diabetes) showed that all of the top 10 ranked miRNAs having no known associations with these three diseases in existing miRNA-disease databases were directly or indirectly confirmed by our latest literature mining. All these results demonstrated the reliability and efficiency of CHNmiRD, and it is anticipated that CHNmiRD will serve as a powerful bioinformatics method for mining novel disease-related miRNAs and providing a new perspective into molecular mechanisms underlying human diseases at the post-transcriptional level. CHNmiRD is freely available at http://www.bio-bigdata.com/CHNmiRD. PMID:26849207
Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations.
Shi, Hongbo; Zhang, Guangde; Zhou, Meng; Cheng, Liang; Yang, Haixiu; Wang, Jing; Sun, Jie; Wang, Zhenzhen
2016-01-01
MicroRNAs (miRNAs) play an important role in the development and progression of human diseases. The identification of disease-associated miRNAs will be helpful for understanding the molecular mechanisms of diseases at the post-transcriptional level. Based on different types of genomic data sources, computational methods for miRNA-disease association prediction have been proposed. However, individual source of genomic data tends to be incomplete and noisy; therefore, the integration of various types of genomic data for inferring reliable miRNA-disease associations is urgently needed. In this study, we present a computational framework, CHNmiRD, for identifying miRNA-disease associations by integrating multiple genomic and phenotype data, including protein-protein interaction data, gene ontology data, experimentally verified miRNA-target relationships, disease phenotype information and known miRNA-disease connections. The performance of CHNmiRD was evaluated by experimentally verified miRNA-disease associations, which achieved an area under the ROC curve (AUC) of 0.834 for 5-fold cross-validation. In particular, CHNmiRD displayed excellent performance for diseases without any known related miRNAs. The results of case studies for three human diseases (glioblastoma, myocardial infarction and type 1 diabetes) showed that all of the top 10 ranked miRNAs having no known associations with these three diseases in existing miRNA-disease databases were directly or indirectly confirmed by our latest literature mining. All these results demonstrated the reliability and efficiency of CHNmiRD, and it is anticipated that CHNmiRD will serve as a powerful bioinformatics method for mining novel disease-related miRNAs and providing a new perspective into molecular mechanisms underlying human diseases at the post-transcriptional level. CHNmiRD is freely available at http://www.bio-bigdata.com/CHNmiRD.
Nguyen, Quan H; Tellam, Ross L; Naval-Sanchez, Marina; Porto-Neto, Laercio R; Barendse, William; Reverter, Antonio; Hayes, Benjamin; Kijas, James; Dalrymple, Brian P
2018-01-01
Abstract Genome sequences for hundreds of mammalian species are available, but an understanding of their genomic regulatory regions, which control gene expression, is only beginning. A comprehensive prediction of potential active regulatory regions is necessary to functionally study the roles of the majority of genomic variants in evolution, domestication, and animal production. We developed a computational method to predict regulatory DNA sequences (promoters, enhancers, and transcription factor binding sites) in production animals (cows and pigs) and extended its broad applicability to other mammals. The method utilizes human regulatory features identified from thousands of tissues, cell lines, and experimental assays to find homologous regions that are conserved in sequences and genome organization and are enriched for regulatory elements in the genome sequences of other mammalian species. Importantly, we developed a filtering strategy, including a machine learning classification method, to utilize a very small number of species-specific experimental datasets available to select for the likely active regulatory regions. The method finds the optimal combination of sensitivity and accuracy to unbiasedly predict regulatory regions in mammalian species. Furthermore, we demonstrated the utility of the predicted regulatory datasets in cattle for prioritizing variants associated with multiple production and climate change adaptation traits and identifying potential genome editing targets. PMID:29618048
Nguyen, Quan H; Tellam, Ross L; Naval-Sanchez, Marina; Porto-Neto, Laercio R; Barendse, William; Reverter, Antonio; Hayes, Benjamin; Kijas, James; Dalrymple, Brian P
2018-03-01
Genome sequences for hundreds of mammalian species are available, but an understanding of their genomic regulatory regions, which control gene expression, is only beginning. A comprehensive prediction of potential active regulatory regions is necessary to functionally study the roles of the majority of genomic variants in evolution, domestication, and animal production. We developed a computational method to predict regulatory DNA sequences (promoters, enhancers, and transcription factor binding sites) in production animals (cows and pigs) and extended its broad applicability to other mammals. The method utilizes human regulatory features identified from thousands of tissues, cell lines, and experimental assays to find homologous regions that are conserved in sequences and genome organization and are enriched for regulatory elements in the genome sequences of other mammalian species. Importantly, we developed a filtering strategy, including a machine learning classification method, to utilize a very small number of species-specific experimental datasets available to select for the likely active regulatory regions. The method finds the optimal combination of sensitivity and accuracy to unbiasedly predict regulatory regions in mammalian species. Furthermore, we demonstrated the utility of the predicted regulatory datasets in cattle for prioritizing variants associated with multiple production and climate change adaptation traits and identifying potential genome editing targets.
Mentasti, Massimo; Tewolde, Rediat; Aslett, Martin; Harris, Simon R.; Afshar, Baharak; Underwood, Anthony; Harrison, Timothy G.
2016-01-01
Sequence-based typing (SBT), analogous to multilocus sequence typing (MLST), is the current “gold standard” typing method for investigation of legionellosis outbreaks caused by Legionella pneumophila. However, as common sequence types (STs) cause many infections, some investigations remain unresolved. In this study, various whole-genome sequencing (WGS)-based methods were evaluated according to published guidelines, including (i) a single nucleotide polymorphism (SNP)-based method, (ii) extended MLST using different numbers of genes, (iii) determination of gene presence or absence, and (iv) a kmer-based method. L. pneumophila serogroup 1 isolates (n = 106) from the standard “typing panel,” previously used by the European Society for Clinical Microbiology Study Group on Legionella Infections (ESGLI), were tested together with another 229 isolates. Over 98% of isolates were considered typeable using the SNP- and kmer-based methods. Percentages of isolates with complete extended MLST profiles ranged from 99.1% (50 genes) to 86.8% (1,455 genes), while only 41.5% produced a full profile with the gene presence/absence scheme. Replicates demonstrated that all methods offer 100% reproducibility. Indices of discrimination range from 0.972 (ribosomal MLST) to 0.999 (SNP based), and all values were higher than that achieved with SBT (0.940). Epidemiological concordance is generally inversely related to discriminatory power. We propose that an extended MLST scheme with ∼50 genes provides optimal epidemiological concordance while substantially improving the discrimination offered by SBT and can be used as part of a hierarchical typing scheme that should maintain backwards compatibility and increase discrimination where necessary. This analysis will be useful for the ESGLI to design a scheme that has the potential to become the new gold standard typing method for L. pneumophila. PMID:27280420
David, Sophia; Mentasti, Massimo; Tewolde, Rediat; Aslett, Martin; Harris, Simon R; Afshar, Baharak; Underwood, Anthony; Fry, Norman K; Parkhill, Julian; Harrison, Timothy G
2016-08-01
Sequence-based typing (SBT), analogous to multilocus sequence typing (MLST), is the current "gold standard" typing method for investigation of legionellosis outbreaks caused by Legionella pneumophila However, as common sequence types (STs) cause many infections, some investigations remain unresolved. In this study, various whole-genome sequencing (WGS)-based methods were evaluated according to published guidelines, including (i) a single nucleotide polymorphism (SNP)-based method, (ii) extended MLST using different numbers of genes, (iii) determination of gene presence or absence, and (iv) a kmer-based method. L. pneumophila serogroup 1 isolates (n = 106) from the standard "typing panel," previously used by the European Society for Clinical Microbiology Study Group on Legionella Infections (ESGLI), were tested together with another 229 isolates. Over 98% of isolates were considered typeable using the SNP- and kmer-based methods. Percentages of isolates with complete extended MLST profiles ranged from 99.1% (50 genes) to 86.8% (1,455 genes), while only 41.5% produced a full profile with the gene presence/absence scheme. Replicates demonstrated that all methods offer 100% reproducibility. Indices of discrimination range from 0.972 (ribosomal MLST) to 0.999 (SNP based), and all values were higher than that achieved with SBT (0.940). Epidemiological concordance is generally inversely related to discriminatory power. We propose that an extended MLST scheme with ∼50 genes provides optimal epidemiological concordance while substantially improving the discrimination offered by SBT and can be used as part of a hierarchical typing scheme that should maintain backwards compatibility and increase discrimination where necessary. This analysis will be useful for the ESGLI to design a scheme that has the potential to become the new gold standard typing method for L. pneumophila. Copyright © 2016 David et al.
Iyer, Janani; Wang, Qingyu; Le, Thanh; Pizzo, Lucilla; Grönke, Sebastian; Ambegaokar, Surendra S.; Imai, Yuzuru; Srivastava, Ashutosh; Troisí, Beatriz Llamusí; Mardon, Graeme; Artero, Ruben; Jackson, George R.; Isaacs, Adrian M.; Partridge, Linda; Lu, Bingwei; Kumar, Justin P.; Girirajan, Santhosh
2016-01-01
About two-thirds of the vital genes in the Drosophila genome are involved in eye development, making the fly eye an excellent genetic system to study cellular function and development, neurodevelopment/degeneration, and complex diseases such as cancer and diabetes. We developed a novel computational method, implemented as Flynotyper software (http://flynotyper.sourceforge.net), to quantitatively assess the morphological defects in the Drosophila eye resulting from genetic alterations affecting basic cellular and developmental processes. Flynotyper utilizes a series of image processing operations to automatically detect the fly eye and the individual ommatidium, and calculates a phenotypic score as a measure of the disorderliness of ommatidial arrangement in the fly eye. As a proof of principle, we tested our method by analyzing the defects due to eye-specific knockdown of Drosophila orthologs of 12 neurodevelopmental genes to accurately document differential sensitivities of these genes to dosage alteration. We also evaluated eye images from six independent studies assessing the effect of overexpression of repeats, candidates from peptide library screens, and modifiers of neurotoxicity and developmental processes on eye morphology, and show strong concordance with the original assessment. We further demonstrate the utility of this method by analyzing 16 modifiers of sine oculis obtained from two genome-wide deficiency screens of Drosophila and accurately quantifying the effect of its enhancers and suppressors during eye development. Our method will complement existing assays for eye phenotypes, and increase the accuracy of studies that use fly eyes for functional evaluation of genes and genetic interactions. PMID:26994292
Strategies and tools for whole genome alignments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Couronne, Olivier; Poliakov, Alexander; Bray, Nicolas
2002-11-25
The availability of the assembled mouse genome makespossible, for the first time, an alignment and comparison of two largevertebrate genomes. We have investigated different strategies ofalignment for the subsequent analysis of conservation of genomes that areeffective for different quality assemblies. These strategies were appliedto the comparison of the working draft of the human genome with the MouseGenome Sequencing Consortium assembly, as well as other intermediatemouse assemblies. Our methods are fast and the resulting alignmentsexhibit a high degree of sensitivity, covering more than 90 percent ofknown coding exons in the human genome. We have obtained such coveragewhile preserving specificity. With amore » view towards the end user, we havedeveloped a suite of tools and websites for automatically aligning, andsubsequently browsing and working with whole genome comparisons. Wedescribe the use of these tools to identify conserved non-coding regionsbetween the human and mouse genomes, some of which have not beenidentified by other methods.« less
Barría, Agustín; Christensen, Kris A.; Yoshida, Grazyella M.; Correa, Katharina; Jedlicki, Ana; Lhorente, Jean P.; Davidson, William S.; Yáñez, José M.
2018-01-01
Piscirickettsia salmonis is one of the main infectious diseases affecting coho salmon (Oncorhynchus kisutch) farming, and current treatments have been ineffective for the control of this disease. Genetic improvement for P. salmonis resistance has been proposed as a feasible alternative for the control of this infectious disease in farmed fish. Genotyping by sequencing (GBS) strategies allow genotyping of hundreds of individuals with thousands of single nucleotide polymorphisms (SNPs), which can be used to perform genome wide association studies (GWAS) and predict genetic values using genome-wide information. We used double-digest restriction-site associated DNA (ddRAD) sequencing to dissect the genetic architecture of resistance against P. salmonis in a farmed coho salmon population and to identify molecular markers associated with the trait. We also evaluated genomic selection (GS) models in order to determine the potential to accelerate the genetic improvement of this trait by means of using genome-wide molecular information. A total of 764 individuals from 33 full-sib families (17 highly resistant and 16 highly susceptible) were experimentally challenged against P. salmonis and their genotypes were assayed using ddRAD sequencing. A total of 9,389 SNPs markers were identified in the population. These markers were used to test genomic selection models and compare different GWAS methodologies for resistance measured as day of death (DD) and binary survival (BIN). Genomic selection models showed higher accuracies than the traditional pedigree-based best linear unbiased prediction (PBLUP) method, for both DD and BIN. The models showed an improvement of up to 95% and 155% respectively over PBLUP. One SNP related with B-cell development was identified as a potential functional candidate associated with resistance to P. salmonis defined as DD. PMID:29440129
Rolf, Megan M; Taylor, Jeremy F; Schnabel, Robert D; McKay, Stephanie D; McClure, Matthew C; Northcutt, Sally L; Kerley, Monty S; Weaber, Robert L
2010-04-19
Molecular estimates of breeding value are expected to increase selection response due to improvements in the accuracy of selection and a reduction in generation interval, particularly for traits that are difficult or expensive to record or are measured late in life. Several statistical methods for incorporating molecular data into breeding value estimation have been proposed, however, most studies have utilized simulated data in which the generated linkage disequilibrium may not represent the targeted livestock population. A genomic relationship matrix was developed for 698 Angus steers and 1,707 Angus sires using 41,028 single nucleotide polymorphisms and breeding values were estimated using feed efficiency phenotypes (average daily feed intake, residual feed intake, and average daily gain) recorded on the steers. The number of SNPs needed to accurately estimate a genomic relationship matrix was evaluated in this population. Results were compared to estimates produced from pedigree-based mixed model analysis of 862 Angus steers with 34,864 identified paternal relatives but no female ancestors. Estimates of additive genetic variance and breeding value accuracies were similar for AFI and RFI using the numerator and genomic relationship matrices despite fewer animals in the genomic analysis. Bootstrap analyses indicated that 2,500-10,000 markers are required for robust estimation of genomic relationship matrices in cattle. This research shows that breeding values and their accuracies may be estimated for commercially important sires for traits recorded in experimental populations without the need for pedigree data to establish identity by descent between members of the commercial and experimental populations when at least 2,500 SNPs are available for the generation of a genomic relationship matrix.
Lee, Chi-Ching; Chen, Yi-Ping Phoebe; Yao, Tzu-Jung; Ma, Cheng-Yu; Lo, Wei-Cheng; Lyu, Ping-Chiang; Tang, Chuan Yi
2013-04-10
Sequencing of microbial genomes is important because of microbial-carrying antibiotic and pathogenetic activities. However, even with the help of new assembling software, finishing a whole genome is a time-consuming task. In most bacteria, pathogenetic or antibiotic genes are carried in genomic islands. Therefore, a quick genomic island (GI) prediction method is useful for ongoing sequencing genomes. In this work, we built a Web server called GI-POP (http://gipop.life.nthu.edu.tw) which integrates a sequence assembling tool, a functional annotation pipeline, and a high-performance GI predicting module, in a support vector machine (SVM)-based method called genomic island genomic profile scanning (GI-GPS). The draft genomes of the ongoing genome projects in contigs or scaffolds can be submitted to our Web server, and it provides the functional annotation and highly probable GI-predicting results. GI-POP is a comprehensive annotation Web server designed for ongoing genome project analysis. Researchers can perform annotation and obtain pre-analytic information include possible GIs, coding/non-coding sequences and functional analysis from their draft genomes. This pre-analytic system can provide useful information for finishing a genome sequencing project. Copyright © 2012 Elsevier B.V. All rights reserved.
2013-01-01
Background With high quantity and quality data production and low cost, next generation sequencing has the potential to provide new opportunities for plant phylogeographic studies on single and multiple species. Here we present an approach for in silicio chloroplast DNA assembly and single nucleotide polymorphism detection from short-read shotgun sequencing. The approach is simple and effective and can be implemented using standard bioinformatic tools. Results The chloroplast genome of Toona ciliata (Meliaceae), 159,514 base pairs long, was assembled from shotgun sequencing on the Illumina platform using de novo assembly of contigs. To evaluate its practicality, value and quality, we compared the short read assembly with an assembly completed using 454 data obtained after chloroplast DNA isolation. Sanger sequence verifications indicated that the Illumina dataset outperformed the longer read 454 data. Pooling of several individuals during preparation of the shotgun library enabled detection of informative chloroplast SNP markers. Following validation, we used the identified SNPs for a preliminary phylogeographic study of T. ciliata in Australia and to confirm low diversity across the distribution. Conclusions Our approach provides a simple method for construction of whole chloroplast genomes from shotgun sequencing of whole genomic DNA using short-read data and no available closely related reference genome (e.g. from the same species or genus). The high coverage of Illumina sequence data also renders this method appropriate for multiplexing and SNP discovery and therefore a useful approach for landscape level studies of evolutionary ecology. PMID:23497206
How to interpret Methylation Sensitive Amplified Polymorphism (MSAP) profiles?
2014-01-01
Background DNA methylation plays a key role in development, contributes to genome stability, and may also respond to external factors supporting adaptation and evolution. To connect different types of stimuli with particular biological processes, identifying genome regions with altered 5-methylcytosine distribution at a genome-wide scale is important. Many researchers are using the simple, reliable, and relatively inexpensive Methylation Sensitive Amplified Polymorphism (MSAP) method that is particularly useful in studies of epigenetic variation. However, electrophoretic patterns produced by the method are rather difficult to interpret, particularly when MspI and HpaII isoschizomers are used because these enzymes are methylation-sensitive, and any C within the CCGG recognition motif can be methylated in plant DNA. Results Here, we evaluate MSAP patterns with respect to current knowledge of the enzyme activities and the level and distribution of 5-methylcytosine in plant and vertebrate genomes. We discuss potential caveats related to complex MSAP patterns and provide clues regarding how to interpret them. We further show that addition of combined HpaII + MspI digestion would assist in the interpretation of the most controversial MSAP pattern represented by the signal in the HpaII but not in the MspI profile. Conclusions We recommend modification of the MSAP protocol that definitely discerns between putative hemimethylated mCCGG and internal CmCGG sites. We believe that our view and the simple improvement will assist in correct MSAP data interpretation. PMID:24393618
A Metagenomic Approach to Cyanobacterial Genomics
Alvarenga, Danillo O.; Fiore, Marli F.; Varani, Alessandro M.
2017-01-01
Cyanobacteria, or oxyphotobacteria, are primary producers that establish ecological interactions with a wide variety of organisms. Although their associations with eukaryotes have received most attention, interactions with bacterial and archaeal symbionts have also been occurring for billions of years. Due to these associations, obtaining axenic cultures of cyanobacteria is usually difficult, and most isolation efforts result in unicyanobacterial cultures containing a number of associated microbes, hence composing a microbial consortium. With rising numbers of cyanobacterial blooms due to climate change, demand for genomic evaluations of these microorganisms is increasing. However, standard genomic techniques call for the sequencing of axenic cultures, an approach that not only adds months or even years for culture purification, but also appears to be impossible for some cyanobacteria, which is reflected in the relatively low number of publicly available genomic sequences of this phylum. Under the framework of metagenomics, on the other hand, cumbersome techniques for achieving axenic growth can be circumvented and individual genomes can be successfully obtained from microbial consortia. This review focuses on approaches for the genomic and metagenomic assessment of non-axenic cyanobacterial cultures that bypass requirements for axenity. These methods enable researchers to achieve faster and less costly genomic characterizations of cyanobacterial strains and raise additional information about their associated microorganisms. While non-axenic cultures may have been previously frowned upon in cyanobacteriology, latest advancements in metagenomics have provided new possibilities for in vitro studies of oxyphotobacteria, renewing the value of microbial consortia as a reliable and functional resource for the rapid assessment of bloom-forming cyanobacteria. PMID:28536564
The Genomic Signature of Crop-Wild Introgression in Maize
Hufford, Matthew B.; Lubinksy, Pesach; Pyhäjärvi, Tanja; Devengenzo, Michael T.; Ellstrand, Norman C.; Ross-Ibarra, Jeffrey
2013-01-01
The evolutionary significance of hybridization and subsequent introgression has long been appreciated, but evaluation of the genome-wide effects of these phenomena has only recently become possible. Crop-wild study systems represent ideal opportunities to examine evolution through hybridization. For example, maize and the conspecific wild teosinte Zea mays ssp. mexicana (hereafter, mexicana) are known to hybridize in the fields of highland Mexico. Despite widespread evidence of gene flow, maize and mexicana maintain distinct morphologies and have done so in sympatry for thousands of years. Neither the genomic extent nor the evolutionary importance of introgression between these taxa is understood. In this study we assessed patterns of genome-wide introgression based on 39,029 single nucleotide polymorphisms genotyped in 189 individuals from nine sympatric maize-mexicana populations and reference allopatric populations. While portions of the maize and mexicana genomes appeared resistant to introgression (notably near known cross-incompatibility and domestication loci), we detected widespread evidence for introgression in both directions of gene flow. Through further characterization of these genomic regions and preliminary growth chamber experiments, we found evidence suggestive of the incorporation of adaptive mexicana alleles into maize during its expansion to the highlands of central Mexico. In contrast, very little evidence was found for adaptive introgression from maize to mexicana. The methods we have applied here can be replicated widely, and such analyses have the potential to greatly inform our understanding of evolution through introgressive hybridization. Crop species, due to their exceptional genomic resources and frequent histories of spread into sympatry with relatives, should be particularly influential in these studies. PMID:23671421
Wu, Liyou; Liu, Xueduan; Schadt, Christopher W.; Zhou, Jizhong
2006-01-01
Microarray technology provides the opportunity to identify thousands of microbial genes or populations simultaneously, but low microbial biomass often prevents application of this technology to many natural microbial communities. We developed a whole-community genome amplification-assisted microarray detection approach based on multiple displacement amplification. The representativeness of amplification was evaluated using several types of microarrays and quantitative indexes. Representative detection of individual genes or genomes was obtained with 1 to 100 ng DNA from individual or mixed genomes, in equal or unequal abundance, and with 1 to 500 ng community DNAs from groundwater. Lower concentrations of DNA (as low as 10 fg) could be detected, but the lower template concentrations affected the representativeness of amplification. Robust quantitative detection was also observed by significant linear relationships between signal intensities and initial DNA concentrations ranging from (i) 0.04 to 125 ng (r2 = 0.65 to 0.99) for DNA from pure cultures as detected by whole-genome open reading frame arrays, (ii) 0.1 to 1,000 ng (r2 = 0.91) for genomic DNA using community genome arrays, and (iii) 0.01 to 250 ng (r2 = 0.96 to 0.98) for community DNAs from ethanol-amended groundwater using 50-mer functional gene arrays. This method allowed us to investigate the oligotrophic microbial communities in groundwater contaminated with uranium and other metals. The results indicated that microorganisms containing genes involved in contaminant degradation and immobilization are present in these communities, that their spatial distribution is heterogeneous, and that microbial diversity is greatly reduced in the highly contaminated environment. PMID:16820490
2010-01-01
Background An important focus of genomic science is the discovery and characterization of all functional elements within genomes. In silico methods are used in genome studies to discover putative regulatory genomic elements (called words or motifs). Although a number of methods have been developed for motif discovery, most of them lack the scalability needed to analyze large genomic data sets. Methods This manuscript presents WordSeeker, an enumerative motif discovery toolkit that utilizes multi-core and distributed computational platforms to enable scalable analysis of genomic data. A controller task coordinates activities of worker nodes, each of which (1) enumerates a subset of the DNA word space and (2) scores words with a distributed Markov chain model. Results A comprehensive suite of performance tests was conducted to demonstrate the performance, speedup and efficiency of WordSeeker. The scalability of the toolkit enabled the analysis of the entire genome of Arabidopsis thaliana; the results of the analysis were integrated into The Arabidopsis Gene Regulatory Information Server (AGRIS). A public version of WordSeeker was deployed on the Glenn cluster at the Ohio Supercomputer Center. Conclusion WordSeeker effectively utilizes concurrent computing platforms to enable the identification of putative functional elements in genomic data sets. This capability facilitates the analysis of the large quantity of sequenced genomic data. PMID:21210985
Genomics Education for the Public: Perspectives of Genomic Researchers and ELSI Advisors
Jones, Sondra Smolek; Markey, Janell M.; Byerly, Katherine W.; Roberts, Megan C.
2014-01-01
Aims: For more than two decades genomic education of the public has been a significant challenge. As genomic information becomes integrated into daily life and routine clinical care, the need for public education is even more critical. We conducted a pilot study to learn how genomic researchers and ethical, legal, and social implications advisors who were affiliated with large-scale genomic variation studies have approached the issue of educating the public about genomics. Methods/Results: Semi-structured telephone interviews were conducted with researchers and advisors associated with the SNP/HAPMAP studies and the Cancer Genome Atlas Study. Respondents described approach(es) associated with educating the public about their study. Interviews were audio-recorded, transcribed, coded, and analyzed by team review. Although few respondents described formal educational efforts, most provided recommendations for what should/could be done, emphasizing the need for an overarching entity(s) to take responsibility to lead the effort to educate the public. Opposing views were described related to: who this should be; the overall goal of the educational effort; and the educational approach. Four thematic areas emerged: What is the rationale for educating the public about genomics?; Who is the audience?; Who should be responsible for this effort?; and What should the content be? Policy issues associated with these themes included the need to agree on philosophical framework(s) to guide the rationale, content, and target audiences for education programs; coordinate previous/ongoing educational efforts; and develop a centralized knowledge base. Suggestions for next steps are presented. Conclusion: A complex interplay of philosophical, professional, and cultural issues can create impediments to genomic education of the public. Many challenges, however, can be addressed by agreement on a guiding philosophical framework(s) and identification of a responsible entity(s) to provide leadership for developing/overseeing an appropriate infrastructure to support the coordination/integration/sharing and evaluation of educational efforts, benefiting consumers and professionals. PMID:24495163
Mannila, H.; Koivisto, M.; Perola, M.; Varilo, T.; Hennah, W.; Ekelund, J.; Lukk, M.; Peltonen, L.; Ukkonen, E.
2003-01-01
We describe a new probabilistic method for finding haplotype blocks that is based on the use of the minimum description length (MDL) principle. We give a rigorous definition of the quality of a segmentation of a genomic region into blocks and describe a dynamic programming algorithm for finding the optimal segmentation with respect to this measure. We also describe a method for finding the probability of a block boundary for each pair of adjacent markers: this gives a tool for evaluating the significance of each block boundary. We have applied the method to the published data of Daly and colleagues. The results expose some problems that exist in the current methods for the evaluation of the significance of predicted block boundaries. Our method, MDL block finder, can be used to compare block borders in different sample sets, and we demonstrate this by applying the MDL-based method to define the block structure in chromosomes from population isolates. PMID:12761696
Mannila, H; Koivisto, M; Perola, M; Varilo, T; Hennah, W; Ekelund, J; Lukk, M; Peltonen, L; Ukkonen, E
2003-07-01
We describe a new probabilistic method for finding haplotype blocks that is based on the use of the minimum description length (MDL) principle. We give a rigorous definition of the quality of a segmentation of a genomic region into blocks and describe a dynamic programming algorithm for finding the optimal segmentation with respect to this measure. We also describe a method for finding the probability of a block boundary for each pair of adjacent markers: this gives a tool for evaluating the significance of each block boundary. We have applied the method to the published data of Daly and colleagues. The results expose some problems that exist in the current methods for the evaluation of the significance of predicted block boundaries. Our method, MDL block finder, can be used to compare block borders in different sample sets, and we demonstrate this by applying the MDL-based method to define the block structure in chromosomes from population isolates.
Zhao, Yongan; Wang, Xiaofeng; Jiang, Xiaoqian; Ohno-Machado, Lucila; Tang, Haixu
2015-01-01
To propose a new approach to privacy preserving data selection, which helps the data users access human genomic datasets efficiently without undermining patients' privacy. Our idea is to let each data owner publish a set of differentially-private pilot data, on which a data user can test-run arbitrary association-test algorithms, including those not known to the data owner a priori. We developed a suite of new techniques, including a pilot-data generation approach that leverages the linkage disequilibrium in the human genome to preserve both the utility of the data and the privacy of the patients, and a utility evaluation method that helps the user assess the value of the real data from its pilot version with high confidence. We evaluated our approach on real human genomic data using four popular association tests. Our study shows that the proposed approach can help data users make the right choices in most cases. Even though the pilot data cannot be directly used for scientific discovery, it provides a useful indication of which datasets are more likely to be useful to data users, who can therefore approach the appropriate data owners to gain access to the data. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association.
Lu, Wen-Jie; Yamada, Yoshiji; Sakuma, Jun
2015-01-01
Developed sequencing techniques are yielding large-scale genomic data at low cost. A genome-wide association study (GWAS) targeting genetic variations that are significantly associated with a particular disease offers great potential for medical improvement. However, subjects who volunteer their genomic data expose themselves to the risk of privacy invasion; these privacy concerns prevent efficient genomic data sharing. Our goal is to presents a cryptographic solution to this problem. To maintain the privacy of subjects, we propose encryption of all genotype and phenotype data. To allow the cloud to perform meaningful computation in relation to the encrypted data, we use a fully homomorphic encryption scheme. Noting that we can evaluate typical statistics for GWAS from a frequency table, our solution evaluates frequency tables with encrypted genomic and clinical data as input. We propose to use a packing technique for efficient evaluation of these frequency tables. Our solution supports evaluation of the D' measure of linkage disequilibrium, the Hardy-Weinberg Equilibrium, the χ2 test, etc. In this paper, we take χ2 test and linkage disequilibrium as examples and demonstrate how we can conduct these algorithms securely and efficiently in an outsourcing setting. We demonstrate with experimentation that secure outsourcing computation of one χ2 test with 10, 000 subjects requires about 35 ms and evaluation of one linkage disequilibrium with 10, 000 subjects requires about 80 ms. With appropriate encoding and packing technique, cryptographic solutions based on fully homomorphic encryption for secure computations of GWAS can be practical.
Efficient Server-Aided Secure Two-Party Function Evaluation with Applications to Genomic Computation
2016-07-14
of the important properties of secure computation . In particular, it is known that full fairness cannot be achieved in the case of two-party com...Jakobsen, J. Nielsen, and C. Orlandi. A framework for outsourcing of secure computation . In ACM Workshop on Cloud Computing Security (CCSW), pages...Function Evaluation with Applications to Genomic Computation Abstract: Computation based on genomic data is becoming increasingly popular today, be it
Efficient engineering of a bacteriophage genome using the type I-E CRISPR-Cas system.
Kiro, Ruth; Shitrit, Dror; Qimron, Udi
2014-01-01
The clustered regularly interspaced short palindromic repeats (CRISPR)-CRISPR-associated (Cas) system has recently been used to engineer genomes of various organisms, but surprisingly, not those of bacteriophages (phages). Here we present a method to genetically engineer the Escherichia coli phage T7 using the type I-E CRISPR-Cas system. T7 phage genome is edited by homologous recombination with a DNA sequence flanked by sequences homologous to the desired location. Non-edited genomes are targeted by the CRISPR-Cas system, thus enabling isolation of the desired recombinant phages. This method broadens CRISPR Cas-based editing to phages and uses a CRISPR-Cas type other than type II. The method may be adjusted to genetically engineer any bacteriophage genome.
Repeat-aware modeling and correction of short read errors.
Yang, Xiao; Aluru, Srinivas; Dorman, Karin S
2011-02-15
High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content. We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at "http://aluru-sun.ece.iastate.edu/doku.php?id = redeem". We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.
Argimón, Silvia; Konganti, Kranti; Chen, Hao; Alekseyenko, Alexander V.; Brown, Stuart; Caufield, Page W.
2014-01-01
Comparative genomics is a popular method for the identification of microbial virulence determinants, especially since the sequencing of a large number of whole bacterial genomes from pathogenic and non-pathogenic strains has become relatively inexpensive. The bioinformatics pipelines for comparative genomics usually include gene prediction and annotation and can require significant computer power. To circumvent this, we developed a rapid method for genome-scale in silico subtractive hybridization, based on blastn and independent of feature identification and annotation. Whole genome comparisons by in silico genome subtraction were performed to identify genetic loci specific to Streptococcus mutans strains associated with severe early childhood caries (S-ECC), compared to strains isolated from caries-free (CF) children. The genome similarity of the 20 S. mutans strains included in this study, calculated by Simrank k-mer sharing, ranged from 79.5 to 90.9%, confirming this is a genetically heterogeneous group of strains. We identified strain-specific genetic elements in 19 strains, with sizes ranging from 200 bp to 39 kb. These elements contained protein-coding regions with functions mostly associated with mobile DNA. We did not, however, identify any genetic loci consistently associated with dental caries, i.e., shared by all the S-ECC strains and absent in the CF strains. Conversely, we did not identify any genetic loci specific with the healthy group. Comparison of previously published genomes from pathogenic and carriage strains of Neisseria meningitidis with our in silico genome subtraction yielded the same set of genes specific to the pathogenic strains, thus validating our method. Our results suggest that S. mutans strains derived from caries active or caries free dentitions cannot be differentiated based on the presence or absence of specific genetic elements. Our in silico genome subtraction method is available as the Microbial Genome Comparison (MGC) tool, with a user-friendly JAVA graphical interface. PMID:24291226
Toward the automated generation of genome-scale metabolic networks in the SEED.
DeJongh, Matthew; Formsma, Kevin; Boillot, Paul; Gould, John; Rycenga, Matthew; Best, Aaron
2007-04-26
Current methods for the automated generation of genome-scale metabolic networks focus on genome annotation and preliminary biochemical reaction network assembly, but do not adequately address the process of identifying and filling gaps in the reaction network, and verifying that the network is suitable for systems level analysis. Thus, current methods are only sufficient for generating draft-quality networks, and refinement of the reaction network is still largely a manual, labor-intensive process. We have developed a method for generating genome-scale metabolic networks that produces substantially complete reaction networks, suitable for systems level analysis. Our method partitions the reaction space of central and intermediary metabolism into discrete, interconnected components that can be assembled and verified in isolation from each other, and then integrated and verified at the level of their interconnectivity. We have developed a database of components that are common across organisms, and have created tools for automatically assembling appropriate components for a particular organism based on the metabolic pathways encoded in the organism's genome. This focuses manual efforts on that portion of an organism's metabolism that is not yet represented in the database. We have demonstrated the efficacy of our method by reverse-engineering and automatically regenerating the reaction network from a published genome-scale metabolic model for Staphylococcus aureus. Additionally, we have verified that our method capitalizes on the database of common reaction network components created for S. aureus, by using these components to generate substantially complete reconstructions of the reaction networks from three other published metabolic models (Escherichia coli, Helicobacter pylori, and Lactococcus lactis). We have implemented our tools and database within the SEED, an open-source software environment for comparative genome annotation and analysis. Our method sets the stage for the automated generation of substantially complete metabolic networks for over 400 complete genome sequences currently in the SEED. With each genome that is processed using our tools, the database of common components grows to cover more of the diversity of metabolic pathways. This increases the likelihood that components of reaction networks for subsequently processed genomes can be retrieved from the database, rather than assembled and verified manually.
Uemoto, Yoshinobu; Sasaki, Shinji; Kojima, Takatoshi; Sugimoto, Yoshikazu; Watanabe, Toshio
2015-11-19
Genetic variance that is not captured by single nucleotide polymorphisms (SNPs) is due to imperfect linkage disequilibrium (LD) between SNPs and quantitative trait loci (QTLs), and the extent of LD between SNPs and QTLs depends on different minor allele frequencies (MAF) between them. To evaluate the impact of MAF of QTLs on genomic evaluation, we performed a simulation study using real cattle genotype data. In total, 1368 Japanese Black cattle and 592,034 SNPs (Illumina BovineHD BeadChip) were used. We simulated phenotypes using real genotypes under different scenarios, varying the MAF categories, QTL heritability, number of QTLs, and distribution of QTL effect. After generating true breeding values and phenotypes, QTL heritability was estimated and the prediction accuracy of genomic estimated breeding value (GEBV) was assessed under different SNP densities, prediction models, and population size by a reference-test validation design. The extent of LD between SNPs and QTLs in this population was higher in the QTLs with high MAF than in those with low MAF. The effect of MAF of QTLs depended on the genetic architecture, evaluation strategy, and population size in genomic evaluation. In genetic architecture, genomic evaluation was affected by the MAF of QTLs combined with the QTL heritability and the distribution of QTL effect. The number of QTL was not affected on genomic evaluation if the number of QTL was more than 50. In the evaluation strategy, we showed that different SNP densities and prediction models affect the heritability estimation and genomic prediction and that this depends on the MAF of QTLs. In addition, accurate QTL heritability and GEBV were obtained using denser SNP information and the prediction model accounted for the SNPs with low and high MAFs. In population size, a large sample size is needed to increase the accuracy of GEBV. The MAF of QTL had an impact on heritability estimation and prediction accuracy. Most genetic variance can be captured using denser SNPs and the prediction model accounted for MAF, but a large sample size is needed to increase the accuracy of GEBV under all QTL MAF categories.
Recovering complete and draft population genomes from metagenome datasets
Sangwan, Naseer; Xia, Fangfang; Gilbert, Jack A.
2016-03-08
Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem ofmore » chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.« less
Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes
Farreras, Montse
2014-01-01
Abstract The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes. PMID:24597675
Recovering complete and draft population genomes from metagenome datasets
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sangwan, Naseer; Xia, Fangfang; Gilbert, Jack A.
Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem ofmore » chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.« less
Genomic-based multiple-trait evaluation in Eucalyptus grandis using dominant DArT markers.
Cappa, Eduardo P; El-Kassaby, Yousry A; Muñoz, Facundo; Garcia, Martín N; Villalba, Pamela V; Klápště, Jaroslav; Marcucci Poltri, Susana N
2018-06-01
We investigated the impact of combining the pedigree- and genomic-based relationship matrices in a multiple-trait individual-tree mixed model (a.k.a., multiple-trait combined approach) on the estimates of heritability and on the genomic correlations between growth and stem straightness in an open-pollinated Eucalyptus grandis population. Additionally, the added advantage of incorporating genomic information on the theoretical accuracies of parents and offspring breeding values was evaluated. Our results suggested that the use of the combined approach for estimating heritabilities and additive genetic correlations in multiple-trait evaluations is advantageous and including genomic information increases the expected accuracy of breeding values. Furthermore, the multiple-trait combined approach was proven to be superior to the single-trait combined approach in predicting breeding values, in particular for low-heritability traits. Finally, our results advocate the use of the combined approach in forest tree progeny testing trials, specifically when a multiple-trait individual-tree mixed model is considered. Copyright © 2018 Elsevier B.V. All rights reserved.
Álvarez-Ainza, M L; Zamora-Quiñonez, K A; Moreno-Ibarra, G M; Acedo-Félix, E
2015-03-01
Bacanora is a spirituous beverage elaborated with Agave angustifolia Haw in an artisanal process. Natural fermentation is mostly performed with native yeasts and bacteria. In this study, 228 strains of yeast like Saccharomyces were isolated from the natural alcoholic fermentation on the production of bacanora. Restriction analysis of the amplified region ITS1-5.8S-ITS2 of the ribosomal DNA genes (RFLPr) were used to confirm the genus, and 182 strains were identified as Saccharomyces cerevisiae. These strains displayed high genomic variability in their chromosomes profiles by karyotyping. Electrophoretic profiles of the strains evaluated showed a large number of chromosomes the size of which ranged between 225 and 2200 kpb approximately.
Unsupervised Outlier Profile Analysis
Ghosh, Debashis; Li, Song
2014-01-01
In much of the analysis of high-throughput genomic data, “interesting” genes have been selected based on assessment of differential expression between two groups or generalizations thereof. Most of the literature focuses on changes in mean expression or the entire distribution. In this article, we explore the use of C(α) tests, which have been applied in other genomic data settings. Their use for the outlier expression problem, in particular with continuous data, is problematic but nevertheless motivates new statistics that give an unsupervised analog to previously developed outlier profile analysis approaches. Some simulation studies are used to evaluate the proposal. A bivariate extension is described that can accommodate data from two platforms on matched samples. The proposed methods are applied to data from a prostate cancer study. PMID:25452686
State of the Art: Response Assessment in Lung Cancer in the Era of Genomic Medicine
Hatabu, Hiroto; Johnson, Bruce E.; McLoud, Theresa C.
2014-01-01
Tumor response assessment has been a foundation for advances in cancer therapy. Recent discoveries of effective targeted therapy for specific genomic abnormalities in lung cancer and their clinical application have brought revolutionary advances in lung cancer therapy and transformed the oncologist’s approach to patients with lung cancer. Because imaging is a major method of response assessment in lung cancer both in clinical trials and practice, radiologists must understand the genomic alterations in lung cancer and the rapidly evolving therapeutic approaches to effectively communicate with oncology colleagues and maintain the key role in lung cancer care. This article describes the origin and importance of tumor response assessment, presents the recent genomic discoveries in lung cancer and therapies directed against these genomic changes, and describes how these discoveries affect the radiology community. The authors then summarize the conventional Response Evaluation Criteria in Solid Tumors and World Health Organization guidelines, which continue to be the major determinants of trial endpoints, and describe their limitations particularly in an era of genomic-based therapy. More advanced imaging techniques for lung cancer response assessment are presented, including computed tomography tumor volume and perfusion, dynamic contrast material–enhanced and diffusion-weighted magnetic resonance imaging, and positron emission tomography with fluorine 18 fluorodeoxyglucose and novel tracers. State-of-art knowledge of lung cancer biology, treatment, and imaging will help the radiology community to remain effective contributors to the personalized care of lung cancer patients. © RSNA, 2014 PMID:24661292
A Utility Maximizing and Privacy Preserving Approach for Protecting Kinship in Genomic Databases.
Kale, Gulce; Ayday, Erman; Tastan, Oznur
2017-09-12
Rapid and low cost sequencing of genomes enabled widespread use of genomic data in research studies and personalized customer applications, where genomic data is shared in public databases. Although the identities of the participants are anonymized in these databases, sensitive information about individuals can still be inferred. One such information is kinship. We define two routes kinship privacy can leak and propose a technique to protect kinship privacy against these risks while maximizing the utility of shared data. The method involves systematic identification of minimal portions of genomic data to mask as new participants are added to the database. Choosing the proper positions to hide is cast as an optimization problem in which the number of positions to mask is minimized subject to privacy constraints that ensure the familial relationships are not revealed.We evaluate the proposed technique on real genomic data. Results indicate that concurrent sharing of data pertaining to a parent and an offspring results in high risks of kinship privacy, whereas the sharing data from further relatives together is often safer. We also show arrival order of family members have a high impact on the level of privacy risks and on the utility of sharing data. Available at: https://github.com/tastanlab/Kinship-Privacy. erman@cs.bilkent.edu.tr or oznur.tastan@cs.bilkent.edu.tr. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
An experimental validation of genomic selection in octoploid strawberry
Gezan, Salvador A; Osorio, Luis F; Verma, Sujeet; Whitaker, Vance M
2017-01-01
The primary goal of genomic selection is to increase genetic gains for complex traits by predicting performance of individuals for which phenotypic data are not available. The objective of this study was to experimentally evaluate the potential of genomic selection in strawberry breeding and to define a strategy for its implementation. Four clonally replicated field trials, two in each of 2 years comprised of a total of 1628 individuals, were established in 2013–2014 and 2014–2015. Five complex yield and fruit quality traits with moderate to low heritability were assessed in each trial. High-density genotyping was performed with the Affymetrix Axiom IStraw90 single-nucleotide polymorphism array, and 17 479 polymorphic markers were chosen for analysis. Several methods were compared, including Genomic BLUP, Bayes B, Bayes C, Bayesian LASSO Regression, Bayesian Ridge Regression and Reproducing Kernel Hilbert Spaces. Cross-validation within training populations resulted in higher values than for true validations across trials. For true validations, Bayes B gave the highest predictive abilities on average and also the highest selection efficiencies, particularly for yield traits that were the lowest heritability traits. Selection efficiencies using Bayes B for parent selection ranged from 74% for average fruit weight to 34% for early marketable yield. A breeding strategy is proposed in which advanced selection trials are utilized as training populations and in which genomic selection can reduce the breeding cycle from 3 to 2 years for a subset of untested parents based on their predicted genomic breeding values. PMID:28090334
Advanced yellow fever virus genome detection in point-of-care facilities and reference laboratories.
Domingo, Cristina; Patel, Pranav; Yillah, Jasmin; Weidmann, Manfred; Méndez, Jairo A; Nakouné, Emmanuel Rivalyn; Niedrig, Matthias
2012-12-01
Reported methods for the detection of the yellow fever viral genome are beset by limitations in sensitivity, specificity, strain detection spectra, and suitability to laboratories with simple infrastructure in areas of endemicity. We describe the development of two different approaches affording sensitive and specific detection of the yellow fever genome: a real-time reverse transcription-quantitative PCR (RT-qPCR) and an isothermal protocol employing the same primer-probe set but based on helicase-dependent amplification technology (RT-tHDA). Both assays were evaluated using yellow fever cell culture supernatants as well as spiked and clinical samples. We demonstrate reliable detection by both assays of different strains of yellow fever virus with improved sensitivity and specificity. The RT-qPCR assay is a powerful tool for reference or diagnostic laboratories with real-time PCR capability, while the isothermal RT-tHDA assay represents a useful alternative to earlier amplification techniques for the molecular diagnosis of yellow fever by field or point-of-care laboratories.
Gel, Bernat; Díez-Villanueva, Anna; Serra, Eduard; Buschbeck, Marcus; Peinado, Miguel A; Malinverni, Roberto
2016-01-15
Statistically assessing the relation between a set of genomic regions and other genomic features is a common challenging task in genomic and epigenomic analyses. Randomization based approaches implicitly take into account the complexity of the genome without the need of assuming an underlying statistical model. regioneR is an R package that implements a permutation test framework specifically designed to work with genomic regions. In addition to the predefined randomization and evaluation strategies, regioneR is fully customizable allowing the use of custom strategies to adapt it to specific questions. Finally, it also implements a novel function to evaluate the local specificity of the detected association. regioneR is an R package released under Artistic-2.0 License. The source code and documents are freely available through Bioconductor (http://www.bioconductor.org/packages/regioneR). rmalinverni@carrerasresearch.org. © The Author 2015. Published by Oxford University Press.