Sample records for utilizing genotype imputation

  1. LinkImputeR: user-guided genotype calling and imputation for non-model organisms.

    PubMed

    Money, Daniel; Migicovsky, Zoë; Gardner, Kyle; Myles, Sean

    2017-07-10

    Genomic studies such as genome-wide association and genomic selection require genome-wide genotype data. All existing technologies used to create these data result in missing genotypes, which are often then inferred using genotype imputation software. However, existing imputation methods most often make use only of genotypes that are successfully inferred after having passed a certain read depth threshold. Because of this, any read information for genotypes that did not pass the threshold, and were thus set to missing, is ignored. Most genomic studies also choose read depth thresholds and quality filters without investigating their effects on the size and quality of the resulting genotype data. Moreover, almost all genotype imputation methods require ordered markers and are therefore of limited utility in non-model organisms. Here we introduce LinkImputeR, a software program that exploits the read count information that is normally ignored, and makes use of all available DNA sequence information for the purposes of genotype calling and imputation. It is specifically designed for non-model organisms since it requires neither ordered markers nor a reference panel of genotypes. Using next-generation DNA sequence (NGS) data from apple, cannabis and grape, we quantify the effect of varying read count and missingness thresholds on the quantity and quality of genotypes generated from LinkImputeR. We demonstrate that LinkImputeR can increase the number of genotype calls by more than an order of magnitude, can improve genotyping accuracy by several percent and can thus improve the power of downstream analyses. Moreover, we show that the effects of quality and read depth filters can differ substantially between data sets and should therefore be investigated on a per-study basis. By exploiting DNA sequence data that is normally ignored during genotype calling and imputation, LinkImputeR can significantly improve both the quantity and quality of genotype data generated from NGS technologies. It enables the user to quickly and easily examine the effects of varying thresholds and filters on the number and quality of the resulting genotype calls. In this manner, users can decide on thresholds that are most suitable for their purposes. We show that LinkImputeR can significantly augment the value and utility of NGS data sets, especially in non-model organisms with poor genomic resources.

  2. Imputation of missing genotypes from sparse to high density using long-range phasing

    USDA-ARS?s Scientific Manuscript database

    Related individuals share potentially long chromosome segments that trace to a common ancestor. A phasing algorithm (ChromoPhase) that utilizes this characteristic of finite populations was developed to phase large sections of a chromosome. In addition to phasing, ChromoPhase imputes missing genotyp...

  3. Effects of reduced panel, reference origin, and genetic relationship on imputation of genotypes in Hereford cattle

    USDA-ARS?s Scientific Manuscript database

    The objective of this study was to investigate alternative methods for designing and utilizing reduced single nucleotide polymorphism (SNP) panels for imputing SNP genotypes. Two purebred Hereford populations, an experimental population known as Line 1 Hereford (L1, N=240) and registered Hereford wi...

  4. The utility of low-density genotyping for imputation in the Thoroughbred horse

    PubMed Central

    2014-01-01

    Background Despite the dramatic reduction in the cost of high-density genotyping that has occurred over the last decade, it remains one of the limiting factors for obtaining the large datasets required for genomic studies of disease in the horse. In this study, we investigated the potential for low-density genotyping and subsequent imputation to address this problem. Results Using the haplotype phasing and imputation program, BEAGLE, it is possible to impute genotypes from low- to high-density (50K) in the Thoroughbred horse with reasonable to high accuracy. Analysis of the sources of variation in imputation accuracy revealed dependence both on the minor allele frequency of the single nucleotide polymorphisms (SNPs) being imputed and on the underlying linkage disequilibrium structure. Whereas equidistant spacing of the SNPs on the low-density panel worked well, optimising SNP selection to increase their minor allele frequency was advantageous, even when the panel was subsequently used in a population of different geographical origin. Replacing base pair position with linkage disequilibrium map distance reduced the variation in imputation accuracy across SNPs. Whereas a 1K SNP panel was generally sufficient to ensure that more than 80% of genotypes were correctly imputed, other studies suggest that a 2K to 3K panel is more efficient to minimize the subsequent loss of accuracy in genomic prediction analyses. The relationship between accuracy and genotyping costs for the different low-density panels, suggests that a 2K SNP panel would represent good value for money. Conclusions Low-density genotyping with a 2K SNP panel followed by imputation provides a compromise between cost and accuracy that could promote more widespread genotyping, and hence the use of genomic information in horses. In addition to offering a low cost alternative to high-density genotyping, imputation provides a means to combine datasets from different genotyping platforms, which is becoming necessary since researchers are starting to use the recently developed equine 70K SNP chip. However, more work is needed to evaluate the impact of between-breed differences on imputation accuracy. PMID:24495673

  5. A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals

    PubMed Central

    Browning, Brian L.; Browning, Sharon R.

    2009-01-01

    We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R2, and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package. PMID:19200528

  6. Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials.

    PubMed

    Hori, Tomoaki; Montcho, David; Agbangla, Clement; Ebana, Kaworu; Futakuchi, Koichi; Iwata, Hiroyoshi

    2016-11-01

    A method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials. Multi-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the 'uniform' scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the 'fiber' scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.

  7. Imputation of microsatellite alleles from dense SNP genotypes for parentage verification across multiple Bos taurus and Bos indicus breeds

    PubMed Central

    McClure, Matthew C.; Sonstegard, Tad S.; Wiggans, George R.; Van Eenennaam, Alison L.; Weber, Kristina L.; Penedo, Cecilia T.; Berry, Donagh P.; Flynn, John; Garcia, Jose F.; Carmo, Adriana S.; Regitano, Luciana C. A.; Albuquerque, Milla; Silva, Marcos V. G. B.; Machado, Marco A.; Coffey, Mike; Moore, Kirsty; Boscher, Marie-Yvonne; Genestout, Lucie; Mazza, Raffaele; Taylor, Jeremy F.; Schnabel, Robert D.; Simpson, Barry; Marques, Elisa; McEwan, John C.; Cromie, Andrew; Coutinho, Luiz L.; Kuehn, Larry A.; Keele, John W.; Piper, Emily K.; Cook, Jim; Williams, Robert; Van Tassell, Curtis P.

    2013-01-01

    To assist cattle producers transition from microsatellite (MS) to single nucleotide polymorphism (SNP) genotyping for parental verification we previously devised an effective and inexpensive method to impute MS alleles from SNP haplotypes. While the reported method was verified with only a limited data set (N = 479) from Brown Swiss, Guernsey, Holstein, and Jersey cattle, some of the MS-SNP haplotype associations were concordant across these phylogenetically diverse breeds. This implied that some haplotypes predate modern breed formation and remain in strong linkage disequilibrium. To expand the utility of MS allele imputation across breeds, MS and SNP data from more than 8000 animals representing 39 breeds (Bos taurus and B. indicus) were used to predict 9410 SNP haplotypes, incorporating an average of 73 SNPs per haplotype, for which alleles from 12 MS markers could be accurately be imputed. Approximately 25% of the MS-SNP haplotypes were present in multiple breeds (N = 2 to 36 breeds). These shared haplotypes allowed for MS imputation in breeds that were not represented in the reference population with only a small increase in Mendelian inheritance inconsistancies. Our reported reference haplotypes can be used for any cattle breed and the reported methods can be applied to any species to aid the transition from MS to SNP genetic markers. While ~91% of the animals with imputed alleles for 12 MS markers had ≤1 Mendelian inheritance conflicts with their parents' reported MS genotypes, this figure was 96% for our reference animals, indicating potential errors in the reported MS genotypes. The workflow we suggest autocorrects for genotyping errors and rare haplotypes, by MS genotyping animals whose imputed MS alleles fail parentage verification, and then incorporating those animals into the reference dataset. PMID:24065982

  8. Evaluation and application of summary statistic imputation to discover new height-associated loci.

    PubMed

    Rüeger, Sina; McDaid, Aaron; Kutalik, Zoltán

    2018-05-01

    As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression.

  9. Evaluation and application of summary statistic imputation to discover new height-associated loci

    PubMed Central

    2018-01-01

    As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation, which we improved to accommodate variable sample size across SNVs. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, genotype imputation boasts a 3- to 5-fold lower root-mean-square error, and better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded a decrease in statistical power by 9, 43 and 35%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression. PMID:29782485

  10. Genetic Diversity Analysis of Highly Incomplete SNP Genotype Data with Imputations: An Empirical Assessment

    PubMed Central

    Fu, Yong-Bi

    2014-01-01

    Genotyping by sequencing (GBS) recently has emerged as a promising genomic approach for assessing genetic diversity on a genome-wide scale. However, concerns are not lacking about the uniquely large unbalance in GBS genotype data. Although some genotype imputation has been proposed to infer missing observations, little is known about the reliability of a genetic diversity analysis of GBS data, with up to 90% of observations missing. Here we performed an empirical assessment of accuracy in genetic diversity analysis of highly incomplete single nucleotide polymorphism genotypes with imputations. Three large single-nucleotide polymorphism genotype data sets for corn, wheat, and rice were acquired, and missing data with up to 90% of missing observations were randomly generated and then imputed for missing genotypes with three map-independent imputation methods. Estimating heterozygosity and inbreeding coefficient from original, missing, and imputed data revealed variable patterns of bias from assessed levels of missingness and genotype imputation, but the estimation biases were smaller for missing data without genotype imputation. The estimates of genetic differentiation were rather robust up to 90% of missing observations but became substantially biased when missing genotypes were imputed. The estimates of topology accuracy for four representative samples of interested groups generally were reduced with increased levels of missing genotypes. Probabilistic principal component analysis based imputation performed better in terms of topology accuracy than those analyses of missing data without genotype imputation. These findings are not only significant for understanding the reliability of the genetic diversity analysis with respect to large missing data and genotype imputation but also are instructive for performing a proper genetic diversity analysis of highly incomplete GBS or other genotype data. PMID:24626289

  11. Practical implementation of cost-effective genomic selection in commercial pig breeding using imputation.

    PubMed

    Cleveland, M A; Hickey, J M

    2013-08-01

    Genomic selection can be implemented in pig breeding at a reduced cost using genotype imputation. Accuracy of imputation and the impact on resulting genomic breeding values (gEBV) was investigated. High-density genotype data was available for 4,763 animals from a single pig line. Three low-density genotype panels were constructed with SNP densities of 450 (L450), 3,071 (L3k) and 5,963 (L6k). Accuracy of imputation was determined using 184 test individuals with no genotyped descendants in the data but with parents and grandparents genotyped using the Illumina PorcineSNP60 Beadchip. Alternative genotyping scenarios were created in which parents, grandparents, and individuals that were not direct ancestors of test animals (Other) were genotyped at high density (S1), grandparents were not genotyped (S2), dams and granddams were not genotyped (S3), and dams and granddams were genotyped at low density (S4). Four additional scenarios were created by excluding Other animal genotypes. Test individuals were always genotyped at low density. Imputation was performed with AlphaImpute. Genomic breeding values were calculated using the single-step genomic evaluation. Test animals were evaluated for the information retained in the gEBV, calculated as the correlation between gEBV using imputed genotypes and gEBV using true genotypes. Accuracy of imputation was high for all scenarios but decreased with fewer SNP on the low-density panel (0.995 to 0.965 for S1) and with reduced genotyping of ancestors, where the largest changes were for L450 (0.965 in S1 to 0.914 in S3). Exclusion of genotypes for Other animals resulted in only small accuracy decreases. Imputation accuracy was not consistent across the genome. Information retained in the gEBV was related to genotyping scenario and thus to imputation accuracy. Reducing the number of SNP on the low-density panel reduced the information retained in the gEBV, with the largest decrease observed from L3k to L450. Excluding Other animal genotypes had little impact on imputation accuracy but caused large decreases in the information retained in the gEBV. These results indicate that accuracy of gEBV from imputed genotypes depends on the level of genotyping in close relatives and the size of the genotyped dataset. Fewer high-density genotyped individuals are needed to obtain accurate imputation than are needed to obtain accurate gEBV. Strategies to optimize development of low-density panels can improve both imputation and gEBV accuracy.

  12. Imputation across genotyping arrays for genome-wide association studies: assessment of bias and a correction strategy.

    PubMed

    Johnson, Eric O; Hancock, Dana B; Levy, Joshua L; Gaddis, Nathan C; Saccone, Nancy L; Bierut, Laura J; Page, Grier P

    2013-05-01

    A great promise of publicly sharing genome-wide association data is the potential to create composite sets of controls. However, studies often use different genotyping arrays, and imputation to a common set of SNPs has shown substantial bias: a problem which has no broadly applicable solution. Based on the idea that using differing genotyped SNP sets as inputs creates differential imputation errors and thus bias in the composite set of controls, we examined the degree to which each of the following occurs: (1) imputation based on the union of genotyped SNPs (i.e., SNPs available on one or more arrays) results in bias, as evidenced by spurious associations (type 1 error) between imputed genotypes and arbitrarily assigned case/control status; (2) imputation based on the intersection of genotyped SNPs (i.e., SNPs available on all arrays) does not evidence such bias; and (3) imputation quality varies by the size of the intersection of genotyped SNP sets. Imputations were conducted in European Americans and African Americans with reference to HapMap phase II and III data. Imputation based on the union of genotyped SNPs across the Illumina 1M and 550v3 arrays showed spurious associations for 0.2 % of SNPs: ~2,000 false positives per million SNPs imputed. Biases remained problematic for very similar arrays (550v1 vs. 550v3) and were substantial for dissimilar arrays (Illumina 1M vs. Affymetrix 6.0). In all instances, imputing based on the intersection of genotyped SNPs (as few as 30 % of the total SNPs genotyped) eliminated such bias while still achieving good imputation quality.

  13. Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data

    PubMed Central

    2016-01-01

    Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted ‘glmnet’). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition. PMID:27537694

  14. Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.

    PubMed

    Chan, Ariel W; Hamblin, Martha T; Jannink, Jean-Luc

    2016-01-01

    Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordable, datasets derived from HTS methods suffer from sequencing error, alignment errors, and missing data, all of which introduce noise and uncertainty to variant discovery and genotype calling. Under such circumstances, meaningful analysis of the data is difficult. Our primary interest lies in the issue of how one can accurately infer or impute missing genotypes in HTS-derived datasets. Many of the existing genotype imputation algorithms and software packages were primarily developed by and optimized for the human genetics community, a field where a complete and accurate reference genome has been constructed and SNP arrays have, in large part, been the common genotyping platform. We set out to answer two questions: 1) can we use existing imputation methods developed by the human genetics community to impute missing genotypes in datasets derived from non-human species and 2) are these methods, which were developed and optimized to impute ascertained variants, amenable for imputation of missing genotypes at HTS-derived variants? We selected Beagle v.4, a widely used algorithm within the human genetics community with reportedly high accuracy, to serve as our imputation contender. We performed a series of cross-validation experiments, using GBS data collected from the species Manihot esculenta by the Next Generation (NEXTGEN) Cassava Breeding Project. NEXTGEN currently imputes missing genotypes in their datasets using a LASSO-penalized, linear regression method (denoted 'glmnet'). We selected glmnet to serve as a benchmark imputation method for this reason. We obtained estimates of imputation accuracy by masking a subset of observed genotypes, imputing, and calculating the sample Pearson correlation between observed and imputed genotype dosages at the site and individual level; computation time served as a second metric for comparison. We then set out to examine factors affecting imputation accuracy, such as levels of missing data, read depth, minor allele frequency (MAF), and reference panel composition.

  15. LinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms

    PubMed Central

    Money, Daniel; Gardner, Kyle; Migicovsky, Zoë; Schwaninger, Heidi; Zhong, Gan-Yuan; Myles, Sean

    2015-01-01

    Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-quality reference genomes and panels of reference genotypes that aid in imputation accuracy. In nonmodel organisms, however, genetic and physical maps often are either of poor quality or are completely absent, and there are no panels of reference genotypes available. There is therefore a need for imputation methods designed specifically for nonmodel organisms in which genomic resources are poorly developed and marker order is unreliable or unknown. Here we introduce LinkImpute, a software package based on a k-nearest neighbor genotype imputation method, LD-kNNi, which is designed for unordered markers. No physical or genetic maps are required, and it is designed to work on unphased genotype data from heterozygous species. It exploits the fact that markers useful for imputation often are not physically close to the missing genotype but rather distributed throughout the genome. Using genotyping-by-sequencing data from diverse and heterozygous accessions of apples, grapes, and maize, we compare LD-kNNi with several genotype imputation methods and show that LD-kNNi is fast, comparable in accuracy to the best-existing methods, and exhibits the least bias in allele frequency estimates. PMID:26377960

  16. Linkage disequilibrium among commonly genotyped SNP and variants detected from bull sequence

    USDA-ARS?s Scientific Manuscript database

    Genomic prediction utilizing causal variants could increase selection accuracy above that achieved with SNP genotyped by commercial assays. A number of variants detected from sequencing influential sires are likely to be causal, but noticable improvements in prediction accuracy using imputed sequen...

  17. A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation

    PubMed Central

    2012-01-01

    Background Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation. Methods An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis. Results Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored. Conclusions The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations. PMID:22462519

  18. Genotype imputation efficiency in Nelore Cattle

    USDA-ARS?s Scientific Manuscript database

    Genotype imputation efficiency in Nelore cattle was evaluated in different scenarios of lower density (LD) chips, imputation methods and sets of animals to have their genotypes imputed. Twelve commercial and virtual custom LD chips with densities varying from 7K to 75K SNPs were tested. Customized L...

  19. LinkImpute: Fast and Accurate Genotype Imputation for Nonmodel Organisms.

    PubMed

    Money, Daniel; Gardner, Kyle; Migicovsky, Zoë; Schwaninger, Heidi; Zhong, Gan-Yuan; Myles, Sean

    2015-09-15

    Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-quality reference genomes and panels of reference genotypes that aid in imputation accuracy. In nonmodel organisms, however, genetic and physical maps often are either of poor quality or are completely absent, and there are no panels of reference genotypes available. There is therefore a need for imputation methods designed specifically for nonmodel organisms in which genomic resources are poorly developed and marker order is unreliable or unknown. Here we introduce LinkImpute, a software package based on a k-nearest neighbor genotype imputation method, LD-kNNi, which is designed for unordered markers. No physical or genetic maps are required, and it is designed to work on unphased genotype data from heterozygous species. It exploits the fact that markers useful for imputation often are not physically close to the missing genotype but rather distributed throughout the genome. Using genotyping-by-sequencing data from diverse and heterozygous accessions of apples, grapes, and maize, we compare LD-kNNi with several genotype imputation methods and show that LD-kNNi is fast, comparable in accuracy to the best-existing methods, and exhibits the least bias in allele frequency estimates. Copyright © 2015 Money et al.

  20. Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel.

    PubMed

    Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit

    2017-06-01

    Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies.

  1. Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel

    PubMed Central

    Mitt, Mario; Kals, Mart; Pärn, Kalle; Gabriel, Stacey B; Lander, Eric S; Palotie, Aarno; Ripatti, Samuli; Morris, Andrew P; Metspalu, Andres; Esko, Tõnu; Mägi, Reedik; Palta, Priit

    2017-01-01

    Genetic imputation is a cost-efficient way to improve the power and resolution of genome-wide association (GWA) studies. Current publicly accessible imputation reference panels accurately predict genotypes for common variants with minor allele frequency (MAF)≥5% and low-frequency variants (0.5≤MAF<5%) across diverse populations, but the imputation of rare variation (MAF<0.5%) is still rather limited. In the current study, we evaluate imputation accuracy achieved with reference panels from diverse populations with a population-specific high-coverage (30 ×) whole-genome sequencing (WGS) based reference panel, comprising of 2244 Estonian individuals (0.25% of adult Estonians). Although the Estonian-specific panel contains fewer haplotypes and variants, the imputation confidence and accuracy of imputed low-frequency and rare variants was significantly higher. The results indicate the utility of population-specific reference panels for human genetic studies. PMID:28401899

  2. Accuracy of estimation of genomic breeding values in pigs using low-density genotypes and imputation.

    PubMed

    Badke, Yvonne M; Bates, Ronald O; Ernst, Catherine W; Fix, Justin; Steibel, Juan P

    2014-04-16

    Genomic selection has the potential to increase genetic progress. Genotype imputation of high-density single-nucleotide polymorphism (SNP) genotypes can improve the cost efficiency of genomic breeding value (GEBV) prediction for pig breeding. Consequently, the objectives of this work were to: (1) estimate accuracy of genomic evaluation and GEBV for three traits in a Yorkshire population and (2) quantify the loss of accuracy of genomic evaluation and GEBV when genotypes were imputed under two scenarios: a high-cost, high-accuracy scenario in which only selection candidates were imputed from a low-density platform and a low-cost, low-accuracy scenario in which all animals were imputed using a small reference panel of haplotypes. Phenotypes and genotypes obtained with the PorcineSNP60 BeadChip were available for 983 Yorkshire boars. Genotypes of selection candidates were masked and imputed using tagSNP in the GeneSeek Genomic Profiler (10K). Imputation was performed with BEAGLE using 128 or 1800 haplotypes as reference panels. GEBV were obtained through an animal-centric ridge regression model using de-regressed breeding values as response variables. Accuracy of genomic evaluation was estimated as the correlation between estimated breeding values and GEBV in a 10-fold cross validation design. Accuracy of genomic evaluation using observed genotypes was high for all traits (0.65-0.68). Using genotypes imputed from a large reference panel (accuracy: R(2) = 0.95) for genomic evaluation did not significantly decrease accuracy, whereas a scenario with genotypes imputed from a small reference panel (R(2) = 0.88) did show a significant decrease in accuracy. Genomic evaluation based on imputed genotypes in selection candidates can be implemented at a fraction of the cost of a genomic evaluation using observed genotypes and still yield virtually the same accuracy. On the other side, using a very small reference panel of haplotypes to impute training animals and candidates for selection results in lower accuracy of genomic evaluation.

  3. Strategies for genotype imputation in composite beef cattle.

    PubMed

    Chud, Tatiane C S; Ventura, Ricardo V; Schenkel, Flavio S; Carvalheiro, Roberto; Buzanskas, Marcos E; Rosa, Jaqueline O; Mudadu, Maurício de Alvarenga; da Silva, Marcos Vinicius G B; Mokry, Fabiana B; Marcondes, Cintia R; Regitano, Luciana C A; Munari, Danísio P

    2015-08-07

    Genotype imputation has been used to increase genomic information, allow more animals in genome-wide analyses, and reduce genotyping costs. In Brazilian beef cattle production, many animals are resulting from crossbreeding and such an event may alter linkage disequilibrium patterns. Thus, the challenge is to obtain accurately imputed genotypes in crossbred animals. The objective of this study was to evaluate the best fitting and most accurate imputation strategy on the MA genetic group (the progeny of a Charolais sire mated with crossbred Canchim X Zebu cows) and Canchim cattle. The data set contained 400 animals (born between 1999 and 2005) genotyped with the Illumina BovineHD panel. Imputation accuracy of genotypes from the Illumina-Bovine3K (3K), Illumina-BovineLD (6K), GeneSeek-Genomic-Profiler (GGP) BeefLD (GGP9K), GGP-IndicusLD (GGP20Ki), Illumina-BovineSNP50 (50K), GGP-IndicusHD (GGP75Ki), and GGP-BeefHD (GGP80K) to Illumina-BovineHD (HD) SNP panels were investigated. Seven scenarios for reference and target populations were tested; the animals were grouped according with birth year (S1), genetic groups (S2 and S3), genetic groups and birth year (S4 and S5), gender (S6), and gender and birth year (S7). Analyses were performed using FImpute and BEAGLE software and computation run-time was recorded. Genotype imputation accuracy was measured by concordance rate (CR) and allelic R square (R(2)). The highest imputation accuracy scenario consisted of a reference population with males and females and a target population with young females. Among the SNP panels in the tested scenarios, from the 50K, GGP75Ki and GGP80K were the most adequate to impute to HD in Canchim cattle. FImpute reduced computation run-time to impute genotypes from 20 to 100 times when compared to BEAGLE. The genotyping panels possessing at least 50 thousands markers are suitable for genotype imputation to HD with acceptable accuracy. The FImpute algorithm demonstrated a higher efficiency of imputed markers, especially in lower density panels. These considerations may assist to increase genotypic information, reduce genotyping costs, and aid in genomic selection evaluations in crossbred animals.

  4. Impact of pre-imputation SNP-filtering on genotype imputation results

    PubMed Central

    2014-01-01

    Background Imputation of partially missing or unobserved genotypes is an indispensable tool for SNP data analyses. However, research and understanding of the impact of initial SNP-data quality control on imputation results is still limited. In this paper, we aim to evaluate the effect of different strategies of pre-imputation quality filtering on the performance of the widely used imputation algorithms MaCH and IMPUTE. Results We considered three scenarios: imputation of partially missing genotypes with usage of an external reference panel, without usage of an external reference panel, as well as imputation of completely un-typed SNPs using an external reference panel. We first created various datasets applying different SNP quality filters and masking certain percentages of randomly selected high-quality SNPs. We imputed these SNPs and compared the results between the different filtering scenarios by using established and newly proposed measures of imputation quality. While the established measures assess certainty of imputation results, our newly proposed measures focus on the agreement with true genotypes. These measures showed that pre-imputation SNP-filtering might be detrimental regarding imputation quality. Moreover, the strongest drivers of imputation quality were in general the burden of missingness and the number of SNPs used for imputation. We also found that using a reference panel always improves imputation quality of partially missing genotypes. MaCH performed slightly better than IMPUTE2 in most of our scenarios. Again, these results were more pronounced when using our newly defined measures of imputation quality. Conclusion Even a moderate filtering has a detrimental effect on the imputation quality. Therefore little or no SNP filtering prior to imputation appears to be the best strategy for imputing small to moderately sized datasets. Our results also showed that for these datasets, MaCH performs slightly better than IMPUTE2 in most scenarios at the cost of increased computing time. PMID:25112433

  5. High-density marker imputation accuracy in sixteen French cattle breeds.

    PubMed

    Hozé, Chris; Fouilloux, Marie-Noëlle; Venot, Eric; Guillaume, François; Dassonneville, Romain; Fritz, Sébastien; Ducrocq, Vincent; Phocas, Florence; Boichard, Didier; Croiseau, Pascal

    2013-09-03

    Genotyping with the medium-density Bovine SNP50 BeadChip® (50K) is now standard in cattle. The high-density BovineHD BeadChip®, which contains 777,609 single nucleotide polymorphisms (SNPs), was developed in 2010. Increasing marker density increases the level of linkage disequilibrium between quantitative trait loci (QTL) and SNPs and the accuracy of QTL localization and genomic selection. However, re-genotyping all animals with the high-density chip is not economically feasible. An alternative strategy is to genotype part of the animals with the high-density chip and to impute high-density genotypes for animals already genotyped with the 50K chip. Thus, it is necessary to investigate the error rate when imputing from the 50K to the high-density chip. Five thousand one hundred and fifty three animals from 16 breeds (89 to 788 per breed) were genotyped with the high-density chip. Imputation error rates from the 50K to the high-density chip were computed for each breed with a validation set that included the 20% youngest animals. Marker genotypes were masked for animals in the validation population in order to mimic 50K genotypes. Imputation was carried out using the Beagle 3.3.0 software. Mean allele imputation error rates ranged from 0.31% to 2.41% depending on the breed. In total, 1980 SNPs had high imputation error rates in several breeds, which is probably due to genome assembly errors, and we recommend to discard these in future studies. Differences in imputation accuracy between breeds were related to the high-density-genotyped sample size and to the genetic relationship between reference and validation populations, whereas differences in effective population size and level of linkage disequilibrium showed limited effects. Accordingly, imputation accuracy was higher in breeds with large populations and in dairy breeds than in beef breeds. More than 99% of the alleles were correctly imputed if more than 300 animals were genotyped at high-density. No improvement was observed when multi-breed imputation was performed. In all breeds, imputation accuracy was higher than 97%, which indicates that imputation to the high-density chip was accurate. Imputation accuracy depends mainly on the size of the reference population and the relationship between reference and target populations.

  6. High-density marker imputation accuracy in sixteen French cattle breeds

    PubMed Central

    2013-01-01

    Background Genotyping with the medium-density Bovine SNP50 BeadChip® (50K) is now standard in cattle. The high-density BovineHD BeadChip®, which contains 777 609 single nucleotide polymorphisms (SNPs), was developed in 2010. Increasing marker density increases the level of linkage disequilibrium between quantitative trait loci (QTL) and SNPs and the accuracy of QTL localization and genomic selection. However, re-genotyping all animals with the high-density chip is not economically feasible. An alternative strategy is to genotype part of the animals with the high-density chip and to impute high-density genotypes for animals already genotyped with the 50K chip. Thus, it is necessary to investigate the error rate when imputing from the 50K to the high-density chip. Methods Five thousand one hundred and fifty three animals from 16 breeds (89 to 788 per breed) were genotyped with the high-density chip. Imputation error rates from the 50K to the high-density chip were computed for each breed with a validation set that included the 20% youngest animals. Marker genotypes were masked for animals in the validation population in order to mimic 50K genotypes. Imputation was carried out using the Beagle 3.3.0 software. Results Mean allele imputation error rates ranged from 0.31% to 2.41% depending on the breed. In total, 1980 SNPs had high imputation error rates in several breeds, which is probably due to genome assembly errors, and we recommend to discard these in future studies. Differences in imputation accuracy between breeds were related to the high-density-genotyped sample size and to the genetic relationship between reference and validation populations, whereas differences in effective population size and level of linkage disequilibrium showed limited effects. Accordingly, imputation accuracy was higher in breeds with large populations and in dairy breeds than in beef breeds. More than 99% of the alleles were correctly imputed if more than 300 animals were genotyped at high-density. No improvement was observed when multi-breed imputation was performed. Conclusion In all breeds, imputation accuracy was higher than 97%, which indicates that imputation to the high-density chip was accurate. Imputation accuracy depends mainly on the size of the reference population and the relationship between reference and target populations. PMID:24004563

  7. Genotype imputation in the domestic dog

    PubMed Central

    Meurs, K. M.

    2016-01-01

    Application of imputation methods to accurately predict a dense array of SNP genotypes in the dog could provide an important supplement to current analyses of array-based genotyping data. Here, we developed a reference panel of 4,885,283 SNPs in 83 dogs across 15 breeds using whole genome sequencing. We used this panel to predict the genotypes of 268 dogs across three breeds with 84,193 SNP array-derived genotypes as inputs. We then (1) performed breed clustering of the actual and imputed data; (2) evaluated several reference panel breed combinations to determine an optimal reference panel composition; and (3) compared the accuracy of two commonly used software algorithms (Beagle and IMPUTE2). Breed clustering was well preserved in the imputation process across eigenvalues representing 75 % of the variation in the imputed data. Using Beagle with a target panel from a single breed, genotype concordance was highest using a multi-breed reference panel (92.4 %) compared to a breed-specific reference panel (87.0 %) or a reference panel containing no breeds overlapping with the target panel (74.9 %). This finding was confirmed using target panels derived from two other breeds. Additionally, using the multi-breed reference panel, genotype concordance was slightly higher with IMPUTE2 (94.1 %) compared to Beagle; Pearson correlation coefficients were slightly higher for both software packages (0.946 for Beagle, 0.961 for IMPUTE2). Our findings demonstrate that genotype imputation from SNP array-derived data to whole genome-level genotypes is both feasible and accurate in the dog with appropriate breed overlap between the target and reference panels. PMID:27129452

  8. Fast imputation using medium- or low-coverage sequence data

    USDA-ARS?s Scientific Manuscript database

    Direct imputation from raw sequence reads can be more accurate than calling genotypes first and then imputing, especially if read depth is low or error rates high, but different imputation strategies are required than those used for data from genotyping chips. A fast algorithm to impute from lower t...

  9. Use of partial least squares regression to impute SNP genotypes in Italian cattle breeds.

    PubMed

    Dimauro, Corrado; Cellesi, Massimo; Gaspa, Giustino; Ajmone-Marsan, Paolo; Steri, Roberto; Marras, Gabriele; Macciotta, Nicolò P P

    2013-06-05

    The objective of the present study was to test the ability of the partial least squares regression technique to impute genotypes from low density single nucleotide polymorphisms (SNP) panels i.e. 3K or 7K to a high density panel with 50K SNP. No pedigree information was used. Data consisted of 2093 Holstein, 749 Brown Swiss and 479 Simmental bulls genotyped with the Illumina 50K Beadchip. First, a single-breed approach was applied by using only data from Holstein animals. Then, to enlarge the training population, data from the three breeds were combined and a multi-breed analysis was performed. Accuracies of genotypes imputed using the partial least squares regression method were compared with those obtained by using the Beagle software. The impact of genotype imputation on breeding value prediction was evaluated for milk yield, fat content and protein content. In the single-breed approach, the accuracy of imputation using partial least squares regression was around 90 and 94% for the 3K and 7K platforms, respectively; corresponding accuracies obtained with Beagle were around 85% and 90%. Moreover, computing time required by the partial least squares regression method was on average around 10 times lower than computing time required by Beagle. Using the partial least squares regression method in the multi-breed resulted in lower imputation accuracies than using single-breed data. The impact of the SNP-genotype imputation on the accuracy of direct genomic breeding values was small. The correlation between estimates of genetic merit obtained by using imputed versus actual genotypes was around 0.96 for the 7K chip. Results of the present work suggested that the partial least squares regression imputation method could be useful to impute SNP genotypes when pedigree information is not available.

  10. Next-generation genotype imputation service and methods.

    PubMed

    Das, Sayantan; Forer, Lukas; Schönherr, Sebastian; Sidore, Carlo; Locke, Adam E; Kwong, Alan; Vrieze, Scott I; Chew, Emily Y; Levy, Shawn; McGue, Matt; Schlessinger, David; Stambolian, Dwight; Loh, Po-Ru; Iacono, William G; Swaroop, Anand; Scott, Laura J; Cucca, Francesco; Kronenberg, Florian; Boehnke, Michael; Abecasis, Gonçalo R; Fuchsberger, Christian

    2016-10-01

    Genotype imputation is a key component of genetic association studies, where it increases power, facilitates meta-analysis, and aids interpretation of signals. Genotype imputation is computationally demanding and, with current tools, typically requires access to a high-performance computing cluster and to a reference panel of sequenced genomes. Here we describe improvements to imputation machinery that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools. We also describe a new web-based service for imputation that facilitates access to new reference panels and greatly improves user experience and productivity.

  11. Genotype imputation in a tropical crossbred dairy cattle population.

    PubMed

    Oliveira Júnior, Gerson A; Chud, Tatiane C S; Ventura, Ricardo V; Garrick, Dorian J; Cole, John B; Munari, Danísio P; Ferraz, José B S; Mullart, Erik; DeNise, Sue; Smith, Shannon; da Silva, Marcos Vinícius G B

    2017-12-01

    The objective of this study was to investigate different strategies for genotype imputation in a population of crossbred Girolando (Gyr × Holstein) dairy cattle. The data set consisted of 478 Girolando, 583 Gyr, and 1,198 Holstein sires genotyped at high density with the Illumina BovineHD (Illumina, San Diego, CA) panel, which includes ∼777K markers. The accuracy of imputation from low (20K) and medium densities (50K and 70K) to the HD panel density and from low to 50K density were investigated. Seven scenarios using different reference populations (RPop) considering Girolando, Gyr, and Holstein breeds separately or combinations of animals of these breeds were tested for imputing genotypes of 166 randomly chosen Girolando animals. The population genotype imputation were performed using FImpute. Imputation accuracy was measured as the correlation between observed and imputed genotypes (CORR) and also as the proportion of genotypes that were imputed correctly (CR). This is the first paper on imputation accuracy in a Girolando population. The sample-specific imputation accuracies ranged from 0.38 to 0.97 (CORR) and from 0.49 to 0.96 (CR) imputing from low and medium densities to HD, and 0.41 to 0.95 (CORR) and from 0.50 to 0.94 (CR) for imputation from 20K to 50K. The CORR anim exceeded 0.96 (for 50K and 70K panels) when only Girolando animals were included in RPop (S1). We found smaller CORR anim when Gyr (S2) was used instead of Holstein (S3) as RPop. The same behavior was observed between S4 (Gyr + Girolando) and S5 (Holstein + Girolando) because the target animals were more related to the Holstein population than to the Gyr population. The highest imputation accuracies were observed for scenarios including Girolando animals in the reference population, whereas using only Gyr animals resulted in low imputation accuracies, suggesting that the haplotypes segregating in the Girolando population had a greater effect on accuracy than the purebred haplotypes. All chromosomes had similar imputation accuracies (CORR snp ) within each scenario. Crossbred animals (Girolando) must be included in the reference population to provide the best imputation accuracies. Copyright © 2017 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  12. Genotype Imputation for Latinos Using the HapMap and 1000 Genomes Project Reference Panels.

    PubMed

    Gao, Xiaoyi; Haritunians, Talin; Marjoram, Paul; McKean-Cowdin, Roberta; Torres, Mina; Taylor, Kent D; Rotter, Jerome I; Gauderman, William J; Varma, Rohit

    2012-01-01

    Genotype imputation is a vital tool in genome-wide association studies (GWAS) and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous, and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR + CEU + YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation based analysis in Latinos.

  13. Significant variation between SNP-based HLA imputations in diverse populations: the last mile is the hardest.

    PubMed

    Pappas, D J; Lizee, A; Paunic, V; Beutner, K R; Motyer, A; Vukcevic, D; Leslie, S; Biesiada, J; Meller, J; Taylor, K D; Zheng, X; Zhao, L P; Gourraud, P-A; Hollenbach, J A; Mack, S J; Maiers, M

    2018-05-22

    Four single nucleotide polymorphism (SNP)-based human leukocyte antigen (HLA) imputation methods (e-HLA, HIBAG, HLA*IMP:02 and MAGPrediction) were trained using 1000 Genomes SNP and HLA genotypes and assessed for their ability to accurately impute molecular HLA-A, -B, -C and -DRB1 genotypes in the Human Genome Diversity Project cell panel. Imputation concordance was high (>89%) across all methods for both HLA-A and HLA-C, but HLA-B and HLA-DRB1 proved generally difficult to impute. Overall, <27.8% of subjects were correctly imputed for all HLA loci by any method. Concordance across all loci was not enhanced via the application of confidence thresholds; reliance on confidence scores across methods only led to noticeable improvement (+3.2%) for HLA-DRB1. As the HLA complex is highly relevant to the study of human health and disease, a standardized assessment of SNP-based HLA imputation methods is crucial for advancing genomic research. Considerable room remains for the improvement of HLA-B and especially HLA-DRB1 imputation methods, and no imputation method is as accurate as molecular genotyping. The application of large, ancestrally diverse HLA and SNP reference data sets and multiple imputation methods has the potential to make SNP-based HLA imputation methods a tractable option for determining HLA genotypes.

  14. Performance of genotype imputation for low frequency and rare variants from the 1000 genomes.

    PubMed

    Zheng, Hou-Feng; Rong, Jing-Jing; Liu, Ming; Han, Fang; Zhang, Xing-Wei; Richards, J Brent; Wang, Li

    2015-01-01

    Genotype imputation is now routinely applied in genome-wide association studies (GWAS) and meta-analyses. However, most of the imputations have been run using HapMap samples as reference, imputation of low frequency and rare variants (minor allele frequency (MAF) < 5%) are not systemically assessed. With the emergence of next-generation sequencing, large reference panels (such as the 1000 Genomes panel) are available to facilitate imputation of these variants. Therefore, in order to estimate the performance of low frequency and rare variants imputation, we imputed 153 individuals, each of whom had 3 different genotype array data including 317k, 610k and 1 million SNPs, to three different reference panels: the 1000 Genomes pilot March 2010 release (1KGpilot), the 1000 Genomes interim August 2010 release (1KGinterim), and the 1000 Genomes phase1 November 2010 and May 2011 release (1KGphase1) by using IMPUTE version 2. The differences between these three releases of the 1000 Genomes data are the sample size, ancestry diversity, number of variants and their frequency spectrum. We found that both reference panel and GWAS chip density affect the imputation of low frequency and rare variants. 1KGphase1 outperformed the other 2 panels, at higher concordance rate, higher proportion of well-imputed variants (info>0.4) and higher mean info score in each MAF bin. Similarly, 1M chip array outperformed 610K and 317K. However for very rare variants (MAF ≤ 0.3%), only 0-1% of the variants were well imputed. We conclude that the imputation of low frequency and rare variants improves with larger reference panels and higher density of genome-wide genotyping arrays. Yet, despite a large reference panel size and dense genotyping density, very rare variants remain difficult to impute.

  15. Bias Characterization in Probabilistic Genotype Data and Improved Signal Detection with Multiple Imputation

    PubMed Central

    Palmer, Cameron; Pe’er, Itsik

    2016-01-01

    Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data. PMID:27310603

  16. A spatial haplotype copying model with applications to genotype imputation.

    PubMed

    Yang, Wen-Yun; Hormozdiari, Farhad; Eskin, Eleazar; Pasaniuc, Bogdan

    2015-05-01

    Ever since its introduction, the haplotype copy model has proven to be one of the most successful approaches for modeling genetic variation in human populations, with applications ranging from ancestry inference to genotype phasing and imputation. Motivated by coalescent theory, this approach assumes that any chromosome (haplotype) can be modeled as a mosaic of segments copied from a set of chromosomes sampled from the same population. At the core of the model is the assumption that any chromosome from the sample is equally likely to contribute a priori to the copying process. Motivated by recent works that model genetic variation in a geographic continuum, we propose a new spatial-aware haplotype copy model that jointly models geography and the haplotype copying process. We extend hidden Markov models of haplotype diversity such that at any given location, haplotypes that are closest in the genetic-geographic continuum map are a priori more likely to contribute to the copying process than distant ones. Through simulations starting from the 1000 Genomes data, we show that our model achieves superior accuracy in genotype imputation over the standard spatial-unaware haplotype copy model. In addition, we show the utility of our model in selecting a small personalized reference panel for imputation that leads to both improved accuracy as well as to a lower computational runtime than the standard approach. Finally, we show our proposed model can be used to localize individuals on the genetic-geographical map on the basis of their genotype data.

  17. A combined reference panel from the 1000 Genomes and UK10K projects improved rare variant imputation in European and Chinese samples

    PubMed Central

    Chou, Wen-Chi; Zheng, Hou-Feng; Cheng, Chia-Ho; Yan, Han; Wang, Li; Han, Fang; Richards, J. Brent; Karasik, David; Kiel, Douglas P.; Hsu, Yi-Hsiang

    2016-01-01

    Imputation using the 1000 Genomes haplotype reference panel has been widely adapted to estimate genotypes in genome wide association studies. To evaluate imputation quality with a relatively larger reference panel and a reference panel composed of different ethnic populations, we conducted imputations in the Framingham Heart Study and the North Chinese Study using a combined reference panel from the 1000 Genomes (N = 1,092) and UK10K (N = 3,781) projects. For rare variants with 0.01% < MAF ≤ 0.5%, imputation in the Framingham Heart Study with the combined reference panel increased well-imputed genotypes (with imputation quality score ≥0.4) from 62.9% to 76.1% when compared to imputation with the 1000 Genomes. For the North Chinese samples, imputation of rare variants with 0.01% < MAF ≤ 0.5% with the combined reference panel increased well-imputed genotypes by from 49.8% to 61.8%. The predominant European ancestry of the UK10K and the combined reference panels may explain why there was less of an increase in imputation success in the North Chinese samples. Our results underscore the importance and potential of larger reference panels to impute rare variants, while recognizing that increasing ethnic specific variants in reference panels may result in better imputation for genotypes in some ethnic groups. PMID:28004816

  18. DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts.

    PubMed

    Lee, Donghyung; Bigdeli, T Bernard; Williamson, Vernell S; Vladimirov, Vladimir I; Riley, Brien P; Fanous, Ayman H; Bacanu, Silviu-Alin

    2015-10-01

    To increase the signal resolution for large-scale meta-analyses of genome-wide association studies, genotypes at unmeasured single nucleotide polymorphisms (SNPs) are commonly imputed using large multi-ethnic reference panels. However, the ever increasing size and ethnic diversity of both reference panels and cohorts makes genotype imputation computationally challenging for moderately sized computer clusters. Moreover, genotype imputation requires subject-level genetic data, which unlike summary statistics provided by virtually all studies, is not publicly available. While there are much less demanding methods which avoid the genotype imputation step by directly imputing SNP statistics, e.g. Directly Imputing summary STatistics (DIST) proposed by our group, their implicit assumptions make them applicable only to ethnically homogeneous cohorts. To decrease computational and access requirements for the analysis of cosmopolitan cohorts, we propose DISTMIX, which extends DIST capabilities to the analysis of mixed ethnicity cohorts. The method uses a relevant reference panel to directly impute unmeasured SNP statistics based only on statistics at measured SNPs and estimated/user-specified ethnic proportions. Simulations show that the proposed method adequately controls the Type I error rates. The 1000 Genomes panel imputation of summary statistics from the ethnically diverse Psychiatric Genetic Consortium Schizophrenia Phase 2 suggests that, when compared to genotype imputation methods, DISTMIX offers comparable imputation accuracy for only a fraction of computational resources. DISTMIX software, its reference population data, and usage examples are publicly available at http://code.google.com/p/distmix. dlee4@vcu.edu Supplementary Data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  19. Genotype Imputation with Millions of Reference Samples

    PubMed Central

    Browning, Brian L.; Browning, Sharon R.

    2016-01-01

    We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle’s throughput was more than 100× greater than Impute2’s throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs. PMID:26748515

  20. Assessing and comparison of different machine learning methods in parent-offspring trios for genotype imputation.

    PubMed

    Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi

    2016-06-21

    Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation. Copyright © 2016 Elsevier Ltd. All rights reserved.

  1. An imputed genotype resource for the laboratory mouse

    PubMed Central

    Szatkiewicz, Jin P.; Beane, Glen L.; Ding, Yueming; Hutchins, Lucie; de Villena, Fernando Pardo-Manuel; Churchill, Gary A.

    2009-01-01

    We have created a high-density SNP resource encompassing 7.87 million polymorphic loci across 49 inbred mouse strains of the laboratory mouse by combining data available from public databases and training a hidden Markov model to impute missing genotypes in the combined data. The strong linkage disequilibrium found in dense sets of SNP markers in the laboratory mouse provides the basis for accurate imputation. Using genotypes from eight independent SNP resources, we empirically validated the quality of the imputed genotypes and demonstrate that they are highly reliable for most inbred strains. The imputed SNP resource will be useful for studies of natural variation and complex traits. It will facilitate association study designs by providing high density SNP genotypes for large numbers of mouse strains. We anticipate that this resource will continue to evolve as new genotype data become available for laboratory mouse strains. The data are available for bulk download or query at http://cgd.jax.org/. PMID:18301946

  2. The effect of genome-wide association scan quality control on imputation outcome for common variants.

    PubMed

    Southam, Lorraine; Panoutsopoulou, Kalliope; Rayner, N William; Chapman, Kay; Durrant, Caroline; Ferreira, Teresa; Arden, Nigel; Carr, Andrew; Deloukas, Panos; Doherty, Michael; Loughlin, John; McCaskie, Andrew; Ollier, William E R; Ralston, Stuart; Spector, Timothy D; Valdes, Ana M; Wallis, Gillian A; Wilkinson, J Mark; Marchini, Jonathan; Zeggini, Eleftheria

    2011-05-01

    Imputation is an extremely valuable tool in conducting and synthesising genome-wide association studies (GWASs). Directly typed SNP quality control (QC) is thought to affect imputation quality. It is, therefore, common practise to use quality-controlled (QCed) data as an input for imputing genotypes. This study aims to determine the effect of commonly applied QC steps on imputation outcomes. We performed several iterations of imputing SNPs across chromosome 22 in a dataset consisting of 3177 samples with Illumina 610 k (Illumina, San Diego, CA, USA) GWAS data, applying different QC steps each time. The imputed genotypes were compared with the directly typed genotypes. In addition, we investigated the correlation between alternatively QCed data. We also applied a series of post-imputation QC steps balancing elimination of poorly imputed SNPs and information loss. We found that the difference between the unQCed data and the fully QCed data on imputation outcome was minimal. Our study shows that imputation of common variants is generally very accurate and robust to GWAS QC, which is not a major factor affecting imputation outcome. A minority of common-frequency SNPs with particular properties cannot be accurately imputed regardless of QC stringency. These findings may not generalise to the imputation of low frequency and rare variants.

  3. Genotype imputation in a coalescent model with infinitely-many-sites mutation

    PubMed Central

    Huang, Lucy; Buzbas, Erkan O.; Rosenberg, Noah A.

    2012-01-01

    Empirical studies have identified population-genetic factors as important determinants of the properties of genotype-imputation accuracy in imputation-based disease association studies. Here, we develop a simple coalescent model of three sequences that we use to explore the theoretical basis for the influence of these factors on genotype-imputation accuracy, under the assumption of infinitely-many-sites mutation. Employing a demographic model in which two populations diverged at a given time in the past, we derive the approximate expectation and variance of imputation accuracy in a study sequence sampled from one of the two populations, choosing between two reference sequences, one sampled from the same population as the study sequence and the other sampled from the other population. We show that under this model, imputation accuracy—as measured by the proportion of polymorphic sites that are imputed correctly in the study sequence—increases in expectation with the mutation rate, the proportion of the markers in a chromosomal region that are genotyped, and the time to divergence between the study and reference populations. Each of these effects derives largely from an increase in information available for determining the reference sequence that is genetically most similar to the sequence targeted for imputation. We analyze as a function of divergence time the expected gain in imputation accuracy in the target using a reference sequence from the same population as the target rather than from the other population. Together with a growing body of empirical investigations of genotype imputation in diverse human populations, our modeling framework lays a foundation for extending imputation techniques to novel populations that have not yet been extensively examined. PMID:23079542

  4. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies

    PubMed Central

    Howie, Bryan N.; Donnelly, Peter; Marchini, Jonathan

    2009-01-01

    Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project) will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2) that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%–20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions. PMID:19543373

  5. Genotype Imputation with Millions of Reference Samples.

    PubMed

    Browning, Brian L; Browning, Sharon R

    2016-01-07

    We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle's throughput was more than 100× greater than Impute2's throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs. Copyright © 2016 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  6. Genotype imputation from various low-density SNP panels and its impact on accuracy of genomic breeding values in pigs.

    PubMed

    Grossi, D A; Brito, L F; Jafarikia, M; Schenkel, F S; Feng, Z

    2018-04-30

    The uptake of genomic selection (GS) by the swine industry is still limited by the costs of genotyping. A feasible alternative to overcome this challenge is to genotype animals using an affordable low-density (LD) single nucleotide polymorphism (SNP) chip panel followed by accurate imputation to a high-density panel. Therefore, the main objective of this study was to screen incremental densities of LD panels in order to systematically identify one that balances the tradeoffs among imputation accuracy, prediction accuracy of genomic estimated breeding values (GEBVs), and genotype density (directly associated with genotyping costs). Genotypes using the Illumina Porcine60K BeadChip were available for 1378 Duroc (DU), 2361 Landrace (LA) and 3192 Yorkshire (YO) pigs. In addition, pseudo-phenotypes (de-regressed estimated breeding values) for five economically important traits were provided for the analysis. The reference population for genotyping imputation consisted of 931 DU, 1631 LA and 2103 YO animals and the remainder individuals were included in the validation population of each breed. A LD panel of 3000 evenly spaced SNPs (LD3K) yielded high imputation accuracy rates: 93.78% (DU), 97.07% (LA) and 97.00% (YO) and high correlations (>0.97) between the predicted GEBVs using the actual 60 K SNP genotypes and the imputed 60 K SNP genotypes for all traits and breeds. The imputation accuracy was influenced by the reference population size as well as the amount of parental genotype information available in the reference population. However, parental genotype information became less important when the LD panel had at least 3000 SNPs. The correlation of the GEBVs directly increased with an increase in imputation accuracy. When genotype information for both parents was available, a panel of 300 SNPs (imputed to 60 K) yielded GEBV predictions highly correlated (⩾0.90) with genomic predictions obtained based on the true 60 K panel, for all traits and breeds. For a small reference population size with no parents on reference population, it is recommended the use of a panel at least as dense as the LD3K and, when there are two parents in the reference population, a panel as small as the LD300 might be a feasible option. These findings are of great importance for the development of LD panels for swine in order to reduce genotyping costs, increase the uptake of GS and, therefore, optimize the profitability of the swine industry.

  7. The use of imputed sibling genotypes in sibship-based association analysis: on modeling alternatives, power and model misspecification.

    PubMed

    Minică, Camelia C; Dolan, Conor V; Hottenga, Jouke-Jan; Willemsen, Gonneke; Vink, Jacqueline M; Boomsma, Dorret I

    2013-05-01

    When phenotypic, but no genotypic data are available for relatives of participants in genetic association studies, previous research has shown that family-based imputed genotypes can boost the statistical power when included in such studies. Here, using simulations, we compared the performance of two statistical approaches suitable to model imputed genotype data: the mixture approach, which involves the full distribution of the imputed genotypes and the dosage approach, where the mean of the conditional distribution features as the imputed genotype. Simulations were run by varying sibship size, size of the phenotypic correlations among siblings, imputation accuracy and minor allele frequency of the causal SNP. Furthermore, as imputing sibling data and extending the model to include sibships of size two or greater requires modeling the familial covariance matrix, we inquired whether model misspecification affects power. Finally, the results obtained via simulations were empirically verified in two datasets with continuous phenotype data (height) and with a dichotomous phenotype (smoking initiation). Across the settings considered, the mixture and the dosage approach are equally powerful and both produce unbiased parameter estimates. In addition, the likelihood-ratio test in the linear mixed model appears to be robust to the considered misspecification in the background covariance structure, given low to moderate phenotypic correlations among siblings. Empirical results show that the inclusion in association analysis of imputed sibling genotypes does not always result in larger test statistic. The actual test statistic may drop in value due to small effect sizes. That is, if the power benefit is small, that the change in distribution of the test statistic under the alternative is relatively small, the probability is greater of obtaining a smaller test statistic. As the genetic effects are typically hypothesized to be small, in practice, the decision on whether family-based imputation could be used as a means to increase power should be informed by prior power calculations and by the consideration of the background correlation.

  8. GeneImp: Fast Imputation to Large Reference Panels Using Genotype Likelihoods from Ultralow Coverage Sequencing

    PubMed Central

    Spiliopoulou, Athina; Colombo, Marco; Orchard, Peter; Agakov, Felix; McKeigue, Paul

    2017-01-01

    We address the task of genotype imputation to a dense reference panel given genotype likelihoods computed from ultralow coverage sequencing as inputs. In this setting, the data have a high-level of missingness or uncertainty, and are thus more amenable to a probabilistic representation. Most existing imputation algorithms are not well suited for this situation, as they rely on prephasing for computational efficiency, and, without definite genotype calls, the prephasing task becomes computationally expensive. We describe GeneImp, a program for genotype imputation that does not require prephasing and is computationally tractable for whole-genome imputation. GeneImp does not explicitly model recombination, instead it capitalizes on the existence of large reference panels—comprising thousands of reference haplotypes—and assumes that the reference haplotypes can adequately represent the target haplotypes over short regions unaltered. We validate GeneImp based on data from ultralow coverage sequencing (0.5×), and compare its performance to the most recent version of BEAGLE that can perform this task. We show that GeneImp achieves imputation quality very close to that of BEAGLE, using one to two orders of magnitude less time, without an increase in memory complexity. Therefore, GeneImp is the first practical choice for whole-genome imputation to a dense reference panel when prephasing cannot be applied, for instance, in datasets produced via ultralow coverage sequencing. A related future application for GeneImp is whole-genome imputation based on the off-target reads from deep whole-exome sequencing. PMID:28348060

  9. A comprehensive SNP and indel imputability database.

    PubMed

    Duan, Qing; Liu, Eric Yi; Croteau-Chonka, Damien C; Mohlke, Karen L; Li, Yun

    2013-02-15

    Genotype imputation has become an indispensible step in genome-wide association studies (GWAS). Imputation accuracy, directly influencing downstream analysis, has shown to be improved using re-sequencing-based reference panels; however, this comes at the cost of high computational burden due to the huge number of potentially imputable markers (tens of millions) discovered through sequencing a large number of individuals. Therefore, there is an increasing need for access to imputation quality information without actually conducting imputation. To facilitate this process, we have established a publicly available SNP and indel imputability database, aiming to provide direct access to imputation accuracy information for markers identified by the 1000 Genomes Project across four major populations and covering multiple GWAS genotyping platforms. SNP and indel imputability information can be retrieved through a user-friendly interface by providing the ID(s) of the desired variant(s) or by specifying the desired genomic region. The query results can be refined by selecting relevant GWAS genotyping platform(s). This is the first database providing variant imputability information specific to each continental group and to each genotyping platform. In Filipino individuals from the Cebu Longitudinal Health and Nutrition Survey, our database can achieve an area under the receiver-operating characteristic curve of 0.97, 0.91, 0.88 and 0.79 for markers with minor allele frequency >5%, 3-5%, 1-3% and 0.5-1%, respectively. Specifically, by filtering out 48.6% of markers (corresponding to a reduction of up to 48.6% in computational costs for actual imputation) based on the imputability information in our database, we can remove 77%, 58%, 51% and 42% of the poorly imputed markers at the cost of only 0.3%, 0.8%, 1.5% and 4.6% of the well-imputed markers with minor allele frequency >5%, 3-5%, 1-3% and 0.5-1%, respectively. http://www.unc.edu/∼yunmli/imputability.html

  10. Comparing strategies for selection of low-density SNPs for imputation-mediated genomic prediction in U. S. Holsteins.

    PubMed

    He, Jun; Xu, Jiaqi; Wu, Xiao-Lin; Bauck, Stewart; Lee, Jungjae; Morota, Gota; Kachman, Stephen D; Spangler, Matthew L

    2018-04-01

    SNP chips are commonly used for genotyping animals in genomic selection but strategies for selecting low-density (LD) SNPs for imputation-mediated genomic selection have not been addressed adequately. The main purpose of the present study was to compare the performance of eight LD (6K) SNP panels, each selected by a different strategy exploiting a combination of three major factors: evenly-spaced SNPs, increased minor allele frequencies, and SNP-trait associations either for single traits independently or for all the three traits jointly. The imputation accuracies from 6K to 80K SNP genotypes were between 96.2 and 98.2%. Genomic prediction accuracies obtained using imputed 80K genotypes were between 0.817 and 0.821 for daughter pregnancy rate, between 0.838 and 0.844 for fat yield, and between 0.850 and 0.863 for milk yield. The two SNP panels optimized on the three major factors had the highest genomic prediction accuracy (0.821-0.863), and these accuracies were very close to those obtained using observed 80K genotypes (0.825-0.868). Further exploration of the underlying relationships showed that genomic prediction accuracies did not respond linearly to imputation accuracies, but were significantly affected by genotype (imputation) errors of SNPs in association with the traits to be predicted. SNPs optimal for map coverage and MAF were favorable for obtaining accurate imputation of genotypes whereas trait-associated SNPs improved genomic prediction accuracies. Thus, optimal LD SNP panels were the ones that combined both strengths. The present results have practical implications on the design of LD SNP chips for imputation-enabled genomic prediction.

  11. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing

    PubMed Central

    Howie, Bryan; Fuchsberger, Christian; Stephens, Matthew; Marchini, Jonathan; Abecasis, Gonçalo R.

    2013-01-01

    Sequencing efforts, including the 1000 Genomes Project and disease-specific efforts, are producing large collections of haplotypes that can be used for genotype imputation in genome-wide association studies (GWAS). Imputing from these reference panels can help identify new risk alleles, but the use of large panels with existing methods imposes a high computational burden. To keep imputation broadly accessible, we introduce a strategy called “pre-phasing” that maintains the accuracy of leading methods while cutting computational costs by orders of magnitude. In brief, we first statistically estimate the haplotypes for each GWAS individual (“pre-phasing”) and then impute missing genotypes into these estimated haplotypes. This reduces the computational cost because: (i) the GWAS samples must be phased only once, whereas standard methods would implicitly re-phase with each reference panel update; (ii) it is much faster to match a phased GWAS haplotype to one reference haplotype than to match unphased GWAS genotypes to a pair of reference haplotypes. This strategy will be particularly valuable for repeated imputation as reference panels evolve. PMID:22820512

  12. Fast imputation using medium or low-coverage sequence data

    USDA-ARS?s Scientific Manuscript database

    Accurate genotype imputation can greatly reduce costs and increase benefits by combining whole-genome sequence data of varying read depth and microarray genotypes of varying densities. For large populations, an efficient strategy chooses the two haplotypes most likely to form each genotype and updat...

  13. Genotype imputation for African Americans using data from HapMap phase II versus 1000 genomes projects.

    PubMed

    Sung, Yun J; Gu, C Charles; Tiwari, Hemant K; Arnett, Donna K; Broeckel, Ulrich; Rao, Dabeeru C

    2012-07-01

    Genotype imputation provides imputation of untyped single nucleotide polymorphisms (SNPs) that are present on a reference panel such as those from the HapMap Project. It is popular for increasing statistical power and comparing results across studies using different platforms. Imputation for African American populations is challenging because their linkage disequilibrium blocks are shorter and also because no ideal reference panel is available due to admixture. In this paper, we evaluated three imputation strategies for African Americans. The intersection strategy used a combined panel consisting of SNPs polymorphic in both CEU and YRI. The union strategy used a panel consisting of SNPs polymorphic in either CEU or YRI. The merge strategy merged results from two separate imputations, one using CEU and the other using YRI. Because recent investigators are increasingly using the data from the 1000 Genomes (1KG) Project for genotype imputation, we evaluated both 1KG-based imputations and HapMap-based imputations. We used 23,707 SNPs from chromosomes 21 and 22 on Affymetrix SNP Array 6.0 genotyped for 1,075 HyperGEN African Americans. We found that 1KG-based imputations provided a substantially larger number of variants than HapMap-based imputations, about three times as many common variants and eight times as many rare and low-frequency variants. This higher yield is expected because the 1KG panel includes more SNPs. Accuracy rates using 1KG data were slightly lower than those using HapMap data before filtering, but slightly higher after filtering. The union strategy provided the highest imputation yield with next highest accuracy. The intersection strategy provided the lowest imputation yield but the highest accuracy. The merge strategy provided the lowest imputation accuracy. We observed that SNPs polymorphic only in CEU had much lower accuracy, reducing the accuracy of the union strategy. Our findings suggest that 1KG-based imputations can facilitate discovery of significant associations for SNPs across the whole MAF spectrum. Because the 1KG Project is still under way, we expect that later versions will provide better imputation performance. © 2012 Wiley Periodicals, Inc.

  14. Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation

    PubMed Central

    2013-01-01

    Background Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry). Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker density, but result in some genotype errors and a large number of missing genotype values. Imputation can reduce the number of missing values and can correct genotyping errors, but current methods of imputation require a reference genome and thus are not an option for most species. Results Genotyping by Sequencing (GBS) was used to produce highly saturated maps for a R. idaeus pseudo-testcross progeny. While low coverage and high variance in sequencing resulted in a large number of missing values for some individuals, a novel method of imputation based on maximum likelihood marker ordering from initial marker segregation overcame the challenge of missing values, and made map construction computationally tractable. The two resulting parental maps contained 4521 and 2391 molecular markers spanning 462.7 and 376.6 cM respectively over seven linkage groups. Detection of precise genomic regions with segregation distortion was possible because of map saturation. Microsatellites (SSRs) linked these results to published maps for cross-validation and map comparison. Conclusions GBS together with genome-independent imputation provides a rapid method for genetic map construction in any pseudo-testcross progeny. Our method of imputation estimates the correct genotype call of missing values and corrects genotyping errors that lead to inflated map size and reduced precision in marker placement. Comparison of SSRs to published R. idaeus maps showed that the linkage maps constructed with GBS and our method of imputation were robust, and marker positioning reliable. The high marker density allowed identification of genomic regions with segregation distortion in R. idaeus, which may help to identify deleterious alleles that are the basis of inbreeding depression in the species. PMID:23324311

  15. Design and coverage of high throughput genotyping arrays optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm.

    PubMed

    Hoffmann, Thomas J; Zhan, Yiping; Kvale, Mark N; Hesselson, Stephanie E; Gollub, Jeremy; Iribarren, Carlos; Lu, Yontao; Mei, Gangwu; Purdy, Matthew M; Quesenberry, Charles; Rowell, Sarah; Shapero, Michael H; Smethurst, David; Somkin, Carol P; Van den Eeden, Stephen K; Walter, Larry; Webster, Teresa; Whitmer, Rachel A; Finn, Andrea; Schaefer, Catherine; Kwok, Pui-Yan; Risch, Neil

    2011-12-01

    Four custom Axiom genotyping arrays were designed for a genome-wide association (GWA) study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. The array optimized for individuals of European race/ethnicity was previously described. Here we detail the development of three additional microarrays optimized for individuals of East Asian, African American, and Latino race/ethnicity. For these arrays, we decreased redundancy of high-performing SNPs to increase SNP capacity. The East Asian array was designed using greedy pairwise SNP selection. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a novel hybrid SNP selection method for the African American and Latino arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. The arrays provide excellent genome-wide coverage and are valuable additions for large-scale GWA studies. Copyright © 2011 Elsevier Inc. All rights reserved.

  16. Novel methods to optimize genotypic imputation for low-coverage, next-generation sequence data in crop plants

    USDA-ARS?s Scientific Manuscript database

    Next-generation sequencing technology such as genotyping-by-sequencing (GBS) made low-cost, but often low-coverage, whole-genome sequencing widely available. Extensive inbreeding in crop plants provides an untapped, high quality source of phased haplotypes for imputing missing genotypes. We introduc...

  17. genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools.

    PubMed

    Lemieux Perreault, Louis-Philippe; Legault, Marc-André; Asselin, Géraldine; Dubé, Marie-Pierre

    2016-12-01

    Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies). The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe Documentation and tutorials are available at http://pgxcentre.github.io/genipe CONTACT: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  18. Molgenis-impute: imputation pipeline in a box.

    PubMed

    Kanterakis, Alexandros; Deelen, Patrick; van Dijk, Freerk; Byelas, Heorhiy; Dijkstra, Martijn; Swertz, Morris A

    2015-08-19

    Genotype imputation is an important procedure in current genomic analysis such as genome-wide association studies, meta-analyses and fine mapping. Although high quality tools are available that perform the steps of this process, considerable effort and expertise is required to set up and run a best practice imputation pipeline, particularly for larger genotype datasets, where imputation has to scale out in parallel on computer clusters. Here we present MOLGENIS-impute, an 'imputation in a box' solution that seamlessly and transparently automates the set up and running of all the steps of the imputation process. These steps include genome build liftover (liftovering), genotype phasing with SHAPEIT2, quality control, sample and chromosomal chunking/merging, and imputation with IMPUTE2. MOLGENIS-impute builds on MOLGENIS-compute, a simple pipeline management platform for submission and monitoring of bioinformatics tasks in High Performance Computing (HPC) environments like local/cloud servers, clusters and grids. All the required tools, data and scripts are downloaded and installed in a single step. Researchers with diverse backgrounds and expertise have tested MOLGENIS-impute on different locations and imputed over 30,000 samples so far using the 1,000 Genomes Project and new Genome of the Netherlands data as the imputation reference. The tests have been performed on PBS/SGE clusters, cloud VMs and in a grid HPC environment. MOLGENIS-impute gives priority to the ease of setting up, configuring and running an imputation. It has minimal dependencies and wraps the pipeline in a simple command line interface, without sacrificing flexibility to adapt or limiting the options of underlying imputation tools. It does not require knowledge of a workflow system or programming, and is targeted at researchers who just want to apply best practices in imputation via simple commands. It is built on the MOLGENIS compute workflow framework to enable customization with additional computational steps or it can be included in other bioinformatics pipelines. It is available as open source from: https://github.com/molgenis/molgenis-imputation.

  19. Genomic prediction using imputed whole-genome sequence data in Holstein Friesian cattle.

    PubMed

    van Binsbergen, Rianne; Calus, Mario P L; Bink, Marco C A M; van Eeuwijk, Fred A; Schrooten, Chris; Veerkamp, Roel F

    2015-09-17

    In contrast to currently used single nucleotide polymorphism (SNP) panels, the use of whole-genome sequence data is expected to enable the direct estimation of the effects of causal mutations on a given trait. This could lead to higher reliabilities of genomic predictions compared to those based on SNP genotypes. Also, at each generation of selection, recombination events between a SNP and a mutation can cause decay in reliability of genomic predictions based on markers rather than on the causal variants. Our objective was to investigate the use of imputed whole-genome sequence genotypes versus high-density SNP genotypes on (the persistency of) the reliability of genomic predictions using real cattle data. Highly accurate phenotypes based on daughter performance and Illumina BovineHD Beadchip genotypes were available for 5503 Holstein Friesian bulls. The BovineHD genotypes (631,428 SNPs) of each bull were used to impute whole-genome sequence genotypes (12,590,056 SNPs) using the Beagle software. Imputation was done using a multi-breed reference panel of 429 sequenced individuals. Genomic estimated breeding values for three traits were predicted using a Bayesian stochastic search variable selection (BSSVS) model and a genome-enabled best linear unbiased prediction model (GBLUP). Reliabilities of predictions were based on 2087 validation bulls, while the other 3416 bulls were used for training. Prediction reliabilities ranged from 0.37 to 0.52. BSSVS performed better than GBLUP in all cases. Reliabilities of genomic predictions were slightly lower with imputed sequence data than with BovineHD chip data. Also, the reliabilities tended to be lower for both sequence data and BovineHD chip data when relationships between training animals were low. No increase in persistency of prediction reliability using imputed sequence data was observed. Compared to BovineHD genotype data, using imputed sequence data for genomic prediction produced no advantage. To investigate the putative advantage of genomic prediction using (imputed) sequence data, a training set with a larger number of individuals that are distantly related to each other and genomic prediction models that incorporate biological information on the SNPs or that apply stricter SNP pre-selection should be considered.

  20. Statistical inference for Hardy-Weinberg proportions in the presence of missing genotype information.

    PubMed

    Graffelman, Jan; Sánchez, Milagros; Cook, Samantha; Moreno, Victor

    2013-01-01

    In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions.

  1. Genotype imputation in a tropical crossbred dairy cattle population

    USDA-ARS?s Scientific Manuscript database

    The application of new tools, such as genomic selection and genotype imputation, still presents challenges in crossbred populations because relationships of causal variants with markers may vary across breeds. In order to make genomic selection more cost effective, cheap low density chips are often ...

  2. Estimation of Genetic Relationships Between Individuals Across Cohorts and Platforms: Application to Childhood Height.

    PubMed

    Fedko, Iryna O; Hottenga, Jouke-Jan; Medina-Gomez, Carolina; Pappa, Irene; van Beijsterveldt, Catharina E M; Ehli, Erik A; Davies, Gareth E; Rivadeneira, Fernando; Tiemeier, Henning; Swertz, Morris A; Middeldorp, Christel M; Bartels, Meike; Boomsma, Dorret I

    2015-09-01

    Combining genotype data across cohorts increases power to estimate the heritability due to common single nucleotide polymorphisms (SNPs), based on analyzing a Genetic Relationship Matrix (GRM). However, the combination of SNP data across multiple cohorts may lead to stratification, when for example, different genotyping platforms are used. In the current study, we address issues of combining SNP data from different cohorts, the Netherlands Twin Register (NTR) and the Generation R (GENR) study. Both cohorts include children of Northern European Dutch background (N = 3102 + 2826, respectively) who were genotyped on different platforms. We explore imputation and phasing as a tool and compare three GRM-building strategies, when data from two cohorts are (1) just combined, (2) pre-combined and cross-platform imputed and (3) cross-platform imputed and post-combined. We test these three strategies with data on childhood height for unrelated individuals (N = 3124, average age 6.7 years) to explore their effect on SNP-heritability estimates and compare results to those obtained from the independent studies. All combination strategies result in SNP-heritability estimates with a standard error smaller than those of the independent studies. We did not observe significant difference in estimates of SNP-heritability based on various cross-platform imputed GRMs. SNP-heritability of childhood height was on average estimated as 0.50 (SE = 0.10). Introducing cohort as a covariate resulted in ≈2 % drop. Principal components (PCs) adjustment resulted in SNP-heritability estimates of about 0.39 (SE = 0.11). Strikingly, we did not find significant difference between cross-platform imputed and combined GRMs. All estimates were significant regardless the use of PCs adjustment. Based on these analyses we conclude that imputation with a reference set helps to increase power to estimate SNP-heritability by combining cohorts of the same ethnicity genotyped on different platforms. However, important factors should be taken into account such as remaining cohort stratification after imputation and/or phenotypic heterogeneity between and within cohorts. Whether one should use imputation, or just combine the genotype data, depends on the number of overlapping SNPs in relation to the total number of genotyped SNPs for both cohorts, and their ability to tag all the genetic variance related to the specific trait of interest.

  3. Imputation of single nucleotide polymorhpism genotypes of Hereford cattle: reference panel size, family relationship and population structure

    USDA-ARS?s Scientific Manuscript database

    The objective of this study is to investigate single nucleotide polymorphism (SNP) genotypes imputation of Hereford cattle. Purebred Herefords were from two sources, Line 1 Hereford (N=240) and representatives of Industry Herefords (N=311). Using different reference panels of 62 and 494 males with 1...

  4. GACT: a Genome build and Allele definition Conversion Tool for SNP imputation and meta-analysis in genetic association studies.

    PubMed

    Sulovari, Arvis; Li, Dawei

    2014-07-19

    Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions. In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs. GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep-sequencing, particularly for data from the dbGaP and other public databases. http://www.uvm.edu/genomics/software/gact.

  5. Accuracy of genotype imputation in Swiss cattle breeds

    USDA-ARS?s Scientific Manuscript database

    The objective of this study was to evaluate the accuracy of imputation from Illumina Bovine3k Bead Chip (3k) and Illumina BovineLD (6k) to 54k chip information in Swiss dairy cattle breeds. Genotype data comprised of 54k SNP chip data of Original Braunvieh (OB), Brown Swiss (BS), Swiss Fleckvieh (SF...

  6. Genome-wide association analysis based on multiple imputation with low-depth GBS data: application to biofuel traits in reed canarygrass

    USDA-ARS?s Scientific Manuscript database

    Genotyping-by-sequencing allows for large-scale genetic analyses in plant species with no reference genome, creating the challenge of sound inference in the presence of uncertain genotypes. Here we report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundina...

  7. Genome-wide association study based on multiple imputation with low-depth sequencing data: application to biofuel traits in reed canarygrass

    USDA-ARS?s Scientific Manuscript database

    Genotyping by sequencing allows for large-scale genetic analyses in plant species with no reference genome, but sets the challenge of sound inference in presence of uncertain genotypes. We report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundinacea L., P...

  8. DISSCO: direct imputation of summary statistics allowing covariates

    PubMed Central

    Xu, Zheng; Duan, Qing; Yan, Song; Chen, Wei; Li, Mingyao; Lange, Ethan; Li, Yun

    2015-01-01

    Background: Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates. Methods: We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO). Results: We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9–15.2% for variants with minor allele frequency <5%. Availability and implementation: http://www.unc.edu/∼yunmli/DISSCO. Contact: yunli@med.unc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25810429

  9. DISSCO: direct imputation of summary statistics allowing covariates.

    PubMed

    Xu, Zheng; Duan, Qing; Yan, Song; Chen, Wei; Li, Mingyao; Lange, Ethan; Li, Yun

    2015-08-01

    Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates. We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO). We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9-15.2% for variants with minor allele frequency <5%. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  10. SparRec: An effective matrix completion framework of missing data imputation for GWAS

    NASA Astrophysics Data System (ADS)

    Jiang, Bo; Ma, Shiqian; Causey, Jason; Qiao, Linbo; Hardin, Matthew Price; Bitts, Ian; Johnson, Daniel; Zhang, Shuzhong; Huang, Xiuzhen

    2016-10-01

    Genome-wide association studies present computational challenges for missing data imputation, while the advances of genotype technologies are generating datasets of large sample sizes with sample sets genotyped on multiple SNP chips. We present a new framework SparRec (Sparse Recovery) for imputation, with the following properties: (1) The optimization models of SparRec, based on low-rank and low number of co-clusters of matrices, are different from current statistics methods. While our low-rank matrix completion (LRMC) model is similar to Mendel-Impute, our matrix co-clustering factorization (MCCF) model is completely new. (2) SparRec, as other matrix completion methods, is flexible to be applied to missing data imputation for large meta-analysis with different cohorts genotyped on different sets of SNPs, even when there is no reference panel. This kind of meta-analysis is very challenging for current statistics based methods. (3) SparRec has consistent performance and achieves high recovery accuracy even when the missing data rate is as high as 90%. Compared with Mendel-Impute, our low-rank based method achieves similar accuracy and efficiency, while the co-clustering based method has advantages in running time. The testing results show that SparRec has significant advantages and competitive performance over other state-of-the-art existing statistics methods including Beagle and fastPhase.

  11. Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications.

    PubMed

    Wu, Xiao-Lin; Xu, Jiaqi; Feng, Guofei; Wiggans, George R; Taylor, Jeremy F; He, Jun; Qian, Changsong; Qiu, Jiansheng; Simpson, Barry; Walker, Jeremy; Bauck, Stewart

    2016-01-01

    Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with ≤1,000 SNPs, but required considerably more computing time. Nevertheless, the differences diminished when >5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with ≥3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal.

  12. Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications

    PubMed Central

    Wu, Xiao-Lin; Xu, Jiaqi; Feng, Guofei; Wiggans, George R.; Taylor, Jeremy F.; He, Jun; Qian, Changsong; Qiu, Jiansheng; Simpson, Barry; Walker, Jeremy; Bauck, Stewart

    2016-01-01

    Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with ≤1,000 SNPs, but required considerably more computing time. Nevertheless, the differences diminished when >5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with ≥3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal. PMID:27583971

  13. Ascertainment bias from imputation methods evaluation in wheat.

    PubMed

    Brandariz, Sofía P; González Reymúndez, Agustín; Lado, Bettina; Malosetti, Marcos; Garcia, Antonio Augusto Franco; Quincke, Martín; von Zitzewitz, Jarislav; Castro, Marina; Matus, Iván; Del Pozo, Alejandro; Castro, Ariel J; Gutiérrez, Lucía

    2016-10-04

    Whole-genome genotyping techniques like Genotyping-by-sequencing (GBS) are being used for genetic studies such as Genome-Wide Association (GWAS) and Genomewide Selection (GS), where different strategies for imputation have been developed. Nevertheless, imputation error may lead to poor performance (i.e. smaller power or higher false positive rate) when complete data is not required as it is for GWAS, and each marker is taken at a time. The aim of this study was to compare the performance of GWAS analysis for Quantitative Trait Loci (QTL) of major and minor effect using different imputation methods when no reference panel is available in a wheat GBS panel. In this study, we compared the power and false positive rate of dissecting quantitative traits for imputed and not-imputed marker score matrices in: (1) a complete molecular marker barley panel array, and (2) a GBS wheat panel with missing data. We found that there is an ascertainment bias in imputation method comparisons. Simulating over a complete matrix and creating missing data at random proved that imputation methods have a poorer performance. Furthermore, we found that when QTL were simulated with imputed data, the imputation methods performed better than the not-imputed ones. On the other hand, when QTL were simulated with not-imputed data, the not-imputed method and one of the imputation methods performed better for dissecting quantitative traits. Moreover, larger differences between imputation methods were detected for QTL of major effect than QTL of minor effect. We also compared the different marker score matrices for GWAS analysis in a real wheat phenotype dataset, and we found minimal differences indicating that imputation did not improve the GWAS performance when a reference panel was not available. Poorer performance was found in GWAS analysis when an imputed marker score matrix was used, no reference panel is available, in a wheat GBS panel.

  14. A comparison of different algorithms for phasing haplotypes using Holstein cattle genotypes and pedigree data.

    PubMed

    Miar, Younes; Sargolzaei, Mehdi; Schenkel, Flavio S

    2017-04-01

    Phasing genotypes to haplotypes is becoming increasingly important due to its applications in the study of diseases, population and evolutionary genetics, imputation, and so on. Several studies have focused on the development of computational methods that infer haplotype phase from population genotype data. The aim of this study was to compare phasing algorithms implemented in Beagle, Findhap, FImpute, Impute2, and ShapeIt2 software using 50k and 777k (HD) genotyping data. Six scenarios were considered: no-parents, sire-progeny pairs, sire-dam-progeny trios, each with and without pedigree information in Holstein cattle. Algorithms were compared with respect to their phasing accuracy and computational efficiency. In the studied population, Beagle and FImpute were more accurate than other phasing algorithms. Across scenarios, phasing accuracies for Beagle and FImpute were 99.49-99.90% and 99.44-99.99% for 50k, respectively, and 99.90-99.99% and 99.87-99.99% for HD, respectively. Generally, FImpute resulted in higher accuracy when genotypic information of at least one parent was available. In the absence of parental genotypes and pedigree information, Beagle and Impute2 (with double the default number of states) were slightly more accurate than FImpute. Findhap gave high phasing accuracy when parents' genotypes and pedigree information were available. In terms of computing time, Findhap was the fastest algorithm followed by FImpute. FImpute was 30 to 131, 87 to 786, and 353 to 1,400 times faster across scenarios than Beagle, ShapeIt2, and Impute2, respectively. In summary, FImpute and Beagle were the most accurate phasing algorithms. Moreover, the low computational requirement of FImpute makes it an attractive algorithm for phasing genotypes of large livestock populations. Copyright © 2017 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  15. Imputation-Based Genomic Coverage Assessments of Current Human Genotyping Arrays

    PubMed Central

    Nelson, Sarah C.; Doheny, Kimberly F.; Pugh, Elizabeth W.; Romm, Jane M.; Ling, Hua; Laurie, Cecelia A.; Browning, Sharon R.; Weir, Bruce S.; Laurie, Cathy C.

    2013-01-01

    Microarray single-nucleotide polymorphism genotyping, combined with imputation of untyped variants, has been widely adopted as an efficient means to interrogate variation across the human genome. “Genomic coverage” is the total proportion of genomic variation captured by an array, either by direct observation or through an indirect means such as linkage disequilibrium or imputation. We have performed imputation-based genomic coverage assessments of eight current genotyping arrays that assay from ~0.3 to ~5 million variants. Coverage was determined separately in each of the four continental ancestry groups in the 1000 Genomes Project phase 1 release. We used the subset of 1000 Genomes variants present on each array to impute the remaining variants and assessed coverage based on correlation between imputed and observed allelic dosages. More than 75% of common variants (minor allele frequency > 0.05) are covered by all arrays in all groups except for African ancestry, and up to ~90% in all ancestries for the highest density arrays. In contrast, less than 40% of less common variants (0.01 < minor allele frequency < 0.05) are covered by low density arrays in all ancestries and 50–80% in high density arrays, depending on ancestry. We also calculated genome-wide power to detect variant-trait association in a case-control design, across varying sample sizes, effect sizes, and minor allele frequency ranges, and compare these array-based power estimates with a hypothetical array that would type all variants in 1000 Genomes. These imputation-based genomic coverage and power analyses are intended as a practical guide to researchers planning genetic studies. PMID:23979933

  16. Imputation of Missing Genotypes From Sparse to High Density Using Long-Range Phasing

    USDA-ARS?s Scientific Manuscript database

    Related individuals in a population share long chromosome segments which trace to a common ancestor. We describe a long-range phasing algorithm that makes use of this property to phase whole chromosomes and simultaneously impute a large number of missing markers. We test our method by imputing marke...

  17. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel.

    PubMed

    Huang, Jie; Howie, Bryan; McCarthy, Shane; Memari, Yasin; Walter, Klaudia; Min, Josine L; Danecek, Petr; Malerba, Giovanni; Trabetti, Elisabetta; Zheng, Hou-Feng; Gambaro, Giovanni; Richards, J Brent; Durbin, Richard; Timpson, Nicholas J; Marchini, Jonathan; Soranzo, Nicole

    2015-09-14

    Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants.

  18. Improving accuracy of genomic prediction in Brangus cattle by adding animals with imputed low-density SNP genotypes.

    PubMed

    Lopes, F B; Wu, X-L; Li, H; Xu, J; Perkins, T; Genho, J; Ferretti, R; Tait, R G; Bauck, S; Rosa, G J M

    2018-02-01

    Reliable genomic prediction of breeding values for quantitative traits requires the availability of sufficient number of animals with genotypes and phenotypes in the training set. As of 31 October 2016, there were 3,797 Brangus animals with genotypes and phenotypes. These Brangus animals were genotyped using different commercial SNP chips. Of them, the largest group consisted of 1,535 animals genotyped by the GGP-LDV4 SNP chip. The remaining 2,262 genotypes were imputed to the SNP content of the GGP-LDV4 chip, so that the number of animals available for training the genomic prediction models was more than doubled. The present study showed that the pooling of animals with both original or imputed 40K SNP genotypes substantially increased genomic prediction accuracies on the ten traits. By supplementing imputed genotypes, the relative gains in genomic prediction accuracies on estimated breeding values (EBV) were from 12.60% to 31.27%, and the relative gain in genomic prediction accuracies on de-regressed EBV was slightly small (i.e. 0.87%-18.75%). The present study also compared the performance of five genomic prediction models and two cross-validation methods. The five genomic models predicted EBV and de-regressed EBV of the ten traits similarly well. Of the two cross-validation methods, leave-one-out cross-validation maximized the number of animals at the stage of training for genomic prediction. Genomic prediction accuracy (GPA) on the ten quantitative traits was validated in 1,106 newly genotyped Brangus animals based on the SNP effects estimated in the previous set of 3,797 Brangus animals, and they were slightly lower than GPA in the original data. The present study was the first to leverage currently available genotype and phenotype resources in order to harness genomic prediction in Brangus beef cattle. © 2018 Blackwell Verlag GmbH.

  19. Probability genotype imputation method and integrated weighted lasso for QTL identification.

    PubMed

    Demetrashvili, Nino; Van den Heuvel, Edwin R; Wit, Ernst C

    2013-12-30

    Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest. Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax. Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.

  20. A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods

    PubMed Central

    Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A

    2015-01-01

    Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated. PMID:26126540

  1. A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods.

    PubMed

    Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A

    2015-12-01

    Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3-40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31-0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04-0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.

  2. On the value of Mendelian laws of segregation in families: data quality control, imputation and beyond

    PubMed Central

    Blue, Elizabeth Marchani; Sun, Lei; Tintle, Nathan L.; Wijsman, Ellen M.

    2014-01-01

    When analyzing family data, we dream of perfectly informative data, even whole genome sequences (WGS) for all family members. Reality intervenes, and we find next-generation sequence (NGS) data have error, and are often too expensive or impossible to collect on everyone. Genetic Analysis Workshop 18 groups “Quality Control” and “Dropping WGS through families using GWAS framework” focused on finding, correcting, and using errors within the available sequence and family data, developing methods to infer and analyze missing sequence data among relatives, and testing for linkage and association with simulated blood pressure. We found that single nucleotide polymorphisms, NGS, and imputed data are generally concordant, but that errors are particularly likely at rare variants, homozygous genotypes, within regions with repeated sequences or structural variants, and within sequence data imputed from unrelateds. Admixture complicated identification of cryptic relatedness, but information from Mendelian transmission improved error detection and provided an estimate of the de novo mutation rate. Both genotype and pedigree errors had an adverse effect on subsequent analyses. Computationally fast rules-based imputation was accurate, but could not cover as many loci or subjects as more computationally demanding probability-based methods. Incorporating population-level data into pedigree-based imputation methods improved results. Observed data outperformed imputed data in association testing, but imputed data were also useful. We discuss the strengths and weaknesses of existing methods, and suggest possible future directions. Topics include improving communication between those performing data collection and analysis, establishing thresholds for and improving imputation quality, and incorporating error into imputation and analytical models. PMID:25112184

  3. Genotyping by sequencing for genomic prediction in a soybean breeding population.

    PubMed

    Jarquín, Diego; Kocak, Kyle; Posadas, Luis; Hyma, Katie; Jedlicka, Joseph; Graef, George; Lorenz, Aaron

    2014-08-29

    Advances in genotyping technology, such as genotyping by sequencing (GBS), are making genomic prediction more attractive to reduce breeding cycle times and costs associated with phenotyping. Genomic prediction and selection has been studied in several crop species, but no reports exist in soybean. The objectives of this study were (i) evaluate prospects for genomic selection using GBS in a typical soybean breeding program and (ii) evaluate the effect of GBS marker selection and imputation on genomic prediction accuracy. To achieve these objectives, a set of soybean lines sampled from the University of Nebraska Soybean Breeding Program were genotyped using GBS and evaluated for yield and other agronomic traits at multiple Nebraska locations. Genotyping by sequencing scored 16,502 single nucleotide polymorphisms (SNPs) with minor-allele frequency (MAF) > 0.05 and percentage of missing values ≤ 5% on 301 elite soybean breeding lines. When SNPs with up to 80% missing values were included, 52,349 SNPs were scored. Prediction accuracy for grain yield, assessed using cross validation, was estimated to be 0.64, indicating good potential for using genomic selection for grain yield in soybean. Filtering SNPs based on missing data percentage had little to no effect on prediction accuracy, especially when random forest imputation was used to impute missing values. The highest accuracies were observed when random forest imputation was used on all SNPs, but differences were not significant. A standard additive G-BLUP model was robust; modeling additive-by-additive epistasis did not provide any improvement in prediction accuracy. The effect of training population size on accuracy began to plateau around 100, but accuracy steadily climbed until the largest possible size was used in this analysis. Including only SNPs with MAF > 0.30 provided higher accuracies when training populations were smaller. Using GBS for genomic prediction in soybean holds good potential to expedite genetic gain. Our results suggest that standard additive G-BLUP models can be used on unfiltered, imputed GBS data without loss in accuracy.

  4. Identity-by-Descent-Based Phasing and Imputation in Founder Populations Using Graphical Models

    PubMed Central

    Palin, Kimmo; Campbell, Harry; Wright, Alan F; Wilson, James F; Durbin, Richard

    2011-01-01

    Accurate knowledge of haplotypes, the combination of alleles co-residing on a single copy of a chromosome, enables powerful gene mapping and sequence imputation methods. Since humans are diploid, haplotypes must be derived from genotypes by a phasing process. In this study, we present a new computational model for haplotype phasing based on pairwise sharing of haplotypes inferred to be Identical-By-Descent (IBD). We apply the Bayesian network based model in a new phasing algorithm, called systematic long-range phasing (SLRP), that can capitalize on the close genetic relationships in isolated founder populations, and show with simulated and real genome-wide genotype data that SLRP substantially reduces the rate of phasing errors compared to previous phasing algorithms. Furthermore, the method accurately identifies regions of IBD, enabling linkage-like studies without pedigrees, and can be used to impute most genotypes with very low error rate. Genet. Epidemiol. 2011. © 2011 Wiley Periodicals, Inc.35:853-860, 2011 PMID:22006673

  5. Applications of the 1000 Genomes Project resources

    PubMed Central

    Zheng-Bradley, Xiangqun

    2017-01-01

    Abstract The 1000 Genomes Project created a valuable, worldwide reference for human genetic variation. Common uses of the 1000 Genomes dataset include genotype imputation supporting Genome-wide Association Studies, mapping expression Quantitative Trait Loci, filtering non-pathogenic variants from exome, whole genome and cancer genome sequencing projects, and genetic analysis of population structure and molecular evolution. In this article, we will highlight some of the multiple ways that the 1000 Genomes data can be and has been utilized for genetic studies. PMID:27436001

  6. MaCH-Admix: Genotype Imputation for Admixed Populations

    PubMed Central

    Liu, Eric Yi; Li, Mingyao; Wang, Wei; Li, Yun

    2012-01-01

    Imputation in admixed populations is an important problem but challenging due to the complex linkage disequilibrium (LD) pattern. The emergence of large reference panels such as that from the 1,000 Genomes Project enables more accurate imputation in general, and in particular for admixed populations and for uncommon variants. To efficiently benefit from these large reference panels, one key issue to consider in modern genotype imputation framework is the selection of effective reference panels. In this work, we consider a number of methods for effective reference panel construction inside a hidden Markov model and specific to each target individual. These methods fall into two categories: identity-by-state (IBS) based and ancestry-weighted approach. We evaluated the performance on individuals from recently admixed populations. Our target samples include 8,421 African Americans and 3,587 Hispanic Americans from the Women’s Health Initiative, which allow assessment of imputation quality for uncommon variants. Our experiments include both large and small reference panels; large, medium, and small target samples; and in genome regions of varying levels of LD. We also include BEAGLE and IMPUTE2 for comparison. Experiment results with large reference panel suggest that our novel piecewise IBS method yields consistently higher imputation quality than other methods/software. The advantage is particularly noteworthy among uncommon variants where we observe up to 5.1% information gain with the difference being highly significant (Wilcoxon signed rank test P-value < 0.0001). Our work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison. PMID:23074066

  7. Phased genotyping-by-sequencing enhances analysis of genetic diversity and reveals divergent copy number variants in maize

    USDA-ARS?s Scientific Manuscript database

    High-throughput sequencing of reduced representation genomic libraries has ushered in an era of genotyping-by-sequencing (GBS), where genome-wide genotype data can be obtained for nearly any species. However, there remains a need for imputation-free GBS methods for genotyping large samples taken fr...

  8. Evaluating imputation algorithms for low-depth genotyping-by-sequencing (GBS) data

    USDA-ARS?s Scientific Manuscript database

    Well-powered genomic studies require genome-wide marker coverage across many individuals. For non-model species with few genomic resources, high-throughput sequencing (HTS) methods, such as Genotyping-By-Sequencing (GBS), offer an inexpensive alternative to array-based genotyping. Although affordabl...

  9. Genotype Imputation with Thousands of Genomes

    PubMed Central

    Howie, Bryan; Marchini, Jonathan; Stephens, Matthew

    2011-01-01

    Genotype imputation is a statistical technique that is often used to increase the power and resolution of genetic association studies. Imputation methods work by using haplotype patterns in a reference panel to predict unobserved genotypes in a study dataset, and a number of approaches have been proposed for choosing subsets of reference haplotypes that will maximize accuracy in a given study population. These panel selection strategies become harder to apply and interpret as sequencing efforts like the 1000 Genomes Project produce larger and more diverse reference sets, which led us to develop an alternative framework. Our approach is built around a new approximation that uses local sequence similarity to choose a custom reference panel for each study haplotype in each region of the genome. This approximation makes it computationally efficient to use all available reference haplotypes, which allows us to bypass the panel selection step and to improve accuracy at low-frequency variants by capturing unexpected allele sharing among populations. Using data from HapMap 3, we show that our framework produces accurate results in a wide range of human populations. We also use data from the Malaria Genetic Epidemiology Network (MalariaGEN) to provide recommendations for imputation-based studies in Africa. We demonstrate that our approximation improves efficiency in large, sequence-based reference panels, and we discuss general computational strategies for modern reference datasets. Genome-wide association studies will soon be able to harness the power of thousands of reference genomes, and our work provides a practical way for investigators to use this rich information. New methodology from this study is implemented in the IMPUTE2 software package. PMID:22384356

  10. GWAS and fine-mapping of 35 production, reproduction and conformation traits with imputed sequences of 27K Holstein bulls

    USDA-ARS?s Scientific Manuscript database

    Fine-mapping of causal variants is becoming feasible for complex traits in livestock GWAS, as an increasing number of animals are sequenced. Imputation has been routinely applied to ascertain sequence variants in large genotyped populations based on small reference populations of sequenced animals. ...

  11. GWAS and fine-mapping of 35 production, reproduction, and conformation traits with imputed sequences of 27K Holstein bulls

    USDA-ARS?s Scientific Manuscript database

    Imputation has been routinely applied to ascertain sequence variants in large genotyped populations based on reference populations of sequenced animals. With the implementation of the 1000 Bull Genomes Project and increasing numbers of animals sequenced, fine-mapping of causal variants is becoming f...

  12. Design of a bovine low-density SNP array optimized for imputation

    USDA-ARS?s Scientific Manuscript database

    The Illumina BovineLD BeadChip was designed to support imputation to higher density genotypes in dairy and beef breeds by including single-nucleotide polymorphisms (SNPs) that had a high minor allele frequency as well as uniform spacing across the genome except at the ends of the chromosome where de...

  13. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

    PubMed Central

    Pasaniuc, Bogdan; Zaitlen, Noah; Shi, Huwenbo; Bhatia, Gaurav; Gusev, Alexander; Pickrell, Joseph; Hirschhorn, Joel; Strachan, David P.; Patterson, Nick; Price, Alkes L.

    2014-01-01

    Motivation: Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. Results: In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1–5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case–control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of χ2 association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. Availability and implementation: Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. Contact: bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu Supplementary information: Supplementary materials are available at Bioinformatics online. PMID:24990607

  14. Applications of the 1000 Genomes Project resources.

    PubMed

    Zheng-Bradley, Xiangqun; Flicek, Paul

    2017-05-01

    The 1000 Genomes Project created a valuable, worldwide reference for human genetic variation. Common uses of the 1000 Genomes dataset include genotype imputation supporting Genome-wide Association Studies, mapping expression Quantitative Trait Loci, filtering non-pathogenic variants from exome, whole genome and cancer genome sequencing projects, and genetic analysis of population structure and molecular evolution. In this article, we will highlight some of the multiple ways that the 1000 Genomes data can be and has been utilized for genetic studies. © The Author 2016. Published by Oxford University Press.

  15. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes

    PubMed Central

    2011-01-01

    Background Knowing the phase of marker genotype data can be useful in genome-wide association studies, because it makes it possible to use analysis frameworks that account for identity by descent or parent of origin of alleles and it can lead to a large increase in data quantities via genotype or sequence imputation. Long-range phasing and haplotype library imputation constitute a fast and accurate method to impute phase for SNP data. Methods A long-range phasing and haplotype library imputation algorithm was developed. It combines information from surrogate parents and long haplotypes to resolve phase in a manner that is not dependent on the family structure of a dataset or on the presence of pedigree information. Results The algorithm performed well in both simulated and real livestock and human datasets in terms of both phasing accuracy and computation efficiency. The percentage of alleles that could be phased in both simulated and real datasets of varying size generally exceeded 98% while the percentage of alleles incorrectly phased in simulated data was generally less than 0.5%. The accuracy of phasing was affected by dataset size, with lower accuracy for dataset sizes less than 1000, but was not affected by effective population size, family data structure, presence or absence of pedigree information, and SNP density. The method was computationally fast. In comparison to a commonly used statistical method (fastPHASE), the current method made about 8% less phasing mistakes and ran about 26 times faster for a small dataset. For larger datasets, the differences in computational time are expected to be even greater. A computer program implementing these methods has been made available. Conclusions The algorithm and software developed in this study make feasible the routine phasing of high-density SNP chips in large datasets. PMID:21388557

  16. Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel.

    PubMed

    Delaneau, Olivier; Marchini, Jonathan

    2014-06-13

    A major use of the 1000 Genomes Project (1000 GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to build a backbone (or 'scaffold') of haplotypes across each chromosome. We then phase the sequence data 'onto' this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000 GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and bi-allelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low-frequency variants.

  17. Identification of Novel Genetic Markers of Breast Cancer Survival

    PubMed Central

    Guo, Qi; Schmidt, Marjanka K.; Kraft, Peter; Canisius, Sander; Chen, Constance; Khan, Sofia; Tyrer, Jonathan; Bolla, Manjeet K.; Wang, Qin; Dennis, Joe; Michailidou, Kyriaki; Lush, Michael; Kar, Siddhartha; Beesley, Jonathan; Dunning, Alison M.; Shah, Mitul; Czene, Kamila; Darabi, Hatef; Eriksson, Mikael; Lambrechts, Diether; Weltens, Caroline; Leunen, Karin; Bojesen, Stig E.; Nordestgaard, Børge G.; Nielsen, Sune F.; Flyger, Henrik; Chang-Claude, Jenny; Rudolph, Anja; Seibold, Petra; Flesch-Janys, Dieter; Blomqvist, Carl; Aittomäki, Kristiina; Fagerholm, Rainer; Muranen, Taru A.; Couch, Fergus J.; Olson, Janet E.; Vachon, Celine; Andrulis, Irene L.; Knight, Julia A.; Glendon, Gord; Mulligan, Anna Marie; Broeks, Annegien; Hogervorst, Frans B.; Haiman, Christopher A.; Henderson, Brian E.; Schumacher, Fredrick; Le Marchand, Loic; Hopper, John L.; Tsimiklis, Helen; Apicella, Carmel; Southey, Melissa C.; Cox, Angela; Cross, Simon S.; Reed, Malcolm W. R.; Giles, Graham G.; Milne, Roger L.; McLean, Catriona; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Grip, Mervi; Hooning, Maartje J.; Hollestelle, Antoinette; Martens, John W. M.; van den Ouweland, Ans M. W.; Marme, Federik; Schneeweiss, Andreas; Yang, Rongxi; Burwinkel, Barbara; Figueroa, Jonine; Chanock, Stephen J.; Lissowska, Jolanta; Sawyer, Elinor J.; Tomlinson, Ian; Kerin, Michael J.; Miller, Nicola; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Holleczek, Bernd; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M.; Li, Jingmei; Brand, Judith S.; Humphreys, Keith; Devilee, Peter; Tollenaar, Rob A. E. M.; Seynaeve, Caroline; Radice, Paolo; Peterlongo, Paolo; Bonanni, Bernardo; Mariani, Paolo; Fasching, Peter A.; Beckmann, Matthias W.; Hein, Alexander; Ekici, Arif B.; Chenevix-Trench, Georgia; Balleine, Rosemary; Phillips, Kelly-Anne; Benitez, Javier; Zamora, M. Pilar; Arias Perez, Jose Ignacio; Menéndez, Primitiva; Jakubowska, Anna; Lubinski, Jan; Jaworska-Bieniek, Katarzyna; Durda, Katarzyna; Hamann, Ute; Kabisch, Maria; Ulmer, Hans Ulrich; Rüdiger, Thomas; Margolin, Sara; Kristensen, Vessela; Nord, Silje; Evans, D. Gareth; Abraham, Jean E.; Earl, Helena M.; Hiller, Louise; Dunn, Janet A.; Bowden, Sarah; Berg, Christine; Campa, Daniele; Diver, W. Ryan; Gapstur, Susan M.; Gaudet, Mia M.; Hankinson, Susan E.; Hoover, Robert N.; Hüsing, Anika; Kaaks, Rudolf; Machiela, Mitchell J.; Willett, Walter; Barrdahl, Myrto; Canzian, Federico; Chin, Suet-Feung; Caldas, Carlos; Hunter, David J.; Lindstrom, Sara; García-Closas, Montserrat; Hall, Per; Easton, Douglas F.; Eccles, Diana M.; Rahman, Nazneen; Nevanlinna, Heli; Pharoah, Paul D. P.

    2015-01-01

    Background: Survival after a diagnosis of breast cancer varies considerably between patients, and some of this variation may be because of germline genetic variation. We aimed to identify genetic markers associated with breast cancer–specific survival. Methods: We conducted a large meta-analysis of studies in populations of European ancestry, including 37954 patients with 2900 deaths from breast cancer. Each study had been genotyped for between 200000 and 900000 single nucleotide polymorphisms (SNPs) across the genome; genotypes for nine million common variants were imputed using a common reference panel from the 1000 Genomes Project. We also carried out subtype-specific analyses based on 6881 estrogen receptor (ER)–negative patients (920 events) and 23059 ER-positive patients (1333 events). All statistical tests were two-sided. Results: We identified one new locus (rs2059614 at 11q24.2) associated with survival in ER-negative breast cancer cases (hazard ratio [HR] = 1.95, 95% confidence interval [CI] = 1.55 to 2.47, P = 1.91 x 10–8). Genotyping a subset of 2113 case patients, of which 300 were ER negative, provided supporting evidence for the quality of the imputation. The association in this set of case patients was stronger for the observed genotypes than for the imputed genotypes. A second locus (rs148760487 at 2q24.2) was associated at genome-wide statistical significance in initial analyses; the association was similar in ER-positive and ER-negative case patients. Here the results of genotyping suggested that the finding was less robust. Conclusions: This is currently the largest study investigating genetic variation associated with breast cancer survival. Our results have potential clinical implications, as they confirm that germline genotype can provide prognostic information in addition to standard tumor prognostic factors. PMID:25890600

  18. Properties of different density genotypes used in dairy cattle evaluation

    USDA-ARS?s Scientific Manuscript database

    Dairy cattle breeders have used a 50K chip since April 2008 and a less expensive, lower density (3K) chip since September 2010 in genomic selection. Evaluations from 3K are less reliable because genotype calls are less accurate and missing markers are imputed. After excluding genotypes with < 90% ca...

  19. Reliability increases from combining 50,000- and 777,000-marker genotypes from four countries

    USDA-ARS?s Scientific Manuscript database

    Genomic predictions were compared on U.S. scale after combining 50,000 (50K) and 777,000 (HD) marker genotypes across countries. The genotyped Holsteins included 161,341 animals with five marker densities including 1,510 with HD. Imputation was more accurate with FImpute than with findhap across the...

  20. Genomic imputation and evaluation using high density Holstein genotypes

    USDA-ARS?s Scientific Manuscript database

    Genomic evaluations for 161,341 Holsteins were computed using 311,725 of the 777,962 markers on the Illumina high-density (HD) chip. Initial edits with 1,741 HD genotypes from 5 breeds revealed that 636,967 markers were usable but that half were redundant. Usable Holstein genotypes included 1,510 an...

  1. Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration.

    PubMed

    Deelen, Patrick; Bonder, Marc Jan; van der Velde, K Joeri; Westra, Harm-Jan; Winder, Erwin; Hendriksen, Dennis; Franke, Lude; Swertz, Morris A

    2014-12-11

    To gain statistical power or to allow fine mapping, researchers typically want to pool data before meta-analyses or genotype imputation. However, the necessary harmonization of genetic datasets is currently error-prone because of many different file formats and lack of clarity about which genomic strand is used as reference. Genotype Harmonizer (GH) is a command-line tool to harmonize genetic datasets by automatically solving issues concerning genomic strand and file format. GH solves the unknown strand issue by aligning ambiguous A/T and G/C SNPs to a specified reference, using linkage disequilibrium patterns without prior knowledge of the used strands. GH supports many common GWAS/NGS genotype formats including PLINK, binary PLINK, VCF, SHAPEIT2 & Oxford GEN. GH is implemented in Java and a large part of the functionality can also be used as Java 'Genotype-IO' API. All software is open source under license LGPLv3 and available from http://www.molgenis.org/systemsgenetics. GH can be used to harmonize genetic datasets across different file formats and can be easily integrated as a step in routine meta-analysis and imputation pipelines.

  2. Genomic Prediction and Association Mapping of Curd-Related Traits in Gene Bank Accessions of Cauliflower.

    PubMed

    Thorwarth, Patrick; Yousef, Eltohamy A A; Schmid, Karl J

    2018-02-02

    Genetic resources are an important source of genetic variation for plant breeding. Genome-wide association studies (GWAS) and genomic prediction greatly facilitate the analysis and utilization of useful genetic diversity for improving complex phenotypic traits in crop plants. We explored the potential of GWAS and genomic prediction for improving curd-related traits in cauliflower ( Brassica oleracea var. botrytis ) by combining 174 randomly selected cauliflower gene bank accessions from two different gene banks. The collection was genotyped with genotyping-by-sequencing (GBS) and phenotyped for six curd-related traits at two locations and three growing seasons. A GWAS analysis based on 120,693 single-nucleotide polymorphisms identified a total of 24 significant associations for curd-related traits. The potential for genomic prediction was assessed with a genomic best linear unbiased prediction model and BayesB. Prediction abilities ranged from 0.10 to 0.66 for different traits and did not differ between prediction methods. Imputation of missing genotypes only slightly improved prediction ability. Our results demonstrate that GWAS and genomic prediction in combination with GBS and phenotyping of highly heritable traits can be used to identify useful quantitative trait loci and genotypes among genetically diverse gene bank material for subsequent utilization as genetic resources in cauliflower breeding. Copyright © 2018 Thorwarth et al.

  3. EPIGEN-Brazil Initiative resources: a Latin American imputation panel and the Scientific Workflow.

    PubMed

    Magalhães, Wagner C S; Araujo, Nathalia M; Leal, Thiago P; Araujo, Gilderlanio S; Viriato, Paula J S; Kehdy, Fernanda S; Costa, Gustavo N; Barreto, Mauricio L; Horta, Bernardo L; Lima-Costa, Maria Fernanda; Pereira, Alexandre C; Tarazona-Santos, Eduardo; Rodrigues, Maíra R

    2018-06-14

    EPIGEN-Brazil is one of the largest Latin American initiatives at the interface of human genomics, public health, and computational biology. Here, we present two resources to address two challenges to the global dissemination of precision medicine and the development of the bioinformatics know-how to support it. To address the underrepresentation of non-European individuals in human genome diversity studies, we present the EPIGEN-5M+1KGP imputation panel-the fusion of the public 1000 Genomes Project (1KGP) Phase 3 imputation panel with haplotypes derived from the EPIGEN-5M data set (a product of the genotyping of 4.3 million SNPs in 265 admixed individuals from the EPIGEN-Brazil Initiative). When we imputed a target SNPs data set (6487 admixed individuals genotyped for 2.2 million SNPs from the EPIGEN-Brazil project) with the EPIGEN-5M+1KGP panel, we gained 140,452 more SNPs in total than when using the 1KGP Phase 3 panel alone and 788,873 additional high confidence SNPs ( info score ≥ 0.8). Thus, the major effect of the inclusion of the EPIGEN-5M data set in this new imputation panel is not only to gain more SNPs but also to improve the quality of imputation. To address the lack of transparency and reproducibility of bioinformatics protocols, we present a conceptual Scientific Workflow in the form of a website that models the scientific process (by including publications, flowcharts, masterscripts, documents, and bioinformatics protocols), making it accessible and interactive. Its applicability is shown in the context of the development of our EPIGEN-5M+1KGP imputation panel. The Scientific Workflow also serves as a repository of bioinformatics resources. © 2018 Magalhães et al.; Published by Cold Spring Harbor Laboratory Press.

  4. A comprehensive survey of genetic variation in 20,691 subjects from four large cohorts

    PubMed Central

    Loomis, Stephanie; Turman, Constance; Huang, Hongyan; Huang, Jinyan; Aschard, Hugues; Chan, Andrew T.; Choi, Hyon; Cornelis, Marilyn; Curhan, Gary; De Vivo, Immaculata; Eliassen, A. Heather; Fuchs, Charles; Gaziano, Michael; Hankinson, Susan E.; Hu, Frank; Jensen, Majken; Kang, Jae H.; Kabrhel, Christopher; Liang, Liming; Pasquale, Louis R.; Rimm, Eric; Stampfer, Meir J.; Tamimi, Rulla M.; Tworoger, Shelley S.; Wiggs, Janey L.; Hunter, David J.; Kraft, Peter

    2017-01-01

    The Nurses’ Health Study (NHS), Nurses’ Health Study II (NHSII), Health Professionals Follow Up Study (HPFS) and the Physicians Health Study (PHS) have collected detailed longitudinal data on multiple exposures and traits for approximately 310,000 study participants over the last 35 years. Over 160,000 study participants across the cohorts have donated a DNA sample and to date, 20,691 subjects have been genotyped as part of genome-wide association studies (GWAS) of twelve primary outcomes. However, these studies utilized six different GWAS arrays making it difficult to conduct analyses of secondary phenotypes or share controls across studies. To allow for secondary analyses of these data, we have created three new datasets merged by platform family and performed imputation using a common reference panel, the 1,000 Genomes Phase I release. Here, we describe the methodology behind the data merging and imputation and present imputation quality statistics and association results from two GWAS of secondary phenotypes (body mass index (BMI) and venous thromboembolism (VTE)). We observed the strongest BMI association for the FTO SNP rs55872725 (β = 0.45, p = 3.48x10-22), and using a significance level of p = 0.05, we replicated 19 out of 32 known BMI SNPs. For VTE, we observed the strongest association for the rs2040445 SNP (OR = 2.17, 95% CI: 1.79–2.63, p = 2.70x10-15), located downstream of F5 and also observed significant associations for the known ABO and F11 regions. This pooled resource can be used to maximize power in GWAS of phenotypes collected across the cohorts and for studying gene-environment interactions as well as rare phenotypes and genotypes. PMID:28301549

  5. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment.

    PubMed

    Pasaniuc, Bogdan; Zaitlen, Noah; Shi, Huwenbo; Bhatia, Gaurav; Gusev, Alexander; Pickrell, Joseph; Hirschhorn, Joel; Strachan, David P; Patterson, Nick; Price, Alkes L

    2014-10-15

    Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants [increasing to 87% (60%) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case-control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of [Formula: see text] association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu Supplementary materials are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  6. Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation

    USDA-ARS?s Scientific Manuscript database

    Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry). Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker d...

  7. Identification and characterization of novel associations in the CASP8/ALS2CR12 region on chromosome 2 with breast cancer risk

    PubMed Central

    Lin, Wei-Yu; Camp, Nicola J.; Ghoussaini, Maya; Beesley, Jonathan; Michailidou, Kyriaki; Hopper, John L.; Apicella, Carmel; Southey, Melissa C.; Stone, Jennifer; Schmidt, Marjanka K.; Broeks, Annegien; Van't Veer, Laura J.; Th Rutgers, Emiel J.; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Fasching, Peter A.; Haeberle, Lothar; Ekici, Arif B.; Beckmann, Matthias W.; Peto, Julian; Dos-Santos-Silva, Isabel; Fletcher, Olivia; Johnson, Nichola; Bolla, Manjeet K.; Wang, Qin; Dennis, Joe; Sawyer, Elinor J.; Cheng, Timothy; Tomlinson, Ian; Kerin, Michael J.; Miller, Nicola; Marmé, Frederik; Surowy, Harald M.; Burwinkel, Barbara; Guénel, Pascal; Truong, Thérèse; Menegaux, Florence; Mulot, Claire; Bojesen, Stig E.; Nordestgaard, Børge G.; Nielsen, Sune F.; Flyger, Henrik; Benitez, Javier; Zamora, M. Pilar; Arias Perez, Jose Ignacio; Menéndez, Primitiva; González-Neira, Anna; Pita, Guillermo; Alonso, M. Rosario; Álvarez, Nuria; Herrero, Daniel; Anton-Culver, Hoda; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Meindl, Alfons; Lichtner, Peter; Schmutzler, Rita K.; Müller-Myhsok, Bertram; Brauch, Hiltrud; Brüning, Thomas; Ko, Yon-Dschun; Tessier, Daniel C.; Vincent, Daniel; Bacot, Francois; Nevanlinna, Heli; Aittomäki, Kristiina; Blomqvist, Carl; Khan, Sofia; Matsuo, Keitaro; Ito, Hidemi; Iwata, Hiroji; Horio, Akiyo; Bogdanova, Natalia V.; Antonenkova, Natalia N.; Dörk, Thilo; Lindblom, Annika; Margolin, Sara; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M.; Wu, Anna H.; Tseng, Chiu-Chen; Van Den Berg, David; Stram, Daniel O.; Neven, Patrick; Wauters, Els; Wildiers, Hans; Lambrechts, Diether; Chang-Claude, Jenny; Rudolph, Anja; Seibold, Petra; Flesch-Janys, Dieter; Radice, Paolo; Peterlongo, Paolo; Manoukian, Siranoush; Bonanni, Bernardo; Couch, Fergus J.; Wang, Xianshu; Vachon, Celine; Purrington, Kristen; Giles, Graham G.; Milne, Roger L.; Mclean, Catriona; Haiman, Christopher A.; Henderson, Brian E.; Schumacher, Fredrick; Le Marchand, Loic; Simard, Jacques; Goldberg, Mark S.; Labrèche, France; Dumont, Martine; Teo, Soo Hwang; Yip, Cheng Har; Hassan, Norhashimah; Vithana, Eranga Nishanthie; Kristensen, Vessela; Zheng, Wei; Deming-Halverson, Sandra; Shrubsole, Martha J.; Long, Jirong; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Kauppila, Saila; Andrulis, Irene L.; Knight, Julia A.; Glendon, Gord; Tchatchou, Sandrine; Devilee, Peter; Tollenaar, Robert A.E.M.; Seynaeve, Caroline; Van Asperen, Christi J.; García-Closas, Montserrat; Figueroa, Jonine; Lissowska, Jolanta; Brinton, Louise; Czene, Kamila; Darabi, Hatef; Eriksson, Mikael; Brand, Judith S.; Hooning, Maartje J.; Hollestelle, Antoinette; Van Den Ouweland, Ans M.W.; Jager, Agnes; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Hui; Cross, Simon S.; Reed, Malcolm W. R.; Blot, William; Signorello, Lisa B.; Cai, Qiuyin; Pharoah, Paul D.P.; Perkins, Barbara; Shah, Mitul; Blows, Fiona M.; Kang, Daehee; Yoo, Keun-Young; Noh, Dong-Young; Hartman, Mikael; Miao, Hui; Chia, Kee Seng; Putti, Thomas Choudary; Hamann, Ute; Luccarini, Craig; Baynes, Caroline; Ahmed, Shahana; Maranian, Mel; Healey, Catherine S.; Jakubowska, Anna; Lubinski, Jan; Jaworska-Bieniek, Katarzyna; Durda, Katarzyna; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; Mckay, James; Slager, Susan; Toland, Amanda E.; Yannoukakos, Drakoulis; Shen, Chen-Yang; Hsiung, Chia-Ni; Wu, Pei-Ei; Ding, Shian-ling; Ashworth, Alan; Jones, Michael; Orr, Nick; Swerdlow, Anthony J; Tsimiklis, Helen; Makalic, Enes; Schmidt, Daniel F.; Bui, Quang M.; Chanock, Stephen J.; Hunter, David J.; Hein, Rebecca; Dahmen, Norbert; Beckmann, Lars; Aaltonen, Kirsimari; Muranen, Taru A.; Heikkinen, Tuomas; Irwanto, Astrid; Rahman, Nazneen; Turnbull, Clare A.; Waisfisz, Quinten; Meijers-Heijboer, Hanne E. J.; Adank, Muriel A.; Van Der Luijt, Rob B.; Hall, Per; Chenevix-Trench, Georgia; Dunning, Alison; Easton, Douglas F.; Cox, Angela

    2015-01-01

    Previous studies have suggested that polymorphisms in CASP8 on chromosome 2 are associated with breast cancer risk. To clarify the role of CASP8 in breast cancer susceptibility, we carried out dense genotyping of this region in the Breast Cancer Association Consortium (BCAC). Single-nucleotide polymorphisms (SNPs) spanning a 1 Mb region around CASP8 were genotyped in 46 450 breast cancer cases and 42 600 controls of European origin from 41 studies participating in the BCAC as part of a custom genotyping array experiment (iCOGS). Missing genotypes and SNPs were imputed and, after quality exclusions, 501 typed and 1232 imputed SNPs were included in logistic regression models adjusting for study and ancestry principal components. The SNPs retained in the final model were investigated further in data from nine genome-wide association studies (GWAS) comprising in total 10 052 case and 12 575 control subjects. The most significant association signal observed in European subjects was for the imputed intronic SNP rs1830298 in ALS2CR12 (telomeric to CASP8), with per allele odds ratio and 95% confidence interval [OR (95% confidence interval, CI)] for the minor allele of 1.05 (1.03–1.07), P = 1 × 10−5. Three additional independent signals from intronic SNPs were identified, in CASP8 (rs36043647), ALS2CR11 (rs59278883) and CFLAR (rs7558475). The association with rs1830298 was replicated in the imputed results from the combined GWAS (P = 3 × 10−6), yielding a combined OR (95% CI) of 1.06 (1.04–1.08), P = 1 × 10−9. Analyses of gene expression associations in peripheral blood and normal breast tissue indicate that CASP8 might be the target gene, suggesting a mechanism involving apoptosis. PMID:25168388

  8. Identification and characterization of novel associations in the CASP8/ALS2CR12 region on chromosome 2 with breast cancer risk.

    PubMed

    Lin, Wei-Yu; Camp, Nicola J; Ghoussaini, Maya; Beesley, Jonathan; Michailidou, Kyriaki; Hopper, John L; Apicella, Carmel; Southey, Melissa C; Stone, Jennifer; Schmidt, Marjanka K; Broeks, Annegien; Van't Veer, Laura J; Th Rutgers, Emiel J; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Fasching, Peter A; Haeberle, Lothar; Ekici, Arif B; Beckmann, Matthias W; Peto, Julian; Dos-Santos-Silva, Isabel; Fletcher, Olivia; Johnson, Nichola; Bolla, Manjeet K; Wang, Qin; Dennis, Joe; Sawyer, Elinor J; Cheng, Timothy; Tomlinson, Ian; Kerin, Michael J; Miller, Nicola; Marmé, Frederik; Surowy, Harald M; Burwinkel, Barbara; Guénel, Pascal; Truong, Thérèse; Menegaux, Florence; Mulot, Claire; Bojesen, Stig E; Nordestgaard, Børge G; Nielsen, Sune F; Flyger, Henrik; Benitez, Javier; Zamora, M Pilar; Arias Perez, Jose Ignacio; Menéndez, Primitiva; González-Neira, Anna; Pita, Guillermo; Alonso, M Rosario; Alvarez, Nuria; Herrero, Daniel; Anton-Culver, Hoda; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Meindl, Alfons; Lichtner, Peter; Schmutzler, Rita K; Müller-Myhsok, Bertram; Brauch, Hiltrud; Brüning, Thomas; Ko, Yon-Dschun; Tessier, Daniel C; Vincent, Daniel; Bacot, Francois; Nevanlinna, Heli; Aittomäki, Kristiina; Blomqvist, Carl; Khan, Sofia; Matsuo, Keitaro; Ito, Hidemi; Iwata, Hiroji; Horio, Akiyo; Bogdanova, Natalia V; Antonenkova, Natalia N; Dörk, Thilo; Lindblom, Annika; Margolin, Sara; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M; Wu, Anna H; Tseng, Chiu-Chen; Van Den Berg, David; Stram, Daniel O; Neven, Patrick; Wauters, Els; Wildiers, Hans; Lambrechts, Diether; Chang-Claude, Jenny; Rudolph, Anja; Seibold, Petra; Flesch-Janys, Dieter; Radice, Paolo; Peterlongo, Paolo; Manoukian, Siranoush; Bonanni, Bernardo; Couch, Fergus J; Wang, Xianshu; Vachon, Celine; Purrington, Kristen; Giles, Graham G; Milne, Roger L; Mclean, Catriona; Haiman, Christopher A; Henderson, Brian E; Schumacher, Fredrick; Le Marchand, Loic; Simard, Jacques; Goldberg, Mark S; Labrèche, France; Dumont, Martine; Teo, Soo Hwang; Yip, Cheng Har; Hassan, Norhashimah; Vithana, Eranga Nishanthie; Kristensen, Vessela; Zheng, Wei; Deming-Halverson, Sandra; Shrubsole, Martha J; Long, Jirong; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Kauppila, Saila; Andrulis, Irene L; Knight, Julia A; Glendon, Gord; Tchatchou, Sandrine; Devilee, Peter; Tollenaar, Robert A E M; Seynaeve, Caroline; Van Asperen, Christi J; García-Closas, Montserrat; Figueroa, Jonine; Lissowska, Jolanta; Brinton, Louise; Czene, Kamila; Darabi, Hatef; Eriksson, Mikael; Brand, Judith S; Hooning, Maartje J; Hollestelle, Antoinette; Van Den Ouweland, Ans M W; Jager, Agnes; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Hui; Cross, Simon S; Reed, Malcolm W R; Blot, William; Signorello, Lisa B; Cai, Qiuyin; Pharoah, Paul D P; Perkins, Barbara; Shah, Mitul; Blows, Fiona M; Kang, Daehee; Yoo, Keun-Young; Noh, Dong-Young; Hartman, Mikael; Miao, Hui; Chia, Kee Seng; Putti, Thomas Choudary; Hamann, Ute; Luccarini, Craig; Baynes, Caroline; Ahmed, Shahana; Maranian, Mel; Healey, Catherine S; Jakubowska, Anna; Lubinski, Jan; Jaworska-Bieniek, Katarzyna; Durda, Katarzyna; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; Mckay, James; Slager, Susan; Toland, Amanda E; Yannoukakos, Drakoulis; Shen, Chen-Yang; Hsiung, Chia-Ni; Wu, Pei-Ei; Ding, Shian-Ling; Ashworth, Alan; Jones, Michael; Orr, Nick; Swerdlow, Anthony J; Tsimiklis, Helen; Makalic, Enes; Schmidt, Daniel F; Bui, Quang M; Chanock, Stephen J; Hunter, David J; Hein, Rebecca; Dahmen, Norbert; Beckmann, Lars; Aaltonen, Kirsimari; Muranen, Taru A; Heikkinen, Tuomas; Irwanto, Astrid; Rahman, Nazneen; Turnbull, Clare A; Waisfisz, Quinten; Meijers-Heijboer, Hanne E J; Adank, Muriel A; Van Der Luijt, Rob B; Hall, Per; Chenevix-Trench, Georgia; Dunning, Alison; Easton, Douglas F; Cox, Angela

    2015-01-01

    Previous studies have suggested that polymorphisms in CASP8 on chromosome 2 are associated with breast cancer risk. To clarify the role of CASP8 in breast cancer susceptibility, we carried out dense genotyping of this region in the Breast Cancer Association Consortium (BCAC). Single-nucleotide polymorphisms (SNPs) spanning a 1 Mb region around CASP8 were genotyped in 46 450 breast cancer cases and 42 600 controls of European origin from 41 studies participating in the BCAC as part of a custom genotyping array experiment (iCOGS). Missing genotypes and SNPs were imputed and, after quality exclusions, 501 typed and 1232 imputed SNPs were included in logistic regression models adjusting for study and ancestry principal components. The SNPs retained in the final model were investigated further in data from nine genome-wide association studies (GWAS) comprising in total 10 052 case and 12 575 control subjects. The most significant association signal observed in European subjects was for the imputed intronic SNP rs1830298 in ALS2CR12 (telomeric to CASP8), with per allele odds ratio and 95% confidence interval [OR (95% confidence interval, CI)] for the minor allele of 1.05 (1.03-1.07), P = 1 × 10(-5). Three additional independent signals from intronic SNPs were identified, in CASP8 (rs36043647), ALS2CR11 (rs59278883) and CFLAR (rs7558475). The association with rs1830298 was replicated in the imputed results from the combined GWAS (P = 3 × 10(-6)), yielding a combined OR (95% CI) of 1.06 (1.04-1.08), P = 1 × 10(-9). Analyses of gene expression associations in peripheral blood and normal breast tissue indicate that CASP8 might be the target gene, suggesting a mechanism involving apoptosis. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  9. Inclusion of Population-specific Reference Panel from India to the 1000 Genomes Phase 3 Panel Improves Imputation Accuracy.

    PubMed

    Ahmad, Meraj; Sinha, Anubhav; Ghosh, Sreya; Kumar, Vikrant; Davila, Sonia; Yajnik, Chittaranjan S; Chandak, Giriraj R

    2017-07-27

    Imputation is a computational method based on the principle of haplotype sharing allowing enrichment of genome-wide association study datasets. It depends on the haplotype structure of the population and density of the genotype data. The 1000 Genomes Project led to the generation of imputation reference panels which have been used globally. However, recent studies have shown that population-specific panels provide better enrichment of genome-wide variants. We compared the imputation accuracy using 1000 Genomes phase 3 reference panel and a panel generated from genome-wide data on 407 individuals from Western India (WIP). The concordance of imputed variants was cross-checked with next-generation re-sequencing data on a subset of genomic regions. Further, using the genome-wide data from 1880 individuals, we demonstrate that WIP works better than the 1000 Genomes phase 3 panel and when merged with it, significantly improves the imputation accuracy throughout the minor allele frequency range. We also show that imputation using only South Asian component of the 1000 Genomes phase 3 panel works as good as the merged panel, making it computationally less intensive job. Thus, our study stresses that imputation accuracy using 1000 Genomes phase 3 panel can be further improved by including population-specific reference panels from South Asia.

  10. Exploiting genotyping by sequencing to characterize the genomic structure of the American cranberry through high-density linkage mapping

    USDA-ARS?s Scientific Manuscript database

    The application of genotyping by sequencing (GBS) approaches, combined with data imputation methodologies, is narrowing the genetic knowledge gap between major and understudied, minor crops. GBS is an excellent tool to characterize the genomic structure of recently domesticated (~200 years) and unde...

  11. The African Genome Variation Project shapes medical genetics in Africa

    NASA Astrophysics Data System (ADS)

    Gurdasani, Deepti; Carstensen, Tommy; Tekola-Ayele, Fasil; Pagani, Luca; Tachmazidou, Ioanna; Hatzikotoulas, Konstantinos; Karthikeyan, Savita; Iles, Louise; Pollard, Martin O.; Choudhury, Ananyo; Ritchie, Graham R. S.; Xue, Yali; Asimit, Jennifer; Nsubuga, Rebecca N.; Young, Elizabeth H.; Pomilla, Cristina; Kivinen, Katja; Rockett, Kirk; Kamali, Anatoli; Doumatey, Ayo P.; Asiki, Gershim; Seeley, Janet; Sisay-Joof, Fatoumatta; Jallow, Muminatou; Tollman, Stephen; Mekonnen, Ephrem; Ekong, Rosemary; Oljira, Tamiru; Bradman, Neil; Bojang, Kalifa; Ramsay, Michele; Adeyemo, Adebowale; Bekele, Endashaw; Motala, Ayesha; Norris, Shane A.; Pirie, Fraser; Kaleebu, Pontiano; Kwiatkowski, Dominic; Tyler-Smith, Chris; Rotimi, Charles; Zeggini, Eleftheria; Sandhu, Manjinder S.

    2015-01-01

    Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterization of African genetic diversity is needed. The African Genome Variation Project provides a resource with which to design, implement and interpret genomic studies in sub-Saharan Africa and worldwide. The African Genome Variation Project represents dense genotypes from 1,481 individuals and whole-genome sequences from 320 individuals across sub-Saharan Africa. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across sub-Saharan Africa. We identify new loci under selection, including loci related to malaria susceptibility and hypertension. We show that modern imputation panels (sets of reference genotypes from which unobserved or missing genotypes in study sets can be inferred) can identify association signals at highly differentiated loci across populations in sub-Saharan Africa. Using whole-genome sequencing, we demonstrate further improvements in imputation accuracy, strengthening the case for large-scale sequencing efforts of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa.

  12. The African Genome Variation Project shapes medical genetics in Africa.

    PubMed

    Gurdasani, Deepti; Carstensen, Tommy; Tekola-Ayele, Fasil; Pagani, Luca; Tachmazidou, Ioanna; Hatzikotoulas, Konstantinos; Karthikeyan, Savita; Iles, Louise; Pollard, Martin O; Choudhury, Ananyo; Ritchie, Graham R S; Xue, Yali; Asimit, Jennifer; Nsubuga, Rebecca N; Young, Elizabeth H; Pomilla, Cristina; Kivinen, Katja; Rockett, Kirk; Kamali, Anatoli; Doumatey, Ayo P; Asiki, Gershim; Seeley, Janet; Sisay-Joof, Fatoumatta; Jallow, Muminatou; Tollman, Stephen; Mekonnen, Ephrem; Ekong, Rosemary; Oljira, Tamiru; Bradman, Neil; Bojang, Kalifa; Ramsay, Michele; Adeyemo, Adebowale; Bekele, Endashaw; Motala, Ayesha; Norris, Shane A; Pirie, Fraser; Kaleebu, Pontiano; Kwiatkowski, Dominic; Tyler-Smith, Chris; Rotimi, Charles; Zeggini, Eleftheria; Sandhu, Manjinder S

    2015-01-15

    Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterization of African genetic diversity is needed. The African Genome Variation Project provides a resource with which to design, implement and interpret genomic studies in sub-Saharan Africa and worldwide. The African Genome Variation Project represents dense genotypes from 1,481 individuals and whole-genome sequences from 320 individuals across sub-Saharan Africa. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across sub-Saharan Africa. We identify new loci under selection, including loci related to malaria susceptibility and hypertension. We show that modern imputation panels (sets of reference genotypes from which unobserved or missing genotypes in study sets can be inferred) can identify association signals at highly differentiated loci across populations in sub-Saharan Africa. Using whole-genome sequencing, we demonstrate further improvements in imputation accuracy, strengthening the case for large-scale sequencing efforts of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa.

  13. Improved imputation accuracy in Hispanic/Latino populations with larger and more diverse reference panels: applications in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL)

    PubMed Central

    Nelson, Sarah C.; Stilp, Adrienne M.; Papanicolaou, George J.; Taylor, Kent D.; Rotter, Jerome I.; Thornton, Timothy A.; Laurie, Cathy C.

    2016-01-01

    Imputation is commonly used in genome-wide association studies to expand the set of genetic variants available for analysis. Larger and more diverse reference panels, such as the final Phase 3 of the 1000 Genomes Project, hold promise for improving imputation accuracy in genetically diverse populations such as Hispanics/Latinos in the USA. Here, we sought to empirically evaluate imputation accuracy when imputing to a 1000 Genomes Phase 3 versus a Phase 1 reference, using participants from the Hispanic Community Health Study/Study of Latinos. Our assessments included calculating the correlation between imputed and observed allelic dosage in a subset of samples genotyped on a supplemental array. We observed that the Phase 3 reference yielded higher accuracy at rare variants, but that the two reference panels were comparable at common variants. At a sample level, the Phase 3 reference improved imputation accuracy in Hispanic/Latino samples from the Caribbean more than for Mainland samples, which we attribute primarily to the additional reference panel samples available in Phase 3. We conclude that a 1000 Genomes Project Phase 3 reference panel can yield improved imputation accuracy compared with Phase 1, particularly for rare variants and for samples of certain genetic ancestry compositions. Our findings can inform imputation design for other genome-wide association studies of participants with diverse ancestries, especially as larger and more diverse reference panels continue to become available. PMID:27346520

  14. Imputation of Exome Sequence Variants into Population- Based Samples and Blood-Cell-Trait-Associated Loci in African Americans: NHLBI GO Exome Sequencing Project

    PubMed Central

    Auer, Paul L.; Johnsen, Jill M.; Johnson, Andrew D.; Logsdon, Benjamin A.; Lange, Leslie A.; Nalls, Michael A.; Zhang, Guosheng; Franceschini, Nora; Fox, Keolu; Lange, Ethan M.; Rich, Stephen S.; O’Donnell, Christopher J.; Jackson, Rebecca D.; Wallace, Robert B.; Chen, Zhao; Graubert, Timothy A.; Wilson, James G.; Tang, Hua; Lettre, Guillaume; Reiner, Alex P.; Ganesh, Santhi K.; Li, Yun

    2012-01-01

    Researchers have successfully applied exome sequencing to discover causal variants in selected individuals with familial, highly penetrant disorders. We demonstrate the utility of exome sequencing followed by imputation for discovering low-frequency variants associated with complex quantitative traits. We performed exome sequencing in a reference panel of 761 African Americans and then imputed newly discovered variants into a larger sample of more than 13,000 African Americans for association testing with the blood cell traits hemoglobin, hematocrit, white blood count, and platelet count. First, we illustrate the feasibility of our approach by demonstrating genome-wide-significant associations for variants that are not covered by conventional genotyping arrays; for example, one such association is that between higher platelet count and an MPL c.117G>T (p.Lys39Asn) variant encoding a p.Lys39Asn amino acid substitution of the thrombpoietin receptor gene (p = 1.5 × 10−11). Second, we identified an association between missense variants of LCT and higher white blood count (p = 4 × 10−13). Third, we identified low-frequency coding variants that might account for allelic heterogeneity at several known blood cell-associated loci: MPL c.754T>C (p.Tyr252His) was associated with higher platelet count; CD36 c.975T>G (p.Tyr325∗) was associated with lower platelet count; and several missense variants at the α-globin gene locus were associated with lower hemoglobin. By identifying low-frequency missense variants associated with blood cell traits not previously reported by genome-wide association studies, we establish that exome sequencing followed by imputation is a powerful approach to dissecting complex, genetically heterogeneous traits in large population-based studies. PMID:23103231

  15. CGDSNPdb: a database resource for error-checked and imputed mouse SNPs.

    PubMed

    Hutchins, Lucie N; Ding, Yueming; Szatkiewicz, Jin P; Von Smith, Randy; Yang, Hyuna; de Villena, Fernando Pardo-Manuel; Churchill, Gary A; Graber, Joel H

    2010-07-06

    The Center for Genome Dynamics Single Nucleotide Polymorphism Database (CGDSNPdb) is an open-source value-added database with more than nine million mouse single nucleotide polymorphisms (SNPs), drawn from multiple sources, with genotypes assigned to multiple inbred strains of laboratory mice. All SNPs are checked for accuracy and annotated for properties specific to the SNP as well as those implied by changes to overlapping protein-coding genes. CGDSNPdb serves as the primary interface to two unique data sets, the 'imputed genotype resource' in which a Hidden Markov Model was used to assess local haplotypes and the most probable base assignment at several million genomic loci in tens of strains of mice, and the Affymetrix Mouse Diversity Genotyping Array, a high density microarray with over 600,000 SNPs and over 900,000 invariant genomic probes. CGDSNPdb is accessible online through either a web-based query tool or a MySQL public login. Database URL: http://cgd.jax.org/cgdsnpdb/

  16. Mendel-GPU: haplotyping and genotype imputation on graphics processing units

    PubMed Central

    Chen, Gary K.; Wang, Kai; Stram, Alex H.; Sobel, Eric M.; Lange, Kenneth

    2012-01-01

    Motivation: In modern sequencing studies, one can improve the confidence of genotype calls by phasing haplotypes using information from an external reference panel of fully typed unrelated individuals. However, the computational demands are so high that they prohibit researchers with limited computational resources from haplotyping large-scale sequence data. Results: Our graphics processing unit based software delivers haplotyping and imputation accuracies comparable to competing programs at a fraction of the computational cost and peak memory demand. Availability: Mendel-GPU, our OpenCL software, runs on Linux platforms and is portable across AMD and nVidia GPUs. Users can download both code and documentation at http://code.google.com/p/mendel-gpu/. Contact: gary.k.chen@usc.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22954633

  17. Dense genotyping of immune-related loci implicates host responses to microbial exposure in Behçet's disease susceptibility.

    PubMed

    Takeuchi, Masaki; Mizuki, Nobuhisa; Meguro, Akira; Ombrello, Michael J; Kirino, Yohei; Satorius, Colleen; Le, Julie; Blake, Mary; Erer, Burak; Kawagoe, Tatsukata; Ustek, Duran; Tugal-Tutkun, Ilknur; Seyahi, Emire; Ozyazgan, Yilmaz; Sousa, Inês; Davatchi, Fereydoun; Francisco, Vânia; Shahram, Farhad; Abdollahi, Bahar Sadeghi; Nadji, Abdolhadi; Shafiee, Niloofar Mojarad; Ghaderibarmi, Fahmida; Ohno, Shigeaki; Ueda, Atsuhisa; Ishigatsubo, Yoshiaki; Gadina, Massimo; Oliveira, Sofia A; Gül, Ahmet; Kastner, Daniel L; Remmers, Elaine F

    2017-03-01

    We analyzed 1,900 Turkish Behçet's disease cases and 1,779 controls genotyped with the Immunochip. The most significantly associated SNP was rs1050502, a tag SNP for HLA-B*51. In the Turkish discovery set, we identified three new risk loci, IL1A-IL1B, IRF8, and CEBPB-PTPN1, with genome-wide significance (P < 5 × 10 -8 ) by direct genotyping and ADO-EGR2 by imputation. We replicated the ADO-EGR2, IRF8, and CEBPB-PTPN1 loci by genotyping 969 Iranian cases and 826 controls. Imputed data in 608 Japanese cases and 737 controls further replicated ADO-EGR2 and IRF8, and meta-analysis additionally identified RIPK2 and LACC1. The disease-associated allele of rs4402765, the lead marker at IL1A-IL1B, was associated with both decreased IL-1α and increased IL-1β production. ABO non-secretor genotypes for two ancestry-specific FUT2 SNPs showed strong disease association (P = 5.89 × 10 -15 ). Our findings extend the list of susceptibility genes shared with Crohn's disease and leprosy and implicate mucosal factors and the innate immune response to microbial exposure in Behçet's disease susceptibility.

  18. Dense genotyping of immune-related loci implicates host responses to microbial exposure in Behçet’s disease susceptibility

    PubMed Central

    Takeuchi, Masaki; Mizuki, Nobuhisa; Meguro, Akira; Ombrello, Michael J.; Kirino, Yohei; Satorius, Colleen; Le, Julie; Blake, Mary; Erer, Burak; Kawagoe, Tatsukata; Ustek, Duran; Tugal-Tutkun, Ilknur; Seyahi, Emire; Ozyazgan, Yilmaz; Sousa, Inês; Davatchi, Fereydoun; Francisco, Vânia; Shahram, Farhad; Abdollahi, Bahar Sadeghi; Nadji, Abdolhadi; Shafiee, Niloofar Mojarad; Ghaderibarmi, Fahmida; Ohno, Shigeaki; Ueda, Atsuhisa; Ishigatsubo, Yoshiaki; Gadina, Massimo; Oliveira, Sofia A.; Gül, Ahmet; Kastner, Daniel L.; Remmers, Elaine F.

    2017-01-01

    We analyzed 1,900 Turkish Behçet’s disease cases and 1,779 controls genotyped with the Immunochip. The most significantly associated single nucleotide polymorphism (SNP) was rs1050502, a tag SNP for HLA-B*51. In the Turkish discovery set, we identified three novel loci, IL1A-IL1B, IRF8, and CEBPB-PTPN1, with genome-wide significance (P<5×10−8) by direct genotyping, and ADO-EGR2 by imputation. ADO-EGR2, IRF8, and CEBPB-PTPN1 replicated by genotyping 969 Iranian cases and 826 controls. Imputed data in 608 Japanese cases and 737 controls replicated ADO-EGR2 and IRF8 and meta-analysis additionally identified RIPK2 and LACC1. The disease-associated allele of rs4402765, the lead marker of the IL1A-IL1B locus, was associated with both decreased interleukin-1α and increased interleukin-1β production. ABO non-secretor genotypes of two ancestry-specific FUT2 SNPs showed strong disease association (P=5.89×10−15). Our findings extend shared susceptibility genes with Crohn’s disease and leprosy, and implicate mucosal factors and the innate immune response to microbial exposure in Behçet’s disease susceptibility. PMID:28166214

  19. HLA imputation in an admixed population: An assessment of the 1000 Genomes data as a training set.

    PubMed

    Nunes, Kelly; Zheng, Xiuwen; Torres, Margareth; Moraes, Maria Elisa; Piovezan, Bruno Z; Pontes, Gerlandia N; Kimura, Lilian; Carnavalli, Juliana E P; Mingroni Netto, Regina C; Meyer, Diogo

    2016-03-01

    Methods to impute HLA alleles based on dense single nucleotide polymorphism (SNP) data provide a valuable resource to association studies and evolutionary investigation of the MHC region. The availability of appropriate training sets is critical to the accuracy of HLA imputation, and the inclusion of samples with various ancestries is an important pre-requisite in studies of admixed populations. We assess the accuracy of HLA imputation using 1000 Genomes Project data as a training set, applying it to a highly admixed Brazilian population, the Quilombos from the state of São Paulo. To assess accuracy, we compared imputed and experimentally determined genotypes for 146 samples at 4 HLA classical loci. We found imputation accuracies of 82.9%, 81.8%, 94.8% and 86.6% for HLA-A, -B, -C and -DRB1 respectively (two-field resolution). Accuracies were improved when we included a subset of Quilombo individuals in the training set. We conclude that the 1000 Genomes data is a valuable resource for construction of training sets due to the diversity of ancestries and the potential for a large overlap of SNPs with the target population. We also show that tailoring training sets to features of the target population substantially enhances imputation accuracy. Copyright © 2016 American Society for Histocompatibility and Immunogenetics. Published by Elsevier Inc. All rights reserved.

  20. Underestimation of Variance of Predicted Health Utilities Derived from Multiattribute Utility Instruments.

    PubMed

    Chan, Kelvin K W; Xie, Feng; Willan, Andrew R; Pullenayegum, Eleanor M

    2017-04-01

    Parameter uncertainty in value sets of multiattribute utility-based instruments (MAUIs) has received little attention previously. This false precision leads to underestimation of the uncertainty of the results of cost-effectiveness analyses. The aim of this study is to examine the use of multiple imputation as a method to account for this uncertainty of MAUI scoring algorithms. We fitted a Bayesian model with random effects for respondents and health states to the data from the original US EQ-5D-3L valuation study, thereby estimating the uncertainty in the EQ-5D-3L scoring algorithm. We applied these results to EQ-5D-3L data from the Commonwealth Fund (CWF) Survey for Sick Adults ( n = 3958), comparing the standard error of the estimated mean utility in the CWF population using the predictive distribution from the Bayesian mixed-effect model (i.e., incorporating parameter uncertainty in the value set) with the standard error of the estimated mean utilities based on multiple imputation and the standard error using the conventional approach of using MAUI (i.e., ignoring uncertainty in the value set). The mean utility in the CWF population based on the predictive distribution of the Bayesian model was 0.827 with a standard error (SE) of 0.011. When utilities were derived using the conventional approach, the estimated mean utility was 0.827 with an SE of 0.003, which is only 25% of the SE based on the full predictive distribution of the mixed-effect model. Using multiple imputation with 20 imputed sets, the mean utility was 0.828 with an SE of 0.011, which is similar to the SE based on the full predictive distribution. Ignoring uncertainty of the predicted health utilities derived from MAUIs could lead to substantial underestimation of the variance of mean utilities. Multiple imputation corrects for this underestimation so that the results of cost-effectiveness analyses using MAUIs can report the correct degree of uncertainty.

  1. The African Genome Variation Project shapes medical genetics in Africa

    PubMed Central

    Gurdasani, Deepti; Carstensen, Tommy; Tekola-Ayele, Fasil; Pagani, Luca; Tachmazidou, Ioanna; Hatzikotoulas, Konstantinos; Karthikeyan, Savita; Iles, Louise; Pollard, Martin O.; Choudhury, Ananyo; Ritchie, Graham R. S.; Xue, Yali; Asimit, Jennifer; Nsubuga, Rebecca N.; Young, Elizabeth H.; Pomilla, Cristina; Kivinen, Katja; Rockett, Kirk; Kamali, Anatoli; Doumatey, Ayo P.; Asiki, Gershim; Seeley, Janet; Sisay-Joof, Fatoumatta; Jallow, Muminatou; Tollman, Stephen; Mekonnen, Ephrem; Ekong, Rosemary; Oljira, Tamiru; Bradman, Neil; Bojang, Kalifa; Ramsay, Michele; Adeyemo, Adebowale; Bekele, Endashaw; Motala, Ayesha; Norris, Shane A.; Pirie, Fraser; Kaleebu, Pontiano; Kwiatkowski, Dominic; Tyler-Smith, Chris; Rotimi, Charles; Zeggini, Eleftheria; Sandhu, Manjinder S.

    2014-01-01

    Given the importance of Africa to studies of human origins and disease susceptibility, detailed characterisation of African genetic diversity is needed. The African Genome Variation Project (AGVP) provides a resource to help design, implement and interpret genomic studies in sub-Saharan Africa (SSA) and worldwide. The AGVP represents dense genotypes from 1,481 and whole genome sequences (WGS) from 320 individuals across SSA. Using this resource, we find novel evidence of complex, regionally distinct hunter-gatherer and Eurasian admixture across SSA. We identify new loci under selection, including for malaria and hypertension. We show that modern imputation panels can identify association signals at highly differentiated loci across populations in SSA. Using WGS, we show further improvement in imputation accuracy supporting efforts for large-scale sequencing of diverse African haplotypes. Finally, we present an efficient genotype array design capturing common genetic variation in Africa, showing for the first time that such designs are feasible. PMID:25470054

  2. A SPATIOTEMPORAL APPROACH FOR HIGH RESOLUTION TRAFFIC FLOW IMPUTATION

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Han, Lee; Chin, Shih-Miao; Hwang, Ho-Ling

    Along with the rapid development of Intelligent Transportation Systems (ITS), traffic data collection technologies have been evolving dramatically. The emergence of innovative data collection technologies such as Remote Traffic Microwave Sensor (RTMS), Bluetooth sensor, GPS-based Floating Car method, automated license plate recognition (ALPR) (1), etc., creates an explosion of traffic data, which brings transportation engineering into the new era of Big Data. However, despite the advance of technologies, the missing data issue is still inevitable and has posed great challenges for research such as traffic forecasting, real-time incident detection and management, dynamic route guidance, and massive evacuation optimization, because themore » degree of success of these endeavors depends on the timely availability of relatively complete and reasonably accurate traffic data. A thorough literature review suggests most current imputation models, if not all, focus largely on the temporal nature of the traffic data and fail to consider the fact that traffic stream characteristics at a certain location are closely related to those at neighboring locations and utilize these correlations for data imputation. To this end, this paper presents a Kriging based spatiotemporal data imputation approach that is able to fully utilize the spatiotemporal information underlying in traffic data. Imputation performance of the proposed approach was tested using simulated scenarios and achieved stable imputation accuracy. Moreover, the proposed Kriging imputation model is more flexible compared to current models.« less

  3. Consequences of splitting whole-genome sequencing effort over multiple breeds on imputation accuracy.

    PubMed

    Bouwman, Aniek C; Veerkamp, Roel F

    2014-10-03

    The aim of this study was to determine the consequences of splitting sequencing effort over multiple breeds for imputation accuracy from a high-density SNP chip towards whole-genome sequence. Such information would assist for instance numerical smaller cattle breeds, but also pig and chicken breeders, who have to choose wisely how to spend their sequencing efforts over all the breeds or lines they evaluate. Sequence data from cattle breeds was used, because there are currently relatively many individuals from several breeds sequenced within the 1,000 Bull Genomes project. The advantage of whole-genome sequence data is that it carries the causal mutations, but the question is whether it is possible to impute the causal variants accurately. This study therefore focussed on imputation accuracy of variants with low minor allele frequency and breed specific variants. Imputation accuracy was assessed for chromosome 1 and 29 as the correlation between observed and imputed genotypes. For chromosome 1, the average imputation accuracy was 0.70 with a reference population of 20 Holstein, and increased to 0.83 when the reference population was increased by including 3 other dairy breeds with 20 animals each. When the same amount of animals from the Holstein breed were added the accuracy improved to 0.88, while adding the 3 other breeds to the reference population of 80 Holstein improved the average imputation accuracy marginally to 0.89. For chromosome 29, the average imputation accuracy was lower. Some variants benefitted from the inclusion of other breeds in the reference population, initially determined by the MAF of the variant in each breed, but even Holstein specific variants did gain imputation accuracy from the multi-breed reference population. This study shows that splitting sequencing effort over multiple breeds and combining the reference populations is a good strategy for imputation from high-density SNP panels towards whole-genome sequence when reference populations are small and sequencing effort is limiting. When sequencing effort is limiting and interest lays in multiple breeds or lines this provides imputation of each breed.

  4. References for Haplotype Imputation in the Big Data Era

    PubMed Central

    Li, Wenzhi; Xu, Wei; Li, Qiling; Ma, Li; Song, Qing

    2016-01-01

    Imputation is a powerful in silico approach to fill in those missing values in the big datasets. This process requires a reference panel, which is a collection of big data from which the missing information can be extracted and imputed. Haplotype imputation requires ethnicity-matched references; a mismatched reference panel will significantly reduce the quality of imputation. However, currently existing big datasets cover only a small number of ethnicities, there is a lack of ethnicity-matched references for many ethnic populations in the world, which has hampered the data imputation of haplotypes and its downstream applications. To solve this issue, several approaches have been proposed and explored, including the mixed reference panel, the internal reference panel and genotype-converted reference panel. This review article provides the information and comparison between these approaches. Increasing evidence showed that not just one or two genetic elements dictate the gene activity and functions; instead, cis-interactions of multiple elements dictate gene activity. Cis-interactions require the interacting elements to be on the same chromosome molecule, therefore, haplotype analysis is essential for the investigation of cis-interactions among multiple genetic variants at different loci, and appears to be especially important for studying the common diseases. It will be valuable in a wide spectrum of applications from academic research, to clinical diagnosis, prevention, treatment, and pharmaceutical industry. PMID:27274952

  5. Computational strategies for alternative single-step Bayesian regression models with large numbers of genotyped and non-genotyped animals.

    PubMed

    Fernando, Rohan L; Cheng, Hao; Golden, Bruce L; Garrick, Dorian J

    2016-12-08

    Two types of models have been used for single-step genomic prediction and genome-wide association studies that include phenotypes from both genotyped animals and their non-genotyped relatives. The two types are breeding value models (BVM) that fit breeding values explicitly and marker effects models (MEM) that express the breeding values in terms of the effects of observed or imputed genotypes. MEM can accommodate a wider class of analyses, including variable selection or mixture model analyses. The order of the equations that need to be solved and the inverses required in their construction vary widely, and thus the computational effort required depends upon the size of the pedigree, the number of genotyped animals and the number of loci. We present computational strategies to avoid storing large, dense blocks of the MME that involve imputed genotypes. Furthermore, we present a hybrid model that fits a MEM for animals with observed genotypes and a BVM for those without genotypes. The hybrid model is computationally attractive for pedigree files containing millions of animals with a large proportion of those being genotyped. We demonstrate the practicality on both the original MEM and the hybrid model using real data with 6,179,960 animals in the pedigree with 4,934,101 phenotypes and 31,453 animals genotyped at 40,214 informative loci. To complete a single-trait analysis on a desk-top computer with four graphics cards required about 3 h using the hybrid model to obtain both preconditioned conjugate gradient solutions and 42,000 Markov chain Monte-Carlo (MCMC) samples of breeding values, which allowed making inferences from posterior means, variances and covariances. The MCMC sampling required one quarter of the effort when the hybrid model was used compared to the published MEM. We present a hybrid model that fits a MEM for animals with genotypes and a BVM for those without genotypes. Its practicality and considerable reduction in computing effort was demonstrated. This model can readily be extended to accommodate multiple traits, multiple breeds, maternal effects, and additional random effects such as polygenic residual effects.

  6. Genotypic variants at 2q33 and risk of esophageal squamous cell carcinoma in China: a meta-analysis of genome-wide association studies

    PubMed Central

    Abnet, Christian C.; Wang, Zhaoming; Song, Xin; Hu, Nan; Zhou, Fu-You; Freedman, Neal D.; Li, Xue-Min; Yu, Kai; Shu, Xiao-Ou; Yuan, Jian-Min; Zheng, Wei; Dawsey, Sanford M.; Liao, Linda M.; Lee, Maxwell P.; Ding, Ti; Qiao, You-Lin; Gao, Yu-Tang; Koh, Woon-Puay; Xiang, Yong-Bing; Tang, Ze-Zhong; Fan, Jin-Hu; Chung, Charles C.; Wang, Chaoyu; Wheeler, William; Yeager, Meredith; Yuenger, Jeff; Hutchinson, Amy; Jacobs, Kevin B.; Giffen, Carol A.; Burdett, Laurie; Fraumeni, Joseph F.; Tucker, Margaret A.; Chow, Wong-Ho; Zhao, Xue-Ke; Li, Jiang-Man; Li, Ai-Li; Sun, Liang-Dan; Wei, Wu; Li, Ji-Lin; Zhang, Peng; Li, Hong-Lei; Cui, Wen-Yan; Wang, Wei-Peng; Liu, Zhi-Cai; Yang, Xia; Fu, Wen-Jing; Cui, Ji-Li; Lin, Hong-Li; Zhu, Wen-Liang; Liu, Min; Chen, Xi; Chen, Jie; Guo, Li; Han, Jing-Jing; Zhou, Sheng-Li; Huang, Jia; Wu, Yue; Yuan, Chao; Huang, Jing; Ji, Ai-Fang; Kul, Jian-Wei; Fan, Zhong-Min; Wang, Jian-Po; Zhang, Dong-Yun; Zhang, Lian-Qun; Zhang, Wei; Chen, Yuan-Fang; Ren, Jing-Li; Li, Xiu-Min; Dong, Jin-Cheng; Xing, Guo-Lan; Guo, Zhi-Gang; Yang, Jian-Xue; Mao, Yi-Ming; Yuan, Yuan; Guo, Er-Tao; Zhang, Wei; Hou, Zhi-Chao; Liu, Jing; Li, Yan; Tang, Sa; Chang, Jia; Peng, Xiu-Qin; Han, Min; Yin, Wan-Li; Liu, Ya-Li; Hu, Yan-Long; Liu, Yu; Yang, Liu-Qin; Zhu, Fu-Guo; Yang, Xiu-Feng; Feng, Xiao-Shan; Wang, Zhou; Li, Yin; Gao, She-Gan; Liu, Hai-Lin; Yuan, Ling; Jin, Yan; Zhang, Yan-Rui; Sheyhidin, Ilyar; Li, Feng; Chen, Bao-Ping; Ren, Shu-Wei; Liu, Bin; Li, Dan; Zhang, Gao-Fu; Yue, Wen-Bin; Feng, Chang-Wei; Qige, Qirenwang; Zhao, Jian-Ting; Yang, Wen-Jun; Lei, Guang-Yan; Chen, Long-Qi; Li, En-Min; Xu, Li-Yan; Wu, Zhi-Yong; Bao, Zhi-Qin; Chen, Ji-Li; Li, Xian-Chang; Zhuang, Xiang; Zhou, Ying-Fa; Zuo, Xian-Bo; Dong, Zi-Ming; Wang, Lu-Wen; Fan, Xue-Pin; Wang, Jin; Zhou, Qi; Ma, Guo-Shun; Zhang, Qin-Xian; Liu, Hai; Jian, Xin-Ying; Lian, Sin-Yong; Wang, Jin-Sheng; Chang, Fu-Bao; Lu, Chang-Dong; Miao, Jian-Jun; Chen, Zhi-Guo; Wang, Ran; Guo, Ming; Fan, Zeng-Lin; Tao, Ping; Liu, Tai-Jing; Wei, Jin-Chang; Kong, Qing-Peng; Fan, Lei; Wang, Xian-Zeng; Gao, Fu-Sheng; Wang, Tian-Yun; Xie, Dong; Wang, Li; Chen, Shu-Qing; Yang, Wan-Cai; Hong, Jun-Yan; Wang, Liang; Qiu, Song-Liang; Goldstein, Alisa M.; Yuan, Zhi-Qing; Chanock, Stephen J.; Zhang, Xue-Jun; Taylor, Philip R.; Wang, Li-Dong

    2012-01-01

    Genome-wide association studies have identified susceptibility loci for esophageal squamous cell carcinoma (ESCC). We conducted a meta-analysis of all single-nucleotide polymorphisms (SNPs) that showed nominally significant P-values in two previously published genome-wide scans that included a total of 2961 ESCC cases and 3400 controls. The meta-analysis revealed five SNPs at 2q33 with P< 5 × 10−8, and the strongest signal was rs13016963, with a combined odds ratio (95% confidence interval) of 1.29 (1.19–1.40) and P= 7.63 × 10−10. An imputation analysis of 4304 SNPs at 2q33 suggested a single association signal, and the strongest imputed SNP associations were similar to those from the genotyped SNPs. We conducted an ancestral recombination graph analysis with 53 SNPs to identify one or more haplotypes that harbor the variants directly responsible for the detected association signal. This showed that the five SNPs exist in a single haplotype along with 45 imputed SNPs in strong linkage disequilibrium, and the strongest candidate was rs10201587, one of the genotyped SNPs. Our meta-analysis found genome-wide significant SNPs at 2q33 that map to the CASP8/ALS2CR12/TRAK2 gene region. Variants in CASP8 have been extensively studied across a spectrum of cancers with mixed results. The locus we identified appears to be distinct from the widely studied rs3834129 and rs1045485 SNPs in CASP8. Future studies of esophageal and other cancers should focus on comprehensive sequencing of this 2q33 locus and functional analysis of rs13016963 and rs10201587 and other strongly correlated variants. PMID:22323360

  7. Genotypic variants at 2q33 and risk of esophageal squamous cell carcinoma in China: a meta-analysis of genome-wide association studies.

    PubMed

    Abnet, Christian C; Wang, Zhaoming; Song, Xin; Hu, Nan; Zhou, Fu-You; Freedman, Neal D; Li, Xue-Min; Yu, Kai; Shu, Xiao-Ou; Yuan, Jian-Min; Zheng, Wei; Dawsey, Sanford M; Liao, Linda M; Lee, Maxwell P; Ding, Ti; Qiao, You-Lin; Gao, Yu-Tang; Koh, Woon-Puay; Xiang, Yong-Bing; Tang, Ze-Zhong; Fan, Jin-Hu; Chung, Charles C; Wang, Chaoyu; Wheeler, William; Yeager, Meredith; Yuenger, Jeff; Hutchinson, Amy; Jacobs, Kevin B; Giffen, Carol A; Burdett, Laurie; Fraumeni, Joseph F; Tucker, Margaret A; Chow, Wong-Ho; Zhao, Xue-Ke; Li, Jiang-Man; Li, Ai-Li; Sun, Liang-Dan; Wei, Wu; Li, Ji-Lin; Zhang, Peng; Li, Hong-Lei; Cui, Wen-Yan; Wang, Wei-Peng; Liu, Zhi-Cai; Yang, Xia; Fu, Wen-Jing; Cui, Ji-Li; Lin, Hong-Li; Zhu, Wen-Liang; Liu, Min; Chen, Xi; Chen, Jie; Guo, Li; Han, Jing-Jing; Zhou, Sheng-Li; Huang, Jia; Wu, Yue; Yuan, Chao; Huang, Jing; Ji, Ai-Fang; Kul, Jian-Wei; Fan, Zhong-Min; Wang, Jian-Po; Zhang, Dong-Yun; Zhang, Lian-Qun; Zhang, Wei; Chen, Yuan-Fang; Ren, Jing-Li; Li, Xiu-Min; Dong, Jin-Cheng; Xing, Guo-Lan; Guo, Zhi-Gang; Yang, Jian-Xue; Mao, Yi-Ming; Yuan, Yuan; Guo, Er-Tao; Zhang, Wei; Hou, Zhi-Chao; Liu, Jing; Li, Yan; Tang, Sa; Chang, Jia; Peng, Xiu-Qin; Han, Min; Yin, Wan-Li; Liu, Ya-Li; Hu, Yan-Long; Liu, Yu; Yang, Liu-Qin; Zhu, Fu-Guo; Yang, Xiu-Feng; Feng, Xiao-Shan; Wang, Zhou; Li, Yin; Gao, She-Gan; Liu, Hai-Lin; Yuan, Ling; Jin, Yan; Zhang, Yan-Rui; Sheyhidin, Ilyar; Li, Feng; Chen, Bao-Ping; Ren, Shu-Wei; Liu, Bin; Li, Dan; Zhang, Gao-Fu; Yue, Wen-Bin; Feng, Chang-Wei; Qige, Qirenwang; Zhao, Jian-Ting; Yang, Wen-Jun; Lei, Guang-Yan; Chen, Long-Qi; Li, En-Min; Xu, Li-Yan; Wu, Zhi-Yong; Bao, Zhi-Qin; Chen, Ji-Li; Li, Xian-Chang; Zhuang, Xiang; Zhou, Ying-Fa; Zuo, Xian-Bo; Dong, Zi-Ming; Wang, Lu-Wen; Fan, Xue-Pin; Wang, Jin; Zhou, Qi; Ma, Guo-Shun; Zhang, Qin-Xian; Liu, Hai; Jian, Xin-Ying; Lian, Sin-Yong; Wang, Jin-Sheng; Chang, Fu-Bao; Lu, Chang-Dong; Miao, Jian-Jun; Chen, Zhi-Guo; Wang, Ran; Guo, Ming; Fan, Zeng-Lin; Tao, Ping; Liu, Tai-Jing; Wei, Jin-Chang; Kong, Qing-Peng; Fan, Lei; Wang, Xian-Zeng; Gao, Fu-Sheng; Wang, Tian-Yun; Xie, Dong; Wang, Li; Chen, Shu-Qing; Yang, Wan-Cai; Hong, Jun-Yan; Wang, Liang; Qiu, Song-Liang; Goldstein, Alisa M; Yuan, Zhi-Qing; Chanock, Stephen J; Zhang, Xue-Jun; Taylor, Philip R; Wang, Li-Dong

    2012-05-01

    Genome-wide association studies have identified susceptibility loci for esophageal squamous cell carcinoma (ESCC). We conducted a meta-analysis of all single-nucleotide polymorphisms (SNPs) that showed nominally significant P-values in two previously published genome-wide scans that included a total of 2961 ESCC cases and 3400 controls. The meta-analysis revealed five SNPs at 2q33 with P< 5 × 10(-8), and the strongest signal was rs13016963, with a combined odds ratio (95% confidence interval) of 1.29 (1.19-1.40) and P= 7.63 × 10(-10). An imputation analysis of 4304 SNPs at 2q33 suggested a single association signal, and the strongest imputed SNP associations were similar to those from the genotyped SNPs. We conducted an ancestral recombination graph analysis with 53 SNPs to identify one or more haplotypes that harbor the variants directly responsible for the detected association signal. This showed that the five SNPs exist in a single haplotype along with 45 imputed SNPs in strong linkage disequilibrium, and the strongest candidate was rs10201587, one of the genotyped SNPs. Our meta-analysis found genome-wide significant SNPs at 2q33 that map to the CASP8/ALS2CR12/TRAK2 gene region. Variants in CASP8 have been extensively studied across a spectrum of cancers with mixed results. The locus we identified appears to be distinct from the widely studied rs3834129 and rs1045485 SNPs in CASP8. Future studies of esophageal and other cancers should focus on comprehensive sequencing of this 2q33 locus and functional analysis of rs13016963 and rs10201587 and other strongly correlated variants.

  8. Using imputed genotype data in the joint score tests for genetic association and gene-environment interactions in case-control studies.

    PubMed

    Song, Minsun; Wheeler, William; Caporaso, Neil E; Landi, Maria Teresa; Chatterjee, Nilanjan

    2018-03-01

    Genome-wide association studies (GWAS) are now routinely imputed for untyped single nucleotide polymorphisms (SNPs) based on various powerful statistical algorithms for imputation trained on reference datasets. The use of predicted allele counts for imputed SNPs as the dosage variable is known to produce valid score test for genetic association. In this paper, we investigate how to best handle imputed SNPs in various modern complex tests for genetic associations incorporating gene-environment interactions. We focus on case-control association studies where inference for an underlying logistic regression model can be performed using alternative methods that rely on varying degree on an assumption of gene-environment independence in the underlying population. As increasingly large-scale GWAS are being performed through consortia effort where it is preferable to share only summary-level information across studies, we also describe simple mechanisms for implementing score tests based on standard meta-analysis of "one-step" maximum-likelihood estimates across studies. Applications of the methods in simulation studies and a dataset from GWAS of lung cancer illustrate ability of the proposed methods to maintain type-I error rates for the underlying testing procedures. For analysis of imputed SNPs, similar to typed SNPs, the retrospective methods can lead to considerable efficiency gain for modeling of gene-environment interactions under the assumption of gene-environment independence. Methods are made available for public use through CGEN R software package. © 2017 WILEY PERIODICALS, INC.

  9. Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India.

    PubMed

    Pemberton, T J; Jakobsson, M; Conrad, D F; Coop, G; Wall, J D; Pritchard, J K; Patel, P I; Rosenberg, N A

    2008-07-01

    When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis - such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.

  10. The Oxytocin Receptor Gene ( OXTR) and Face Recognition.

    PubMed

    Verhallen, Roeland J; Bosten, Jenny M; Goodbourn, Patrick T; Lawrance-Owen, Adam J; Bargary, Gary; Mollon, J D

    2017-01-01

    A recent study has linked individual differences in face recognition to rs237887, a single-nucleotide polymorphism (SNP) of the oxytocin receptor gene ( OXTR; Skuse et al., 2014). In that study, participants were assessed using the Warrington Recognition Memory Test for Faces, but performance on Warrington's test has been shown not to rely purely on face recognition processes. We administered the widely used Cambridge Face Memory Test-a purer test of face recognition-to 370 participants. Performance was not significantly associated with rs237887, with 16 other SNPs of OXTR that we genotyped, or with a further 75 imputed SNPs. We also administered three other tests of face processing (the Mooney Face Test, the Glasgow Face Matching Test, and the Composite Face Test), but performance was never significantly associated with rs237887 or with any of the other genotyped or imputed SNPs, after corrections for multiple testing. In addition, we found no associations between OXTR and Autism-Spectrum Quotient scores.

  11. 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data.

    PubMed

    Luo, Yuan; Szolovits, Peter; Dighe, Anand S; Baron, Jason M

    2018-06-01

    A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data. We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points. 3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone. 3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.

  12. Short communication: Validation of 4 candidate causative trait variants in 2 cattle breeds using targeted sequence imputation.

    PubMed

    Pausch, Hubert; Wurmser, Christine; Reinhardt, Friedrich; Emmerling, Reiner; Fries, Ruedi

    2015-06-01

    Most association studies for pinpointing trait-associated variants are performed within breed. The availability of sequence data from key ancestors of several cattle breeds now enables immediate assessment of the frequency of trait-associated variants in populations different from the mapping population and their imputation into large validation populations. The objective of this study was to validate the effects of 4 putatively causative variants on milk production traits, male fertility, and stature in German Fleckvieh and Holstein-Friesian animals using targeted sequence imputation. We used whole-genome sequence data of 456 animals to impute 4 missense mutations in DGAT1, GHR, PRLR, and PROP1 into 10,363 Fleckvieh and 8,812 Holstein animals. The accuracy of the imputed genotypes exceeded 95% for all variants. Association testing with imputed variants revealed consistent antagonistic effects of the DGAT1 p.A232K and GHR p.F279Y variants on milk yield and protein and fat contents, respectively, in both breeds. The allele frequency of both polymorphisms has changed considerably in the past 20 yr, indicating that they were targets of recent selection for milk production traits. The PRLR p.S18N variant was associated with yield traits in Fleckvieh but not in Holstein, suggesting that it may be in linkage disequilibrium with a mutation affecting yield traits rather than being causal. The reported effects of the PROP1 p.H173R variant on milk production, male fertility, and stature could not be confirmed. Our results demonstrate that population-wide imputation of candidate causal variants from sequence data is feasible, enabling their rapid validation in large independent populations. Copyright © 2015 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  13. The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer.

    PubMed

    Aghdam, Rosa; Baghfalaki, Taban; Khosravi, Pegah; Saberi Ansari, Elnaz

    2017-12-01

    Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/. Copyright © 2017. Production and hosting by Elsevier B.V.

  14. Prediction accuracies for growth and wood attributes of interior spruce in space using genotyping-by-sequencing.

    PubMed

    Gamal El-Dien, Omnia; Ratcliffe, Blaise; Klápště, Jaroslav; Chen, Charles; Porth, Ilga; El-Kassaby, Yousry A

    2015-05-09

    Genomic selection (GS) in forestry can substantially reduce the length of breeding cycle and increase gain per unit time through early selection and greater selection intensity, particularly for traits of low heritability and late expression. Affordable next-generation sequencing technologies made it possible to genotype large numbers of trees at a reasonable cost. Genotyping-by-sequencing was used to genotype 1,126 Interior spruce trees representing 25 open-pollinated families planted over three sites in British Columbia, Canada. Four imputation algorithms were compared (mean value (MI), singular value decomposition (SVD), expectation maximization (EM), and a newly derived, family-based k-nearest neighbor (kNN-Fam)). Trees were phenotyped for several yield and wood attributes. Single- and multi-site GS prediction models were developed using the Ridge Regression Best Linear Unbiased Predictor (RR-BLUP) and the Generalized Ridge Regression (GRR) to test different assumption about trait architecture. Finally, using PCA, multi-trait GS prediction models were developed. The EM and kNN-Fam imputation methods were superior for 30 and 60% missing data, respectively. The RR-BLUP GS prediction model produced better accuracies than the GRR indicating that the genetic architecture for these traits is complex. GS prediction accuracies for multi-site were high and better than those of single-sites while multi-site predictability produced the lowest accuracies reflecting type-b genetic correlations and deemed unreliable. The incorporation of genomic information in quantitative genetics analyses produced more realistic heritability estimates as half-sib pedigree tended to inflate the additive genetic variance and subsequently both heritability and gain estimates. Principle component scores as representatives of multi-trait GS prediction models produced surprising results where negatively correlated traits could be concurrently selected for using PCA2 and PCA3. The application of GS to open-pollinated family testing, the simplest form of tree improvement evaluation methods, was proven to be effective. Prediction accuracies obtained for all traits greatly support the integration of GS in tree breeding. While the within-site GS prediction accuracies were high, the results clearly indicate that single-site GS models ability to predict other sites are unreliable supporting the utilization of multi-site approach. Principle component scores provided an opportunity for the concurrent selection of traits with different phenotypic optima.

  15. Imputation of unordered markers and the impact on genomic selection accuracy

    USDA-ARS?s Scientific Manuscript database

    Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Genotyping-by-sequencing can generate a large number of de novo markers. However, without a reference genome, these markers are unordered and typically have a large propo...

  16. Sequence data and association statistics from 12,940 type 2 diabetes cases and controls.

    PubMed

    Flannick, Jason; Fuchsberger, Christian; Mahajan, Anubha; Teslovich, Tanya M; Agarwala, Vineeta; Gaulton, Kyle J; Caulkins, Lizz; Koesterer, Ryan; Ma, Clement; Moutsianas, Loukas; McCarthy, Davis J; Rivas, Manuel A; Perry, John R B; Sim, Xueling; Blackwell, Thomas W; Robertson, Neil R; Rayner, N William; Cingolani, Pablo; Locke, Adam E; Tajes, Juan Fernandez; Highland, Heather M; Dupuis, Josee; Chines, Peter S; Lindgren, Cecilia M; Hartl, Christopher; Jackson, Anne U; Chen, Han; Huyghe, Jeroen R; van de Bunt, Martijn; Pearson, Richard D; Kumar, Ashish; Müller-Nurasyid, Martina; Grarup, Niels; Stringham, Heather M; Gamazon, Eric R; Lee, Jaehoon; Chen, Yuhui; Scott, Robert A; Below, Jennifer E; Chen, Peng; Huang, Jinyan; Go, Min Jin; Stitzel, Michael L; Pasko, Dorota; Parker, Stephen C J; Varga, Tibor V; Green, Todd; Beer, Nicola L; Day-Williams, Aaron G; Ferreira, Teresa; Fingerlin, Tasha; Horikoshi, Momoko; Hu, Cheng; Huh, Iksoo; Ikram, Mohammad Kamran; Kim, Bong-Jo; Kim, Yongkang; Kim, Young Jin; Kwon, Min-Seok; Lee, Juyoung; Lee, Selyeong; Lin, Keng-Han; Maxwell, Taylor J; Nagai, Yoshihiko; Wang, Xu; Welch, Ryan P; Yoon, Joon; Zhang, Weihua; Barzilai, Nir; Voight, Benjamin F; Han, Bok-Ghee; Jenkinson, Christopher P; Kuulasmaa, Teemu; Kuusisto, Johanna; Manning, Alisa; Ng, Maggie C Y; Palmer, Nicholette D; Balkau, Beverley; Stančáková, Alena; Abboud, Hanna E; Boeing, Heiner; Giedraitis, Vilmantas; Prabhakaran, Dorairaj; Gottesman, Omri; Scott, James; Carey, Jason; Kwan, Phoenix; Grant, George; Smith, Joshua D; Neale, Benjamin M; Purcell, Shaun; Butterworth, Adam S; Howson, Joanna M M; Lee, Heung Man; Lu, Yingchang; Kwak, Soo-Heon; Zhao, Wei; Danesh, John; Lam, Vincent K L; Park, Kyong Soo; Saleheen, Danish; So, Wing Yee; Tam, Claudia H T; Afzal, Uzma; Aguilar, David; Arya, Rector; Aung, Tin; Chan, Edmund; Navarro, Carmen; Cheng, Ching-Yu; Palli, Domenico; Correa, Adolfo; Curran, Joanne E; Rybin, Dennis; Farook, Vidya S; Fowler, Sharon P; Freedman, Barry I; Griswold, Michael; Hale, Daniel Esten; Hicks, Pamela J; Khor, Chiea-Chuen; Kumar, Satish; Lehne, Benjamin; Thuillier, Dorothée; Lim, Wei Yen; Liu, Jianjun; Loh, Marie; Musani, Solomon K; Puppala, Sobha; Scott, William R; Yengo, Loïc; Tan, Sian-Tsung; Taylor, Herman A; Thameem, Farook; Wilson, Gregory; Wong, Tien Yin; Njølstad, Pål Rasmus; Levy, Jonathan C; Mangino, Massimo; Bonnycastle, Lori L; Schwarzmayr, Thomas; Fadista, João; Surdulescu, Gabriela L; Herder, Christian; Groves, Christopher J; Wieland, Thomas; Bork-Jensen, Jette; Brandslund, Ivan; Christensen, Cramer; Koistinen, Heikki A; Doney, Alex S F; Kinnunen, Leena; Esko, Tõnu; Farmer, Andrew J; Hakaste, Liisa; Hodgkiss, Dylan; Kravic, Jasmina; Lyssenko, Valeri; Hollensted, Mette; Jørgensen, Marit E; Jørgensen, Torben; Ladenvall, Claes; Justesen, Johanne Marie; Käräjämäki, Annemari; Kriebel, Jennifer; Rathmann, Wolfgang; Lannfelt, Lars; Lauritzen, Torsten; Narisu, Narisu; Linneberg, Allan; Melander, Olle; Milani, Lili; Neville, Matt; Orho-Melander, Marju; Qi, Lu; Qi, Qibin; Roden, Michael; Rolandsson, Olov; Swift, Amy; Rosengren, Anders H; Stirrups, Kathleen; Wood, Andrew R; Mihailov, Evelin; Blancher, Christine; Carneiro, Mauricio O; Maguire, Jared; Poplin, Ryan; Shakir, Khalid; Fennell, Timothy; DePristo, Mark; de Angelis, Martin Hrabé; Deloukas, Panos; Gjesing, Anette P; Jun, Goo; Nilsson, Peter; Murphy, Jacquelyn; Onofrio, Robert; Thorand, Barbara; Hansen, Torben; Meisinger, Christa; Hu, Frank B; Isomaa, Bo; Karpe, Fredrik; Liang, Liming; Peters, Annette; Huth, Cornelia; O'Rahilly, Stephen P; Palmer, Colin N A; Pedersen, Oluf; Rauramaa, Rainer; Tuomilehto, Jaakko; Salomaa, Veikko; Watanabe, Richard M; Syvänen, Ann-Christine; Bergman, Richard N; Bharadwaj, Dwaipayan; Bottinger, Erwin P; Cho, Yoon Shin; Chandak, Giriraj R; Chan, Juliana Cn; Chia, Kee Seng; Daly, Mark J; Ebrahim, Shah B; Langenberg, Claudia; Elliott, Paul; Jablonski, Kathleen A; Lehman, Donna M; Jia, Weiping; Ma, Ronald C W; Pollin, Toni I; Sandhu, Manjinder; Tandon, Nikhil; Froguel, Philippe; Barroso, Inês; Teo, Yik Ying; Zeggini, Eleftheria; Loos, Ruth J F; Small, Kerrin S; Ried, Janina S; DeFronzo, Ralph A; Grallert, Harald; Glaser, Benjamin; Metspalu, Andres; Wareham, Nicholas J; Walker, Mark; Banks, Eric; Gieger, Christian; Ingelsson, Erik; Im, Hae Kyung; Illig, Thomas; Franks, Paul W; Buck, Gemma; Trakalo, Joseph; Buck, David; Prokopenko, Inga; Mägi, Reedik; Lind, Lars; Farjoun, Yossi; Owen, Katharine R; Gloyn, Anna L; Strauch, Konstantin; Tuomi, Tiinamaija; Kooner, Jaspal Singh; Lee, Jong-Young; Park, Taesung; Donnelly, Peter; Morris, Andrew D; Hattersley, Andrew T; Bowden, Donald W; Collins, Francis S; Atzmon, Gil; Chambers, John C; Spector, Timothy D; Laakso, Markku; Strom, Tim M; Bell, Graeme I; Blangero, John; Duggirala, Ravindranath; Tai, E Shyong; McVean, Gilean; Hanis, Craig L; Wilson, James G; Seielstad, Mark; Frayling, Timothy M; Meigs, James B; Cox, Nancy J; Sladek, Rob; Lander, Eric S; Gabriel, Stacey; Mohlke, Karen L; Meitinger, Thomas; Groop, Leif; Abecasis, Goncalo; Scott, Laura J; Morris, Andrew P; Kang, Hyun Min; Altshuler, David; Burtt, Noël P; Florez, Jose C; Boehnke, Michael; McCarthy, Mark I

    2017-12-19

    To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-frequency (minor allele frequency [MAF] 0.1-5%) non-coding variants in the whole-genome sequenced individuals and 99.7% of low-frequency coding variants in the whole-exome sequenced individuals. Each variant was tested for association with T2D in the sequenced individuals, and, to increase power, most were tested in larger numbers of individuals (>80% of low-frequency coding variants in ~82 K Europeans via the exome chip, and ~90% of low-frequency non-coding variants in ~44 K Europeans via genotype imputation). The variants, genotypes, and association statistics from these analyses provide the largest reference to date of human genetic information relevant to T2D, for use in activities such as T2D-focused genotype imputation, functional characterization of variants or genes, and other novel analyses to detect associations between sequence variation and T2D.

  17. Sequence data and association statistics from 12,940 type 2 diabetes cases and controls

    PubMed Central

    Jason, Flannick; Fuchsberger, Christian; Mahajan, Anubha; Teslovich, Tanya M.; Agarwala, Vineeta; Gaulton, Kyle J.; Caulkins, Lizz; Koesterer, Ryan; Ma, Clement; Moutsianas, Loukas; McCarthy, Davis J.; Rivas, Manuel A.; Perry, John R. B.; Sim, Xueling; Blackwell, Thomas W.; Robertson, Neil R.; Rayner, N William; Cingolani, Pablo; Locke, Adam E.; Tajes, Juan Fernandez; Highland, Heather M.; Dupuis, Josee; Chines, Peter S.; Lindgren, Cecilia M.; Hartl, Christopher; Jackson, Anne U.; Chen, Han; Huyghe, Jeroen R.; van de Bunt, Martijn; Pearson, Richard D.; Kumar, Ashish; Müller-Nurasyid, Martina; Grarup, Niels; Stringham, Heather M.; Gamazon, Eric R.; Lee, Jaehoon; Chen, Yuhui; Scott, Robert A.; Below, Jennifer E.; Chen, Peng; Huang, Jinyan; Go, Min Jin; Stitzel, Michael L.; Pasko, Dorota; Parker, Stephen C. J.; Varga, Tibor V.; Green, Todd; Beer, Nicola L.; Day-Williams, Aaron G.; Ferreira, Teresa; Fingerlin, Tasha; Horikoshi, Momoko; Hu, Cheng; Huh, Iksoo; Ikram, Mohammad Kamran; Kim, Bong-Jo; Kim, Yongkang; Kim, Young Jin; Kwon, Min-Seok; Lee, Juyoung; Lee, Selyeong; Lin, Keng-Han; Maxwell, Taylor J.; Nagai, Yoshihiko; Wang, Xu; Welch, Ryan P.; Yoon, Joon; Zhang, Weihua; Barzilai, Nir; Voight, Benjamin F.; Han, Bok-Ghee; Jenkinson, Christopher P.; Kuulasmaa, Teemu; Kuusisto, Johanna; Manning, Alisa; Ng, Maggie C. Y.; Palmer, Nicholette D.; Balkau, Beverley; Stančáková, Alena; Abboud, Hanna E.; Boeing, Heiner; Giedraitis, Vilmantas; Prabhakaran, Dorairaj; Gottesman, Omri; Scott, James; Carey, Jason; Kwan, Phoenix; Grant, George; Smith, Joshua D.; Neale, Benjamin M.; Purcell, Shaun; Butterworth, Adam S.; Howson, Joanna M. M.; Lee, Heung Man; Lu, Yingchang; Kwak, Soo-Heon; Zhao, Wei; Danesh, John; Lam, Vincent K. L.; Park, Kyong Soo; Saleheen, Danish; So, Wing Yee; Tam, Claudia H. T.; Afzal, Uzma; Aguilar, David; Arya, Rector; Aung, Tin; Chan, Edmund; Navarro, Carmen; Cheng, Ching-Yu; Palli, Domenico; Correa, Adolfo; Curran, Joanne E.; Rybin, Dennis; Farook, Vidya S.; Fowler, Sharon P.; Freedman, Barry I.; Griswold, Michael; Hale, Daniel Esten; Hicks, Pamela J.; Khor, Chiea-Chuen; Kumar, Satish; Lehne, Benjamin; Thuillier, Dorothée; Lim, Wei Yen; Liu, Jianjun; Loh, Marie; Musani, Solomon K.; Puppala, Sobha; Scott, William R.; Yengo, Loïc; Tan, Sian-Tsung; Taylor, Herman A.; Thameem, Farook; Wilson, Gregory; Wong, Tien Yin; Njølstad, Pål Rasmus; Levy, Jonathan C.; Mangino, Massimo; Bonnycastle, Lori L.; Schwarzmayr, Thomas; Fadista, João; Surdulescu, Gabriela L.; Herder, Christian; Groves, Christopher J.; Wieland, Thomas; Bork-Jensen, Jette; Brandslund, Ivan; Christensen, Cramer; Koistinen, Heikki A.; Doney, Alex S. F.; Kinnunen, Leena; Esko, Tõnu; Farmer, Andrew J.; Hakaste, Liisa; Hodgkiss, Dylan; Kravic, Jasmina; Lyssenko, Valeri; Hollensted, Mette; Jørgensen, Marit E.; Jørgensen, Torben; Ladenvall, Claes; Justesen, Johanne Marie; Käräjämäki, Annemari; Kriebel, Jennifer; Rathmann, Wolfgang; Lannfelt, Lars; Lauritzen, Torsten; Narisu, Narisu; Linneberg, Allan; Melander, Olle; Milani, Lili; Neville, Matt; Orho-Melander, Marju; Qi, Lu; Qi, Qibin; Roden, Michael; Rolandsson, Olov; Swift, Amy; Rosengren, Anders H.; Stirrups, Kathleen; Wood, Andrew R.; Mihailov, Evelin; Blancher, Christine; Carneiro, Mauricio O.; Maguire, Jared; Poplin, Ryan; Shakir, Khalid; Fennell, Timothy; DePristo, Mark; de Angelis, Martin Hrabé; Deloukas, Panos; Gjesing, Anette P.; Jun, Goo; Nilsson, Peter; Murphy, Jacquelyn; Onofrio, Robert; Thorand, Barbara; Hansen, Torben; Meisinger, Christa; Hu, Frank B.; Isomaa, Bo; Karpe, Fredrik; Liang, Liming; Peters, Annette; Huth, Cornelia; O'Rahilly, Stephen P; Palmer, Colin N. A.; Pedersen, Oluf; Rauramaa, Rainer; Tuomilehto, Jaakko; Salomaa, Veikko; Watanabe, Richard M.; Syvänen, Ann-Christine; Bergman, Richard N.; Bharadwaj, Dwaipayan; Bottinger, Erwin P.; Cho, Yoon Shin; Chandak, Giriraj R.; Chan, Juliana CN; Chia, Kee Seng; Daly, Mark J.; Ebrahim, Shah B.; Langenberg, Claudia; Elliott, Paul; Jablonski, Kathleen A.; Lehman, Donna M.; Jia, Weiping; Ma, Ronald C. W.; Pollin, Toni I.; Sandhu, Manjinder; Tandon, Nikhil; Froguel, Philippe; Barroso, Inês; Teo, Yik Ying; Zeggini, Eleftheria; Loos, Ruth J. F.; Small, Kerrin S.; Ried, Janina S.; DeFronzo, Ralph A.; Grallert, Harald; Glaser, Benjamin; Metspalu, Andres; Wareham, Nicholas J.; Walker, Mark; Banks, Eric; Gieger, Christian; Ingelsson, Erik; Im, Hae Kyung; Illig, Thomas; Franks, Paul W.; Buck, Gemma; Trakalo, Joseph; Buck, David; Prokopenko, Inga; Mägi, Reedik; Lind, Lars; Farjoun, Yossi; Owen, Katharine R.; Gloyn, Anna L.; Strauch, Konstantin; Tuomi, Tiinamaija; Kooner, Jaspal Singh; Lee, Jong-Young; Park, Taesung; Donnelly, Peter; Morris, Andrew D.; Hattersley, Andrew T.; Bowden, Donald W.; Collins, Francis S.; Atzmon, Gil; Chambers, John C.; Spector, Timothy D.; Laakso, Markku; Strom, Tim M.; Bell, Graeme I.; Blangero, John; Duggirala, Ravindranath; Tai, E. Shyong; McVean, Gilean; Hanis, Craig L.; Wilson, James G.; Seielstad, Mark; Frayling, Timothy M.; Meigs, James B.; Cox, Nancy J.; Sladek, Rob; Lander, Eric S.; Gabriel, Stacey; Mohlke, Karen L.; Meitinger, Thomas; Groop, Leif; Abecasis, Goncalo; Scott, Laura J.; Morris, Andrew P.; Kang, Hyun Min; Altshuler, David; Burtt, Noël P.; Florez, Jose C.; Boehnke, Michael; McCarthy, Mark I.

    2017-01-01

    To investigate the genetic basis of type 2 diabetes (T2D) to high resolution, the GoT2D and T2D-GENES consortia catalogued variation from whole-genome sequencing of 2,657 European individuals and exome sequencing of 12,940 individuals of multiple ancestries. Over 27M SNPs, indels, and structural variants were identified, including 99% of low-frequency (minor allele frequency [MAF] 0.1–5%) non-coding variants in the whole-genome sequenced individuals and 99.7% of low-frequency coding variants in the whole-exome sequenced individuals. Each variant was tested for association with T2D in the sequenced individuals, and, to increase power, most were tested in larger numbers of individuals (>80% of low-frequency coding variants in ~82 K Europeans via the exome chip, and ~90% of low-frequency non-coding variants in ~44 K Europeans via genotype imputation). The variants, genotypes, and association statistics from these analyses provide the largest reference to date of human genetic information relevant to T2D, for use in activities such as T2D-focused genotype imputation, functional characterization of variants or genes, and other novel analyses to detect associations between sequence variation and T2D. PMID:29257133

  18. Risk-Stratified Imputation in Survival Analysis

    PubMed Central

    Kennedy, Richard E.; Adragni, Kofi P.; Tiwari, Hemant K.; Voeks, Jenifer H.; Brott, Thomas G.; Howard, George

    2013-01-01

    Background Censoring that is dependent on covariates associated with survival can arise in randomized trials due to changes in recruitment and eligibility criteria to minimize withdrawals, potentially leading to biased treatment effect estimates. Imputation approaches have been proposed to address censoring in survival analysis; and while these approaches may provide unbiased estimates of treatment effects, imputation of a large number of outcomes may over- or underestimate the associated variance based on the imputation pool selected. Purpose We propose an improved method, risk-stratified imputation, as an alternative to address withdrawal related to the risk of events in the context of time-to-event analyses. Methods Our algorithm performs imputation from a pool of replacement subjects with similar values of both treatment and covariate(s) of interest, that is, from a risk-stratified sample. This stratification prior to imputation addresses the requirement of time-to-event analysis that censored observations are representative of all other observations in the risk group with similar exposure variables. We compared our risk-stratified imputation to case deletion and bootstrap imputation in a simulated dataset in which the covariate of interest (study withdrawal) was related to treatment. A motivating example from a recent clinical trial is also presented to demonstrate the utility of our method. Results In our simulations, risk-stratified imputation gives estimates of treatment effect comparable to bootstrap and auxiliary variable imputation while avoiding inaccuracies of the latter two in estimating the associated variance. Similar results were obtained in analysis of clinical trial data. Limitations Risk-stratified imputation has little advantage over other imputation methods when covariates of interest are not related to treatment, although its performance is superior when covariates are related to treatment. Risk-stratified imputation is intended for categorical covariates, and may be sensitive to the width of the matching window if continuous covariates are used. Conclusions The use of the risk-stratified imputation should facilitate the analysis of many clinical trials, in which one group has a higher withdrawal rate that is related to treatment. PMID:23818434

  19. Rare coding variants in PLCG2, ABI3 and TREM2 implicate microglial-mediated innate immunity in Alzheimer’s disease

    PubMed Central

    Sims, Rebecca; van der Lee, Sven J.; Naj, Adam C.; Bellenguez, Céline; Badarinarayan, Nandini; Jakobsdottir, Johanna; Kunkle, Brian W.; Boland, Anne; Raybould, Rachel; Bis, Joshua C.; Martin, Eden R.; Grenier-Boley, Benjamin; Heilmann-Heimbach, Stefanie; Chouraki, Vincent; Kuzma, Amanda B.; Sleegers, Kristel; Vronskaya, Maria; Ruiz, Agustin; Graham, Robert R.; Olaso, Robert; Hoffmann, Per; Grove, Megan L.; Vardarajan, Badri N.; Hiltunen, Mikko; Nöthen, Markus M.; White, Charles C.; Hamilton-Nelson, Kara L.; Epelbaum, Jacques; Maier, Wolfgang; Choi, Seung-Hoan; Beecham, Gary W.; Dulary, Cécile; Herms, Stefan; Smith, Albert V.; Funk, Cory C.; Derbois, Céline; Forstner, Andreas J.; Ahmad, Shahzad; Li, Hongdong; Bacq, Delphine; Harold, Denise; Satizabal, Claudia L.; Valladares, Otto; Squassina, Alessio; Thomas, Rhodri; Brody, Jennifer A.; Qu, Liming; Sanchez-Juan, Pascual; Morgan, Taniesha; Wolters, Frank J.; Zhao, Yi; Garcia, Florentino Sanchez; Denning, Nicola; Fornage, Myriam; Malamon, John; Naranjo, Maria Candida Deniz; Majounie, Elisa; Mosley, Thomas H.; Dombroski, Beth; Wallon, David; Lupton, Michelle K; Dupuis, Josée; Whitehead, Patrice; Fratiglioni, Laura; Medway, Christopher; Jian, Xueqiu; Mukherjee, Shubhabrata; Keller, Lina; Brown, Kristelle; Lin, Honghuang; Cantwell, Laura B.; Panza, Francesco; McGuinness, Bernadette; Moreno-Grau, Sonia; Burgess, Jeremy D.; Solfrizzi, Vincenzo; Proitsi, Petra; Adams, Hieab H.; Allen, Mariet; Seripa, Davide; Pastor, Pau; Cupples, L. Adrienne; Price, Nathan D; Hannequin, Didier; Frank-García, Ana; Levy, Daniel; Chakrabarty, Paramita; Caffarra, Paolo; Giegling, Ina; Beiser, Alexa S.; Giedraitis, Vimantas; Hampel, Harald; Garcia, Melissa E.; Wang, Xue; Lannfelt, Lars; Mecocci, Patrizia; Eiriksdottir, Gudny; Crane, Paul K.; Pasquier, Florence; Boccardi, Virginia; Henández, Isabel; Barber, Robert C.; Scherer, Martin; Tarraga, Lluis; Adams, Perrie M.; Leber, Markus; Chen, Yuning; Albert, Marilyn S.; Riedel-Heller, Steffi; Emilsson, Valur; Beekly, Duane; Braae, Anne; Schmidt, Reinhold; Blacker, Deborah; Masullo, Carlo; Schmidt, Helena; Doody, Rachelle S.; Spalletta, Gianfranco; Longstreth, WT; Fairchild, Thomas J.; Bossù, Paola; Lopez, Oscar L.; Frosch, Matthew P.; Sacchinelli, Eleonora; Ghetti, Bernardino; Sánchez-Juan, Pascual; Yang, Qiong; Huebinger, Ryan M.; Jessen, Frank; Li, Shuo; Kamboh, M. Ilyas; Morris, John; Sotolongo-Grau, Oscar; Katz, Mindy J.; Corcoran, Chris; Himali, Jayanadra J.; Keene, C. Dirk; Tschanz, JoAnn; Fitzpatrick, Annette L.; Kukull, Walter A.; Norton, Maria; Aspelund, Thor; Larson, Eric B.; Munger, Ron; Rotter, Jerome I.; Lipton, Richard B.; Bullido, María J; Hofman, Albert; Montine, Thomas J.; Coto, Eliecer; Boerwinkle, Eric; Petersen, Ronald C.; Alvarez, Victoria; Rivadeneira, Fernando; Reiman, Eric M.; Gallo, Maura; O’Donnell, Christopher J.; Reisch, Joan S.; Bruni, Amalia Cecilia; Royall, Donald R.; Dichgans, Martin; Sano, Mary; Galimberti, Daniela; St George-Hyslop, Peter; Scarpini, Elio; Tsuang, Debby W.; Mancuso, Michelangelo; Bonuccelli, Ubaldo; Winslow, Ashley R.; Daniele, Antonio; Wu, Chuang-Kuo; Peters, Oliver; Nacmias, Benedetta; Riemenschneider, Matthias; Heun, Reinhard; Brayne, Carol; Rubinsztein, David C; Bras, Jose; Guerreiro, Rita; Hardy, John; Al-Chalabi, Ammar; Shaw, Christopher E; Collinge, John; Mann, David; Tsolaki, Magda; Clarimón, Jordi; Sussams, Rebecca; Lovestone, Simon; O’Donovan, Michael C; Owen, Michael J; Behrens, Timothy W.; Mead, Simon; Goate, Alison M.; Uitterlinden, Andre G.; Holmes, Clive; Cruchaga, Carlos; Ingelsson, Martin; Bennett, David A.; Powell, John; Golde, Todd E.; Graff, Caroline; De Jager, Philip L.; Morgan, Kevin; Ertekin-Taner, Nilufer; Combarros, Onofre; Psaty, Bruce M.; Passmore, Peter; Younkin, Steven G; Berr, Claudine; Gudnason, Vilmundur; Rujescu, Dan; Dickson, Dennis W.; Dartigues, Jean-Francois; DeStefano, Anita L.; Ortega-Cubero, Sara; Hakonarson, Hakon; Campion, Dominique; Boada, Merce; Kauwe, John “Keoni”; Farrer, Lindsay A.; Van Broeckhoven, Christine; Ikram, M. Arfan; Jones, Lesley; Haines, Johnathan; Tzourio, Christophe; Launer, Lenore J.; Escott-Price, Valentina; Mayeux, Richard; Deleuze, Jean-François; Amin, Najaf; Holmans, Peter A; Pericak-Vance, Margaret A.; Amouyel, Philippe; van Duijn, Cornelia M.; Ramirez, Alfredo; Wang, Li-San; Lambert, Jean-Charles; Seshadri, Sudha; Williams, Julie; Schellenberg, Gerard D.

    2017-01-01

    Introduction We identified rare coding variants associated with Alzheimer’s disease (AD) in a 3-stage case-control study of 85,133 subjects. In stage 1, 34,174 samples were genotyped using a whole-exome microarray. In stage 2, we tested associated variants (P<1×10-4) in 35,962 independent samples using de novo genotyping and imputed genotypes. In stage 3, an additional 14,997 samples were used to test the most significant stage 2 associations (P<5×10-8) using imputed genotypes. We observed 3 novel genome-wide significant (GWS) AD associated non-synonymous variants; a protective variant in PLCG2 (rs72824905/p.P522R, P=5.38×10-10, OR=0.68, MAFcases=0.0059, MAFcontrols=0.0093), a risk variant in ABI3 (rs616338/p.S209F, P=4.56×10-10, OR=1.43, MAFcases=0.011, MAFcontrols=0.008), and a novel GWS variant in TREM2 (rs143332484/p.R62H, P=1.55×10-14, OR=1.67, MAFcases=0.0143, MAFcontrols=0.0089), a known AD susceptibility gene. These protein-coding changes are in genes highly expressed in microglia and highlight an immune-related protein-protein interaction network enriched for previously identified AD risk genes. These genetic findings provide additional evidence that the microglia-mediated innate immune response contributes directly to AD development. PMID:28714976

  20. Genetic risk variants for membranous nephropathy: extension of and association with other chronic kidney disease aetiologies.

    PubMed

    Sekula, Peggy; Li, Yong; Stanescu, Horia C; Wuttke, Matthias; Ekici, Arif B; Bockenhauer, Detlef; Walz, Gerd; Powis, Stephen H; Kielstein, Jan T; Brenchley, Paul; Eckardt, Kai-Uwe; Kronenberg, Florian; Kleta, Robert; Köttgen, Anna

    2017-02-01

    Membranous nephropathy (MN) is a common cause of nephrotic syndrome in adults. Previous genome-wide association studies (GWAS) of 300 000 genotyped variants identified MN-associated loci at HLA-DQA1 and PLA2R1. We used a combined approach of genotype imputation, GWAS, human leucocyte antigen (HLA) imputation and extension to other aetiologies of chronic kidney disease (CKD) to investigate genetic MN risk variants more comprehensively. GWAS using 9 million high-quality imputed genotypes and classical HLA alleles were conducted for 323 MN European-ancestry cases and 345 controls. Additionally, 4960 patients with different CKD aetiologies in the German Chronic Kidney Disease (GCKD) study were genotyped for risk variants at HLA-DQA1 and PLA2R1. In GWAS, lead variants in known loci [rs9272729, HLA-DQA1, odds ratio (OR) = 7.3 per risk allele, P = 5.9 × 10 -27 and rs17830558, PLA2R1, OR = 2.2, P = 1.9 × 10 -8 ] were significantly associated with MN. No novel signals emerged in GWAS of X-chromosomal variants or in sex-specific analyses. Classical HLA alleles (DRB1*0301-DQA1*0501-DQB1*0201 haplotype) were associated with MN but provided little additional information beyond rs9272729. Associations were replicated in 137 GCKD patients with MN (HLA-DQA1: P = 6.4 × 10 -24 ; PLA2R1: P = 5.0 × 10 -4 ). MN risk increased steeply for patients with high-risk genotype combinations (OR > 79). While genetic variation in PLA2R1 exclusively associated with MN across 19 CKD aetiologies, the HLA-DQA1 risk allele was also associated with lupus nephritis (P = 2.8 × 10 -6 ), type 1 diabetic nephropathy (P = 6.9 × 10 -5 ) and focal segmental glomerulosclerosis (P = 5.1 × 10 -5 ), but not with immunoglobulin A nephropathy. PLA2R1 and HLA-DQA1 are the predominant risk loci for MN detected by GWAS. While HLA-DQA1 risk variants show an association with other CKD aetiologies, PLA2R1 variants are specific to MN. © The Author 2016. Published by Oxford University Press on behalf of ERA-EDTA. All rights reserved.

  1. Genomic prediction of the polled and horned phenotypes in Merino sheep.

    PubMed

    Duijvesteijn, Naomi; Bolormaa, Sunduimijid; Daetwyler, Hans D; van der Werf, Julius H J

    2018-05-22

    In horned sheep breeds, breeding for polledness has been of interest for decades. The objective of this study was to improve prediction of the horned and polled phenotypes using horn scores classified as polled, scurs, knobs or horns. Derived phenotypes polled/non-polled (P/NP) and horned/non-horned (H/NH) were used to test four different strategies for prediction in 4001 purebred Merino sheep. These strategies include the use of single 'single nucleotide polymorphism' (SNP) genotypes, multiple-SNP haplotypes, genome-wide and chromosome-wide genomic best linear unbiased prediction and information from imputed sequence variants from the region including the RXFP2 gene. Low-density genotypes of these animals were imputed to the Illumina Ovine high-density (600k) chip and the 1.78-kb insertion polymorphism in RXFP2 was included in the imputation process to whole-genome sequence. We evaluated the mode of inheritance and validated models by a fivefold cross-validation and across- and between-family prediction. The most significant SNPs for prediction of P/NP and H/NH were OAR10_29546872.1 and OAR10_29458450, respectively, located on chromosome 10 close to the 1.78-kb insertion at 29.5 Mb. The mode of inheritance included an additive effect and a sex-dependent effect for dominance for P/NP and a sex-dependent additive and dominance effect for H/NH. Models with the highest prediction accuracies for H/NH used either single SNPs or 3-SNP haplotypes and included a polygenic effect estimated based on traditional pedigree relationships. Prediction accuracies for H/NH were 0.323 for females and 0.725 for males. For predicting P/NP, the best models were the same as for H/NH but included a genomic relationship matrix with accuracies of 0.713 for females and 0.620 for males. Our results show that prediction accuracy is high using a single SNP, but does not reach 1 since the causative mutation is not genotyped. Incomplete penetrance or allelic heterogeneity, which can influence expression of the phenotype, may explain why prediction accuracy did not approach 1 with any of the genetic models tested here. Nevertheless, a breeding program to eradicate horns from Merino sheep can be effective by selecting genotypes GG of SNP OAR10_29458450 or TT of SNP OAR10_29546872.1 since all sheep with these genotypes will be non-horned.

  2. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant-phenotype associations undetected by HapMap based imputation.

    PubMed

    Wood, Andrew R; Perry, John R B; Tanaka, Toshiko; Hernandez, Dena G; Zheng, Hou-Feng; Melzer, David; Gibbs, J Raphael; Nalls, Michael A; Weedon, Michael N; Spector, Tim D; Richards, J Brent; Bandinelli, Stefania; Ferrucci, Luigi; Singleton, Andrew B; Frayling, Timothy M

    2013-01-01

    Genome-wide association (GWA) studies have been limited by the reliance on common variants present on microarrays or imputable from the HapMap Project data. More recently, the completion of the 1000 Genomes Project has provided variant and haplotype information for several million variants derived from sequencing over 1,000 individuals. To help understand the extent to which more variants (including low frequency (1% ≤ MAF <5%) and rare variants (<1%)) can enhance previously identified associations and identify novel loci, we selected 93 quantitative circulating factors where data was available from the InCHIANTI population study. These phenotypes included cytokines, binding proteins, hormones, vitamins and ions. We selected these phenotypes because many have known strong genetic associations and are potentially important to help understand disease processes. We performed a genome-wide scan for these 93 phenotypes in InCHIANTI. We identified 21 signals and 33 signals that reached P<5×10(-8) based on HapMap and 1000 Genomes imputation, respectively, and 9 and 11 that reached a stricter, likely conservative, threshold of P<5×10(-11) respectively. Imputation of 1000 Genomes genotype data modestly improved the strength of known associations. Of 20 associations detected at P<5×10(-8) in both analyses (17 of which represent well replicated signals in the NHGRI catalogue), six were captured by the same index SNP, five were nominally more strongly associated in 1000 Genomes imputed data and one was nominally more strongly associated in HapMap imputed data. We also detected an association between a low frequency variant and phenotype that was previously missed by HapMap based imputation approaches. An association between rs112635299 and alpha-1 globulin near the SERPINA gene represented the known association between rs28929474 (MAF = 0.007) and alpha1-antitrypsin that predisposes to emphysema (P = 2.5×10(-12)). Our data provide important proof of principle that 1000 Genomes imputation will detect novel, low frequency-large effect associations.

  3. Imputation of Variants from the 1000 Genomes Project Modestly Improves Known Associations and Can Identify Low-frequency Variant - Phenotype Associations Undetected by HapMap Based Imputation

    PubMed Central

    Wood, Andrew R.; Perry, John R. B.; Tanaka, Toshiko; Hernandez, Dena G.; Zheng, Hou-Feng; Melzer, David; Gibbs, J. Raphael; Nalls, Michael A.; Weedon, Michael N.; Spector, Tim D.; Richards, J. Brent; Bandinelli, Stefania; Ferrucci, Luigi; Singleton, Andrew B.; Frayling, Timothy M.

    2013-01-01

    Genome-wide association (GWA) studies have been limited by the reliance on common variants present on microarrays or imputable from the HapMap Project data. More recently, the completion of the 1000 Genomes Project has provided variant and haplotype information for several million variants derived from sequencing over 1,000 individuals. To help understand the extent to which more variants (including low frequency (1% ≤ MAF <5%) and rare variants (<1%)) can enhance previously identified associations and identify novel loci, we selected 93 quantitative circulating factors where data was available from the InCHIANTI population study. These phenotypes included cytokines, binding proteins, hormones, vitamins and ions. We selected these phenotypes because many have known strong genetic associations and are potentially important to help understand disease processes. We performed a genome-wide scan for these 93 phenotypes in InCHIANTI. We identified 21 signals and 33 signals that reached P<5×10−8 based on HapMap and 1000 Genomes imputation, respectively, and 9 and 11 that reached a stricter, likely conservative, threshold of P<5×10−11 respectively. Imputation of 1000 Genomes genotype data modestly improved the strength of known associations. Of 20 associations detected at P<5×10−8 in both analyses (17 of which represent well replicated signals in the NHGRI catalogue), six were captured by the same index SNP, five were nominally more strongly associated in 1000 Genomes imputed data and one was nominally more strongly associated in HapMap imputed data. We also detected an association between a low frequency variant and phenotype that was previously missed by HapMap based imputation approaches. An association between rs112635299 and alpha-1 globulin near the SERPINA gene represented the known association between rs28929474 (MAF = 0.007) and alpha1-antitrypsin that predisposes to emphysema (P = 2.5×10−12). Our data provide important proof of principle that 1000 Genomes imputation will detect novel, low frequency-large effect associations. PMID:23696881

  4. A Maximum-Likelihood Method to Correct for Allelic Dropout in Microsatellite Data with No Replicate Genotypes

    PubMed Central

    Wang, Chaolong; Schroeder, Kari B.; Rosenberg, Noah A.

    2012-01-01

    Allelic dropout is a commonly observed source of missing data in microsatellite genotypes, in which one or both allelic copies at a locus fail to be amplified by the polymerase chain reaction. Especially for samples with poor DNA quality, this problem causes a downward bias in estimates of observed heterozygosity and an upward bias in estimates of inbreeding, owing to mistaken classifications of heterozygotes as homozygotes when one of the two copies drops out. One general approach for avoiding allelic dropout involves repeated genotyping of homozygous loci to minimize the effects of experimental error. Existing computational alternatives often require replicate genotyping as well. These approaches, however, are costly and are suitable only when enough DNA is available for repeated genotyping. In this study, we propose a maximum-likelihood approach together with an expectation-maximization algorithm to jointly estimate allelic dropout rates and allele frequencies when only one set of nonreplicated genotypes is available. Our method considers estimates of allelic dropout caused by both sample-specific factors and locus-specific factors, and it allows for deviation from Hardy–Weinberg equilibrium owing to inbreeding. Using the estimated parameters, we correct the bias in the estimation of observed heterozygosity through the use of multiple imputations of alleles in cases where dropout might have occurred. With simulated data, we show that our method can (1) effectively reproduce patterns of missing data and heterozygosity observed in real data; (2) correctly estimate model parameters, including sample-specific dropout rates, locus-specific dropout rates, and the inbreeding coefficient; and (3) successfully correct the downward bias in estimating the observed heterozygosity. We find that our method is fairly robust to violations of model assumptions caused by population structure and by genotyping errors from sources other than allelic dropout. Because the data sets imputed under our model can be investigated in additional subsequent analyses, our method will be useful for preparing data for applications in diverse contexts in population genetics and molecular ecology. PMID:22851645

  5. Genome-wide imputation study identifies novel HLA locus for pulmonary fibrosis and potential role for auto-immunity in fibrotic idiopathic interstitial pneumonia.

    PubMed

    Fingerlin, Tasha E; Zhang, Weiming; Yang, Ivana V; Ainsworth, Hannah C; Russell, Pamela H; Blumhagen, Rachel Z; Schwarz, Marvin I; Brown, Kevin K; Steele, Mark P; Loyd, James E; Cosgrove, Gregory P; Lynch, David A; Groshong, Steve; Collard, Harold R; Wolters, Paul J; Bradford, Williamson Z; Kossen, Karl; Seiwert, Scott D; du Bois, Roland M; Garcia, Christine Kim; Devine, Megan S; Gudmundsson, Gunnar; Isaksson, Helgi J; Kaminski, Naftali; Zhang, Yingze; Gibson, Kevin F; Lancaster, Lisa H; Maher, Toby M; Molyneaux, Philip L; Wells, Athol U; Moffatt, Miriam F; Selman, Moises; Pardo, Annie; Kim, Dong Soon; Crapo, James D; Make, Barry J; Regan, Elizabeth A; Walek, Dinesha S; Daniel, Jerry J; Kamatani, Yoichiro; Zelenika, Diana; Murphy, Elissa; Smith, Keith; McKean, David; Pedersen, Brent S; Talbert, Janet; Powers, Julia; Markin, Cheryl R; Beckman, Kenneth B; Lathrop, Mark; Freed, Brian; Langefeld, Carl D; Schwartz, David A

    2016-06-07

    Fibrotic idiopathic interstitial pneumonias (fIIP) are a group of fatal lung diseases with largely unknown etiology and without definitive treatment other than lung transplant to prolong life. There is strong evidence for the importance of both rare and common genetic risk alleles in familial and sporadic disease. We have previously used genome-wide single nucleotide polymorphism data to identify 10 risk loci for fIIP. Here we extend that work to imputed genome-wide genotypes and conduct new RNA sequencing studies of lung tissue to identify and characterize new fIIP risk loci. We performed genome-wide genotype imputation association analyses in 1616 non-Hispanic white (NHW) cases and 4683 NHW controls followed by validation and replication (878 cases, 2017 controls) genotyping and targeted gene expression in lung tissue. Following meta-analysis of the discovery and replication populations, we identified a novel fIIP locus in the HLA region of chromosome 6 (rs7887 P meta  = 3.7 × 10(-09)). Imputation of classic HLA alleles identified two in high linkage disequilibrium that are associated with fIIP (DRB1*15:01 P = 1.3 × 10(-7) and DQB1*06:02 P = 6.1 × 10(-8)). Targeted RNA-sequencing of the HLA locus identified 21 genes differentially expressed between fibrotic and control lung tissue (Q < 0.001), many of which are involved in immune and inflammatory response regulation. In addition, the putative risk alleles, DRB1*15:01 and DQB1*06:02, are associated with expression of the DQB1 gene among fIIP cases (Q < 1 × 10(-16)). We have identified a genome-wide significant association between the HLA region and fIIP. Two HLA alleles are associated with fIIP and affect expression of HLA genes in lung tissue, indicating that the potential genetic risk due to HLA alleles may involve gene regulation in addition to altered protein structure. These studies reveal the importance of the HLA region for risk of fIIP and a basis for the potential etiologic role of auto-immunity in fIIP.

  6. Imputation of microsatellite alleles from dense SNP genotypes for parentage verification across multiple Bos taurus and Bos indicus breeds

    USDA-ARS?s Scientific Manuscript database

    Microsatellite markers (MS) have traditionally been used for parental verification and are still the international standard in spite of their higher cost, error rate, and turnaround time compared with Single Nucleotide Polymorphisms (SNP) -based assays. Despite domestic and international demands fr...

  7. Genetic architecture of feed efficiency in mid-lactation Holstein dairy cows

    USDA-ARS?s Scientific Manuscript database

    The objective of this study was to explore the genetic architecture and biological basis of feed efficiency in lactating Holstein cows. In total, 4,918 cows with actual or imputed genotypes for 60,671 SNP had individual feed intake, milk yield, milk composition, and body weight records. Cows were ...

  8. The genetic and biological basis of feed efficiency in mid-lactation Holstein dairy cows

    USDA-ARS?s Scientific Manuscript database

    The objective of this study was to characterize the genetic architecture and biological basis of feed efficiency in lactating Holstein cows. In total, 4,916 cows with actual or imputed genotypes for 60,671 SNP had individual feed intake, milk yield, milk composition, and body weight records. Cows we...

  9. Very low-depth sequencing in a founder population identifies a cardioprotective APOC3 signal missed by genome-wide imputation.

    PubMed

    Gilly, Arthur; Ritchie, Graham Rs; Southam, Lorraine; Farmaki, Aliki-Eleni; Tsafantakis, Emmanouil; Dedoussis, George; Zeggini, Eleftheria

    2016-06-01

    Cohort-wide very low-depth whole-genome sequencing (WGS) can comprehensively capture low-frequency sequence variation for the cost of a dense genome-wide genotyping array. Here, we analyse 1x sequence data across the APOC3 gene in a founder population from the island of Crete in Greece (n = 1239) and find significant evidence for association with blood triglyceride levels with the previously reported R19X cardioprotective null mutation (β = -1.09,σ = 0.163, P = 8.2 × 10 -11 ) and a second loss of function mutation, rs138326449 (β = -1.17,σ = 0.188, P = 1.14 × 10 -9 ). The signal cannot be recapitulated by imputing genome-wide genotype data on a large reference panel of 5122 individuals including 249 with 4x WGS data from the same population. Gene-level meta-analysis with other studies reporting burden signals at APOC3 provides robust evidence for a replicable cardioprotective rare variant aggregation (P = 3.2 × 10 -31 , n = 13 480). © The Author 2016. Published by Oxford University Press.

  10. Very low-depth sequencing in a founder population identifies a cardioprotective APOC3 signal missed by genome-wide imputation

    PubMed Central

    Gilly, Arthur; Ritchie, Graham Rs; Southam, Lorraine; Farmaki, Aliki-Eleni; Tsafantakis, Emmanouil; Dedoussis, George; Zeggini, Eleftheria

    2016-01-01

    Cohort-wide very low-depth whole-genome sequencing (WGS) can comprehensively capture low-frequency sequence variation for the cost of a dense genome-wide genotyping array. Here, we analyse 1x sequence data across the APOC3 gene in a founder population from the island of Crete in Greece (n = 1239) and find significant evidence for association with blood triglyceride levels with the previously reported R19X cardioprotective null mutation (β = −1.09,σ = 0.163, P = 8.2 × 10−11) and a second loss of function mutation, rs138326449 (β = −1.17,σ = 0.188, P = 1.14 × 10−9). The signal cannot be recapitulated by imputing genome-wide genotype data on a large reference panel of 5122 individuals including 249 with 4x WGS data from the same population. Gene-level meta-analysis with other studies reporting burden signals at APOC3 provides robust evidence for a replicable cardioprotective rare variant aggregation (P = 3.2 × 10−31, n = 13 480). PMID:27146844

  11. Rare coding variants in PLCG2, ABI3, and TREM2 implicate microglial-mediated innate immunity in Alzheimer's disease.

    PubMed

    Sims, Rebecca; van der Lee, Sven J; Naj, Adam C; Bellenguez, Céline; Badarinarayan, Nandini; Jakobsdottir, Johanna; Kunkle, Brian W; Boland, Anne; Raybould, Rachel; Bis, Joshua C; Martin, Eden R; Grenier-Boley, Benjamin; Heilmann-Heimbach, Stefanie; Chouraki, Vincent; Kuzma, Amanda B; Sleegers, Kristel; Vronskaya, Maria; Ruiz, Agustin; Graham, Robert R; Olaso, Robert; Hoffmann, Per; Grove, Megan L; Vardarajan, Badri N; Hiltunen, Mikko; Nöthen, Markus M; White, Charles C; Hamilton-Nelson, Kara L; Epelbaum, Jacques; Maier, Wolfgang; Choi, Seung-Hoan; Beecham, Gary W; Dulary, Cécile; Herms, Stefan; Smith, Albert V; Funk, Cory C; Derbois, Céline; Forstner, Andreas J; Ahmad, Shahzad; Li, Hongdong; Bacq, Delphine; Harold, Denise; Satizabal, Claudia L; Valladares, Otto; Squassina, Alessio; Thomas, Rhodri; Brody, Jennifer A; Qu, Liming; Sánchez-Juan, Pascual; Morgan, Taniesha; Wolters, Frank J; Zhao, Yi; Garcia, Florentino Sanchez; Denning, Nicola; Fornage, Myriam; Malamon, John; Naranjo, Maria Candida Deniz; Majounie, Elisa; Mosley, Thomas H; Dombroski, Beth; Wallon, David; Lupton, Michelle K; Dupuis, Josée; Whitehead, Patrice; Fratiglioni, Laura; Medway, Christopher; Jian, Xueqiu; Mukherjee, Shubhabrata; Keller, Lina; Brown, Kristelle; Lin, Honghuang; Cantwell, Laura B; Panza, Francesco; McGuinness, Bernadette; Moreno-Grau, Sonia; Burgess, Jeremy D; Solfrizzi, Vincenzo; Proitsi, Petra; Adams, Hieab H; Allen, Mariet; Seripa, Davide; Pastor, Pau; Cupples, L Adrienne; Price, Nathan D; Hannequin, Didier; Frank-García, Ana; Levy, Daniel; Chakrabarty, Paramita; Caffarra, Paolo; Giegling, Ina; Beiser, Alexa S; Giedraitis, Vilmantas; Hampel, Harald; Garcia, Melissa E; Wang, Xue; Lannfelt, Lars; Mecocci, Patrizia; Eiriksdottir, Gudny; Crane, Paul K; Pasquier, Florence; Boccardi, Virginia; Henández, Isabel; Barber, Robert C; Scherer, Martin; Tarraga, Lluis; Adams, Perrie M; Leber, Markus; Chen, Yuning; Albert, Marilyn S; Riedel-Heller, Steffi; Emilsson, Valur; Beekly, Duane; Braae, Anne; Schmidt, Reinhold; Blacker, Deborah; Masullo, Carlo; Schmidt, Helena; Doody, Rachelle S; Spalletta, Gianfranco; Longstreth, W T; Fairchild, Thomas J; Bossù, Paola; Lopez, Oscar L; Frosch, Matthew P; Sacchinelli, Eleonora; Ghetti, Bernardino; Yang, Qiong; Huebinger, Ryan M; Jessen, Frank; Li, Shuo; Kamboh, M Ilyas; Morris, John; Sotolongo-Grau, Oscar; Katz, Mindy J; Corcoran, Chris; Dunstan, Melanie; Braddel, Amy; Thomas, Charlene; Meggy, Alun; Marshall, Rachel; Gerrish, Amy; Chapman, Jade; Aguilar, Miquel; Taylor, Sarah; Hill, Matt; Fairén, Mònica Díez; Hodges, Angela; Vellas, Bruno; Soininen, Hilkka; Kloszewska, Iwona; Daniilidou, Makrina; Uphill, James; Patel, Yogen; Hughes, Joseph T; Lord, Jenny; Turton, James; Hartmann, Annette M; Cecchetti, Roberta; Fenoglio, Chiara; Serpente, Maria; Arcaro, Marina; Caltagirone, Carlo; Orfei, Maria Donata; Ciaramella, Antonio; Pichler, Sabrina; Mayhaus, Manuel; Gu, Wei; Lleó, Alberto; Fortea, Juan; Blesa, Rafael; Barber, Imelda S; Brookes, Keeley; Cupidi, Chiara; Maletta, Raffaele Giovanni; Carrell, David; Sorbi, Sandro; Moebus, Susanne; Urbano, Maria; Pilotto, Alberto; Kornhuber, Johannes; Bosco, Paolo; Todd, Stephen; Craig, David; Johnston, Janet; Gill, Michael; Lawlor, Brian; Lynch, Aoibhinn; Fox, Nick C; Hardy, John; Albin, Roger L; Apostolova, Liana G; Arnold, Steven E; Asthana, Sanjay; Atwood, Craig S; Baldwin, Clinton T; Barnes, Lisa L; Barral, Sandra; Beach, Thomas G; Becker, James T; Bigio, Eileen H; Bird, Thomas D; Boeve, Bradley F; Bowen, James D; Boxer, Adam; Burke, James R; Burns, Jeffrey M; Buxbaum, Joseph D; Cairns, Nigel J; Cao, Chuanhai; Carlson, Chris S; Carlsson, Cynthia M; Carney, Regina M; Carrasquillo, Minerva M; Carroll, Steven L; Diaz, Carolina Ceballos; Chui, Helena C; Clark, David G; Cribbs, David H; Crocco, Elizabeth A; DeCarli, Charles; Dick, Malcolm; Duara, Ranjan; Evans, Denis A; Faber, Kelley M; Fallon, Kenneth B; Fardo, David W; Farlow, Martin R; Ferris, Steven; Foroud, Tatiana M; Galasko, Douglas R; Gearing, Marla; Geschwind, Daniel H; Gilbert, John R; Graff-Radford, Neill R; Green, Robert C; Growdon, John H; Hamilton, Ronald L; Harrell, Lindy E; Honig, Lawrence S; Huentelman, Matthew J; Hulette, Christine M; Hyman, Bradley T; Jarvik, Gail P; Abner, Erin; Jin, Lee-Way; Jun, Gyungah; Karydas, Anna; Kaye, Jeffrey A; Kim, Ronald; Kowall, Neil W; Kramer, Joel H; LaFerla, Frank M; Lah, James J; Leverenz, James B; Levey, Allan I; Li, Ge; Lieberman, Andrew P; Lunetta, Kathryn L; Lyketsos, Constantine G; Marson, Daniel C; Martiniuk, Frank; Mash, Deborah C; Masliah, Eliezer; McCormick, Wayne C; McCurry, Susan M; McDavid, Andrew N; McKee, Ann C; Mesulam, Marsel; Miller, Bruce L; Miller, Carol A; Miller, Joshua W; Morris, John C; Murrell, Jill R; Myers, Amanda J; O'Bryant, Sid; Olichney, John M; Pankratz, Vernon S; Parisi, Joseph E; Paulson, Henry L; Perry, William; Peskind, Elaine; Pierce, Aimee; Poon, Wayne W; Potter, Huntington; Quinn, Joseph F; Raj, Ashok; Raskind, Murray; Reisberg, Barry; Reitz, Christiane; Ringman, John M; Roberson, Erik D; Rogaeva, Ekaterina; Rosen, Howard J; Rosenberg, Roger N; Sager, Mark A; Saykin, Andrew J; Schneider, Julie A; Schneider, Lon S; Seeley, William W; Smith, Amanda G; Sonnen, Joshua A; Spina, Salvatore; Stern, Robert A; Swerdlow, Russell H; Tanzi, Rudolph E; Thornton-Wells, Tricia A; Trojanowski, John Q; Troncoso, Juan C; Van Deerlin, Vivianna M; Van Eldik, Linda J; Vinters, Harry V; Vonsattel, Jean Paul; Weintraub, Sandra; Welsh-Bohmer, Kathleen A; Wilhelmsen, Kirk C; Williamson, Jennifer; Wingo, Thomas S; Woltjer, Randall L; Wright, Clinton B; Yu, Chang-En; Yu, Lei; Garzia, Fabienne; Golamaully, Feroze; Septier, Gislain; Engelborghs, Sebastien; Vandenberghe, Rik; De Deyn, Peter P; Fernadez, Carmen Muñoz; Benito, Yoland Aladro; Thonberg, Hakan; Forsell, Charlotte; Lilius, Lena; Kinhult-Stählbom, Anne; Kilander, Lena; Brundin, RoseMarie; Concari, Letizia; Helisalmi, Seppo; Koivisto, Anne Maria; Haapasalo, Annakaisa; Dermecourt, Vincent; Fievet, Nathalie; Hanon, Olivier; Dufouil, Carole; Brice, Alexis; Ritchie, Karen; Dubois, Bruno; Himali, Jayanadra J; Keene, C Dirk; Tschanz, JoAnn; Fitzpatrick, Annette L; Kukull, Walter A; Norton, Maria; Aspelund, Thor; Larson, Eric B; Munger, Ron; Rotter, Jerome I; Lipton, Richard B; Bullido, María J; Hofman, Albert; Montine, Thomas J; Coto, Eliecer; Boerwinkle, Eric; Petersen, Ronald C; Alvarez, Victoria; Rivadeneira, Fernando; Reiman, Eric M; Gallo, Maura; O'Donnell, Christopher J; Reisch, Joan S; Bruni, Amalia Cecilia; Royall, Donald R; Dichgans, Martin; Sano, Mary; Galimberti, Daniela; St George-Hyslop, Peter; Scarpini, Elio; Tsuang, Debby W; Mancuso, Michelangelo; Bonuccelli, Ubaldo; Winslow, Ashley R; Daniele, Antonio; Wu, Chuang-Kuo; Peters, Oliver; Nacmias, Benedetta; Riemenschneider, Matthias; Heun, Reinhard; Brayne, Carol; Rubinsztein, David C; Bras, Jose; Guerreiro, Rita; Al-Chalabi, Ammar; Shaw, Christopher E; Collinge, John; Mann, David; Tsolaki, Magda; Clarimón, Jordi; Sussams, Rebecca; Lovestone, Simon; O'Donovan, Michael C; Owen, Michael J; Behrens, Timothy W; Mead, Simon; Goate, Alison M; Uitterlinden, Andre G; Holmes, Clive; Cruchaga, Carlos; Ingelsson, Martin; Bennett, David A; Powell, John; Golde, Todd E; Graff, Caroline; De Jager, Philip L; Morgan, Kevin; Ertekin-Taner, Nilufer; Combarros, Onofre; Psaty, Bruce M; Passmore, Peter; Younkin, Steven G; Berr, Claudine; Gudnason, Vilmundur; Rujescu, Dan; Dickson, Dennis W; Dartigues, Jean-François; DeStefano, Anita L; Ortega-Cubero, Sara; Hakonarson, Hakon; Campion, Dominique; Boada, Merce; Kauwe, John Keoni; Farrer, Lindsay A; Van Broeckhoven, Christine; Ikram, M Arfan; Jones, Lesley; Haines, Jonathan L; Tzourio, Christophe; Launer, Lenore J; Escott-Price, Valentina; Mayeux, Richard; Deleuze, Jean-François; Amin, Najaf; Holmans, Peter A; Pericak-Vance, Margaret A; Amouyel, Philippe; van Duijn, Cornelia M; Ramirez, Alfredo; Wang, Li-San; Lambert, Jean-Charles; Seshadri, Sudha; Williams, Julie; Schellenberg, Gerard D

    2017-09-01

    We identified rare coding variants associated with Alzheimer's disease in a three-stage case-control study of 85,133 subjects. In stage 1, we genotyped 34,174 samples using a whole-exome microarray. In stage 2, we tested associated variants (P < 1 × 10 -4 ) in 35,962 independent samples using de novo genotyping and imputed genotypes. In stage 3, we used an additional 14,997 samples to test the most significant stage 2 associations (P < 5 × 10 -8 ) using imputed genotypes. We observed three new genome-wide significant nonsynonymous variants associated with Alzheimer's disease: a protective variant in PLCG2 (rs72824905: p.Pro522Arg, P = 5.38 × 10 -10 , odds ratio (OR) = 0.68, minor allele frequency (MAF) cases = 0.0059, MAF controls = 0.0093), a risk variant in ABI3 (rs616338: p.Ser209Phe, P = 4.56 × 10 -10 , OR = 1.43, MAF cases = 0.011, MAF controls = 0.008), and a new genome-wide significant variant in TREM2 (rs143332484: p.Arg62His, P = 1.55 × 10 -14 , OR = 1.67, MAF cases = 0.0143, MAF controls = 0.0089), a known susceptibility gene for Alzheimer's disease. These protein-altering changes are in genes highly expressed in microglia and highlight an immune-related protein-protein interaction network enriched for previously identified risk genes in Alzheimer's disease. These genetic findings provide additional evidence that the microglia-mediated innate immune response contributes directly to the development of Alzheimer's disease.

  12. Across-Platform Imputation of DNA Methylation Levels Incorporating Nonlocal Information Using Penalized Functional Regression.

    PubMed

    Zhang, Guosheng; Huang, Kuan-Chieh; Xu, Zheng; Tzeng, Jung-Ying; Conneely, Karen N; Guan, Weihua; Kang, Jian; Li, Yun

    2016-05-01

    DNA methylation is a key epigenetic mark involved in both normal development and disease progression. Recent advances in high-throughput technologies have enabled genome-wide profiling of DNA methylation. However, DNA methylation profiling often employs different designs and platforms with varying resolution, which hinders joint analysis of methylation data from multiple platforms. In this study, we propose a penalized functional regression model to impute missing methylation data. By incorporating functional predictors, our model utilizes information from nonlocal probes to improve imputation quality. Here, we compared the performance of our functional model to linear regression and the best single probe surrogate in real data and via simulations. Specifically, we applied different imputation approaches to an acute myeloid leukemia dataset consisting of 194 samples and our method showed higher imputation accuracy, manifested, for example, by a 94% relative increase in information content and up to 86% more CpG sites passing post-imputation filtering. Our simulated association study further demonstrated that our method substantially improves the statistical power to identify trait-associated methylation loci. These findings indicate that the penalized functional regression model is a convenient and valuable imputation tool for methylation data, and it can boost statistical power in downstream epigenome-wide association study (EWAS). © 2016 WILEY PERIODICALS, INC.

  13. Applying an efficient K-nearest neighbor search to forest attribute imputation

    Treesearch

    Andrew O. Finley; Ronald E. McRoberts; Alan R. Ek

    2006-01-01

    This paper explores the utility of an efficient nearest neighbor (NN) search algorithm for applications in multi-source kNN forest attribute imputation. The search algorithm reduces the number of distance calculations between a given target vector and each reference vector, thereby, decreasing the time needed to discover the NN subset. Results of five trials show gains...

  14. Can We Spin Straw Into Gold? An Evaluation of Immigrant Legal Status Imputation Approaches

    PubMed Central

    Van Hook, Jennifer; Bachmeier, James D.; Coffman, Donna; Harel, Ofer

    2014-01-01

    Researchers have developed logical, demographic, and statistical strategies for imputing immigrants’ legal status, but these methods have never been empirically assessed. We used Monte Carlo simulations to test whether, and under what conditions, legal status imputation approaches yield unbiased estimates of the association of unauthorized status with health insurance coverage. We tested five methods under a range of missing data scenarios. Logical and demographic imputation methods yielded biased estimates across all missing data scenarios. Statistical imputation approaches yielded unbiased estimates only when unauthorized status was jointly observed with insurance coverage; when this condition was not met, these methods overestimated insurance coverage for unauthorized relative to legal immigrants. We next showed how bias can be reduced by incorporating prior information about unauthorized immigrants. Finally, we demonstrated the utility of the best-performing statistical method for increasing power. We used it to produce state/regional estimates of insurance coverage among unauthorized immigrants in the Current Population Survey, a data source that contains no direct measures of immigrants’ legal status. We conclude that commonly employed legal status imputation approaches are likely to produce biased estimates, but data and statistical methods exist that could substantially reduce these biases. PMID:25511332

  15. Integrating common and rare genetic variation in diverse human populations.

    PubMed

    Altshuler, David M; Gibbs, Richard A; Peltonen, Leena; Altshuler, David M; Gibbs, Richard A; Peltonen, Leena; Dermitzakis, Emmanouil; Schaffner, Stephen F; Yu, Fuli; Peltonen, Leena; Dermitzakis, Emmanouil; Bonnen, Penelope E; Altshuler, David M; Gibbs, Richard A; de Bakker, Paul I W; Deloukas, Panos; Gabriel, Stacey B; Gwilliam, Rhian; Hunt, Sarah; Inouye, Michael; Jia, Xiaoming; Palotie, Aarno; Parkin, Melissa; Whittaker, Pamela; Yu, Fuli; Chang, Kyle; Hawes, Alicia; Lewis, Lora R; Ren, Yanru; Wheeler, David; Gibbs, Richard A; Muzny, Donna Marie; Barnes, Chris; Darvishi, Katayoon; Hurles, Matthew; Korn, Joshua M; Kristiansson, Kati; Lee, Charles; McCarrol, Steven A; Nemesh, James; Dermitzakis, Emmanouil; Keinan, Alon; Montgomery, Stephen B; Pollack, Samuela; Price, Alkes L; Soranzo, Nicole; Bonnen, Penelope E; Gibbs, Richard A; Gonzaga-Jauregui, Claudia; Keinan, Alon; Price, Alkes L; Yu, Fuli; Anttila, Verneri; Brodeur, Wendy; Daly, Mark J; Leslie, Stephen; McVean, Gil; Moutsianas, Loukas; Nguyen, Huy; Schaffner, Stephen F; Zhang, Qingrun; Ghori, Mohammed J R; McGinnis, Ralph; McLaren, William; Pollack, Samuela; Price, Alkes L; Schaffner, Stephen F; Takeuchi, Fumihiko; Grossman, Sharon R; Shlyakhter, Ilya; Hostetter, Elizabeth B; Sabeti, Pardis C; Adebamowo, Clement A; Foster, Morris W; Gordon, Deborah R; Licinio, Julio; Manca, Maria Cristina; Marshall, Patricia A; Matsuda, Ichiro; Ngare, Duncan; Wang, Vivian Ota; Reddy, Deepa; Rotimi, Charles N; Royal, Charmaine D; Sharp, Richard R; Zeng, Changqing; Brooks, Lisa D; McEwen, Jean E

    2010-09-02

    Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of

  16. Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals.

    PubMed

    Stram, Daniel O; Leigh Pearce, Celeste; Bretsky, Phillip; Freedman, Matthew; Hirschhorn, Joel N; Altshuler, David; Kolonel, Laurence N; Henderson, Brian E; Thomas, Duncan C

    2003-01-01

    The US National Cancer Institute has recently sponsored the formation of a Cohort Consortium (http://2002.cancer.gov/scpgenes.htm) to facilitate the pooling of data on very large numbers of people, concerning the effects of genes and environment on cancer incidence. One likely goal of these efforts will be generate a large population-based case-control series for which a number of candidate genes will be investigated using SNP haplotype as well as genotype analysis. The goal of this paper is to outline the issues involved in choosing a method of estimating haplotype-specific risk estimates for such data that is technically appropriate and yet attractive to epidemiologists who are already comfortable with odds ratios and logistic regression. Our interest is to develop and evaluate extensions of methods, based on haplotype imputation, that have been recently described (Schaid et al., Am J Hum Genet, 2002, and Zaykin et al., Hum Hered, 2002) as providing score tests of the null hypothesis of no effect of SNP haplotypes upon risk, which may be used for more complex tasks, such as providing confidence intervals, and tests of equivalence of haplotype-specific risks in two or more separate populations. In order to do so we (1) develop a cohort approach towards odds ratio analysis by expanding the E-M algorithm to provide maximum likelihood estimates of haplotype-specific odds ratios as well as genotype frequencies; (2) show how to correct the cohort approach, to give essentially unbiased estimates for population-based or nested case-control studies by incorporating the probability of selection as a case or control into the likelihood, based on a simplified model of case and control selection, and (3) finally, in an example data set (CYP17 and breast cancer, from the Multiethnic Cohort Study) we compare likelihood-based confidence interval estimates from the two methods with each other, and with the use of the single-imputation approach of Zaykin et al. applied under both null and alternative hypotheses. We conclude that so long as haplotypes are well predicted by SNP genotypes (we use the Rh2 criteria of Stram et al. [1]) the differences between the three methods are very small and in particular that the single imputation method may be expected to work extremely well. Copyright 2003 S. Karger AG, Basel

  17. Constructing linkage maps in the genomics era with MapDisto 2.0.

    PubMed

    Heffelfinger, Christopher; Fragoso, Christopher A; Lorieux, Mathias

    2017-07-15

    Genotyping by sequencing (GBS) generates datasets that are challenging to handle by current genetic mapping software with graphical interface. Geneticists need new user-friendly computer programs that can analyze GBS data on desktop computers. This requires improvements in computation efficiency, both in terms of speed and use of random-access memory (RAM). MapDisto v.2.0 is a user-friendly computer program for construction of genetic linkage maps. It includes several new major features: (i) handling of very large genotyping datasets like the ones generated by GBS; (ii) direct importation and conversion of Variant Call Format (VCF) files; (iii) detection of linkage, i.e. construction of linkage groups in case of segregation distortion; (iv) data imputation on VCF files using a new approach, called LB-Impute. Features i to iv operate through inclusion of new Java modules that are used transparently by MapDisto; (v) QTL detection via a new R/qtl graphical interface. The program is available free of charge at mapdisto.free.fr. mapdisto@gmail.com. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  18. Modeling coverage gaps in haplotype frequencies via Bayesian inference to improve stem cell donor selection.

    PubMed

    Louzoun, Yoram; Alter, Idan; Gragert, Loren; Albrecht, Mark; Maiers, Martin

    2018-05-01

    Regardless of sampling depth, accurate genotype imputation is limited in regions of high polymorphism which often have a heavy-tailed haplotype frequency distribution. Many rare haplotypes are thus unobserved. Statistical methods to improve imputation by extending reference haplotype distributions using linkage disequilibrium patterns that relate allele and haplotype frequencies have not yet been explored. In the field of unrelated stem cell transplantation, imputation of highly polymorphic human leukocyte antigen (HLA) genes has an important application in identifying the best-matched stem cell donor when searching large registries totaling over 28,000,000 donors worldwide. Despite these large registry sizes, a significant proportion of searched patients present novel HLA haplotypes. Supporting this observation, HLA population genetic models have indicated that many extant HLA haplotypes remain unobserved. The absent haplotypes are a significant cause of error in haplotype matching. We have applied a Bayesian inference methodology for extending haplotype frequency distributions, using a model where new haplotypes are created by recombination of observed alleles. Applications of this joint probability model offer significant improvement in frequency distribution estimates over the best existing alternative methods, as we illustrate using five-locus HLA frequency data from the National Marrow Donor Program registry. Transplant matching algorithms and disease association studies involving phasing and imputation of rare variants may benefit from this statistical inference framework.

  19. Association analysis for udder index and milking speed with imputed whole-genome sequence variants in Nordic Holstein cattle.

    PubMed

    Jardim, Júlia Gazzoni; Guldbrandtsen, Bernt; Lund, Mogens Sandø; Sahana, Goutam

    2018-03-01

    Genome-wide association testing facilitates the identification of genetic variants associated with complex traits. Mapping genes that promote genetic resistance to mastitis could reduce the cost of antibiotic use and enhance animal welfare and milk production by improving outcomes of breeding for udder health. Using imputed whole-genome sequence variants, we carried out association studies for 2 traits related to udder health, udder index, and milking speed in Nordic Holstein cattle. A total of 4,921 bulls genotyped with the BovineSNP50 BeadChip array were imputed to high-density genotypes (Illumina BovineHD BeadChip, Illumina, San Diego, CA) and, subsequently, to whole-genome sequence variants. An association analysis was carried out using a linear mixed model. Phenotypes used in the association analyses were deregressed breeding values. Multitrait meta-analysis was carried out for these 2 traits. We identified 10 and 8 chromosomes harboring markers that were significantly associated with udder index and milking speed, respectively. Strongest association signals were observed on chromosome 20 for udder index and chromosome 19 for milking speed. Multitrait meta-analysis identified 13 chromosomes harboring associated markers for the combination of udder index and milking speed. The associated region on chromosome 20 overlapped with earlier reported quantitative trait loci for similar traits in other cattle populations. Moreover, this region was located close to the FYB gene, which is involved in platelet activation and controls IL-2 expression; FYB is a strong candidate gene for udder health and worthy of further investigation. Copyright © 2018 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  20. Joint genome-wide prediction in several populations accounting for randomness of genotypes: A hierarchical Bayes approach. II: Multivariate spike and slab priors for marker effects and derivation of approximate Bayes and fractional Bayes factors for the complete family of models.

    PubMed

    Martínez, Carlos Alberto; Khare, Kshitij; Banerjee, Arunava; Elzo, Mauricio A

    2017-03-21

    This study corresponds to the second part of a companion paper devoted to the development of Bayesian multiple regression models accounting for randomness of genotypes in across population genome-wide prediction. This family of models considers heterogeneous and correlated marker effects and allelic frequencies across populations, and has the ability of considering records from non-genotyped individuals and individuals with missing genotypes in any subset of loci without the need for previous imputation, taking into account uncertainty about imputed genotypes. This paper extends this family of models by considering multivariate spike and slab conditional priors for marker allele substitution effects and contains derivations of approximate Bayes factors and fractional Bayes factors to compare models from part I and those developed here with their null versions. These null versions correspond to simpler models ignoring heterogeneity of populations, but still accounting for randomness of genotypes. For each marker loci, the spike component of priors corresponded to point mass at 0 in R S , where S is the number of populations, and the slab component was a S-variate Gaussian distribution, independent conditional priors were assumed. For the Gaussian components, covariance matrices were assumed to be either the same for all markers or different for each marker. For null models, the priors were simply univariate versions of these finite mixture distributions. Approximate algebraic expressions for Bayes factors and fractional Bayes factors were found using the Laplace approximation. Using the simulated datasets described in part I, these models were implemented and compared with models derived in part I using measures of predictive performance based on squared Pearson correlations, Deviance Information Criterion, Bayes factors, and fractional Bayes factors. The extensions presented here enlarge our family of genome-wide prediction models making it more flexible in the sense that it now offers more modeling options. Copyright © 2017 Elsevier Ltd. All rights reserved.

  1. Comparing genetic ancestry and self-reported race/ethnicity in a multiethnic population in New York City.

    PubMed

    Lee, Yin Leng; Teitelbaum, Susan; Wolff, Mary S; Wetmur, James G; Chen, Jia

    2010-12-01

    Self-reported race/ethnicity is frequently used in epidemiological studies to assess an individual's background origin. However, in admixed populations such as Hispanic, self-reported race/ethnicity may not accurately represent them genetically because they are admixed with European, African and Native American ancestry. We estimated the proportions of genetic admixture in an ethnically diverse population of 396 mothers and 188 of their children with 35 ancestry informative markers (AIMs) using the STRUCTURE version 2.2 program. The majority of the markers showed significant deviation from Hardy-Weinberg equilibrium in our study population. In mothers self-identified as Black and White, the imputed ancestry proportions were 77.6% African and 75.1% European respectively, while the racial composition among self-identified Hispanics was 29.2% European, 26.0% African, and 44.8% Native American. We also investigated the utility of AIMs by showing the improved fitness of models in paraoxanase-1 genotype-phenotype associations after incorporating AIMs; however, the improvement was moderate at best. In summary, a minimal set of 35 AIMs is sufficient to detect population stratification and estimate the proportion of individual genetic admixture; however, the utility of these markers remains questionable.

  2. Performance Comparison of Two Gene Set Analysis Methods for Genome-wide Association Study Results: GSA-SNP vs i-GSEA4GWAS.

    PubMed

    Kwon, Ji-Sun; Kim, Jihye; Nam, Dougu; Kim, Sangsoo

    2012-06-01

    Gene set analysis (GSA) is useful in interpreting a genome-wide association study (GWAS) result in terms of biological mechanism. We compared the performance of two different GSA implementations that accept GWAS p-values of single nucleotide polymorphisms (SNPs) or gene-by-gene summaries thereof, GSA-SNP and i-GSEA4GWAS, under the same settings of inputs and parameters. GSA runs were made with two sets of p-values from a Korean type 2 diabetes mellitus GWAS study: 259,188 and 1,152,947 SNPs of the original and imputed genotype datasets, respectively. When Gene Ontology terms were used as gene sets, i-GSEA4GWAS produced 283 and 1,070 hits for the unimputed and imputed datasets, respectively. On the other hand, GSA-SNP reported 94 and 38 hits, respectively, for both datasets. Similar, but to a lesser degree, trends were observed with Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets as well. The huge number of hits by i-GSEA4GWAS for the imputed dataset was probably an artifact due to the scaling step in the algorithm. The decrease in hits by GSA-SNP for the imputed dataset may be due to the fact that it relies on Z-statistics, which is sensitive to variations in the background level of associations. Judicious evaluation of the GSA outcomes, perhaps based on multiple programs, is recommended.

  3. Multiple imputation for cure rate quantile regression with censored data.

    PubMed

    Wu, Yuanshan; Yin, Guosheng

    2017-03-01

    The main challenge in the context of cure rate analysis is that one never knows whether censored subjects are cured or uncured, or whether they are susceptible or insusceptible to the event of interest. Considering the susceptible indicator as missing data, we propose a multiple imputation approach to cure rate quantile regression for censored data with a survival fraction. We develop an iterative algorithm to estimate the conditionally uncured probability for each subject. By utilizing this estimated probability and Bernoulli sample imputation, we can classify each subject as cured or uncured, and then employ the locally weighted method to estimate the quantile regression coefficients with only the uncured subjects. Repeating the imputation procedure multiple times and taking an average over the resultant estimators, we obtain consistent estimators for the quantile regression coefficients. Our approach relaxes the usual global linearity assumption, so that we can apply quantile regression to any particular quantile of interest. We establish asymptotic properties for the proposed estimators, including both consistency and asymptotic normality. We conduct simulation studies to assess the finite-sample performance of the proposed multiple imputation method and apply it to a lung cancer study as an illustration. © 2016, The International Biometric Society.

  4. Genome-wide association study identifies three novel loci for type 2 diabetes.

    PubMed

    Hara, Kazuo; Fujita, Hayato; Johnson, Todd A; Yamauchi, Toshimasa; Yasuda, Kazuki; Horikoshi, Momoko; Peng, Chen; Hu, Cheng; Ma, Ronald C W; Imamura, Minako; Iwata, Minoru; Tsunoda, Tatsuhiko; Morizono, Takashi; Shojima, Nobuhiro; So, Wing Yee; Leung, Ting Fan; Kwan, Patrick; Zhang, Rong; Wang, Jie; Yu, Weihui; Maegawa, Hiroshi; Hirose, Hiroshi; Kaku, Kohei; Ito, Chikako; Watada, Hirotaka; Tanaka, Yasushi; Tobe, Kazuyuki; Kashiwagi, Atsunori; Kawamori, Ryuzo; Jia, Weiping; Chan, Juliana C N; Teo, Yik Ying; Shyong, Tai E; Kamatani, Naoyuki; Kubo, Michiaki; Maeda, Shiro; Kadowaki, Takashi

    2014-01-01

    Although over 60 loci for type 2 diabetes (T2D) have been identified, there still remains a large genetic component to be clarified. To explore unidentified loci for T2D, we performed a genome-wide association study (GWAS) of 6 209 637 single-nucleotide polymorphisms (SNPs), which were directly genotyped or imputed using East Asian references from the 1000 Genomes Project (June 2011 release) in 5976 Japanese patients with T2D and 20 829 nondiabetic individuals. Nineteen unreported loci were selected and taken forward to follow-up analyses. Combined discovery and follow-up analyses (30 392 cases and 34 814 controls) identified three new loci with genome-wide significance, which were MIR129-LEP [rs791595; risk allele = A; risk allele frequency (RAF) = 0.080; P = 2.55 × 10(-13); odds ratio (OR) = 1.17], GPSM1 [rs11787792; risk allele = A; RAF = 0.874; P = 1.74 × 10(-10); OR = 1.15] and SLC16A13 (rs312457; risk allele = G; RAF = 0.078; P = 7.69 × 10(-13); OR = 1.20). This study demonstrates that GWASs based on the imputation of genotypes using modern reference haplotypes such as that from the 1000 Genomes Project data can assist in identification of new loci for common diseases.

  5. 11,670 whole-genome sequences representative of the Han Chinese population from the CONVERGE project.

    PubMed

    Cai, Na; Bigdeli, Tim B; Kretzschmar, Warren W; Li, Yihan; Liang, Jieqin; Hu, Jingchu; Peterson, Roseann E; Bacanu, Silviu; Webb, Bradley Todd; Riley, Brien; Li, Qibin; Marchini, Jonathan; Mott, Richard; Kendler, Kenneth S; Flint, Jonathan

    2017-02-14

    The China, Oxford and Virginia Commonwealth University Experimental Research on Genetic Epidemiology (CONVERGE) project on Major Depressive Disorder (MDD) sequenced 11,670 female Han Chinese at low-coverage (1.7X), providing the first large-scale whole genome sequencing resource representative of the largest ethnic group in the world. Samples are collected from 58 hospitals from 23 provinces around China. We are able to call 22 million high quality single nucleotide polymorphisms (SNP) from the nuclear genome, representing the largest SNP call set from an East Asian population to date. We use these variants for imputation of genotypes across all samples, and this has allowed us to perform a successful genome wide association study (GWAS) on MDD. The utility of these data can be extended to studies of genetic ancestry in the Han Chinese and evolutionary genetics when integrated with data from other populations. Molecular phenotypes, such as copy number variations and structural variations can be detected, quantified and analysed in similar ways.

  6. Extent of linkage disequilibrium, consistency of gametic phase, and imputation accuracy within and across Canadian dairy breeds.

    PubMed

    Larmer, S G; Sargolzaei, M; Schenkel, F S

    2014-05-01

    Genomic selection requires a large reference population to accurately estimate single nucleotide polymorphism (SNP) effects. In some Canadian dairy breeds, the available reference populations are not large enough for accurate estimation of SNP effects for traits of interest. If marker phase is highly consistent across multiple breeds, it is theoretically possible to increase the accuracy of genomic prediction for one or all breeds by pooling several breeds into a common reference population. This study investigated the extent of linkage disequilibrium (LD) in 5 major dairy breeds using a 50,000 (50K) SNP panel and 3 of the same breeds using the 777,000 (777K) SNP panel. Correlation of pair-wise SNP phase was also investigated on both panels. The level of LD was measured using the squared correlation of alleles at 2 loci (r(2)), and the consistency of SNP gametic phases was correlated using the signed square root of these values. Because of the high cost of the 777K panel, the accuracy of imputation from lower density marker panels [6,000 (6K) or 50K] was examined both within breed and using a multi-breed reference population in Holstein, Ayrshire, and Guernsey. Imputation was carried out using FImpute V2.2 and Beagle 3.3.2 software. Imputation accuracies were then calculated as both the proportion of correct SNP filled in (concordance rate) and allelic R(2). Computation time was also explored to determine the efficiency of the different algorithms for imputation. Analysis showed that LD values >0.2 were found in all breeds at distances at or shorter than the average adjacent pair-wise distance between SNP on the 50K panel. Correlations of r-values, however, did not reach high levels (<0.9) at these distances. High correlation values of SNP phase between breeds were observed (>0.94) when the average pair-wise distances using the 777K SNP panel were examined. High concordance rate (0.968-0.995) and allelic R(2) (0.946-0.991) were found for all breeds when imputation was carried out with FImpute from 50K to 777K. Imputation accuracy for Guernsey and Ayrshire was slightly lower when using the imputation method in Beagle. Computing time was significantly greater when using Beagle software, with all comparable procedures being 9 to 13 times less efficient, in terms of time, compared with FImpute. These findings suggest that use of a multi-breed reference population might increase prediction accuracy using the 777K SNP panel and that 777K genotypes can be efficiently and effectively imputed using the lower density 50K SNP panel. Copyright © 2014 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  7. Sixteen new lung function signals identified through 1000 Genomes Project reference panel imputation

    PubMed Central

    Artigas, María Soler; Wain, Louise V.; Miller, Suzanne; Kheirallah, Abdul Kader; Huffman, Jennifer E.; Ntalla, Ioanna; Shrine, Nick; Obeidat, Ma'en; Trochet, Holly; McArdle, Wendy L.; Alves, Alexessander Couto; Hui, Jennie; Zhao, Jing Hua; Joshi, Peter K.; Teumer, Alexander; Albrecht, Eva; Imboden, Medea; Rawal, Rajesh; Lopez, Lorna M.; Marten, Jonathan; Enroth, Stefan; Surakka, Ida; Polasek, Ozren; Lyytikäinen, Leo-Pekka; Granell, Raquel; Hysi, Pirro G.; Flexeder, Claudia; Mahajan, Anubha; Beilby, John; Bossé, Yohan; Brandsma, Corry-Anke; Campbell, Harry; Gieger, Christian; Gläser, Sven; González, Juan R.; Grallert, Harald; Hammond, Chris J.; Harris, Sarah E.; Hartikainen, Anna-Liisa; Heliövaara, Markku; Henderson, John; Hocking, Lynne; Horikoshi, Momoko; Hutri-Kähönen, Nina; Ingelsson, Erik; Johansson, Åsa; Kemp, John P.; Kolcic, Ivana; Kumar, Ashish; Lind, Lars; Melén, Erik; Musk, Arthur W.; Navarro, Pau; Nickle, David C.; Padmanabhan, Sandosh; Raitakari, Olli T.; Ried, Janina S.; Ripatti, Samuli; Schulz, Holger; Scott, Robert A.; Sin, Don D.; Starr, John M.; Deloukas, Panos; Hansell, Anna L.; Hubbard, Richard; Jackson, Victoria E.; Marchini, Jonathan; Pavord, Ian; Thomson, Neil C.; Zeggini, Eleftheria; Viñuela, Ana; Völzke, Henry; Wild, Sarah H.; Wright, Alan F.; Zemunik, Tatijana; Jarvis, Deborah L.; Spector, Tim D.; Evans, David M.; Lehtimäki, Terho; Vitart, Veronique; Kähönen, Mika; Gyllensten, Ulf; Rudan, Igor; Deary, Ian J.; Karrasch, Stefan; Probst-Hensch, Nicole M.; Heinrich, Joachim; Stubbe, Beate; Wilson, James F.; Wareham, Nicholas J.; James, Alan L.; Morris, Andrew P.; Jarvelin, Marjo-Riitta; Hayward, Caroline; Sayers, Ian; Strachan, David P.; Hall, Ian P.; Tobin, Martin D.

    2015-01-01

    Lung function measures are used in the diagnosis of chronic obstructive pulmonary disease. In 38,199 European ancestry individuals, we studied genome-wide association of forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC) and FEV1/FVC with 1000 Genomes Project (phase 1)-imputed genotypes and followed up top associations in 54,550 Europeans. We identify 14 novel loci (P<5 × 10−8) in or near ENSA, RNU5F-1, KCNS3, AK097794, ASTN2, LHX3, CCDC91, TBX3, TRIP11, RIN3, TEKT5, LTBP4, MN1 and AP1S2, and two novel signals at known loci NPNT and GPR126, providing a basis for new understanding of the genetic determinants of these traits and pulmonary diseases in which they are altered. PMID:26635082

  8. Genomic relationships based on X chromosome markers and accuracy of genomic predictions with and without X chromosome markers

    PubMed Central

    2014-01-01

    Background Although the X chromosome is the second largest bovine chromosome, markers on the X chromosome are not used for genomic prediction in some countries and populations. In this study, we presented a method for computing genomic relationships using X chromosome markers, investigated the accuracy of imputation from a low density (7K) to the 54K SNP (single nucleotide polymorphism) panel, and compared the accuracy of genomic prediction with and without using X chromosome markers. Methods The impact of considering X chromosome markers on prediction accuracy was assessed using data from Nordic Holstein bulls and different sets of SNPs: (a) the 54K SNPs for reference and test animals, (b) SNPs imputed from the 7K to the 54K SNP panel for test animals, (c) SNPs imputed from the 7K to the 54K panel for half of the reference animals, and (d) the 7K SNP panel for all animals. Beagle and Findhap were used for imputation. GBLUP (genomic best linear unbiased prediction) models with or without X chromosome markers and with or without a residual polygenic effect were used to predict genomic breeding values for 15 traits. Results Averaged over the two imputation datasets, correlation coefficients between imputed and true genotypes for autosomal markers, pseudo-autosomal markers, and X-specific markers were 0.971, 0.831 and 0.935 when using Findhap, and 0.983, 0.856 and 0.937 when using Beagle. Estimated reliabilities of genomic predictions based on the imputed datasets using Findhap or Beagle were very close to those using the real 54K data. Genomic prediction using all markers gave slightly higher reliabilities than predictions without X chromosome markers. Based on our data which included only bulls, using a G matrix that accounted for sex-linked relationships did not improve prediction, compared with a G matrix that did not account for sex-linked relationships. A model that included a polygenic effect did not recover the loss of prediction accuracy from exclusion of X chromosome markers. Conclusions The results from this study suggest that markers on the X chromosome contribute to accuracy of genomic predictions and should be used for routine genomic evaluation. PMID:25080199

  9. A Kriging based spatiotemporal approach for traffic volume data imputation

    PubMed Central

    Han, Lee D.; Liu, Xiaohan; Pu, Li; Chin, Shih-miao; Hwang, Ho-ling

    2018-01-01

    Along with the rapid development of Intelligent Transportation Systems, traffic data collection technologies have progressed fast. The emergence of innovative data collection technologies such as remote traffic microwave sensor, Bluetooth sensor, GPS-based floating car method, and automated license plate recognition, has significantly increased the variety and volume of traffic data. Despite the development of these technologies, the missing data issue is still a problem that poses great challenge for data based applications such as traffic forecasting, real-time incident detection, dynamic route guidance, and massive evacuation optimization. A thorough literature review suggests most current imputation models either focus on the temporal nature of the traffic data and fail to consider the spatial information of neighboring locations or assume the data follow a certain distribution. These two issues reduce the imputation accuracy and limit the use of the corresponding imputation methods respectively. As a result, this paper presents a Kriging based data imputation approach that is able to fully utilize the spatiotemporal correlation in the traffic data and that does not assume the data follow any distribution. A set of scenarios with different missing rates are used to evaluate the performance of the proposed method. The performance of the proposed method was compared with that of two other widely used methods, historical average and K-nearest neighborhood. Comparison results indicate that the proposed method has the highest imputation accuracy and is more flexible compared to other methods. PMID:29664928

  10. Cohort-specific imputation of gene expression improves prediction of warfarin dose for African Americans.

    PubMed

    Gottlieb, Assaf; Daneshjou, Roxana; DeGorter, Marianne; Bourgeois, Stephane; Svensson, Peter J; Wadelius, Mia; Deloukas, Panos; Montgomery, Stephen B; Altman, Russ B

    2017-11-24

    Genome-wide association studies are useful for discovering genotype-phenotype associations but are limited because they require large cohorts to identify a signal, which can be population-specific. Mapping genetic variation to genes improves power and allows the effects of both protein-coding variation as well as variation in expression to be combined into "gene level" effects. Previous work has shown that warfarin dose can be predicted using information from genetic variation that affects protein-coding regions. Here, we introduce a method that improves dose prediction by integrating tissue-specific gene expression. In particular, we use drug pathways and expression quantitative trait loci knowledge to impute gene expression-on the assumption that differential expression of key pathway genes may impact dose requirement. We focus on 116 genes from the pharmacokinetic and pharmacodynamic pathways of warfarin within training and validation sets comprising both European and African-descent individuals. We build gene-tissue signatures associated with warfarin dose in a cohort-specific manner and identify a signature of 11 gene-tissue pairs that significantly augments the International Warfarin Pharmacogenetics Consortium dosage-prediction algorithm in both populations. Our results demonstrate that imputed expression can improve dose prediction and bridge population-specific compositions. MATLAB code is available at https://github.com/assafgo/warfarin-cohort.

  11. New insights into the pharmacogenomics of antidepressant response from the GENDEP and STAR*D studies: rare variant analysis and high-density imputation.

    PubMed

    Fabbri, C; Tansey, K E; Perlis, R H; Hauser, J; Henigsberg, N; Maier, W; Mors, O; Placentino, A; Rietschel, M; Souery, D; Breen, G; Curtis, C; Sang-Hyuk, L; Newhouse, S; Patel, H; Guipponi, M; Perroud, N; Bondolfi, G; O'Donovan, M; Lewis, G; Biernacka, J M; Weinshilboum, R M; Farmer, A; Aitchison, K J; Craig, I; McGuffin, P; Uher, R; Lewis, C M

    2017-11-21

    Genome-wide association studies have generally failed to identify polymorphisms associated with antidepressant response. Possible reasons include limited coverage of genetic variants that this study tried to address by exome genotyping and dense imputation. A meta-analysis of Genome-Based Therapeutic Drugs for Depression (GENDEP) and Sequenced Treatment Alternatives to Relieve Depression (STAR*D) studies was performed at the single-nucleotide polymorphism (SNP), gene and pathway levels. Coverage of genetic variants was increased compared with previous studies by adding exome genotypes to previously available genome-wide data and using the Haplotype Reference Consortium panel for imputation. Standard quality control was applied. Phenotypes were symptom improvement and remission after 12 weeks of antidepressant treatment. Significant findings were investigated in NEWMEDS consortium samples and Pharmacogenomic Research Network Antidepressant Medication Pharmacogenomic Study (PGRN-AMPS) for replication. A total of 7062 950 SNPs were analyzed in GENDEP (n=738) and STAR*D (n=1409). rs116692768 (P=1.80e-08, ITGA9 (integrin α9)) and rs76191705 (P=2.59e-08, NRXN3 (neurexin 3)) were significantly associated with symptom improvement during citalopram/escitalopram treatment. At the gene level, no consistent effect was found. At the pathway level, the Gene Ontology (GO) terms GO: 0005694 (chromosome) and GO: 0044427 (chromosomal part) were associated with improvement (corrected P=0.007 and 0.045, respectively). The association between rs116692768 and symptom improvement was replicated in PGRN-AMPS (P=0.047), whereas rs76191705 was not. The two SNPs did not replicate in NEWMEDS. ITGA9 codes for a membrane receptor for neurotrophins and NRXN3 is a transmembrane neuronal adhesion receptor involved in synaptic differentiation. Despite their meaningful biological rationale for being involved in antidepressant effect, replication was partial. Further studies may help in clarifying their role.The Pharmacogenomics Journal advance online publication, 21 November 2017; doi:10.1038/tpj.2017.44.

  12. OryzaGenome: Genome Diversity Database of Wild Oryza Species.

    PubMed

    Ohyanagi, Hajime; Ebata, Toshinobu; Huang, Xuehui; Gong, Hao; Fujita, Masahiro; Mochizuki, Takako; Toyoda, Atsushi; Fujiyama, Asao; Kaminuma, Eli; Nakamura, Yasukazu; Feng, Qi; Wang, Zi-Xuan; Han, Bin; Kurata, Nori

    2016-01-01

    The species in the genus Oryza, encompassing nine genome types and 23 species, are a rich genetic resource and may have applications in deeper genomic analyses aiming to understand the evolution of plant genomes. With the advancement of next-generation sequencing (NGS) technology, a flood of Oryza species reference genomes and genomic variation information has become available in recent years. This genomic information, combined with the comprehensive phenotypic information that we are accumulating in our Oryzabase, can serve as an excellent genotype-phenotype association resource for analyzing rice functional and structural evolution, and the associated diversity of the Oryza genus. Here we integrate our previous and future phenotypic/habitat information and newly determined genotype information into a united repository, named OryzaGenome, providing the variant information with hyperlinks to Oryzabase. The current version of OryzaGenome includes genotype information of 446 O. rufipogon accessions derived by imputation and of 17 accessions derived by imputation-free deep sequencing. Two variant viewers are implemented: SNP Viewer as a conventional genome browser interface and Variant Table as a text-based browser for precise inspection of each variant one by one. Portable VCF (variant call format) file or tab-delimited file download is also available. Following these SNP (single nucleotide polymorphism) data, reference pseudomolecules/scaffolds/contigs and genome-wide variation information for almost all of the closely and distantly related wild Oryza species from the NIG Wild Rice Collection will be available in future releases. All of the resources can be accessed through http://viewer.shigen.info/oryzagenome/. © The Author 2015. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists.

  13. TREM2 Variants in Alzheimer's Disease

    PubMed Central

    Guerreiro, Rita; Wojtas, Aleksandra; Bras, Jose; Carrasquillo, Minerva; Rogaeva, Ekaterina; Majounie, Elisa; Cruchaga, Carlos; Sassi, Celeste; Kauwe, John S.K.; Younkin, Steven; Hazrati, Lilinaz; Collinge, John; Pocock, Jennifer; Lashley, Tammaryn; Williams, Julie; Lambert, Jean-Charles; Amouyel, Philippe; Goate, Alison; Rademakers, Rosa; Morgan, Kevin; Powell, John; St. George-Hyslop, Peter; Singleton, Andrew; Hardy, John

    2013-01-01

    BACKGROUND Homozygous loss-of-function mutations in TREM2, encoding the triggering receptor expressed on myeloid cells 2 protein, have previously been associated with an autosomal recessive form of early-onset dementia. METHODS We used genome, exome, and Sanger sequencing to analyze the genetic variability in TREM2 in a series of 1092 patients with Alzheimer's disease and 1107 controls (the discovery set). We then performed a meta-analysis on imputed data for the TREM2 variant rs75932628 (predicted to cause a R47H substitution) from three genomewide association studies of Alzheimer's disease and tested for the association of the variant with disease. We genotyped the R47H variant in an additional 1887 cases and 4061 controls. We then assayed the expression of TREM2 across different regions of the human brain and identified genes that are differentially expressed in a mouse model of Alzheimer's disease and in control mice. RESULTS We found significantly more variants in exon 2 of TREM2 in patients with Alzheimer's disease than in controls in the discovery set (P = 0.02). There were 22 variant alleles in 1092 patients with Alzheimer's disease and 5 variant alleles in 1107 controls (P<0.001). The most commonly associated variant, rs75932628 (encoding R47H), showed highly significant association with Alzheimer's disease (P<0.001). Meta-analysis of rs75932628 genotypes imputed from genomewide association studies confirmed this association (P = 0.002), as did direct genotyping of an additional series of 1887 patients with Alzheimer's disease and 4061 controls (P<0.001). Trem2 expression differed between control mice and a mouse model of Alzheimer's disease. CONCLUSIONS Heterozygous rare variants in TREM2 are associated with a significant increase in the risk of Alzheimer's disease. (Funded by Alzheimer's Research UK and others.) PMID:23150934

  14. Disclosure Control using Partially Synthetic Data for Large-Scale Health Surveys, with Applications to CanCORS

    PubMed Central

    Loong, Bronwyn; Zaslavsky, Alan M.; He, Yulei; Harrington, David P.

    2013-01-01

    Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents’ identities and sensitive attributes, by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by CanCORS, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the United States. We review inferential methods for partially synthetic data, and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data, and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. PMID:23670983

  15. Insights to the Genetics of Diabetic Nephropathy through a Genome-wide Association Study of the GoKinD Collection

    PubMed Central

    Pezzolesi, Marcus G.; Skupien, Jan; Krolewski, Andrzej S.

    2010-01-01

    The Genetics of Kidneys in Diabetes (GoKinD) study was initiated to facilitate research aimed at identifying genes involved in diabetic nephropathy (DN) in type 1 diabetes (T1D). In this review, we present on overview of this study and the various reports that have utilized its collection. At the forefront of these efforts is the recent genome-wide association (GWA) scan implemented on the GoKinD collection. We highlight the results from our analysis of these data and describe compelling evidence from animal models that further support the potential role of associated loci in the susceptibility of DN. To enhance our analysis of genetic associations in GoKinD, using genome-wide imputation (GWI), we expanded our analysis of this collection to include genotype data from more than 2.4 million common SNPs. We illustrate the added utility of this enhanced dataset through the comprehensive fine-mapping of candidate genomic regions previously linked with DN and the targeted investigation of genes involved in candidate pathway implicated in its pathogenesis. Collectively, GWA and GWI data from the GoKinD collection will serve as a springboard for future investigations into the genetic basis of DN in T1D. PMID:20347642

  16. Enhanced statistical tests for GWAS in admixed populations: assessment using African Americans from CARe and a Breast Cancer Consortium.

    PubMed

    Pasaniuc, Bogdan; Zaitlen, Noah; Lettre, Guillaume; Chen, Gary K; Tandon, Arti; Kao, W H Linda; Ruczinski, Ingo; Fornage, Myriam; Siscovick, David S; Zhu, Xiaofeng; Larkin, Emma; Lange, Leslie A; Cupples, L Adrienne; Yang, Qiong; Akylbekova, Ermeg L; Musani, Solomon K; Divers, Jasmin; Mychaleckyj, Joe; Li, Mingyao; Papanicolaou, George J; Millikan, Robert C; Ambrosone, Christine B; John, Esther M; Bernstein, Leslie; Zheng, Wei; Hu, Jennifer J; Ziegler, Regina G; Nyante, Sarah J; Bandera, Elisa V; Ingles, Sue A; Press, Michael F; Chanock, Stephen J; Deming, Sandra L; Rodriguez-Gil, Jorge L; Palmer, Cameron D; Buxbaum, Sarah; Ekunwe, Lynette; Hirschhorn, Joel N; Henderson, Brian E; Myers, Simon; Haiman, Christopher A; Reich, David; Patterson, Nick; Wilson, James G; Price, Alkes L

    2011-04-01

    While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations.

  17. Modeling 3D Facial Shape from DNA

    PubMed Central

    Claes, Peter; Liberton, Denise K.; Daniels, Katleen; Rosana, Kerri Matthes; Quillen, Ellen E.; Pearson, Laurel N.; McEvoy, Brian; Bauchet, Marc; Zaidi, Arslan A.; Yao, Wei; Tang, Hua; Barsh, Gregory S.; Absher, Devin M.; Puts, David A.; Rocha, Jorge; Beleza, Sandra; Pereira, Rinaldo W.; Baynam, Gareth; Suetens, Paul; Vandermeulen, Dirk; Wagner, Jennifer K.; Boster, James S.; Shriver, Mark D.

    2014-01-01

    Human facial diversity is substantial, complex, and largely scientifically unexplained. We used spatially dense quasi-landmarks to measure face shape in population samples with mixed West African and European ancestry from three locations (United States, Brazil, and Cape Verde). Using bootstrapped response-based imputation modeling (BRIM), we uncover the relationships between facial variation and the effects of sex, genomic ancestry, and a subset of craniofacial candidate genes. The facial effects of these variables are summarized as response-based imputed predictor (RIP) variables, which are validated using self-reported sex, genomic ancestry, and observer-based facial ratings (femininity and proportional ancestry) and judgments (sex and population group). By jointly modeling sex, genomic ancestry, and genotype, the independent effects of particular alleles on facial features can be uncovered. Results on a set of 20 genes showing significant effects on facial features provide support for this approach as a novel means to identify genes affecting normal-range facial features and for approximating the appearance of a face from genetic markers. PMID:24651127

  18. Reference-based phasing using the Haplotype Reference Consortium panel.

    PubMed

    Loh, Po-Ru; Danecek, Petr; Palamara, Pier Francesco; Fuchsberger, Christian; A Reshef, Yakir; K Finucane, Hilary; Schoenherr, Sebastian; Forer, Lukas; McCarthy, Shane; Abecasis, Goncalo R; Durbin, Richard; L Price, Alkes

    2016-11-01

    Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.

  19. Integrative approaches for large-scale transcriptome-wide association studies

    PubMed Central

    Gusev, Alexander; Ko, Arthur; Shi, Huwenbo; Bhatia, Gaurav; Chung, Wonil; Penninx, Brenda W J H; Jansen, Rick; de Geus, Eco JC; Boomsma, Dorret I; Wright, Fred A; Sullivan, Patrick F; Nikkola, Elina; Alvarez, Marcus; Civelek, Mete; Lusis, Aldons J.; Lehtimäki, Terho; Raitoharju, Emma; Kähönen, Mika; Seppälä, Ilkka; Raitakari, Olli T.; Kuusisto, Johanna; Laakso, Markku; Price, Alkes L.; Pajukanta, Päivi; Pasaniuc, Bogdan

    2016-01-01

    Many genetic variants influence complex traits by modulating gene expression, thus altering the abundance levels of one or multiple proteins. Here, we introduce a powerful strategy that integrates gene expression measurements with summary association statistics from large-scale genome-wide association studies (GWAS) to identify genes whose cis-regulated expression is associated to complex traits. We leverage expression imputation to perform a transcriptome wide association scan (TWAS) to identify significant expression-trait associations. We applied our approaches to expression data from blood and adipose tissue measured in ~3,000 individuals overall. We imputed gene expression into GWAS data from over 900,000 phenotype measurements to identify 69 novel genes significantly associated to obesity-related traits (BMI, lipids, and height). Many of the novel genes are associated with relevant phenotypes in the Hybrid Mouse Diversity Panel. Our results showcase the power of integrating genotype, gene expression and phenotype to gain insights into the genetic basis of complex traits. PMID:26854917

  20. Whole genome sequencing and imputation in isolated populations identify genetic associations with medically-relevant complex traits

    PubMed Central

    Southam, Lorraine; Gilly, Arthur; Süveges, Dániel; Farmaki, Aliki-Eleni; Schwartzentruber, Jeremy; Tachmazidou, Ioanna; Matchan, Angela; Rayner, Nigel W.; Tsafantakis, Emmanouil; Karaleftheri, Maria; Xue, Yali; Dedoussis, George; Zeggini, Eleftheria

    2017-01-01

    Next-generation association studies can be empowered by sequence-based imputation and by studying founder populations. Here we report ∼9.5 million variants from whole-genome sequencing (WGS) of a Cretan-isolated population, and show enrichment of rare and low-frequency variants with predicted functional consequences. We use a WGS-based imputation approach utilizing 10,422 reference haplotypes to perform genome-wide association analyses and observe 17 genome-wide significant, independent signals, including replicating evidence for association at eight novel low-frequency variant signals. Two novel cardiometabolic associations are at lead variants unique to the founder population sequences: chr16:70790626 (high-density lipoprotein levels beta −1.71 (SE 0.25), P=1.57 × 10−11, effect allele frequency (EAF) 0.006); and rs145556679 (triglycerides levels beta −1.13 (SE 0.17), P=2.53 × 10−11, EAF 0.013). Our findings add empirical support to the contribution of low-frequency variants in complex traits, demonstrate the advantage of including population-specific sequences in imputation panels and exemplify the power gains afforded by population isolates. PMID:28548082

  1. The use of multiple imputation for the accurate measurements of individual feed intake by electronic feeders.

    PubMed

    Jiao, S; Tiezzi, F; Huang, Y; Gray, K A; Maltecca, C

    2016-02-01

    Obtaining accurate individual feed intake records is the key first step in achieving genetic progress toward more efficient nutrient utilization in pigs. Feed intake records collected by electronic feeding systems contain errors (erroneous and abnormal values exceeding certain cutoff criteria), which are due to feeder malfunction or animal-feeder interaction. In this study, we examined the use of a novel data-editing strategy involving multiple imputation to minimize the impact of errors and missing values on the quality of feed intake data collected by an electronic feeding system. Accuracy of feed intake data adjustment obtained from the conventional linear mixed model (LMM) approach was compared with 2 alternative implementations of multiple imputation by chained equation, denoted as MI (multiple imputation) and MICE (multiple imputation by chained equation). The 3 methods were compared under 3 scenarios, where 5, 10, and 20% feed intake error rates were simulated. Each of the scenarios was replicated 5 times. Accuracy of the alternative error adjustment was measured as the correlation between the true daily feed intake (DFI; daily feed intake in the testing period) or true ADFI (the mean DFI across testing period) and the adjusted DFI or adjusted ADFI. In the editing process, error cutoff criteria are used to define if a feed intake visit contains errors. To investigate the possibility that the error cutoff criteria may affect any of the 3 methods, the simulation was repeated with 2 alternative error cutoff values. Multiple imputation methods outperformed the LMM approach in all scenarios with mean accuracies of 96.7, 93.5, and 90.2% obtained with MI and 96.8, 94.4, and 90.1% obtained with MICE compared with 91.0, 82.6, and 68.7% using LMM for DFI. Similar results were obtained for ADFI. Furthermore, multiple imputation methods consistently performed better than LMM regardless of the cutoff criteria applied to define errors. In conclusion, multiple imputation is proposed as a more accurate and flexible method for error adjustments in feed intake data collected by electronic feeders.

  2. Disclosure control using partially synthetic data for large-scale health surveys, with applications to CanCORS.

    PubMed

    Loong, Bronwyn; Zaslavsky, Alan M; He, Yulei; Harrington, David P

    2013-10-30

    Statistical agencies have begun to partially synthesize public-use data for major surveys to protect the confidentiality of respondents' identities and sensitive attributes by replacing high disclosure risk and sensitive variables with multiple imputations. To date, there are few applications of synthetic data techniques to large-scale healthcare survey data. Here, we describe partial synthesis of survey data collected by the Cancer Care Outcomes Research and Surveillance (CanCORS) project, a comprehensive observational study of the experiences, treatments, and outcomes of patients with lung or colorectal cancer in the USA. We review inferential methods for partially synthetic data and discuss selection of high disclosure risk variables for synthesis, specification of imputation models, and identification disclosure risk assessment. We evaluate data utility by replicating published analyses and comparing results using original and synthetic data and discuss practical issues in preserving inferential conclusions. We found that important subgroup relationships must be included in the synthetic data imputation model, to preserve the data utility of the observed data for a given analysis procedure. We conclude that synthetic CanCORS data are suited best for preliminary data analyses purposes. These methods address the requirement to share data in clinical research without compromising confidentiality. Copyright © 2013 John Wiley & Sons, Ltd.

  3. Two-pass imputation algorithm for missing value estimation in gene expression time series.

    PubMed

    Tsiporkova, Elena; Boeva, Veselka

    2007-10-01

    Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.

  4. Joint Effects of Known Type 2 Diabetes Susceptibility Loci in Genome-Wide Association Study of Singapore Chinese: The Singapore Chinese Health Study

    PubMed Central

    Chen, Zhanghua; Pereira, Mark A.; Seielstad, Mark; Koh, Woon-Puay; Tai, E. Shyong; Teo, Yik-Ying; Liu, Jianjun; Hsu, Chris; Wang, Renwei; Odegaard, Andrew O.; Thyagarajan, Bharat; Koratkar, Revati; Yuan, Jian-Min; Gross, Myron D.; Stram, Daniel O.

    2014-01-01

    Background Genome-wide association studies (GWAS) have identified genetic factors in type 2 diabetes (T2D), mostly among individuals of European ancestry. We tested whether previously identified T2D-associated single nucleotide polymorphisms (SNPs) replicate and whether SNPs in regions near known T2D SNPs were associated with T2D within the Singapore Chinese Health Study. Methods 2338 cases and 2339 T2D controls from the Singapore Chinese Health Study were genotyped for 507,509 SNPs. Imputation extended the genotyped SNPs to 7,514,461 with high estimated certainty (r2>0.8). Replication of known index SNP associations in T2D was attempted. Risk scores were computed as the sum of index risk alleles. SNPs in regions ±100 kb around each index were tested for associations with T2D in conditional fine-mapping analysis. Results Of 69 index SNPs, 20 were genotyped directly and genotypes at 35 others were well imputed. Among the 55 SNPs with data, disease associations were replicated (at p<0.05) for 15 SNPs, while 32 more were directionally consistent with previous reports. Risk score was a significant predictor with a 2.03 fold higher risk CI (1.69–2.44) of T2D comparing the highest to lowest quintile of risk allele burden (p = 5.72×10−14). Two improved SNPs around index rs10923931 and 5 new candidate SNPs around indices rs10965250 and rs1111875 passed simple Bonferroni corrections for significance in conditional analysis. Nonetheless, only a small fraction (2.3% on the disease liability scale) of T2D burden in Singapore is explained by these SNPs. Conclusions While diabetes risk in Singapore Chinese involves genetic variants, most disease risk remains unexplained. Further genetic work is ongoing in the Singapore Chinese population to identify unique common variants not already seen in earlier studies. However rapid increases in T2D risk have occurred in recent decades in this population, indicating that dynamic environmental influences and possibly gene by environment interactions complicate the genetic architecture of this disease. PMID:24520337

  5. Joint effects of known type 2 diabetes susceptibility loci in genome-wide association study of Singapore Chinese: the Singapore Chinese health study.

    PubMed

    Chen, Zhanghua; Pereira, Mark A; Seielstad, Mark; Koh, Woon-Puay; Tai, E Shyong; Teo, Yik-Ying; Liu, Jianjun; Hsu, Chris; Wang, Renwei; Odegaard, Andrew O; Thyagarajan, Bharat; Koratkar, Revati; Yuan, Jian-Min; Gross, Myron D; Stram, Daniel O

    2014-01-01

    Genome-wide association studies (GWAS) have identified genetic factors in type 2 diabetes (T2D), mostly among individuals of European ancestry. We tested whether previously identified T2D-associated single nucleotide polymorphisms (SNPs) replicate and whether SNPs in regions near known T2D SNPs were associated with T2D within the Singapore Chinese Health Study. 2338 cases and 2339 T2D controls from the Singapore Chinese Health Study were genotyped for 507,509 SNPs. Imputation extended the genotyped SNPs to 7,514,461 with high estimated certainty (r(2)>0.8). Replication of known index SNP associations in T2D was attempted. Risk scores were computed as the sum of index risk alleles. SNPs in regions ± 100 kb around each index were tested for associations with T2D in conditional fine-mapping analysis. Of 69 index SNPs, 20 were genotyped directly and genotypes at 35 others were well imputed. Among the 55 SNPs with data, disease associations were replicated (at p<0.05) for 15 SNPs, while 32 more were directionally consistent with previous reports. Risk score was a significant predictor with a 2.03 fold higher risk CI (1.69-2.44) of T2D comparing the highest to lowest quintile of risk allele burden (p = 5.72 × 10(-14)). Two improved SNPs around index rs10923931 and 5 new candidate SNPs around indices rs10965250 and rs1111875 passed simple Bonferroni corrections for significance in conditional analysis. Nonetheless, only a small fraction (2.3% on the disease liability scale) of T2D burden in Singapore is explained by these SNPs. While diabetes risk in Singapore Chinese involves genetic variants, most disease risk remains unexplained. Further genetic work is ongoing in the Singapore Chinese population to identify unique common variants not already seen in earlier studies. However rapid increases in T2D risk have occurred in recent decades in this population, indicating that dynamic environmental influences and possibly gene by environment interactions complicate the genetic architecture of this disease.

  6. Identifying Heat Waves in Florida: Considerations of Missing Weather Data

    PubMed Central

    Leary, Emily; Young, Linda J.; DuClos, Chris; Jordan, Melissa M.

    2015-01-01

    Background Using current climate models, regional-scale changes for Florida over the next 100 years are predicted to include warming over terrestrial areas and very likely increases in the number of high temperature extremes. No uniform definition of a heat wave exists. Most past research on heat waves has focused on evaluating the aftermath of known heat waves, with minimal consideration of missing exposure information. Objectives To identify and discuss methods of handling and imputing missing weather data and how those methods can affect identified periods of extreme heat in Florida. Methods In addition to ignoring missing data, temporal, spatial, and spatio-temporal models are described and utilized to impute missing historical weather data from 1973 to 2012 from 43 Florida weather monitors. Calculated thresholds are used to define periods of extreme heat across Florida. Results Modeling of missing data and imputing missing values can affect the identified periods of extreme heat, through the missing data itself or through the computed thresholds. The differences observed are related to the amount of missingness during June, July, and August, the warmest months of the warm season (April through September). Conclusions Missing data considerations are important when defining periods of extreme heat. Spatio-temporal methods are recommended for data imputation. A heat wave definition that incorporates information from all monitors is advised. PMID:26619198

  7. Identifying Heat Waves in Florida: Considerations of Missing Weather Data.

    PubMed

    Leary, Emily; Young, Linda J; DuClos, Chris; Jordan, Melissa M

    2015-01-01

    Using current climate models, regional-scale changes for Florida over the next 100 years are predicted to include warming over terrestrial areas and very likely increases in the number of high temperature extremes. No uniform definition of a heat wave exists. Most past research on heat waves has focused on evaluating the aftermath of known heat waves, with minimal consideration of missing exposure information. To identify and discuss methods of handling and imputing missing weather data and how those methods can affect identified periods of extreme heat in Florida. In addition to ignoring missing data, temporal, spatial, and spatio-temporal models are described and utilized to impute missing historical weather data from 1973 to 2012 from 43 Florida weather monitors. Calculated thresholds are used to define periods of extreme heat across Florida. Modeling of missing data and imputing missing values can affect the identified periods of extreme heat, through the missing data itself or through the computed thresholds. The differences observed are related to the amount of missingness during June, July, and August, the warmest months of the warm season (April through September). Missing data considerations are important when defining periods of extreme heat. Spatio-temporal methods are recommended for data imputation. A heat wave definition that incorporates information from all monitors is advised.

  8. Re-Ranking Sequencing Variants in the Post-GWAS Era for Accurate Causal Variant Identification

    PubMed Central

    Faye, Laura L.; Machiela, Mitchell J.; Kraft, Peter; Bull, Shelley B.; Sun, Lei

    2013-01-01

    Next generation sequencing has dramatically increased our ability to localize disease-causing variants by providing base-pair level information at costs increasingly feasible for the large sample sizes required to detect complex-trait associations. Yet, identification of causal variants within an established region of association remains a challenge. Counter-intuitively, certain factors that increase power to detect an associated region can decrease power to localize the causal variant. First, combining GWAS with imputation or low coverage sequencing to achieve the large sample sizes required for high power can have the unintended effect of producing differential genotyping error among SNPs. This tends to bias the relative evidence for association toward better genotyped SNPs. Second, re-use of GWAS data for fine-mapping exploits previous findings to ensure genome-wide significance in GWAS-associated regions. However, using GWAS findings to inform fine-mapping analysis can bias evidence away from the causal SNP toward the tag SNP and SNPs in high LD with the tag. Together these factors can reduce power to localize the causal SNP by more than half. Other strategies commonly employed to increase power to detect association, namely increasing sample size and using higher density genotyping arrays, can, in certain common scenarios, actually exacerbate these effects and further decrease power to localize causal variants. We develop a re-ranking procedure that accounts for these adverse effects and substantially improves the accuracy of causal SNP identification, often doubling the probability that the causal SNP is top-ranked. Application to the NCI BPC3 aggressive prostate cancer GWAS with imputation meta-analysis identified a new top SNP at 2 of 3 associated loci and several additional possible causal SNPs at these loci that may have otherwise been overlooked. This method is simple to implement using R scripts provided on the author's website. PMID:23950724

  9. A new GWAS and meta-analysis with 1000Genomes imputation identifies novel risk variants for colorectal cancer.

    PubMed

    Al-Tassan, Nada A; Whiffin, Nicola; Hosking, Fay J; Palles, Claire; Farrington, Susan M; Dobbins, Sara E; Harris, Rebecca; Gorman, Maggie; Tenesa, Albert; Meyer, Brian F; Wakil, Salma M; Kinnersley, Ben; Campbell, Harry; Martin, Lynn; Smith, Christopher G; Idziaszczyk, Shelley; Barclay, Ella; Maughan, Timothy S; Kaplan, Richard; Kerr, Rachel; Kerr, David; Buchanan, Daniel D; Buchannan, Daniel D; Win, Aung Ko; Hopper, John; Jenkins, Mark; Lindor, Noralane M; Newcomb, Polly A; Gallinger, Steve; Conti, David; Schumacher, Fred; Casey, Graham; Dunlop, Malcolm G; Tomlinson, Ian P; Cheadle, Jeremy P; Houlston, Richard S

    2015-05-20

    Genome-wide association studies (GWAS) of colorectal cancer (CRC) have identified 23 susceptibility loci thus far. Analyses of previously conducted GWAS indicate additional risk loci are yet to be discovered. To identify novel CRC susceptibility loci, we conducted a new GWAS and performed a meta-analysis with five published GWAS (totalling 7,577 cases and 9,979 controls of European ancestry), imputing genotypes utilising the 1000 Genomes Project. The combined analysis identified new, significant associations with CRC at 1p36.2 marked by rs72647484 (minor allele frequency [MAF] = 0.09) near CDC42 and WNT4 (P = 1.21 × 10(-8), odds ratio [OR] = 1.21 ) and at 16q24.1 marked by rs16941835 (MAF = 0.21, P = 5.06 × 10(-8); OR = 1.15) within the long non-coding RNA (lncRNA) RP11-58A18.1 and ~500 kb from the nearest coding gene FOXL1. Additionally we identified a promising association at 10p13 with rs10904849 intronic to CUBN (MAF = 0.32, P = 7.01 × 10(-8); OR = 1.14). These findings provide further insights into the genetic and biological basis of inherited genetic susceptibility to CRC. Additionally, our analysis further demonstrates that imputation can be used to exploit GWAS data to identify novel disease-causing variants.

  10. yaImpute: An R package for kNN imputation

    Treesearch

    Nicholas L. Crookston; Andrew O. Finley

    2008-01-01

    This article introduces yaImpute, an R package for nearest neighbor search and imputation. Although nearest neighbor imputation is used in a host of disciplines, the methods implemented in the yaImpute package are tailored to imputation-based forest attribute estimation and mapping. The impetus to writing the yaImpute is a growing interest in nearest neighbor...

  11. Missing data in FFQs: making assumptions about item non-response.

    PubMed

    Lamb, Karen E; Olstad, Dana Lee; Nguyen, Cattram; Milte, Catherine; McNaughton, Sarah A

    2017-04-01

    FFQs are a popular method of capturing dietary information in epidemiological studies and may be used to derive dietary exposures such as nutrient intake or overall dietary patterns and diet quality. As FFQs can involve large numbers of questions, participants may fail to respond to all questions, leaving researchers to decide how to deal with missing data when deriving intake measures. The aim of the present commentary is to discuss the current practice for dealing with item non-response in FFQs and to propose a research agenda for reporting and handling missing data in FFQs. Single imputation techniques, such as zero imputation (assuming no consumption of the item) or mean imputation, are commonly used to deal with item non-response in FFQs. However, single imputation methods make strong assumptions about the missing data mechanism and do not reflect the uncertainty created by the missing data. This can lead to incorrect inference about associations between diet and health outcomes. Although the use of multiple imputation methods in epidemiology has increased, these have seldom been used in the field of nutritional epidemiology to address missing data in FFQs. We discuss methods for dealing with item non-response in FFQs, highlighting the assumptions made under each approach. Researchers analysing FFQs should ensure that missing data are handled appropriately and clearly report how missing data were treated in analyses. Simulation studies are required to enable systematic evaluation of the utility of various methods for handling item non-response in FFQs under different assumptions about the missing data mechanism.

  12. Genome-wide and fine-resolution association analysis of malaria in West Africa.

    PubMed

    Jallow, Muminatou; Teo, Yik Ying; Small, Kerrin S; Rockett, Kirk A; Deloukas, Panos; Clark, Taane G; Kivinen, Katja; Bojang, Kalifa A; Conway, David J; Pinder, Margaret; Sirugo, Giorgio; Sisay-Joof, Fatou; Usen, Stanley; Auburn, Sarah; Bumpstead, Suzannah J; Campino, Susana; Coffey, Alison; Dunham, Andrew; Fry, Andrew E; Green, Angela; Gwilliam, Rhian; Hunt, Sarah E; Inouye, Michael; Jeffreys, Anna E; Mendy, Alieu; Palotie, Aarno; Potter, Simon; Ragoussis, Jiannis; Rogers, Jane; Rowlands, Kate; Somaskantharajah, Elilan; Whittaker, Pamela; Widden, Claire; Donnelly, Peter; Howie, Bryan; Marchini, Jonathan; Morris, Andrew; SanJoaquin, Miguel; Achidi, Eric Akum; Agbenyega, Tsiri; Allen, Angela; Amodu, Olukemi; Corran, Patrick; Djimde, Abdoulaye; Dolo, Amagana; Doumbo, Ogobara K; Drakeley, Chris; Dunstan, Sarah; Evans, Jennifer; Farrar, Jeremy; Fernando, Deepika; Hien, Tran Tinh; Horstmann, Rolf D; Ibrahim, Muntaser; Karunaweera, Nadira; Kokwaro, Gilbert; Koram, Kwadwo A; Lemnge, Martha; Makani, Julie; Marsh, Kevin; Michon, Pascal; Modiano, David; Molyneux, Malcolm E; Mueller, Ivo; Parker, Michael; Peshu, Norbert; Plowe, Christopher V; Puijalon, Odile; Reeder, John; Reyburn, Hugh; Riley, Eleanor M; Sakuntabhai, Anavaj; Singhasivanon, Pratap; Sirima, Sodiomon; Tall, Adama; Taylor, Terrie E; Thera, Mahamadou; Troye-Blomberg, Marita; Williams, Thomas N; Wilson, Michael; Kwiatkowski, Dominic P

    2009-06-01

    We report a genome-wide association (GWA) study of severe malaria in The Gambia. The initial GWA scan included 2,500 children genotyped on the Affymetrix 500K GeneChip, and a replication study included 3,400 children. We used this to examine the performance of GWA methods in Africa. We found considerable population stratification, and also that signals of association at known malaria resistance loci were greatly attenuated owing to weak linkage disequilibrium (LD). To investigate possible solutions to the problem of low LD, we focused on the HbS locus, sequencing this region of the genome in 62 Gambian individuals and then using these data to conduct multipoint imputation in the GWA samples. This increased the signal of association, from P = 4 × 10(-7) to P = 4 × 10(-14), with the peak of the signal located precisely at the HbS causal variant. Our findings provide proof of principle that fine-resolution multipoint imputation, based on population-specific sequencing data, can substantially boost authentic GWA signals and enable fine mapping of causal variants in African populations.

  13. Further Confirmation of Germline Glioma Risk Variant rs78378222 in TP53 and Its Implication in Tumor Tissues via Integrative Analysis of TCGA Data

    PubMed Central

    Wang, Zhaoming; Rajaraman, Preetha; Melin, Beatrice S.; Chung, Charles C.; Zhang, Weijia; McKean-Cowdin, Roberta; Michaud, Dominique; Yeager, Meredith; Ahlbom, Anders; Albanes, Demetrius; Andersson, Ulrika; Beane Freeman, Laura E.; Buring, Julie E.; Butler, Mary Ann; Carreón, Tania; Feychting, Maria; Gapstur, Susan M.; Gaziano, J. Michael; Giles, Graham G.; Hallmans, Goran; Henriksson, Roger; Hoffman-Bolton, Judith; Inskip, Peter D.; Kitahara, Cari M.; Le Marchand, Loic; Linet, Martha S.; Li, Shengchao; Peters, Ulrike; Purdue, Mark P.; Rothman, Nathaniel; Ruder, Avima M.; Sesso, Howard D.; Severi, Gianluca; Stampfer, Meir; Stevens, Victoria L.; Visvanathan, Kala; Wang, Sophia S.; White, Emily; Zeleniuch-Jacquotte, Anne; Hoover, Robert; Fraumeni, Joseph F.; Chatterjee, Nilanjan; Hartge, Patricia; Chanock, Stephen J.

    2016-01-01

    We confirmed strong association of rs78378222:A>C (per allele odds ratio [OR] = 3.14; P = 6.48 × 10−11), a germline rare single-nucleotide polymorphism (SNP) in TP53, via imputation of a genome-wide association study of glioma (1,856 cases and 4,955 controls). We subsequently performed integrative analyses on the Cancer Genome Atlas (TCGA) data for GBM (glioblastoma multiforme) and LUAD (lung adenocarcinoma). Based on SNP data, we imputed genotypes for rs78378222 and selected individuals carrying rare risk allele (C). Using RNA sequencing data, we observed aberrant transcripts with ~3 kb longer than normal for those individuals. Using exome sequencing data, we further showed that loss of haplotype carrying common protective allele (A) occurred somatically in GBM but not in LUAD. Our bioinformatic analysis suggests rare risk allele (C) disrupts mRNA termination, and an allelic loss of a genomic region harboring common protective allele (A) occurs during tumor initiation or progression for glioma. PMID:25907361

  14. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer

    PubMed Central

    Wang, Yufei; McKay, James D.; Rafnar, Thorunn; Wang, Zhaoming; Timofeeva, Maria; Broderick, Peter; Zong, Xuchen; Laplana, Marina; Wei, Yongyue; Han, Younghun; Lloyd, Amy; Delahaye-Sourdeix, Manon; Chubb, Daniel; Gaborieau, Valerie; Wheeler, William; Chatterjee, Nilanjan; Thorleifsson, Gudmar; Sulem, Patrick; Liu, Geoffrey; Kaaks, Rudolf; Henrion, Marc; Kinnersley, Ben; Vallée, Maxime; LeCalvez-Kelm, Florence; Stevens, Victoria L.; Gapstur, Susan M.; Chen, Wei V.; Zaridze, David; Szeszenia-Dabrowska, Neonilia; Lissowska, Jolanta; Rudnai, Peter; Fabianova, Eleonora; Mates, Dana; Bencko, Vladimir; Foretova, Lenka; Janout, Vladimir; Krokan, Hans E.; Gabrielsen, Maiken Elvestad; Skorpen, Frank; Vatten, Lars; Njølstad, Inger; Chen, Chu; Goodman, Gary; Benhamou, Simone; Vooder, Tonu; Valk, Kristjan; Nelis, Mari; Metspalu, Andres; Lener, Marcin; Lubiński, Jan; Johansson, Mattias; Vineis, Paolo; Agudo, Antonio; Clavel-Chapelon, Francoise; Bueno-de-Mesquita, H.Bas; Trichopoulos, Dimitrios; Khaw, Kay-Tee; Johansson, Mikael; Weiderpass, Elisabete; Tjønneland, Anne; Riboli, Elio; Lathrop, Mark; Scelo, Ghislaine; Albanes, Demetrius; Caporaso, Neil E.; Ye, Yuanqing; Gu, Jian; Wu, Xifeng; Spitz, Margaret R.; Dienemann, Hendrik; Rosenberger, Albert; Su, Li; Matakidou, Athena; Eisen, Timothy; Stefansson, Kari; Risch, Angela; Chanock, Stephen J.; Christiani, David C.; Hung, Rayjean J.; Brennan, Paul; Landi, Maria Teresa; Houlston, Richard S.; Amos, Christopher I.

    2014-01-01

    We conducted imputation to the 1000 Genomes Project of four genome-wide association studies of lung cancer in populations of European ancestry (11,348 cases and 15,861 controls) and genotyped an additional 10,246 cases and 38,295 controls for follow-up. We identified large-effect genome-wide associations for squamous lung cancer with the rare variants of BRCA2-K3326X (rs11571833; odds ratio [OR]=2.47, P=4.74×10−20) and of CHEK2-I157T (rs17879961; OR=0.38 P=1.27×10−13). We also showed an association between common variation at 3q28 (TP63; rs13314271; OR=1.13, P=7.22×10−10) and lung adenocarcinoma previously only reported in Asians. These findings provide further evidence for inherited genetic susceptibility to lung cancer and its biological basis. Additionally, our analysis demonstrates that imputation can identify rare disease-causing variants having substantive effects on cancer risk from pre-existing GWAS data. PMID:24880342

  15. The population genomics of archaeological transition in west Iberia: Investigation of ancient substructure using imputation and haplotype-based methods

    PubMed Central

    Martiniano, Rui; McLaughlin, Russell; Silva, Nuno M.; Manco, Licinio; Pereira, Tania; Coelho, Maria J.; Serra, Miguel; Burger, Joachim; Parreira, Rui; Moran, Elena; Valera, Antonio C.; Silva, Ana M.

    2017-01-01

    We analyse new genomic data (0.05–2.95x) from 14 ancient individuals from Portugal distributed from the Middle Neolithic (4200–3500 BC) to the Middle Bronze Age (1740–1430 BC) and impute genomewide diploid genotypes in these together with published ancient Eurasians. While discontinuity is evident in the transition to agriculture across the region, sensitive haplotype-based analyses suggest a significant degree of local hunter-gatherer contribution to later Iberian Neolithic populations. A more subtle genetic influx is also apparent in the Bronze Age, detectable from analyses including haplotype sharing with both ancient and modern genomes, D-statistics and Y-chromosome lineages. However, the limited nature of this introgression contrasts with the major Steppe migration turnovers within third Millennium northern Europe and echoes the survival of non-Indo-European language in Iberia. Changes in genomic estimates of individual height across Europe are also associated with these major cultural transitions, and ancestral components continue to correlate with modern differences in stature. PMID:28749934

  16. A Genome-Wide Association Study Identifies A New Ovarian Cancer Susceptibility Locus On 9p22.2

    PubMed Central

    Song, Honglin; Ramus, Susan J.; Tyrer, Jonathan; Bolton, Kelly L.; Gentry-Maharaj, Aleksandra; Wozniak, Eva; Anton-Culver, Hoda; Chang-Claude, Jenny; Cramer, Daniel W.; DiCioccio, Richard; Dörk, Thilo; Goode, Ellen L.; Goodman, Marc T; Schildkraut, Joellen M; Sellers, Thomas; Baglietto, Laura; Beckmann, Matthias W.; Beesley, Jonathan; Blaakaer, Jan; Carney, Michael E; Chanock, Stephen; Chen, Zhihua; Cunningham, Julie M.; Dicks, Ed; Doherty, Jennifer A.; Dürst, Matthias; Ekici, Arif B.; Fenstermacher, David; Fridley, Brooke L.; Giles, Graham; Gore, Martin E.; De Vivo, Immaculata; Hillemanns, Peter; Hogdall, Claus; Hogdall, Estrid; Iversen, Edwin S; Jacobs, Ian J; Jakubowska, Anna; Li, Dong; Lissowska, Jolanta; Lubiński, Jan; Lurie, Galina; McGuire, Valerie; McLaughlin, John; Mędrek, Krzysztof; Moorman, Patricia G.; Moysich, Kirsten; Narod, Steven; Phelan, Catherine; Pye, Carole; Risch, Harvey; Runnebaum, Ingo B; Severi, Gianluca; Southey, Melissa; Stram, Daniel O.; Thiel, Falk C.; Terry, Kathryn L.; Tsai, Ya-Yu; Tworoger, Shelley S.; Van Den Berg, David J.; Vierkant, Robert A.; Wang-Gohrke, Shan; Webb, Penelope M.; Wilkens, Lynne R.; Wu, Anna H; Yang, Hannah; Brewster, Wendy; Ziogas, Argyrios; Houlston, Richard; Tomlinson, Ian; Whittemore, Alice S; Rossing, Mary Anne; Ponder, Bruce A.J.; Pearce, Celeste Leigh; Ness, Roberta B.; Menon, Usha; Kjaer, Susanne Krüger; Gronwald, Jacek; Garcia-Closas, Montserrat; Fasching, Peter A.; Easton, Douglas F; Chenevix-Trench, Georgia; Berchuck, Andrew; Pharoah, Paul D.P.; Gayther, Simon A.

    2009-01-01

    Epithelial ovarian cancer has a major heritable component, but the known susceptibility genes explain less than half the excess familial risk1. We performed a genome wide association study (GWAS) to identify common ovarian cancer susceptibility alleles. We evaluated 507,094 SNPs genotyped in 1,817 cases and 2,353 controls from the UK and ~2 million imputed SNPs. We genotyped the 22,790 top ranked SNPs in 4,274 cases and 4,809 controls of European ancestry from Europe, USA and Australia. We identified 12 SNPs at 9p22 associated with disease risk (P<10−8). The most significant SNP (rs3814113; P = 2.5 × 10−17) was genotyped in a further 2,670 ovarian cancer cases and 4,668 controls confirming its association (combined data odds ratio = 0.82 95% CI 0.79 – 0.86, P-trend = 5.1 × 10−19). The association differs by histological subtype, being strongest for serous ovarian cancers (OR 0.77 95% CI 0.73 – 0.81, Ptrend = 4.1 × 10−21). PMID:19648919

  17. Evidence that breast cancer risk at the 2q35 locus is mediated through IGFBP5 regulation.

    PubMed

    Ghoussaini, Maya; Edwards, Stacey L; Michailidou, Kyriaki; Nord, Silje; Cowper-Sal Lari, Richard; Desai, Kinjal; Kar, Siddhartha; Hillman, Kristine M; Kaufmann, Susanne; Glubb, Dylan M; Beesley, Jonathan; Dennis, Joe; Bolla, Manjeet K; Wang, Qin; Dicks, Ed; Guo, Qi; Schmidt, Marjanka K; Shah, Mitul; Luben, Robert; Brown, Judith; Czene, Kamila; Darabi, Hatef; Eriksson, Mikael; Klevebring, Daniel; Bojesen, Stig E; Nordestgaard, Børge G; Nielsen, Sune F; Flyger, Henrik; Lambrechts, Diether; Thienpont, Bernard; Neven, Patrick; Wildiers, Hans; Broeks, Annegien; Van't Veer, Laura J; Th Rutgers, Emiel J; Couch, Fergus J; Olson, Janet E; Hallberg, Emily; Vachon, Celine; Chang-Claude, Jenny; Rudolph, Anja; Seibold, Petra; Flesch-Janys, Dieter; Peto, Julian; Dos-Santos-Silva, Isabel; Gibson, Lorna; Nevanlinna, Heli; Muranen, Taru A; Aittomäki, Kristiina; Blomqvist, Carl; Hall, Per; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Kang, Daehee; Choi, Ji-Yeob; Park, Sue K; Noh, Dong-Young; Matsuo, Keitaro; Ito, Hidemi; Iwata, Hiroji; Yatabe, Yasushi; Guénel, Pascal; Truong, Thérèse; Menegaux, Florence; Sanchez, Marie; Burwinkel, Barbara; Marme, Frederik; Schneeweiss, Andreas; Sohn, Christof; Wu, Anna H; Tseng, Chiu-Chen; Van Den Berg, David; Stram, Daniel O; Benitez, Javier; Zamora, M Pilar; Perez, Jose Ignacio Arias; Menéndez, Primitiva; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Qiuyin; Cox, Angela; Cross, Simon S; Reed, Malcolm W R; Andrulis, Irene L; Knight, Julia A; Glendon, Gord; Tchatchou, Sandrine; Sawyer, Elinor J; Tomlinson, Ian; Kerin, Michael J; Miller, Nicola; Haiman, Christopher A; Henderson, Brian E; Schumacher, Fredrick; Le Marchand, Loic; Lindblom, Annika; Margolin, Sara; Teo, Soo Hwang; Yip, Cheng Har; Lee, Daphne S C; Wong, Tien Y; Hooning, Maartje J; Martens, John W M; Collée, J Margriet; van Deurzen, Carolien H M; Hopper, John L; Southey, Melissa C; Tsimiklis, Helen; Kapuscinski, Miroslav K; Shen, Chen-Yang; Wu, Pei-Ei; Yu, Jyh-Cherng; Chen, Shou-Tung; Alnæs, Grethe Grenaker; Borresen-Dale, Anne-Lise; Giles, Graham G; Milne, Roger L; McLean, Catriona; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Hartman, Mikael; Miao, Hui; Buhari, Shaik Ahmad Bin Syed; Teo, Yik Ying; Fasching, Peter A; Haeberle, Lothar; Ekici, Arif B; Beckmann, Matthias W; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Swerdlow, Anthony; Ashworth, Alan; Orr, Nick; Schoemaker, Minouk J; García-Closas, Montserrat; Figueroa, Jonine; Chanock, Stephen J; Lissowska, Jolanta; Simard, Jacques; Goldberg, Mark S; Labrèche, France; Dumont, Martine; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Brauch, Hiltrud; Brüning, Thomas; Koto, Yon-Dschun; Radice, Paolo; Peterlongo, Paolo; Bonanni, Bernardo; Volorio, Sara; Dörk, Thilo; Bogdanova, Natalia V; Helbig, Sonja; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M; Devilee, Peter; Tollenaar, Robert A E M; Seynaeve, Caroline; Van Asperen, Christi J; Jakubowska, Anna; Lubinski, Jan; Jaworska-Bieniek, Katarzyna; Durda, Katarzyna; Slager, Susan; Toland, Amanda E; Ambrosone, Christine B; Yannoukakos, Drakoulis; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; McKay, James; Hamann, Ute; Torres, Diana; Zheng, Wei; Long, Jirong; Anton-Culver, Hoda; Neuhausen, Susan L; Luccarini, Craig; Baynes, Caroline; Ahmed, Shahana; Maranian, Mel; Healey, Catherine S; González-Neira, Anna; Pita, Guillermo; Alonso, M Rosario; Alvarez, Nuria; Herrero, Daniel; Tessier, Daniel C; Vincent, Daniel; Bacot, Francois; de Santiago, Ines; Carroll, Jason; Caldas, Carlos; Brown, Melissa A; Lupien, Mathieu; Kristensen, Vessela N; Pharoah, Paul D P; Chenevix-Trench, Georgia; French, Juliet D; Easton, Douglas F; Dunning, Alison M

    2014-09-23

    GWAS have identified a breast cancer susceptibility locus on 2q35. Here we report the fine mapping of this locus using data from 101,943 subjects from 50 case-control studies. We genotype 276 SNPs using the 'iCOGS' genotyping array and impute genotypes for a further 1,284 using 1000 Genomes Project data. All but two, strongly correlated SNPs (rs4442975 G/T and rs6721996 G/A) are excluded as candidate causal variants at odds against >100:1. The best functional candidate, rs4442975, is associated with oestrogen receptor positive (ER+) disease with an odds ratio (OR) in Europeans of 0.85 (95% confidence interval=0.84-0.87; P=1.7 × 10(-43)) per t-allele. This SNP flanks a transcriptional enhancer that physically interacts with the promoter of IGFBP5 (encoding insulin-like growth factor-binding protein 5) and displays allele-specific gene expression, FOXA1 binding and chromatin looping. Evidence suggests that the g-allele confers increased breast cancer susceptibility through relative downregulation of IGFBP5, a gene with known roles in breast cell biology.

  18. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease

    PubMed Central

    Lambert, Jean-Charles; Ibrahim-Verbaas, Carla A; Harold, Denise; Naj, Adam C; Sims, Rebecca; Bellenguez, Céline; Jun, Gyungah; DeStefano, Anita L; Bis, Joshua C; Beecham, Gary W; Grenier-Boley, Benjamin; Russo, Giancarlo; Thornton-Wells, Tricia A; Jones, Nicola; Smith, Albert V; Chouraki, Vincent; Thomas, Charlene; Ikram, M Arfan; Zelenika, Diana; Vardarajan, Badri N; Kamatani, Yoichiro; Lin, Chiao-Feng; Gerrish, Amy; Schmidt, Helena; Kunkle, Brian; Dunstan, Melanie L; Ruiz, Agustin; Bihoreau, Marie-Thérèse; Choi, Seung-Hoan; Reitz, Christiane; Pasquier, Florence; Hollingworth, Paul; Ramirez, Alfredo; Hanon, Olivier; Fitzpatrick, Annette L; Buxbaum, Joseph D; Campion, Dominique; Crane, Paul K; Baldwin, Clinton; Becker, Tim; Gudnason, Vilmundur; Cruchaga, Carlos; Craig, David; Amin, Najaf; Berr, Claudine; Lopez, Oscar L; De Jager, Philip L; Deramecourt, Vincent; Johnston, Janet A; Evans, Denis; Lovestone, Simon; Letenneur, Luc; Morón, Francisco J; Rubinsztein, David C; Eiriksdottir, Gudny; Sleegers, Kristel; Goate, Alison M; Fiévet, Nathalie; Huentelman, Matthew J; Gill, Michael; Brown, Kristelle; Kamboh, M Ilyas; Keller, Lina; Barberger-Gateau, Pascale; McGuinness, Bernadette; Larson, Eric B; Green, Robert; Myers, Amanda J; Dufouil, Carole; Todd, Stephen; Wallon, David; Love, Seth; Rogaeva, Ekaterina; Gallacher, John; St George-Hyslop, Peter; Clarimon, Jordi; Lleo, Alberto; Bayer, Anthony; Tsuang, Debby W; Yu, Lei; Tsolaki, Magda; Bossù, Paola; Spalletta, Gianfranco; Proitsi, Petroula; Collinge, John; Sorbi, Sandro; Sanchez-Garcia, Florentino; Fox, Nick C; Hardy, John; Deniz Naranjo, Maria Candida; Bosco, Paolo; Clarke, Robert; Brayne, Carol; Galimberti, Daniela; Mancuso, Michelangelo; Matthews, Fiona; Moebus, Susanne; Mecocci, Patrizia; Zompo, Maria Del; Maier, Wolfgang; Hampel, Harald; Pilotto, Alberto; Bullido, Maria; Panza, Francesco; Caffarra, Paolo; Nacmias, Benedetta; Gilbert, John R; Mayhaus, Manuel; Lannfelt, Lars; Hakonarson, Hakon; Pichler, Sabrina; Carrasquillo, Minerva M; Ingelsson, Martin; Beekly, Duane; Alvarez, Victoria; Zou, Fanggeng; Valladares, Otto; Younkin, Steven G; Coto, Eliecer; Hamilton-Nelson, Kara L; Gu, Wei; Razquin, Cristina; Pastor, Pau; Mateo, Ignacio; Owen, Michael J; Faber, Kelley M; Jonsson, Palmi V; Combarros, Onofre; O’Donovan, Michael C; Cantwell, Laura B; Soininen, Hilkka; Blacker, Deborah; Mead, Simon; Mosley, Thomas H; Bennett, David A; Harris, Tamara B; Fratiglioni, Laura; Holmes, Clive; de Bruijn, Renee F A G; Passmore, Peter; Montine, Thomas J; Bettens, Karolien; Rotter, Jerome I; Brice, Alexis; Morgan, Kevin; Foroud, Tatiana M; Kukull, Walter A; Hannequin, Didier; Powell, John F; Nalls, Michael A; Ritchie, Karen; Lunetta, Kathryn L; Kauwe, John S K; Boerwinkle, Eric; Riemenschneider, Matthias; Boada, Mercè; Hiltunen, Mikko; Martin, Eden R; Schmidt, Reinhold; Rujescu, Dan; Wang, Li-san; Dartigues, Jean-François; Mayeux, Richard; Tzourio, Christophe; Hofman, Albert; Nöthen, Markus M; Graff, Caroline; Psaty, Bruce M; Jones, Lesley; Haines, Jonathan L; Holmans, Peter A; Lathrop, Mark; Pericak-Vance, Margaret A; Launer, Lenore J; Farrer, Lindsay A; van Duijn, Cornelia M; Van Broeckhoven, Christine; Moskvina, Valentina; Seshadri, Sudha; Williams, Julie; Schellenberg, Gerard D; Amouyel, Philippe

    2013-01-01

    Eleven susceptibility loci for late-onset Alzheimer’s disease (LOAD) were identified by previous studies; however, a large portion of the genetic risk for this disease remains unexplained. We conducted a large, two-stage meta-analysis of genome-wide association studies (GWAS) in individuals of European ancestry. In stage 1, we used genotyped and imputed data (7,055,881 SNPs) to perform meta-analysis on 4 previously published GWAS data sets consisting of 17,008 Alzheimer’s disease cases and 37,154 controls. In stage 2,11,632 SNPs were genotyped and tested for association in an independent set of 8,572 Alzheimer’s disease cases and 11,312 controls. In addition to the APOE locus (encoding apolipoprotein E), 19 loci reached genome-wide significance (P < 5 × 10−8) in the combined stage 1 and stage 2 analysis, of which 11 are newly associated with Alzheimer’s disease. PMID:24162737

  19. 1000 Genomes-based meta-analysis identifies 10 novel loci for kidney function

    PubMed Central

    Gorski, Mathias; van der Most, Peter J.; Teumer, Alexander; Chu, Audrey Y.; Li, Man; Mijatovic, Vladan; Nolte, Ilja M.; Cocca, Massimiliano; Taliun, Daniel; Gomez, Felicia; Li, Yong; Tayo, Bamidele; Tin, Adrienne; Feitosa, Mary F.; Aspelund, Thor; Attia, John; Biffar, Reiner; Bochud, Murielle; Boerwinkle, Eric; Borecki, Ingrid; Bottinger, Erwin P.; Chen, Ming-Huei; Chouraki, Vincent; Ciullo, Marina; Coresh, Josef; Cornelis, Marilyn C.; Curhan, Gary C.; d’Adamo, Adamo Pio; Dehghan, Abbas; Dengler, Laura; Ding, Jingzhong; Eiriksdottir, Gudny; Endlich, Karlhans; Enroth, Stefan; Esko, Tõnu; Franco, Oscar H.; Gasparini, Paolo; Gieger, Christian; Girotto, Giorgia; Gottesman, Omri; Gudnason, Vilmundur; Gyllensten, Ulf; Hancock, Stephen J.; Harris, Tamara B.; Helmer, Catherine; Höllerer, Simon; Hofer, Edith; Hofman, Albert; Holliday, Elizabeth G.; Homuth, Georg; Hu, Frank B.; Huth, Cornelia; Hutri-Kähönen, Nina; Hwang, Shih-Jen; Imboden, Medea; Johansson, Åsa; Kähönen, Mika; König, Wolfgang; Kramer, Holly; Krämer, Bernhard K.; Kumar, Ashish; Kutalik, Zoltan; Lambert, Jean-Charles; Launer, Lenore J.; Lehtimäki, Terho; de Borst, Martin; Navis, Gerjan; Swertz, Morris; Liu, Yongmei; Lohman, Kurt; Loos, Ruth J. F.; Lu, Yingchang; Lyytikäinen, Leo-Pekka; McEvoy, Mark A.; Meisinger, Christa; Meitinger, Thomas; Metspalu, Andres; Metzger, Marie; Mihailov, Evelin; Mitchell, Paul; Nauck, Matthias; Oldehinkel, Albertine J.; Olden, Matthias; WJH Penninx, Brenda; Pistis, Giorgio; Pramstaller, Peter P.; Probst-Hensch, Nicole; Raitakari, Olli T.; Rettig, Rainer; Ridker, Paul M.; Rivadeneira, Fernando; Robino, Antonietta; Rosas, Sylvia E.; Ruderfer, Douglas; Ruggiero, Daniela; Saba, Yasaman; Sala, Cinzia; Schmidt, Helena; Schmidt, Reinhold; Scott, Rodney J.; Sedaghat, Sanaz; Smith, Albert V.; Sorice, Rossella; Stengel, Benedicte; Stracke, Sylvia; Strauch, Konstantin; Toniolo, Daniela; Uitterlinden, Andre G.; Ulivi, Sheila; Viikari, Jorma S.; Völker, Uwe; Vollenweider, Peter; Völzke, Henry; Vuckovic, Dragana; Waldenberger, Melanie; Jin Wang, Jie; Yang, Qiong; Chasman, Daniel I.; Tromp, Gerard; Snieder, Harold; Heid, Iris M.; Fox, Caroline S.; Köttgen, Anna; Pattaro, Cristian; Böger, Carsten A.; Fuchsberger, Christian

    2017-01-01

    HapMap imputed genome-wide association studies (GWAS) have revealed >50 loci at which common variants with minor allele frequency >5% are associated with kidney function. GWAS using more complete reference sets for imputation, such as those from The 1000 Genomes project, promise to identify novel loci that have been missed by previous efforts. To investigate the value of such a more complete variant catalog, we conducted a GWAS meta-analysis of kidney function based on the estimated glomerular filtration rate (eGFR) in 110,517 European ancestry participants using 1000 Genomes imputed data. We identified 10 novel loci with p-value < 5 × 10−8 previously missed by HapMap-based GWAS. Six of these loci (HOXD8, ARL15, PIK3R1, EYA4, ASTN2, and EPB41L3) are tagged by common SNPs unique to the 1000 Genomes reference panel. Using pathway analysis, we identified 39 significant (FDR < 0.05) genes and 127 significantly (FDR < 0.05) enriched gene sets, which were missed by our previous analyses. Among those, the 10 identified novel genes are part of pathways of kidney development, carbohydrate metabolism, cardiac septum development and glucose metabolism. These results highlight the utility of re-imputing from denser reference panels, until whole-genome sequencing becomes feasible in large samples. PMID:28452372

  20. 1000 Genomes-based meta-analysis identifies 10 novel loci for kidney function.

    PubMed

    Gorski, Mathias; van der Most, Peter J; Teumer, Alexander; Chu, Audrey Y; Li, Man; Mijatovic, Vladan; Nolte, Ilja M; Cocca, Massimiliano; Taliun, Daniel; Gomez, Felicia; Li, Yong; Tayo, Bamidele; Tin, Adrienne; Feitosa, Mary F; Aspelund, Thor; Attia, John; Biffar, Reiner; Bochud, Murielle; Boerwinkle, Eric; Borecki, Ingrid; Bottinger, Erwin P; Chen, Ming-Huei; Chouraki, Vincent; Ciullo, Marina; Coresh, Josef; Cornelis, Marilyn C; Curhan, Gary C; d'Adamo, Adamo Pio; Dehghan, Abbas; Dengler, Laura; Ding, Jingzhong; Eiriksdottir, Gudny; Endlich, Karlhans; Enroth, Stefan; Esko, Tõnu; Franco, Oscar H; Gasparini, Paolo; Gieger, Christian; Girotto, Giorgia; Gottesman, Omri; Gudnason, Vilmundur; Gyllensten, Ulf; Hancock, Stephen J; Harris, Tamara B; Helmer, Catherine; Höllerer, Simon; Hofer, Edith; Hofman, Albert; Holliday, Elizabeth G; Homuth, Georg; Hu, Frank B; Huth, Cornelia; Hutri-Kähönen, Nina; Hwang, Shih-Jen; Imboden, Medea; Johansson, Åsa; Kähönen, Mika; König, Wolfgang; Kramer, Holly; Krämer, Bernhard K; Kumar, Ashish; Kutalik, Zoltan; Lambert, Jean-Charles; Launer, Lenore J; Lehtimäki, Terho; de Borst, Martin; Navis, Gerjan; Swertz, Morris; Liu, Yongmei; Lohman, Kurt; Loos, Ruth J F; Lu, Yingchang; Lyytikäinen, Leo-Pekka; McEvoy, Mark A; Meisinger, Christa; Meitinger, Thomas; Metspalu, Andres; Metzger, Marie; Mihailov, Evelin; Mitchell, Paul; Nauck, Matthias; Oldehinkel, Albertine J; Olden, Matthias; Wjh Penninx, Brenda; Pistis, Giorgio; Pramstaller, Peter P; Probst-Hensch, Nicole; Raitakari, Olli T; Rettig, Rainer; Ridker, Paul M; Rivadeneira, Fernando; Robino, Antonietta; Rosas, Sylvia E; Ruderfer, Douglas; Ruggiero, Daniela; Saba, Yasaman; Sala, Cinzia; Schmidt, Helena; Schmidt, Reinhold; Scott, Rodney J; Sedaghat, Sanaz; Smith, Albert V; Sorice, Rossella; Stengel, Benedicte; Stracke, Sylvia; Strauch, Konstantin; Toniolo, Daniela; Uitterlinden, Andre G; Ulivi, Sheila; Viikari, Jorma S; Völker, Uwe; Vollenweider, Peter; Völzke, Henry; Vuckovic, Dragana; Waldenberger, Melanie; Jin Wang, Jie; Yang, Qiong; Chasman, Daniel I; Tromp, Gerard; Snieder, Harold; Heid, Iris M; Fox, Caroline S; Köttgen, Anna; Pattaro, Cristian; Böger, Carsten A; Fuchsberger, Christian

    2017-04-28

    HapMap imputed genome-wide association studies (GWAS) have revealed >50 loci at which common variants with minor allele frequency >5% are associated with kidney function. GWAS using more complete reference sets for imputation, such as those from The 1000 Genomes project, promise to identify novel loci that have been missed by previous efforts. To investigate the value of such a more complete variant catalog, we conducted a GWAS meta-analysis of kidney function based on the estimated glomerular filtration rate (eGFR) in 110,517 European ancestry participants using 1000 Genomes imputed data. We identified 10 novel loci with p-value < 5 × 10 -8 previously missed by HapMap-based GWAS. Six of these loci (HOXD8, ARL15, PIK3R1, EYA4, ASTN2, and EPB41L3) are tagged by common SNPs unique to the 1000 Genomes reference panel. Using pathway analysis, we identified 39 significant (FDR < 0.05) genes and 127 significantly (FDR < 0.05) enriched gene sets, which were missed by our previous analyses. Among those, the 10 identified novel genes are part of pathways of kidney development, carbohydrate metabolism, cardiac septum development and glucose metabolism. These results highlight the utility of re-imputing from denser reference panels, until whole-genome sequencing becomes feasible in large samples.

  1. Multiple imputation for handling missing outcome data when estimating the relative risk.

    PubMed

    Sullivan, Thomas R; Lee, Katherine J; Ryan, Philip; Salter, Amy B

    2017-09-06

    Multiple imputation is a popular approach to handling missing data in medical research, yet little is known about its applicability for estimating the relative risk. Standard methods for imputing incomplete binary outcomes involve logistic regression or an assumption of multivariate normality, whereas relative risks are typically estimated using log binomial models. It is unclear whether misspecification of the imputation model in this setting could lead to biased parameter estimates. Using simulated data, we evaluated the performance of multiple imputation for handling missing data prior to estimating adjusted relative risks from a correctly specified multivariable log binomial model. We considered an arbitrary pattern of missing data in both outcome and exposure variables, with missing data induced under missing at random mechanisms. Focusing on standard model-based methods of multiple imputation, missing data were imputed using multivariate normal imputation or fully conditional specification with a logistic imputation model for the outcome. Multivariate normal imputation performed poorly in the simulation study, consistently producing estimates of the relative risk that were biased towards the null. Despite outperforming multivariate normal imputation, fully conditional specification also produced somewhat biased estimates, with greater bias observed for higher outcome prevalences and larger relative risks. Deleting imputed outcomes from analysis datasets did not improve the performance of fully conditional specification. Both multivariate normal imputation and fully conditional specification produced biased estimates of the relative risk, presumably since both use a misspecified imputation model. Based on simulation results, we recommend researchers use fully conditional specification rather than multivariate normal imputation and retain imputed outcomes in the analysis when estimating relative risks. However fully conditional specification is not without its shortcomings, and so further research is needed to identify optimal approaches for relative risk estimation within the multiple imputation framework.

  2. Genome-wide association study for feed efficiency and growth traits in U.S. beef cattle.

    PubMed

    Seabury, Christopher M; Oldeschulte, David L; Saatchi, Mahdi; Beever, Jonathan E; Decker, Jared E; Halley, Yvette A; Bhattarai, Eric K; Molaei, Maral; Freetly, Harvey C; Hansen, Stephanie L; Yampara-Iquise, Helen; Johnson, Kristen A; Kerley, Monty S; Kim, JaeWoo; Loy, Daniel D; Marques, Elisa; Neibergs, Holly L; Schnabel, Robert D; Shike, Daniel W; Spangler, Matthew L; Weaber, Robert L; Garrick, Dorian J; Taylor, Jeremy F

    2017-05-18

    Single nucleotide polymorphism (SNP) arrays for domestic cattle have catalyzed the identification of genetic markers associated with complex traits for inclusion in modern breeding and selection programs. Using actual and imputed Illumina 778K genotypes for 3887 U.S. beef cattle from 3 populations (Angus, Hereford, SimAngus), we performed genome-wide association analyses for feed efficiency and growth traits including average daily gain (ADG), dry matter intake (DMI), mid-test metabolic weight (MMWT), and residual feed intake (RFI), with marker-based heritability estimates produced for all traits and populations. Moderate and/or large-effect QTL were detected for all traits in all populations, as jointly defined by the estimated proportion of variance explained (PVE) by marker effects (PVE ≥ 1.0%) and a nominal P-value threshold (P ≤ 5e-05). Lead SNPs with PVE ≥ 2.0% were considered putative evidence of large-effect QTL (n = 52), whereas those with PVE ≥ 1.0% but < 2.0% were considered putative evidence for moderate-effect QTL (n = 35). Identical or proximal lead SNPs associated with ADG, DMI, MMWT, and RFI collectively supported the potential for either pleiotropic QTL, or independent but proximal causal mutations for multiple traits within and between the analyzed populations. Marker-based heritability estimates for all investigated traits ranged from 0.18 to 0.60 using 778K genotypes, or from 0.17 to 0.57 using 50K genotypes (reduced from Illumina 778K HD to Illumina Bovine SNP50). An investigation to determine if QTL detected by 778K analysis could also be detected using 50K genotypes produced variable results, suggesting that 50K analyses were generally insufficient for QTL detection in these populations, and that relevant breeding or selection programs should be based on higher density analyses (imputed or directly ascertained). Fourteen moderate to large-effect QTL regions which ranged from being physically proximal (lead SNPs ≤ 3Mb) to fully overlapping for RFI, DMI, ADG, and MMWT were detected within and between populations, and included evidence for pleiotropy, proximal but independent causal mutations, and multi-breed QTL. Bovine positional candidate genes for these traits were functionally conserved across vertebrate species.

  3. Seeing the forest for the trees: utilizing modified random forests imputation of forest plot data for landscape-level analyses

    Treesearch

    Karin L. Riley; Isaac C. Grenfell; Mark A. Finney

    2015-01-01

    Mapping the number, size, and species of trees in forests across the western United States has utility for a number of research endeavors, ranging from estimation of terrestrial carbon resources to tree mortality following wildfires. For landscape fire and forest simulations that use the Forest Vegetation Simulator (FVS), a tree-level dataset, or “tree list”, is a...

  4. Applied Missing Data Analysis. Methodology in the Social Sciences Series

    ERIC Educational Resources Information Center

    Enders, Craig K.

    2010-01-01

    Walking readers step by step through complex concepts, this book translates missing data techniques into something that applied researchers and graduate students can understand and utilize in their own research. Enders explains the rationale and procedural details for maximum likelihood estimation, Bayesian estimation, multiple imputation, and…

  5. Reuse of imputed data in microarray analysis increases imputation efficiency

    PubMed Central

    Kim, Ki-Yeol; Kim, Byoung-Jin; Yi, Gwan-Su

    2004-01-01

    Background The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the validity of imputed values in these methods had not been fully checked. Results We developed a new cluster-based imputation method called sequential K-nearest neighbor (SKNN) method. This imputes the missing values sequentially from the gene having least missing values, and uses the imputed values for the later imputation. Although it uses the imputed values, the efficiency of this new method is greatly improved in its accuracy and computational complexity over the conventional KNN-based method and other methods based on maximum likelihood estimation. The performance of SKNN was in particular higher than other imputation methods for the data with high missing rates and large number of experiments. Application of Expectation Maximization (EM) to the SKNN method improved the accuracy, but increased computational time proportional to the number of iterations. The Multiple Imputation (MI) method, which is well known but not applied previously to microarray data, showed a similarly high accuracy as the SKNN method, with slightly higher dependency on the types of data sets. Conclusions Sequential reuse of imputed data in KNN-based imputation greatly increases the efficiency of imputation. The SKNN method should be practically useful to save the data of some microarray experiments which have high amounts of missing entries. The SKNN method generates reliable imputed values which can be used for further cluster-based analysis of microarray data. PMID:15504240

  6. Genome-wide association study with 1000 genomes imputation identifies signals for nine sex hormone-related phenotypes.

    PubMed

    Ruth, Katherine S; Campbell, Purdey J; Chew, Shelby; Lim, Ee Mun; Hadlow, Narelle; Stuckey, Bronwyn G A; Brown, Suzanne J; Feenstra, Bjarke; Joseph, John; Surdulescu, Gabriela L; Zheng, Hou Feng; Richards, J Brent; Murray, Anna; Spector, Tim D; Wilson, Scott G; Perry, John R B

    2016-02-01

    Genetic factors contribute strongly to sex hormone levels, yet knowledge of the regulatory mechanisms remains incomplete. Genome-wide association studies (GWAS) have identified only a small number of loci associated with sex hormone levels, with several reproductive hormones yet to be assessed. The aim of the study was to identify novel genetic variants contributing to the regulation of sex hormones. We performed GWAS using genotypes imputed from the 1000 Genomes reference panel. The study used genotype and phenotype data from a UK twin register. We included 2913 individuals (up to 294 males) from the Twins UK study, excluding individuals receiving hormone treatment. Phenotypes were standardised for age, sex, BMI, stage of menstrual cycle and menopausal status. We tested 7,879,351 autosomal SNPs for association with levels of dehydroepiandrosterone sulphate (DHEAS), oestradiol, free androgen index (FAI), follicle-stimulating hormone (FSH), luteinizing hormone (LH), prolactin, progesterone, sex hormone-binding globulin and testosterone. Eight independent genetic variants reached genome-wide significance (P<5 × 10(-8)), with minor allele frequencies of 1.3-23.9%. Novel signals included variants for progesterone (P=7.68 × 10(-12)), oestradiol (P=1.63 × 10(-8)) and FAI (P=1.50 × 10(-8)). A genetic variant near the FSHB gene was identified which influenced both FSH (P=1.74 × 10(-8)) and LH (P=3.94 × 10(-9)) levels. A separate locus on chromosome 7 was associated with both DHEAS (P=1.82 × 10(-14)) and progesterone (P=6.09 × 10(-14)). This study highlights loci that are relevant to reproductive function and suggests overlap in the genetic basis of hormone regulation.

  7. Genome-wide association study of sporadic brain arteriovenous malformations.

    PubMed

    Weinsheimer, Shantel; Bendjilali, Nasrine; Nelson, Jeffrey; Guo, Diana E; Zaroff, Jonathan G; Sidney, Stephen; McCulloch, Charles E; Al-Shahi Salman, Rustam; Berg, Jonathan N; Koeleman, Bobby P C; Simon, Matthias; Bostroem, Azize; Fontanella, Marco; Sturiale, Carmelo L; Pola, Roberto; Puca, Alfredo; Lawton, Michael T; Young, William L; Pawlikowska, Ludmila; Klijn, Catharina J M; Kim, Helen

    2016-09-01

    The pathogenesis of sporadic brain arteriovenous malformations (BAVMs) remains unknown, but studies suggest a genetic component. We estimated the heritability of sporadic BAVM and performed a genome-wide association study (GWAS) to investigate association of common single nucleotide polymorphisms (SNPs) with risk of sporadic BAVM in the international, multicentre Genetics of Arteriovenous Malformation (GEN-AVM) consortium. The Caucasian discovery cohort included 515 BAVM cases and 1191 controls genotyped using Affymetrix genome-wide SNP arrays. Genotype data were imputed to 1000 Genomes Project data, and well-imputed SNPs (>0.01 minor allele frequency) were analysed for association with BAVM. 57 top BAVM-associated SNPs (51 SNPs with p<10(-05) or p<10(-04) in candidate pathway genes, and 6 candidate BAVM SNPs) were tested in a replication cohort including 608 BAVM cases and 744 controls. The estimated heritability of BAVM was 17.6% (SE 8.9%, age and sex-adjusted p=0.015). None of the SNPs were significantly associated with BAVM in the replication cohort after correction for multiple testing. 6 SNPs had a nominal p<0.1 in the replication cohort and map to introns in EGFEM1P, SP4 and CDKAL1 or near JAG1 and BNC2. Of the 6 candidate SNPs, 2 in ACVRL1 and MMP3 had a nominal p<0.05 in the replication cohort. We performed the first GWAS of sporadic BAVM in the largest BAVM cohort assembled to date. No GWAS SNPs were replicated, suggesting that common SNPs do not contribute strongly to BAVM susceptibility. However, heritability estimates suggest a modest but significant genetic contribution. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/

  8. Genomic prediction using preselected DNA variants from a GWAS with whole-genome sequence data in Holstein-Friesian cattle.

    PubMed

    Veerkamp, Roel F; Bouwman, Aniek C; Schrooten, Chris; Calus, Mario P L

    2016-12-01

    Whole-genome sequence data is expected to capture genetic variation more completely than common genotyping panels. Our objective was to compare the proportion of variance explained and the accuracy of genomic prediction by using imputed sequence data or preselected SNPs from a genome-wide association study (GWAS) with imputed whole-genome sequence data. Phenotypes were available for 5503 Holstein-Friesian bulls. Genotypes were imputed up to whole-genome sequence (13,789,029 segregating DNA variants) by using run 4 of the 1000 bull genomes project. The program GCTA was used to perform GWAS for protein yield (PY), somatic cell score (SCS) and interval from first to last insemination (IFL). From the GWAS, subsets of variants were selected and genomic relationship matrices (GRM) were used to estimate the variance explained in 2087 validation animals and to evaluate the genomic prediction ability. Finally, two GRM were fitted together in several models to evaluate the effect of selected variants that were in competition with all the other variants. The GRM based on full sequence data explained only marginally more genetic variation than that based on common SNP panels: for PY, SCS and IFL, genomic heritability improved from 0.81 to 0.83, 0.83 to 0.87 and 0.69 to 0.72, respectively. Sequence data also helped to identify more variants linked to quantitative trait loci and resulted in clearer GWAS peaks across the genome. The proportion of total variance explained by the selected variants combined in a GRM was considerably smaller than that explained by all variants (less than 0.31 for all traits). When selected variants were used, accuracy of genomic predictions decreased and bias increased. Although 35 to 42 variants were detected that together explained 13 to 19% of the total variance (18 to 23% of the genetic variance) when fitted alone, there was no advantage in using dense sequence information for genomic prediction in the Holstein data used in our study. Detection and selection of variants within a single breed are difficult due to long-range linkage disequilibrium. Stringent selection of variants resulted in more biased genomic predictions, although this might be due to the training population being the same dataset from which the selected variants were identified.

  9. Genome-Wide Association Study in an Amerindian Ancestry Population Reveals Novel Systemic Lupus Erythematosus Risk Loci and the Role of European Admixture.

    PubMed

    Alarcón-Riquelme, Marta E; Ziegler, Julie T; Molineros, Julio; Howard, Timothy D; Moreno-Estrada, Andrés; Sánchez-Rodríguez, Elena; Ainsworth, Hannah C; Ortiz-Tello, Patricia; Comeau, Mary E; Rasmussen, Astrid; Kelly, Jennifer A; Adler, Adam; Acevedo-Vázquez, Eduardo M; Cucho-Venegas, Jorge Mariano; García-De la Torre, Ignacio; Cardiel, Mario H; Miranda, Pedro; Catoggio, Luis J; Maradiaga-Ceceña, Marco; Gaffney, Patrick M; Vyse, Timothy J; Criswell, Lindsey A; Tsao, Betty P; Sivils, Kathy L; Bae, Sang-Cheol; James, Judith A; Kimberly, Robert P; Kaufman, Kenneth M; Harley, John B; Esquivel-Valerio, Jorge A; Moctezuma, José F; García, Mercedes A; Berbotto, Guillermo A; Babini, Alejandra M; Scherbarth, Hugo; Toloza, Sergio; Baca, Vicente; Nath, Swapan K; Aguilar Salinas, Carlos; Orozco, Lorena; Tusié-Luna, Teresa; Zidovetzki, Raphael; Pons-Estel, Bernardo A; Langefeld, Carl D; Jacob, Chaim O

    2016-04-01

    Systemic lupus erythematosus (SLE) is a chronic autoimmune disease with a strong genetic component. We undertook the present work to perform the first genome-wide association study on individuals from the Americas who are enriched for Native American heritage. We analyzed 3,710 individuals from the US and 4 countries of Latin America who were diagnosed as having SLE, and healthy controls. Samples were genotyped with HumanOmni1 BeadChip. Data on out-of-study controls genotyped with HumanOmni2.5 were also included. Statistical analyses were performed using SNPtest and SNPGWA. Data were adjusted for genomic control and false discovery rate. Imputation was performed using Impute2 and, for classic HLA alleles, HiBag. Odds ratios (ORs) and 95% confidence intervals (95% CIs) were calculated. The IRF5-TNPO3 region showed the strongest association and largest OR for SLE (rs10488631: genomic control-adjusted P [Pgcadj ] = 2.61 × 10(-29), OR 2.12 [95% CI 1.88-2.39]), followed by HLA class II on the DQA2-DQB1 loci (rs9275572: Pgcadj  = 1.11 × 10(-16), OR 1.62 [95% CI 1.46-1.80] and rs9271366: Pgcadj  = 6.46 × 10(-12), OR 2.06 [95% CI 1.71-2.50]). Other known SLE loci found to be associated in this population were ITGAM, STAT4, TNIP1, NCF2, and IRAK1. We identified a novel locus on 10q24.33 (rs4917385: Pgcadj  = 1.39 × 10(-8)) with an expression quantitative trait locus (eQTL) effect (Peqtl  = 8.0 × 10(-37) at USMG5/miR1307), and several new suggestive loci. SLE risk loci previously identified in Europeans and Asians were corroborated. Local ancestry estimation showed that the HLA allele risk contribution is of European ancestral origin. Imputation of HLA alleles suggested that autochthonous Native American haplotypes provide protection against development of SLE. Our results demonstrate that studying admixed populations provides new insights in the delineation of the genetic architecture that underlies autoimmune and complex diseases. © 2016, American College of Rheumatology.

  10. The multiple imputation method: a case study involving secondary data analysis.

    PubMed

    Walani, Salimah R; Cleland, Charles M

    2015-05-01

    To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.

  11. Joint genome-wide prediction in several populations accounting for randomness of genotypes: A hierarchical Bayes approach. I: Multivariate Gaussian priors for marker effects and derivation of the joint probability mass function of genotypes.

    PubMed

    Martínez, Carlos Alberto; Khare, Kshitij; Banerjee, Arunava; Elzo, Mauricio A

    2017-03-21

    It is important to consider heterogeneity of marker effects and allelic frequencies in across population genome-wide prediction studies. Moreover, all regression models used in genome-wide prediction overlook randomness of genotypes. In this study, a family of hierarchical Bayesian models to perform across population genome-wide prediction modeling genotypes as random variables and allowing population-specific effects for each marker was developed. Models shared a common structure and differed in the priors used and the assumption about residual variances (homogeneous or heterogeneous). Randomness of genotypes was accounted for by deriving the joint probability mass function of marker genotypes conditional on allelic frequencies and pedigree information. As a consequence, these models incorporated kinship and genotypic information that not only permitted to account for heterogeneity of allelic frequencies, but also to include individuals with missing genotypes at some or all loci without the need for previous imputation. This was possible because the non-observed fraction of the design matrix was treated as an unknown model parameter. For each model, a simpler version ignoring population structure, but still accounting for randomness of genotypes was proposed. Implementation of these models and computation of some criteria for model comparison were illustrated using two simulated datasets. Theoretical and computational issues along with possible applications, extensions and refinements were discussed. Some features of the models developed in this study make them promising for genome-wide prediction, the use of information contained in the probability distribution of genotypes is perhaps the most appealing. Further studies to assess the performance of the models proposed here and also to compare them with conventional models used in genome-wide prediction are needed. Copyright © 2017 Elsevier Ltd. All rights reserved.

  12. Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: The general location model.

    PubMed

    Seaman, Shaun R; Hughes, Rachael A

    2018-06-01

    Estimating the parameters of a regression model of interest is complicated by missing data on the variables in that model. Multiple imputation is commonly used to handle these missing data. Joint model multiple imputation and full-conditional specification multiple imputation are known to yield imputed data with the same asymptotic distribution when the conditional models of full-conditional specification are compatible with that joint model. We show that this asymptotic equivalence of imputation distributions does not imply that joint model multiple imputation and full-conditional specification multiple imputation will also yield asymptotically equally efficient inference about the parameters of the model of interest, nor that they will be equally robust to misspecification of the joint model. When the conditional models used by full-conditional specification multiple imputation are linear, logistic and multinomial regressions, these are compatible with a restricted general location joint model. We show that multiple imputation using the restricted general location joint model can be substantially more asymptotically efficient than full-conditional specification multiple imputation, but this typically requires very strong associations between variables. When associations are weaker, the efficiency gain is small. Moreover, full-conditional specification multiple imputation is shown to be potentially much more robust than joint model multiple imputation using the restricted general location model to mispecification of that model when there is substantial missingness in the outcome variable.

  13. Comparison of HapMap and 1000 Genomes Reference Panels in a Large-Scale Genome-Wide Association Study.

    PubMed

    de Vries, Paul S; Sabater-Lleal, Maria; Chasman, Daniel I; Trompet, Stella; Ahluwalia, Tarunveer S; Teumer, Alexander; Kleber, Marcus E; Chen, Ming-Huei; Wang, Jie Jin; Attia, John R; Marioni, Riccardo E; Steri, Maristella; Weng, Lu-Chen; Pool, Rene; Grossmann, Vera; Brody, Jennifer A; Venturini, Cristina; Tanaka, Toshiko; Rose, Lynda M; Oldmeadow, Christopher; Mazur, Johanna; Basu, Saonli; Frånberg, Mattias; Yang, Qiong; Ligthart, Symen; Hottenga, Jouke J; Rumley, Ann; Mulas, Antonella; de Craen, Anton J M; Grotevendt, Anne; Taylor, Kent D; Delgado, Graciela E; Kifley, Annette; Lopez, Lorna M; Berentzen, Tina L; Mangino, Massimo; Bandinelli, Stefania; Morrison, Alanna C; Hamsten, Anders; Tofler, Geoffrey; de Maat, Moniek P M; Draisma, Harmen H M; Lowe, Gordon D; Zoledziewska, Magdalena; Sattar, Naveed; Lackner, Karl J; Völker, Uwe; McKnight, Barbara; Huang, Jie; Holliday, Elizabeth G; McEvoy, Mark A; Starr, John M; Hysi, Pirro G; Hernandez, Dena G; Guan, Weihua; Rivadeneira, Fernando; McArdle, Wendy L; Slagboom, P Eline; Zeller, Tanja; Psaty, Bruce M; Uitterlinden, André G; de Geus, Eco J C; Stott, David J; Binder, Harald; Hofman, Albert; Franco, Oscar H; Rotter, Jerome I; Ferrucci, Luigi; Spector, Tim D; Deary, Ian J; März, Winfried; Greinacher, Andreas; Wild, Philipp S; Cucca, Francesco; Boomsma, Dorret I; Watkins, Hugh; Tang, Weihong; Ridker, Paul M; Jukema, Jan W; Scott, Rodney J; Mitchell, Paul; Hansen, Torben; O'Donnell, Christopher J; Smith, Nicholas L; Strachan, David P; Dehghan, Abbas

    2017-01-01

    An increasing number of genome-wide association (GWA) studies are now using the higher resolution 1000 Genomes Project reference panel (1000G) for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In order to assess the improvement of 1000G over HapMap imputation in identifying associated loci, we compared the results of GWA studies of circulating fibrinogen based on the two reference panels. Using both HapMap and 1000G imputation we performed a meta-analysis of 22 studies comprising the same 91,953 individuals. We identified six additional signals using 1000G imputation, while 29 loci were associated using both HapMap and 1000G imputation. One locus identified using HapMap imputation was not significant using 1000G imputation. The genome-wide significance threshold of 5×10-8 is based on the number of independent statistical tests using HapMap imputation, and 1000G imputation may lead to further independent tests that should be corrected for. When using a stricter Bonferroni correction for the 1000G GWA study (P-value < 2.5×10-8), the number of loci significant only using HapMap imputation increased to 4 while the number of loci significant only using 1000G decreased to 5. In conclusion, 1000G imputation enabled the identification of 20% more loci than HapMap imputation, although the advantage of 1000G imputation became less clear when a stricter Bonferroni correction was used. More generally, our results provide insights that are applicable to the implementation of other dense reference panels that are under development.

  14. Comparison of HapMap and 1000 Genomes Reference Panels in a Large-Scale Genome-Wide Association Study

    PubMed Central

    de Vries, Paul S.; Sabater-Lleal, Maria; Chasman, Daniel I.; Trompet, Stella; Kleber, Marcus E.; Chen, Ming-Huei; Wang, Jie Jin; Attia, John R.; Marioni, Riccardo E.; Weng, Lu-Chen; Grossmann, Vera; Brody, Jennifer A.; Venturini, Cristina; Tanaka, Toshiko; Rose, Lynda M.; Oldmeadow, Christopher; Mazur, Johanna; Basu, Saonli; Yang, Qiong; Ligthart, Symen; Hottenga, Jouke J.; Rumley, Ann; Mulas, Antonella; de Craen, Anton J. M.; Grotevendt, Anne; Taylor, Kent D.; Delgado, Graciela E.; Kifley, Annette; Lopez, Lorna M.; Berentzen, Tina L.; Mangino, Massimo; Bandinelli, Stefania; Morrison, Alanna C.; Hamsten, Anders; Tofler, Geoffrey; de Maat, Moniek P. M.; Draisma, Harmen H. M.; Lowe, Gordon D.; Zoledziewska, Magdalena; Sattar, Naveed; Lackner, Karl J.; Völker, Uwe; McKnight, Barbara; Huang, Jie; Holliday, Elizabeth G.; McEvoy, Mark A.; Starr, John M.; Hysi, Pirro G.; Hernandez, Dena G.; Guan, Weihua; Rivadeneira, Fernando; McArdle, Wendy L.; Slagboom, P. Eline; Zeller, Tanja; Psaty, Bruce M.; Uitterlinden, André G.; de Geus, Eco J. C.; Stott, David J.; Binder, Harald; Hofman, Albert; Franco, Oscar H.; Rotter, Jerome I.; Ferrucci, Luigi; Spector, Tim D.; Deary, Ian J.; März, Winfried; Greinacher, Andreas; Wild, Philipp S.; Cucca, Francesco; Boomsma, Dorret I.; Watkins, Hugh; Tang, Weihong; Ridker, Paul M.; Jukema, Jan W.; Scott, Rodney J.; Mitchell, Paul; Hansen, Torben; O'Donnell, Christopher J.; Smith, Nicholas L.; Strachan, David P.

    2017-01-01

    An increasing number of genome-wide association (GWA) studies are now using the higher resolution 1000 Genomes Project reference panel (1000G) for imputation, with the expectation that 1000G imputation will lead to the discovery of additional associated loci when compared to HapMap imputation. In order to assess the improvement of 1000G over HapMap imputation in identifying associated loci, we compared the results of GWA studies of circulating fibrinogen based on the two reference panels. Using both HapMap and 1000G imputation we performed a meta-analysis of 22 studies comprising the same 91,953 individuals. We identified six additional signals using 1000G imputation, while 29 loci were associated using both HapMap and 1000G imputation. One locus identified using HapMap imputation was not significant using 1000G imputation. The genome-wide significance threshold of 5×10−8 is based on the number of independent statistical tests using HapMap imputation, and 1000G imputation may lead to further independent tests that should be corrected for. When using a stricter Bonferroni correction for the 1000G GWA study (P-value < 2.5×10−8), the number of loci significant only using HapMap imputation increased to 4 while the number of loci significant only using 1000G decreased to 5. In conclusion, 1000G imputation enabled the identification of 20% more loci than HapMap imputation, although the advantage of 1000G imputation became less clear when a stricter Bonferroni correction was used. More generally, our results provide insights that are applicable to the implementation of other dense reference panels that are under development. PMID:28107422

  15. DrImpute: imputing dropout events in single cell RNA sequencing data.

    PubMed

    Gong, Wuming; Kwak, Il-Youp; Pota, Pruthvi; Koyano-Nakagawa, Naoko; Garry, Daniel J

    2018-06-08

    The single cell RNA sequencing (scRNA-seq) technique begin a new era by allowing the observation of gene expression at the single cell level. However, there is also a large amount of technical and biological noise. Because of the low number of RNA transcriptomes and the stochastic nature of the gene expression pattern, there is a high chance of missing nonzero entries as zero, which are called dropout events. We develop DrImpute to impute dropout events in scRNA-seq data. We show that DrImpute has significantly better performance on the separation of the dropout zeros from true zeros than existing imputation algorithms. We also demonstrate that DrImpute can significantly improve the performance of existing tools for clustering, visualization and lineage reconstruction of nine published scRNA-seq datasets. DrImpute can serve as a very useful addition to the currently existing statistical tools for single cell RNA-seq analysis. DrImpute is implemented in R and is available at https://github.com/gongx030/DrImpute .

  16. Replication of Associations of Genetic Loci Outside the HLA Region With Susceptibility to Anti–Cyclic Citrullinated Peptide–Negative Rheumatoid Arthritis

    PubMed Central

    Viatte, Sebastien; Massey, Jonathan; Bowes, John; Duffus, Kate; Eyre, Stephen; Barton, Anne; Loughlin, John; Arden, Nigel; Birrell, Fraser; Carr, Andrew; Deloukas, Panos; Doherty, Michael; McCaskie, Andrew W.; Ollier, William E. R.; Rai, Ashok; Ralston, Stuart H.; Spector, Tim D.; Valdes, Ana M.; Wallis, Gillian A.; Wilkinson, J. Mark; Zeggini, Eleftheria

    2016-01-01

    Objective Genetic polymorphisms within the HLA region explain only a modest proportion of anti–cyclic citrullinated peptide (anti‐CCP)–negative rheumatoid arthritis (RA) heritability. However, few non‐HLA markers have been identified so far. This study was undertaken to replicate the associations of anti‐CCP–negative RA with non‐HLA genetic polymorphisms demonstrated in a previous study. Methods The Rheumatoid Arthritis Consortium International densely genotyped 186 autoimmune‐related regions in 3,339 anti‐CCP–negative RA patients and 15,870 controls across 6 different populations using the Illumina ImmunoChip array. We performed a case–control replication study of the anti‐CCP–negative markers with the strongest associations in that discovery study, in an independent cohort of anti‐CCP–negative UK RA patients. Individuals from the arcOGEN Consortium and Wellcome Trust Case Control Consortium were used as controls. Genotyping in cases was performed using Sequenom MassArray technology. Genome‐wide data from controls were imputed using the 1000 Genomes Phase I integrated variant call set release version 3 as a reference panel. Results After genotyping and imputation quality control procedures, data were available for 15 non‐HLA single‐nucleotide polymorphisms in 1,024 cases and 6,348 controls. We confirmed the known markers ANKRD55 (meta‐analysis odds ratio [OR] 0.80; P = 2.8 × 10−13) and BLK (OR 1.13; P = 7.0 × 10−6) and identified new and specific markers of anti‐CCP–negative RA (prolactin [PRL] [OR 1.13; P = 2.1 × 10−6] and NFIA [OR 0.85; P = 2.5 × 10−6]). Neither of these loci is associated with other common, complex autoimmune diseases. Conclusion Anti‐CCP–negative RA and anti‐CCP–positive RA are genetically different disease subsets that only partially share susceptibility factors. Genetic polymorphisms located near the PRL and NFIA genes represent examples of genetic susceptibility factors specific for anti‐CCP–negative RA. PMID:26895230

  17. Haplotypic Analysis of Wellcome Trust Case Control Consortium Data

    PubMed Central

    Browning, Brian L.; Browning, Sharon R.

    2008-01-01

    We applied a recently developed multilocus association testing method (localized haplotype clustering) to Wellcome Trust Case Control Consortium data (14,000 cases of seven common diseases and 3,000 shared controls genotyped on the Affymetrix 500K array). After rigorous data quality filtering, we identified three disease-associated loci with strong statistical support from localized haplotype cluster tests but with only marginal significance in single marker tests. These loci are chromosomes 10p15.1 with type 1 diabetes (p = 5.1 × 10-9), 12q15 with type 2 diabetes (p = 1.9 × 10-7) and 15q26.2 with hypertension (p = 2.8 × 10-8). We also detected the association of chromosome 9p21.3 with type 2 diabetes (p = 2.8 × 10-8), although this locus did not pass our stringent genotype quality filters. The association of 10p15.1 with type 1 diabetes and 9p21.3 with type 2 diabetes have both been replicated in other studies using independent data sets. Overall, localized haplotype cluster analysis had better success detecting disease associated variants than a previous single-marker analysis of imputed HapMap SNPs. We found that stringent application of quality score thresholds to genotype data substantially reduced false-positive results arising from genotype error. In addition, we demonstrate that it is possible to simultaneously phase 16,000 individuals genotyped on genome-wide data (450K markers) using the Beagle software package. PMID:18224336

  18. A Study of Imputation Algorithms. Working Paper Series.

    ERIC Educational Resources Information Center

    Hu, Ming-xiu; Salvucci, Sameena

    Many imputation techniques and imputation software packages have been developed over the years to deal with missing data. Different methods may work well under different circumstances, and it is advisable to conduct a sensitivity analysis when choosing an imputation method for a particular survey. This study reviewed about 30 imputation methods…

  19. Evidence that breast cancer risk at the 2q35 locus is mediated through IGFBP5 regulation

    PubMed Central

    Ghoussaini, Maya; Edwards, Stacey L.; Michailidou, Kyriaki; Nord, Silje; Cowper-Sal·lari, Richard; Desai, Kinjal; Kar, Siddhartha; Hillman, Kristine M.; Kaufmann, Susanne; Glubb, Dylan M.; Beesley, Jonathan; Dennis, Joe; Bolla, Manjeet K.; Wang, Qin; Dicks, Ed; Guo, Qi; Schmidt, Marjanka K.; Shah, Mitul; Luben, Robert; Brown, Judith; Czene, Kamila; Darabi, Hatef; Eriksson, Mikael; Klevebring, Daniel; Bojesen, Stig E.; Nordestgaard, Børge G.; Nielsen, Sune F.; Flyger, Henrik; Lambrechts, Diether; Thienpont, Bernard; Neven, Patrick; Wildiers, Hans; Broeks, Annegien; Van’t Veer, Laura J.; Th Rutgers, Emiel J.; Couch, Fergus J.; Olson, Janet E.; Hallberg, Emily; Vachon, Celine; Chang-Claude, Jenny; Rudolph, Anja; Seibold, Petra; Flesch-Janys, Dieter; Peto, Julian; dos-Santos-Silva, Isabel; Gibson, Lorna; Nevanlinna, Heli; Muranen, Taru A.; Aittomäki, Kristiina; Blomqvist, Carl; Hall, Per; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Kang, Daehee; Choi, Ji-Yeob; Park, Sue K.; Noh, Dong-Young; Matsuo, Keitaro; Ito, Hidemi; Iwata, Hiroji; Yatabe, Yasushi; Guénel, Pascal; Truong, Thérèse; Menegaux, Florence; Sanchez, Marie; Burwinkel, Barbara; Marme, Frederik; Schneeweiss, Andreas; Sohn, Christof; Wu, Anna H.; Tseng, Chiu-chen; Van Den Berg, David; Stram, Daniel O.; Benitez, Javier; Zamora, M. Pilar; Perez, Jose Ignacio Arias; Menéndez, Primitiva; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Qiuyin; Cox, Angela; Cross, Simon S.; Reed, Malcolm W. R.; Andrulis, Irene L.; Knight, Julia A.; Glendon, Gord; Tchatchou, Sandrine; Sawyer, Elinor J.; Tomlinson, Ian; Kerin, Michael J.; Miller, Nicola; Haiman, Christopher A.; Henderson, Brian E.; Schumacher, Fredrick; Le Marchand, Loic; Lindblom, Annika; Margolin, Sara; TEO, Soo Hwang; YIP, Cheng Har; Lee, Daphne S. C.; Wong, Tien Y.; Hooning, Maartje J.; Martens, John W. M.; Collée, J. Margriet; van Deurzen, Carolien H. M.; Hopper, John L.; Southey, Melissa C.; Tsimiklis, Helen; Kapuscinski, Miroslav K.; Shen, Chen-Yang; Wu, Pei-Ei; Yu, Jyh-Cherng; Chen, Shou-Tung; Alnæs, Grethe Grenaker; Borresen-Dale, Anne-Lise; Giles, Graham G.; Milne, Roger L.; McLean, Catriona; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Hartman, Mikael; Miao, Hui; Buhari, Shaik Ahmad Bin Syed; Teo, Yik Ying; Fasching, Peter A.; Haeberle, Lothar; Ekici, Arif B.; Beckmann, Matthias W.; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Swerdlow, Anthony; Ashworth, Alan; Orr, Nick; Schoemaker, Minouk J.; García-Closas, Montserrat; Figueroa, Jonine; Chanock, Stephen J.; Lissowska, Jolanta; Simard, Jacques; Goldberg, Mark S.; Labrèche, France; Dumont, Martine; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Brauch, Hiltrud; Brüning, Thomas; Koto, Yon-Dschun; Radice, Paolo; Peterlongo, Paolo; Bonanni, Bernardo; Volorio, Sara; Dörk, Thilo; Bogdanova, Natalia V.; Helbig, Sonja; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M.; Devilee, Peter; Tollenaar, Robert A. E. M.; Seynaeve, Caroline; Van Asperen, Christi J.; Jakubowska, Anna; Lubinski, Jan; Jaworska-Bieniek, Katarzyna; Durda, Katarzyna; Slager, Susan; Toland, Amanda E.; Ambrosone, Christine B.; Yannoukakos, Drakoulis; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; McKay, James; Hamann, Ute; Torres, Diana; Zheng, Wei; Long, Jirong; Anton-Culver, Hoda; Neuhausen, Susan L.; Luccarini, Craig; Baynes, Caroline; Ahmed, Shahana; Maranian, Mel; Healey, Catherine S.; González-Neira, Anna; Pita, Guillermo; Alonso, M. Rosario; Álvarez, Nuria; Herrero, Daniel; Tessier, Daniel C.; Vincent, Daniel; Bacot, Francois; de Santiago, Ines; Carroll, Jason; Caldas, Carlos; Brown, Melissa A.; Lupien, Mathieu; Kristensen, Vessela N.; Pharoah, Paul D P; Chenevix-Trench, Georgia; French, Juliet D; Easton, Douglas F.; Dunning, Alison M.; Chenevix-Trench, Georgia; Webb, Penny; Bowtell, David; De Fazio, Anna

    2014-01-01

    GWAS have identified a breast cancer susceptibility locus on 2q35. Here we report the fine mapping of this locus using data from 101,943 subjects from 50 case-control studies. We genotype 276 SNPs using the ‘iCOGS’ genotyping array and impute genotypes for a further 1,284 using 1000 Genomes Project data. All but two, strongly correlated SNPs (rs4442975 G/T and rs6721996 G/A) are excluded as candidate causal variants at odds against >100:1. The best functional candidate, rs4442975, is associated with oestrogen receptor positive (ER+) disease with an odds ratio (OR) in Europeans of 0.85 (95% confidence interval=0.84−0.87; P=1.7 × 10−43) per t-allele. This SNP flanks a transcriptional enhancer that physically interacts with the promoter of IGFBP5 (encoding insulin-like growth factor-binding protein 5) and displays allele-specific gene expression, FOXA1 binding and chromatin looping. Evidence suggests that the g-allele confers increased breast cancer susceptibility through relative downregulation of IGFBP5, a gene with known roles in breast cell biology. PMID:25248036

  20. Development and Genetic Characterization of an Advanced Backcross-Nested Association Mapping (AB-NAM) Population of Wild × Cultivated Barley

    PubMed Central

    Nice, Liana M.; Steffenson, Brian J.; Brown-Guedira, Gina L.; Akhunov, Eduard D.; Liu, Chaochih; Kono, Thomas J. Y.; Morrell, Peter L.; Blake, Thomas K.; Horsley, Richard D.; Smith, Kevin P.; Muehlbauer, Gary J.

    2016-01-01

    The ability to access alleles from unadapted germplasm collections is a long-standing problem for geneticists and breeders. Here we developed, characterized, and demonstrated the utility of a wild barley advanced backcross-nested association mapping (AB-NAM) population. We developed this population by backcrossing 25 wild barley accessions to the six-rowed malting barley cultivar Rasmusson. The 25 wild barley parents were selected from the 318 accession Wild Barley Diversity Collection (WBDC) to maximize allelic diversity. The resulting 796 BC2F4:6 lines were genotyped with 384 SNP markers, and an additional 4022 SNPs and 263,531 sequence variants were imputed onto the population using 9K iSelect SNP genotypes and exome capture sequence of the parents, respectively. On average, 96% of each wild parent was introgressed into the Rasmusson background, and the population exhibited low population structure. While linkage disequilibrium (LD) decay (r2 = 0.2) was lowest in the WBDC (0.36 cM), the AB-NAM (9.2 cM) exhibited more rapid LD decay than comparable advanced backcross (28.6 cM) and recombinant inbred line (32.3 cM) populations. Three qualitative traits: glossy spike, glossy sheath, and black hull color were mapped with high resolution to loci corresponding to known barley mutants for these traits. Additionally, a total of 10 QTL were identified for grain protein content. The combination of low LD, negligible population structure, and high diversity in an adapted background make the AB-NAM an important tool for high-resolution gene mapping and discovery of novel allelic variation using wild barley germplasm. PMID:27182953

  1. A Comparison of Joint Model and Fully Conditional Specification Imputation for Multilevel Missing Data

    ERIC Educational Resources Information Center

    Mistler, Stephen A.; Enders, Craig K.

    2017-01-01

    Multiple imputation methods can generally be divided into two broad frameworks: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution, whereas FCS imputes variables one at a time from a series of univariate conditional…

  2. Ipsative imputation for a 15-item Geriatric Depression Scale in community-dwelling elderly people.

    PubMed

    Imai, Hissei; Furukawa, Toshiaki A; Kasahara, Yoriko; Ishimoto, Yasuko; Kimura, Yumi; Fukutomi, Eriko; Chen, Wen-Ling; Tanaka, Mire; Sakamoto, Ryota; Wada, Taizo; Fujisawa, Michiko; Okumiya, Kiyohito; Matsubayashi, Kozo

    2014-09-01

    Missing data are inevitable in almost all medical studies. Imputation methods using the probabilistic model are common, but they cannot impute individual data and require special software. In contrast, the ipsative imputation method, which substitutes the missing items by the mean of the remaining items within the individual, is easy and does not need any special software, but it can provide individual scores. The aim of the present study was to evaluate the validity of the ipsative imputation method using data involving the 15-item Geriatric Depression Scale. Participants were community-dwelling elderly individuals (n = 1178). A structural equation model was constructed. The model fit indexes were calculated to assess the validity of the imputation method when it is used for individuals who were missing 20% of data or less and 40% of data or less, depending on whether we assumed that their correlation coefficients were the same as the dataset with no missing items. Finally, we compared path coefficients of the dataset imputed by ipsative imputation with those by multiple imputation. When compared with the assumption that the datasets differed, all of the model fit indexes were better under the assumption that the dataset without missing data is the same as that that was missing 20% of data or less. However, by the same assumption, the model fit indexes were worse in the dataset that was missing 40% of data or less. The path coefficients of the dataset imputed by ipsative imputation and by multiple imputation were compatible with each other if the proportion of missing items was 20% or less. Ipsative imputation appears to be a valid imputation method and can be used to impute data in studies using the 15-item Geriatric Depression Scale, if the percentage of its missing items is 20% or less. © 2014 The Authors. Psychogeriatrics © 2014 Japanese Psychogeriatric Society.

  3. Risk for ACPA-positive rheumatoid arthritis is driven by shared HLA amino acid polymorphisms in Asian and European populations

    PubMed Central

    Okada, Yukinori; Kim, Kwangwoo; Han, Buhm; Pillai, Nisha E.; Ong, Rick T.-H.; Saw, Woei-Yuh; Luo, Ma; Jiang, Lei; Yin, Jian; Bang, So-Young; Lee, Hye-Soon; Brown, Matthew A.; Bae, Sang-Cheol; Xu, Huji; Teo, Yik-Ying; de Bakker, Paul I.W.; Raychaudhuri, Soumya

    2014-01-01

    Previous studies have emphasized ethnically heterogeneous human leukocyte antigen (HLA) classical allele associations to rheumatoid arthritis (RA) risk. We fine-mapped RA risk alleles within the major histocompatibility complex (MHC) in 2782 seropositive RA cases and 4315 controls of Asian descent. We applied imputation to determine genotypes for eight class I and II HLA genes to Asian populations for the first time using a newly constructed pan-Asian reference panel. First, we empirically measured high imputation accuracy in Asian samples. Then we observed the most significant association in HLA-DRβ1 at amino acid position 13, located outside the classical shared epitope (Pomnibus = 6.9 × 10−135). The individual residues at position 13 have relative effects that are consistent with published effects in European populations (His > Phe > Arg > Tyr ≅ Gly > Ser)—but the observed effects in Asians are generally smaller. Applying stepwise conditional analysis, we identified additional independent associations at positions 57 (conditional Pomnibus = 2.2 × 10−33) and 74 (conditional Pomnibus = 1.1 × 10−8). Outside of HLA-DRβ1, we observed independent effects for amino acid polymorphisms within HLA-B (Asp9, conditional P = 3.8 × 10−6) and HLA-DPβ1 (Phe9, conditional P = 3.0 × 10−5) concordant with European populations. Our trans-ethnic HLA fine-mapping study reveals that (i) a common set of amino acid residues confer shared effects in European and Asian populations and (ii) these same effects can explain ethnically heterogeneous classical allelic associations (e.g. HLA-DRB1*09:01) due to allele frequency differences between populations. Our study illustrates the value of high-resolution imputation for fine-mapping causal variants in the MHC. PMID:25070946

  4. A bias-corrected estimator in multiple imputation for missing data.

    PubMed

    Tomita, Hiroaki; Fujisawa, Hironori; Henmi, Masayuki

    2018-05-29

    Multiple imputation (MI) is one of the most popular methods to deal with missing data, and its use has been rapidly increasing in medical studies. Although MI is rather appealing in practice since it is possible to use ordinary statistical methods for a complete data set once the missing values are fully imputed, the method of imputation is still problematic. If the missing values are imputed from some parametric model, the validity of imputation is not necessarily ensured, and the final estimate for a parameter of interest can be biased unless the parametric model is correctly specified. Nonparametric methods have been also proposed for MI, but it is not so straightforward as to produce imputation values from nonparametrically estimated distributions. In this paper, we propose a new method for MI to obtain a consistent (or asymptotically unbiased) final estimate even if the imputation model is misspecified. The key idea is to use an imputation model from which the imputation values are easily produced and to make a proper correction in the likelihood function after the imputation by using the density ratio between the imputation model and the true conditional density function for the missing variable as a weight. Although the conditional density must be nonparametrically estimated, it is not used for the imputation. The performance of our method is evaluated by both theory and simulation studies. A real data analysis is also conducted to illustrate our method by using the Duke Cardiac Catheterization Coronary Artery Disease Diagnostic Dataset. Copyright © 2018 John Wiley & Sons, Ltd.

  5. Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model

    PubMed Central

    Seaman, Shaun R; White, Ian R; Carpenter, James R

    2015-01-01

    Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation. Imputation of partially observed covariates is complicated if the substantive model is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of multiple imputation may impute covariates from models that are incompatible with such substantive models. We show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model. We investigate through simulation the performance of this proposal, and compare it with existing approaches. Simulation results suggest our proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible. Stata software implementing the approach is freely available. PMID:24525487

  6. Rare-Variant Association Analysis: Study Designs and Statistical Tests

    PubMed Central

    Lee, Seunggeung; Abecasis, Gonçalo R.; Boehnke, Michael; Lin, Xihong

    2014-01-01

    Despite the extensive discovery of trait- and disease-associated common variants, much of the genetic contribution to complex traits remains unexplained. Rare variants can explain additional disease risk or trait variability. An increasing number of studies are underway to identify trait- and disease-associated rare variants. In this review, we provide an overview of statistical issues in rare-variant association studies with a focus on study designs and statistical tests. We present the design and analysis pipeline of rare-variant studies and review cost-effective sequencing designs and genotyping platforms. We compare various gene- or region-based association tests, including burden tests, variance-component tests, and combined omnibus tests, in terms of their assumptions and performance. Also discussed are the related topics of meta-analysis, population-stratification adjustment, genotype imputation, follow-up studies, and heritability due to rare variants. We provide guidelines for analysis and discuss some of the challenges inherent in these studies and future research directions. PMID:24995866

  7. Genome-Wide Gene-Environment Study Identifies Glutamate Receptor Gene GRIN2A as a Parkinson's Disease Modifier Gene via Interaction with Coffee

    PubMed Central

    Hamza, Taye H.; Chen, Honglei; Hill-Burns, Erin M.; Rhodes, Shannon L.; Montimurro, Jennifer; Kay, Denise M.; Tenesa, Albert; Kusel, Victoria I.; Sheehan, Patricia; Eaaswarkhanth, Muthukrishnan; Yearout, Dora; Samii, Ali; Roberts, John W.; Agarwal, Pinky; Bordelon, Yvette; Park, Yikyung; Wang, Liyong; Gao, Jianjun; Vance, Jeffery M.; Kendler, Kenneth S.; Bacanu, Silviu-Alin; Scott, William K.; Ritz, Beate; Nutt, John; Factor, Stewart A.; Zabetian, Cyrus P.; Payami, Haydeh

    2011-01-01

    Our aim was to identify genes that influence the inverse association of coffee with the risk of developing Parkinson's disease (PD). We used genome-wide genotype data and lifetime caffeinated-coffee-consumption data on 1,458 persons with PD and 931 without PD from the NeuroGenetics Research Consortium (NGRC), and we performed a genome-wide association and interaction study (GWAIS), testing each SNP's main-effect plus its interaction with coffee, adjusting for sex, age, and two principal components. We then stratified subjects as heavy or light coffee-drinkers and performed genome-wide association study (GWAS) in each group. We replicated the most significant SNP. Finally, we imputed the NGRC dataset, increasing genomic coverage to examine the region of interest in detail. The primary analyses (GWAIS, GWAS, Replication) were performed using genotyped data. In GWAIS, the most significant signal came from rs4998386 and the neighboring SNPs in GRIN2A. GRIN2A encodes an NMDA-glutamate-receptor subunit and regulates excitatory neurotransmission in the brain. Achieving P2df = 10−6, GRIN2A surpassed all known PD susceptibility genes in significance in the GWAIS. In stratified GWAS, the GRIN2A signal was present in heavy coffee-drinkers (OR = 0.43; P = 6×10−7) but not in light coffee-drinkers. The a priori Replication hypothesis that “Among heavy coffee-drinkers, rs4998386_T carriers have lower PD risk than rs4998386_CC carriers” was confirmed: ORReplication = 0.59, PReplication = 10−3; ORPooled = 0.51, PPooled = 7×10−8. Compared to light coffee-drinkers with rs4998386_CC genotype, heavy coffee-drinkers with rs4998386_CC genotype had 18% lower risk (P = 3×10−3), whereas heavy coffee-drinkers with rs4998386_TC genotype had 59% lower risk (P = 6×10−13). Imputation revealed a block of SNPs that achieved P2df<5×10−8 in GWAIS, and OR = 0.41, P = 3×10−8 in heavy coffee-drinkers. This study is proof of concept that inclusion of environmental factors can help identify genes that are missed in GWAS. Both adenosine antagonists (caffeine-like) and glutamate antagonists (GRIN2A-related) are being tested in clinical trials for treatment of PD. GRIN2A may be a useful pharmacogenetic marker for subdividing individuals in clinical trials to determine which medications might work best for which patients. PMID:21876681

  8. Association of Long Runs of Homozygosity With Alzheimer Disease Among African American Individuals

    PubMed Central

    Ghani, Mahdi; Reitz, Christiane; Cheng, Rong; Vardarajan, Badri Narayan; Jun, Gyungah; Sato, Christine; Naj, Adam; Rajbhandary, Ruchita; Wang, Li-San; Valladares, Otto; Lin, Chiao-Feng; Larson, Eric B.; Graff-Radford, Neill R.; Evans, Denis; De Jager, Philip L.; Crane, Paul K.; Buxbaum, Joseph D.; Murrell, Jill R.; Raj, Towfique; Ertekin-Taner, Nilufer; Logue, Mark; Baldwin, Clinton T.; Green, Robert C.; Barnes, Lisa L.; Cantwell, Laura B.; Fallin, M. Daniele; Go, Rodney C. P.; Griffith, Patrick A.; Obisesan, Thomas O.; Manly, Jennifer J.; Lunetta, Kathryn L.; Kamboh, M. Ilyas; Lopez, Oscar L.; Bennett, David A.; Hendrie, Hugh; Hall, Kathleen S.; Goate, Alison M.; Byrd, Goldie S.; Kukull, Walter A.; Foroud, Tatiana M.; Haines, Jonathan L.; Farrer, Lindsay A.; Pericak-Vance, Margaret A.; Lee, Joseph H.; Schellenberg, Gerard D.; St. George-Hyslop, Peter; Mayeux, Richard; Rogaeva, Ekaterina

    2015-01-01

    IMPORTANCE Mutations in known causal Alzheimer disease (AD) genes account for only 1% to 3% of patients and almost all are dominantly inherited. Recessive inheritance of complex phenotypes can be linked to long (>1-megabase [Mb]) runs of homozygosity (ROHs) detectable by single-nucleotide polymorphism (SNP) arrays. OBJECTIVE To evaluate the association between ROHs and AD in an African American population known to have a risk for AD up to 3 times higher than white individuals. DESIGN, SETTING, AND PARTICIPANTS Case-control study of a large African American data set previously genotyped on different genome-wide SNP arrays conducted from December 2013 to January 2015. Global and locus-based ROH measurements were analyzed using raw or imputed genotype data. We studied the raw genotypes from 2 case-control subsets grouped based on SNP array: Alzheimer’s Disease Genetics Consortium data set (871 cases and 1620 control individuals) and Chicago Health and Aging Project–Indianapolis Ibadan Dementia Study data set (279 cases and 1367 control individuals). We then examined the entire data set using imputed genotypes from 1917 cases and 3858 control individuals. MAIN OUTCOMES AND MEASURES The ROHs larger than 1 Mb, 2 Mb, or 3 Mb were investigated separately for global burden evaluation, consensus regions, and gene-based analyses. RESULTS The African American cohort had a low degree of inbreeding (F ~ 0.006). In the Alzheimer’s Disease Genetics Consortium data set, we detected a significantly higher proportion of cases with ROHs greater than 2 Mb (P = .004) or greater than 3 Mb (P = .02), as well as a significant 114-kilobase consensus region on chr4q31.3 (empirical P value 2 = .04; ROHs >2 Mb). In the Chicago Health and Aging Project–Indianapolis Ibadan Dementia Study data set, we identified a significant 202-kilobase consensus region on Chr15q24.1 (empirical P value 2 = .02; ROHs >1 Mb) and a cluster of 13 significant genes on Chr3p21.31 (empirical P value 2 = .03; ROHs >3 Mb). A total of 43 of 49 nominally significant genes common for both data sets also mapped to Chr3p21.31. Analyses of imputed SNP data from the entire data set confirmed the association of AD with global ROH measurements (12.38 ROHs >1 Mb in cases vs 12.11 in controls; 2.986 Mb average size of ROHs >2 Mb in cases vs 2.889 Mb in controls; and 22% of cases with ROHs >3 Mb vs 19% of controls) and a gene-cluster on Chr3p21.31 (empirical P value 2 = .006-.04; ROHs >3 Mb). Also, we detected a significant association between AD and CLDN17 (empirical P value 2 = .01; ROHs >1 Mb), encoding a protein from the Claudin family, members of which were previously suggested as AD biomarkers. CONCLUSIONS AND RELEVANCE To our knowledge, we discovered the first evidence of increased burden of ROHs among patients with AD from an outbred African American population, which could reflect either the cumulative effect of multiple ROHs to AD or the contribution of specific loci harboring recessive mutations and risk haplotypes in a subset of patients. Sequencing is required to uncover AD variants in these individuals. PMID:26366463

  9. Should multiple imputation be the method of choice for handling missing data in randomized trials?

    PubMed Central

    Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J

    2016-01-01

    The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group. PMID:28034175

  10. Should multiple imputation be the method of choice for handling missing data in randomized trials?

    PubMed

    Sullivan, Thomas R; White, Ian R; Salter, Amy B; Ryan, Philip; Lee, Katherine J

    2016-01-01

    The use of multiple imputation has increased markedly in recent years, and journal reviewers may expect to see multiple imputation used to handle missing data. However in randomized trials, where treatment group is always observed and independent of baseline covariates, other approaches may be preferable. Using data simulation we evaluated multiple imputation, performed both overall and separately by randomized group, across a range of commonly encountered scenarios. We considered both missing outcome and missing baseline data, with missing outcome data induced under missing at random mechanisms. Provided the analysis model was correctly specified, multiple imputation produced unbiased treatment effect estimates, but alternative unbiased approaches were often more efficient. When the analysis model overlooked an interaction effect involving randomized group, multiple imputation produced biased estimates of the average treatment effect when applied to missing outcome data, unless imputation was performed separately by randomized group. Based on these results, we conclude that multiple imputation should not be seen as the only acceptable way to handle missing data in randomized trials. In settings where multiple imputation is adopted, we recommend that imputation is carried out separately by randomized group.

  11. Assessment of imputation methods using varying ecological information to fill the gaps in a tree functional trait database

    NASA Astrophysics Data System (ADS)

    Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi

    2016-04-01

    Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix and in selected bivariate trait relationships. MICE yielded imputations which better preserved the variability and covariance structure of the data and provided an estimate of between-imputation uncertainty. We found that adding species identity as a predictor in MICE and kNN improved imputation for all traits, but adding climate did not lead to any appreciable improvement. However, forest structure and spatial structure did reduce imputation errors in maximum height and in leaf biomass to sapwood area ratios, respectively. Although species mean imputations showed the lowest error for 3 out the 5 studied traits, dataset-averaged errors were lowest for MICE imputations with all additional predictors, when missing data levels were 50% or lower. Species mean imputations always resulted in larger errors in the correlation matrix and appreciably altered the studied bivariate trait relationships. In conclusion, MICE imputations using species identity, climate, forest structure and spatial structure as predictors emerged as the most suitable method of the ones tested here, but it was also evident that imputation performance deteriorates at high levels of missing data (80%).

  12. Missing data imputation: focusing on single imputation.

    PubMed

    Zhang, Zhongheng

    2016-01-01

    Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations.

  13. Missing data imputation: focusing on single imputation

    PubMed Central

    2016-01-01

    Complete case analysis is widely used for handling missing data, and it is the default method in many statistical packages. However, this method may introduce bias and some useful information will be omitted from analysis. Therefore, many imputation methods are developed to make gap end. The present article focuses on single imputation. Imputations with mean, median and mode are simple but, like complete case analysis, can introduce bias on mean and deviation. Furthermore, they ignore relationship with other variables. Regression imputation can preserve relationship between missing values and other variables. There are many sophisticated methods exist to handle missing values in longitudinal data. This article focuses primarily on how to implement R code to perform single imputation, while avoiding complex mathematical calculations. PMID:26855945

  14. Accounting for one-channel depletion improves missing value imputation in 2-dye microarray data.

    PubMed

    Ritz, Cecilia; Edén, Patrik

    2008-01-19

    For 2-dye microarray platforms, some missing values may arise from an un-measurably low RNA expression in one channel only. Information of such "one-channel depletion" is so far not included in algorithms for imputation of missing values. Calculating the mean deviation between imputed values and duplicate controls in five datasets, we show that KNN-based imputation gives a systematic bias of the imputed expression values of one-channel depleted spots. Evaluating the correction of this bias by cross-validation showed that the mean square deviation between imputed values and duplicates were reduced up to 51%, depending on dataset. By including more information in the imputation step, we more accurately estimate missing expression values.

  15. Combining multiple imputation and meta-analysis with individual participant data

    PubMed Central

    Burgess, Stephen; White, Ian R; Resche-Rigon, Matthieu; Wood, Angela M

    2013-01-01

    Multiple imputation is a strategy for the analysis of incomplete data such that the impact of the missingness on the power and bias of estimates is mitigated. When data from multiple studies are collated, we can propose both within-study and multilevel imputation models to impute missing data on covariates. It is not clear how to choose between imputation models or how to combine imputation and inverse-variance weighted meta-analysis methods. This is especially important as often different studies measure data on different variables, meaning that we may need to impute data on a variable which is systematically missing in a particular study. In this paper, we consider a simulation analysis of sporadically missing data in a single covariate with a linear analysis model and discuss how the results would be applicable to the case of systematically missing data. We find in this context that ensuring the congeniality of the imputation and analysis models is important to give correct standard errors and confidence intervals. For example, if the analysis model allows between-study heterogeneity of a parameter, then we should incorporate this heterogeneity into the imputation model to maintain the congeniality of the two models. In an inverse-variance weighted meta-analysis, we should impute missing data and apply Rubin's rules at the study level prior to meta-analysis, rather than meta-analyzing each of the multiple imputations and then combining the meta-analysis estimates using Rubin's rules. We illustrate the results using data from the Emerging Risk Factors Collaboration. PMID:23703895

  16. Multiple imputation of missing covariates for the Cox proportional hazards cure model

    PubMed Central

    Beesley, Lauren J; Bartlett, Jonathan W; Wolf, Gregory T; Taylor, Jeremy M G

    2016-01-01

    We explore several approaches for imputing partially observed covariates when the outcome of interest is a censored event time and when there is an underlying subset of the population that will never experience the event of interest. We call these subjects “cured,” and we consider the case where the data are modeled using a Cox proportional hazards (CPH) mixture cure model. We study covariate imputation approaches using fully conditional specification (FCS). We derive the exact conditional distribution and suggest a sampling scheme for imputing partially observed covariates in the CPH cure model setting. We also propose several approximations to the exact distribution that are simpler and more convenient to use for imputation. A simulation study demonstrates that the proposed imputation approaches outperform existing imputation approaches for survival data without a cure fraction in terms of bias in estimating CPH cure model parameters. We apply our multiple imputation techniques to a study of patients with head and neck cancer. PMID:27439726

  17. GSimp: A Gibbs sampler based left-censored missing value imputation approach for metabolomics studies

    PubMed Central

    Jia, Erik; Chen, Tianlu

    2018-01-01

    Left-censored missing values commonly exist in targeted metabolomics datasets and can be considered as missing not at random (MNAR). Improper data processing procedures for missing values will cause adverse impacts on subsequent statistical analyses. However, few imputation methods have been developed and applied to the situation of MNAR in the field of metabolomics. Thus, a practical left-censored missing value imputation method is urgently needed. We developed an iterative Gibbs sampler based left-censored missing value imputation approach (GSimp). We compared GSimp with other three imputation methods on two real-world targeted metabolomics datasets and one simulation dataset using our imputation evaluation pipeline. The results show that GSimp outperforms other imputation methods in terms of imputation accuracy, observation distribution, univariate and multivariate analyses, and statistical sensitivity. Additionally, a parallel version of GSimp was developed for dealing with large scale metabolomics datasets. The R code for GSimp, evaluation pipeline, tutorial, real-world and simulated targeted metabolomics datasets are available at: https://github.com/WandeRum/GSimp. PMID:29385130

  18. Propensity score analysis with partially observed covariates: How should multiple imputation be used?

    PubMed

    Leyrat, Clémence; Seaman, Shaun R; White, Ian R; Douglas, Ian; Smeeth, Liam; Kim, Joseph; Resche-Rigon, Matthieu; Carpenter, James R; Williamson, Elizabeth J

    2017-01-01

    Inverse probability of treatment weighting is a popular propensity score-based approach to estimate marginal treatment effects in observational studies at risk of confounding bias. A major issue when estimating the propensity score is the presence of partially observed covariates. Multiple imputation is a natural approach to handle missing data on covariates: covariates are imputed and a propensity score analysis is performed in each imputed dataset to estimate the treatment effect. The treatment effect estimates from each imputed dataset are then combined to obtain an overall estimate. We call this method MIte. However, an alternative approach has been proposed, in which the propensity scores are combined across the imputed datasets (MIps). Therefore, there are remaining uncertainties about how to implement multiple imputation for propensity score analysis: (a) should we apply Rubin's rules to the inverse probability of treatment weighting treatment effect estimates or to the propensity score estimates themselves? (b) does the outcome have to be included in the imputation model? (c) how should we estimate the variance of the inverse probability of treatment weighting estimator after multiple imputation? We studied the consistency and balancing properties of the MIte and MIps estimators and performed a simulation study to empirically assess their performance for the analysis of a binary outcome. We also compared the performance of these methods to complete case analysis and the missingness pattern approach, which uses a different propensity score model for each pattern of missingness, and a third multiple imputation approach in which the propensity score parameters are combined rather than the propensity scores themselves (MIpar). Under a missing at random mechanism, complete case and missingness pattern analyses were biased in most cases for estimating the marginal treatment effect, whereas multiple imputation approaches were approximately unbiased as long as the outcome was included in the imputation model. Only MIte was unbiased in all the studied scenarios and Rubin's rules provided good variance estimates for MIte. The propensity score estimated in the MIte approach showed good balancing properties. In conclusion, when using multiple imputation in the inverse probability of treatment weighting context, MIte with the outcome included in the imputation model is the preferred approach.

  19. Imputation-Based Meta-Analysis of Severe Malaria in Three African Populations

    PubMed Central

    Band, Gavin; Le, Quang Si; Jostins, Luke; Pirinen, Matti; Kivinen, Katja; Jallow, Muminatou; Sisay-Joof, Fatoumatta; Bojang, Kalifa; Pinder, Margaret; Sirugo, Giorgio; Conway, David J.; Nyirongo, Vysaul; Kachala, David; Molyneux, Malcolm; Taylor, Terrie; Ndila, Carolyne; Peshu, Norbert; Marsh, Kevin; Williams, Thomas N.; Alcock, Daniel; Andrews, Robert; Edkins, Sarah; Gray, Emma; Hubbart, Christina; Jeffreys, Anna; Rowlands, Kate; Schuldt, Kathrin; Clark, Taane G.; Small, Kerrin S.; Teo, Yik Ying; Kwiatkowski, Dominic P.; Rockett, Kirk A.; Barrett, Jeffrey C.; Spencer, Chris C. A.

    2013-01-01

    Combining data from genome-wide association studies (GWAS) conducted at different locations, using genotype imputation and fixed-effects meta-analysis, has been a powerful approach for dissecting complex disease genetics in populations of European ancestry. Here we investigate the feasibility of applying the same approach in Africa, where genetic diversity, both within and between populations, is far more extensive. We analyse genome-wide data from approximately 5,000 individuals with severe malaria and 7,000 population controls from three different locations in Africa. Our results show that the standard approach is well powered to detect known malaria susceptibility loci when sample sizes are large, and that modern methods for association analysis can control the potential confounding effects of population structure. We show that pattern of association around the haemoglobin S allele differs substantially across populations due to differences in haplotype structure. Motivated by these observations we consider new approaches to association analysis that might prove valuable for multicentre GWAS in Africa: we relax the assumptions of SNP–based fixed effect analysis; we apply Bayesian approaches to allow for heterogeneity in the effect of an allele on risk across studies; and we introduce a region-based test to allow for heterogeneity in the location of causal alleles. PMID:23717212

  20. Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets.

    PubMed

    Huang, Min-Wei; Lin, Wei-Chao; Tsai, Chih-Fong

    2018-01-01

    Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.

  1. Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using MICE: A CALIBER Study

    PubMed Central

    Shah, Anoop D.; Bartlett, Jonathan W.; Carpenter, James; Nicholas, Owen; Hemingway, Harry

    2014-01-01

    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The “true” imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001–2010) with complete data on all covariates. Variables were artificially made “missing at random,” and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data. PMID:24589914

  2. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

    PubMed

    Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry

    2014-03-15

    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

  3. Accuracy of genomic predictions in Gyr (Bos indicus) dairy cattle.

    PubMed

    Boison, S A; Utsunomiya, A T H; Santos, D J A; Neves, H H R; Carvalheiro, R; Mészáros, G; Utsunomiya, Y T; do Carmo, A S; Verneque, R S; Machado, M A; Panetto, J C C; Garcia, J F; Sölkner, J; da Silva, M V G B

    2017-07-01

    Genomic selection may accelerate genetic progress in breeding programs of indicine breeds when compared with traditional selection methods. We present results of genomic predictions in Gyr (Bos indicus) dairy cattle of Brazil for milk yield (MY), fat yield (FY), protein yield (PY), and age at first calving using information from bulls and cows. Four different single nucleotide polymorphism (SNP) chips were studied. Additionally, the effect of the use of imputed data on genomic prediction accuracy was studied. A total of 474 bulls and 1,688 cows were genotyped with the Illumina BovineHD (HD; San Diego, CA) and BovineSNP50 (50K) chip, respectively. Genotypes of cows were imputed to HD using FImpute v2.2. After quality check of data, 496,606 markers remained. The HD markers present on the GeneSeek SGGP-20Ki (15,727; Lincoln, NE), 50K (22,152), and GeneSeek GGP-75Ki (65,018) were subset and used to assess the effect of lower SNP density on accuracy of prediction. Deregressed breeding values were used as pseudophenotypes for model training. Data were split into reference and validation to mimic a forward prediction scheme. The reference population consisted of animals whose birth year was ≤2004 and consisted of either only bulls (TR1) or a combination of bulls and dams (TR2), whereas the validation set consisted of younger bulls (born after 2004). Genomic BLUP was used to estimate genomic breeding values (GEBV) and reliability of GEBV (R 2 PEV ) was based on the prediction error variance approach. Reliability of GEBV ranged from ∼0.46 (FY and PY) to 0.56 (MY) with TR1 and from 0.51 (PY) to 0.65 (MY) with TR2. When averaged across all traits, R 2 PEV were substantially higher (R 2 PEV of TR1 = 0.50 and TR2 = 0.57) compared with reliabilities of parent averages (0.35) computed from pedigree data and based on diagonals of the coefficient matrix (prediction error variance approach). Reliability was similar for all the 4 marker panels using either TR1 or TR2, except that imputed HD cow data set led to an inflation of reliability. Reliability of GEBV could be increased by enlarging the limited bull reference population with cow information. A reduced panel of ∼15K markers resulted in reliabilities similar to using HD markers. Reliability of GEBV could be increased by enlarging the limited bull reference population with cow information. Copyright © 2017 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  4. Binary variable multiple-model multiple imputation to address missing data mechanism uncertainty: Application to a smoking cessation trial

    PubMed Central

    Siddique, Juned; Harel, Ofer; Crespi, Catherine M.; Hedeker, Donald

    2014-01-01

    The true missing data mechanism is never known in practice. We present a method for generating multiple imputations for binary variables that formally incorporates missing data mechanism uncertainty. Imputations are generated from a distribution of imputation models rather than a single model, with the distribution reflecting subjective notions of missing data mechanism uncertainty. Parameter estimates and standard errors are obtained using rules for nested multiple imputation. Using simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal smoking cessation trial where nonignorably missing data were a concern. Our method provides a simple approach for formalizing subjective notions regarding nonresponse and can be implemented using existing imputation software. PMID:24634315

  5. A procedure for linking psychosocial job characteristics data to health surveys.

    PubMed Central

    Schwartz, J E; Pieper, C F; Karasek, R A

    1988-01-01

    A system is presented for linking information about psychosocial characteristics of job situations to national health surveys. Job information can be imputed to individuals on surveys that contain three-digit US Census occupation codes. Occupational mean scores on psychosocial job characteristics-control over task situation (decision latitude), psychological work load, physical exertion, and other measures-for the linkage system are derived from US national surveys of working conditions (Quality of Employment Surveys 1969, 1972, and 1977). This paper discusses a new method for reducing the biases in multivariate analyses that are likely to arise when utilizing linkage systems based on mean scores. Such biases are reduced by modifying the linkage system to adjust imputed individual scores for demographic factors such as age, education, race, marital status and, implicitly, sex (since men and women have separate linkage data bases). Statistics on the linkage system's efficiency and reliability are reported. All dimensions have high inter-survey reproducibility. Despite their psychosocial nature, decision latitude and physical exertion can be more efficiently imputed with the linkage system than earnings (a non-psychosocial job characteristic). The linkage system presented here is a useful tool for initial epidemiological studies of the consequences of psychosocial job characteristics and constitutes the methodological basis for the subsequent paper. PMID:3389426

  6. Alternative Multiple Imputation Inference for Mean and Covariance Structure Modeling

    ERIC Educational Resources Information Center

    Lee, Taehun; Cai, Li

    2012-01-01

    Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…

  7. Comprehensive Search for Alzheimer Disease Susceptibility Loci in the APOE Region

    PubMed Central

    Jun, Gyungah; Vardarajan, Badri N.; Buros, Jacqueline; Yu, Chang-En; Hawk, Michele V.; Dombroski, Beth A.; Crane, Paul K.; Larson, Eric B.; Mayeux, Richard; Haines, Jonathan L.; Lunetta, Kathryn L.; Pericak-Vance, Margaret A.; Schellenberg, Gerard D.; Farrer, Lindsay A.

    2013-01-01

    Objective To evaluate the association of risk and age at onset (AAO) of Alzheimer disease (AD) with single-nucleotide polymorphisms (SNPs) in the chromosome 19 region including apolipoprotein E (APOE) and a repeat-length polymorphism in TOMM40 (poly-T, rs10524523). Design Conditional logistic regression models and survival analysis. Setting Fifteen genome-wide association study data sets assembled by the Alzheimer's Disease Genetics Consortium. Participants Eleven thousand eight hundred forty AD cases and 10 931 cognitively normal elderly controls. Main Outcome Measures Association of AD risk and AAO with genotyped and imputed SNPs located in an 800-Mb region including APOE in the entire Alzheimer's Disease Genetics Consortium data set and with the TOMM40 poly-T marker genotyped in a subset of 1256 cases and 1605 controls. Results In models adjusting for APOE ε4, no SNPs in the entire region were significantly associated with AAO at P<.001. Rs10524523 was not significantly associated with AD or AAO in models adjusting for APOE genotype or within the subset of ε3/ε3 subjects. Conclusions APOE alleles ε2, ε3, and ε4 account for essentially all the inherited risk of AD associated with this region. Other variants including a poly-T track in TOMM40 are not independent risk or AAO loci. PMID:22869155

  8. Gaussian mixture clustering and imputation of microarray data.

    PubMed

    Ouyang, Ming; Welsh, William J; Georgopoulos, Panos

    2004-04-12

    In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.

  9. Multiple Imputation of Multilevel Missing Data-Rigor versus Simplicity

    ERIC Educational Resources Information Center

    Drechsler, Jörg

    2015-01-01

    Multiple imputation is widely accepted as the method of choice to address item-nonresponse in surveys. However, research on imputation strategies for the hierarchical structures that are typically found in the data in educational contexts is still limited. While a multilevel imputation model should be preferred from a theoretical point of view if…

  10. A Comparison of Item-Level and Scale-Level Multiple Imputation for Questionnaire Batteries

    ERIC Educational Resources Information Center

    Gottschall, Amanda C.; West, Stephen G.; Enders, Craig K.

    2012-01-01

    Behavioral science researchers routinely use scale scores that sum or average a set of questionnaire items to address their substantive questions. A researcher applying multiple imputation to incomplete questionnaire data can either impute the incomplete items prior to computing scale scores or impute the scale scores directly from other scale…

  11. Multiple imputation to deal with missing EQ-5D-3L data: Should we impute individual domains or the actual index?

    PubMed

    Simons, Claire L; Rivero-Arias, Oliver; Yu, Ly-Mee; Simon, Judit

    2015-04-01

    Missing data are a well-known and widely documented problem in cost-effectiveness analyses alongside clinical trials using individual patient-level data. Current methodological research recommends multiple imputation (MI) to deal with missing health outcome data, but there is little guidance on whether MI for multi-attribute questionnaires, such as the EQ-5D-3L, should be carried out at domain or at summary score level. In this paper, we evaluated the impact of imputing individual domains versus imputing index values to deal with missing EQ-5D-3L data using a simulation study and developed recommendations for future practice. We simulated missing data in a patient-level dataset with complete EQ-5D-3L data at one point in time from a large multinational clinical trial (n = 1,814). Different proportions of missing data were generated using a missing at random (MAR) mechanism and three different scenarios were studied. The performance of using each method was evaluated using root mean squared error and mean absolute error of the actual versus predicted EQ-5D-3L indices. In large sample sizes (n > 500) and a missing data pattern that follows mainly unit non-response, imputing domains or the index produced similar results. However, domain imputation became more accurate than index imputation with pattern of missingness following an item non-response. For smaller sample sizes (n < 100), index imputation was more accurate. When MI models were misspecified, both domain and index imputations were inaccurate for any proportion of missing data. The decision between imputing the domains or the EQ-5D-3L index scores depends on the observed missing data pattern and the sample size available for analysis. Analysts conducting this type of exercises should also evaluate the sensitivity of the analysis to the MAR assumption and whether the imputation model is correctly specified.

  12. Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information

    NASA Astrophysics Data System (ADS)

    Poyatos, Rafael; Sus, Oliver; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi

    2018-05-01

    The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10-80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in the IEFC incomplete dataset (5495 plots) and quantify imputation uncertainty. Resulting spatial patterns of the studied traits in Catalan forests were broadly similar when using species means, regression kriging or the best-performing MICE application, but some important discrepancies were observed at the local level. Our results highlight the need to assess imputation quality beyond just imputation accuracy and show that including environmental information in statistical imputation approaches yields more plausible imputations in spatially explicit plant trait datasets.

  13. Plant genotypic diversity reduces the rate of consumer resource utilization

    PubMed Central

    McArt, Scott H.; Thaler, Jennifer S.

    2013-01-01

    While plant species diversity can reduce herbivore densities and herbivory, little is known regarding how plant genotypic diversity alters resource utilization by herbivores. Here, we show that an invasive folivore—the Japanese beetle (Popillia japonica)—increases 28 per cent in abundance, but consumes 24 per cent less foliage in genotypic polycultures compared with monocultures of the common evening primrose (Oenothera biennis). We found strong complementarity for reduced herbivore damage among plant genotypes growing in polycultures and a weak dominance effect of particularly resistant genotypes. Sequential feeding by P. japonica on different genotypes from polycultures resulted in reduced consumption compared with feeding on different plants of the same genotype from monocultures. Thus, diet mixing among plant genotypes reduced herbivore consumption efficiency. Despite positive complementarity driving an increase in fruit production in polycultures, we observed a trade-off between complementarity for increased plant productivity and resistance to herbivory, suggesting costs in the complementary use of resources by plant genotypes may manifest across trophic levels. These results elucidate mechanisms for how plant genotypic diversity simultaneously alters resource utilization by both producers and consumers, and show that population genotypic diversity can increase the resistance of a native plant to an invasive herbivore. PMID:23658201

  14. Plant genotypic diversity reduces the rate of consumer resource utilization.

    PubMed

    McArt, Scott H; Thaler, Jennifer S

    2013-07-07

    While plant species diversity can reduce herbivore densities and herbivory, little is known regarding how plant genotypic diversity alters resource utilization by herbivores. Here, we show that an invasive folivore--the Japanese beetle (Popillia japonica)--increases 28 per cent in abundance, but consumes 24 per cent less foliage in genotypic polycultures compared with monocultures of the common evening primrose (Oenothera biennis). We found strong complementarity for reduced herbivore damage among plant genotypes growing in polycultures and a weak dominance effect of particularly resistant genotypes. Sequential feeding by P. japonica on different genotypes from polycultures resulted in reduced consumption compared with feeding on different plants of the same genotype from monocultures. Thus, diet mixing among plant genotypes reduced herbivore consumption efficiency. Despite positive complementarity driving an increase in fruit production in polycultures, we observed a trade-off between complementarity for increased plant productivity and resistance to herbivory, suggesting costs in the complementary use of resources by plant genotypes may manifest across trophic levels. These results elucidate mechanisms for how plant genotypic diversity simultaneously alters resource utilization by both producers and consumers, and show that population genotypic diversity can increase the resistance of a native plant to an invasive herbivore.

  15. Nonsyndromic cleft palate: An association study at GWAS candidate loci in a multiethnic sample.

    PubMed

    Ishorst, Nina; Francheschelli, Paola; Böhmer, Anne C; Khan, Mohammad Faisal J; Heilmann-Heimbach, Stefanie; Fricker, Nadine; Little, Julian; Steegers-Theunissen, Regine P M; Peterlin, Borut; Nowak, Stefanie; Martini, Markus; Kruse, Teresa; Dunsche, Anton; Kreusch, Thomas; Gölz, Lina; Aldhorae, Khalid; Halboub, Esam; Reutter, Heiko; Mossey, Peter; Nöthen, Markus M; Rubini, Michele; Ludwig, Kerstin U; Knapp, Michael; Mangold, Elisabeth

    2018-06-01

    Nonsyndromic cleft palate only (nsCPO) is a common and multifactorial form of orofacial clefting. In contrast to successes achieved for the other common form of orofacial clefting, that is, nonsyndromic cleft lip with/without cleft palate (nsCL/P), genome wide association studies (GWAS) of nsCPO have identified only one genome wide significant locus. Aim of the present study was to investigate whether common variants contribute to nsCPO and, if so, to identify novel risk loci. We genotyped 33 SNPs at 27 candidate loci from 2 previously published nsCPO GWAS in an independent multiethnic sample. It included: (i) a family-based sample of European ancestry (n = 212); and (ii) two case/control samples of Central European (n = 94/339) and Arabian ancestry (n = 38/231), respectively. A separate association analysis was performed for each genotyped dataset, and meta-analyses were performed. After association analysis and meta-analyses, none of the 33 SNPs showed genome-wide significance. Two variants showed nominally significant association in the imputed GWAS dataset and exhibited a further decrease in p-value in a European and an overall meta-analysis including imputed GWAS data, respectively (rs395572: P MetaEU  = 3.16 × 10 -4 ; rs6809420: P MetaAll  = 2.80 × 10 -4 ). Our findings suggest that there is a limited contribution of common variants to nsCPO. However, the individual effect sizes might be too small for detection of further associations in the present sample sizes. Rare variants may play a more substantial role in nsCPO than in nsCL/P, for which GWAS of smaller sample sizes have identified genome-wide significant loci. Whole-exome/genome sequencing studies of nsCPO are now warranted. © 2018 Wiley Periodicals, Inc.

  16. A Genome-Wide Association Study to Identify Genomic Modulators of Rate Control Therapy in Patients with Atrial Fibrillation

    PubMed Central

    Kolek, Matthew J.; Edwards, Todd L.; Muhammad, Raafia; Balouch, Adnan; Shoemaker, M. Benjamin; Blair, Marcia A.; Kor, Kaylen C.; Takahashi, Atsushi; Kubo, Michiaki; Roden, Dan M.; Tanaka, Toshihiro; Darbar, Dawood

    2014-01-01

    For many patients with atrial fibrillation (AF), ventricular rate control with atrioventricular (AV) nodal blockers is considered first-line therapy, though response to treatment is highly variable. Using an extreme phenotype of failure of rate control necessitating AV nodal ablation and pacemaker implantation, we conducted a genome wide association study (GWAS) to identify genomic modulators of rate control therapy. Cases included 95 patients who failed rate control therapy. Controls (N=190) achieved adequate rate control therapy with ≤2 AV nodal blockers using a conventional clinical definition. Genotyping was performed on the Illumina 610-Quad platform, and results were imputed to the 1000 Genomes reference haplotypes. 554,041 single nucleotide polymorphisms (SNPs) met criteria for minor allele frequency (>0.01), call rate (>95%), and quality control, and 6,055,224 SNPs were available after imputation. No SNP reached the canonical threshold for significance for GWAS of P<5 × 10−8. Sixty-three SNPs with P<10−5 at 6 genomic loci were genotyped in a validation cohort of 130 cases and 157 controls. These included 6q24.3 (near SAMD5/SASH1, P=9.36 × 10−8), 4q12 (IGFBP7, P=1.75 × 10−7), 6q22.33 (C6orf174, P=4.86 × 10−7), 3p21.31 (CDCP1, P=1.18 × 10−6), 12p12.1 (SOX5, P=1.62 × 10−6), and 7p11 (LANCL2, P=6.51 × 10−6). However, none of these were significant in the replication cohort or in a meta-analysis of both cohorts. In conclusion, we identified several potentially important genomic modulators of rate control therapy in AF, particularly SOX5, which was previously associated with resting heart rate and PR interval. However these failed to reach genome-wide significance. PMID:25015694

  17. Partitioning error components for accuracy-assessment of near-neighbor methods of imputation

    Treesearch

    Albert R. Stage; Nicholas L. Crookston

    2007-01-01

    Imputation is applied for two quite different purposes: to supply missing data to complete a data set for subsequent modeling analyses or to estimate subpopulation totals. Error properties of the imputed values have different effects in these two contexts. We partition errors of imputation derived from similar observation units as arising from three sources:...

  18. Multiple Imputation of Item Scores in Test and Questionnaire Data, and Influence on Psychometric Results

    ERIC Educational Resources Information Center

    van Ginkel, Joost R.; van der Ark, L. Andries; Sijtsma, Klaas

    2007-01-01

    The performance of five simple multiple imputation methods for dealing with missing data were compared. In addition, random imputation and multivariate normal imputation were used as lower and upper benchmark, respectively. Test data were simulated and item scores were deleted such that they were either missing completely at random, missing at…

  19. SPSS Syntax for Missing Value Imputation in Test and Questionnaire Data

    ERIC Educational Resources Information Center

    van Ginkel, Joost R.; van der Ark, L. Andries

    2005-01-01

    A well-known problem in the analysis of test and questionnaire data is that some item scores may be missing. Advanced methods for the imputation of missing data are available, such as multiple imputation under the multivariate normal model and imputation under the saturated logistic model (Schafer, 1997). Accompanying software was made available…

  20. Imputing data that are missing at high rates using a boosting algorithm

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cauthen, Katherine Regina; Lambert, Gregory; Ray, Jaideep

    Traditional multiple imputation approaches may perform poorly for datasets with high rates of missingness unless many m imputations are used. This paper implements an alternative machine learning-based approach to imputing data that are missing at high rates. Here, we use boosting to create a strong learner from a weak learner fitted to a dataset missing many observations. This approach may be applied to a variety of types of learners (models). The approach is demonstrated by application to a spatiotemporal dataset for predicting dengue outbreaks in India from meteorological covariates. A Bayesian spatiotemporal CAR model is boosted to produce imputations, andmore » the overall RMSE from a k-fold cross-validation is used to assess imputation accuracy.« less

  1. Multiple imputation for IPD meta-analysis: allowing for heterogeneity and studies with missing covariates.

    PubMed

    Quartagno, M; Carpenter, J R

    2016-07-30

    Recently, multiple imputation has been proposed as a tool for individual patient data meta-analysis with sporadically missing observations, and it has been suggested that within-study imputation is usually preferable. However, such within study imputation cannot handle variables that are completely missing within studies. Further, if some of the contributing studies are relatively small, it may be appropriate to share information across studies when imputing. In this paper, we develop and evaluate a joint modelling approach to multiple imputation of individual patient data in meta-analysis, with an across-study probability distribution for the study specific covariance matrices. This retains the flexibility to allow for between-study heterogeneity when imputing while allowing (i) sharing information on the covariance matrix across studies when this is appropriate, and (ii) imputing variables that are wholly missing from studies. Simulation results show both equivalent performance to the within-study imputation approach where this is valid, and good results in more general, practically relevant, scenarios with studies of very different sizes, non-negligible between-study heterogeneity and wholly missing variables. We illustrate our approach using data from an individual patient data meta-analysis of hypertension trials. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2015 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

  2. Reconstruction of spatially detailed global map of NH4+ and NO3- application in synthetic nitrogen fertilizer

    NASA Astrophysics Data System (ADS)

    Nishina, Kazuya; Ito, Akihiko; Hanasaki, Naota; Hayashi, Seiji

    2017-02-01

    Currently, available historical global N fertilizer map as an input data to global biogeochemical model is still limited and existing maps were not considered NH4+ and NO3- in the fertilizer application rates. This paper provides a method for constructing a new historical global nitrogen fertilizer application map (0.5° × 0.5° resolution) for the period 1961-2010 based on country-specific information from Food and Agriculture Organization statistics (FAOSTAT) and various global datasets. This new map incorporates the fraction of NH4+ (and NO3-) in N fertilizer inputs by utilizing fertilizer species information in FAOSTAT, in which species can be categorized as NH4+- and/or NO3--forming N fertilizers. During data processing, we applied a statistical data imputation method for the missing data (19 % of national N fertilizer consumption) in FAOSTAT. The multiple imputation method enabled us to fill gaps in the time-series data using plausible values using covariates information (year, population, GDP, and crop area). After the imputation, we downscaled the national consumption data to a gridded cropland map. Also, we applied the multiple imputation method to the available chemical fertilizer species consumption, allowing for the estimation of the NH4+ / NO3- ratio in national fertilizer consumption. In this study, the synthetic N fertilizer inputs in 2000 showed a general consistency with the existing N fertilizer map (Potter et al., 2010) in relation to the ranges of N fertilizer inputs. Globally, the estimated N fertilizer inputs based on the sum of filled data increased from 15 to 110 Tg-N during 1961-2010. On the other hand, the global NO3- input started to decline after the late 1980s and the fraction of NO3- in global N fertilizer decreased consistently from 35 to 13 % over a 50-year period. NH4+-forming fertilizers are dominant in most countries; however, the NH4+ / NO3- ratio in N fertilizer inputs shows clear differences temporally and geographically. This new map can be utilized as input data to global model studies and bring new insights for the assessment of historical terrestrial N cycling changes. Datasets available at doi:10.1594/PANGAEA.861203.

  3. Missing in space: an evaluation of imputation methods for missing data in spatial analysis of risk factors for type II diabetes.

    PubMed

    Baker, Jannah; White, Nicole; Mengersen, Kerrie

    2014-11-20

    Spatial analysis is increasingly important for identifying modifiable geographic risk factors for disease. However, spatial health data from surveys are often incomplete, ranging from missing data for only a few variables, to missing data for many variables. For spatial analyses of health outcomes, selection of an appropriate imputation method is critical in order to produce the most accurate inferences. We present a cross-validation approach to select between three imputation methods for health survey data with correlated lifestyle covariates, using as a case study, type II diabetes mellitus (DM II) risk across 71 Queensland Local Government Areas (LGAs). We compare the accuracy of mean imputation to imputation using multivariate normal and conditional autoregressive prior distributions. Choice of imputation method depends upon the application and is not necessarily the most complex method. Mean imputation was selected as the most accurate method in this application. Selecting an appropriate imputation method for health survey data, after accounting for spatial correlation and correlation between covariates, allows more complete analysis of geographic risk factors for disease with more confidence in the results to inform public policy decision-making.

  4. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.

    PubMed

    Liu, Yuzhe; Gopalakrishnan, Vanathi

    2017-03-01

    Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.

  5. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer

    PubMed Central

    Michailidou, Kyriaki; Beesley, Jonathan; Lindstrom, Sara; Canisius, Sander; Dennis, Joe; Lush, Michael; Maranian, Mel J; Bolla, Manjeet K; Wang, Qin; Shah, Mitul; Perkins, Barbara J; Czene, Kamila; Eriksson, Mikael; Darabi, Hatef; Brand, Judith S; Bojesen, Stig E; Nordestgaard, Børge G; Flyger, Henrik; Nielsen, Sune F; Rahman, Nazneen; Turnbull, Clare; Fletcher, Olivia; Peto, Julian; Gibson, Lorna; dos-Santos-Silva, Isabel; Chang-Claude, Jenny; Flesch-Janys, Dieter; Rudolph, Anja; Eilber, Ursula; Behrens, Sabine; Nevanlinna, Heli; Muranen, Taru A; Aittomäki, Kristiina; Blomqvist, Carl; Khan, Sofia; Aaltonen, Kirsimari; Ahsan, Habibul; Kibriya, Muhammad G; Whittemore, Alice S; John, Esther M; Malone, Kathleen E; Gammon, Marilie D; Santella, Regina M; Ursin, Giske; Makalic, Enes; Schmidt, Daniel F; Casey, Graham; Hunter, David J; Gapstur, Susan M; Gaudet, Mia M; Diver, W Ryan; Haiman, Christopher A; Schumacher, Fredrick; Henderson, Brian E; Le Marchand, Loic; Berg, Christine D; Chanock, Stephen; Figueroa, Jonine; Hoover, Robert N; Lambrechts, Diether; Neven, Patrick; Wildiers, Hans; van Limbergen, Erik; Schmidt, Marjanka K; Broeks, Annegien; Verhoef, Senno; Cornelissen, Sten; Couch, Fergus J; Olson, Janet E; Hallberg, Emily; Vachon, Celine; Waisfisz, Quinten; Meijers-Heijboer, Hanne; Adank, Muriel A; van der Luijt, Rob B; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Kang, Daehee; Choi, Ji-Yeob; Park, Sue K; Yoo, Keun-Young; Matsuo, Keitaro; Ito, Hidemi; Iwata, Hiroji; Tajima, Kazuo; Guénel, Pascal; Truong, Thérèse; Mulot, Claire; Sanchez, Marie; Burwinkel, Barbara; Marme, Frederik; Surowy, Harald; Sohn, Christof; Wu, Anna H; Tseng, Chiu-chen; Van Den Berg, David; Stram, Daniel O; González-Neira, Anna; Benitez, Javier; Zamora, M Pilar; Perez, Jose Ignacio Arias; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Hui; Cox, Angela; Cross, Simon S; Reed, Malcolm WR; Andrulis, Irene L; Knight, Julia A; Glendon, Gord; Mulligan, Anna Marie; Sawyer, Elinor J; Tomlinson, Ian; Kerin, Michael J; Miller, Nicola; Lindblom, Annika; Margolin, Sara; Teo, Soo Hwang; Yip, Cheng Har; Taib, Nur Aishah Mohd; TAN, Gie-Hooi; Hooning, Maartje J; Hollestelle, Antoinette; Martens, John WM; Collée, J Margriet; Blot, William; Signorello, Lisa B; Cai, Qiuyin; Hopper, John L; Southey, Melissa C; Tsimiklis, Helen; Apicella, Carmel; Shen, Chen-Yang; Hsiung, Chia-Ni; Wu, Pei-Ei; Hou, Ming-Feng; Kristensen, Vessela N; Nord, Silje; Alnaes, Grethe I Grenaker; Giles, Graham G; Milne, Roger L; McLean, Catriona; Canzian, Federico; Trichopoulos, Dmitrios; Peeters, Petra; Lund, Eiliv; Sund, Malin; Khaw, Kay-Tee; Gunter, Marc J; Palli, Domenico; Mortensen, Lotte Maxild; Dossus, Laure; Huerta, Jose-Maria; Meindl, Alfons; Schmutzler, Rita K; Sutter, Christian; Yang, Rongxi; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Hartman, Mikael; Miao, Hui; Chia, Kee Seng; Chan, Ching Wan; Fasching, Peter A; Hein, Alexander; Beckmann, Matthias W; Haeberle, Lothar; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Ashworth, Alan; Orr, Nick; Schoemaker, Minouk J; Swerdlow, Anthony J; Brinton, Louise; Garcia-Closas, Montserrat; Zheng, Wei; Halverson, Sandra L; Shrubsole, Martha; Long, Jirong; Goldberg, Mark S; Labrèche, France; Dumont, Martine; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Grip, Mervi; Brauch, Hiltrud; Hamann, Ute; Brüning, Thomas; Radice, Paolo; Peterlongo, Paolo; Manoukian, Siranoush; Bernard, Loris; Bogdanova, Natalia V; Dörk, Thilo; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M; Devilee, Peter; Tollenaar, Robert AEM; Seynaeve, Caroline; Van Asperen, Christi J; Jakubowska, Anna; Lubinski, Jan; Jaworska, Katarzyna; Huzarski, Tomasz; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; McKay, James; Slager, Susan; Toland, Amanda E; Ambrosone, Christine B; Yannoukakos, Drakoulis; Kabisch, Maria; Torres, Diana; Neuhausen, Susan L; Anton-Culver, Hoda; Luccarini, Craig; Baynes, Caroline; Ahmed, Shahana; Healey, Catherine S; Tessier, Daniel C; Vincent, Daniel; Bacot, Francois; Pita, Guillermo; Alonso, M Rosario; Álvarez, Nuria; Herrero, Daniel; Simard, Jacques; Pharoah, Paul PDP; Kraft, Peter; Dunning, Alison M; Chenevix-Trench, Georgia; Hall, Per; Easton, Douglas F

    2015-01-01

    Genome wide association studies (GWAS) and large scale replication studies have identified common variants in 79 loci associated with breast cancer, explaining ~14% of the familial risk of the disease. To identify new susceptibility loci, we performed a meta-analysis of 11 GWAS comprising of 15,748 breast cancer cases and 18,084 controls, and 46,785 cases and 42,892 controls from 41 studies genotyped on a 200K custom array (iCOGS). Analyses were restricted to women of European ancestry. Genotypes for more than 11M SNPs were generated by imputation using the 1000 Genomes Project reference panel. We identified 15 novel loci associated with breast cancer at P<5×10−8. Combining association analysis with ChIP-Seq data in mammary cell lines and ChIA-PET chromatin interaction data in ENCODE, we identified likely target genes in two regions: SETBP1 on 18q12.3 and RNF115 and PDZK1 on 1q21.1. One association appears to be driven by an amino-acid substitution in EXO1. PMID:25751625

  6. The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog

    PubMed Central

    Togninalli, Matteo; Seren, Ümit; Meng, Dazhe; Fitz, Joffrey; Nordborg, Magnus; Weigel, Detlef

    2018-01-01

    Abstract The abundance of high-quality genotype and phenotype data for the model organism Arabidopsis thaliana enables scientists to study the genetic architecture of many complex traits at an unprecedented level of detail using genome-wide association studies (GWAS). GWAS have been a great success in A. thaliana and many SNP-trait associations have been published. With the AraGWAS Catalog (https://aragwas.1001genomes.org) we provide a publicly available, manually curated and standardized GWAS catalog for all publicly available phenotypes from the central A. thaliana phenotype repository, AraPheno. All GWAS have been recomputed on the latest imputed genotype release of the 1001 Genomes Consortium using a standardized GWAS pipeline to ensure comparability between results. The catalog includes currently 167 phenotypes and more than 222 000 SNP-trait associations with P < 10−4, of which 3887 are significantly associated using permutation-based thresholds. The AraGWAS Catalog can be accessed via a modern web-interface and provides various features to easily access, download and visualize the results and summary statistics across GWAS. PMID:29059333

  7. Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues.

    PubMed

    Ernst, Jason; Kellis, Manolis

    2015-04-01

    With hundreds of epigenomic maps, the opportunity arises to exploit the correlated nature of epigenetic signals, across both marks and samples, for large-scale prediction of additional datasets. Here, we undertake epigenome imputation by leveraging such correlations through an ensemble of regression trees. We impute 4,315 high-resolution signal maps, of which 26% are also experimentally observed. Imputed signal tracks show overall similarity to observed signals and surpass experimental datasets in consistency, recovery of gene annotations and enrichment for disease-associated variants. We use the imputed data to detect low-quality experimental datasets, to find genomic sites with unexpected epigenomic signals, to define high-priority marks for new experiments and to delineate chromatin states in 127 reference epigenomes spanning diverse tissues and cell types. Our imputed datasets provide the most comprehensive human regulatory region annotation to date, and our approach and the ChromImpute software constitute a useful complement to large-scale experimental mapping of epigenomic information.

  8. Using Bayesian Imputation to Assess Racial and Ethnic Disparities in Pediatric Performance Measures.

    PubMed

    Brown, David P; Knapp, Caprice; Baker, Kimberly; Kaufmann, Meggen

    2016-06-01

    To analyze health care disparities in pediatric quality of care measures and determine the impact of data imputation. Five HEDIS measures are calculated based on 2012 administrative data for 145,652 children in two public insurance programs in Florida. The Bayesian Improved Surname and Geocoding (BISG) imputation method is used to impute missing race and ethnicity data for 42 percent of the sample (61,954 children). Models are estimated with and without the imputed race and ethnicity data. Dropping individuals with missing race and ethnicity data biases quality of care measures for minorities downward relative to nonminority children for several measures. These results provide further support for the importance of appropriately accounting for missing race and ethnicity data through imputation methods. © Health Research and Educational Trust.

  9. A Review On Missing Value Estimation Using Imputation Algorithm

    NASA Astrophysics Data System (ADS)

    Armina, Roslan; Zain, Azlan Mohd; Azizah Ali, Nor; Sallehuddin, Roselina

    2017-09-01

    The presence of the missing value in the data set has always been a major problem for precise prediction. The method for imputing missing value needs to minimize the effect of incomplete data sets for the prediction model. Many algorithms have been proposed for countermeasure of missing value problem. In this review, we provide a comprehensive analysis of existing imputation algorithm, focusing on the technique used and the implementation of global or local information of data sets for missing value estimation. In addition validation method for imputation result and way to measure the performance of imputation algorithm also described. The objective of this review is to highlight possible improvement on existing method and it is hoped that this review gives reader better understanding of imputation method trend.

  10. Identifying reprioritization response shift in a stroke caregiver population: a comparison of missing data methods.

    PubMed

    Sajobi, Tolulope T; Lix, Lisa M; Singh, Gurbakhshash; Lowerison, Mark; Engbers, Jordan; Mayo, Nancy E

    2015-03-01

    Response shift (RS) is an important phenomenon that influences the assessment of longitudinal changes in health-related quality of life (HRQOL) studies. Given that RS effects are often small, missing data due to attrition or item non-response can contribute to failure to detect RS effects. Since missing data are often encountered in longitudinal HRQOL data, effective strategies to deal with missing data are important to consider. This study aims to compare different imputation methods on the detection of reprioritization RS in the HRQOL of caregivers of stroke survivors. Data were from a Canadian multi-center longitudinal study of caregivers of stroke survivors over a one-year period. The Stroke Impact Scale physical function score at baseline, with a cutoff of 75, was used to measure patient stroke severity for the reprioritization RS analysis. Mean imputation, likelihood-based expectation-maximization imputation, and multiple imputation methods were compared in test procedures based on changes in relative importance weights to detect RS in SF-36 domains over a 6-month period. Monte Carlo simulation methods were used to compare the statistical powers of relative importance test procedures for detecting RS in incomplete longitudinal data under different missing data mechanisms and imputation methods. Of the 409 caregivers, 15.9 and 31.3 % of them had missing data at baseline and 6 months, respectively. There were no statistically significant changes in relative importance weights on any of the domains when complete-case analysis was adopted. But statistical significant changes were detected on physical functioning and/or vitality domains when mean imputation or EM imputation was adopted. There were also statistically significant changes in relative importance weights for physical functioning, mental health, and vitality domains when multiple imputation method was adopted. Our simulations revealed that relative importance test procedures were least powerful under complete-case analysis method and most powerful when a mean imputation or multiple imputation method was adopted for missing data, regardless of the missing data mechanism and proportion of missing data. Test procedures based on relative importance measures are sensitive to the type and amount of missing data and imputation method. Relative importance test procedures based on mean imputation and multiple imputation are recommended for detecting RS in incomplete data.

  11. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research.

    PubMed

    Hayati Rezvan, Panteha; Lee, Katherine J; Simpson, Julie A

    2015-04-07

    Missing data are common in medical research, which can lead to a loss in statistical power and potentially biased results if not handled appropriately. Multiple imputation (MI) is a statistical method, widely adopted in practice, for dealing with missing data. Many academic journals now emphasise the importance of reporting information regarding missing data and proposed guidelines for documenting the application of MI have been published. This review evaluated the reporting of missing data, the application of MI including the details provided regarding the imputation model, and the frequency of sensitivity analyses within the MI framework in medical research articles. A systematic review of articles published in the Lancet and New England Journal of Medicine between January 2008 and December 2013 in which MI was implemented was carried out. We identified 103 papers that used MI, with the number of papers increasing from 11 in 2008 to 26 in 2013. Nearly half of the papers specified the proportion of complete cases or the proportion with missing data by each variable. In the majority of the articles (86%) the imputed variables were specified. Of the 38 papers (37%) that stated the method of imputation, 20 used chained equations, 8 used multivariate normal imputation, and 10 used alternative methods. Very few articles (9%) detailed how they handled non-normally distributed variables during imputation. Thirty-nine papers (38%) stated the variables included in the imputation model. Less than half of the papers (46%) reported the number of imputations, and only two papers compared the distribution of imputed and observed data. Sixty-six papers presented the results from MI as a secondary analysis. Only three articles carried out a sensitivity analysis following MI to assess departures from the missing at random assumption, with details of the sensitivity analyses only provided by one article. This review outlined deficiencies in the documenting of missing data and the details provided about imputation. Furthermore, only a few articles performed sensitivity analyses following MI even though this is strongly recommended in guidelines. Authors are encouraged to follow the available guidelines and provide information on missing data and the imputation process.

  12. The Minnesota Center for Twin and Family Research Genome-Wide Association Study

    PubMed Central

    Miller, Michael B.; Basu, Saonli; Cunningham, Julie; Eskin, Eleazar; Malone, Steven M.; Oetting, William S.; Schork, Nicholas; Sul, Jae Hoon; Iacono, William G.; Mcgue, Matt

    2012-01-01

    As part of the Genes, Environment and Development Initiative (GEDI), the Minnesota Center for Twin and Family Research (MCTFR) undertook a genome-wide association study (GWAS), which we describe here. A total of 8405 research participants, clustered in 4-member families, have been successfully genotyped on 527,829 single nucleotide polymorphism (SNP) markers using Illumina’s Human660W-Quad array. Quality control screening of samples and markers as well as SNP imputation procedures are described. We also describe methods for ancestry control and how the familial clustering of the MCTFR sample can be accounted for in the analysis using a Rapid Feasible Generalized Least Squares algorithm. The rich longitudinal MCTFR assessments provide numerous opportunities for collaboration. PMID:23363460

  13. Multiple Imputation of Cognitive Performance as a Repeatedly Measured Outcome

    PubMed Central

    Rawlings, Andreea M.; Sang, Yingying; Sharrett, A. Richey; Coresh, Josef; Griswold, Michael; Kucharska-Newton, Anna M.; Palta, Priya; Wruck, Lisa M.; Gross, Alden L.; Deal, Jennifer A.; Power, Melinda C.; Bandeen-Roche, Karen

    2016-01-01

    Background Longitudinal studies of cognitive performance are sensitive to dropout, as participants experiencing cognitive deficits are less likely to attend study visits, which may bias estimated associations between exposures of interest and cognitive decline. Multiple imputation is a powerful tool for handling missing data, however its use for missing cognitive outcome measures in longitudinal analyses remains limited. Methods We use multiple imputation by chained equations (MICE) to impute cognitive performance scores of participants who did not attend the 2011-2013 exam of the Atherosclerosis Risk in Communities Study. We examined the validity of imputed scores using observed and simulated data under varying assumptions. We examined differences in the estimated association between diabetes at baseline and 20-year cognitive decline with and without imputed values. Lastly, we discuss how different analytic methods (mixed models and models fit using generalized estimate equations) and choice of for whom to impute result in different estimands. Results Validation using observed data showed MICE produced unbiased imputations. Simulations showed a substantial reduction in the bias of the 20-year association between diabetes and cognitive decline comparing MICE (3-4% bias) to analyses of available data only (16-23% bias) in a construct where missingness was strongly informative but realistic. Associations between diabetes and 20-year cognitive decline were substantially stronger with MICE than in available-case analyses. Conclusions Our study suggests when informative data are available for non-examined participants, MICE can be an effective tool for imputing cognitive performance and improving assessment of cognitive decline, though careful thought should be given to target imputation population and analytic model chosen, as they may yield different estimands. PMID:27619926

  14. Multiple imputation in the presence of non-normal data.

    PubMed

    Lee, Katherine J; Carlin, John B

    2017-02-20

    Multiple imputation (MI) is becoming increasingly popular for handling missing data. Standard approaches for MI assume normality for continuous variables (conditionally on the other variables in the imputation model). However, it is unclear how to impute non-normally distributed continuous variables. Using simulation and a case study, we compared various transformations applied prior to imputation, including a novel non-parametric transformation, to imputation on the raw scale and using predictive mean matching (PMM) when imputing non-normal data. We generated data from a range of non-normal distributions, and set 50% to missing completely at random or missing at random. We then imputed missing values on the raw scale, following a zero-skewness log, Box-Cox or non-parametric transformation and using PMM with both type 1 and 2 matching. We compared inferences regarding the marginal mean of the incomplete variable and the association with a fully observed outcome. We also compared results from these approaches in the analysis of depression and anxiety symptoms in parents of very preterm compared with term-born infants. The results provide novel empirical evidence that the decision regarding how to impute a non-normal variable should be based on the nature of the relationship between the variables of interest. If the relationship is linear in the untransformed scale, transformation can introduce bias irrespective of the transformation used. However, if the relationship is non-linear, it may be important to transform the variable to accurately capture this relationship. A useful alternative is to impute the variable using PMM with type 1 matching. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  15. A meta-data based method for DNA microarray imputation.

    PubMed

    Jörnsten, Rebecka; Ouyang, Ming; Wang, Hui-Yu

    2007-03-29

    DNA microarray experiments are conducted in logical sets, such as time course profiling after a treatment is applied to the samples, or comparisons of the samples under two or more conditions. Due to cost and design constraints of spotted cDNA microarray experiments, each logical set commonly includes only a small number of replicates per condition. Despite the vast improvement of the microarray technology in recent years, missing values are prevalent. Intuitively, imputation of missing values is best done using many replicates within the same logical set. In practice, there are few replicates and thus reliable imputation within logical sets is difficult. However, it is in the case of few replicates that the presence of missing values, and how they are imputed, can have the most profound impact on the outcome of downstream analyses (e.g. significance analysis and clustering). This study explores the feasibility of imputation across logical sets, using the vast amount of publicly available microarray data to improve imputation reliability in the small sample size setting. We download all cDNA microarray data of Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabditis elegans from the Stanford Microarray Database. Through cross-validation and simulation, we find that, for all three species, our proposed imputation using data from public databases is far superior to imputation within a logical set, sometimes to an astonishing degree. Furthermore, the imputation root mean square error for significant genes is generally a lot less than that of non-significant ones. Since downstream analysis of significant genes, such as clustering and network analysis, can be very sensitive to small perturbations of estimated gene effects, it is highly recommended that researchers apply reliable data imputation prior to further analysis. Our method can also be applied to cDNA microarray experiments from other species, provided good reference data are available.

  16. GWAS in an Amerindian ancestry population reveals novel systemic lupus erythematosus risk loci and the role of European admixture

    PubMed Central

    Alarcón-Riquelme, Marta E.; Ziegler, Julie T.; Molineros, Julio; Howard, Timothy D.; Moreno-Estrada, Andrés; Sánchez-Rodríguez, Elena; Ainsworth, Hannah C.; Ortiz-Tello, Patricia; Comeau, Mary E.; Rasmussen, Astrid; Kelly, Jennifer A.; Adler, Adam; Acevedo-Vázquez, Eduardo; Cucho, Jorge Mariano; García-De la Torre, Ignacio; Cardiel, Mario H.; Miranda, Pedro; Catoggio, Luis; Maradiaga-Ceceña, Marco; Gaffney, Patrick; Vyse, Timothy; Criswell, Lindsey A.; Tsao, Betty P.; Sivils, Kathy L.; Bae, Sang-Cheol; James, Judith A.; Kimberly, Robert; Kaufman, Ken; Harley, John B.; Esquivel-Valerio, Jorge; Moctezuma, José F.; García, Mercedes A.; Berbotto, Guillermo; Babini, Alejandra; Scherbarth, Hugo; Toloza, Sergio; Baca, Vicente; Nath, Swapan K.; Salinas, Carlos Aguilar; Orozco, Lorena; Tusié-Luna, Teresa; Zidovetzki, Raphael; Pons-Estel, Bernardo A.; Langefeld, Carl D.; Jacob, Chaim O.

    2016-01-01

    OBJECTIVES Systemic lupus erythematosus (SLE) is a chronic autoimmune disease with a strong genetic component. Our aim was to perform the first genome-wide association study on individuals from the Americas enriched for Native American heritage. MATERIALS and METHODS We analyzed 3,710 individuals from four countries of Latin America and the Unites States diagnosed with SLE and healthy controls. Samples were genotyped with the HumanOmni1 BeadChip. Data of out-of-study controls was obtained for the HumanOmni2.5. Statistical analyses were performed using SNPTEST and SNPGWA. Data was adjusted for genomic control and FDR. Imputation was done using IMPUTE2, and HiBAG for classical HLA alleles. RESULTS The IRF5-TNPO3 region showed the strongest association and largest odds ratio (OR) (rs10488631, Pgcadj = 2.61×10−29, OR = 2.12, 95% CI: 1.88–2.39) followed by the HLA class II on the DQA2-DQB1 loci (rs9275572, Pgcadj = 1.11 × 10−16, OR = 1.62, 95% CI: 1.46–1.80; rs9271366, Pgcadj=6.46 × 10−12, OR = 2.06, 95% CI: 1.71–2.50). Other known SLE loci associated were ITGAM, STAT4, TNIP1, NCF2 and IRAK1. We identified a novel locus on 10q24.33 (rs4917385, Pgcadj =1.4×10−8) with a eQTL effect (Peqtl=8.0 × 10−37 at USMG5/miR1307), and describe novel loci. We corroborate SLE-risk loci previously identified in European and Asians. Local ancestry estimation showed that HLA allele risk contribution is of European ancestral origin. Imputation of HLA alleles suggested that autochthonous Native American haplotypes provide protection. CONCLUSIONS Our results show the insight gained by studying admixed populations to delineate the genetic architecture that underlies autoimmune and complex diseases. PMID:26606652

  17. Family-based Association Analyses of Imputed Genotypes Reveal Genome-Wide Significant Association of Alzheimer’s disease with OSBPL6, PTPRG and PDCL3

    PubMed Central

    Herold, Christine; Hooli, Basavaraj V.; Mullin, Kristina; Liu, Tian; Roehr, Johannes T; Mattheisen, Manuel; Parrado, Antonio R.; Bertram, Lars; Lange, Christoph; Tanzi, Rudolph E.

    2015-01-01

    The genetic basis of Alzheimer's disease (AD) is complex and heterogeneous. Over 200 highly penetrant pathogenic variants in the genes APP, PSEN1 and PSEN2 cause a subset of early-onset familial Alzheimer's disease (EOFAD). On the other hand, susceptibility to late-onset forms of AD (LOAD) is indisputably associated to the ε4 allele in the gene APOE, and more recently to variants in more than two-dozen additional genes identified in the large-scale genome-wide association studies (GWAS) and meta-analyses reports. Taken together however, although the heritability in AD is estimated to be as high as 80%, a large proportion of the underlying genetic factors still remain to be elucidated. In this study we performed a systematic family-based genome-wide association and meta-analysis on close to 15 million imputed variants from three large collections of AD families (~3,500 subjects from 1,070 families). Using a multivariate phenotype combining affection status and onset age, meta-analysis of the association results revealed three single nucleotide polymorphisms (SNPs) that achieved genome-wide significance for association with AD risk: rs7609954 in the gene PTPRG (P-value = 3.98·10−08), rs1347297 in the gene OSBPL6 (P-value = 4.53·10−08), and rs1513625 near PDCL3 (P-value = 4.28·10−08). In addition, rs72953347 in OSBPL6 (P-value = 6.36·10−07) and two SNPs in the gene CDKAL1 showed marginally significant association with LOAD (rs10456232, P-value: 4.76·10−07; rs62400067, P-value: 3.54·10−07). In summary, family-based GWAS meta-analysis of imputed SNPs revealed novel genomic variants in (or near) PTPRG, OSBPL6, and PDCL3 that influence risk for AD with genome-wide significance. PMID:26830138

  18. HLA-DRB1*11 and variants of the MHC class II locus are strong risk factors for systemic juvenile idiopathic arthritis.

    PubMed

    Ombrello, Michael J; Remmers, Elaine F; Tachmazidou, Ioanna; Grom, Alexei; Foell, Dirk; Haas, Johannes-Peter; Martini, Alberto; Gattorno, Marco; Özen, Seza; Prahalad, Sampath; Zeft, Andrew S; Bohnsack, John F; Mellins, Elizabeth D; Ilowite, Norman T; Russo, Ricardo; Len, Claudio; Hilario, Maria Odete E; Oliveira, Sheila; Yeung, Rae S M; Rosenberg, Alan; Wedderburn, Lucy R; Anton, Jordi; Schwarz, Tobias; Hinks, Anne; Bilginer, Yelda; Park, Jane; Cobb, Joanna; Satorius, Colleen L; Han, Buhm; Baskin, Elizabeth; Signa, Sara; Duerr, Richard H; Achkar, J P; Kamboh, M Ilyas; Kaufman, Kenneth M; Kottyan, Leah C; Pinto, Dalila; Scherer, Stephen W; Alarcón-Riquelme, Marta E; Docampo, Elisa; Estivill, Xavier; Gül, Ahmet; de Bakker, Paul I W; Raychaudhuri, Soumya; Langefeld, Carl D; Thompson, Susan; Zeggini, Eleftheria; Thomson, Wendy; Kastner, Daniel L; Woo, Patricia

    2015-12-29

    Systemic juvenile idiopathic arthritis (sJIA) is an often severe, potentially life-threatening childhood inflammatory disease, the pathophysiology of which is poorly understood. To determine whether genetic variation within the MHC locus on chromosome 6 influences sJIA susceptibility, we performed an association study of 982 children with sJIA and 8,010 healthy control subjects from nine countries. Using meta-analysis of directly observed and imputed SNP genotypes and imputed classic HLA types, we identified the MHC locus as a bona fide susceptibility locus with effects on sJIA risk that transcended geographically defined strata. The strongest sJIA-associated SNP, rs151043342 [P = 2.8 × 10(-17), odds ratio (OR) 2.6 (2.1, 3.3)], was part of a cluster of 482 sJIA-associated SNPs that spanned a 400-kb region and included the class II HLA region. Conditional analysis controlling for the effect of rs151043342 found that rs12722051 independently influenced sJIA risk [P = 1.0 × 10(-5), OR 0.7 (0.6, 0.8)]. Meta-analysis of imputed classic HLA-type associations in six study populations of Western European ancestry revealed that HLA-DRB1*11 and its defining amino acid residue, glutamate 58, were strongly associated with sJIA [P = 2.7 × 10(-16), OR 2.3 (1.9, 2.8)], as was the HLA-DRB1*11-HLA-DQA1*05-HLA-DQB1*03 haplotype [6.4 × 10(-17), OR 2.3 (1.9, 2.9)]. By examining the MHC locus in the largest collection of sJIA patients assembled to date, this study solidifies the relationship between the class II HLA region and sJIA, implicating adaptive immune molecules in the pathogenesis of sJIA.

  19. Cox regression analysis with missing covariates via nonparametric multiple imputation.

    PubMed

    Hsu, Chiu-Hsieh; Yu, Mandi

    2018-01-01

    We consider the situation of estimating Cox regression in which some covariates are subject to missing, and there exists additional information (including observed event time, censoring indicator and fully observed covariates) which may be predictive of the missing covariates. We propose to use two working regression models: one for predicting the missing covariates and the other for predicting the missing probabilities. For each missing covariate observation, these two working models are used to define a nearest neighbor imputing set. This set is then used to non-parametrically impute covariate values for the missing observation. Upon the completion of imputation, Cox regression is performed on the multiply imputed datasets to estimate the regression coefficients. In a simulation study, we compare the nonparametric multiple imputation approach with the augmented inverse probability weighted (AIPW) method, which directly incorporates the two working models into estimation of Cox regression, and the predictive mean matching imputation (PMM) method. We show that all approaches can reduce bias due to non-ignorable missing mechanism. The proposed nonparametric imputation method is robust to mis-specification of either one of the two working models and robust to mis-specification of the link function of the two working models. In contrast, the PMM method is sensitive to misspecification of the covariates included in imputation. The AIPW method is sensitive to the selection probability. We apply the approaches to a breast cancer dataset from Surveillance, Epidemiology and End Results (SEER) Program.

  20. Imputation for multisource data with comparison and assessment techniques

    DOE PAGES

    Casleton, Emily Michele; Osthus, David Allen; Van Buren, Kendra Lu

    2017-12-27

    Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, asmore » well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank themethods,we also propose an approach to identify significant differences. Imputation techniqueswill also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linearmodel tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.« less

  1. Imputation for multisource data with comparison and assessment techniques

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Casleton, Emily Michele; Osthus, David Allen; Van Buren, Kendra Lu

    Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, asmore » well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank themethods,we also propose an approach to identify significant differences. Imputation techniqueswill also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linearmodel tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.« less

  2. Erythropoietin-Stimulating Agents and Survival in End-Stage Renal Disease: Comparison of Payment Policy Analysis, Instrumental Variables, and Multiple Imputation of Potential Outcomes

    PubMed Central

    Dore, David D.; Swaminathan, Shailender; Gutman, Roee; Trivedi, Amal N.; Mor, Vincent

    2013-01-01

    Objective To compare the assumptions and estimands across three approaches to estimating the effect of erythropoietin-stimulating agents (ESAs) on mortality. Study Design and Setting Using data from the Renal Management Information System, we conducted two analyses utilizing a change to bundled payment that we hypothesized mimicked random assignment to ESA (pre-post, difference-in-difference, and instrumental variable analyses). A third analysis was based on multiply imputing potential outcomes using propensity scores. Results There were 311,087 recipients of ESAs and 13,095 non-recipients. In the pre-post comparison, we identified no clear relationship between bundled payment (measured by calendar time) and the incidence of death within six months (risk difference -1.5%; 95% CI - 7.0% to 4.0%). In the instrumental variable analysis, the risk of mortality was similar among ESA recipients (risk difference -0.9%; 95% CI -2.1 to 0.3). In the multiple imputation analysis, we observed a 4.2% (95% CI 3.4% to 4.9%) absolute reduction in mortality risk with use of ESAs, but closer to the null for patients with baseline hematocrit >36%. Conclusion Methods emanating from different disciplines often rely on different assumptions, but can be informative about a similar causal contrast. The implications of these distinct approaches are discussed. PMID:23849152

  3. Multiple imputation: an application to income nonresponse in the National Survey on Recreation and the Environment

    Treesearch

    Stanley J. Zarnoch; H. Ken Cordell; Carter J. Betz; John C. Bergstrom

    2010-01-01

    Multiple imputation is used to create values for missing family income data in the National Survey on Recreation and the Environment. We present an overview of the survey and a description of the missingness pattern for family income and other key variables. We create a logistic model for the multiple imputation process and to impute data sets for family income. We...

  4. Baseline predictors of sputum culture conversion in pulmonary tuberculosis: importance of cavities, smoking, time to detection and W-Beijing genotype.

    PubMed

    Visser, Marianne E; Stead, Michael C; Walzl, Gerhard; Warren, Rob; Schomaker, Michael; Grewal, Harleen M S; Swart, Elizabeth C; Maartens, Gary

    2012-01-01

    Time to detection (TTD) on automated liquid mycobacterial cultures is an emerging biomarker of tuberculosis outcomes. The M. tuberculosis W-Beijing genotype is spreading globally, indicating a selective advantage. There is a paucity of data on the association between baseline TTD and W-Beijing genotype and tuberculosis outcomes. To assess baseline predictors of failure of sputum culture conversion, within the first 2 months of antitubercular therapy, in participants with pulmonary tuberculosis. Between May 2005 and August 2008 we conducted a prospective cohort study of time to sputum culture conversion in ambulatory participants with first episodes of smear and culture positive pulmonary tuberculosis attending two primary care clinics in Cape Town, South Africa. Rifampicin resistance (diagnosed on phenotypic susceptibility testing) was an exclusion criterion. Sputum was collected weekly for 8 weeks for mycobacterial culture on liquid media (BACTEC MGIT 960). Due to missing data, multiple imputation was performed. Time to sputum culture conversion was analysed using a Cox-proportional hazards model. Bayesian model averaging determined the posterior effect probability for each variable. 113 participants were enrolled (30.1% female, 10.5% HIV-infected, 44.2% W-Beijing genotype, and 89% cavities). On Kaplan Meier analysis 50.4% of participants underwent sputum culture conversion by 8 weeks. The following baseline factors were associated with slower sputum culture conversion: TTD (adjusted hazard ratio (aHR) = 1.11, 95% CI 1.02; 1.2), lung cavities (aHR = 0.13, 95% CI 0.02; 0.95), ever smoking (aHR = 0.32, 95% CI 0.1; 1.02) and the W-Beijing genotype (aHR = 0.51, 95% CI 0.25; 1.07). On Bayesian model averaging, posterior probability effects were strong for TTD, lung cavitation and smoking and moderate for W-Beijing genotype. We found that baseline TTD, smoking, cavities and W-Beijing genotype were associated with delayed 2 month sputum culture. Larger studies are needed to confirm the relationship between the W-Beijing genotype and sputum culture conversion.

  5. Variable Selection in the Presence of Missing Data: Imputation-based Methods.

    PubMed

    Zhao, Yize; Long, Qi

    2017-01-01

    Variable selection plays an essential role in regression analysis as it identifies important variables that associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid used under the assumptions of missing at random (MAR) and missing completely at random (MCAR), largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combine variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under-developed and offers fertile ground for further research.

  6. Genome-Wide Association Study Implicates HLA-C*01: 02 as a Risk Factor at the Major Histocompatibility Complex Locus in Schizophrenia

    PubMed Central

    2012-01-01

    Background We performed a genome-wide association study (GWAS) to identify common risk variants for schizophrenia. Methods The discovery scan included 1606 patients and 1794 controls from Ireland, using 6,212,339 directly genotyped or imputed single nucleotide polymorphisms (SNPs). A subset of this sample (270 cases and 860 controls) was subsequently included in the Psychiatric GWAS Consortium-schizophrenia GWAS meta-analysis. Results One hundred eight SNPs were taken forward for replication in an independent sample of 13,195 cases and 31,021 control subjects. The most significant associations in discovery, corrected for genomic inflation, were (rs204999, p combined = 1.34 × 10−9 and in combined samples (rs2523722 p combined = 2.88 × 10−16) mapped to the major histocompatibility complex (MHC) region. We imputed classical human leukocyte antigen (HLA) alleles at the locus; the most significant finding was with HLA-C*01:02. This association was distinct from the top SNP signal. The HLA alleles DRB1*03:01 and B*08:01 were protective, replicating a previous study. Conclusions This study provides further support for involvement of MHC class I molecules in schizophrenia. We found evidence of association with previously reported risk alleles at the TCF4, VRK2, and ZNF804A loci. PMID:22883433

  7. Approaches in Characterizing Genetic Structure and Mapping in a Rice Multiparental Population.

    PubMed

    Raghavan, Chitra; Mauleon, Ramil; Lacorte, Vanica; Jubay, Monalisa; Zaw, Hein; Bonifacio, Justine; Singh, Rakesh Kumar; Huang, B Emma; Leung, Hei

    2017-06-07

    Multi-parent Advanced Generation Intercross (MAGIC) populations are fast becoming mainstream tools for research and breeding, along with the technology and tools for analysis. This paper demonstrates the analysis of a rice MAGIC population from data filtering to imputation and processing of genetic data to characterizing genomic structure, and finally quantitative trait loci (QTL) mapping. In this study, 1316 S6:8 indica MAGIC (MI) lines and the eight founders were sequenced using Genotyping by Sequencing (GBS). As the GBS approach often includes missing data, the first step was to impute the missing SNPs. The observable number of recombinations in the population was then explored. Based on this case study, a general outline of procedures for a MAGIC analysis workflow is provided, as well as for QTL mapping of agronomic traits and biotic and abiotic stress, using the results from both association and interval mapping approaches. QTL for agronomic traits (yield, flowering time, and plant height), physical (grain length and grain width) and cooking properties (amylose content) of the rice grain, abiotic stress (submergence tolerance), and biotic stress (brown spot disease) were mapped. Through presenting this extensive analysis in the MI population in rice, we highlight important considerations when choosing analytical approaches. The methods and results reported in this paper will provide a guide to future genetic analysis methods applied to multi-parent populations. Copyright © 2017 Raghavan et al.

  8. Aluminium tolerance and high phosphorus efficiency helps Stylosanthes better adapt to low-P acid soils.

    PubMed

    Du, Yu-Mei; Tian, Jiang; Liao, Hong; Bai, Chang-Jun; Yan, Xiao-Long; Liu, Guo-Dao

    2009-06-01

    Stylosanthes spp. (stylo) is one of the most important pasture legumes used in a wide range of agricultural systems on acid soils, where aluminium (Al) toxicity and phosphorus (P) deficiency are two major limiting factors for plant growth. However, physiological mechanisms of stylo adaptation to acid soils are not understood. Twelve stylo genotypes were surveyed under field conditions, followed by sand and nutrient solution culture experiments to investigate possible physiological mechanisms of stylo adaptation to low-P acid soils. Stylo genotypes varied substantially in growth and P uptake in low P conditions in the field. Three genotypes contrasting in P efficiency were selected for experiments in nutrient solution and sand culture to examine their Al tolerance and ability to utilize different P sources, including Ca-P, K-P, Al-P, Fe-P and phytate-P. Among the three tested genotypes, the P-efficient genotype 'TPRC2001-1' had higher Al tolerance than the P-inefficient genotype 'Fine-stem' as indicated by relative tap root length and haematoxylin staining. The three genotypes differed in their ability to utilize different P sources. The P-efficient genotype, 'TPRC2001-1', had superior ability to utilize phytate-P. The findings suggest that possible physiological mechanisms of stylo adaptation to low-P acid soils might involve superior ability of plant roots to tolerate Al toxicity and to utilize organic P and Al-P.

  9. [Imputing missing data in public health: general concepts and application to dichotomous variables].

    PubMed

    Hernández, Gilma; Moriña, David; Navarro, Albert

    The presence of missing data in collected variables is common in health surveys, but the subsequent imputation thereof at the time of analysis is not. Working with imputed data may have certain benefits regarding the precision of the estimators and the unbiased identification of associations between variables. The imputation process is probably still little understood by many non-statisticians, who view this process as highly complex and with an uncertain goal. To clarify these questions, this note aims to provide a straightforward, non-exhaustive overview of the imputation process to enable public health researchers ascertain its strengths. All this in the context of dichotomous variables which are commonplace in public health. To illustrate these concepts, an example in which missing data is handled by means of simple and multiple imputation is introduced. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.

  10. Genetic variants in Protocadherin-1, bronchial hyper-responsiveness, and asthma subphenotypes in German children.

    PubMed

    Toncheva, Antoaneta A; Suttner, Kathrin; Michel, Sven; Klopp, Norman; Illig, Thomas; Balschun, Tobias; Vogelberg, Christian; von Berg, Andrea; Bufe, Albrecht; Heinzmann, Andrea; Laub, Otto; Rietschel, Ernst; Simma, Burkhard; Frischer, Thomas; Genuneit, Jon; von Mutius, Erika; Kabesch, Michael

    2012-11-01

    Recently, Protocadherin-1 (PCDH1) was reported as a novel susceptibility gene for bronchial hyper-responsiveness (BHR) and asthma. PCDH1 is located on chromosome 5q31-33, in the vicinity of several known candidate genes for asthma and allergy. To exclude that the associations observed for PCDH1 originate from the nearby cytokine cluster, an extensive linkage disequilibrium (LD) analysis was performed. Effects of polymorphisms in PCDH1 on asthma, BHR, and related phenotypes were studied comprehensively. Genotype information was acquired from Illumina HumanHap300Chip genotyping, MALDI-TOF MS genotyping, and imputation. LD was assessed by Haploview 4.2 software. Associations were investigated in a population of 1454 individuals (763 asthmatics) from two German study populations [MAGICS and International Study of Asthma and Allergies in Childhood phase II (ISAAC II)] using logistic regression to model additive effects. No relevant LD between PCDH1 tagging polymorphisms and 98 single nucleotide polymorphisms within the cytokine cluster was detected. While BHR was not associated with PCDH1 polymorphisms, significant associations with subphenotypes of asthma were observed. Protocadherin-1 polymorphisms may specifically affect the development of non-atopic asthma in children. Functional studies are needed to further investigate the role of PCDH1 in BHR and asthma development. © 2012 John Wiley & Sons A/S. Published by Blackwell Publishing Ltd.

  11. Variant calling in low-coverage whole genome sequencing of a Native American population sample.

    PubMed

    Bizon, Chris; Spiegel, Michael; Chasse, Scott A; Gizer, Ian R; Li, Yun; Malc, Ewa P; Mieczkowski, Piotr A; Sailsbery, Josh K; Wang, Xiaoshu; Ehlers, Cindy L; Wilhelmsen, Kirk C

    2014-01-30

    The reduction in the cost of sequencing a human genome has led to the use of genotype sampling strategies in order to impute and infer the presence of sequence variants that can then be tested for associations with traits of interest. Low-coverage Whole Genome Sequencing (WGS) is a sampling strategy that overcomes some of the deficiencies seen in fixed content SNP array studies. Linkage-disequilibrium (LD) aware variant callers, such as the program Thunder, may provide a calling rate and accuracy that makes a low-coverage sequencing strategy viable. We examined the performance of an LD-aware variant calling strategy in a population of 708 low-coverage whole genome sequences from a community sample of Native Americans. We assessed variant calling through a comparison of the sequencing results to genotypes measured in 641 of the same subjects using a fixed content first generation exome array. The comparison was made using the variant calling routines GATK Unified Genotyper program and the LD-aware variant caller Thunder. Thunder was found to improve concordance in a coverage dependent fashion, while correctly calling nearly all of the common variants as well as a high percentage of the rare variants present in the sample. Low-coverage WGS is a strategy that appears to collect genetic information intermediate in scope between fixed content genotyping arrays and deep-coverage WGS. Our data suggests that low-coverage WGS is a viable strategy with a greater chance of discovering novel variants and associations than fixed content arrays for large sample association analyses.

  12. Meta‐analysis of test accuracy studies using imputation for partial reporting of multiple thresholds

    PubMed Central

    Deeks, J.J.; Martin, E.C.; Riley, R.D.

    2017-01-01

    Introduction For tests reporting continuous results, primary studies usually provide test performance at multiple but often different thresholds. This creates missing data when performing a meta‐analysis at each threshold. A standard meta‐analysis (no imputation [NI]) ignores such missing data. A single imputation (SI) approach was recently proposed to recover missing threshold results. Here, we propose a new method that performs multiple imputation of the missing threshold results using discrete combinations (MIDC). Methods The new MIDC method imputes missing threshold results by randomly selecting from the set of all possible discrete combinations which lie between the results for 2 known bounding thresholds. Imputed and observed results are then synthesised at each threshold. This is repeated multiple times, and the multiple pooled results at each threshold are combined using Rubin's rules to give final estimates. We compared the NI, SI, and MIDC approaches via simulation. Results Both imputation methods outperform the NI method in simulations. There was generally little difference in the SI and MIDC methods, but the latter was noticeably better in terms of estimating the between‐study variances and generally gave better coverage, due to slightly larger standard errors of pooled estimates. Given selective reporting of thresholds, the imputation methods also reduced bias in the summary receiver operating characteristic curve. Simulations demonstrate the imputation methods rely on an equal threshold spacing assumption. A real example is presented. Conclusions The SI and, in particular, MIDC methods can be used to examine the impact of missing threshold results in meta‐analysis of test accuracy studies. PMID:29052347

  13. The impact of missing trauma data on predicting massive transfusion

    PubMed Central

    Trickey, Amber W.; Fox, Erin E.; del Junco, Deborah J.; Ning, Jing; Holcomb, John B.; Brasel, Karen J.; Cohen, Mitchell J.; Schreiber, Martin A.; Bulger, Eileen M.; Phelan, Herb A.; Alarcon, Louis H.; Myers, John G.; Muskat, Peter; Cotton, Bryan A.; Wade, Charles E.; Rahbar, Mohammad H.

    2013-01-01

    INTRODUCTION Missing data are inherent in clinical research and may be especially problematic for trauma studies. This study describes a sensitivity analysis to evaluate the impact of missing data on clinical risk prediction algorithms. Three blood transfusion prediction models were evaluated utilizing an observational trauma dataset with valid missing data. METHODS The PRospective Observational Multi-center Major Trauma Transfusion (PROMMTT) study included patients requiring ≥ 1 unit of red blood cells (RBC) at 10 participating U.S. Level I trauma centers from July 2009 – October 2010. Physiologic, laboratory, and treatment data were collected prospectively up to 24h after hospital admission. Subjects who received ≥ 10 RBC units within 24h of admission were classified as massive transfusion (MT) patients. Correct classification percentages for three MT prediction models were evaluated using complete case analysis and multiple imputation. A sensitivity analysis for missing data was conducted to determine the upper and lower bounds for correct classification percentages. RESULTS PROMMTT enrolled 1,245 subjects. MT was received by 297 patients (24%). Missing percentage ranged from 2.2% (heart rate) to 45% (respiratory rate). Proportions of complete cases utilized in the MT prediction models ranged from 41% to 88%. All models demonstrated similar correct classification percentages using complete case analysis and multiple imputation. In the sensitivity analysis, correct classification upper-lower bound ranges per model were 4%, 10%, and 12%. Predictive accuracy for all models using PROMMTT data was lower than reported in the original datasets. CONCLUSIONS Evaluating the accuracy clinical prediction models with missing data can be misleading, especially with many predictor variables and moderate levels of missingness per variable. The proposed sensitivity analysis describes the influence of missing data on risk prediction algorithms. Reporting upper/lower bounds for percent correct classification may be more informative than multiple imputation, which provided similar results to complete case analysis in this study. PMID:23778514

  14. Impact of missing data imputation methods on gene expression clustering and classification.

    PubMed

    de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G

    2015-02-26

    Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .

  15. An original imputation technique of missing data for assessing exposure of newborns to perchlorate in drinking water.

    PubMed

    Caron, Alexandre; Clement, Guillaume; Heyman, Christophe; Aernout, Eva; Chazard, Emmanuel; Le Tertre, Alain

    2015-01-01

    Incompleteness of epidemiological databases is a major drawback when it comes to analyzing data. We conceived an epidemiological study to assess the association between newborn thyroid function and the exposure to perchlorates found in the tap water of the mother's home. Only 9% of newborn's exposure to perchlorate was known. The aim of our study was to design, test and evaluate an original method for imputing perchlorate exposure of newborns based on their maternity of birth. In a first database, an exhaustive collection of newborn's thyroid function measured during a systematic neonatal screening was collected. In this database the municipality of residence of the newborn's mother was only available for 2012. Between 2004 and 2011, the closest data available was the municipality of the maternity of birth. Exposure was assessed using a second database which contained the perchlorate levels for each municipality. We computed the catchment area of every maternity ward based on the French nationwide exhaustive database of inpatient stay. Municipality, and consequently perchlorate exposure, was imputed by a weighted draw in the catchment area. Missing values for remaining covariates were imputed by chained equation. A linear mixture model was computed on each imputed dataset. We compared odds ratios (ORs) and 95% confidence intervals (95% CI) estimated on real versus imputed 2012 data. The same model was then carried out for the whole imputed database. The ORs estimated on 36,695 observations by our multiple imputation method are comparable to the real 2012 data. On the 394,979 observations of the whole database, the ORs remain stable but the 95% CI tighten considerably. The model estimates computed on imputed data are similar to those calculated on real data. The main advantage of multiple imputation is to provide unbiased estimate of the ORs while maintaining their variances. Thus, our method will be used to increase the statistical power of future studies by including all 394,979 newborns.

  16. A comparison of multiple imputation methods for incomplete longitudinal binary data.

    PubMed

    Yamaguchi, Yusuke; Misumi, Toshihiro; Maruo, Kazushi

    2018-01-01

    Longitudinal binary data are commonly encountered in clinical trials. Multiple imputation is an approach for getting a valid estimation of treatment effects under an assumption of missing at random mechanism. Although there are a variety of multiple imputation methods for the longitudinal binary data, a limited number of researches have reported on relative performances of the methods. Moreover, when focusing on the treatment effect throughout a period that has often been used in clinical evaluations of specific disease areas, no definite investigations comparing the methods have been available. We conducted an extensive simulation study to examine comparative performances of six multiple imputation methods available in the SAS MI procedure for longitudinal binary data, where two endpoints of responder rates at a specified time point and throughout a period were assessed. The simulation study suggested that results from naive approaches of a single imputation with non-responders and a complete case analysis could be very sensitive against missing data. The multiple imputation methods using a monotone method and a full conditional specification with a logistic regression imputation model were recommended for obtaining unbiased and robust estimations of the treatment effect. The methods were illustrated with data from a mental health research.

  17. Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics

    DOE PAGES

    Webb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.; ...

    2015-04-09

    In this review, we apply selected imputation strategies to label-free liquid chromatography–mass spectrometry (LC–MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC–MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yieldedmore » the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. In summary, on the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.« less

  18. Gaussian-based routines to impute categorical variables in health surveys.

    PubMed

    Yucel, Recai M; He, Yulei; Zaslavsky, Alan M

    2011-12-20

    The multivariate normal (MVN) distribution is arguably the most popular parametric model used in imputation and is available in most software packages (e.g., SAS PROC MI, R package norm). When it is applied to categorical variables as an approximation, practitioners often either apply simple rounding techniques for ordinal variables or create a distinct 'missing' category and/or disregard the nominal variable from the imputation phase. All of these practices can potentially lead to biased and/or uninterpretable inferences. In this work, we develop a new rounding methodology calibrated to preserve observed distributions to multiply impute missing categorical covariates. The major attractiveness of this method is its flexibility to use any 'working' imputation software, particularly those based on MVN, allowing practitioners to obtain usable imputations with small biases. A simulation study demonstrates the clear advantage of the proposed method in rounding ordinal variables and, in some scenarios, its plausibility in imputing nominal variables. We illustrate our methods on a widely used National Survey of Children with Special Health Care Needs where incomplete values on race posed a valid threat on inferences pertaining to disparities. Copyright © 2011 John Wiley & Sons, Ltd.

  19. Should "Multiple Imputations" Be Treated as "Multiple Indicators"?

    ERIC Educational Resources Information Center

    Mislevy, Robert J.

    1993-01-01

    Multiple imputations for latent variables are constructed so that analyses treating them as true variables have the correct expectations for population characteristics. Analyzing multiple imputations in accordance with their construction yields correct estimates of population characteristics, whereas analyzing them as multiple indicators generally…

  20. Handling Missing Data With Multilevel Structural Equation Modeling and Full Information Maximum Likelihood Techniques.

    PubMed

    Schminkey, Donna L; von Oertzen, Timo; Bullock, Linda

    2016-08-01

    With increasing access to population-based data and electronic health records for secondary analysis, missing data are common. In the social and behavioral sciences, missing data frequently are handled with multiple imputation methods or full information maximum likelihood (FIML) techniques, but healthcare researchers have not embraced these methodologies to the same extent and more often use either traditional imputation techniques or complete case analysis, which can compromise power and introduce unintended bias. This article is a review of options for handling missing data, concluding with a case study demonstrating the utility of multilevel structural equation modeling using full information maximum likelihood (MSEM with FIML) to handle large amounts of missing data. MSEM with FIML is a parsimonious and hypothesis-driven strategy to cope with large amounts of missing data without compromising power or introducing bias. This technique is relevant for nurse researchers faced with ever-increasing amounts of electronic data and decreasing research budgets. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  1. A context-intensive approach to imputation of missing values in data sets from networks of environmental monitors.

    PubMed

    Larsen, Lawrence C; Shah, Mena

    2016-01-01

    Although networks of environmental monitors are constantly improving through advances in technology and management, instances of missing data still occur. Many methods of imputing values for missing data are available, but they are often difficult to use or produce unsatisfactory results. I-Bot (short for "Imputation Robot") is a context-intensive approach to the imputation of missing data in data sets from networks of environmental monitors. I-Bot is easy to use and routinely produces imputed values that are highly reliable. I-Bot is described and demonstrated using more than 10 years of California data for daily maximum 8-hr ozone, 24-hr PM2.5 (particulate matter with an aerodynamic diameter <2.5 μm), mid-day average surface temperature, and mid-day average wind speed. I-Bot performance is evaluated by imputing values for observed data as if they were missing, and then comparing the imputed values with the observed values. In many cases, I-Bot is able to impute values for long periods with missing data, such as a week, a month, a year, or even longer. Qualitative visual methods and standard quantitative metrics demonstrate the effectiveness of the I-Bot methodology. Many resources are expended every year to analyze and interpret data sets from networks of environmental monitors. A large fraction of those resources is used to cope with difficulties due to the presence of missing data. The I-Bot method of imputing values for such missing data may help convert incomplete data sets into virtually complete data sets that facilitate the analysis and reliable interpretation of vital environmental data.

  2. Use of Multiple Imputation Method to Improve Estimation of Missing Baseline Serum Creatinine in Acute Kidney Injury Research

    PubMed Central

    Peterson, Josh F.; Eden, Svetlana K.; Moons, Karel G.; Ikizler, T. Alp; Matheny, Michael E.

    2013-01-01

    Summary Background and objectives Baseline creatinine (BCr) is frequently missing in AKI studies. Common surrogate estimates can misclassify AKI and adversely affect the study of related outcomes. This study examined whether multiple imputation improved accuracy of estimating missing BCr beyond current recommendations to apply assumed estimated GFR (eGFR) of 75 ml/min per 1.73 m2 (eGFR 75). Design, setting, participants, & measurements From 41,114 unique adult admissions (13,003 with and 28,111 without BCr data) at Vanderbilt University Hospital between 2006 and 2008, a propensity score model was developed to predict likelihood of missing BCr. Propensity scoring identified 6502 patients with highest likelihood of missing BCr among 13,003 patients with known BCr to simulate a “missing” data scenario while preserving actual reference BCr. Within this cohort (n=6502), the ability of various multiple-imputation approaches to estimate BCr and classify AKI were compared with that of eGFR 75. Results All multiple-imputation methods except the basic one more closely approximated actual BCr than did eGFR 75. Total AKI misclassification was lower with multiple imputation (full multiple imputation + serum creatinine) (9.0%) than with eGFR 75 (12.3%; P<0.001). Improvements in misclassification were greater in patients with impaired kidney function (full multiple imputation + serum creatinine) (15.3%) versus eGFR 75 (40.5%; P<0.001). Multiple imputation improved specificity and positive predictive value for detecting AKI at the expense of modestly decreasing sensitivity relative to eGFR 75. Conclusions Multiple imputation can improve accuracy in estimating missing BCr and reduce misclassification of AKI beyond currently proposed methods. PMID:23037980

  3. Utilizing random forests imputation of forest plot data for landscape-level wildfire analyses

    Treesearch

    Karin L. Riley; Isaac C. Grenfell; Mark A. Finney; Nicholas L. Crookston

    2014-01-01

    Maps of the number, size, and species of trees in forests across the United States are desirable for a number of applications. For landscape-level fire and forest simulations that use the Forest Vegetation Simulator (FVS), a spatial tree-level dataset, or “tree list”, is a necessity. FVS is widely used at the stand level for simulating fire effects on tree mortality,...

  4. Multiple imputation of missing passenger boarding data in the national census of ferry operators

    DOT National Transportation Integrated Search

    2008-08-01

    This report presents findings from the 2006 National Census of Ferry Operators (NCFO) augmented : with imputed values for passengers and passenger miles. Due to the imputation procedures used to calculate missing data, totals in Table 1 may not corre...

  5. The HCUP SID Imputation Project: Improving Statistical Inferences for Health Disparities Research by Imputing Missing Race Data.

    PubMed

    Ma, Yan; Zhang, Wei; Lyman, Stephen; Huang, Yihe

    2018-06-01

    To identify the most appropriate imputation method for missing data in the HCUP State Inpatient Databases (SID) and assess the impact of different missing data methods on racial disparities research. HCUP SID. A novel simulation study compared four imputation methods (random draw, hot deck, joint multiple imputation [MI], conditional MI) for missing values for multiple variables, including race, gender, admission source, median household income, and total charges. The simulation was built on real data from the SID to retain their hierarchical data structures and missing data patterns. Additional predictive information from the U.S. Census and American Hospital Association (AHA) database was incorporated into the imputation. Conditional MI prediction was equivalent or superior to the best performing alternatives for all missing data structures and substantially outperformed each of the alternatives in various scenarios. Conditional MI substantially improved statistical inferences for racial health disparities research with the SID. © Health Research and Educational Trust.

  6. Multiple imputation methods for bivariate outcomes in cluster randomised trials.

    PubMed

    DiazOrdaz, K; Kenward, M G; Gomes, M; Grieve, R

    2016-09-10

    Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single-level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost-effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing-at-random clustered data scenarios were simulated following a full-factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed-effects multiple imputation and too low following single-level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

  7. A rare variant in MYH6 is associated with high risk of sick sinus syndrome

    PubMed Central

    Holm, Hilma; Gudbjartsson, Daniel F; Sulem, Patrick; Masson, Gisli; Helgadottir, Hafdis Th; Zanon, Carlo; Magnusson, Olafur Th; Helgason, Agnar; Saemundsdottir, Jona; Gylfason, Arnaldur; Stefansdottir, Hrafnhildur; Gretarsdottir, Solveig; Matthiasson, Stefan E; Thorgeirsson, Guðmundur; Jonasdottir, Aslaug; Sigurdsson, Asgeir; Stefansson, Hreinn; Werge, Thomas; Rafnar, Thorunn; Kiemeney, Lambertus A; Parvez, Babar; Muhammad, Raafia; Roden, Dan M; Darbar, Dawood; Thorleifsson, Gudmar; Walters, G Bragi; Kong, Augustine; Thorsteinsdottir, Unnur; Arnar, David O; Stefansson, Kari

    2011-01-01

    Through complementary application of SNP genotyping, whole-genome sequencing and imputation in 38,384 Icelanders, we have discovered a previously unidentified sick sinus syndrome susceptibility gene, MYH6, encoding the alpha heavy chain subunit of cardiac myosin. A missense variant in this gene, c.2161C>T, results in the conceptual amino acid substitution p.Arg721Trp, has an allelic frequency of 0.38% in Icelanders and associates with sick sinus syndrome with an odds ratio = 1 2.53 and P = 1.5 × 10−29. We show that the lifetime risk of being diagnosed with sick sinus syndrome is around 6% for non-carriers of c.2161C>T but is approximately 50% for carriers of the c.2161C>T variant. PMID:21378987

  8. 16 CFR 1115.11 - Imputed knowledge.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm... 16 Commercial Practices 2 2014-01-01 2014-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or...

  9. 16 CFR 1115.11 - Imputed knowledge.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm... 16 Commercial Practices 2 2010-01-01 2010-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or...

  10. 16 CFR 1115.11 - Imputed knowledge.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm... 16 Commercial Practices 2 2011-01-01 2011-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or...

  11. 16 CFR 1115.11 - Imputed knowledge.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... care to ascertain the truth of complaints or other representations. This includes the knowledge a firm... 16 Commercial Practices 2 2012-01-01 2012-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or...

  12. 16 CFR § 1115.11 - Imputed knowledge.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... due care to ascertain the truth of complaints or other representations. This includes the knowledge a... 16 Commercial Practices 2 2013-01-01 2013-01-01 false Imputed knowledge. § 1115.11 Section Â... SUBSTANTIAL PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating...

  13. 5 CFR 919.630 - May the OPM impute conduct of one person to another?

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an... individual to whom the improper conduct is imputed either participated in, had knowledge of, or reason to...

  14. 48 CFR 1830.7002-4 - Determining imputed cost of money.

    Code of Federal Regulations, 2012 CFR

    2012-10-01

    ... of money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...

  15. 48 CFR 1830.7002-4 - Determining imputed cost of money.

    Code of Federal Regulations, 2013 CFR

    2013-10-01

    ... of money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...

  16. 48 CFR 1830.7002-4 - Determining imputed cost of money.

    Code of Federal Regulations, 2014 CFR

    2014-10-01

    ... of money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...

  17. 48 CFR 1830.7002-4 - Determining imputed cost of money.

    Code of Federal Regulations, 2010 CFR

    2010-10-01

    ... money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS AND... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the representative...

  18. Obtaining Predictions from Models Fit to Multiply Imputed Data

    ERIC Educational Resources Information Center

    Miles, Andrew

    2016-01-01

    Obtaining predictions from regression models fit to multiply imputed data can be challenging because treatments of multiple imputation seldom give clear guidance on how predictions can be calculated, and because available software often does not have built-in routines for performing the necessary calculations. This research note reviews how…

  19. Missing data imputation and haplotype phase inference for genome-wide association studies

    PubMed Central

    Browning, Sharon R.

    2009-01-01

    Imputation of missing data and the use of haplotype-based association tests can improve the power of genome-wide association studies (GWAS). In this article, I review methods for haplotype inference and missing data imputation, and discuss their application to GWAS. I discuss common features of the best algorithms for haplotype phase inference and missing data imputation in large-scale data sets, as well as some important differences between classes of methods, and highlight the methods that provide the highest accuracy and fastest computational performance. PMID:18850115

  20. Meta-analysis with missing study-level sample variance data.

    PubMed

    Chowdhry, Amit K; Dworkin, Robert H; McDermott, Michael P

    2016-07-30

    We consider a study-level meta-analysis with a normally distributed outcome variable and possibly unequal study-level variances, where the object of inference is the difference in means between a treatment and control group. A common complication in such an analysis is missing sample variances for some studies. A frequently used approach is to impute the weighted (by sample size) mean of the observed variances (mean imputation). Another approach is to include only those studies with variances reported (complete case analysis). Both mean imputation and complete case analysis are only valid under the missing-completely-at-random assumption, and even then the inverse variance weights produced are not necessarily optimal. We propose a multiple imputation method employing gamma meta-regression to impute the missing sample variances. Our method takes advantage of study-level covariates that may be used to provide information about the missing data. Through simulation studies, we show that multiple imputation, when the imputation model is correctly specified, is superior to competing methods in terms of confidence interval coverage probability and type I error probability when testing a specified group difference. Finally, we describe a similar approach to handling missing variances in cross-over studies. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  1. Hierarchical imputation of systematically and sporadically missing data: An approximate Bayesian approach using chained equations.

    PubMed

    Jolani, Shahab

    2018-03-01

    In health and medical sciences, multiple imputation (MI) is now becoming popular to obtain valid inferences in the presence of missing data. However, MI of clustered data such as multicenter studies and individual participant data meta-analysis requires advanced imputation routines that preserve the hierarchical structure of data. In clustered data, a specific challenge is the presence of systematically missing data, when a variable is completely missing in some clusters, and sporadically missing data, when it is partly missing in some clusters. Unfortunately, little is known about how to perform MI when both types of missing data occur simultaneously. We develop a new class of hierarchical imputation approach based on chained equations methodology that simultaneously imputes systematically and sporadically missing data while allowing for arbitrary patterns of missingness among them. Here, we use a random effect imputation model and adopt a simplification over fully Bayesian techniques such as Gibbs sampler to directly obtain draws of parameters within each step of the chained equations. We justify through theoretical arguments and extensive simulation studies that the proposed imputation methodology has good statistical properties in terms of bias and coverage rates of parameter estimates. An illustration is given in a case study with eight individual participant datasets. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  2. Missing value imputation for gene expression data by tailored nearest neighbors.

    PubMed

    Faisal, Shahla; Tutz, Gerhard

    2017-04-25

    High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values. The method aims at avoiding the curse of dimensionality, which typically occurs if local methods as nearest neighbors are applied in high dimensional settings. The proposed weighted nearest neighbors algorithm is compared to existing missing value imputation techniques like mean imputation, KNNimpute and the recently proposed imputation by random forests. We use RNA-sequence and microarray data from studies on human cancer to compare the performance of the methods. The results from simulations as well as real studies show that the weighted distance procedure can successfully handle missing values for high dimensional data structures where the number of predictors is larger than the number of samples. The method typically outperforms the considered competitors.

  3. Missing value imputation for microarray data: a comprehensive comparison study and a web tool.

    PubMed

    Chiu, Chia-Chun; Chan, Shih-Yao; Wang, Chung-Ching; Wu, Wei-Sheng

    2013-01-01

    Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses.

  4. Pharmacogenetic meta-analysis of baseline risk factors, pharmacodynamic, efficacy and tolerability endpoints from two large global cardiovascular outcomes trials for darapladib

    PubMed Central

    Yeo, Astrid; Warren, Liling; Aponte, Jennifer; Johansson, Kelley; Barnes, Allison; MacPhee, Colin; Davies, Richard; Chissoe, Stephanie; O’Donoghue, Michelle L.; White, Harvey D.

    2017-01-01

    Darapladib, a lipoprotein-associated phospholipase A2 (Lp-PLA2) inhibitor, failed to demonstrate efficacy for the primary endpoints in two large phase III cardiovascular outcomes trials, one in stable coronary heart disease patients (STABILITY) and one in acute coronary syndrome (SOLID-TIMI 52). No major safety signals were observed but tolerability issues of diarrhea and odor were common (up to 13%). We hypothesized that genetic variants associated with Lp-PLA2 activity may influence efficacy and tolerability and therefore performed a comprehensive pharmacogenetic analysis of both trials. We genotyped patients within the STABILITY and SOLID-TIMI 52 trials who provided a DNA sample and consent (n = 13,577 and 10,404 respectively, representing 86% and 82% of the trial participants) using genome-wide arrays with exome content and performed imputation using a 1000 Genomes reference panel. We investigated baseline and change from baseline in Lp-PLA2 activity, two efficacy endpoints (major coronary events and myocardial infarction) as well as tolerability parameters at genome-wide and candidate gene level using a meta-analytic approach. We replicated associations of published loci on baseline Lp-PLA2 activity (APOE, CELSR2, LPA, PLA2G7, LDLR and SCARB1) and identified three novel loci (TOMM5, FRMD5 and LPL) using the GWAS-significance threshold P≤5E-08. Review of the PLA2G7 gene (encoding Lp-PLA2) within these datasets identified V279F null allele carriers as well as three other rare exonic null alleles within various ethnic groups, however none of these variants nor any other loci associated with Lp-PLA2 activity at baseline were associated with any of the drug response endpoints. The analysis of darapladib efficacy endpoints, despite low power, identified six low frequency loci with main genotype effect (though with borderline imputation scores) and one common locus (minor allele frequency 0.24) with genotype by treatment interaction effect passing the GWAS-significance threshold. This locus conferred risk in placebo subjects, hazard ratio (HR) 1.22 with 95% confidence interval (CI) 1.11–1.33, but was protective in darapladib subjects, HR 0.79 (95% CI 0.71–0.88). No major loci for tolerability were found. Thus, genetic analysis confirmed and extended the influence of lipoprotein loci on Lp-PLA2 levels, identified some novel null alleles in the PLA2G7 gene, and only identified one potentially efficacious subgroup within these two large clinical trials. PMID:28753643

  5. Pharmacogenetic meta-analysis of baseline risk factors, pharmacodynamic, efficacy and tolerability endpoints from two large global cardiovascular outcomes trials for darapladib.

    PubMed

    Yeo, Astrid; Li, Li; Warren, Liling; Aponte, Jennifer; Fraser, Dana; King, Karen; Johansson, Kelley; Barnes, Allison; MacPhee, Colin; Davies, Richard; Chissoe, Stephanie; Tarka, Elizabeth; O'Donoghue, Michelle L; White, Harvey D; Wallentin, Lars; Waterworth, Dawn

    2017-01-01

    Darapladib, a lipoprotein-associated phospholipase A2 (Lp-PLA2) inhibitor, failed to demonstrate efficacy for the primary endpoints in two large phase III cardiovascular outcomes trials, one in stable coronary heart disease patients (STABILITY) and one in acute coronary syndrome (SOLID-TIMI 52). No major safety signals were observed but tolerability issues of diarrhea and odor were common (up to 13%). We hypothesized that genetic variants associated with Lp-PLA2 activity may influence efficacy and tolerability and therefore performed a comprehensive pharmacogenetic analysis of both trials. We genotyped patients within the STABILITY and SOLID-TIMI 52 trials who provided a DNA sample and consent (n = 13,577 and 10,404 respectively, representing 86% and 82% of the trial participants) using genome-wide arrays with exome content and performed imputation using a 1000 Genomes reference panel. We investigated baseline and change from baseline in Lp-PLA2 activity, two efficacy endpoints (major coronary events and myocardial infarction) as well as tolerability parameters at genome-wide and candidate gene level using a meta-analytic approach. We replicated associations of published loci on baseline Lp-PLA2 activity (APOE, CELSR2, LPA, PLA2G7, LDLR and SCARB1) and identified three novel loci (TOMM5, FRMD5 and LPL) using the GWAS-significance threshold P≤5E-08. Review of the PLA2G7 gene (encoding Lp-PLA2) within these datasets identified V279F null allele carriers as well as three other rare exonic null alleles within various ethnic groups, however none of these variants nor any other loci associated with Lp-PLA2 activity at baseline were associated with any of the drug response endpoints. The analysis of darapladib efficacy endpoints, despite low power, identified six low frequency loci with main genotype effect (though with borderline imputation scores) and one common locus (minor allele frequency 0.24) with genotype by treatment interaction effect passing the GWAS-significance threshold. This locus conferred risk in placebo subjects, hazard ratio (HR) 1.22 with 95% confidence interval (CI) 1.11-1.33, but was protective in darapladib subjects, HR 0.79 (95% CI 0.71-0.88). No major loci for tolerability were found. Thus, genetic analysis confirmed and extended the influence of lipoprotein loci on Lp-PLA2 levels, identified some novel null alleles in the PLA2G7 gene, and only identified one potentially efficacious subgroup within these two large clinical trials.

  6. Genome-wide association analysis of ischemic stroke in young adults.

    PubMed

    Cheng, Yu-Ching; O'Connell, Jeffrey R; Cole, John W; Stine, O Colin; Dueker, Nicole; McArdle, Patrick F; Sparks, Mary J; Shen, Jess; Laurie, Cathy C; Nelson, Sarah; Doheny, Kimberly F; Ling, Hua; Pugh, Elizabeth W; Brott, Thomas G; Brown, Robert D; Meschia, James F; Nalls, Michael; Rich, Stephen S; Worrall, Bradford; Anderson, Christopher D; Biffi, Alessandro; Cortellini, Lynelle; Furie, Karen L; Rost, Natalia S; Rosand, Jonathan; Manolio, Teri A; Kittner, Steven J; Mitchell, Braxton D

    2011-11-01

    Ischemic stroke (IS) is among the leading causes of death in Western countries. There is a significant genetic component to IS susceptibility, especially among young adults. To date, research to identify genetic loci predisposing to stroke has met only with limited success. We performed a genome-wide association (GWA) analysis of early-onset IS to identify potential stroke susceptibility loci. The GWA analysis was conducted by genotyping 1 million SNPs in a biracial population of 889 IS cases and 927 controls, ages 15-49 years. Genotypes were imputed using the HapMap3 reference panel to provide 1.4 million SNPs for analysis. Logistic regression models adjusting for age, recruitment stages, and population structure were used to determine the association of IS with individual SNPs. Although no single SNP reached genome-wide significance (P < 5 × 10(-8)), we identified two SNPs in chromosome 2q23.3, rs2304556 (in FMNL2; P = 1.2 × 10(-7)) and rs1986743 (in ARL6IP6; P = 2.7 × 10(-7)), strongly associated with early-onset stroke. These data suggest that a novel locus on human chromosome 2q23.3 may be associated with IS susceptibility among young adults.

  7. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer.

    PubMed

    Michailidou, Kyriaki; Beesley, Jonathan; Lindstrom, Sara; Canisius, Sander; Dennis, Joe; Lush, Michael J; Maranian, Mel J; Bolla, Manjeet K; Wang, Qin; Shah, Mitul; Perkins, Barbara J; Czene, Kamila; Eriksson, Mikael; Darabi, Hatef; Brand, Judith S; Bojesen, Stig E; Nordestgaard, Børge G; Flyger, Henrik; Nielsen, Sune F; Rahman, Nazneen; Turnbull, Clare; Fletcher, Olivia; Peto, Julian; Gibson, Lorna; dos-Santos-Silva, Isabel; Chang-Claude, Jenny; Flesch-Janys, Dieter; Rudolph, Anja; Eilber, Ursula; Behrens, Sabine; Nevanlinna, Heli; Muranen, Taru A; Aittomäki, Kristiina; Blomqvist, Carl; Khan, Sofia; Aaltonen, Kirsimari; Ahsan, Habibul; Kibriya, Muhammad G; Whittemore, Alice S; John, Esther M; Malone, Kathleen E; Gammon, Marilie D; Santella, Regina M; Ursin, Giske; Makalic, Enes; Schmidt, Daniel F; Casey, Graham; Hunter, David J; Gapstur, Susan M; Gaudet, Mia M; Diver, W Ryan; Haiman, Christopher A; Schumacher, Fredrick; Henderson, Brian E; Le Marchand, Loic; Berg, Christine D; Chanock, Stephen J; Figueroa, Jonine; Hoover, Robert N; Lambrechts, Diether; Neven, Patrick; Wildiers, Hans; van Limbergen, Erik; Schmidt, Marjanka K; Broeks, Annegien; Verhoef, Senno; Cornelissen, Sten; Couch, Fergus J; Olson, Janet E; Hallberg, Emily; Vachon, Celine; Waisfisz, Quinten; Meijers-Heijboer, Hanne; Adank, Muriel A; van der Luijt, Rob B; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Kang, Daehee; Choi, Ji-Yeob; Park, Sue K; Yoo, Keun-Young; Matsuo, Keitaro; Ito, Hidemi; Iwata, Hiroji; Tajima, Kazuo; Guénel, Pascal; Truong, Thérèse; Mulot, Claire; Sanchez, Marie; Burwinkel, Barbara; Marme, Frederik; Surowy, Harald; Sohn, Christof; Wu, Anna H; Tseng, Chiu-chen; Van Den Berg, David; Stram, Daniel O; González-Neira, Anna; Benitez, Javier; Zamora, M Pilar; Perez, Jose Ignacio Arias; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Hui; Cox, Angela; Cross, Simon S; Reed, Malcolm W R; Andrulis, Irene L; Knight, Julia A; Glendon, Gord; Mulligan, Anna Marie; Sawyer, Elinor J; Tomlinson, Ian; Kerin, Michael J; Miller, Nicola; Lindblom, Annika; Margolin, Sara; Teo, Soo Hwang; Yip, Cheng Har; Taib, Nur Aishah Mohd; Tan, Gie-Hooi; Hooning, Maartje J; Hollestelle, Antoinette; Martens, John W M; Collée, J Margriet; Blot, William; Signorello, Lisa B; Cai, Qiuyin; Hopper, John L; Southey, Melissa C; Tsimiklis, Helen; Apicella, Carmel; Shen, Chen-Yang; Hsiung, Chia-Ni; Wu, Pei-Ei; Hou, Ming-Feng; Kristensen, Vessela N; Nord, Silje; Alnaes, Grethe I Grenaker; Giles, Graham G; Milne, Roger L; McLean, Catriona; Canzian, Federico; Trichopoulos, Dimitrios; Peeters, Petra; Lund, Eiliv; Sund, Malin; Khaw, Kay-Tee; Gunter, Marc J; Palli, Domenico; Mortensen, Lotte Maxild; Dossus, Laure; Huerta, Jose-Maria; Meindl, Alfons; Schmutzler, Rita K; Sutter, Christian; Yang, Rongxi; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Hartman, Mikael; Miao, Hui; Chia, Kee Seng; Chan, Ching Wan; Fasching, Peter A; Hein, Alexander; Beckmann, Matthias W; Haeberle, Lothar; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Ashworth, Alan; Orr, Nick; Schoemaker, Minouk J; Swerdlow, Anthony J; Brinton, Louise; Garcia-Closas, Montserrat; Zheng, Wei; Halverson, Sandra L; Shrubsole, Martha; Long, Jirong; Goldberg, Mark S; Labrèche, France; Dumont, Martine; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Grip, Mervi; Brauch, Hiltrud; Hamann, Ute; Brüning, Thomas; Radice, Paolo; Peterlongo, Paolo; Manoukian, Siranoush; Bernard, Loris; Bogdanova, Natalia V; Dörk, Thilo; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M; Devilee, Peter; Tollenaar, Robert A E M; Seynaeve, Caroline; Van Asperen, Christi J; Jakubowska, Anna; Lubinski, Jan; Jaworska, Katarzyna; Huzarski, Tomasz; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; McKay, James; Slager, Susan; Toland, Amanda E; Ambrosone, Christine B; Yannoukakos, Drakoulis; Kabisch, Maria; Torres, Diana; Neuhausen, Susan L; Anton-Culver, Hoda; Luccarini, Craig; Baynes, Caroline; Ahmed, Shahana; Healey, Catherine S; Tessier, Daniel C; Vincent, Daniel; Bacot, Francois; Pita, Guillermo; Alonso, M Rosario; Álvarez, Nuria; Herrero, Daniel; Simard, Jacques; Pharoah, Paul P D P; Kraft, Peter; Dunning, Alison M; Chenevix-Trench, Georgia; Hall, Per; Easton, Douglas F

    2015-04-01

    Genome-wide association studies (GWAS) and large-scale replication studies have identified common variants in 79 loci associated with breast cancer, explaining ∼14% of the familial risk of the disease. To identify new susceptibility loci, we performed a meta-analysis of 11 GWAS, comprising 15,748 breast cancer cases and 18,084 controls together with 46,785 cases and 42,892 controls from 41 studies genotyped on a 211,155-marker custom array (iCOGS). Analyses were restricted to women of European ancestry. We generated genotypes for more than 11 million SNPs by imputation using the 1000 Genomes Project reference panel, and we identified 15 new loci associated with breast cancer at P < 5 × 10(-8). Combining association analysis with ChIP-seq chromatin binding data in mammary cell lines and ChIA-PET chromatin interaction data from ENCODE, we identified likely target genes in two regions: SETBP1 at 18q12.3 and RNF115 and PDZK1 at 1q21.1. One association appears to be driven by an amino acid substitution encoded in EXO1.

  8. Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data.

    PubMed

    Taugourdeau, Simon; Villerd, Jean; Plantureux, Sylvain; Huguenin-Elie, Olivier; Amiaud, Bernard

    2014-04-01

    Functional trait databases are powerful tools in ecology, though most of them contain large amounts of missing values. The goal of this study was to test the effect of imputation methods on the evaluation of trait values at species level and on the subsequent calculation of functional diversity indices at community level using functional trait databases. Two simple imputation methods (average and median), two methods based on ecological hypotheses, and one multiple imputation method were tested using a large plant trait database, together with the influence of the percentage of missing data and differences between functional traits. At community level, the complete-case approach and three functional diversity indices calculated from grassland plant communities were included. At the species level, one of the methods based on ecological hypothesis was for all traits more accurate than imputation with average or median values, but the multiple imputation method was superior for most of the traits. The method based on functional proximity between species was the best method for traits with an unbalanced distribution, while the method based on the existence of relationships between traits was the best for traits with a balanced distribution. The ranking of the grassland communities for their functional diversity indices was not robust with the complete-case approach, even for low percentages of missing data. With the imputation methods based on ecological hypotheses, functional diversity indices could be computed with a maximum of 30% of missing data, without affecting the ranking between grassland communities. The multiple imputation method performed well, but not better than single imputation based on ecological hypothesis and adapted to the distribution of the trait values for the functional identity and range of the communities. Ecological studies using functional trait databases have to deal with missing data using imputation methods corresponding to their specific needs and making the most out of the information available in the databases. Within this framework, this study indicates the possibilities and limits of single imputation methods based on ecological hypothesis and concludes that they could be useful when studying the ranking of communities for their functional diversity indices.

  9. Filling the gap in functional trait databases: use of ecological hypotheses to replace missing data

    PubMed Central

    Taugourdeau, Simon; Villerd, Jean; Plantureux, Sylvain; Huguenin-Elie, Olivier; Amiaud, Bernard

    2014-01-01

    Functional trait databases are powerful tools in ecology, though most of them contain large amounts of missing values. The goal of this study was to test the effect of imputation methods on the evaluation of trait values at species level and on the subsequent calculation of functional diversity indices at community level using functional trait databases. Two simple imputation methods (average and median), two methods based on ecological hypotheses, and one multiple imputation method were tested using a large plant trait database, together with the influence of the percentage of missing data and differences between functional traits. At community level, the complete-case approach and three functional diversity indices calculated from grassland plant communities were included. At the species level, one of the methods based on ecological hypothesis was for all traits more accurate than imputation with average or median values, but the multiple imputation method was superior for most of the traits. The method based on functional proximity between species was the best method for traits with an unbalanced distribution, while the method based on the existence of relationships between traits was the best for traits with a balanced distribution. The ranking of the grassland communities for their functional diversity indices was not robust with the complete-case approach, even for low percentages of missing data. With the imputation methods based on ecological hypotheses, functional diversity indices could be computed with a maximum of 30% of missing data, without affecting the ranking between grassland communities. The multiple imputation method performed well, but not better than single imputation based on ecological hypothesis and adapted to the distribution of the trait values for the functional identity and range of the communities. Ecological studies using functional trait databases have to deal with missing data using imputation methods corresponding to their specific needs and making the most out of the information available in the databases. Within this framework, this study indicates the possibilities and limits of single imputation methods based on ecological hypothesis and concludes that they could be useful when studying the ranking of communities for their functional diversity indices. PMID:24772273

  10. Comparison of missing value imputation methods in time series: the case of Turkish meteorological data

    NASA Astrophysics Data System (ADS)

    Yozgatligil, Ceylan; Aslan, Sipan; Iyigun, Cem; Batmaz, Inci

    2013-04-01

    This study aims to compare several imputation methods to complete the missing values of spatio-temporal meteorological time series. To this end, six imputation methods are assessed with respect to various criteria including accuracy, robustness, precision, and efficiency for artificially created missing data in monthly total precipitation and mean temperature series obtained from the Turkish State Meteorological Service. Of these methods, simple arithmetic average, normal ratio (NR), and NR weighted with correlations comprise the simple ones, whereas multilayer perceptron type neural network and multiple imputation strategy adopted by Monte Carlo Markov Chain based on expectation-maximization (EM-MCMC) are computationally intensive ones. In addition, we propose a modification on the EM-MCMC method. Besides using a conventional accuracy measure based on squared errors, we also suggest the correlation dimension (CD) technique of nonlinear dynamic time series analysis which takes spatio-temporal dependencies into account for evaluating imputation performances. Depending on the detailed graphical and quantitative analysis, it can be said that although computational methods, particularly EM-MCMC method, are computationally inefficient, they seem favorable for imputation of meteorological time series with respect to different missingness periods considering both measures and both series studied. To conclude, using the EM-MCMC algorithm for imputing missing values before conducting any statistical analyses of meteorological data will definitely decrease the amount of uncertainty and give more robust results. Moreover, the CD measure can be suggested for the performance evaluation of missing data imputation particularly with computational methods since it gives more precise results in meteorological time series.

  11. Is missing geographic positioning system data in accelerometry studies a problem, and is imputation the solution?

    PubMed Central

    Meseck, Kristin; Jankowska, Marta M.; Schipperijn, Jasper; Natarajan, Loki; Godbole, Suneeta; Carlson, Jordan; Takemoto, Michelle; Crist, Katie; Kerr, Jacqueline

    2016-01-01

    The main purpose of the present study was to assess the impact of global positioning system (GPS) signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate and viable method for correcting GPS data loss. Accelerometer and GPS data of 782 participants from 8 studies were pooled to represent a range of lifestyles and interactions with the built environment. Periods of GPS signal lapse were identified and extracted. Generalised linear mixed models were run with the number of lapses and the length of lapses as outcomes. The signal lapses were imputed using a simple ruleset, and imputation was validated against person-worn camera imagery. A final generalised linear mixed model was used to identify the difference between the amount of GPS minutes pre- and post-imputation for the activity categories of sedentary, light, and moderate-to-vigorous physical activity. Over 17% of the dataset was comprised of GPS data lapses. No strong associations were found between increasing lapse length and number of lapses and the demographic and built environment variables. A significant difference was found between the pre- and post-imputation minutes for each activity category. No demographic or environmental bias was found for length or number of lapses, but imputation of GPS data may make a significant difference for inclusion of physical activity data that occurred during a lapse. Imputing GPS data lapses is a viable technique for returning spatial context to accelerometer data and improving the completeness of the dataset. PMID:27245796

  12. Effects of imputation on correlation: implications for analysis of mass spectrometry data from multiple biological matrices.

    PubMed

    Taylor, Sandra L; Ruhaak, L Renee; Kelly, Karen; Weiss, Robert H; Kim, Kyoungmi

    2017-03-01

    With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly. © The Author 2016. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  13. 26 CFR 1.401(a)(4)-7 - Imputation of permitted disparity.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... + permitted disparity rate (3) Employees whose plan year compensation exceeds taxable wage base. If an... 26 Internal Revenue 5 2010-04-01 2010-04-01 false Imputation of permitted disparity. 1.401(a)(4)-7... Imputation of permitted disparity. (a) Introduction. In determining whether a plan satisfies section 401(a)(4...

  14. 31 CFR 19.630 - May the Department of the Treasury impute conduct of one person to another?

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an... individual to whom the improper conduct is imputed either participated in, had knowledge of, or reason to...

  15. 22 CFR 1508.630 - May the African Development Foundation impute conduct of one person to another?

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an organization to an... improper conduct is imputed either participated in, had knowledge of, or reason to know of the improper...

  16. 21 CFR 1404.630 - May the Office of National Drug Control Policy impute conduct of one person to another?

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an... individual to whom the improper conduct is imputed either participated in, had knowledge of, or reason to...

  17. 22 CFR 208.630 - May the U.S. Agency for International Development impute conduct of one person to another?

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ..., or with the organization's knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed... individual, if the individual to whom the improper conduct is imputed either participated in, had knowledge...

  18. A Comparison of Imputation Methods for Bayesian Factor Analysis Models

    ERIC Educational Resources Information Center

    Merkle, Edgar C.

    2011-01-01

    Imputation methods are popular for the handling of missing data in psychology. The methods generally consist of predicting missing data based on observed data, yielding a complete data set that is amiable to standard statistical analyses. In the context of Bayesian factor analysis, this article compares imputation under an unrestricted…

  19. Standard and Robust Methods in Regression Imputation

    ERIC Educational Resources Information Center

    Moraveji, Behjat; Jafarian, Koorosh

    2014-01-01

    The aim of this paper is to provide an introduction of new imputation algorithms for estimating missing values from official statistics in larger data sets of data pre-processing, or outliers. The goal is to propose a new algorithm called IRMI (iterative robust model-based imputation). This algorithm is able to deal with all challenges like…

  20. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Webb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.

    In this review, we apply selected imputation strategies to label-free liquid chromatography–mass spectrometry (LC–MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC–MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yieldedmore » the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. In summary, on the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.« less

  1. Common variants at the CHEK2 gene locus and risk of epithelial ovarian cancer

    PubMed Central

    Lawrenson, Kate; Iversen, Edwin S.; Tyrer, Jonathan; Weber, Rachel Palmieri; Concannon, Patrick; Hazelett, Dennis J.; Li, Qiyuan; Marks, Jeffrey R.; Berchuck, Andrew; Lee, Janet M.; Aben, Katja K.H.; Anton-Culver, Hoda; Antonenkova, Natalia; Bandera, Elisa V.; Bean, Yukie; Beckmann, Matthias W.; Bisogna, Maria; Bjorge, Line; Bogdanova, Natalia; Brinton, Louise A.; Brooks-Wilson, Angela; Bruinsma, Fiona; Butzow, Ralf; Campbell, Ian G.; Carty, Karen; Chang-Claude, Jenny; Chenevix-Trench, Georgia; Chen, Ann; Chen, Zhihua; Cook, Linda S.; Cramer, Daniel W.; Cunningham, Julie M.; Cybulski, Cezary; Plisiecka-Halasa, Joanna; Dennis, Joe; Dicks, Ed; Doherty, Jennifer A.; Dörk, Thilo; du Bois, Andreas; Eccles, Diana; Easton, Douglas T.; Edwards, Robert P.; Eilber, Ursula; Ekici, Arif B.; Fasching, Peter A.; Fridley, Brooke L.; Gao, Yu-Tang; Gentry-Maharaj, Aleksandra; Giles, Graham G.; Glasspool, Rosalind; Goode, Ellen L.; Goodman, Marc T.; Gronwald, Jacek; Harter, Philipp; Hasmad, Hanis Nazihah; Hein, Alexander; Heitz, Florian; Hildebrandt, Michelle A.T.; Hillemanns, Peter; Hogdall, Estrid; Hogdall, Claus; Hosono, Satoyo; Jakubowska, Anna; Paul, James; Jensen, Allan; Karlan, Beth Y.; Kjaer, Susanne Kruger; Kelemen, Linda E.; Kellar, Melissa; Kelley, Joseph L.; Kiemeney, Lambertus A.; Krakstad, Camilla; Lambrechts, Diether; Lambrechts, Sandrina; Le, Nhu D.; Lee, Alice W.; Cannioto, Rikki; Leminen, Arto; Lester, Jenny; Levine, Douglas A.; Liang, Dong; Lissowska, Jolanta; Lu, Karen; Lubinski, Jan; Lundvall, Lene; Massuger, Leon F.A.G.; Matsuo, Keitaro; McGuire, Valerie; McLaughlin, John R.; Nevanlinna, Heli; McNeish, Iain; Menon, Usha; Modugno, Francesmary; Moysich, Kirsten B.; Narod, Steven A.; Nedergaard, Lotte; Ness, Roberta B.; Noor Azmi, Mat Adenan; Odunsi, Kunle; Olson, Sara H.; Orlow, Irene; Orsulic, Sandra; Pearce, Celeste L.; Pejovic, Tanja; Pelttari, Liisa M.; Permuth-Wey, Jennifer; Phelan, Catherine M.; Pike, Malcolm C.; Poole, Elizabeth M.; Ramus, Susan J.; Risch, Harvey A.; Rosen, Barry; Rossing, Mary Anne; Rothstein, Joseph H.; Rudolph, Anja; Runnebaum, Ingo B.; Rzepecka, Iwona K.; Salvesen, Helga B.; Budzilowska, Agnieszka; Sellers, Thomas A.; Shu, Xiao-Ou; Shvetsov, Yurii B.; Siddiqui, Nadeem; Sieh, Weiva; Song, Honglin; Southey, Melissa C.; Sucheston, Lara; Tangen, Ingvild L.; Teo, Soo-Hwang; Terry, Kathryn L.; Thompson, Pamela J.; Timorek, Agnieszka; Tworoger, Shelley S.; Nieuwenhuysen, Els Van; Vergote, Ignace; Vierkant, Robert A.; Wang-Gohrke, Shan; Walsh, Christine; Wentzensen, Nicolas; Whittemore, Alice S.; Wicklund, Kristine G.; Wilkens, Lynne R.; Woo, Yin-Ling; Wu, Xifeng; Wu, Anna H.; Yang, Hannah; Zheng, Wei; Ziogas, Argyrios; Coetzee, Gerhard A.; Freedman, Matthew L.; Monteiro, Alvaro N.A.; Moes-Sosnowska, Joanna; Kupryjanczyk, Jolanta; Pharoah, Paul D.; Gayther, Simon A.; Schildkraut, Joellen M.

    2015-01-01

    Genome-wide association studies have identified 20 genomic regions associated with risk of epithelial ovarian cancer (EOC), but many additional risk variants may exist. Here, we evaluated associations between common genetic variants [single nucleotide polymorphisms (SNPs) and indels] in DNA repair genes and EOC risk. We genotyped 2896 common variants at 143 gene loci in DNA samples from 15 397 patients with invasive EOC and controls. We found evidence of associations with EOC risk for variants at FANCA, EXO1, E2F4, E2F2, CREB5 and CHEK2 genes (P ≤ 0.001). The strongest risk association was for CHEK2 SNP rs17507066 with serous EOC (P = 4.74 x 10–7). Additional genotyping and imputation of genotypes from the 1000 genomes project identified a slightly more significant association for CHEK2 SNP rs6005807 (r 2 with rs17507066 = 0.84, odds ratio (OR) 1.17, 95% CI 1.11–1.24, P = 1.1×10−7). We identified 293 variants in the region with likelihood ratios of less than 1:100 for representing the causal variant. Functional annotation identified 25 candidate SNPs that alter transcription factor binding sites within regulatory elements active in EOC precursor tissues. In The Cancer Genome Atlas dataset, CHEK2 gene expression was significantly higher in primary EOCs compared to normal fallopian tube tissues (P = 3.72×10−8). We also identified an association between genotypes of the candidate causal SNP rs12166475 (r 2 = 0.99 with rs6005807) and CHEK2 expression (P = 2.70×10-8). These data suggest that common variants at 22q12.1 are associated with risk of serous EOC and CHEK2 as a plausible target susceptibility gene. PMID:26424751

  2. An initiator codon mutation in SDE2 causes recessive embryonic lethality in Holstein cattle.

    PubMed

    Fritz, Sébastien; Hoze, Chris; Rebours, Emmanuelle; Barbat, Anne; Bizard, Méline; Chamberlain, Amanda; Escouflaire, Clémentine; Vander Jagt, Christy; Boussaha, Mekki; Grohs, Cécile; Allais-Bonnet, Aurélie; Philippe, Maëlle; Vallée, Amélie; Amigues, Yves; Hayes, Benjamin J; Boichard, Didier; Capitan, Aurélien

    2018-04-18

    Researching depletions in homozygous genotypes for specific haplotypes among the large cohorts of animals genotyped for genomic selection is a very efficient strategy to map recessive lethal mutations. In this study, by analyzing real or imputed Illumina BovineSNP50 (Illumina Inc., San Diego, CA) genotypes from more than 250,000 Holstein animals, we identified a new locus called HH6 showing significant negative effects on conception rate and nonreturn rate at 56 d in at-risk versus control mating. We fine-mapped this locus in a 1.1-Mb interval and analyzed genome sequence data from 12 carrier and 284 noncarrier Holstein bulls. We report the identification of a strong candidate mutation in the gene encoding SDE2 telomere maintenance homolog (SDE2), a protein essential for genomic stability in eukaryotes. This A-to-G transition changes the initiator ATG (methionine) codon to ACG because the gene is transcribed on the reverse strand. Using RNA sequencing and quantitative reverse-transcription PCR, we demonstrated that this mutation does not significantly affect SDE2 splicing and expression level in heterozygous carriers compared with control animals. Initiation of translation at the closest in-frame methionine codon would truncate the SDE2 precursor by 83 amino acids, including the cleavage site necessary for its activation. Finally, no homozygote for the G allele was observed in a large population of nearly 29,000 individuals genotyped for the mutation. The low frequency (1.3%) of the derived allele in the French population and the availability of a diagnostic test on the Illumina EuroG10K SNP chip routinely used for genomic evaluation will enable rapid and efficient selection against this deleterious mutation. Copyright © 2018 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  3. Genetic relatedness of previously Plant-Variety-Protected commercial maize inbreds.

    PubMed

    Beckett, Travis J; Morales, A Jason; Koehler, Klaus L; Rocheford, Torbert R

    2017-01-01

    The emergence of high-throughput, high-density genotyping methods combined with increasingly powerful computing systems has created opportunities to further discover and exploit the genes controlling agronomic performance in elite maize breeding populations. Understanding the genetic basis of population structure in an elite set of materials is an essential step in this genetic discovery process. This paper presents a genotype-based population analysis of all maize inbreds whose Plant Variety Protection certificates had expired as of the end of 2013 (283 inbreds) as well as 66 public founder inbreds. The results provide accurate population structure information and allow for important inferences in context of the historical development of North American elite commercial maize germplasm. Genotypic data was obtained via genotyping-by-sequencing on 349 inbreds. After filtering for missing data, 77,314 high-quality markers remained. The remaining missing data (average per individual was 6.22 percent) was fully imputed at an accuracy of 83 percent. Calculation of linkage disequilibrium revealed that the average r2 of 0.20 occurs at approximately 1.1 Kb. Results of population genetics analyses agree with previously published studies that divide North American maize germplasm into three heterotic groups: Stiff Stalk, Non-Stiff Stalk, and Iodent. Principal component analysis shows that population differentiation is indeed very complex and present at many levels, yet confirms that division into three main sub-groups is optimal for population description. Clustering based on Nei's genetic distance provides an additional empirical representation of the three main heterotic groups. Overall fixation index (FST), indicating the degree of genetic divergence between the three main heterotic groups, was 0.1361. Understanding the genetic relationships and population differentiation of elite germplasm may help breeders to maintain and potentially increase the rate of genetic gain, resulting in higher overall agronomic performance.

  4. [Characteristics of dry matter production and nitrogen accumulation in barley genotypes with high nitrogen utilization efficiency].

    PubMed

    Huang, Yi; Li, Ting-Xuan; Zhang, Xi-Zhou; Ji, Lin

    2014-07-01

    A pot experiment was conducted under low (125 mg x kg-1) and normal (250 mg x kg(-1)) nitrogen treatments. The nitrogen uptake and utilization efficiency of 22 barley cultivars were investigated, and the characteristics of dry matter production and nitrogen accumulation in barley were analyzed. The results showed that nitrogen uptake and utilization efficiency were different for barley under two nitrogen levels. The maximal values of grain yield, nitrogen utilization efficiency for grain and nitrogen harvest index were 2.87, 2.91 and 2.47 times as those of the lowest under the low nitrogen treatment. Grain yield and nitrogen utilization efficiency for grain and nitrogen harvest index of barley genotype with high nitrogen utilization efficiency were significantly greater than low nitrogen utilization efficiency, and the parameters of high nitrogen utilization efficiency genotype were 82.1%, 61.5% and 50.5% higher than low nitrogen utilization efficiency genotype under the low nitrogen treatment. Dry matter mass and nitrogen utilization of high nitrogen utilization efficiency was significantly higher than those of low nitrogen utilization efficiency. A peak of dry matter mass of high nitrogen utilization efficiency occurred during jointing to heading stage, while that of nitrogen accumulation appeared before jointing. Under the low nitrogen treatment, dry matter mass of DH61 and DH121+ was 34.4% and 38.3%, and nitrogen accumulation was 54. 8% and 58.0% higher than DH80, respectively. Dry matter mass and nitrogen accumulation seriously affected yield before jointing stage, and the contribution rates were 47.9% and 54.7% respectively under the low nitrogen treatment. The effect of dry matter and nitrogen accumulation on nitrogen utilization efficiency for grain was the largest during heading to mature stages, followed by sowing to jointing stages, with the contribution rate being 29.5% and 48.7%, 29.0% and 15.8%, respectively. In conclusion, barley genotype with high nitrogen utilization efficiency had a strong ability of dry matter production and nitrogen accumulation. It could synergistically improve yield and nitrogen utilization efficiency by enhancing the ability of nitrogen uptake and dry matter formation before jointing stage in barley.

  5. Whole-genome sequencing identifies EN1 as a determinant of bone density and fracture

    PubMed Central

    Zheng, Hou-Feng; Forgetta, Vincenzo; Hsu, Yi-Hsiang; Estrada, Karol; Rosello-Diez, Alberto; Leo, Paul J; Dahia, Chitra L; Park-Min, Kyung Hyun; Tobias, Jonathan H; Kooperberg, Charles; Kleinman, Aaron; Styrkarsdottir, Unnur; Liu, Ching-Ti; Uggla, Charlotta; Evans, Daniel S; Nielson, Carrie M; Walter, Klaudia; Pettersson-Kymmer, Ulrika; McCarthy, Shane; Eriksson, Joel; Kwan, Tony; Jhamai, Mila; Trajanoska, Katerina; Memari, Yasin; Min, Josine; Huang, Jie; Danecek, Petr; Wilmot, Beth; Li, Rui; Chou, Wen-Chi; Mokry, Lauren E; Moayyeri, Alireza; Claussnitzer, Melina; Cheng, Chia-Ho; Cheung, Warren; Medina-Gómez, Carolina; Ge, Bing; Chen, Shu-Huang; Choi, Kwangbom; Oei, Ling; Fraser, James; Kraaij, Robert; Hibbs, Matthew A; Gregson, Celia L; Paquette, Denis; Hofman, Albert; Wibom, Carl; Tranah, Gregory J; Marshall, Mhairi; Gardiner, Brooke B; Cremin, Katie; Auer, Paul; Hsu, Li; Ring, Sue; Tung, Joyce Y; Thorleifsson, Gudmar; Enneman, Anke W; van Schoor, Natasja M; de Groot, Lisette C.P.G.M.; van der Velde, Nathalie; Melin, Beatrice; Kemp, John P; Christiansen, Claus; Sayers, Adrian; Zhou, Yanhua; Calderari, Sophie; van Rooij, Jeroen; Carlson, Chris; Peters, Ulrike; Berlivet, Soizik; Dostie, Josée; Uitterlinden, Andre G; Williams, Stephen R.; Farber, Charles; Grinberg, Daniel; LaCroix, Andrea Z; Haessler, Jeff; Chasman, Daniel I; Giulianini, Franco; Rose, Lynda M; Ridker, Paul M; Eisman, John A; Nguyen, Tuan V; Center, Jacqueline R; Nogues, Xavier; Garcia-Giralt, Natalia; Launer, Lenore L; Gudnason, Vilmunder; Mellström, Dan; Vandenput, Liesbeth; Karlsson, Magnus K; Ljunggren, Östen; Svensson, Olle; Hallmans, Göran; Rousseau, François; Giroux, Sylvie; Bussière, Johanne; Arp, Pascal P; Koromani, Fjorda; Prince, Richard L; Lewis, Joshua R; Langdahl, Bente L; Hermann, A Pernille; Jensen, Jens-Erik B; Kaptoge, Stephen; Khaw, Kay-Tee; Reeve, Jonathan; Formosa, Melissa M; Xuereb-Anastasi, Angela; Åkesson, Kristina; McGuigan, Fiona E; Garg, Gaurav; Olmos, Jose M; Zarrabeitia, Maria T; Riancho, Jose A; Ralston, Stuart H; Alonso, Nerea; Jiang, Xi; Goltzman, David; Pastinen, Tomi; Grundberg, Elin; Gauguier, Dominique; Orwoll, Eric S; Karasik, David; Davey-Smith, George; Smith, Albert V; Siggeirsdottir, Kristin; Harris, Tamara B; Zillikens, M Carola; van Meurs, Joyce BJ; Thorsteinsdottir, Unnur; Maurano, Matthew T; Timpson, Nicholas J; Soranzo, Nicole; Durbin, Richard; Wilson, Scott G; Ntzani, Evangelia E; Brown, Matthew A; Stefansson, Kari; Hinds, David A; Spector, Tim; Cupples, L Adrienne; Ohlsson, Claes; Greenwood, Celia MT; Jackson, Rebecca D; Rowe, David W; Loomis, Cynthia A; Evans, David M; Ackert-Bicknell, Cheryl L; Joyner, Alexandra L; Duncan, Emma L; Kiel, Douglas P; Rivadeneira, Fernando; Richards, J Brent

    2016-01-01

    SUMMARY The extent to which low-frequency (minor allele frequency [MAF] between 1–5%) and rare (MAF ≤ 1%) variants contribute to complex traits and disease in the general population is largely unknown. Bone mineral density (BMD) is highly heritable, is a major predictor of osteoporotic fractures and has been previously associated with common genetic variants1–8, and rare, population-specific, coding variants9. Here we identify novel non-coding genetic variants with large effects on BMD (ntotal = 53,236) and fracture (ntotal = 508,253) in individuals of European ancestry from the general population. Associations for BMD were derived from whole-genome sequencing (n=2,882 from UK10K), whole-exome sequencing (n= 3,549), deep imputation of genotyped samples using a combined UK10K/1000Genomes reference panel (n=26,534), and de-novo replication genotyping (n= 20,271). We identified a low-frequency non-coding variant near a novel locus, EN1, with an effect size 4-fold larger than the mean of previously reported common variants for lumbar spine BMD8 (rs11692564[T], MAF = 1.7%, replication effect size = +0.20 standard deviations [SD], Pmeta = 2×10−14), which was also associated with a decreased risk of fracture (OR = 0.85; P = 2×10−11; ncases = 98,742 and ncontrols = 409,511). Using an En1Cre/flox mouse model, we observed that conditional loss of En1 results in low bone mass, likely as a consequence of high bone turn-over. We also identified a novel low-frequency non-coding variant with large effects on BMD near WNT16 (rs148771817[T], MAF = 1.1%, replication effect size = +0.39 SD, Pmeta = 1×10−11). In general, there was an excess of association signals arising from deleterious coding and conserved non-coding variants. These findings provide evidence that low-frequency non-coding variants have large effects on BMD and fracture, thereby providing rationale for whole-genome sequencing and improved imputation reference panels to study the genetic architecture of complex traits and disease in the general population. PMID:26367794

  6. Genome-Wide Meta-Analysis of Sciatica in Finnish Population.

    PubMed

    Lemmelä, Susanna; Solovieva, Svetlana; Shiri, Rahman; Benner, Christian; Heliövaara, Markku; Kettunen, Johannes; Anttila, Verneri; Ripatti, Samuli; Perola, Markus; Seppälä, Ilkka; Juonala, Markus; Kähönen, Mika; Salomaa, Veikko; Viikari, Jorma; Raitakari, Olli T; Lehtimäki, Terho; Palotie, Aarno; Viikari-Juntura, Eira; Husgafvel-Pursiainen, Kirsti

    2016-01-01

    Sciatica or the sciatic syndrome is a common and often disabling low back disorder in the working-age population. It has a relatively high heritability but poorly understood molecular mechanisms. The Finnish population is a genetic isolate where small founder population and bottleneck events have led to enrichment of certain rare and low frequency variants. We performed here the first genome-wide association (GWAS) and meta-analysis of sciatica. The meta-analysis was conducted across two GWAS covering 291 Finnish sciatica cases and 3671 controls genotyped and imputed at 7.7 million autosomal variants. The most promising loci (p<1x10-6) were replicated in 776 Finnish sciatica patients and 18,489 controls. We identified five intragenic variants, with relatively low frequencies, at two novel loci associated with sciatica at genome-wide significance. These included chr9:14344410:I (rs71321981) at 9p22.3 (NFIB gene; p = 1.30x10-8, MAF = 0.08) and four variants at 15q21.2: rs145901849, rs80035109, rs190200374 and rs117458827 (MYO5A; p = 1.34x10-8, MAF = 0.06; p = 2.32x10-8, MAF = 0.07; p = 3.85x10-8, MAF = 0.06; p = 4.78x10-8, MAF = 0.07, respectively). The most significant association in the meta-analysis, a single base insertion rs71321981 within the regulatory region of the transcription factor NFIB, replicated in an independent Finnish population sample (p = 0.04). Despite identifying 15q21.2 as a promising locus, we were not able to replicate it. It was differentiated; the lead variants within 15q21.2 were more frequent in Finland (6-7%) than in other European populations (1-2%). Imputation accuracies of the three significantly associated variants (chr9:14344410:I, rs190200374, and rs80035109) were validated by genotyping. In summary, our results suggest a novel locus, 9p22.3 (NFIB), which may be involved in susceptibility to sciatica. In addition, another locus, 15q21.2, emerged as a promising one, but failed to replicate.

  7. Genome-Wide Meta-Analysis of Sciatica in Finnish Population

    PubMed Central

    Lemmelä, Susanna; Solovieva, Svetlana; Shiri, Rahman; Benner, Christian; Heliövaara, Markku; Kettunen, Johannes; Anttila, Verneri; Ripatti, Samuli; Perola, Markus; Seppälä, Ilkka; Juonala, Markus; Kähönen, Mika; Salomaa, Veikko; Viikari, Jorma; Raitakari, Olli T.; Lehtimäki, Terho; Palotie, Aarno; Viikari-Juntura, Eira; Husgafvel-Pursiainen, Kirsti

    2016-01-01

    Sciatica or the sciatic syndrome is a common and often disabling low back disorder in the working-age population. It has a relatively high heritability but poorly understood molecular mechanisms. The Finnish population is a genetic isolate where small founder population and bottleneck events have led to enrichment of certain rare and low frequency variants. We performed here the first genome-wide association (GWAS) and meta-analysis of sciatica. The meta-analysis was conducted across two GWAS covering 291 Finnish sciatica cases and 3671 controls genotyped and imputed at 7.7 million autosomal variants. The most promising loci (p<1x10-6) were replicated in 776 Finnish sciatica patients and 18,489 controls. We identified five intragenic variants, with relatively low frequencies, at two novel loci associated with sciatica at genome-wide significance. These included chr9:14344410:I (rs71321981) at 9p22.3 (NFIB gene; p = 1.30x10-8, MAF = 0.08) and four variants at 15q21.2: rs145901849, rs80035109, rs190200374 and rs117458827 (MYO5A; p = 1.34x10-8, MAF = 0.06; p = 2.32x10-8, MAF = 0.07; p = 3.85x10-8, MAF = 0.06; p = 4.78x10-8, MAF = 0.07, respectively). The most significant association in the meta-analysis, a single base insertion rs71321981 within the regulatory region of the transcription factor NFIB, replicated in an independent Finnish population sample (p = 0.04). Despite identifying 15q21.2 as a promising locus, we were not able to replicate it. It was differentiated; the lead variants within 15q21.2 were more frequent in Finland (6–7%) than in other European populations (1–2%). Imputation accuracies of the three significantly associated variants (chr9:14344410:I, rs190200374, and rs80035109) were validated by genotyping. In summary, our results suggest a novel locus, 9p22.3 (NFIB), which may be involved in susceptibility to sciatica. In addition, another locus, 15q21.2, emerged as a promising one, but failed to replicate. PMID:27764105

  8. A genome-wide association study to identify genomic modulators of rate control therapy in patients with atrial fibrillation.

    PubMed

    Kolek, Matthew J; Edwards, Todd L; Muhammad, Raafia; Balouch, Adnan; Shoemaker, M Benjamin; Blair, Marcia A; Kor, Kaylen C; Takahashi, Atsushi; Kubo, Michiaki; Roden, Dan M; Tanaka, Toshihiro; Darbar, Dawood

    2014-08-15

    For many patients with atrial fibrillation, ventricular rate control with atrioventricular (AV) nodal blockers is considered first-line therapy, although response to treatment is highly variable. Using an extreme phenotype of failure of rate control necessitating AV nodal ablation and pacemaker implantation, we conducted a genome-wide association study (GWAS) to identify genomic modulators of rate control therapy. Cases included 95 patients who failed rate control therapy. Controls (n = 190) achieved adequate rate control therapy with ≤2 AV nodal blockers using a conventional clinical definition. Genotyping was performed on the Illumina 610-Quad platform, and results were imputed to the 1000 Genomes reference haplotypes. A total of 554,041 single-nucleotide polymorphisms (SNPs) met criteria for minor allele frequency (>0.01), call rate (>95%), and quality control, and 6,055,224 SNPs were available after imputation. No SNP reached the canonical threshold for significance for GWAS of p <5 × 10(-8). Sixty-three SNPs with p <10(-5) at 6 genomic loci were genotyped in a validation cohort of 130 cases and 157 controls. These included 6q24.3 (near SAMD5/SASH1, p = 9.36 × 10(-8)), 4q12 (IGFBP7, p = 1.75 × 10(-7)), 6q22.33 (C6orf174, p = 4.86 × 10(-7)), 3p21.31 (CDCP1, p = 1.18 × 10(-6)), 12p12.1 (SOX5, p = 1.62 × 10(-6)), and 7p11 (LANCL2, p = 6.51 × 10(-6)). However, none of these were significant in the replication cohort or in a meta-analysis of both cohorts. In conclusion, we identified several potentially important genomic modulators of rate control therapy in atrial fibrillation, particularly SOX5, which was previously associated with heart rate at rest and PR interval. However, these failed to reach genome-wide significance. Copyright © 2014 Elsevier Inc. All rights reserved.

  9. Accounting for Dependence Induced by Weighted KNN Imputation in Paired Samples, Motivated by a Colorectal Cancer Study

    PubMed Central

    Suyundikov, Anvar; Stevens, John R.; Corcoran, Christopher; Herrick, Jennifer; Wolff, Roger K.; Slattery, Martha L.

    2015-01-01

    Missing data can arise in bioinformatics applications for a variety of reasons, and imputation methods are frequently applied to such data. We are motivated by a colorectal cancer study where miRNA expression was measured in paired tumor-normal samples of hundreds of patients, but data for many normal samples were missing due to lack of tissue availability. We compare the precision and power performance of several imputation methods, and draw attention to the statistical dependence induced by K-Nearest Neighbors (KNN) imputation. This imputation-induced dependence has not previously been addressed in the literature. We demonstrate how to account for this dependence, and show through simulation how the choice to ignore or account for this dependence affects both power and type I error rate control. PMID:25849489

  10. Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.

    PubMed

    Beaulieu-Jones, Brett K; Lavage, Daniel R; Snyder, John W; Moore, Jason H; Pendergrass, Sarah A; Bauer, Christopher R

    2018-02-23

    Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available. ©Brett K Beaulieu-Jones, Daniel R Lavage, John W Snyder, Jason H Moore, Sarah A Pendergrass, Christopher R Bauer. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 23.02.2018.

  11. Genomewide association study of cocaine dependence and related traits: FAM53B identified as a risk gene

    PubMed Central

    Gelernter, Joel; Sherva, Richard; Koesterer, Ryan; Almasy, Laura; Zhao, Hongyu; Kranzler, Henry R.; Farrer, Lindsay

    2013-01-01

    We report a GWAS for cocaine dependence (CD) in three sets of African- and European-American subjects (AAs and EAs, respectively), to identify pathways, genes, and alleles important in CD risk. The discovery GWAS dataset (n=5,697 subjects) was genotyped using the Illumina OmniQuad microarray (890,000 analyzed SNPs). Additional genotypes were imputed based on the 1000 Genomes reference panel. Top-ranked findings were evaluated by incorporating information from publicly available GWAS data from 4,063 subjects. Then, the most significant GWAS SNPs were genotyped in 2,549 independent subjects. We observed one genomewide-significant (GWS) result: rs7086629 at the FAM53B (“family with sequence similarity 53, member B”) locus. This was supported in both AAs and EAs; p-value (meta-analysis of all samples) =4.28×10−8. The gene maps to the same chromosomal region as the maximum peak we observed in a previous linkage study. NCOR2 (nuclear receptor corepressor 1) SNP rs150954431 was associated with p=1.19×10−9 in the EA discovery sample. SNP rs2456778, which maps to CDK1 (“cyclin-dependent kinase 1”), was associated with cocaine-induced paranoia in AAs in the discovery sample only (p=4.68×10−8). This is the first study to identify risk variants for CD using GWAS. Our results implicate novel risk loci and provide insights into potential therapeutic and prevention strategies. PMID:23958962

  12. A comparison of imputation techniques for handling missing predictor values in a risk model with a binary outcome.

    PubMed

    Ambler, Gareth; Omar, Rumana Z; Royston, Patrick

    2007-06-01

    Risk models that aim to predict the future course and outcome of disease processes are increasingly used in health research, and it is important that they are accurate and reliable. Most of these risk models are fitted using routinely collected data in hospitals or general practices. Clinical outcomes such as short-term mortality will be near-complete, but many of the predictors may have missing values. A common approach to dealing with this is to perform a complete-case analysis. However, this may lead to overfitted models and biased estimates if entire patient subgroups are excluded. The aim of this paper is to investigate a number of methods for imputing missing data to evaluate their effect on risk model estimation and the reliability of the predictions. Multiple imputation methods, including hotdecking and multiple imputation by chained equations (MICE), were investigated along with several single imputation methods. A large national cardiac surgery database was used to create simulated yet realistic datasets. The results suggest that complete case analysis may produce unreliable risk predictions and should be avoided. Conditional mean imputation performed well in our scenario, but may not be appropriate if using variable selection methods. MICE was amongst the best performing multiple imputation methods with regards to the quality of the predictions. Additionally, it produced the least biased estimates, with good coverage, and hence is recommended for use in practice.

  13. Mapping gradients of community composition with nearest-neighbour imputation: extending plot data for landscape analysis

    Treesearch

    Janet L. Ohmann; Matthew J. Gregory; Emilie B. Henderson; Heather M. Roberts

    2011-01-01

    Question: How can nearest-neighbour (NN) imputation be used to develop maps of multiple species and plant communities? Location: Western and central Oregon, USA, but methods are applicable anywhere. Methods: We demonstrate NN imputation by mapping woody plant communities for >100 000 km2 of diverse forests and woodlands. Species abundances on...

  14. Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys

    ERIC Educational Resources Information Center

    Si, Yajuan; Reiter, Jerome P.

    2013-01-01

    In many surveys, the data comprise a large number of categorical variables that suffer from item nonresponse. Standard methods for multiple imputation, like log-linear models or sequential regression imputation, can fail to capture complex dependencies and can be difficult to implement effectively in high dimensions. We present a fully Bayesian,…

  15. Users guide to the Most Similar Neighbor Imputation Program Version 2

    Treesearch

    Nicholas L. Crookston; Melinda Moeur; David Renner

    2002-01-01

    The Most Similar Neighbor (MSN, Moeur and Stage 1995) program is used to impute attributes measured on some sample units to sample units where they are not measured. In forestry applications, forest stands or vegetation polygons are examples of sample units. Attributes from detailed vegetation inventories are imputed to sample units where that information is not...

  16. An imputed forest composition map for New England screened by species range boundaries

    Treesearch

    Matthew J. Duveneck; Jonathan R. Thompson; B. Tyler Wilson

    2015-01-01

    Initializing forest landscape models (FLMs) to simulate changes in tree species composition requires accurate fine-scale forest attribute information mapped continuously over large areas. Nearest-neighbor imputation maps, maps developed from multivariate imputation of field plots, have high potential for use as the initial condition within FLMs, but the tendency for...

  17. Missing value imputation for microarray data: a comprehensive comparison study and a web tool

    PubMed Central

    2013-01-01

    Background Microarray data are usually peppered with missing values due to various reasons. However, most of the downstream analyses for microarray data require complete datasets. Therefore, accurate algorithms for missing value estimation are needed for improving the performance of microarray data analyses. Although many algorithms have been developed, there are many debates on the selection of the optimal algorithm. The studies about the performance comparison of different algorithms are still incomprehensive, especially in the number of benchmark datasets used, the number of algorithms compared, the rounds of simulation conducted, and the performance measures used. Results In this paper, we performed a comprehensive comparison by using (I) thirteen datasets, (II) nine algorithms, (III) 110 independent runs of simulation, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of microarray datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. Our results indicate that the performance of an imputation algorithm mainly depends on the type of a dataset but not on the species where the samples come from. In addition to the statistical measure, two other measures with biological meanings are useful to reflect the impact of missing value imputation on the downstream data analyses. Our study suggests that local-least-squares-based methods are good choices to handle missing values for most of the microarray datasets. Conclusions In this work, we carried out a comprehensive comparison of the algorithms for microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose the optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with the existing algorithms using this comparison strategy as a standard protocol. In addition, to assist researchers in dealing with missing values easily, we built a web-based and easy-to-use imputation tool, MissVIA (http://cosbi.ee.ncku.edu.tw/MissVIA), which supports many imputation algorithms. Once users upload a real microarray dataset and choose the imputation algorithms, MissVIA will determine the optimal algorithm for the users' data through a series of simulations, and then the imputed results can be downloaded for the downstream data analyses. PMID:24565220

  18. Flexible Modeling of Survival Data with Covariates Subject to Detection Limits via Multiple Imputation.

    PubMed

    Bernhardt, Paul W; Wang, Huixia Judy; Zhang, Daowen

    2014-01-01

    Models for survival data generally assume that covariates are fully observed. However, in medical studies it is not uncommon for biomarkers to be censored at known detection limits. A computationally-efficient multiple imputation procedure for modeling survival data with covariates subject to detection limits is proposed. This procedure is developed in the context of an accelerated failure time model with a flexible seminonparametric error distribution. The consistency and asymptotic normality of the multiple imputation estimator are established and a consistent variance estimator is provided. An iterative version of the proposed multiple imputation algorithm that approximates the EM algorithm for maximum likelihood is also suggested. Simulation studies demonstrate that the proposed multiple imputation methods work well while alternative methods lead to estimates that are either biased or more variable. The proposed methods are applied to analyze the dataset from a recently-conducted GenIMS study.

  19. PTPN22 association in systemic lupus erythematosus (SLE) with respect to individual ancestry and clinical sub-phenotypes.

    PubMed

    Namjou, Bahram; Kim-Howard, Xana; Sun, Celi; Adler, Adam; Chung, Sharon A; Kaufman, Kenneth M; Kelly, Jennifer A; Glenn, Stuart B; Guthridge, Joel M; Scofield, Robert H; Kimberly, Robert P; Brown, Elizabeth E; Alarcón, Graciela S; Edberg, Jeffrey C; Kim, Jae-Hoon; Choi, Jiyoung; Ramsey-Goldman, Rosalind; Petri, Michelle A; Reveille, John D; Vilá, Luis M; Boackle, Susan A; Freedman, Barry I; Tsao, Betty P; Langefeld, Carl D; Vyse, Timothy J; Jacob, Chaim O; Pons-Estel, Bernardo; Niewold, Timothy B; Moser Sivils, Kathy L; Merrill, Joan T; Anaya, Juan-Manuel; Gilkeson, Gary S; Gaffney, Patrick M; Bae, Sang-Cheol; Alarcón-Riquelme, Marta E; Harley, John B; Criswell, Lindsey A; James, Judith A; Nath, Swapan K

    2013-01-01

    Protein tyrosine phosphatase non-receptor type 22 (PTPN22) is a negative regulator of T-cell activation associated with several autoimmune diseases, including systemic lupus erythematosus (SLE). Missense rs2476601 is associated with SLE in individuals with European ancestry. Since the rs2476601 risk allele frequency differs dramatically across ethnicities, we assessed robustness of PTPN22 association with SLE and its clinical sub-phenotypes across four ethnically diverse populations. Ten SNPs were genotyped in 8220 SLE cases and 7369 controls from in European-Americans (EA), African-Americans (AA), Asians (AS), and Hispanics (HS). We performed imputation-based association followed by conditional analysis to identify independent associations. Significantly associated SNPs were tested for association with SLE clinical sub-phenotypes, including autoantibody profiles. Multiple testing was accounted for by using false discovery rate. We successfully imputed and tested allelic association for 107 SNPs within the PTPN22 region and detected evidence of ethnic-specific associations from EA and HS. In EA, the strongest association was at rs2476601 (P = 4.7 × 10(-9), OR = 1.40 (95% CI = 1.25-1.56)). Independent association with rs1217414 was also observed in EA, and both SNPs are correlated with increased European ancestry. For HS imputed intronic SNP, rs3765598, predicted to be a cis-eQTL, was associated (P = 0.007, OR = 0.79 and 95% CI = 0.67-0.94). No significant associations were observed in AA or AS. Case-only analysis using lupus-related clinical criteria revealed differences between EA SLE patients positive for moderate to high titers of IgG anti-cardiolipin (aCL IgG >20) versus negative aCL IgG at rs2476601 (P = 0.012, OR = 1.65). Association was reinforced when these cases were compared to controls (P = 2.7 × 10(-5), OR = 2.11). Our results validate that rs2476601 is the most significantly associated SNP in individuals with European ancestry. Additionally, rs1217414 and rs3765598 may be associated with SLE. Further studies are required to confirm the involvement of rs2476601 with aCL IgG.

  20. A hybrid computational strategy to address WGS variant analysis in >5000 samples.

    PubMed

    Huang, Zhuoyi; Rustagi, Navin; Veeraraghavan, Narayanan; Carroll, Andrew; Gibbs, Richard; Boerwinkle, Eric; Venkata, Manjunath Gorentla; Yu, Fuli

    2016-09-10

    The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.

  1. Germline sequence variants in TGM3 and RGS22 confer risk of basal cell carcinoma

    PubMed Central

    Stacey, Simon N.; Sulem, Patrick; Gudbjartsson, Daniel F.; Jonasdottir, Aslaug; Thorleifsson, Gudmar; Gudjonsson, Sigurjon A.; Masson, Gisli; Gudmundsson, Julius; Sigurgeirsson, Bardur; Benediktsdottir, Kristrun R.; Thorisdottir, Kristin; Ragnarsson, Rafn; Fuentelsaz, Victoria; Corredera, Cristina; Grasa, Matilde; Planelles, Dolores; Sanmartin, Onofre; Rudnai, Peter; Gurzau, Eugene; Koppova, Kvetoslava; Hemminki, Kari; Nexø, Bjørn A; Tjønneland, Anne; Overvad, Kim; Johannsdottir, Hrefna; Helgadottir, Hafdis T.; Thorsteinsdottir, Unnur; Kong, Augustine; Vogel, Ulla; Kumar, Rajiv; Nagore, Eduardo; Mayordomo, José I.; Rafnar, Thorunn; Olafsson, Jon H.; Stefansson, Kari

    2014-01-01

    To search for new sequence variants that confer risk of cutaneous basal cell carcinoma (BCC), we conducted a genome-wide association study of 38.5 million single nucleotide polymorphisms (SNPs) and small indels identified through whole-genome sequencing of 2230 Icelanders. We imputed genotypes for 4208 BCC patients and 109 408 controls using Illumina SNP chip typing data, carried out association tests and replicated the findings in independent population samples. We found new BCC susceptibility loci at TGM3 (rs214782[G], P = 5.5 × 10−17, OR = 1.29) and RGS22 (rs7006527[C], P = 8.7 × 10−13, OR = 0.77). TGM3 encodes transglutaminase type 3, which plays a key role in production of the cornified envelope during epidermal differentiation. PMID:24403052

  2. The Effect on Melanoma Risk of Genes Previously Associated With Telomere Length

    PubMed Central

    Bishop, D. Timothy; Taylor, John C.; Hayward, Nicholas K.; Brossard, Myriam; Cust, Anne E.; Dunning, Alison M.; Lee, Jeffrey E.; Moses, Eric K.; Akslen, Lars A.; Andresen, Per A.; Avril, Marie-Françoise; Azizi, Esther; Scarrà, Giovanna Bianchi; Brown, Kevin M.; Dębniak, Tadeusz; Elder, David E.; Friedman, Eitan; Ghiorzo, Paola; Gillanders, Elizabeth M.; Goldstein, Alisa M.; Gruis, Nelleke A.; Hansson, Johan; Harland, Mark; Helsing, Per; Hočevar, Marko; Höiom, Veronica; Ingvar, Christian; Kanetsky, Peter A.; Landi, Maria Teresa; Lang, Julie; Lathrop, G. Mark; Lubiński, Jan; Mackie, Rona M.; Martin, Nicholas G.; Molven, Anders; Montgomery, Grant W.; Novaković, Srdjan; Olsson, Håkan; Puig, Susana; Puig-Butille, Joan Anton; Radford-Smith, Graham L.; Randerson-Moor, Juliette; van der Stoep, Nienke; van Doorn, Remco; Whiteman, David C.; MacGregor, Stuart; Pooley, Karen A.; Ward, Sarah V.; Mann, Graham J.; Amos, Christopher I.; Pharoah, Paul D. P.; Demenais, Florence; Law, Matthew H.; Newton Bishop, Julia A.; Barrett, Jennifer H.

    2014-01-01

    Telomere length has been associated with risk of many cancers, but results are inconsistent. Seven single nucleotide polymorphisms (SNPs) previously associated with mean leukocyte telomere length were either genotyped or well-imputed in 11108 case patients and 13933 control patients from Europe, Israel, the United States and Australia, four of the seven SNPs reached a P value under .05 (two-sided). A genetic score that predicts telomere length, derived from these seven SNPs, is strongly associated (P = 8.92x10-9, two-sided) with melanoma risk. This demonstrates that the previously observed association between longer telomere length and increased melanoma risk is not attributable to confounding via shared environmental effects (such as ultraviolet exposure) or reverse causality. We provide the first proof that multiple germline genetic determinants of telomere length influence cancer risk. PMID:25231748

  3. Genome-wide analysis of nuclear magnetic resonance metabolites revealed parent-of-origin effect on triglycerides in medium very low-density lipoprotein in PTPRD gene.

    PubMed

    Pervjakova, N; Kukushkina, V; Haller, T; Kasela, S; Joensuu, A; Kristiansson, K; Annilo, T; Perola, M; Salomaa, V; Jousilahti, P; Metspalu, A; Mägi, R

    2018-05-01

    The aim of the study was to explore the parent-of-origin effects (POEs) on a range of human nuclear magnetic resonance metabolites. We search for POEs in 14,815 unrelated individuals from Estonian and Finnish cohorts using POE method for the genotype data imputed with 1000 G reference panel and 82 nuclear magnetic resonance metabolites. Meta-analysis revealed the evidence of POE for the variant rs1412727 in PTPRD gene for the metabolite: triglycerides in medium very low-density lipoprotein. No POEs were detected for genetic variants that were previously known to have main effect on circulating metabolites. We demonstrated possibility to detect POEs for human metabolites, but the POEs are weak, and therefore it is hard to detect those using currently available sample sizes.

  4. Nonparametric Multiple Imputation for Questionnaires with Individual Skip Patterns and Constraints: The Case of Income Imputation in The National Educational Panel Study

    ERIC Educational Resources Information Center

    Aßmann, Christian; Würbach, Ariane; Goßmann, Solange; Geissler, Ferdinand; Bela, Anika

    2017-01-01

    Large-scale surveys typically exhibit data structures characterized by rich mutual dependencies between surveyed variables and individual-specific skip patterns. Despite high efforts in fieldwork and questionnaire design, missing values inevitably occur. One approach for handling missing values is to provide multiply imputed data sets, thus…

  5. A Method for Imputing Response Options for Missing Data on Multiple-Choice Assessments

    ERIC Educational Resources Information Center

    Wolkowitz, Amanda A.; Skorupski, William P.

    2013-01-01

    When missing values are present in item response data, there are a number of ways one might impute a correct or incorrect response to a multiple-choice item. There are significantly fewer methods for imputing the actual response option an examinee may have provided if he or she had not omitted the item either purposely or accidentally. This…

  6. Impact of Missing Data on the Detection of Differential Item Functioning: The Case of Mantel-Haenszel and Logistic Regression Analysis

    ERIC Educational Resources Information Center

    Robitzsch, Alexander; Rupp, Andre A.

    2009-01-01

    This article describes the results of a simulation study to investigate the impact of missing data on the detection of differential item functioning (DIF). Specifically, it investigates how four methods for dealing with missing data (listwise deletion, zero imputation, two-way imputation, response function imputation) interact with two methods of…

  7. iVAR: a program for imputing missing data in multivariate time series using vector autoregressive models.

    PubMed

    Liu, Siwei; Molenaar, Peter C M

    2014-12-01

    This article introduces iVAR, an R program for imputing missing data in multivariate time series on the basis of vector autoregressive (VAR) models. We conducted a simulation study to compare iVAR with three methods for handling missing data: listwise deletion, imputation with sample means and variances, and multiple imputation ignoring time dependency. The results showed that iVAR produces better estimates for the cross-lagged coefficients than do the other three methods. We demonstrate the use of iVAR with an empirical example of time series electrodermal activity data and discuss the advantages and limitations of the program.

  8. Dealing with gene expression missing data.

    PubMed

    Brás, L P; Menezes, J C

    2006-05-01

    Compared evaluation of different methods is presented for estimating missing values in microarray data: weighted K-nearest neighbours imputation (KNNimpute), regression-based methods such as local least squares imputation (LLSimpute) and partial least squares imputation (PLSimpute) and Bayesian principal component analysis (BPCA). The influence in prediction accuracy of some factors, such as methods' parameters, type of data relationships used in the estimation process (i.e. row-wise, column-wise or both), missing rate and pattern and type of experiment [time series (TS), non-time series (NTS) or mixed (MIX) experiments] is elucidated. Improvements based on the iterative use of data (iterative LLS and PLS imputation--ILLSimpute and IPLSimpute), the need to perform initial imputations (modified PLS and Helland PLS imputation--MPLSimpute and HPLSimpute) and the type of relationships employed (KNNarray, LLSarray, HPLSarray and alternating PLS--APLSimpute) are proposed. Overall, it is shown that data set properties (type of experiment, missing rate and pattern) affect the data similarity structure, therefore influencing the methods' performance. LLSimpute and ILLSimpute are preferable in the presence of data with a stronger similarity structure (TS and MIX experiments), whereas PLS-based methods (MPLSimpute, IPLSimpute and APLSimpute) are preferable when estimating NTS missing data.

  9. Randomly and Non-Randomly Missing Renal Function Data in the Strong Heart Study: A Comparison of Imputation Methods

    PubMed Central

    Shara, Nawar; Yassin, Sayf A.; Valaitis, Eduardas; Wang, Hong; Howard, Barbara V.; Wang, Wenyu; Lee, Elisa T.; Umans, Jason G.

    2015-01-01

    Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989–1991), 2 (1993–1995), and 3 (1998–1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results. PMID:26414328

  10. Randomly and Non-Randomly Missing Renal Function Data in the Strong Heart Study: A Comparison of Imputation Methods.

    PubMed

    Shara, Nawar; Yassin, Sayf A; Valaitis, Eduardas; Wang, Hong; Howard, Barbara V; Wang, Wenyu; Lee, Elisa T; Umans, Jason G

    2015-01-01

    Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991), 2 (1993-1995), and 3 (1998-1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.

  11. Multiple imputation by chained equations for systematically and sporadically missing multilevel data.

    PubMed

    Resche-Rigon, Matthieu; White, Ian R

    2018-06-01

    In multilevel settings such as individual participant data meta-analysis, a variable is 'systematically missing' if it is wholly missing in some clusters and 'sporadically missing' if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.

  12. Advancing US GHG Inventory by Incorporating Survey Data using Machine-Learning Techniques

    NASA Astrophysics Data System (ADS)

    Alsaker, C.; Ogle, S. M.; Breidt, J.

    2017-12-01

    Crop management data are used in the National Greenhouse Gas Inventory that is compiled annually and reported to the United Nations Framework Convention on Climate Change. Emissions for carbon stock change and N2O emissions for US agricultural soils are estimated using the USDA National Resources Inventory (NRI). NRI provides basic information on land use and cropping histories, but it does not provide much detail on other management practices. In contrast, the Conservation Effects Assessment Project (CEAP) survey collects detailed crop management data that could be used in the GHG Inventory. The survey data were collected from NRI survey locations that are a subset of the NRI every 10 years. Therefore, imputation of the CEAP are needed to represent the management practices across all NRI survey locations both spatially and temporally. Predictive mean matching and an artificial neural network methods have been applied to develop imputation model under a multiple imputation framework. Temporal imputation involves adjusting the imputation model using state-level USDA Agricultural Resource Management Survey data. Distributional and predictive accuracy is assessed for the imputed data, providing not only management data needed for the inventory but also rigorous estimates of uncertainty.

  13. Imputation method for lifetime exposure assessment in air pollution epidemiologic studies

    PubMed Central

    2013-01-01

    Background Environmental epidemiology, when focused on the life course of exposure to a specific pollutant, requires historical exposure estimates that are difficult to obtain for the full time period due to gaps in the historical record, especially in earlier years. We show that these gaps can be filled by applying multiple imputation methods to a formal risk equation that incorporates lifetime exposure. We also address challenges that arise, including choice of imputation method, potential bias in regression coefficients, and uncertainty in age-at-exposure sensitivities. Methods During time periods when parameters needed in the risk equation are missing for an individual, the parameters are filled by an imputation model using group level information or interpolation. A random component is added to match the variance found in the estimates for study subjects not needing imputation. The process is repeated to obtain multiple data sets, whose regressions against health data can be combined statistically to develop confidence limits using Rubin’s rules to account for the uncertainty introduced by the imputations. To test for possible recall bias between cases and controls, which can occur when historical residence location is obtained by interview, and which can lead to misclassification of imputed exposure by disease status, we introduce an “incompleteness index,” equal to the percentage of dose imputed (PDI) for a subject. “Effective doses” can be computed using different functional dependencies of relative risk on age of exposure, allowing intercomparison of different risk models. To illustrate our approach, we quantify lifetime exposure (dose) from traffic air pollution in an established case–control study on Long Island, New York, where considerable in-migration occurred over a period of many decades. Results The major result is the described approach to imputation. The illustrative example revealed potential recall bias, suggesting that regressions against health data should be done as a function of PDI to check for consistency of results. The 1% of study subjects who lived for long durations near heavily trafficked intersections, had very high cumulative exposures. Thus, imputation methods must be designed to reproduce non-standard distributions. Conclusions Our approach meets a number of methodological challenges to extending historical exposure reconstruction over a lifetime and shows promise for environmental epidemiology. Application to assessment of breast cancer risks will be reported in a subsequent manuscript. PMID:23919666

  14. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials

    PubMed Central

    Andridge, Rebecca. R.

    2011-01-01

    In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared. PMID:21259309

  15. Multiple Imputation of Completely Missing Repeated Measures Data within Person from a Complex Sample: Application to Accelerometer Data in the National Health and Nutrition Examination Survey

    PubMed Central

    Liu, Benmei; Yu, Mandi; Graubard, Barry I; Troiano, Richard P; Schenker, Nathaniel

    2016-01-01

    The Physical Activity Monitor (PAM) component was introduced into the 2003-2004 National Health and Nutrition Examination Survey (NHANES) to collect objective information on physical activity including both movement intensity counts and ambulatory steps. Due to an error in the accelerometer device initialization process, the steps data were missing for all participants in several primary sampling units (PSUs), typically a single county or group of contiguous counties, who had intensity count data from their accelerometers. To avoid potential bias and loss in efficiency in estimation and inference involving the steps data, we considered methods to accurately impute the missing values for steps collected in the 2003-2004 NHANES. The objective was to come up with an efficient imputation method which minimized model-based assumptions. We adopted a multiple imputation approach based on Additive Regression, Bootstrapping and Predictive mean matching (ARBP) methods. This method fits alternative conditional expectation (ace) models, which use an automated procedure to estimate optimal transformations for both the predictor and response variables. This paper describes the approaches used in this imputation and evaluates the methods by comparing the distributions of the original and the imputed data. A simulation study using the observed data is also conducted as part of the model diagnostics. Finally some real data analyses are performed to compare the before and after imputation results. PMID:27488606

  16. Multiple imputation of missing data in nested case-control and case-cohort studies.

    PubMed

    Keogh, Ruth H; Seaman, Shaun R; Bartlett, Jonathan W; Wood, Angela M

    2018-06-05

    The nested case-control and case-cohort designs are two main approaches for carrying out a substudy within a prospective cohort. This article adapts multiple imputation (MI) methods for handling missing covariates in full-cohort studies for nested case-control and case-cohort studies. We consider data missing by design and data missing by chance. MI analyses that make use of full-cohort data and MI analyses based on substudy data only are described, alongside an intermediate approach in which the imputation uses full-cohort data but the analysis uses only the substudy. We describe adaptations to two imputation methods: the approximate method (MI-approx) of White and Royston () and the "substantive model compatible" (MI-SMC) method of Bartlett et al. (). We also apply the "MI matched set" approach of Seaman and Keogh () to nested case-control studies, which does not require any full-cohort information. The methods are investigated using simulation studies and all perform well when their assumptions hold. Substantial gains in efficiency can be made by imputing data missing by design using the full-cohort approach or by imputing data missing by chance in analyses using the substudy only. The intermediate approach brings greater gains in efficiency relative to the substudy approach and is more robust to imputation model misspecification than the full-cohort approach. The methods are illustrated using the ARIC Study cohort. Supplementary Materials provide R and Stata code. © 2018, The International Biometric Society.

  17. Examining solutions to missing data in longitudinal nursing research.

    PubMed

    Roberts, Mary B; Sullivan, Mary C; Winchester, Suzy B

    2017-04-01

    Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study's purpose was to (1) introduce a three-step approach to assess and address missing data and (2) illustrate this approach using categorical and continuous-level variables from a longitudinal study of premature infants. A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification (FCS). Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and FCS. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data. The rate of missingness was 16-23% for continuous variables and 1-28% for categorical variables. FCS imputation provided the least difference in mean and standard deviation estimates for continuous measures. FCS imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings. Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies. © 2017 Wiley Periodicals, Inc.

  18. Coupling of Belowground Carbon Cycling and Stoichiometry from Organisms to Ecosystems along a Soil C Gradient Under Rice Cultivation

    NASA Astrophysics Data System (ADS)

    Hartman, W.; Ye, R.; Horwath, W. R.; Tringe, S. G.

    2015-12-01

    Ecological stoichiometry is a framework linking biogeochemical cycles to organism functional traits that has been widely applied in aquatic ecosystems, animals and plants, but is poorly explored in soil microbes. We evaluated relationships among soil stoichiometry, carbon (C) cycling, and microbial community structure and function along a soil gradient spanning ~5-25% C in cultivated rice fields with experimental nitrogen (N) amendments. We found rates of soil C turnover were associated with nutrient stoichiometry and phosphorus (P) availability at ecosystem, community, and organism scales. At the ecosystem scale, soil C turnover was highest in mineral soils with lower C content and N:P ratios, and was positively correlated with soil inorganic P. Effects of N fertilization on soil C cycling also appeared to be mediated by soil P availability, while microbial community composition (by 16S rRNA sequencing) was not altered by N addition. Microbial communities varied along the soil C gradient, corresponding with highly covariant soil %C, N:P ratios, C quality, and carbon turnover. In contrast, we observed unambiguous shifts in microbial community function, imputed from taxonomy and directly assessed by shotgun sequenced metagenomes. The abundance of genes for carbohydrate utilization decreased with increasing soil C (and declining C turnover), while genes for aromatic C uptake, N fixation and P scavenging increased along with potential incorporation of C into biomass pools. Ecosystem and community-scale associations between C and nutrient substrate availability were also reflected in patterns of resource allocation among individual genomes (imputed and assembled). Microbes associated with higher rates of soil C turnover harbored more genes for carbohydrate utilization, fewer genes for obtaining energetically costly forms of C, N and P, more ribosomal RNA gene copies, and potentially lower C use efficiency. We suggest genome clustering by functional gene suites might yield simplified guilds related to biogeochemical cycling, even when function is imputed directly from taxonomy. Our findings in a controlled model wetland ecosystem bolster evidence for the role of P in influencing soil C cycling, and our approach could be leveraged to reduce complex microbial data for trait-based modeling of soil C cycling.

  19. Random Forest as an Imputation Method for Education and Psychology Research: Its Impact on Item Fit and Difficulty of the Rasch Model

    ERIC Educational Resources Information Center

    Golino, Hudson F.; Gomes, Cristiano M. A.

    2016-01-01

    This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…

  20. Imputed forest structure uncertainty varies across elevational and longitudinal gradients in the western Cascade mountains, Oregon, USA

    Treesearch

    David M. Bell; Matthew J. Gregory; Janet L. Ohmann

    2015-01-01

    Imputation provides a useful method for mapping forest attributes across broad geographic areas based on field plot measurements and Landsat multi-spectral data, but the resulting map products may be of limited use without corresponding analyses of uncertainties in predictions. In the case of k-nearest neighbor (kNN) imputation with k = 1, such as the Gradient Nearest...

  1. Influence of lidar, Landsat imagery, disturbance history, plot location accuracy, and plot size on accuracy of imputation maps of forest composition and structure

    Treesearch

    Harold S.J. Zald; Janet L. Ohmann; Heather M. Roberts; Matthew J. Gregory; Emilie B. Henderson; Robert J. McGaughey; Justin Braaten

    2014-01-01

    This study investigated how lidar-derived vegetation indices, disturbance history from Landsat time series (LTS) imagery, plot location accuracy, and plot size influenced accuracy of statistical spatial models (nearest-neighbor imputation maps) of forest vegetation composition and structure. Nearest-neighbor (NN) imputation maps were developed for 539,000 ha in the...

  2. Multiple imputation of missing fMRI data in whole brain analysis

    PubMed Central

    Vaden, Kenneth I.; Gebregziabher, Mulugeta; Kuchinsky, Stefanie E.; Eckert, Mark A.

    2012-01-01

    Whole brain fMRI analyses rarely include the entire brain because of missing data that result from data acquisition limits and susceptibility artifact, in particular. This missing data problem is typically addressed by omitting voxels from analysis, which may exclude brain regions that are of theoretical interest and increase the potential for Type II error at cortical boundaries or Type I error when spatial thresholds are used to establish significance. Imputation could significantly expand statistical map coverage, increase power, and enhance interpretations of fMRI results. We examined multiple imputation for group level analyses of missing fMRI data using methods that leverage the spatial information in fMRI datasets for both real and simulated data. Available case analysis, neighbor replacement, and regression based imputation approaches were compared in a general linear model framework to determine the extent to which these methods quantitatively (effect size) and qualitatively (spatial coverage) increased the sensitivity of group analyses. In both real and simulated data analysis, multiple imputation provided 1) variance that was most similar to estimates for voxels with no missing data, 2) fewer false positive errors in comparison to mean replacement, and 3) fewer false negative errors in comparison to available case analysis. Compared to the standard analysis approach of omitting voxels with missing data, imputation methods increased brain coverage in this study by 35% (from 33,323 to 45,071 voxels). In addition, multiple imputation increased the size of significant clusters by 58% and number of significant clusters across statistical thresholds, compared to the standard voxel omission approach. While neighbor replacement produced similar results, we recommend multiple imputation because it uses an informed sampling distribution to deal with missing data across subjects that can include neighbor values and other predictors. Multiple imputation is anticipated to be particularly useful for 1) large fMRI data sets with inconsistent missing voxels across subjects and 2) addressing the problem of increased artifact at ultra-high field, which significantly limit the extent of whole brain coverage and interpretations of results. PMID:22500925

  3. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.

    PubMed

    Brock, Guy N; Shaffer, John R; Blakesley, Richard E; Lotz, Meredith J; Tseng, George C

    2008-01-10

    Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.

  4. Missing value imputation in DNA microarrays based on conjugate gradient method.

    PubMed

    Dorri, Fatemeh; Azmi, Paeiz; Dorri, Faezeh

    2012-02-01

    Analysis of gene expression profiles needs a complete matrix of gene array values; consequently, imputation methods have been suggested. In this paper, an algorithm that is based on conjugate gradient (CG) method is proposed to estimate missing values. k-nearest neighbors of the missed entry are first selected based on absolute values of their Pearson correlation coefficient. Then a subset of genes among the k-nearest neighbors is labeled as the best similar ones. CG algorithm with this subset as its input is then used to estimate the missing values. Our proposed CG based algorithm (CGimpute) is evaluated on different data sets. The results are compared with sequential local least squares (SLLSimpute), Bayesian principle component analysis (BPCAimpute), local least squares imputation (LLSimpute), iterated local least squares imputation (ILLSimpute) and adaptive k-nearest neighbors imputation (KNNKimpute) methods. The average of normalized root mean squares error (NRMSE) and relative NRMSE in different data sets with various missing rates shows CGimpute outperforms other methods. Copyright © 2011 Elsevier Ltd. All rights reserved.

  5. How we load our data sets with theories and why we do so purposefully.

    PubMed

    Rochefort-Maranda, Guillaume

    2016-12-01

    In this paper, I compare theory-laden perceptions with imputed data sets. The similarities between the two allow me to show how the phenomenon of theory-ladenness can manifest itself in statistical analyses. More importantly, elucidating the differences between them will allow me to broaden the focus of the existing literature on theory-ladenness and to introduce some much-needed nuances. The topic of statistical imputation has received no attention in philosophy of science. Yet, imputed data sets are very similar to theory-laden perceptions, and they are now an integral part of many scientific inferences. Unlike the existence of theory-laden perceptions, that of imputed data sets cannot be challenged or reduced to a manageable source of error. In fact, imputed data sets are created purposefully in order to improve the quality of our inferences. They do not undermine the possibility of scientific knowledge; on the contrary, they are epistemically desirable. Copyright © 2016 Elsevier Ltd. All rights reserved.

  6. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index

    PubMed Central

    Yang, Jian; Bakshi, Andrew; Zhu, Zhihong; Hemani, Gibran; Vinkhuyzen, Anna A.E.; Lee, Sang Hong; Robinson, Matthew R.; Perry, John R.B.; Nolte, Ilja M.; van Vliet-Ostaptchouk, Jana V.; Snieder, Harold; Esko, Tonu; Milani, Lili; Mägi, Reedik; Metspalu, Andres; Hamsten, Anders; Magnusson, Patrik K.E.; Pedersen, Nancy L.; Ingelsson, Erik; Soranzo, Nicole; Keller, Matthew C.; Wray, Naomi R.; Goddard, Michael E.; Visscher, Peter M.

    2015-01-01

    We propose a method (GREML-LDMS) to estimate heritability for human complex traits in unrelated individuals using whole-genome sequencing (WGS) data. We demonstrate using simulations based on WGS data that ~97% and ~68% of variation at common and rare variants, respectively, can be captured by imputation. Using the GREML-LDMS method, we estimate from 44,126 unrelated individuals that all ~17M imputed variants explain 56% (s.e. = 2.3%) of variance for height and 27% (s.e. = 2.5%) for body mass index (BMI), and find evidence that height- and BMI-associated variants have been under natural selection. Considering imperfect tagging of imputation and potential overestimation of heritability from previous family-based studies, heritability is likely to be 60–70% for height and 30–40% for BMI. Therefore, missing heritability is small for both traits. For further gene discovery of complex traits, a design with SNP arrays followed by imputation is more cost-effective than WGS at current prices. PMID:26323059

  7. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.

    PubMed

    Lazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, Thomas

    2016-04-01

    Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.

  8. Controlled pattern imputation for sensitivity analysis of longitudinal binary and ordinal outcomes with nonignorable dropout.

    PubMed

    Tang, Yongqiang

    2018-04-30

    The controlled imputation method refers to a class of pattern mixture models that have been commonly used as sensitivity analyses of longitudinal clinical trials with nonignorable dropout in recent years. These pattern mixture models assume that participants in the experimental arm after dropout have similar response profiles to the control participants or have worse outcomes than otherwise similar participants who remain on the experimental treatment. In spite of its popularity, the controlled imputation has not been formally developed for longitudinal binary and ordinal outcomes partially due to the lack of a natural multivariate distribution for such endpoints. In this paper, we propose 2 approaches for implementing the controlled imputation for binary and ordinal data based respectively on the sequential logistic regression and the multivariate probit model. Efficient Markov chain Monte Carlo algorithms are developed for missing data imputation by using the monotone data augmentation technique for the sequential logistic regression and a parameter-expanded monotone data augmentation scheme for the multivariate probit model. We assess the performance of the proposed procedures by simulation and the analysis of a schizophrenia clinical trial and compare them with the fully conditional specification, last observation carried forward, and baseline observation carried forward imputation methods. Copyright © 2018 John Wiley & Sons, Ltd.

  9. Using full-cohort data in nested case-control and case-cohort studies by multiple imputation.

    PubMed

    Keogh, Ruth H; White, Ian R

    2013-10-15

    In many large prospective cohorts, expensive exposure measurements cannot be obtained for all individuals. Exposure-disease association studies are therefore often based on nested case-control or case-cohort studies in which complete information is obtained only for sampled individuals. However, in the full cohort, there may be a large amount of information on cheaply available covariates and possibly a surrogate of the main exposure(s), which typically goes unused. We view the nested case-control or case-cohort study plus the remainder of the cohort as a full-cohort study with missing data. Hence, we propose using multiple imputation (MI) to utilise information in the full cohort when data from the sub-studies are analysed. We use the fully observed data to fit the imputation models. We consider using approximate imputation models and also using rejection sampling to draw imputed values from the true distribution of the missing values given the observed data. Simulation studies show that using MI to utilise full-cohort information in the analysis of nested case-control and case-cohort studies can result in important gains in efficiency, particularly when a surrogate of the main exposure is available in the full cohort. In simulations, this method outperforms counter-matching in nested case-control studies and a weighted analysis for case-cohort studies, both of which use some full-cohort information. Approximate imputation models perform well except when there are interactions or non-linear terms in the outcome model, where imputation using rejection sampling works well. Copyright © 2013 John Wiley & Sons, Ltd.

  10. Examining Solutions to Missing Data in Longitudinal Nursing Research

    PubMed Central

    Roberts, Mary B.; Sullivan, Mary C.; Winchester, Suzy B.

    2017-01-01

    Purpose Longitudinal studies are highly valuable in pediatrics because they provide useful data about developmental patterns of child health and behavior over time. When data are missing, the value of the research is impacted. The study’s purpose was to: (1) introduce a 3-step approach to assess and address missing data; (2) illustrate this approach using categorical and continuous level variables from a longitudinal study of premature infants. Methods A three-step approach with simulations was followed to assess the amount and pattern of missing data and to determine the most appropriate imputation method for the missing data. Patterns of missingness were Missing Completely at Random, Missing at Random, and Not Missing at Random. Missing continuous-level data were imputed using mean replacement, stochastic regression, multiple imputation, and fully conditional specification. Missing categorical-level data were imputed using last value carried forward, hot-decking, stochastic regression, and fully conditional specification. Simulations were used to evaluate these imputation methods under different patterns of missingness at different levels of missing data. Results The rate of missingness was 16–23% for continuous variables and 1–28% for categorical variables. Fully conditional specification imputation provided the least difference in mean and standard deviation estimates for continuous measures. Fully conditional specification imputation was acceptable for categorical measures. Results obtained through simulation reinforced and confirmed these findings. Practice Implications Significant investments are made in the collection of longitudinal data. The prudent handling of missing data can protect these investments and potentially improve the scientific information contained in pediatric longitudinal studies. PMID:28425202

  11. A suggested approach for imputation of missing dietary data for young children in daycare.

    PubMed

    Stevens, June; Ou, Fang-Shu; Truesdale, Kimberly P; Zeng, Donglin; Vaughn, Amber E; Pratt, Charlotte; Ward, Dianne S

    2015-01-01

    Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls). Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES); lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI). From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES)] ratio among non-daycare children on weekdays and the L/(B+D+ES) ratio for all children on weekends. Daytime snack data were used to impute snacks. The reported mean (± standard deviation) weekday intake was lower for daycare children [725 (±324) kcal] compared to non-daycare children [1,048 (±463) kcal]. Weekend intake for all children was 1,173 (±427) kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409) kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.

  12. Correlation of Lactobacillus rhamnosus Genotypes and Carbohydrate Utilization Signatures Determined by Phenotype Profiling

    PubMed Central

    Lambert, Jolanda; van Limpt, Kees; Wels, Michiel; Smokvina, Tamara; Knol, Jan; Kleerebezem, Michiel

    2015-01-01

    Lactobacillus rhamnosus is a bacterial species commonly colonizing the gastrointestinal (GI) tract of humans and also frequently used in food products. While some strains have been studied extensively, physiological variability among isolates of the species found in healthy humans or their diet is largely unexplored. The aim of this study was to characterize the diversity of carbohydrate utilization capabilities of human isolates and food-derived strains of L. rhamnosus in relation to their niche of isolation and genotype. We investigated the genotypic and phenotypic diversity of 25 out of 65 L. rhamnosus strains from various niches, mainly human feces and fermented dairy products. Genetic fingerprinting of the strains by amplified fragment length polymorphism (AFLP) identified 11 distinct subgroups at 70% similarity and suggested niche enrichment within particular genetic clades. High-resolution carbohydrate utilization profiling (OmniLog) identified 14 carbon sources that could be used by all of the strains tested for growth, while the utilization of 58 carbon sources differed significantly between strains, enabling the stratification of L. rhamnosus strains into three metabolic clusters that partially correlate with the genotypic clades but appear uncorrelated with the strain's origin of isolation. Draft genome sequences of 8 strains were generated and employed in a gene-trait matching (GTM) analysis together with the publicly available genomes of L. rhamnosus GG (ATCC 53103) and HN001 for several carbohydrates that were distinct for the different metabolic clusters: l-rhamnose, cellobiose, l-sorbose, and α-methyl-d-glucoside. From the analysis, candidate genes were identified that correlate with l-sorbose and α-methyl-d-glucoside utilization, and the proposed function of these genes could be confirmed by heterologous expression in a strain lacking the genes. This study expands our insight into the phenotypic and genotypic diversity of the species L. rhamnosus and explores the relationships between specific carbohydrate utilization capacities and genotype and/or niche adaptation of this species. PMID:26048937

  13. A second generation human haplotype map of over 3.1 million SNPs.

    PubMed

    Frazer, Kelly A; Ballinger, Dennis G; Cox, David R; Hinds, David A; Stuve, Laura L; Gibbs, Richard A; Belmont, John W; Boudreau, Andrew; Hardenbol, Paul; Leal, Suzanne M; Pasternak, Shiran; Wheeler, David A; Willis, Thomas D; Yu, Fuli; Yang, Huanming; Zeng, Changqing; Gao, Yang; Hu, Haoran; Hu, Weitao; Li, Chaohua; Lin, Wei; Liu, Siqi; Pan, Hao; Tang, Xiaoli; Wang, Jian; Wang, Wei; Yu, Jun; Zhang, Bo; Zhang, Qingrun; Zhao, Hongbin; Zhao, Hui; Zhou, Jun; Gabriel, Stacey B; Barry, Rachel; Blumenstiel, Brendan; Camargo, Amy; Defelice, Matthew; Faggart, Maura; Goyette, Mary; Gupta, Supriya; Moore, Jamie; Nguyen, Huy; Onofrio, Robert C; Parkin, Melissa; Roy, Jessica; Stahl, Erich; Winchester, Ellen; Ziaugra, Liuda; Altshuler, David; Shen, Yan; Yao, Zhijian; Huang, Wei; Chu, Xun; He, Yungang; Jin, Li; Liu, Yangfan; Shen, Yayun; Sun, Weiwei; Wang, Haifeng; Wang, Yi; Wang, Ying; Xiong, Xiaoyan; Xu, Liang; Waye, Mary M Y; Tsui, Stephen K W; Xue, Hong; Wong, J Tze-Fei; Galver, Luana M; Fan, Jian-Bing; Gunderson, Kevin; Murray, Sarah S; Oliphant, Arnold R; Chee, Mark S; Montpetit, Alexandre; Chagnon, Fanny; Ferretti, Vincent; Leboeuf, Martin; Olivier, Jean-François; Phillips, Michael S; Roumy, Stéphanie; Sallée, Clémentine; Verner, Andrei; Hudson, Thomas J; Kwok, Pui-Yan; Cai, Dongmei; Koboldt, Daniel C; Miller, Raymond D; Pawlikowska, Ludmila; Taillon-Miller, Patricia; Xiao, Ming; Tsui, Lap-Chee; Mak, William; Song, You Qiang; Tam, Paul K H; Nakamura, Yusuke; Kawaguchi, Takahisa; Kitamoto, Takuya; Morizono, Takashi; Nagashima, Atsushi; Ohnishi, Yozo; Sekine, Akihiro; Tanaka, Toshihiro; Tsunoda, Tatsuhiko; Deloukas, Panos; Bird, Christine P; Delgado, Marcos; Dermitzakis, Emmanouil T; Gwilliam, Rhian; Hunt, Sarah; Morrison, Jonathan; Powell, Don; Stranger, Barbara E; Whittaker, Pamela; Bentley, David R; Daly, Mark J; de Bakker, Paul I W; Barrett, Jeff; Chretien, Yves R; Maller, Julian; McCarroll, Steve; Patterson, Nick; Pe'er, Itsik; Price, Alkes; Purcell, Shaun; Richter, Daniel J; Sabeti, Pardis; Saxena, Richa; Schaffner, Stephen F; Sham, Pak C; Varilly, Patrick; Altshuler, David; Stein, Lincoln D; Krishnan, Lalitha; Smith, Albert Vernon; Tello-Ruiz, Marcela K; Thorisson, Gudmundur A; Chakravarti, Aravinda; Chen, Peter E; Cutler, David J; Kashuk, Carl S; Lin, Shin; Abecasis, Gonçalo R; Guan, Weihua; Li, Yun; Munro, Heather M; Qin, Zhaohui Steve; Thomas, Daryl J; McVean, Gilean; Auton, Adam; Bottolo, Leonardo; Cardin, Niall; Eyheramendy, Susana; Freeman, Colin; Marchini, Jonathan; Myers, Simon; Spencer, Chris; Stephens, Matthew; Donnelly, Peter; Cardon, Lon R; Clarke, Geraldine; Evans, David M; Morris, Andrew P; Weir, Bruce S; Tsunoda, Tatsuhiko; Mullikin, James C; Sherry, Stephen T; Feolo, Michael; Skol, Andrew; Zhang, Houcan; Zeng, Changqing; Zhao, Hui; Matsuda, Ichiro; Fukushima, Yoshimitsu; Macer, Darryl R; Suda, Eiko; Rotimi, Charles N; Adebamowo, Clement A; Ajayi, Ike; Aniagwu, Toyin; Marshall, Patricia A; Nkwodimmah, Chibuzor; Royal, Charmaine D M; Leppert, Mark F; Dixon, Missy; Peiffer, Andy; Qiu, Renzong; Kent, Alastair; Kato, Kazuto; Niikawa, Norio; Adewole, Isaac F; Knoppers, Bartha M; Foster, Morris W; Clayton, Ellen Wright; Watkin, Jessica; Gibbs, Richard A; Belmont, John W; Muzny, Donna; Nazareth, Lynne; Sodergren, Erica; Weinstock, George M; Wheeler, David A; Yakub, Imtaz; Gabriel, Stacey B; Onofrio, Robert C; Richter, Daniel J; Ziaugra, Liuda; Birren, Bruce W; Daly, Mark J; Altshuler, David; Wilson, Richard K; Fulton, Lucinda L; Rogers, Jane; Burton, John; Carter, Nigel P; Clee, Christopher M; Griffiths, Mark; Jones, Matthew C; McLay, Kirsten; Plumb, Robert W; Ross, Mark T; Sims, Sarah K; Willey, David L; Chen, Zhu; Han, Hua; Kang, Le; Godbout, Martin; Wallenburg, John C; L'Archevêque, Paul; Bellemare, Guy; Saeki, Koji; Wang, Hongguang; An, Daochang; Fu, Hongbo; Li, Qing; Wang, Zhen; Wang, Renwu; Holden, Arthur L; Brooks, Lisa D; McEwen, Jean E; Guyer, Mark S; Wang, Vivian Ota; Peterson, Jane L; Shi, Michael; Spiegel, Jack; Sung, Lawrence M; Zacharia, Lynn F; Collins, Francis S; Kennedy, Karen; Jamieson, Ruth; Stewart, John

    2007-10-18

    We describe the Phase II HapMap, which characterizes over 3.1 million human single nucleotide polymorphisms (SNPs) genotyped in 270 individuals from four geographically diverse populations and includes 25-35% of common SNP variation in the populations surveyed. The map is estimated to capture untyped common variation with an average maximum r2 of between 0.9 and 0.96 depending on population. We demonstrate that the current generation of commercial genome-wide genotyping products captures common Phase II SNPs with an average maximum r2 of up to 0.8 in African and up to 0.95 in non-African populations, and that potential gains in power in association studies can be obtained through imputation. These data also reveal novel aspects of the structure of linkage disequilibrium. We show that 10-30% of pairs of individuals within a population share at least one region of extended genetic identity arising from recent ancestry and that up to 1% of all common variants are untaggable, primarily because they lie within recombination hotspots. We show that recombination rates vary systematically around genes and between genes of different function. Finally, we demonstrate increased differentiation at non-synonymous, compared to synonymous, SNPs, resulting from systematic differences in the strength or efficacy of natural selection between populations.

  14. The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

    PubMed Central

    Scaglione, Davide; Reyes-Chin-Wo, Sebastian; Acquadro, Alberto; Froenicke, Lutz; Portis, Ezio; Beitel, Christopher; Tirone, Matteo; Mauro, Rosario; Lo Monaco, Antonino; Mauromicale, Giovanni; Faccioli, Primetta; Cattivelli, Luigi; Rieseberg, Loren; Michelmore, Richard; Lanteri, Sergio

    2016-01-01

    Globe artichoke (Cynara cardunculus var. scolymus) is an out-crossing, perennial, multi-use crop species that is grown worldwide and belongs to the Compositae, one of the most successful Angiosperm families. We describe the first genome sequence of globe artichoke. The assembly, comprising of 13,588 scaffolds covering 725 of the 1,084 Mb genome, was generated using ~133-fold Illumina sequencing data and encodes 26,889 predicted genes. Re-sequencing (30×) of globe artichoke and cultivated cardoon (C. cardunculus var. altilis) parental genotypes and low-coverage (0.5 to 1×) genotyping-by-sequencing of 163 F1 individuals resulted in 73% of the assembled genome being anchored in 2,178 genetic bins ordered along 17 chromosomal pseudomolecules. This was achieved using a novel pipeline, SOILoCo (Scaffold Ordering by Imputation with Low Coverage), to detect heterozygous regions and assign parental haplotypes with low sequencing read depth and of unknown phase. SOILoCo provides a powerful tool for de novo genome analysis of outcrossing species. Our data will enable genome-scale analyses of evolutionary processes among crops, weeds, and wild species within and beyond the Compositae, and will facilitate the identification of economically important genes from related species. PMID:26786968

  15. Meta-analysis in more than 17,900 cases of ischemic stroke reveals a novel association at 12q24.12.

    PubMed

    Kilarski, Laura L; Achterberg, Sefanja; Devan, William J; Traylor, Matthew; Malik, Rainer; Lindgren, Arne; Pare, Guillame; Sharma, Pankaj; Slowik, Agniesczka; Thijs, Vincent; Walters, Matthew; Worrall, Bradford B; Sale, Michele M; Algra, Ale; Kappelle, L Jaap; Wijmenga, Cisca; Norrving, Bo; Sandling, Johanna K; Rönnblom, Lars; Goris, An; Franke, Andre; Sudlow, Cathie; Rothwell, Peter M; Levi, Christopher; Holliday, Elizabeth G; Fornage, Myriam; Psaty, Bruce; Gretarsdottir, Solveig; Thorsteinsdottir, Unnar; Seshadri, Sudha; Mitchell, Braxton D; Kittner, Steven; Clarke, Robert; Hopewell, Jemma C; Bis, Joshua C; Boncoraglio, Giorgio B; Meschia, James; Ikram, M Arfan; Hansen, Bjorn M; Montaner, Joan; Thorleifsson, Gudmar; Stefanson, Kari; Rosand, Jonathan; de Bakker, Paul I W; Farrall, Martin; Dichgans, Martin; Markus, Hugh S; Bevan, Steve

    2014-08-19

    To perform a genome-wide association study (GWAS) using the Immunochip array in 3,420 cases of ischemic stroke and 6,821 controls, followed by a meta-analysis with data from more than 14,000 additional ischemic stroke cases. Using the Immunochip, we genotyped 3,420 ischemic stroke cases and 6,821 controls. After imputation we meta-analyzed the results with imputed GWAS data from 3,548 cases and 5,972 controls recruited from the ischemic stroke WTCCC2 study, and with summary statistics from a further 8,480 cases and 56,032 controls in the METASTROKE consortium. A final in silico "look-up" of 2 single nucleotide polymorphisms in 2,522 cases and 1,899 controls was performed. Associations were also examined in 1,088 cases with intracerebral hemorrhage and 1,102 controls. In an overall analysis of 17,970 cases of ischemic stroke and 70,764 controls, we identified a novel association on chromosome 12q24 (rs10744777, odds ratio [OR] 1.10 [1.07-1.13], p = 7.12 × 10(-11)) with ischemic stroke. The association was with all ischemic stroke rather than an individual stroke subtype, with similar effect sizes seen in different stroke subtypes. There was no association with intracerebral hemorrhage (OR 1.03 [0.90-1.17], p = 0.695). Our results show, for the first time, a genetic risk locus associated with ischemic stroke as a whole, rather than in a subtype-specific manner. This finding was not associated with intracerebral hemorrhage. © 2014 American Academy of Neurology.

  16. Time Series Imputation via L1 Norm-Based Singular Spectrum Analysis

    NASA Astrophysics Data System (ADS)

    Kalantari, Mahdi; Yarmohammadi, Masoud; Hassani, Hossein; Silva, Emmanuel Sirimal

    Missing values in time series data is a well-known and important problem which many researchers have studied extensively in various fields. In this paper, a new nonparametric approach for missing value imputation in time series is proposed. The main novelty of this research is applying the L1 norm-based version of Singular Spectrum Analysis (SSA), namely L1-SSA which is robust against outliers. The performance of the new imputation method has been compared with many other established methods. The comparison is done by applying them to various real and simulated time series. The obtained results confirm that the SSA-based methods, especially L1-SSA can provide better imputation in comparison to other methods.

  17. Association of genetic variation with systolic and diastolic blood pressure among African Americans: the Candidate Gene Association Resource study

    PubMed Central

    Fox, Ervin R.; Young, J. Hunter; Li, Yali; Dreisbach, Albert W.; Keating, Brendan J.; Musani, Solomon K.; Liu, Kiang; Morrison, Alanna C.; Ganesh, Santhi; Kutlar, Abdullah; Ramachandran, Vasan S.; Polak, Josef F.; Fabsitz, Richard R.; Dries, Daniel L.; Farlow, Deborah N.; Redline, Susan; Adeyemo, Adebowale; Hirschorn, Joel N.; Sun, Yan V.; Wyatt, Sharon B.; Penman, Alan D.; Palmas, Walter; Rotter, Jerome I.; Townsend, Raymond R.; Doumatey, Ayo P.; Tayo, Bamidele O.; Mosley, Thomas H.; Lyon, Helen N.; Kang, Sun J.; Rotimi, Charles N.; Cooper, Richard S.; Franceschini, Nora; Curb, J. David; Martin, Lisa W.; Eaton, Charles B.; Kardia, Sharon L.R.; Taylor, Herman A.; Caulfield, Mark J.; Ehret, Georg B.; Johnson, Toby; Chakravarti, Aravinda; Zhu, Xiaofeng; Levy, Daniel; Munroe, Patricia B.; Rice, Kenneth M.; Bochud, Murielle; Johnson, Andrew D.; Chasman, Daniel I.; Smith, Albert V.; Tobin, Martin D.; Verwoert, Germaine C.; Hwang, Shih-Jen; Pihur, Vasyl; Vollenweider, Peter; O'Reilly, Paul F.; Amin, Najaf; Bragg-Gresham, Jennifer L.; Teumer, Alexander; Glazer, Nicole L.; Launer, Lenore; Zhao, Jing Hua; Aulchenko, Yurii; Heath, Simon; Sõber, Siim; Parsa, Afshin; Luan, Jian'an; Arora, Pankaj; Dehghan, Abbas; Zhang, Feng; Lucas, Gavin; Hicks, Andrew A.; Jackson, Anne U.; Peden, John F.; Tanaka, Toshiko; Wild, Sarah H.; Rudan, Igor; Igl, Wilmar; Milaneschi, Yuri; Parker, Alex N.; Fava, Cristiano; Chambers, John C.; Kumari, Meena; JinGo, Min; van der Harst, Pim; Kao, Wen Hong Linda; Sjögren, Marketa; Vinay, D.G.; Alexander, Myriam; Tabara, Yasuharu; Shaw-Hawkins, Sue; Whincup, Peter H.; Liu, Yongmei; Shi, Gang; Kuusisto, Johanna; Seielstad, Mark; Sim, Xueling; Nguyen, Khanh-Dung Hoang; Lehtimäki, Terho; Matullo, Giuseppe; Wu, Ying; Gaunt, Tom R.; Charlotte Onland-Moret, N.; Cooper, Matthew N.; Platou, Carl G.P.; Org, Elin; Hardy, Rebecca; Dahgam, Santosh; Palmen, Jutta; Vitart, Veronique; Braund, Peter S.; Kuznetsova, Tatiana; Uiterwaal, Cuno S.P.M.; Campbell, Harry; Ludwig, Barbara; Tomaszewski, Maciej; Tzoulaki, Ioanna; Palmer, Nicholette D.; Aspelund, Thor; Garcia, Melissa; Chang, Yen-Pei C.; O'Connell, Jeffrey R.; Steinle, Nanette I.; Grobbee, Diederick E.; Arking, Dan E.; Hernandez, Dena; Najjar, Samer; McArdle, Wendy L.; Hadley, David; Brown, Morris J.; Connell, John M.; Hingorani, Aroon D.; Day, Ian N.M.; Lawlor, Debbie A.; Beilby, John P.; Lawrence, Robert W.; Clarke, Robert; Collins, Rory; Hopewell, Jemma C.; Ongen, Halit; Bis, Joshua C.; Kähönen, Mika; Viikari, Jorma; Adair, Linda S.; Lee, Nanette R.; Chen, Ming-Huei; Olden, Matthias; Pattaro, Cristian; Hoffman Bolton, Judith A.; Köttgen, Anna; Bergmann, Sven; Mooser, Vincent; Chaturvedi, Nish; Frayling, Timothy M.; Islam, Muhammad; Jafar, Tazeen H.; Erdmann, Jeanette; Kulkarni, Smita R.; Bornstein, Stefan R.; Grässler, Jürgen; Groop, Leif; Voight, Benjamin F.; Kettunen, Johannes; Howard, Philip; Taylor, Andrew; Guarrera, Simonetta; Ricceri, Fulvio; Emilsson, Valur; Plump, Andrew; Barroso, Inês; Khaw, Kay-Tee; Weder, Alan B.; Hunt, Steven C.; Bergman, Richard N.; Collins, Francis S.; Bonnycastle, Lori L.; Scott, Laura J.; Stringham, Heather M.; Peltonen, Leena; Perola, Markus; Vartiainen, Erkki; Brand, Stefan-Martin; Staessen, Jan A.; Wang, Thomas J.; Burton, Paul R.; SolerArtigas, Maria; Dong, Yanbin; Snieder, Harold; Wang, Xiaoling; Zhu, Haidong; Lohman, Kurt K.; Rudock, Megan E.; Heckbert, Susan R.; Smith, Nicholas L.; Wiggins, Kerri L.; Shriner, Daniel; Veldre, Gudrun; Viigimaa, Margus; Kinra, Sanjay; Prabhakaran, Dorairajan; Tripathy, Vikal; Langefeld, Carl D.; Rosengren, Annika; Thelle, Dag S.; MariaCorsi, Anna; Singleton, Andrew; Forrester, Terrence; Hilton, Gina; McKenzie, Colin A.; Salako, Tunde; Iwai, Naoharu; Kita, Yoshikuni; Ogihara, Toshio; Ohkubo, Takayoshi; Okamura, Tomonori; Ueshima, Hirotsugu; Umemura, Satoshi; Eyheramendy, Susana; Meitinger, Thomas; Wichmann, H.-Erich; Cho, Yoon Shin; Kim, Hyung-Lae; Lee, Jong-Young; Scott, James; Sehmi, Joban S.; Zhang, Weihua; Hedblad, Bo; Nilsson, Peter; Smith, George Davey; Wong, Andrew; Narisu, Narisu; Stančáková, Alena; Raffel, Leslie J.; Yao, Jie; Kathiresan, Sekar; O'Donnell, Chris; Schwartz, Steven M.; Arfan Ikram, M.; Longstreth, Will T.; Seshadri, Sudha; Shrine, Nick R.G.; Wain, Louise V.; Morken, Mario A.; Swift, Amy J.; Laitinen, Jaana; Prokopenko, Inga; Zitting, Paavo; Cooper, Jackie A.; Humphries, Steve E.; Danesh, John; Rasheed, Asif; Goel, Anuj; Hamsten, Anders; Watkins, Hugh; Bakker, Stephan J.L.; van Gilst, Wiek H.; Janipalli, Charles S.; Radha Mani, K.; Yajnik, Chittaranjan S.; Hofman, Albert; Mattace-Raso, Francesco U.S.; Oostra, Ben A.; Demirkan, Ayse; Isaacs, Aaron; Rivadeneira, Fernando; Lakatta, Edward G.; Orru, Marco; Scuteri, Angelo; Ala-Korpela, Mika; Kangas, Antti J.; Lyytikäinen, Leo-Pekka; Soininen, Pasi; Tukiainen, Taru; Würz, Peter; Twee-Hee Ong, Rick; Dörr, Marcus; Kroemer, Heyo K.; Völker, Uwe; Völzke, Henry; Galan, Pilar; Hercberg, Serge; Lathrop, Mark; Zelenika, Diana; Deloukas, Panos; Mangino, Massimo; Spector, Tim D.; Zhai, Guangju; Meschia, James F.; Nalls, Michael A.; Sharma, Pankaj; Terzic, Janos; Kranthi Kumar, M.J.; Denniff, Matthew; Zukowska-Szczechowska, Ewa; Wagenknecht, Lynne E.; Fowkes, Gerald R.; Charchar, Fadi J.; Schwarz, Peter E.H.; Hayward, Caroline; Guo, Xiuqing; Bots, Michiel L.; Brand, Eva; Samani, Nilesh J.; Polasek, Ozren; Talmud, Philippa J.; Nyberg, Fredrik; Kuh, Diana; Laan, Maris; Hveem, Kristian; Palmer, Lyle J.; van der Schouw, Yvonne T.; Casas, Juan P.; Mohlke, Karen L.; Vineis, Paolo; Raitakari, Olli; Wong, Tien Y.; Shyong Tai, E.; Laakso, Markku; Rao, Dabeeru C.; Harris, Tamara B.; Morris, Richard W.; Dominiczak, Anna F.; Kivimaki, Mika; Marmot, Michael G.; Miki, Tetsuro; Saleheen, Danish; Chandak, Giriraj R.; Coresh, Josef; Navis, Gerjan; Salomaa, Veikko; Han, Bok-Ghee; Kooner, Jaspal S.; Melander, Olle; Ridker, Paul M.; Bandinelli, Stefania; Gyllensten, Ulf B.; Wright, Alan F.; Wilson, James F.; Ferrucci, Luigi; Farrall, Martin; Tuomilehto, Jaakko; Pramstaller, Peter P.; Elosua, Roberto; Soranzo, Nicole; Sijbrands, Eric J.G.; Altshuler, David; Loos, Ruth J.F.; Shuldiner, Alan R.; Gieger, Christian; Meneton, Pierre; Uitterlinden, Andre G.; Wareham, Nicholas J.; Gudnason, Vilmundur; Rettig, Rainer; Uda, Manuela; Strachan, David P.; Witteman, Jacqueline C.M.; Hartikainen, Anna-Liisa; Beckmann, Jacques S.; Boerwinkle, Eric; Boehnke, Michael; Larson, Martin G.; Järvelin, Marjo-Riitta; Psaty, Bruce M.; Abecasis, Gonçalo R.; Elliott, Paul; van Duijn , Cornelia M.; Newton-Cheh, Christopher

    2011-01-01

    The prevalence of hypertension in African Americans (AAs) is higher than in other US groups; yet, few have performed genome-wide association studies (GWASs) in AA. Among people of European descent, GWASs have identified genetic variants at 13 loci that are associated with blood pressure. It is unknown if these variants confer susceptibility in people of African ancestry. Here, we examined genome-wide and candidate gene associations with systolic blood pressure (SBP) and diastolic blood pressure (DBP) using the Candidate Gene Association Resource (CARe) consortium consisting of 8591 AAs. Genotypes included genome-wide single-nucleotide polymorphism (SNP) data utilizing the Affymetrix 6.0 array with imputation to 2.5 million HapMap SNPs and candidate gene SNP data utilizing a 50K cardiovascular gene-centric array (ITMAT-Broad-CARe [IBC] array). For Affymetrix data, the strongest signal for DBP was rs10474346 (P= 3.6 × 10−8) located near GPR98 and ARRDC3. For SBP, the strongest signal was rs2258119 in C21orf91 (P= 4.7 × 10−8). The top IBC association for SBP was rs2012318 (P= 6.4 × 10−6) near SLC25A42 and for DBP was rs2523586 (P= 1.3 × 10−6) near HLA-B. None of the top variants replicated in additional AA (n = 11 882) or European-American (n = 69 899) cohorts. We replicated previously reported European-American blood pressure SNPs in our AA samples (SH2B3, P= 0.009; TBX3-TBX5, P= 0.03; and CSK-ULK3, P= 0.0004). These genetic loci represent the best evidence of genetic influences on SBP and DBP in AAs to date. More broadly, this work supports that notion that blood pressure among AAs is a trait with genetic underpinnings but also with significant complexity. PMID:21378095

  18. Association of genetic variation with systolic and diastolic blood pressure among African Americans: the Candidate Gene Association Resource study.

    PubMed

    Fox, Ervin R; Young, J Hunter; Li, Yali; Dreisbach, Albert W; Keating, Brendan J; Musani, Solomon K; Liu, Kiang; Morrison, Alanna C; Ganesh, Santhi; Kutlar, Abdullah; Ramachandran, Vasan S; Polak, Josef F; Fabsitz, Richard R; Dries, Daniel L; Farlow, Deborah N; Redline, Susan; Adeyemo, Adebowale; Hirschorn, Joel N; Sun, Yan V; Wyatt, Sharon B; Penman, Alan D; Palmas, Walter; Rotter, Jerome I; Townsend, Raymond R; Doumatey, Ayo P; Tayo, Bamidele O; Mosley, Thomas H; Lyon, Helen N; Kang, Sun J; Rotimi, Charles N; Cooper, Richard S; Franceschini, Nora; Curb, J David; Martin, Lisa W; Eaton, Charles B; Kardia, Sharon L R; Taylor, Herman A; Caulfield, Mark J; Ehret, Georg B; Johnson, Toby; Chakravarti, Aravinda; Zhu, Xiaofeng; Levy, Daniel

    2011-06-01

    The prevalence of hypertension in African Americans (AAs) is higher than in other US groups; yet, few have performed genome-wide association studies (GWASs) in AA. Among people of European descent, GWASs have identified genetic variants at 13 loci that are associated with blood pressure. It is unknown if these variants confer susceptibility in people of African ancestry. Here, we examined genome-wide and candidate gene associations with systolic blood pressure (SBP) and diastolic blood pressure (DBP) using the Candidate Gene Association Resource (CARe) consortium consisting of 8591 AAs. Genotypes included genome-wide single-nucleotide polymorphism (SNP) data utilizing the Affymetrix 6.0 array with imputation to 2.5 million HapMap SNPs and candidate gene SNP data utilizing a 50K cardiovascular gene-centric array (ITMAT-Broad-CARe [IBC] array). For Affymetrix data, the strongest signal for DBP was rs10474346 (P= 3.6 × 10(-8)) located near GPR98 and ARRDC3. For SBP, the strongest signal was rs2258119 in C21orf91 (P= 4.7 × 10(-8)). The top IBC association for SBP was rs2012318 (P= 6.4 × 10(-6)) near SLC25A42 and for DBP was rs2523586 (P= 1.3 × 10(-6)) near HLA-B. None of the top variants replicated in additional AA (n = 11 882) or European-American (n = 69 899) cohorts. We replicated previously reported European-American blood pressure SNPs in our AA samples (SH2B3, P= 0.009; TBX3-TBX5, P= 0.03; and CSK-ULK3, P= 0.0004). These genetic loci represent the best evidence of genetic influences on SBP and DBP in AAs to date. More broadly, this work supports that notion that blood pressure among AAs is a trait with genetic underpinnings but also with significant complexity.

  19. Fine scale mapping of the 17q22 breast cancer locus using dense SNPs, genotyped within the Collaborative Oncological Gene-Environment Study (COGs)

    PubMed Central

    Darabi, Hatef; Beesley, Jonathan; Droit, Arnaud; Kar, Siddhartha; Nord, Silje; Moradi Marjaneh, Mahdi; Soucy, Penny; Michailidou, Kyriaki; Ghoussaini, Maya; Fues Wahl, Hanna; Bolla, Manjeet K.; Wang, Qin; Dennis, Joe; Alonso, M. Rosario; Andrulis, Irene L.; Anton-Culver, Hoda; Arndt, Volker; Beckmann, Matthias W.; Benitez, Javier; Bogdanova, Natalia V.; Bojesen, Stig E.; Brauch, Hiltrud; Brenner, Hermann; Broeks, Annegien; Brüning, Thomas; Burwinkel, Barbara; Chang-Claude, Jenny; Choi, Ji-Yeob; Conroy, Don M.; Couch, Fergus J.; Cox, Angela; Cross, Simon S.; Czene, Kamila; Devilee, Peter; Dörk, Thilo; Easton, Douglas F.; Fasching, Peter A.; Figueroa, Jonine; Fletcher, Olivia; Flyger, Henrik; Galle, Eva; García-Closas, Montserrat; Giles, Graham G.; Goldberg, Mark S.; González-Neira, Anna; Guénel, Pascal; Haiman, Christopher A.; Hallberg, Emily; Hamann, Ute; Hartman, Mikael; Hollestelle, Antoinette; Hopper, John L.; Ito, Hidemi; Jakubowska, Anna; Johnson, Nichola; Kang, Daehee; Khan, Sofia; Kosma, Veli-Matti; Kriege, Mieke; Kristensen, Vessela; Lambrechts, Diether; Le Marchand, Loic; Lee, Soo Chin; Lindblom, Annika; Lophatananon, Artitaya; Lubinski, Jan; Mannermaa, Arto; Manoukian, Siranoush; Margolin, Sara; Matsuo, Keitaro; Mayes, Rebecca; McKay, James; Meindl, Alfons; Milne, Roger L.; Muir, Kenneth; Neuhausen, Susan L.; Nevanlinna, Heli; Olswold, Curtis; Orr, Nick; Peterlongo, Paolo; Pita, Guillermo; Pylkäs, Katri; Rudolph, Anja; Sangrajrang, Suleeporn; Sawyer, Elinor J.; Schmidt, Marjanka K.; Schmutzler, Rita K.; Seynaeve, Caroline; Shah, Mitul; Shen, Chen-Yang; Shu, Xiao-Ou; Southey, Melissa C.; Stram, Daniel O.; Surowy, Harald; Swerdlow, Anthony; Teo, Soo H.; Tessier, Daniel C.; Tomlinson, Ian; Torres, Diana; Truong, Thérèse; Vachon, Celine M.; Vincent, Daniel; Winqvist, Robert; Wu, Anna H.; Wu, Pei-Ei; Yip, Cheng Har; Zheng, Wei; Pharoah, Paul D. P.; Hall, Per; Edwards, Stacey L.; Simard, Jacques; French, Juliet D.; Chenevix-Trench, Georgia; Dunning, Alison M.

    2016-01-01

    Genome-wide association studies have found SNPs at 17q22 to be associated with breast cancer risk. To identify potential causal variants related to breast cancer risk, we performed a high resolution fine-mapping analysis that involved genotyping 517 SNPs using a custom Illumina iSelect array (iCOGS) followed by imputation of genotypes for 3,134 SNPs in more than 89,000 participants of European ancestry from the Breast Cancer Association Consortium (BCAC). We identified 28 highly correlated common variants, in a 53 Kb region spanning two introns of the STXBP4 gene, that are strong candidates for driving breast cancer risk (lead SNP rs2787486 (OR = 0.92; CI 0.90–0.94; P = 8.96 × 10−15)) and are correlated with two previously reported risk-associated variants at this locus, SNPs rs6504950 (OR = 0.94, P = 2.04 × 10−09, r2 = 0.73 with lead SNP) and rs1156287 (OR = 0.93, P = 3.41 × 10−11, r2 = 0.83 with lead SNP). Analyses indicate only one causal SNP in the region and several enhancer elements targeting STXBP4 are located within the 53 kb association signal. Expression studies in breast tumor tissues found SNP rs2787486 to be associated with increased STXBP4 expression, suggesting this may be a target gene of this locus. PMID:27600471

  20. Dissection of additive, dominance, and imprinting effects for production and reproduction traits in Holstein cattle.

    PubMed

    Jiang, Jicai; Shen, Botong; O'Connell, Jeffrey R; VanRaden, Paul M; Cole, John B; Ma, Li

    2017-05-30

    Although genome-wide association and genomic selection studies have primarily focused on additive effects, dominance and imprinting effects play an important role in mammalian biology and development. The degree to which these non-additive genetic effects contribute to phenotypic variation and whether QTL acting in a non-additive manner can be detected in genetic association studies remain controversial. To empirically answer these questions, we analyzed a large cattle dataset that consisted of 42,701 genotyped Holstein cows with genotyped parents and phenotypic records for eight production and reproduction traits. SNP genotypes were phased in pedigree to determine the parent-of-origin of alleles, and a three-component GREML was applied to obtain variance decomposition for additive, dominance, and imprinting effects. The results showed a significant non-zero contribution from dominance to production traits but not to reproduction traits. Imprinting effects significantly contributed to both production and reproduction traits. Interestingly, imprinting effects contributed more to reproduction traits than to production traits. Using GWAS and imputation-based fine-mapping analyses, we identified and validated a dominance association signal with milk yield near RUNX2, a candidate gene that has been associated with milk production in mice. When adding non-additive effects into the prediction models, however, we observed little or no increase in prediction accuracy for the eight traits analyzed. Collectively, our results suggested that non-additive effects contributed a non-negligible amount (more for reproduction traits) to the total genetic variance of complex traits in cattle, and detection of QTLs with non-additive effect is possible in GWAS using a large dataset.

  1. Association analysis for feet and legs disorders with whole-genome sequence variants in 3 dairy cattle breeds.

    PubMed

    Wu, Xiaoping; Guldbrandtsen, Bernt; Lund, Mogens Sandø; Sahana, Goutam

    2016-09-01

    Identification of genetic variants associated with feet and legs disorders (FLD) will aid in the genetic improvement of these traits by providing knowledge on genes that influence trait variations. In Denmark, FLD in cattle has been recorded since the 1990s. In this report, we used deregressed breeding values as response variables for a genome-wide association study. Bulls (5,334 Danish Holstein, 4,237 Nordic Red Dairy Cattle, and 1,180 Danish Jersey) with deregressed estimated breeding values were genotyped with the Illumina Bovine 54k single nucleotide polymorphism (SNP) genotyping array. Genotypes were imputed to whole-genome sequence variants, and then 22,751,039 SNP on 29 autosomes were used for an association analysis. A modified linear mixed-model approach (efficient mixed-model association eXpedited, EMMAX) and a linear mixed model were used for association analysis. We identified 5 (3,854 SNP), 3 (13,642 SNP), and 0 quantitative trait locus (QTL) regions associated with the FLD index in Danish Holstein, Nordic Red Dairy Cattle, and Danish Jersey populations, respectively. We did not identify any QTL that were common among the 3 breeds. In a meta-analysis of the 3 breeds, 4 QTL regions were significant, but no additional QTL region was identified compared with within-breed analyses. Comparison between top SNP locations within these QTL regions and known genes suggested that RASGRP1, LCORL, MOS, and MITF may be candidate genes for FLD in dairy cattle. Copyright © 2016 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  2. Fine scale mapping of the 17q22 breast cancer locus using dense SNPs, genotyped within the Collaborative Oncological Gene-Environment Study (COGs).

    PubMed

    Darabi, Hatef; Beesley, Jonathan; Droit, Arnaud; Kar, Siddhartha; Nord, Silje; Moradi Marjaneh, Mahdi; Soucy, Penny; Michailidou, Kyriaki; Ghoussaini, Maya; Fues Wahl, Hanna; Bolla, Manjeet K; Wang, Qin; Dennis, Joe; Alonso, M Rosario; Andrulis, Irene L; Anton-Culver, Hoda; Arndt, Volker; Beckmann, Matthias W; Benitez, Javier; Bogdanova, Natalia V; Bojesen, Stig E; Brauch, Hiltrud; Brenner, Hermann; Broeks, Annegien; Brüning, Thomas; Burwinkel, Barbara; Chang-Claude, Jenny; Choi, Ji-Yeob; Conroy, Don M; Couch, Fergus J; Cox, Angela; Cross, Simon S; Czene, Kamila; Devilee, Peter; Dörk, Thilo; Easton, Douglas F; Fasching, Peter A; Figueroa, Jonine; Fletcher, Olivia; Flyger, Henrik; Galle, Eva; García-Closas, Montserrat; Giles, Graham G; Goldberg, Mark S; González-Neira, Anna; Guénel, Pascal; Haiman, Christopher A; Hallberg, Emily; Hamann, Ute; Hartman, Mikael; Hollestelle, Antoinette; Hopper, John L; Ito, Hidemi; Jakubowska, Anna; Johnson, Nichola; Kang, Daehee; Khan, Sofia; Kosma, Veli-Matti; Kriege, Mieke; Kristensen, Vessela; Lambrechts, Diether; Le Marchand, Loic; Lee, Soo Chin; Lindblom, Annika; Lophatananon, Artitaya; Lubinski, Jan; Mannermaa, Arto; Manoukian, Siranoush; Margolin, Sara; Matsuo, Keitaro; Mayes, Rebecca; McKay, James; Meindl, Alfons; Milne, Roger L; Muir, Kenneth; Neuhausen, Susan L; Nevanlinna, Heli; Olswold, Curtis; Orr, Nick; Peterlongo, Paolo; Pita, Guillermo; Pylkäs, Katri; Rudolph, Anja; Sangrajrang, Suleeporn; Sawyer, Elinor J; Schmidt, Marjanka K; Schmutzler, Rita K; Seynaeve, Caroline; Shah, Mitul; Shen, Chen-Yang; Shu, Xiao-Ou; Southey, Melissa C; Stram, Daniel O; Surowy, Harald; Swerdlow, Anthony; Teo, Soo H; Tessier, Daniel C; Tomlinson, Ian; Torres, Diana; Truong, Thérèse; Vachon, Celine M; Vincent, Daniel; Winqvist, Robert; Wu, Anna H; Wu, Pei-Ei; Yip, Cheng Har; Zheng, Wei; Pharoah, Paul D P; Hall, Per; Edwards, Stacey L; Simard, Jacques; French, Juliet D; Chenevix-Trench, Georgia; Dunning, Alison M

    2016-09-07

    Genome-wide association studies have found SNPs at 17q22 to be associated with breast cancer risk. To identify potential causal variants related to breast cancer risk, we performed a high resolution fine-mapping analysis that involved genotyping 517 SNPs using a custom Illumina iSelect array (iCOGS) followed by imputation of genotypes for 3,134 SNPs in more than 89,000 participants of European ancestry from the Breast Cancer Association Consortium (BCAC). We identified 28 highly correlated common variants, in a 53 Kb region spanning two introns of the STXBP4 gene, that are strong candidates for driving breast cancer risk (lead SNP rs2787486 (OR = 0.92; CI 0.90-0.94; P = 8.96 × 10(-15))) and are correlated with two previously reported risk-associated variants at this locus, SNPs rs6504950 (OR = 0.94, P = 2.04 × 10(-09), r(2) = 0.73 with lead SNP) and rs1156287 (OR = 0.93, P = 3.41 × 10(-11), r(2) = 0.83 with lead SNP). Analyses indicate only one causal SNP in the region and several enhancer elements targeting STXBP4 are located within the 53 kb association signal. Expression studies in breast tumor tissues found SNP rs2787486 to be associated with increased STXBP4 expression, suggesting this may be a target gene of this locus.

  3. Genome-wide association study identifies multiple loci influencing human serum metabolite levels

    PubMed Central

    Kettunen, Johannes; Tukiainen, Taru; Sarin, Antti-Pekka; Ortega-Alonso, Alfredo; Tikkanen, Emmi; Lyytikäinen, Leo-Pekka; Kangas, Antti J; Soininen, Pasi; Würtz, Peter; Silander, Kaisa; Dick, Danielle M; Rose, Richard J; Savolainen, Markku J; Viikari, Jorma; Kähönen, Mika; Lehtimäki, Terho; Pietiläinen, Kirsi H; Inouye, Michael; McCarthy, Mark I; Jula, Antti; Eriksson, Johan; Raitakari, Olli T; Salomaa, Veikko; Kaprio, Jaakko; Järvelin, Marjo-Riitta; Peltonen, Leena; Perola, Markus; Freimer, Nelson B; Ala-Korpela, Mika; Palotie, Aarno; Ripatti, Samuli

    2013-01-01

    Nuclear magnetic resonance assays allow for measurement of a wide range of metabolic phenotypes. We report here the results of a GWAS on 8,330 Finnish individuals genotyped and imputed at 7.7 million SNPs for a range of 216 serum metabolic phenotypes assessed by NMR of serum samples. We identified significant associations (P < 2.31 × 10−10) at 31 loci, including 11 for which there have not been previous reports of associations to a metabolic trait or disorder. Analyses of Finnish twin pairs suggested that the metabolic measures reported here show higher heritability than comparable conventional metabolic phenotypes. In accordance with our expectations, SNPs at the 31 loci associated with individual metabolites account for a greater proportion of the genetic component of trait variance (up to 40%) than is typically observed for conventional serum metabolic phenotypes. The identification of such associations may provide substantial insight into cardiometabolic disorders. PMID:22286219

  4. Identification of a BRCA2-Specific Modifier Locus at 6p24 Related to Breast Cancer Risk

    PubMed Central

    Vijai, Joseph; Klein, Robert J.; Kirchhoff, Tomas; McGuffog, Lesley; Barrowdale, Daniel; Dunning, Alison M.; Lee, Andrew; Dennis, Joe; Healey, Sue; Dicks, Ed; Soucy, Penny; Sinilnikova, Olga M.; Pankratz, Vernon S.; Wang, Xianshu; Eldridge, Ronald C.; Tessier, Daniel C.; Vincent, Daniel; Bacot, Francois; Hogervorst, Frans B. L.; Peock, Susan; Stoppa-Lyonnet, Dominique; Peterlongo, Paolo; Schmutzler, Rita K.; Nathanson, Katherine L.; Piedmonte, Marion; Singer, Christian F.; Thomassen, Mads; Hansen, Thomas v. O.; Neuhausen, Susan L.; Blanco, Ignacio; Greene, Mark H.; Garber, Judith; Weitzel, Jeffrey N.; Andrulis, Irene L.; Goldgar, David E.; D'Andrea, Emma; Caldes, Trinidad; Nevanlinna, Heli; Osorio, Ana; van Rensburg, Elizabeth J.; Arason, Adalgeir; Rennert, Gad; van den Ouweland, Ans M. W.; van der Hout, Annemarie H.; Kets, Carolien M.; Aalfs, Cora M.; Wijnen, Juul T.; Ausems, Margreet G. E. M.; Frost, Debra; Ellis, Steve; Fineberg, Elena; Platte, Radka; Evans, D. Gareth; Jacobs, Chris; Adlard, Julian; Tischkowitz, Marc; Porteous, Mary E.; Damiola, Francesca; Golmard, Lisa; Barjhoux, Laure; Longy, Michel; Belotti, Muriel; Ferrer, Sandra Fert; Mazoyer, Sylvie; Spurdle, Amanda B.; Manoukian, Siranoush; Barile, Monica; Genuardi, Maurizio; Arnold, Norbert; Meindl, Alfons; Sutter, Christian; Wappenschmidt, Barbara; Domchek, Susan M.; Pfeiler, Georg; Friedman, Eitan; Jensen, Uffe Birk; Robson, Mark; Shah, Sohela; Lazaro, Conxi; Mai, Phuong L.; Benitez, Javier; Southey, Melissa C.; Schmidt, Marjanka K.; Fasching, Peter A.; Peto, Julian; Humphreys, Manjeet K.; Wang, Qin; Michailidou, Kyriaki; Sawyer, Elinor J.; Burwinkel, Barbara; Guénel, Pascal; Bojesen, Stig E.; Milne, Roger L.; Brenner, Hermann; Lochmann, Magdalena; Aittomäki, Kristiina; Dörk, Thilo; Margolin, Sara; Mannermaa, Arto; Lambrechts, Diether; Chang-Claude, Jenny; Radice, Paolo; Giles, Graham G.; Haiman, Christopher A.; Winqvist, Robert; Devillee, Peter; García-Closas, Montserrat; Schoof, Nils; Hooning, Maartje J.; Cox, Angela; Pharoah, Paul D. P.; Jakubowska, Anna; Orr, Nick; González-Neira, Anna; Pita, Guillermo; Alonso, M. Rosario; Hall, Per; Couch, Fergus J.; Simard, Jacques; Altshuler, David; Easton, Douglas F.; Chenevix-Trench, Georgia; Antoniou, Antonis C.; Offit, Kenneth

    2013-01-01

    Common genetic variants contribute to the observed variation in breast cancer risk for BRCA2 mutation carriers; those known to date have all been found through population-based genome-wide association studies (GWAS). To comprehensively identify breast cancer risk modifying loci for BRCA2 mutation carriers, we conducted a deep replication of an ongoing GWAS discovery study. Using the ranked P-values of the breast cancer associations with the imputed genotype of 1.4 M SNPs, 19,029 SNPs were selected and designed for inclusion on a custom Illumina array that included a total of 211,155 SNPs as part of a multi-consortial project. DNA samples from 3,881 breast cancer affected and 4,330 unaffected BRCA2 mutation carriers from 47 studies belonging to the Consortium of Investigators of Modifiers of BRCA1/2 were genotyped and available for analysis. We replicated previously reported breast cancer susceptibility alleles in these BRCA2 mutation carriers and for several regions (including FGFR2, MAP3K1, CDKN2A/B, and PTHLH) identified SNPs that have stronger evidence of association than those previously published. We also identified a novel susceptibility allele at 6p24 that was inversely associated with risk in BRCA2 mutation carriers (rs9348512; per allele HR = 0.85, 95% CI 0.80–0.90, P = 3.9×10−8). This SNP was not associated with breast cancer risk either in the general population or in BRCA1 mutation carriers. The locus lies within a region containing TFAP2A, which encodes a transcriptional activation protein that interacts with several tumor suppressor genes. This report identifies the first breast cancer risk locus specific to a BRCA2 mutation background. This comprehensive update of novel and previously reported breast cancer susceptibility loci contributes to the establishment of a panel of SNPs that modify breast cancer risk in BRCA2 mutation carriers. This panel may have clinical utility for women with BRCA2 mutations weighing options for medical prevention of breast cancer. PMID:23544012

  5. Genetic variants associated with response to lithium treatment in bipolar disorder: a genome-wide association study.

    PubMed

    Hou, Liping; Heilbronner, Urs; Degenhardt, Franziska; Adli, Mazda; Akiyama, Kazufumi; Akula, Nirmala; Ardau, Raffaella; Arias, Bárbara; Backlund, Lena; Banzato, Claudio E M; Benabarre, Antoni; Bengesser, Susanne; Bhattacharjee, Abesh Kumar; Biernacka, Joanna M; Birner, Armin; Brichant-Petitjean, Clara; Bui, Elise T; Cervantes, Pablo; Chen, Guo-Bo; Chen, Hsi-Chung; Chillotti, Caterina; Cichon, Sven; Clark, Scott R; Colom, Francesc; Cousins, David A; Cruceanu, Cristiana; Czerski, Piotr M; Dantas, Clarissa R; Dayer, Alexandre; Étain, Bruno; Falkai, Peter; Forstner, Andreas J; Frisén, Louise; Fullerton, Janice M; Gard, Sébastien; Garnham, Julie S; Goes, Fernando S; Grof, Paul; Gruber, Oliver; Hashimoto, Ryota; Hauser, Joanna; Herms, Stefan; Hoffmann, Per; Hofmann, Andrea; Jamain, Stephane; Jiménez, Esther; Kahn, Jean-Pierre; Kassem, Layla; Kittel-Schneider, Sarah; Kliwicki, Sebastian; König, Barbara; Kusumi, Ichiro; Lackner, Nina; Laje, Gonzalo; Landén, Mikael; Lavebratt, Catharina; Leboyer, Marion; Leckband, Susan G; Jaramillo, Carlos A López; MacQueen, Glenda; Manchia, Mirko; Martinsson, Lina; Mattheisen, Manuel; McCarthy, Michael J; McElroy, Susan L; Mitjans, Marina; Mondimore, Francis M; Monteleone, Palmiero; Nievergelt, Caroline M; Nöthen, Markus M; Ösby, Urban; Ozaki, Norio; Perlis, Roy H; Pfennig, Andrea; Reich-Erkelenz, Daniela; Rouleau, Guy A; Schofield, Peter R; Schubert, K Oliver; Schweizer, Barbara W; Seemüller, Florian; Severino, Giovanni; Shekhtman, Tatyana; Shilling, Paul D; Shimoda, Kazutaka; Simhandl, Christian; Slaney, Claire M; Smoller, Jordan W; Squassina, Alessio; Stamm, Thomas; Stopkova, Pavla; Tighe, Sarah K; Tortorella, Alfonso; Turecki, Gustavo; Volkert, Julia; Witt, Stephanie; Wright, Adam; Young, L Trevor; Zandi, Peter P; Potash, James B; DePaulo, J Raymond; Bauer, Michael; Reininghaus, Eva Z; Novák, Tomas; Aubry, Jean-Michel; Maj, Mario; Baune, Bernhard T; Mitchell, Philip B; Vieta, Eduard; Frye, Mark A; Rybakowski, Janusz K; Kuo, Po-Hsiu; Kato, Tadafumi; Grigoroiu-Serbanescu, Maria; Reif, Andreas; Del Zompo, Maria; Bellivier, Frank; Schalling, Martin; Wray, Naomi R; Kelsoe, John R; Alda, Martin; Rietschel, Marcella; McMahon, Francis J; Schulze, Thomas G

    2016-03-12

    Lithium is a first-line treatment in bipolar disorder, but individual response is variable. Previous studies have suggested that lithium response is a heritable trait. However, no genetic markers of treatment response have been reproducibly identified. Here, we report the results of a genome-wide association study of lithium response in 2563 patients collected by 22 participating sites from the International Consortium on Lithium Genetics (ConLiGen). Data from common single nucleotide polymorphisms (SNPs) were tested for association with categorical and continuous ratings of lithium response. Lithium response was measured using a well established scale (Alda scale). Genotyped SNPs were used to generate data at more than 6 million sites, using standard genomic imputation methods. Traits were regressed against genotype dosage. Results were combined across two batches by meta-analysis. A single locus of four linked SNPs on chromosome 21 met genome-wide significance criteria for association with lithium response (rs79663003, p=1·37 × 10(-8); rs78015114, p=1·31 × 10(-8); rs74795342, p=3·31 × 10(-9); and rs75222709, p=3·50 × 10(-9)). In an independent, prospective study of 73 patients treated with lithium monotherapy for a period of up to 2 years, carriers of the response-associated alleles had a significantly lower rate of relapse than carriers of the alternate alleles (p=0·03268, hazard ratio 3·8, 95% CI 1·1-13·0). The response-associated region contains two genes for long, non-coding RNAs (lncRNAs), AL157359.3 and AL157359.4. LncRNAs are increasingly appreciated as important regulators of gene expression, particularly in the CNS. Confirmed biomarkers of lithium response would constitute an important step forward in the clinical management of bipolar disorder. Further studies are needed to establish the biological context and potential clinical utility of these findings. Deutsche Forschungsgemeinschaft, National Institute of Mental Health Intramural Research Program. Copyright © 2016 Elsevier Ltd. All rights reserved.

  6. Breast Cancer and Modifiable Lifestyle Factors in Argentinean Women: Addressing Missing Data in a Case-Control Study

    PubMed Central

    Coquet, Julia Becaria; Tumas, Natalia; Osella, Alberto Ruben; Tanzi, Matteo; Franco, Isabella; Diaz, Maria Del Pilar

    2016-01-01

    A number of studies have evidenced the effect of modifiable lifestyle factors such as diet, breastfeeding and nutritional status on breast cancer risk. However, none have addressed the missing data problem in nutritional epidemiologic research in South America. Missing data is a frequent problem in breast cancer studies and epidemiological settings in general. Estimates of effect obtained from these studies may be biased, if no appropriate method for handling missing data is applied. We performed Multiple Imputation for missing values on covariates in a breast cancer case-control study of Córdoba (Argentina) to optimize risk estimates. Data was obtained from a breast cancer case control study from 2008 to 2015 (318 cases, 526 controls). Complete case analysis and multiple imputation using chained equations were the methods applied to estimate the effects of a Traditional dietary pattern and other recognized factors associated with breast cancer. Physical activity and socioeconomic status were imputed. Logistic regression models were performed. When complete case analysis was performed only 31% of women were considered. Although a positive association of Traditional dietary pattern and breast cancer was observed from both approaches (complete case analysis OR=1.3, 95%CI=1.0-1.7; multiple imputation OR=1.4, 95%CI=1.2-1.7), effects of other covariates, like BMI and breastfeeding, were only identified when multiple imputation was considered. A Traditional dietary pattern, BMI and breastfeeding are associated with the occurrence of breast cancer in this Argentinean population when multiple imputation is appropriately performed. Multiple Imputation is suggested in Latin America’s epidemiologic studies to optimize effect estimates in the future. PMID:27892664

  7. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation.

    PubMed

    Välikangas, Tommi; Suomi, Tomi; Elo, Laura L

    2017-05-31

    Label-free mass spectrometry (MS) has developed into an important tool applied in various fields of biological and life sciences. Several software exist to process the raw MS data into quantified protein abundances, including open source and commercial solutions. Each software includes a set of unique algorithms for different tasks of the MS data processing workflow. While many of these algorithms have been compared separately, a thorough and systematic evaluation of their overall performance is missing. Moreover, systematic information is lacking about the amount of missing values produced by the different proteomics software and the capabilities of different data imputation methods to account for them.In this study, we evaluated the performance of five popular quantitative label-free proteomics software workflows using four different spike-in data sets. Our extensive testing included the number of proteins quantified and the number of missing values produced by each workflow, the accuracy of detecting differential expression and logarithmic fold change and the effect of different imputation and filtering methods on the differential expression results. We found that the Progenesis software performed consistently well in the differential expression analysis and produced few missing values. The missing values produced by the other software decreased their performance, but this difference could be mitigated using proper data filtering or imputation methods. Among the imputation methods, we found that the local least squares (lls) regression imputation consistently increased the performance of the software in the differential expression analysis, and a combination of both data filtering and local least squares imputation increased performance the most in the tested data sets. © The Author 2017. Published by Oxford University Press.

  8. Data imputation analysis for Cosmic Rays time series

    NASA Astrophysics Data System (ADS)

    Fernandes, R. C.; Lucio, P. S.; Fernandez, J. H.

    2017-05-01

    The occurrence of missing data concerning Galactic Cosmic Rays time series (GCR) is inevitable since loss of data is due to mechanical and human failure or technical problems and different periods of operation of GCR stations. The aim of this study was to perform multiple dataset imputation in order to depict the observational dataset. The study has used the monthly time series of GCR Climax (CLMX) and Roma (ROME) from 1960 to 2004 to simulate scenarios of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% of missing data compared to observed ROME series, with 50 replicates. Then, the CLMX station as a proxy for allocation of these scenarios was used. Three different methods for monthly dataset imputation were selected: AMÉLIA II - runs the bootstrap Expectation Maximization algorithm, MICE - runs an algorithm via Multivariate Imputation by Chained Equations and MTSDI - an Expectation Maximization algorithm-based method for imputation of missing values in multivariate normal time series. The synthetic time series compared with the observed ROME series has also been evaluated using several skill measures as such as RMSE, NRMSE, Agreement Index, R, R2, F-test and t-test. The results showed that for CLMX and ROME, the R2 and R statistics were equal to 0.98 and 0.96, respectively. It was observed that increases in the number of gaps generate loss of quality of the time series. Data imputation was more efficient with MTSDI method, with negligible errors and best skill coefficients. The results suggest a limit of about 60% of missing data for imputation, for monthly averages, no more than this. It is noteworthy that CLMX, ROME and KIEL stations present no missing data in the target period. This methodology allowed reconstructing 43 time series.

  9. Use of Multiple Imputation to Estimate the Proportion of Respiratory Virus Detections Among Patients Hospitalized With Community-Acquired Pneumonia.

    PubMed

    Bozio, Catherine H; Flanders, W Dana; Finelli, Lyn; Bramley, Anna M; Reed, Carrie; Gandhi, Neel R; Vidal, Jorge E; Erdman, Dean; Levine, Min Z; Lindstrom, Stephen; Ampofo, Krow; Arnold, Sandra R; Self, Wesley H; Williams, Derek J; Grijalva, Carlos G; Anderson, Evan J; McCullers, Jonathan A; Edwards, Kathryn M; Pavia, Andrew T; Wunderink, Richard G; Jain, Seema

    2018-04-01

    Real-time polymerase chain reaction (PCR) on respiratory specimens and serology on paired blood specimens are used to determine the etiology of respiratory illnesses for research studies. However, convalescent serology is often not collected. We used multiple imputation to assign values for missing serology results to estimate virus-specific prevalence among pediatric and adult community-acquired pneumonia hospitalizations using data from an active population-based surveillance study. Presence of adenoviruses, human metapneumovirus, influenza viruses, parainfluenza virus types 1-3, and respiratory syncytial virus was defined by positive PCR on nasopharyngeal/oropharyngeal specimens or a 4-fold rise in paired serology. We performed multiple imputation by developing a multivariable regression model for each virus using data from patients with available serology results. We calculated absolute and relative differences in the proportion of each virus detected comparing the imputed to observed (nonimputed) results. Among 2222 children and 2259 adults, 98.8% and 99.5% had nasopharyngeal/oropharyngeal specimens and 43.2% and 37.5% had paired serum specimens, respectively. Imputed results increased viral etiology assignments by an absolute difference of 1.6%-4.4% and 0.8%-2.8% in children and adults, respectively; relative differences were 1.1-3.0 times higher. Multiple imputation can be used when serology results are missing, to refine virus-specific prevalence estimates, and these will likely increase estimates.

  10. Multiple Imputation of a Randomly Censored Covariate Improves Logistic Regression Analysis.

    PubMed

    Atem, Folefac D; Qian, Jing; Maye, Jacqueline E; Johnson, Keith A; Betensky, Rebecca A

    2016-01-01

    Randomly censored covariates arise frequently in epidemiologic studies. The most commonly used methods, including complete case and single imputation or substitution, suffer from inefficiency and bias. They make strong parametric assumptions or they consider limit of detection censoring only. We employ multiple imputation, in conjunction with semi-parametric modeling of the censored covariate, to overcome these shortcomings and to facilitate robust estimation. We develop a multiple imputation approach for randomly censored covariates within the framework of a logistic regression model. We use the non-parametric estimate of the covariate distribution or the semiparametric Cox model estimate in the presence of additional covariates in the model. We evaluate this procedure in simulations, and compare its operating characteristics to those from the complete case analysis and a survival regression approach. We apply the procedures to an Alzheimer's study of the association between amyloid positivity and maternal age of onset of dementia. Multiple imputation achieves lower standard errors and higher power than the complete case approach under heavy and moderate censoring and is comparable under light censoring. The survival regression approach achieves the highest power among all procedures, but does not produce interpretable estimates of association. Multiple imputation offers a favorable alternative to complete case analysis and ad hoc substitution methods in the presence of randomly censored covariates within the framework of logistic regression.

  11. Borehole optical lateral displacement sensor

    DOEpatents

    Lewis, R.E.

    1998-10-20

    There is provided by this invention an optical displacement sensor that utilizes a reflective target connected to a surface to be monitored to reflect light from a light source such that the reflected light is received by a photoelectric transducer. The electric signal from the photoelectric transducer is then imputed into electronic circuitry to generate an electronic image of the target. The target`s image is monitored to determine the quantity and direction of any lateral displacement in the target`s image which represents lateral displacement in the surface being monitored. 4 figs.

  12. Limitations in Using Multiple Imputation to Harmonize Individual Participant Data for Meta-Analysis.

    PubMed

    Siddique, Juned; de Chavez, Peter J; Howe, George; Cruden, Gracelyn; Brown, C Hendricks

    2018-02-01

    Individual participant data (IPD) meta-analysis is a meta-analysis in which the individual-level data for each study are obtained and used for synthesis. A common challenge in IPD meta-analysis is when variables of interest are measured differently in different studies. The term harmonization has been coined to describe the procedure of placing variables on the same scale in order to permit pooling of data from a large number of studies. Using data from an IPD meta-analysis of 19 adolescent depression trials, we describe a multiple imputation approach for harmonizing 10 depression measures across the 19 trials by treating those depression measures that were not used in a study as missing data. We then apply diagnostics to address the fit of our imputation model. Even after reducing the scale of our application, we were still unable to produce accurate imputations of the missing values. We describe those features of the data that made it difficult to harmonize the depression measures and provide some guidelines for using multiple imputation for harmonization in IPD meta-analysis.

  13. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index.

    PubMed

    Yang, Jian; Bakshi, Andrew; Zhu, Zhihong; Hemani, Gibran; Vinkhuyzen, Anna A E; Lee, Sang Hong; Robinson, Matthew R; Perry, John R B; Nolte, Ilja M; van Vliet-Ostaptchouk, Jana V; Snieder, Harold; Esko, Tonu; Milani, Lili; Mägi, Reedik; Metspalu, Andres; Hamsten, Anders; Magnusson, Patrik K E; Pedersen, Nancy L; Ingelsson, Erik; Soranzo, Nicole; Keller, Matthew C; Wray, Naomi R; Goddard, Michael E; Visscher, Peter M

    2015-10-01

    We propose a method (GREML-LDMS) to estimate heritability for human complex traits in unrelated individuals using whole-genome sequencing data. We demonstrate using simulations based on whole-genome sequencing data that ∼97% and ∼68% of variation at common and rare variants, respectively, can be captured by imputation. Using the GREML-LDMS method, we estimate from 44,126 unrelated individuals that all ∼17 million imputed variants explain 56% (standard error (s.e.) = 2.3%) of variance for height and 27% (s.e. = 2.5%) of variance for body mass index (BMI), and we find evidence that height- and BMI-associated variants have been under natural selection. Considering the imperfect tagging of imputation and potential overestimation of heritability from previous family-based studies, heritability is likely to be 60-70% for height and 30-40% for BMI. Therefore, the missing heritability is small for both traits. For further discovery of genes associated with complex traits, a study design with SNP arrays followed by imputation is more cost-effective than whole-genome sequencing at current prices.

  14. Covariate Selection for Multilevel Models with Missing Data

    PubMed Central

    Marino, Miguel; Buxton, Orfeu M.; Li, Yi

    2017-01-01

    Missing covariate data hampers variable selection in multilevel regression settings. Current variable selection techniques for multiply-imputed data commonly address missingness in the predictors through list-wise deletion and stepwise-selection methods which are problematic. Moreover, most variable selection methods are developed for independent linear regression models and do not accommodate multilevel mixed effects regression models with incomplete covariate data. We develop a novel methodology that is able to perform covariate selection across multiply-imputed data for multilevel random effects models when missing data is present. Specifically, we propose to stack the multiply-imputed data sets from a multiple imputation procedure and to apply a group variable selection procedure through group lasso regularization to assess the overall impact of each predictor on the outcome across the imputed data sets. Simulations confirm the advantageous performance of the proposed method compared with the competing methods. We applied the method to reanalyze the Healthy Directions-Small Business cancer prevention study, which evaluated a behavioral intervention program targeting multiple risk-related behaviors in a working-class, multi-ethnic population. PMID:28239457

  15. Analysis of partially observed clustered data using generalized estimating equations and multiple imputation

    PubMed Central

    Aloisio, Kathryn M.; Swanson, Sonja A.; Micali, Nadia; Field, Alison; Horton, Nicholas J.

    2015-01-01

    Clustered data arise in many settings, particularly within the social and biomedical sciences. As an example, multiple–source reports are commonly collected in child and adolescent psychiatric epidemiologic studies where researchers use various informants (e.g. parent and adolescent) to provide a holistic view of a subject’s symptomatology. Fitzmaurice et al. (1995) have described estimation of multiple source models using a standard generalized estimating equation (GEE) framework. However, these studies often have missing data due to additional stages of consent and assent required. The usual GEE is unbiased when missingness is Missing Completely at Random (MCAR) in the sense of Little and Rubin (2002). This is a strong assumption that may not be tenable. Other options such as weighted generalized estimating equations (WEEs) are computationally challenging when missingness is non–monotone. Multiple imputation is an attractive method to fit incomplete data models while only requiring the less restrictive Missing at Random (MAR) assumption. Previously estimation of partially observed clustered data was computationally challenging however recent developments in Stata have facilitated their use in practice. We demonstrate how to utilize multiple imputation in conjunction with a GEE to investigate the prevalence of disordered eating symptoms in adolescents reported by parents and adolescents as well as factors associated with concordance and prevalence. The methods are motivated by the Avon Longitudinal Study of Parents and their Children (ALSPAC), a cohort study that enrolled more than 14,000 pregnant mothers in 1991–92 and has followed the health and development of their children at regular intervals. While point estimates were fairly similar to the GEE under MCAR, the MAR model had smaller standard errors, while requiring less stringent assumptions regarding missingness. PMID:25642154

  16. Incomplete Data in Smart Grid: Treatment of Values in Electric Vehicle Charging Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Majipour, Mostafa; Chu, Peter; Gadh, Rajit

    2014-11-03

    In this paper, five imputation methods namely Constant (zero), Mean, Median, Maximum Likelihood, and Multiple Imputation methods have been applied to compensate for missing values in Electric Vehicle (EV) charging data. The outcome of each of these methods have been used as the input to a prediction algorithm to forecast the EV load in the next 24 hours at each individual outlet. The data is real world data at the outlet level from the UCLA campus parking lots. Given the sparsity of the data, both Median and Constant (=zero) imputations improved the prediction results. Since in most missing value casesmore » in our database, all values of that instance are missing, the multivariate imputation methods did not improve the results significantly compared to univariate approaches.« less

  17. Common variants at the CHEK2 gene locus and risk of epithelial ovarian cancer.

    PubMed

    Lawrenson, Kate; Iversen, Edwin S; Tyrer, Jonathan; Weber, Rachel Palmieri; Concannon, Patrick; Hazelett, Dennis J; Li, Qiyuan; Marks, Jeffrey R; Berchuck, Andrew; Lee, Janet M; Aben, Katja K H; Anton-Culver, Hoda; Antonenkova, Natalia; Bandera, Elisa V; Bean, Yukie; Beckmann, Matthias W; Bisogna, Maria; Bjorge, Line; Bogdanova, Natalia; Brinton, Louise A; Brooks-Wilson, Angela; Bruinsma, Fiona; Butzow, Ralf; Campbell, Ian G; Carty, Karen; Chang-Claude, Jenny; Chenevix-Trench, Georgia; Chen, Ann; Chen, Zhihua; Cook, Linda S; Cramer, Daniel W; Cunningham, Julie M; Cybulski, Cezary; Plisiecka-Halasa, Joanna; Dennis, Joe; Dicks, Ed; Doherty, Jennifer A; Dörk, Thilo; du Bois, Andreas; Eccles, Diana; Easton, Douglas T; Edwards, Robert P; Eilber, Ursula; Ekici, Arif B; Fasching, Peter A; Fridley, Brooke L; Gao, Yu-Tang; Gentry-Maharaj, Aleksandra; Giles, Graham G; Glasspool, Rosalind; Goode, Ellen L; Goodman, Marc T; Gronwald, Jacek; Harter, Philipp; Hasmad, Hanis Nazihah; Hein, Alexander; Heitz, Florian; Hildebrandt, Michelle A T; Hillemanns, Peter; Hogdall, Estrid; Hogdall, Claus; Hosono, Satoyo; Jakubowska, Anna; Paul, James; Jensen, Allan; Karlan, Beth Y; Kjaer, Susanne Kruger; Kelemen, Linda E; Kellar, Melissa; Kelley, Joseph L; Kiemeney, Lambertus A; Krakstad, Camilla; Lambrechts, Diether; Lambrechts, Sandrina; Le, Nhu D; Lee, Alice W; Cannioto, Rikki; Leminen, Arto; Lester, Jenny; Levine, Douglas A; Liang, Dong; Lissowska, Jolanta; Lu, Karen; Lubinski, Jan; Lundvall, Lene; Massuger, Leon F A G; Matsuo, Keitaro; McGuire, Valerie; McLaughlin, John R; Nevanlinna, Heli; McNeish, Iain; Menon, Usha; Modugno, Francesmary; Moysich, Kirsten B; Narod, Steven A; Nedergaard, Lotte; Ness, Roberta B; Noor Azmi, Mat Adenan; Odunsi, Kunle; Olson, Sara H; Orlow, Irene; Orsulic, Sandra; Pearce, Celeste L; Pejovic, Tanja; Pelttari, Liisa M; Permuth-Wey, Jennifer; Phelan, Catherine M; Pike, Malcolm C; Poole, Elizabeth M; Ramus, Susan J; Risch, Harvey A; Rosen, Barry; Rossing, Mary Anne; Rothstein, Joseph H; Rudolph, Anja; Runnebaum, Ingo B; Rzepecka, Iwona K; Salvesen, Helga B; Budzilowska, Agnieszka; Sellers, Thomas A; Shu, Xiao-Ou; Shvetsov, Yurii B; Siddiqui, Nadeem; Sieh, Weiva; Song, Honglin; Southey, Melissa C; Sucheston, Lara; Tangen, Ingvild L; Teo, Soo-Hwang; Terry, Kathryn L; Thompson, Pamela J; Timorek, Agnieszka; Tworoger, Shelley S; Van Nieuwenhuysen, Els; Vergote, Ignace; Vierkant, Robert A; Wang-Gohrke, Shan; Walsh, Christine; Wentzensen, Nicolas; Whittemore, Alice S; Wicklund, Kristine G; Wilkens, Lynne R; Woo, Yin-Ling; Wu, Xifeng; Wu, Anna H; Yang, Hannah; Zheng, Wei; Ziogas, Argyrios; Coetzee, Gerhard A; Freedman, Matthew L; Monteiro, Alvaro N A; Moes-Sosnowska, Joanna; Kupryjanczyk, Jolanta; Pharoah, Paul D; Gayther, Simon A; Schildkraut, Joellen M

    2015-11-01

    Genome-wide association studies have identified 20 genomic regions associated with risk of epithelial ovarian cancer (EOC), but many additional risk variants may exist. Here, we evaluated associations between common genetic variants [single nucleotide polymorphisms (SNPs) and indels] in DNA repair genes and EOC risk. We genotyped 2896 common variants at 143 gene loci in DNA samples from 15 397 patients with invasive EOC and controls. We found evidence of associations with EOC risk for variants at FANCA, EXO1, E2F4, E2F2, CREB5 and CHEK2 genes (P ≤ 0.001). The strongest risk association was for CHEK2 SNP rs17507066 with serous EOC (P = 4.74 x 10(-7)). Additional genotyping and imputation of genotypes from the 1000 genomes project identified a slightly more significant association for CHEK2 SNP rs6005807 (r (2) with rs17507066 = 0.84, odds ratio (OR) 1.17, 95% CI 1.11-1.24, P = 1.1×10(-7)). We identified 293 variants in the region with likelihood ratios of less than 1:100 for representing the causal variant. Functional annotation identified 25 candidate SNPs that alter transcription factor binding sites within regulatory elements active in EOC precursor tissues. In The Cancer Genome Atlas dataset, CHEK2 gene expression was significantly higher in primary EOCs compared to normal fallopian tube tissues (P = 3.72×10(-8)). We also identified an association between genotypes of the candidate causal SNP rs12166475 (r (2) = 0.99 with rs6005807) and CHEK2 expression (P = 2.70×10(-8)). These data suggest that common variants at 22q12.1 are associated with risk of serous EOC and CHEK2 as a plausible target susceptibility gene. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  18. Penalized regression procedures for variable selection in the potential outcomes framework

    PubMed Central

    Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L.

    2015-01-01

    A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple ‘impute, then select’ class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems, and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data and imputation are drawn. A difference LASSO algorithm is defined, along with its multiple imputation analogues. The procedures are illustrated using a well-known right heart catheterization dataset. PMID:25628185

  19. Missing value imputation: with application to handwriting data

    NASA Astrophysics Data System (ADS)

    Xu, Zhen; Srihari, Sargur N.

    2015-01-01

    Missing values make pattern analysis difficult, particularly with limited available data. In longitudinal research, missing values accumulate, thereby aggravating the problem. Here we consider how to deal with temporal data with missing values in handwriting analysis. In the task of studying development of individuality of handwriting, we encountered the fact that feature values are missing for several individuals at several time instances. Six algorithms, i.e., random imputation, mean imputation, most likely independent value imputation, and three methods based on Bayesian network (static Bayesian network, parameter EM, and structural EM), are compared with children's handwriting data. We evaluate the accuracy and robustness of the algorithms under different ratios of missing data and missing values, and useful conclusions are given. Specifically, static Bayesian network is used for our data which contain around 5% missing data to provide adequate accuracy and low computational cost.

  20. Multivariate missing data in hydrology - Review and applications

    NASA Astrophysics Data System (ADS)

    Ben Aissia, Mohamed-Aymen; Chebana, Fateh; Ouarda, Taha B. M. J.

    2017-12-01

    Water resources planning and management require complete data sets of a number of hydrological variables, such as flood peaks and volumes. However, hydrologists are often faced with the problem of missing data (MD) in hydrological databases. Several methods are used to deal with the imputation of MD. During the last decade, multivariate approaches have gained popularity in the field of hydrology, especially in hydrological frequency analysis (HFA). However, treating the MD remains neglected in the multivariate HFA literature whereas the focus has been mainly on the modeling component. For a complete analysis and in order to optimize the use of data, MD should also be treated in the multivariate setting prior to modeling and inference. Imputation of MD in the multivariate hydrological framework can have direct implications on the quality of the estimation. Indeed, the dependence between the series represents important additional information that can be included in the imputation process. The objective of the present paper is to highlight the importance of treating MD in multivariate hydrological frequency analysis by reviewing and applying multivariate imputation methods and by comparing univariate and multivariate imputation methods. An application is carried out for multiple flood attributes on three sites in order to evaluate the performance of the different methods based on the leave-one-out procedure. The results indicate that, the performance of imputation methods can be improved by adopting the multivariate setting, compared to mean substitution and interpolation methods, especially when using the copula-based approach.

  1. Use of Multiple Imputation to Estimate the Proportion of Respiratory Virus Detections Among Patients Hospitalized With Community-Acquired Pneumonia

    PubMed Central

    Bozio, Catherine H; Flanders, W Dana; Finelli, Lyn; Bramley, Anna M; Reed, Carrie; Gandhi, Neel R; Vidal, Jorge E; Erdman, Dean; Levine, Min Z; Lindstrom, Stephen; Ampofo, Krow; Arnold, Sandra R; Self, Wesley H; Williams, Derek J; Grijalva, Carlos G; Anderson, Evan J; McCullers, Jonathan A; Edwards, Kathryn M; Pavia, Andrew T; Wunderink, Richard G; Jain, Seema

    2018-01-01

    Abstract Background Real-time polymerase chain reaction (PCR) on respiratory specimens and serology on paired blood specimens are used to determine the etiology of respiratory illnesses for research studies. However, convalescent serology is often not collected. We used multiple imputation to assign values for missing serology results to estimate virus-specific prevalence among pediatric and adult community-acquired pneumonia hospitalizations using data from an active population-based surveillance study. Methods Presence of adenoviruses, human metapneumovirus, influenza viruses, parainfluenza virus types 1–3, and respiratory syncytial virus was defined by positive PCR on nasopharyngeal/oropharyngeal specimens or a 4-fold rise in paired serology. We performed multiple imputation by developing a multivariable regression model for each virus using data from patients with available serology results. We calculated absolute and relative differences in the proportion of each virus detected comparing the imputed to observed (nonimputed) results. Results Among 2222 children and 2259 adults, 98.8% and 99.5% had nasopharyngeal/oropharyngeal specimens and 43.2% and 37.5% had paired serum specimens, respectively. Imputed results increased viral etiology assignments by an absolute difference of 1.6%–4.4% and 0.8%–2.8% in children and adults, respectively; relative differences were 1.1–3.0 times higher. Conclusions Multiple imputation can be used when serology results are missing, to refine virus-specific prevalence estimates, and these will likely increase estimates.

  2. Substance dependence among those without symptoms of substance abuse in the World Mental Health Survey.

    PubMed

    Lago, Luise; Glantz, Meyer D; Kessler, Ronald C; Sampson, Nancy A; Al-Hamzawi, Ali; Florescu, Silvia; Moskalewicz, Jacek; Murphy, Sam; Navarro-Mateu, Fernando; Torres de Galvis, Yolanda; Viana, Maria Carmen; Xavier, Miguel; Degenhardt, Louisa

    2017-09-01

    The World Health Organization (WHO) World Mental Health (WMH) Survey Initiative uses the Composite International Diagnostic Interview (CIDI). The first 13 surveys only assessed substance dependence among respondents with a history of substance abuse; later surveys also assessed substance dependence without symptoms of abuse. We compared results across the two sets of surveys to assess implications of the revised logic and develop an imputation model for missing values of lifetime dependence in the earlier surveys. Lifetime dependence without symptoms of abuse was low in the second set of surveys (0.3% alcohol, 0.2% drugs). Regression-based imputation models were built in random half-samples of the new surveys and validated in the other half. There were minimal differences for imputed and actual reported cases in the validation dataset for age, gender and quantity; more mental disorders and days out of role were found in the imputed cases. Concordance between imputed and observed dependence cases in the full sample was high for alcohol [sensitivity 88.0%, specificity 99.8%, total classification accuracy (TCA) 99.5%, area under the curve (AUC) 0.94] and drug dependence (sensitivity 100.0%, specificity 99.8%, TCA 99.8%, AUC 1.00). This provides cross-national evidence of the small degree to which lifetime dependence occurs without symptoms of abuse. Imputation of substance dependence in the earlier WMH surveys improved estimates of dependence. Copyright © 2017 John Wiley & Sons, Ltd.

  3. The Influence of Co-Morbidity and Other Health Measures on Dental and Medical Care Use among Medicare beneficiaries 2002

    PubMed Central

    Chen, Haiyan; Moeller, John; Manski, Richard J.

    2011-01-01

    Objective To assess the impact of co-morbidity and other health measures on the use of dental and medical care services among the community-based Medicare population with data from the 2002 Medicare Current Beneficiary Survey. Methods A co-morbidity index is the main independent variable of our study. It includes oral cancer as a co-morbidity condition and was developed from Medicare claims data. The two outcome variables indicate whether a beneficiary had a dental visit during the year and whether the beneficiary had an inpatient hospital stay during the year. Logistic regressions estimated the relationship between the outcome variables and co-morbidity after controlling for other explanatory variables. Results High scores on the co-morbidity index, high numbers of self-reported physical limitations, and fair or poor self-reported health status were correlated with higher hospital use and lower dental care utilization. Similar results were found for other types of medical care including medical provider visits, outpatient care, and prescription drugs. A multiple imputation technique was used for the approximate 20% of the sample with missing claims, but the resulting co-morbidity index performed no differently than the index constructed without imputation. Conclusions Co-morbidities and other health status measures are theorized to play either a predisposing or need role in determining health care utilization. The study’s findings confirm the dominant role of these measures as predisposing factors limiting access to dental care for Medicare beneficiaries and as need factors producing higher levels of inpatient hospital and other medical care for Medicare beneficiaries. PMID:21972460

  4. Biomechanisms of Comorbidity: Reviewing Integrative Analyses of Multi-omics Datasets and Electronic Health Records.

    PubMed

    Pouladi, N; Achour, I; Li, H; Berghout, J; Kenost, C; Gonzalez-Garay, M L; Lussier, Y A

    2016-11-10

    Disease comorbidity is a pervasive phenomenon impacting patients' health outcomes, disease management, and clinical decisions. This review presents past, current and future research directions leveraging both phenotypic and molecular information to uncover disease similarity underpinning the biology and etiology of disease comorbidity. We retrieved ~130 publications and retained 59, ranging from 2006 to 2015, that comprise a minimum number of five diseases and at least one type of biomolecule. We surveyed their methods, disease similarity metrics, and calculation of comorbidities in the electronic health records, if present. Among the surveyed studies, 44% generated or validated disease similarity metrics in context of comorbidity, with 60% being published in the last two years. As inputs, 87% of studies utilized intragenic loci and proteins while 13% employed RNA (mRNA, LncRNA or miRNA). Network modeling was predominantly used (35%) followed by statistics (28%) to impute similarity between these biomolecules and diseases. Studies with large numbers of biomolecules and diseases used network models or naïve overlap of disease-molecule associations, while machine learning, statistics, and information retrieval were utilized in smaller and moderate sized studies. Multiscale computations comprising shared function, network topology, and phenotypes were performed exclusively on proteins. This review highlighted the growing methods for identifying the molecular mechanisms underpinning comorbidities that leverage multiscale molecular information and patterns from electronic health records. The survey unveiled that intergenic polymorphisms have been overlooked for similarity imputation compared to their intragenic counterparts, offering new opportunities to bridge the mechanistic and similarity gaps of comorbidity.

  5. Multiple imputation as one tool to provide longitudinal databases for modelling human height and weight development.

    PubMed

    Aßmann, C

    2016-06-01

    Besides large efforts regarding field work, provision of valid databases requires statistical and informational infrastructure to enable long-term access to longitudinal data sets on height, weight and related issues. To foster use of longitudinal data sets within the scientific community, provision of valid databases has to address data-protection regulations. It is, therefore, of major importance to hinder identifiability of individuals from publicly available databases. To reach this goal, one possible strategy is to provide a synthetic database to the public allowing for pretesting strategies for data analysis. The synthetic databases can be established using multiple imputation tools. Given the approval of the strategy, verification is based on the original data. Multiple imputation by chained equations is illustrated to facilitate provision of synthetic databases as it allows for capturing a wide range of statistical interdependencies. Also missing values, typically occurring within longitudinal databases for reasons of item non-response, can be addressed via multiple imputation when providing databases. The provision of synthetic databases using multiple imputation techniques is one possible strategy to ensure data protection, increase visibility of longitudinal databases and enhance the analytical potential.

  6. Missing value imputation strategies for metabolomics data.

    PubMed

    Armitage, Emily Grace; Godzien, Joanna; Alonso-Herranz, Vanesa; López-Gonzálvez, Ángeles; Barbas, Coral

    2015-12-01

    The origin of missing values can be caused by different reasons and depending on these origins missing values should be considered differently and dealt with in different ways. In this research, four methods of imputation have been compared with respect to revealing their effects on the normality and variance of data, on statistical significance and on the approximation of a suitable threshold to accept missing data as truly missing. Additionally, the effects of different strategies for controlling familywise error rate or false discovery and how they work with the different strategies for missing value imputation have been evaluated. Missing values were found to affect normality and variance of data and k-means nearest neighbour imputation was the best method tested for restoring this. Bonferroni correction was the best method for maximizing true positives and minimizing false positives and it was observed that as low as 40% missing data could be truly missing. The range between 40 and 70% missing values was defined as a "gray area" and therefore a strategy has been proposed that provides a balance between the optimal imputation strategy that was k-means nearest neighbor and the best approximation of positioning real zeros. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  7. Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data.

    PubMed

    Wei, Runmin; Wang, Jingye; Su, Mingming; Jia, Erik; Chen, Shaoqiu; Chen, Tianlu; Ni, Yan

    2018-01-12

    Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student's t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics ( https://metabolomics.cc.hawaii.edu/software/MetImp/ ).

  8. Whole-genome sequence-based genomic prediction in laying chickens with different genomic relationship matrices to account for genetic architecture.

    PubMed

    Ni, Guiyan; Cavero, David; Fangmann, Anna; Erbe, Malena; Simianer, Henner

    2017-01-16

    With the availability of next-generation sequencing technologies, genomic prediction based on whole-genome sequencing (WGS) data is now feasible in animal breeding schemes and was expected to lead to higher predictive ability, since such data may contain all genomic variants including causal mutations. Our objective was to compare prediction ability with high-density (HD) array data and WGS data in a commercial brown layer line with genomic best linear unbiased prediction (GBLUP) models using various approaches to weight single nucleotide polymorphisms (SNPs). A total of 892 chickens from a commercial brown layer line were genotyped with 336 K segregating SNPs (array data) that included 157 K genic SNPs (i.e. SNPs in or around a gene). For these individuals, genome-wide sequence information was imputed based on data from re-sequencing runs of 25 individuals, leading to 5.2 million (M) imputed SNPs (WGS data), including 2.6 M genic SNPs. De-regressed proofs (DRP) for eggshell strength, feed intake and laying rate were used as quasi-phenotypic data in genomic prediction analyses. Four weighting factors for building a trait-specific genomic relationship matrix were investigated: identical weights, -(log 10 P) from genome-wide association study results, squares of SNP effects from random regression BLUP, and variable selection based weights (known as BLUP|GA). Predictive ability was measured as the correlation between DRP and direct genomic breeding values in five replications of a fivefold cross-validation. Averaged over the three traits, the highest predictive ability (0.366 ± 0.075) was obtained when only genic SNPs from WGS data were used. Predictive abilities with genic SNPs and all SNPs from HD array data were 0.361 ± 0.072 and 0.353 ± 0.074, respectively. Prediction with -(log 10 P) or squares of SNP effects as weighting factors for building a genomic relationship matrix or BLUP|GA did not increase accuracy, compared to that with identical weights, regardless of the SNP set used. Our results show that little or no benefit was gained when using all imputed WGS data to perform genomic prediction compared to using HD array data regardless of the weighting factors tested. However, using only genic SNPs from WGS data had a positive effect on prediction ability.

  9. Bayesian whole-genome prediction and genome-wide association analysis with missing genotypes using variable selection

    USDA-ARS?s Scientific Manuscript database

    Single-step Genomic Best Linear Unbiased Predictor (ssGBLUP) has become increasingly popular for whole-genome prediction (WGP) modeling as it utilizes any available pedigree and phenotypes on both genotyped and non-genotyped individuals. The WGP accuracy of ssGBLUP has been demonstrated to be greate...

  10. Imputing forest carbon stock estimates from inventory plots to a nationally continuous coverage

    PubMed Central

    2013-01-01

    The U.S. has been providing national-scale estimates of forest carbon (C) stocks and stock change to meet United Nations Framework Convention on Climate Change (UNFCCC) reporting requirements for years. Although these currently are provided as national estimates by pool and year to meet greenhouse gas monitoring requirements, there is growing need to disaggregate these estimates to finer scales to enable strategic forest management and monitoring activities focused on various ecosystem services such as C storage enhancement. Through application of a nearest-neighbor imputation approach, spatially extant estimates of forest C density were developed for the conterminous U.S. using the U.S.’s annual forest inventory. Results suggest that an existing forest inventory plot imputation approach can be readily modified to provide raster maps of C density across a range of pools (e.g., live tree to soil organic carbon) and spatial scales (e.g., sub-county to biome). Comparisons among imputed maps indicate strong regional differences across C pools. The C density of pools closely related to detrital input (e.g., dead wood) is often highest in forests suffering from recent mortality events such as those in the northern Rocky Mountains (e.g., beetle infestations). In contrast, live tree carbon density is often highest on the highest quality forest sites such as those found in the Pacific Northwest. Validation results suggest strong agreement between the estimates produced from the forest inventory plots and those from the imputed maps, particularly when the C pool is closely associated with the imputation model (e.g., aboveground live biomass and live tree basal area), with weaker agreement for detrital pools (e.g., standing dead trees). Forest inventory imputed plot maps provide an efficient and flexible approach to monitoring diverse C pools at national (e.g., UNFCCC) and regional scales (e.g., Reducing Emissions from Deforestation and Forest Degradation projects) while allowing timely incorporation of empirical data (e.g., annual forest inventory). PMID:23305341

  11. Working with Missing Values

    ERIC Educational Resources Information Center

    Acock, Alan C.

    2005-01-01

    Less than optimum strategies for missing values can produce biased estimates, distorted statistical power, and invalid conclusions. After reviewing traditional approaches (listwise, pairwise, and mean substitution), selected alternatives are covered including single imputation, multiple imputation, and full information maximum likelihood…

  12. Prediction of regulatory gene pairs using dynamic time warping and gene ontology.

    PubMed

    Yang, Andy C; Hsu, Hui-Huang; Lu, Ming-Da; Tseng, Vincent S; Shih, Timothy K

    2014-01-01

    Selecting informative genes is the most important task for data analysis on microarray gene expression data. In this work, we aim at identifying regulatory gene pairs from microarray gene expression data. However, microarray data often contain multiple missing expression values. Missing value imputation is thus needed before further processing for regulatory gene pairs becomes possible. We develop a novel approach to first impute missing values in microarray time series data by combining k-Nearest Neighbour (KNN), Dynamic Time Warping (DTW) and Gene Ontology (GO). After missing values are imputed, we then perform gene regulation prediction based on our proposed DTW-GO distance measurement of gene pairs. Experimental results show that our approach is more accurate when compared with existing missing value imputation methods on real microarray data sets. Furthermore, our approach can also discover more regulatory gene pairs that are known in the literature than other methods.

  13. CMPK1 and RBP3 are associated with corneal curvature in Asian populations.

    PubMed

    Chen, Peng; Miyake, Masahiro; Fan, Qiao; Liao, Jiemin; Yamashiro, Kenji; Ikram, Mohammad K; Chew, Merywn; Vithana, Eranga N; Khor, Chiea-Chuen; Aung, Tin; Tai, E-Shyong; Wong, Tien-Yin; Teo, Yik-Ying; Yoshimura, Nagahisa; Saw, Seang-Mei; Cheng, Ching-Yu

    2014-11-15

    Corneal curvature (CC) measures the steepness of the cornea and is an important parameter for clinically diseases such as astigmatism and myopia. Despite the high heritability of CC, only two associated genes have been discovered to date. We performed a three-stage genome-wide association study meta-analysis in 12 660 Asian individuals. Our Stage 1 was done in multiethnic cohorts comprising 7440 individuals, followed by a Stage 2 replication in 2473 Chinese and Stage 3 in 2747 Japanese. The SNP array genotype data were imputed up to the 1000 Genomes Project Phase 1 cosmopolitan panel. The SNP association with the radii of CC was investigated in the linear regression model with the adjustment of age, gender and principal components. In addition to the known genes, MTOR (also known as FRAP1) and PDGFRA, we discovered two novel genes associated with CC: CMPK1 (rs17103186, P = 3.3 × 10(-12)) and RBP3 (rs11204213 [Val884Met], P = 1.1 × 10(-13)). The missense RBP3 SNP, rs11204213, was also associated with axial length (AL) (P = 4.2 × 10(-6)) and had larger effects on both CC and AL compared with other SNPs. The index SNPs at the four indicated loci explained 1.9% of CC variance across the Stages 1 and 2 cohorts, while 33.8% of CC variance was explained by the genome-wide imputation data. We identified two novel genes influencing CC, which are related to either corneal shape or eye size. This study provides additional insights into genetic architecture of corneal shape. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  14. Circadian CLOCK gene polymorphisms in relation to sleep patterns and obesity in African Americans: findings from the Jackson heart study.

    PubMed

    Riestra, Pia; Gebreab, Samson Y; Xu, Ruihua; Khan, Rumana J; Gaye, Amadou; Correa, Adolfo; Min, Nancy; Sims, Mario; Davis, Sharon K

    2017-06-23

    Circadian rhythms regulate key biological processes and the dysregulation of the intrinsic clock mechanism affects sleep patterns and obesity onset. The CLOCK (circadian locomotor output cycles protein kaput) gene encodes a core transcription factor of the molecular circadian clock influencing diverse metabolic pathways, including glucose and lipid homeostasis. The primary objective of this study was to evaluate the associations between CLOCK single nucleotide polymorphisms (SNPs) and body mass index (BMI). We also evaluated the association of SNPs with BMI related factors such as sleep duration and quality, adiponectin and leptin, in 2962 participants (1116 men and 1810 women) from the Jackson Heart Study. Genotype data for the selected 23 CLOCK gene SNPS was obtained by imputation with IMPUTE2 software and reference phase data from the 1000 genome project. Genetic analyses were conducted with PLINK RESULTS: We found a significant association between the CLOCK SNP rs2070062 and sleep duration, participants carriers of the T allele showed significantly shorter sleep duration compared to non-carriers after the adjustment for individual proportions of European ancestry (PEA), socio economic status (SES), body mass index (BMI), alcohol consumption and smoking status that reach the significance threshold after multiple testing correction. In addition, we found nominal associations of the CLOCK SNP rs6853192 with longer sleep duration and the rs6820823, rs3792603 and rs11726609 with BMI. However, these associations did not reach the significance threshold after correction for multiple testing. In this work, CLOCK gene variants were associated with sleep duration and BMI suggesting that the effects of these polymorphisms on circadian rhythmicity may affect sleep duration and body weight regulation in Africans Americans.

  15. Blood lead levels, iron metabolism gene polymorphisms and homocysteine: a gene-environment interaction study.

    PubMed

    Kim, Kyoung-Nam; Lee, Mee-Ri; Lim, Youn-Hee; Hong, Yun-Chul

    2017-12-01

    Homocysteine has been causally associated with various adverse health outcomes. Evidence supporting the relationship between lead and homocysteine levels has been accumulating, but most prior studies have not focused on the interaction with genetic polymorphisms. From a community-based prospective cohort, we analysed 386 participants (aged 41-71 years) with information regarding blood lead and plasma homocysteine levels. Blood lead levels were measured between 2001 and 2003, and plasma homocysteine levels were measured in 2007. Interactions of lead levels with 42 genotyped single-nucleotide polymorphisms (SNPs) in five genes ( TF , HFE , CBS , BHMT and MTR ) were assessed via a 2-degree of freedom (df) joint test and a 1-df interaction test. In secondary analyses using imputation, we further assessed 58 imputed SNPs in the TF and MTHFR genes. Blood lead concentrations were positively associated with plasma homocysteine levels (p=0.0276). Six SNPs in the TF and MTR genes were screened using the 2-df joint test, and among them, three SNPs in the TF gene showed interactions with lead with respect to homocysteine levels through the 1-df interaction test (p<0.0083). Seven SNPs in the MTHFR gene were associated with homocysteine levels at an α-level of 0.05, but the associations did not persist after Bonferroni correction. These SNPs did not show interactions with lead levels. Blood lead levels were positively associated with plasma homocysteine levels measured 4-6 years later, and three SNPs in the TF gene modified the association. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  16. Novel Loci Associated with PR Interval in a Genome-Wide Association Study of Ten African American Cohorts

    PubMed Central

    Butler, Anne M.; Yin, Xiaoyan; Evans, Daniel S.; Nalls, Michael A.; Smith, Erin N.; Tanaka, Toshiko; Li, Guo; Buxbaum, Sarah G.; Whitsel, Eric A.; Alonso, Alvaro; Arking, Dan E.; Benjamin, Emelia J.; Berenson, Gerald S.; Bis, Josh C.; Chen, Wei; Deo, Rajat; Ellinor, Patrick T.; Heckbert, Susan R.; Heiss, Gerardo; Hsueh, Wen-Chi; Keating, Brendan J.; Kerr, Kathleen F.; Li, Yun; Limacher, Marian C.; Liu, Yongmei; Lubitz, Steven A.; Marciante, Kristin D.; Mehra, Reena; Meng, Yan A.; Newman, Anne B.; Newton-Cheh, Christopher; North, Kari E.; Palmer, Cameron D.; Psaty, Bruce M.; Quibrera, P. Miguel; Redline, Susan; Reiner, Alex P.; Rotter, Jerome I.; Schnabel, Renate B.; Schork, Nicholas J.; Singleton, Andrew B.; Smith, J. Gustav; Soliman, Elsayed Z.; Srinivasan, Sathanur R.; Zhang, Zhu-ming; Zonderman, Alan B.; Ferrucci, Luigi; Murray, Sarah S.; Evans, Michele K.; Sotoodehnia, Nona; Magnani, Jared W.; Avery, Christy L.

    2013-01-01

    Background The PR interval (PR) as measured by the resting, standard 12-lead electrocardiogram (ECG) reflects the duration of atrial/atrioventricular nodal depolarization. Substantial evidence exists for a genetic contribution to PR, including genome-wide association studies that have identified common genetic variants at nine loci influencing PR in populations of European and Asian descent. However, few studies have examined loci associated with PR in African Americans. Methods and Results We present results from the largest genome-wide association study to date of PR in 13,415 adults of African descent from ten cohorts. We tested for association between PR (ms) and approximately 2.8 million genotyped and imputed single nucleotide polymorphisms. Imputation was performed using HapMap 2 YRI and CEU panels. Study-specific results, adjusted for global ancestry and clinical correlates of PR, were meta-analyzed using the inverse variance method. Variation in genome-wide test statistic distributions was noted within studies (lambda range: 0.9–1.1), although not after genomic control correction was applied to the overall meta-analysis (lambda: 1.008). In addition to generalizing previously reported associations with MEIS1, SCN5A, ARHGAP24, CAV1, and TBX5 to African American populations at the genome-wide significance level (P<5.0×10−8), we also identified a novel locus: ITGA9, located in a region previously implicated in SCN5A expression. The 3p21 region harboring SCN5A also contained two additional independent secondary signals influencing PR (P<5.0×10−8). Conclusions This study demonstrates the ability to map novel loci in African Americans as well as the generalizability of loci associated with PR across populations of African, European and Asian descent. PMID:23139255

  17. Common and rare genetic markers of lipid variation in subjects with type 2 diabetes from the ACCORD clinical trial.

    PubMed

    Marvel, Skylar W; Rotroff, Daniel M; Wagner, Michael J; Buse, John B; Havener, Tammy M; McLeod, Howard L; Motsinger-Reif, Alison A

    2017-01-01

    Individuals with type 2 diabetes are at an increased risk of cardiovascular disease. Alterations in circulating lipid levels, total cholesterol (TC), low-density lipoprotein (LDL), high-density lipoprotein (HDL), and triglycerides (TG) are heritable risk factors for cardiovascular disease. Here we conduct a genome-wide association study (GWAS) of common and rare variants to investigate associations with baseline lipid levels in 7,844 individuals with type 2 diabetes from the ACCORD clinical trial. DNA extracted from stored blood samples from ACCORD participants were genotyped using the Affymetrix Axiom Biobank 1 Genotyping Array. After quality control and genotype imputation, association of common genetic variants (CV), defined as minor allele frequency (MAF) ≥ 3%, with baseline levels of TC, LDL, HDL, and TG was tested using a linear model. Rare variant (RV) associations (MAF < 3%) were conducted using a suite of methods that collapse multiple RV within individual genes. Many statistically significant CV ( p  < 1 × 10 -8 ) replicate findings in large meta-analyses in non-diabetic subjects. RV analyses also confirmed findings in other studies, whereas significant RV associations with CNOT2 , HPN-AS1 , and SIRPD appear to be novel ( q  < 0.1). Here we present findings for the largest GWAS of lipid levels in people with type 2 diabetes to date. We identified 17 statistically significant ( p  < 1 × 10 -8 ) associations of CV with lipid levels in 11 genes or chromosomal regions, all of which were previously identified in meta-analyses of mostly non-diabetic cohorts. We also identified 13 associations in 11 genes based on RV, several of which represent novel findings.

  18. Common and rare genetic markers of lipid variation in subjects with type 2 diabetes from the ACCORD clinical trial

    PubMed Central

    Wagner, Michael J.; Buse, John B.; Havener, Tammy M.; McLeod, Howard L.

    2017-01-01

    Background Individuals with type 2 diabetes are at an increased risk of cardiovascular disease. Alterations in circulating lipid levels, total cholesterol (TC), low-density lipoprotein (LDL), high-density lipoprotein (HDL), and triglycerides (TG) are heritable risk factors for cardiovascular disease. Here we conduct a genome-wide association study (GWAS) of common and rare variants to investigate associations with baseline lipid levels in 7,844 individuals with type 2 diabetes from the ACCORD clinical trial. Methods DNA extracted from stored blood samples from ACCORD participants were genotyped using the Affymetrix Axiom Biobank 1 Genotyping Array. After quality control and genotype imputation, association of common genetic variants (CV), defined as minor allele frequency (MAF) ≥ 3%, with baseline levels of TC, LDL, HDL, and TG was tested using a linear model. Rare variant (RV) associations (MAF < 3%) were conducted using a suite of methods that collapse multiple RV within individual genes. Results Many statistically significant CV (p < 1 × 10−8) replicate findings in large meta-analyses in non-diabetic subjects. RV analyses also confirmed findings in other studies, whereas significant RV associations with CNOT2, HPN-AS1, and SIRPD appear to be novel (q < 0.1). Discussion Here we present findings for the largest GWAS of lipid levels in people with type 2 diabetes to date. We identified 17 statistically significant (p < 1 × 10−8) associations of CV with lipid levels in 11 genes or chromosomal regions, all of which were previously identified in meta-analyses of mostly non-diabetic cohorts. We also identified 13 associations in 11 genes based on RV, several of which represent novel findings. PMID:28480134

  19. Fine-Mapping of the 1p11.2 Breast Cancer Susceptibility Locus

    PubMed Central

    Horne, Hisani N.; Chung, Charles C.; Zhang, Han; Yu, Kai; Prokunina-Olsson, Ludmila; Michailidou, Kyriaki; Bolla, Manjeet K.; Wang, Qin; Dennis, Joe; Hopper, John L.; Southey, Melissa C.; Schmidt, Marjanka K.; Broeks, Annegien; Muir, Kenneth; Lophatananon, Artitaya; Fasching, Peter A.; Beckmann, Matthias W.; Fletcher, Olivia; Johnson, Nichola; Sawyer, Elinor J.; Tomlinson, Ian; Burwinkel, Barbara; Marme, Frederik; Guénel, Pascal; Truong, Thérèse; Bojesen, Stig E.; Flyger, Henrik; Benitez, Javier; González-Neira, Anna; Anton-Culver, Hoda; Neuhausen, Susan L.; Brenner, Hermann; Arndt, Volker; Meindl, Alfons; Schmutzler, Rita K.; Brauch, Hiltrud; Hamann, Ute; Nevanlinna, Heli; Khan, Sofia; Matsuo, Keitaro; Iwata, Hiroji; Dörk, Thilo; Bogdanova, Natalia V.; Lindblom, Annika; Margolin, Sara; Mannermaa, Arto; Kosma, Veli-Matti; Chenevix-Trench, Georgia; Wu, Anna H.; ven den Berg, David; Smeets, Ann; Zhao, Hui; Chang-Claude, Jenny; Rudolph, Anja; Radice, Paolo; Barile, Monica; Couch, Fergus J.; Vachon, Celine; Giles, Graham G.; Milne, Roger L.; Haiman, Christopher A.; Marchand, Loic Le; Goldberg, Mark S.; Teo, Soo H.; Taib, Nur A. M.; Kristensen, Vessela; Borresen-Dale, Anne-Lise; Zheng, Wei; Shrubsole, Martha; Winqvist, Robert; Jukkola-Vuorinen, Arja; Andrulis, Irene L.; Knight, Julia A.; Devilee, Peter; Seynaeve, Caroline; García-Closas, Montserrat; Czene, Kamila; Darabi, Hatef; Hollestelle, Antoinette; Martens, John W. M.; Li, Jingmei; Lu, Wei; Shu, Xiao-Ou; Cox, Angela; Cross, Simon S.; Blot, William; Cai, Qiuyin; Shah, Mitul; Luccarini, Craig; Baynes, Caroline; Harrington, Patricia; Kang, Daehee; Choi, Ji-Yeob; Hartman, Mikael; Chia, Kee Seng; Kabisch, Maria; Torres, Diana; Jakubowska, Anna; Lubinski, Jan; Sangrajrang, Suleeporn; Brennan, Paul; Slager, Susan; Yannoukakos, Drakoulis; Shen, Chen-Yang; Hou, Ming-Feng; Swerdlow, Anthony; Orr, Nick; Simard, Jacques; Hall, Per; Pharoah, Paul D. P.

    2016-01-01

    The Cancer Genetic Markers of Susceptibility genome-wide association study (GWAS) originally identified a single nucleotide polymorphism (SNP) rs11249433 at 1p11.2 associated with breast cancer risk. To fine-map this locus, we genotyped 92 SNPs in a 900kb region (120,505,799–121,481,132) flanking rs11249433 in 45,276 breast cancer cases and 48,998 controls of European, Asian and African ancestry from 50 studies in the Breast Cancer Association Consortium. Genotyping was done using iCOGS, a custom-built array. Due to the complicated nature of the region on chr1p11.2: 120,300,000–120,505,798, that lies near the centromere and contains seven duplicated genomic segments, we restricted analyses to 429 SNPs excluding the duplicated regions (42 genotyped and 387 imputed). Per-allelic associations with breast cancer risk were estimated using logistic regression models adjusting for study and ancestry-specific principal components. The strongest association observed was with the original identified index SNP rs11249433 (minor allele frequency (MAF) 0.402; per-allele odds ratio (OR) = 1.10, 95% confidence interval (CI) 1.08–1.13, P = 1.49 x 10-21). The association for rs11249433 was limited to ER-positive breast cancers (test for heterogeneity P≤8.41 x 10-5). Additional analyses by other tumor characteristics showed stronger associations with moderately/well differentiated tumors and tumors of lobular histology. Although no significant eQTL associations were observed, in silico analyses showed that rs11249433 was located in a region that is likely a weak enhancer/promoter. Fine-mapping analysis of the 1p11.2 breast cancer susceptibility locus confirms this region to be limited to risk to cancers that are ER-positive. PMID:27556229

  20. Fine-Mapping of the 1p11.2 Breast Cancer Susceptibility Locus.

    PubMed

    Horne, Hisani N; Chung, Charles C; Zhang, Han; Yu, Kai; Prokunina-Olsson, Ludmila; Michailidou, Kyriaki; Bolla, Manjeet K; Wang, Qin; Dennis, Joe; Hopper, John L; Southey, Melissa C; Schmidt, Marjanka K; Broeks, Annegien; Muir, Kenneth; Lophatananon, Artitaya; Fasching, Peter A; Beckmann, Matthias W; Fletcher, Olivia; Johnson, Nichola; Sawyer, Elinor J; Tomlinson, Ian; Burwinkel, Barbara; Marme, Frederik; Guénel, Pascal; Truong, Thérèse; Bojesen, Stig E; Flyger, Henrik; Benitez, Javier; González-Neira, Anna; Anton-Culver, Hoda; Neuhausen, Susan L; Brenner, Hermann; Arndt, Volker; Meindl, Alfons; Schmutzler, Rita K; Brauch, Hiltrud; Hamann, Ute; Nevanlinna, Heli; Khan, Sofia; Matsuo, Keitaro; Iwata, Hiroji; Dörk, Thilo; Bogdanova, Natalia V; Lindblom, Annika; Margolin, Sara; Mannermaa, Arto; Kosma, Veli-Matti; Chenevix-Trench, Georgia; Wu, Anna H; Ven den Berg, David; Smeets, Ann; Zhao, Hui; Chang-Claude, Jenny; Rudolph, Anja; Radice, Paolo; Barile, Monica; Couch, Fergus J; Vachon, Celine; Giles, Graham G; Milne, Roger L; Haiman, Christopher A; Marchand, Loic Le; Goldberg, Mark S; Teo, Soo H; Taib, Nur A M; Kristensen, Vessela; Borresen-Dale, Anne-Lise; Zheng, Wei; Shrubsole, Martha; Winqvist, Robert; Jukkola-Vuorinen, Arja; Andrulis, Irene L; Knight, Julia A; Devilee, Peter; Seynaeve, Caroline; García-Closas, Montserrat; Czene, Kamila; Darabi, Hatef; Hollestelle, Antoinette; Martens, John W M; Li, Jingmei; Lu, Wei; Shu, Xiao-Ou; Cox, Angela; Cross, Simon S; Blot, William; Cai, Qiuyin; Shah, Mitul; Luccarini, Craig; Baynes, Caroline; Harrington, Patricia; Kang, Daehee; Choi, Ji-Yeob; Hartman, Mikael; Chia, Kee Seng; Kabisch, Maria; Torres, Diana; Jakubowska, Anna; Lubinski, Jan; Sangrajrang, Suleeporn; Brennan, Paul; Slager, Susan; Yannoukakos, Drakoulis; Shen, Chen-Yang; Hou, Ming-Feng; Swerdlow, Anthony; Orr, Nick; Simard, Jacques; Hall, Per; Pharoah, Paul D P; Easton, Douglas F; Chanock, Stephen J; Dunning, Alison M; Figueroa, Jonine D

    2016-01-01

    The Cancer Genetic Markers of Susceptibility genome-wide association study (GWAS) originally identified a single nucleotide polymorphism (SNP) rs11249433 at 1p11.2 associated with breast cancer risk. To fine-map this locus, we genotyped 92 SNPs in a 900kb region (120,505,799-121,481,132) flanking rs11249433 in 45,276 breast cancer cases and 48,998 controls of European, Asian and African ancestry from 50 studies in the Breast Cancer Association Consortium. Genotyping was done using iCOGS, a custom-built array. Due to the complicated nature of the region on chr1p11.2: 120,300,000-120,505,798, that lies near the centromere and contains seven duplicated genomic segments, we restricted analyses to 429 SNPs excluding the duplicated regions (42 genotyped and 387 imputed). Per-allelic associations with breast cancer risk were estimated using logistic regression models adjusting for study and ancestry-specific principal components. The strongest association observed was with the original identified index SNP rs11249433 (minor allele frequency (MAF) 0.402; per-allele odds ratio (OR) = 1.10, 95% confidence interval (CI) 1.08-1.13, P = 1.49 x 10-21). The association for rs11249433 was limited to ER-positive breast cancers (test for heterogeneity P≤8.41 x 10-5). Additional analyses by other tumor characteristics showed stronger associations with moderately/well differentiated tumors and tumors of lobular histology. Although no significant eQTL associations were observed, in silico analyses showed that rs11249433 was located in a region that is likely a weak enhancer/promoter. Fine-mapping analysis of the 1p11.2 breast cancer susceptibility locus confirms this region to be limited to risk to cancers that are ER-positive.

  1. rs2735383, located at a microRNA binding site in the 3’UTR of NBS1, is not associated with breast cancer risk

    PubMed Central

    Liu, Jingjing; Lončar, Ivona; Collée, J. Margriet; Bolla, Manjeet K.; Dennis, Joe; Michailidou, Kyriaki; Wang, Qin; Andrulis, Irene L.; Barile, Monica; Beckmann, Matthias W.; Behrens, Sabine; Benitez, Javier; Blomqvist, Carl; Boeckx, Bram; Bogdanova, Natalia V.; Bojesen, Stig E.; Brauch, Hiltrud; Brennan, Paul; Brenner, Hermann; Broeks, Annegien; Burwinkel, Barbara; Chang-Claude, Jenny; Chen, Shou-Tung; Chenevix-Trench, Georgia; Cheng, Ching Y.; Choi, Ji-Yeob; Couch, Fergus J.; Cox, Angela; Cross, Simon S.; Cuk, Katarina; Czene, Kamila; Dörk, Thilo; dos-Santos-Silva, Isabel; Fasching, Peter A.; Figueroa, Jonine; Flyger, Henrik; García-Closas, Montserrat; Giles, Graham G.; Glendon, Gord; Goldberg, Mark S.; González-Neira, Anna; Guénel, Pascal; Haiman, Christopher A.; Hamann, Ute; Hart, Steven N.; Hartman, Mikael; Hatse, Sigrid; Hopper, John L.; Ito, Hidemi; Jakubowska, Anna; Kabisch, Maria; Kang, Daehee; Kosma, Veli-Matti; Kristensen, Vessela N.; Le Marchand, Loic; Lee, Eunjung; Li, Jingmei; Lophatananon, Artitaya; Jan Lubinski; Mannermaa, Arto; Matsuo, Keitaro; Milne, Roger L.; Sahlberg, Kristine K.; Ottestad, Lars; Kåresen, Rolf; Langerød, Anita; Schlichting, Ellen; Holmen, Marit Muri; Sauer, Toril; Haakensen, Vilde; Engebråten, Olav; Naume, Bjørn; Kiserud, Cecile E.; Reinertsen, Kristin V.; Helland, åslaug; Riis, Margit; Bukholm, Ida; Lønning, Per Eystein; Børresen-Dale, Anne-Lise; Grenaker Alnæs, Grethe I.; Neuhausen, Susan L.; Nevanlinna, Heli; Orr, Nick; Perez, Jose I. A.; Peto, Julian; Putti, Thomas C.; Pylkäs, Katri; Radice, Paolo; Sangrajrang, Suleeporn; Sawyer, Elinor J.; Schmidt, Marjanka K.; Schneeweiss, Andreas; Shen, Chen-Yang; Shrubsole, Martha J.; Shu, Xiao-Ou; Simard, Jacques; Southey, Melissa C.; Swerdlow, Anthony; Teo, Soo H.; Tessier, Daniel C.; Thanasitthichai, Somchai; Tomlinson, Ian; Torres, Diana; Truong, Thérèse; Tseng, Chiu-Chen; Vachon, Celine; Winqvist, Robert; Wu, Anna H.; Yannoukakos, Drakoulis; Zheng, Wei; Hall, Per; Dunning, Alison M.; Easton, Douglas F.; Hooning, Maartje J.; van den Ouweland, Ans M. W.; Martens, John W. M.; Hollestelle, Antoinette

    2016-01-01

    NBS1, also known as NBN, plays an important role in maintaining genomic stability. Interestingly, rs2735383 G > C, located in a microRNA binding site in the 3′-untranslated region (UTR) of NBS1, was shown to be associated with increased susceptibility to lung and colorectal cancer. However, the relation between rs2735383 and susceptibility to breast cancer is not yet clear. Therefore, we genotyped rs2735383 in 1,170 familial non-BRCA1/2 breast cancer cases and 1,077 controls using PCR-based restriction fragment length polymorphism (RFLP-PCR) analysis, but found no association between rs2735383CC and breast cancer risk (OR = 1.214, 95% CI = 0.936–1.574, P = 0.144). Because we could not exclude a small effect size due to a limited sample size, we further analyzed imputed rs2735383 genotypes (r2 > 0.999) of 47,640 breast cancer cases and 46,656 controls from the Breast Cancer Association Consortium (BCAC). However, rs2735383CC was not associated with overall breast cancer risk in European (OR = 1.014, 95% CI = 0.969–1.060, P = 0.556) nor in Asian women (OR = 0.998, 95% CI = 0.905–1.100, P = 0.961). Subgroup analyses by age, age at menarche, age at menopause, menopausal status, number of pregnancies, breast feeding, family history and receptor status also did not reveal a significant association. This study therefore does not support the involvement of the genotype at NBS1 rs2735383 in breast cancer susceptibility. PMID:27845421

  2. Genotyping-by-sequencing highlights original diversity patterns within a European collection of 1191 maize flint lines, as compared to the maize USDA genebank.

    PubMed

    Gouesnard, Brigitte; Negro, Sandra; Laffray, Amélie; Glaubitz, Jeff; Melchinger, Albrecht; Revilla, Pedro; Moreno-Gonzalez, Jesus; Madur, Delphine; Combes, Valérie; Tollon-Cordet, Christine; Laborde, Jacques; Kermarrec, Dominique; Bauland, Cyril; Moreau, Laurence; Charcosset, Alain; Nicolas, Stéphane

    2017-10-01

    Genotyping by sequencing is suitable for analysis of global diversity in maize. We showed the distinctiveness of flint maize inbred lines of interest to enrich the diversity of breeding programs. Genotyping-by-sequencing (GBS) is a highly cost-effective procedure that permits the analysis of large collections of inbred lines. We used it to characterize diversity in 1191 maize flint inbred lines from the INRA collection, the European Cornfed association panel, and lines recently derived from landraces. We analyzed the properties of GBS data obtained with different imputation methods, through comparison with a 50 K SNP array. We identified seven ancestral groups within the Flint collection (dent, Northern flint, Italy, Pyrenees-Galicia, Argentina, Lacaune, Popcorn) in agreement with breeding knowledge. Analysis highlighted many crosses between different origins and the improvement of flint germplasm with dent germplasm. We performed association studies on different agronomic traits, revealing SNPs associated with cob color, kernel color, and male flowering time variation. We compared the diversity of both our collection and the USDA collection which has been previously analyzed by GBS. The population structure of the 4001 inbred lines confirmed the influence of the historical inbred lines (B73, A632, Oh43, Mo17, W182E, PH207, and Wf9) within the dent group. It showed distinctly different tropical and popcorn groups, a sweet-Northern flint group and a flint group sub-structured in Italian and European flint (Pyrenees-Galicia and Lacaune) groups. Interestingly, we identified several selective sweeps between dent, flint, and tropical inbred lines that co-localized with SNPs associated with flowering time variation. The joint analysis of collections by GBS offers opportunities for a global diversity analysis of maize inbred lines.

  3. Hidden Markov Model-Based CNV Detection Algorithms for Illumina Genotyping Microarrays.

    PubMed

    Seiser, Eric L; Innocenti, Federico

    2014-01-01

    Somatic alterations in DNA copy number have been well studied in numerous malignancies, yet the role of germline DNA copy number variation in cancer is still emerging. Genotyping microarrays generate allele-specific signal intensities to determine genotype, but may also be used to infer DNA copy number using additional computational approaches. Numerous tools have been developed to analyze Illumina genotype microarray data for copy number variant (CNV) discovery, although commonly utilized algorithms freely available to the public employ approaches based upon the use of hidden Markov models (HMMs). QuantiSNP, PennCNV, and GenoCN utilize HMMs with six copy number states but vary in how transition and emission probabilities are calculated. Performance of these CNV detection algorithms has been shown to be variable between both genotyping platforms and data sets, although HMM approaches generally outperform other current methods. Low sensitivity is prevalent with HMM-based algorithms, suggesting the need for continued improvement in CNV detection methodologies.

  4. Multiple Imputation in Two-Stage Cluster Samples Using The Weighted Finite Population Bayesian Bootstrap.

    PubMed

    Zhou, Hanzhi; Elliott, Michael R; Raghunathan, Trivellore E

    2016-06-01

    Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for clustering features when generating datasets for multiple imputation is a nontrivial task, particularly when, as is often the case, cluster sampling is accompanied by unequal probabilities of selection, necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generating imputations, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that have incorrect nominal coverage. In this article, we extend a recently developed, weighted, finite-population Bayesian bootstrap procedure to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where the probabilities of selection are known at the first and second stages of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities is known. We show that this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency, even when compared with correct fully parametric models. An application is made using the National Automotive Sampling System Crashworthiness Data System, a multistage, unequal probability sample of U.S. passenger vehicle crashes, which suffers from a substantial amount of missing data in "Delta-V," a key crash severity measure.

  5. MVIAeval: a web tool for comprehensively evaluating the performance of a new missing value imputation algorithm.

    PubMed

    Wu, Wei-Sheng; Jhou, Meng-Jhun

    2017-01-13

    Missing value imputation is important for microarray data analyses because microarray data with missing values would significantly degrade the performance of the downstream analyses. Although many microarray missing value imputation algorithms have been developed, an objective and comprehensive performance comparison framework is still lacking. To solve this problem, we previously proposed a framework which can perform a comprehensive performance comparison of different existing algorithms. Also the performance of a new algorithm can be evaluated by our performance comparison framework. However, constructing our framework is not an easy task for the interested researchers. To save researchers' time and efforts, here we present an easy-to-use web tool named MVIAeval (Missing Value Imputation Algorithm evaluator) which implements our performance comparison framework. MVIAeval provides a user-friendly interface allowing users to upload the R code of their new algorithm and select (i) the test datasets among 20 benchmark microarray (time series and non-time series) datasets, (ii) the compared algorithms among 12 existing algorithms, (iii) the performance indices from three existing ones, (iv) the comprehensive performance scores from two possible choices, and (v) the number of simulation runs. The comprehensive performance comparison results are then generated and shown as both figures and tables. MVIAeval is a useful tool for researchers to easily conduct a comprehensive and objective performance evaluation of their newly developed missing value imputation algorithm for microarray data or any data which can be represented as a matrix form (e.g. NGS data or proteomics data). Thus, MVIAeval will greatly expedite the progress in the research of missing value imputation algorithms.

  6. Variable selection under multiple imputation using the bootstrap in a prognostic study

    PubMed Central

    Heymans, Martijn W; van Buuren, Stef; Knol, Dirk L; van Mechelen, Willem; de Vet, Henrica CW

    2007-01-01

    Background Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection. Method In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels. Results We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found. Conclusion We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values. PMID:17629912

  7. On the multiple imputation variance estimator for control-based and delta-adjusted pattern mixture models.

    PubMed

    Tang, Yongqiang

    2017-12-01

    Control-based pattern mixture models (PMM) and delta-adjusted PMMs are commonly used as sensitivity analyses in clinical trials with non-ignorable dropout. These PMMs assume that the statistical behavior of outcomes varies by pattern in the experimental arm in the imputation procedure, but the imputed data are typically analyzed by a standard method such as the primary analysis model. In the multiple imputation (MI) inference, Rubin's variance estimator is generally biased when the imputation and analysis models are uncongenial. One objective of the article is to quantify the bias of Rubin's variance estimator in the control-based and delta-adjusted PMMs for longitudinal continuous outcomes. These PMMs assume the same observed data distribution as the mixed effects model for repeated measures (MMRM). We derive analytic expressions for the MI treatment effect estimator and the associated Rubin's variance in these PMMs and MMRM as functions of the maximum likelihood estimator from the MMRM analysis and the observed proportion of subjects in each dropout pattern when the number of imputations is infinite. The asymptotic bias is generally small or negligible in the delta-adjusted PMM, but can be sizable in the control-based PMM. This indicates that the inference based on Rubin's rule is approximately valid in the delta-adjusted PMM. A simple variance estimator is proposed to ensure asymptotically valid MI inferences in these PMMs, and compared with the bootstrap variance. The proposed method is illustrated by the analysis of an antidepressant trial, and its performance is further evaluated via a simulation study. © 2017, The International Biometric Society.

  8. Multiple Imputation in Two-Stage Cluster Samples Using The Weighted Finite Population Bayesian Bootstrap

    PubMed Central

    Zhou, Hanzhi; Elliott, Michael R.; Raghunathan, Trivellore E.

    2017-01-01

    Multistage sampling is often employed in survey samples for cost and convenience. However, accounting for clustering features when generating datasets for multiple imputation is a nontrivial task, particularly when, as is often the case, cluster sampling is accompanied by unequal probabilities of selection, necessitating case weights. Thus, multiple imputation often ignores complex sample designs and assumes simple random sampling when generating imputations, even though failing to account for complex sample design features is known to yield biased estimates and confidence intervals that have incorrect nominal coverage. In this article, we extend a recently developed, weighted, finite-population Bayesian bootstrap procedure to generate synthetic populations conditional on complex sample design data that can be treated as simple random samples at the imputation stage, obviating the need to directly model design features for imputation. We develop two forms of this method: one where the probabilities of selection are known at the first and second stages of the design, and the other, more common in public use files, where only the final weight based on the product of the two probabilities is known. We show that this method has advantages in terms of bias, mean square error, and coverage properties over methods where sample designs are ignored, with little loss in efficiency, even when compared with correct fully parametric models. An application is made using the National Automotive Sampling System Crashworthiness Data System, a multistage, unequal probability sample of U.S. passenger vehicle crashes, which suffers from a substantial amount of missing data in “Delta-V,” a key crash severity measure. PMID:29226161

  9. Joint modelling rationale for chained equations

    PubMed Central

    2014-01-01

    Background Chained equations imputation is widely used in medical research. It uses a set of conditional models, so is more flexible than joint modelling imputation for the imputation of different types of variables (e.g. binary, ordinal or unordered categorical). However, chained equations imputation does not correspond to drawing from a joint distribution when the conditional models are incompatible. Concurrently with our work, other authors have shown the equivalence of the two imputation methods in finite samples. Methods Taking a different approach, we prove, in finite samples, sufficient conditions for chained equations and joint modelling to yield imputations from the same predictive distribution. Further, we apply this proof in four specific cases and conduct a simulation study which explores the consequences when the conditional models are compatible but the conditions otherwise are not satisfied. Results We provide an additional “non-informative margins” condition which, together with compatibility, is sufficient. We show that the non-informative margins condition is not satisfied, despite compatible conditional models, in a situation as simple as two continuous variables and one binary variable. Our simulation study demonstrates that as a consequence of this violation order effects can occur; that is, systematic differences depending upon the ordering of the variables in the chained equations algorithm. However, the order effects appear to be small, especially when associations between variables are weak. Conclusions Since chained equations is typically used in medical research for datasets with different types of variables, researchers must be aware that order effects are likely to be ubiquitous, but our results suggest they may be small enough to be negligible. PMID:24559129

  10. The rare TREM2 R47H variant exerts only a modest effect on Alzheimer disease risk.

    PubMed

    Hooli, Basavaraj V; Parrado, Antonio R; Mullin, Kristina; Yip, Wai-Ki; Liu, Tian; Roehr, Johannes T; Qiao, Dandi; Jessen, Frank; Peters, Oliver; Becker, Tim; Ramirez, Alfredo; Lange, Christoph; Bertram, Lars; Tanzi, Rudolph E

    2014-10-07

    Recently, 2 independent studies reported that a rare missense variant, rs75932628 (R47H), in exon 2 of the gene encoding the "triggering receptor expressed on myeloid cells 2" (TREM2) significantly increases the risk of Alzheimer disease (AD) with an effect size comparable to that of the APOE ε4 allele. In this study, we attempted to replicate the association between rs75932628 and AD risk by directly genotyping rs75932628 in 2 independent Caucasian family cohorts consisting of 927 families (with 1,777 affected and 1,235 unaffected) and in 2 Caucasian case-control cohorts composed of 1,314 cases and 1,609 controls. In addition, we imputed genotypes in 3 independent Caucasian case-control cohorts containing 1,906 cases and 1,503 controls. Meta-analysis of the 2 family-based and the 5 case-control cohorts yielded a p value of 0.0029, while the overall summary estimate (using case-control data only) resulted in an odds ratio of 1.67 (95% confidence interval 0.95-2.92) for the association between the TREM2 R47H and increased AD risk. While our results serve to confirm the association between R47H and risk of AD, the observed effect on risk was substantially smaller than that previously reported. © 2014 American Academy of Neurology.

  11. The rare TREM2 R47H variant exerts only a modest effect on Alzheimer disease risk

    PubMed Central

    Hooli, Basavaraj V.; Parrado, Antonio R.; Mullin, Kristina; Yip, Wai-Ki; Liu, Tian; Roehr, Johannes T.; Qiao, Dandi; Jessen, Frank; Peters, Oliver; Becker, Tim; Ramirez, Alfredo; Lange, Christoph; Bertram, Lars

    2014-01-01

    Objectives: Recently, 2 independent studies reported that a rare missense variant, rs75932628 (R47H), in exon 2 of the gene encoding the “triggering receptor expressed on myeloid cells 2” (TREM2) significantly increases the risk of Alzheimer disease (AD) with an effect size comparable to that of the APOE ε4 allele. Methods: In this study, we attempted to replicate the association between rs75932628 and AD risk by directly genotyping rs75932628 in 2 independent Caucasian family cohorts consisting of 927 families (with 1,777 affected and 1,235 unaffected) and in 2 Caucasian case-control cohorts composed of 1,314 cases and 1,609 controls. In addition, we imputed genotypes in 3 independent Caucasian case-control cohorts containing 1,906 cases and 1,503 controls. Results: Meta-analysis of the 2 family-based and the 5 case-control cohorts yielded a p value of 0.0029, while the overall summary estimate (using case-control data only) resulted in an odds ratio of 1.67 (95% confidence interval 0.95–2.92) for the association between the TREM2 R47H and increased AD risk. Conclusions: While our results serve to confirm the association between R47H and risk of AD, the observed effect on risk was substantially smaller than that previously reported. PMID:25186855

  12. Identification of multiple genetic susceptibility loci in Takayasu arteritis.

    PubMed

    Saruhan-Direskeneli, Güher; Hughes, Travis; Aksu, Kenan; Keser, Gokhan; Coit, Patrick; Aydin, Sibel Z; Alibaz-Oner, Fatma; Kamalı, Sevil; Inanc, Murat; Carette, Simon; Hoffman, Gary S; Akar, Servet; Onen, Fatos; Akkoc, Nurullah; Khalidi, Nader A; Koening, Curry; Karadag, Omer; Kiraz, Sedat; Langford, Carol A; McAlear, Carol A; Ozbalkan, Zeynep; Ates, Askin; Karaaslan, Yasar; Maksimowicz-McKinnon, Kathleen; Monach, Paul A; Ozer, Hüseyin T; Seyahi, Emire; Fresko, Izzet; Cefle, Ayse; Seo, Philip; Warrington, Kenneth J; Ozturk, Mehmet A; Ytterberg, Steven R; Cobankara, Veli; Onat, A Mesut; Guthridge, Joel M; James, Judith A; Tunc, Ercan; Duzgun, Nurşen; Bıcakcıgil, Muge; Yentür, Sibel P; Merkel, Peter A; Direskeneli, Haner; Sawalha, Amr H

    2013-08-08

    Takayasu arteritis is a rare inflammatory disease of large arteries. The etiology of Takayasu arteritis remains poorly understood, but genetic contribution to the disease pathogenesis is supported by the genetic association with HLA-B*52. We genotyped ~200,000 genetic variants in two ethnically divergent Takayasu arteritis cohorts from Turkey and North America by using a custom-designed genotyping platform (Immunochip). Additional genetic variants and the classical HLA alleles were imputed and analyzed. We identified and confirmed two independent susceptibility loci within the HLA region (r(2) < 0.2): HLA-B/MICA (rs12524487, OR = 3.29, p = 5.57 × 10(-16)) and HLA-DQB1/HLA-DRB1 (rs113452171, OR = 2.34, p = 3.74 × 10(-9); and rs189754752, OR = 2.47, p = 4.22 × 10(-9)). In addition, we identified and confirmed a genetic association between Takayasu arteritis and the FCGR2A/FCGR3A locus on chromosome 1 (rs10919543, OR = 1.81, p = 5.89 × 10(-12)). The risk allele in this locus results in increased mRNA expression of FCGR2A. We also established the genetic association between IL12B and Takayasu arteritis (rs56167332, OR = 1.54, p = 2.18 × 10(-8)). Copyright © 2013 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  13. Genome-Wide Meta-Analysis of Longitudinal Alcohol Consumption Across Youth and Early Adulthood.

    PubMed

    Adkins, Daniel E; Clark, Shaunna L; Copeland, William E; Kennedy, Martin; Conway, Kevin; Angold, Adrian; Maes, Hermine; Liu, Youfang; Kumar, Gaurav; Erkanli, Alaattin; Patkar, Ashwin A; Silberg, Judy; Brown, Tyson H; Fergusson, David M; Horwood, L John; Eaves, Lindon; van den Oord, Edwin J C G; Sullivan, Patrick F; Costello, E J

    2015-08-01

    The public health burden of alcohol is unevenly distributed across the life course, with levels of use, abuse, and dependence increasing across adolescence and peaking in early adulthood. Here, we leverage this temporal patterning to search for common genetic variants predicting developmental trajectories of alcohol consumption. Comparable psychiatric evaluations measuring alcohol consumption were collected in three longitudinal community samples (N=2,126, obs=12,166). Consumption-repeated measurements spanning adolescence and early adulthood were analyzed using linear mixed models, estimating individual consumption trajectories, which were then tested for association with Illumina 660W-Quad genotype data (866,099 SNPs after imputation and QC). Association results were combined across samples using standard meta-analysis methods. Four meta-analysis associations satisfied our pre-determined genome-wide significance criterion (FDR<0.1) and six others met our 'suggestive' criterion (FDR<0.2). Genome-wide significant associations were highly biological plausible, including associations within GABA transporter 1, SLC6A1 (solute carrier family 6, member 1), and exonic hits in LOC100129340 (mitofusin-1-like). Pathway analyses elaborated single marker results, indicating significant enriched associations to intuitive biological mechanisms, including neurotransmission, xenobiotic pharmacodynamics, and nuclear hormone receptors (NHR). These findings underscore the value of combining longitudinal behavioral data and genome-wide genotype information in order to study developmental patterns and improve statistical power in genomic studies.

  14. The chromosome 2p21 region harbors a complex genetic architecture for association with risk for renal cell carcinoma

    PubMed Central

    Han, Summer S.; Yeager, Meredith; Moore, Lee E.; Wei, Ming-Hui; Pfeiffer, Ruth; Toure, Ousmane; Purdue, Mark P.; Johansson, Mattias; Scelo, Ghislaine; Chung, Charles C.; Gaborieau, Valerie; Zaridze, David; Schwartz, Kendra; Szeszenia-Dabrowska, Neonilia; Davis, Faith; Bencko, Vladimir; Colt, Joanne S.; Janout, Vladimir; Matveev, Vsevolod; Foretova, Lenka; Mates, Dana; Navratilova, M.; Boffetta, Paolo; Berg, Christine D.; Grubb, Robert L.; Stevens, Victoria L.; Thun, Michael J.; Diver, W. Ryan; Gapstur, Susan M.; Albanes, Demetrius; Weinstein, Stephanie J.; Virtamo, Jarmo; Burdett, Laurie; Brisuda, Antonin; McKay, James D.; Fraumeni, Joseph F.; Chatterjee, Nilanjan; Rosenberg, Philip S.; Rothman, Nathaniel; Brennan, Paul; Chow, Wong-Ho; Tucker, Margaret A.; Chanock, Stephen J.; Toro, Jorge R.

    2012-01-01

    In follow-up of a recent genome-wide association study (GWAS) that identified a locus in chromosome 2p21 associated with risk for renal cell carcinoma (RCC), we conducted a fine mapping analysis of a 120 kb region that includes EPAS1. We genotyped 59 tagged common single-nucleotide polymorphisms (SNPs) in 2278 RCC and 3719 controls of European background and observed a novel signal for rs9679290 [P = 5.75 × 10−8, per-allele odds ratio (OR) = 1.27, 95% confidence interval (CI): 1.17–1.39]. Imputation of common SNPs surrounding rs9679290 using HapMap 3 and 1000 Genomes data yielded two additional signals, rs4953346 (P = 4.09 × 10−14) and rs12617313 (P = 7.48 × 10−12), both highly correlated with rs9679290 (r2 > 0.95), but interestingly not correlated with the two SNPs reported in the GWAS: rs11894252 and rs7579899 (r2 < 0.1 with rs9679290). Genotype analysis of rs12617313 confirmed an association with RCC risk (P = 1.72 × 10−9, per-allele OR = 1.28, 95% CI: 1.18–1.39) In conclusion, we report that chromosome 2p21 harbors a complex genetic architecture for common RCC risk variants. PMID:22113997

  15. Genome-wide association study identifies three new melanoma susceptibility loci.

    PubMed

    Barrett, Jennifer H; Iles, Mark M; Harland, Mark; Taylor, John C; Aitken, Joanne F; Andresen, Per Arne; Akslen, Lars A; Armstrong, Bruce K; Avril, Marie-Francoise; Azizi, Esther; Bakker, Bert; Bergman, Wilma; Bianchi-Scarrà, Giovanna; Bressac-de Paillerets, Brigitte; Calista, Donato; Cannon-Albright, Lisa A; Corda, Eve; Cust, Anne E; Dębniak, Tadeusz; Duffy, David; Dunning, Alison M; Easton, Douglas F; Friedman, Eitan; Galan, Pilar; Ghiorzo, Paola; Giles, Graham G; Hansson, Johan; Hocevar, Marko; Höiom, Veronica; Hopper, John L; Ingvar, Christian; Janssen, Bart; Jenkins, Mark A; Jönsson, Göran; Kefford, Richard F; Landi, Giorgio; Landi, Maria Teresa; Lang, Julie; Lubiński, Jan; Mackie, Rona; Malvehy, Josep; Martin, Nicholas G; Molven, Anders; Montgomery, Grant W; van Nieuwpoort, Frans A; Novakovic, Srdjan; Olsson, Håkan; Pastorino, Lorenza; Puig, Susana; Puig-Butille, Joan Anton; Randerson-Moor, Juliette; Snowden, Helen; Tuominen, Rainer; Van Belle, Patricia; van der Stoep, Nienke; Whiteman, David C; Zelenika, Diana; Han, Jiali; Fang, Shenying; Lee, Jeffrey E; Wei, Qingyi; Lathrop, G Mark; Gillanders, Elizabeth M; Brown, Kevin M; Goldstein, Alisa M; Kanetsky, Peter A; Mann, Graham J; Macgregor, Stuart; Elder, David E; Amos, Christopher I; Hayward, Nicholas K; Gruis, Nelleke A; Demenais, Florence; Bishop, Julia A Newton; Bishop, D Timothy

    2011-10-09

    We report a genome-wide association study for melanoma that was conducted by the GenoMEL Consortium. Our discovery phase included 2,981 individuals with melanoma and 1,982 study-specific control individuals of European ancestry, as well as an additional 6,426 control subjects from French or British populations, all of whom were genotyped for 317,000 or 610,000 single-nucleotide polymorphisms (SNPs). Our analysis replicated previously known melanoma susceptibility loci. Seven new regions with at least one SNP with P < 10(-5) and further local imputed or genotyped support were selected for replication using two other genome-wide studies (from Australia and Texas, USA). Additional replication came from case-control series from the UK and The Netherlands. Variants at three of the seven loci replicated at P < 10(-3): an SNP in ATM (rs1801516, overall P = 3.4 × 10(-9)), an SNP in MX2 (rs45430, P = 2.9 × 10(-9)) and an SNP adjacent to CASP8 (rs13016963, P = 8.6 × 10(-10)). A fourth locus near CCND1 remains of potential interest, showing suggestive but inconclusive evidence of replication (rs1485993, overall P = 4.6 × 10(-7) under a fixed-effects model and P = 1.2 × 10(-3) under a random-effects model). These newly associated variants showed no association with nevus or pigmentation phenotypes in a large British case-control series.

  16. Efficient genotype compression and analysis of large genetic variation datasets

    PubMed Central

    Layer, Ryan M.; Kindlon, Neil; Karczewski, Konrad J.; Quinlan, Aaron R.

    2015-01-01

    Genotype Query Tools (GQT) is a new indexing strategy that expedites analyses of genome variation datasets in VCF format based on sample genotypes, phenotypes and relationships. GQT’s compressed genotype index minimizes decompression for analysis, and performance relative to existing methods improves with cohort size. We show substantial (up to 443 fold) performance gains over existing methods and demonstrate GQT’s utility for exploring massive datasets involving thousands to millions of genomes. PMID:26550772

  17. Age at menopause: imputing age at menopause for women with a hysterectomy with application to risk of postmenopausal breast cancer

    PubMed Central

    Rosner, Bernard; Colditz, Graham A.

    2011-01-01

    Purpose Age at menopause, a major marker in the reproductive life, may bias results for evaluation of breast cancer risk after menopause. Methods We follow 38,948 premenopausal women in 1980 and identify 2,586 who reported hysterectomy without bilateral oophorectomy, and 31,626 who reported natural menopause during 22 years of follow-up. We evaluate risk factors for natural menopause, impute age at natural menopause for women reporting hysterectomy without bilateral oophorectomy and estimate the hazard of reaching natural menopause in the next 2 years. We apply this imputed age at menopause to both increase sample size and to evaluate the relation between postmenopausal exposures and risk of breast cancer. Results Age, cigarette smoking, age at menarche, pregnancy history, body mass index, history of benign breast disease, and history of breast cancer were each significantly related to age at natural menopause; duration of oral contraceptive use and family history of breast cancer were not. The imputation increased sample size substantially and although some risk factors after menopause were weaker in the expanded model (height, and alcohol use), use of hormone therapy is less biased. Conclusions Imputing age at menopause increases sample size, broadens generalizability making it applicable to women with hysterectomy, and reduces bias. PMID:21441037

  18. Application of a novel hybrid method for spatiotemporal data imputation: A case study of the Minqin County groundwater level

    NASA Astrophysics Data System (ADS)

    Zhang, Zhongrong; Yang, Xuan; Li, Hao; Li, Weide; Yan, Haowen; Shi, Fei

    2017-10-01

    The techniques for data analyses have been widely developed in past years, however, missing data still represent a ubiquitous problem in many scientific fields. In particular, dealing with missing spatiotemporal data presents an enormous challenge. Nonetheless, in recent years, a considerable amount of research has focused on spatiotemporal problems, making spatiotemporal missing data imputation methods increasingly indispensable. In this paper, a novel spatiotemporal hybrid method is proposed to verify and imputed spatiotemporal missing values. This new method, termed SOM-FLSSVM, flexibly combines three advanced techniques: self-organizing feature map (SOM) clustering, the fruit fly optimization algorithm (FOA) and the least squares support vector machine (LSSVM). We employ a cross-validation (CV) procedure and FOA swarm intelligence optimization strategy that can search available parameters and determine the optimal imputation model. The spatiotemporal underground water data for Minqin County, China, were selected to test the reliability and imputation ability of SOM-FLSSVM. We carried out a validation experiment and compared three well-studied models with SOM-FLSSVM using a different missing data ratio from 0.1 to 0.8 in the same data set. The results demonstrate that the new hybrid method performs well in terms of both robustness and accuracy for spatiotemporal missing data.

  19. Statistical Methods for Generalized Linear Models with Covariates Subject to Detection Limits.

    PubMed

    Bernhardt, Paul W; Wang, Huixia J; Zhang, Daowen

    2015-05-01

    Censored observations are a common occurrence in biomedical data sets. Although a large amount of research has been devoted to estimation and inference for data with censored responses, very little research has focused on proper statistical procedures when predictors are censored. In this paper, we consider statistical methods for dealing with multiple predictors subject to detection limits within the context of generalized linear models. We investigate and adapt several conventional methods and develop a new multiple imputation approach for analyzing data sets with predictors censored due to detection limits. We establish the consistency and asymptotic normality of the proposed multiple imputation estimator and suggest a computationally simple and consistent variance estimator. We also demonstrate that the conditional mean imputation method often leads to inconsistent estimates in generalized linear models, while several other methods are either computationally intensive or lead to parameter estimates that are biased or more variable compared to the proposed multiple imputation estimator. In an extensive simulation study, we assess the bias and variability of different approaches within the context of a logistic regression model and compare variance estimation methods for the proposed multiple imputation estimator. Lastly, we apply several methods to analyze the data set from a recently-conducted GenIMS study.

  20. A Hot-Deck Multiple Imputation Procedure for Gaps in Longitudinal Recurrent Event Histories

    PubMed Central

    Wang, Chia-Ning; Little, Roderick; Nan, Bin; Harlow, Siobán D.

    2012-01-01

    Summary We propose a regression-based hot deck multiple imputation method for gaps of missing data in longitudinal studies, where subjects experience a recurrent event process and a terminal event. Examples are repeated asthma episodes and death, or menstrual periods and the menopause, as in our motivating application. Research interest concerns the onset time of a marker event, defined by the recurrent-event process, or the duration from this marker event to the final event. Gaps in the recorded event history make it difficult to determine the onset time of the marker event, and hence, the duration from onset to the final event. Simple approaches such as jumping gap times or dropping cases with gaps have obvious limitations. We propose a procedure for imputing information in the gaps by substituting information in the gap from a matched individual with a completely recorded history in the corresponding interval. Predictive Mean Matching is used to incorporate information on longitudinal characteristics of the repeated process and the final event time. Multiple imputation is used to propagate imputation uncertainty. The procedure is applied to an important data set for assessing the timing and duration of the menopausal transition. The performance of the proposed method is assessed by a simulation study. PMID:21361886

  1. Evaluation of techniques for handling missing cost-to-charge ratios in the USA Nationwide Inpatient Sample: a simulation study.

    PubMed

    Yu, Tzy-Chyi; Zhou, Huanxue

    2015-09-01

    Evaluate performance of techniques used to handle missing cost-to-charge ratio (CCR) data in the USA Healthcare Cost and Utilization Project's Nationwide Inpatient Sample. Four techniques to replace missing CCR data were evaluated: deleting discharges with missing CCRs (complete case analysis), reweighting as recommended by Healthcare Cost and Utilization Project, reweighting by adjustment cells and hot deck imputation by adjustment cells. Bias and root mean squared error of these techniques on hospital cost were evaluated in five disease cohorts. Similar mean cost estimates would be obtained with any of the four techniques when the percentage of missing data is low (<10%). When total cost is the outcome of interest, a reweighting technique to avoid underestimation from dropping observations with missing data should be adopted.

  2. Warfarin Pharmacogenetics

    PubMed Central

    Johnson, Julie A.; Cavallari, Larisa H.

    2014-01-01

    The cytochrome P450 (CYP) 2C9 and vitamin K epoxide reductase complex 1 (VKORC1) genotypes have been strongly and consistently associated with warfarin dose requirements, and dosing algorithms incorporating genetic and clinical information have been shown to be predictive of stable warfarin dose. However, clinical trials evaluating genotype-guided warfarin dosing produced mixed results, calling into question the utility of this approach. Recent trials used surrogate markers as endpoints rather than clinical endpoints, further complicating translation of the data to clinical practice. The present data do not support genetic testing to guide warfarin dosing, but in the setting where genotype data are available, use of such data in those of European ancestry is reasonable. Outcomes data are expected from an on-going trial, observational studies continue, and more work is needed to define dosing algorithms that incorporate appropriate variants in minority populations; all these will further shape guidelines and recommendations on the clinical utility of genotype-guided warfarin dosing. PMID:25282448

  3. Influence of Pattern of Missing Data on Performance of Imputation Methods: An Example Using National Data on Drug Injection in Prisons

    PubMed Central

    Haji-Maghsoudi, Saiedeh; Haghdoost, Ali-akbar; Rastegari, Azam; Baneshi, Mohammad Reza

    2013-01-01

    Background: Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern, to be addressed here, is the role of the pattern of missing data. Methods: We used information of 2720 prisoners. Results derived from fitting regression model to whole data were served as gold standard. Missing data were then generated so that 10%, 20% and 50% of data were lost. In scenario 1, we generated missing values, at above rates, in one variable which was significant in gold model (age). In scenario 2, a small proportion of each of independent variable was dropped out. Four imputation methods, under different Event Per Variable (EPV) values, were compared in terms of selection of important variables and parameter estimation. Results: In scenario 2, bias in estimates was low and performances of all methods for handing missing data were similar. All methods at all missing rates were able to detect significance of age. In scenario 1, biases in estimations were increased, in particular at 50% missing rate. Here at EPVs of 10 and 5, imputation methods failed to capture effect of age. Conclusion: In scenario 2, all imputation methods at all missing rates, were able to detect age as being significant. This was not the case in scenario 1. Our results showed that performance of imputation methods depends on the pattern of missing data. PMID:24596839

  4. Impact of non-anticoagulant therapy on patients with sepsis-induced disseminated intravascular coagulation: A multicenter, case-control study.

    PubMed

    Kudo, Daisuke; Hayakawa, Mineji; Ono, Kota; Yamakawa, Kazuma

    2018-03-01

    Anticoagulant therapy for patients with sepsis is not recommended in the latest Surviving Sepsis Campaign guidelines, and non-anticoagulant therapy is the global standard treatment approach at present. We aimed at elucidating the effect of non-anticoagulant therapy on patients with sepsis-induced disseminated intravascular coagulation (DIC), as evidence on this topic has remained inconclusive. Data from 3195 consecutive adult patients admitted to 42 intensive care units for the treatment of severe sepsis were retrospectively analyzed via propensity score analyses with and without multiple imputation. The primary outcome was in-hospital all-cause mortality. Among 1784 patients with sepsis-induced DIC, 745 (41.8%) were not treated with anticoagulants. The inverse probability of treatment-weighted (with and without multiple imputation) and quintile-stratified propensity score analyses (without multiple imputation) indicated a significant association between non-anticoagulant therapy and higher in-hospital all-cause mortality (odds ratio [95% confidence interval]: 1.59 [1.19-2.12], 1.32 [1.02-1.81], and 1.32 [1.03-1.69], respectively). However, quintile-stratified propensity score analyses with multiple imputation and propensity score matching analysis with and without multiple imputation did not show this association. Survival duration was not significantly different between patients in the propensity score-matched non-anticoagulant therapy group and those in the anticoagulant therapy group (Cox regression analysis with and without multiple imputation: hazard ratio [95% confidence interval]: 1.26 [1.00-1.60] and 1.22 [0.93-1.59], respectively). It remains controversial if non-anticoagulant therapy is harmful, equivalent, or beneficial compared with anticoagulant therapy in the treatment of patients with sepsis-induced DIC. Copyright © 2018 Elsevier Ltd. All rights reserved.

  5. Multiple imputation for assessment of exposures to drinking water contaminants: evaluation with the Atrazine Monitoring Program.

    PubMed

    Jones, Rachael M; Stayner, Leslie T; Demirtas, Hakan

    2014-10-01

    Drinking water may contain pollutants that harm human health. The frequency of pollutant monitoring may occur quarterly, annually, or less frequently, depending upon the pollutant, the pollutant concentration, and community water system. However, birth and other health outcomes are associated with narrow time-windows of exposure. Infrequent monitoring impedes linkage between water quality and health outcomes for epidemiological analyses. To evaluate the performance of multiple imputation to fill in water quality values between measurements in community water systems (CWSs). The multiple imputation method was implemented in a simulated setting using data from the Atrazine Monitoring Program (AMP, 2006-2009 in five Midwestern states). Values were deleted from the AMP data to leave one measurement per month. Four patterns reflecting drinking water monitoring regulations were used to delete months of data in each CWS: three patterns were missing at random and one pattern was missing not at random. Synthetic health outcome data were created using a linear and a Poisson exposure-response relationship with five levels of hypothesized association, respectively. The multiple imputation method was evaluated by comparing the exposure-response relationships estimated based on multiply imputed data with the hypothesized association. The four patterns deleted 65-92% months of atrazine observations in AMP data. Even with these high rates of missing information, our procedure was able to recover most of the missing information when the synthetic health outcome was included for missing at random patterns and for missing not at random patterns with low-to-moderate exposure-response relationships. Multiple imputation appears to be an effective method for filling in water quality values between measurements. Copyright © 2014 Elsevier Inc. All rights reserved.

  6. Site-Specific Management of Miscanthus Genotypes for Combustion and Anaerobic Digestion: A Comparison of Energy Yields.

    PubMed

    Kiesel, Andreas; Nunn, Christopher; Iqbal, Yasir; Van der Weijde, Tim; Wagner, Moritz; Özgüven, Mensure; Tarakanov, Ivan; Kalinina, Olena; Trindade, Luisa M; Clifton-Brown, John; Lewandowski, Iris

    2017-01-01

    In Europe, the perennial C 4 grass miscanthus is currently mainly cultivated for energy generation via combustion. In recent years, anaerobic digestion has been identified as a promising alternative utilization pathway. Anaerobic digestion produces a higher-value intermediate (biogas), which can be upgraded to biomethane, stored in the existing natural gas infrastructure and further utilized as a transport fuel or in combined heat and power plants. However, the upgrading of the solid biomass into gaseous fuel leads to conversion-related energy losses, the level of which depends on the cultivation parameters genotype, location, and harvest date. Thus, site-specific crop management needs to be adapted to the intended utilization pathway. The objectives of this paper are to quantify (i) the impact of genotype, location and harvest date on energy yields of anaerobic digestion and combustion and (ii) the conversion losses of upgrading solid biomass into biogas. For this purpose, five miscanthus genotypes (OPM 3, 6, 9, 11, 14), three cultivation locations (Adana, Moscow, Stuttgart), and up to six harvest dates (August-March) were assessed. Anaerobic digestion yielded, on average, 35% less energy than combustion. Genotype, location, and harvest date all had significant impacts on the energy yield. For both, this is determined by dry matter yield and ash content and additionally by substrate-specific methane yield for anaerobic digestion and moisture content for combustion. Averaged over all locations and genotypes, an early harvest in August led to 25% and a late harvest to 45% conversion losses. However, each utilization option has its own optimal harvest date, determined by biomass yield, biomass quality, and cutting tolerance. By applying an autumn green harvest for anaerobic digestion and a delayed harvest for combustion, the conversion-related energy loss was reduced to an average of 18%. This clearly shows that the delayed harvest required to maintain biomass quality for combustion is accompanied by high energy losses through yield reduction over winter. The pre-winter harvest applied in the biogas utilization pathway avoids these yield losses and largely compensates for the conversion-related energy losses of anaerobic digestion.

  7. Site-Specific Management of Miscanthus Genotypes for Combustion and Anaerobic Digestion: A Comparison of Energy Yields

    PubMed Central

    Kiesel, Andreas; Nunn, Christopher; Iqbal, Yasir; Van der Weijde, Tim; Wagner, Moritz; Özgüven, Mensure; Tarakanov, Ivan; Kalinina, Olena; Trindade, Luisa M.; Clifton-Brown, John; Lewandowski, Iris

    2017-01-01

    In Europe, the perennial C4 grass miscanthus is currently mainly cultivated for energy generation via combustion. In recent years, anaerobic digestion has been identified as a promising alternative utilization pathway. Anaerobic digestion produces a higher-value intermediate (biogas), which can be upgraded to biomethane, stored in the existing natural gas infrastructure and further utilized as a transport fuel or in combined heat and power plants. However, the upgrading of the solid biomass into gaseous fuel leads to conversion-related energy losses, the level of which depends on the cultivation parameters genotype, location, and harvest date. Thus, site-specific crop management needs to be adapted to the intended utilization pathway. The objectives of this paper are to quantify (i) the impact of genotype, location and harvest date on energy yields of anaerobic digestion and combustion and (ii) the conversion losses of upgrading solid biomass into biogas. For this purpose, five miscanthus genotypes (OPM 3, 6, 9, 11, 14), three cultivation locations (Adana, Moscow, Stuttgart), and up to six harvest dates (August–March) were assessed. Anaerobic digestion yielded, on average, 35% less energy than combustion. Genotype, location, and harvest date all had significant impacts on the energy yield. For both, this is determined by dry matter yield and ash content and additionally by substrate-specific methane yield for anaerobic digestion and moisture content for combustion. Averaged over all locations and genotypes, an early harvest in August led to 25% and a late harvest to 45% conversion losses. However, each utilization option has its own optimal harvest date, determined by biomass yield, biomass quality, and cutting tolerance. By applying an autumn green harvest for anaerobic digestion and a delayed harvest for combustion, the conversion-related energy loss was reduced to an average of 18%. This clearly shows that the delayed harvest required to maintain biomass quality for combustion is accompanied by high energy losses through yield reduction over winter. The pre-winter harvest applied in the biogas utilization pathway avoids these yield losses and largely compensates for the conversion-related energy losses of anaerobic digestion. PMID:28367151

  8. The Contribution of GWAS Loci in Familial Dyslipidemias

    PubMed Central

    Söderlund, Sanni; Surakka, Ida; Matikainen, Niina; Pirinen, Matti; Pajukanta, Päivi; Service, Susan K.; Laurila, Pirkka-Pekka; Ehnholm, Christian; Salomaa, Veikko; Wilson, Richard K.; Palotie, Aarno; Freimer, Nelson B.; Taskinen, Marja-Riitta; Ripatti, Samuli

    2016-01-01

    Familial combined hyperlipidemia (FCH) is a complex and common familial dyslipidemia characterized by elevated total cholesterol and/or triglyceride levels with over five-fold risk of coronary heart disease. The genetic architecture and contribution of rare Mendelian and common variants to FCH susceptibility is unknown. In 53 Finnish FCH families, we genotyped and imputed nine million variants in 715 family members with DNA available. We studied the enrichment of variants previously implicated with monogenic dyslipidemias and/or lipid levels in the general population by comparing allele frequencies between the FCH families and population samples. We also constructed weighted polygenic scores using 212 lipid-associated SNPs and estimated the relative contributions of Mendelian variants and polygenic scores to the risk of FCH in the families. We identified, across the whole allele frequency spectrum, an enrichment of variants known to elevate, and a deficiency of variants known to lower LDL-C and/or TG levels among both probands and affected FCH individuals. The score based on TG associated SNPs was particularly high among affected individuals compared to non-affected family members. Out of 234 affected FCH individuals across the families, seven (3%) carried Mendelian variants and 83 (35%) showed high accumulation of either known LDL-C or TG elevating variants by having either polygenic score over the 90th percentile in the population. The positive predictive value of high score was much higher for affected FCH individuals than for similar sporadic cases in the population. FCH is highly polygenic, supporting the hypothesis that variants across the whole allele frequency spectrum contribute to this complex familial trait. Polygenic SNP panels improve identification of individuals affected with FCH, but their clinical utility remains to be defined. PMID:27227539

  9. Gene-diet interaction effects on BMI levels in the Singapore Chinese population.

    PubMed

    Chang, Xuling; Dorajoo, Rajkumar; Sun, Ye; Han, Yi; Wang, Ling; Khor, Chiea-Chuen; Sim, Xueling; Tai, E-Shyong; Liu, Jianjun; Yuan, Jian-Min; Koh, Woon-Puay; van Dam, Rob M; Friedlander, Yechiel; Heng, Chew-Kiat

    2018-02-24

    Recent genome-wide association studies (GWAS) have identified 97 body-mass index (BMI) associated loci. We aimed to evaluate if dietary intake modifies BMI associations at these loci in the Singapore Chinese population. We utilized GWAS information from six data subsets from two adult Chinese population (N = 7817). Seventy-eight genotyped or imputed index BMI single nucleotide polymorphisms (SNPs) that passed quality control procedures were available in all datasets. Alternative Healthy Eating Index (AHEI)-2010 score and ten nutrient variables were evaluated. Linear regression analyses between z score transformed BMI (Z-BMI) and dietary factors were performed. Interaction analyses were performed by introducing the interaction term (diet x SNP) in the same regression model. Analysis was carried out in each cohort individually and subsequently meta-analyzed using the inverse-variance weighted method. Analyses were also evaluated with a weighted gene-risk score (wGRS) contructed by BMI index SNPs from recent large-scale GWAS studies. Nominal associations between Z-BMI and AHEI-2010 and some dietary factors were identified (P = 0.047-0.010). The BMI wGRS was robustly associated with Z-BMI (P = 1.55 × 10 - 15 ) but not with any dietary variables. Dietary variables did not significantly interact with the wGRS to modify BMI associations. When interaction analyses were repeated using individual SNPs, a significant association between cholesterol intake and rs4740619 (CCDC171) was identified (β = 0.077, adjP interaction  = 0.043). The CCDC171 gene locus may interact with cholesterol intake to increase BMI in the Singaporean Chinese population, however most known obesity risk loci were not associated with dietary intake and did not interact with diet to modify BMI levels.

  10. Genome-wide study of resistant hypertension identified from electronic health records.

    PubMed

    Dumitrescu, Logan; Ritchie, Marylyn D; Denny, Joshua C; El Rouby, Nihal M; McDonough, Caitrin W; Bradford, Yuki; Ramirez, Andrea H; Bielinski, Suzette J; Basford, Melissa A; Chai, High Seng; Peissig, Peggy; Carrell, David; Pathak, Jyotishman; Rasmussen, Luke V; Wang, Xiaoming; Pacheco, Jennifer A; Kho, Abel N; Hayes, M Geoffrey; Matsumoto, Martha; Smith, Maureen E; Li, Rongling; Cooper-DeHoff, Rhonda M; Kullo, Iftikhar J; Chute, Christopher G; Chisholm, Rex L; Jarvik, Gail P; Larson, Eric B; Carey, David; McCarty, Catherine A; Williams, Marc S; Roden, Dan M; Bottinger, Erwin; Johnson, Julie A; de Andrade, Mariza; Crawford, Dana C

    2017-01-01

    Resistant hypertension is defined as high blood pressure that remains above treatment goals in spite of the concurrent use of three antihypertensive agents from different classes. Despite the important health consequences of resistant hypertension, few studies of resistant hypertension have been conducted. To perform a genome-wide association study for resistant hypertension, we defined and identified cases of resistant hypertension and hypertensives with treated, controlled hypertension among >47,500 adults residing in the US linked to electronic health records (EHRs) and genotyped as part of the electronic MEdical Records & GEnomics (eMERGE) Network. Electronic selection logic using billing codes, laboratory values, text queries, and medication records was used to identify resistant hypertension cases and controls at each site, and a total of 3,006 cases of resistant hypertension and 876 controlled hypertensives were identified among eMERGE Phase I and II sites. After imputation and quality control, a total of 2,530,150 SNPs were tested for an association among 2,830 multi-ethnic cases of resistant hypertension and 876 controlled hypertensives. No test of association was genome-wide significant in the full dataset or in the dataset limited to European American cases (n = 1,719) and controls (n = 708). The most significant finding was CLNK rs13144136 at p = 1.00x10-6 (odds ratio = 0.68; 95% CI = 0.58-0.80) in the full dataset with similar results in the European American only dataset. We also examined whether SNPs known to influence blood pressure or hypertension also influenced resistant hypertension. None was significant after correction for multiple testing. These data highlight both the difficulties and the potential utility of EHR-linked genomic data to study clinically-relevant traits such as resistant hypertension.

  11. Leaf transpiration plays a role in phosphorus acquisition among a large set of chickpea genotypes.

    PubMed

    Pang, Jiayin; Zhao, Hongxia; Bansal, Ruchi; Bohuon, Emilien; Lambers, Hans; Ryan, Megan H; Siddique, Kadambot H M

    2018-01-09

    Low availability of inorganic phosphorus (P) is considered a major constraint for crop productivity worldwide. A unique set of 266 chickpea (Cicer arietinum L.) genotypes, originating from 29 countries and with diverse genetic background, were used to study P-use efficiency. Plants were grown in pots containing sterilized river sand supplied with P at a rate of 10 μg P g -1 soil as FePO 4 , a poorly soluble form of P. The results showed large genotypic variation in plant growth, shoot P content, physiological P-use efficiency, and P-utilization efficiency in response to low P supply. Further investigation of a subset of 100 chickpea genotypes with contrasting growth performance showed significant differences in photosynthetic rate and photosynthetic P-use efficiency. A positive correlation was found between leaf P concentration and transpiration rate of the young fully expanded leaves. For the first time, our study has suggested a role of leaf transpiration in P acquisition, consistent with transpiration-driven mass flow in chickpea grown in low-P sandy soils. The identification of 6 genotypes with high plant growth, P-acquisition, and P-utilization efficiency suggests that the chickpea reference set can be used in breeding programmes to improve both P-acquisition and P-utilization efficiency under low-P conditions. © 2018 John Wiley & Sons Ltd.

  12. Transitioning to multiple imputation : a new method to impute missing blood alcohol concentration (BAC) values in FARS

    DOT National Transportation Integrated Search

    2002-01-01

    The National Center for Statistics and Analysis (NCSA) of the National Highway Traffic Safety : Administration (NHTSA) has undertaken several approaches to remedy the problem of missing blood alcohol : test results in the Fatality Analysis Reporting ...

  13. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective.

    ERIC Educational Resources Information Center

    Schafer, Joseph L.; Olsen, Maren K.

    1998-01-01

    The key ideas of multiple imputation for multivariate missing data problems are reviewed. Software programs available for this analysis are described, and their use is illustrated with data from the Adolescent Alcohol Prevention Trial (W. Hansen and J. Graham, 1991). (SLD)

  14. 48 CFR 1830.7002-4 - Determining imputed cost of money.

    Code of Federal Regulations, 2011 CFR

    2011-10-01

    ... 48 Federal Acquisition Regulations System 6 2011-10-01 2011-10-01 false Determining imputed cost... AND SPACE ADMINISTRATION GENERAL CONTRACTING REQUIREMENTS COST ACCOUNTING STANDARDS ADMINISTRATION... investment (see 1830.7002-3). (1) When a representative investment is determined for a cost accounting period...

  15. Imputational Modeling of Spatial Context and Social Environmental Predictors of Walking in an Underserved Community: The PATH Trial

    PubMed Central

    Ellerbe, Caitlyn; Lawson, Andrew B.; Alia, Kassandra A.; Meyers, Duncan C.; Coulon, Sandra M.; Lawman, Hannah G.

    2013-01-01

    Background This study examined imputational modeling effects of spatial proximity and social factors of walking in African American adults. Purpose Models were compared that examined relationships between household proximity to a walking trail and social factors in determining walking status. Methods Participants (N=133; 66% female; mean age=55 yrs) were recruited to a police-supported walking and social marketing intervention. Bayesian modeling was used to identify predictors of walking at 12 months. Results Sensitivity analysis using different imputation approaches, and spatial contextual effects, were compared. All the imputation methods showed social life and income were significant predictors of walking, however, the complete data approach was the best model indicating Age (1.04, 95% OR: 1.00, 1.08), Social Life (0.83, 95% OR: 0.69, 0.98) and Income > $10,000 (0.10, 95% OR: 0.01, 0.97) were all predictors of walking. Conclusions The complete data approach was the best model of predictors of walking in African Americans. PMID:23481250

  16. Imputational modeling of spatial context and social environmental predictors of walking in an underserved community: the PATH trial.

    PubMed

    Wilson, Dawn K; Ellerbe, Caitlyn; Lawson, Andrew B; Alia, Kassandra A; Meyers, Duncan C; Coulon, Sandra M; Lawman, Hannah G

    2013-03-01

    This study examined imputational modeling effects of spatial proximity and social factors of walking in African American adults. Models were compared that examined relationships between household proximity to a walking trail and social factors in determining walking status. Participants (N=133; 66% female; mean age=55 years) were recruited to a police-supported walking and social marketing intervention. Bayesian modeling was used to identify predictors of walking at 12 months. Sensitivity analysis using different imputation approaches, and spatial contextual effects, were compared. All the imputation methods showed social life and income were significant predictors of walking, however, the complete data approach was the best model indicating Age (1.04, 95% OR: 1.00, 1.08), Social Life (0.83, 95% OR: 0.69, 0.98) and Income <$10,000 (0.10, 95% OR: 0.01, 0.97) were all predictors of walking. The complete data approach was the best model of predictors of walking in African Americans. Copyright © 2012 Elsevier Ltd. All rights reserved.

  17. Multiple imputation for multivariate data with missing and below-threshold measurements: time-series concentrations of pollutants in the Arctic.

    PubMed

    Hopke, P K; Liu, C; Rubin, D B

    2001-03-01

    Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.

  18. Imputation of missing data in time series for air pollutants

    NASA Astrophysics Data System (ADS)

    Junger, W. L.; Ponce de Leon, A.

    2015-02-01

    Missing data are major concerns in epidemiological studies of the health effects of environmental air pollutants. This article presents an imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution. Different approaches are considered for filtering the temporal component. A simulation study was performed to assess validity and performance of proposed method in comparison with some frequently used methods. Simulations showed that when the amount of missing data was as low as 5%, the complete data analysis yielded satisfactory results regardless of the generating mechanism of the missing data, whereas the validity began to degenerate when the proportion of missing values exceeded 10%. The proposed imputation method exhibited good accuracy and precision in different settings with respect to the patterns of missing observations. Most of the imputations obtained valid results, even under missing not at random. The methods proposed in this study are implemented as a package called mtsdi for the statistical software system R.

  19. Estimating Interaction Effects With Incomplete Predictor Variables

    PubMed Central

    Enders, Craig K.; Baraldi, Amanda N.; Cham, Heining

    2014-01-01

    The existing missing data literature does not provide a clear prescription for estimating interaction effects with missing data, particularly when the interaction involves a pair of continuous variables. In this article, we describe maximum likelihood and multiple imputation procedures for this common analysis problem. We outline 3 latent variable model specifications for interaction analyses with missing data. These models apply procedures from the latent variable interaction literature to analyses with a single indicator per construct (e.g., a regression analysis with scale scores). We also discuss multiple imputation for interaction effects, emphasizing an approach that applies standard imputation procedures to the product of 2 raw score predictors. We thoroughly describe the process of probing interaction effects with maximum likelihood and multiple imputation. For both missing data handling techniques, we outline centering and transformation strategies that researchers can implement in popular software packages, and we use a series of real data analyses to illustrate these methods. Finally, we use computer simulations to evaluate the performance of the proposed techniques. PMID:24707955

  20. Statistical methods for incomplete data: Some results on model misspecification.

    PubMed

    McIsaac, Michael; Cook, R J

    2017-02-01

    Inverse probability weighted estimating equations and multiple imputation are two of the most studied frameworks for dealing with incomplete data in clinical and epidemiological research. We examine the limiting behaviour of estimators arising from inverse probability weighted estimating equations, augmented inverse probability weighted estimating equations and multiple imputation when the requisite auxiliary models are misspecified. We compute limiting values for settings involving binary responses and covariates and illustrate the effects of model misspecification using simulations based on data from a breast cancer clinical trial. We demonstrate that, even when both auxiliary models are misspecified, the asymptotic biases of double-robust augmented inverse probability weighted estimators are often smaller than the asymptotic biases of estimators arising from complete-case analyses, inverse probability weighting or multiple imputation. We further demonstrate that use of inverse probability weighting or multiple imputation with slightly misspecified auxiliary models can actually result in greater asymptotic bias than the use of naïve, complete case analyses. These asymptotic results are shown to be consistent with empirical results from simulation studies.

  1. Cost effectiveness of the addition of a comprehensive CT scan to the abdomen and pelvis for the detection of cancer after unprovoked venous thromboembolism.

    PubMed

    Coyle, Kathryn; Carrier, Marc; Lazo-Langner, Alejandro; Shivakumar, Sudeep; Zarychanski, Ryan; Tagalakis, Vicky; Solymoss, Susan; Routhier, Nathalie; Douketis, James; Coyle, Douglas

    2017-03-01

    Unprovoked venous thromboembolism (VTE) can be the first manifestation of cancer. It is unclear if extensive screening for occult cancer including a comprehensive computed tomography (CT) scan of the abdomen/pelvis is cost-effective in this patient population. To assess the health care related costs, number of missed cancer cases and health related utility values of a limited screening strategy with and without the addition of a comprehensive CT scan of the abdomen/pelvis and to identify to what extent testing should be done in these circumstances to allow early detection of occult cancers. Cost effectiveness analysis using data that was collected alongside the SOME randomized controlled trial which compared an extensive occult cancer screening including a CT of the abdomen/pelvis to a more limited screening strategy in patients with a first unprovoked VTE, was used for the current analyses. Analyses were conducted with a one-year time horizon from a Canadian health care perspective. Primary analysis was based on complete cases, with sensitivity analysis using appropriate multiple imputation methods to account for missing data. Data from a total of 854 patients with a first unprovoked VTE were included in these analyses. The addition of a comprehensive CT scan was associated with higher costs ($551 CDN) with no improvement in utility values or number of missed cancers. Results were consistent when adopting multiple imputation methods. The addition of a comprehensive CT scan of the abdomen/pelvis for the screening of occult cancer in patients with unprovoked VTE is not cost effective, as it is both more costly and not more effective in detecting occult cancer. Copyright © 2017 Elsevier Ltd. All rights reserved.

  2. Cost-effectiveness of a nurse facilitated, cognitive behavioural self-management programme compared with usual care using a CBT manual alone for patients with heart failure: secondary analysis of data from the SEMAPHFOR trial.

    PubMed

    Mejía, Aurelio; Richardson, Gerry; Pattenden, Jill; Cockayne, Sarah; Lewin, Robert

    2014-09-01

    To assess the cost-effectiveness of a nurse facilitated, cognitive behavioural self-management programme for patients with heart failure compared with usual care including the un-facilitated access to the same manual, from the perspective of the NHS. Data were obtained from a pragmatic, multi-centre, randomized controlled 'open' trial conducted in seven centres in the UK between 2006 and 2008. Effectiveness was estimated as Quality-Adjusted Life Years. Resource use was measured prospectively on all patients using information provided by patients in postal questionnaires, case-note review, electronic record review and interviews with patients. Unit costs were obtained from the literature and applied to the relevant resource use to estimate total costs. Multiple imputation was used to handle missing data. There were no substantial differences in the utility scores between treatment groups in all follow-up assessments, in the use of medication or outpatient visits and both groups report a similar frequency of contact with health care professionals. After controlling for baseline utility and using imputed dataset, treatment was associated with a reduction in QALY of 0.004 and a additional cost of £69.49. The probability that the intervention is cost-effective for thresholds between £20,000 and £30,000 is around 45%. There is little evidence that the addition of the intervention had any effect on costs or outcomes. The uncertainty around both estimates of cost and effectiveness mean that it is not reasonable to make recommendations based on cost-effectiveness alone. Copyright © 2014 Elsevier Ltd. All rights reserved.

  3. Missing data treatments matter: an analysis of multiple imputation for anterior cervical discectomy and fusion procedures.

    PubMed

    Ondeck, Nathaniel T; Fu, Michael C; Skrip, Laura A; McLynn, Ryan P; Cui, Jonathan J; Basques, Bryce A; Albert, Todd J; Grauer, Jonathan N

    2018-04-09

    The presence of missing data is a limitation of large datasets, including the National Surgical Quality Improvement Program (NSQIP). In addressing this issue, most studies use complete case analysis, which excludes cases with missing data, thus potentially introducing selection bias. Multiple imputation, a statistically rigorous approach that approximates missing data and preserves sample size, may be an improvement over complete case analysis. The present study aims to evaluate the impact of using multiple imputation in comparison with complete case analysis for assessing the associations between preoperative laboratory values and adverse outcomes following anterior cervical discectomy and fusion (ACDF) procedures. This is a retrospective review of prospectively collected data. Patients undergoing one-level ACDF were identified in NSQIP 2012-2015. Perioperative adverse outcome variables assessed included the occurrence of any adverse event, severe adverse events, and hospital readmission. Missing preoperative albumin and hematocrit values were handled using complete case analysis and multiple imputation. These preoperative laboratory levels were then tested for associations with 30-day postoperative outcomes using logistic regression. A total of 11,999 patients were included. Of this cohort, 63.5% of patients had missing preoperative albumin and 9.9% had missing preoperative hematocrit. When using complete case analysis, only 4,311 patients were studied. The removed patients were significantly younger, healthier, of a common body mass index, and male. Logistic regression analysis failed to identify either preoperative hypoalbuminemia or preoperative anemia as significantly associated with adverse outcomes. When employing multiple imputation, all 11,999 patients were included. Preoperative hypoalbuminemia was significantly associated with the occurrence of any adverse event and severe adverse events. Preoperative anemia was significantly associated with the occurrence of any adverse event, severe adverse events, and hospital readmission. Multiple imputation is a rigorous statistical procedure that is being increasingly used to address missing values in large datasets. Using this technique for ACDF avoided the loss of cases that may have affected the representativeness and power of the study and led to different results than complete case analysis. Multiple imputation should be considered for future spine studies. Copyright © 2018 Elsevier Inc. All rights reserved.

  4. Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation.

    PubMed

    Wahl, Simone; Boulesteix, Anne-Laure; Zierer, Astrid; Thorand, Barbara; van de Wiel, Mark A

    2016-10-26

    Missing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation. In a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data. Internal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by increasing the number of bootstrap draws rather than the number of imputations. With a simple integrated approach, valid confidence intervals for performance estimates can be obtained. When prognostic models are developed on incomplete data, Val-MI represents a valid strategy to obtain estimates of predictive performance measures.

  5. Validation sampling can reduce bias in healthcare database studies: an illustration using influenza vaccination effectiveness

    PubMed Central

    Nelson, Jennifer C.; Marsh, Tracey; Lumley, Thomas; Larson, Eric B.; Jackson, Lisa A.; Jackson, Michael

    2014-01-01

    Objective Estimates of treatment effectiveness in epidemiologic studies using large observational health care databases may be biased due to inaccurate or incomplete information on important confounders. Study methods that collect and incorporate more comprehensive confounder data on a validation cohort may reduce confounding bias. Study Design and Setting We applied two such methods, imputation and reweighting, to Group Health administrative data (full sample) supplemented by more detailed confounder data from the Adult Changes in Thought study (validation sample). We used influenza vaccination effectiveness (with an unexposed comparator group) as an example and evaluated each method’s ability to reduce bias using the control time period prior to influenza circulation. Results Both methods reduced, but did not completely eliminate, the bias compared with traditional effectiveness estimates that do not utilize the validation sample confounders. Conclusion Although these results support the use of validation sampling methods to improve the accuracy of comparative effectiveness findings from healthcare database studies, they also illustrate that the success of such methods depends on many factors, including the ability to measure important confounders in a representative and large enough validation sample, the comparability of the full sample and validation sample, and the accuracy with which data can be imputed or reweighted using the additional validation sample information. PMID:23849144

  6. Statistical tests for detecting associations with groups of genetic variants: generalization, evaluation, and implementation

    PubMed Central

    Ferguson, John; Wheeler, William; Fu, YiPing; Prokunina-Olsson, Ludmila; Zhao, Hongyu; Sampson, Joshua

    2013-01-01

    With recent advances in sequencing, genotyping arrays, and imputation, GWAS now aim to identify associations with rare and uncommon genetic variants. Here, we describe and evaluate a class of statistics, generalized score statistics (GSS), that can test for an association between a group of genetic variants and a phenotype. GSS are a simple weighted sum of single-variant statistics and their cross-products. We show that the majority of statistics currently used to detect associations with rare variants are equivalent to choosing a specific set of weights within this framework. We then evaluate the power of various weighting schemes as a function of variant characteristics, such as MAF, the proportion associated with the phenotype, and the direction of effect. Ultimately, we find that two classical tests are robust and powerful, but details are provided as to when other GSS may perform favorably. The software package CRaVe is available at our website (http://dceg.cancer.gov/bb/tools/crave). PMID:23092956

  7. Fine-mapping identifies multiple prostate cancer risk loci at 5p15, one of which associates with TERT expression

    PubMed Central

    Kote-Jarai, Zsofia; Saunders, Edward J.; Leongamornlert, Daniel A.; Tymrakiewicz, Malgorzata; Dadaev, Tokhir; Jugurnauth-Little, Sarah; Ross-Adams, Helen; Al Olama, Ali Amin; Benlloch, Sara; Halim, Silvia; Russel, Roslin; Dunning, Alison M.; Luccarini, Craig; Dennis, Joe; Neal, David E.; Hamdy, Freddie C.; Donovan, Jenny L.; Muir, Ken; Giles, Graham G.; Severi, Gianluca; Wiklund, Fredrik; Gronberg, Henrik; Haiman, Christopher A.; Schumacher, Fredrick; Henderson, Brian E.; Le Marchand, Loic; Lindstrom, Sara; Kraft, Peter; Hunter, David J.; Gapstur, Susan; Chanock, Stephen; Berndt, Sonja I.; Albanes, Demetrius; Andriole, Gerald; Schleutker, Johanna; Weischer, Maren; Canzian, Federico; Riboli, Elio; Key, Tim J.; Travis, Ruth C.; Campa, Daniele; Ingles, Sue A.; John, Esther M.; Hayes, Richard B.; Pharoah, Paul; Khaw, Kay-Tee; Stanford, Janet L.; Ostrander, Elaine A.; Signorello, Lisa B.; Thibodeau, Stephen N.; Schaid, Dan; Maier, Christiane; Vogel, Walther; Kibel, Adam S.; Cybulski, Cezary; Lubinski, Jan; Cannon-Albright, Lisa; Brenner, Hermann; Park, Jong Y.; Kaneva, Radka; Batra, Jyotsna; Spurdle, Amanda; Clements, Judith A.; Teixeira, Manuel R.; Govindasami, Koveela; Guy, Michelle; Wilkinson, Rosemary A.; Sawyer, Emma J.; Morgan, Angela; Dicks, Ed; Baynes, Caroline; Conroy, Don; Bojesen, Stig E.; Kaaks, Rudolf; Vincent, Daniel; Bacot, François; Tessier, Daniel C.; Easton, Douglas F.; Eeles, Rosalind A.

    2013-01-01

    Associations between single nucleotide polymorphisms (SNPs) at 5p15 and multiple cancer types have been reported. We have previously shown evidence for a strong association between prostate cancer (PrCa) risk and rs2242652 at 5p15, intronic in the telomerase reverse transcriptase (TERT) gene that encodes TERT. To comprehensively evaluate the association between genetic variation across this region and PrCa, we performed a fine-mapping analysis by genotyping 134 SNPs using a custom Illumina iSelect array or Sequenom MassArray iPlex, followed by imputation of 1094 SNPs in 22 301 PrCa cases and 22 320 controls in The PRACTICAL consortium. Multiple stepwise logistic regression analysis identified four signals in the promoter or intronic regions of TERT that independently associated with PrCa risk. Gene expression analysis of normal prostate tissue showed evidence that SNPs within one of these regions also associated with TERT expression, providing a potential mechanism for predisposition to disease. PMID:23535824

  8. Partial F-tests with multiply imputed data in the linear regression framework via coefficient of determination.

    PubMed

    Chaurasia, Ashok; Harel, Ofer

    2015-02-10

    Tests for regression coefficients such as global, local, and partial F-tests are common in applied research. In the framework of multiple imputation, there are several papers addressing tests for regression coefficients. However, for simultaneous hypothesis testing, the existing methods are computationally intensive because they involve calculation with vectors and (inversion of) matrices. In this paper, we propose a simple method based on the scalar entity, coefficient of determination, to perform (global, local, and partial) F-tests with multiply imputed data. The proposed method is evaluated using simulated data and applied to suicide prevention data. Copyright © 2014 John Wiley & Sons, Ltd.

  9. 5 CFR 919.630 - May the OPM impute conduct of one person to another?

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... to another? 919.630 Section 919.630 Administrative Personnel OFFICE OF PERSONNEL MANAGEMENT...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an...

  10. 5 CFR 919.630 - May the OPM impute conduct of one person to another?

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... to another? 919.630 Section 919.630 Administrative Personnel OFFICE OF PERSONNEL MANAGEMENT...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an...

  11. 5 CFR 919.630 - May the OPM impute conduct of one person to another?

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... to another? 919.630 Section 919.630 Administrative Personnel OFFICE OF PERSONNEL MANAGEMENT...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an...

  12. 5 CFR 919.630 - May the OPM impute conduct of one person to another?

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... to another? 919.630 Section 919.630 Administrative Personnel OFFICE OF PERSONNEL MANAGEMENT...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from the conduct is evidence of knowledge, approval or acquiescence. (b) Conduct imputed from an...

  13. An Introduction to Modern Missing Data Analyses

    ERIC Educational Resources Information Center

    Baraldi, Amanda N.; Enders, Craig K.

    2010-01-01

    A great deal of recent methodological research has focused on two modern missing data analysis methods: maximum likelihood and multiple imputation. These approaches are advantageous to traditional techniques (e.g. deletion and mean imputation techniques) because they require less stringent assumptions and mitigate the pitfalls of traditional…

  14. [Imputation methods for missing data in educational diagnostic evaluation].

    PubMed

    Fernández-Alonso, Rubén; Suárez-Álvarez, Javier; Muñiz, José

    2012-02-01

    In the diagnostic evaluation of educational systems, self-reports are commonly used to collect data, both cognitive and orectic. For various reasons, in these self-reports, some of the students' data are frequently missing. The main goal of this research is to compare the performance of different imputation methods for missing data in the context of the evaluation of educational systems. On an empirical database of 5,000 subjects, 72 conditions were simulated: three levels of missing data, three types of loss mechanisms, and eight methods of imputation. The levels of missing data were 5%, 10%, and 20%. The loss mechanisms were set at: Missing completely at random, moderately conditioned, and strongly conditioned. The eight imputation methods used were: listwise deletion, replacement by the mean of the scale, by the item mean, the subject mean, the corrected subject mean, multiple regression, and Expectation-Maximization (EM) algorithm, with and without auxiliary variables. The results indicate that the recovery of the data is more accurate when using an appropriate combination of different methods of recovering lost data. When a case is incomplete, the mean of the subject works very well, whereas for completely lost data, multiple imputation with the EM algorithm is recommended. The use of this combination is especially recommended when data loss is greater and its loss mechanism is more conditioned. Lastly, the results are discussed, and some future lines of research are analyzed.

  15. What are the appropriate methods for analyzing patient-reported outcomes in randomized trials when data are missing?

    PubMed

    Hamel, J F; Sebille, V; Le Neel, T; Kubis, G; Boyer, F C; Hardouin, J B

    2017-12-01

    Subjective health measurements using Patient Reported Outcomes (PRO) are increasingly used in randomized trials, particularly for patient groups comparisons. Two main types of analytical strategies can be used for such data: Classical Test Theory (CTT) and Item Response Theory models (IRT). These two strategies display very similar characteristics when data are complete, but in the common case when data are missing, whether IRT or CTT would be the most appropriate remains unknown and was investigated using simulations. We simulated PRO data such as quality of life data. Missing responses to items were simulated as being completely random, depending on an observable covariate or on an unobserved latent trait. The considered CTT-based methods allowed comparing scores using complete-case analysis, personal mean imputations or multiple-imputations based on a two-way procedure. The IRT-based method was the Wald test on a Rasch model including a group covariate. The IRT-based method and the multiple-imputations-based method for CTT displayed the highest observed power and were the only unbiased method whatever the kind of missing data. Online software and Stata® modules compatibles with the innate mi impute suite are provided for performing such analyses. Traditional procedures (listwise deletion and personal mean imputations) should be avoided, due to inevitable problems of biases and lack of power.

  16. Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.

    PubMed

    Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha

    2015-12-01

    Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length. Copyright © 2015 Elsevier Inc. All rights reserved.

  17. Investigating the Effects of Imputation Methods for Modelling Gene Networks Using a Dynamic Bayesian Network from Gene Expression Data

    PubMed Central

    CHAI, Lian En; LAW, Chow Kuan; MOHAMAD, Mohd Saberi; CHONG, Chuii Khim; CHOON, Yee Wen; DERIS, Safaai; ILLIAS, Rosli Md

    2014-01-01

    Background: Gene expression data often contain missing expression values. Therefore, several imputation methods have been applied to solve the missing values, which include k-nearest neighbour (kNN), local least squares (LLS), and Bayesian principal component analysis (BPCA). However, the effects of these imputation methods on the modelling of gene regulatory networks from gene expression data have rarely been investigated and analysed using a dynamic Bayesian network (DBN). Methods: In the present study, we separately imputed datasets of the Escherichia coli S.O.S. DNA repair pathway and the Saccharomyces cerevisiae cell cycle pathway with kNN, LLS, and BPCA, and subsequently used these to generate gene regulatory networks (GRNs) using a discrete DBN. We made comparisons on the basis of previous studies in order to select the gene network with the least error. Results: We found that BPCA and LLS performed better on larger networks (based on the S. cerevisiae dataset), whereas kNN performed better on smaller networks (based on the E. coli dataset). Conclusion: The results suggest that the performance of each imputation method is dependent on the size of the dataset, and this subsequently affects the modelling of the resultant GRNs using a DBN. In addition, on the basis of these results, a DBN has the capacity to discover potential edges, as well as display interactions, between genes. PMID:24876803

  18. An integrated SNP mining and utilization (ISMU) pipeline for next generation sequencing data.

    PubMed

    Azam, Sarwar; Rathore, Abhishek; Shah, Trushar M; Telluri, Mohan; Amindala, BhanuPrakash; Ruperao, Pradeep; Katta, Mohan A V S K; Varshney, Rajeev K

    2014-01-01

    Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.

  19. 32 CFR 776.29 - Imputed disqualification: General rule.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... 32 National Defense 5 2011-07-01 2011-07-01 false Imputed disqualification: General rule. 776.29... inferences, deductions, or working presumptions that reasonably may be made about the way in which covered... interests of another. When such independence is lacking or unlikely, representation cannot be zealous. (5...

  20. 32 CFR 776.29 - Imputed disqualification: General rule.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... 32 National Defense 5 2010-07-01 2010-07-01 false Imputed disqualification: General rule. 776.29... inferences, deductions, or working presumptions that reasonably may be made about the way in which covered... interests of another. When such independence is lacking or unlikely, representation cannot be zealous. (5...

Top